This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
pˆ1 } ) 2N w2 1 !# 2 1 −arcsin pˆ1 + (1{pˆ1
pˆ2 } ) 2N w1 is assumed to have a null distribution of N (0, 1) and an alternative distribution of N (δ, 1) where " δ = (4N w1 w2 ) −arcsin
1 2
1 ! 2 1 (1{p2
arcsin p2 +
p1 +
1 2N w1
3526
Chapter 57. The POWER Procedure
The approximate power for the 1-sided balanced case is given by Walters (1979) and is easily extended to the unbalanced and 2-sided cases: upper 1-sided Φ (δ − z1−α ) , Φ (−δ − z ) , 1−α power = lower 1-sided Φ δ − z1− α + Φ −δ − z1− α , 2-sided 2 2
Analyses in the TWOSAMPLEMEANS Statement Two-sample t Test Assuming Equal Variances (TEST=DIFF)
The hypotheses for the two-sample t test are H0 : µdiff = µ0 µdiff 6= µ0 , 2-sided µ > µ0 , upper 1-sided H1 : diff µdiff < µ0 , lower 1-sided The test assumes normally distributed data and common standard deviation per group, and it requires N ≥ 3, n1 ≥ 1, and n2 ≥ 1. The test statistics are 1
1
t = N 2 (w1 w2 ) 2
x ¯2 − x ¯1 − µ0 sp
∼ t(N − 2, δ)
t2 ∼ F (1, N − 2, δ 2 )
¯2 are the sample means and sp is the pooled standard deviation, and where x ¯1 and x 1 2
δ = N (w1 w2 )
1 2
µdiff − µ0 σ
The test is
Reject H0
2 t ≥ F1−α (1, N − 2), 2-sided t ≥ t1−α (N − 2), upper 1-sided if t ≤ tα (N − 2), lower 1-sided
Exact power computations for t tests are given in O’Brien and Muller (1993, section 8.2.1): P F (1, N − 2, δ 2 ) ≥ F1−α (1, N − 2) , 2-sided P (t(N − 2, δ) ≥ t1−α (N − 2)) , upper 1-sided power = P (t(N − 2, δ) ≤ tα (N − 2)) , lower 1-sided
Computational Methods and Formulas
3527
Solutions for N , n1 , n2 , α, and δ are obtained by numerically inverting the power equation. Closed-form solutions for other parameters, in terms of δ, are as follows: µdiff
1
= δσ(N w1 w2 )− 2 + µ0 1
µ1 = δσ(N w1 w2 )− 2 + µ0 − µ2 1
µ2 = δσ(N w1 w2 )− 2 + µ0 − µ1 1 δ −1 (N w1 w2 ) 2 (µdiff − µ0 ), σ = undefined, i1 1 1h 2 4δ 2 σ 2 , ± 1 − 2 2 2 N (µdiff −µ0 ) w1 = undefined, i1 1 1h 2 4δ 2 σ 2 ± 1 − , 2 2 2 N (µdiff −µ0 ) w2 = undefined,
|δ| > 0 otherwise 1
0 < |δ| ≤ 12 N 2 |µdiffσ−µ0 | otherwise 1
0 < |δ| ≤ 12 N 2 |µdiffσ−µ0 | otherwise
Finally, here is a derivation of the solution for w1 : Solve the δ equation for w1 (which requires the quadratic formula). Then determine the range of δ given w1 :
0, when 1 (µ −µ0 ) 1 diff 2 , when 2N σ
w1 = 0 or 1, if (µdiff − µ0 ) ≥ 0 w1 = 12 , if (µdiff − µ0 ) < 0
0, when 1 (µ −µ0 ) 1 diff 2 , when 2N σ
w1 = 0 or 1, if (µdiff − µ0 ) < 0 w1 = 12 , if (µdiff − µ0 ) ≥ 0
min(δ) = w1
max(δ) = w1
This implies 1 1 |µdiff − µ0 | |δ| ≤ N 2 2 σ Two-sample Satterthwaite t Test Assuming Unequal Variances (TEST=DIFF– SATT)
The hypotheses for the two-sample Satterthwaite t test are H0 : µdiff = µ0 µdiff 6= µ0 , 2-sided µ > µ0 , upper 1-sided H1 : diff µdiff < µ0 , lower 1-sided The test assumes normally distributed data and requires N ≥ 3, n1 ≥ 1, and n2 ≥ 1. The test statistics are t = F
1 x ¯2 − x ¯1 − µ0 x ¯2 − x ¯1 − µ0 =N2 h 1 i1 h 2 i s21 s22 2 s1 s22 2 + + w1 w2 n1 n2
= t2
3528
Chapter 57. The POWER Procedure
where x ¯1 and x ¯2 are the sample means and s1 and s2 are the sample standard deviations. As DiSantostefano and Muller (1995, p. 585) state, the test is based on assuming that under H0 , F is distributed as F (1, ν), where ν is given by Satterthwaite’s approximation (Satterthwaite 1946), i σ22 2 σ12 + n1 n2 2 2 2 2
σ12 w1 2 2
h
ν=
σ1 n1
n1 −1
+
h
=
σ2 n2
+
σ1 w1
+
N w1 −1
n2 −1
i σ22 2 w2 2 2 σ2 w2
N w2 −1
Since ν is unknown, in practice it must be replaced by an estimate i s21 s22 2 + n1 n2 2 2 2 2
s21 w1 2 2
h
νˆ =
s1 n1
n1 −1
+
h
=
s2 n2
+
s1 w1
n2 −1
N w1 −1
+
i s22 2 w2 2 2 s2 w2
N w2 −1
So the test is F ≥ F1−α (1, νˆ), 2-sided t ≥ t1−α (ˆ ν ), upper 1-sided if t ≤ tα (ˆ ν ), lower 1-sided
Reject H0
Exact solutions for power for the 2-sided and upper 1-sided cases are given in Moser, Stevens, and Watts (1989). The lower 1-sided case follows easily using symmetry. The equations are as follows: R∞ 0 P (F (1, N − 2, λ) > h(u)F1−α (1, v(u))|u) f (u)du, 2-sided R∞ 1 0 P t(N − 2, λ 2 ) > 1 power = [h(u)] 2 t1−α (v(u))|u f (u)du, upper 1-sided R∞ 1 P t(N − 2, λ 2 ) < 0 1 lower 1-sided [h(u)] 2 tα (v(u))|u f (u)du, where
h(u) =
v(u) =
λ =
h
1 n1
u n2
(n1 + n2 − 2) i uσ 2 1 (n1 − 1) + (n2 − 1) σ21 n1 + 2 2 u 1 + n1 n2
1 n21 (n1 −1)
+
u2 n22 (n2 −1) )2
+
(µdiff − µ0 σ12 n1
+
σ22 n2
σ22 σ12 n2
Computational Methods and Formulas
f (u) =
Γ
n2 −1 σ12 (n2 − 1) 2 · σ22 (n1 − 1) 2 − n1 +n2 2 −2 n2 − 1 uσ1 1+ n1 − 1 σ22
n1 +n2 −2 2 n2 −1 n1 −1 Γ 2 2
Γ
u
n2 −3 2
The density f (u) is obtained from the fact that uσ12 ∼ F (n2 − 1, n1 − 1) σ22 Two-sample Pooled t Test of Mean Ratio with Lognormal Data (TEST=RATIO)
The lognormal case is handled by re-expressing the analysis equivalently as a normality-based test on the log-transformed data, using properties of the lognormal distribution as discussed in Johnson and Kotz (1970, chapter 14). The approaches in the “Two-sample t Test Assuming Equal Variances (TEST=DIFF)” section on page 3526 then apply. In contrast to the usual t test on normal data, the hypotheses with lognormal data are defined in terms of geometric means rather than arithmetic means. The test assumes equal coefficients of variation in the two groups. The hypotheses for the two-sample t test with lognormal data are γ2 = γ0 γ1 γ 2 6 γ0 , 2-sided γ1 = γ2 : γ1 > γ0 , upper 1-sided γ2 < γ0 , lower 1-sided γ1
H0 :
H1
Let µ?1 , µ?2 , and σ ? be the (arithmetic) means and common standard deviation of the corresponding normal distributions of the log-transformed data. The hypotheses can be rewritten as follows: H0 : µ?2 − µ?1 = log(γ0 ) ? µ2 − µ?1 6= log(γ0 ), 2-sided µ? − µ?1 > log(γ0 ), upper 1-sided H1 : 2? µ2 − µ?1 < log(γ0 ), lower 1-sided where µ?1 = log γ1 µ?2 = log γ2
3529
3530
Chapter 57. The POWER Procedure
The test assumes lognormally distributed data and requires N ≥ 3, n1 ≥ 1, and n2 ≥ 1. The power is P F (1, N − 2, δ 2 ) ≥ F1−α (1, N − 2) , 2-sided P (t(N − 2, δ) ≥ t1−α (N − 2)) , upper 1-sided power = P (t(N − 2, δ) ≤ tα (N − 2)) , lower 1-sided where µ?2 − µ?1 − log(γ0 ) δ = N (w1 w2 ) σ? 1 σ ? = log(CV2 + 1) 2 1 2
1 2
Additive Equivalence Test for Mean Difference with Normal Data (TEST=EQUIV– DIFF)
The hypotheses for the equivalence test are H0 : µdiff < θL
or µdiff > θU
H1 : θL ≤ µdiff ≤ θU The analysis is the two one-sided tests (TOST) procedure of Schuirmann (1987). The test assumes normally distributed data and requires N ≥ 3, n1 ≥ 1, and n2 ≥ 1. Phillips (1990) derives an expression for the exact power assuming a balanced design; the results are easily adapted to an unbalanced design:
power = QN −2 0,
µdiff − θU 1
1
σN − 2 (w1 w2 )− 2 ! 1 (N − 2) 2 (θU − θL ) 1
1
2σN − 2 (w1 w2 )− 2 (t1−α (N − 2))
QN −2 0,
(−t1−α (N − 2)),
(t1−α (N − 2)),
µdiff − θL 1
1
σN − 2 (w1 w2 )− 2 ! 1 (N − 2) 2 (θU − θL ) 1
; −
;
1
2σN − 2 (w1 w2 )− 2 (t1−α (N − 2))
where Q· (·, ·; ·, ·) is Owen’s Q function, defined in the “Common Notation” section on page 3498.
Computational Methods and Formulas
Multiplicative Equivalence Test for Mean Ratio with Lognormal Data (TEST=EQUIV– RATIO)
The lognormal case is handled by re-expressing the analysis equivalently as a normality-based test on the log-transformed data, using properties of the lognormal distribution as discussed in Johnson and Kotz (1970, chapter 14). The approaches in the “Additive Equivalence Test for Mean Difference with Normal Data (TEST=EQUIV– DIFF)” section on page 3530 then apply. In contrast to the additive equivalence test on normal data, the hypotheses with lognormal data are defined in terms of geometric means rather than arithmetic means. The hypotheses for the equivalence test are γT γT ≤ θL or ≥ θU γR γR γT : θL < < θU γR
H0 : H1
where
0 < θL < θU
The analysis is the two one-sided tests (TOST) procedure of Schuirmann (1987) on the log-transformed data. The test assumes lognormally distributed data and requires N ≥ 3, n1 ≥ 1, and n2 ≥ 1. Diletti, Hauschke, and Steinijans (1991) derive an expression for the exact power assuming a crossover design; the results are easily adapted to an unbalanced two-sample design: power = QN −2 (−t1−α (N − 2)),
log
γT γR
− 12
− log(θU ) 1
(w1 w2 )− 2 ! 1 (N − 2) 2 (log(θU ) − log(θL )) 0, − 1 1 2σ ? N − 2 (w1 w2 )− 2 (t1−α (N − 2)) log γγRT − log(θL ) QN −2 (t1−α (N − 2)), 1 ; 1 σ ? N − 2 (w1 w2 )− 2 ! 1 (N − 2) 2 (log(θU ) − log(θL )) 0, 1 1 2σ ? N − 2 (w1 w2 )− 2 (t1−α (N − 2)) σ?N
;
where 1 σ ? = log(CV2 + 1) 2 is the (assumed common) standard deviation of the normal distribution of the logtransformed data, and Q· (·, ·; ·, ·) is Owen’s Q function, defined in the “Common Notation” section on page 3498.
3531
3532
Chapter 57. The POWER Procedure
Confidence Interval for Mean Difference (CI=DIFF)
This analysis of precision applies to the standard t-based confidence interval: h
s
(¯ x2 − x ¯1 ) − t1− α2 (N − 2) √N wp
1 w2
, i
s
(¯ x2 − x ¯1 ) + t1− α2 (N − 2) √N wp
1 w2
h
,
2-sided
s
upper 1-sided (¯ x2 − x ¯1 ) − t1−α (N − 2) √N wp w , ∞ , 1 2 i s −∞, (¯ x2 − x ¯1 ) + t1−α (N − 2) √N wp w , lower 1-sided 1
2
where x ¯1 and x ¯2 are the sample means and sp is the pooled standard deviation. The “half-width” is defined as the distance from the point estimate x ¯2 − x ¯1 to a finite endpoint, ( half-width =
s
t1− α2 (N − 2) √N wp w , 2-sided 1 2 s t1−α (N − 2) √N wp w , 1-sided 1
2
A “valid” conference interval captures the true mean. The exact probability of obtaining at most the target confidence interval half-width h, unconditional or conditional on validity, is given by Beal (1989): 2 −2)(w1 w2 ) P χ2 (N − 2) ≤ h 2N (N , 2-sided σ (t21− α (N −2)) Pr(half-width ≤ h) = 2 2 (N −2)(w1 w2 ) P χ2 (N − 2) ≤ h N , 1-sided σ 2 (t21−α (N −2)) h 1 2 Q (t1− α2 (N − 2)), 0; N −2 1−α Pr(half-width ≤ h | 2-sided = 0, b2 ) − QN −2 (0, 0; 0, b2 )] , validity) 1 QN −2 ((t1−α (N − 2)), 0; 0, b2 ) , 1-sided 1−α
where 1
b2 =
h(N − 2) 2
1
1
σ(t1− αc (N − 2))N − 2 (w1 w2 )− 2
c = number of sides and Q· (·, ·; ·, ·) is Owen’s Q function, defined in the “Common Notation” section on page 3498. A “quality” confidence interval is both sufficiently narrow (half-width ≤ h) and valid: Pr(quality) = Pr(half-width ≤ h and validity) = Pr(half-width ≤ h | validity)(1 − α)
Computational Methods and Formulas
3533
Analyses in the TWOSAMPLESURVIVAL Statement Rank Tests for Two Survival Curves (TEST=LOGRANK, TEST=GEHAN, TEST=TARONEWARE)
The method is from Lakatos (1988) and Cantor (1997, pp. 83–92). Define the following notation: Xj (i) = ith input time point on survival curve for group j Sj (i) = input survivor function value corresponding to Xj (i) hj (t) = hazard rate for group j at time t Ψj (t) = loss hazard rate for group j at time t λj
= exponential hazard rate for group j
R = hazard ratio of group 2 to group 1 ≡ (assumed constant) value of mj
h2 (t) h1 (t)
= median survival time for group j
b = number of subintervals per time unit T
= accrual time
τ
= post-accrual follow-up time
Lj
= exponential loss rate for group j
XLj
= input time point on loss curve for group j
SLj
= input survivor function value corresponding to XLj
mLj
= median survival time for group j
ri = rank for ith time point
Each survival curve can be specified in one of several ways. • For exponential curves: – – – –
a single point (Xj (1), Sj (1)) on the curve median survival time hazard rate hazard ratio (for curve 2, with respect to curve 1)
• For piecewise linear curves with proportional hazards: – a set of points {(X1 (1), S1 (1)), (X1 (2), S1 (2)), . . .} (for curve 1) – hazard ratio (for curve 2, with respect to curve 1) • For arbitrary piecewise linear curves: – a set of points {(Xj (1), Sj (1)), (Xj (2), Sj (2)), . . .} A total of M evenly spaced time points {t0 = 0, t1 , t2 , . . . , tM = T + τ } are used in calculations, where M = floor ((T + τ )b)
3534
Chapter 57. The POWER Procedure
The hazard function is calculated for each survival curve at each time point. For an exponential curve, the (constant) hazard is given by one of the following, depending on the input parameterization:
hj (t) =
λj λ1 R − log( 1 ) 2
mj − log(Sj (1)) Xj (1) − log(S1 (1)) R X1 (1)
For a piecewise linear curve, define the following additional notation: t− = largest input time X such that X ≤ ti i t+ = smallest input time X such that X > ti i
The hazard is computed using linear interpolation as follows: + Sj (t− i ) − Sj(ti ) + hj (ti ) = − − − − Sj (t+ i ) − Sj (ti ) ti − ti + Sj (ti ) ti − ti
With proportional hazards, the hazard rate of group 2’s curve in terms of the hazard rate of group 1’s curve is h2 (t) = h1 (t)R Hazard function values {Ψj (ti )} for the loss curves are computed in an analogous way from {Lj , XLj , SLj , mLj }. The expected number at risk Nj (i) at time i in group j is calculated for each group and time points 0 through M − 1, as follows: Nj (0) = N wj 1 1 1 Nj (i + 1) = Nj (i) 1 − hj (ti ) − Ψj (ti ) − 1{ti >τ } b b b(T + τ − ti )
Define θi as the ratio of hazards and φi as the ratio of expected numbers at risk for time ti : θi = φi =
h2 (ti ) h1 (ti ) N2 (i) N1 (i)
Computational Methods and Formulas
3535
The expected number of deaths in each subinterval is calculated as follows: 1 Di = [h1 (ti )N1 (i) + h2 (ti )N2 (i)] b The rank values are calculated as follows according to which test statistic is used: log-rank 1, N (i) + N (i), Gehan ri = 1 2 p N1 (i) + N2 (i), Tarone-Ware The distribution of the test statistic is approximated by N (E, 1) where h i φi θi φi D r − i i 1+φi θi i=0 1+φi qP M −1 φi 2 i=0 Di ri (1+φi )2
PM −1 E=
1
Note that N 2 can be factored out of the mean E, and so it can be expressed equivalently as P 1 2
1 2
E = N E? = N
M −1 ? ? i=0 Di ri
h
φi θi 1+φi θi
−
φi 1+φi
i
qP
M −1 ? ? 2 φi i=0 Di ri (1+φi )2
where E ? is free of N and Di? ri?
=
[h1 (ti )N1? (i)
h2 (ti )N2? (i)]
1 b
+ log-rank 1, ? (i) + N ? (i), N Gehan = 2 p1 ? ? N1 (i) + N2 (i), Tarone-Ware
Nj? (0) = wj Nj? (i
+ 1) =
Nj? (i)
1 1 1 1 − hj (ti ) − Ψj (ti ) − 1{ti >τ } b b b(T + τ − ti )
The approximate power is 1 2 E ? − z1−α , upper 1-sided Φ −N 1 lower 1-sided power = Φ N 2 E ? − z1−α , 1 1 Φ −N 2 E ? − z1− α2 + Φ N 2 E ? − z1− α2 , 2-sided Note that the upper and lower 1-sided cases are expressed differently than in other analyses. This is because E ? > 0 corresponds to a higher survival curve in group
3536
Chapter 57. The POWER Procedure
1 and thus, by the convention used in PROC power for 2-group analyses, the lower side. For the 1-sided cases, a closed-form inversion of the power equation yield an approximate total sample size N=
zpower + z1−α E?
2
For the 2-sided case, the solution for N is obtained by numerically inverting the power equation.
Examples Example 57.1. One-Way ANOVA This example deals with the same situation as in Example 34.1 on page 1951 of Chapter 34, “The GLMPOWER Procedure.” Hocking (1985, p. 109) describes a study of the effectiveness of electrolytes in reducing lactic acid buildup for long-distance runners. You are planning a similar study in which you will allocate five different fluids to runners on a 10-mile course and measure lactic acid buildup immediately after the race. The fluids consist of water and two commercial electrolyte drinks, EZDure and LactoZap, each prepared at two concentrations, low (EZD1 and LZ1) and high (EZD2 and LZ2). You conjecture that the standard deviation of lactic acid measurements given any particular fluid is about 3.75, and that the expected lactic acid values will correspond roughly to those in Table 57.27. You are least familiar with the LZ1 drink and hence decide to consider a range of reasonable values for that mean. Table 57.27. Mean Lactic Acid Buildup by Fluid
Water 35.6
EZD1 33.7
EZD2 30.2
LZ1 29 or 28
LZ2 25.9
You are interested in four different comparisons, shown in Table 57.28 with appropriate contrast coefficients. Table 57.28. Planned Comparisons
Comparison Water versus electrolytes EZD versus LZ EZD1 versus EZD2 LZ1 versus LZ2
Contrast Coefficients Water EZD1 EZD2 LZ1 4 -1 -1 -1 0 1 1 -1 0 1 -1 0 0 0 0 1
LZ2 -1 -1 0 -1
For each of these contrasts you want to determine the sample size required to achieve a power of 0.9 for detecting an effect with magnitude in accord with Table 57.27. You are not yet attempting to choose a single sample size for the study, but rather checking the range of sample sizes needed for individual contrasts. You plan to test
Example 57.1. One-Way ANOVA
each contrast at α = 0.025. In the interests of reducing costs, you will provide twice as many runners with water as with any of the electrolytes; in other words, you will use a sample size weighting scheme of 2:1:1:1:1. Use the ONEWAYANOVA statement in the POWER procedure to compute the sample sizes. The statements required to perform this analysis are as follows: proc power; onewayanova groupmeans = 35.6 | 33.7 | 30.2 | 29 28 | 25.9 stddev = 3.75 groupweights = (2 1 1 1 1) alpha = 0.025 ntotal = . power = 0.9 contrast = (4 -1 -1 -1 -1) (0 1 1 -1 -1) (0 1 -1 0 0) (0 0 0 1 -1); run;
The NTOTAL= option with a missing value (.) indicates total sample size as the result parameter. The GROUPMEANS= option with values from Table 57.27 specifies your conjectures for the means. With only one mean varying (the LZ1 mean), the “crossed” notation is simpler, showing scenarios for each group mean separated by a vertical bar (|). See the “Specifying Value Lists in Analysis Statements” section on page 3490 for more details on crossed and matched notations for grouped values. The contrasts in Table 57.28 are specified with the CONTRAST option, using the “matched” notation with each contrast enclosed in parentheses. The STDDEV=, ALPHA=, and POWER= options specify the error standard deviation, significance level, and power. The GROUPWEIGHTS= option specifies the weighting schemes. Default values for the NULL= and SIDES= options specify a 2-sided t test of the contrast equal to 0. See Output 57.1.1 for the output.
3537
3538
Chapter 57. The POWER Procedure
Output 57.1.1. Sample Sizes for One-Way ANOVA Contrasts The POWER Procedure Single DF Contrast in One-Way ANOVA Fixed Scenario Elements Method Alpha Standard Deviation Group Weights Nominal Power Number of Sides Null Contrast Value
Exact 0.025 3.75 2 1 1 1 1 0.9 2 0
Computed N Total
Index 1 2 3 4 5 6 7 8
-----Contrast----4 4 0 0 0 0 0 0
-1 -1 1 1 1 1 0 0
-1 -1 1 1 -1 -1 0 0
-1 -1 -1 -1 0 0 1 1
-1 -1 -1 -1 0 0 -1 -1
-------------Means------------35.6 35.6 35.6 35.6 35.6 35.6 35.6 35.6
33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7
30.2 30.2 30.2 30.2 30.2 30.2 30.2 30.2
29 28 29 28 29 28 29 28
25.9 25.9 25.9 25.9 25.9 25.9 25.9 25.9
Actual Power
N Total
0.947 0.901 0.929 0.922 0.901 0.901 0.902 0.902
30 24 60 48 174 174 222 480
The sample sizes in Output 57.1.1 range from 24 for the comparison of water versus electrolytes to 480 for the comparison of LZ1 versus LZ2, both assuming the smaller LZ1 mean. The sample size for the latter comparison is relatively large because the small mean difference of 28 − 25.9 = 2.1 is hard to detect. The Nominal Power of 0.9 in the Fixed Scenario Elements table in Output 57.1.1 represents the input target power, and the Actual Power column in the Computed N Total table is the power at the sample size (N Total) adjusted to achieve the specified sample weighting. Note that all of the sample sizes are rounded up to multiples of 6 to preserve integer group sizes (since the group weights add up to 6). You can use the NFRACTIONAL option in the ONEWAYANOVA statement to compute raw fractional sample sizes. Suppose you want to plot the required sample size for the range of power values from 0.5 to 0.95. First, define the analysis by specifying the same statements as before, but add the PLOTONLY option to the PROC POWER statement to disable the nongraphical results. Next, specify the PLOT statement with X=POWER to request a plot with power on the x-axis. (The result parameter, here sample size, is always plotted on the other axis.) Use the MIN= and MAX= options in the PLOT statement to specify the power range. proc power plotonly; onewayanova groupmeans = 35.6 | 33.7 | 30.2 | 29 28 | 25.9 stddev = 3.75 groupweights = (2 1 1 1 1)
Example 57.1. One-Way ANOVA alpha = 0.025 ntotal = . power = 0.9 contrast = (4 -1 -1 -1 -1) (0 (0 1 -1 0 0) (0 plot x=power min=.5 max=.95; run;
1 0
1 -1 -1) 0 1 -1);
See Output 57.1.2 for the resulting plot. Output 57.1.2. Plot of Sample Size versus Power for One-Way ANOVA Contrasts
In Output 57.1.2, the line style identifies the contrast, and the plotting symbol identifies the group means scenario. The plot shows that the required sample size is highest for the (0 0 0 1 -1) contrast, corresponding to the test of LZ1 versus LZ2 that was previously found to require the most resources, in either cell means scenario. Note that some of the plotted points in Output 57.1.2 are unevenly spaced. This is because the plotted points are the rounded sample size results at their corresponding actual power levels. The range specified with the MIN= and MAX= values in the PLOT statement correspond to nominal power levels. In some cases, actual power is substantially higher than nominal power. To obtain plots with evenly spaced points (but with fractional sample sizes at the computed points), you can use the NFRACTIONAL option in the analysis statement preceding the PLOT statement. Finally, suppose you want to plot the power for the range of sample sizes you will likely consider for the study (the range of 24 to 480 that achieves 0.9 power for different comparisons). In the ONEWAYANOVA statement, identify power as the result
3539
3540
Chapter 57. The POWER Procedure
(POWER=.), and specify NTOTAL=24. The following statements produce the plot:
proc power plotonly; onewayanova groupmeans = 35.6 | 33.7 | 30.2 | 29 28 | 25.9 stddev = 3.75 groupweights = (2 1 1 1 1) alpha = 0.025 ntotal = 24 power = . contrast = (4 -1 -1 -1 -1) (0 1 1 -1 -1) (0 1 -1 0 0) (0 0 0 1 -1); plot x=n min=24 max=480; run;
The X=N option in the PLOT statement requests a plot with sample size on the x-axis. Note that the value specified with the NTOTAL=24 option is not used. It is overridden in the plot by the MIN= and MAX= options in the PLOT statement, and the PLOTONLY option in the PROC POWER statement disables nongraphical results. But the NTOTAL= option (along with a value) is still needed in the ONEWAYANOVA statement as a placeholder, to identify the desired parameterization for sample size. See Output 57.1.3 for the plot. Output 57.1.3. Plot of Power versus Sample Size for One-Way ANOVA Contrasts
Example 57.2. The Sawtooth Power Function in Proportion Analyses
Although Output 57.1.2 and Output 57.1.3 surface essentially the same computations for practical power ranges, they each provide a different quick visual assessment. Output 57.1.2 reveals the range of required sample sizes for powers of interest, and Output 57.1.3 reveals the range of achieved powers for sample sizes of interest.
Example 57.2. The Sawtooth Power Function in Proportion Analyses For many common statistical analyses, the power curve is monotonically increasing: the more samples you take, the more power you achieve. However, in statistical analyses of discrete data, such as tests of proportions, the power curve is often nonmonotonic. A small increase in sample size can result in a decrease in power, a decrease that is sometimes substantial. The explanation is that the actual significance level (in other words, the achieved Type 1 error rate) for discrete tests strays below the target level and varies with sample size. The power loss from a decrease in the Type 1 error rate may outweigh the power gain from an increase in sample size. The example discussed in this section demonstrates this “sawtooth” phenomenon. For additional discussion on the topic, refer to Chernick and Liu (2002). Suppose you have a new scheduling system for an airline, and you want to determine how many flights you must observe to have at least an 80% chance of establishing an improvement in the proportion of late arrivals on a specific travel route. You will use a 1-sided exact binomial proportion test with a null proportion of 30%, the frequency of late arrivals under the previous scheduling system, and a nominal significance level of α = 0.05. Well-supported predictions estimate the new late arrival rate to be about 20%, and you will base your sample size determination on this assumption. The POWER procedure does not currently compute exact sample size directly for the exact binomial test. But you can get an initial estimate by computing the approximate sample size required for a z test. Use the ONESAMPLEFREQ statement in the POWER procedure with TEST=Z and METHOD=NORMAL to compute the approximate sample size to achieve a power of 0.8 using the z test. The following statements perform the analysis: proc power; onesamplefreq test=z method=normal sides = 1 alpha = 0.05 nullproportion = 0.3 proportion = 0.2 ntotal = . power = 0.8; run;
The NTOTAL= option with a missing value (.) indicates sample size as the result parameter. The SIDES=1 option specifies a 1-sided test. The ALPHA=, NULLPROPORTION=, and POWER= options specify the significance level of 0.05, null value of 0.3, and target power of 0.8. The PROPORTION= option specifies your conjecture of 0.3 for the true proportion.
3541
3542
Chapter 57. The POWER Procedure
Output 57.2.1. Approximate Sample Size for z Test of a Proportion The POWER Procedure Z Test for Binomial Proportion Fixed Scenario Elements Method Number of Sides Null Proportion Alpha Binomial Proportion Nominal Power
Normal approximation 1 0.3 0.05 0.2 0.8
Computed N Total Actual Power
N Total
0.800
119
The results, shown in Output 57.2.1, indicate that you need to observe about N =119 flights to have an 80% chance of rejecting the hypothesis of a late arrival proportion of 30% or higher, if the true proportion is 20%, using the z test. A similar analysis (Output 57.2.2) reveals an approximate sample size of N =129 for the z test with continuity correction, which performed using TEST=ADJZ: proc power; onesamplefreq test=adjz method=normal sides = 1 alpha = 0.05 nullproportion = 0.3 proportion = 0.2 ntotal = . power = 0.8; run;
Example 57.2. The Sawtooth Power Function in Proportion Analyses
Output 57.2.2. Approximate Sample Size for z Test with Continuity Correction The POWER Procedure Z Test for Binomial Proportion with Continuity Adjustment Fixed Scenario Elements Method Number of Sides Null Proportion Alpha Binomial Proportion Nominal Power
Normal approximation 1 0.3 0.05 0.2 0.8
Computed N Total Actual Power
N Total
0.801
129
Based on the approximate sample size results, you decide to explore the power of the exact binomial test for sample sizes between 110 and 140. The following statements produce the plot: proc power plotonly; onesamplefreq test=exact sides = 1 alpha = 0.05 nullproportion = 0.3 proportion = 0.2 ntotal = 119 power = .; plot x=n min=110 max=140 step=1 yopts=(ref=.8) xopts=(ref=119 129); run;
The TEST=EXACT option in the ONESAMPLEFREQ statement specifies the exact binomial test, and the missing value (.) for the POWER= option indicates power as the result parameter. The PLOTONLY option in the PROC POWER statement disables nongraphical output. The PLOT statement with X=N requests a plot with sample size on the x-axis. The MIN= and MAX= options in the PLOT statement specify the sample size range. The YOPTS=(REF=) and XOPTS=(REF=) options add reference lines to highlight the approximate sample size results. The STEP=1 option produces a point at each integer sample size. The sample size value specified with the NTOTAL= option in the ONESAMPLEFREQ statement is overridden by the MIN= and MAX= options in the PLOT statement. Output 57.2.3 shows the plot.
3543
3544
Chapter 57. The POWER Procedure
Output 57.2.3. Plot of Power versus Sample Size for Exact Binomial Test
Note the sawtooth pattern in Output 57.2.3. Although the power surpasses the target level of 0.8 at N =119, it decreases to 0.79 with N =120 and further to 0.76 with N =122 before rising again to 0.81 with N =123. Not until N =130 does the power stay above the 0.8 target. Thus, a more conservative sample size recommendation of 130 might be appropriate, depending on the precise goals of the sample size determination. In addition to considering alternative sample sizes, you may also want to assess the sensitivity of the power to inaccuracies in assumptions about the true proportion. The following statements produce a plot including true proportion values of 0.18 and 0.22. They are identical to the previous statements except for the additional true proportion values specified with the PROPORTION= option in the ONESAMPLEFREQ statement. proc power plotonly; onesamplefreq test=exact sides = 1 alpha = 0.05 nullproportion = 0.3 proportion = 0.18 0.2 0.22 ntotal = 119 power = .; plot x=n min=110 max=140 step=1 yopts=(ref=.8) xopts=(ref=119 129); run;
Output 57.2.4 shows the plot.
Example 57.2. The Sawtooth Power Function in Proportion Analyses
Output 57.2.4. Plot for Assessing Sensitivity to True Proportion Value
The plot reveals a dramatic sensitivity to the true proportion value. For N =119, the power is about 0.92 if the true proportion is 0.22, and as low as 0.62 if the proportion is 0.18. Note also that the power jumps occur at the same sample sizes in all three curves; the curves are only shifted and stretched vertically. This is because spikes and valleys in power curves are invariant to the true proportion value; they are due to changes in the critical value of the test. A closer look at some ancillary output from the analysis sheds light on this property of the sawtooth pattern. You can add an ODS OUTPUT statement to save the plot content corresponding to Output 57.2.3 to a data set: proc power plotonly; ods output plotcontent=PlotData; onesamplefreq test=exact sides = 1 alpha = 0.05 nullproportion = 0.3 proportion = 0.2 ntotal = 119 power = .; plot x=n min=110 max=140 step=1 yopts=(ref=.8) xopts=(ref=119 129); run;
The PlotData data set contains parameter values for each point in the plot. The parameters including underlying characteristics of
3545
3546
Chapter 57. The POWER Procedure
the putative test. The following statements print the critical value and actual significance level along with sample size and power.
proc print data=PlotData; var NTotal LowerCritVal Alpha Power; run;
Output 57.2.5 shows the plot data. Output 57.2.5. Numerical Content of Plot Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
NTotal 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
Lower CritVal 24 24 25 25 25 26 26 26 26 27 27 27 27 28 28 28 28 29 29 29 30 30 30 30 31 31 31 31 32 32 32
Alpha
Power
0.0356 0.0313 0.0446 0.0395 0.0349 0.0490 0.0435 0.0386 0.0341 0.0478 0.0425 0.0377 0.0334 0.0465 0.0414 0.0368 0.0327 0.0453 0.0404 0.0359 0.0493 0.0441 0.0394 0.0351 0.0480 0.0429 0.0384 0.0342 0.0466 0.0418 0.0374
0.729 0.713 0.771 0.756 0.741 0.795 0.781 0.767 0.752 0.804 0.790 0.776 0.762 0.812 0.799 0.786 0.772 0.820 0.807 0.794 0.838 0.827 0.815 0.803 0.845 0.834 0.823 0.811 0.851 0.841 0.830
Note that whenever the critical value changes, the actual α jumps up to a value close to the nominal α=0.05, and the power also jumps up. Then while the critical value stays constant, the actual α and power slowly decrease. The critical value is independent of the true proportion value. So, you can achieve a locally maximal power by choosing a sample size corresponding to a spike on the sawtooth curve, and this choice is locally optimal regardless of the unknown value of the true proportion. Locally optimal sample sizes in this case include 115, 119, 123, 127, 130, and 134. As a point of interest, the power does not always jump sharply and decrease gradually. The shape of the sawtooth depends on the direction of the test and
Example 57.2. The Sawtooth Power Function in Proportion Analyses
the location of the null proportion relative to 0.5. For example, if the direction of the hypothesis in this example is reversed (by switching true and null proportion values) so that the rejection region is in the upper tail, then the power curve exhibits sharp decreases and gradual increases. The following statements are similar to those producing the plot in Output 57.2.3 but with values of the PROPORTION= and NULLPROPORTION= options switched.
proc power plotonly; onesamplefreq test=exact sides = 1 alpha = 0.05 nullproportion = 0.2 proportion = 0.3 ntotal = 119 power = .; plot x=n min=110 max=140 step=1; run;
The resulting plot is shown in Output 57.2.6. Output 57.2.6. Plot of Power versus Sample Size for Another 1-Sided Test
Finally, 2-sided tests can lead to even more irregular power curve shapes, since changes in lower and upper critical values affect the power in different ways. The following statements produce a plot of power versus sample size for the scenario of a 2-sided test with high alpha and a true proportion close to the null value.
3547
3548
Chapter 57. The POWER Procedure proc power plotonly; onesamplefreq test=exact sides = 2 alpha = 0.2 nullproportion = 0.1 proportion = 0.09 ntotal = 10 power = .; plot x=n min=2 max=100 step=1; run;
The resulting plot is shown in Output 57.2.7. Output 57.2.7. Plot of Power versus Sample Size for a 2-Sided Test
Due to the irregular shapes of power curves for proportion tests, the question “Which sample size should I use?” is often insufficient. A sample size solution produced directly in PROC POWER reveals the smallest possible sample size to achieve your target power. But as the Examples in this section demonstrate, it is helpful to consult graphs for answers to questions such as the following: • Which sample size will guarantee that all higher sample sizes also achieve my target power? • Given a candidate sample size, can I increase it slightly to achieve locally maximal power, or perhaps even decrease it and get higher power?
Example 57.3. Simple AB/BA Crossover Designs
Example 57.3. Simple AB/BA Crossover Designs Crossover trials are experiments in which each subject is given a sequence of different treatments. They are especially common in clinical trials for medical studies. The reduction in variability from taking multiple measurements on a subject allows for more precise treatment comparisons. The simplest such design is the AB/BA crossover, in which each subject receives each of two treatments in a randomized order. Under certain simplifying assumptions, you can test the treatment difference in an AB/BA crossover trial using either a paired or two-sample t test (or equivalence test, depending on the hypothesis). This example will demonstrate when and how you can use the PAIREDMEANS statement in PROC POWER to perform power analyses for AB/BA crossover designs. Senn (1993, Chapter 3) discusses a study comparing the effects of two bronchodilator medications in treatment of asthma, using an AB/BA crossover design. Suppose you want to plan a similar study comparing two new medications, “Xilodol” and “Brantium.” Half of the patients would be assigned to sequence AB, getting a dose of Xilodol in the first treatment period, a wash-out period of one week, and then a dose of Brantium in the second treatment period. The other half would be assigned to sequence BA, following the same schedule but with the drugs reversed. In each treatment period you would administer the drugs in the morning and then measure peak expiratory flow (PEF) at the end of the day, with higher PEF representing better lung function. You conjecture that the mean and standard deviation of PEF are about µA = 310 and σA = 40 for Xilodol and µB = 330 and σB = 55 for Brantium, and that each pair of measurements on the same subject will have a correlation of about 0.3. You want to compute the power of both 1-sided and 2-sided tests of mean difference, with a significance level of α = 0.01, for a sample size of 100 patients and also plot the power for a range of 50 to 200 patients. Note that the allocation ratio of patients to the two sequences is irrelevant in this analysis. The choice of statistical test depends on which assumptions are reasonable. One possibility is a t test. A paired or two-sample t test is valid when there is no carryover effect and no interactions between patients, treatments, and periods. See Senn (1993, Chapter 3) for more details. The choice between a paired or a two-sample test depends on what you assume about the period effect. If you assume no period effect, then a paired t test is the appropriate analysis for the design, with the first member of each pair being the Xilodol measurement (regardless of which sequence the patient belongs to). Otherwise the two-sample t test approach is called for, since this analysis adjusts for the period effect using an extra degree of freedom. Suppose you assume no period effect. Then you can use the PAIREDMEANS statement in PROC POWER with the TEST=DIFF option to perform a sample size analysis for the paired t test. Indicate power as the result parameter by specifying the POWER= option with a missing value (.). Specify the conjectured means and standard deviations for each drug using the PAIREDMEANS= and PAIREDSTDDEVS= options and the correlation using the CORR= option. Specify both 1- and 2-sided
3549
3550
Chapter 57. The POWER Procedure
tests using the SIDES= option, the significance level using the ALPHA= option, and the sample size (in terms of number of pairs) using the NPAIRS= option. Generate a plot of power versus sample size by specifying the PLOT statement with X=N to request a plot with sample size on the x-axis. (The result parameter, here power, is always plotted on the other axis.) Use the MIN= and MAX= options in the PLOT statement to specify the sample size range (as numbers of pairs). The following statements perform the sample size analysis. proc power; pairedmeans test=diff pairedmeans = (330 310) pairedstddevs = (40 55) corr = 0.3 sides = 1 2 alpha = 0.01 npairs = 100 power = .; plot x=n min=50 max=200; run;
Default values for the NULLDIFF= and DIST= options specify a null mean difference of 0 and the assumption of normally distributed data. The output is shown in Output 57.3.1 and Output 57.3.2. Output 57.3.1. Power for Paired t Analysis of Crossover Design The POWER Procedure Paired t Test for Mean Difference Fixed Scenario Elements Distribution Method Alpha Mean 1 Mean 2 Standard Deviation 1 Standard Deviation 2 Correlation Number of Pairs Null Difference
Normal Exact 0.01 330 310 40 55 0.3 100 0
Computed Power Index 1 2
Sides
Power
1 2
0.865 0.801
Example 57.3. Simple AB/BA Crossover Designs
Output 57.3.2. Plot of Power versus Sample Size for Paired t Analysis of Crossover Design
The Computed Power table in Output 57.3.1 shows that the power with 100 patients is about 0.8 for the 2-sided test and 0.87 for the 1-sided test with the alternative of larger Brantium mean. In Output 57.3.2, the line style identifies the number of sides of the test. The plotting symbols identify locations of actual computed powers; the curves are linear interpolations of these points. The plot demonstrates how much higher the power is for the 1-sided test than the 2-sided test for the range of sample sizes. Suppose now that instead of detecting a difference between Xilodol and Brantium, you want to establish that they are similar, in particular, that the absolute mean PEF difference is at most 35. You might consider this goal if, for example, one of the drugs has fewer side effects and if a difference of no more than 35 is considered clinically small. Instead of a standard t test, you would conduct an equivalence test of the treatment mean difference for the two drugs. You would test the hypothesis that the true difference is less than -35 or more than 35 against the alternative that the mean difference is between -35 and 35, using an additive model and a two one-sided tests (“TOST”) analysis. Assuming no period effect, you can use the PAIREDMEANS statement with the TEST=EQUIV– DIFF option to perform a sample size analysis for the paired equivalence test. Indicate power as the result parameter by specifying the POWER= option with a missing value (.). Use the LOWER= and UPPER= options to specify the equivalence bounds of -35 and 35. Use the PAIREDMEANS=, PAIREDSTDDEVS=, CORR=, and ALPHA= options in the same way as in the t test at the beginning of this example to specify the remaining parameters.
3551
3552
The
Chapter 57. The POWER Procedure following
statements
perform
the
sample
size
analysis.
proc power; pairedmeans test=equiv_add lower = -35 upper = 35 pairedmeans = (330 310) pairedstddevs = (40 55) corr = 0.3 alpha = 0.01 npairs = 100 power = .; run;
The default option DIST=NORMAL specifies an assumption of normally distributed data. The output is shown in Output 57.3.3. Output 57.3.3. Power for Paired Equivalence Test for Crossover Design The POWER Procedure Equivalence Test for Paired Mean Difference Fixed Scenario Elements Distribution Method Lower Equivalence Bound Upper Equivalence Bound Alpha Reference Mean Treatment Mean Standard Deviation 1 Standard Deviation 2 Correlation Number of Pairs
Normal Exact -35 35 0.01 330 310 40 55 0.3 100
Computed Power Power 0.598
The power for the paired equivalence test with 100 patients is about 0.6.
Example 57.4. Noninferiority Test with Lognormal Data The typical goal in noninferiority testing is to conclude that a new treatment or process or product is not appreciably worse than some standard. This is accomplished by convincingly rejecting a 1-sided null hypothesis that the new treatment is appreciably worse than the standard. When designing such studies, investigators must define precisely what constitutes “appreciably worse.”
Example 57.4. Noninferiority Test with Lognormal Data
You can use the POWER procedure for sample size analyses for a variety of noninferiority tests, by specifying custom, 1-sided null hypotheses for common tests. This example illustrates the strategy (often called Blackwelder’s scheme, Blackwelder 1982) by comparing the means of two independent lognormal samples. The logic applies to one-sample, two-sample, and paired-sample problems involving normally distributed measures and proportions. Suppose you are designing a study hoping to show that a new (less expensive) manufacturing process does not produce appreciably more pollution than the current process. Quantifying “appreciably worse” as 10%, you seek to show that the mean pollutant level from the new process is less than 110% of that from the current process. In standard hypothesis testing notation, you seek to reject H0 :
µnew ≥ 1.10 µcurrent
in favor of HA :
µnew < 1.10 µcurrent
This is described graphically in Figure 57.8. Mean ratios below 100% are better levels for the new process; a ratio of 100% indicates absolute equivalence; ratios of 100–110% are “tolerably” worse; and ratios exceeding 110% are appreciably worse.
not appreciably worse (HA) (better)
appreciably worse (H0)
(tolerably worse)
100%
110%
µnew/µcurrent
Figure 57.8. Hypotheses for the Pollutant Study
An appropriate test for this situation is the common two-group t test on logtransformed data. The hypotheses become H0 : log (µnew ) − log (µcurrent ) ≥ log(1.10) HA : log (µnew ) − log (µcurrent ) < log(1.10) Measurements of the pollutant level will be taken using laboratory models of the two processes and will be treated as independent lognormal observations with a coefficient of variation (σ/µ) between 0.5 and 0.6 for both processes. You will end up with 300 measurements for the current process and 180 for the new one. It is important to
3553
3554
Chapter 57. The POWER Procedure
avoid a Type 1 error here, so you set the Type 1 error rate to 0.01. Your theoretical work suggests that the new process will actually reduce the pollutant by about 10% (to 90% of current), but you need to compute and graph the power of the study if the new levels are actually between 70% and 120% of current levels. Implement the sample size analysis using the TWOSAMPLEMEANS statement in PROC POWER with the TEST=RATIO option, Indicate power as the result parameter by specifying the POWER= option with a missing value (.). Specify a series of scenarios for the mean ratio between 0.7 and 1.2 using the MEANRATIO= option. Use the NULLRATIO= option to specify the null mean ratio of 1.10. Specify SIDES=L to indicate a 1-sided test with the alternative hypothesis stating that the mean ratio is lower than the null value. Specify the significance level, scenarios for the coefficient of variation, and the group sample sizes using the ALPHA=, CV=, and GROUPNS= options. Generate a plot of power versus mean ratio by specifying the PLOT statement with X=EFFECT to request a plot with mean ratio on the xaxis. (The result parameter, here power, is always plotted on the other axis.) Use the STEP= option in the PLOT statement to specify an interval of 0.05 between computed points in the plot. The following statements perform the desired analysis. proc power; twosamplemeans test=ratio meanratio = 0.7 to 1.2 by 0.1 nullratio = 1.10 sides = L alpha = 0.01 cv = 0.5 0.6 groupns = (300 180) power = .; plot x=effect step=0.05; run;
Note the use of SIDES=L, which forces computations for cases that need a rejection region that is opposite to the one providing the most one-tailed power; in this case, it is the lower tail. Such cases will show power that is less than the prescribed Type 1 error rate. The default option DIST=LOGNORMAL specifies the assumption of lognormally distributed data. The default MIN= and MAX= options in the plot statement specify an x-axis range identical to the effect size range in the TWOSAMPLEMEANS statement (mean ratios between 0.7 and 1.2). See the output in Output 57.4.1 and Output 57.4.2.
Example 57.4. Noninferiority Test with Lognormal Data
Output 57.4.1. Power for Noninferiority Test of Ratio The POWER Procedure Two-sample t Test for Mean Ratio Fixed Scenario Elements Distribution Method Number of Sides Null Geometric Mean Ratio Alpha Group 1 Sample Size Group 2 Sample Size
Lognormal Exact L 1.1 0.01 300 180
Computed Power
Index
Geo Mean Ratio
CV
Power
1 2 3 4 5 6 7 8 9 10 11 12
0.7 0.7 0.8 0.8 0.9 0.9 1.0 1.0 1.1 1.1 1.2 1.2
0.5 0.6 0.5 0.6 0.5 0.6 0.5 0.6 0.5 0.6 0.5 0.6
>.999 >.999 >.999 >.999 0.985 0.933 0.424 0.306 0.010 0.010 <.001 <.001
3555
3556
Chapter 57. The POWER Procedure
Output 57.4.2. Plot of Power versus Mean Ratio for Noninferiority Test
The Computed Power table in Output 57.4.1 shows that power exceeds 0.90 if the true mean ratio is 90% or less, as surmised. But power is unacceptably low (0.31–0.42) if the processes happen to be truly equivalent. Note that the power is identical to the alpha level (0.01) if the true mean ratio is 1.10 and below 0.01 if the true mean ratio is appreciably worse (> 110%). In Output 57.4.2, the line style identifies the coefficient of variation. The plotting symbols identify locations of actual computed powers; the curves are linear interpolations of these points.
Example 57.5. Multiple Regression and Correlation You are working with a team of preventive cardiologists investigating whether elevated serum homocysteine levels are linked to atherosclerosis (plaque buildup in coronary arteries). The planned analysis is an ordinary least squares regression to assess the relationship between total homocysteine level (tHcy) and a plaque burden index (PBI), adjusting for six other variables: age, gender, plasma levels of folate, vitamins B6 and B12 , and a serum cholesterol index. You will regress PBI on tHcy and the six other predictors (plus the intercept) and use a Type III F test to assess whether tHcy is a significant predictor after adjusting for the others. You wonder whether 100 subjects will provide adequate statistical power. This is a correlational study at a single time. Subjects will be screened so that about half will have had a heart problem. All eight variables will be measured during one visit. Most clinicians are familiar with simple correlations between two variables, so you decide to pose the statistical problem in terms of estimating and testing the partial correlation between X1 = tHcy and Y = PBI, controlling for the six other predictor
Example 57.5. Multiple Regression and Correlation
variables (RY X1 |X−1 ). This greatly simplifies matters, especially the elicitation of the conjectured effect.
Y, adjusted for common set of 6 covariates
You use partial regression plots like that shown in Figure 57.9 to teach the team that the partial correlation between PBI and tHcy is the correlation of two sets of residuals obtained from ordinary regression models, one from regressing PBI on the six covariates and the other from regressing tHcy on the same covariates. Thus each subject has “expected” tHcy and PBI values based on the six covariates. The cardiologists believe that subjects who are relatively higher than expected on tHcy will also be relatively higher than expected on PBI. The partial correlation quantifies that adjusted association just like a standard simple correlation does with the unadjusted linear association between two variables.
ρ = 0.35
X, adjusted for common set of 6 covariates
Figure 57.9. Partial Regression Plot
Based on previously published studies of various coronary risk factors and after viewing a set of scatterplots showing various correlations, the team surmises that the true partial correlation is likely to be at least 0.35. You want to compute the statistical power for a sample size of N = 100, using α = 0.05. You also want to plot power for sample sizes between 50 and 150. Use the MULTREG statement to compute the power and the PLOT statement to produce the graph. Since the predictors are observed rather than fixed in advanced, and a joint multivariate normal assumption seems tenable, use MODEL=RANDOM. The following statements perform the power analysis:
3557
3558
Chapter 57. The POWER Procedure proc power; multreg model = random nfullpredictors = 7 ntestpredictors = 1 partialcorr = 0.35 ntotal = 100 power = .; plot x=n min=50 max=150; run;
The POWER=. option identifies power as the parameter to compute. The NFULLPREDICTORS= option specifies 7 total predictors (not including the intercept), and the NTESTPREDICTORS= option indicates that 1 of those predictors is being tested. The PARTIALCORR= and NTOTAL= options specify the partial correlation and sample size, respectively. The default value for the ALPHA= option sets the significance level to 0.05. The X=N option in the plot statement requests a plot of sample size on the x-axis, and the MIN= and MAX= options specify the sample size range. Output 57.5.1 shows the output, and Output 57.5.2 shows the plot. Output 57.5.1. Power Analysis for Multiple Regression The POWER Procedure Type III F Test in Multiple Regression Fixed Scenario Elements Method Model Number of Predictors in Full Model Number of Test Predictors Partial Correlation Total Sample Size Alpha
Computed Power Power 0.939
Exact Random X 7 1 0.35 100 0.05
Example 57.5. Multiple Regression and Correlation
Output 57.5.2. Plot of Power versus Sample Size for Multiple Regression
For the sample size N = 100, the study is almost balanced with respect to Type 1 and Type 2 error rates, with α = 0.05 and β = 1 - 0.937 = 0.063. The study thus seems well designed at this sample size. Now suppose that in a follow-up meeting with the cardiologists, you discover that their specific intent is to demonstrate that the (partial) correlation between PBI and tHcy is greater than 0.2. You suggest changing the planned data analysis to a 1-sided Fisher’s z test with a null correlation of 0.2. The following statements perform a power analysis for this test: proc power; onecorr dist=fisherz npvars = 6 corr = 0.35 nullcorr = 0.2 sides = 1 ntotal = 100 power = .; run;
The DIST=FISHERZ option in the ONECORR statement specifies Fisher’s z test. The NPVARS= option specifies that 6 additional variables are adjusted for in the partial correlation. The CORR= option specifies the conjectured correlation of 0.35, and the NULLCORR= option indicates the null value of 0.2. The SIDES= option specifies a 1-sided test. Output 57.5.3 shows the output.
3559
3560
Chapter 57. The POWER Procedure
Output 57.5.3. Power Analysis for Fisher’s z Test The POWER Procedure Fisher’s z Test for Pearson Correlation Fixed Scenario Elements Distribution Method Number of Sides Null Correlation Number of Variables Partialled Out Correlation Total Sample Size Nominal Alpha
Fisher’s z transformation of r Normal approximation 1 0.2 6 0.35 100 0.05
Computed Power Actual Alpha
Power
0.05
0.466
The power for Fisher’s z test is less than 50%, the decrease being mostly due to the smaller effect size (relative to the null value). When asked for a recommendation for a new sample size goal, you compute the required sample size to achieve a power of 0.95 (to balance Type 1 and Type 2 errors) and 0.85 (a threshold deemed to be minimally acceptable to the team). The following statements perform the sample size determination: proc power; onecorr dist=fisherz npvars = 6 corr = 0.35 nullcorr = 0.2 sides = 1 ntotal = . power = 0.85 0.95; run;
The NTOTAL=. option identifies sample size as the parameter to compute, and the POWER= option specifies the target powers.
Example 57.6. Comparing Two Survival Curves
Output 57.5.4. Sample Size Determination for Fisher’s z Test The POWER Procedure Fisher’s z Test for Pearson Correlation Fixed Scenario Elements Distribution Method Number of Sides Null Correlation Number of Variables Partialled Out Correlation Nominal Alpha
Fisher’s z transformation of r Normal approximation 1 0.2 6 0.35 0.05
Computed N Total
Index
Nominal Power
Actual Alpha
Actual Power
N Total
1 2
0.85 0.95
0.05 0.05
0.850 0.950
280 417
The results in Output 57.5.4 reveal a required sample size of 417 to achieve a power of 0.95 and 280 to achieve a power of 0.85.
Example 57.6. Comparing Two Survival Curves You are consulting for a clinical research group planning a trial to compare survival rates for proposed and standard cancer treatments. The planned data analysis is a log-rank test to nonparametrically compare the overall survival curves for the two treatments. Your goal is to determine an appropriate sample size to achieve a power of 0.8 for a 2-sided test with α = 0.05 using a balanced design. The survival curve for patients on the standard treatment is well-known to be approximately exponential with a median survival time of five years. The research group conjectures that the new proposed treatment will yield a (nonexponential) survival curve similar to the dashed line in Output 57.6.1. Patients will be accrued uniformly over two years and then followed for an additional three years past the accrual period. Some loss to follow-up is expected, with roughly exponential rates that would result in about 50% loss with the standard treatment within 10 years. The loss to follow-up with the proposed treatment is more difficult to predict, but 50% loss would expected to occur sometime between years 5 and 20.
3561
3562
Chapter 57. The POWER Procedure
Output 57.6.1. Survival Curves
Use the TWOSAMPLESURVIVAL statement with the TEST=LOGRANK option to compute the required sample size for the log-rank test. The following statements perform the analysis: proc power; twosamplesurvival test=logrank curve("Standard") = 5 : 0.5 curve("Proposed") = (1 to 5 by 1):(0.95 0.9 0.75 0.7 0.6) groupsurvival = "Standard" | "Proposed" accrualtime = 2 followuptime = 3 groupmedlosstimes = 10 | 20 5 power = 0.8 npergroup = .; run;
The CURVE= option defines the two survival curves. The “Standard” curve has only one point, specifying an exponential form with a survival probability of 0.5 at year 5. The “Proposed” curve is a piecewise linear curve defined by the five points shown in Output 57.6.1. The GROUPSURVIVAL= option assigns the survival curves to the two groups, and the ACCRUALTIME= and FOLLOWUPTIME= options specify the accrual and follow-up times. The GROUPMEDLOSSTIMES= option specifies the years at which 50% loss is expected to occur. The POWER= option specifies the target power, and the NPERGROUP=. option identifies sample size per group as the parameter to compute. Default values for the SIDES= and ALPHA= options specify a 2-sided test with α = 0.05.
Example 57.7. Confidence Interval Precision
Output 57.6.2 shows the results. Output 57.6.2. Sample Size Determination for Log-Rank Test The POWER Procedure Log-Rank Test for Two Survival Curves Fixed Scenario Elements Method Accrual Time Follow-up Time Group 1 Survival Curve Form of Survival Curve 1 Group 2 Survival Curve Form of Survival Curve 2 Group 1 Median Loss Time Nominal Power Number of Sides Number of Time Sub-Intervals Alpha
Lakatos normal approximation 2 3 Standard Exponential Proposed Piecewise Linear 10 0.8 2 12 0.05
Computed N Per Group
Index
Median Loss Time 2
Actual Power
N Per Group
1 2
20 5
0.800 0.801
228 234
The required sample size per group to achieve a power of 0.8 is 228 if the median loss time is 20 years for the proposed treatment. Only six more patients are required in each group if the median loss time is as short as five years.
Example 57.7. Confidence Interval Precision An investment firm has hired you to help plan a study to estimate the success of a new investment strategy called “IntuiVest.” The study involves complex simulations of market conditions over time, and it tracks the balance of a hypothetical brokerage account starting with $50,000. Each simulation is very expensive in terms of computing time. You are asked to determine an appropriate number of simulations to estimate the average change in the account balance at the end of three years. The goal is to have a 95% chance of obtaining a 90% confidence interval whose half-width is at most $1,000. That is, the firm wants to have a 95% chance of being able to correctly claim at the end of the study that “Our research shows with 90% confidence that IntuiVest yields a profit of $X +/- $1,000 at the end of three years on an initial investment of $50,000 (under simulated market conditions).” The probability of achieving the desired precision (that is, a small interval width) can be calculated either unconditionally or conditionally given that the true mean is captured by the interval. You decide to use the conditional form, considering two of its advantages:
3563
3564
Chapter 57. The POWER Procedure • The conditional probability is usually lower than the unconditional probability for the same sample size, meaning that the conditional form is generally conservative. • The overall probability of achieving the desired precision and capturing the true mean is easily computed as the product of the half-width probability and the confidence level. In this case, the overall probability is 0.95 × 0.9 = 0.855.
Based on some initial simulations, you expect a standard deviation between $25,000 and $45,000 for the ending account balance. You will consider both of these values in the sample size analysis. As mentioned in the “Overview of Power Concepts” section on page 3488, an analysis of confidence interval precision is analogous to a traditional power analysis, with “CI Half-Width” taking the place of effect size and “Prob(Width)” taking the place of power. In this example, the target CI Half-Width is 1000, and the desired Prob(Width) is 0.95. In addition to computing sample sizes for a half-width of $1,000, you are asked to plot the required number of simulations for a range of half-widths between $500 and $2,000. Use the ONESAMPLEMEANS statement with the CI=T option to implement the sample size determination. The following statements perform the analysis:
proc power; onesamplemeans ci=t alpha = 0.1 halfwidth = 1000 stddev = 25000 45000 probwidth = 0.95 ntotal = .; plot x=effect min=500 max=2000; run;
The NTOTAL=. option identifies sample size as the parameter to compute. The ALPHA=0.1 option specifies a confidence level of 1 − α = 0.9. The HALFWIDTH= option specifies the target half-width, and the STDDEV= option specifies the conjectured standard deviation values. The PROBWIDTH= option specifies the desired probability of achieving the target precision. The default value PROBTYPE=CONDITIONAL specifies that this probability is conditional on the true mean being captured by the interval. The default of SIDES=2 indicates a 2-sided interval. Output 57.7.1 shows the output, and Output 57.7.2 shows the plot.
Example 57.7. Confidence Interval Precision
Output 57.7.1. Sample Size Determination for Confidence Interval Precision The POWER Procedure Confidence Interval for Mean Fixed Scenario Elements Distribution Method Alpha CI Half-Width Nominal Prob(Width) Number of Sides Prob Type
Normal Exact 0.1 1000 0.95 2 Conditional
Computed N Total
Index
Std Dev
Actual Prob (Width)
N Total
1 2
25000 45000
0.951 0.950
1788 5652
Output 57.7.2. Plot of Sample Size vs. Confidence Interval Half-Width
The number of simulations required in order to have a 95% chance of obtaining a half-width of at most 1000 is between 1788 and 5652, depending on the standard deviation. The plot reveals that over 20,000 simulations would be required for a halfwidth of 500 assuming the higher standard deviation.
3565
3566
Chapter 57. The POWER Procedure
Example 57.8. Customizing Plots The example in this section demonstrates various ways you can modify and enhance plots: • assigning analysis parameters to axes • fine-tuning a sample size axis • adding reference lines • linking plot features to analysis parameters • choosing key (legend) styles • modifying symbol locations The example plots are all based on a sample size analysis for a two-sample t test of group mean difference. You start by computing the sample size required to achieve a power of 0.9 using a 2-sided test with α = 0.05, assuming the first mean is 12, the second mean is either 15 or 18, and the standard deviation is either 7 or 9. Use the TWOSAMPLEMEANS statement with the TEST=DIFF option to compute the required sample sizes. Indicate total sample size as the result parameter by supplying a missing value (.) with the NTOTAL= option. Use the GROUPMEANS=, STDDEV=, and POWER= option to specify values of the other parameters. The following statements perform the sample size computations. proc power; twosamplemeans test=diff groupmeans = 12 | 15 18 stddev = 7 9 power = 0.9 ntotal = .; run;
Default values for the NULLDIFF=, SIDES=, GROUPWEIGHTS=, and DIST= options specify a null mean difference of 0, 2-sided test, balanced design, and assumption of normally distributed data. Output 57.8.1 shows that the required sample size ranges from 60 to 382 depending on the unknown standard deviation and second mean.
Example 57.8. Customizing Plots
Output 57.8.1. Computed Sample Sizes The POWER Procedure Two-sample t Test for Mean Difference Fixed Scenario Elements Distribution Method Group 1 Mean Nominal Power Number of Sides Null Difference Alpha Group 1 Weight Group 2 Weight
Normal Exact 12 0.9 2 0 0.05 1 1
Computed N Total
Index
Mean2
Std Dev
Actual Power
N Total
1 2 3 4
15 15 18 18
7 9 7 9
0.902 0.901 0.904 0.904
232 382 60 98
Assigning Analysis Parameters to Axes Use the PLOT statement to produce plots for all power and sample size analyses in PROC POWER. For the sample size analysis described at the beginning of this example, suppose you want to plot the required sample size on the y-axis against a range of powers between 0.5 and 0.95 on the x-axis. The X= and Y= options specify which parameter to plot against the result, and which axis to assign to this parameter. You can use either the X= or Y= option, but not both. Use the X=POWER option in PLOT statement to request a plot with power on the x-axis. The result parameter, here total sample size, is always plotted on the other axis. Use the MIN= and MAX= options to specify the range of the axis indicated with either the X= or the Y= option. Here, specify MIN=0.5 and MAX=0.95 to specify the power range. The following statements produce the plot. proc power plotonly; twosamplemeans test=diff groupmeans = 12 | 15 18 stddev = 7 9 power = 0.9 ntotal = .; plot x=power min=0.5 max=0.95; run;
Note that the value (0.9) of the POWER= option in the TWOSAMPLEMEANS statement is only a placeholder when the PLOTONLY option is used and both the MIN= and MAX= options are used, because the values of the MIN= and MAX= options
3567
3568
Chapter 57. The POWER Procedure
override the value of 0.9. But the POWER= option itself is still required in the TWOSAMPLEMEANS statement, to provide a complete specification of the sample size analysis. The resulting plot is shown in Output 57.8.2. Output 57.8.2. Plot of Sample Size versus Power
The line style identifies the group means scenario, and the plotting symbol identifies the standard deviation scenario. The locations of plotting symbols indicate computed sample sizes; the curves are linear interpolations of these points. By default, each curve consists of approximately 20 computed points (sometimes slightly more or less, depending on the analysis). If you would rather plot power on the y-axis versus sample size on the x-axis, you have two general strategies to choose from. One strategy is to use the Y= option instead of the X= option in the PLOT statement: plot y=power min=0.5 max=0.95;
Example 57.8. Customizing Plots
Output 57.8.3. Plot of Power versus Sample Size using First Strategy
Note that the resulting plot (Output 57.8.3) is essentially a mirror image of Output 57.8.2. The axis ranges are set such that each curve in Output 57.8.3 contains similar values of Y instead of X. Each plotted point represents the computed value of the x-axis at the input value of the y-axis. A second strategy for plotting power versus sample size (when originally solving for sample size) is to invert the analysis and base the plot on computed power for a given range of sample sizes. This strategy works well for monotonic power curves (as is the case for the t test and most other continuous analyses). It is advantageous in the sense of preserving the traditional role of the y-axis as the computed parameter. A common way to implement this strategy is • Determine the range of sample sizes sufficient to cover at the desired power range for all curves (where each “curve” represents a scenario for standard deviation and second group mean). • Use this range for the x-axis of a plot. To determine the required sample sizes for target powers of 0.5 and 0.95, change the values in the POWER= option to reflect this range:
3569
3570
Chapter 57. The POWER Procedure proc power; twosamplemeans test=diff groupmeans = 12 | 15 18 stddev = 7 9 power = 0.5 0.95 ntotal = .; run;
Output 57.8.4 reveals that a sample size range of 24 to 470 is approximately sufficient to cover the desired power range of 0.5 to 0.95 for all curves (“approximately” because the actual power at the rounded sample size of 24 is slightly higher than the nominal power of 0.5). Output 57.8.4. Computed Sample Sizes The POWER Procedure Two-sample t Test for Mean Difference Fixed Scenario Elements Distribution Method Group 1 Mean Number of Sides Null Difference Alpha Group 1 Weight Group 2 Weight
Normal Exact 12 2 0 0.05 1 1
Computed N Total
Index
Mean2
Std Dev
Nominal Power
Actual Power
N Total
1 2 3 4 5 6 7 8
15 15 15 15 18 18 18 18
7 7 9 9 7 7 9 9
0.50 0.95 0.50 0.95 0.50 0.95 0.50 0.95
0.502 0.951 0.505 0.950 0.519 0.953 0.516 0.952
86 286 142 470 24 74 38 120
To plot power on the y-axis for sample sizes between 20 and 500, use the X=N option in the PLOT statement with MIN=20 and MAX=500: proc power plotonly; twosamplemeans test=diff groupmeans = 12 | 15 18 stddev = 7 9 power = . ntotal = 200; plot x=n min=20 max=500; run;
Example 57.8. Customizing Plots
Each curve in the resulting plot in Output 57.8.5 covers at least a power range of 0.5 to 0.95. Output 57.8.5. Plot of Power versus Sample Size Using Second Strategy
Finally, suppose you want to produce a plot of sample size versus effect size for a power of 0.9. In this case, the “effect size” is defined to be the mean difference. You need to reparameterize the analysis by using the MEANDIFF= option instead of the GROUPMEANS= option to produce a plot, since each plot axis must be represented by a scalar parameter. Use the X=EFFECT option in the PLOT statement to assign the mean difference to the x-axis. The following statements produce a plot of required sample size to detect mean differences between 3 and 6. proc power plotonly; twosamplemeans test=diff meandiff = 3 6 stddev = 7 9 power = 0.9 ntotal = .; plot x=effect min=3 max=6; run;
The resulting plot Output 57.8.6 shows how the required sample size decreases with increasing mean difference.
3571
3572
Chapter 57. The POWER Procedure
Output 57.8.6. Plot of Sample Size versus Mean Difference
Fine-Tuning a Sample Size Axis Consider the following plot request for a sample size analysis similar to the one in Output 57.8.1 but with only a single scenario, and with unbalanced sample size allocation of 2:1. proc power plotonly; ods output plotcontent=PlotData; twosamplemeans test=diff groupmeans = 12 | 18 stddev = 7 groupweights = 2 | 1 power = . ntotal = 20; plot x=n min=20 max=50 npoints=20; run;
The MIN=, MAX=, and NPOINTS= options in the PLOT statement request a plot with 20 points between 20 and 50. But the resulting plot (Output 57.8.7) appears to have only 11 points, and they range from 18 to 48.
Example 57.8. Customizing Plots
Output 57.8.7. Plot with Overlapping Points
The reason that this plot has fewer points than usual is due to the rounding of sample sizes. If you do not use the NFRACTIONAL option in the analysis statement (here, the TWOSAMPLEMEANS statement), then the set of sample size points determined by the MIN=, MAX=, NPOINTS=, and STEP= options in the PLOT statement may be rounded to satisfy the allocation weights. In this case, they are rounded down to the nearest multiples of 3 (the sum of the weights), and many of the points overlap. To see the overlap, you can print the NominalNTotal (unadjusted) and NTotal (rounded) variables in the PlotContent ODS object (here saved to a data set called PlotData): proc print data=PlotData; var NominalNTotal NTotal; run;
The output is shown in Output 57.8.8.
3573
3574
Chapter 57. The POWER Procedure
Output 57.8.8. Sample Sizes Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Nominal NTotal 18.0 19.6 21.2 22.7 24.3 25.9 27.5 29.1 30.6 32.2 33.8 35.4 36.9 38.5 40.1 41.7 43.3 44.8 46.4 48.0
NTotal 18 18 21 21 24 24 27 27 30 30 33 33 36 36 39 39 42 42 45 48
Besides overlapping of sample size points, another peculiarity that might occur without the NFRACTIONAL option is unequal spacing; for example, in the plot in Output 57.8.9, created with the following statements. proc power plotonly; twosamplemeans test=diff groupmeans = 12 | 18 stddev = 7 groupweights = 2 | 1 power = . ntotal = 20; plot x=n min=20 max=50 npoints=5; run;
Example 57.8. Customizing Plots
Output 57.8.9. Plot with Unequally Spaced Points
If you want to guarantee evenly spaced, nonoverlapping sample size points in your plots, you can either (1) use the NFRACTIONAL option in the analysis statement preceding the PLOT statement, or (2) use the STEP= option and provide values for the MIN=, MAX=, and STEP= options in the PLOT statement that are multiples of the sum of the allocation weights. Note that this sum is simply 1 for one-sample and paired designs and 2 for balanced two-sample designs. So, any integer step value works well for one-sample and paired designs, and any even step value works well for balanced two-sample designs. Both of these strategies will avoid rounding adjustments. The following statements implement the first strategy to create the plot in Output 57.8.10, using the NFRACTIONAL option in the TWOSAMPLEMEANS statement.
proc power plotonly; twosamplemeans test=diff nfractional groupmeans = 12 | 18 stddev = 7 groupweights = 2 | 1 power = . ntotal = 20; plot x=n min=20 max=50 npoints=20; run;
3575
3576
Chapter 57. The POWER Procedure
Output 57.8.10. Plot with Fractional Sample Sizes
To implement the second strategy, use multiples of 3 for the STEP=, MIN=, and MAX= options in the PLOT statement (because the sum of the allocation weights is 2 + 1 = 3). The following statements use STEP=3, MIN=18, and MAX=48 to create a plot that looks identical to Output 57.8.7 but suffers no overlapping of points. proc power plotonly; twosamplemeans test=diff groupmeans = 12 | 18 stddev = 7 groupweights = 2 | 1 power = . ntotal = 20; plot x=n min=18 max=48 step=3; run;
Adding Reference Lines Suppose you want to add reference lines to highlight power=0.8 and power=0.9 on the plot in Output 57.8.5. You can add simple reference lines using the YOPTS= option and REF= sub-option in the PLOT statement to produce Output 57.8.11, using the following statements. proc power plotonly; twosamplemeans test=diff groupmeans = 12 | 15 18
Example 57.8. Customizing Plots
stddev = 7 9 power = . ntotal = 100; plot x=n min=20 max=500 yopts=(ref=0.8 0.9); run;
Output 57.8.11. Plot with Simple Reference Lines on Y-Axis
Or, you can specify CROSSREF=YES to add reference lines that intersect each curve and cross over to the other axis: plot x=n min=20 max=500 yopts=(ref=0.8 0.9 crossref=yes);
The resulting plot is shown in Output 57.8.12.
3577
3578
Chapter 57. The POWER Procedure
Output 57.8.12. Plot with CROSSREF=YES Style Reference Lines from Y-Axis
You can also add reference lines for the x-axis by using the XOPTS= option instead of the YOPTS= option. For example, the following plot statement produces Output 57.8.13, which has crossing reference lines highlighting the sample size of 100. plot x=n min=20 max=500 xopts=(ref=100 crossref=yes);
Example 57.8. Customizing Plots
Output 57.8.13. Plot with CROSSREF=YES Style Reference Lines from X-Axis
Linking Plot Features to Analysis Parameters You can use the VARY option in the PLOT statement to specify which of the following features you wish to associate with analysis parameters. • line style • plotting symbol • color • panel You can specify mappings between each of these features and one or more analysis parameters, or you can simply choose a subset of these features to use (and rely on default settings to associate these features with multiple-valued analysis parameters). Suppose you supplement the sample size analysis in Output 57.8.5 to include three values of alpha, using the following statements.
3579
3580
Chapter 57. The POWER Procedure proc power plotonly; twosamplemeans test=diff groupmeans = 12 | 15 18 stddev = 7 9 alpha = 0.01 0.025 0.1 power = . ntotal = 100; plot x=n min=20 max=500; run;
The defaults for the VARY option in the PLOT statement specify line style varying by the ALPHA= parameter, plotting symbol varying by the GROUPMEANS= parameter, panel varying by the STDDEV= parameter, and color remaining constant. The resulting plot, consisting of two panels, is shown in Output 57.8.14 and Output 57.8.15. Output 57.8.14. Plot with Default VARY Settings: Panel 1 of 2
Example 57.8. Customizing Plots
Output 57.8.15. Plot with Default VARY Settings: Panel 2 of 2
Suppose you want to produce a plot with only one panel that varies color in addition to line style and plotting symbol. Include the LINESTYLE, SYMBOL, and COLOR keywords in the VARY option in the PLOT statement, as follows, to produce the plot in Output 57.8.16. plot x=n min=20 max=500 vary (linestyle, symbol, color);
3581
3582
Chapter 57. The POWER Procedure
Output 57.8.16. Plot with Varying Color Instead of Panel
Finally, suppose you want to specify which features are used and which analysis parameters they are linked to. The following PLOT statement produces a two-panel plot (shown in Output 57.8.17 and Output 57.8.18) in which line style varies by standard deviation, plotting symbol varies by both alpha and sides, and panel varies by means.
plot x=n min=20 max=500 vary (linestyle by stddev, symbol by alpha sides, panel by groupmeans);
Example 57.8. Customizing Plots
Output 57.8.17. Plot with Features Explicitly Linked to Parameters: Panel 1 of 2
Output 57.8.18. Plot with Features Explicitly Linked to Parameters: Panel 2 of 2
3583
3584
Chapter 57. The POWER Procedure
Choosing Key (Legend) Styles The default style for the key (or “legend”) is one that displays the association between levels of features and levels of analysis parameters, located below the x-axis. For example, Output 57.8.5 demonstrates this style of key. You can reproduce Output 57.8.5 with the same key but a different location, inside the plotting region, using the POS=INSET option within the KEY=BYFEATURE option in the PLOT statement. The following statements product the plot in Output 57.8.19.
proc power plotonly; twosamplemeans test=diff groupmeans = 12 | 15 18 stddev = 7 9 power = . ntotal = 200; plot x=n min=20 max=500 key = byfeature(pos=inset); run;
Output 57.8.19. Plot with a By-Feature Key Inside the Plotting Region
Alternatively, you can specify a key that identifies each individual curve separately by number using the KEY=BYCURVE option in the PLOT statement: plot x=n min=20 max=500 key = bycurve;
Example 57.8. Customizing Plots
The resulting plot is shown in Output 57.8.20. Output 57.8.20. Plot with a Numbered By-Curve Key
Use the NUMBERS=OFF option within the KEY=BYCURVE option to specify a nonnumbered key that identifies curves with samples of line styles, symbols, and colors: plot x=n min=20 max=500 key = bycurve(numbers=off pos=inset);
The POS=INSET suboption places the key within the plotting region. The resulting plot is shown in Output 57.8.21.
3585
3586
Chapter 57. The POWER Procedure
Output 57.8.21. Plot with a Nonnumbered By-Curve Key
Finally, you can attach labels directly to curves with the KEY=ONCURVES option. The following plot statement produces Output 57.8.22. plot x=n min=20 max=500 key = oncurves;
Example 57.8. Customizing Plots
Output 57.8.22. Plot with Directly Labeled Curves
Modifying Symbol Locations The default locations for plotting symbols are the points computed directly from the power and sample size algorithms. For example, Output 57.8.5 shows plotting symbols corresponding to computed points. The curves connecting these points are interpolated (as indicated by the INTERPOL= option in the PLOT statement). You can modify the locations of plotting symbols using the MARKERS= option in the plot statement. The MARKERS=ANALYSIS option places plotting symbols at locations corresponding to the input specified in the analysis statement preceding the PLOT statement. You may prefer this as an alternative to using reference lines to highlight specific points. For example, you can reproduce Output 57.8.5, but with the plotting symbols located at the sample sizes shown in Output 57.8.1, using the following statements. proc power plotonly; twosamplemeans test=diff groupmeans = 12 | 15 18 stddev = 7 9 power = . ntotal = 232 382 60 98; plot x=n min=20 max=500 markers=analysis; run;
The analysis statement here is the TWOSAMPLEMEANS statement. The MARKERS=ANALYSIS option in the PLOT statement causes the plotting
3587
3588
Chapter 57. The POWER Procedure
symbols to occur at sample sizes specified by the NTOTAL= option in the TWOSAMPLEMEANS statement: 232, 382, 60, and 98. The resulting plot is shown in Output 57.8.23. Output 57.8.23. Plot with MARKERS=ANALYSIS
You can also use the MARKERS=NICE option to align symbols with the tick marks on one of the axes (the x-axis when the X= option is used, or the y-axis when the Y= is used): plot x=n min=20 max=500 markers=nice;
The plot created by this PLOT statement is shown in Output 57.8.24.
References
Output 57.8.24. Plot with MARKERS=NICE
Note that the plotting symbols are aligned with the tick marks on the x-axis because the X= option is specified.
References Anderson, T.W. (1984), An Introduction to Multivariate Statistical Analysis, Second Edition, New York: John Wiley & Sons. Beal, S.L. (1989), “Sample Size Determination for Confidence Intervals on the Population Means and on the Difference between Two Population Means,” Biometrics, 45, 969–977. Blackwelder, W.C. (1982), “‘Proving the Null Hypothesis’ in Clinical Trials,” Controlled Clinical Trials, 3, 345–353. Cantor, A. B. (1997), Extending SAS Survival Analysis Techniques for Medical Research, Cary, NC: SAS Institute Inc. Castelloe, J.M. (2000), “Sample Size Computations and Power Analysis with the SAS System,” Proceedings of the Twenty-fifth Annual SAS Users Group International Conference, Paper 265-25, Cary, NC: SAS Institute Inc. Castelloe, J.M. and O’Brien, R.G. (2001), “Power and Sample Size Determination for Linear Models,” Proceedings of the Twenty-sixth Annual SAS Users Group International Conference, Paper 240-26. Cary, NC: SAS Institute Inc.
3589
3590
Chapter 57. The POWER Procedure
Chernick, M.R. and Liu, C.Y. (2002), “The Saw-Toothed Behavior of Power Versus Sample Size and Software Solutions: Single Binomial Proportion Using Exact Methods,” The American Statistician, 56, 149–155. Connor, R.J. (1987), “Sample Size for Testing Differences in Proportions for the Paired-Sample Design,” Biometrics, 43, 207–211. Diegert, C. and Diegert, K.V. (1981), “Note on Inversion of Casagrande-Pike-Smith Approximate Sample-Size Formula for Fisher-Irwin Test on 2 X 2 Tables,” Biometrics, 37, 595. Diletti, D., Hauschke, D., and Steinijans, V.W. (1991), “Sample Size Determination for Bioequivalence Assessment by Means of Confidence Intervals,” International Journal of Clinical Pharmacology, Therapy and Toxicology, 29, 1–8. DiSantostefano, R.L. and Muller, K.E. (1995), “A Comparison of Power Approximations for Satterthwaite’s Test,” Communications in Statistics — Simulation and Computation, 24 (3), 583–593. Fisher, R.A. (1921), “On the ‘Probable Error’ of a Coefficient of Correlation Deduced from a Small Sample,” Metron, 1, 3–32. Fleiss, J.L., Tytun, A. and Ury, H.K. (1980), “A Simple Approximation for Calculating Sample Sizes for Comparing Independent Proportions,” Biometrics, 36, 343–346. Gatsonis, C. and Sampson, A.R. (1989), “Multiple Correlation: Exact Power and Sample Size Calculations,” Psychological Bulletin, 106, 516–524. Hocking, R.R. (1985), The Analysis of Linear Models, Monterey, CA: Brooks/Cole Publishing Company. Johnson, N.L. and Kotz, S. (1970), Distributions in Statistics: Continuous Univariate Distributions — 1, New York: John Wiley & Sons. Johnson, N.L. Kotz, S. and Balakrishnan, N. (1995), Continuous Univariate Distributions, Volume 2, Second Edition, New York: John Wiley & Sons. Jones R.M. and Miller, K.S. (1966), “On the Multivariate Lognormal Distribution,” Journal of Industrial Mathematics, 16, 63–76. Lachin, J.M. (1992), “Power and Sample Size Evaluation for the McNemar Test with Application to Matched Case-Control Studies,” Statistics in Medicine, 11, 1239–1251. Lakatos, E. (1988), “Sample Sizes Based on the Log-Rank Statistic in Complex Clinical Trials,” Biometrics, 44, 229–241. Lenth, R.V. (2001), “Some Practical Guidelines for Effective Sample Size Determination,” The American Statistician, 55, 187–193. Maxwell, S.E. (2000), “Sample Size and Multiple Regression Analysis,” Psychological Methods, 5, 434–458. Miettinen, O.S. (1968), “The Matched Pairs Design in the Case of All-or-None Responses,” Biometrics, 339–352.
References
Moser, B.K., Stevens, G.R., and Watts, C.L. (1989), “The Two-Sample T Test Versus Satterthwaite’s Approximate F Test,” Communications in Statistics A — Theory and Methods, 18, 3963–3975. Muller, K.E. and Benignus, V.A. (1992), “Increasing Scientific Power with Statistical Power,” Neurotoxicology and Teratology, 14, 211–219. O’Brien, R.G. and Muller, K.E. (1993), “Unified Power Analysis for t-Tests Through Multivariate Hypotheses,” in Applied Analysis of Variance in Behavioral Science, ed. L.K. Edwards, New York: Marcel Dekker, Chapter 8, 297–344. Owen, D.B. (1965), “A Special Case of a Bivariate Non-Central t-Distribution,” Biometrika, 52, 437–446. Pagano, M. and Gauvreau, K. (1993), Principles of Biostatistics, Belmont, CA: Wadsworth, Inc. Phillips, K.F. (1990), “Power of the Two One-Sided Tests Procedure in Bioequivalence,” Journal of Pharmacokinetics and Biopharmaceutics, 18, 137–144. Satterthwaite, F.W. (1946), “An Approximate Distribution of Estimates of Variance Components,” Biometrics Bulletin, 2, 110–114. Schork, M. and Williams, G. (1980), “Number of Observations Required for the Comparison of Two Correlated Proportions,” Communications in Statistics– Simulation and Computation 9, 349–357. Schuirmann, D.J. (1987), “A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability,” Journal of Pharmacokinetics and Biopharmaceutics, 15, 657–680. Senn, S. (1993), Cross-over Trials in Clinical Research, New York: John Wiley & Sons, Inc. Stuart, A. and Ord, J.K. (1994), Kendall’s Advanced Theory of Statistics, Volume 1: Distribution Theory, Sixth Edition, Baltimore: Edward Arnold Publishers Ltd. Walters, D.E. (1979). “In Defence of the Arc Sine Approximation,” The Statistician, 28, 219–232.
3591
Chapter 58
The PRINCOMP Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3595 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3596 SYNTAX . . . . . . . . . . . . . PROC PRINCOMP Statement BY Statement . . . . . . . . . FREQ Statement . . . . . . . . PARTIAL Statement . . . . . . VAR Statement . . . . . . . . WEIGHT Statement . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. 3604 . 3604 . 3607 . 3607 . 3608 . 3608 . 3608
DETAILS . . . . . . . . . . . . . Missing Values . . . . . . . . . Output Data Sets . . . . . . . . Computational Resources . . . Displayed Output . . . . . . . ODS Table Names . . . . . . . ODS Graphics (Experimental) .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. 3608 . 3608 . 3609 . 3611 . 3612 . 3613 . 3613
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . Example 58.1. Temperatures . . . . . . . . . . . . . Example 58.2. Crime Rates . . . . . . . . . . . . . . Example 58.3. Basketball Data . . . . . . . . . . . . Example 58.4. PRINCOMP Graphics (Experimental)
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. 3614 . 3614 . 3619 . 3626 . 3632
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3638
3594
Chapter 58. The PRINCOMP Procedure
Chapter 58
The PRINCOMP Procedure Overview The PRINCOMP procedure performs principal component analysis. As input you can use raw data, a correlation matrix, a covariance matrix, or a sums of squares and crossproducts (SSCP) matrix. You can create output data sets containing eigenvalues, eigenvectors, and standardized or unstandardized principal component scores. Principal component analysis is a multivariate technique for examining relationships among several quantitative variables. The choice between using factor analysis and principal component analysis depends in part upon your research objectives. You should use the PRINCOMP procedure if you are interested in summarizing data and detecting linear relationships. Plots of principal components are especially valuable tools in exploratory data analysis. You can use principal components to reduce the number of variables in regression, clustering, and so on. See Chapter 5, “Introduction to Multivariate Procedures,” for a detailed comparison of the PRINCOMP and FACTOR procedures. Experimental graphics are now available with the PRINCOMP procedure. For more information, see the “ODS Graphics” section on page 3613. Principal component analysis was originated by Pearson (1901) and later developed by Hotelling (1933). The application of principal components is discussed by Rao (1964), Cooley and Lohnes (1971), and Gnanadesikan (1977). Excellent statistical treatments of principal components are found in Kshirsagar (1972), Morrison (1976), and Mardia, Kent, and Bibby (1979). Given a data set with p numeric variables, you can compute p principal components. Each principal component is a linear combination of the original variables, with coefficients equal to the eigenvectors of the correlation or covariance matrix. The eigenvectors are customarily taken with unit length. The principal components are sorted by descending order of the eigenvalues, which are equal to the variances of the components. Principal components have a variety of useful properties (Rao 1964; Kshirsagar 1972): • The eigenvectors are orthogonal, so the principal components represent jointly perpendicular directions through the space of the original variables. • The principal component scores are jointly uncorrelated. Note that this property is quite distinct from the previous one. • The first principal component has the largest variance of any unit-length linear combination of the observed variables. The jth principal component has the largest variance of any unit-length linear combination orthogonal to the first
3596
Chapter 58. The PRINCOMP Procedure j − 1 principal components. The last principal component has the smallest variance of any linear combination of the original variables. • The scores on the first j principal components have the highest possible generalized variance of any set of unit-length linear combinations of the original variables. • The first j principal components provide a least-squares solution to the model Y = XB + E where Y is an n × p matrix of the centered observed variables; X is the n × j matrix of scores on the first j principal components; B is the j × p matrix of eigenvectors; E is an n × p matrix of residuals; and you want to minimize trace(E0 E), the sum of all the squared elements in E. In other words, the first j principal components are the best linear predictors of the original variables among all possible sets of j variables, although any nonsingular linear transformation of the first j principal components would provide equally good prediction. The same result is obtained if you want to minimize the determinant or the Euclidean (Schur, Frobenious) norm of E0 E rather than the trace. • In geometric terms, the j-dimensional linear subspace spanned by the first j principal components provides the best possible fit to the data points as measured by the sum of squared perpendicular distances from each data point to the subspace. This is in contrast to the geometric interpretation of least squares regression, which minimizes the sum of squared vertical distances. For example, suppose you have two variables. Then, the first principal component minimizes the sum of squared perpendicular distances from the points to the first principal axis. This is in contrast to least squares, which would minimize the sum of squared vertical distances from the points to the fitted line.
Principal component analysis can also be used for exploring polynomial relationships and for multivariate outlier detection (Gnanadesikan 1977), and it is related to factor analysis, correspondence analysis, allometry, and biased regression techniques (Mardia, Kent, and Bibby 1979).
Getting Started The following example uses the PRINCOMP procedure to analyze job performance. Police officers were rated by their supervisors in 14 categories as part of standard police departmental administrative procedure. The following statements create the Jobratings data set: options validvarname=any; data Jobratings; input (’Communication Skills’n ’Problem Solving’n ’Learning Ability’n
Getting Started
’Judgment Under Pressure’n ’Observational Skills’n ’Willingness to Confront Problems’n ’Interest in People’n ’Interpersonal Sensitivity’n ’Desire for Self-Improvement’n ’Appearance’n ’Dependability’n ’Physical Ability’n ’Integrity’n ’Overall Rating’n) (1.); datalines; 26838853879867 74758876857667 56757863775875 67869777988997 ... 99899899899899 76656399567486 ;
The data set Jobratings contains 14 variables. Each variable contains the job ratings using a scale measurement from 1 to 10 (1=fail to comply, 10=exceptional). The last variable Overall Rating contains a score as an overall index on how each officer performs. The following statement requests a principal component analysis on the Jobratings data set and outputs the scores to the Scores data set (OUT= Scores). Note that variable Overall Rating is excluded from the analysis. proc princomp data=Jobratings(drop=’Overall Rating’n) out=scores; run;
Figure 58.1 to Figure 58.3 display the PROC PRINCOMP output, beginning with simple statistics followed by the correlation matrix. The PROC PRINCOMP statement requests by default principal components computed from the correlation matrix, so the total variance is equal to the number of variables, 13. In this example, it would also be reasonable to use the COV option, which would cause variables with a high variance (such as Dependability) to have more influence on the results than variables with a low variance (such as Learning Ability). If you used the COV option, scores would be computed from centered rather than standardized variables.
3597
3598
Chapter 58. The PRINCOMP Procedure
The PRINCOMP Procedure Observations Variables
103 13
Simple Statistics
Mean StD
Communication Skills
Problem Solving
Learning Ability
Judgment Under Pressure
Observational Skills
6.650485437 1.764068036
6.631067961 1.590352602
6.990291262 1.339411238
6.737864078 1.731830976
6.932038835 1.761584269
Simple Statistics
Mean StD
Willingness to Confront Problems
Interest in People
Interpersonal Sensitivity
Desire for Self-Improvement
Appearance
7.291262136 1.525155524
6.708737864 1.892353385
6.621359223 1.760773587
6.572815534 1.729796212
7.000000000 1.798692335
Simple Statistics
Mean StD
Dependability
Physical Ability
Integrity
6.825242718 1.917040123
7.203883495 1.555251845
7.213592233 1.845240223
Figure 58.1. Number of Observations and Simple Statistics from the PRINCOMP Procedure
Getting Started
The PRINCOMP Procedure Correlation Matrix
Communication Skills Problem Solving Learning Ability Judgment Under Pressure Observational Skills Willingness to Confront Problems Interest in People Interpersonal Sensitivity Desire for Self-Improvement Appearance Dependability Physical Ability Integrity
Communication Skills
Problem Solving
Learning Ability
Judgment Under Pressure
1.0000 0.6280 0.5546 0.5538 0.5381 0.5265 0.4391 0.5030 0.5642 0.4913 0.5471 0.2192 0.5081
0.6280 1.0000 0.5690 0.6195 0.4284 0.5015 0.3972 0.4398 0.4090 0.3873 0.4546 0.3201 0.3846
0.5546 0.5690 1.0000 0.4892 0.6230 0.5245 0.2735 0.1855 0.5737 0.3988 0.5110 0.2269 0.3142
0.5538 0.6195 0.4892 1.0000 0.3733 0.4004 0.6226 0.6134 0.4826 0.2266 0.5471 0.3476 0.5883
Correlation Matrix
Communication Skills Problem Solving Learning Ability Judgment Under Pressure Observational Skills Willingness to Confront Problems Interest in People Interpersonal Sensitivity Desire for Self-Improvement Appearance Dependability Physical Ability Integrity
Observational Skills
Willingness to Confront Problems
Interest in People
0.5381 0.4284 0.6230 0.3733 1.0000 0.7300 0.2616 0.1655 0.5985 0.4177 0.5626 0.4274 0.3906
0.5265 0.5015 0.5245 0.4004 0.7300 1.0000 0.2233 0.1291 0.5307 0.4825 0.4870 0.4872 0.3260
0.4391 0.3972 0.2735 0.6226 0.2616 0.2233 1.0000 0.8051 0.4857 0.2679 0.6074 0.3768 0.7452
Figure 58.2. Correlation Matrix from the PRINCOMP Procedure
3599
3600
Chapter 58. The PRINCOMP Procedure
Correlation Matrix
Communication Skills Problem Solving Learning Ability Judgment Under Pressure Observational Skills Willingness to Confront Problems Interest in People Interpersonal Sensitivity Desire for Self-Improvement Appearance Dependability Physical Ability Integrity
Interpersonal Sensitivity
Desire for Self-Improvement
Appearance
0.5030 0.4398 0.1855 0.6134 0.1655 0.1291 0.8051 1.0000 0.3713 0.2600 0.5408 0.2182 0.6920
0.5642 0.4090 0.5737 0.4826 0.5985 0.5307 0.4857 0.3713 1.0000 0.4474 0.5981 0.3752 0.5664
0.4913 0.3873 0.3988 0.2266 0.4177 0.4825 0.2679 0.2600 0.4474 1.0000 0.5089 0.3820 0.4135
Correlation Matrix
Communication Skills Problem Solving Learning Ability Judgment Under Pressure Observational Skills Willingness to Confront Problems Interest in People Interpersonal Sensitivity Desire for Self-Improvement Appearance Dependability Physical Ability Integrity
Dependability
Physical Ability
Integrity
0.5471 0.4546 0.5110 0.5471 0.5626 0.4870 0.6074 0.5408 0.5981 0.5089 1.0000 0.4461 0.6536
0.2192 0.3201 0.2269 0.3476 0.4274 0.4872 0.3768 0.2182 0.3752 0.3820 0.4461 1.0000 0.3810
0.5081 0.3846 0.3142 0.5883 0.3906 0.3260 0.7452 0.6920 0.5664 0.4135 0.6536 0.3810 1.0000
Figure 58.3. (Continued) Correlation Matrix from the PRINCOMP Procedure
Figure 58.4 displays the eigenvalues. The first principal component explains about 50% of the total variance, the second principal component explains about 13.6%, and the third principal component explains about 7.7%. Note that the eigenvalues sum to the total variance. The eigenvalues indicate that three to five components provide a good summary of the data, with three components accounting for about 71.7% of the total variance and five components explaining about 82.7%. Subsequent components contribute less than 5% each.
Getting Started
The PRINCOMP Procedure Eigenvalues of the Correlation Matrix
1 2 3 4 5 6 7 8 9 10 11 12 13
Eigenvalue
Difference
Proportion
Cumulative
6.54740242 1.77271499 1.00523565 0.74313901 0.67834402 0.45138034 0.38215866 0.29783254 0.27442591 0.26233782 0.24455450 0.19777828 0.14269586
4.77468744 0.76747933 0.26209665 0.06479499 0.22696368 0.06922167 0.08432613 0.02340663 0.01208809 0.01778332 0.04677622 0.05508241
0.5036 0.1364 0.0773 0.0572 0.0522 0.0347 0.0294 0.0229 0.0211 0.0202 0.0188 0.0152 0.0110
0.5036 0.6400 0.7173 0.7745 0.8267 0.8614 0.8908 0.9137 0.9348 0.9550 0.9738 0.9890 1.0000
Figure 58.4. Eigenvalues from the PRINCOMP Procedure
Figure 58.5 and Figure 58.6 display the eigenvectors. From the eigenvectors matrix, you can represent the first principal component Prin1 as a linear combination of the original variables 0.303548 × (Communication Skills)
Prin1 =
+0.278034 × (Problem Solving) +0.266521 × (Learning Ability) . . . +0.298246 × (Integrity) and, similarly, the second principal component Prin2 is 0.052039 × (Communication Skills)
Prin2 =
+0.057046 × (Problem Solving) +0.288152 × (Learning Ability) . . . −0.301812 × (Integrity) where the variables are standardized.
3601
3602
Chapter 58. The PRINCOMP Procedure
The PRINCOMP Procedure Eigenvectors
Communication Skills Problem Solving Learning Ability Judgment Under Pressure Observational Skills Willingness to Confront Problems Interest in People Interpersonal Sensitivity Desire for Self-Improvement Appearance Dependability Physical Ability Integrity
Prin1
Prin2
Prin3
Prin4
0.303548 0.278034 0.266521 0.294376 0.276641 0.267580 0.278060 0.253814 0.299833 0.237358 0.319480 0.213868 0.298246
0.052039 0.057046 0.288152 -.199458 0.366979 0.392989 -.432916 -.495662 0.099077 0.190065 -.049742 0.097499 -.301812
-.329181 -.400112 -.354591 -.255164 0.065959 0.098723 0.118113 -.064547 0.061097 0.248353 0.169476 0.614959 0.190222
-.227039 0.300476 -.020735 0.397306 0.035711 0.184409 0.046047 -.060000 -.211279 -.544587 -.156070 0.514519 -.169062
Prin5
Prin6
Prin7
Prin8
0.181087 0.453604 -.219329 -.030188 -.325257 0.038278 -.111279 0.107807 -.427477 0.568044 -.130575 0.203995 -.130757
-.416563 0.096750 0.578388 0.102087 -.301254 -.458585 0.030870 -.170305 0.105369 0.221643 0.202301 0.173168 -.100039
0.143543 0.048904 -.114808 0.068204 -.297894 -.044796 -.011105 -.088194 0.689011 0.049267 -.594850 0.169247 0.029456
0.333846 0.199259 0.064088 -.591822 0.163484 -.365684 0.154829 0.192725 0.087453 -.257497 0.081242 0.302536 -.317545
Eigenvectors
Communication Skills Problem Solving Learning Ability Judgment Under Pressure Observational Skills Willingness to Confront Problems Interest in People Interpersonal Sensitivity Desire for Self-Improvement Appearance Dependability Physical Ability Integrity
Figure 58.5. Eigenvectors from the PRINCOMP Procedure
Getting Started
Eigenvectors
Communication Skills Problem Solving Learning Ability Judgment Under Pressure Observational Skills Willingness to Confront Problems Interest in People Interpersonal Sensitivity Desire for Self-Improvement Appearance Dependability Physical Ability Integrity
Prin9
Prin10
Prin11
Prin12
-.430955 0.256098 0.224706 -.358618 0.258377 0.129976 0.321200 0.137468 -.121474 0.087395 -.495598 -.149625 0.271060
0.375983 -.372914 0.287031 0.178270 0.223793 -.330710 -.081470 -.074821 -.363854 0.061890 -.377561 0.258321 0.297010
0.028370 -.434417 0.210540 0.118318 -.079692 0.275249 0.393841 0.285447 -.052085 0.168369 -.164909 -.006202 -.612497
-.252778 0.069863 -.284355 0.306490 0.565290 -.386151 -.210915 0.276824 0.151436 0.236655 -.090904 -.055828 -.276273
Eigenvectors Prin13 Communication Skills Problem Solving Learning Ability Judgment Under Pressure Observational Skills Willingness to Confront Problems Interest in People Interpersonal Sensitivity Desire for Self-Improvement Appearance Dependability Physical Ability Integrity
Figure 58.6.
-.122809 -.116642 0.248555 -.126636 -.168555 0.177688 -.610215 0.643410 0.053834 -.113705 -.018094 0.133430 0.114965
(Continued) Eigenvectors from the PRINCOMP Procedure
The first component reflects overall performance since the first eigenvector shows approximately equal loadings on all variables. The second eigenvector has high positive loadings on the variables Observational Skills and Willingness to Confront Problems but even higher negative loadings on the variables Interest in People and Interpersonal Sensitivity. This component seems to reflect the ability to take action, but it also reflects a lack of interpersonal skills. The third eigenvector has a very high positive loading on the variable Physical Ability and high negative loadings on the variables Problem Solving and Learning Ability. This component seems to reflect physical strength, but also shows poor learning and problem-solving skills. In short, the three components represent: First Component: overall performance Second Component: smart, tough, and introverted Third Component: superior strength and average intellect
3603
3604
Chapter 58. The PRINCOMP Procedure
Syntax The following statements are available in PROC PRINCOMP.
PROC PRINCOMP < options > ; BY variables ; FREQ variable ; PARTIAL variables ; VAR variables ; WEIGHT variable ; Usually only the VAR statement is used in addition to the PROC PRINCOMP statement. The rest of this section provides detailed syntax information for each of the preceding statements, beginning with the PROC PRINCOMP statement. The remaining statements are described in alphabetical order.
PROC PRINCOMP Statement PROC PRINCOMP < options > ; The PROC PRINCOMP statement starts the PRINCOMP procedure and, optionally, identifies input and output data sets, specifies details of the analysis, or suppresses the display of output. You can specify the following options in the PROC PRINCOMP statement. Task Specify data sets
Options DATA= OUT= OUTSTAT=
Specify details of analysis
COV N= NOINT PREFIX= SINGULAR= STD VARDEF=
Suppress the display of output
NOPRINT
The following list provides details on these options. COVARIANCE COV
computes the principal components from the covariance matrix. If you omit the COV option, the correlation matrix is analyzed. Use of the COV option causes variables with large variances to be more strongly associated with components with large eigenvalues and causes variables with small variances to be more strongly associated with
PROC PRINCOMP Statement
components with small eigenvalues. You should not specify the COV option unless the units in which the variables are measured are comparable or the variables are standardized in some way. DATA=SAS-data-set
specifies the SAS data set to be analyzed. The data set can be an ordinary SAS data set or a TYPE=ACE, TYPE=CORR, TYPE=COV, TYPE=FACTOR, TYPE=SSCP, TYPE=UCORR, or TYPE=UCOV data set (see Appendix A, “Special SAS Data Sets,”). Also, the PRINCOMP procedure can read the – TYPE– =‘COVB’ matrix from a TYPE=EST data set. If you omit the DATA= option, the procedure uses the most recently created SAS data set. N=number
specifies the number of principal components to be computed. The default is the number of variables. The value of the N= option must be an integer greater than or equal to zero. NOINT
omits the intercept from the model. In other words, the NOINT option requests that the covariance or correlation matrix not be corrected for the mean. When you use the PRINCOMP procedure with the NOINT option, the covariance matrix and, hence, the standard deviations are not corrected for the mean. If you are interested in the standard deviations corrected for the mean, you can get them by using a procedure such as the MEANS procedure. If you use a TYPE=SSCP data set as input to the PRINCOMP procedure and list the variable Intercept in the VAR statement, the procedure acts as if you had also specified the NOINT option. If you use NOINT and also create an OUTSTAT= data set, the data set is TYPE=UCORR or TYPE=UCOV rather than TYPE=CORR or TYPE=COV. NOPRINT
suppresses the display of all output. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 14, “Using the Output Delivery System.” OUT=SAS-data-set
creates an output SAS data set that contains all the original data as well as the principal component scores. If you want to create a permanent SAS data set, you must specify a two-level name (refer to SAS Language Reference: Concepts for information on permanent SAS data sets). OUTSTAT=SAS-data-set
creates an output SAS data set that contains means, standard deviations, number of observations, correlations or covariances, eigenvalues, and eigenvectors. If you specify the COV option, the data set is TYPE=COV or TYPE=UCOV, depending on the NOINT option, and it contains covariances; otherwise, the data set is TYPE=CORR or TYPE=UCORR, depending on the NOINT option, and it contains correlations. If you specify the PARTIAL statement, the OUTSTAT= data set contains R-squares as well. If you want to create a permanent SAS data set, you must specify a two-level
3605
3606
Chapter 58. The PRINCOMP Procedure
name (refer to SAS Language Reference: Concepts for information on permanent SAS data sets). PREFIX=name
specifies a prefix for naming the principal components. By default, the names are Prin1, Prin2, . . . , Prinn. If you specify PREFIX=ABC, the components are named ABC1, ABC2, ABC3, and so on. The number of characters in the prefix plus the number of digits required to designate the variables should not exceed the current name length defined by the VALIDVARNAME= system option. SINGULAR=p SING=p
specifies the singularity criterion, where 0 < p < 1. If a variable in a PARTIAL statement has an R-square as large as 1 − p when predicted from the variables listed before it in the statement, the variable is assigned a standardized coefficient of 0. By default, SINGULAR=1E−8. STANDARD STD
standardizes the principal component scores in the OUT= data set to unit variance. If you omit the STANDARD option, the scores have variance equal to the corresponding eigenvalue. Note that STANDARD has no effect on the eigenvalues themselves. VARDEF=DF | N | WDF | WEIGHT | WGT
specifies the divisor used in calculating variances and standard deviations. By default, VARDEF=DF. The following table displays the values and associated divisors.
Value DF
Divisor error degrees of freedom
Formula n−i
(before partialling)
n−p−i
(after partialling)
N
number of observations
WEIGHT | WGT
sum of weights
n Pn
sum of weights minus one
P
−i
(before partialling)
P
−p−i
(after partialling)
WDF
j=1 wj n j=1 wj n j=1 wj
In the formulas for VARDEF=DF and VARDEF=WDF, p is the number of degrees of freedom of the variables in the PARTIAL statement, and i is 0 if the NOINT option is specified and 1 otherwise.
FREQ Statement
BY Statement BY variables ; You can specify a BY statement with PROC PRINCOMP to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the PRINCOMP procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
FREQ Statement FREQ variable ; The FREQ statement specifies a variable that provides frequencies for each observation in the DATA= data set. Specifically, if n is the value of the FREQ variable for a given observation, then that observation is used n times. The analysis produced using a FREQ statement reflects the expanded number of observations. The total number of observations is considered equal to the sum of the FREQ variable. You could produce the same analysis (without the FREQ statement) by first creating a new data set that contains the expanded number of observations. For example, if the value of the FREQ variable is 5 for the first observation, the first 5 observations in the new data set would be identical. Each observation in the old data set would be replicated nj times in the new data set, where nj is the value of the FREQ variable for that observation. If the value of the FREQ variable is missing or is less than one, the observation is not used in the analysis. If the value is not an integer, only the integer portion is used.
3607
3608
Chapter 58. The PRINCOMP Procedure
PARTIAL Statement PARTIAL variables ; If you want to analyze a partial correlation or covariance matrix, specify the names of the numeric variables to be partialled out in the PARTIAL statement. The PRINCOMP procedure computes the principal components of the residuals from the prediction of the VAR variables by the PARTIAL variables. If you request an OUT= or OUTSTAT= data set, the residual variables are named by prefixing the characters R– to the VAR variables. Thus, the number of characters required to distinguish the VAR variables should be, at most, two characters fewer than the current name length defined by the VALIDVARNAME= system option.
VAR Statement VAR variables ; The VAR statement lists the numeric variables to be analyzed. If you omit the VAR statement, all numeric variables not specified in other statements are analyzed. If, however, the DATA= data set is TYPE=SSCP, the default set of variables used as VAR variables does not include Intercept so that the correlation or covariance matrix is constructed correctly. If you want to analyze Intercept as a separate variable, you should specify it in the VAR statement.
WEIGHT Statement WEIGHT variable ; If you want to use relative weights for each observation in the input data set, place the weights in a variable in the data set and specify the name in a WEIGHT statement. This is often done when the variance associated with each observation is different and the values of the weight variable are proportional to the reciprocals of the variances. The observation is used in the analysis only if the value of the WEIGHT statement variable is nonmissing and is greater than zero.
Details Missing Values Observations with missing values for any variable in the VAR, PARTIAL, FREQ, or WEIGHT statement are omitted from the analysis and are given missing values for principal component scores in the OUT= data set. If a correlation, covariance, or SSCP matrix is read, it can contain missing values as long as every pair of variables has at least one nonmissing entry.
Output Data Sets
Output Data Sets OUT= Data Set The OUT= data set contains all the variables in the original data set plus new variables containing the principal component scores. The N= option determines the number of new variables. The names of the new variables are formed by concatenating the value given by the PREFIX= option (or Prin if PREFIX= is omitted) and the numbers 1, 2, 3, and so on. The new variables have mean 0 and variance equal to the corresponding eigenvalue, unless you specify the STANDARD option to standardize the scores to unit variance. Also, if you specify the COV option, the procedure computes the principal component scores from the corrected or the uncorrected (if the NOINT option is specified) variables rather than the standardized variables. If you use a PARTIAL statement, the OUT= data set also contains the residuals from predicting the VAR variables from the PARTIAL variables. The names of the residual variables are formed by prefixing R– to the names of the VAR variables. An OUT= data set cannot be created if the DATA= data set is TYPE=ACE, TYPE=CORR, TYPE=COV, TYPE=EST, TYPE=FACTOR, TYPE=SSCP, TYPE=UCORR, or TYPE=UCOV.
OUTSTAT= Data Set The OUTSTAT= data set is similar to the TYPE=CORR data set produced by the CORR procedure. The following table relates the TYPE= value for the OUTSTAT= data set to the options specified in the PROC PRINCOMP statement. Options (default) COV NOINT COV NOINT
TYPE= CORR COV UCORR UCOV
Notice that the default (neither the COV nor NOINT option) produces a TYPE=CORR data set. The new data set contains the following variables: • the BY variables, if any • two new variables, – TYPE– and – NAME– , both character variables • the variables analyzed, that is, those in the VAR statement; or, if there is no VAR statement, all numeric variables not listed in any other statement; or, if there is a PARTIAL statement, the residual variables as described under the OUT= data set Each observation in the new data set contains some type of statistic as indicated by the – TYPE– variable. The values of the – TYPE– variable are as follows:
3609
3610
Chapter 58. The PRINCOMP Procedure
– TYPE– MEAN
mean of each variable. If you specify the PARTIAL statement, this observation is omitted.
STD
standard deviations. If you specify the COV option, this observation is omitted, so the SCORE procedure does not standardize the variables before computing scores. If you use the PARTIAL statement, the standard deviation of a variable is computed as its root mean squared error as predicted from the PARTIAL variables.
USTD
uncorrected standard deviations. When you specify the NOINT option in the PROC PRINCOMP statement, the OUTSTAT= data set contains standard deviations not corrected for the mean. However, if you also specify the COV option in the PROC PRINCOMP statement, this observation is omitted.
N
number of observations on which the analysis is based. This value is the same for each variable. If you specify the PARTIAL statement and the value of the VARDEF= option is DF or unspecified, then the number of observations is decremented by the degrees of freedom for the PARTIAL variables.
SUMWGT
the sum of the weights of the observations. This value is the same for each variable. If you specify the PARTIAL statement and VARDEF=WDF, then the sum of the weights is decremented by the degrees of freedom for the PARTIAL variables. This observation is output only if the value is different from that in the observation with – TYPE– =‘N’. correlations between each variable and the variable specified by the – NAME– variable. The number of observations with TYPE =‘CORR’ is equal to the number of variables being analyzed. – – If you specify the COV option, no – TYPE– =‘CORR’ observations are produced. If you use the PARTIAL statement, the partial correlations, not the raw correlations, are output.
CORR
UCORR
uncorrected correlation matrix. When you specify the NOINT option without the COV option in the PROC PRINCOMP statement, the OUTSTAT= data set contains a matrix of correlations not corrected for the means. However, if you also specify the COV option in the PROC PRINCOMP statement, this observation is omitted.
COV
covariances between each variable and the variable specified by the – NAME– variable. – TYPE– =‘COV’ observations are produced only if you specify the COV option. If you use the PARTIAL statement, the partial covariances, not the raw covariances, are output.
UCOV
uncorrected covariance matrix. When you specify the NOINT and COV options in the PROC PRINCOMP statement, the OUTSTAT= data set contains a matrix of covariances not corrected for the means.
EIGENVAL eigenvalues. If the N= option requested fewer than the maximum number of principal components, only the specified number of eigenvalues are produced, with missing values filling out the observation.
Computational Resources SCORE
eigenvectors. The – NAME– variable contains the name of the corresponding principal component as constructed from the PREFIX= option. The number of observations with – TYPE– =‘SCORE’ equals the number of principal components computed. The eigenvectors have unit length unless you specify the STD option, in which case the unit-length eigenvectors are divided by the square roots of the eigenvalues to produce scores with unit standard deviations. To obtain the principal component scores, if the COV option is not specified, these coefficients should be multiplied by the standardized data. With the COV option, these coefficients should be multiplied by the centered data. Means obtained from the observation with – TYPE– =’MEAN’ and standard deviations obtained from the observation with – TYPE– =’STD’ should be used for centering and standardizing the data.
USCORE
scoring coefficients to be applied without subtracting the mean from the raw variables. – TYPE– =‘USCORE’ observations are produced when you specify the NOINT option in the PROC PRINCOMP statement.
To obtain the principal component scores, these coefficients should be multiplied by the data that are standardized by the uncorrected standard deviations obtained from the observation with – TYPE– =’USTD’. RSQUARED R-squares for each VAR variable as predicted by the PARTIAL variables B
regression coefficients for each VAR variable as predicted by the PARTIAL variables. This observation is produced only if you specify the COV option.
STB
standardized regression coefficients for each VAR variable as predicted by the PARTIAL variables. If you specify the COV option, this observation is omitted.
The data set can be used with the SCORE procedure to compute principal component scores, or it can be used as input to the FACTOR procedure specifying METHOD=SCORE to rotate the components. If you use the PARTIAL statement, the scoring coefficients should be applied to the residuals, not the original variables.
Computational Resources Let n = number of observations v = number of VAR variables p = number of PARTIAL variables c = number of components
3611
3612
Chapter 58. The PRINCOMP Procedure • The minimum allocated memory required is 232v + 120p + 48c + max(8cv, 8vp + 4(v + p)(v + p + 1)) bytes • The time required to compute the correlation matrix is roughly proportional to p n(v + p)2 + (v + p)(v + p + 1) 2 • The time required to compute eigenvalues is roughly proportional to v 3 . • The time required to compute eigenvectors is roughly proportional to cv 2 .
Displayed Output The PRINCOMP procedure displays the following items if the DATA= data set is not TYPE=CORR, TYPE=COV, TYPE=SSCP, TYPE=UCORR, or TYPE=UCOV: • Simple Statistics, including the Mean and Std (standard deviation) for each variable. If you specify the NOINT option, the uncorrected standard deviation (UStD) is displayed. • the Correlation or, if you specify the COV option, the Covariance Matrix The PRINCOMP procedure displays the following items if you use the PARTIAL statement. • Regression Statistics, giving the R-square and RMSE (root mean square error) for each VAR variable as predicted by the PARTIAL variables (not shown) • Standardized Regression Coefficients or, if you specify the COV option, Regression Coefficients for predicting the VAR variables from the PARTIAL variables (not shown) • the Partial Correlation Matrix or, if you specify the COV option, the Partial Covariance Matrix (not shown) The PRINCOMP procedure displays the following item if you specify the COV option: • the Total Variance The PRINCOMP procedure displays the following items unless you specify the NOPRINT option: • Eigenvalues of the correlation or covariance matrix, as well as the Difference between successive eigenvalues, the Proportion of variance explained by each eigenvalue, and the Cumulative proportion of variance explained • the Eigenvectors
ODS Graphics (Experimental)
ODS Table Names PROC PRINCOMP assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 58.1. ODS Tables Produced in PROC PRINCOMP
ODS Table Name Corr Cov Eigenvalues Eigenvectors NObsNVar ParCorr ParCov RegCoef RSquareRMSE SimpleStatistics StdRegCoef TotalVariance
Description Correlation Matrix Covariance Matrix Eigenvalues Eigenvectors Number of Observations, Variables and (Partial) Variables Partial Correlation Matrix Uncorrected Partial Covariance Matrix Regression Coefficients Regression Statistics: R-Squares and RMSEs Simple Statistics Standardized Regression Coefficients Total Variance
Statement / Option default unless COV is specified default if COV is specified default default default PARTIAL statement PARTIAL statement COV PARTIAL statement COV PARTIAL statement default PARTIAL statement PROC PRINCOMP COV
ODS Graphics (Experimental) This section describes the use of ODS for creating graphics with the PRINCOMP procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release. To request these graphs, you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, “Statistical Graphics Using ODS.” You can specify the N= option in the PRINCOMP statement to control the number of principal components to be displayed.
ODS Graph Names PROC PRINCOMP assigns a name to each graph it creates using ODS. You can use these names to reference the graphs when using ODS. The names are listed in Table 58.2.
3613
3614
Chapter 58. The PRINCOMP Procedure
To request these graphs, you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, “Statistical Graphics Using ODS.” Table 58.2. ODS Graphics Produced by PROC PRINCOMP
ODS Graph Name EigenvaluePlot
Plot Description Eigenvalues and Proportion Plot
Statement default
PaintedPrinCompScoresPlot
default and nvar∗ >= 3
PrinCompMatrixPlot
Painted Component Scores Plot: 2nd versus 3rd, painted by 1st Component Scores Matrix Plot
PrinCompPatternPlot
Component Pattern Plot
default
PrinCompScoresPlot12
Component Scores Plot: 1st versus 2nd Component Scores Plot: 1st versus 3rd
default and nvar >= 2
PrinCompScoresPlot13
default and nvar >= 2
default and nvar >= 3
Examples Example 58.1. Temperatures This example analyzes mean daily temperatures in selected cities in January and July. Both the raw data and the principal components are plotted to illustrate how principal components are orthogonal rotations of the original variables. The following statements create the Temperature data set: data Temperature; title ’Mean Temperature in January and July for Selected Cities ’; input City $1-15 January July; cards; Mobile 51.2 81.6 Phoenix 51.2 91.2 Little Rock 39.5 81.4 Sacramento 45.1 75.2 Denver 29.9 73.0 Hartford 24.8 72.7 Wilmington 32.0 75.8 Washington DC 35.6 78.7 Jacksonville 54.6 81.0 Miami 67.2 82.3 Atlanta 42.4 78.0 Boise 29.0 74.5 Chicago 22.9 71.9 Peoria 23.8 75.1 Indianapolis 27.9 75.0 Des Moines 19.4 75.1 Wichita 31.3 80.7 Louisville 33.3 76.9 ∗
number of variables to be analyzed
Example 58.1. Temperatures New Orleans Portland, ME Baltimore Boston Detroit Sault Ste Marie Duluth Minneapolis Jackson Kansas City St Louis Great Falls Omaha Reno Concord Atlantic City Albuquerque Albany Buffalo New York Charlotte Raleigh Bismarck Cincinnati Cleveland Columbus Oklahoma City Portland, OR Philadelphia Pittsburgh Providence Columbia Sioux Falls Memphis Nashville Dallas El Paso Houston Salt Lake City Burlington Norfolk Richmond Spokane Charleston, WV Milwaukee Cheyenne ;
52.9 21.5 33.4 29.2 25.5 14.2 8.5 12.2 47.1 27.8 31.3 20.5 22.6 31.9 20.6 32.7 35.2 21.5 23.7 32.2 42.1 40.5 8.2 31.1 26.9 28.4 36.8 38.1 32.3 28.1 28.4 45.4 14.2 40.5 38.3 44.8 43.6 52.1 28.0 16.8 40.5 37.5 25.4 34.5 19.4 26.6
81.9 68.0 76.6 73.3 73.3 63.8 65.6 71.9 81.7 78.8 78.6 69.3 77.2 69.3 69.7 75.1 78.7 72.0 70.1 76.6 78.5 77.5 70.8 75.6 71.4 73.6 81.5 67.1 76.8 71.9 72.1 81.2 73.3 79.6 79.6 84.8 82.3 83.3 76.7 69.8 78.3 77.9 69.7 75.0 69.9 69.1
The following statements plot the temperature data set. For information on the %PLOTIT macro, see Appendix B, “Using the %PLOTIT Macro.” title2 ’Plot of Raw Data’; %plotit(data=Temperature,labelvar=City, plotvars=July January, color=black, colors=black); run;
3615
3616
Chapter 58. The PRINCOMP Procedure
The results are displayed in Output 58.1.1, which shows a scatter diagram of the 64 pairs of data points with July temperatures plotted against January temperatures. Output 58.1.1. Plot of Raw Data
The following statement requests a principal component analysis on the Temperature data set and outputs the scores to the Prin data set (OUT= Prin): proc princomp data=Temperature cov out=Prin; var July January; run;
Output 58.1.2 displays the PROC PRINCOMP output. The standard deviation of January (11.712) is higher than the standard deviation of July (5.128). The COV option in the PROC PRINCOMP statement requests the principal components to be computed from the covariance matrix. The total variance is 163.474. The first principal component explains about 94% of the total variance, and the second principal component explains only about 6%. The eigenvalues sum to the total variance.
Example 58.1. Temperatures
Note that January receives a higher loading on Prin1 because it has a higher standard deviation than July, and the PRINCOMP procedure calculates the scores using the centered variables rather than the standardized variables. Output 58.1.2. Results of Principal Component Analysis Mean Temperature in January and July for Selected Cities The PRINCOMP Procedure Observations Variables
64 2
Simple Statistics
Mean StD
July
January
75.60781250 5.12761910
32.09531250 11.71243309
Covariance Matrix
July January
July
January
26.2924777 46.8282912
46.8282912 137.1810888
Total Variance
163.47356647
Eigenvalues of the Covariance Matrix
1 2
Eigenvalue
Difference
Proportion
Cumulative
154.310607 9.162960
145.147647
0.9439 0.0561
0.9439 1.0000
Eigenvectors
July January
Prin1
Prin2
0.343532 0.939141
0.939141 -.343532
The following statement plots the Prin data set created from the previous PROC PRINCOMP statement: title2 ’Plot of Principal Components’; %plotit(data=Prin,labelvar=City, plotvars=Prin2 Prin1, color=black, colors=black); run;
3617
3618
Chapter 58. The PRINCOMP Procedure
Output 58.1.3 displays a plot of the second principal component Prin2 against the first principal component Prin1. It is clear from this plot that the principal components are orthogonal rotations of the original variables and that the first principal component has a larger variance than the second principal component. In fact, Prin1 has a larger variance than either of the original variables July and January. Output 58.1.3. Plot of Principal Components
Example 58.2. Crime Rates
Example 58.2. Crime Rates The following data provide crime rates per 100,000 people in seven categories for each of the fifty states in 1977. Since there are seven numeric variables, it is impossible to plot all the variables simultaneously. Principal components can be used to summarize the data in two or three dimensions, and they help to visualize the data. The following statements produce Output 58.2.1: data Crime; title ’Crime Rates per 100,000 Population by State’; input State $1-15 Murder Rape Robbery Assault Burglary Larceny Auto_Theft; cards; Alabama 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 Alaska 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 Arizona 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5 Arkansas 8.8 27.6 83.2 203.4 972.6 1862.1 183.4 California 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5 Colorado 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1 Connecticut 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2 Delaware 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0 Florida 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4 Georgia 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 Hawaii 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4 Idaho 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6 Illinois 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6 Indiana 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4 Iowa 2.3 10.6 41.2 89.8 812.5 2685.1 219.9 Kansas 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3 Kentucky 10.1 19.1 81.1 123.3 872.2 1662.1 245.4 Louisiana 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7 Maine 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9 Maryland 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5 Massachusetts 3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1 Michigan 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5 Minnesota 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1 Mississippi 14.3 19.6 65.7 189.1 915.6 1239.9 144.4 Missouri 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4 Montana 5.4 16.7 39.2 156.8 804.9 2773.2 309.2 Nebraska 3.9 18.1 64.7 112.7 760.0 2316.1 249.1 Nevada 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2 New Hampshire 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4 New Jersey 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5 New Mexico 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5 New York 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8 North Carolina 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1 North Dakota 0.9 9.0 13.3 43.8 446.1 1843.0 144.7 Ohio 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4 Oklahoma 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8 Oregon 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9 Pennsylvania 5.6 19.0 130.3 128.0 877.5 1624.1 333.2 Rhode Island 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4 South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1
3619
3620
Chapter 58. The PRINCOMP Procedure South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming ;
2.0 10.1 13.3 3.5 1.4 9.0 4.3 6.0 2.8 5.4
13.5 17.9 155.7 570.5 1704.4 147.5 29.7 145.8 203.9 1259.7 1776.5 314.0 33.8 152.4 208.2 1603.1 2988.7 397.6 20.3 68.8 147.3 1171.6 3004.6 334.5 15.9 30.8 101.2 1348.2 2201.0 265.2 23.3 92.1 165.7 986.2 2521.2 226.7 39.6 106.2 224.8 1605.6 3386.9 360.3 13.2 42.2 90.9 597.4 1341.7 163.3 12.9 52.2 63.7 846.9 2614.2 220.7 21.9 39.7 173.9 811.6 2772.2 282.0
proc princomp out=Crime_Components; run;
Example 58.2. Crime Rates
Output 58.2.1. Results of Principal Component Analysis: PROC PRINCOMP Crime Rates per 100,000 Population by State The PRINCOMP Procedure Observations Variables
50 7
Simple Statistics
Mean StD
Murder
Rape
Robbery
Assault
7.444000000 3.866768941
25.73400000 10.75962995
124.0920000 88.3485672
211.3000000 100.2530492
Simple Statistics
Mean StD
Burglary
Larceny
Auto_Theft
1291.904000 432.455711
2671.288000 725.908707
377.5260000 193.3944175
Correlation Matrix
Murder Rape Robbery Assault Burglary Larceny Auto_Theft
Murder
Rape
Robbery
Assault
Burglary
Larceny
Auto_ Theft
1.0000 0.6012 0.4837 0.6486 0.3858 0.1019 0.0688
0.6012 1.0000 0.5919 0.7403 0.7121 0.6140 0.3489
0.4837 0.5919 1.0000 0.5571 0.6372 0.4467 0.5907
0.6486 0.7403 0.5571 1.0000 0.6229 0.4044 0.2758
0.3858 0.7121 0.6372 0.6229 1.0000 0.7921 0.5580
0.1019 0.6140 0.4467 0.4044 0.7921 1.0000 0.4442
0.0688 0.3489 0.5907 0.2758 0.5580 0.4442 1.0000
Eigenvalues of the Correlation Matrix
1 2 3 4 5 6 7
Eigenvalue
Difference
Proportion
Cumulative
4.11495951 1.23872183 0.72581663 0.31643205 0.25797446 0.22203947 0.12405606
2.87623768 0.51290521 0.40938458 0.05845759 0.03593499 0.09798342
0.5879 0.1770 0.1037 0.0452 0.0369 0.0317 0.0177
0.5879 0.7648 0.8685 0.9137 0.9506 0.9823 1.0000
Eigenvectors
Murder Rape Robbery Assault Burglary Larceny Auto_Theft
Prin1
Prin2
Prin3
Prin4
Prin5
Prin6
Prin7
0.300279 0.431759 0.396875 0.396652 0.440157 0.357360 0.295177
-.629174 -.169435 0.042247 -.343528 0.203341 0.402319 0.502421
0.178245 -.244198 0.495861 -.069510 -.209895 -.539231 0.568384
-.232114 0.062216 -.557989 0.629804 -.057555 -.234890 0.419238
0.538123 0.188471 -.519977 -.506651 0.101033 0.030099 0.369753
0.259117 -.773271 -.114385 0.172363 0.535987 0.039406 -.057298
0.267593 -.296485 -.003903 0.191745 -.648117 0.601690 0.147046
3621
3622
Chapter 58. The PRINCOMP Procedure
The eigenvalues indicate that two or three components provide a good summary of the data, two components accounting for 76% of the total variance and three components explaining 87%. Subsequent components contribute less than 5% each. The first component is a measure of overall crime rate since the first eigenvector shows approximately equal loadings on all variables. The second eigenvector has high positive loadings on variables Auto– Theft and Larceny and high negative loadings on variables Murder and Assault. There is also a small positive loading on Burglary and a small negative loading on Rape. This component seems to measure the preponderance of property crime over violent crime. The interpretation of the third component is not obvious. A simple way to examine the principal components in more detail is to display the output data set sorted by each of the large components. The following statements produce Output 58.2.2 through Output 58.2.3: proc sort data=Crime_Components; by Prin1; run; proc print; id State; var Prin1 Prin2 Murder Rape Robbery Assault Burglary Larceny Auto_Theft; title2 ’States Listed in Order of Overall Crime Rate’; title3 ’As Determined by the First Principal Component’; run;
proc sort data=Crime_Components; by Prin2; run; proc print; id State; var Prin1 Prin2 Murder Rape Robbery Assault Burglary Larceny Auto_Theft; title2 ’States Listed in Order of Property Vs. Violent Crime’; title3 ’As Determined by the Second Principal Component’; run;
Example 58.2. Crime Rates
Output 58.2.2. OUT= Data Set Sorted by First Principal Component Crime Rates per 100,000 Population by State States Listed in Order of Overall Crime Rate As Determined by the First Principal Component
S t a t e North Dakota South Dakota West Virginia Iowa Wisconsin New Hampshire Nebraska Vermont Maine Kentucky Pennsylvania Montana Minnesota Mississippi Idaho Wyoming Arkansas Utah Virginia North Carolina Kansas Connecticut Indiana Oklahoma Rhode Island Tennessee Alabama New Jersey Ohio Georgia Illinois Missouri Hawaii Washington Delaware Massachusetts Louisiana New Mexico Texas Oregon South Carolina Maryland Michigan Alaska Colorado Arizona Florida New York California Nevada
P r i n 1 -3.96408 -3.17203 -3.14772 -2.58156 -2.50296 -2.46562 -2.15071 -2.06433 -1.82631 -1.72691 -1.72007 -1.66801 -1.55434 -1.50736 -1.43245 -1.42463 -1.05441 -1.04996 -0.91621 -0.69925 -0.63407 -0.54133 -0.49990 -0.32136 -0.20156 -0.13660 -0.04988 0.21787 0.23953 0.49041 0.51290 0.55637 0.82313 0.93058 0.96458 0.97844 1.12020 1.21417 1.39696 1.44900 1.60336 2.18280 2.27333 2.42151 2.50929 3.01414 3.11175 3.45248 4.28380 5.26699
P r i n 2 0.38767 -0.25446 -0.81425 0.82475 0.78083 0.82503 0.22574 0.94497 0.57878 -1.14663 -0.19590 0.27099 1.05644 -2.54671 -0.00801 0.06268 -1.34544 0.93656 -0.69265 -1.67027 -0.02804 1.50123 0.00003 -0.62429 2.14658 -1.13498 -2.09610 0.96421 0.09053 -1.38079 0.09423 -0.55851 1.82392 0.73776 1.29674 2.63105 -2.08327 -0.95076 -0.68131 0.58603 -2.16211 -0.19474 0.15487 0.16652 0.91660 0.84495 -0.60392 0.43289 0.14319 -0.25262
M u r d e r 0.9 2.0 6.0 2.3 2.8 3.2 3.9 1.4 2.4 10.1 5.6 5.4 2.7 14.3 5.5 5.4 8.8 3.5 9.0 10.6 6.6 4.2 7.4 8.6 3.6 10.1 14.2 5.6 7.8 11.7 9.9 9.6 7.2 4.3 6.0 3.1 15.5 8.8 13.3 4.9 11.9 8.0 9.3 10.8 6.3 9.5 10.2 10.7 11.5 15.8
R a p e 9.0 13.5 13.2 10.6 12.9 10.7 18.1 15.9 13.5 19.1 19.0 16.7 19.5 19.6 19.4 21.9 27.6 20.3 23.3 17.0 22.0 16.8 26.5 29.2 10.5 29.7 25.2 21.0 27.3 31.1 21.8 28.3 25.5 39.6 24.9 20.8 30.9 39.1 33.8 39.9 33.0 34.8 38.9 51.6 42.0 34.2 39.6 29.4 49.4 49.1
R o b b e r y
A s s a u l t
13.3 17.9 42.2 41.2 52.2 23.2 64.7 30.8 38.7 81.1 130.3 39.2 85.9 65.7 39.6 39.7 83.2 68.8 92.1 61.3 100.7 129.5 123.2 73.8 86.5 145.8 96.8 180.4 190.5 140.5 211.3 189.0 128.0 106.2 157.0 169.1 142.9 109.6 152.4 124.1 105.9 292.1 261.9 96.8 170.7 138.2 187.9 472.6 287.0 323.1
43.8 155.7 90.9 89.8 63.7 76.0 112.7 101.2 170.0 123.3 128.0 156.8 85.8 189.1 172.5 173.9 203.4 147.3 165.7 318.3 180.5 131.8 153.5 205.0 201.0 203.9 278.3 185.1 181.1 256.5 209.0 233.5 64.1 224.8 194.2 231.6 335.5 343.4 208.2 286.9 485.3 358.9 274.6 284.0 292.9 312.3 449.1 319.1 358.0 355.0
B u r g l a r y 446.1 570.5 597.4 812.5 846.9 1041.7 760.0 1348.2 1253.1 872.2 877.5 804.9 1134.7 915.6 1050.8 811.6 972.6 1171.6 986.2 1154.1 1270.4 1346.0 1086.2 1288.2 1489.5 1259.7 1135.5 1435.8 1216.0 1351.1 1085.0 1318.3 1911.5 1605.6 1682.6 1532.2 1165.5 1418.7 1603.1 1636.4 1613.6 1400.0 1522.7 1331.7 1935.2 2346.1 1859.9 1728.0 2139.4 2453.1
L a r c e n y
A u t o _ T h e f t
1843.0 144.7 1704.4 147.5 1341.7 163.3 2685.1 219.9 2614.2 220.7 2343.9 293.4 2316.1 249.1 2201.0 265.2 2350.7 246.9 1662.1 245.4 1624.1 333.2 2773.2 309.2 2559.3 343.1 1239.9 144.4 2599.6 237.6 2772.2 282.0 1862.1 183.4 3004.6 334.5 2521.2 226.7 2037.8 192.1 2739.3 244.3 2620.7 593.2 2498.7 377.4 2228.1 326.8 2844.1 791.4 1776.5 314.0 1881.9 280.7 2774.5 511.5 2696.8 400.4 2170.2 297.9 2828.5 528.6 2424.2 378.4 3920.4 489.4 3386.9 360.3 3678.4 467.0 2311.3 1140.1 2469.9 337.7 3008.6 259.5 2988.7 397.6 3506.1 388.9 2342.4 245.1 3177.7 428.5 3159.0 545.5 3369.8 753.3 3903.2 477.1 4467.4 439.5 3840.5 351.4 2782.0 745.8 3499.8 663.5 4212.6 559.2
3623
3624
Chapter 58. The PRINCOMP Procedure
Output 58.2.3. OUT= Data Set Sorted by Second Principal Component Crime Rates per 100,000 Population by State States Listed in Order of Property Vs. Violent Crime As Determined by the Second Principal Component
S t a t e Mississippi South Carolina Alabama Louisiana North Carolina Georgia Arkansas Kentucky Tennessee New Mexico West Virginia Virginia Texas Oklahoma Florida Missouri South Dakota Nevada Pennsylvania Maryland Kansas Idaho Indiana Wyoming Ohio Illinois California Michigan Alaska Nebraska Montana North Dakota New York Maine Oregon Washington Wisconsin Iowa New Hampshire Arizona Colorado Utah Vermont New Jersey Minnesota Delaware Connecticut Hawaii Rhode Island Massachusetts
P r i n 1 -1.50736 1.60336 -0.04988 1.12020 -0.69925 0.49041 -1.05441 -1.72691 -0.13660 1.21417 -3.14772 -0.91621 1.39696 -0.32136 3.11175 0.55637 -3.17203 5.26699 -1.72007 2.18280 -0.63407 -1.43245 -0.49990 -1.42463 0.23953 0.51290 4.28380 2.27333 2.42151 -2.15071 -1.66801 -3.96408 3.45248 -1.82631 1.44900 0.93058 -2.50296 -2.58156 -2.46562 3.01414 2.50929 -1.04996 -2.06433 0.21787 -1.55434 0.96458 -0.54133 0.82313 -0.20156 0.97844
P r i n 2 -2.54671 -2.16211 -2.09610 -2.08327 -1.67027 -1.38079 -1.34544 -1.14663 -1.13498 -0.95076 -0.81425 -0.69265 -0.68131 -0.62429 -0.60392 -0.55851 -0.25446 -0.25262 -0.19590 -0.19474 -0.02804 -0.00801 0.00003 0.06268 0.09053 0.09423 0.14319 0.15487 0.16652 0.22574 0.27099 0.38767 0.43289 0.57878 0.58603 0.73776 0.78083 0.82475 0.82503 0.84495 0.91660 0.93656 0.94497 0.96421 1.05644 1.29674 1.50123 1.82392 2.14658 2.63105
M u r d e r 14.3 11.9 14.2 15.5 10.6 11.7 8.8 10.1 10.1 8.8 6.0 9.0 13.3 8.6 10.2 9.6 2.0 15.8 5.6 8.0 6.6 5.5 7.4 5.4 7.8 9.9 11.5 9.3 10.8 3.9 5.4 0.9 10.7 2.4 4.9 4.3 2.8 2.3 3.2 9.5 6.3 3.5 1.4 5.6 2.7 6.0 4.2 7.2 3.6 3.1
R a p e 19.6 33.0 25.2 30.9 17.0 31.1 27.6 19.1 29.7 39.1 13.2 23.3 33.8 29.2 39.6 28.3 13.5 49.1 19.0 34.8 22.0 19.4 26.5 21.9 27.3 21.8 49.4 38.9 51.6 18.1 16.7 9.0 29.4 13.5 39.9 39.6 12.9 10.6 10.7 34.2 42.0 20.3 15.9 21.0 19.5 24.9 16.8 25.5 10.5 20.8
R o b b e r y
A s s a u l t
65.7 105.9 96.8 142.9 61.3 140.5 83.2 81.1 145.8 109.6 42.2 92.1 152.4 73.8 187.9 189.0 17.9 323.1 130.3 292.1 100.7 39.6 123.2 39.7 190.5 211.3 287.0 261.9 96.8 64.7 39.2 13.3 472.6 38.7 124.1 106.2 52.2 41.2 23.2 138.2 170.7 68.8 30.8 180.4 85.9 157.0 129.5 128.0 86.5 169.1
189.1 485.3 278.3 335.5 318.3 256.5 203.4 123.3 203.9 343.4 90.9 165.7 208.2 205.0 449.1 233.5 155.7 355.0 128.0 358.9 180.5 172.5 153.5 173.9 181.1 209.0 358.0 274.6 284.0 112.7 156.8 43.8 319.1 170.0 286.9 224.8 63.7 89.8 76.0 312.3 292.9 147.3 101.2 185.1 85.8 194.2 131.8 64.1 201.0 231.6
B u r g l a r y 915.6 1613.6 1135.5 1165.5 1154.1 1351.1 972.6 872.2 1259.7 1418.7 597.4 986.2 1603.1 1288.2 1859.9 1318.3 570.5 2453.1 877.5 1400.0 1270.4 1050.8 1086.2 811.6 1216.0 1085.0 2139.4 1522.7 1331.7 760.0 804.9 446.1 1728.0 1253.1 1636.4 1605.6 846.9 812.5 1041.7 2346.1 1935.2 1171.6 1348.2 1435.8 1134.7 1682.6 1346.0 1911.5 1489.5 1532.2
L a r c e n y
A u t o _ T h e f t
1239.9 144.4 2342.4 245.1 1881.9 280.7 2469.9 337.7 2037.8 192.1 2170.2 297.9 1862.1 183.4 1662.1 245.4 1776.5 314.0 3008.6 259.5 1341.7 163.3 2521.2 226.7 2988.7 397.6 2228.1 326.8 3840.5 351.4 2424.2 378.4 1704.4 147.5 4212.6 559.2 1624.1 333.2 3177.7 428.5 2739.3 244.3 2599.6 237.6 2498.7 377.4 2772.2 282.0 2696.8 400.4 2828.5 528.6 3499.8 663.5 3159.0 545.5 3369.8 753.3 2316.1 249.1 2773.2 309.2 1843.0 144.7 2782.0 745.8 2350.7 246.9 3506.1 388.9 3386.9 360.3 2614.2 220.7 2685.1 219.9 2343.9 293.4 4467.4 439.5 3903.2 477.1 3004.6 334.5 2201.0 265.2 2774.5 511.5 2559.3 343.1 3678.4 467.0 2620.7 593.2 3920.4 489.4 2844.1 791.4 2311.3 1140.1
Example 58.2. Crime Rates
Another recommended procedure is to make scatter plots of the first few components. The sorted listings help to identify observations on the plots. The following statements produce Output 58.2.4 through Output 58.2.5: title2 ’Plot of the First Two Principal Components’; %plotit(data=Crime_Components,labelvar=State, plotvars=Prin2 Prin1, color=black, colors=black); run;
title2 ’Plot of the First and Third Principal Components’; %plotit(data=Crime_Components,labelvar=State, plotvars=Prin3 Prin1, color=black, colors=black); run;
Output 58.2.4. Plot of the First Two Principal Components
3625
3626
Chapter 58. The PRINCOMP Procedure
Output 58.2.5. Plot of the First and Third Principal Components
It is possible to identify regional trends on the plot of the first two components. Nevada and California are at the extreme right, with high overall crime rates but an average ratio of property crime to violent crime. North and South Dakota are on the extreme left with low overall crime rates. Southeastern states tend to be in the bottom of the plot, with a higher-than-average ratio of violent crime to property crime. New England states tend to be in the upper part of the plot, with a greater-than-average ratio of property crime to violent crime. The most striking feature of the plot of the first and third principal components is that Massachusetts and New York are outliers on the third component.
Example 58.3. Basketball Data The data in this example are rankings of 35 college basketball teams. The rankings were made before the start of the 1985–86 season by 10 news services. The purpose of the principal component analysis is to compute a single variable that best summarizes all 10 of the preseason rankings. Note that the various news services rank different numbers of teams, varying from 20 through 30 (there is a missing rank in one of the variables, WashPost). And,
Example 58.3. Basketball Data
of course, each service does not rank the same teams, so there are missing values in these data. Each of the 35 teams is ranked by at least one news service. The PRINCOMP procedure omits observations with missing values. To obtain principal component scores for all of the teams, it is necessary to replace the missing values. Since it is the best teams that are ranked, it is not appropriate to replace missing values with the mean of the nonmissing values. Instead, an ad hoc method is used that replaces missing values by the mean of the unassigned ranks. For example, if 20 teams are ranked by a news service, then ranks 21 through 35 are unassigned. The mean of ranks 21 through 35 is 28, so missing values for that variable are replaced by the value 28. To prevent the method of missing-value replacement from having an undue effect on the analysis, each observation is weighted according to the number of nonmissing values it has. See Example 59.3 in Chapter 59, “The PRINQUAL Procedure,” for an alternative analysis of these data. Since the first principal component accounts for 78% of the variance, there is substantial agreement among the rankings. The eigenvector shows that all the news services are about equally weighted, so a simple average would work almost as well as the first principal component. The following statements produce Output 58.3.1 through Output 58.3.3: /*-----------------------------------------------------------*/ /* */ /* Preseason 1985 College Basketball Rankings */ /* (rankings of 35 teams by 10 news services) */ /* */ /* Note: (a) news services rank varying numbers of teams; */ /* (b) not all teams are ranked by all news services; */ /* (c) each team is ranked by at least one service; */ /* (d) rank 20 is missing for UPI. */ /* */ /*-----------------------------------------------------------*/ data HoopsRanks; input School $13. CSN DurSun DurHer WashPost USAToday Sport InSports UPI AP SI; label CSN = ’Community Sports News (Chapel Hill, NC)’ DurSun = ’Durham Sun’ DurHer = ’Durham Morning Herald’ WashPost = ’Washington Post’ USAToday = ’USA Today’ Sport = ’Sport Magazine’ InSports = ’Inside Sports’ UPI = ’United Press International’ AP = ’Associated Press’ SI = ’Sports Illustrated’ ; format CSN--SI 5.1; cards; Louisville 1 8 1 9 8 9 6 10 9 9 Georgia Tech 2 2 4 3 1 1 1 2 1 1 Kansas 3 4 5 1 5 11 8 4 5 7 Michigan 4 5 9 4 2 5 3 1 3 2 Duke 5 6 7 5 4 10 4 5 6 5 UNC 6 1 2 2 3 4 2 3 2 3
3627
3628
Chapter 58. The PRINCOMP Procedure Syracuse Notre Dame Kentucky LSU DePaul Georgetown Navy Illinois Iowa Arkansas Memphis State Washington UAB UNLV NC State Maryland Pittsburgh Oklahoma Indiana Virginia Old Dominion Auburn St. Johns UCLA St. Joseph’s Tennessee Montana Houston Virginia Tech ; /* /* /* /* /* /* /* /* /* /*
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 . . . . .
10 14 15 9 . 7 20 3 16 . . . 13 18 17 . . 19 12 . . 11 . . . . . . .
6 15 16 13 21 8 23 3 . . 11 . 10 18 14 . . 17 20 22 . 12 . . 19 24 . . .
11 13 14 . 15 6 10 7 . . . . . 19 16 . . 17 18 . . 8 . . . . 20 . .
6 11 14 13 20 9 18 7 23 25 16 . 12 22 15 19 . 17 21 . . 10 . . . . . 24 .
6 20 19 15 . 2 13 3 . . 8 . 17 . . . . 12 . 18 . 7 14 . . 16 . . .
5 18 11 16 19 9 15 10 . . 20 . . 14 12 . . 17 . . . 7 . . . . . . 13
6 13 12 9 . 8 . 7 14 . . 17 16 18 15 . . . . . . 11 . 19 . . . . .
4 12 11 14 . 8 20 7 . . 15 . 16 18 17 19 . 13 . . . 10 . . . . . . .
10 . 13 8 19 4 . 6 20 16 12 . 15 . 18 14 . 17 . . . 11 . . . . . . .
PROC MEANS is used to output a data set containing the maximum value of each of the newspaper and magazine rankings. The output data set, maxrank, is then used to set the missing values to the next highest rank plus thirty-six, divided by two (that is, the mean of the missing ranks). This ad hoc method of replacing missing values is based more on intuition than on rigorous statistical theory. Observations are weighted by the number of nonmissing values.
title ’Pre-Season 1985 College Basketball Rankings’; proc means data=HoopsRanks; output out=MaxRank max=CSNMax DurSunMax DurHerMax WashPostMax USATodayMax SportMax InSportsMax UPIMax APMax SIMax; run;
*/ */ */ */ */ */ */ */ */ */
Example 58.3. Basketball Data
Output 58.3.1. Summary Statistics for Basketball Rankings Using PROC MEANS Pre-Season 1985 College Basketball Rankings The MEANS Procedure Variable Label N Mean ----------------------------------------------------------------------CSN Community Sports News (Chapel Hill, NC) 30 15.5000000 DurSun Durham Sun 20 10.5000000 DurHer Durham Morning Herald 24 12.5000000 WashPost Washington Post 19 10.4210526 USAToday USA Today 25 13.0000000 Sport Sport Magazine 20 10.5000000 InSports Inside Sports 20 10.5000000 UPI United Press International 19 10.0000000 AP Associated Press 20 10.5000000 SI Sports Illustrated 20 10.5000000 ----------------------------------------------------------------------Variable Label Std Dev Minimum -------------------------------------------------------------------------------CSN Community Sports News (Chapel Hill, NC) 8.8034084 1.0000000 DurSun Durham Sun 5.9160798 1.0000000 DurHer Durham Morning Herald 7.0710678 1.0000000 WashPost Washington Post 6.0673607 1.0000000 USAToday USA Today 7.3598007 1.0000000 Sport Sport Magazine 5.9160798 1.0000000 InSports Inside Sports 5.9160798 1.0000000 UPI United Press International 5.6273143 1.0000000 AP Associated Press 5.9160798 1.0000000 SI Sports Illustrated 5.9160798 1.0000000 -------------------------------------------------------------------------------Variable Label Maximum ----------------------------------------------------------------CSN Community Sports News (Chapel Hill, NC) 30.0000000 DurSun Durham Sun 20.0000000 DurHer Durham Morning Herald 24.0000000 WashPost Washington Post 20.0000000 USAToday USA Today 25.0000000 Sport Sport Magazine 20.0000000 InSports Inside Sports 20.0000000 UPI United Press International 19.0000000 AP Associated Press 20.0000000 SI Sports Illustrated 20.0000000 -----------------------------------------------------------------
3629
3630
Chapter 58. The PRINCOMP Procedure data Basketball; set HoopsRanks; if _n_=1 then set MaxRank; array Services{10} CSN--SI; array MaxRanks{10} CSNMax--SIMax; keep School CSN--SI Weight; Weight=0; do i=1 to 10; if Services{i}=. then Services{i}=(MaxRanks{i}+36)/2; else Weight=Weight+1; end; run;
proc princomp data=Basketball n=1 out=PCBasketball standard; var CSN--SI; weight Weight; run;
Output 58.3.2. Principal Components Analysis of Basketball Rankings Using PROC PRINCOMP The PRINCOMP Procedure Observations Variables
35 10
Simple Statistics
Mean StD
CSN
DurSun
DurHer
WashPost
USAToday
13.33640553 22.08036285
13.06451613 21.66394183
12.88018433 21.38091837
13.83410138 23.47841791
12.55760369 20.48207965
Simple Statistics
Mean StD
Sport
InSports
UPI
AP
SI
13.83870968 23.37756267
13.24423963 22.20231526
13.59216590 23.25602811
12.83410138 21.40782406
13.52534562 22.93219584
Example 58.3. Basketball Data
Output 58.3.2. (continued) Correlation Matrix
CSN DurSun DurHer WashPost USAToday Sport InSports UPI AP SI
Community Sports News (Chapel Hill, NC) Durham Sun Durham Morning Herald Washington Post USA Today Sport Magazine Inside Sports United Press International Associated Press Sports Illustrated
CSN
DurSun
DurHer
1.0000 0.6505 0.6415 0.6121 0.7456 0.4806 0.6558 0.7007 0.6779 0.6135
0.6505 1.0000 0.8341 0.7667 0.8860 0.6940 0.7702 0.9015 0.8437 0.7518
0.6415 0.8341 1.0000 0.7035 0.8877 0.7788 0.7900 0.7676 0.8788 0.7761
Correlation Matrix
CSN DurSun DurHer WashPost USAToday Sport InSports UPI AP SI
Wash Post
USAToday
Sport
In Sports
UPI
AP
SI
0.6121 0.7667 0.7035 1.0000 0.7984 0.6598 0.8717 0.6953 0.7809 0.5952
0.7456 0.8860 0.8877 0.7984 1.0000 0.7716 0.8475 0.8539 0.9479 0.8426
0.4806 0.6940 0.7788 0.6598 0.7716 1.0000 0.7176 0.6220 0.8217 0.7701
0.6558 0.7702 0.7900 0.8717 0.8475 0.7176 1.0000 0.7920 0.8830 0.7332
0.7007 0.9015 0.7676 0.6953 0.8539 0.6220 0.7920 1.0000 0.8436 0.7738
0.6779 0.8437 0.8788 0.7809 0.9479 0.8217 0.8830 0.8436 1.0000 0.8212
0.6135 0.7518 0.7761 0.5952 0.8426 0.7701 0.7332 0.7738 0.8212 1.0000
Eigenvalues of the Correlation Matrix Eigenvalue 1
Difference
7.88601647
Proportion
Cumulative
0.7886
0.7886
Eigenvectors Prin1 CSN DurSun DurHer WashPost USAToday Sport InSports UPI AP SI
Community Sports News (Chapel Hill, NC) Durham Sun Durham Morning Herald Washington Post USA Today Sport Magazine Inside Sports United Press International Associated Press Sports Illustrated
0.270205 0.326048 0.324392 0.300449 0.345200 0.293881 0.324088 0.319902 0.342151 0.308570
3631
3632
Chapter 58. The PRINCOMP Procedure proc sort data=PCBasketball; by Prin1; run; proc print; var School Prin1; title ’Pre-Season 1985 College Basketball Rankings’; title2 ’College Teams as Ordered by PROC PRINCOMP’; run;
Output 58.3.3. Basketball Rankings Using PROC PRINCOMP Pre-Season 1985 College Basketball Rankings College Teams as Ordered by PROC PRINCOMP OBS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
School Georgia Tech UNC Michigan Kansas Duke Illinois Syracuse Louisville Georgetown Auburn Kentucky LSU Notre Dame NC State UAB Oklahoma Memphis State Navy UNLV DePaul Iowa Indiana Maryland Arkansas Virginia Washington Tennessee St. Johns Virginia Tech St. Joseph’s UCLA Pittsburgh Houston Montana Old Dominion
Prin1 -0.58068 -0.53317 -0.47874 -0.40285 -0.38464 -0.33586 -0.31578 -0.31489 -0.29735 -0.09785 0.00843 0.00872 0.09407 0.19404 0.19771 0.23864 0.25319 0.28921 0.35103 0.43770 0.50213 0.51713 0.55910 0.62977 0.67586 0.67756 0.70822 0.71425 0.71638 0.73492 0.73965 0.75078 0.75534 0.75790 0.76821
Example 58.4. PRINCOMP Graphics (Experimental) This example illustrates the experimental ODS graphics in PROC PRINCOMP, using the example in the the “Getting Started” section on page 3596. The following statements request plots in PROC PRINCOMP.
Example 58.4. PRINCOMP Graphics (Experimental)
ods html; ods graphics on; proc princomp data=Jobratings(drop=’Overall Rating’n) n=5; run; ods graphics off; ods html close;
These graphical displays are requested by specifying the experimental ODS GRAPHICS statement. For general information about ODS graphics, see Chapter 15, “Statistical Graphics Using ODS.” For specific information about the graphics available in the PRINCOMP procedure, see the “ODS Graphics” section on page 3613. The N= 5 option in the PROC PRINCOMP statement sets the number of principal components to 5. Output 58.4.1 shows the eigenvalue plots. Each point in the plot on the left shows an eigenvalue; each point in the plot on the right shows the (cumulative) proportion of variance explained by each component. Output 58.4.1. Eigenvalue Scatter Plot (Experimental)
Output 58.4.2 shows a scatter matrix plot between the first five components. The histogram of each component is displayed in the diagonal element of the matrix.
3633
3634
Chapter 58. The PRINCOMP Procedure
Output 58.4.2. Component Scores Matrix Plot (Experimental)
Output 58.4.3 shows a component pattern profile. The Y-axis shows the correlation between a component and a variable. There is one profile for each component. Line patterns are used to differentiate correlations between components. The nearly horizontal profile from the first component indicates that the first component is mostly correlated evenly across all variables. The second component is positively correlated with the variables Observational Skills and Willingness to Confront Problems and is negatively correlated with the variables Interest in People and Interpersonal Sensitivity.
Example 58.4. PRINCOMP Graphics (Experimental)
Output 58.4.3. Component Pattern Plot (Experimental)
Output 58.4.4 shows a scatter plot of the first and second components. Observation numbers are used as the plotting symbol.
3635
3636
Chapter 58. The PRINCOMP Procedure
Output 58.4.4. Component Scores Plot: 1st versus 2nd (Experimental)
Output 58.4.5 shows a scatter plot of the first and third components.
Example 58.4. PRINCOMP Graphics (Experimental)
Output 58.4.5. Component Scores Plot: 1st versus 3rd (Experimental)
Output 58.4.6 shows a scatter plot of the second and third components, displaying density with color. Color interpolation is based on the first component, going from blue (or light gray) (minimum density), magenta (or dark gray) (median density), and to red (or black) (maximum density).
3637
3638
Chapter 58. The PRINCOMP Procedure
Output 58.4.6. Painted Components Scores Plot: 2nd versus 3rd, Painted by 1st (Experimental)
References Cooley, W.W. and Lohnes, P.R. (1971), Multivariate Data Analysis, New York: John Wiley & Sons, Inc. Gnanadesikan, R. (1977), Methods for Statistical Data Analysis of Multivariate Observations, New York: John Wiley & Sons, Inc. Hotelling, H. (1933), “Analysis of a Complex of Statistical Variables into Principal Components,” Journal of Educational Psychology, 24, 417–441, 498–520. Kshirsagar, A.M. (1972), Multivariate Analysis, New York: Marcel Dekker, Inc. Mardia, K.V., Kent, J.T., and Bibby, J.M. (1979), Multivariate Analysis, London: Academic Press. Morrison, D.F. (1976), Multivariate Statistical Methods, Second Edition, New York: McGraw-Hill Book Co. Pearson, K. (1901), “On Lines and Planes of Closest Fit to Systems of Points in Space,” Philosophical Magazine, 6(2), 559–572. Rao, C.R. (1964), “The Use and Interpretation of Principal Component Analysis in Applied Research,” Sankhya A, 26, 329–358.
Chapter 59
The PRINQUAL Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3641 The Three Methods of Variable Transformation . . . . . . . . . . . . . . . 3643 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3644 SYNTAX . . . . . . . . . . . . PROC PRINQUAL Statement BY Statement . . . . . . . . FREQ Statement . . . . . . . ID Statement . . . . . . . . . TRANSFORM Statement . . WEIGHT Statement . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. 3651 . 3651 . 3658 . 3658 . 3659 . 3659 . 3666
DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . Controlling the Number of Iterations . . . . . . . . . . . . . . . Performing a Principal Component Analysis of Transformed Data Using the MAC Method . . . . . . . . . . . . . . . . . . . . . . Output Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . Avoiding Constant Transformations . . . . . . . . . . . . . . . . Constant Variables . . . . . . . . . . . . . . . . . . . . . . . . . Character OPSCORE Variables . . . . . . . . . . . . . . . . . . REITERATE Option Usage . . . . . . . . . . . . . . . . . . . . Passive Observations . . . . . . . . . . . . . . . . . . . . . . . Computational Resources . . . . . . . . . . . . . . . . . . . . . Displayed Output . . . . . . . . . . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . ODS Graphics (Experimental) . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. 3667 . 3667 . 3667 . 3668 . 3669 . 3669 . 3672 . 3673 . 3673 . 3673 . 3674 . 3675 . 3676 . 3676 . 3677
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example 59.1. Multidimensional Preference Analysis of Cars Data . . Example 59.2. MDPREF of Cars Data, ODS Graphics (Experimental) Example 59.3. Principal Components of Basketball Rankings . . . . .
. . . .
. . . .
. 3677 . 3677 . 3687 . 3688
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3700
3640
Chapter 59. The PRINQUAL Procedure
Chapter 59
The PRINQUAL Procedure Overview The PRINQUAL procedure obtains linear and nonlinear transformations of variables by using the method of alternating least squares to optimize properties of the transformed variables’ covariance or correlation matrix. Nonoptimal transformations for logarithm, rank, exponentiation, inverse sine, and logit are also available with PROC PRINQUAL. Experimental graphics are now available with the PRINQUAL procedure. For more information, see the “ODS Graphics” section on page 3677. The PRINQUAL (principal components of qualitative data) procedure is a data transformation procedure that is based on the work of Kruskal and Shepard (1974); Young, Takane, and de Leeuw (1978); Young (1981); and Winsberg and Ramsay (1983). You can use PROC PRINQUAL to • generalize ordinary principal component analysis to a method capable of analyzing data that are not quantitative • perform metric and nonmetric multidimensional preference (MDPREF) analyses (Carroll 1972) • preprocess data, transforming variables prior to their use in other data analyses • summarize mixed quantitative and qualitative data and detect nonlinear relationships • reduce the number of variables for subsequent use in regression analyses, cluster analyses, and other analyses The PRINQUAL procedure provides three methods of transforming a set of qualitative and quantitative variables to optimize the transformed variables’ covariance or correlation matrix. These methods are • maximum total variance (MTV) • minimum generalized variance (MGV) • maximum average correlation (MAC) All three methods attempt to find transformations that decrease the rank of the covariance matrix computed from the transformed variables. Transforming the variables to maximize the variance accounted for by a few linear combinations (using the MTV method) locates the observations in a space with dimensionality that approximates the stated number of linear combinations as much as possible, given the transformation constraints. Transforming the variables to minimize their generalized variance or
3642
Chapter 59. The PRINQUAL Procedure
maximize the sum of correlations also reduces the dimensionality. The transformed qualitative (nominal and ordinal) variables can be thought of as quantified by the analysis, with the quantification done in the context set by the algorithm. The data are quantified so that the proportion of variance accounted for by a stated number of principal components is locally maximal, the generalized variance of the variables is locally minimal, or the average of the correlations is locally maximal. The data can contain variables with nominal, ordinal, interval, and ratio scales of measurement (Siegel 1956). Any mix is allowed with all methods. PROC PRINQUAL can • transform nominal variables by scoring the categories to optimize the covariance matrix (Fisher 1938) • transform ordinal variables monotonically by scoring the ordered categories so that order is weakly preserved (adjacent categories can be merged) and the covariance matrix is optimized. You can untie ties optimally or leave them tied (Kruskal 1964). You can also transform ordinal variables to ranks. • transform interval and ratio scale of measurement variables linearly, or transform them nonlinearly with spline transformations (de Boor 1978; van Rijckevorsel 1982) or monotone spline transformations (Winsberg and Ramsay 1983). In addition, nonoptimal transformations for logarithm, exponential, power, logit, and inverse trigonometric sine are available. • for all transformations, estimate missing data without constraint, with category constraints (missing values within the same group get the same value), and with order constraints (missing value estimates in adjacent groups can be tied to preserve a specified ordering). Refer to Gifi (1990) and Young (1981). The PROC PRINQUAL iterations produce a set of transformed variables. Each variable’s new scoring satisfies a set of constraints based on the original scoring of the variable and the specified transformation type. First, all variables are required to satisfy transformation standardization constraints; that is, all variables have a fixed mean and variance. The other constraints include linear constraints, weak order constraints, category constraints, and smoothness constraints. The new set of scores is selected from the sets of possible scorings that do not violate the constraints so that the method criterion is locally optimized. The displayed output from PROC PRINQUAL is a listing of the iteration history. However, the primary output from PROC PRINQUAL is an output data set. By default, the procedure creates an output data set that contains variables with – TYPE– =’SCORE’. These observations contain original variables, transformed variables, components, or data approximations. If you specify the CORRELATIONS option in the PROC PRINQUAL statement, the data set also contains observations with – TYPE– =’CORR’; these observations contain correlations or component structure information.
The Three Methods of Variable Transformation
The Three Methods of Variable Transformation The three methods of variable transformation provided by PROC PRINQUAL are discussed in the following sections.
The Maximum Total Variance (MTV) Method The MTV method (Young, Takane, and de Leeuw 1978) is based on the principal component model, and it attempts to maximize the sum of the first r eigenvalues of the covariance matrix. This method transforms variables to be (in a least-squares sense) as similar to linear combinations of r principal component score variables as possible, where r can be much smaller than the number of variables. This maximizes the total variance of the first r components (the trace of the covariance matrix of the first r principal components). Refer to Kuhfeld, Sarle, and Young (1985). On each iteration, the MTV algorithm alternates classical principal component analysis (Hotelling 1933) with optimal scaling (Young 1981). When all variables are ordinal preference ratings, this corresponds to Carroll’s (1972) MDPREF analysis. You can request the dummy variable initialization method suggested by Tenenhaus and Vachette (1977), who independently proposed the same iterative algorithm for nominal and interval scale-of-measurement variables.
The Minimum Generalized Variance (MGV) Method The MGV method (Sarle 1984) uses an iterated multiple regression algorithm in an attempt to minimize the determinant of the covariance matrix of the transformed variables. This method transforms each variable to be (in a least-squares sense) as similar to linear combinations of the remaining variables as possible. This locally minimizes the generalized variance of the transformed variables, the determinant of the covariance matrix, the volume of the parallelepiped defined by the transformed variables, and the sphericity (the extent to which a quadratic form in the optimized covariance matrix defines a sphere). Refer to Kuhfeld, Sarle, and Young (1985). On each iteration for each variable, the MGV algorithm alternates multiple regression with optimal scaling. The multiple regression involves predicting the selected variable from all other variables. You can request a dummy variable initialization using a modification of the Tenenhaus and Vachette (1977) method that is appropriate with a regression algorithm. This method can be viewed as a way of investigating the nature of the linear and nonlinear dependencies in, and the rank of, a data matrix containing variables that can be nonlinearly transformed. This method tries to create a less-thanfull-rank data matrix. The matrix contains the transformation of each variable that is most similar to what the other transformed variables predict.
The Maximum Average Correlation (MAC) Method The MAC method (de Leeuw 1985) uses an iterated constrained multiple regression algorithm in an attempt to maximize the average of the elements of the correlation matrix. This method transforms each variable to be (in a least-squares sense) as similar to the average of the remaining variables as possible. On each iteration for each variable, the MAC algorithm alternates computing an equally weighted average of the other variables with optimal scaling. The MAC
3643
3644
Chapter 59. The PRINQUAL Procedure
method is similar to the MGV method in that each variable is scaled to be as similar to a linear combination of the other variables as possible, given the constraints on the transformation. However, optimal weights are not computed. You can use the MAC method when all variables are positively correlated or when no monotonicity constraints are placed on any transformations. Do not use this method with negatively correlated variables when some optimal transformations are constrained to be increasing because the signs of the correlations are not taken into account. The MAC method is useful as an initialization method for the MTV and MGV methods.
Getting Started In the following example, PROC PRINQUAL uses the MTV method. Suppose that the problem is to linearize a curve through three-dimensional space. Let
X 1 = X3 X2 = X1 − X5 X3 = X2 − X6
where X = −1.00, −0.98, −0.96, . . . , 1.00. These three variables define a curve in three-dimensional space. The GPLOT procedure is used to display two-dimensional views of this curve. These data are completely described by three linear components, but they define a single curve, which could be described as a single nonlinear component. PROC PRINQUAL is used to attempt to straighten the curve into a one-dimensional line with a continuous transformation of each variable. The N=1 option in the PROC PRINQUAL statement requests one principal component. The TRANSFORM statement requests a cubic spline transformation with nine knots. Splines are curves, which are usually required to be continuous and smooth. Splines are usually defined as piecewise polynomials of degree n with function values and first n − 1 derivatives that agree at the points where they join. The abscissa values of the join points are called knots. The term “spline” is also used for polynomials (splines with no knots) and piecewise polynomials with more than one discontinuous derivative. Splines with no knots are generally smoother than splines with knots, which are generally smoother than splines with multiple discontinuous derivatives. Splines with few knots are generally smoother than splines with many knots; however, increasing the number of knots usually increases the fit of the spline function to the data. Knots give the curve freedom to bend to more closely follow the data. Refer to Smith (1979) for an excellent introduction to splines. For another example of using splines, see Example 75.1 in Chapter 75, “The TRANSREG Procedure.” One component accounts for 71 percent of the variance of the untransformed data, and after 45 iterations, over 98 percent of the variance of the transformed data is accounted for by one component (see Figure 59.2). Note that the algorithm would
Getting Started
3645
not have converged with 50 iterations and the default convergence criterion, so more iterations may be needed for this problem. PROC PRINQUAL creates an output data set (which is not displayed) that contains both the original and transformed variables. The original variables have the names X1, X2, and X3. Transformed variables are named TX1, TX2, and TX3. All observations in the output data set have – TYPE– =’SCORE’, since the CORRELATIONS option is not specified in the PROC PRINQUAL statement. The GPLOT procedure uses this output data set and displays the nonlinear transformations of all three variables and the nearly one-dimensional scatter plot (see Figure 59.3 and Figure 59.4). PROC PRINQUAL tries to project each variable on the first principal component. Notice that the curve in this example is closer to a circle than to a function from some views (see the plot of X3 vs. X2 in Figure 59.1) and that the first component does not run approximately from one end point of the curve to the other (see Figure 59.4). Since the curve has these characteristics, PROC PRINQUAL linearizes the scatter plot by collapsing the scatter around the principal axis, not by straightening the curve into a single line. PROC PRINQUAL would straighten simpler curves. The following statements produce Figure 59.1 through Figure 59.4: * Generate a Three-Dimensional Curve; data X; do X = -1 to 1 by 0.02; X1 = X ** 3; X2 = X1 - X ** 5; X3 = X2 - X ** 6; output; end; drop X; run; goptions goutmode=replace nodisplay; %let opts = haxis=axis2 vaxis=axis1 frame cframe=ligr; * Depending on your goptions, these plot options may work better: * %let opts = haxis=axis2 vaxis=axis1 frame; proc gplot data=X; title; axis1 minor=none label=(angle=90 rotate=0) order=(-1 to 1); axis2 minor=none order=(-1 to 1); plot X1*X2 / &opts name=’prqin1’; plot X3*X2 / &opts name=’prqin2’ vreverse; plot X1*X3 / &opts name=’prqin3’; symbol1 color=blue; run; quit; goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:prqin1 2:prqin2 3:prqin3; run; quit;
3646
Chapter 59. The PRINQUAL Procedure * Try to Straighten the Curve; proc prinqual data=X n=1 maxiter=50 covariance converge=0.007; title ’Iteratively Derive Variable Transformations’; transform spline(X1-X3 / nknots=9); run; * Plot the Transformations; goptions nodisplay; proc gplot; title; axis1 minor=none label=(angle=90 rotate=0); axis2 minor=none; plot TX1*X1 / &opts name=’prqin4’; plot TX2*X2 / &opts name=’prqin5’; plot TX3*X3 / &opts name=’prqin6’; symbol1 color=blue; run; quit; goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:prqin4 2:prqin6 3:prqin5; run; quit; * Plot the Straightened Scatter Plot; goptions nodisplay; proc gplot; axis1 minor=none label=(angle=90 rotate=0) order=(-1 to 1); axis2 minor=none order=(-1 to 1); plot TX1*TX2 / &opts name=’prqin7’; plot TX3*TX2 / &opts name=’prqin8’ vreverse; plot TX1*TX3 / &opts name=’prqin9’; symbol1 color=blue; run; quit; goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:prqin7 2:prqin8 3:prqin9; run; quit;
Getting Started
Figure 59.1. Three-Dimensional Curve Example Output
3647
3648
Chapter 59. The PRINQUAL Procedure
Iteratively Derive Variable Transformations The PRINQUAL Procedure PRINQUAL MTV Algorithm Iteration History Iteration Average Maximum Proportion Criterion Number Change Change of Variance Change Note ---------------------------------------------------------------------------1 0.16253 1.33045 0.71369 2 0.07871 0.94549 0.79035 0.07667 3 0.06518 0.80219 0.86334 0.07299 4 0.05322 0.57928 0.91379 0.05045 5 0.04154 0.38404 0.94204 0.02825 6 0.03181 0.24391 0.95640 0.01436 7 0.02461 0.15397 0.96349 0.00709 8 0.01982 0.10205 0.96704 0.00355 9 0.01662 0.07393 0.96894 0.00189 10 0.01439 0.06232 0.97005 0.00112 11 0.01288 0.05436 0.97081 0.00075 12 0.01189 0.04911 0.97139 0.00058 13 0.01119 0.04531 0.97188 0.00049 14 0.01068 0.04276 0.97232 0.00044 15 0.01027 0.04115 0.97273 0.00041 16 0.00993 0.04039 0.97313 0.00040 17 0.00965 0.04249 0.97351 0.00038 18 0.00940 0.04400 0.97388 0.00037 19 0.00919 0.04509 0.97423 0.00036 20 0.00900 0.04587 0.97458 0.00034 21 0.00883 0.04643 0.97491 0.00033 22 0.00867 0.04681 0.97523 0.00032 23 0.00852 0.04705 0.97555 0.00031 24 0.00839 0.04719 0.97585 0.00031 25 0.00827 0.04724 0.97615 0.00030 26 0.00816 0.04722 0.97644 0.00029 27 0.00805 0.04713 0.97672 0.00028 28 0.00795 0.04699 0.97700 0.00027 29 0.00785 0.04680 0.97726 0.00027 30 0.00776 0.04656 0.97752 0.00026 31 0.00768 0.04629 0.97777 0.00025 32 0.00760 0.04598 0.97802 0.00025 33 0.00752 0.04564 0.97826 0.00024 34 0.00745 0.04528 0.97849 0.00023 35 0.00739 0.04489 0.97872 0.00023 36 0.00733 0.04448 0.97894 0.00022 37 0.00729 0.04405 0.97915 0.00022 38 0.00724 0.04361 0.97936 0.00021 39 0.00720 0.04315 0.97957 0.00021 40 0.00716 0.04268 0.97977 0.00020 41 0.00713 0.04219 0.97997 0.00020 42 0.00709 0.04170 0.98016 0.00019 43 0.00706 0.04120 0.98035 0.00019 44 0.00703 0.04070 0.98054 0.00019 45 0.00699 0.04019 0.98072 0.00018 Converged Algorithm converged.
Figure 59.2. PROC PRINQUAL MTV Iteration History
Getting Started
Figure 59.3. Variable Transformation Plots
3649
3650
Chapter 59. The PRINQUAL Procedure
Figure 59.4. Plots of the Nearly One-Dimensional Curve
PROC PRINQUAL Statement
Syntax The following statements are available in PROC PRINQUAL.
PROC PRINQUAL < options > ; TRANSFORM transform(variables < / t-options >) < . . . transform(variables < / t-options >) > ; BY variables ; FREQ variable ; ID variables ; WEIGHT variable ; To use PROC PRINQUAL, you need the PROC PRINQUAL and TRANSFORM statements. You can abbreviate all options and t-options to their first three letters. This is a special feature of PROC PRINQUAL and is not generally true of other SAS/STAT procedures. The rest of this section provides detailed syntax information for each of the preceding statements, beginning with the PROC PRINQUAL statement. The remaining statements are described in alphabetical order.
PROC PRINQUAL Statement PROC PRINQUAL < options > ; The PROC PRINQUAL statement starts the PRINQUAL procedure. Optionally, this statement identifies an input data set, creates an output data set, specifies the algorithm and other computational details, and controls displayed output. The following table summarizes options available in the PROC PRINQUAL statement.
3651
3652
Chapter 59. The PRINQUAL Procedure Task Identify input data set specifies input SAS data set
Option DATA=
Specify details for output data set outputs approximations to transformed variables specifies prefix for approximation variables outputs correlations and component structure matrix specifies a multidimensional preference analysis specifies output data set specifies prefix for principal component scores variables replaces raw data with transformed data outputs principal component scores standardizes principal component scores specifies transformation standardization specifies prefix for transformed variables
APPROXIMATIONS APREFIX= CORRELATIONS MDPREF OUT= PREFIX= REPLACE SCORES STANDARD TSTANDARD= TPREFIX=
Control iterative algorithm analyzes covariances initializes using dummy variables specifies iterative algorithm specifies number of principal components suppresses numerical error checking specifies number of MGV models before refreshing restarts iterations specifies singularity criterion specifies input observation type
COVARIANCE DUMMY METHOD= N= NOCHECK REFRESH= REITERATE SINGULAR= TYPE=
Control the number of iterations specifies minimum criterion change specifies number of first iteration to be displayed specifies minimum data change specifies number of MAC initialization iterations specifies maximum number of iterations
CCONVERGE= CHANGE= CONVERGE= INITITER= MAXITER=
Specify details for handling missing values includes monotone special missing values excludes observations with missing values unties special missing values
MONOTONE= NOMISS UNTIE=
Suppress displayed output suppresses displayed output
NOPRINT
The following list describes these options in alphabetical order. APREFIX=name APR=name
specifies a prefix for naming the approximation variables. By default, APREFIX=A. Specifying the APREFIX= option also implies the APPROXIMATIONS option.
PROC PRINQUAL Statement
APPROXIMATIONS APPROX APP
includes principal component approximations to the transformed variables (Eckart and Young 1936) in the output data set. Variable names are constructed from the value of the APREFIX= option and the input variable names. If you specify the APREFIX= option, then approximations are automatically included. If you specify the APPROXIMATIONS option and not the APREFIX= option, then the APPROXIMATIONS option uses the default, APREFIX=A, to construct the variable names. CCONVERGE=n CCO=n
specifies the minimum change in the criterion being optimized that is required to continue iterating. By default, CCONVERGE=0.0. The CCONVERGE= option is ignored for METHOD=MAC. For the MGV method, specify CCONVERGE=-2 to ensure data convergence. CHANGE=n CHA=n
specifies the number of the first iteration to be displayed in the iteration history table. The default is CHANGE=1. When you specify a larger value for n, the first n − 1 iterations are not displayed, thus speeding up the analysis. The CHANGE= option is most useful with the MGV method, which is much slower than the other methods. CONVERGE=n CON=n
specifies the minimum average absolute change in standardized variable scores that is required to continue iterating. By default, CONVERGE=0.00001. Average change is computed over only those variables that can be transformed by the iterations, that is, all LINEAR, OPSCORE, MONOTONE, UNTIE, SPLINE, MSPLINE, and SSPLINE variables and nonoptimal transformation variables with missing values. For more information, see the section “Optimal Transformations” on page 3662. COVARIANCE COV
computes the principal components from the covariance matrix. The variables are always centered to mean zero. If you do not specify the COVARIANCE option, the variables are also standardized to variance one, which means the analysis is based on the correlation matrix. CORRELATIONS COR
includes correlations and the component structure matrix in the output data set. By default, this information is not included. DATA=SAS-data-set
specifies the SAS data set to be analyzed. The data set must be an ordinary SAS data set; it cannot be a TYPE=CORR or TYPE=COV data set. If you omit the DATA= option, the PRINQUAL procedure uses the most recently created SAS data set.
3653
3654
Chapter 59. The PRINQUAL Procedure
DUMMY DUM
expands variables specified for OPSCORE optimal transformations to dummy variables for the initialization (Tenenhaus and Vachette 1977). By default, the initial values of OPSCORE variables are the actual data values. The dummy variable nominal initialization requires considerable time and memory, so it might not be possible to use the DUMMY option with large data sets. No separate report of the initialization is produced. Initialization results are incorporated into the first iteration displayed in the iteration history table. For details, see the section “Optimal Transformations” on page 3662. INITITER=n INI=n
specifies the number of MAC iterations required to initialize the data before starting MTV or MGV iterations. By default, INITITER=0. The INITITER= option is ignored if METHOD=MAC. MAXITER=n MAX=n
specifies the maximum number of iterations. By default, MAXITER=30. MDPREF MDP
specifies a multidimensional preference analysis by implying the STANDARD, SCORES, and CORRELATIONS options. This option also suppresses warnings when there are more variables than observations. METHOD=MAC | MGV | MTV MET=MAC | MGV | MTV
specifies the optimization method. By default, METHOD=MTV. Values of the METHOD= option are MTV for maximum total variance, MGV for minimum generalized variance, or MAC for maximum average correlation. You can use the MAC method when all variables are positively correlated or when no monotonicity constraints are placed on any transformations. See the section “The Three Methods of Variable Transformation” on page 3643. MONOTONE=two-letters MON=two-letters
specifies the first and last special missing value in the list of those special missing values to be estimated using within-variable order and category constraints. By default, there are no order constraints on missing value estimates. The two-letters value must consist of two letters in alphabetical order. For example, MONOTONE=DF means that the estimate of .D must be less than or equal to the estimate of .E, which must be less than or equal to the estimate of .F; no order constraints are placed on estimates of .– , .A through .C, and .G through .Z. For details, see the “Missing Values” section on page 3667, and “Optimal Scaling” in Chapter 75, “The TRANSREG Procedure.”
PROC PRINQUAL Statement
N=n
specifies the number of principal components to be computed. By default, N=2. NOCHECK NOC
turns off computationally intensive numerical error checking for the MGV method. If you do not specify the NOCHECK option, the procedure computes R2 from the squared length of the predicted values vector and compares this value to the R2 computed from the error sum of squares that is a by-product of the sweep algorithm (Goodnight 1978). If the two values of R2 differ by more than the square root of the value of the SINGULAR= option, a warning is displayed, the value of the REFRESH= option is halved, and the model is refit after refreshing. Specifying the NOCHECK option slightly speeds up the algorithm. Note that other less computationally intensive error checking is always performed. NOMISS NOM
excludes all observations with missing values from the analysis, but does not exclude them from the OUT= data set. If you omit the NOMISS option, PROC PRINQUAL simultaneously computes the optimal transformations of the nonmissing values and estimates the missing values that minimize squared error. Casewise deletion of observations with missing values occurs when you specify the NOMISS option, when there are missing values in IDENTITY variables, when there are weights less than or equal to 0, or when there are frequencies less than 1. Excluded observations are output with a blank value for the – TYPE– variable, and they have a weight of 0. They do not contribute to the analysis but are scored and transformed as supplementary or passive observations. See the “Passive Observations” section on page 3674 and the “Missing Values” section on page 3667 for more information on excluded observations and missing data. NOPRINT NOP
suppresses the display of all output. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 14, “Using the Output Delivery System.” OUT=SAS-data-set
specifies an output SAS data set that contains results of the analysis. If you omit the OUT= option, PROC PRINQUAL still creates an output data set and names it using the DATAn convention. If you want to create a permanent SAS data set, you must specify a two-level name. (Refer to the discussion in SAS Language Reference: Concepts.) You can use the REPLACE, APPROXIMATIONS, SCORES, and CORRELATIONS options to control what information is included in the output data set. For details, see the “Output Data Set” section on page 3669. PREFIX=name PRE=name
specifies a prefix for naming the principal components. By default, PREFIX=Prin. As a result, the principal component default names are Prin1, Prin2,. . ., Prinn.
3655
3656
Chapter 59. The PRINQUAL Procedure
REFRESH=n REF=n
specifies the number of variables to scale in the MGV method before computing a new inverse. By default, REFRESH=5. PROC PRINQUAL uses the REFRESH= option in the sweep algorithm of the MGV method. Large values for the REFRESH= option make the method run faster but with increased error. Small values make the method run more slowly and with more numerical accuracy. REITERATE REI
enables the PRINQUAL procedure to use previous transformations as starting points. The REITERATE option affects only variables that are iteratively transformed (specified as LINEAR, SPLINE, MSPLINE, SSPLINE, UNTIE, OPSCORE, and MONOTONE). For iterative transformations, the REITERATE option requests a search in the input data set for a variable that consists of the value of the TPREFIX= option followed by the original variable name. If such a variable is found, it is used to provide the initial values for the first iteration. The final transformation is a member of the transformation family defined by the original variable, not the transformation family defined by the initialization variable. See the “REITERATE Option Usage” section on page 3673. REPLACE REP
replaces the original data with the transformed data in the output data set. The names of the transformed variables in the output data set correspond to the names of the original variables in the input data set. If you do not specify the REPLACE option, both original variables and transformed variables (with names constructed from the TPREFIX= option and the original variable names) are included in the output data set. SCORES SCO
includes principal component scores in the output data set. By default, scores are not included. SINGULAR=n SIN=n
specifies the largest value within rounding error of zero. By default, SINGULAR=1E−8. The PRINQUAL procedure uses the value of the SINGULAR= option for checking (1 − R2 ) when constructing full-rank matrices of predictor variables, checking denominators before dividing, and so on. STANDARD STD
standardizes the principal component scores in the output data set to mean zero and variance one instead of the default mean zero and variance equal to the corresponding eigenvalue. See the SCORES option.
PROC PRINQUAL Statement
TPREFIX=name TPR=name
specifies a prefix for naming the transformed variables. By default, TPREFIX=T. The TPREFIX= option is ignored if you specify the REPLACE option. TSTANDARD=CENTER | NOMISS | ORIGINAL | Z TST=CEN | NOM | ORI | Z
specifies the standardization of the transformed variables in the OUT= data set. By default, TSTANDARD=ORIGINAL. When the TSTANDARD= option is specified in the PROC statement, it specifies the default standardization for all variables. When you specify TSTANDARD= as a t-option, it overrides the default standardization just for selected variables. CENTER
centers the output variables to mean zero, but the variances are the same as the variances of the input variables.
NOMISS
sets the means and variances of the transformed variables in the OUT= data set, computed over all output values that correspond to nonmissing values in the input data set, to the means and variances computed from the nonmissing observations of the original variables. The TSTANDARD=NOMISS specification is useful with missing data. When a variable is linearly transformed, the final variable contains the original nonmissing values and the missing value estimates. In other words, the nonmissing values are unchanged. If your data have no missing values, TSTANDARD=NOMISS and TSTANDARD=ORIGINAL produce the same results.
ORIGINAL
sets the means and variances of the transformed variables to the means and variances of the original variables. This is the default.
Z
standardizes the variables to mean zero, variance one.
For nonoptimal variable transformations, the means and variances of the original variables are actually the means and variances of the nonlinearly transformed variables, unless you specify the ORIGINAL nonoptimal t-option in the TRANSFORM statement. For example, if a variable X with no missing values is specified as LOG, then, by default, the final transformation of X is simply LOG(X), not LOG(X) standardized to the mean of X and variance of X. TYPE=’text ’|name TYP=’text ’|name
specifies the valid value for the – TYPE– variable in the input data set. If PROC PRINQUAL finds an input – TYPE– variable, it uses only observations with a – TYPE– value that matches the TYPE= value. This enables a PROC PRINQUAL OUT= data set containing correlations to be used as input to PROC PRINQUAL without requiring a WHERE statement to exclude the correlations. If a – TYPE– variable is not in the data set, all observations are used. The default is TYPE=’SCORE’, so if you do not specify the TYPE= option, only observations with – TYPE– = ’SCORE’ are used.
3657
3658
Chapter 59. The PRINQUAL Procedure
PROC PRINQUAL displays a note when it reads observations with blank values of – TYPE– , but it does not automatically exclude those observations. Data sets created by the TRANSREG and PRINQUAL procedures have blank – TYPE– values for those observations that were excluded from the analysis due to nonpositive weights, nonpositive frequencies, or missing data. When these observations are read again, they are excluded for the same reason that they were excluded from their original analysis, not because their – TYPE– value is blank. UNTIE=two-letters UNT=two-letters
specifies the first and last special missing value in the list of those special missing values that are to be estimated with within-variable order constraints but no category constraints. The two-letters value must consist of two letters in alphabetical order. By default, there are category constraints but no order constraints on special missing value estimates. For details, see the “Missing Values” section on page 3667. Also, see “Optimal Scaling” in Chapter 75, “The TRANSREG Procedure.”
BY Statement BY variables ; You can specify a BY statement with PROC PRINQUAL to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement options NOTSORTED or DESCENDING in the BY statement for the PRINQUAL procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
FREQ Statement FREQ variable ; If one variable in the input data set represents the frequency of occurrence for other values in the observation, list the variable’s name in a FREQ statement. PROC PRINQUAL then treats the data set as if each observation appeared n times, where n is the value of the FREQ variable for the observation. Noninteger values of the
TRANSFORM Statement
FREQ variable are truncated to the largest integer less than the FREQ value. The observation is used in the analysis only if the value of the FREQ statement variable is greater than or equal to 1.
ID Statement ID variables ; The ID statement includes additional character or numeric variables in the output data set. The variables must be contained in the input data set.
TRANSFORM Statement TRANSFORM transform(variables < / t-options >) < . . . transform(variables < / t-options >) > ; The TRANSFORM statement lists the variables to be analyzed (variables) and specifies the transformation (transform) to apply to each variable listed. You must specify a transformation for each variable list in the TRANSFORM statement. The variables are variables in the data set. The t-options are transformation options that provide details for the transformation; these depend on the transform chosen. The t-options are listed after a slash in the parentheses that enclose the variables. For example, the following statements find a quadratic polynomial transformation of all variables in the data set: proc prinqual; transform spline(_all_ / degree=2); run;
Or, if N1 through N10 are nominal variables and M1 through M10 are ordinal variables, you can use the following statements. proc prinqual; transform opscore(N1-N10) monotone(M1-M10); run;
The following sections describe the transformations available (specified with transform) and the options available for some of the transformations (specified with toptions).
Families of Transformations There are three types of transformation families: nonoptimal, optimal, and other. Each family is summarized as follows. Nonoptimal transformations preprocess the specified variables, replacing each one with a single new nonoptimal, nonlinear transformation.
3659
3660
Chapter 59. The PRINQUAL Procedure
Optimal transformations
replace the specified variables with new, iteratively derived optimal transformation variables that fit the specified model better than the original variable (except for contrived cases where the transformation fits the model exactly as well as the original variable).
Other transformations
are the IDENTITY and SSPLINE transformations. These do not fit into either of the preceding categories.
The following table summarizes the transformations in each family.
Family Nonoptimal transformations inverse trigonometric sine exponential logarithm logit raises variables to specified power transforms to ranks Optimal transformations linear monotonic, ties preserved monotonic B-spline optimal scoring B-spline monotonic, ties not preserved Other transformations identity, no transformation iterative smoothing spline
Members of Family ARSIN EXP LOG LOGIT POWER RANK LINEAR MONOTONE MSPLINE OPSCORE SPLINE UNTIE IDENTITY SSPLINE
The transform is followed by a variable (or list of variables) enclosed in parentheses. Optionally, depending on the transform, the parentheses can also contain t-options, which follow the variables and a slash. For example, transform log(X Y);
computes the LOG transformation of X and Y. A more complex example is transform spline(Y / nknots=2) log(X1 X2 X3);
The preceding statement uses the SPLINE transformation of the variable Y and the LOG transformation of the variables X1, X2, and X3. In addition, it uses the NKNOTS= option with the SPLINE transformation and specifies two knots. The rest of this section provides syntax details for members of the three families of transformations. The t-options are discussed in the section “Transformation Options (t-options)” on page 3663.
TRANSFORM Statement
Nonoptimal Transformations
Nonoptimal transformations are computed before the iterative algorithm begins. Nonoptimal transformations create a single new transformed variable that replaces the original variable. The new variable is not transformed by the subsequent iterative algorithms (except for a possible linear transformation and missing value estimation). The following list provides syntax and details for nonoptimal variable transformations. ARSIN ARS
finds an inverse trigonometric sine transformation. Variables following ARSIN must be numeric, in the interval (−1.0 ≤ X ≤ 1.0), and they are typically continuous. EXP
exponentiates variables (the variable X is transformed to aX ). To specify the value of a, use the PARAMETER= t-option. By default, a is the mathematical constant e = 2.718 . . .. Variables following EXP must be numeric, and they are typically continuous. LOG
transforms variables to logarithms (the variable X is transformed to loga (X)). To specify the base of the logarithm, use the PARAMETER= t-option. The default is a natural logarithm with base e = 2.718 . . .. Variables following LOG must be numeric and positive, and they are typically continuous. LOGIT
finds a logit transformation on the variables. The logit of X is log(X/(1−X)). Unlike other transformations, LOGIT does not have a three-letter abbreviation. Variables following LOGIT must be numeric, in the interval (0.0 < X < 1.0), and they are typically continuous. POWER POW
raises variables to a specified power (the variable X is transformed to X a ). You must specify the power parameter a by specifying the PARAMETER= t-option following the variables: power(variable / parameter=number)
You can use POWER for squaring variables (PARAMETER=2), reciprocal transformations (PARAMETER=−1), square roots (PARAMETER=0.5), and so on. Variables following POWER must be numeric, and they are typically continuous. RANK RAN
transforms variables to ranks. Ranks are averaged within ties. The smallest input value is assigned the smallest rank. Variables following RANK must be numeric.
3661
3662
Chapter 59. The PRINQUAL Procedure
Optimal Transformations
Optimal transformations are iteratively derived. Missing values for these types of variables can be optimally estimated (see the “Missing Values” section on page 3667). The following list provides syntax and details for optimal transformations. LINEAR LIN
finds an optimal linear transformation of each variable. For variables with no missing values, the transformed variable is the same as the original variable. For variables with missing values, the transformed nonmissing values have a different scale and origin than the original values. Variables following LINEAR must be numeric. MONOTONE MON
finds a monotonic transformation of each variable, with the restriction that ties are preserved. The Kruskal (1964) secondary least-squares monotonic transformation is used. This transformation weakly preserves order and category membership (ties). Variables following MONOTONE must be numeric, and they are typically discrete. MSPLINE MSP
finds a monotonically increasing B-spline transformation with monotonic coefficients (de Boor 1978; de Leeuw 1986) of each variable. You can specify the DEGREE=, KNOTS=, NKNOTS=, and EVENLY t-options with MSPLINE. By default, PROC PRINQUAL uses a quadratic spline. Variables following MSPLINE must be numeric, and they are typically continuous. OPSCORE OPS
finds an optimal scoring of each variable. The OPSCORE transformation assigns scores to each class (level) of the variable. Fisher’s (1938) optimal scoring method is used. Variables following OPSCORE can be either character or numeric; numeric variables should be discrete. SPLINE SPL
finds a B-spline transformation (de Boor 1978) of each variable. By default, PROC PRINQUAL uses a cubic polynomial transformation. You can specify the DEGREE=, KNOTS=, NKNOTS=, and EVENLY t-options with SPLINE. Variables following SPLINE must be numeric, and they are typically continuous. UNTIE UNT
finds a monotonic transformation of each variable without the restriction that ties are preserved. The PRINQUAL procedure uses the Kruskal (1964) primary least-squares monotonic transformation method. This transformation weakly preserves order but not category membership (it may untie some previously tied values). Variables following UNTIE must be numeric, and they are typically discrete.
TRANSFORM Statement
Other Transformations IDENTITY IDE
specifies variables that are not changed by the iterations. The IDENTITY transformation is used for variables when no transformation and no missing data estimation are desired. However, the REFLECT, ADDITIVE, TSTANDARD=Z, and TSTANDARD=CENTER options can linearly transform all variables, including IDENTITY variables, after the iterations. Observations with missing values in IDENTITY variables are excluded from the analysis, and no optimal scores are computed for missing values in IDENTITY variables. Variables following IDENTITY must be numeric. SSPLINE SSP
finds an iterative smoothing spline transformation of each variable. The SSPLINE transformation does not generally minimize squared error. You can specify the smoothing parameter with either the SM= t-option or the PARAMETER= t-option. The default smoothing parameter is SM=0. Variables following SSPLINE must be numeric, and they are typically continuous.
Transformation Options (t-options) If you use a nonoptimal, optimal or other transformation, you can use t-options, which specify additional details of the transformation. The t-options are specified within the parentheses that enclose variables and are listed after a slash. For example, proc prinqual; transform spline(X Y / nknots=3); run;
The preceding statements find an optimal variable transformation (SPLINE) of the variables X and Y and use a t-option to specify the number of knots (NKNOTS=). The following is a more complex example. proc prinqual; transform spline(Y / nknots=3) spline(X1 X2 / nknots=6); run;
These statements use the SPLINE transformation for all three variables and use toptions as well; the NKNOTS= option specifies the number of knots for the spline. The following sections discuss the t-options available for nonoptimal, optimal, and other transformations. The following table summarizes the t-options.
3663
3664
Chapter 59. The PRINQUAL Procedure
Table 59.1. t-options Available in the TRANSFORM Statement
Task Nonoptimal transformation t-options uses original mean and variance
Option ORIGINAL
Parameter t-options specifies miscellaneous parameters specifies smoothing parameter
PARAMETER= SM=
Spline t-options specifies the degree of the spline spaces the knots evenly specifies the interior knots or break points creates n knots
DEGREE= EVENLY KNOTS= NKNOTS=
Other t-options renames variables reflects the variable around the mean specifies transformation standardization
NAME= REFLECT TSTANDARD=
Nonoptimal Transformation t-options ORIGINAL ORI
matches the variable’s final mean and variance to the mean and variance of the original variable. By default, the mean and variance are based on the transformed values. The ORIGINAL t-option is available for all of the nonoptimal transformations. Parameter t-options PARAMETER=number PAR=number
specifies the transformation parameter. The PARAMETER= t-option is available for the EXP, LOG, POWER, SMOOTH, and SSPLINE transformations. For EXP, the parameter is the value to be exponentiated; for LOG, the parameter is the base value; and for POWER, the parameter is the power. For SMOOTH and SSPLINE, the parameter is the raw smoothing parameter. (You can specify a SAS/GRAPH-style smoothing parameter with the SM= t-option.) The default for the PARAMETER= toption for the LOG and EXP transformations is e = 2.718 . . .. The default parameter for SSPLINE is computed from SM=0. For the POWER transformation, you must specify the PARAMETER= t-option; there is no default. SM=n
specifies a SAS/GRAPH-style smoothing parameter in the range 0 to 100. You can specify the SM= t-option only with the SSPLINE transformation. The smoothness of the function increases as the value of the smoothing parameter increases. By default, SM=0.
TRANSFORM Statement
Spline t-options
The following t-options are available with the SPLINE and MSPLINE optimal transformations. DEGREE=n DEG=n
specifies the degree of the B-spline transformation. The degree must be a nonnegative integer. The defaults are DEGREE=3 for SPLINE variables and DEGREE=2 for MSPLINE variables. The polynomial degree should be a small integer, usually 0, 1, 2, or 3. Larger values are rarely useful. If you have any doubt as to what degree to specify, use the default. EVENLY EVE
is used with the NKNOTS= t-option to space the knots evenly. The differences between adjacent knots are constant. If you specify NKNOTS=k, k knots are created at minimum + i((maximum − minimum)/(k + 1)) for i = 1, . . . , k. For example, if you specify spline(X / knots=2 evenly)
and the variable X has a minimum of 4 and a maximum of 10, then the two interior knots are 6 and 8. Without the EVENLY t-option, the NKNOTS= t-option places knots at percentiles, so the knots are not evenly spaced. KNOTS=number-list | n TO m BY p KNO=number-list | n TO m BY p
specifies the interior knots or break points. By default, there are no knots. The first time you specify a value in the knot list, it indicates a discontinuity in the nth (from DEGREE=n) derivative of the transformation function at the value of the knot. The second mention of a value indicates a discontinuity in the (n − 1)th derivative of the transformation function at the value of the knot. Knots can be repeated any number of times for decreasing smoothness at the break points, but the values in the knot list can never decrease. You cannot use the KNOTS= t-option with the NKNOTS= t-option. You should keep the number of knots small (see the section “Specifying the Number of Knots” on page 4613 in Chapter 75, “The TRANSREG Procedure.” ). NKNOTS=n NKN=n
creates n knots, the first at the 100/(n + 1) percentile, the second at the 200/(n + 1) percentile, and so on. Knots are always placed at data values; there is no interpolation. For example, if NKNOTS=3, knots are placed at the twenty-fifth percentile, the median, and the seventy-fifth percentile. By default, NKNOTS=0. The NKNOTS= t-option must be ≥ 0.
3665
3666
Chapter 59. The PRINQUAL Procedure
You cannot use the NKNOTS= t-option with the KNOTS= t-option. You should keep the number of knots small (see the section “Specifying the Number of Knots” on page 4613 in Chapter 75, “The TRANSREG Procedure,” ). Other t-options
The following t-options are available for all transformations. NAME=(variable-list) NAM=(variable-list)
renames variables as they are used in the TRANSFORM statement. This option allows a variable to be used more than once. For example, if the variable X is a character variable, then the following step stores both the original character variable X and a numeric variable XC that contains category numbers in the output data set. proc prinqual data=A n=1 out=B; transform linear(Y) opscore(X / name=(XC)); id X; run; REFLECT REF
reflects the transformation y = −(y − y¯) + y¯ after the iterations are completed and before the final standardization and results calculations. TSTANDARD=CENTER | NOMISS | ORIGINAL | Z TST=CEN | NOM | ORI | Z
specifies the standardization of the transformed variables in the OUT= data set. By default, TSTANDARD=ORIGINAL. When the TSTANDARD= option is specified in the PROC PRINQUAL statement, it specifies the default standardization for all variables. When you specify TSTANDARD= as a t-option, it overrides the default standardization just for selected variables.
WEIGHT Statement WEIGHT variable ; When you use a WEIGHT statement, a weighted residual sum of squares is minimized. The WEIGHT statement has no effect on degrees of freedom or number of observations, but the weights affect most other calculations. The observation is used in the analysis only if the value of the WEIGHT statement variable is greater than 0.
Controlling the Number of Iterations
Details Missing Values PROC PRINQUAL can estimate missing values, subject to optional constraints, so that the covariance matrix is optimized. The procedure provides several approaches for handling missing data. When you specify the NOMISS option in the PROC PRINQUAL statement, observations with missing values are excluded from the analysis. Otherwise, missing data are estimated, using variable means as initial estimates. Missing values for OPSCORE character variables are treated the same as any other category during the initialization. See the section “Missing Values” on page 4599 in Chapter 75, “The TRANSREG Procedure,” for more information on missing data estimation.
Controlling the Number of Iterations Several options in the PROC PRINQUAL statement control the number of iterations performed. Iteration terminates when any one of the following conditions is satisfied: • The number of iterations equals the value of the MAXITER= option. • The average absolute change in variable scores from one iteration to the next is less than the value of the CONVERGE= option. • The criterion change is less than the value of the CCONVERGE= option. With the MTV method, the change in the proportion of variance criterion can become negative when the data have converged so that it is numerically impossible, within machine precision, to increase the criterion. Because the MTV algorithm is convergent, a negative criterion change is the result of very small amounts of rounding error. The MGV method displays the average squared multiple correlation (which is not the criterion being optimized), so the criterion change can become negative well before convergence. The MAC method criterion (average correlation) is never computed, so the CCONVERGE= option is ignored for METHOD=MAC. You can specify a negative value for either convergence option if you want to define convergence only in terms of the other convergence option. With the MGV method, iterations minimize the generalized variance (determinant), but the generalized variance is not reported for two reasons. First, in most data sets, the generalized variance is almost always near zero (or will be after one or two iterations), which is its minimum. This does not mean that iteration is complete; it simply means that at least one multiple correlation is at or near one. The algorithm continues minimizing the determinant in (m − 1), (m − 2) dimensions, and so on. Because the generalized variance is almost always near zero, it does not provide a good indication of how the iterations are progressing. The mean R2 provides a better indication of convergence. The second reason for not reporting the generalized variance is that almost no additional time is required to compute R2 values for each step. This is because the error sum of squares is a by-product of the algorithm at each step.
3667
3668
Chapter 59. The PRINQUAL Procedure
Computing the determinant at the end of each iteration adds more computations to an already computationally intensive algorithm. You can increase the number of iterations to ensure convergence by increasing the value of the MAXITER= option and decreasing the value of the CONVERGE= option. Because the average absolute change in standardized variable scores seldom decreases below 1E−11, you typically do not specify a value for the CONVERGE= option less than 1E−8 or 1E−10. Most of the data changes occur during the first few iterations, but the data can still change after 50 or even 100 iterations. You can try different combinations of values for the CONVERGE= and MAXITER= options to ensure convergence without extreme overiteration. If the data do not converge with the default specifications, specify the REITERATE option, or try CONVERGE=1E−8 and MAXITER=50, or CONVERGE=1E−10 and MAXITER=200.
Performing a Principal Component Analysis of Transformed Data PROC PRINQUAL produces an iteration history table that displays (for each iteration) the iteration number, the maximum and average absolute change in standardized variable scores computed over the iteratively transformed variables, the criterion being optimized, and the criterion change. In order to examine the results of the analysis in more detail, you can analyze the information in the output data set using other SAS procedures. Specifically, use the PRINCOMP procedure to perform a components analysis on the transformed data. PROC PRINCOMP accepts the raw data from PROC PRINQUAL but issues a warning because the PROC PRINQUAL output data set has – NAME– and – TYPE– variables, but it is not a TYPE=CORR data set. You can ignore this warning. If the output data set contains both scores and correlations, you must subset it for analysis with PROC PRINCOMP. Otherwise, the correlation observations are treated as ordinary observations and the PROC PRINCOMP results are incorrect. For example, consider the following statements: proc prinqual data=a out=b correlations replace; transform spline(var1-var50 / nknots=3); run; proc princomp data=b; where _TYPE_=’SCORE’; run;
Also note that the proportion of variance accounted for, as reported by PROC PRINCOMP, can exceed the proportion of variance accounted for in the last PROC PRINQUAL iteration. This is because PROC PRINQUAL reports the variance accounted for by the components analysis that generated the current scaling of the data, not a components analysis of the current scaling of the data.
Output Data Set
Using the MAC Method You can use the MAC algorithm alone by specifying METHOD=MAC, or you can use it as an initialization algorithm for METHOD=MTV and METHOD=MGV analyses by specifying the iteration option INITITER=. If any variables are negatively correlated, do not use the MAC algorithm with monotonic transformations (MONOTONE, UNTIE, and MSPLINE) because the signs of the correlations among the variables are not used when computing variable approximations. If an approximation is negatively correlated with the original variable, monotone constraints would make the optimally scaled variable a constant, which is not allowed (see the section “Avoiding Constant Transformations” on page 3672). When used with other transformations, the MAC algorithm can reverse the scoring of the variables. So, for example, if variable X is designated LOG(X) with METHOD=MAC and TSTANDARD=ORIGINAL, the final transformation (for example, TX) may not be LOG(X). If TX is not LOG(X), it has the same mean as LOG(X) and the same variance as LOG(X), and it is perfectly negatively correlated with LOG(X). PROC PRINQUAL displays a note for every variable that is reversed in this manner. You can use the METHOD=MAC algorithm to reverse the scorings of some rating variables before a factor analysis. The correlations among bipolar ratings such as ’like - dislike’, ’hot - cold’, and ’fragile - monumental’ are typically both positive and negative. If some items are reversed to say ’dislike - like’, ’cold - hot’, and ’monumental - fragile’, some of the negative signs can be eliminated, and the factor pattern matrix would be cleaner. You can use PROC PRINQUAL with METHOD=MAC and LINEAR transformations to reverse some items, maximizing the average of the intercorrelations.
Output Data Set The PRINQUAL procedure produces an output data set by default. By specifying the OUT=, APPROXIMATIONS, SCORES, REPLACE, and CORRELATIONS options in the PROC PRINQUAL statement, you can name this data set and control, to some extent, the contents of it.
Structure and Content The output data set can have 16 different forms, depending on the specified combinations of the REPLACE, SCORES, APPROXIMATIONS, and CORRELATIONS options. You can specify any combination of these options. To illustrate, assume that the data matrix consists of N observations and m variables, and n components are computed. Then, define the following: D
the N × m matrix of original data with variable names that correspond to the names of the variables in the input data set. However, when you use the OPSCORE transformation on character variables, those variables are replaced by numeric variables that contain category numbers
T
the N × m matrix of transformed data with variable names constructed from the value of the TPREFIX= option (if you do not specify the REPLACE option) and the names of the variables in the input data set
3669
3670
Chapter 59. The PRINQUAL Procedure
S
the N × n matrix of component scores with variable names constructed from the value of the PREFIX= option and integers
A
the N × m matrix of data approximations with variable names constructed from the value of the APREFIX= option and the names of the variables in the input data set
RTD
the m × m matrix of correlations between the transformed variables and the original variables with variable names that correspond to the names of the variables in the input data set. When missing values exist, casewise deletion is used to compute the correlations.
RTT
the m × m matrix of correlations among the transformed variables with the variable names constructed from the value of the TPREFIX= option (if you do not specify the REPLACE option) and the names of the variables in the input data set
RTS
the m × n matrix of correlations between the transformed variables and the principal component scores (component structure matrix) with variable names constructed from the value of the PREFIX= option and integers
RTA
the m × m matrix of correlations between the transformed variables and the variable approximations with variable names constructed from the value of the APREFIX= option and the names of the variables in the input data set
To create a data set WORK.A that contains all information, specify the following options in the PROC PRINQUAL statement proc prinqual scores approximations correlations out=a;
and also use a TRANSFORM statement appropriate for your data. WORK.A data set contains D
T
S
A
RTD
RTT
RTS
RTA
Then the
To eliminate the bottom partitions that contain the correlations and component structure, do not specify the CORRELATIONS option. For example, use the following PROC PRINQUAL statement with an appropriate TRANSFORM statement. proc prinqual scores approximations out=a;
Then the WORK.A data set contains D T S A
If you use the following PROC PRINQUAL statement (with an appropriate TRANSFORM statement)
Output Data Set
proc prinqual out=a;
this creates a data set WORK.A of the form D T
To output transformed data and component scores only, specify the following options in the PROC PRINQUAL statement: proc prinqual replace scores out=a;
Then the WORK.A data set contains T S
– TYPE– and – NAME– Variables In addition to the preceding information, the output data set contains two character variables, the variable – TYPE– (length 8) and the variable – NAME– (length 32). The – TYPE– variable has the value ’SCORE’ if the observation contains variables, transformed variables, components, or data approximations; the – TYPE– variable has the value ’CORR’ if the observation contains correlations or component structure. By default, the – NAME– variable has values ’ROW1’, ’ROW2’, and so on, for the observations with – TYPE– =’SCORE’. If you use an ID statement, the variable – NAME– contains the formatted ID variable for SCORES observations. The values of the variable – NAME– for observations with – TYPE– =’CORR’ are the names of the transformed variables. Certain procedures, such as PROC PRINCOMP, which can use the PROC PRINQUAL output data set, issue a warning that the PROC PRINQUAL data set contains – NAME– and – TYPE– variables but is not a TYPE=CORR data set. You can ignore this warning.
Variable Names The TPREFIX=, APREFIX=, and PREFIX= options specify prefixes for the transformed and approximation variable names and for principal component score variables, respectively. PROC PRINQUAL constructs transformed and approximation variable names from a prefix and the first characters of the original variable name. The number of characters in the prefix plus the number of characters in the original variable name (including the final digits, if any) required to uniquely designate the new variables should not exceed 32. For example, if the APREFIX= parameter that you specify is one character, PROC PRINQUAL adds the first 31 characters of the original variable name; if your prefix is four characters, only the first 28 characters of the original variable name are added.
3671
3672
Chapter 59. The PRINQUAL Procedure
Effect of the TSTANDARD= and COVARIANCE Options The values in the output data set are affected by the TSTANDARD= and COVARIANCE options. If you specify TSTANDARD=NOMISS, the NOMISS standardization is performed on the transformed data after the iterations have been completed, but before the output data set is created. The new means and variances are used in creating the output data set. Then, if you do not specify the COVARIANCE option, the data are transformed to mean zero and variance one. The principal component scores and data approximations are computed from the resulting matrix. The data are then linearly transformed to have the mean and variance specified by the TSTANDARD= option. The data approximations are transformed so that the means within each pair of a transformed variable and its approximation are the same. The ratio of the variance of a variable approximation to the variance of the corresponding transformed variable equals the proportion of the variance of the variable that is accounted for by the components model. If you specify the COVARIANCE option and do not specify TSTANDARD=Z, you can input the transformed data to PROC PRINCOMP, again specifying the COVARIANCE option, to perform a components analysis of the results of PROC PRINQUAL. Similarly, if you do not specify the COVARIANCE option with PROC PRINQUAL and you input the transformed data to PROC PRINCOMP without the COVARIANCE option, you receive the same report. However, some combinations of PROC PRINQUAL options, such as COVARIANCE and TSTANDARD=Z, while valid, produce approximations and scores that cannot be reproduced by PROC PRINCOMP. The component scores in the output data set are computed from the correlations among the transformed variables, or from the covariances if you specified the COVARIANCE option. The component scores are computed after the TSTANDARD=NOMISS transformation, if specified. The means of the component scores in the output data set are always zero. The variances equal the corresponding eigenvalues, unless you specify the STANDARD option; then the variances are set to one.
Avoiding Constant Transformations There are times when the optimal scaling produces a constant transformed variable. This can happen with the MONOTONE, UNTIE, and MSPLINE transformations when the target is negatively correlated with the original input variable. It can happen with all transformations when the target is uncorrelated with the original input variable. When this happens, the procedure modifies the target to avoid a constant transformation. This strategy avoids certain nonoptimal solutions. If the transformation is monotonic and a constant transformed variable results, the procedure multiplies the target by −1 and tries the optimal scaling again. If the transformation is not monotonic or if the multiplication by −1 did not help, the procedure tries using a random target. If the transformation is still constant, the previous nonconstant transformation is retained. When a constant transformation is avoided by any strategy, this message is displayed: “A constant transformation was avoided for name.”
REITERATE Option Usage
Constant Variables Constant and almost constant variables are zeroed and ignored.
Character OPSCORE Variables Character OPSCORE variables are replaced by a numeric variable containing category numbers before the iterations, and the character values are discarded. Only the first eight characters are considered when determining category membership. If you want the original character variable in the output data set, give it a different name in the OPSCORE specificiation (OPSCORE(x / name=(x2)) and name the original variable on the ID statement (ID x;).
REITERATE Option Usage You can use the REITERATE option to perform additional iterations when PROC PRINQUAL stops before the data have adequately converged. For example, suppose that you execute the following code: proc prinqual data=A cor out=B; transform mspline(X1-X5); run;
If the transformations do not converge in the default 30 iterations, you can perform more iterations without repeating the first 30 iterations. proc prinqual data=B reiterate cor out=B; transform mspline(X1-X5); run;
Note that a WHERE statement is not necessary to exclude the correlation observations. They are automatically excluded because their – TYPE– variable value is not ’SCORE’. You can also use the REITERATE option to specify starting values other than the original values for the transformations. Providing alternate starting points may avoid local optima. Here are two examples. proc prinqual data=A out=B; transform rank(X1-X5); run; proc prinqual data=B reiterate out=C; /* Use ranks as the starting point. */ transform monotone(X1-X5); run; data B; set A;
3673
3674
Chapter 59. The PRINQUAL Procedure array TXS[5] TX1-TX5; do j = 1 to 5; TXS[j] = normal(0); end; run; proc prinqual data=B reiterate out=C; /* Use a random starting point. */ transform monotone(X1-X5); run;
Note that divergence with the REITERATE option, particularly in the second iteration, is not an error since the initial transformation is not required to be a valid member of the transformation family. When you specify the REITERATE option, the iteration does not terminate when the criterion change is negative during the first ten iterations.
Passive Observations Observations may be excluded from the analysis for several reasons, including zero weight, zero frequency, missing values in variables designated IDENTITY, or missing values with the NOMISS option specified. These observations are passive in that they do not contribute to determining transformations, R2 , total variance, and so on. However, some information can be computed for them, such as approximations, principal component scores, and transformed values. Passive observations in the output data set have a blank value for the variable – TYPE– . Missing value estimates for passive observations may converge slowly with METHOD=MTV. In the following example, the missing value estimates should be 2, 5, and 8. Since the nonpassive observations do not change, the procedure converges in one iteration but the missing value estimates do not converge. The extra iterations produced by specifying CONVERGE=−1 and CCONVERGE=−1, as shown in the second PROC PRINQUAL step, generate the expected results. data A; input X Y; datalines; 1 1 2 . 3 3 4 4 5 . 6 6 7 7 8 . 9 9 ; proc prinqual nomiss data=A nomiss n=1 out=B method=mtv; transform lin(X Y); run;
Computational Resources
proc print; run; proc prinqual nomiss data=A nomiss n=1 out=B method=mtv converge=-1 cconverge=-1; transform lin(X Y); run; proc print; run;
Computational Resources This section provides information on the computational resources required to run PROC PRINQUAL. Let N
= number of observations
m = number of variables n = number of principal components k = maximum spline degree p = maximum number of knots
• For the MTV algorithm, more than 56m + 8N m + 8 (6N + (p + k + 2)(p + k + 11)) bytes of array space are required. • For the MGV and MAC algorithms, more than 56m plus the maximum of the data matrix size and the optimal scaling work space bytes of array space are required. The data matrix size is 8N m bytes. The optimal scaling work space requires less than 8 (6N + (p + k + 2)(p + k + 11)) bytes. • For the MTV and MGV algorithms, more than 56m + 4m(m + 1) bytes of array space are required. • PROC PRINQUAL tries to store the original and transformed data in memory. If there is not enough memory, a utility data set is used, potentially resulting in a large increase in execution time. The amount of memory for the preceding data formulas are underestimates of the amount of memory needed to handle most problems. These formulas give an absolute minimum amount of memory required. If a utility data set is used, and if memory could be used with perfect efficiency, then roughly the amount of memory stated previously would be needed. In reality, most problems require at least two or three times the minimum.
3675
3676
Chapter 59. The PRINQUAL Procedure • PROC PRINQUAL sorts the data once. The sort time is roughly proportional to mN 3/2 . • For the MTV algorithm, the time required to compute the variable approximations is roughly proportional to 2N m2 + 5m3 + nm2 . • For the MGV algorithm, one regression analysis per iteration is required to compute model parameter estimates. The time required for accumulating the crossproduct matrix is roughly proportional to N m2 . The time required to compute the regression coefficients is roughly proportional to m3 . For each variable for each iteration, the swept crossproduct matrix is updated with time roughly proportional to m(N+m). The swept crossproduct matrix is updated for each variable with time roughly proportional to m2 , until computations are refreshed, requiring all sweeps to be performed again. • The only computationally intensive part of the MAC algorithm is the optimal scaling, since variable approximations are simple averages. • Each optimal scaling is a multiple regression problem, although some transformations are handled with faster special case algorithms. The number of regressors for the optimal scaling problems depends on the original values of the variable and the type of transformation. For each monotone spline transformation, an unknown number of multiple regressions is required to find a set of coefficients that satisfies the constraints. The B-spline basis is generated twice for each SPLINE and MSPLINE transformation for each iteration. The time required to generate the B-spline basis is roughly proportional to N k 2 .
Displayed Output The main output from the PRINQUAL procedure is the output data set. However, the procedure does produce displayed output in the form of an iteration history table that includes the following: • Iteration Number • the criterion being optimized • criterion change • Maximum and Average Absolute Change in standardized variable scores computed over variables that can be iteratively transformed • notes • final convergence status
ODS Table Names PROC PRINQUAL assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.”
Example 59.1. Multidimensional Preference Analysis of Cars Data
Table 59.2. ODS Tables Produced in PROC PRINQUAL
ODS Table Name ConvergenceStatus Footnotes MAC MGV MTV
Description Convergence Status Iteration History Footnotes MAC Iteration History MGV Iteration History MTV Iteration History
Statement
PROC PROC PROC
Option default default METHOD=MAC METHOD=MGV METHOD=MTV
ODS Graphics (Experimental) This section describes the use of ODS for creating graphics with the PRINQUAL procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release. To request a graph you must specify the ODS GRAPHICS statement in addition to the following option. For more information on the ODS GRAPHICS statement, see Chapter 15, “Statistical Graphics Using ODS.” The following table shows the available plot option. Option MDPREF
Plot Description Multidimensional preference analysis
ODS Graph Names PROC PRINQUAL assigns a name to the graph it creates using ODS. You can use this name to reference the graph when using ODS. The name is listed in Table 59.3. To request a graph you must specify the ODS GRAPHICS statement in addition to the option indicated in Table 59.3. For more information on the ODS GRAPHICS statement, see Chapter 15, “Statistical Graphics Using ODS.” Table 59.3. ODS Graphics Produced by PROC PRINQUAL
ODS Graph Name PrinqualPlot
Plot Description Multidimensional preference analysis
Statement PROC
Option MDPREF
Examples Example 59.1. Multidimensional Preference Analysis of Cars Data This example uses PROC PRINQUAL to perform a nonmetric multidimensional preference (MDPREF) analysis (Carroll 1972). MDPREF analysis is a principal component analysis of a data matrix with columns that correspond to people and rows that correspond to objects. The data are ratings or rankings of each person’s preference for each object. The data are the transpose of the usual multivariate data matrix. (In
3677
3678
Chapter 59. The PRINQUAL Procedure
other words, the columns are people instead of the more typical matrix where rows represent people.) The final result of an MDPREF analysis is a biplot (Gabriel 1981) of the resulting preference space. A biplot displays the judges and objects in a single plot by projecting them onto the plane in the transformed variable space that accounts for the most variance. The data are ratings by 25 judges of their preference for each of 17 automobiles. The ratings are made on a 0 to 9 scale, with 0 meaning very weak preference and 9 meaning very strong preference for the automobile. These judgments were made in 1980 about that year’s products. There are two additional variables that indicate the manufacturer and model of the automobile. This example uses PROC PRINQUAL, PROC FACTOR, and the %PLOTIT macro. PROC FACTOR is used before PROC PRINQUAL to perform a principal component analysis of the raw judgments. PROC FACTOR is also used immediately after PROC PRINQUAL since PROC PRINQUAL is a scoring procedure that optimally scores the data but does not report the principal component analysis. The %PLOTIT macro produces the biplot. For information on the %PLOTIT macro, see Appendix B, “Using the %PLOTIT Macro.” The scree plot, in the standard principal component analysis reported by PROC FACTOR, shows that two principal components should be retained for further use. (See the scree plot in Output 59.1.1 —there is a clear separation between the first two components and the remaining components.) There are nine eigenvalues that are precisely zero because there are nine fewer observations than variables in the data matrix. PROC PRINQUAL is then used to monotonically transform the raw judgments to maximize the proportion of variance accounted for by the first two principal components. The following statements create the data set and perform a principal component analysis of the original data. These statements produce Output 59.1.1. title ’Preference Ratings for Automobiles Manufactured in 1980’; data CarPref; input Make $ 1-10 Model $ 12-22 @25 (Judge1-Judge25) (1.); datalines; Cadillac Eldorado 8007990491240508971093809 Chevrolet Chevette 0051200423451043003515698 Chevrolet Citation 4053305814161643544747795 Chevrolet Malibu 6027400723121345545668658 Ford Fairmont 2024006715021443530648655 Ford Mustang 5007197705021101850657555 Ford Pinto 0021000303030201500514078 Honda Accord 5956897609699952998975078 Honda Civic 4836709507488852567765075 Lincoln Continental 7008990592230409962091909 Plymouth Gran Fury 7006000434101107333458708 Plymouth Horizon 3005005635461302444675655 Plymouth Volare 4005003614021602754476555 Pontiac Firebird 0107895613201206958265907 Volkswagen Dasher 4858696508877795377895000 Volkswagen Rabbit 4858509709695795487885000
Example 59.1. Multidimensional Preference Analysis of Cars Data Volvo ;
DL
9989998909999987989919000
* Principal Component Analysis of the Original Data; options ls=80 ps=65; proc factor data=CarPref nfactors=2 scree; ods select Eigenvalues ScreePlot; var Judge1-Judge25; title3 ’Principal Components of Original Data’; run;
Output 59.1.1. Principal Component Analysis of Original Data Preference Ratings for Automobiles Manufactured in 1980 Principal Components of Original Data The FACTOR Procedure Initial Factor Method: Principal Components Eigenvalues of the Correlation Matrix: Total = 25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Average = 1
Eigenvalue
Difference
Proportion
Cumulative
10.8857202 5.8507276 2.0429312 1.5221504 1.2143469 0.9578630 0.7381286 0.5884027 0.3766841 0.2675591 0.1901698 0.1437776 0.1088394 0.0480977 0.0424367 0.0221653 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
5.0349926 3.8077964 0.5207808 0.3078035 0.2564839 0.2197345 0.1497259 0.2117186 0.1091250 0.0773893 0.0463921 0.0349382 0.0607418 0.0056610 0.0202714 0.0221653 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
0.4354 0.2340 0.0817 0.0609 0.0486 0.0383 0.0295 0.0235 0.0151 0.0107 0.0076 0.0058 0.0044 0.0019 0.0017 0.0009 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.4354 0.6695 0.7512 0.8121 0.8606 0.8989 0.9285 0.9520 0.9671 0.9778 0.9854 0.9911 0.9955 0.9974 0.9991 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
3679
3680
Chapter 59. The PRINQUAL Procedure
Output 59.1.1. (continued) Preference Ratings for Automobiles Manufactured in 1980 Principal Components of Original Data The FACTOR Procedure Initial Factor Method: Principal Components Scree Plot of Eigenvalues | | | | | 12 + | | | | 1 | | 10 + | | | | | | 8 + E | i | g | e | n | v | a 6 + l | 2 u | e | s | | | 4 + | | | | | | 2 + 3 | | 4 | 5 | 6 7 | 8 | 9 0 1 2 0 + 3 4 5 6 7 8 9 0 1 2 3 4 5 | | | | -----+----+----+----+----+----+----+----+----+----+----+----+----+----+---0 2 4 6 8 10 12 14 16 18 20 22 24 26 Number
Example 59.1. Multidimensional Preference Analysis of Cars Data
3681
To fit the nonmetric MDPREF model, you can use the PRINQUAL procedure. The MONOTONE option is specified in the TRANSFORM statement to request a nonmetric MDPREF analysis; alternatively, you can instead specify the IDENTITY option for a metric analysis. Several options are used in the PROC PRINQUAL statement. The option DATA=CarPref specifies the input data set, OUT=Results creates an output data set, and N=2 and the default METHOD=MTV transform the data to better fit a two-component model. The REPLACE option replaces the original data with the monotonically transformed data in the OUT= data set. The MDPREF option standardizes the component scores to variance one so that the geometry of the biplot is correct, and it creates two variables in the OUT= data set named Prin1 and Prin2. These variables contain the standardized principal component scores and structure matrix, which are used to make the biplot. If the variables in data matrix X are standardized to mean zero and variance one, and n is the number of rows in X, then X = VΛ1/2 W0 is the principal component model, where X0 X/(n − 1) = WΛW0 . The W and Λ contain the eigenvectors and eigenvalues of the correlation matrix of X. The first two columns of V, the standardized component scores, and WΛ1/2 , which is the structure matrix, are output. The advantage of creating a biplot based on principal components is that coordinates do not depend on the sample size. The following statements transform the data and produce Output 59.1.2. * Transform the Data to Better Fit a Two Component Model; proc prinqual data=CarPref out=Results n=2 replace mdpref; id model; transform monotone(Judge1-Judge25); title2 ’Multidimensional Preference (MDPREF) Analysis’; title3 ’Optimal Monotonic Transformation of Preference Data’; run;
3682
Chapter 59. The PRINQUAL Procedure
Output 59.1.2. Transformation of Automobile Preference Data Preference Ratings for Automobiles Manufactured in 1980 Multidimensional Preference (MDPREF) Analysis Optimal Monotonic Transformation of Preference Data The PRINQUAL Procedure PRINQUAL MTV Algorithm Iteration History Iteration Average Maximum Proportion Criterion Number Change Change of Variance Change Note ---------------------------------------------------------------------------1 0.24994 1.28017 0.66946 2 0.07223 0.36958 0.80194 0.13249 3 0.04522 0.29026 0.81598 0.01404 4 0.03096 0.25213 0.82178 0.00580 5 0.02182 0.23045 0.82493 0.00315 6 0.01602 0.19017 0.82680 0.00187 7 0.01219 0.14748 0.82793 0.00113 8 0.00953 0.11031 0.82861 0.00068 9 0.00737 0.06461 0.82904 0.00043 10 0.00556 0.04469 0.82930 0.00026 11 0.00445 0.04087 0.82944 0.00014 12 0.00381 0.03706 0.82955 0.00011 13 0.00319 0.03348 0.82965 0.00009 14 0.00255 0.02999 0.82971 0.00006 15 0.00213 0.02824 0.82976 0.00005 16 0.00183 0.02646 0.82980 0.00004 17 0.00159 0.02472 0.82983 0.00003 18 0.00139 0.02305 0.82985 0.00003 19 0.00123 0.02145 0.82988 0.00002 20 0.00109 0.01993 0.82989 0.00002 21 0.00096 0.01850 0.82991 0.00001 22 0.00086 0.01715 0.82992 0.00001 23 0.00076 0.01588 0.82993 0.00001 24 0.00067 0.01440 0.82994 0.00001 25 0.00059 0.00871 0.82994 0.00001 26 0.00050 0.00720 0.82995 0.00000 27 0.00043 0.00642 0.82995 0.00000 28 0.00037 0.00573 0.82995 0.00000 29 0.00031 0.00510 0.82995 0.00000 30 0.00027 0.00454 0.82995 0.00000 Not Converged WARNING: Failed to converge, however criterion change is less than 0.0001.
The iteration history displayed by PROC PRINQUAL indicates that the proportion of variance is increased from an initial 0.66946 to 0.82995. The proportion of variance accounted for by PROC PRINQUAL on the first iteration equals the cumulative proportion of variance shown by PROC FACTOR for the first two principal components. In this example, PROC PRINQUAL’s initial iteration performs a standard principal component analysis of the raw data. The columns labeled Average Change, Maximum Change, and Variance Change contain values that always decrease, indicating that PROC PRINQUAL is improving the transformations at a monotonically decreasing rate over the iterations. This does not always happen, and when it does not, it suggests that the analysis may be converging to a degenerate solution. See Example 59.3 on page 3688 for a discussion of a degenerate solution. The algorithm does not converge in 30 iterations. However, the criterion change is small, indicating that more iterations are unlikely to have much effect on the results.
Example 59.1. Multidimensional Preference Analysis of Cars Data
3683
The second PROC FACTOR analysis is performed on the transformed data. The WHERE statement is used to retain only the monotonically transformed judgments. The scree plot shows that the first two eigenvalues are now much larger than the remaining smaller eigenvalues. The second eigenvalue has increased markedly at the expense of the next several eigenvalues. Two principal components seem to be necessary and sufficient to adequately describe these judges’ preferences for these automobiles. The cumulative proportion of variance displayed by PROC FACTOR for the first two principal components is 0.83. The following statements perform the analysis and produce Output 59.1.3: * Final Principal Component Analysis; proc factor data=Results nfactors=2 scree; ods select Eigenvalues ScreePlot; var Judge1-Judge25; where _TYPE_=’SCORE’; title3 ’Principal Components of Monotonically Transformed Data’; run;
Output 59.1.3. Principal Components of Transformed Data Preference Ratings for Automobiles Manufactured in 1980 Multidimensional Preference (MDPREF) Analysis Principal Components of Monotonically Transformed Data The FACTOR Procedure Initial Factor Method: Principal Components Eigenvalues of the Correlation Matrix: Total = 25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Average = 1
Eigenvalue
Difference
Proportion
Cumulative
11.5959045 9.1529589 1.1577036 0.8505023 0.7220700 0.4607160 0.3648821 0.2770970 0.1520025 0.1013403 0.0720640 0.0519661 0.0182987 0.0155927 0.0062258 0.0006755 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2.4429455 7.9952554 0.3072013 0.1284323 0.2613540 0.0958339 0.0877851 0.1250945 0.0506622 0.0292763 0.0200979 0.0336675 0.0027059 0.0093669 0.0055503 0.0006755 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
0.4638 0.3661 0.0463 0.0340 0.0289 0.0184 0.0146 0.0111 0.0061 0.0041 0.0029 0.0021 0.0007 0.0006 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.4638 0.8300 0.8763 0.9103 0.9392 0.9576 0.9722 0.9833 0.9894 0.9934 0.9963 0.9984 0.9991 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
3684
Chapter 59. The PRINQUAL Procedure
Output 59.1.3. (continued) Preference Ratings for Automobiles Manufactured in 1980 Multidimensional Preference (MDPREF) Analysis Principal Components of Monotonically Transformed Data The FACTOR Procedure Initial Factor Method: Principal Components Scree Plot of Eigenvalues | | | | | 12 + | 1 | | | | | 10 + | | | 2 | | | 8 + E | i | g | e | n | v | a 6 + l | u | e | s | | | 4 + | | | | | | 2 + | | | 3 | 4 5 | 6 | 7 8 9 0 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 | | | | -----+----+----+----+----+----+----+----+----+----+----+----+----+----+---0 2 4 6 8 10 12 14 16 18 20 22 24 26 Number
Example 59.1. Multidimensional Preference Analysis of Cars Data
The remainder of the example constructs the MDPREF biplot. A biplot is a plot that displays the relation between the row points and the columns of a data matrix. The rows of V, the standardized component scores, and WΛ1/2 , which is the structure matrix, contain enough information to reproduce X. The (i, j) element of X is the product of row i of V and row j of WΛ1/2 . If all but the first two columns of V and WΛ1/2 are discarded, the (i, j) element of X is approximated by the product of row i of V and row j of WΛ1/2 . Since the MDPREF analysis is based on a principal component model, the dimensions of the MDPREF biplot are the first two principal components. The first principal component is the longest dimension through the MDPREF biplot. The first principal component is overall preference, which is the most salient dimension in the preference judgments. One end points in the direction that is on the average preferred most by the judges, and the other end points in the least preferred direction. The second principal component is orthogonal to the first principal component, and it is the orthogonal direction that is the second most salient. The interpretation of the second dimension varies from example to example. With an MDPREF biplot, it is geometrically appropriate to represent each automobile (object) by a point and each judge by a vector. The automobile points have coordinates that are the scores of the automobile on the first two principal components. The judge vectors emanate from the origin of the space and go through a point with coordinates that are the coefficients of the judge (variable) on the first two principal components. The absolute length of a vector is arbitrary. However, the relative lengths of the vectors indicate fit, with the squared lengths being proportional to the communalities in the PROC FACTOR output. The direction of the vector indicates the direction that is most preferred by the individual judge, with preference increasing as the vector moves from the origin. Let v0 be row i of V, u0 be row j of U = WΛ1/2 , kvk be the length of v, kuk be the length of u, and θ be the angle between v and u. The predicted degree of preference that an individual judge has for an automobile is u0 v = kuk kvk cos θ. Each car point can be orthogonally projected onto the vector. The projection of car i on vector j is u((u0 v)/(u0 u)) and the length of this projection is kvk cos θ. The automobile that projects farthest along a vector in the direction it points is that judge’s most preferred automobile, since the length of this projection, kvk cos θ, differs from the predicted preference, kuk kvk cos θ, only by kuk, which is constant within each judge. To interpret the biplot, look for directions through the plot that show a continuous change in some attribute of the automobiles, or look for regions in the plot that contain clusters of automobile points and determine what attributes the automobiles have in common. Those points that are tightly clustered in a region of the plot represent automobiles that have the same preference patterns across the judges. Those vectors that point in roughly the same direction represent judges who tend to have similar preference patterns.
3685
3686
Chapter 59. The PRINQUAL Procedure
The following statement constructs the biplot and produces Output 59.1.4: title3 ’Biplot of Automobiles and Judges’; %plotit(data=results, datatype=mdpref 2);
The DATATYPE=MDPREF 2 option indicates that the coordinates come from an MDPREF analysis, so the macro represents the scores as points and the structure as vectors, with the vectors stretched by a factor of two to make a better graphical display. Output 59.1.4. Preference Ratings for Automobiles Manufactured in 1980
In the biplot, American automobiles are located on the left of the space, while European and Japanese automobiles are located on the right. At the top of the space are expensive American automobiles (Cadillac Eldorado, Lincoln Continental) while at the bottom are inexpensive ones (Pinto, Chevette). The first principal component differentiates American from imported automobiles, and the second arranges automobiles by price and other associated characteristics. The two expensive American automobiles form a cluster, the sporty automobile (Firebird) is by itself, the Volvo DL is by itself, and the remaining imported autos
Example 59.2. MDPREF of Cars Data, ODS Graphics (Experimental)
3687
form a cluster, as do the remaining American autos. It seems there are 5 prototypical automobiles in this set of 17, in terms of preference patterns among the 25 judges. Most of the judges prefer the imported automobiles, especially the Volvo. There is also a fairly large minority that prefer the expensive cars, whether or not they are American (those with vectors that point towards one o’clock), or simply prefer expensive American automobiles (vectors that point towards eleven o’clock). There are two people who prefer anything except expensive American cars (five o’clock vectors), and one who prefers inexpensive American cars (seven o’clock vector). Several vectors point toward the upper-right corner of the plot, toward a region with no cars. This is the region between the European and Japanese cars on the right and the luxury cars on the top. This suggests that there is a market for luxury Japanese and European cars.
Example 59.2. Multidimensional Preference Analysis of Cars Data, ODS Graphics (Experimental) The following graphical displays are requested by specifying the experimental ODS GRAPHICS statement. For general information about ODS graphics see Chapter 15, “Statistical Graphics Using ODS.” For specific information about the graphics available in the PRINQUAL procedure, see the “ODS Graphics” section on page 3677. title ’Preference Ratings for Automobiles Manufactured in 1980’; options validvarname=any; data CarPref; input Make $ 1-10 Model $ 12-22 @25 (’1’n-’25’n) (1.); datalines; Cadillac Eldorado 8007990491240508971093809 Chevrolet Chevette 0051200423451043003515698 Chevrolet Citation 4053305814161643544747795 Chevrolet Malibu 6027400723121345545668658 Ford Fairmont 2024006715021443530648655 Ford Mustang 5007197705021101850657555 Ford Pinto 0021000303030201500514078 Honda Accord 5956897609699952998975078 Honda Civic 4836709507488852567765075 Lincoln Continental 7008990592230409962091909 Plymouth Gran Fury 7006000434101107333458708 Plymouth Horizon 3005005635461302444675655 Plymouth Volare 4005003614021602754476555 Pontiac Firebird 0107895613201206958265907 Volkswagen Dasher 4858696508877795377895000 Volkswagen Rabbit 4858509709695795487885000 Volvo DL 9989998909999987989919000 ; ods html; ods graphics on; proc prinqual data=CarPref out=Results n=2 replace mdpref maxiter=100; id model;
3688
Chapter 59. The PRINQUAL Procedure transform monotone(’1’n-’25’n); title2 ’Multidimensional Preference (MDPREF) Analysis’; title3 ’Optimal Monotonic Transformation of Preference Data’; run; ods graphics off; ods html close;
Output 59.2.1. Multidimensional Preference Analysis (Experimental)
Example 59.3. Principal Components of Basketball Rankings The data in this example are 1985–1986 preseason rankings of 35 college basketball teams by 10 different news services. The services do not all rank the same teams or the same number of teams, so there are missing values in these data. Each of the 35 teams in the data set is ranked by at least one news service. One way of summarizing these data is with a principal component analysis, since the rankings should all be related to a single underlying variable, the first principal component. You can use PROC PRINQUAL to estimate the missing ranks and compute scores for all observations. You can formulate a PROC PRINQUAL analysis that assumes that the observed ranks are ordinal variables and replaces the ranks with new numbers that are monotonic with the ranks and better fit the one principal component model.
Example 59.3. Principal Components of Basketball Rankings
3689
The missing rank estimates need to be constrained since a news service would have positioned the unranked teams below the teams it ranked. PROC PRINQUAL should impose order constraints within the nonmissing values and between the missing and nonmissing values, but not within the missing values. PROC PRINQUAL has sophisticated missing data handling facilities; however, these facilities cannot directly handle this problem. The solution requires reformulating the problem. By performing some preliminary data manipulations, specifying the N=1 option in the PROC PRINQUAL statement, and specifying the UNTIE transformation in the TRANSFORM statement, you can make the missing value estimates conform to the requirements. The PROC MEANS step finds the largest rank for each variable. The next DATA step replaces missing values with a value that is one larger than the largest observed rank. The N=1 option (in the PRINQUAL procedure) specifies that the variables should be transformed to make them as one-dimensional as possible. The UNTIE transformation in the TRANSFORM statement monotonically transforms the ranks, untying any ties in an optimal way. Because the only ties are for the values that replace the missing values, and because these values are larger than the observed values, the rescoring of the data satisfies the preceding requirements. The following statements create the data set and perform the transformations discussed previously. These statements produce Output 59.3.1. * * * * * * * * *
Example 2: Basketball Data Preseason 1985 College Basketball Rankings (rankings of 35 teams by 10 news services) Note: (a) (b) (c) (d)
Various news services rank varying numbers of teams. Not all 35 teams are ranked by all news services. Each team is ranked by at least one service. Rank 20 is missing for UPI.;
title1 ’1985 Preseason College Basketball Rankings’; data bballm; input School $13. CSN DurhamSun DurhamHerald WashingtonPost USA_Today SportMagazine InsideSports UPI AP SportsIllustrated; label CSN = ’Community Sports News (Chapel Hill, NC)’ DurhamSun = ’Durham Sun’ DurhamHerald = ’Durham Morning Herald’ WashingtonPost = ’Washington Post’ USA_Today = ’USA Today’ SportMagazine = ’Sport Magazine’ InsideSports = ’Inside Sports’ UPI = ’United Press International’ AP = ’Associated Press’ SportsIllustrated = ’Sports Illustrated’ ; format CSN--SportsIllustrated 5.1; datalines; Louisville 1 8 1 9 8 9 6 10 9 9
3690
Chapter 59. The PRINQUAL Procedure Georgia Tech Kansas Michigan Duke UNC Syracuse Notre Dame Kentucky LSU DePaul Georgetown Navy Illinois Iowa Arkansas Memphis State Washington UAB UNLV NC State Maryland Pittsburgh Oklahoma Indiana Virginia Old Dominion Auburn St. Johns UCLA St. Joseph’s Tennessee Montana Houston Virginia Tech ;
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 . . . . .
2 4 5 6 1 10 14 15 9 . 7 20 3 16 . . . 13 18 17 . . 19 12 . . 11 . . . . . . .
4 5 9 7 2 6 15 16 13 21 8 23 3 . . 11 . 10 18 14 . . 17 20 22 . 12 . . 19 24 . . .
3 1 4 5 2 11 13 14 . 15 6 10 7 . . . . . 19 16 . . 17 18 . . 8 . . . . 20 . .
1 5 2 4 3 6 11 14 13 20 9 18 7 23 25 16 . 12 22 15 19 . 17 21 . . 10 . . . . . 24 .
1 11 5 10 4 6 20 19 15 . 2 13 3 . . 8 . 17 . . . . 12 . 18 . 7 14 . . 16 . . .
1 8 3 4 2 5 18 11 16 19 9 15 10 . . 20 . . 14 12 . . 17 . . . 7 . . . . . . 13
2 4 1 5 3 6 13 12 9 . 8 . 7 14 . . 17 16 18 15 . . . . . . 11 . 19 . . . . .
1 5 3 6 2 4 12 11 14 . 8 20 7 . . 15 . 16 18 17 19 . 13 . . . 10 . . . . . . .
1 7 2 5 3 10 . 13 8 19 4 . 6 20 16 12 . 15 . 18 14 . 17 . . . 11 . . . . . . .
* Find maximum rank for each news service and replace * each missing value with the next highest rank.; proc means data=bballm noprint; output out=maxrank max=mcsn mdurs mdurh mwas musa mspom mins mupi map mspoi; run; data bball; set bballm; if _n_=1 then set maxrank; array services[10] CSN--SportsIllustrated; array maxranks[10] mcsn--mspoi; keep School CSN--SportsIllustrated; do i=1 to 10; if services[i]=. then services[i]=maxranks[i]+1; end; run;
Example 59.3. Principal Components of Basketball Rankings
* * * * * * * * * * * *
3691
Assume that the ranks are ordinal and that unranked teams would have been ranked lower than ranked teams. Monotonically transform all ranked teams while estimating the unranked teams. Enforce the constraint that the missing ranks are estimated to be less than the observed ranks. Order the unranked teams optimally within this constraint. Do this so as to maximize the variance accounted for by one linear combination. This makes the data as nearly rank one as possible, given the constraints. NOTE: The UNTIE transformation should be used with caution. If frequently produces degenerate results.;
proc prinqual data=bball out=tbball scores n=1 tstandard=z; title2 ’Optimal Monotonic Transformation of Ranked Teams’; title3 ’with Constrained Estimation of Unranked Teams’; transform untie(CSN -- SportsIllustrated); id School; run;
3692
Chapter 59. The PRINQUAL Procedure
Output 59.3.1. Transformation of Basketball Team Rankings 1985 Preseason College Basketball Rankings Optimal Monotonic Transformation of Ranked Teams with Constrained Estimation of Unranked Teams The PRINQUAL Procedure PRINQUAL MTV Algorithm Iteration History Iteration Average Maximum Proportion Criterion Number Change Change of Variance Change Note ---------------------------------------------------------------------------1 0.18563 0.76531 0.85850 2 0.03225 0.14627 0.94362 0.08512 3 0.02126 0.10530 0.94669 0.00307 4 0.01467 0.07526 0.94801 0.00132 5 0.01067 0.05282 0.94865 0.00064 6 0.00800 0.03669 0.94899 0.00034 7 0.00617 0.02862 0.94919 0.00020 8 0.00486 0.02636 0.94932 0.00013 9 0.00395 0.02453 0.94941 0.00009 10 0.00327 0.02300 0.94947 0.00006 11 0.00275 0.02166 0.94952 0.00005 12 0.00236 0.02041 0.94956 0.00004 13 0.00205 0.01927 0.94959 0.00003 14 0.00181 0.01818 0.94962 0.00003 15 0.00162 0.01719 0.94964 0.00002 16 0.00147 0.01629 0.94966 0.00002 17 0.00136 0.01546 0.94968 0.00002 18 0.00128 0.01469 0.94970 0.00002 19 0.00121 0.01398 0.94971 0.00001 20 0.00115 0.01332 0.94973 0.00001 21 0.00111 0.01271 0.94974 0.00001 22 0.00105 0.01213 0.94975 0.00001 23 0.00099 0.01155 0.94976 0.00001 24 0.00095 0.01095 0.94977 0.00001 25 0.00091 0.01038 0.94978 0.00001 26 0.00088 0.00986 0.94978 0.00001 27 0.00084 0.00936 0.94979 0.00001 28 0.00081 0.00889 0.94980 0.00001 29 0.00077 0.00846 0.94980 0.00000 30 0.00073 0.00805 0.94980 0.00000 Not Converged WARNING: Failed to converge, however criterion change is less than 0.0001.
An alternative approach is to use the pairwise deletion option of the CORR procedure to compute the correlation matrix and then use PROC PRINCOMP or PROC FACTOR to perform the principal component analysis. This approach has several disadvantages. The correlation matrix may not be positive semidefinite (psd), an assumption required for principal component analysis. PROC PRINQUAL always produces a psd correlation matrix. Even with pairwise deletion, PROC CORR removes the six observations with only a single nonmissing value from this data set. Finally, it is still not possible to calculate scores on the principal components for those teams that have missing values. It is possible to compute the composite ranking using PROC PRINCOMP and some preliminary data manipulations, similar to those discussed previously. Chapter 58, “The PRINCOMP Procedure,” contains an example where the average
Example 59.3. Principal Components of Basketball Rankings
3693
of the unused ranks in each poll is substituted for the missing values, and each observation is weighted by the number of nonmissing values. This method has much to recommend it. It is much faster and simpler than using PROC PRINQUAL. It is also much less prone to degeneracies and capitalization on chance. However, PROC PRINCOMP does not allow the nonmissing ranks to be monotonically transformed and the missing values untied to optimize fit. PROC PRINQUAL monotonically transforms the observed ranks and estimates the missing ranks (within the constraints given previously) to account for almost 95 percent of the variance of the transformed data by just one dimension. PROC FACTOR is then used to report details of the principal component analysis of the transformed data. As shown by the Factor Pattern values in Output 59.3.2, nine of the ten news services have a correlation of 0.95 or larger with the scores on the first principal component after the data are optimally transformed. The scores are sorted and the composite ranking is displayed following the PROC FACTOR output. More confidence can be placed in the stability of the scores for the teams that are ranked by the majority of the news services than in scores for teams that are seldom ranked. The monotonic transformations are plotted for each of the ten news services. These plots are the values of the raw ranks (with the missing ranks replaced by the maximum rank plus one) versus the rescored (transformed) ranks. The transformations are the step functions that maximize the fit of the data to the principal component model. Smoother transformations could be found by using MSPLINE transformations, but MSPLINE transformations would not correctly handle the missing data problem. The following statements perform the final analysis and produce Output 59.3.2: * Perform the Final Principal Component Analysis; proc factor nfactors=1; var TCSN -- TSportsIllustrated; title4 ’Principal Component Analysis’; run; proc sort; by Prin1; run; * Display Scores on the First Principal Component; proc print; title4 ’Teams Ordered by Scores on First Principal Component’; var School Prin1; run; * Plot the Transformations; goptions goutmode=replace nodisplay; %let opts = haxis=axis2 vaxis=axis1 frame cframe=ligr; * Depending on your goptions, these plot options may work better: * %let opts = haxis=axis2 vaxis=axis1 frame; proc gplot; title; axis1 minor=none label=(angle=90 rotate=0)
3694
Chapter 59. The PRINQUAL Procedure order=(-3 to 2 by 1); axis2 minor=none order=(0 to 40 by 10); plot TCSN*CSN plot TDurhamSun*DurhamSun plot TDurhamHerald*DurhamHerald plot TWashingtonPost*WashingtonPost plot TUSA_Today*USA_Today plot TSportMagazine*SportMagazine plot TInsideSports*InsideSports plot TUPI*UPI plot TAP*AP plot TSportsIllustrated*SportsIllustrated symbol1 c=blue; run; quit;
/ / / / / / / / / /
&opts &opts &opts &opts &opts &opts &opts &opts &opts &opts
goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:prqex1 2:prqex2 3:prqex3 4:prqex4; treplay 1:prqex5 2:prqex6 3:prqex7 4:prqex8; treplay 1:prqex9 3:prqex10; run; quit;
name=’prqex1’; name=’prqex2’; name=’prqex3’; name=’prqex4’; name=’prqex5’; name=’prqex6’; name=’prqex7’; name=’prqex8’; name=’prqex9’; name=’prqex10’;
Example 59.3. Principal Components of Basketball Rankings
Output 59.3.2. Alternative Approach for Analyzing Basketball Rankings 1985 Preseason College Basketball Rankings Optimal Monotonic Transformation of Ranked Teams with Constrained Estimation of Unranked Teams Principal Component Analysis The FACTOR Procedure Initial Factor Method: Principal Components Prior Communality Estimates: ONE
Eigenvalues of the Correlation Matrix: Total = 10
1 2 3 4 5 6 7 8 9 10
Average = 1
Eigenvalue
Difference
Proportion
Cumulative
9.49808040 0.22109985 0.08675881 0.07409119 0.04360523 0.03793160 0.01694775 0.01395675 0.00413045 0.00339797
9.27698055 0.13434105 0.01266762 0.03048596 0.00567364 0.02098385 0.00299099 0.00982630 0.00073249
0.9498 0.0221 0.0087 0.0074 0.0044 0.0038 0.0017 0.0014 0.0004 0.0003
0.9498 0.9719 0.9806 0.9880 0.9924 0.9962 0.9979 0.9992 0.9997 1.0000
1 factor will be retained by the NFACTOR criterion.
Factor Pattern Factor1 TCSN TDurhamSun TDurhamHerald TWashingtonPost TUSA_Today TSportMagazine TInsideSports TUPI TAP TSportsIllustrated
CSN Transformation DurhamSun Transformation DurhamHerald Transformation WashingtonPost Transformation USA_Today Transformation SportMagazine Transformation InsideSports Transformation UPI Transformation AP Transformation SportsIllustrated Transformation
0.91136 0.98887 0.97402 0.97408 0.98867 0.95331 0.98521 0.98534 0.99590 0.98615
Variance Explained by Each Factor Factor1 9.4980804
Final Communality Estimates: Total = 9.498080
TCSN
TDurhamSun
TDurham Herald
TWashington Post
TUSA_Today
0.83057866
0.97785439
0.94870875
0.94882907
0.97747798
TSport Magazine
TInside Sports
TUPI
TAP
TSports Illustrated
0.90879058
0.97064640
0.97088804
0.99181626
0.97249026
3695
3696
Chapter 59. The PRINQUAL Procedure
Output 59.3.2. (continued) 1985 Preseason College Basketball Rankings Optimal Monotonic Transformation of Ranked Teams with Constrained Estimation of Unranked Teams Teams Ordered by Scores on First Principal Component OBS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
School Georgia Tech UNC Michigan Kansas Duke Illinois Georgetown Louisville Syracuse Auburn LSU Memphis State Kentucky Notre Dame Navy UAB DePaul Oklahoma NC State UNLV Iowa Indiana Maryland Virginia Arkansas Washington Tennessee Virginia Tech St. Johns Montana UCLA Pittsburgh Old Dominion St. Joseph’s Houston
Prin1 -6.20315 -5.93314 -5.71034 -4.78699 -4.75896 -4.19220 -4.02861 -3.73087 -3.47497 -1.78429 -0.35928 0.46737 0.63661 0.71919 0.76187 0.98316 1.09891 1.12012 1.15144 1.28766 1.45260 1.48123 1.54935 2.01385 2.02718 2.10878 2.27770 2.36103 2.37387 2.43502 2.52481 3.00907 3.03324 3.39259 4.69614
Example 59.3. Principal Components of Basketball Rankings
Output 59.3.3. Monotonic Transformation for Each News Service
3697
3698
Chapter 59. The PRINQUAL Procedure
Output 59.3.3. (continued)
Example 59.3. Principal Components of Basketball Rankings
Output 59.3.3. (continued)
The ordinary PROC PRINQUAL missing data handling facilities do not work for these data because they do not constrain the missing data estimates properly. If you code the missing ranks as missing and specify linear transformations, then you can compute least-squares estimates of the missing values without transforming the observed values. The first principal component then accounts for 92 percent of the variance after 20 iterations. However, Virginia Tech is ranked number 11 by its score even though it appeared in only one poll (InsideSports ranked it number 13, anchoring it firmly in the middle). Specifying monotone transformations is also inappropriate since they too allow unranked teams to move in between ranked teams. With these data, the combination of monotone transformations and the freedom to score the missing ranks without constraint leads to degenerate transformations. PROC PRINQUAL tries to merge the 35 points into two points, producing a perfect fit in one dimension. There is evidence for this after 20 iterations when the Average Change, Maximum Change, and Variance Change values are all increasing, instead of the more stable decreasing change rate seen in the analysis shown. The change rates all stop increasing after 41 iterations, and it is clear by 70 or 80 iterations that one component will account for 100 percent of the transformed variables variance after sufficient iteration. While this may seem desirable (after all, it is a perfect fit), you should, in fact, be on guard when this happens. Whenever convergence is slow, the rates of change increase, or the final data perfectly fit the model, the solution is probably degenerating due to too few constraints on the scorings.
3699
3700
Chapter 59. The PRINQUAL Procedure
PROC PRINQUAL can account for 100 percent of the variance by scoring Montana and UCLA with one positive value on all variables and scoring all the other teams with one negative value on all variables. This inappropriate analysis suggests that all ranked teams are equally good except for two teams that are less good. Both of these two teams are ranked by only one news service, and their only nonmissing rank is last in the poll. This accounts for the degeneracy.
References de Boor, C. (1978), A Practical Guide to Splines, New York: Springer Verlag. Carroll, J.D. (1972), “Individual Differences and Multidimensional Scaling,” in Multidimensional Scaling: Theory and Applications in the Behavioral Sciences (Volume 1), eds. R.N. Shepard, A.K. Romney, and S.B. Nerlove, New York: Seminar Press. Eckart, C. and Young, G. (1936), “The Approximation of One Matrix by Another of Lower Rank,” Psychometrika, 1, 211–218. Fisher, R. (1938), Statistical Methods for Research Workers (10th Edition), Edinburgh: Oliver and Boyd Press. Gabriel, K.R. (1981), “Biplot Display of Multivariate Matrices for Inspection of Data and Diagnosis,” Interpreting Multivariate Data, ed. V. Barnett, London: John Wiley & Sons, Inc. Gifi, A. (1990), Nonlinear Multivariate Analysis, New York: John Wiley & Sons, Inc. Goodnight, J.H. (1978), SAS Technical Report R-106, The SWEEP Operator: Its Importance in Statistical Computing, Cary, NC: SAS Institute Inc. Harman, H.H. (1976), Modern Factor Analysis, Third Edition, Chicago: University of Chicago Press. Hotelling, H. (1933), “Analysis of a Complex of Statistical Variables into Principal Components,” Journal of Educational Psychology, 24, 498–520. Kuhfeld, W.F., Sarle, W.S., and Young, F.W. (1985), “Methods of Generating Model Estimates in the PRINQUAL Macro,” SAS Users Group International Conference Proceedings: SUGI 10, Cary, NC: SAS Institute Inc., 962–971. Kruskal, J.B. (1964),“ Multidimensional Scaling By Optimizing Goodness of Fit to a Nonmetric Hypothesis,” Psychometrika, 29, 1–27. Kruskal, J.B. and Shepard, R.N. (1974), “A Nonmetric Variety of Linear Factor Analysis,” Psychometrika, 38, 123–157. de Leeuw, J. (1985), (Personal Conversation). de Leeuw, J. (1986), “Regression with Optimal Scaling of the Dependent Variable,” Department of Data Theory, The University of Leiden, The Netherlands.
References
van Rijckeveorsel, J. (1982), “Canonical Analysis with B-Splines,” in COMPUSTAT 1982, Part I, eds. H. Caussinus, P. Ettinger, and R. Tomassone, Vienna: Wein, Physica Verlag. Sarle, W.S. (1984), (Personal Conversation). Siegel, S. (1956), Nonparametric Statistics, New York: McGraw Hill Book Co. Smith, P.L. (1979), “Splines as a Useful and Convenient Statistical Tool,” The American Statistician, 33, 57–62. Tenenhaus, M. and Vachette, J.L. (1977), “PRINQUAL: Un Programme d’Analyse en Composantes Principales d’un Ensemble de Variables Nominales ou Numeriques,” Les Cahiers de Recherche #68, CESA, Jouy-en-Josas, France. Winsberg, S. and Ramsay, J.O. (1983), “Monotone Spline Transformations for Dimension Reduction,” Psychometrika, 48, 575–595. Young, F.W. (1981), “Quantitative Analysis of Qualitative Data,” Psychometrika, 46, 357–388. Young, F.W., Takane, Y., and de Leeuw, J. (1978), “The Principal Components of Mixed Measurement Level Multivariate Data: An Alternating Least Squares Method with Optimal Scaling Features,” Psychometrika, 43, 279–281.
3701
Chapter 60
The PROBIT Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3705 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3706 Estimating the Natural Response Threshold Parameter . . . . . . . . . . . . 3706 SYNTAX . . . . . . . . . . PROC PROBIT Statement BY Statement . . . . . . CDFPLOT Statement . . CLASS Statement . . . . INSET Statement . . . . IPPPLOT Statement . . . LPREDPLOT Statement . MODEL Statement . . . OUTPUT Statement . . . PREDPPLOT Statement . WEIGHT Statement . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. 3710 . 3711 . 3714 . 3715 . 3723 . 3723 . 3725 . 3733 . 3741 . 3745 . 3746 . 3754
DETAILS . . . . . . . . . . . . . . Missing Values . . . . . . . . . . Response Level Ordering . . . . Computational Method . . . . . Distributions . . . . . . . . . . . INEST= SAS-data-set . . . . . . Model Specification . . . . . . . Lack of Fit Tests . . . . . . . . . Rescaling the Covariance Matrix Tolerance Distribution . . . . . . Inverse Confidence Limits . . . . OUTEST= SAS-data-set . . . . XDATA= SAS-data-set . . . . . Displayed Output . . . . . . . . ODS Table Names . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. 3754 . 3754 . 3754 . 3755 . 3757 . 3757 . 3758 . 3759 . 3760 . 3760 . 3761 . 3762 . 3763 . 3763 . 3765
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3766 Example 60.1. Dosage Levels . . . . . . . . . . . . . . . . . . . . . . . . . 3766 Example 60.2. Multilevel Response . . . . . . . . . . . . . . . . . . . . . . 3775
3704
Chapter 60. The PROBIT Procedure Example 60.3. Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 3780 Example 60.4. An Epidemiology Study . . . . . . . . . . . . . . . . . . . . 3784
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3796
Chapter 60
The PROBIT Procedure Overview The PROBIT procedure calculates maximum likelihood estimates of regression parameters and the natural (or threshold) response rate for quantal response data from biological assays or other discrete event data. This includes probit, logit, ordinal logistic, and extreme value (or gompit) regression models. Probit analysis developed from the need to analyze qualitative (dichotomous or polytomous) dependent variables within the regression framework. Many response variables are binary by nature (yes/no), while others are measured ordinally rather than continuously (degree of severity). Collett (1991) and Agresti (1990), for example, have shown ordinary least squares (OLS) regression to be inadequate when the dependent variable is discrete. Probit or logit analyses are more appropriate in this case. The PROBIT procedure computes maximum likelihood estimates of the parameters β and C of the probit equation using a modified Newton-Raphson algorithm. When the response Y is binary, with values 0 and 1, the probit equation is p = Pr(Y = 0) = C + (1 − C)F (x0 β) where β
is a vector of parameter estimates
F
is a cumulative distribution function (the normal, logistic, or extreme value)
x
is a vector of explanatory variables
p
is the probability of a response
C
is the natural (threshold) response rate
Notice that PROC PROBIT, by default, models the probability of the lower response levels. The choice of the distribution function F (normal for the probit model, logistic for the logit model, and extreme value or Gompertz for the gompit model) determines the type of analysis. For most problems, there is relatively little difference between the normal and logistic specifications of the model. Both distributions are symmetric about the value zero. The extreme value (or Gompertz) distribution, however, is not symmetric, approaching 0 on the left more slowly than it approaches 1 on the right. You can use the extreme value distribution where such asymmetry is appropriate. For ordinal response models, the response, Y, of an individual or an experimental unit may be restricted to one of a (usually small) number, k + 1(k ≥ 1), of ordinal values, denoted for convenience by 1, . . . , k, k + 1. For example, the severity of coronary disease can be classified into three response categories as 1=no disease, 2=angina
3706
Chapter 60. The PROBIT Procedure
pectoris, and 3=myocardial infarction. The PROBIT procedure fits a common slopes cumulative model, which is a parallel lines regression model based on the cumulative probabilities of the response categories rather than on their individual probabilities. The cumulative model has the form Pr(Y ≤ 1 | x) = F (x0 β) Pr(Y ≤ i | x) = F (αi + x0 β),
2≤i≤k
where α2 , . . . , αk are k − 1 intercept parameters. By default, the covariate vector x contains an overall intercept term. You can set or estimate the natural (threshold) response rate C. Estimation of C can begin either from an initial value that you specify or from the rate observed in a control group. By default, the natural response rate is fixed at zero. An observation in the data set analyzed by the PROBIT procedure may contain the response and explanatory values for one subject. Alternatively, it may provide the number of observed events from a number of subjects at a particular setting of the explanatory variables. In this case, PROC PROBIT models the probability of an event.
Getting Started The following example illustrates how you can use the PROBIT procedure to compute the threshold response rate and regression parameter estimates for quantal response data.
Estimating the Natural Response Threshold Parameter Suppose you want to test the effect of a drug at 12 dosage levels. You randomly divide 180 subjects into 12 groups of 15—one group for each dosage level. You then conduct the experiment and, for each subject, record the presence or absence of a positive response to the drug. You summarize the data by counting the number of subjects responding positively in each dose group. Your data set is as follows: data study; input Dose Respond; Number = 15; datalines; 0 3 1.1 4 1.3 4 2.0 3 2.2 5 2.8 4 3.7 5 3.9 9 4.4 8
Estimating the Natural Response Threshold Parameter
4.8 11 5.9 12 6.8 13 ; run;
The variable dose represents the amount of drug administered. The first group, receiving a dose level of 0, is the control group. The variable number represents the number of subjects in each group. All groups are equal in size; hence, number has the value 15 for all observations. The variable respond represents the number of subjects responding to the associated drug dosage. You can model the probability of positive response as a function of dosage using the following statements: proc probit data=study log10 optc; model respond/number=dose; output out = new p = p_hat; predpplot var = dose cfit = blue cframe=ligr inborder; inset; ippplot var = dose cfit = blue cframe=ligr inborder; inset; run;
The DATA= option specifies that PROC PROBIT analyze the SAS data set study. The LOG10 option replaces the first continuous independent variable (dose) by its common logarithm. The OPTC option estimates the natural response rate. When you use the LOG10 option with the OPTC option, any observations with a dose value less than or equal to zero are used in the estimation as a control group. The OUTPUT statement creates a new data set, new, that contains all the variables in the original data set, and a new variable, p– hat, that represents the predicted probabilities. The MODEL statement specifies a proportional response using the variables respond and number in events/trials syntax. The variable dose is the stimulus or explanatory variable. The results from this analysis are displayed in the following figures.
3707
3708
Chapter 60. The PROBIT Procedure
Probit Procedure Model Information Data Set Events Variable Trials Variable Number of Observations Number of Events Number of Trials Number of Events In Control Group Number of Trials In Control Group Name of Distribution Log Likelihood
WORK.STUDY Respond Number 12 81 180 3 15 Normal -104.3945783
Algorithm converged.
Figure 60.1. Model Fitting Information for the PROBIT Procedure
Figure 60.1 displays background information about the model fit. Included are the name of the input data set, the response variables used, and the number of observations, events, and trials. The last line in Figure 60.1 shows the final value of the log-likelihood function. Figure 60.2 displays the table of parameter estimates for the model. The parameter C, which is the natural response threshold or the proportion of individuals responding at zero dose, is estimated to be 0.2409. Since both the intercept and the slope coefficient have significant p-values (0.0020, 0.0010), you can write the model for Pr(response) = C + (1 − C)F (x0 β) as Pr(response) = 0.2409 + 0.7591(Φ(−4.1439 + 6.2308 × log10 (dose))) where Φ is the normal cumulative distribution function. Probit Procedure Analysis of Parameter Estimates
Parameter Intercept Log10(Dose) _C_
DF Estimate 1 1 1
-4.1438 6.2308 0.2409
Standard Error 1.3415 1.8996 0.0523
95% Confidence Limits -6.7731 2.5076 0.1385
-1.5146 9.9539 0.3433
ChiSquare Pr > ChiSq 9.54 10.76
0.0020 0.0010
Figure 60.2. Model Parameter Estimates for the PROBIT Procedure
Finally, PROC PROBIT specifies the resulting tolerance distribution by providing the mean MU and scale parameter SIGMA as well as the covariance matrix of the distribution parameters.
Estimating the Natural Response Threshold Parameter
Probit Procedure Probit Model in Terms of Tolerance Distribution MU
SIGMA
0.66506312
0.16049411
Estimated Covariance Matrix for Tolerance Parameters
MU SIGMA _C_
MU
SIGMA
_C_
0.001158 -0.000493 0.000954
-0.000493 0.002394 -0.000999
0.000954 -0.000999 0.002731
Figure 60.3. Tolerance Distribution Estimates for the PROBIT Procedure
Figure 60.4. Plot of Observed and Fitted Probabilities versus Dose Level
The PREDPPLOT statement creates the plot in Figure 60.4, showing the relationship between dosage level, observed response proportions, and estimated probability values. The dashed lines represent pointwise confidence bands for the fitted probabilities, and a reference line is plotted at the estimated threshold value of .24.
3709
3710
Chapter 60. The PROBIT Procedure
Figure 60.5. Inverse Predicted Probability Plot with Fiducial Limits
The IPPPLOT statement creates the plot in Figure 60.5, showing the inverse relationship between dosage level and observed response proportions/estimated probability values. The dashed lines represent pointwise fiducial limits for the predicted values of the dose variable, and a reference line is also plotted at the estimated threshold value of .24. The INSET statement after each of these plot statements draws a box within the plot. In the inset box, summary information about the model fitting is printed.
Syntax The following statements are available in PROC PROBIT.
PROC PROBIT < options > ; MODEL response=independents < / options > ; BY variables ; CLASS variables ; OUTPUT < OUT=SAS-data-set > < options > ; WEIGHT variable ; CDFPLOT < VAR = variable > < options > ; INSET < keyword-list > < / options > ; IPPPLOT < VAR = variable > < options > ; LPREDPLOT < VAR = variable > < options > ; PREDPPLOT < VAR = variable > < options > ;
PROC PROBIT Statement
A MODEL statement is required. Only a single MODEL statement can be used with one invocation of the PROBIT procedure. If multiple MODEL statements are present, only the last one is used. Main effects and higher-order terms can be specified in the MODEL statement, similar to the GLM procedure. If a CLASS statement is used, it must precede the MODEL statement. The CDFPLOT, INSET, IPPPLOT, LPREDPLOT, and PREDPPLOT statements are used to produce graphical output. You can use any appropriate combination of the graphical statements after the MODEL statement.
PROC PROBIT Statement PROC PROBIT < options > ; The PROC PROBIT statement starts the procedure. You can specify the following options in the PROC PROBIT statement. COVOUT
writes the parameter estimate covariance matrix to the OUTEST= data set. C=rate OPTC
controls how the natural response is handled. Specify the OPTC option to request that the natural response rate C be estimated. Specify the C=rate option to set the natural response rate or to provide the initial estimate of the natural response rate. The natural response rate value must be a number between 0 and 1. • If you specify neither the OPTC nor the C= option, a natural response rate of zero is assumed. • If you specify both the OPTC and the C= option, the C= option should be a reasonable initial estimate of the natural response rate. For example, you could use the ratio of the number of responses to the number of subjects in a control group. • If you specify the C= option but not the OPTC option, the natural response rate is set to the specified value and not estimated. • If you specify the OPTC option but not the C= option, PROC PROBIT’s action depends on the response variable, as follows: – If you specify either the LN or LOG10 option and some subjects have the first independent variable (dose) values less than or equal to zero, these subjects are treated as a control group. The initial estimate of C is then the ratio of the number of responses to the number of subjects in this group. – If you do not specify the LN or LOG10 option or if there is no control group, then one of the following occurs: · If all responses are greater than zero, the initial estimate of the natural response rate is the minimal response rate ( the ratio of the number of responses to the number of subjects in a dose group) across all dose levels.
3711
3712
Chapter 60. The PROBIT Procedure · If one or more of the responses is zero (making the response rate zero in that dose group), the initial estimate of the natural rate is the reciprocal of twice the largest number of subjects in any dose group in the experiment.
DATA=SAS-data-set
specifies the SAS data set to be used by PROC PROBIT. By default, the procedure uses the most recently created SAS data set. GOUT=graphics-catalog
specifies a graphics catalog in which to save graphics output. HPROB=p
specifies a minimum probability level for the Pearson chi-square to indicate a good fit. The default value is 0.10. The LACKFIT option must also be specified for this option to have any effect. For Pearson goodness-of-fit chi-square values with probability greater than the HPROB= value, the fiducial limits, if requested with the INVERSECL option, are computed using a critical value of 1.96. For chi-square values with probability less than the value of the HPROB= option, the critical value is a 0.95 two-sided quantile value taken from the t distribution with degrees of freedom equal to (k − 1) × m − q, where k is the number of levels for the response variable, m is the number of different sets of independent variable values, and q is the number of parameters fit in the model. Note that the HPROB= option can also appear in the MODEL statement. INEST= SAS-data-set
specifies an input SAS data set that contains initial estimates for all the parameters in the model. See the section “INEST= SAS-data-set” on page 3757 for a detailed description of the contents of the INEST= data set. INVERSECL
computes confidence limits for the values of the first continuous independent variable (such as dose) that yield selected response rates. If the algorithm fails to converge (this can happen when C is nonzero), missing values are reported for the confidence limits. See the section “Inverse Confidence Limits” on page 3761 for details. Note that the INVERSECL option can also appear in the MODEL statement. LACKFIT
performs two goodness-of-fit tests (a Pearson chi-square test and a log-likelihood ratio chi-square test) for the fitted model. To compute the test statistics, proper grouping of the observations into subpopulations is needed. You can use the AGGREGATE or AGGREGATE= option for this end. See the entry for the AGGREGATE and AGGREGATE= options under the MODEL statement. If neither AGGREGATE nor AGGREGATE= is specified, PROC PROBIT assumes each observation is from a separate subpopulation and computes the goodness-of-fit test statistics only for the events/trials syntax. Note: This test is not appropriate if the data are very sparse, with only a few values at each set of the independent variable values.
PROC PROBIT Statement
If the Pearson chi-square test statistic is significant, then the covariance estimates and standard error estimates are adjusted.See the “Lack of Fit Tests” section on page 3759 for a description of the tests. Note that the LACKFIT option can also appear in the MODEL statement. LOG LN
analyzes the data by replacing the first continuous independent variable by its natural logarithm. This variable is usually the level of some treatment such as dosage. In addition to the usual output given by the INVERSECL option, the estimated dose values and 95% fiducial limits for dose are also displayed. If you specify the OPTC option, any observations with a dose value less than or equal to zero are used in the estimation as a control group. If you do not specify the OPTC option with the LOG or LN option, then any observations with the first continuous independent variable values less than or equal to zero are ignored. LOG10
specifies an analysis like that of the LN or LOG option except that the common logarithm (log to the base 10) of the dose value is used rather than the natural logarithm. NAMELEN=n
specifies the length of effect names in tables and output data sets to be n characters, where n is a value between 20 and 200. The default length is 20 characters. NOPRINT
suppresses the display of all output. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 14, “Using the Output Delivery System.” OPTC
controls how the natural response is handled. See the description of the C= option on page 3711 for details. ORDER=DATA | FORMATTED | FREQ | INTERNAL
specifies the sorting order for the levels of the classification variables specified in the CLASS statement, including the levels of the response variable. Response level ordering is important since PROC PROBIT always models the probability of response levels at the beginning of the ordering. See the section “Response Level Ordering” on page 3754 for further details. This ordering also determines which parameters in the model correspond to each level in the data. The following table shows how PROC PROBIT interprets values of the ORDER= option. Value of ORDER= DATA FORMATTED FREQ INTERNAL
Levels Sorted By order of appearance in the input data set formatted value descending frequency count; levels with the most observations come first in the order unformatted value
3713
3714
Chapter 60. The PROBIT Procedure
By default, ORDER=FORMATTED. For the values FORMATTED and INTERNAL, the sort order is machine dependent. For more information on sorting order, see the chapter on the SORT procedure in the SAS Procedures Guide. OUTEST= SAS-data-set
specifies a SAS data set to contain the parameter estimates and, if the COVOUT option is specified, their estimated covariances. If you omit this option, the output data set is not created. The contents of the data set are described in the section “OUTEST= SAS-data-set” on page 3762. XDATA= SAS-data-set
specifies an input SAS data set that contains values for all the independent variables in the MODEL statement and variables in the CLASS statement. If there are covariates specified in a MODEL statement, you specify fixed values for the effects in the MODEL statement by the XDATA= data set when predicted values and/or fiducial limits for a single continuous variable (dose variable) are required. These specified values for the effects in the MODEL statement are also used for generating plots. See the section “XDATA= SAS-data-set” on page 3763 for a detailed description of the contents of the XDATA= data set.
BY Statement BY variables ; You can specify a BY statement with PROC PROBIT to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order on each of the BY variables, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the PROBIT procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
CDFPLOT Statement
CDFPLOT Statement CDFPLOT
specifies a single continuous variable (dose variable) in the independent variable list of the MODEL statement. If a VAR= variable is not specified, the first single continuous variable in the independent variable list of the MODEL statement is used. If such a variable does not exist in the independent variable list of the MODEL statement, an error is reported. The predicted cumulative distribution function is defined as ˆ Fˆj (x) = C + (1 − C)F (ˆ aj + x0 b) where j = 1, . . . , k are the indexes of the k levels of the multinomial response variˆ is able, F is the CDF of the distribution used to model the cumulative probabilities, b the vector of estimated parameters, x is the covariate vector, a ˆj are estimated ordinal intercepts with a ˆ1 = 0, and C is the threshold parameter, either known or estimated from the model. Let x1 be the covariate corresponding to the dose variable and x−1 be the vector of the rest of the covariates. Let the corresponding estimated parameters ˆ −1 . Then be ˆb1 and b ˆ −1 ) Fˆj (x) = C + (1 − C)F (ˆ aj + x1ˆb1 + x0−1 b To plot Fˆj as a function of x1 , x−1 must be specified. You can use the XDATA= option to provide the values of x−1 (see the XDATA= option in the PROC PROBIT statement for details), or use the default values that follow the rules: • If the effect contains a continuous variable (or variables), the overall mean of this effect is used. • If the effect is a single classification variable, the highest level of the variable is used. options
specify the levels of the multinomial response variable for which the cdf curves are requested, and add features to the plot. There are k − 1 curves for a k-level multinomial response variable (for the highest level, it is the constant line 1). You can specify any of them to be plotted by the LEVEL= option in the CDFPLOT statement. See the LEVEL= option for how to specify the levels. An attached box on the right side of the plot is used to label these curves with the names of their levels. You can specify the color of this box using the CLABBOX= option.
3715
3716
Chapter 60. The PROBIT Procedure
You can use options in the CDFPLOT statement to • superimpose specification limits • specify the levels for which the cdf curves are requested • specify graphical enhancements (such as color or text height)
Summary of Options The following tables list all options by function. The “Dictionary of Options” on page 3718 describes each option in detail. CDF Options Table 60.1. Options for CDFPLOT
LEVEL=character-list
specifies the names of the levels for which the cdf curves are requested
NOTHRESH
suppresses the threshold line
THRESHLABPOS=value
specifies the position for the label of the threshold line
General Options Table 60.2. Color Options
CAXIS=color
specifies color for axis
CFIT=color
specifies color for fitted curves
CFRAME=color
specifies color for frame
CGRID=color
specifies color for grid lines
CHREF=color
specifies color for HREF= lines
CLABBOX=color
specifies color for label box
CTEXT=color
specifies color for text
CVREF=color
specifies color for VREF= lines
CDFPLOT Statement
Table 60.3. Options to Enhance Plots Produced on Graphics Devices
ANNOTATE= SAS-data-set INBORDER
specifies an ANNOTATE data set requests a border around plot
LFIT=linetype
specifies line style for fitted curves
LGRID=linetype
specifies line style for grid lines
NOFRAME
suppresses the frame around plotting areas
NOGRID
suppresses grid lines
NOFIT
suppresses cdf curves
NOHLABEL
suppresses horizontal labels
NOHTICK
suppresses horizontal ticks
NOVTICK
suppresses vertical ticks
TURNVLABELS
vertically strings out characters in vertical labels
WFIT=n
specifies thickness for fitted curves
WGRID=n
specifies thickness for grids
WREFL=n
specifies thickness for reference lines
Table 60.4. Axis Options
HAXIS=value1 to value2
specifies tick mark values for horizontal axis specifies offset for horizontal axis
HLOWER=value
specifies lower limit on horizontal axis scale
HUPPER=value
specifies upper limit on horizontal axis scale
NHTICK=n
specifies number of ticks for horizontal axis
NVTICK=n
specifies number of ticks for vertical axis
VAXIS=value1 to value2
specifies tick mark values for vertical axis specifies label for vertical axis
VOFFSET=value
specifies offset for vertical axis
VLOWER=value
specifies lower limit on vertical axis scale
VUPPER=value
specifies upper limit on vertical axis scale
WAXIS=n
specifies thickness for axis
Table 60.5. Graphics Catalog Options
DESCRIPTION=’string’
specifies description for graphics catalog member
NAME=’string’
specifies name for plot in graphics catalog
Table 60.6. Options for Text Enhancement
FONT=font
specifies software font for text
HEIGHT=value
specifies height of text used outside framed areas
INFONT=font
specifies software font for text inside framed areas
INHEIGHT=value
specifies height of text inside framed areas
3717
3718
Chapter 60. The PROBIT Procedure
Table 60.7. Options for Reference Lines
HREF< (INTERSECT)> =value-list HREFLABELS= (’label1’,. . .,’labeln’) HREFLABPOS=n
requests horizontal reference line
specifies vertical position of labels for HREF= lines
LHREF=linetype
specifies line style for HREF= lines
LVREF=linetype
specifies line style for VREF= lines
VREF<(INTERSECT)> =value-list VREFLABELS= (’label1’,. . .,’labeln’) VREFLABPOS=n
requests vertical reference line
specifies labels for HREF= lines
specifies labels for VREF= lines specifies horizontal position of labels for VREF= lines
Dictionary of Options The following entries provide detailed descriptions of the options in the CDFPLOT statement. ANNOTATE=SAS-data-set ANNO=SAS-data-set
specifies an ANNOTATE data set, as described in SAS/GRAPH Software: Reference, that enables you to add features to the cdf plot. The ANNOTATE= data set you specify in the CDFPLOT statement is used for all plots created by the statement. CAXIS=color CAXES=color
specifies the color used for the axes and tick marks. This option overrides any COLOR= specifications in an AXIS statement. The default is the first color in the device color list. CFIT=color
specifies the color for the fitted cdf curves. The default is the first color in the device color list. CFRAME=color CFR=color
specifies the color for the area enclosed by the axes and frame. This area is not shaded by default. CGRID=color
specifies the color for grid lines. The default is the first color in the device color list. CLABBOX=color
specifies the color for the area enclosed by the label box for cdf curves. This area is not shaded by default.
CDFPLOT Statement
CHREF=color CH=color
specifies the color for lines requested by the HREF= option. The default is the first color in the device color list. CTEXT=color
specifies the color for tick mark values and axis labels. The default is the color specified for the CTEXT= option in the most recent GOPTIONS statement. CVREF=color CV=color
specifies the color for lines requested by the VREF= option. The default is the first color in the device color list. DESCRIPTION=’string’ DES=’string’
specifies a description, up to 40 characters, that appears in the PROC GREPLAY master menu. The default is the variable name. FONT=font
specifies a software font for reference line and axis labels. You can also specify fonts for axis labels in an AXIS statement. The FONT= font takes precedence over the FTEXT= font specified in the most recent GOPTIONS statement. Hardware characters are used by default. HAXIS=value1 to value2
specifies tick mark values for the horizontal axis. value1, value2, and value3 must be numeric, and value1 must be less than value2. The lower tick mark is value1. Tick marks are drawn at increments of value3. The last tick mark is the greatest value that does not exceed value2. If value3 is omitted, a value of 1 is used. Examples of HAXIS= lists are: haxis = 0 to 10 haxis = 2 to 10 by 2 haxis = 0 to 200 by 10 HEIGHT=value
specifies the height of text used outside framed areas. The default value is 3.846 (in percentage). HLOWER=value
specifies the lower limit on the horizontal axis scale. The HLOWER= option specifies value as the lower horizontal axis tick mark. The tick mark interval and the upper axis limit are determined automatically. This option has no effect if the HAXIS= option is used.
3719
3720
Chapter 60. The PROBIT Procedure
HOFFSET=value
specifies offset for horizontal axis. The default value is 1. HUPPER=value
specifies value as the upper horizontal axis tick mark. The tick mark interval and the lower axis limit are determined automatically. This option has no effect if the HAXIS= option is used. HREF <(INTERSECT)> =value-list
requests reference lines perpendicular to the horizontal axis. If (INTERSECT) is specified, a second reference line perpendicular to the vertical axis is drawn that intersects the fit line at the same point as the horizontal axis reference line. If a horizontal axis reference line label is specified, the intersecting vertical axis reference line is labeled with the vertical axis value. See also the CHREF=, HREFLABELS=, and LHREF= options. HREFLABELS=’label1’,. . .,’labeln’ HREFLABEL=’label1’,. . .,’labeln’ HREFLAB=’label1’,. . .,’labeln’
specifies labels for the lines requested by the HREF= option. The number of labels must equal the number of lines. Enclose each label in quotes. Labels can be up to 16 characters. HREFLABPOS=n
specifies the vertical position of labels for HREF= lines. The following table shows valid values for n and the corresponding label placements. n 1 2 3 4 5 6
label placement top staggered from top bottom staggered from bottom alternating from top alternating from bottom
INBORDER
requests a border around cdf plots. LEVEL= (character-list) ORDINAL= (character-list)
specifies the names of the levels for which cdf curves are requested. Names should be quoted and separated by space. If there is no correct name provided, no cdf curve is plotted. LFIT=linetype
specifies a line style for fitted curves. By default, fitted curves are drawn by connecting solid lines (linetype = 1).
CDFPLOT Statement
LGRID=linetype
specifies a line style for all grid lines. linetype is between 1 and 46. The default is 35. LHREF=linetype LH=linetype
specifies the line type for lines requested by the HREF= option. The default is 2, which produces a dashed line. LVREF=linetype LV=linetype
specifies the line type for lines requested by the VREF= option. The default is 2, which produces a dashed line. NAME=’string’
specifies a name for the plot, up to eight characters, that appears in the PROC GREPLAY master menu. The default is ’PROBIT’. NOFIT
suppresses the fitted cdf curves. NOFRAME
suppresses the frame around plotting areas. NOGRID
suppresses grid lines. NOHLABEL
suppresses horizontal labels. NOHTICK
suppresses horizontal tick marks. NOTHRESH
suppresses the threshold line. NOVLABEL
suppresses vertical labels. NOVTICK
suppresses vertical tick marks. THRESHLABPOS=n
specifies the horizontal position of labels for the threshold line. The following table shows valid values for n and the corresponding label placements. n 1 2
label placement left right
3721
3722
Chapter 60. The PROBIT Procedure
VAXIS=value1 to value2
specifies tick mark values for the vertical axis. value1, value2, and value3 must be numeric, and value1 must be less than value2. The lower tick mark is value1. Tick marks are drawn at increments of value3. The last tick mark is the greatest value that does not exceed value2. This method of specification of tick marks is not valid for logarithmic axes. If value3 is omitted, a value of 1 is used. Examples of VAXIS= lists are: vaxis = 0 to 10 vaxis = 0 to 2 by .1 VAXISLABEL=’string’
specifies a label for the vertical axis. VLOWER=value
specifies the lower limit on the vertical axis scale. The VLOWER= option specifies value as the lower vertical axis tick mark. The tick mark interval and the upper axis limit are determined automatically. This option has no effect if the VAXIS= option is used. VREF=value-list
requests reference lines perpendicular to the vertical axis. If (INTERSECT) is specified, a second reference line perpendicular to the horizontal axis is drawn that intersects the fit line at the same point as the vertical axis reference line. If a vertical axis reference line label is specified, the intersecting horizontal axis reference line is labeled with the horizontal axis value. See also the CVREF=, LVREF=, and VREFLABELS= options. VREFLABELS=’label1’,. . .,’labeln’ VREFLABEL=’label1’,. . .,’labeln’ VREFLAB=’label1’,. . .,’labeln’
specifies labels for the lines requested by the VREF= option. The number of labels must equal the number of lines. Enclose each label in quotes. Labels can be up to 16 characters. VREFLABPOS=n
specifies the horizontal position of labels for VREF= lines. The following table shows valid values for n and the corresponding label placements. n 1 2
label placement left right
VUPPER=value
specifies the upper limit on the vertical axis scale. The VUPPER= option specifies value as the upper vertical axis tick mark. The tick mark interval and the lower axis limit are determined automatically. This option has no effect if the VAXIS= option is used.
INSET Statement
WAXIS=n
specifies line thickness for axes and frame. The default value is 1. WFIT=n
specifies line thickness for fitted curves. The default value is 1. WGRID=n
specifies line thickness for grids. The default value is 1. WREFL=n
specifies line thickness for reference lines. The default value is 1.
CLASS Statement CLASS variables ; The CLASS statement names the classification variables to be used in the analysis. Classification variables can be either character or numeric. If a single response variable is specified in the MODEL statement, it must also be specified in a CLASS statement. Class levels are determined from the formatted values of the CLASS variables. Thus, you can use formats to group values into levels. See the discussion of the FORMAT procedure in SAS Language Reference: Dictionary. If the CLASS statement is used, it must appear before any of the MODEL statements.
INSET Statement INSET
By default, inset entries are identified with appropriate labels. However, you can provide a customized label by specifying the keyword for that entry followed by the equal sign (=) and the label in quotes. For example, the following INSET statement produces an inset containing the number of observations and the name of the distribution, labeled “Sample Size” and “Distribution” in the inset.
3723
3724
Chapter 60. The PROBIT Procedure
inset nobs=’Sample Size’ dist=’Distribution’;
If you specify a keyword that does not apply to the plot you are creating, then the keyword is ignored. The options control the appearance of the box. If you specify more than one INSET statement, only the first one is used.
Keywords Used in the INSET Statement The following tables list keywords available in the INSET statement to display summary statistics, distribution parameters, and distribution fitting information. Table 60.8. Summary Statistics
NOBS
number of observations
NTRIALS
number of trials
NEVENTS
number of events
C
the user inputted threshold
OPTC
the estimated natural threshold
NRESPLEV
number of levels of the response variable
Table 60.9. General Information
CONFIDENCE
confidence coefficient for all confidence intervals or for the Weibayes fit
DIST
name of the distribution
Options Used in the INSET Statement The following tables list the options available in the INSET statement.
IPPPLOT Statement
3725
Table 60.10. General Appearance Options
FONT=font
specifies software font for text
HEIGHT=value
specifies height of text
HEADER=’quoted string’ NOFRAME POS= value
specifies text for header or box title omits frame around box
REFPOINT= name
determines the position of the inset. The value can be a compass point (N, NE, E, SE, S, SW, W, NW) or a pair of coordinates (x, y) enclosed in parentheses. The coordinates can be specified in axis percent units or axis data units. specifies the reference point for an inset that is positioned by a pair of coordinates with the POS= option. You use the REFPOINT= option in conjunction with the POS= coordinates. The REFPOINT= option specifies which corner of the inset frame you have specified with coordinates (x, y) and it can take the value of BR (bottom right), BL (bottom left), TR (top right), or TL (top left). The default is REFPOINT=BL. If the inset position is specified as a compass point, then the REFPOINT= option is ignored.
Table 60.11. Color and Pattern Options
CFILL=color
specifies color for filling box
CFILLH=color
specifies color for filling box header
CFRAME=color
specifies color for frame
CHEADER=color
specifies color for text in header
CTEXT=color
specifies color for text
IPPPLOT Statement IPPPLOT
specifies a single continuous variable (dose variable) in the independent variable list of the MODEL statement. If a VAR= variable is not specified, the first single continuous variable in the independent variable list of the MODEL statement is used. If such a variable does not exist in the independent variable list of the MODEL statement, an error is reported. For the binomial model, the response variable is a probability. An estimate of the
3726
Chapter 60. The PROBIT Procedure
dose level x ˆ1 needed for a response of p is given by ˆ −1 )/ˆb1 x ˆ1 = (F −1 (p) − x0−1 b where F is the cumulative distribution function used to model the probability, x−1 is ˆ −1 is the vector of the estimated parameters the vector of the rest of the covariates, b ˆ corresponding to x−1 , and b1 is the estimated parameter for the dose variable of interest. To plot x ˆ1 as a function of p, x−1 must be specified. You can use the XDATA= option to provide the values of x−1 (see the XDATA= option in the PROC PROBIT statement for details), or use the default values that follow the rules: • If the effect contains a continuous variable (or variables), the overall mean of this effect is used. • If the effect is a single classification variable, the highest level of the variable is used. options
add features to the plot. You can use options in the IPPPLOT statement to • superimpose specification limits • suppress or add the observed data points on the plot • suppress or add the fiducial limits on the plot • specify graphical enhancements (such as color or text height)
Summary of Options The following tables list all options by function. The “Dictionary of Options” on page 3728 describes each option in detail. IPP Options Table 60.12. Plot Layout Options for IPPPLOT
NOCONF
suppresses fiducial limits
NODATA
suppresses observed data points on the plot
NOTHRESH
suppresses the threshold line
THRESHLABPOS=value
specifies the position for the label of the threshold line
General Options
IPPPLOT Statement
3727
Table 60.13. Color Options
CAXIS=color
specifies color for axis
CFIT=color
specifies color for fitted curves
CFRAME=color
specifies color for frame
CGRID=color
specifies color for grid lines
CHREF=color
specifies color for HREF= lines
CTEXT=color
specifies color for text
CVREF=color
specifies color for VREF= lines
Table 60.14. Options to Enhance Plots Produced on Graphics Devices
ANNOTATE= SAS-data-set INBORDER
specifies an ANNOTATE data set requests a border around plot
LFIT=linetype
specifies line style for fitted curves and confidence limits
LGRID=linetype
specifies line style for grid lines
NOFRAME
suppresses the frame around plotting areas
NOGRID
suppresses grid lines
NOFIT
suppresses fitted curves
NOHLABEL
suppresses horizontal labels
NOHTICK
suppresses horizontal ticks
NOVTICK
suppresses vertical ticks
TURNVLABELS
vertically strings out characters in vertical labels
WFIT=n
specifies thickness for fitted curves
WGRID=n
specifies thickness for grids
WREFL=n
specifies thickness for reference lines
Table 60.15. Axis Options
HAXIS=value1 to value2
specifies tick mark values for horizontal axis specifies offset for horizontal axis
HLOWER=value
specifies lower limit on horizontal axis scale
HUPPER=value
specifies upper limit on horizontal axis scale
NHTICK=n
specifies number of ticks for horizontal axis
NVTICK=n
specifies number of ticks for vertical axis
VAXIS=value1 to value2
specifies tick mark values for vertical axis specifies label for vertical axis
VOFFSET=value
specifies offset for vertical axis
VLOWER=value
specifies lower limit on vertical axis scale
VUPPER=value
specifies upper limit on vertical axis scale
WAXIS=n
specifies thickness for axis
3728
Chapter 60. The PROBIT Procedure
Table 60.16. Options for Reference Lines
HREF<(INTERSECT)> =value-list HREFLABELS= (’label1’,. . .,’labeln’) HREFLABPOS=n
requests horizontal reference line
specifies vertical position of labels for HREF= lines
LHREF=linetype
specifies line style for HREF= lines
LVREF=linetype
specifies line style for VREF= lines
VREF<(INTERSECT)> =value-list VREFLABELS= (’label1’,. . .,’labeln’) VREFLABPOS=n
requests vertical reference line
specifies labels for HREF= lines
specifies labels for VREF= lines specifies horizontal position of labels for VREF= lines
Table 60.17. Graphics Catalog Options
DESCRIPTION=’string’
specifies description for graphics catalog member
NAME=’string’
specifies name for plot in graphics catalog
Table 60.18. Options for Text Enhancement
FONT=font
specifies software font for text
HEIGHT=value
specifies height of text used outside framed areas
INFONT=font
specifies software font for text inside framed areas
INHEIGHT=value
specifies height of text inside framed areas
Dictionary of Options The following entries provide detailed descriptions of the options in the IPPPLOT statement. ANNOTATE=SAS-data-set ANNO=SAS-data-set
specifies an ANNOTATE data set, as described in SAS/GRAPH Software: Reference, that enables you to add features to the ipp plot. The ANNOTATE= data set you specify in the IPPPLOT statement is used for all plots created by the statement. CAXIS=color CAXES=color
specifies the color used for the axes and tick marks. This option overrides any COLOR= specifications in an AXIS statement. The default is the first color in the device color list. CFIT=color
specifies the color for the fitted ipp curves. The default is the first color in the device color list.
IPPPLOT Statement
CFRAME=color CFR=color
specifies the color for the area enclosed by the axes and frame. This area is not shaded by default. CGRID=color
specifies the color for grid lines. The default is the first color in the device color list. CHREF=color CH=color
specifies the color for lines requested by the HREF= option. The default is the first color in the device color list. CTEXT=color
specifies the color for tick mark values and axis labels. The default is the color specified for the CTEXT= option in the most recent GOPTIONS statement. CVREF=color CV=color
specifies the color for lines requested by the VREF= option. The default is the first color in the device color list. DESCRIPTION=’string’ DES=’string’
specifies a description, up to 40 characters, that appears in the PROC GREPLAY master menu. The default is the variable name. FONT=font
specifies a software font for reference line and axis labels. You can also specify fonts for axis labels in an AXIS statement. The FONT= font takes precedence over the FTEXT= font specified in the most recent GOPTIONS statement. Hardware characters are used by default. HAXIS=value1 to value2
specifies tick mark values for the horizontal axis. value1, value2, and value3 must be numeric, and value1 must be less than value2. The lower tick mark is value1. Tick marks are drawn at increments of value3. The last tick mark is the greatest value that does not exceed value2. If value3 is omitted, a value of 1 is used. Examples of HAXIS= lists are: haxis = 0 to 10 haxis = 2 to 10 by 2 haxis = 0 to 200 by 10 HEIGHT=value
specifies the height of text used outside framed areas. The default value is 3.846 (in percentage).
3729
3730
Chapter 60. The PROBIT Procedure
HLOWER=value
specifies the lower limit on the horizontal axis scale. The HLOWER= option specifies value as the lower horizontal axis tick mark. The tick mark interval and the upper axis limit are determined automatically. This option has no effect if the HAXIS= option is used. HOFFSET=value
specifies offset for horizontal axis. The default value is 1. HUPPER=value
specifies value as the upper horizontal axis tick mark. The tick mark interval and the lower axis limit are determined automatically. This option has no effect if the HAXIS= option is used. HREF < (INTERSECT) > =value-list
requests reference lines perpendicular to the horizontal axis. If (INTERSECT) is specified, a second reference line perpendicular to the vertical axis is drawn that intersects the fit line at the same point as the horizontal axis reference line. If a horizontal axis reference line label is specified, the intersecting vertical axis reference line is labeled with the vertical axis value. See also the CHREF=, HREFLABELS=, and LHREF= options. HREFLABELS=’label1’,. . .,’labeln’ HREFLABEL=’label1’,. . .,’labeln’ HREFLAB=’label1’,. . .,’labeln’
specifies labels for the lines requested by the HREF= option. The number of labels must equal the number of lines. Enclose each label in quotes. Labels can be up to 16 characters. HREFLABPOS=n
specifies the vertical position of labels for HREF= lines. The following table shows valid values for n and the corresponding label placements. n 1 2 3 4 5 6
label placement top staggered from top bottom staggered from bottom alternating from top alternating from bottom
INBORDER
requests a border around ipp plots. LFIT=linetype
specifies a line style for fitted curves and confidence limits. By default, fitted curves are drawn by connecting solid lines (linetype = 1) and confidence limits are drawn by connecting dashed lines (linetype = 3).
IPPPLOT Statement
LGRID=linetype
specifies a line style for all grid lines. linetype is between 1 and 46. The default is 35. LHREF=linetype LH=linetype
specifies the line type for lines requested by the HREF= option. The default is 2, which produces a dashed line. LVREF=linetype LV=linetype
specifies the line type for lines requested by the VREF= option. The default is 2, which produces a dashed line. NAME=’string’
specifies a name for the plot, up to eight characters, that appears in the PROC GREPLAY master menu. The default is ’PROBIT’. NOCONF
suppresses fiducial limits from the plot. NODATA
suppresses observed data points from the plot. NOFIT
suppresses the fitted ipp curves. NOFRAME
suppresses the frame around plotting areas. NOGRID
suppresses grid lines. NOHLABEL
suppresses horizontal labels. NOHTICK
suppresses horizontal tick marks. NOTHRESH
suppresses the threshold line. NOVLABEL
suppresses vertical labels. NOVTICK
suppresses vertical tick marks. THRESHLABPOS=n
specifies the vertical position of labels for the threshold line. The following table shows valid values for n and the corresponding label placements. n 1 2
label placement top bottom
3731
3732
Chapter 60. The PROBIT Procedure
VAXIS=value1 to value2
specifies tick mark values for the vertical axis. value1, value2, and value3 must be numeric, and value1 must be less than value2. The lower tick mark is value1. Tick marks are drawn at increments of value3. The last tick mark is the greatest value that does not exceed value2. This method of specification of tick marks is not valid for logarithmic axes. If value3 is omitted, a value of 1 is used. Examples of VAXIS= lists are: vaxis = 0 to 10 vaxis = 0 to 2 by .1 VAXISLABEL=’string’
specifies a label for the vertical axis. VLOWER=value
specifies the lower limit on the vertical axis scale. The VLOWER= option specifies value as the lower vertical axis tick mark. The tick mark interval and the upper axis limit are determined automatically. This option has no effect if the VAXIS= option is used. VREF=value-list
requests reference lines perpendicular to the vertical axis. If (INTERSECT) is specified, a second reference line perpendicular to the horizontal axis is drawn that intersects the fit line at the same point as the vertical axis reference line. If a vertical axis reference line label is specified, the intersecting horizontal axis reference line is labeled with the horizontal axis value. See also the CVREF=, LVREF=, and VREFLABELS= options. VREFLABELS=’label1’,. . .,’labeln’ VREFLABEL=’label1’,. . .,’labeln’ VREFLAB=’label1’,. . .,’labeln’
specifies labels for the lines requested by the VREF= option. The number of labels must equal the number of lines. Enclose each label in quotes. Labels can be up to 16 characters. VREFLABPOS=n
specifies the horizontal position of labels for VREF= lines. The following table shows valid values for n and the corresponding label placements. n 1 2
label placement left right
VUPPER=value
specifies the upper limit on the vertical axis scale. The VUPPER= option specifies value as the upper vertical axis tick mark. The tick mark interval and the lower axis limit are determined automatically. This option has no effect if the VAXIS= option is used.
LPREDPLOT Statement
WAXIS=n
specifies line thickness for axes and frame. The default value is 1. WFIT=n
specifies line thickness for fitted curves. The default value is 1. WGRID=n
specifies line thickness for grids. The default value is 1. WREFL=n
specifies line thickness for reference lines. The default value is 1.
LPREDPLOT Statement LPREDPLOT
specifies a single continuous variable (dose variable) in the independent variable list of the MODEL statement for which the linear predictor plot is plotted. If a VAR= variable is not specified, the first single continuous variable in the independent variable list of the MODEL statement is used. If such a variable does not exist in the independent variable list of the MODEL statement, an error is reported. Let x1 be the covariate of the dose variable, x−1 be the vector of the rest of the ˆ −1 be the vector of estimated parameters corresponding to x−1 , and ˆb1 covariates, b be the estimated parameter for the dose variable of interest. ˆ 0 b as a function of x1 , x−1 must be specified. You can use the XDATA= To plot x option to provide the values of x−1 (see the XDATA= option in the PROC PROBIT statement for details), or use the default values that follow the rules: • If the effect contains a continuous variable (or variables), the overall mean of this effect is used. • If the effect is a single classification variable, the highest level of the variable is used. options
add features to the plot. For the multinomial model, you can use the LEVEL= option to specify the levels for which the linear predictor lines are plotted. The lines are labeled by the names of their levels in the middle. You can use options in the LPREDPLOT statement to • superimpose specification limits
3733
3734
Chapter 60. The PROBIT Procedure • suppress or add the observed data points on the plot for the binomial model • suppress or add the confidence limits for the binomial model • specify the levels for which the linear predictor lines are requested for the multinomial model • specify graphical enhancements (such as color or text height)
Summary of Options The following tables list all options by function. The “Dictionary of Options” on page 3736 describes each option in detail. LPRED Options Table 60.19. Plot Layout Options for LPREDPLOT
LEVEL=character-list
specifies the names of the levels for which the linear predictor lines are requested (only for the multinomial model )
NOCONF
suppresses fiducial limits (only for the binomial model)
NODATA
suppresses observed data points on the plot (only for the binomial model)
NOTHRESH
suppresses the threshold line
THRESHLABPOS=value
specifies the position for the label of the threshold line
General Options Table 60.20. Color Options
CAXIS=color
specifies color for axis
CFIT=color
specifies color for fitted curves
CFRAME=color
specifies color for frame
CGRID=color
specifies color for grid lines
CHREF=color
specifies color for HREF= lines
CTEXT=color
specifies color for text
CVREF=color
specifies color for VREF= lines
LPREDPLOT Statement
3735
Table 60.21. Options to Enhance Plots Produced on Graphics Devices
ANNOTATE= SAS-data-set INBORDER
specifies an ANNOTATE data set requests a border around plot
LFIT=linetype
specifies line style for fitted curves and confidence limits
LGRID=linetype
specifies line style for grid lines
NOFRAME
suppresses the frame around plotting areas
NOGRID
suppresses grid lines
NOFIT
suppresses fitted curves
NOHLABEL
suppresses horizontal labels
NOHTICK
suppresses horizontal ticks
NOVTICK
suppresses vertical ticks
TURNVLABELS
vertically strings out characters in vertical labels
WFIT=n
specifies thickness for fitted curves
WGRID=n
specifies thickness for grids
WREFL=n
specifies thickness for reference lines
Table 60.22. Axis Options
HAXIS=value1 to value2
specifies tick mark values for horizontal axis specifies offset for horizontal axis
HLOWER=value
specifies lower limit on horizontal axis scale
HUPPER=value
specifies upper limit on horizontal axis scale
NHTICK=n
specifies number of ticks for horizontal axis
NVTICK=n
specifies number of ticks for vertical axis
VAXIS=value1 to value2
specifies tick mark values for vertical axis specifies label for vertical axis
VOFFSET=value
specifies offset for vertical axis
VLOWER=value
specifies lower limit on vertical axis scale
VUPPER=value
specifies upper limit on vertical axis scale
WAXIS=n
specifies thickness for axis
Table 60.23. Graphics Catalog Options
DESCRIPTION=’string’
specifies description for graphics catalog member
NAME=’string’
specifies name for plot in graphics catalog
Table 60.24. Options for Text Enhancement
FONT=font
specifies software font for text
HEIGHT=value
specifies height of text used outside framed areas
INFONT=font
specifies software font for text inside framed areas
INHEIGHT=value
specifies height of text inside framed areas
3736
Chapter 60. The PROBIT Procedure
Table 60.25. Options for Reference Lines
HREF<(INTERSECT)> =value-list HREFLABELS= (’label1’,. . .,’labeln’) HREFLABPOS=n
requests horizontal reference line
specifies vertical position of labels for HREF= lines
LHREF=linetype
specifies line style for HREF= lines
LVREF=linetype
specifies line style for VREF= lines
VREF<(INTERSECT)> =value-list VREFLABELS= (’label1’,. . .,’labeln’) VREFLABPOS=n
requests vertical reference line
specifies labels for HREF= lines
specifies labels for VREF= lines specifies horizontal position of labels for VREF= lines
Dictionary of Options The following entries provide detailed descriptions of the options in the LPREDPLOT statement. ANNOTATE=SAS-data-set ANNO=SAS-data-set
specifies an ANNOTATE data set, as described in SAS/GRAPH Software: Reference, that enables you to add features to the lpred plot. The ANNOTATE= data set you specify in the LPREDPLOT statement is used for all plots created by the statement. CAXIS=color CAXES=color
specifies the color used for the axes and tick marks. This option overrides any COLOR= specifications in an AXIS statement. The default is the first color in the device color list. CFIT=color
specifies the color for the fitted lpred lines. The default is the first color in the device color list. CFRAME=color CFR=color
specifies the color for the area enclosed by the axes and frame. This area is not shaded by default. CGRID=color
specifies the color for grid lines. The default is the first color in the device color list. CHREF=color CH=color
specifies the color for lines requested by the HREF= option. The default is the first color in the device color list.
LPREDPLOT Statement
CTEXT=color
specifies the color for tick mark values and axis labels. The default is the color specified for the CTEXT= option in the most recent GOPTIONS statement. CVREF=color CV=color
specifies the color for lines requested by the VREF= option. The default is the first color in the device color list. DESCRIPTION=’string’ DES=’string’
specifies a description, up to 40 characters, that appears in the PROC GREPLAY master menu. The default is the variable name. FONT=font
specifies a software font for reference line and axis labels. You can also specify fonts for axis labels in an AXIS statement. The FONT= font takes precedence over the FTEXT= font specified in the most recent GOPTIONS statement. Hardware characters are used by default. HAXIS=value1 to value2
specifies tick mark values for the horizontal axis. value1, value2, and value3 must be numeric, and value1 must be less than value2. The lower tick mark is value1. Tick marks are drawn at increments of value3. The last tick mark is the greatest value that does not exceed value2. If value3 is omitted, a value of 1 is used. Examples of HAXIS= lists are: haxis = 0 to 10 haxis = 2 to 10 by 2 haxis = 0 to 200 by 10 HEIGHT=value
specifies the height of text used outside framed areas. The default value is 3.846 (in percentage). HLOWER=value
specifies the lower limit on the horizontal axis scale. The HLOWER= option specifies value as the lower horizontal axis tick mark. The tick mark interval and the upper axis limit are determined automatically. This option has no effect if the HAXIS= option is used. HOFFSET=value
specifies offset for horizontal axis. The default value is 1. HUPPER=value
specifies value as the upper horizontal axis tick mark. The tick mark interval and the lower axis limit are determined automatically. This option has no effect if the HAXIS= option is used.
3737
3738
Chapter 60. The PROBIT Procedure
HREF < (INTERSECT) > =value-list
requests reference lines perpendicular to the horizontal axis. If (INTERSECT) is specified, a second reference line perpendicular to the vertical axis is drawn that intersects the fit line at the same point as the horizontal axis reference line. If a horizontal axis reference line label is specified, the intersecting vertical axis reference line is labeled with the vertical axis value. See also the CHREF=, HREFLABELS=, and LHREF= options. HREFLABELS=’label1’,. . .,’labeln’ HREFLABEL=’label1’,. . .,’labeln’ HREFLAB=’label1’,. . .,’labeln’
specifies labels for the lines requested by the HREF= option. The number of labels must equal the number of lines. Enclose each label in quotes. Labels can be up to 16 characters. HREFLABPOS=n
specifies the vertical position of labels for HREF= lines. The following table shows valid values for n and the corresponding label placements. n 1 2 3 4 5 6
label placement top staggered from top bottom staggered from bottom alternating from top alternating from bottom
INBORDER
requests a border around lpred plots. LEVEL= (character-list) ORDINAL= (character-list)
specifies the names of the levels for which linear predictor lines are requested. Names should be quoted and separated by space. If there is no correct name provided, no lpred line is plotted. LFIT=linetype
specifies a line style for fitted curves and confidence limits. By default, fitted curves are drawn by connecting solid lines (linetype = 1) and confidence limits are drawn by connecting dashed lines (linetype = 3). LGRID=linetype
specifies a line style for all grid lines. linetype is between 1 and 46. The default is 35. LHREF=linetype LH=linetype
specifies the line type for lines requested by the HREF= option. The default is 2, which produces a dashed line.
LPREDPLOT Statement
LVREF=linetype LV=linetype
specifies the line type for lines requested by the VREF= option. The default is 2, which produces a dashed line. NAME=’string’
specifies a name for the plot, up to eight characters, that appears in the PROC GREPLAY master menu. The default is ’PROBIT’. NOCONF
suppresses confidence limits from the plot. This only works for the binomial model. Confidence limits are not plotted for the multinomial model. NODATA
suppresses observed data points from the plot. This only works for the binomial model. Data points are not plotted for the multinomial model. NOFIT
suppresses the fitted lpred lines. NOFRAME
suppresses the frame around plotting areas. NOGRID
suppresses grid lines. NOHLABEL
suppresses horizontal labels. NOHTICK
suppresses horizontal tick marks. NOTHRESH
suppresses the threshold line. NOVLABEL
suppresses vertical labels. NOVTICK
suppresses vertical tick marks. THRESHLABPOS=n
specifies the horizontal position of labels for the threshold line. The following table shows valid values for n and the corresponding label placements. n 1 2
label placement left right
3739
3740
Chapter 60. The PROBIT Procedure
VAXIS=value1 to value2
specifies tick mark values for the vertical axis. value1, value2, and value3 must be numeric, and value1 must be less than value2. The lower tick mark is value1. Tick marks are drawn at increments of value3. The last tick mark is the greatest value that does not exceed value2. This method of specification of tick marks is not valid for logarithmic axes. If value3 is omitted, a value of 1 is used. Examples of VAXIS= lists are: vaxis = 0 to 10 vaxis = 0 to 2 by .1 VAXISLABEL=’string’
specifies a label for the vertical axis. VLOWER=value
specifies the lower limit on the vertical axis scale. The VLOWER= option specifies value as the lower vertical axis tick mark. The tick mark interval and the upper axis limit are determined automatically. This option has no effect if the VAXIS= option is used. VREF=value-list
requests reference lines perpendicular to the vertical axis. If (INTERSECT) is specified, a second reference line perpendicular to the horizontal axis is drawn that intersects the fit line at the same point as the vertical axis reference line. If a vertical axis reference line label is specified, the intersecting horizontal axis reference line is labeled with the horizontal axis value. See also the CVREF=, LVREF=, and VREFLABELS= options. VREFLABELS=’label1’,. . .,’labeln’ VREFLABEL=’label1’,. . .,’labeln’ VREFLAB=’label1’,. . .,’labeln’
specifies labels for the lines requested by the VREF= option. The number of labels must equal the number of lines. Enclose each label in quotes. Labels can be up to 16 characters. VREFLABPOS=n
specifies the horizontal position of labels for VREF= lines. The following table shows valid values for n and the corresponding label placements. n 1 2
label placement left right
VUPPER=number
specifies the upper limit on the vertical axis scale. The VUPPER= option specifies number as the upper vertical axis tick mark. The tick mark interval and the lower axis limit are determined automatically. This option has no effect if the VAXIS= option is used.
MODEL Statement
WAXIS=n
specifies line thickness for axes and frame. The default value is 1. WFIT=n
specifies line thickness for fitted lines. The default value is 1. WGRID=n
specifies line thickness for grids. The default value is 1. WREFL=n
specifies line thickness for reference lines. The default value is 1.
MODEL Statement
MODEL response=effects < / options > ;
MODEL events/trials=effects < / options > ;
The MODEL statement names the variables used as the response and the independent variables. Additionally, you can specify the distribution used to model the response, as well as other options. Only a single MODEL statement can be used with one invocation of the PROBIT procedure. If multiple MODEL statements are present, only the last is used. Main effects and interaction terms can be specified in the MODEL statement, similar to the GLM procedure. The optional label is used to label output from the matching MODEL statement. The response can be a single variable with a value that is used to indicate the level of the observed response. Such a response variable must be listed in the CLASS statement. For example, the response might be a variable called Symptoms that takes on the values ‘None,’ ‘Mild,’ or ‘Severe.’ Note that, for dichotomous response variables, the probability of the lower sorted value is modeled by default (see the “Details” section beginning on page 3754). Because the model fit by the PROBIT procedure requires ordered response levels, you may need to use either the ORDER=DATA option in the PROC PROBIT statement or a numeric coding of the response to get the desired ordering of levels. Alternatively, the response can be specified as a pair of variable names separated by a slash (/). The value of the first variable, events, is the number of positive responses (or events). The value of the second variable, trials, is the number of trials. Both variables must be numeric and non-negative, and the ratio of the first variable value to the second variable value must be between 0 and 1, inclusive. For example, the variables might be hits, a variable containing the number of hits for a baseball player, and AtBats, a variable containing the number of times at bat. A model for hitting proportion (batting average) as a function of age could be specified as model hits/AtBats=age;
The effects following the equal sign are the covariates in the model. Higher-order effects, such as interactions and nested terms, are allowed in the list, similar to the GLM procedure. Variable names and combinations of variable names representing
3741
3742
Chapter 60. The PROBIT Procedure
higher-order terms are allowed to appear in this list. Class variables can be used as effects, and indicator variables are generated for the class levels. If you do not specify any covariates following the equal sign, an intercept-only model is fit. The following options are available in the MODEL statement. AGGREGATE AGGREGATE= (variable-list)
specifies the subpopulations on which the Pearson chi-square test statistic and the log-likelihood ratio chi-square test statistic (deviance) are calculated if the LACKFIT option is specified. See the section “Rescaling the Covariance Matrix” on page 3760 for details of Pearson’s chi-square and deviance calculations. Observations with common values in the given list of variables are regarded as coming from the same subpopulation. Variables in the list can be any variables in the input data set. Specifying the AGGREGATE option is equivalent to specifying the AGGREGATE= option with a variable list that includes all independent variables in the MODEL statement. The PROBIT procedure sorts the input data set according to the variables specified in this list. Information for the sorted data set is reported in the “Response-Covariate Profile” table. The deviance and Pearson goodness-of-fit statistics are calculated if the LACKFIT option is specified in the MODEL statement. The calculated results are reported in the “Goodness-of-Fit” table. If the Pearson chi-square test is significant with the test level specified by the HPROB= option, the fiducial limits, if required with the INVERSECL option in the MODEL statement, are modified (see the section “Inverse Confidence Limits” on page 3761 for details). Also, the covariance matrix is re-scaled by the dispersion parameter when the SCALE= option is specified. ALPHA=value
sets the significance level for the confidence intervals for regression parameters, fiducial limits for the predicted values, and confidence intervals for the predicted probabilities. The value must be between 0 and 1. The default value is ALPHA=0.05. CONVERGE=value
specifies the convergence criterion. Convergence is declared when the maximum change in the parameter estimates between Newton-Raphson steps is less than the value specified. The change is a relative change if the parameter is greater than 0.01 in absolute value; otherwise, it is an absolute change. By default, CONVERGE=1.0E-8. CORRB
displays the estimated correlation matrix of the parameter estimates. COVB
displays the estimated covariance matrix of the parameter estimates. DISTRIBUTION=distribution-type DIST=distribution-type D=distribution-type
specifies the cumulative distribution function used to model the response probabili-
MODEL Statement
ties. The distributions are described in the “Details” section beginning on page 3754. Valid values for distribution-type are NORMAL
the normal distribution for the probit model
LOGISTIC
the logistic distribution for the logit model
EXTREMEVALUE | EXTREME | GOMPERTZ the extreme value, or Gompertz distribution for the gompit model By default, DISTRIBUTION=NORMAL. HPROB=p
specifies a minimum probability level for the Pearson chi-square to indicate a good fit. The default value is 0.10. The LACKFIT option must also be specified for this option to have any effect. For Pearson goodness-of-fit chi-square values with probability greater than the HPROB= value, the fiducial limits, if requested with the INVERSECL option, are computed using a critical value of 1.96. For chi-square values with probability less than the value of the HPROB= option, the critical value is a 0.95 two-sided quantile value taken from the t distribution with degrees of freedom equal to (k − 1) × m − q, where k is the number of levels for the response variable, m is the number of different sets of independent variable values, and q is the number of parameters fit in the model. If you specify the HPROB= option in both the PROC PROBIT and MODEL statements, the MODEL statement option takes precedence. INITIAL=values
sets initial values for the parameters in the model other than the intercept. The values must be given in the order in which the variables are listed in the MODEL statement. If some of the independent variables listed in the MODEL statement are classification variables, then there must be as many values given for that variable as there are classification levels minus 1. The INITIAL option can be specified as follows. Type of List list separated by blanks list separated by commas
Specification initial=3 4 5 initial=3,4,5
By default, all parameters have initial estimates of zero. Note: The INITIAL= option is overwritten by the INEST= option in the PROC PROBIT statement. INTERCEPT=value
initializes the intercept parameter to value. By default, INTERCEPT=0. INVERSECL
computes confidence limits for the values of the first continuous independent variable (such as dose) that yield selected response rates. If the algorithm fails to converge (this can happen when C is nonzero), missing values are reported for the confidence limits. See the section “Inverse Confidence Limits” on page 3761 for details.
3743
3744
Chapter 60. The PROBIT Procedure
ITPRINT
displays the iteration history, the final evaluation of the gradient, and the second derivative matrix (Hessian). LACKFIT
performs two goodness-of-fit tests (a Pearson chi-square test and a log-likelihood ratio chi-square test) for the fitted model. To compute the test statistics, proper grouping of the observations into subpopulations is needed. You can use the AGGREGATE or AGGREGATE= option to this end. See the entry for the AGGREGATE and AGGREGATE= options under the MODEL statement. If neither AGGREGATE nor AGGREGATE= is specified, PROC PROBIT assumes each observation is from a separate subpopulation and computes the goodness-of-fit test statistics only for the events/trials syntax. Note: This test is not appropriate if the data are very sparse, with only a few values at each set of the independent variable values. If the Pearson chi-square test statistic is significant, then the covariance estimates and standard error estimates are adjusted.See the section “Lack of Fit Tests” on page 3759 for a description of the tests. Note that the LACKFIT option can also appear in the PROC PROBIT statement. See the section “PROC PROBIT Statement” on page 3711 for details. MAXITER=value MAXIT=value
specifies the maximum number of iterations to be performed in estimating the parameters. By default, MAXITER=50. NOINT
fits a model with no intercept parameter. If the INTERCEPT= option is also specified, the intercept is fixed at the specified value; otherwise, it is set to zero. This is most useful when the response is binary. When the response has k levels, then k − 1 intercept parameters are fit. The NOINT option sets the intercept parameter corresponding to the lowest response level equal to zero. A Lagrange multiplier, or score, test for the restricted model is computed when the NOINT option is specified. SCALE= scale
enables you to specify the method for estimating the dispersion parameter. To correct for overdispersion or underdispersion, the covariance matrix is multiplied by the estimate of the dispersion parameter. Valid values for scale are as follows: D | DEVIANCE
specifies that the dispersion parameter be estimated by the deviance divided by its degrees of freedom.
P | PEARSON
specifies that the dispersion parameter be estimated by the Pearson chi-square statistic divided by its degrees of freedom. This is set as the default.
You can use the AGGREGATE= option to define the subpopulations for calculating the Pearson chi-square statistic and the deviance.
OUTPUT Statement
The “Goodness-of-Fit ” table includes the Pearson chi-square statistic, the deviance, their degrees of freedom, the ratio of each statistic divided by its degrees of freedom, and the corresponding p-value. SINGULAR=value
specifies the singularity criterion for determining linear dependencies in the set of independent variables. The sum of squares and cross-products matrix of the independent variables is formed and swept. If the relative size of a pivot becomes less than the value specified, then the variable corresponding to the pivot is considered to be linearly dependent on the previous set of variables considered. By default, SINGULAR=1E−12.
OUTPUT Statement OUTPUT
specifies the statistics to include in the output data set and assigns names to the new variables that contain the statistics. Specify a keyword for each desired statistic (see the following list of keywords), an equal sign, and the variable to contain the statistic. The keywords allowed and the statistics they represent are as follows: PROB | P cumulative probability estimates p = C + (1 − C)F (aj + x0 β) STD XBETA
standard error estimates of aj + x0 b estimates of aj + x0 β
OUT=SAS-data-set names the output data set. By default, the new data set is named using the DATAn convention.
3745
3746
Chapter 60. The PROBIT Procedure
When the single variable response syntax is used, the – LEVEL– variable is added to the output data set, and there are k − 1 output observations for each input observation, where k is the number of response levels. There is no observation output corresponding to the highest response level. For each of the k − 1 observations, the PROB variable contains the fitted probability of obtaining a response level up to the level indicated by the – LEVEL– variable, the XBETA variable contains aj + x0 b, where j references the levels (a1 = 0), and the STD variable contains the standard error estimate of the XBETA variable. See the “Details” section, which follows, for the formulas for the parameterizations.
PREDPPLOT Statement PREDPPLOT
specifies a single continuous variable (dose variable) in the independent variable list of the MODEL statement. If a VAR= variable is not specified, the first single continuous variable in the independent variable list of the MODEL statement is used. If such a variable does not exist in the independent variable list of the MODEL statement, an error is reported. The predicted probability is ˆ pˆ = C + (1 − C)F (x0 b) for the binomial model and ˆ pˆ1 = C + (1 − C)F (x0 b) ˆ − F (ˆ ˆ j = 2, . . . , k − 1 pˆj = (1 − C)(F (ˆ aj + x0 b) aj−1 + x0 b)) ˆ pˆk = (1 − C)(1 − F (ˆ ak−1 + x0 b))
for the multinomial model with k response levels, where F is the cumulative distribution function used to model the probability, x0 is the vector of the covariates, a ˆj are the estimated ordinal intercepts with a ˆ1 = 0, C is the threshold parameter, either ˆ 0 is the vector of estimated parameters. known or estimated from the model, and b To plot pˆ (or pˆj ) as a function of a continuous variable x1 , the remaining covariates x−1 must be specified. You can use the XDATA= option to provide the values of x−1 (see the XDATA= option in the PROC PROBIT statement for details), or use the default values that follow the rules:
PREDPPLOT Statement
3747
• If the effect contains a continuous variable (or variables), the overall mean of this effect is used. • If the effect is a single classification variable, the highest level of the variable is used. options
enable you to plot the observed data and add features to the plot. You can use options in the PREDPPLOT statement to • superimpose specification limits • suppress or add observed data points for the binomial model • suppress or add confidence limits for the binomial model • specify the levels for which predicted probability curves are requested for the multinomial model • specify graphical enhancements (such as color or text height)
Summary of Options The following tables list all options by function. The “Dictionary of Options” on page 3749 describes each option in detail. PREDPPLOT Options Table 60.26. Plot Layout Options for PREDPPLOT
LEVEL=character-list
specifies the names of the levels for which the predicted probability curves are requested (only for the multinomial model)
NOCONF
suppresses confidence limits
NODATA
suppresses observed data points on the plot
NOTHRESH
suppresses the threshold line
THRESHLABPOS=value
specifies the position for the label of the threshold line
General Options Table 60.27. Color Options
CAXIS=color
specifies color for the axes
CFIT=color
specifies color for fitted curves
CFRAME=color
specifies color for frame
CGRID=color
specifies color for grid lines
CHREF=color
specifies color for HREF= lines
CLABBOX=color
specifies color for label box
CTEXT=color
specifies color for text
CVREF=color
specifies color for VREF= lines
3748
Chapter 60. The PROBIT Procedure
Table 60.28. Options to Enhance Plots Produced on Graphics Devices
ANNOTATE= SAS-data-set INBORDER
specifies an ANNOTATE data set requests a border around plot
LFIT=linetype
specifies line style for fitted curves and confidence limits
LGRID=linetype
specifies line style for grid lines
NOFRAME
suppresses the frame around plotting areas
NOGRID
suppresses grid lines
NOFIT
suppresses fitted curves
NOHLABEL
suppresses horizontal labels
NOHTICK
suppresses horizontal ticks
NOVTICK
suppresses vertical ticks
TURNVLABELS
vertically strings out characters in vertical labels
WFIT=n
specifies thickness for fitted curves
WGRID=n
specifies thickness for grids
WREFL=n
specifies thickness for reference lines
Table 60.29. Axis Options
HAXIS=value1 to value2
specifies tick mark values for horizontal axis specifies offset for horizontal axis
HLOWER=value
specifies lower limit on horizontal axis scale
HUPPER=value
specifies upper limit on horizontal axis scale
NHTICK=n
specifies number of ticks for horizontal axis
NVTICK=n
specifies number of ticks for vertical axis
VAXIS=value1 to value2
specifies tick mark values for vertical axis specifies label for vertical axis
VOFFSET=value
specifies offset for vertical axis
VLOWER=value
specifies lower limit on vertical axis scale
VUPPER=value
specifies upper limit on vertical axis scale
WAXIS=n
specifies thickness for axis
Table 60.30. Graphics Catalog Options
DESCRIPTION=’string’
specifies description for graphics catalog member
NAME=’string’
specifies name for plot in graphics catalog
Table 60.31. Options for Text Enhancement
FONT=font
specifies software font for text
HEIGHT=value
specifies height of text used outside framed areas
INFONT=font
specifies software font for text inside framed areas
INHEIGHT=value
specifies height of text inside framed areas
PREDPPLOT Statement
3749
Table 60.32. Options for Reference Lines
HREF<(INTERSECT)> =value-list HREFLABELS= (’label1’,. . .,’labeln’) HREFLABPOS=n
requests horizontal reference line
specifies vertical position of labels for HREF= lines
LHREF=linetype
specifies line style for HREF= lines
LVREF=linetype
specifies line style for VREF= lines
VREF<(INTERSECT)> =value-list VREFLABELS= (’label1’,. . .,’labeln’) VREFLABPOS=n
requests vertical reference line
specifies labels for HREF= lines
specifies labels for VREF= lines specifies horizontal position of labels for VREF= lines
Dictionary of Options The following entries provide detailed descriptions of the options in the PREDPPLOT statement. ANNOTATE=SAS-data-set ANNO=SAS-data-set
specifies an ANNOTATE data set, as described in SAS/GRAPH Software: Reference, that enables you to add features to the predicted probability plot. The ANNOTATE= data set you specify in the PREDPPLOT statement is used for all plots created by the statement. CAXIS=color CAXES=color
specifies the color used for the axes and tick marks. This option overrides any COLOR= specifications in an AXIS statement. The default is the first color in the device color list. CFIT=color
specifies the color for the fitted predicted probability curves. The default is the first color in the device color list. CFRAME=color CFR=color
specifies the color for the area enclosed by the axes and frame. This area is not shaded by default. CGRID=color
specifies the color for grid lines. The default is the first color in the device color list. CHREF=color CH=color
specifies the color for lines requested by the HREF= option. The default is the first color in the device color list.
3750
Chapter 60. The PROBIT Procedure
CTEXT=color
specifies the color for tick mark values and axis labels. The default is the color specified for the CTEXT= option in the most recent GOPTIONS statement. CVREF=color CV=color
specifies the color for lines requested by the VREF= option. The default is the first color in the device color list. DESCRIPTION=’string’ DES=’string’
specifies a description, up to 40 characters, that appears in the PROC GREPLAY master menu. The default is the variable name. FONT=font
specifies a software font for reference line and axis labels. You can also specify fonts for axis labels in an AXIS statement. The FONT= font takes precedence over the FTEXT= font specified in the most recent GOPTIONS statement. Hardware characters are used by default. HAXIS=value1 to value2
specifies tick mark values for the horizontal axis. value1, value2, and value3 must be numeric, and value1 must be less than value2. The lower tick mark is value1. Tick marks are drawn at increments of value3. The last tick mark is the greatest value that does not exceed value2. If value3 is omitted, a value of 1 is used. Examples of HAXIS= lists are: haxis = 0 to 10 haxis = 2 to 10 by 2 haxis = 0 to 200 by 10 HEIGHT=value
specifies the height of text used outside framed areas. HLOWER=value
specifies the lower limit on the horizontal axis scale. The HLOWER= option specifies value as the lower horizontal axis tick mark. The tick mark interval and the upper axis limit are determined automatically. This option has no effect if the HAXIS= option is used. HOFFSET=value
specifies the offset for the horizontal axis. The default value is 1. HUPPER=value
specifies value as the upper horizontal axis tick mark. The tick mark interval and the lower axis limit are determined automatically. This option has no effect if the HAXIS= option is used.
PREDPPLOT Statement
HREF < (INTERSECT) > =value-list
requests reference lines perpendicular to the horizontal axis. If (INTERSECT) is specified, a second reference line perpendicular to the vertical axis is drawn that intersects the fit line at the same point as the horizontal axis reference line. If a horizontal axis reference line label is specified, the intersecting vertical axis reference line is labeled with the vertical axis value. See also the CHREF=, HREFLABELS=, and LHREF= options. HREFLABELS=’label1’,. . .,’labeln’ HREFLABEL=’label1’,. . .,’labeln’ HREFLAB=’label1’,. . .,’labeln’
specifies labels for the lines requested by the HREF= option. The number of labels must equal the number of lines. Enclose each label in quotes. Labels can be up to 16 characters. HREFLABPOS=n
specifies the vertical position of labels for HREF= lines. The following table shows valid values for n and the corresponding label placements. n 1 2 3 4 5 6
label placement top staggered from top bottom staggered from bottom alternating from top alternating from bottom
INBORDER
requests a border around predicted probability plots. LEVEL= (character-list) ORDINAL= (character-list)
specifies the names of the levels for which predicted probability curves are requested. Names should be quoted and separated by space. If there is no correct name provided, no fitted probability curve is plotted. LFIT=linetype
specifies a line style for fitted curves and confidence limits. By default, fitted curves are drawn by connecting solid lines (linetype = 1) and confidence limits are drawn by connecting dashed lines (linetype = 3). LGRID=linetype
specifies a line style for all grid lines. linetype is between 1 and 46. The default is 35. LHREF=linetype LH=linetype
specifies the line type for lines requested by the HREF= option. The default is 2, which produces a dashed line.
3751
3752
Chapter 60. The PROBIT Procedure
LVREF=linetype LV=linetype
specifies the line type for lines requested by the VREF= option. The default is 2, which produces a dashed line. NAME=’string’
specifies a name for the plot, up to eight characters, that appears in the PROC GREPLAY master menu. The default is ’PROBIT’. NOCONF
suppresses confidence limits from the plot. This only works for the binomial model. Confidence limits are not plotted for the multinomial model. NODATA
suppresses observed data points from the plot. This only works for the binomial model. The data points are not plotted for the multinomial model. NOFIT
suppresses the fitted predicted probability curves. NOFRAME
suppresses the frame around plotting areas. NOGRID
suppresses grid lines. NOHLABEL
suppresses horizontal labels. NOHTICK
suppresses horizontal tick marks. NOTHRESH
suppresses the threshold line. NOVLABEL
suppresses vertical labels. NOVTICK
suppresses vertical tick marks. THRESHLABPOS=n
specifies the horizontal position of labels for the threshold line. The following table shows valid values for n and the corresponding label placements. n 1 2
label placement left right
PREDPPLOT Statement
VAXIS=value1 to value2
specifies tick mark values for the vertical axis. value1, value2, and value3 must be numeric, and value1 must be less than value2. The lower tick mark is value1. Tick marks are drawn at increments of value3. The last tick mark is the greatest value that does not exceed value2. This method of specification of tick marks is not valid for logarithmic axes. If value3 is omitted, a value of 1 is used. Examples of VAXIS= lists are: vaxis = 0 to 10 vaxis = 0 to 2 by .1 VAXISLABEL=’string’
specifies a label for the vertical axis. VLOWER=value
specifies the lower limit on the vertical axis scale. The VLOWER= option specifies value as the lower vertical axis tick mark. The tick mark interval and the upper axis limit are determined automatically. This option has no effect if the VAXIS= option is used. VREF=value-list
requests reference lines perpendicular to the vertical axis. If (INTERSECT) is specified, a second reference line perpendicular to the horizontal axis is drawn that intersects the fit line at the same point as the vertical axis reference line. If a vertical axis reference line label is specified, the intersecting horizontal axis reference line is labeled with the horizontal axis value. See also the CVREF=, LVREF=, and VREFLABELS= options. VREFLABELS=’label1’,. . .,’labeln’ VREFLABEL=’label1’,. . .,’labeln’ VREFLAB=’label1’,. . .,’labeln’
specifies labels for the lines requested by the VREF= option. The number of labels must equal the number of lines. Enclose each label in quotes. Labels can be up to 16 characters. VREFLABPOS=n
specifies the horizontal position of labels for VREF= lines. The following table shows valid values for n and the corresponding label placements. n 1 2
label placement left right
VUPPER=value
specifies the upper limit on the vertical axis scale. The VUPPER= option specifies value as the upper vertical axis tick mark. The tick mark interval and the lower axis limit are determined automatically. This option has no effect if the VAXIS= option is used.
3753
3754
Chapter 60. The PROBIT Procedure
WAXIS=n
specifies line thickness for axes and frame. The default value is 1. WFIT=n
specifies line thickness for fitted curves. The default value is 1. WGRID=n
specifies line thickness for grids. The default value is 1. WREFL=n
specifies line thickness for reference lines. The default value is 1.
WEIGHT Statement WEIGHT variable ; A WEIGHT statement can be used with PROC PROBIT to weight each observation by the value of the variable specified. The contribution of each observation to the likelihood function is multiplied by the value of the weight variable. Observations with zero, negative, or missing weights are not used in model estimation.
Details Missing Values PROC PROBIT does not use any observations having missing values for any of the independent variables, the response variables, or the weight variable. If only the response variables are missing, statistics requested in the OUTPUT statement are computed.
Response Level Ordering For binary response data, PROC PROBIT fits the following model by default, Φ
−1
p−C 1−C
= x0 β
where p is the probability of the response level identified as the first level in the “Weighted Frequency Counts for the Ordered Response Categories” table in the output and Φ is the normal cumulative distribution function. By default, the covariate vector x contains an intercept term. This is sometimes called Abbot’s formula. Because of the symmetry of the normal (and logistic) distribution, the effect of reversing the order of the two response values is to change the signs of β in the preceding equation. By default, response levels appear in ascending, sorted order (that is, the lowest level appears first and then the next lowest, and so on). There are a number of ways that you can control the sort order of the response categories and, therefore, which level is assigned the first ordered level. One of the most common sets of response levels is {0,1}, with 1 representing the event with the probability that is to be modeled.
Computational Method
Consider the example where Y takes the values 1 and 0 for event and nonevent, respectively, and EXPOSURE is the explanatory variable. By default, PROC PROBIT assigns the first ordered level to response level 0, causing the probability of the nonevent to be modeled. There are several ways to change this. Besides recoding the variable Y, you can • assign a format to Y such that the first formatted value (when the formatted values are put in sorted order) corresponds to the event. For this example, Y=0 could be assigned formatted value ‘nonevent’ and Y=1 could be assigned formatted value ‘event.’ Since ORDER=FORMATTED by default, Y=1 becomes the first ordered level. See Example 60.3 for an illustration of this method. proc format; value disease 1=’event’ 0=’nonevent’; run; proc probit; model y=exposure; format y disease.; run;
• arrange the input data set so that Y=1 appears first and use the ORDER=DATA option in the PROC PROBIT statement. Since ORDER=DATA sorts levels in order of their appearance in the data set, Y=1 becomes the first ordered level. Note that this option causes class variables to be sorted by their order of appearance in the data set, also.
Computational Method The log-likelihood function is maximized by means of a ridge-stabilized NewtonRaphson algorithm. Initial regression parameter estimates are set to zero. The INITIAL= and INTERCEPT= options in the MODEL statement can be used to give nonzero initial estimates. The log-likelihood function, L, is computed as L=
X
wi ln(pi )
i
where the sum is over the observations in the data set, wi is the weight for the ith observation, and pi is the modeled probability of the observed response. In the case of the events/trials syntax in the MODEL statement, each observation contributes two terms corresponding to the probability of the event and the probability of its complement: L=
X
wi [ri ln(pi ) + (ni − ri ) ln(1 − pi )]
i
where ri is the number of events and ni is the number of trials for observation i. This log-likelihood function differs from the log-likelihood function for a binomial
3755
3756
Chapter 60. The PROBIT Procedure
or multinomial distribution by additive terms consisting of the log of binomial or multinomial coefficients. These terms are parameter-independent and do not affect the model estimation or the standard errors and tests. The estimated covariance matrix, V, of the parameter estimates is computed as the negative inverse of the information matrix of second derivatives of L with respect to the parameters evaluated at the final parameter estimates. Thus, the estimated covariance matrix is derived from the observed information matrix rather than the expected information matrix (these are generally not the same). The standard error estimates for the parameter estimates are taken as the square roots of the corresponding diagonal elements of V. If convergence of the maximum likelihood estimates is attained, a Type III chi-square test statistic is computed for each effect, testing whether there is any contribution from any of the levels of the effect. This statistic is computed as a quadratic form in the appropriate parameter estimates using the corresponding submatrix of the asymptotic covariance matrix estimate. Refer to Chapter 32, “The GLM Procedure,” and Chapter 11, “The Four Types of Estimable Functions,” for more information about Type III estimable functions. The asymptotic covariance matrix is computed as the inverse of the observed information matrix. Note that if the NOINT option is specified and class variables are used, the first class variable contains a contribution from an intercept term. The results are displayed in an ODS table named Type3Analysis. Chi-square tests for individual parameters are Wald tests based on the observed information matrix and the parameter estimates. If an effect has a single degree of freedom in the parameter estimates table, the chi-square test for this parameter is equivalent to the Type III test for this effect. In releases previous to Version 8.2, a multiple degree of freedom statistic was computed for each effect to test for contribution from any level of the effect. In general, the Type III test statistic in a main effect only model (no interaction terms) will be equal to the previously computed effect statistic, unless there are collinearities among the effects. If there are collinearities, the Type III statistic will adjust for them, and the value of the Type III statistic and the number of degrees of freedom might not be equal to those of the previous effect statistic. The theory behind these tests assumes large samples. If the samples are not large, it may be better to base the tests on log-likelihood ratios. These changes in log likelihood can be obtained by fitting the model twice, once with all the parameters of interest and once leaving out the parameters to be tested. Refer to Cox and Oakes (1984) for a discussion of the merits of some possible test methods. If some of the independent variables are perfectly correlated with the response pattern, then the theoretical parameter estimates may be infinite. Although fitted probabilities of 0 and 1 are not especially pathological, infinite parameter estimates are required to yield these probabilities. Due to the finite precision of computer arithmetic, the actual parameter estimates are not infinite. Indeed, since the tails of the distributions allowed in the PROBIT procedure become small rapidly, an argument to the cumulative distribution function of around 20 becomes effectively infinite. In the
INEST= SAS-data-set
case of such parameter estimates, the standard error estimates and the corresponding chi-square tests are not trustworthy.
Distributions The distributions, F (x), allowed in the PROBIT procedure are specified with the DISTRIBUTION= option in the model statement. The cumulative distribution functions for the available distributions are Z
x
−∞
2 1 z √ exp − dz(normal) 2 2π
1 (logistic) 1 + e−x x
1 − e−e (extreme value or Gompertz) The variances of these three distributions are not all equal to 1, and their means are not all equal to zero. Their means and variances are shown in the following table, where γ is the Euler constant. Distribution Normal Logistic extreme value or Gompertz
Mean 0 0 −γ
Variance 1 π 2 /3 π 2 /6
When comparing parameter estimates using different distributions, you need to take into account the different scalings and, for the extreme value (or Gompertz) distribution, a possible shift in location. For example, if the fitted probabilities are in the neighborhood of 0.1 √ to 0.9, then the parameter estimates from the logistic model should be about π/ 3 larger than the estimates from the probit model.
INEST= SAS-data-set The INEST= data set names a SAS data set that specifies initial estimates for all the parameters in the model. The INEST= data set must contain the intercept variables (named Intercept for binary response model and Intercept, Intercept2, Intercept3, and so forth, for multinomial response models) and all independent variables in the MODEL statement. If BY processing is used, the INEST= data set should also include the BY variables, and there must be at least one observation for each BY group. If there is more than one observation in a BY group, the first one read is used for that BY group. If the INEST= data set also contains the – TYPE– variable, only observations with – TYPE– value ’PARMS’ are used as starting values. Combining the INEST= data
3757
3758
Chapter 60. The PROBIT Procedure
set and the option MAXIT= in the MODEL statement, partial scoring can be done, such as predicting on a validation data set by using the model built from a training data set. You can specify starting values for the iterative algorithm in the INEST= data set. This data set overwrites the INITIAL= option in the MODEL statement, which is a little difficult to use for models with multilevel interaction effects. The INEST= data set has the same structure as the “OUTEST= SAS-data-set” on page 3762, but is not required to have all the variables or observations that appear in the OUTEST= data set. One simple use of the INEST= option is passing the previous OUTEST= data set directly to the next model as an INEST= data set, assuming that the two models have the same parameterization.
Model Specification For a two-level response, the probability that the lesser response occurs is modeled by the probit equation as p = C + (1 − C)F (x0 b) The probability of the other (complementary) event is 1 − p. For a multilevel response with outcomes labeled li for i = 1, 2, . . . , k, the probability, pj , of observing level lj is as follows.
p1 = C + (1 − C)F (x0 b) p2 = (1 − C) F (a2 + x0 b) − F (x0 b) .. . pj
= (1 − C) F (aj + x0 b) − F (aj−1 + x0 b) .. .
pk = (1 − C)(1 − F (ak−1 + x0 b))
Thus, for a k-level response, there are k − 2 additional parameters, a2 , a3 , . . . , ak−1 , estimated. These parameters are denoted by Interceptj, j = 2, 3, . . . , k − 1 in the output. An intercept parameter is always added to the set of independent variables as the first term in the model unless the NOINT option is specified in the MODEL statement. If a classification variable taking on k levels is used as one of the independent variables, a set of k indicator variables is generated to model the effect of this variable. Because of the presence of the intercept term, there are at most k − 1 degrees of freedom for this effect in the model.
Lack of Fit Tests
Lack of Fit Tests Two goodness-of-fit tests can be requested from the PROBIT procedure: a Pearson chi-square test and a log-likelihood ratio chi-square test. To compute the test statistics, you can use the AGGREGATE or AGGREGATE= option grouping the observations into subpopulations. If neither AGGREGATE nor AGGREGATE= is specified, PROC PROBIT assumes that each observation is from a separate subpopulation and computes the goodness-of-fit test statistics only for the events/trials syntax. If the Pearson goodness-of-fit chi-square test is requested and the p-value for the test is too small, variances and covariances are adjusted by a heterogeneity factor (the goodness-of-fit chi-square divided by its degrees of freedom) and a critical value from the t distribution is used to compute the fiducial limits. The Pearson chi-square test statistic is computed as
χ2P =
m X k X (rij − ni pˆij )2 ni pˆij i=1 j=1
where the sum on i is over grouping, the sum on j is over levels of response, the rij is the frequency of response level j for the ith grouping, ni is the total frequency for the ith grouping, and pˆij is the fitted probability for the jth level at the ith grouping. The likelihood ratio chi-square test statistic is computed as
χ2D
=2
m X k X i=1 j=1
rij ln
rij ni pˆij
This quantity is sometimes called the deviance. If the modeled probabilities fit the data, these statistics should be approximately distributed as chi-square with degrees of freedom equal to (k−1)×m−q, where k is the number of levels of the multinomial or binomial response, m is the number of sets of independent variable values (covariate patterns), and q is the number of parameters fit in the model. In order for the Pearson statistic and the deviance to be distributed as chi-square, there must be sufficient replication within the groupings. When this is not true, the data are sparse, and the p-values for these statistics are not valid and should be ignored. Similarly, these statistics, divided by their degrees of freedom, cannot serve as indicators of overdispersion. A large difference between the Pearson statistic and the deviance provides some evidence that the data are too sparse to use either statistic.
3759
3760
Chapter 60. The PROBIT Procedure
Rescaling the Covariance Matrix One way of correcting overdispersion is to multiply the covariance matrix by a dispersion parameter. You can supply the value of the dispersion parameter directly, or you can estimate the dispersion parameter based on either the Pearson chi-square statistic or the deviance for the fitted model. The Pearson chi-square statistic χ2P and the deviance χ2D are defined in the section “Lack of Fit Tests” on page 3759. If the SCALE= option is specified in the MODEL statement, the dispersion parameter is estimated by 2 χP /(m(k − 1) − q) χ2 /(m(k − 1) − q) σ b2 = D (constant)2
SCALE=PEARSON SCALE=DEVIANCE SCALE=constant
In order for the Pearson statistic and the deviance to be distributed as chi-square, there must be sufficient replication within the subpopulations. When this is not true, the data are sparse, and the p-values for these statistics are not valid and should be ignored. Similarly, these statistics, divided by their degrees of freedom, cannot serve as indicators of overdispersion. A large difference between the Pearson statistic and the deviance provides some evidence that the data are too sparse to use either statistic. You can use the AGGREGATE (or AGGREGATE=) option to define the subpopulation profiles. If you do not specify this option, each observation is regarded as coming from a separate subpopulation. For events/trials syntax, each observation represents n Bernoulli trials, where n is the value of the trials variable; for singletrial syntax, each observation represents a single trial. Without the AGGREGATE (or AGGREGATE=) option, the Pearson chi-square statistic and the deviance are calculated only for events/trials syntax. Note that the parameter estimates are not changed by this method. However, their standard errors are adjusted for overdispersion, affecting their significance tests.
Tolerance Distribution For a single independent variable, such as a dosage level, the models for the probabilities can be justified on the basis of a population with mean µ and scale parameter σ of tolerances for the subjects. Then, given a dose x, the probability, P , of observing a response in a particular subject is the probability that the subject’s tolerance is less than the dose or P =F
x−µ σ
Thus, in this case, the intercept parameter, b0 , and the regression parameter, b1 , are related to µ and σ by b1 =
1 σ
Inverse Confidence Limits b0 = −
µ σ
Note: The parameter σ is not equal to the standard deviation of the population of tolerances for the logistic and extreme value distributions.
Inverse Confidence Limits In bioassay problems, estimates of the values of the independent variables that yield a desired response are often needed. For instance, the value yielding a 50% response rate (called the ED50 or LD50) is often used. The INVERSECL option requests that confidence limits be computed for the value of the independent variable that yields a specified response. These limits are computed only for the first continuous variable effect in the model. The other variables are set either at their mean values if they are continuous or at the reference (last) level if they are discrete variables. For a discussion of inverse confidence limits, refer to Hubert, Bohidar, and Peace (1988). For the PROBIT procedure, the response variable is a probability. An estimate of the first continuous variable value needed to achieve a response of p is given by x ˆ1 =
1 F −1 (p) − x∗ 0 b∗ b1
where F is the cumulative distribution function used to model the probability, x∗ is the vector of independent variables excluding the first one, which can be specified by the XDATA= option described in the section “XDATA= SAS-data-set” on page 3763, b∗ is the vector of parameter estimates excluding the first one, and b1 is the estimated regression coefficient for the independent variable of interest. Note that, for both binary and ordinal models, the INVERSECL option provides estimates of the value of x1 yielding Pr(first response level) = p, for various values of p. This estimator is given as a ratio of random variables, for example, r = a/b. Confidence limits for this ratio can be computed using Fieller’s theorem. A brief description of this theorem follows. Refer to Finney (1971) for a more complete description of Fieller’s theorem. If the random variables a and b are thought to be distributed as jointly normal, then for any fixed value r the following probability statement holds if z is an α/2 quantile from the standard normal distribution and V is the variance-covariance matrix of a and b. Pr (a − rb)2 > z 2 (Vaa − 2rVab + r2 Vbb ) = α Usually the inequality can be solved for r to yield a confidence interval. The PROBIT procedure uses a value of 1.96 for z, corresponding to an α value of 0.05, unless the goodness-of-fit p-value is less than the specified value of the HPROB= option. When this happens, the covariance matrix is scaled by the heterogeneity factor, and a t distribution quantile is used for z.
3761
3762
Chapter 60. The PROBIT Procedure
It is possible for the roots of the equation for r to be imaginary or for the confidence interval to be all points outside of an interval. In these cases, the limits are set to missing by the PROBIT procedure. Although the normal and logistic distribution give comparable fitted values of p if the empirically observed proportions are not too extreme, they can give appreciably different values when extrapolated into the tails. Correspondingly, the estimates of the confidence limits and dose values can be different for the two distributions even when they agree quite well in the body of the data. Extrapolation outside of the range of the actual data are often sensitive to model assumptions, and caution is advised if extrapolation is necessary.
OUTEST= SAS-data-set The OUTEST= data set contains parameter estimates and the log likelihood for the model. You can specify a label in the MODEL statement to distinguish between the estimates for different modeling using the PROBIT procedure. If you specify the COVOUT option, the OUTEST= data set also contains the estimated covariance matrix of the parameter estimates. The OUTEST= data set contains each variable used as a dependent or independent variable in any MODEL statement. One observation consists of parameter values for the model with the dependent variable having the value −1. If you specify the COVOUT option, there are additional observations containing the rows of the estimated covariance matrix. For these observations, the dependent variable contains the parameter estimate for the corresponding row variable. The following variables are also added to the data set:
– MODEL–
a character variable containing the label of the MODEL statement, if present, or blank otherwise
– NAME–
a character variable containing the name of the dependent variable for the parameter estimates observations or the name of the row for the covariance matrix estimates
– TYPE–
a character variable containing the type of the observation, either PARMS for parameter estimates or COV for covariance estimates
– DIST– – LNLIKE–
a character variable containing the name of the distribution modeled a numeric variable containing the last computed value of the log likelihood
a numeric variable containing the estimated threshold parameter – C– INTERCEPT a numeric variable containing the intercept parameter estimates and covariances Any BY variables specified are also added to the OUTEST= data set.
Displayed Output
XDATA= SAS-data-set The XDATA= data set is used for specifiying values for the effects in the MODEL statement when predicted values and/or fiducial limits for a single continuous variable (dose variable) are required. It is also used for plots specified by the CDFPLOT, IPPPLOT, LPREDPLOT, and PREDPPLOT statement. The XDATA= data names a SAS data set that contains user input values for all the independent variables in the MODEL statement and the variables in the CLASS statement. The XDATA= data set has the same structure as the DATA= data set but is not required to have all the variables or observations that appear in the DATA= data set. The XDATA= data set must contain all the independent variables in the MODEL statement and variables in the CLASS statement. Even though variables in the CLASS statement may not be used in the MODEL statement, valid values are required for these variables in the XDATA= data set. Missing values are not allowed. For independent variables in the MODEL statement, although the dose variable’s value is not used in the computing of predicted values and/or fiducial limits for the dose variable, missing values are not allowed in the XDATA= data set for any of the independent variables. Missing values are allowed for the dependent variables and other variables if they are included in the XDATA= data set and not listed in the CLASS statement. If BY processing is used, the XDATA= data set should also include the BY variables, and there must be at least one valid observation for each BY group. If there is more than one valid observation in one BY group, the last one read is used for that BY group. If there is no XDATA= data set in the PROC PROBIT statement, by default, the PROBIT procedure will use overall mean for effects containing continuous variable (or variables) and the highest level of a single classification variable as reference level. The rules are summarized as follows: • If the effect contains a continuous variable (or variables), the overall mean of this effect is used. • If the effect is a single classification variable, the highest level of the variable is used.
Displayed Output If you request the iteration history (ITPRINT), PROC PROBIT displays • the current value of the log likelihood • the ridging parameter for the modified Newton-Raphson optimization process • the current estimate of the parameters • the current estimate of the parameter C for a natural (threshold) model • the values of the gradient and the Hessian on the last iteration
3763
3764
Chapter 60. The PROBIT Procedure
If you include CLASS variables, PROC PROBIT displays • the numbers of levels for each CLASS variable • the (ordered) values of the levels • the number of observations used After the model is fit, PROC PROBIT displays • the name of the input data set • the name of the dependent variables • the number of observations used • the number of events and the number of trials • the final value of the log-likelihood function • the parameter estimates • the standard error estimates of the parameter estimates • approximate chi-square test statistics for the test If you specify the COVB or CORRB options, PROC PROBIT displays • the estimated covariance matrix for the parameter estimates • the estimated correlation matrix for the parameter estimates If you specify the LACKFIT option, PROC PROBIT displays • a count of the number of levels of the response and the number of distinct sets of independent variables • a goodness-of-fit test based on the Pearson chi-square • a goodness-of-fit test based on the likelihood-ratio chi-square If you specify only one independent variable, the normal distribution is used to model the probabilities, and the response is binary, PROC PROBIT displays • the mean MU of the stimulus tolerance • the scale parameter SIGMA of the stimulus tolerance • the covariance matrix for MU, SIGMA, and the natural response parameter C If you specify the INVERSECL options, PROC PROBIT also displays • the estimated dose along with the 95% fiducial limits for probability levels 0.01 to 0.10, 0.15 to 0.85 by 0.05, and 0.90 to 0.99
ODS Table Names
ODS Table Names PROC PROBIT assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 60.33. ODS Tables Produced in PROC PROBIT
ODS Table Name ClassLevels ConvergenceStatus CorrB CovB CovTolerance GoodnessOfFit IterHistory LagrangeStatistics LastGrad LastHess LogProbitAnalysis ModelInfo MuSigma NObs ParameterEstimates ParmInfo ProbitAnalysis ResponseLevels ResponseProfiles Type3Analysis ∗
Depends on data.
Description Class variable levels Convergence status Parameter estimate correlation matrix Parameter estimate covariance matrix Covariance matrix for location and scale Goodness of fit tests Iteration history Lagrange statistics Last evaluation of the gradient Last evaluation of the Hessian Probit analysis for log dose Model information Location and scale Observations Summary Parameter estimates Parameter indices Probit analysis for linear dose Response-covariate profile Counts for ordinal data Type 3 tests
Statement CLASS MODEL MODEL
Option default default CORRB
MODEL
COVB
MODEL
default∗
MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL PROC MODEL MODEL MODEL MODEL MODEL MODEL
LACKFIT ITPRINT NOINT ITPRINT ITPRINT INVERSECL default default∗ default default default INVERSECL LACKFIT default default∗
3765
3766
Chapter 60. The PROBIT Procedure
Examples Example 60.1. Dosage Levels In this example, Dose is a variable representing the level of a stimulus, N represents the number of subjects tested at each level of the stimulus, and Response is the number of subjects responding to that level of the stimulus. Both probit and logit response models are fit to the data. The LOG10 option in the PROC PROBIT statement requests that the log base 10 of Dose is used as the independent variable. Specifically, for a given level of Dose, the probability p of a positive response is modeled as p = Pr(Response) = F (b0 + b1 × log10 (Dose)) The probabilities are estimated first using the normal distribution function (the default) and then using the logistic distribution function. Note that, in this model specification, the natural rate is assumed to be zero. The LACKFIT option specifies lack-of-fit tests and the INVERSECL option specifies inverse confidence limits. In the DATA step that reads the data, a number of observations are generated that have a missing value for the response. Although the PROBIT procedure does not use the observations with the missing values to fit the model, it does give predicted values for all nonmissing sets of independent variables. These data points fill in the plot of fitted and observed values in the logistic model displayed in Output 60.1.2. The plot, requested with the PREDPPLOT statement, displays the estimated logistic cumulative distribution function and the observed response rates. The VAR= DOSE option specifies the horizontal axis variable in the plot. The following statements produce Output 60.1.1: data a; infile cards eof=eof; input Dose N Response; Observed= Response/N; output; return; eof: do Dose=0.5 to 7.5 by 0.25; output; end; datalines; 1 10 1 2 12 2 3 10 4 4 10 5 5 12 8 6 10 8 7 10 10 ;
Example 60.1. Dosage Levels
proc probit log10; model Response/N=Dose / lackfit inversecl itprint; output out=B p=Prob std=std xbeta=xbeta; title ’Output from Probit Procedure’; run; symbol v=dot c=white; proc probit log10; model Response/N=Dose / d=logistic inversecl; predpplot var = dose cfit = blue cframe=ligr inborder; output out=B p=Prob std=std xbeta=xbeta; title ’Output from Probit Procedure’; run;
Output 60.1.1. Dosage Levels: PROC PROBIT Output from Probit Procedure Probit Procedure Iteration History for Parameter Estimates Iter
Ridge
Loglikelihood
Intercept
Log10(Dose)
0 1 2 3 4 5
0 0 0 0 0 0
-51.292891 -37.881166 -37.286169 -37.280389 -37.280388 -37.280388
0 -1.355817008 -1.764939171 -1.812147863 -1.812704962 -1.812704962
0 2.635206083 3.3408954936 3.4172391614 3.418117919 3.418117919
3767
3768
Chapter 60. The PROBIT Procedure
Output from Probit Procedure Probit Procedure Model Information Data Set Events Variable Trials Variable Number of Observations Number of Events Number of Trials Missing Values Name of Distribution Log Likelihood
WORK.B Response N 7 38 74 29 Normal -37.28038802
Last Evaluation of the Negative of the Gradient Intercept
Log10(Dose)
3.434907E-7
-2.09809E-8
Last Evaluation of the Negative of the Hessian
Intercept Log10(Dose)
Intercept
Log10(Dose)
36.005280383 20.152675982
20.152675982 13.078826305
Goodness-of-Fit Tests Statistic Pearson Chi-Square L.R. Chi-Square
Value
DF
Pr > ChiSq
3.6497 4.6381
5 5
0.6009 0.4616
Response-Covariate Profile Response Levels Number of Covariate Values
2 7
The p-values in the Goodness-of-Fit table of 0.6009 for the Pearson chi-square and 0.4616 for the likelihood ratio chi-square indicate an adequate fit for the model fit with the normal distribution.
Example 60.1. Dosage Levels
Output from Probit Procedure Probit Procedure Analysis of Parameter Estimates
Parameter Intercept Log10(Dose)
DF Estimate 1 1
-1.8127 3.4181
Standard Error 0.4493 0.7455
95% Confidence Limits -2.6934 1.9569
-0.9320 4.8794
ChiSquare Pr > ChiSq 16.27 21.02
<.0001 <.0001
Probit Model in Terms of Tolerance Distribution MU
SIGMA
0.53032254
0.29255866
Estimated Covariance Matrix for Tolerance Parameters
MU SIGMA
MU
SIGMA
0.002418 -0.000409
-0.000409 0.004072
Tolerance distribution parameter estimates for the normal distribution indicate a mean tolerance for the population of 0.5303.
3769
3770
Chapter 60. The PROBIT Procedure
Output from Probit Procedure Probit Procedure Probit Analysis on Log10(Dose) Probability
Log10(Dose)
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
-0.15027 -0.07052 -0.01992 0.01814 0.04911 0.07546 0.09857 0.11926 0.13807 0.15539 0.22710 0.28410 0.33299 0.37690 0.41759 0.45620 0.49356 0.53032 0.56709 0.60444 0.64305 0.68374 0.72765 0.77655 0.83354 0.90525 0.92257 0.94139 0.96208 0.98519 1.01154 1.04250 1.08056 1.13116 1.21092
95% Fiducial Limits -0.69518 -0.55766 -0.47064 -0.40534 -0.35233 -0.30731 -0.26793 -0.23273 -0.20080 -0.17147 -0.05086 0.04369 0.12343 0.19348 0.25658 0.31429 0.36754 0.41693 0.46296 0.50618 0.54734 0.58745 0.62776 0.66999 0.71675 0.77313 0.78646 0.80083 0.81653 0.83394 0.85367 0.87669 0.90480 0.94189 0.99987
0.07710 0.13475 0.17156 0.19941 0.22218 0.24165 0.25881 0.27425 0.28837 0.30142 0.35631 0.40124 0.44116 0.47857 0.51504 0.55182 0.58999 0.63057 0.67451 0.72271 0.77603 0.83550 0.90265 0.98008 1.07279 1.19191 1.22098 1.25265 1.28759 1.32672 1.37149 1.42424 1.48928 1.57602 1.71321
The LD50 (ED50 for log dose) is 0.5303, the dose corresponding to a probability of 0.5. This is the same as the mean tolerance for the normal distribution.
Example 60.1. Dosage Levels
Output from Probit Procedure Probit Procedure Probit Analysis on Dose Probability
Dose
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
0.70750 0.85012 0.95517 1.04266 1.11971 1.18976 1.25478 1.31600 1.37427 1.43019 1.68696 1.92353 2.15276 2.38180 2.61573 2.85893 3.11573 3.39096 3.69051 4.02199 4.39594 4.82770 5.34134 5.97787 6.81617 8.03992 8.36704 8.73752 9.16385 9.66463 10.26925 11.02811 12.03830 13.52585 16.25233
95% Fiducial Limits 0.20175 0.27691 0.33834 0.39324 0.44429 0.49282 0.53960 0.58515 0.62980 0.67380 0.88950 1.10584 1.32870 1.56128 1.80543 2.06200 2.33098 2.61175 2.90374 3.20759 3.52651 3.86765 4.24385 4.67724 5.20900 5.93105 6.11584 6.32165 6.55431 6.82245 7.13949 7.52816 8.03149 8.74763 9.99709
1.19427 1.36380 1.48444 1.58274 1.66793 1.74443 1.81473 1.88042 1.94252 2.00181 2.27147 2.51906 2.76161 3.01000 3.27374 3.56306 3.89038 4.27138 4.72619 5.28090 5.97077 6.84706 7.99189 9.55169 11.82480 15.55653 16.63320 17.89163 19.39034 21.21881 23.52275 26.56066 30.85201 37.67206 51.66627
The ED50 for dose is 3.39 with a 95% confidence interval of (2.61, 4.27).
3771
3772
Chapter 60. The PROBIT Procedure
Plot of Observed and Fitted Probabilities Probit Procedure Model Information Data Set Events Variable Trials Variable Number of Observations Number of Events Number of Trials Missing Values Name of Distribution Log Likelihood
WORK.A Response N 7 38 74 29 Logistic -37.11065336
Algorithm converged.
Plot of Observed and Fitted Probabilities Probit Procedure Analysis of Parameter Estimates
Parameter Intercept Log10(Dose)
DF Estimate 1 1
-3.2246 5.9702
Standard Error 0.8861 1.4492
95% Confidence Limits -4.9613 3.1299
-1.4880 8.8105
ChiSquare Pr > ChiSq 13.24 16.97
0.0003 <.0001
The regression √ parameter estimates for the logistic model of -3.22 and 5.97 are approximately π/ 3 times as large as those for the normal model.
Example 60.1. Dosage Levels
Plot of Observed and Fitted Probabilities Probit Procedure Probit Analysis on Log10(Dose) Probability
Log10(Dose)
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
-0.22955 -0.11175 -0.04212 0.00780 0.04693 0.07925 0.10686 0.13103 0.15259 0.17209 0.24958 0.30792 0.35611 0.39820 0.43644 0.47221 0.50651 0.54013 0.57374 0.60804 0.64381 0.68205 0.72414 0.77233 0.83067 0.90816 0.92766 0.94922 0.97339 1.00100 1.03332 1.07245 1.12237 1.19200 1.30980
95% Fiducial Limits -0.97441 -0.75158 -0.62018 -0.52618 -0.45265 -0.39205 -0.34037 -0.29521 -0.25502 -0.21875 -0.07552 0.03092 0.11742 0.19143 0.25684 0.31588 0.36986 0.41957 0.46559 0.50846 0.54896 0.58815 0.62752 0.66915 0.71631 0.77562 0.79014 0.80607 0.82378 0.84384 0.86713 0.89511 0.93053 0.97952 1.06166
0.04234 0.12404 0.17265 0.20771 0.23533 0.25826 0.27796 0.29530 0.31085 0.32498 0.38207 0.42645 0.46451 0.49932 0.53275 0.56619 0.60089 0.63807 0.67894 0.72474 0.77673 0.83637 0.90582 0.98876 1.09242 1.23343 1.26931 1.30912 1.35391 1.40523 1.46546 1.53864 1.63228 1.76329 1.98569
3773
3774
Chapter 60. The PROBIT Procedure
Plot of Observed and Fitted Probabilities Probit Procedure Probit Analysis on Dose Probability
Dose
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
0.58945 0.77312 0.90757 1.01813 1.11413 1.20018 1.27896 1.35218 1.42100 1.48625 1.77656 2.03199 2.27043 2.50152 2.73172 2.96627 3.21006 3.46837 3.74746 4.05546 4.40366 4.80891 5.29836 5.92009 6.77126 8.09391 8.46559 8.89644 9.40575 10.02317 10.79732 11.81534 13.25466 15.55972 20.40815
95% Fiducial Limits 0.10607 0.17718 0.23978 0.29773 0.35266 0.40546 0.45670 0.50675 0.55588 0.60430 0.84038 1.07379 1.31046 1.55393 1.80652 2.06957 2.34345 2.62768 2.92138 3.22451 3.53961 3.87391 4.24155 4.66820 5.20365 5.96508 6.16800 6.39837 6.66469 6.97977 7.36428 7.85438 8.52173 9.53941 11.52549
1.10241 1.33058 1.48817 1.61327 1.71922 1.81244 1.89654 1.97379 2.04572 2.11339 2.41030 2.66961 2.91416 3.15736 3.40996 3.68292 3.98927 4.34578 4.77466 5.30573 5.98041 6.86079 8.05044 9.74455 12.37149 17.11715 18.59129 20.37592 22.58957 25.42292 29.20549 34.56521 42.88232 57.98207 96.75820
Both the ED50 and the LD50 are similar to those for the normal model. The statement PREDPPLOT creates the plot of observed and fitted probabilities in Output 60.1.2. The dashed line represent pointwise confidence bands for the probabilities.
Example 60.2. Multilevel Response
Output 60.1.2. Plot of Observed and Fitted Probabilities
Example 60.2. Multilevel Response In this example, two preparations, a standard preparation and a test preparation, are each given at several dose levels to groups of insects. The symptoms are recorded for each insect within each group, and two multilevel probit models are fit. Because the natural sort order of the three levels is not the same as the response order, the ORDER=DATA option is specified in the PROC PROBIT statement to get the desired order. The following statements produce Output 60.2.1: data multi; input Prep $ Dose Symptoms $ N; LDose=log10(Dose); if Prep=’test’ then PrepDose=LDose; else PrepDose=0; datalines; stand 10 None 33 stand 10 Mild 7 stand 10 Severe 10 stand 20 None 17 stand 20 Mild 13 stand 20 Severe 17 stand 30 None 14 stand 30 Mild 3 stand 30 Severe 28 stand 40 None 9 stand 40 Mild 8 stand 40 Severe 32
3775
3776
Chapter 60. The PROBIT Procedure test test test test test test test test test test test test ;
10 10 10 20 20 20 30 30 30 40 40 40
None Mild Severe None Mild Severe None Mild Severe None Mild Severe
44 6 0 32 10 12 23 7 21 16 6 19
proc probit order=data; class Prep Symptoms; nonpara: model Symptoms=Prep LDose PrepDose / lackfit; weight N; title ’Probit Models for Symptom Severity’; run; proc probit order=data; class Prep Symptoms; parallel: model Symptoms=Prep LDose / lackfit; weight N; title ’Probit Models for Symptom Severity’; run;
The first model allows for nonparallelism between the dose response curves for the two preparations by inclusion of an interaction between Prep and LDose. The interaction term is labeled PrepDose in the “Analysis of Parameter Estimates” table. The results of this first model indicate that the parameter for the interaction term is not significant, having a Wald chi-square of 0.73. Also, since the first model is a generalization of the second, a likelihood ratio test statistic for this same parameter can be obtained by multiplying the difference in log likelihoods between the two models by 2. The value obtained, 2 × (−345.94 − (−346.31)), is 0.73. This is in close agreement with the Wald chi-square from the first model. The lack-of-fit test statistics for the two models do not indicate a problem with either fit. Output 60.2.1. Multilevel Response: PROC PROBIT Probit Models for Symptom Severity Probit Procedure Class Level Information Name Prep Symptoms
Levels 2 3
Values stand test None Mild Severe
Example 60.2. Multilevel Response
Probit Models for Symptom Severity Probit Procedure Model Information Data Set Dependent Variable Weight Variable Number of Observations Missing Values Name of Distribution Log Likelihood
WORK.MULTI Symptoms N 23 1 Normal -345.9401767
Probit Models for Symptom Severity Probit Procedure Analysis of Parameter Estimates
Parameter Intercept Intercept2 Prep stand Prep test LDose PrepDose
DF Estimate 1 1 1 0 1 1
3.8080 0.4684 -1.2573 0.0000 -2.1512 -0.5072
Standard Error 0.6252 0.0559 0.8190 0.0000 0.3909 0.5945
95% Confidence Limits 2.5827 0.3589 -2.8624 0.0000 -2.9173 -1.6724
5.0333 0.5780 0.3479 0.0000 -1.3851 0.6580
Probit Models for Symptom Severity Probit Procedure Class Level Information Name Prep Symptoms
Levels 2 3
Values stand test None Mild Severe
ChiSquare Pr > ChiSq 37.10 70.19 2.36 . 30.29 0.73
<.0001 <.0001 0.1247 . <.0001 0.3935
3777
3778
Chapter 60. The PROBIT Procedure
Probit Models for Symptom Severity Probit Procedure Model Information Data Set Dependent Variable Weight Variable Number of Observations Missing Values Name of Distribution Log Likelihood
WORK.MULTI Symptoms N 23 1 Normal -346.306141
Probit Models for Symptom Severity Probit Procedure Analysis of Parameter Estimates
Parameter Intercept Intercept2 Prep stand Prep test LDose
DF Estimate 1 1 1 0 1
3.4148 0.4678 -0.5675 0.0000 -2.3721
Standard Error 0.4126 0.0558 0.1259 0.0000 0.2949
95% Confidence Limits 2.6061 0.3584 -0.8142 0.0000 -2.9502
4.2235 0.5772 -0.3208 0.0000 -1.7940
ChiSquare Pr > ChiSq 68.50 70.19 20.33 . 64.68
<.0001 <.0001 <.0001 . <.0001
The negative coefficient associated with LDose indicates that the probability of having no symptoms (Symptoms=’None’) or no or mild symptoms (Symptoms=’None’ or Symptoms=’Mild’) decreases as LDose increases; that is, the probability of a severe symptom increases with LDose. This association is apparent for both treatment groups. The negative coefficient associated with the standard treatment group (Prep = stand) indicates that the standard treatment is associated with more severe symptoms across all Ldose values. The following statements use the PREDPPLOT statement to create the plot shown in Output 60.2.2 of the probabilities of the response taking on individual levels as a function of LDose. Since there are two covariates, LDose and Prep, the value of the CLASS variable Prep is fixed at the highest level, test. Although not shown here, the CDFPLOT statement creates similar plots of the cumulative response probabilities, instead of individual response level probabilities. proc probit data=multi order=data; class Prep Symptoms; parallel: model Symptoms=Prep LDose / lackfit; predpplot var=ldose level=("None" "Mild" "Severe") cfit=blue cframe=ligr inborder noconf ; weight N;
Example 60.2. Multilevel Response
title ’Probit Models for Symptom Severity’; run; Output 60.2.2. Plot of Predicted Probilities for the Test Preparation Group
The following statements use the XDATA= data set to create a plot of the predicted probabilities with Prep set to the stand level. The resulting plot is shown in Output 60.2.3. data xrow; input Prep $ Dose Symptoms $ N; LDose=log10(Dose); datalines; stand 40 Severe 32 run; proc probit data=multi order=data xdata=xrow; class Prep Symptoms; parallel: model Symptoms=Prep LDose / lackfit; predpplot var=ldose level=("None" "Mild" "Severe") cfit=blue cframe=ligr inborder noconf ; weight N; title ’Predicted Probabilities for Standard Preparation’; run;
3779
3780
Chapter 60. The PROBIT Procedure
Output 60.2.3. Plot of Predicted Probabilities for the Standard Preparation Group
Example 60.3. Logistic Regression In this example, a series of people are questioned as to whether or not they would subscribe to a new newspaper. For each person, the variables sex (Female, Male), age, and subs (1=yes,0=no) are recorded. The PROBIT procedure is used to fit a logistic regression model to the probability of a positive response (subscribing) as a function of the variables sex and age. Specifically, the probability of subscribing is modeled as p = Pr(subs = 1) = F (b0 + b1 × sex + b2 × age) where F is the cumulative logistic distribution function. By default, the PROBIT procedure models the probability of the lower response level for binary data. One way to model Pr(subs = 1) is to format the response variable so that the formatted value corresponding to subs=1 is the lower level. The following statements format the values of subs as 1 = ’accept’ and 0 = ’reject’, so that PROBIT models Pr(accept) = Pr(subs = 1). The following statements produce Output 60.3.1: data news; input sex $ age subs; datalines; Female 35 0 Male 44 0 Male 45 1
Example 60.3. Logistic Regression Female Female Female Male Male Female Female Female Female Male Female Female Male Male Female Male Male Male Female Female Female Female Female Female Female Male Male Female Male Female Female Male Female Female Female Female Female ;
47 51 47 54 47 35 34 48 56 46 59 46 59 38 39 49 42 50 45 47 30 39 51 45 43 39 31 39 34 52 46 58 50 32 52 35 51
1 0 0 1 1 0 0 0 1 1 1 1 1 1 0 1 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 1 1 0 1 0 0
proc format; value subscrib 1 = ’accept’ 0 = ’reject’; run; proc probit; class subs sex; model subs=sex age / d=logistic itprint; format subs subscrib.; title ’Logistic Regression of Subscription Status’; run;
3781
3782
Chapter 60. The PROBIT Procedure
Output 60.3.1. Logistic Regression: PROC PROBIT Logistic Regression of Subscription Status Probit Procedure Class Level Information Name
Levels
subs sex
2 2
Values accept reject Female Male
PROC PROBIT is modeling the probabilities of levels of subs having LOWER Ordered Values in the response profile table.
Logistic Regression of Subscription Status Probit Procedure Iteration History for Parameter Estimates Iter
Ridge
Loglikelihood
Intercept
sexFemale
age
0 1 2 3 4 5 6
0 0 0 0 0 0 0
-27.725887 -20.142659 -19.52245 -19.490439 -19.490303 -19.490303 -19.490303
0 -3.634567629 -5.254865196 -5.728485385 -5.76187293 -5.7620267 -5.7620267
0 -1.648455751 -2.234724956 -2.409827238 -2.422349862 -2.422407743 -2.422407743
0 0.1051634384 0.1506493473 0.1639621828 0.1649007124 0.1649050312 0.1649050312
Model Information Data Set Dependent Variable Number of Observations Name of Distribution Log Likelihood
WORK.NEWS subs 40 Logistic -19.49030281
PROC PROBIT is modeling the probabilities of levels of subs having LOWER Ordered Values in the response profile table.
Example 60.3. Logistic Regression
Logistic Regression of Subscription Status Probit Procedure PROC PROBIT is modeling the probabilities of levels of subs having LOWER Ordered Values in the response profile table.
Last Evaluation of the Negative of the Gradient Intercept
sexFemale
age
-5.95457E-12
8.768328E-10
-1.636696E-8
Last Evaluation of the Negative of the Hessian
Intercept sexFemale age
Intercept
sexFemale
age
6.4597397447 4.6042218284 292.04051848
4.6042218284 4.6042218284 216.20829515
292.04051848 216.20829515 13487.329973
Algorithm converged.
Logistic Regression of Subscription Status Probit Procedure PROC PROBIT is modeling the probabilities of levels of subs having LOWER Ordered Values in the response profile table.
Analysis of Parameter Estimates
Parameter Intercept sex sex age
DF Estimate
Female Male
1 1 0 1
-5.7620 -2.4224 0.0000 0.1649
Standard Error
95% Confidence Limits
2.7635 -11.1783 0.9559 -4.2959 0.0000 0.0000 0.0652 0.0371
-0.3458 -0.5489 0.0000 0.2927
ChiSquare Pr > ChiSq 4.35 6.42 . 6.40
0.0371 0.0113 . 0.0114
From Output 60.3.1, there appears to be an effect due to both the variables sex and age. The positive coefficient for age indicates that older people are more likely to subscribe than younger people. The negative coefficient for sex indicates that females are less likely to subscribe than males.
3783
3784
Chapter 60. The PROBIT Procedure
Example 60.4. An Epidemiology Study The data, which is from an epidemiology study, consists of five variables: the number, r, of individuals surviving after an epidemic, out of n treated, for combinations of medicine dosage (dose), treatment (treat = A, B), and sex (sex = 0(Female), 1(Male)). To see if the two treatments have different effects on male and female individual survival rate, the interaction term between the two variables treat and sex is included in the model. The following invocation of PROC PROBIT fits the binary probit model to the grouped data: data epidemic; input treat$ dose n r sex; label dose = Dose; datalines; A 2.17 142 142 0 A .57 132 47 1 A 1.68 128 105 1 A 1.08 126 100 0 A 1.79 125 118 0 B 1.66 117 115 1 B 1.49 127 114 0 B 1.17 51 44 1 B 2.00 127 126 0 B .80 129 100 1 ; data xval; input treat $ dose sex ; datalines; B 2. 1 ; title ’Epidemiology Study’; proc probit optc lackfit covout data = epidemic outest = out1 xdata = xval; class treat sex; model r/n = dose treat sex sex*treat/corrb covb inversecl; output out = out2 p =p; predpplot var = dose font = swiss vref(intersect) = .6667 vreflab = ’two thirds’ vreflabpos = 2 cfit=blue cframe=ligr ; inset /
Example 60.4. An Epidemiology Study
cfill = white ctext = blue pos = se ; ippplot
inset
font = swiss href(intersect) = .75 hreflab = ’three quarters’ vreflabpos = 2 threshlabpos = 2 cfit=blue cframe=ligr ; / cfill = white ctext = blue;
lpredplot font = swiss vref(intersect) = 1. vreflab = ’unit probit’ vreflabpos = 2 cfit=blue cframe=ligr ; inset / cfill = white ctext = blue; run;
The results of this analysis are shown in the following tables and figures. Beginning with SAS Release 8.2, the PROBIT procedure does not support multiple MODEL statements. Only the last one is used if there is more than one MODEL statement in one invocation of the PROBIT procedure. Output 60.4.1. Class Level Information Epidemiology Study Probit Procedure Class Level Information Name treat sex
Levels 2 2
Values A B 0 1
Output 60.4.1 displays the table of level information for all classification variables in the CLASS statement.
3785
3786
Chapter 60. The PROBIT Procedure
Output 60.4.2. Parameter Information Epidemiology Study Probit Procedure Parameter Information Parameter
Effect
Intercept dose treatA treatB sex0 sex1 treatAsex0 treatAsex1 treatBsex0 treatBsex1
Intercept dose treat treat sex sex treat*sex treat*sex treat*sex treat*sex
treat
sex
A B
A A B B
0 1 0 1 0 1
Output 60.4.2 displays the table of parameter information for the effects in the MODEL statement. The name of a parameter is generated from combining the variable names and level names in the effect. The maximum length of a parameter name is 32. The name of the effects are specified in the MODEL statement. The length of names of effects can be specified by the NAMELEN= option in the PROC PROBIT statement, with the default length 20. Output 60.4.3. Model Information Epidemiology Study Probit Procedure Model Information Data Set Events Variable Trials Variable Number of Observations Number of Events Number of Trials Name of Distribution Log Likelihood
WORK.EPIDEMIC r n 10 1011 1204 Normal -387.2467391
Algorithm converged.
Output 60.4.3 displays background information about the model fit. Included are the name of the input data set, the response variables used, and the number of observations, events, and trials. The table also includes the status of the convergence of the model fitting algorithm and the final value of log-likelihood function.
Example 60.4. An Epidemiology Study
Output 60.4.4. Goodness-of-Fit Tests and Response-Covariate Profile Epidemiology Study Probit Procedure Goodness-of-Fit Tests Statistic Pearson Chi-Square L.R. Chi-Square
Value
DF
Pr > ChiSq
4.9317 5.7079
4 4
0.2944 0.2220
Response-Covariate Profile Response Levels Number of Covariate Values
2 10
Output 60.4.4 displays the table of goodness-of-fit tests requested with the LACKFIT option in the PROC PROBIT statement. Two goodness-of-fit statistics, the Pearson chi-square statistic and the likelihood ratio chi-square statistic, are computed. The grouping method for computing these statistics can be specified by the AGGREGATE= option. The details can be found in the AGGREGATE= option and an example can be found in the second part of this example. By default, the PROBIT procedure uses the covariates in the MODEL statement to do grouping. Observations with the same values of the covariates in the MODEL statement are grouped into cells and the two statistics are computed according to these cells. The total number of cells, and the number of levels for the response variable are reported next in the “Response-Covariate Profile.” In this example, neither the Pearson chi-square nor the log-likelihood ratio chi-square tests are significant at the 0.1 level, which is the default test level used by the PROBIT procedure. That means that the model, which includes the interaction of treat and sex, is suitable for this epidemiology data set. (Further investigation shows that models without the interaction of treat and sex are not acceptable by either test.) Output 60.4.5. Type III Tests Epidemiology Study Probit Procedure Type III Analysis of Effects
Effect dose treat sex treat*sex
DF
Wald Chi-Square
Pr > ChiSq
1 1 1 1
42.1691 16.1421 1.7710 13.9343
<.0001 <.0001 0.1833 0.0002
Output 60.4.5 displays the Type III test results for all effects specified in the MODEL statement, which include the degrees of freedom for the effect, the Wald Chi-Square test statistic, and the p-value.
3787
3788
Chapter 60. The PROBIT Procedure
Output 60.4.6. Analysis of Parameter Estimates Epidemiology Study Probit Procedure Analysis of Parameter Estimates
Parameter Intercept dose treat treat sex sex treat*sex treat*sex treat*sex treat*sex _C_
DF Estimate
A B 0 1 A A B B
0 1 0 1
1 1 1 0 1 0 1 0 0 0 1
-0.8871 1.6774 -1.2537 0.0000 -0.4633 0.0000 1.2899 0.0000 0.0000 0.0000 0.2735
Standard Error 0.3632 0.2583 0.2616 0.0000 0.2289 0.0000 0.3456 0.0000 0.0000 0.0000 0.0946
95% Confidence Limits -1.5991 1.1711 -1.7664 0.0000 -0.9119 0.0000 0.6126 0.0000 0.0000 0.0000 0.0881
ChiSquare Pr > ChiSq
-0.1752 2.1837 -0.7410 0.0000 -0.0147 0.0000 1.9672 0.0000 0.0000 0.0000 0.4589
5.96 42.17 22.97 . 4.10 . 13.93 . . .
0.0146 <.0001 <.0001 . 0.0429 . 0.0002 . . .
Output 60.4.6 displays the table of parameter estimates for the model. The PROBIT procedure displays information for all the parameters of an effect. Degenerate parameters are indicated by 0 degree of freedom. Confidence intervals are computed for all parameters with non-zero degrees of freedom, including the natural threshold C if the OPTC option is specified in the PROC PROBIT statement. The confidence level can be specified by the ALPHA= option in the MODEL statement. The default confidence level is 95%. From this table, you can see the following results: • dose has significant positive effect on the survival rate. • Individuals under treatment A have a lower survival rate. • Male individuals have a higher survival rate. • Female individuals under treatment A have a higher survival rate.
Example 60.4. An Epidemiology Study
Output 60.4.7. Estimated Covariance Matrix Epidemiology Study Probit Procedure Estimated Covariance Matrix
Intercept dose treatA sex0 treatAsex0 _C_
Intercept
dose
treatA
sex0
treatAsex0
0.131944 -0.087353 0.053551 0.030285 -0.067056 -0.028073
-0.087353 0.066723 -0.047506 -0.034081 0.058620 0.018196
0.053551 -0.047506 0.068425 0.036063 -0.075323 -0.017084
0.030285 -0.034081 0.036063 0.052383 -0.063599 -0.008088
-0.067056 0.058620 -0.075323 -0.063599 0.119408 0.019134
Estimated Covariance Matrix _C_ Intercept dose treatA sex0 treatAsex0 _C_
-0.028073 0.018196 -0.017084 -0.008088 0.019134 0.008948
Output 60.4.8. Estimated Correlation Matrix Epidemiology Study Probit Procedure Estimated Correlation Matrix
Intercept dose treatA sex0 treatAsex0 _C_
Intercept
dose
treatA
sex0
treatAsex0
1.000000 -0.930998 0.563595 0.364284 -0.534227 -0.817027
-0.930998 1.000000 -0.703083 -0.576477 0.656744 0.744699
0.563595 -0.703083 1.000000 0.602359 -0.833299 -0.690420
0.364284 -0.576477 0.602359 1.000000 -0.804154 -0.373565
-0.534227 0.656744 -0.833299 -0.804154 1.000000 0.585364
Estimated Correlation Matrix _C_ Intercept dose treatA sex0 treatAsex0 _C_
-0.817027 0.744699 -0.690420 -0.373565 0.585364 1.000000
Output 60.4.7 and Output 60.4.8 display tables of estimated covariance matrix and estimated correlation matrix for estimated parameters with a non-zero degree of freedom, respectively. They are computed by the inverse of the Hessian matrix of the estimated parameters.
3789
3790
Chapter 60. The PROBIT Procedure
Output 60.4.9. Probit Analysis on Dose Epidemiology Study Probit Procedure Probit Analysis on dose Probability
dose
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
-0.85801 -0.69549 -0.59238 -0.51482 -0.45172 -0.39802 -0.35093 -0.30877 -0.27043 -0.23513 -0.08900 0.02714 0.12678 0.21625 0.29917 0.37785 0.45397 0.52888 0.60380 0.67992 0.75860 0.84151 0.93099 1.03063 1.14677 1.29290 1.32819 1.36654 1.40870 1.45579 1.50949 1.57258 1.65015 1.75326 1.91577
95% Fiducial Limits -1.81301 -1.58167 -1.43501 -1.32476 -1.23513 -1.15888 -1.09206 -1.03226 -0.97790 -0.92788 -0.72107 -0.55706 -0.41669 -0.29095 -0.17477 -0.06487 0.04104 0.14481 0.24800 0.35213 0.45879 0.56985 0.68770 0.81571 0.95926 1.12867 1.16747 1.20867 1.25284 1.30084 1.35397 1.41443 1.48626 1.57833 1.71776
-0.33743 -0.21116 -0.13093 -0.07050 -0.02130 0.02063 0.05742 0.09039 0.12040 0.14805 0.26278 0.35434 0.43322 0.50437 0.57064 0.63387 0.69546 0.75654 0.81819 0.88157 0.94803 1.01942 1.09847 1.18970 1.30171 1.45386 1.49273 1.53590 1.58450 1.64012 1.70515 1.78353 1.88238 2.01720 2.23537
Output 60.4.9 displays the computed values and fiducial limits for the first single continuous variable dose in the MODEL statement, given the probability levels, without the effect of the natural threshold, and when the option INSERSECL in the MODEL statement is specified. If there is no single continuous variable in the MODEL specification but the INVERSECL option is specified, an error is reported. If the XDATA= option is used to input a data set for the independent variables in the MODEL statement, the PROBIT procedure uses these values for the independent variables other than the single continuous variable. Missing values are not permitted in the XDATA= data set for the independent variables, although the value for the single continuous variable is not used in the computing of the fiducial limits. A suitable valid value should be given. In the data set xval created by the SAS statements on page 3784, Dose = 2.
Example 60.4. An Epidemiology Study
See the section “XDATA= SAS-data-set” on page 3763 for the default values for those effects other than the single continuous variable, for which the fiducial limits are computed. In this example, there are two classification variables, treat and sex. Fiducial limits for the dose variable are computed for the highest level of the classification variables, treat = B and sex = 1, which is the default specification. Since these are the default values, you would get the same values and fiducial limits if you did not specify the XDATA= option in this example. The confidence level for the fiducial limits can be specified by the ALPHA= option in the MODEL statement. The default level is 95%. If a LOG10 or LOG option is used in the PROC PROBIT statement, the values and the fiducial limits are computed for both the single continuous variable and its logarithm. Output 60.4.10. Outest Data Set for Epidemiology Study Obs _MODEL_ _NAME_ 1 2 3 4 5 6 7 8 9 10 11 12
Obs
r Intercept dose treatA treatB sex0 sex1 treatAsex0 treatAsex1 treatBsex0 treatBsex1 _C_
dose
treatA
1 1.67739 -1.25367 2 -0.08735 0.05355 3 0.06672 -0.04751 4 -0.04751 0.06843 5 0.00000 0.00000 6 -0.03408 0.03606 7 0.00000 0.00000 8 0.05862 -0.07532 9 0.00000 0.00000 10 0.00000 0.00000 11 0.00000 0.00000 12 0.01820 -0.01708
_TYPE_ _DIST_ PARMS COV COV COV COV COV COV COV COV COV COV COV
_STATUS_
Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal
0 0 0 0 0 0 0 0 0 0 0 0
Converged Converged Converged Converged Converged Converged Converged Converged Converged Converged Converged Converged
treat B
sex0
sex1
0 0 0 0 0 0 0 0 0 0 0 0
-0.46329 0.03029 -0.03408 0.03606 0.00000 0.05238 0.00000 -0.06360 0.00000 0.00000 0.00000 -0.00809
0 0 0 0 0 0 0 0 0 0 0 0
_LNLIKE_ -387.247 -387.247 -387.247 -387.247 -387.247 -387.247 -387.247 -387.247 -387.247 -387.247 -387.247 -387.247
treat Asex0 1.28991 -0.06706 0.05862 -0.07532 0.00000 -0.06360 0.00000 0.11941 0.00000 0.00000 0.00000 0.01913
r
Intercept
-1.00000 -0.88714 1.67739 -1.25367 0.00000 -0.46329 0.00000 1.28991 0.00000 0.00000 0.00000 0.27347
treat treat treat Asex1 Bsex0 Bsex1
-0.88714 0.13194 -0.08735 0.05355 0.00000 0.03029 0.00000 -0.06706 0.00000 0.00000 0.00000 -0.02807
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
_C_ 0.27347 -0.02807 0.01820 -0.01708 0.00000 -0.00809 0.00000 0.01913 0.00000 0.00000 0.00000 0.00895
Output 60.4.10 displays the OUTEST= data set. All parameters for an effect are included. The following three outputs, Output 60.4.11, Output 60.4.12, and Output 60.4.13, are generated from the three plot statements. The first plot, specified with the PREDPPLOT statement, is the plot of the predicted probability against the single continuous variable Dose, which is specified by the VAR= option in the PREDPPLOT statement. This single continuous variable must be in the MODEL statement. If the VAR= option is not used, the first single continuous variable in the MODEL statement is used. In this example, you would get the same plot if the VAR = dose was not used in the PREDPPLOT statement. You can specify values of other independent
3791
3792
Chapter 60. The PROBIT Procedure
variables in the MODEL statement using an XDATA= data set, or by using the default values. The second plot, specified with the IPPPLOT statement, is the inverse of the predicted probability plot with the fiducial limits. It should be pointed out that the fiducial limits are not just the inverse of the confidence limits in the predicted probability plot; see the section “Inverse Confidence Limits” on page 3761 for the computation of these limits. The third plot, specified with the LPREDPLOT statement, is the plot of the linear predictor x0 β against the first single continuous variable (or the single continuous variable specified by the VAR= option) with the Wald confidence intervals. After each plot statement, an optional INSET statement is used to draw a box within the plot (inset box). In the inset box, information about the model fitting can be specified. See “INSET Statement” on page 3723 for more detail. Output 60.4.11. Predicted Probability Plot
Example 60.4. An Epidemiology Study
Output 60.4.12. Inverse Predicted Probability Plot
Output 60.4.13. Linear Predictor Plot
Combining INEST= data set and the MAXIT= option in the MODEL statement, the PROBIT procedure can do prediction, if the parameterizations for the models used for the training data and the validation data are exactly the same. After the first invocation of PROC PROBIT, you have the estimated parameters and their covariance matrix in the data set OUTEST = Out1, and the fitted probabilities for the training data set epidemic in the data set OUTPUT = Out2. See Output 60.4.10 on page 3791 for the data set Out1 and Output 60.4.14 on page 3795 for the data set Out2.
3793
3794
Chapter 60. The PROBIT Procedure
The validation data are collected in data set validate. The second invocation of PROC PROBIT simply passes the estimated parameters from the training data set epidemic to the validation data set validate for prediction. The predicted probabilities are stored in the data set OUTPUT = Out3 (see Output 60.4.15 on page 3795). The third invocation of PROC PROBIT passes the estimated parameters as initial values for a new fit of the validation data set using the same model. Predicted probabilities are stored in the data set OUTPUT = Out4 (see Output 60.4.16 on page 3795). Goodnessof-Fit tests are computed based on the cells grouped by the AGGREGATE= group variable. Results are shown in Output 60.4.17 on page 3796. data validate; input treat $ dose datalines; B 2.0 0 44 43 1 B 2.0 1 54 52 2 B 1.5 1 36 32 3 B 1.5 0 45 40 4 A 2.0 0 66 64 5 A 2.0 1 89 89 6 A 1.5 1 45 39 7 A 1.5 0 66 60 8 B 2.0 0 44 44 1 B 2.0 1 54 54 2 B 1.5 1 36 30 3 B 1.5 0 45 41 4 A 2.0 0 66 65 5 A 2.0 1 89 88 6 A 1.5 1 45 38 7 A 1.5 0 66 59 8 ;
sex
n r group;
proc probit optc data = validate inest = out1; class treat sex; model r/n = dose treat sex sex*treat / maxit = 0 ; output out = out3 p =p; run ; proc probit optc lackfit data = validate inest = out1; class treat sex; model r/n = dose treat sex sex*treat / aggregate = group ; output out = out4 p =p; run ;
Example 60.4. An Epidemiology Study Output 60.4.14. Out2 Obs
treat
dose
n
r
sex
p
A A A A A B B B B B
2.17 0.57 1.68 1.08 1.79 1.66 1.49 1.17 2.00 0.80
142 132 128 126 125 117 127 51 127 129
142 47 105 100 118 115 114 44 126 100
0 1 1 0 0 1 0 1 0 1
0.99272 0.35925 0.81899 0.77517 0.96682 0.97901 0.90896 0.89749 0.98364 0.76414
1 2 3 4 5 6 7 8 9 10
Output 60.4.15. Out3 Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
treat B B B B A A A A B B B B A A A A
dose
sex
n
r
group
p
2.0 2.0 1.5 1.5 2.0 2.0 1.5 1.5 2.0 2.0 1.5 1.5 2.0 2.0 1.5 1.5
0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
44 54 36 45 66 89 45 66 44 54 36 45 66 89 45 66
43 52 32 40 64 89 39 60 44 54 30 41 65 88 38 59
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
0.98364 0.99506 0.96247 0.91145 0.98500 0.91835 0.74300 0.91666 0.98364 0.99506 0.96247 0.91145 0.98500 0.91835 0.74300 0.91666
dose
sex
n
r
group
p
0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
44 54 36 45 66 89 45 66 44 54 36 45 66 89 45 66
43 52 32 40 64 89 39 60 44 54 30 41 65 88 38 59
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
0.98954 0.98262 0.86187 0.90095 0.98768 0.98614 0.88075 0.88964 0.98954 0.98262 0.86187 0.90095 0.98768 0.98614 0.88075 0.88964
Output 60.4.16. Out4 Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
treat B B B B A A A A B B B B A A A A
2.0 2.0 1.5 1.5 2.0 2.0 1.5 1.5 2.0 2.0 1.5 1.5 2.0 2.0 1.5 1.5
3795
3796
Chapter 60. The PROBIT Procedure
Output 60.4.17. Goodness-of-Fit Table Probit Procedure Goodness-of-Fit Tests Statistic Pearson Chi-Square L.R. Chi-Square
Value
DF
Pr > ChiSq
2.8101 2.8080
2 2
0.2454 0.2456
References Agresti, A. (1990), Categorical Data Analysis, New York: John Wiley & Sons. Collett, D (1991), Modelling Binary Data, London: Chapman and Hall. Cox, D.R. (1970), Analysis of Binary Data, London: Chapman and Hall. Cox, D.R. and Oakes, D. (1984), Analysis of Survival Data, London: Chapman and Hall. Finney, D.J. (1971), Probit Analysis, Third Edition, London: Cambridge University Press. Hubert, J.J., Bohidar, N.R., and Peace, K.E. (1988), “Assessment of Pharmacological Activity,” Biopharmaceutical Statistics for Drug Development, ed. K.E. Peace, New York: Marcel Dekker.
Chapter 61
The REG Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3799 GETTING STARTED . . . . . . Simple Linear Regression . . . Polynomial Regression . . . . Using PROC REG Interactively
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 3800 . 3800 . 3804 . 3812
SYNTAX . . . . . . . . . PROC REG Statement . ADD Statement . . . . BY Statement . . . . . DELETE Statement . . FREQ Statement . . . . ID Statement . . . . . . MODEL Statement . . MTEST Statement . . . OUTPUT Statement . . PAINT Statement . . . PLOT Statement . . . . PRINT Statement . . . REFIT Statement . . . RESTRICT Statement . REWEIGHT Statement TEST Statement . . . . VAR Statement . . . . WEIGHT Statement . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. 3813 . 3815 . 3819 . 3819 . 3820 . 3820 . 3821 . 3821 . 3832 . 3833 . 3835 . 3839 . 3851 . 3852 . 3852 . 3854 . 3858 . 3859 . 3859
DETAILS . . . . . . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . . . . . Input Data Sets . . . . . . . . . . . . . . . . . Output Data Sets . . . . . . . . . . . . . . . . Interactive Analysis . . . . . . . . . . . . . . Model-Selection Methods . . . . . . . . . . . Criteria Used in Model-Selection Methods . . Limitations in Model-Selection Methods . . . Parameter Estimates and Associated Statistics Predicted and Residual Values . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. 3859 . 3859 . 3860 . 3863 . 3869 . 3873 . 3876 . 3877 . 3877 . 3879
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
3798
Chapter 61. The REG Procedure Line Printer Scatter Plot Features . . . . . . . . . . . Models of Less Than Full Rank . . . . . . . . . . . . Collinearity Diagnostics . . . . . . . . . . . . . . . . Model Fit and Diagnostic Statistics . . . . . . . . . . Influence Diagnostics . . . . . . . . . . . . . . . . . Reweighting Observations in an Analysis . . . . . . . Testing for Heteroscedasticity . . . . . . . . . . . . . Multivariate Tests . . . . . . . . . . . . . . . . . . . Autocorrelation in Time Series Data . . . . . . . . . Computations for Ridge Regression and IPC Analysis Construction of Q-Q and P-P Plots . . . . . . . . . . Computational Methods . . . . . . . . . . . . . . . . Computer Resources in Regression Analysis . . . . . Displayed Output . . . . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . . . ODS Graphics (Experimental) . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. 3882 . 3893 . 3895 . 3896 . 3898 . 3903 . 3910 . 3910 . 3915 . 3916 . 3917 . 3917 . 3917 . 3918 . 3920 . 3922
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example 61.1. Aerobic Fitness Prediction . . . . . . . . . . . . . . . Example 61.2. Predicting Weight by Height and Age . . . . . . . . . . Example 61.3. Regression with Quantitative and Qualitative Variables Example 61.4. Displaying Plots for Simple Linear Regression . . . . . Example 61.5. Creating a Cp Plot . . . . . . . . . . . . . . . . . . . . Example 61.6. Controlling Plot Appearance . . . . . . . . . . . . . . Example 61.7. Plotting Model Diagnostic Statistics . . . . . . . . . . Example 61.8. Creating PP and QQ Plots . . . . . . . . . . . . . . . . Example 61.9. Displaying Confidence and Prediction Intervals . . . . Example 61.10. Displaying the Ridge Trace for Acetylene Data . . . . Example 61.11. Plotting Variance Inflation Factors . . . . . . . . . . . Example 61.12. ODS Graphics . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. 3924 . 3924 . 3938 . 3943 . 3948 . 3949 . 3950 . 3952 . 3953 . 3955 . 3956 . 3958 . 3960
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3966
Chapter 61
The REG Procedure Overview The REG procedure is one of many regression procedures in the SAS System. It is a general-purpose procedure for regression, while other SAS regression procedures provide more specialized applications. Other SAS/STAT procedures that perform at least one type of regression analysis are the CATMOD, GENMOD, GLM, LOGISTIC, MIXED, NLIN, ORTHOREG, PROBIT, RSREG, and TRANSREG procedures. SAS/ETS procedures are specialized for applications in time-series or simultaneous systems. These other SAS/STAT regression procedures are summarized in Chapter 2, “Introduction to Regression Procedures,” which also contains an overview of regression techniques and defines many of the statistics computed by PROC REG and other regression procedures. PROC REG provides the following capabilities: • multiple MODEL statements • nine model-selection methods • interactive changes both in the model and the data used to fit the model • linear equality restrictions on parameters • tests of linear hypotheses and multivariate hypotheses • collinearity diagnostics • predicted values, residuals, studentized residuals, confidence limits, and influence statistics • correlation or crossproduct input • requested statistics available for output through output data sets • experimental ODS graphics are now available. For more information, see the “ODS Graphics” section on page 3922. These plots are available in addition to the line-printer and the traditional high resolution plots currently available in PROC REG. • plots − plot model fit summary statistics and diagnostic statistics − produce normal quantile-quantile (Q-Q) and probability-probability (P-P) plots for statistics such as residuals − specify special shorthand options to plot ridge traces, confidence intervals, and prediction intervals − display the fitted model equation, summary statistics, and reference lines on the plot
3800
Chapter 61. The REG Procedure − control the graphics appearance with PLOT statement options and with global graphics statements including the TITLE, FOOTNOTE, NOTE, SYMBOL, and LEGEND statements − “paint” or highlight line-printer scatter plots − produce partial regression leverage line-printer plots
Nine model-selection methods are available in PROC REG. In the simplest method, PROC REG fits the complete model that you specify. The other eight methods involve various ways of including or excluding variables from the model. You specify these methods with the SELECTION= option in the MODEL statement. The methods are identified in the following list and are explained in detail in the “Model-Selection Methods” section on page 3873. NONE
no model selection. This is the default. The complete model specified in the MODEL statement is fit to the data.
FORWARD
forward selection. This method starts with no variables in the model and adds variables.
BACKWARD
backward elimination. This method starts with all variables in the model and deletes variables.
STEPWISE
stepwise regression. This is similar to the FORWARD method except that variables already in the model do not necessarily stay there.
MAXR
forward selection to fit the best one-variable model, the best twovariable model, and so on. Variables are switched so that R2 is maximized.
MINR
similar to the MAXR method, except that variables are switched so that the increase in R2 from adding a variable to the model is minimized.
RSQUARE
finds a specified number of models with the highest R2 in a range of model sizes.
ADJRSQ
finds a specified number of models with the highest adjusted R2 in a range of model sizes.
CP
finds a specified number of models with the lowest Cp in a range of model sizes.
Getting Started Simple Linear Regression Suppose that a response variable Y can be predicted by a linear function of a regressor variable X. You can estimate β0 , the intercept, and β1 , the slope, in Yi = β0 + β1 Xi + i
Simple Linear Regression
3801
for the observations i = 1, 2, . . . , n. Fitting this model with the REG procedure requires only the following MODEL statement, where y is the outcome variable and x is the regressor variable. proc reg; model y=x; run;
For example, you might use regression analysis to find out how well you can predict a child’s weight if you know that child’s height. The following data are from a study of nineteen children. Height and weight are measured for each child. title ’Simple Linear Regression’; data Class; input Name $ Height Weight Age @@; datalines; Alfred 69.0 112.5 14 Alice 56.5 84.0 Carol 62.8 102.5 14 Henry 63.5 102.5 Jane 59.8 84.5 12 Janet 62.5 112.5 John 59.0 99.5 12 Joyce 51.3 50.5 Louise 56.3 77.0 12 Mary 66.5 112.0 Robert 64.8 128.0 12 Ronald 67.0 133.0 William 66.5 112.0 15 ;
13 14 15 11 15 15
Barbara James Jeffrey Judy Philip Thomas
65.3 98.0 13 57.3 83.0 12 62.5 84.0 13 64.3 90.0 14 72.0 150.0 16 57.5 85.0 11
The equation of interest is Weight = β0 + β1 Height +
The variable Weight is the response or dependent variable in this equation, and β0 and β1 are the unknown parameters to be estimated. The variable Height is the regressor or independent variable, and is the unknown error. The following commands invoke the REG procedure and fit this model to the data. proc reg; model Weight = Height; run;
Figure 61.1 includes some information concerning model fit.
3802
Chapter 61. The REG Procedure
Simple Linear Regression The REG Procedure Model: MODEL1 Dependent Variable: Weight Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
1 17 18
7193.24912 2142.48772 9335.73684
7193.24912 126.02869
Root MSE Dependent Mean Coeff Var
11.22625 100.02632 11.22330
R-Square Adj R-Sq
F Value
Pr > F
57.08
<.0001
0.7705 0.7570
Figure 61.1. ANOVA Table
The F statistic for the overall model is highly significant (F =57.076, p<0.0001), indicating that the model explains a significant portion of the variation in the data. The degrees of freedom can be used in checking accuracy of the data and model. The model degrees of freedom are one less than the number of parameters to be estimated. This model estimates two parameters, β0 and β1 ; thus, the degrees of freedom should be 2 − 1 = 1. The corrected total degrees of freedom are always one less than the total number of observations in the data set, in this case 19 − 1 = 18. Several simple statistics follow the ANOVA table. The Root MSE is an estimate of the standard deviation of the error term. The coefficient of variation, or Coeff Var, is a unitless expression of the variation in the data. The R-Square and Adj R-Square are two statistics used in assessing the fit of the model; values close to 1 indicate a better fit. The R-Square of 0.77 indicates that Height accounts for 77% of the variation in Weight. The “Parameter Estimates” table shown in Figure 61.2 contains the estimates of β0 and β1 . The table also contains the t statistics and the corresponding p-values for testing whether each parameter is significantly different from zero. The pvalues (t = −4.43, p = 0.0004 and t = 7.55, p < 0.0001) indicate that the intercept and Height parameter estimates, respectively, are highly significant.
Simple Linear Regression
Simple Linear Regression The REG Procedure Model: MODEL1 Dependent Variable: Weight Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept Height
1 1
-143.02692 3.89903
32.27459 0.51609
-4.43 7.55
0.0004 <.0001
Figure 61.2. Parameter Estimates
From the parameter estimates, the fitted model is Weight = −143.0 + 3.9 × Height
The REG procedure can be used interactively. After you specify a model with the MODEL statement and submit the PROC REG statements, you can submit further statements without reinvoking the procedure. The following command can now be issued to request a plot of the residual versus the predicted values, as shown in Figure 61.3. plot r.*p.; run;
Figure 61.3. Plot of Residual vs. Predicted Values
3803
3804
Chapter 61. The REG Procedure
A trend in the residuals would indicate nonconstant variance in the data. Figure 61.3 may indicate a slight trend in the residuals; they appear to increase slightly as the predicted values increase. A fan-shaped trend may indicate the need for a variancestabilizing transformation. A curved trend (such as a semi-circle) may indicate the need for a quadratic term in the model. Since these residuals have no apparent trend, the analysis is considered to be acceptable.
Polynomial Regression Consider a response variable Y that can be predicted by a polynomial function of a regressor variable X. You can estimate β0 , the intercept, β1 , the slope due to X, and β2 , the slope due to X 2 , in Yi = β0 + β1 Xi + β2 Xi2 + i for the observations i = 1, 2, . . . , n. Consider the following example on population growth trends. The population of the United States from 1790 to 2000 is fit to linear and quadratic functions of time. Note that the quadratic term, YearSq, is created in the DATA step; this is done since polynomial effects such as Year*Year cannot be specified in the MODEL statement in PROC REG. The data are as follows: data USPopulation; input Population @@; retain Year 1780; Year=Year+10; YearSq=Year*Year; Population=Population/1000; datalines; 3929 5308 7239 9638 12866 17069 23191 31443 39818 50155 62947 75994 91972 105710 122775 131669 151325 179323 203211 226542 248710 281422 ;
The following statements begin the analysis. (Influence diagnostics and autocorrelation information for the full model are shown in Figure 61.43 on page 3900 and Figure 61.57 on page 3916.) symbol1 c=blue; proc reg data=USPopulation; var YearSq; model Population=Year / r cli clm; plot r.*p. / cframe=ligr; run;
The DATA option ensures that the procedure uses the intended data set. Any variable that you might add to the model but that is not included in the first MODEL statement must appear in the VAR statement. In the MODEL statement, three options are
Polynomial Regression
specified: R requests a residual analysis to be performed, CLI requests 95% confidence limits for an individual value, and CLM requests these limits for the expected value of the dependent variable. You can request specific 100(1 − α)% limits with the ALPHA= option in the PROC REG or MODEL statement. A plot of the residuals against the predicted values is requested by the PLOT statement. The ANOVA table is displayed in Figure 61.4. The REG Procedure Model: MODEL1 Dependent Variable: Population Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
1 20 21
146869 12832 159700
146869 641.58160
Root MSE Dependent Mean Coeff Var
25.32946 94.64800 26.76175
R-Square Adj R-Sq
F Value
Pr > F
228.92
<.0001
0.9197 0.9156
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept Year
1 1
-2345.85498 1.28786
161.39279 0.08512
-14.54 15.13
<.0001 <.0001
Figure 61.4. ANOVA Table and Parameter Estimates
The Model F statistic is significant (F =228.92, p<0.0001), indicating that the model accounts for a significant portion of variation in the data. The R-Square indicates that the model accounts for 92% of the variation in population growth. The fitted equation for this model is Population = −2345.85 + 1.29 × Year
Figure 61.5 shows the confidence limits for both individual and expected values resulting from the CLM and CLI options.
3805
3806
Chapter 61. The REG Procedure
The REG Procedure Model: MODEL1 Dependent Variable: Population Output Statistics
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Dependent Predicted Std Error Variable Value Mean Predict 3.9290 5.3080 7.2390 9.6380 12.8660 17.0690 23.1910 31.4430 39.8180 50.1550 62.9470 75.9940 91.9720 105.7100 122.7750 131.6690 151.3250 179.3230 203.2110 226.5420 248.7100 281.4220
-40.5778 -27.6991 -14.8205 -1.9418 10.9368 23.8155 36.6941 49.5727 62.4514 75.3300 88.2087 101.0873 113.9660 126.8446 139.7233 152.6019 165.4805 178.3592 191.2378 204.1165 216.9951 229.8738
10.4424 9.7238 9.0283 8.3617 7.7314 7.1470 6.6208 6.1675 5.8044 5.5491 5.4170 5.4170 5.5491 5.8044 6.1675 6.6208 7.1470 7.7314 8.3617 9.0283 9.7238 10.4424
95% CL Mean -62.3602 -47.9826 -33.6533 -19.3841 -5.1906 8.9070 22.8834 36.7075 50.3436 63.7547 76.9090 89.7876 102.3907 114.7368 126.8580 138.7912 150.5721 162.2317 173.7956 185.2837 196.7116 208.0913
-18.7953 -7.4156 4.0123 15.5004 27.0643 38.7239 50.5048 62.4380 74.5592 86.9053 99.5084 112.3870 125.5413 138.9524 152.5885 166.4126 180.3890 194.4866 208.6801 222.9493 237.2786 251.6562
95% CL Predict -97.7280 -84.2950 -70.9128 -57.5827 -44.3060 -31.0839 -17.9174 -4.8073 8.2455 21.2406 34.1776 47.0562 59.8765 72.6387 85.3432 97.9904 110.5812 123.1163 135.5969 148.0241 160.3992 172.7235
16.5725 28.8968 41.2719 53.6991 66.1797 78.7148 91.3056 103.9528 116.6573 129.4195 142.2398 155.1184 168.0554 181.0505 194.1033 207.2134 220.3799 233.6020 246.8787 260.2088 273.5910 287.0240
Figure 61.5. Confidence Limits
The observed dependent variable is displayed for each observation along with its predicted value from the regression equation and the standard error of the mean predicted value. The 95% CL Mean columns are the confidence limits for the expected value of each observation. The 95% CL Predict columns are the confidence limits for the individual observations. Figure 61.6 displays the residual analysis requested by the R option.
Polynomial Regression
Output Statistics
Obs
Residual
Std Error Residual
Student Residual
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
44.5068 33.0071 22.0595 11.5798 1.9292 -6.7465 -13.5031 -18.1297 -22.6334 -25.1750 -25.2617 -25.0933 -21.9940 -21.1346 -16.9483 -20.9329 -14.1555 0.9638 11.9732 22.4255 31.7149 51.5482
23.077 23.389 23.666 23.909 24.121 24.300 24.449 24.567 24.655 24.714 24.743 24.743 24.714 24.655 24.567 24.449 24.300 24.121 23.909 23.666 23.389 23.077
1.929 1.411 0.932 0.484 0.0800 -0.278 -0.552 -0.738 -0.918 -1.019 -1.021 -1.014 -0.890 -0.857 -0.690 -0.856 -0.583 0.0400 0.501 0.948 1.356 2.234
Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS)
Cook’s D
-2-1 0 1 2 | | | | | | | | | | | | | | | | | | | | | |
|*** |** |* | | | *| *| *| **| **| **| *| *| *| *| *| | |* |* |** |****
| | | | | | | | | | | | | | | | | | | | | |
0.381 0.172 0.063 0.014 0.000 0.003 0.011 0.017 0.023 0.026 0.025 0.025 0.020 0.020 0.015 0.027 0.015 0.000 0.015 0.065 0.159 0.511
0 12832 16662
Figure 61.6. Residual Analysis
The residual, its standard error, and the studentized residuals are displayed for each observation. The studentized residual is the residual divided by its standard error. The magnitude of each studentized residual is shown in a plot. Studentized residuals follow a t distribution and can be used to identify outlying or extreme observations. Asterisks (*) extending beyond the dashed lines indicate that the residual is more than three standard errors from zero. Many observations having absolute studentized residuals greater than 2 may indicate an inadequate model. The wave pattern seen in this plot is also an indication that the model is inadequate; a quadratic term may be needed or autocorrelation may be present in the data. Cook’s D is a measure of the change in the predicted values upon deletion of that observation from the data set; hence, it measures the influence of the observation on the estimated regression coefficients. A fairly close agreement between the PRESS statistic (see Table 61.6 on page 3897) and the Sum of Squared Residuals indicates that the MSE is a reasonable measure of the predictive accuracy of the fitted model (Neter, Wasserman, and Kutner, 1990). A plot of the residuals versus predicted values is shown in Figure 61.7.
3807
3808
Chapter 61. The REG Procedure
Figure 61.7. Plot of Residual vs. Predicted Values
The wave pattern of the studentized residual plot is seen here again. The semi-circle shape indicates an inadequate model; perhaps additional terms (such as the quadratic) are needed, or perhaps the data need to be transformed before analysis. If a model fits well, the plot of residuals against predicted values should exhibit no apparent trends. Using the interactive feature of PROC REG, the following commands add the variable YearSq to the independent variables and refit the model. add YearSq; print; plot / cframe=ligr; run;
The ADD statement requests that YearSq be added to the model, and the PRINT command displays the ANOVA table for the new model. The PLOT statement with no variables recreates the most recent plot requested, in this case a plot of residual versus predicted values. Figure 61.8 displays the ANOVA table and estimates for the new model.
Polynomial Regression
The REG Procedure Model: MODEL1.1 Dependent Variable: Population Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
2 19 21
159529 170.97193 159700
79765 8.99852
Root MSE Dependent Mean Coeff Var
2.99975 94.64800 3.16938
R-Square Adj R-Sq
F Value
Pr > F
8864.19
<.0001
0.9989 0.9988
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept Year YearSq
1 1 1
21631 -24.04581 0.00668
639.50181 0.67547 0.00017820
33.82 -35.60 37.51
<.0001 <.0001 <.0001
Figure 61.8. ANOVA Table and Parameter Estimates
The overall F statistic is still significant (F =8864.19, p<0.0001). The R-square has increased from 0.9197 to 0.9989, indicating that the model now accounts for 99.9% of the variation in Population. All effects are significant with p<0.0001 for each effect in the model. The fitted equation is now Population = 21631 − 24.046 × Year + 0.0067 × Yearsq
The confidence limits and residual analysis for the second model are displayed in Figure 61.9.
3809
3810
Chapter 61. The REG Procedure
The REG Procedure Model: MODEL1.1 Dependent Variable: Population Output Statistics
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Dependent Predicted Std Error Variable Value Mean Predict 3.9290 5.3080 7.2390 9.6380 12.8660 17.0690 23.1910 31.4430 39.8180 50.1550 62.9470 75.9940 91.9720 105.7100 122.7750 131.6690 151.3250 179.3230 203.2110 226.5420 248.7100 281.4220
6.2127 5.7226 6.5694 8.7531 12.2737 17.1311 23.3254 30.8566 39.7246 49.9295 61.4713 74.3499 88.5655 104.1178 121.0071 139.2332 158.7962 179.6961 201.9328 225.5064 250.4168 276.6642
1.7565 1.4560 1.2118 1.0305 0.9163 0.8650 0.8613 0.8846 0.9163 0.9436 0.9590 0.9590 0.9436 0.9163 0.8846 0.8613 0.8650 0.9163 1.0305 1.2118 1.4560 1.7565
95% CL Mean 2.5362 2.6751 4.0331 6.5963 10.3558 15.3207 21.5227 29.0051 37.8067 47.9545 59.4641 72.3427 86.5904 102.2000 119.1556 137.4305 156.9858 177.7782 199.7759 222.9701 247.3693 272.9877
9.8892 8.7701 9.1057 10.9100 14.1916 18.9415 25.1281 32.7080 41.6425 51.9046 63.4785 76.3571 90.5405 106.0357 122.8585 141.0359 160.6066 181.6139 204.0896 228.0427 253.4644 280.3407
95% CL Predict -1.0631 -1.2565 -0.2021 2.1144 5.7087 10.5968 16.7932 24.3107 33.1597 43.3476 54.8797 67.7583 81.9836 97.5529 114.4612 132.7010 152.2618 173.1311 195.2941 218.7349 243.4378 269.3884
13.4884 12.7017 13.3409 15.3918 18.8386 23.6655 29.8576 37.4024 46.2896 56.5114 68.0629 80.9415 95.1473 110.6828 127.5529 145.7654 165.3306 186.2610 208.5715 232.2779 257.3959 283.9400
Output Statistics
Obs
Residual
Std Error Residual
Student Residual
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
-2.2837 -0.4146 0.6696 0.8849 0.5923 -0.0621 -0.1344 0.5864 0.0934 0.2255 1.4757 1.6441 3.4065 1.5922 1.7679 -7.5642 -7.4712 -0.3731 1.2782 1.0356 -1.7068 4.7578
2.432 2.623 2.744 2.817 2.856 2.872 2.873 2.866 2.856 2.847 2.842 2.842 2.847 2.856 2.866 2.873 2.872 2.856 2.817 2.744 2.623 2.432
-0.939 -0.158 0.244 0.314 0.207 -0.0216 -0.0468 0.205 0.0327 0.0792 0.519 0.578 1.196 0.557 0.617 -2.632 -2.601 -0.131 0.454 0.377 -0.651 1.957
Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS)
Cook’s D
-2-1 0 1 2 | *| | | | | | | | | | | | | | | | | | | | |* | |* | |** | |* | |* | *****| | *****| | | | | | | | *| | |***
-4.4596E-11 170.97193 237.71229
| | | | | | | | | | | | | | | | | | | | | |
0.153 0.003 0.004 0.004 0.001 0.000 0.000 0.001 0.000 0.000 0.010 0.013 0.052 0.011 0.012 0.208 0.205 0.001 0.009 0.009 0.044 0.666
Polynomial Regression
Figure 61.9. Confidence Limits and Residual Analysis
The plot of the studentized residuals shows that the wave structure is gone. The PRESS statistic is much closer to the Sum of Squared Residuals now, and both statistics have been dramatically reduced. Most of the Cook’s D statistics have also been reduced.
Figure 61.10. Plot of Residual vs. Predicted Values
The plot of residuals versus predicted values seen in Figure 61.10 has improved since a major trend is no longer visible. To create a plot of the observed values, predicted values, and confidence limits against Year all on the same plot and to exert some control over the look of the resulting plot, you can submit the following statements. symbol1 v=dot c=yellow h=.3; symbol2 v=square c=red; symbol3 f=simplex c=blue h=2 v=’-’; symbol4 f=simplex c=blue h=2 v=’-’; plot (Population predicted. u95. l95.)*Year / overlay cframe=ligr; run;
3811
3812
Chapter 61. The REG Procedure
Figure 61.11. Plot of Population vs Year with Confidence Limits
The SYMBOL statements requests that the actual data be displayed as dots, the predicted values as squares, and the upper and lower 95% confidence limits for an individual value (sometimes called a prediction interval) as dashes. PROC REG provides the short-hand commands CONF and PRED to request confidence and prediction intervals for simple regression models; see the “PLOT Statement” section on page 3839 for details. To complete an analysis of these data, you may want to examine influence statistics and, since the data are essentially time series data, examine the Durbin-Watson statistic. You might also want to examine other residual plots, such as the residuals vs. regressors.
Using PROC REG Interactively PROC REG can be used interactively. After you specify a model with a MODEL statement and run REG with a RUN statement, a variety of statements can be executed without reinvoking REG. The “Interactive Analysis” section on page 3869 describes which statements can be used interactively. These interactive statements can be executed singly or in groups by following the single statement or group of statements with a RUN statement. Note that the MODEL statement can be repeated. This is an important difference from the GLM procedure, which allows only one MODEL statement. If you use REG interactively, you can end the REG procedure with a DATA step, another PROC step, an ENDSAS statement, or with a QUIT statement. The syntax
Syntax
of the QUIT statement is quit;
When you are using REG interactively, additional RUN statements do not end REG but tell the procedure to execute additional statements. When a BY statement is used with PROC REG, interactive processing is not possible; that is, once the first RUN statement is encountered, processing proceeds for each BY group in the data set, and no further statements are accepted by the procedure. When using REG interactively, you can fit a model, perform diagnostics, then refit the model, and perform diagnostics on the refitted model. Most of the interactive statements implicitly refit the model; for example, if you use the ADD statement to add a variable to the model, the regression equation is automatically recomputed. The two exceptions to this automatic recomputing are the PAINT and REWEIGHT statements. These two statements do not cause the model to be refitted. To do so, you can follow these statements either with a REFIT statement, which causes the model to be explicitly recomputed, or with another interactive statement that causes the model to be implicitly recomputed.
Syntax The following statements are available in PROC REG.
PROC REG < options > ; < label: > MODEL dependents=
3813
3814
Chapter 61. The REG Procedure
Although there are numerous statements and options available in PROC REG, many analyses use only a few of them. Often you can find the features you need by looking at an example or by scanning this section. In the preceding list, brackets denote optional specifications, and vertical bars denote a choice of one of the specifications separated by the vertical bars. In all cases, label is optional. The PROC REG statement is required. To fit a model to the data, you must specify the MODEL statement. If you want to use only the options available in the PROC REG statement, you do not need a MODEL statement, but you must use a VAR statement. (See the example in the “OUTSSCP= Data Sets” section on page 3868.) Several MODEL statements can be used. In addition, several MTEST, OUTPUT, PAINT, PLOT, PRINT, RESTRICT, and TEST statements can follow each MODEL statement. The ADD, DELETE, and REWEIGHT statements are used interactively to change the regression model and the data used in fitting the model. The ADD, DELETE, MTEST, OUTPUT, PLOT, PRINT, RESTRICT, and TEST statements implicitly refit the model; changes made to the model are reflected in the results from these statements. The REFIT statement is used to refit the model explicitly and is most helpful when it follows PAINT and REWEIGHT statements, which do not refit the model. The BY, FREQ, ID, VAR, and WEIGHT statements are optionally specified once for the entire PROC step, and they must appear before the first RUN statement. When TYPE=CORR, TYPE=COV, or TYPE=SSCP data sets are used as input data sets to PROC REG, statements and options that require the original data are not available. Specifically, the OUTPUT, PAINT, PLOT, and REWEIGHT statements and the MODEL and PRINT statement options P, R, CLM, CLI, DW, DWPROB, INFLUENCE, and PARTIAL are disabled. You can specify the following statements with the REG procedure in addition to the PROC REG statement: ADD
adds independent variables to the regression model.
BY
specifies variables to define subgroups for the analysis.
DELETE
deletes independent variables from the regression model.
FREQ
specifies a frequency variable.
ID
names a variable to identify observations in the tables.
MODEL
specifies the dependent and independent variables in the regression model, requests a model selection method, displays predicted values, and provides details on the estimates (according to which options are selected).
MTEST
performs multivariate tests across multiple dependent variables.
OUTPUT
creates an output data set and names the variables to contain predicted values, residuals, and other diagnostic statistics.
PAINT
paints points in scatter plots.
PLOT
generates scatter plots.
PROC REG Statement
PRINT
displays information about the model and can reset options.
REFIT
refits the model.
RESTRICT
places linear equality restrictions on the parameter estimates.
REWEIGHT
excludes specific observations from analysis or changes the weights of observations used.
TEST
performs an F test on linear functions of the parameters.
VAR
lists variables for which crossproducts are to be computed, variables that can be interactively added to the model, or variables to be used in scatter plots.
WEIGHT
declares a variable to weight observations.
PROC REG Statement PROC REG < options > ; The PROC REG statement is required. If you want to fit a model to the data, you must also use a MODEL statement. If you want to use only the PROC REG options, you do not need a MODEL statement, but you must use a VAR statement. If you do not use a MODEL statement, then the COVOUT and OUTEST= options are not available. Table 61.1 lists the options you can use with the PROC REG statement. Note that any option specified in the PROC REG statement applies to all MODEL statements. Table 61.1. PROC REG Statement Options
Option Description Data Set Options DATA= names a data set to use for the regression OUTEST= outputs a data set that contains parameter estimates and other model fit summary statistics OUTSSCP= outputs a data set that contains sums of squares and crossproducts COVOUT outputs the covariance matrix for parameter estimates to the OUTEST= data set EDF outputs the number of regressors, the error degrees of freedom, and the model R2 to the OUTEST= data set OUTSTB outputs standardized parameter estimates to the OUTEST= data set. Use only with the RIDGE= or PCOMIT= option. OUTSEB outputs standard errors of the parameter estimates to the OUTEST= data set OUTVIF outputs the variance inflation factors to the OUTEST= data set. Use only with the RIDGE= or PCOMIT= option. PCOMIT= performs incomplete principal component analysis and outputs estimates to the OUTEST= data set PRESS outputs the PRESS statistic to the OUTEST= data set RIDGE= performs ridge regression analysis and outputs estimates to the OUTEST= data set RSQUARE same effect as the EDF option
3815
3816
Chapter 61. The REG Procedure
Table 61.1. (continued)
Option TABLEOUT
Description outputs standard errors, confidence limits, and associated test statistics of the parameter estimates to the OUTEST= data set
High Resolution Graphics Options ANNOTATE= specifies an annotation data set GOUT= specifies the graphics catalog in which graphics output is saved Display Options CORR displays correlation matrix for variables listed in MODEL and VAR statements SIMPLE displays simple statistics for each variable listed in MODEL and VAR statements USCCP displays uncorrected sums of squares and crossproducts matrix ALL displays all statistics (CORR, SIMPLE, and USSCP) NOPRINT suppresses output LINEPRINTER creates plots requested as line printer plot Other Options ALPHA= SINGULAR=
sets significance value for confidence and prediction intervals and tests sets criterion for checking for singularity
Following are explanations of the options that you can specify in the PROC REG statement (in alphabetical order). Note that any option specified in the PROC REG statement applies to all MODEL statements. ALL
requests the display of many tables. Using the ALL option in the PROC REG statement is equivalent to specifying ALL in every MODEL statement. The ALL option also implies the CORR, SIMPLE, and USSCP options. ALPHA=number
sets the significance level used for the construction of confidence intervals. The value must be between 0 and 1; the default value of 0.05 results in 95% intervals. This option affects the PROC REG option TABLEOUT; the MODEL options CLB, CLI, and CLM; the OUTPUT statement keywords LCL, LCLM, UCL, and UCLM; the PLOT statement keywords LCL., LCLM., UCL., and UCLM.; and the PLOT statement options CONF and PRED. ANNOTATE=SAS-data-set ANNO= SAS-data-set
specifies an input data set containing annotate variables, as described in SAS/GRAPH Software: Reference. You can use this data set to add features to plots. Features provided in this data set are applied to all plots produced in the current run of PROC REG. To add features to individual plots, use the ANNOTATE= option in the PLOT statement. This option cannot be used if the LINEPRINTER option is specified.
PROC REG Statement
CORR
displays the correlation matrix for all variables listed in the MODEL or VAR statement. COVOUT
outputs the covariance matrices for the parameter estimates to the OUTEST= data set. This option is valid only if the OUTEST= option is also specified. See the “OUTEST= Data Set” section on page 3863. DATA=SAS-data-set
names the SAS data set to be used by PROC REG. The data set can be an ordinary SAS data set or a TYPE=CORR, TYPE=COV, or TYPE=SSCP data set. If one of these special TYPE= data sets is used, the OUTPUT, PAINT, PLOT, and REWEIGHT statements and some options in the MODEL and PRINT statements are not available. See Appendix A, “Special SAS Data Sets,” for more information on TYPE= data sets. If the DATA= option is not specified, PROC REG uses the most recently created SAS data set. EDF
outputs the number of regressors in the model excluding and including the intercept, the error degrees of freedom, and the model R2 to the OUTEST= data set. GOUT=graphics-catalog
specifies the graphics catalog in which graphics output is saved. The default graphicscatalog is WORK.GSEG. The GOUT= option cannot be used if the LINEPRINTER option is specified. LINEPRINTER | LP
creates plots requested as line printer plots. If you do not specify this option, requested plots are created on a high resolution graphics device. This option is required if plots are requested and you do not have SAS/GRAPH software. NOPRINT
suppresses the normal display of results. Note that this option temporarily disables the Output Delivery System (ODS); see Chapter 14, “Using the Output Delivery System,” for more information. OUTEST=SAS-data-set
requests that parameter estimates and optional model fit summary statistics be output to this data set. See the “OUTEST= Data Set” section on page 3863 for details. If you want to create a permanent SAS data set, you must specify a two-level name (refer to the section “SAS Files” in SAS Language Reference: Concepts for more information on permanent SAS data sets). OUTSEB
outputs the standard errors of the parameter estimates to the OUTEST= data set. The value SEB for the variable – TYPE– identifies the standard errors. If the RIDGE= or PCOMIT= option is specified, additional observations are included and identified by the values RIDGESEB and IPCSEB, respectively, for the variable – TYPE– . The standard errors for ridge regression estimates and IPC estimates are limited in their
3817
3818
Chapter 61. The REG Procedure
usefulness because these estimates are biased. This option is available for all model selection methods except RSQUARE, ADJRSQ, and CP. OUTSSCP=SAS-data-set
requests that the sums of squares and crossproducts matrix be output to this TYPE=SSCP data set. See the “OUTSSCP= Data Sets” section on page 3868 for details. If you want to create a permanent SAS data set, you must specify a two-level name (refer to the section “SAS Files” in SAS Language Reference: Concepts for more information on permanent SAS data sets). OUTSTB
outputs the standardized parameter estimates as well as the usual estimates to the OUTEST= data set when the RIDGE= or PCOMIT= option is specified. The values RIDGESTB and IPCSTB for the variable – TYPE– identify ridge regression estimates and IPC estimates, respectively. OUTVIF
outputs the variance inflation factors (VIF) to the OUTEST= data set when the RIDGE= or PCOMIT= option is specified. The factors are the diagonal elements of the inverse of the correlation matrix of regressors as adjusted by ridge regression or IPC analysis. These observations are identified in the output data set by the values RIDGEVIF and IPCVIF for the variable – TYPE– . PCOMIT=list
requests an incomplete principal components (IPC) analysis for each value m in the list. The procedure computes parameter estimates using all but the last m principal components. Each value of m produces a set of IPC estimates, which are output to the OUTEST= data set. The values of m are saved by the variable – PCOMIT– , and the value of the variable – TYPE– is set to IPC to identify the estimates. Only nonnegative integers can be specified with the PCOMIT= option. If you specify the PCOMIT= option, RESTRICT statements are ignored. PRESS
outputs the PRESS statistic to the OUTEST= data set. The values of this statistic are saved in the variable – PRESS– . This option is available for all model selection methods except RSQUARE, ADJRSQ, and CP. RIDGE=list
requests a ridge regression analysis and specifies the values of the ridge constant k (see the “Computations for Ridge Regression and IPC Analysis” section on page 3916). Each value of k produces a set of ridge regression estimates that are placed in the OUTEST= data set. The values of k are saved by the variable – RIDGE– , and the value of the variable – TYPE– is set to RIDGE to identify the estimates. Only nonnegative numbers can be specified with the RIDGE= option. Example 61.10 on page 3956 illustrates this option. If ODS graphics are in effect (see the “ODS Graphics” section on page 3922), then ridge regression plots are automatically produced. These plots consist of panels containing ridge traces for the regressors, with at most eight ridge traces per panel.
BY Statement
If you specify the RIDGE= option, RESTRICT statements are ignored. RSQUARE
has the same effect as the EDF option. SIMPLE
displays the sum, mean, variance, standard deviation, and uncorrected sum of squares for each variable used in PROC REG. SINGULAR=n
tunes the mechanism used to check for singularities. The default value is machine dependent but is approximately 1E−7 on most machines. This option is rarely needed. Singularity checking is described in the “Computational Methods” section on page 3917. TABLEOUT
outputs the standard errors and 100(1 − α)% confidence limits for the parameter estimates, the t statistics for testing if the estimates are zero, and the associated pvalues to the OUTEST= data set. The – TYPE– variable values STDERR, LnB, UnB, T, and PVALUE, where n = 100(1 − α), identify these rows in the OUTEST= data set. The α-level can be set with the ALPHA= option in the PROC REG or MODEL statement. The OUTEST= option must be specified in the PROC REG statement for this option to take effect. USSCP
displays the uncorrected sums-of-squares and crossproducts matrix for all variables used in the procedure.
ADD Statement ADD variables ; The ADD statement adds independent variables to the regression model. Only variables used in the VAR statement or used in MODEL statements before the first RUN statement can be added to the model. You can use the ADD statement interactively to add variables to the model or to include a variable that was previously deleted with a DELETE statement. Each use of the ADD statement modifies the MODEL label. See the “Interactive Analysis” section on page 3869 for an example.
BY Statement BY variables ; You can specify a BY statement with PROC REG to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in the order of the BY variables. The variables are one or more variables in the input data set. If your input data set is not sorted in ascending order, use one of the following alternatives.
3819
3820
Chapter 61. The REG Procedure • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the REG procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure (in base SAS software).
When a BY statement is used with PROC REG, interactive processing is not possible; that is, once the first RUN statement is encountered, processing proceeds for each BY group in the data set, and no further statements are accepted by the procedure. A BY statement that appears after the first RUN statement is ignored. For more information on the BY statement, refer to the discussion in SAS Language Reference: Contents. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
DELETE Statement DELETE variables ; The DELETE statement deletes independent The DELETE statement performs the opposite function of the ADD statement and is used in a similar manner. Each use of the DELETE statement modifies the MODEL label. For an example of how the ADD statement is used (and how the DELETE statement can be used), see the “Interactive Analysis” section on page 3869.
FREQ Statement FREQ variable ; When a FREQ statement appears, each observation in the input data set is assumed to represent n observations, where n is the value of the FREQ variable. The analysis produced using a FREQ statement is the same as an analysis produced using a data set that contains n observations in place of each observation in the input data set. When the procedure determines degrees of freedom for significance tests, the total number of observations is considered to be equal to the sum of the values of the FREQ variable. If the value of the FREQ variable is missing or is less than 1, the observation is not used in the analysis. If the value is not an integer, only the integer portion is used. The FREQ statement must appear before the first RUN statement, or it is ignored.
MODEL Statement
3821
ID Statement ID variables ; When one of the MODEL statement options CLI, CLM, P, R, or INFLUENCE is requested, the variables listed in the ID statement are displayed beside each observation. These variables can be used to identify each observation. If the ID statement is omitted, the observation number is used to identify the observations. Although there are no restrictions on the length of ID variables, PROC REG may truncate ID values to 16 characters for display purposes.
MODEL Statement < label: > MODEL dependents=
Option Description Model Selection and Details of Selection SELECTION= specifies model selection method BEST= specifies maximum number of subset models displayed or output to the OUTEST= data set DETAILS produces summary statistics at each step DETAILS= specifies the display details for forward, backward, and stepwise methods GROUPNAMES= provides names for groups of variables INCLUDE= includes first n variables in the model MAXSTEP= specifies maximum number of steps that may be performed NOINT fits a model without the intercept term PCOMIT= performs incomplete principal component analysis and outputs estimates to the OUTEST= data set SLE= sets criterion for entry into model RIDGE= performs ridge regression analysis and outputs estimates to the OUTEST= data set SLS= sets criterion for staying in model START= specifies number of variables in model to begin the comparing and switching process STOP= stops selection criterion
3822
Chapter 61. The REG Procedure Table 61.2. (continued)
Option Fit Statistics ADJRSQ AIC B BIC CP GMSEP JP MSE PC RMSE SBC SP SSE Data Set Options EDF OUTSEB OUTSTB OUTVIF PRESS RSQUARE
Description computes adjusted R2 computes Akaike’s information criterion computes parameter estimates for each model computes Sawa’s Bayesian information criterion computes Mallows’ Cp statistic computes estimated MSE of prediction assuming multivariate normality computes Jp , the final prediction error computes MSE for each model computes Amemiya’s prediction criterion displays root MSE for each model computes the SBC statistic computes Sp statistic for each model computes error sum of squares for each model outputs the number of regressors, the error degrees of freedom, and the model R2 to the OUTEST= data set outputs standard errors of the parameter estimates to the OUTEST= data set outputs standardized parameter estimates to the OUTEST= data set. Use only with the RIDGE= or PCOMIT= option. outputs the variance inflation factors to the OUTEST= data set. Use only with the RIDGE= or PCOMIT= option. outputs the PRESS statistic to the OUTEST= data set has same effect as the EDF option
Regression Calculations I displays inverse of sums of squares and crossproducts XPX displays sums-of-squares and crossproducts matrix Details on Estimates ACOV displays asymptotic covariance matrix of estimates assuming heteroscedasticity COLLIN produces collinearity analysis COLLINOINT produces collinearity analysis with intercept adjusted out CORRB displays correlation matrix of estimates COVB displays covariance matrix of estimates PARTIALR2 displays squared semi-partial correlation coefficients using Type I sums of squares PCORR1 displays squared partial correlation coefficients using Type I sums of squares PCORR2 displays squared partial correlation coefficients using Type II sums of squares SCORR1 displays squared semi-partial correlation coefficients using Type I sums of squares
MODEL Statement
3823
Table 61.2. (continued)
Option SCORR2 SEQB SPEC SS1 SS2 STB TOL VIF
Description displays squared semi-partial correlation coefficients using Type II sums of squares displays a sequence of parameter estimates during selection process tests that first and second moments of model are correctly specified displays the sequential sums of squares displays the partial sums of squares displays standardized parameter estimates displays tolerance values for parameter estimates computes variance-inflation factors
Predicted and Residual Values CLB computes 100(1 − α)% confidence limits for the parameter estimates CLI computes 100(1 − α)% confidence limits for an individual predicted value CLM computes 100(1−α)% confidence limits for the expected value of the dependent variable DW computes a Durbin-Watson statistic DWPROB computes a Durbin-Watson statistic and p-value INFLUENCE computes influence statistics P computes predicted values PARTIAL displays partial regression plots for each regressor R produces analysis of residuals Display Options and Other Options ALL requests the following options: ACOV, CLB, CLI, CLM, CORRB, COVB, I, P, PCORR1, PCORR2, R, SCORR1, SCORR2, SEQB, SPEC, SS1, SS2, STB, TOL, VIF, XPX ALPHA= sets significance value for confidence and prediction intervals and tests NOPRINT suppresses display of results SIGMA= specifies the true standard deviation of error term for computing CP and BIC SINGULAR= sets criterion for checking for singularity
You can specify the following options in the MODEL statement after a slash (/). ACOV
displays the estimated asymptotic covariance matrix of the estimates under the hypothesis of heteroscedasticity. See the section “Testing for Heteroscedasticity” on page 3910 for more information.
3824
Chapter 61. The REG Procedure
ADJRSQ
computes R2 adjusted for degrees of freedom for each model selected (Darlington 1968; Judge et al. 1980). AIC
outputs Akaike’s information criterion for each model selected (Akaike 1969; Judge et al. 1980) to the OUTEST= data set. If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then the AIC statistic is also added to the SubsetSelSummary table. ALL
requests all these options: ACOV, CLB, CLI, CLM, CORRB, COVB, I, P, PCORR1, PCORR2, R, SCORR1, SCORR2, SEQB, SPEC, SS1, SS2, STB, TOL, VIF, and XPX. ALPHA=number
sets the significance level used for the construction of confidence intervals for the current MODEL statement. The value must be between 0 and 1; the default value of 0.05 results in 95% intervals. This option affects the MODEL options CLB, CLI, and CLM; the OUTPUT statement keywords LCL, LCLM, UCL, and UCLM; the PLOT statement keywords LCL., LCLM., UCL., and UCLM.; and the PLOT statement options CONF and PRED. Specifying this option in the MODEL statement takes precedence over the ALPHA= option in the PROC REG statement. B
is used with the RSQUARE, ADJRSQ, and CP model-selection methods to compute estimated regression coefficients for each model selected. BEST=n
is used with the RSQUARE, ADJRSQ, and CP model-selection methods. If SELECTION=CP or SELECTION=ADJRSQ is specified, the BEST= option specifies the maximum number of subset models to be displayed or output to the OUTEST= data set. For SELECTION=RSQUARE, the BEST= option requests the maximum number of subset models for each size. If the BEST= option is used without the B option (displaying estimated regression coefficients), the variables in each MODEL are listed in order of inclusion instead of the order in which they appear in the MODEL statement. If the BEST= option is omitted and the number of regressors is less than 11, all possible subsets are evaluated. If the BEST= option is omitted and the number of regressors is greater than 10, the number of subsets selected is, at most, equal to the number of regressors. A small value of the BEST= option greatly reduces the CPU time required for large problems. BIC
outputs Sawa’s Bayesian information criterion for each model selected (Sawa 1978; Judge et al. 1980) to the OUTEST= data set. If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then the BIC statistic is also added to the SubsetSelSummary table.
MODEL Statement
CLB
requests the 100(1 − α)% upper- and lower-confidence limits for the parameter estimates. By default, the 95% limits are computed; the ALPHA= option in the PROC REG or MODEL statement can be used to change the α-level. CLI
requests the 100(1 − α)% upper- and lower-confidence limits for an individual predicted value. By default, the 95% limits are computed; the ALPHA= option in the PROC REG or MODEL statement can be used to change the α-level. The confidence limits reflect variation in the error, as well as variation in the parameter estimates. See the “Predicted and Residual Values” section on page 3879 and Chapter 2, “Introduction to Regression Procedures,” for more information. CLM
displays the 100(1 − α)% upper- and lower-confidence limits for the expected value of the dependent variable (mean) for each observation. By default, the 95% limits are computed; the ALPHA= in the PROC REG or MODEL statement can be used to change the α-level. This is not a prediction interval (see the CLI option) because it takes into account only the variation in the parameter estimates, not the variation in the error term. See the section “Predicted and Residual Values” on page 3879 and Chapter 2 for more information. COLLIN
requests a detailed analysis of collinearity among the regressors. This includes eigenvalues, condition indices, and decomposition of the variances of the estimates with respect to each eigenvalue. See the “Collinearity Diagnostics” section on page 3895. COLLINOINT
requests the same analysis as the COLLIN option with the intercept variable adjusted out rather than included in the diagnostics. See the “Collinearity Diagnostics” section on page 3895. CORRB
displays the correlation matrix of the estimates. This is the (X0 X)−1 matrix scaled to unit diagonals. COVB
displays the estimated covariance matrix of the estimates. This matrix is (X0 X)−1 s2 , where s2 is the estimated mean squared error. CP
outputs Mallows’ Cp statistic for each model selected (Mallows 1973; Hocking 1976). See the “Criteria Used in Model-Selection Methods” section on page 3876 for a discussion of the use of Cp . to the OUTEST= data set. If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then the CP statistic is also added to the SubsetSelSummary table. DETAILS DETAILS=name
specifies the level of detail produced when the BACKWARD, FORWARD or STEPWISE methods are used, where name can be ALL, STEPS or SUMMARY. The
3825
3826
Chapter 61. The REG Procedure
DETAILS or DETAILS=ALL option produces entry and removal statistics for each variable in the model building process, ANOVA and parameter estimates at each step, and a selection summary table. The option DETAILS=STEPS provides the step information and summary table. The option DETAILS=SUMMARY produces only the summary table. The default if the DETAILS option is omitted is DETAILS=STEPS. DW
calculates a Durbin-Watson statistic to test whether or not the errors have first-order autocorrelation. (This test is appropriate only for time series data.) The sample autocorrelation of the residuals is also produced. See the section “Autocorrelation in Time Series Data” on page 3915. DWPROB
calculates a Durbin-Watson statistic and a p-value to test whether or not the errors have first-order autocorrelation. Note that it is not necessary to specify the DW option if the DWPROB option is specified. (This test is appropriate only for time series data.) The sample autocorrelation of the residuals is also produced. See the section “Autocorrelation in Time Series Data” on page 3915. EDF
outputs the number of regressors in the model excluding and including the intercept, the error degrees of freedom, and the model R2 to the OUTEST= data set. GMSEP
outputs the estimated mean square error of prediction assuming that both independent and dependent variables are multivariate normal (Stein 1960; Darlington 1968). Note that Hocking’s formula (1976, eq. 4.20) contains a misprint: “n − 1” should read “n − 2.”) to the OUTEST= data set. If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then the GMSEP statistic is also added to the SubsetSelSummary table. GROUPNAMES=’name1’ ’name2’ . . .
provides names for variable groups. This option is available only in the BACKWARD, FORWARD, and STEPWISE methods. The group name can be up to 32 characters. Subsets of independent variables listed in the MODEL statement can be designated as variable groups. This is done by enclosing the appropriate variables in braces. Variables in the same group are entered into or removed from the regression model at the same time. However, if the tolerance of any variable (see the TOL option on page 3832) in a group is less than the setting of the SINGULAR= option, then the variable is not entered into the model with the rest of its group. If the GROUPNAMES= option is not used, then the names GROUP1, GROUP2, . . ., GROUPn are assigned to groups encountered in the MODEL statement. Variables not enclosed by braces are used as groups of a single variable. For example, model y={x1 x2} x3 / selection=stepwise groupnames=’x1 x2’ ’x3’;
MODEL Statement
3827
As another example, model y={ht wgt age} bodyfat / selection=forward groupnames=’htwgtage’ ’bodyfat’; I
displays the (X0 X)−1 matrix. The inverse of the crossproducts matrix is bordered by the parameter estimates and SSE matrices. INCLUDE=n
forces the first n independent variables listed in the MODEL statement to be included in all models. The selection methods are performed on the other variables in the MODEL statement. The INCLUDE= option is not available with SELECTION=NONE. INFLUENCE
requests a detailed analysis of the influence of each observation on the estimates and the predicted values. See the “Influence Diagnostics” section on page 3898 for details. JP
outputs Jp , the estimated mean square error of prediction for each model selected assuming that the values of the regressors are fixed and that the model is correct to the OUTEST= data set. The Jp statistic is also called the final prediction error (FPE) by Akaike (Nicholson 1948; Lord 1950; Mallows 1967; Darlington 1968; Rothman 1968; Akaike 1969; Hocking 1976; Judge et al. 1980). If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then the Jp statistic is also added to the SubsetSelSummary table. MSE
computes the mean square error for each model selected (Darlington 1968). MAXSTEP=n
specifies the maximum number of steps that are done when SELECTION=FORWARD, SELECTION=BACKWARD or SELECTION=STEPWISE is used. The default value is the number of independent variables in the model for the forward and backward methods and three times this number for the stepwise method. NOINT
suppresses the intercept term that is otherwise included in the model. NOPRINT
suppresses the normal display of regression results. Note that this option temporarily disables the Output Delivery System (ODS); see Chapter 14, “Using the Output Delivery System,” for more information.
3828
Chapter 61. The REG Procedure
OUTSEB
outputs the standard errors of the parameter estimates to the OUTEST= data set. The value SEB for the variable – TYPE– identifies the standard errors. If the RIDGE= or PCOMIT= option is specified, additional observations are included and identified by the values RIDGESEB and IPCSEB, respectively, for the variable – TYPE– . The standard errors for ridge regression estimates and incomplete principal components (IPC) estimates are limited in their usefulness because these estimates are biased. This option is available for all model-selection methods except RSQUARE, ADJRSQ, and CP. OUTSTB
outputs the standardized parameter estimates as well as the usual estimates to the OUTEST= data set when the RIDGE= or PCOMIT= option is specified. The values RIDGESTB and IPCSTB for the variable – TYPE– identify ridge regression estimates and IPC estimates, respectively. OUTVIF
outputs the variance inflation factors (VIF) to the OUTEST= data set when the RIDGE= or PCOMIT= option is specified. The factors are the diagonal elements of the inverse of the correlation matrix of regressors as adjusted by ridge regression or IPC analysis. These observations are identified in the output data set by the values RIDGEVIF and IPCVIF for the variable – TYPE– . P
calculates predicted values from the input data and the estimated model. The display includes the observation number, the ID variable (if one is specified), the actual and predicted values, and the residual. If the CLI, CLM, or R option is specified, the P option is unnecessary. See the section “Predicted and Residual Values” on page 3879 for more information. PARTIAL
requests partial regression leverage plots for each regressor. If ODS Graphics are in effect (see the “ODS Graphics” section on page 3922), then these partial plots are produced in panels with up to six plots per panel. See the “Influence Diagnostics” section on page 3898 for more information. PARTIALR2 < ( < TESTS > < SEQTESTS > ) >
See the SCORR1 option. PC
outputs Amemiya’s prediction criterion for each model selected (Amemiya 1976; Judge et al. 1980) to the OUTEST= data set. If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then the PC statistic is also added to the SubsetSelSummary table. PCOMIT=list
requests an IPC analysis for each value m in the list. The procedure computes parameter estimates using all but the last m principal components. Each value of m produces a set of IPC estimates, which is output to the OUTEST= data set. The values of m are saved by the variable – PCOMIT– , and the value of the variable – TYPE– is set
MODEL Statement
to IPC to identify the estimates. Only nonnegative integers can be specified with the PCOMIT= option. If you specify the PCOMIT= option, RESTRICT statements are ignored. The PCOMIT= option is ignored if you use the SELECTION= option in the MODEL statement. PCORR1
displays the squared partial correlation coefficients using Type I Sum of Squares (SS). This is calculated as SS/(SS+SSE), where SSE is the error Sum of Squares. PCORR2
displays the squared partial correlation coefficients using Type II sums of squares. These are calculated the same way as with the PCORR1 option, except that Type II SS are used instead of Type I SS. PRESS
outputs the PRESS statistic to the OUTEST= data set. The values of this statistic are saved in the variable – PRESS– . This option is available for all model-selection methods except RSQUARE, ADJRSQ, and CP. R
requests an analysis of the residuals. The results include everything requested by the P option plus the standard errors of the mean predicted and residual values, the studentized residual, and Cook’s D statistic to measure the influence of each observation on the parameter estimates. See the section “Predicted and Residual Values” on page 3879 for more information. RIDGE=list
requests a ridge regression analysis and specifies the values of the ridge constant k (see the “Computations for Ridge Regression and IPC Analysis” section on page 3916). Each value of k produces a set of ridge regression estimates that are placed in the OUTEST= data set. The values of k are saved by the variable – RIDGE– , and the value of the variable – TYPE– is set to RIDGE to identify the estimates. Only nonnegative numbers can be specified with the RIDGE= option. Example 61.10 on page 3956 illustrates this option. If you specify the RIDGE= option, RESTRICT statements are ignored. The RIDGE= option is ignored if you use the SELECTION= option in the MODEL statement. RMSE
displays the root mean square error for each model selected. RSQUARE
has the same effect as the EDF option. SBC
outputs the SBC statistic for each model selected (Schwarz 1978; Judge et al. 1980) to the OUTEST= data set. If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then the SBC statistic is also added to the SubsetSelSummary table.
3829
3830
Chapter 61. The REG Procedure
SCORR1 < ( < TESTS > < SEQTESTS > ) >
displays the squared semi-partial correlation coefficients using Type I sums of squares. This is calculated as SS/SST, where SST is the corrected total SS. If the NOINT option is used, the uncorrected total SS is used in the denominator. The optional arguments TESTS and SEQTESTS request F-tests, p-values, and cumulative R-Square values as variables are sequentially added to a model. The F-test values are computed as the Type I sum of squares for the variable in question divided by a mean square error. If you specify the TESTS option, the denominator MSE is the residual mean square for the full model specified in the MODEL statement. If you specify the SEQTESTS option, the denominator MSE is the residual mean square for the model containing all the independent variables that have been added to the model up to and including the variable in question. The TESTS and SEQTESTS options are not supported if you specify model selection methods, or the RIDGE or PCOMIT options. Note that the PARTIALR2 option is a synonym for the SCORR1 option. SCORR2 < ( TESTS ) >
displays the squared semi-partial correlation coefficients using Type II sums of squares. These are calculated the same way as with the SCORR1 option, except that Type II SS are used instead of Type I SS. The optional TEST argument requests F-tests and p-values as variables are sequentially added to a model. The F-test values are computed as the Type II sum of squares for the variable in question divided by the residual mean square for the full model specified in the MODEL statement. The TESTS option is not supported if you specify model selection methods, or the RIDGE or PCOMIT options. SELECTION=name
specifies the method used to select the model, where name can be FORWARD (or F), BACKWARD (or B), STEPWISE, MAXR, MINR, RSQUARE, ADJRSQ, CP, or NONE (use the full model). The default method is NONE. See the “Model-Selection Methods” section on page 3873 for a description of each method. SEQB
produces a sequence of parameter estimates as each variable is entered into the model. This is displayed as a matrix where each row is a set of parameter estimates. SIGMA=n
specifies the true standard deviation of the error term to be used in computing the CP and BIC statistics. If the SIGMA= option is not specified, an estimate from the full model is used. This option is available in the RSQUARE, ADJRSQ, and CP model-selection methods only. SINGULAR=n
tunes the mechanism used to check for singularities. Specifying this option in the MODEL statement takes precedence over the SINGULAR= option in the PROC REG statement. The default value is machine dependent but is approximately 1E−7 on most machines. This option is rarely needed. Singularity checking is described in the “Computational Methods” section on page 3917.
MODEL Statement
SLENTRY=value SLE=value
specifies the significance level for entry into the model used in the FORWARD and STEPWISE methods. The defaults are 0.50 for FORWARD and 0.15 for STEPWISE. SLSTAY=value SLS=value
specifies the significance level for staying in the model for the BACKWARD and STEPWISE methods. The defaults are 0.10 for BACKWARD and 0.15 for STEPWISE. SP
outputs the Sp statistic for each model selected (Hocking 1976) to the OUTEST= data set. If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then the SP statistic is also added to the SubsetSelSummary table. SPEC
performs a test that the first and second moments of the model are correctly specified. See the section “Testing for Heteroscedasticity” on page 3910 for more information. SS1
displays the sequential sums of squares (Type I SS) along with the parameter estimates for each term in the model. See Chapter 11, “The Four Types of Estimable Functions,” for more information on the different types of sums of squares. SS2
displays the partial sums of squares (Type II SS) along with the parameter estimates for each term in the model. See the SS1 option also. SSE
computes the error sum of squares for each model selected. START=s
is used to begin the comparing-and-switching process in the MAXR, MINR, and STEPWISE methods for a model containing the first s independent variables in the MODEL statement, where s is the START value. For these methods, the default is START=0. For the RSQUARE, ADJRSQ, and CP methods, START=s specifies the smallest number of regressors to be reported in a subset model. For these methods, the default is START=1. The START= option cannot be used with model-selection methods other than the six described here. STB
produces standardized regression coefficients. A standardized regression coefficient is computed by dividing a parameter estimate by the ratio of the sample standard deviation of the dependent variable to the sample standard deviation of the regressor.
3831
3832
Chapter 61. The REG Procedure
STOP=s
causes PROC REG to stop when it has found the “best” s-variable model, where s is the STOP value. For the RSQUARE, ADJRSQ, and CP methods, STOP=s specifies the largest number of regressors to be reported in a subset model. For the MAXR and MINR methods, STOP=s specifies the largest number of regressors to be included in the model. The default setting for the STOP= option is the number of variables in the MODEL statement. This option can be used only with the MAXR, MINR, RSQUARE, ADJRSQ and CP methods. TOL
produces tolerance values for the estimates. Tolerance for a variable is defined as 1−R2 , where R2 is obtained from the regression of the variable on all other regressors in the model. See the section “Collinearity Diagnostics” on page 3895 for more detail. VIF
produces variance inflation factors with the parameter estimates. Variance inflation is the reciprocal of tolerance. See the section “Collinearity Diagnostics” on page 3895 for more detail. XPX
displays the X0 X crossproducts matrix for the model. The crossproducts matrix is bordered by the X0 Y and Y0 Y matrices.
MTEST Statement < label: > MTEST < equation < , . . . , equation > > < / options > ; where each equation is a linear function composed of coefficients and variable names. The label is optional. The MTEST statement is used to test hypotheses in multivariate regression models where there are several dependent variables fit to the same regressors. If no equations or options are specified, the MTEST statement tests the hypothesis that all estimated parameters except the intercept are zero. The hypotheses that can be tested with the MTEST statement are of the form (Lβ − cj)M = 0 where L is a linear function on the regressor side, β is a matrix of parameters, c is a column vector of constants, j is a row vector of ones, and M is a linear function on the dependent side. The special case where the constants are zero is LβM = 0 See the section “Multivariate Tests” on page 3910 for more details. Each linear function extends across either the regressor variables or the dependent variables. If the equation is across the dependent variables, then the constant term, if
OUTPUT Statement
specified, must be zero. The equations for the regressor variables form the L matrix and c vector in the preceding formula; the equations for dependent variables form the M matrix. If no equations for the dependent variables are given, PROC REG uses an identity matrix for M, testing the same hypothesis across all dependent variables. If no equations for the regressor variables are given, PROC REG forms a linear function corresponding to a test that all the nonintercept parameters are zero. As an example, consider the following statements: model mtest mtest mtest
y1 y2 y3=x1 x2 x3; x1,x2; y1-y2, y2 -y3, x1; y1-y2;
The first MTEST statement tests the hypothesis that the X1 and X2 parameters are zero for Y 1, Y 2 and Y 3. In addition, the second MTEST statement tests the hypothesis that the X1 parameter is the same for all three dependent variables. For the same model, the third MTEST statement tests the hypothesis that all parameters except the intercept are the same for dependent variables Y 1 and Y 2. You can specify the following options in the MTEST statement. CANPRINT
displays the canonical correlations for the hypothesis combinations and the dependent variable combinations. If you specify mtest / canprint;
the canonical correlations between the regressors and the dependent variables are displayed. DETAILS
displays the M matrix and various intermediate calculations. PRINT
displays the H and E matrices.
OUTPUT Statement OUTPUT < OUT=SAS-data-set > keyword=names < . . . keyword=names > ; The OUTPUT statement creates a new SAS data set that saves diagnostic measures calculated after fitting the model. The OUTPUT statement refers to the most recent MODEL statement. At least one keyword=names specification is required. All the variables in the original data set are included in the new data set, along with variables created in the OUTPUT statement. These new variables contain the values of a variety of statistics and diagnostic measures that are calculated for each observation in the data set. If you want to create a permanent SAS data set, you must specify
3833
3834
Chapter 61. The REG Procedure
a two-level name (for example, libref.data-set-name). For more information on permanent SAS data sets, refer to the section “SAS Files” in SAS Language Reference: Concepts. The OUTPUT statement cannot be used when a TYPE=CORR, TYPE=COV, or TYPE=SSCP data set is used as the input data set for PROC REG. See the “Input Data Sets” section on page 3860 for more details. The statistics created in the OUTPUT statement are described in this section. More details are contained in the “Predicted and Residual Values” section on page 3879 and the “Influence Diagnostics” section on page 3898. Also see Chapter 2, “Introduction to Regression Procedures,” for definitions of the statistics available from the REG procedure. You can specify the following options in the OUTPUT statement. OUT=SAS data set
gives the name of the new data set. By default, the procedure uses the DATAn convention to name the new data set. keyword=names
specifies the statistics to include in the output data set and names the new variables that contain the statistics. Specify a keyword for each desired statistic (see the following list of keywords), an equal sign, and the variable or variables to contain the statistic. In the output data set, the first variable listed after a keyword in the OUTPUT statement contains that statistic for the first dependent variable listed in the MODEL statement; the second variable contains the statistic for the second dependent variable in the MODEL statement, and so on. The list of variables following the equal sign can be shorter than the list of dependent variables in the MODEL statement. In this case, the procedure creates the new names in order of the dependent variables in the MODEL statement. For example, the SAS statements proc reg data=a; model y z=x1 x2; output out=b p=yhat zhat r=yresid zresid; run;
create an output data set named b. In addition to the variables in the input data set, b contains the following variables: • yhat, with values that are predicted values of the dependent variable y • zhat, with values that are predicted values of the dependent variable z • yresid, with values that are the residual values of y • zresid, with values that are the residual values of z
PAINT Statement
You can specify the following keywords in the OUTPUT statement. See the “Model Fit and Diagnostic Statistics” section on page 3896 for computational formulas. Table 61.3. Keywords for OUTPUT Statement
Keyword COOKD=names COVRATIO=names
DFFITS=names H=names LCL=names
LCLM=names PREDICTED | P=names PRESS=names
RESIDUAL | R=names RSTUDENT=names STDI=names STDP=names STDR=names STUDENT=names UCL=names UCLM=names
Description Cook’s D influence statistic standard influence of observation on covariance of betas, as discussed in the “Influence Diagnostics” section on page 3898 standard influence of observation on predicted value leverage, xi (X0 X)−1 x0i lower bound of a 100(1 − α)% confidence interval for an individual prediction. This includes the variance of the error, as well as the variance of the parameter estimates. lower bound of a 100(1 − α)% confidence interval for the expected value (mean) of the dependent variable predicted values ith residual divided by (1 − h), where h is the leverage, and where the model has been refit without the ith observation residuals, calculated as ACTUAL minus PREDICTED a studentized residual with the current observation deleted standard error of the individual predicted value standard error of the mean predicted value standard error of the residual studentized residuals, which are the residuals divided by their standard errors upper bound of a 100(1 − α)% confidence interval for an individual prediction upper bound of a 100(1 − α)% confidence interval for the expected value (mean) of the dependent variable
PAINT Statement PAINT < condition | ALLOBS > < / options > ; PAINT < STATUS | UNDO > ; The PAINT statement selects observations to be painted or highlighted in a scatter plot on line printer output; the PAINT statement is ignored if the LINEPRINTER option is not specified in the PROC REG statement. All observations that satisfy condition are painted using some specific symbol. The PAINT statement does not generate a scatter plot and must be followed by a PLOT statement, which does generate a scatter plot. Several PAINT statements can be used before a PLOT statement, and all prior PAINT statement requests are applied to all later PLOT statements.
3835
3836
Chapter 61. The REG Procedure
The PAINT statement lists the observation numbers of the observations selected, the total number of observations selected, and the plotting symbol used to paint the points. On a plot, paint symbols take precedence over all other symbols. If any position contains more than one painted point, the paint symbol for the observation plotted last is used. The PAINT statement cannot be used when a TYPE=CORR, TYPE=COV, or TYPE=SSCP data set is used as the input data set for PROC REG. Also, the PAINT statement cannot be used for models with more than one dependent variable. Note that the syntax for the PAINT statement is the same as the syntax for the REWEIGHT statement. For detailed examples of painting scatter plots, see the section “Painting Scatter Plots” on page 3889.
Specifying Condition Condition is used to select observations to be painted. The syntax of condition is
variable compare value or
variable compare value
logical
variable compare value
where
variable
is one of the following: • a variable name in the input data set • OBS., which is the observation number • keyword., where keyword is a keyword for a statistic requested in the OUTPUT statement
compare is an operator that compares variable to value. Compare can be any one of the following: <, <=, >, >=, =, ˆ =. The operators LT, LE, GT, GE, EQ, and NE can be used instead of the preceding symbols. Refer to the “Expressions” section in SAS Language Reference: Concepts for more information on comparison operators. value
gives an unformatted value of variable. Observations are selected to be painted if they satisfy the condition created by variable compare value. Value can be a number or a character string. If value is a character string, it must be eight characters or less and must be enclosed in quotes. In addition, value is case-sensitive. In other words, the statements paint name=’henry’;
and
PAINT Statement
paint name=’Henry’;
are not the same.
logical
is one of two logical operators. Either AND or OR can be used. To specify AND, use AND or the symbol &. To specify OR, use OR or the symbol |.
Examples of the variable compare value form are paint name=’Henry’; paint residual.>=20; paint obs.=99;
Examples of the variable compare value are
logical
variable compare value form
paint name=’Henry’|name=’Mary’; paint residual.>=20 or residual.<=10; paint obs.>=11 and residual.<=20;
Using ALLOBS Instead of specifying condition, the ALLOBS option can be used to select all observations. This is most useful when you want to unpaint all observations. For example, paint allobs / reset;
resets the symbols for all observations.
Options in the PAINT Statement The following options can be used when either a condition is specified, the ALLOBS option is specified, or when nothing is specified before the slash. If only an option is listed, the option applies to the observations selected in the previous PAINT statement, not to the observations selected by reapplying the condition from the previous PAINT statement. For example, in the statements paint r.>0 / symbol=’a’; reweight r.>0; refit; paint / symbol=’b’;
the second PAINT statement paints only those observations selected in the first PAINT statement. No additional observations are painted even if, after refitting the model, there are new observations that meet the condition in the first PAINT statement. Note: Options are not available when either the UNDO or STATUS option is used.
3837
3838
Chapter 61. The REG Procedure
You can specify the following options after a slash (/). NOLIST
suppresses the display of the selected observation numbers. If the NOLIST option is not specified, a list of observations selected is written to the log. The list includes the observation numbers and painting symbol used to paint the points. The total number of observations selected to be painted is also shown. RESET
changes the painting symbol to the current default symbol, effectively unpainting the observations selected. If you set the default symbol by using the SYMBOL= option in the PLOT statement, the RESET option in the PAINT statement changes the painting symbol to the symbol you specified. Otherwise, the default symbol of ’1’ is used. SYMBOL = ’character’
specifies a painting symbol. If the SYMBOL= option is omitted, the painting symbol is either the one used in the most recent PAINT statement or, if there are no previous PAINT statements, the symbol ’@’. For example, paint / symbol=’#’;
changes the painting symbol for the observations selected by the most recent PAINT statement to ’#’. As another example, paint temp lt 22 / symbol=’c’;
changes the painting symbol to ’c’ for all observations with TEMP<22. In general, the numbers 1, 2, . . . , 9 and the asterisk are not recommended as painting symbols. These symbols are used as default symbols in the PLOT statement, where they represent the number of replicates at a point. If SYMBOL=” is used, no painting is done in the current plot. If SYMBOL=’ ’ is used, observations are painted with a blank and are no longer seen on the plot.
STATUS and UNDO Instead of specifying condition or the ALLOBS option, you can use the STATUS or UNDO option as follows: STATUS
lists (on the log) the observation number and plotting symbol of all currently painted observations. UNDO
undoes changes made by the most recent PAINT statement. Observations may be, but are not necessarily, unpainted. For example, paint obs. <=10 / symbol=’a’; ...other interactive statements paint obs.=1 / symbol=’b’; ...other interactive statements paint undo;
PLOT Statement
The last PAINT statement changes the plotting symbol used for observation 1 back to ’a’. If the statement paint / reset;
is used instead, observation 1 is unpainted.
PLOT Statement PLOT < yvariable*xvariable >< =symbol > < . . . yvariable*xvariable > < =symbol > < / options >; The PLOT statement in PROC REG displays scatter plots with yvariable on the vertical axis and xvariable on the horizontal axis. Line printer plots are generated if the LINEPRINTER option is specified in the PROC REG statement; otherwise, the traditional high-resolution graphics plots are created. Points in line printer plots can be marked with symbols, while global graphics statements such as GOPTIONS and SYMBOL are used to enhance the high-resolution graphics plots. Note that the plots you request using the PLOT statement are independent of the experimental ODS graphics (see the “ODS Graphics” section on page 3922) that are now available in PROC REG. As with most other interactive statements, the PLOT statement implicitly refits the model. For example, if a PLOT statement is preceded by a REWEIGHT statement, the model is recomputed, and the plot reflects the new model. If there are multiple MODEL statements preceding a PLOT statement, then the PLOT statement refers to the latest MODEL statement. The PLOT statement cannot be used when TYPE=CORR, TYPE=COV, or TYPE=SSCP data sets are used as input to PROC REG. You can specify several PLOT statements for each MODEL statement, and you can specify more than one plot in each PLOT statement. For detailed examples of using the PLOT statement and its options, see the section “Producing Scatter Plots” on page 3882.
Specifying Yvariables, Xvariables, and Symbol More than one yvariable∗xvariable pair can be specified to request multiple plots. The yvariables and xvariables can be • any variables specified in the VAR or MODEL statement before the first RUN statement • keyword., where keyword is a regression diagnostic statistic available in the OUTPUT statement (see Table 61.4 on page 3842). For example, plot predicted.*residual.;
3839
3840
Chapter 61. The REG Procedure generates one plot of the predicted values by the residuals for each dependent variable in the MODEL statement. These statistics can also be plotted against any of the variables in the VAR or MODEL statements. • the keyword OBS. (the observation number), which can be plotted against any of the preceding variables • the keyword NPP. or NQQ., which can be used with any of the preceding variables to construct normal P-P or Q-Q plots, respectively (see the section “Construction of Q-Q and P-P Plots” on page 3917 and Example 61.8 on page 3953 for more information) • keywords for model fit summary statistics available in the OUTEST= data set with – TYPE– = PARMS (see Table 61.4 on page 3842). A SELECTION= method (other than NONE) must be requested in the MODEL statement for these variables to be plotted. If one member of a yvariable∗xvariable pair is from the OUTEST= data set, the other member must also be from the OUTEST= data set.
The OUTPUT statement and the OUTEST= option are not required when their keywords are specified in the PLOT statement. The yvariable and xvariable specifications can be replaced by a set of variables and statistics enclosed in parentheses. When this occurs, all possible combinations of yvariable and xvariable are generated. For example, the following two statements are equivalent. plot (y1 y2)*(x1 x2); plot y1*x1 y1*x2 y2*x1 y2*x2;
The statement plot;
is equivalent to respecifying the most recent PLOT statement without any options. However, the line printer options COLLECT, HPLOTS=, SYMBOL=, and VPLOTS=, described in the “Line Printer Plots” section on page 3848, apply across PLOT statements and remain in effect if they have been previously specified. Options used for the traditional high-resolution graphics plots are described in the following section; see for more information.
Traditional High-Resolution Graphics Plots The display of high-resolution graphics plots is described in the following paragraphs, the options are summarized in Table 61.4 and described in the section “Dictionary of PLOT Statement Options” on page 3844, and the “Examples” section on page 3924 contains several examples of the graphics output. Several line printer statements and options are not supported for high-resolution graphics. In particular the PAINT statement is disabled, as are the PLOT statement options CLEAR, COLLECT, HPLOTS=, NOCOLLECT, SYMBOL=, and
PLOT Statement
VPLOTS=. To display more than one plot per page or to collect plots from multiple PLOT statements, use the PROC GREPLAY statement (refer to SAS/GRAPH Software: Reference). Also note that high-resolution graphics options are not recognized for line printer plots. The fitted model equation and a label are displayed in the top margin of the plot; this display can be suppressed with the NOMODEL option. If the label is requested but cannot fit on one line, it is not displayed. The equation and label are displayed on one line when possible; if more lines are required, the label is displayed in the first line with the model equation in successive lines. If displaying the entire equation causes the plot to be unacceptably small, the equation is truncated. Table 61.5 on page 3843 lists options to control the display of the equation. The “Examples” section on page 3924 illustrates the display of the model equation. Four statistics are displayed by default in the right margin: the number of observations, R2 , the adjusted R2 , and the root mean square error. (See Output 61.4.1 on page 3949.) The display of these statistics can be suppressed with the NOSTAT option. You can specify other options to request the display of various statistics in the right margin; see Table 61.5 on page 3843. A default reference line at zero is displayed if residuals are plotted; see Output 61.7.1 on page 3952. If the dependent variable is plotted against the independent variable in a simple linear regression model, the fitted regression line is displayed by default. (See Output 61.4.1 on page 3949.) Default reference lines can be suppressed with the NOLINE option; the lines are not displayed if the OVERLAY option is specified. Specialized plots are requested with special options. For each coefficient, the RIDGEPLOT option plots the ridge estimates against the ridge values k; see the description of the RIDGEPLOT option in the section “Dictionary of PLOT Statement Options” beginning on page 3844 and Example 61.10 on page 3956 for more details. The CONF option plots 100(1 − α)% confidence intervals for the mean while the PRED option plots 100(1 − α)% prediction intervals; see the description of these options in the section “Dictionary of PLOT Statement Options” beginning on page 3844 and in Example 61.9 on page 3955 for more details. If a SELECTION= method is requested, the fitted model equation and the statistics displayed in the margin correspond to the selected model. For the ADJRSQ and CP methods, the selected model is treated as a submodel of the full model. If a CP.*NP. plot is requested, the CHOCKING= and CMALLOWS= options display model selection reference lines; see the descriptions of these options in the section “Dictionary of PLOT Statement Options” beginning on page 3844 and Example 61.5 on page 3949 for more details. PLOT Statement variable Keywords
The following table lists the keywords available as PLOT statement xvariables and yvariables. All keywords have a trailing dot; for example, “COOKD.” requests Cook’s D statistic. Neither the OUTPUT statement nor the OUTEST= option needs to be specified.
3841
3842
Chapter 61. The REG Procedure
Table 61.4. Keywords for PLOT Statement xvariables and yvariables
Keyword Diagnostic Statistics COOKD. COVRATIO. DFFITS. H. LCL. LCLM. PREDICTED. | PRED. | P. PRESS. RESIDUAL. | R. RSTUDENT. STDI. STDP. STDR. STUDENT. UCL. UCLM.
Description Cook’s D influence statistics standard influence of observation on covariance of betas standard influence of observation on predicted value leverage lower bound of 100(1 − α)% confidence interval for individual prediction lower bound of 100(1 − α)% confidence interval for the mean of the dependent variable predicted values residuals from refitting the model with current observation deleted residuals studentized residuals with the current observation deleted standard error of the individual predicted value standard error of the mean predicted value standard error of the residual residuals divided by their standard errors upper bound of 100(1 − α)% confidence interval for individual prediction upper bound of 100(1 − α)% confidence interval for the mean of the dependent variables
Other Keywords Used with Diagnostic Statistics NPP. normal probability-probability plot NQQ. normal quantile-quantile plot OBS. observation number (cannot plot against OUTEST= statistics) Model Fit Summary Statistics ADJRSQ. adjusted R-square AIC. Akaike’s information criterion BIC. Sawa’s Bayesian information criterion CP. Mallows’ Cp statistic EDF. error degrees of freedom GMSEP. estimated MSE of prediction, assuming multivariate normality IN. number of regressors in the model not including the intercept JP. final prediction error MSE. mean squared error NP. number of parameters in the model (including the intercept) PC. Amemiya’s prediction criterion RMSE. root MSE RSQ. R-square SBC. SBC statistic SP. SP statistic SSE. error sum of squares
PLOT Statement
Summary of PLOT Statement Graphics Options
The following table lists the PLOT statement options by function. These options are available unless the LINEPRINTER option is specified in the PROC REG statement. For complete descriptions, see the section “Dictionary of PLOT Statement Options” beginning on page 3844. Table 61.5. High-Resolution Graphics Options
Option Description General Graphics Options ANNOTATE= specifies the annotate data set SAS-data-set CHOCKING=color requests a reference line for Cp model selection criteria CMALLOWS=color requests a reference line for the Cp model selection criterion CONF requests plots of 100(1 − α)% confidence intervals for the mean DESCRIPTION= specifies a description for graphics catalog member ’string’ NAME=’string’ names the plot in graphics catalog OVERLAY overlays plots from the same model PRED requests plots of 100(1 − α)% prediction intervals for individual responses RIDGEPLOT requests the ridge trace for ridge regression Axis and Legend Options LEGEND=LEGENDn specifies LEGEND statement to be used HAXIS=values specifies tick mark values for horizontal axis VAXIS=values specifies tick mark values for vertical axis Reference Line Options HREF=values specifies reference lines perpendicular to horizontal axis LHREF=linetype specifies line style for HREF= lines LLINE=linetype specifies line style for lines displayed by default LVREF=linetype specifies line style for VREF= lines NOLINE suppresses display of any default reference line VREF=values specifies reference lines perpendicular to vertical axis Color Options CAXIS=color CFRAME=color CHREF=color CLINE=color CTEXT=color CVREF=color
specifies color for axis line and tick marks specifies color for frame specifies color for HREF= lines specifies color for lines displayed by default specifies color for text specifies color for VREF= lines
Options for Displaying the Fitted Model Equation MODELFONT=font specifies font of model equation and model label MODELHT=value specifies text height of model equation and model label MODELLAB=’label’ specifies model label NOMODEL suppresses display of the fitted model and the label Options for Displaying Statistics in the Plot Margin AIC displays Akaike’s information criterion
3843
3844
Chapter 61. The REG Procedure
Table 61.5. (continued)
Option BIC CP EDF GMSEP IN JP MSE NOSTAT
NP PC SBC SP SSE STATFONT=font STATHT=value
Description displays Sawa’s Bayesian information criterion displays Mallows’ Cp statistic displays the error degrees of freedom displays the estimated MSE of prediction assuming multivariate normality displays the number of regressors in the model not including the intercept displays the Jp statistic displays the mean squared error suppresses display of the default statistics: the number of observations, R-square, adjusted R-square, and the root mean square error displays the number of parameters in the model including the intercept, if any displays the PC statistic displays the SBC statistic displays the S(p) statistic displays the error sum of squares specifies font of text displayed in the margin specifies height of text displayed in the margin
Dictionary of PLOT Statement Options
The following entries describe the PLOT statement options in detail. Note that these options are available unless you specify the LINEPRINTER option in the PROC REG statement. AIC
displays Akaike’s information criterion in the plot margin. ANNOTATE=SAS-data-set ANNO=SAS-data-set
specifies an input data set that contains appropriate variables for annotation. This applies only to displays created with the current PLOT statement. Refer to SAS/GRAPH Software: Reference for more information. BIC
displays Sawa’s Bayesian information criterion in the plot margin. CAXIS=color CAXES=color CA=color
specifies the color for the axes, frame, and tick marks. CFRAME=color CFR=color
specifies the color for filling the area enclosed by the axes and the frame.
PLOT Statement
CHOCKING=color
requests reference lines corresponding to the equations Cp = p and Cp = 2p − pf ull , where pf ull is the number of parameters in the full model (excluding the intercept) and p is the number of parameters in the subset model (including the intercept). The color must be specified; the Cp = p line is solid and the Cp = 2p − pf ull line is dashed. Only PLOT statements of the form PLOT CP.*NP. produce these lines. For the purpose of parameter estimation, Hocking (1976) suggests selecting a model where Cp ≤ 2p − pf ull . For the purpose of prediction, Hocking suggests the criterion Cp ≤ p. You can request the single reference line Cp = p with the CMALLOWS= option. If, for example, you specify both CHOCKING=RED and CMALLOWS=BLUE, then the Cp = 2p − pf ull line is red and the Cp = p line is blue (see Example 61.5 on page 3949). CHREF=color CH=color
specifies the color for lines requested with the HREF= option. CLINE=color CL=color
specifies the color for lines displayed by default. See the NOLINE option later in this section for details. CMALLOWS=color
requests a Cp = p reference line, where p is the number of parameters (including the intercept) in the subset model. The color must be specified; the line is solid. Only PLOT statements of the form PLOT CP.*NP. produce this line. Mallows (1973) suggests that all subset models with Cp small and near p be considered for further study. See the CHOCKING= option for related model selection criteria. CONF
is a keyword used as a shorthand option to request plots that include (100 − α)% confidence intervals for the mean response (see Example 61.9 on page 3955). The ALPHA= option in the PROC REG or MODEL statement selects the significance level α, which is 0.05 by default. The CONF option is valid for simple regression models only, and is ignored for plots where confidence intervals are inappropriate. The CONF option replaces the CONF95 option; however, the CONF95 option is still supported when the ALPHA= option is not specified. The OVERLAY option is ignored when the CONF option is specified. CP
displays Mallows’ Cp statistic in the plot margin. CTEXT=color CT=color
specifies the color for text including tick mark labels, axis labels, the fitted model label and equation, the statistics displayed in the margin, and legends. (See Example 61.6 on page 3950.)
3845
3846
Chapter 61. The REG Procedure
CVREF=color CV=color
specifies the color for lines requested with the VREF= option. DESCRIPTION=’string ’ DESC=’string ’
specifies a descriptive string, up to 40 characters, that appears in the description field of the PROC GREPLAY master menu. EDF
displays the error degrees of freedom in the plot margin. GMSEP
displays the estimated mean square error of prediction in the plot margin. Note that the estimate is calculated under the assumption that both independent and dependent variables have a multivariate normal distribution. HAXIS=values HA=values
specifies tick mark values for the horizontal axis. HREF=values
specifies where reference lines perpendicular to the horizontal axis are to appear. IN
displays the number of regressors in the model (not including the intercept) in the plot margin. JP
displays the Jp statistic in the plot margin. LEGEND=LEGENDn
specifies the LEGENDn statement to be used. The LEGENDn statement is a global graphics statement; refer to SAS/GRAPH Software: Reference for more information. LHREF=linetype LH=linetype
specifies the line style for lines requested with the HREF= option. The default linetype is 2. Note that LHREF=1 requests a solid line. Refer to SAS/GRAPH Software: Reference for a table of available line types. LLINE=linetype LL=linetype
specifies the line style for reference lines displayed by default; see the NOLINE option for details. The default linetype is 2. Note that LLINE=1 requests a solid line. LVREF=linetype LV=linetype
specifies the line style for lines requested with the VREF= option. The default linetype is 2. Note that LVREF=1 requests a solid line. MODELFONT=font
specifies the font used for displaying the fitted model label and the fitted model equa-
PLOT Statement
tion. Refer to SAS/GRAPH Software: Reference for tables of software fonts. MODELHT=height
specifies the text height for the fitted model label and the fitted model equation. MODELLAB=’label ’
specifies the label to be displayed with the fitted model equation. By default, no label is displayed. If the label does not fit on one line, it is not displayed. See the explanation in the section “Traditional High-Resolution Graphics Plots” beginning on page 3840 for more information. MSE
displays the mean squared error in the plot margin. NAME=’string ’
specifies a descriptive string, up to eight characters, that appears in the name field of the PROC GREPLAY master menu. The default string is REG. NOLINE
suppresses the display of default reference lines. A default reference line at zero is displayed if residuals are plotted. If the dependent variable is plotted against the independent variable in a simple regression model, then the fitted regression line is displayed by default. Default reference lines are not displayed if the OVERLAY option is specified. NOMODEL
suppresses the display of the fitted model equation. NOSTAT
suppresses the display of statistics in the plot margin. By default, the number of observations, R-square, adjusted R-square, and the root MSE are displayed. NP
displays the number of regressors in the model including the intercept, if any, in the plot margin. OVERLAY
overlays all plots specified in the PLOT statement from the same model on one set of axes. The variables for the first plot label the axes. The procedure automatically scales the axes to fit all of the variables unless the HAXIS= or VAXIS= option is used. Default reference lines are not displayed. A default legend is produced; the LEGEND= option can be used to customize the legend. See Example 61.11 on page 3958. PC
displays the PC statistic in the plot margin. PRED
is a keyword used as a shorthand option to request plots that include (100 − α)% prediction intervals for individual responses (see Example 61.9 on page 3955). The ALPHA= option in the PROC REG or MODEL statement selects the significance level α, which is 0.05 by default. The PRED option is valid for simple regression models only, and is ignored for plots where prediction intervals are inappropriate.
3847
3848
Chapter 61. The REG Procedure
The PRED option replaces the PRED95 option; however, the PRED95 option is still supported when the ALPHA= option is not specified. The OVERLAY option is ignored when the PRED option is specified. RIDGEPLOT
creates overlaid plots of ridge estimates against ridge values for each coefficient. The points corresponding to the estimates of each coefficient in the plot are connected by lines. For ridge estimates to be computed and plotted, the OUTEST= option must be specified in the PROC REG statement, and the RIDGE= list must be specified in either the PROC REG or the MODEL statement. See Example 61.10 on page 3956. SBC
displays the SBC statistic in the plot margin. SP
displays the Sp statistic in the plot margin. SSE
displays the error sum of squares in the plot margin. STATFONT=font
specifies the font used for displaying the statistics that appear in the plot margin. Refer to SAS/GRAPH Software: Reference for tables of software fonts. STATHT=height
specifies the text height of the statistics that appear in the plot margin. USEALL
specifies that predicted values at data points with missing dependent variable(s) are included on appropriate plots. By default, only points used in constructing the SSCP matrix appear on plots. VAXIS=values VA=values
specifies tick mark values for the vertical axis. VREF=values
specifies where reference lines perpendicular to the vertical axis are to appear.
Line Printer Plots Line printer plots are requested with the LINEPRINTER option in the PROC REG statement. Points in line printer plots can be marked with symbols, which can be specified as a single character enclosed in quotes or the name of any variable in the input data set. If a character variable is used for the symbol, the first (left-most) nonblank character in the formatted value of the variable is used as the plotting symbol. If a character in quotes is specified, that character becomes the plotting symbol. If a character is used as the plotting symbol, and if there are different plotting symbols needed at the same point, the symbol ’?’ is used at that point. If an unformatted numeric variable is used for the symbol, the symbols ’1’, ’2’, . . . , ’9’ are used for variable values 1, 2, . . . , 9. For noninteger values, only the integer
PLOT Statement
portion is used as the plotting symbol. For values of 10 or greater, the symbol ’*’ is used. For negative values, a ’?’ is used. If a numeric variable is used, and if there is more than one plotting symbol needed at the same point, the sum of the variable values is used at that point. If the sum exceeds 9, the symbol ’*’ is used. If a symbol is not specified, the number of replicates at the point is displayed. The symbol ’*’ is used if there are ten or more replicates. If the LINEPRINTER option is used, you can specify the following options in the PLOT statement after a slash (/): CLEAR
clears any collected scatter plots before plotting begins but does not turn off the COLLECT option. Use this option when you want to begin a new collection with the plots in the current PLOT statement. For more information on collecting plots, see the COLLECT and NOCOLLECT options in this section. COLLECT
specifies that plots begin to be collected from one PLOT statement to the next and that subsequent plots show an overlay of all collected plots. This option enables you to overlay plots before and after changes to the model or to the data used to fit the model. Plots collected before changes are unaffected by the changes and can be overlaid on later plots. You can request more than one plot with this option, and you do not need to request the same number of plots in subsequent PLOT statements. If you specify an unequal number of plots, plots in corresponding positions are overlaid. For example, the statements plot residual.*predicted. y*x / collect; run;
produce two plots. If these statements are then followed by plot residual.*x; run;
two plots are again produced. The first plot shows residual against X values overlaid on residual against predicted values. The second plot is the same as that produced by the first PLOT statement. Axes are scaled for the first plot or plots collected. The axes are not rescaled as more plots are collected. Once specified, the COLLECT option remains in effect until the NOCOLLECT option is specified. HPLOTS=number
sets the number of scatter plots that can be displayed across the page. The procedure begins with one plot per page. The value of the HPLOTS= option remains in effect until you change it in a later PLOT statement. See the VPLOTS= option for an example.
3849
3850
Chapter 61. The REG Procedure
NOCOLLECT
specifies that the collection of scatter plots ends after adding the plots in the current PLOT statement. PROC REG starts with the NOCOLLECT option in effect. After you specify the NOCOLLECT option, any following PLOT statement produces a new plot that contains only the plots requested by that PLOT statement. For more information, see the COLLECT option. OVERLAY
allows requested scatter plots to be superimposed. The axes are scaled so that points on all plots are shown. If the HPLOTS= or VPLOTS= option is set to more than one, the overlaid plot occupies the first position on the page. The OVERLAY option is similar to the COLLECT option in that both options produce superimposed plots. However, OVERLAY superimposes only the plots in the associated PLOT statement; COLLECT superimposes plots across PLOT statements. The OVERLAY option can be used when the COLLECT option is in effect. SYMBOL=’character’
changes the default plotting symbol used for all scatter plots produced in the current and in subsequent PLOT statements. Both SYMBOL=” and SYMBOL=’ ’ are allowed. If the SYMBOL= option has not been specified, the default symbol is ’1’ for positions with one observation, ’2’ for positions with two observations, and so on. For positions with more than 9 observations, ’*’ is used. The SYMBOL= option (or a plotting symbol) is needed to avoid any confusion caused by this default convention. Specifying a particular symbol is especially important when either the OVERLAY or COLLECT option is being used. If you specify the SYMBOL= option and use a number for character, that number is used for all points in the plot. For example, the statement plot y*x / symbol=’1’;
produces a plot with the symbol ’1’ used for all points. If you specify a plotting symbol and the SYMBOL= option, the plotting symbol overrides the SYMBOL= option. For example, in the statements plot y*x y*v=’.’ / symbol=’*’;
the symbol used for the plot of Y against X is ’*’, and a ’.’ is used for the plot of Y against V. If a paint symbol is defined with a PAINT statement, the paint symbol takes precedence over both the SYMBOL= option and the default plotting symbol for the PLOT statement.
PRINT Statement
VPLOTS=number
sets the number of scatter plots that can be displayed down the page. The procedure begins with one plot per page. The value of the VPLOTS= option remains in effect until you change it in a later PLOT statement. For example, to specify a total of six plots per page, with two rows of three plots, use the HPLOTS= and VPLOTS= options as follows: plot y1*x1 y1*x2 y1*x3 y2*x1 y2*x2 y2*x3 / hplots=3 vplots=2; run;
PRINT Statement PRINT < options > < ANOVA > < MODELDATA > ; The PRINT statement enables you to interactively display the results of MODEL statement options, produce an ANOVA table, display the data for variables used in the current model, or redisplay the options specified in a MODEL or a previous PRINT statement. In addition, like most other interactive statements in PROC REG, the PRINT statement implicitly refits the model; thus, effects of REWEIGHT statements are seen in the resulting tables. If the experimental ODS graphics are in effect (see the “ODS Graphics” section on page 3922), the PRINT statement also requests the display of the ODS graphics associated with the current model. The following specifications can appear in the PRINT statement:
options
interactively displays the results of MODEL statement options, where options is one or more of the following: ACOV, ALL, CLI, CLM, COLLIN, COLLINOINT, CORRB, COVB, DW, I, INFLUENCE, P, PARTIAL, PCORR1, PCORR2, R, SCORR1, SCORR2, SEQB, SPEC, SS1, SS2, STB, TOL, VIF, or XPX. See the “MODEL Statement” section on page 3821 for a description of these options.
ANOVA
produces the ANOVA table associated with the current model. This is either the model specified in the last MODEL statement or the model that incorporates changes made by ADD, DELETE or REWEIGHT statements after the last MODEL statement.
MODELDATA
displays the data for variables used in the current model.
Use the statement print;
to reprint options in the most recently specified PRINT or MODEL statement. Options that require original data values, such as R or INFLUENCE, cannot be used when a TYPE=CORR, TYPE=COV, or TYPE=SSCP data set is used as the input data set to PROC REG. See the “Input Data Sets” section on page 3860 for more detail.
3851
3852
Chapter 61. The REG Procedure
REFIT Statement REFIT; The REFIT statement causes the current model and corresponding statistics to be recomputed immediately. No output is generated by this statement. The REFIT statement is needed after one or more REWEIGHT statements to cause them to take effect before subsequent PAINT or REWEIGHT statements. This is sometimes necessary when you are using statistical conditions in REWEIGHT statements. For example, with these statements paint student.>2; plot student.*p.; reweight student.>2; refit; paint student.>2; plot student.*p.;
the second PAINT statement paints any additional observations that meet the condition after deleting observations and refitting the model. The REFIT statement is used because the REWEIGHT statement does not cause the model to be recomputed. In this particular example, the same effect could be achieved by replacing the REFIT statement with a PLOT statement. Most interactive statements can be used to implicitly refit the model; any plots or statistics produced by these statements reflect changes made to the model and changes made to the data used to compute the model. The two exceptions are the PAINT and REWEIGHT statements, which do not cause the model to be recomputed.
RESTRICT Statement RESTRICT equation < , . . . , equation > ; A RESTRICT statement is used to place restrictions on the parameter estimates in the MODEL preceding it. More than one RESTRICT statement can follow each MODEL statement. Each RESTRICT statement replaces any previous RESTRICT statement. To lift all restrictions on a model, submit a new MODEL statement. If there are several restrictions, separate them with commas. The statement restrict equation1=equation2=equation3;
is equivalent to imposing the two restrictions restrict equation1=equation2; restrict equation2=equation3;
Each restriction is written as a linear equation and can be written as
equation
RESTRICT Statement
or
equation = equation The form of each equation is
c1 × variable1 ± c2 × variable2 ± · · · ± cn × variablen where the cj ’s are constants and the variablej ’s are any regressor variables. When no equal sign appears, the linear combination is set equal to zero. Each variable name mentioned must be a variable in the MODEL statement to which the RESTRICT statement refers. The keyword INTERCEPT can also be used as a variable name, and it refers to the intercept parameter in the regression model. Note that the parameters associated with the variables are restricted, not the variables themselves. Restrictions should be consistent and not redundant. Examples of valid RESTRICT statements include the following: restrict restrict restrict restrict restrict restrict
x1; a+b=l; a=b=c; a=b, b=c; 2*f=g+h, intercept+f=0; f=g=h=intercept;
The third and fourth statements in this list produce identical restrictions. You cannot specify restrict f-g=0, f-intercept=0, g-intercept=1;
because the three restrictions are not consistent. If these restrictions are included in a RESTRICT statement, one of the restrict parameters is set to zero and has zero degrees of freedom, indicating that PROC REG is unable to apply a restriction. The restrictions usually operate even if the model is not of full rank. Check to ensure that DF= −1 for each restriction. In addition, the Model DF should decrease by 1 for each restriction. The parameter estimates are those that minimize the quadratic criterion (SSE) subject to the restrictions. If a restriction cannot be applied, its parameter value and degrees of freedom are listed as zero. The method used for restricting the parameter estimates is to introduce a Lagrangian parameter for each restriction (Pringle and Raynor 1971). The estimates of these parameters are displayed with test statistics. Note that the t statistic reported for
3853
3854
Chapter 61. The REG Procedure
the Lagrangian parameters does not follow a Student’s t distribution, but its square follows a beta distribution (LaMotte 1994). The p-value for these parameters is computed using the beta distribution. The Lagrangian parameter γ measures the sensitivity of the SSE to the restriction constant. If the restriction constant is changed by a small amount , the SSE is changed by 2γ. The t ratio tests the significance of the restrictions. If γ is zero, the restricted estimates are the same as the unrestricted estimates, and a change in the restriction constant in either direction increases the SSE. RESTRICT statements are ignored if the PCOMIT= or RIDGE= option is specified in the PROC REG statement.
REWEIGHT Statement REWEIGHT < condition | ALLOBS > < / options > ; REWEIGHT < STATUS | UNDO > ; The REWEIGHT statement interactively changes the weights of observations that are used in computing the regression equation. The REWEIGHT statement can change observation weights, or set them to zero, which causes selected observations to be excluded from the analysis. When a REWEIGHT statement sets observation weights to zero, the observations are not deleted from the data set. More than one REWEIGHT statement can be used. The requests from all REWEIGHT statements are applied to the subsequent statements. Each use of the REWEIGHT statement modifies the MODEL label. The model and corresponding statistics are not recomputed after a REWEIGHT statement. For example, with the following statements reweight r.>0; reweight r.>0;
the second REWEIGHT statement does not exclude any additional observations since the model is not recomputed after the first REWEIGHT statement. Use either a REFIT statement to explicitly refit the model, or implicitly refit the model by following the REWEIGHT statement with any other interactive statement except a PAINT statement or another REWEIGHT statement. The REWEIGHT statement cannot be used if a TYPE=CORR, TYPE=COV, or TYPE=SSCP data set is used as an input data set to PROC REG. Note that the syntax used in the REWEIGHT statement is the same as the syntax in the PAINT statement. The syntax of the REWEIGHT statement is described in the following sections. For detailed examples of using this statement see the section “Reweighting Observations in an Analysis” on page 3903.
REWEIGHT Statement
Specifying Condition Condition is used to find observations to be reweighted. The syntax of condition is variable compare value or
variable compare value
logical
variable compare value
where
variable
is one of the following: • a variable name in the input data set • OBS. which is the observation number • keyword ., where keyword is a keyword for a statistic requested in the OUTPUT statement. The keyword specification is applied to all dependent variables in the model.
compare
is an operator that compares variable to value. Compare can be any one of the following: <, <=, >, >=, =, ˆ =. The operators LT, LE, GT, GE, EQ, and NE can be used instead of the preceding symbols. Refer to the “Expressions” chapter in SAS Language Reference: Concepts for more information on comparison operators.
value
gives an unformatted value of variable. Observations are selected to be reweighted if they satisfy the condition created by variable compare value. Value can be a number or a character string. If value is a character string, it must be eight characters or less and must be enclosed in quotes. In addition, value is case-sensitive. In other words, the following two statements are not the same: reweight name=’steve’; reweight name=’Steve’;
logical
is one of two logical operators. Either AND or OR can be used. To specify AND, use AND or the symbol &. To specify OR, use OR or the symbol |.
Examples of the variable compare value form are reweight obs. le 10; reweight temp=55; reweight type=’new’;
Examples of the variable compare value are
logical
reweight obs.<=10 and residual.<2; reweight student.<-2 or student.>2; reweight name=’Mary’ | name=’Susan’;
variable compare value form
3855
3856
Chapter 61. The REG Procedure
Using ALLOBS Instead of specifying condition, you can use the ALLOBS option to select all observations. This is most useful when you want to restore the original weights of all observations. For example, reweight allobs / reset;
resets weights for all observations and uses all observations in the subsequent analysis. Note that reweight allobs;
specifies that all observations be excluded from analysis. Consequently, using ALLOBS is useful only if you also use one of the options discussed in the following section.
Options in the REWEIGHT Statement The following options can be used when either a condition, ALLOBS, or nothing is specified before the slash. If only an option is listed, the option applies to the observations selected in the previous REWEIGHT statement, not to the observations selected by reapplying the condition from the previous REWEIGHT statement. For example, with the statements reweight r.>0 / weight=0.1; refit; reweight;
the second REWEIGHT statement excludes from the analysis only those observations selected in the first REWEIGHT statement. No additional observations are excluded even if there are new observations that meet the condition in the first REWEIGHT statement. Note: Options are not available when either the UNDO or STATUS option is used. NOLIST
suppresses the display of the selected observation numbers. If you omit the NOLIST option, a list of observations selected is written to the log. RESET
resets the observation weights to their original values as defined by the WEIGHT statement or to WEIGHT=1 if no WEIGHT statement is specified. For example, reweight / reset;
resets observation weights to the original weights in the data set. If previous REWEIGHT statements have been submitted, this REWEIGHT statement applies
REWEIGHT Statement
only to the observations selected by the previous REWEIGHT statement. Note that, although the RESET option does reset observation weights to their original values, it does not cause the model and corresponding statistics to be recomputed. WEIGHT=value
changes observation weights to the specified nonnegative real number. If you omit the WEIGHT= option, the observation weights are set to zero, and observations are excluded from the analysis. For example, reweight name=’Alan’; ...other interactive statements reweight / weight=0.5;
The first REWEIGHT statement changes weights to zero for all observations with name=’Alan’, effectively deleting these observations. The subsequent analysis does not include these observations. The second REWEIGHT statement applies only to those observations selected by the previous REWEIGHT statement, and it changes the weights to 0.5 for all the observations with NAME=’Alan’. Thus, the next analysis includes all original observations; however, those observations with NAME=’Alan’ have their weights set to 0.5.
STATUS and UNDO If you omit condition and the ALLOBS options, you can specify one of the following options. STATUS
writes to the log the observation’s number and the weight of all reweighted observations. If an observation’s weight has been set to zero, it is reported as deleted. However, the observation is not deleted from the data set, only from the analysis. UNDO
undoes the changes made by the most recent REWEIGHT statement. Weights may be, but are not necessarily, reset. For example, in these statements reweight student.>2 / weight=0.1; reweight; reweight undo;
the first REWEIGHT statement sets the weights of observations that satisfy the condition to 0.1. The second REWEIGHT statement sets the weights of the same observations to zero. The third REWEIGHT statement undoes the second, changing the weights back to 0.1.
3857
3858
Chapter 61. The REG Procedure
TEST Statement < label: > TEST equation < , . . . , equation > < / options > ; The TEST statement tests hypotheses about the parameters estimated in the preceding MODEL statement. It has the same syntax as the RESTRICT statement except that it allows an option. Each equation specifies a linear hypothesis to be tested. The rows of the hypothesis are separated by commas. Variable names must correspond to regressors, and each variable name represents the coefficient of the corresponding variable in the model. An optional label is useful to identify each test with a name. The keyword INTERCEPT can be used instead of a variable name to refer to the model’s intercept. The REG procedure performs an F test for the joint hypotheses specified in a single TEST statement. More than one TEST statement can accompany a MODEL statement. The numerator is the usual quadratic form of the estimates; the denominator is the mean squared error. If hypotheses can be represented by Lβ = c then the numerator of the F test is Q = (Lb − c)0 (L(X0 X)− L0 )−1 (Lb − c) divided by degrees of freedom, where b is the estimate of β. For example, model y=a1 a2 b1 b2; aplus: test a1+a2=1; b1: test b1=0, b2=0; b2: test b1, b2;
The last two statements are equivalent; since no constant is specified, zero is assumed. Note that, when the ACOV option is specified in the MODEL statement, tests are recomputed using the heteroscedasticity consistent covariance matrix (see the section “Testing for Heteroscedasticity” on page 3910). One option can be specified in the TEST statement after a slash (/): PRINT
displays intermediate calculations. This includes L(X0 X)− L0 bordered by Lb − c, and (L(X0 X)− L0 )−1 bordered by (L(X0 X)− L0 )−1 (Lb − c).
Missing Values
VAR Statement VAR variables ; The VAR statement is used to include numeric variables in the crossproducts matrix that are not specified in the first MODEL statement. Variables not listed in MODEL statements before the first RUN statement must be listed in the VAR statement if you want the ability to add them interactively to the model with an ADD statement, to include them in a new MODEL statement, or to plot them in a scatter plot with the PLOT statement. In addition, if you want to use options in the PROC REG statement and do not want to fit a model to the data (with a MODEL statement), you must use a VAR statement.
WEIGHT Statement WEIGHT variable ; A WEIGHT statement names a variable in the input data set with values that are relative weights for a weighted least-squares fit. If the weight value is proportional to the reciprocal of the variance for each observation, then the weighted estimates are the best linear unbiased estimates (BLUE). Values of the weight variable must be nonnegative. If an observation’s weight is zero, the observation is deleted from the analysis. If a weight is negative or missing, it is set to zero, and the observation is excluded from the analysis. A more complete description of the WEIGHT statement can be found in Chapter 32, “The GLM Procedure.” Observation weights can be changed interactively with the REWEIGHT statement; see the section “REWEIGHT Statement” beginning on page 3854.
Details Missing Values PROC REG constructs only one crossproducts matrix for the variables in all regressions. If any variable needed for any regression is missing, the observation is excluded from all estimates. If you include variables with missing values in the VAR statement, the corresponding observations are excluded from all analyses, even if you never include the variables in a model. PROC REG assumes that you may want to include these variables after the first RUN statement and deletes observations with missing values.
3859
3860
Chapter 61. The REG Procedure
Input Data Sets PROC REG does not compute new regressors. For example, if you want a quadratic term in your model, you should create a new variable when you prepare the input data. For example, the statement model y=x1 x1*x1;
is not valid. Note that this MODEL statement is valid in the GLM procedure. The input data set for most applications of PROC REG contains standard rectangular data, but special TYPE=CORR, TYPE=COV, or TYPE=SSCP data sets can also be used. TYPE=CORR and TYPE=COV data sets created by the CORR procedure contain means and standard deviations. In addition, TYPE=CORR data sets contain correlations and TYPE=COV data sets contain covariances. TYPE=SSCP data sets created in previous runs of PROC REG that used the OUTSSCP= option contain the sums of squares and crossproducts of the variables. See Appendix A, “Special SAS Data Sets,” and the “SAS Files” section in SAS Language Reference: Concepts for more information on special SAS data sets. These summary files save CPU time. It takes nk 2 operations (where n=number of observations and k=number of variables) to calculate crossproducts; the regressions are of the order k 3 . When n is in the thousands and k is less than 10, you can save 99 percent of the CPU time by reusing the SSCP matrix rather than recomputing it. When you want to use a special SAS data set as input, PROC REG must determine the TYPE for the data set. PROC CORR and PROC REG automatically set the type for their output data sets. However, if you create the data set by some other means (such as a DATA step) you must specify its type with the TYPE= data set option. If the TYPE for the data set is not specified when the data set is created, you can specify TYPE= as a data set option in the DATA= option in the PROC REG statement. For example, proc reg data=a(type=corr);
When TYPE=CORR, TYPE=COV, or TYPE=SSCP data sets are used with PROC REG, statements and options that require the original data values have no effect. The OUTPUT, PAINT, PLOT, and REWEIGHT statements and the MODEL and PRINT statement options P, R, CLM, CLI, DW, INFLUENCE, and PARTIAL are disabled since the original observations needed to calculate predicted and residual values are not present.
Example Using TYPE=CORR Data Set This example uses PROC CORR to produce an input data set for PROC REG. The fitness data for this analysis can be found in Example 61.1 on page 3924. proc corr data=fitness outp=r noprint; var Oxygen RunTime Age Weight RunPulse MaxPulse RestPulse;
Input Data Sets
3861
proc print data=r; proc reg data=r; model Oxygen=RunTime Age Weight; run;
Since the OUTP= data set from PROC CORR is automatically set to TYPE=CORR, the TYPE= data set option is not required in this example. The data set containing the correlation matrix is displayed by the PRINT procedure as shown in Figure 61.12. Figure 61.13 shows results from the regression using the TYPE=CORR data as an input data set.
Obs
_TYPE_
1 2 3 4 5 6 7 8 9 10
MEAN STD N CORR CORR CORR CORR CORR CORR CORR
_NAME_
Oxygen RunTime Age Weight RunPulse MaxPulse RestPulse
Oxygen
RunTime
Age
Weight
RunPulse
MaxPulse
Rest Pulse
47.3758 5.3272 31.0000 1.0000 -0.8622 -0.3046 -0.1628 -0.3980 -0.2367 -0.3994
10.5861 1.3874 31.0000 -0.8622 1.0000 0.1887 0.1435 0.3136 0.2261 0.4504
47.6774 5.2114 31.0000 -0.3046 0.1887 1.0000 -0.2335 -0.3379 -0.4329 -0.1641
77.4445 8.3286 31.0000 -0.1628 0.1435 -0.2335 1.0000 0.1815 0.2494 0.0440
169.645 10.252 31.000 -0.398 0.314 -0.338 0.182 1.000 0.930 0.352
173.774 9.164 31.000 -0.237 0.226 -0.433 0.249 0.930 1.000 0.305
53.4516 7.6194 31.0000 -0.3994 0.4504 -0.1641 0.0440 0.3525 0.3051 1.0000
Figure 61.12. TYPE=CORR Data Set Created by PROC CORR The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
3 27 30
656.27095 195.11060 851.38154
218.75698 7.22632
Root MSE Dependent Mean Coeff Var
2.68818 47.37581 5.67416
R-Square Adj R-Sq
F Value
Pr > F
30.27
<.0001
0.7708 0.7454
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept RunTime Age Weight
1 1 1 1
93.12615 -3.14039 -0.17388 -0.05444
7.55916 0.36738 0.09955 0.06181
12.32 -8.55 -1.75 -0.88
<.0001 <.0001 0.0921 0.3862
Figure 61.13. Regression on TYPE=CORR Data Set
3862
Chapter 61. The REG Procedure
Example Using TYPE=SSCP Data Set The following example uses the saved crossproducts matrix: proc reg data=fitness outsscp=sscp noprint; model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse; proc print data=sscp; proc reg data=sscp; model Oxygen=RunTime Age Weight; run;
First, all variables are used to fit the data and create the SSCP data set. Figure 61.14 shows the PROC PRINT display of the SSCP data set. The SSCP data set is then used as the input data set for PROC REG, and a reduced model is fit to the data. Figure 61.15 also shows the PROC REG results for the reduced model. (For the PROC REG results for the full model, see Figure 61.27 on page 3877.) In the preceding example, the TYPE= data set option is not required since PROC REG sets the OUTSSCP= data set to TYPE=SSCP. Obs _TYPE_ _NAME_ 1 2 3 4 5 6 7 8 9
SSCP SSCP SSCP SSCP SSCP SSCP SSCP SSCP N
Intercept RunTime Age Weight RunPulse MaxPulse RestPulse Oxygen
Intercept 31.00 328.17 1478.00 2400.78 5259.00 5387.00 1657.00 1468.65 31.00
RunTime
Age
Weight
RunPulse
MaxPulse RestPulse
Oxygen
328.17 1478.00 2400.78 5259.00 5387.00 1657.00 1468.65 3531.80 15687.24 25464.71 55806.29 57113.72 17684.05 15356.14 15687.24 71282.00 114158.90 250194.00 256218.00 78806.00 69767.75 25464.71 114158.90 188008.20 407745.67 417764.62 128409.28 113522.26 55806.29 250194.00 407745.67 895317.00 916499.00 281928.00 248497.31 57113.72 256218.00 417764.62 916499.00 938641.00 288583.00 254866.75 17684.05 78806.00 128409.28 281928.00 288583.00 90311.00 78015.41 15356.14 69767.75 113522.26 248497.31 254866.75 78015.41 70429.86 31.00 31.00 31.00 31.00 31.00 31.00 31.00
Figure 61.14. TYPE=SSCP Data Set Created by PROC CORR
Output Data Sets
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
3 27 30
656.27095 195.11060 851.38154
218.75698 7.22632
Root MSE Dependent Mean Coeff Var
2.68818 47.37581 5.67416
R-Square Adj R-Sq
F Value
Pr > F
30.27
<.0001
0.7708 0.7454
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept RunTime Age Weight
1 1 1 1
93.12615 -3.14039 -0.17388 -0.05444
7.55916 0.36738 0.09955 0.06181
12.32 -8.55 -1.75 -0.88
<.0001 <.0001 0.0921 0.3862
Figure 61.15. Regression on TYPE=SSCP Data Set
Output Data Sets OUTEST= Data Set The OUTEST= specification produces a TYPE=EST output SAS data set containing estimates and optional statistics from the regression models. For each BY group on each dependent variable occurring in each MODEL statement, PROC REG outputs an observation to the OUTEST= data set. The variables output to the data set are as follows: • the BY variables, if any • – MODEL– , a character variable containing the label of the corresponding MODEL statement, or MODELn if no label is specified, where n is 1 for the first MODEL statement, 2 for the second model statement, and so on • – TYPE– , a character variable with the value ’PARMS’ for every observation • – DEPVAR– , the name of the dependent variable • – RMSE– , the root mean squared error or the estimate of the standard deviation of the error term • Intercept, the estimated intercept, unless the NOINT option is specified • all the variables listed in any MODEL or VAR statement. Values of these variables are the estimated regression coefficients for the model. A variable
3863
3864
Chapter 61. The REG Procedure that does not appear in the model corresponding to a given observation has a missing value in that observation. The dependent variable in each model is given a value of −1.
If you specify the COVOUT option, the covariance matrix of the estimates is output after the estimates; the – TYPE– variable is set to the value ’COV’ and the names of the rows are identified by the 8-byte character variable, – NAME– . If you specify the TABLEOUT option, the following statistics listed by – TYPE– are added after the estimates: • STDERR, the standard error of the estimate • T, the t statistic for testing if the estimate is zero • PVALUE, the associated p-value • LnB, the 100(1 − α) lower confidence for the estimate, where n is the nearest integer to 100(1 − α) and α defaults to 0.05 or is set using the ALPHA= option in the PROC REG or MODEL statement • UnB, the 100(1 − α) upper confidence for the estimate Specifying the option ADJRSQ, AIC, BIC, CP, EDF, GMSEP, JP, MSE, PC, RSQUARE, SBC, SP, or SSE in the PROC REG or MODEL statement automatically outputs these statistics and the model R2 for each model selected, regardless of the model selection method. Additional variables, in order of occurrence, are as follows. • – IN– , the number of regressors in the model not including the intercept • – P– , the number of parameters in the model including the intercept, if any • – EDF– , the error degrees of freedom • – SSE– , the error sum of squares, if the SSE option is specified • – MSE– , the mean squared error, if the MSE option is specified • – RSQ– , the R2 statistic • – ADJRSQ– , the adjusted R2 , if the ADJRSQ option is specified • – CP– , the Cp statistic, if the CP option is specified • – SP– , the Sp statistic, if the SP option is specified • – JP– , the Jp statistic, if the JP option is specified • – PC– , the PC statistic, if the PC option is specified • – GMSEP– , the GMSEP statistic, if the GMSEP option is specified • – AIC– , the AIC statistic, if the AIC option is specified • – BIC– , the BIC statistic, if the BIC option is specified • – SBC– , the SBC statistic, if the SBC option is specified
Output Data Sets
The following is an example with a display of the OUTEST= data set. This example uses the population data given in the section “Polynomial Regression” beginning on page 3804. Figure 61.16 on page 3865 through Figure 61.18 on page 3866 show the regression equations and the resulting OUTEST= data set. proc reg data=USPopulation outest=est; m1: model Population=Year; m2: model Population=Year YearSq; proc print data=est; run;
The REG Procedure Model: m1 Dependent Variable: Population Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
1 20 21
146869 12832 159700
146869 641.58160
Root MSE Dependent Mean Coeff Var
25.32946 94.64800 26.76175
R-Square Adj R-Sq
F Value
Pr > F
228.92
<.0001
0.9197 0.9156
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept Year
1 1
-2345.85498 1.28786
161.39279 0.08512
-14.54 15.13
<.0001 <.0001
Figure 61.16. Regression Output for Model M1
3865
3866
Chapter 61. The REG Procedure
The REG Procedure Model: m2 Dependent Variable: Population Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
2 19 21
159529 170.97193 159700
79765 8.99852
Root MSE Dependent Mean Coeff Var
2.99975 94.64800 3.16938
R-Square Adj R-Sq
F Value
Pr > F
8864.19
<.0001
0.9989 0.9988
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept Year YearSq
1 1 1
21631 -24.04581 0.00668
639.50181 0.67547 0.00017820
33.82 -35.60 37.51
<.0001 <.0001 <.0001
Figure 61.17. Regression Output for Model M2 Obs _MODEL_ _TYPE_ 1 2
m1 m2
PARMS PARMS
_DEPVAR_
_RMSE_ Intercept
Population 25.3295 Population 2.9998
Year
-2345.85 1.2879 21630.89 -24.0458
Population -1 -1
YearSq . .006684346
Figure 61.18. OUTEST= Data Set
The following modification of the previous example uses the TABLEOUT and ALPHA= options to obtain additional information in the OUTEST= data set: proc reg data=USPopulation outest=est tableout alpha=0.1; m1: model Population=Year/noprint; m2: model Population=Year YearSq/noprint; proc print data=est; run;
Notice that the TABLEOUT option causes standard errors, t statistics, p-values, and confidence limits for the estimates to be added to the OUTEST= data set. Also note that the ALPHA= option is used to set the confidence level at 90%. The OUTEST= data set follows.
Output Data Sets
Obs _MODEL_ _TYPE_ 1 2 3 4 5 6 7 8 9 10 11 12
m1 m1 m1 m1 m1 m1 m2 m2 m2 m2 m2 m2
PARMS STDERR T PVALUE L90B U90B PARMS STDERR T PVALUE L90B U90B
_DEPVAR_ Population Population Population Population Population Population Population Population Population Population Population Population
_RMSE_ Intercept 25.3295 25.3295 25.3295 25.3295 25.3295 25.3295 2.9998 2.9998 2.9998 2.9998 2.9998 2.9998
-2345.85 161.39 -14.54 0.00 -2624.21 -2067.50 21630.89 639.50 33.82 0.00 20525.11 22736.68
Year Population 1.2879 0.0851 15.1300 0.0000 1.1411 1.4347 -24.0458 0.6755 -35.5988 0.0000 -25.2138 -22.8778
-1 . . . . . -1 . . . . .
3867
YearSq . . . . . . 0.0067 0.0002 37.5096 0.0000 0.0064 0.0070
Figure 61.19. The OUTEST= Data Set When TABLEOUT is Specified
A slightly different OUTEST= data set is created when you use the RSQUARE selection method. This example requests only the “best” model for each subset size but asks for a variety of model selection statistics, as well as the estimated regression coefficients. An OUTEST= data set is created and displayed. See Figure 61.20 and Figure 61.21 for results. proc reg data=fitness outest=est; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=rsquare mse jp gmsep cp aic bic sbc b best=1; proc print data=est; run;
3868
Chapter 61. The REG Procedure
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen R-Square Selection Method Number in Model
R-Square
C(p)
AIC
BIC
Estimated MSE of Prediction
J(p)
MSE
SBC
1 0.7434 13.6988 64.5341 65.4673 8.0546 8.0199 7.53384 67.40210 -----------------------------------------------------------------------------------------------------------------2 0.7642 12.3894 63.9050 64.8212 7.9478 7.8621 7.16842 68.20695 -----------------------------------------------------------------------------------------------------------------3 0.8111 6.9596 59.0373 61.3127 6.8583 6.7253 5.95669 64.77326 -----------------------------------------------------------------------------------------------------------------4 0.8368 4.8800 56.4995 60.3996 6.3984 6.2053 5.34346 63.66941 -----------------------------------------------------------------------------------------------------------------5 0.8480 5.1063 56.2986 61.5667 6.4565 6.1782 5.17634 64.90250 -----------------------------------------------------------------------------------------------------------------6 0.8487 7.0000 58.1616 64.0748 6.9870 6.5804 5.36825 68.19952 Number in Model
R-Square
--------------------------------------Parameter Estimates-------------------------------------Intercept Age Weight RunTime RunPulse RestPulse MaxPulse
1 0.7434 82.42177 . . -3.31056 . . . ---------------------------------------------------------------------------------------------------------------------2 0.7642 88.46229 -0.15037 . -3.20395 . . . ---------------------------------------------------------------------------------------------------------------------3 0.8111 111.71806 -0.25640 . -2.82538 -0.13091 . . ---------------------------------------------------------------------------------------------------------------------4 0.8368 98.14789 -0.19773 . -2.76758 -0.34811 . 0.27051 ---------------------------------------------------------------------------------------------------------------------5 0.8480 102.20428 -0.21962 -0.07230 -2.68252 -0.37340 . 0.30491 ---------------------------------------------------------------------------------------------------------------------6 0.8487 102.93448 -0.22697 -0.07418 -2.62865 -0.36963 -0.02153 0.30322
Figure 61.20. PROC REG Output for Physical Fitness Data: Best Models
Obs
_MODEL_
_TYPE_
_DEPVAR_
1 2 3 4 5 6
MODEL1 MODEL1 MODEL1 MODEL1 MODEL1 MODEL1
PARMS PARMS PARMS PARMS PARMS PARMS
Obs
Oxygen
_IN_
_P_
_EDF_
_MSE_
_RSQ_
1 2 3 4 5 6
-1 -1 -1 -1 -1 -1
1 2 3 4 5 6
2 3 4 5 6 7
29 28 27 26 25 24
7.53384 7.16842 5.95669 5.34346 5.17634 5.36825
0.74338 0.76425 0.81109 0.83682 0.84800 0.84867
Oxygen Oxygen Oxygen Oxygen Oxygen Oxygen
_RMSE_
Intercept
2.74478 2.67739 2.44063 2.31159 2.27516 2.31695
82.422 88.462 111.718 98.148 102.204 102.934
Age . -0.15037 -0.25640 -0.19773 -0.21962 -0.22697
Weight
_CP_
. . . . -0.072302 -0.074177
13.6988 12.3894 6.9596 4.8800 5.1063 7.0000
_JP_ 8.01990 7.86214 6.72530 6.20531 6.17821 6.58043
RunTime
RunPulse
RestPulse
Max Pulse
-3.31056 -3.20395 -2.82538 -2.76758 -2.68252 -2.62865
. . -0.13091 -0.34811 -0.37340 -0.36963
. . . . . -0.021534
. . . 0.27051 0.30491 0.30322
_GMSEP_
_AIC_
_BIC_
_SBC_
8.05462 7.94778 6.85833 6.39837 6.45651 6.98700
64.5341 63.9050 59.0373 56.4995 56.2986 58.1616
65.4673 64.8212 61.3127 60.3996 61.5667 64.0748
67.4021 68.2069 64.7733 63.6694 64.9025 68.1995
Figure 61.21. PROC PRINT Output for Physical Fitness Data: OUTEST= Data Set
OUTSSCP= Data Sets The OUTSSCP= option produces a TYPE=SSCP output SAS data set containing sums of squares and crossproducts. A special row (observation) and column (variable) of the matrix called Intercept contain the number of observations and sums.
Interactive Analysis
3869
Observations are identified by the character variable – NAME– . The data set contains all variables used in MODEL statements. You can specify additional variables that you want included in the crossproducts matrix with a VAR statement. The SSCP data set is used when a large number of observations are explored in many different runs. The SSCP data set can be saved and used for subsequent runs, which are much less expensive since PROC REG never reads the original data again. If you run PROC REG once to create only a SSCP data set, you should list all the variables that you may need in a VAR statement or include all the variables that you may need in a MODEL statement. The following example uses the fitness data from Example 61.1 on page 3924 to produce an output data set with the OUTSSCP= option. The resulting output is shown in Figure 61.22. proc reg data=fitness outsscp=sscp; var Oxygen RunTime Age Weight RestPulse RunPulse MaxPulse; proc print data=sscp; run;
Since a model is not fit to the data and since the only request is to create the SSCP data set, a MODEL statement is not required in this example. However, since the MODEL statement is not used, the VAR statement is required. Obs
_TYPE_
1 2 3 4 5 6 7 8 9
SSCP SSCP SSCP SSCP SSCP SSCP SSCP SSCP N
_NAME_ Intercept Oxygen RunTime Age Weight RestPulse RunPulse MaxPulse
Intercept 31.00 1468.65 328.17 1478.00 2400.78 1657.00 5259.00 5387.00 31.00
Oxygen
RunTime
Age
Weight
RestPulse
RunPulse
MaxPulse
1468.65 70429.86 15356.14 69767.75 113522.26 78015.41 248497.31 254866.75 31.00
328.17 15356.14 3531.80 15687.24 25464.71 17684.05 55806.29 57113.72 31.00
1478.00 69767.75 15687.24 71282.00 114158.90 78806.00 250194.00 256218.00 31.00
2400.78 113522.26 25464.71 114158.90 188008.20 128409.28 407745.67 417764.62 31.00
1657.00 78015.41 17684.05 78806.00 128409.28 90311.00 281928.00 288583.00 31.00
5259.00 248497.31 55806.29 250194.00 407745.67 281928.00 895317.00 916499.00 31.00
5387.00 254866.75 57113.72 256218.00 417764.62 288583.00 916499.00 938641.00 31.00
Figure 61.22. SSCP Data Set Created with OUTSSCP= Option: REG Procedure
Interactive Analysis PROC REG enables you to change interactively both the model and the data used to compute the model, and to produce and highlight scatter plots. See the section “Using PROC REG Interactively” on page 3812 for an overview of interactive analysis using PROC REG. The following statements can be used interactively (without reinvoking PROC REG): ADD, DELETE, MODEL, MTEST, OUTPUT, PAINT, PLOT, PRINT, REFIT, RESTRICT, REWEIGHT, and TEST. All interactive features are disabled if there is a BY statement. The ADD, DELETE and REWEIGHT statements can be used to modify the current MODEL. Every use of an ADD, DELETE or REWEIGHT statement causes the model label to be modified by attaching an additional number to it. This number
3870
Chapter 61. The REG Procedure
is the cumulative total of the number of ADD, DELETE or REWEIGHT statements following the current MODEL statement. A more detailed explanation of changing the data used to compute the model is given in the section “Reweighting Observations in an Analysis” on page 3903. Extra features for line printer scatter plots are discussed in the section “Line Printer Scatter Plot Features” on page 3882. The following example illustrates the usefulness of the interactive features. First, the full regression model is fit to the class data (see the “Getting Started” section on page 3800), and Figure 61.23 is produced. proc reg data=Class; model Weight=Age Height; run;
The REG Procedure Model: MODEL1 Dependent Variable: Weight Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
2 16 18
7215.63710 2120.09974 9335.73684
3607.81855 132.50623
Root MSE Dependent Mean Coeff Var
11.51114 100.02632 11.50811
R-Square Adj R-Sq
F Value
Pr > F
27.23
<.0001
0.7729 0.7445
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept Age Height
1 1 1
-141.22376 1.27839 3.59703
33.38309 3.11010 0.90546
-4.23 0.41 3.97
0.0006 0.6865 0.0011
Figure 61.23. Interactive Analysis: Full Model
Interactive Analysis
Next, the regression model is reduced by the following statements, and Figure 61.24 is produced. delete age; print; run;
The REG Procedure Model: MODEL1.1 Dependent Variable: Weight Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
1 17 18
7193.24912 2142.48772 9335.73684
7193.24912 126.02869
Root MSE Dependent Mean Coeff Var
11.22625 100.02632 11.22330
R-Square Adj R-Sq
F Value
Pr > F
57.08
<.0001
0.7705 0.7570
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept Height
1 1
-143.02692 3.89903
32.27459 0.51609
-4.43 7.55
0.0004 <.0001
Figure 61.24. Interactive Analysis: Reduced Model
Note that the MODEL label has been changed from MODEL1 to MODEL1.1, as the original MODEL has been changed by the delete statement. The following statements generate a scatter plot of the residuals against the predicted values from the full model. Figure 61.25 is produced, and the scatter plot shows a possible outlier. add age; plot r.*p. / cframe=ligr; run;
3871
3872
Chapter 61. The REG Procedure
Figure 61.25. Interactive Analysis: Scatter Plot
The following statements delete the observation with the largest residual, refit the regression model, and produce a scatter plot of residuals against predicted values for the refitted model. Figure 61.26 shows the new scatter plot. reweight r.>20; plot / cframe=ligr; run;
Model-Selection Methods
Figure 61.26. Interactive Analysis: Scatter Plot for Refitted Model
Model-Selection Methods The nine methods of model selection implemented in PROC REG are specified with the SELECTION= option in the MODEL statement. Each method is discussed in this section.
Full Model Fitted (NONE) This method is the default and provides no model selection capability. The complete model specified in the MODEL statement is used to fit the model. For many regression analyses, this may be the only method you need.
Forward Selection (FORWARD) The forward-selection technique begins with no variables in the model. For each of the independent variables, the FORWARD method calculates F statistics that reflect the variable’s contribution to the model if it is included. The p-values for these F statistics are compared to the SLENTRY= value that is specified in the MODEL statement (or to 0.50 if the SLENTRY= option is omitted). If no F statistic has a significance level greater than the SLENTRY= value, the FORWARD selection stops. Otherwise, the FORWARD method adds the variable that has the largest F statistic to the model. The FORWARD method then calculates F statistics again for the variables still remaining outside the model, and the evaluation process is repeated. Thus, variables are added one by one to the model until no remaining variable produces a significant F statistic. Once a variable is in the model, it stays.
3873
3874
Chapter 61. The REG Procedure
Backward Elimination (BACKWARD) The backward elimination technique begins by calculating F statistics for a model, including all of the independent variables. Then the variables are deleted from the model one by one until all the variables remaining in the model produce F statistics significant at the SLSTAY= level specified in the MODEL statement (or at the 0.10 level if the SLSTAY= option is omitted). At each step, the variable showing the smallest contribution to the model is deleted.
Stepwise (STEPWISE) The stepwise method is a modification of the forward-selection technique and differs in that variables already in the model do not necessarily stay there. As in the forwardselection method, variables are added one by one to the model, and the F statistic for a variable to be added must be significant at the SLENTRY= level. After a variable is added, however, the stepwise method looks at all the variables already included in the model and deletes any variable that does not produce an F statistic significant at the SLSTAY= level. Only after this check is made and the necessary deletions accomplished can another variable be added to the model. The stepwise process ends when none of the variables outside the model has an F statistic significant at the SLENTRY= level and every variable in the model is significant at the SLSTAY= level, or when the variable to be added to the model is the one just deleted from it.
Maximum R2 Improvement (MAXR) The maximum R2 improvement technique does not settle on a single model. Instead, it tries to find the “best” one-variable model, the “best” two-variable model, and so forth, although it is not guaranteed to find the model with the largest R2 for each size. The MAXR method begins by finding the one-variable model producing the highest R2 . Then another variable, the one that yields the greatest increase in R2 , is added. Once the two-variable model is obtained, each of the variables in the model is compared to each variable not in the model. For each comparison, the MAXR method determines if removing one variable and replacing it with the other variable increases R2 . After comparing all possible switches, the MAXR method makes the switch that produces the largest increase in R2 . Comparisons begin again, and the process continues until the MAXR method finds that no switch could increase R2 . Thus, the two-variable model achieved is considered the “best” two-variable model the technique can find. Another variable is then added to the model, and the comparing-andswitching process is repeated to find the “best” three-variable model, and so forth. The difference between the STEPWISE method and the MAXR method is that all switches are evaluated before any switch is made in the MAXR method . In the STEPWISE method, the “worst” variable may be removed without considering what adding the “best” remaining variable might accomplish. The MAXR method may require much more computer time than the STEPWISE method.
Model-Selection Methods
Minimum R2 (MINR) Improvement The MINR method closely resembles the MAXR method, but the switch chosen is the one that produces the smallest increase in R2 . For a given number of variables in the model, the MAXR and MINR methods usually produce the same “best” model, but the MINR method considers more models of each size.
R2 Selection (RSQUARE) The RSQUARE method finds subsets of independent variables that best predict a dependent variable by linear regression in the given sample. You can specify the largest and smallest number of independent variables to appear in a subset and the number of subsets of each size to be selected. The RSQUARE method can efficiently perform all possible subset regressions and display the models in decreasing order of R2 magnitude within each subset size. Other statistics are available for comparing subsets of different sizes. These statistics, as well as estimated regression coefficients, can be displayed or output to a SAS data set. The subset models selected by the RSQUARE method are optimal in terms of R2 for the given sample, but they are not necessarily optimal for the population from which the sample is drawn or for any other sample for which you may want to make predictions. If a subset model is selected on the basis of a large R2 value or any other criterion commonly used for model selection, then all regression statistics computed for that model under the assumption that the model is given a priori, including all statistics computed by PROC REG, are biased. While the RSQUARE method is a useful tool for exploratory model building, no statistical method can be relied on to identify the “true” model. Effective model building requires substantive theory to suggest relevant predictors and plausible functional forms for the model. The RSQUARE method differs from the other selection methods in that RSQUARE always identifies the model with the largest R2 for each number of variables considered. The other selection methods are not guaranteed to find the model with the largest R2 . The RSQUARE method requires much more computer time than the other selection methods, so a different selection method such as the STEPWISE method is a good choice when there are many independent variables to consider.
Adjusted R2 Selection (ADJRSQ) This method is similar to the RSQUARE method, except that the adjusted R2 statistic is used as the criterion for selecting models, and the method finds the models with the highest adjusted R2 within the range of sizes.
Mallows’ Cp Selection (CP) This method is similar to the ADJRSQ method, except that Mallows’ Cp statistic is used as the criterion for model selection. Models are listed in ascending order of Cp .
3875
3876
Chapter 61. The REG Procedure
Additional Information on Model-Selection Methods If the RSQUARE or STEPWISE procedure (as documented in SAS User’s Guide: Statistics, Version 5 Edition) is requested, PROC REG with the appropriate modelselection method is actually used. Reviews of model-selection methods by Hocking (1976) and Judge et al. (1980) describe these and other variable-selection methods.
Criteria Used in Model-Selection Methods When many significance tests are performed, each at a level of, for example, 5 percent, the overall probability of rejecting at least one true null hypothesis is much larger than 5 percent. If you want to guard against including any variables that do not contribute to the predictive power of the model in the population, you should specify a very small SLE= significance level for the FORWARD and STEPWISE methods and a very small SLS= significance level for the BACKWARD and STEPWISE methods. In most applications, many of the variables considered have some predictive power, however small. If you want to choose the model that provides the best prediction using the sample estimates, you need only to guard against estimating more parameters than can be reliably estimated with the given sample size, so you should use a moderate significance level, perhaps in the range of 10 percent to 25 percent. In addition to R2 , the Cp statistic is displayed for each model generated in the modelselection methods. The Cp statistic is proposed by Mallows (1973) as a criterion for selecting a model. It is a measure of total squared error defined as
Cp =
SSEp − (N − 2p) s2
where s2 is the MSE for the full model, and SSEp is the sum-of-squares error for a model with p parameters including the intercept, if any. If Cp is plotted against p, Mallows recommends the model where Cp first approaches p. When the right model is chosen, the parameter estimates are unbiased, and this is reflected in Cp near p. For further discussion, refer to Daniel and Wood (1980). The Adjusted R2 statistic is an alternative to R2 that is adjusted for the number of parameters in the model. The adjusted R2 statistic is calculated as
ADJRSQ = 1 −
(n − i)(1 − R2 ) n−p
where n is the number of observations used in fitting the model, and i is an indicator variable that is 1 if the model includes an intercept, and 0 otherwise.
Parameter Estimates and Associated Statistics
3877
Limitations in Model-Selection Methods The use of model-selection methods can be time-consuming in some cases because there is no built-in limit on the number of independent variables, and the calculations for a large number of independent variables can be lengthy. The recommended limit on the number of independent variables for the MINR method is 20 + i, where i is the value of the INCLUDE= option. For the RSQUARE, ADJRSQ, or CP methods, with a large value of the BEST= option, adding one more variable to the list from which regressors are selected may significantly increase the CPU time. Also, the time required for the analysis is highly dependent on the data and on the values of the BEST=, START=, and STOP= options.
Parameter Estimates and Associated Statistics The following example uses the fitness data from Example 61.1 on page 3924. Figure 61.28 shows the parameter estimates and the tables from the SS1, SS2, STB, CLB, COVB, and CORRB options: proc reg data=fitness; model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse / ss1 ss2 stb clb covb corrb; run;
The procedure first displays an Analysis of Variance table (Figure 61.27). The F statistic for the overall model is significant, indicating that the model explains a significant portion of the variation in the data. The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
6 24 30
722.54361 128.83794 851.38154
120.42393 5.36825
Root MSE Dependent Mean Coeff Var
2.31695 47.37581 4.89057
R-Square Adj R-Sq
F Value
Pr > F
22.43
<.0001
0.8487 0.8108
Figure 61.27. ANOVA Table
The procedure next displays Parameter Estimates and some associated statistics (Figure 61.28). First, the estimates are shown, followed by their Standard Errors. The next two columns of the table contain the t statistics and the corresponding probabilities for testing the null hypothesis that the parameter is not significantly different
3878
Chapter 61. The REG Procedure
from zero. These probabilities are usually referred to as p-values. For example, the Intercept term in the model is estimated to be 102.9 and is significantly different from zero. The next two columns of the table are the result of requesting the SS1 and SS2 options, and they show sequential and partial Sums of Squares (SS) associated with each variable. The Standardized Estimates (produced by the STB option) are the parameter estimates that result when all variables are standardized to a mean of 0 and a variance of 1. These estimates are computed by multiplying the original estimates by the standard deviation of the regressor (independent) variable and then dividing by the standard deviation of the dependent variable. The CLB option adds the upper and lower 95% confidence limits for the parameter estimates; the α level can be changed by specifying the ALPHA= option in the PROC REG or MODEL statement. The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Parameter Estimates
Variable Intercept RunTime Age Weight RunPulse MaxPulse RestPulse
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Type I SS
Type II SS
Standardized Estimate
1 1 1 1 1 1 1
102.93448 -2.62865 -0.22697 -0.07418 -0.36963 0.30322 -0.02153
12.40326 0.38456 0.09984 0.05459 0.11985 0.13650 0.06605
8.30 -6.84 -2.27 -1.36 -3.08 2.22 -0.33
<.0001 <.0001 0.0322 0.1869 0.0051 0.0360 0.7473
69578 632.90010 17.76563 5.60522 38.87574 26.82640 0.57051
369.72831 250.82210 27.74577 9.91059 51.05806 26.49142 0.57051
0 -0.68460 -0.22204 -0.11597 -0.71133 0.52161 -0.03080
95% Confidence Limits 77.33541 -3.42235 -0.43303 -0.18685 -0.61699 0.02150 -0.15786
Figure 61.28. SS1, SS2, STB, CLB, COVB, and CORRB Options: Parameter Estimates
The final two tables are produced as a result of requesting the COVB and CORRB options (Figure 61.29). These tables show the estimated covariance matrix of the parameter estimates, and the estimated correlation matrix of the estimates.
128.53355 -1.83496 -0.02092 0.03850 -0.12226 0.58493 0.11480
Predicted and Residual Values
3879
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Covariance of Estimates Variable Intercept RunTime Age Weight RunPulse MaxPulse RestPulse
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
153.84081152 0.7678373769 -0.902049478 -0.178237818 0.280796516 -0.832761667 -0.147954715
0.7678373769 0.1478880839 -0.014191688 -0.004417672 -0.009047784 0.0046249498 -0.010915224
-0.902049478 -0.014191688 0.009967521 0.0010219105 -0.001203914 0.0035823843 0.0014897532
-0.178237818 -0.004417672 0.0010219105 0.0029804131 0.0009644683 -0.001372241 0.0003799295
0.280796516 -0.009047784 -0.001203914 0.0009644683 0.0143647273 -0.014952457 -0.000764507
-0.832761667 0.0046249498 0.0035823843 -0.001372241 -0.014952457 0.0186309364 0.0003425724
-0.147954715 -0.010915224 0.0014897532 0.0003799295 -0.000764507 0.0003425724 0.0043631674
Correlation of Estimates Variable Intercept RunTime Age Weight RunPulse MaxPulse RestPulse
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
1.0000 0.1610 -0.7285 -0.2632 0.1889 -0.4919 -0.1806
0.1610 1.0000 -0.3696 -0.2104 -0.1963 0.0881 -0.4297
-0.7285 -0.3696 1.0000 0.1875 -0.1006 0.2629 0.2259
-0.2632 -0.2104 0.1875 1.0000 0.1474 -0.1842 0.1054
0.1889 -0.1963 -0.1006 0.1474 1.0000 -0.9140 -0.0966
-0.4919 0.0881 0.2629 -0.1842 -0.9140 1.0000 0.0380
-0.1806 -0.4297 0.2259 0.1054 -0.0966 0.0380 1.0000
Figure 61.29. SS1, SS2, STB, CLB, COVB, and CORRB Options: Covariances and Correlations
For further discussion of the parameters and statistics, see the “Displayed Output” section on page 3918, and Chapter 2, “Introduction to Regression Procedures.”
Predicted and Residual Values The display of the predicted values and residuals is controlled by the P, R, CLM, and CLI options in the MODEL statement. The P option causes PROC REG to display the observation number, the ID value (if an ID statement is used), the actual value, the predicted value, and the residual. The R, CLI, and CLM options also produce the items under the P option. Thus, P is unnecessary if you use one of the other options. The R option requests more detail, especially about the residuals. The standard errors of the mean predicted value and the residual are displayed. The studentized residual, which is the residual divided by its standard error, is both displayed and plotted. A measure of influence, Cook’s D, is displayed. Cook’s D measures the change to the estimates that results from deleting each observation (Cook 1977, 1979). This statistic is very similar to DFFITS. The CLM option requests that PROC REG display the 100(1 − α)% lower and upper confidence limits for the mean predicted values. This accounts for the variation due to estimating the parameters only. If you want a 100(1 − α)% confidence interval for observed values, then you can use the CLI option, which adds in the variability of the error term. The α level can be specified with the ALPHA= option in the PROC REG or MODEL statement. You can use these statistics in PLOT and PAINT statements. This is useful in performing a variety of regression diagnostics. For definitions of the statistics produced
3880
Chapter 61. The REG Procedure
by these options, see Chapter 2, “Introduction to Regression Procedures.” The following example uses the US population data found on the section “Polynomial Regression” beginning on page 3804. data USPop2; input Year @@; YearSq=Year*Year; datalines; 2010 2020 2030 ; data USPop2; set USPopulation USPop2; proc reg data=USPop2; id Year; model Population=Year YearSq / r cli clm; run;
The REG Procedure Model: MODEL1 Dependent Variable: Population Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
2 19 21
159529 170.97193 159700
79765 8.99852
Root MSE Dependent Mean Coeff Var
2.99975 94.64800 3.16938
R-Square Adj R-Sq
F Value
Pr > F
8864.19
<.0001
0.9989 0.9988
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept Year YearSq
1 1 1
21631 -24.04581 0.00668
639.50181 0.67547 0.00017820
33.82 -35.60 37.51
<.0001 <.0001 <.0001
Figure 61.30. Regression Using the R, CLI, and CLM Options
Predicted and Residual Values
3881
The REG Procedure Model: MODEL1 Dependent Variable: Population Output Statistics Dependent Predicted Std Error Variable Value Mean Predict
Obs Year 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1790 1800 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020 2030
3.9290 5.3080 7.2390 9.6380 12.8660 17.0690 23.1910 31.4430 39.8180 50.1550 62.9470 75.9940 91.9720 105.7100 122.7750 131.6690 151.3250 179.3230 203.2110 226.5420 248.7100 281.4220 . . .
6.2127 5.7226 6.5694 8.7531 12.2737 17.1311 23.3254 30.8566 39.7246 49.9295 61.4713 74.3499 88.5655 104.1178 121.0071 139.2332 158.7962 179.6961 201.9328 225.5064 250.4168 276.6642 304.2484 333.1695 363.4274
1.7565 1.4560 1.2118 1.0305 0.9163 0.8650 0.8613 0.8846 0.9163 0.9436 0.9590 0.9590 0.9436 0.9163 0.8846 0.8613 0.8650 0.9163 1.0305 1.2118 1.4560 1.7565 2.1073 2.5040 2.9435
95% CL Mean 2.5362 2.6751 4.0331 6.5963 10.3558 15.3207 21.5227 29.0051 37.8067 47.9545 59.4641 72.3427 86.5904 102.2000 119.1556 137.4305 156.9858 177.7782 199.7759 222.9701 247.3693 272.9877 299.8377 327.9285 357.2665
9.8892 8.7701 9.1057 10.9100 14.1916 18.9415 25.1281 32.7080 41.6425 51.9046 63.4785 76.3571 90.5405 106.0357 122.8585 141.0359 160.6066 181.6139 204.0896 228.0427 253.4644 280.3407 308.6591 338.4104 369.5883
95% CL Predict -1.0631 -1.2565 -0.2021 2.1144 5.7087 10.5968 16.7932 24.3107 33.1597 43.3476 54.8797 67.7583 81.9836 97.5529 114.4612 132.7010 152.2618 173.1311 195.2941 218.7349 243.4378 269.3884 296.5754 324.9910 354.6310
13.4884 12.7017 13.3409 15.3918 18.8386 23.6655 29.8576 37.4024 46.2896 56.5114 68.0629 80.9415 95.1473 110.6828 127.5529 145.7654 165.3306 186.2610 208.5715 232.2779 257.3959 283.9400 311.9214 341.3479 372.2238
Output Statistics
Obs Year 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1790 1800 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020 2030
Residual
Std Error Residual
Student Residual
-2.2837 -0.4146 0.6696 0.8849 0.5923 -0.0621 -0.1344 0.5864 0.0934 0.2255 1.4757 1.6441 3.4065 1.5922 1.7679 -7.5642 -7.4712 -0.3731 1.2782 1.0356 -1.7068 4.7578 . . .
2.432 2.623 2.744 2.817 2.856 2.872 2.873 2.866 2.856 2.847 2.842 2.842 2.847 2.856 2.866 2.873 2.872 2.856 2.817 2.744 2.623 2.432 . . .
-0.939 -0.158 0.244 0.314 0.207 -0.0216 -0.0468 0.205 0.0327 0.0792 0.519 0.578 1.196 0.557 0.617 -2.632 -2.601 -0.131 0.454 0.377 -0.651 1.957 . . .
Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS)
Cook’s D
-2-1 0 1 2 | *| | | | | | | | | | | | | | | | | | | | |* | |* | |** | |* | |* | *****| | *****| | | | | | | | *| | |***
| | | | | | | | | | | | | | | | | | | | | |
0.153 0.003 0.004 0.004 0.001 0.000 0.000 0.001 0.000 0.000 0.010 0.013 0.052 0.011 0.012 0.208 0.205 0.001 0.009 0.009 0.044 0.666 . . .
-4.4596E-11 170.97193 237.71229
Figure 61.31. Regression Using the R, CLI, and CLM Options
After producing the usual Analysis of Variance and Parameter Estimates tables
3882
Chapter 61. The REG Procedure
(Figure 61.30), the procedure displays the results of requesting the options for predicted and residual values (Figure 61.31). For each observation, the requested information is shown. Note that the ID variable is used to identify each observation. Also note that, for observations with missing dependent variables, the predicted value, standard error of the predicted value, and confidence intervals for the predicted value are still available. The plot of studentized residuals and Cook’s D statistics are displayed as a result of requesting the R option. In the plot of studentized residuals, a large number of observations with absolute values greater than two indicates an inadequate model. A version of the studentized residual plot can be created on a high-resolution graphics device; see Example 61.7 on page 3952 for a similar example.
Line Printer Scatter Plot Features This section discusses the special options available with line printer scatter plots. Detailed examples of high resolution graphics plots and options are given in Example 61.6 on page 3950.
Producing Scatter Plots The interactive PLOT statement available in PROC REG enables you to look at scatter plots of data and diagnostic statistics. These plots can help you to evaluate the model and detect outliers in your data. Several options enable you to place multiple plots on a single page, superimpose plots, and collect plots to be overlaid by later plots. The PAINT statement can be used to highlight points on a plot. See the section “Painting Scatter Plots” on page 3889 for more information on painting. The Class data set introduced in is used in the following examples. You can superimpose several plots with the OVERLAY option. With the following statements, a plot of Weight against Height is overlaid with plots of the predicted values and the 95% prediction intervals. The model on which the statistics are based is the full model including Height and Age. These statements produce Figure 61.32: proc reg data=Class lineprinter; model Weight=Height Age / noprint; plot (ucl. lcl. p.)*Height=’-’ Weight*Height / overlay symbol=’o’; run;
Line Printer Scatter Plot Features
The REG Procedure Model: MODEL1 Dependent Variable: Weight ---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+---U U95 | | p | | p 175 + + e | | r | | | | B | | o 150 + o + u | -| n | | d | -- o | | - - o | o 125 + + f | | | o o | 9 | - o | 5 | -- -? ? | % 100 + o o + | | C | o | . | o oo - o o - | I | - -- | . 75 + ? + ( | | I | | n | | d | -| i 50 + o -+ v | | i | | d | | u | | a 25 + + l | | ---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+---P 50 52 54 56 58 60 62 64 66 68 70 72 r Height
Figure 61.32. Scatter Plot Showing Data, Predicted Values, and Confidence Limits
In this plot, the data values are marked with the symbol ’o’ and the predicted values and prediction interval limits are labeled with the symbol ’-’. The plot is scaled to accommodate the points from all plots. This is an important difference from the COLLECT option, which does not rescale plots after the first plot or plots are collected. You could separate the overlaid plots by using the following statements: plot; run;
This places each of the four plots on a separate page, while the statements
3883
3884
Chapter 61. The REG Procedure plot / overlay; run;
repeat the previous overlaid plot. In general, the statement plot;
is equivalent to respecifying the most recent PLOT statement without any options. However, the COLLECT, HPLOTS=, SYMBOL=, and VPLOTS= options apply across PLOT statements and remain in effect. The next example shows how you can overlay plots of statistics before and after a change in the model. For the full model involving Height and Age, the ordinary residuals and the studentized residuals are plotted against the predicted values. The COLLECT option causes these plots to be collected or retained for re-display later. The option HPLOTS=2 allows the two plots to appear side by side on one page. The symbol ’f’ is used on these plots to identify them as resulting from the full model. These statements produce Figure 61.33: plot r.*p. student.*p. / collect hplots=2 symbol=’f’; run;
Line Printer Scatter Plot Features
The REG Procedure Model: MODEL1
30
20
R E S I D U A L
10
0
-10
-20
-+-----+-----+-----+-----+-----+| | + + | | | | | | | | | | | f | + + | | | | | | | f | | f | | f | + f + | | | | | f | | | | f | | | + f f + | | | f | | f | | f | | f | | | + + | | | f f | | | | f | | | | f | + + -+-----+-----+-----+-----+-----+40 60 80 100 120 140
3
2
S T U D E N T
1
0
-1
-2
-+-----+-----+-----+-----+-----+-| | + + | | | | | | | | | | | f | + + | | | | | | | | | f | | f f | + f + | | | | | f | | | | f | | | + f f + | | | f | | f f | | | | f | | | + + | f f | | | | f | | f | | | | | + + -+-----+-----+-----+-----+-----+-40 60 80 100 120 140
PRED
PRED
Figure 61.33. Collecting Residual Plots for the Full Model
Note that these plots are not overlaid. The COLLECT option does not overlay the plots in one PLOT statement but retains them so that they can be overlaid by later plots. When the COLLECT option appears in a PLOT statement, the plots in that statement become the first plots in the collection. Next, the model is reduced by deleting the Age variable. The PLOT statement requests the same plots as before but labels the points with the symbol ’r’ denoting the reduced model. The following statements produce Figure 61.34: delete Age; plot r.*p. student.*p. / symbol=’r’; run;
3885
3886
Chapter 61. The REG Procedure
The REG Procedure Model: MODEL1.1
30
20
R E S I D U A L
10
0
-10
-20
-+-----+-----+-----+-----+-----+| | + + | | | | | | | | | | | f | + + | r | | | | | | rf | | ? r | | r f | + f + | | | r | | f | | r | | ? | | | + ? ? + | r | | f | | f r | | r f | | ? | | | + + | | | f fr | | r | | f | | r r | | f | + + -+-----+-----+-----+-----+-----+40 60 80 100 120 140 PRED
3
2
S T U D E N T
1
0
-1
-2
-+-----+-----+-----+-----+-----+-| | + + | | | | | | | | | | | f | + + | | | r | | | | r | | f r | | ? r f | + f + | | | | | rf | | r | | ? | | | + ? ? + | r | | f | | f ? | | r | | ? | | | + + | f f | | r r | | f | | r ? | | | | | + + -+-----+-----+-----+-----+-----+-40 60 80 100 120 140 PRED
Figure 61.34. Overlaid Residual Plots for Full and Reduced Models
Notice that the COLLECT option causes the corresponding plots to be overlaid. Also notice that the DELETE statement causes the model label to be changed from MODEL1 to MODEL1.1. The points labeled ’f’ are from the full model, and points labeled ’r’ are from the reduced model. Positions labeled ’?’ contain at least one point from each model. In this example, the OVERLAY option cannot be used because all of the plots to be overlaid cannot be specified in one PLOT statement. With the COLLECT option, any changes to the model or the data used to fit the model do not affect plots collected before the changes. Collected plots are always reproduced exactly as they first appear. (Similarly, a PAINT statement does not affect plots collected before the PAINT statement is issued.)
Line Printer Scatter Plot Features
The previous example overlays the residual plots for two different models. You may prefer to see them side by side on the same page. This can also be done with the COLLECT option by using a blank plot. Continuing from the last example, the COLLECT, HPLOTS=2, and SYMBOL=’r’ options are still in effect. In the following PLOT statement, the CLEAR option deletes the collected plots and allows the specified plot to begin a new collection. The plot created is the residual plot for the reduced model. These statements produce Figure 61.35: plot r.*p. / clear; run;
The REG Procedure Model: MODEL1.1
20
10
R E S I D U A L
0
-10
-20
-+-----+-----+-----+-----+-----+| | | | | | | | + + | r | | | | | | r | | r r | | r | + + | | | r | | | | r | | r | | | + r r + | r | | | | r | | r | | r | | | + + | | | r | | r | | | | r r | | | + + | | | | | | | | -+-----+-----+-----+-----+-----+40 60 80 100 120 140 PRED
Figure 61.35. Residual Plot for Reduced Model Only
3887
3888
Chapter 61. The REG Procedure
The next statements add the variable AGE to the model and place the residual plot for the full model next to the plot for the reduced model. Notice that a blank plot is created in the first plot request by placing nothing between the quotes. Since the COLLECT option is in effect, this plot is superimposed on the residual plot for the reduced model. The residual plot for the full model is created by the second request. The result is the desired side-by-side plots. The NOCOLLECT option turns off the collection process after the specified plots are added and displayed. Any PLOT statements that follow show only the newly specified plots. These statements produce Figure 61.36: add Age; plot r.*p.=’’ r.*p.=’f’ / nocollect; run;
Line Printer Scatter Plot Features
The REG Procedure Model: MODEL1.2
20
10
R E S I D U A L
0
-10
-20
-+-----+-----+-----+-----+-----+| | | | | | | | + + | r | | | | | | r | | r r | | r | + + | | | r | | | | r | | r | | | + r r + | r | | | | r | | r | | r | | | + + | | | r | | r | | | | r r | | | + + | | | | | | | | -+-----+-----+-----+-----+-----+40 60 80 100 120 140 PRED
30
20
R E S I D U A L
10
0
-10
-20
-+-----+-----+-----+-----+-----+| | + + | | | | | | | | | | | f | + + | | | | | | | f | | f | | f | + f + | | | | | f | | | | f | | | + f f + | | | f | | f | | f | | f | | | + + | | | f f | | | | f | | | | f | + + -+-----+-----+-----+-----+-----+40 60 80 100 120 140 PRED
Figure 61.36. Side-by-Side Residual Plots for the Full and Reduced Models
Frequently, when the COLLECT option is in effect, you want the current and following PLOT statements to show only the specified plots. To do this, use both the CLEAR and NOCOLLECT options in the current PLOT statement.
Painting Scatter Plots Painting scatter plots is a useful interactive tool that enables you to mark points of interest in scatter plots. Painting can be used to identify extreme points in scatter plots or to reveal the relationship between two scatter plots. The CLASS data (from the “Simple Linear Regression” section on page 3800) is used to illustrate some of these applications. First, a scatter plot of the studentized residuals against the predicted values is generated. This plot is shown in Figure 61.37.
3889
3890
Chapter 61. The REG Procedure proc reg data=Class lineprinter; model Weight=Age Height / noprint; plot student.*p.; run;
The REG Procedure Model: MODEL1 Dependent Variable: Weight
S t u d e n t i z e d R e s i d u a l
---+------+------+------+------+------+------+------+------+------+--STUDENT | | | | 3 + + | | | | | | | | | 1 | 2 + + | | | | | | | 1 | | 1 1 | 1 + 1 + | | | | | 11 | | 1 | | | 0 + 1 1 + | 1 | | | | 1 2 | | 1 | | | -1 + + | 1 1 | | | | 1 | | 1 | | | -2 + + | | ---+------+------+------+------+------+------+------+------+------+--50 60 70 80 90 100 110 120 130 140 Predicted Value of Weight
PRED
Figure 61.37. Plotting Studentized Residuals Against Predicted Values
Then, the following statements identify the observation ’Henry’ in the scatter plot and produce Figure 61.38: paint Name=’Henry’ / symbol = ’H’; plot; run;
Line Printer Scatter Plot Features
The REG Procedure Model: MODEL1 Dependent Variable: Weight
S t u d e n t i z e d R e s i d u a l
---+------+------+------+------+------+------+------+------+------+--STUDENT | | | | 3 + + | | | | | | | | | 1 | 2 + + | | | | | | | 1 | | 1 1 | 1 + 1 + | | | | | 11 | | 1 | | | 0 + 1 1 + | H | | | | 1 2 | | 1 | | | -1 + + | 1 1 | | | | 1 | | 1 | | | -2 + + | | ---+------+------+------+------+------+------+------+------+------+--50 60 70 80 90 100 110 120 130 140 Predicted Value of Weight
PRED
Figure 61.38. Painting One Observation
Next, the following statements identify observations with large absolute residuals: paint student.>=2 or student.<=-2 / symbol=’s’; plot; run;
The log shows the observation numbers found with these conditions and gives the painting symbol and the number of observations found. Note that the previous PAINT statement is also used in the PLOT statement. Figure 61.39 shows the scatter plot produced by the preceding statements.
3891
3892
Chapter 61. The REG Procedure
The REG Procedure Model: MODEL1 Dependent Variable: Weight
S t u d e n t i z e d R e s i d u a l
---+------+------+------+------+------+------+------+------+------+--STUDENT | | | | 3 + + | | | | | | | | | s | 2 + + | | | | | | | 1 | | 1 1 | 1 + 1 + | | | | | 11 | | 1 | | | 0 + 1 1 + | H | | | | 1 2 | | 1 | | | -1 + + | 1 1 | | | | 1 | | 1 | | | -2 + + | | ---+------+------+------+------+------+------+------+------+------+--50 60 70 80 90 100 110 120 130 140 Predicted Value of Weight
PRED
Figure 61.39. Painting Several Observations
The following statements relate two different scatter plots. These statements produce Figure 61.40. paint student.>=1 / symbol=’p’; paint student.<1 and student.>-1 / symbol=’s’; paint student.<=-1 / symbol=’n’; plot student. * p. cookd. * h. / hplots=2; run;
Models of Less Than Full Rank
The REG Procedure Model: MODEL1
3
2
S T U D E N T
1
0
-1
-2
-+-----+-----+-----+-----+-----+-| | + + | | | | | | | | | | | p | + + | | | | | | | | | p | | p p | + s + | | | | | s | | | | s | | | + s s + | | | s | | s s | | | | s | | | + + | n n | | | | n | | n | | | | | + + -+-----+-----+-----+-----+-----+-40 60 80 100 120 140
C O O K D
-+----+----+----+----+----+----+| | | | | | | | 0.8 + p + | | | | | | | | | | | | 0.6 + + | | | | | | | | | | | | 0.4 + + | | | | | | | | | | | | 0.2 + + | p | | | | n s | | p n s | | n p | | ss | 0.0 + ss ss s + | | | | | | | | -+----+----+----+----+----+----+0.05 0.10 0.15 0.20 0.25 0.30 0.35
PRED
H
Figure 61.40. Painting Observations on More than One Plot
Models of Less Than Full Rank If the model is not full rank, there are an infinite number of least-squares solutions for the estimates. PROC REG chooses a nonzero solution for all variables that are linearly independent of previous variables and a zero solution for other variables. This solution corresponds to using a generalized inverse in the normal equations, and the expected values of the estimates are the Hermite normal form of X multiplied by the true parameters: E(b) = (X0 X)− (X0 X)β
3893
3894
Chapter 61. The REG Procedure
Degrees of freedom for the zeroed estimates are reported as zero. The hypotheses that are not testable have t tests reported as missing. The message that the model is not full rank includes a display of the relations that exist in the matrix. The next example uses the fitness data from Example 61.1 on page 3924. The variable Dif=RunPulse−RestPulse is created. When this variable is included in the model along with RunPulse and RestPulse, there is a linear dependency (or exact collinearity) between the independent variables. Figure 61.41 shows how this problem is diagnosed. data fit2; set fitness; Dif=RunPulse-RestPulse; proc reg data=fit2; model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse Dif; run;
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
6 24 30
722.54361 128.83794 851.38154
120.42393 5.36825
Root MSE Dependent Mean Coeff Var
2.31695 47.37581 4.89057
R-Square Adj R-Sq
F Value
Pr > F
22.43
<.0001
0.8487 0.8108
NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.
Dif =
RunPulse - RestPulse
Parameter Estimates
Variable Intercept RunTime Age Weight RunPulse MaxPulse RestPulse Dif
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
1 1 1 1 B 1 B 0
102.93448 -2.62865 -0.22697 -0.07418 -0.36963 0.30322 -0.02153 0
12.40326 0.38456 0.09984 0.05459 0.11985 0.13650 0.06605 .
8.30 -6.84 -2.27 -1.36 -3.08 2.22 -0.33 .
<.0001 <.0001 0.0322 0.1869 0.0051 0.0360 0.7473 .
Figure 61.41. Model That Is Not Full Rank: REG Procedure
Collinearity Diagnostics
3895
PROC REG produces a message informing you that the model is less than full rank. Parameters with DF=0 are not estimated, and parameters with DF=B are biased. In addition, the form of the linear dependency among the regressors is displayed.
Collinearity Diagnostics When a regressor is nearly a linear combination of other regressors in the model, the affected estimates are unstable and have high standard errors. This problem is called collinearity or multicollinearity. It is a good idea to find out which variables are nearly collinear with which other variables. The approach in PROC REG follows that of Belsley, Kuh, and Welsch (1980). PROC REG provides several methods for detecting collinearity with the COLLIN, COLLINOINT, TOL, and VIF options. The COLLIN option in the MODEL statement requests that a collinearity analysis be performed. First, X0 X is scaled to have 1s on the diagonal. If you specify the COLLINOINT option, the intercept variable is adjusted out first. Then the eigenvalues and eigenvectors are extracted. The analysis in PROC REG is reported with eigenvalues of X0 X rather than singular values of X. The eigenvalues of X0 X are the squares of the singular values of X. The condition indices are the square roots of the ratio of the largest eigenvalue to each individual eigenvalue. The largest condition index is the condition number of the scaled X matrix. Belsey, Kuh, and Welsch (1980) suggest that, when this number is around 10, weak dependencies may be starting to affect the regression estimates. When this number is larger than 100, the estimates may have a fair amount of numerical error (although the statistical standard error almost always is much greater than the numerical error). For each variable, PROC REG produces the proportion of the variance of the estimate accounted for by each principal component. A collinearity problem occurs when a component associated with a high condition index contributes strongly (variance proportion greater than about 0.5) to the variance of two or more variables. The VIF option in the MODEL statement provides the Variance Inflation Factors (VIF). These factors measure the inflation in the variances of the parameter estimates due to collinearities that exist among the regressor (independent) variables. There are no formal criteria for deciding if a VIF is large enough to affect the predicted values. The TOL option requests the tolerance values for the parameter estimates. The tolerance is defined as 1/V IF . For a complete discussion of the preceding methods, refer to Belsley, Kuh, and Welsch (1980). For a more detailed explanation of using the methods with PROC REG, refer to Freund and Littell (1986). This example uses the COLLIN option on the fitness data found in Example 61.1 on page 3924. The following statements produce Figure 61.42. proc reg data=fitness; model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse / tol vif collin; run;
3896
Chapter 61. The REG Procedure
The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
6 24 30
722.54361 128.83794 851.38154
120.42393 5.36825
Root MSE Dependent Mean Coeff Var
2.31695 47.37581 4.89057
R-Square Adj R-Sq
F Value
Pr > F
22.43
<.0001
0.8487 0.8108
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Tolerance
Variance Inflation
Intercept RunTime Age Weight RunPulse MaxPulse RestPulse
1 1 1 1 1 1 1
102.93448 -2.62865 -0.22697 -0.07418 -0.36963 0.30322 -0.02153
12.40326 0.38456 0.09984 0.05459 0.11985 0.13650 0.06605
8.30 -6.84 -2.27 -1.36 -3.08 2.22 -0.33
<.0001 <.0001 0.0322 0.1869 0.0051 0.0360 0.7473
. 0.62859 0.66101 0.86555 0.11852 0.11437 0.70642
0 1.59087 1.51284 1.15533 8.43727 8.74385 1.41559
Collinearity Diagnostics
Number
Eigenvalue
Condition Index
1 2 3 4 5 6 7
6.94991 0.01868 0.01503 0.00911 0.00607 0.00102 0.00017947
1.00000 19.29087 21.50072 27.62115 33.82918 82.63757 196.78560
---------------------------------Proportion of Variation--------------------------------Intercept RunTime Age Weight RunPulse MaxPulse RestPulse 0.00002326 0.00218 0.00061541 0.00638 0.00133 0.79966 0.18981
0.00021086 0.02522 0.12858 0.60897 0.12501 0.09746 0.01455
0.00015451 0.14632 0.15013 0.03186 0.11284 0.49660 0.06210
0.00019651 0.01042 0.23571 0.18313 0.44442 0.10330 0.02283
0.00000862 0.00000244 0.00119 0.00149 0.01506 0.06948 0.91277
0.00000634 0.00000743 0.00125 0.00123 0.00833 0.00561 0.98357
Figure 61.42. Regression Using the TOL, VIF, and COLLIN Options
Model Fit and Diagnostic Statistics This section gathers the formulas for the statistics available in the MODEL, PLOT, and OUTPUT statements. The model to be fit is Y = Xβ + , and the parameter estimate is denoted by b = (X0 X)− X0 Y. The subscript i denotes values for the ith observation, the parenthetical subscript (i) means that the statistic is computed using all observations except the ith observation, and the subscript jj indicates the jth diagonal matrix entry. The ALPHA= option in the PROC REG or MODEL statement is used to set the α value for the t statistics. Table 61.6 contains the summary statistics for assessing the fit of the model.
0.00027850 0.39064 0.02809 0.19030 0.36475 0.02026 0.00568
Model Fit and Diagnostic Statistics
Table 61.6. Formulas and Definitions for Model Fit Summary Statistics
MODEL Option or Statistic n p i σ ˆ2 SST0 SST1 SSE MSE R2 ADJRSQ AIC BIC CP (Cp ) GMSEP JP (Jp ) PC PRESS RMSE SBC SP (Sp )
Definition or Formula the number of observations the number of parameters including the intercept 1 if there is an intercept, 0 otherwise the estimate of pure error variance from the SIGMA= option or from fitting the full model the uncorrected total sum of squares for the dependent variable the total sum of squares corrected for the mean for the dependent variable the error sum of squares SSE n−p SSE 1− SSTi (n − i)(1 − R2 ) 1− n− p SSE n ln + 2p n SSE nˆ σ2 n ln + 2(p + 2)q − 2q 2 where q = n SSE SSE + 2p − n σ ˆ2 MSE(n + 1)(n − 2) 1 = Sp (n + 1)(n − 2) n(n − p − 1) n n+p MSE n n+p n (1 − R2 ) = Jp n−p SSTi the sum of squares of predri (see Table 61.7) √ MSE SSE n ln + p ln(n) n MSE n−p−1
Table 61.7 contains the diagnostic statistics and their formulas; these formulas and further information can be found in Chapter 2, “Introduction to Regression Procedures,” and in the “Influence Diagnostics” section on page 3898. Each statistic is computed for each observation.
3897
3898
Chapter 61. The REG Procedure
Table 61.7. Formulas and Definitions for Diagnostic Statistics
MODEL Option or Statistic PRED (Ybi ) RES (ri ) H (hi ) STDP STDI STDR LCL LCLM UCL UCLM STUDENT RSTUDENT COOKD COVRATIO
Formula Xi b Yi − Ybi xi (X0 X)− x0i √ hi σ b2 p (1 + hi )b σ2 p (1 − hi )b σ2 Ybi − t α2 STDI Ybi − t α2 STDP Ybi + t α2 STDI Ybi + t α2 STDP ri STDRi r √i σ ˆ(i) 1 − hi 1 STDP STUDENT2 ( ) p STDR2 2 (X0 x )−1 det(ˆ σ(i) (i) (i)
DFFITS DFBETASj PRESS(predri )
det(ˆ σ 2 (X0 X)−1 ) b (Yi − Yb(i) ) √ (ˆ σ(i) hi ) bj − b(i)j p σ ˆ(i) (X0 X)jj ri 1 − hi
Influence Diagnostics This section discusses the INFLUENCE option, which produces several influence statistics, and the PARTIAL option, which produces partial regression leverage plots.
The INFLUENCE Option The INFLUENCE option (in the MODEL statement) requests the statistics proposed by Belsley, Kuh, and Welsch (1980) to measure the influence of each observation on the estimates. Influential observations are those that, according to various criteria, appear to have a large influence on the parameter estimates. Let b(i) be the parameter estimates after deleting the ith observation; let s(i)2 be the variance estimate after deleting the ith observation; let X(i) be the X matrix without the ith observation; let yˆ(i) be the ith value predicted without using the ith
Influence Diagnostics
observation; let ri = yi − yˆi be the ith residual; and let hi be the ith diagonal of the projection matrix for the predictor space, also called the hat matrix: hi = xi (X0 X)−1 x0i Belsley, Kuh, and Welsch propose a cutoff of 2p/n, where n is the number of observations used to fit the model and p is the number of parameters in the model. Observations with hi values above this cutoff should be investigated. For each observation, PROC REG first displays the residual, the studentized residual (RSTUDENT), and the hi . The studentized residual RSTUDENT differs slightly from STUDENT since the error variance is estimated by s2(i) without the ith observation, not by s2 . For example, RSTUDENT =
ri s(i)
p
(1 − hi )
Observations with RSTUDENT larger than 2 in absolute value may need some attention. The COVRATIO statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the ith observation:
COVRATIO =
det s2 (i)(X0(i) X(i) )−1 det (s2 (X0 X)−1 )
Belsley, Kuh, and Welsch suggest that observations with |COVRATIO − 1| ≥
3p n
where p is the number of parameters in the model and n is the number of observations used to fit the model, are worth investigation. The DFFITS statistic is a scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space.
DFFITS =
yˆi − yˆ(i) p s(i) h(i)
Large values of DFFITS indicate influential observations. A general cutoff to consider p is 2; a size-adjusted cutoff recommended by Belsley, Kuh, and Welsch is 2 p/n, where n and p are as defined previously. The DFFITS statistic is very similar to Cook’s D, defined in the section “Predicted and Residual Values” on page 3879.
3899
3900
Chapter 61. The REG Procedure
The DFBETAS statistics are the scaled measures of the change in each parameter estimate and are calculated by deleting the ith observation:
DFBETASj =
bj − b(i)j p s(i) (X0 X)jj
where (X0 X)jj is the (j, j)th element of (X0 X)−1 . In general, large values of DFBETAS indicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch recommend 2 as a general √ cutoff value to indicate influential observations and 2/ n as a size-adjusted cutoff. Figure 61.43 shows the tables produced by the INFLUENCE option for the population example (the section “Polynomial Regression” beginning on page 3804). See Figure 61.30 for the fitted regression equation. proc reg data=USPopulation; model Population=Year YearSq / influence; run;
The REG Procedure Model: MODEL1 Dependent Variable: Population Output Statistics
Obs
Residual
RStudent
Hat Diag H
Cov Ratio
DFFITS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
-2.2837 -0.4146 0.6696 0.8849 0.5923 -0.0621 -0.1344 0.5864 0.0934 0.2255 1.4757 1.6441 3.4065 1.5922 1.7679 -7.5642 -7.4712 -0.3731 1.2782 1.0356 -1.7068 4.7578
-0.9361 -0.1540 0.2379 0.3065 0.2021 -0.0210 -0.0455 0.1994 0.0318 0.0771 0.5090 0.5680 1.2109 0.5470 0.6064 -3.2147 -3.1550 -0.1272 0.4440 0.3687 -0.6406 2.1312
0.3429 0.2356 0.1632 0.1180 0.0933 0.0831 0.0824 0.0870 0.0933 0.0990 0.1022 0.1022 0.0990 0.0933 0.0870 0.0824 0.0831 0.0933 0.1180 0.1632 0.2356 0.3429
1.5519 1.5325 1.3923 1.3128 1.2883 1.2827 1.2813 1.2796 1.2969 1.3040 1.2550 1.2420 1.0320 1.2345 1.2123 0.3286 0.3425 1.2936 1.2906 1.3741 1.4380 0.9113
-0.6762 -0.0855 0.1050 0.1121 0.0648 -0.0063 -0.0136 0.0615 0.0102 0.0255 0.1717 0.1916 0.4013 0.1755 0.1871 -0.9636 -0.9501 -0.0408 0.1624 0.1628 -0.3557 1.5395
Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS)
-----------DFBETAS----------Intercept Year YearSq -0.4924 -0.0540 0.0517 0.0335 0.0040 0.0012 0.0054 -0.0339 -0.0067 -0.0182 -0.1272 -0.1426 -0.2895 -0.1173 -0.1076 0.4130 0.2131 -0.0007 0.0415 0.0732 -0.2107 1.0656
0.4862 0.0531 -0.0505 -0.0322 -0.0032 -0.0012 -0.0055 0.0343 0.0067 0.0183 0.1275 0.1426 0.2889 0.1167 0.1067 -0.4063 -0.2048 0.0012 -0.0432 -0.0749 0.2141 -1.0793
-0.4802 -0.0523 0.0494 0.0310 0.0025 0.0013 0.0056 -0.0347 -0.0068 -0.0183 -0.1276 -0.1424 -0.2880 -0.1160 -0.1056 0.3987 0.1957 -0.0016 0.0449 0.0766 -0.2176 1.0933
-4.4596E-11 170.97193 237.71229
Figure 61.43. Regression Using the INFLUENCE Option
In Figure 61.43, observations 16, 17, and 19 exceed the cutoff value of 2 for RSTUDENT. None of the observations exceeds the general cutoff of 2 for DFFITS or
Influence Diagnostics
the DFBETAS, but observations 16, 17, and 19 exceed at least one of the size-adjusted cutoffs for these statistics. Observations 1 and 19 exceed the cutoff for the hat diagonals, and observations 1, 2, 16, 17, and 18 exceed the cutoffs for COVRATIO. Taken together, these statistics indicate that you should look first at observations 16, 17, and 19 and then perhaps investigate the other observations that exceeded a cutoff.
The PARTIAL Option The PARTIAL option in the MODEL statement produces partial regression leverage plots. If the experimental ODS graphics are not in effect, this option requires the use of the LINEPRINTER option in the PROC REG statement. One plot is created for each regressor in the current full model. For example, plots are produced for regressors included by using ADD statements; plots are not produced for interim models in the various model-selection methods but only for the full model. If you use a modelselection method and the final model contains only a subset of the original regressors, the PARTIAL option still produces plots for all regressors in the full model. If the experimental ODS graphics are in effect, these plots are produced as high-resolution graphics, in panels with a maximum of six partial regression leverage plots plots per panel. Multiple panels are displayed for models with more than six regressors. For a given regressor, the partial regression leverage plot is the plot of the dependent variable and the regressor after they have been made orthogonal to the other regressors in the model. These can be obtained by plotting the residuals for the dependent variable against the residuals for the selected regressor, where the residuals for the dependent variable are calculated with the selected regressor omitted, and the residuals for the selected regressor are calculated from a model where the selected regressor is regressed on the remaining regressors. A line fit to the points has a slope equal to the parameter estimate in the full model. When the experimental ODS graphics are not in effect, points in the plot are marked by the number of replicates appearing at one position. The symbol ’*’ is used if there are ten or more replicates. If an ID statement is specified, the left-most nonblank character in the value of the ID variable is used as the plotting symbol. The following statements use the fitness data in Example 61.1 on page 3924 with the PARTIAL option and the ODS GRAPHICS statement to produce the partial regression leverage plots. The plots are shown in Figure 61.44. For general information about ODS graphics, see Chapter 15, “Statistical Graphics Using ODS.” For specific information about the graphics available in the REG procedure, see the “ODS Graphics” section on page 3922. ods html; ods graphics on; proc reg data=fitness; model Oxygen=RunTime Weight Age / partial; run; ods graphics off; ods html close;
3901
3902
Chapter 61. The REG Procedure
Figure 61.44. Partial Regression Leverage Plots (Experimental)
The following statements create a similar panel of partial regression plots using the OUTPUT dataset and the GPLOT procedure. Four plots (created by regressing Oxygen and one of the variables on the remaining variables) are displayed in Figure 61.45. Notice that the Int variable is explicitly added to be used as the intercept term. data fitness2; set fitness; Int=1; proc reg data=fitness2 noprint; model Oxygen Int = RunTime Weight Age / noint; output out=temp r=ry rx; symbol1 c=blue; proc gplot data=temp; plot ry*rx / cframe=ligr; label ry=’Oxygen’ rx=’Intercept’; run;
Reweighting Observations in an Analysis
Figure 61.45. Partial Regression Leverage Plots
Reweighting Observations in an Analysis Reweighting observations is an interactive feature of PROC REG that enables you to change the weights of observations used in computing the regression equation. Observations can also be deleted from the analysis (not from the data set) by changing their weights to zero. The Class data (in the “Getting Started” section on page 3800) are used to illustrate some of the features of the REWEIGHT statement. First, the full model is fit, and the residuals are displayed in Figure 61.46. proc reg data=Class; model Weight=Age Height / p; id Name; run;
3903
3904
Chapter 61. The REG Procedure
The REG Procedure Model: MODEL1 Dependent Variable: Weight Output Statistics
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Name Alfred Alice Barbara Carol Henry James Jane Janet Jeffrey John Joyce Judy Louise Mary Philip Robert Ronald Thomas William
Dependent Variable
Predicted Value
Residual
112.5000 84.0000 98.0000 102.5000 102.5000 83.0000 84.5000 112.5000 84.0000 99.5000 50.5000 90.0000 77.0000 112.0000 150.0000 128.0000 133.0000 85.0000 112.0000
124.8686 78.6273 110.2812 102.5670 105.0849 80.2266 89.2191 102.7663 100.2095 86.3415 57.3660 107.9625 76.6295 117.1544 138.2164 107.2043 118.9529 79.6676 117.1544
-12.3686 5.3727 -12.2812 -0.0670 -2.5849 2.7734 -4.7191 9.7337 -16.2095 13.1585 -6.8660 -17.9625 0.3705 -5.1544 11.7836 20.7957 14.0471 5.3324 -5.1544
Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS)
0 2120.09974 3272.72186
Figure 61.46. Full Model for CLASS Data, Residuals Shown
Upon examining the data and residuals, you realize that observation 17 (Ronald) was mistakenly included in the analysis. Also, you would like to examine the effect of reweighting to 0.5 those observations with residuals that have absolute values greater than or equal to 17. reweight obs.=17; reweight r. le -17 or r. ge 17 / weight=0.5; print p; run;
At this point, a message (on the log) appears that tells you which observations have been reweighted and what the new weights are. Figure 61.47 is produced.
Reweighting Observations in an Analysis
The REG Procedure Model: MODEL1.2 Dependent Variable: Weight Output Statistics
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Name Alfred Alice Barbara Carol Henry James Jane Janet Jeffrey John Joyce Judy Louise Mary Philip Robert Ronald Thomas William
Weight Variable
Dependent Variable
Predicted Value
Residual
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.5000 1.0000 1.0000 1.0000 0.5000 0 1.0000 1.0000
112.5000 84.0000 98.0000 102.5000 102.5000 83.0000 84.5000 112.5000 84.0000 99.5000 50.5000 90.0000 77.0000 112.0000 150.0000 128.0000 133.0000 85.0000 112.0000
121.6250 79.9296 107.5484 102.1663 104.3632 79.9762 87.8225 103.6889 98.7606 85.3117 58.6811 106.8740 76.8377 116.2429 135.9688 103.5150 117.8121 78.1398 116.2429
-9.1250 4.0704 -9.5484 0.3337 -1.8632 3.0238 -3.3225 8.8111 -14.7606 14.1883 -8.1811 -16.8740 0.1623 -4.2429 14.0312 24.4850 15.1879 6.8602 -4.2429
Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS)
0 1500.61194 2287.57621
NOTE: The above statistics use observation weights or frequencies.
Figure 61.47. Model with Reweighted Observations
The first REWEIGHT statement excludes observation 17, and the second REWEIGHT statement reweights observations 12 and 16 to 0.5. An important feature to note from this example is that the model is not refit until after the PRINT statement. REWEIGHT statements do not cause the model to be refit. This is so that multiple REWEIGHT statements can be applied to a subsequent model. In this example, since the intent is to reweight observations with large residuals, the observation that was mistakenly included in the analysis should be deleted; then, the model should be fit for those remaining observations, and the observations with large residuals should be reweighted. To accomplish this, use the REFIT statement. Note that the model label has been changed from MODEL1 to MODEL1.2 as two REWEIGHT statements have been used. These statements produce Figure 61.48: reweight allobs / weight=1.0; reweight obs.=17; refit; reweight r. le -17 or r. ge 17 / weight=.5; print; run;
3905
3906
Chapter 61. The REG Procedure
The REG Procedure Model: MODEL1.5 Dependent Variable: Weight Output Statistics
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Name Alfred Alice Barbara Carol Henry James Jane Janet Jeffrey John Joyce Judy Louise Mary Philip Robert Ronald Thomas William
Weight Variable
Dependent Variable
Predicted Value
Residual
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.5000 0 1.0000 1.0000
112.5000 84.0000 98.0000 102.5000 102.5000 83.0000 84.5000 112.5000 84.0000 99.5000 50.5000 90.0000 77.0000 112.0000 150.0000 128.0000 133.0000 85.0000 112.0000
120.9716 79.5342 107.0746 101.5681 103.7588 79.7204 87.5443 102.9467 98.3117 85.0407 58.6253 106.2625 76.5908 115.4651 134.9953 103.1923 117.0299 78.0288 115.4651
-8.4716 4.4658 -9.0746 0.9319 -1.2588 3.2796 -3.0443 9.5533 -14.3117 14.4593 -8.1253 -16.2625 0.4092 -3.4651 15.0047 24.8077 15.9701 6.9712 -3.4651
Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS)
0 1637.81879 2473.87984
NOTE: The above statistics use observation weights or frequencies.
Figure 61.48. Observations Excluded from Analysis, Model Refitted and Observations Reweighted
Notice that this results in a slightly different model than the previous set of statements: only observation 16 is reweighted to 0.5. Also note that the model label is now MODEL1.5 since five REWEIGHT statements have been used for this model. Another important feature of the REWEIGHT statement is the ability to nullify the effect of a previous or all REWEIGHT statements. First, assume that you have several REWEIGHT statements in effect and you want to restore the original weights of all the observations. The following REWEIGHT statement accomplishes this and produces Figure 61.49: reweight allobs / reset; print; run;
Reweighting Observations in an Analysis
The REG Procedure Model: MODEL1.6 Dependent Variable: Weight Output Statistics
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Name Alfred Alice Barbara Carol Henry James Jane Janet Jeffrey John Joyce Judy Louise Mary Philip Robert Ronald Thomas William
Dependent Variable
Predicted Value
Residual
112.5000 84.0000 98.0000 102.5000 102.5000 83.0000 84.5000 112.5000 84.0000 99.5000 50.5000 90.0000 77.0000 112.0000 150.0000 128.0000 133.0000 85.0000 112.0000
124.8686 78.6273 110.2812 102.5670 105.0849 80.2266 89.2191 102.7663 100.2095 86.3415 57.3660 107.9625 76.6295 117.1544 138.2164 107.2043 118.9529 79.6676 117.1544
-12.3686 5.3727 -12.2812 -0.0670 -2.5849 2.7734 -4.7191 9.7337 -16.2095 13.1585 -6.8660 -17.9625 0.3705 -5.1544 11.7836 20.7957 14.0471 5.3324 -5.1544
Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS)
0 2120.09974 3272.72186
Figure 61.49. Restoring Weights of All Observations
The resulting model is identical to the original model specified at the beginning of this section. Notice that the model label is now MODEL1.6. Note that the Weight column does not appear, since all observations have been reweighted to have weight=1. Now suppose you want only to undo the changes made by the most recent REWEIGHT statement. Use REWEIGHT UNDO for this. The following statements produce Figure 61.50: reweight r. le -12 or r. ge 12 / weight=.75; reweight r. le -17 or r. ge 17 / weight=.5; reweight undo; print; run;
3907
3908
Chapter 61. The REG Procedure
The REG Procedure Model: MODEL1.9 Dependent Variable: Weight Output Statistics
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Name Alfred Alice Barbara Carol Henry James Jane Janet Jeffrey John Joyce Judy Louise Mary Philip Robert Ronald Thomas William
Weight Variable
Dependent Variable
Predicted Value
Residual
0.7500 1.0000 0.7500 1.0000 1.0000 1.0000 1.0000 1.0000 0.7500 0.7500 1.0000 0.7500 1.0000 1.0000 1.0000 0.7500 0.7500 1.0000 1.0000
112.5000 84.0000 98.0000 102.5000 102.5000 83.0000 84.5000 112.5000 84.0000 99.5000 50.5000 90.0000 77.0000 112.0000 150.0000 128.0000 133.0000 85.0000 112.0000
125.1152 78.7691 110.3236 102.8836 105.3936 80.1133 89.0776 103.3322 100.2835 86.2090 57.0745 108.2622 76.5275 117.6752 138.9211 107.0063 119.4681 79.3061 117.6752
-12.6152 5.2309 -12.3236 -0.3836 -2.8936 2.8867 -4.5776 9.1678 -16.2835 13.2910 -6.5745 -18.2622 0.4725 -5.6752 11.0789 20.9937 13.5319 5.6939 -5.6752
Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS)
0 1694.87114 2547.22751
NOTE: The above statistics use observation weights or frequencies.
Figure 61.50. Example of UNDO in REWEIGHT Statement
The resulting model reflects changes made only by the first REWEIGHT statement since the third REWEIGHT statement negates the effect of the second REWEIGHT statement. Observations 1, 3, 9, 10, 12, 16, and 17 have their weights changed to 0.75. Note that the label MODEL1.9 reflects the use of nine REWEIGHT statements for the current model. Now suppose you want to reset the observations selected by the most recent REWEIGHT statement to their original weights. Use the REWEIGHT statement with the RESET option to do this. The following statements produce Figure 61.51: reweight r. le -12 or r. ge 12 / weight=.75; reweight r. le -17 or r. ge 17 / weight=.5; reweight / reset; print; run;
Reweighting Observations in an Analysis
The REG Procedure Model: MODEL1.12 Dependent Variable: Weight Output Statistics
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Name Alfred Alice Barbara Carol Henry James Jane Janet Jeffrey John Joyce Judy Louise Mary Philip Robert Ronald Thomas William
Weight Variable
Dependent Variable
Predicted Value
Residual
0.7500 1.0000 0.7500 1.0000 1.0000 1.0000 1.0000 1.0000 0.7500 0.7500 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.7500 1.0000 1.0000
112.5000 84.0000 98.0000 102.5000 102.5000 83.0000 84.5000 112.5000 84.0000 99.5000 50.5000 90.0000 77.0000 112.0000 150.0000 128.0000 133.0000 85.0000 112.0000
126.0076 77.8727 111.2805 102.4703 105.1278 80.2290 89.7199 102.0122 100.6507 86.6828 56.7703 108.1649 76.4327 117.1975 138.7581 108.7016 119.0957 80.3076 117.1975
-13.5076 6.1273 -13.2805 0.0297 -2.6278 2.7710 -5.2199 10.4878 -16.6507 12.8172 -6.2703 -18.1649 0.5673 -5.1975 11.2419 19.2984 13.9043 4.6924 -5.1975
Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS)
0 1879.08980 2959.57279
NOTE: The above statistics use observation weights or frequencies.
Figure 61.51. REWEIGHT Statement with RESET option
Note that observations that meet the condition of the second REWEIGHT statement (residuals with an absolute value greater than or equal to 17) now have weights reset to their original value of 1. Observations 1, 3, 9, 10, and 17 have weights of 0.75, but observations 12 and 16 (which meet the condition of the second REWEIGHT statement) have their weights reset to 1. Notice how the last three examples show three ways to change weights back to a previous value. In the first example, ALLOBS and the RESET option are used to change weights for all observations back to their original values. In the second example, the UNDO option is used to negate the effect of a previous REWEIGHT statement, thus changing weights for observations selected in the previous REWEIGHT statement to the weights specified in still another REWEIGHT statement. In the third example, the RESET option is used to change weights for observations selected in a previous REWEIGHT statement back to their original values. Finally, note that the label MODEL1.12 indicates that twelve REWEIGHT statements have been applied to the original model.
3909
3910
Chapter 61. The REG Procedure
Testing for Heteroscedasticity The regression model is specified as yi = xi β + i , where the i ’s are identically and independently distributed: E() = 0 and E(0 ) = σ 2 I. If the i ’s are not independent or their variances are not constant, the parameter estimates are unbiased, but the estimate of the covariance matrix is inconsistent. In the case of heteroscedasticity, the ACOV option provides a consistent estimate of the covariance matrix. If the regression data are from a simple random sample, the ACOV option produces the covariance matrix. This matrix is (X0 X)−1 (X0 diag(e2i )X)(X0 X)−1 where ei = yi − xi b The SPEC option performs a model specification test. The null hypothesis for this test maintains that the errors are homoscedastic, independent of the regressors and that several technical assumptions about the model specification are valid. For details, see theorem 2 and assumptions 1–7 of White (1980). When the model is correctly specified and the errors are independent of the regressors, the rejection of this null hypothesis is evidence of heteroscedasticity. In implementing this test, an estimator of the average covariance matrix (White 1980, p. 822) is constructed and inverted. The nonsingularity of this matrix is one of the assumptions in the null hypothesis about the model specification. When PROC REG determines this matrix to be numerically singular, a generalized inverse is used and a note to this effect is written to the log. In such cases, care should be taken in interpreting the results of this test. When you specify the SPEC option, tests listed in the TEST statement are performed with both the usual covariance matrix and the heteroscedasticity consistent covariance matrix. Tests performed with the consistent covariance matrix are asymptotic. For more information, refer to White (1980). Both the ACOV and SPEC options can be specified in a MODEL or PRINT statement.
Multivariate Tests The MTEST statement described in the “MTEST Statement” section on page 3832 can test hypotheses involving several dependent variables in the form (Lβ − cj)M = 0 where L is a linear function on the regressor side, β is a matrix of parameters, c is a column vector of constants, j is a row vector of ones, and M is a linear function on the dependent side. The special case where the constants are zero is LβM = 0
Multivariate Tests
To test this hypothesis, PROC REG constructs two matrices called H and E that correspond to the numerator and denominator of a univariate F test:
H = M0 (LB − cj)0 (L(X0 X)− L0 )−1 (LB − cj)M E = M0 (Y0 Y − B0 (X0 X)B)M
These matrices are displayed for each MTEST statement if the PRINT option is specified. Four test statistics based on the eigenvalues of E−1 H or (E + H)−1 H are formed. These are Wilks’ Lambda, Pillai’s Trace, the Hotelling-Lawley Trace, and Roy’s maximum root. These test statistics are discussed in Chapter 2, “Introduction to Regression Procedures.” The following statements perform a multivariate analysis of variance and produce Figure 61.52 through Figure 61.56: * Manova Data from Morrison (1976, 190); data a; input sex $ drug $ @; do rep=1 to 4; input y1 y2 @; sexcode=(sex=’m’)-(sex=’f’); drug1=(drug=’a’)-(drug=’c’); drug2=(drug=’b’)-(drug=’c’); sexdrug1=sexcode*drug1; sexdrug2=sexcode*drug2; output; end; datalines; m a 5 6 5 4 9 9 7 6 m b 7 6 7 7 9 12 6 8 m c 21 15 14 11 17 12 12 10 f a 7 10 6 6 9 7 8 10 f b 10 13 8 7 7 6 6 9 f c 16 12 14 9 14 8 10 5 ; proc reg; model y1 y2=sexcode drug1 drug2 sexdrug1 sexdrug2; y1y2drug: mtest y1=y2, drug1,drug2; drugshow: mtest drug1, drug2 / print canprint; run;
3911
3912
Chapter 61. The REG Procedure
The REG Procedure Model: MODEL1 Dependent Variable: y1 Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 18 23
316.00000 94.50000 410.50000
63.20000 5.25000
Root MSE Dependent Mean Coeff Var
2.29129 9.75000 23.50039
R-Square Adj R-Sq
F Value
Pr > F
12.04
<.0001
0.7698 0.7058
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept sexcode drug1 drug2 sexdrug1 sexdrug2
1 1 1 1 1 1
9.75000 0.16667 -2.75000 -2.25000 -0.66667 -0.41667
0.46771 0.46771 0.66144 0.66144 0.66144 0.66144
20.85 0.36 -4.16 -3.40 -1.01 -0.63
<.0001 0.7257 0.0006 0.0032 0.3269 0.5366
Figure 61.52. Multivariate Analysis of Variance: REG Procedure
Multivariate Tests
The REG Procedure Model: MODEL1 Dependent Variable: y2 Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 18 23
69.33333 114.00000 183.33333
13.86667 6.33333
Root MSE Dependent Mean Coeff Var
2.51661 8.66667 29.03782
R-Square Adj R-Sq
F Value
Pr > F
2.19
0.1008
0.3782 0.2055
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept sexcode drug1 drug2 sexdrug1 sexdrug2
1 1 1 1 1 1
8.66667 0.16667 -1.41667 -0.16667 -1.16667 -0.41667
0.51370 0.51370 0.72648 0.72648 0.72648 0.72648
16.87 0.32 -1.95 -0.23 -1.61 -0.57
<.0001 0.7493 0.0669 0.8211 0.1257 0.5734
Figure 61.53. Multivariate Analysis of Variance: REG Procedure The REG Procedure Model: MODEL1 Multivariate Test: y1y2drug Multivariate Statistics and Exact F Statistics S=1 Statistic Wilks’ Lambda Pillai’s Trace Hotelling-Lawley Trace Roy’s Greatest Root
M=0
N=8
Value
F Value
Num DF
Den DF
Pr > F
0.28053917 0.71946083 2.56456456 2.56456456
23.08 23.08 23.08 23.08
2 2 2 2
18 18 18 18
<.0001 <.0001 <.0001 <.0001
Figure 61.54. Multivariate Analysis of Variance: First Test
The four multivariate test statistics are all highly significant, giving strong evidence that the coefficients of drug1 and drug2 are not the same across dependent variables y1 and y2.
3913
3914
Chapter 61. The REG Procedure
The REG Procedure Model: MODEL1 Multivariate Test: drugshow Error Matrix (E) 94.5 76.5
76.5 114
Hypothesis Matrix (H) 301 97.5
1 2
97.5 36.333333333
Canonical Correlation
Adjusted Canonical Correlation
Approximate Standard Error
Squared Canonical Correlation
0.905903 0.244371
0.899927 .
0.040101 0.210254
0.820661 0.059717
Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq)
1 2
Eigenvalue
Difference
Proportion
Cumulative
4.5760 0.0635
4.5125
0.9863 0.0137
0.9863 1.0000
Test of H0: The canonical correlations in the current row and all that follow are zero
1 2
Likelihood Ratio
Approximate F Value
Num DF
Den DF
Pr > F
0.16862952 0.94028273
12.20 1.14
4 1
34 18
<.0001 0.2991
Figure 61.55. Multivariate Analysis of Variance: Second Test
Autocorrelation in Time Series Data
The REG Procedure Model: MODEL1 Multivariate Test: drugshow Multivariate Statistics and F Approximations S=2 Statistic Wilks’ Lambda Pillai’s Trace Hotelling-Lawley Trace Roy’s Greatest Root
M=-0.5
N=7.5
Value
F Value
Num DF
Den DF
Pr > F
0.16862952 0.88037810 4.63953666 4.57602675
12.20 7.08 19.40 41.18
4 4 4 2
34 36 19.407 18
<.0001 0.0003 <.0001 <.0001
NOTE: F Statistic for Roy’s Greatest Root is an upper bound. NOTE: F Statistic for Wilks’ Lambda is exact.
Figure 61.56. Multivariate Analysis of Variance: Second Test (continued)
The four multivariate test statistics are all highly significant, giving strong evidence that the coefficients of drug1 and drug2 are not zero for both dependent variables.
Autocorrelation in Time Series Data When regression is performed on time series data, the errors may not be independent. Often errors are autocorrelated; that is, each error is correlated with the error immediately before it. Autocorrelation is also a symptom of systematic lack of fit. The DW option provides the Durbin-Watson d statistic to test that the autocorrelation is zero: Pn d=
(ei − ei−1 )2 i=2P n 2 i=1 ei
The value of d is close to 2 if the errors are uncorrelated. The distribution of d is reported by Durbin and Watson (1951). Tables of the distribution are found in most econometrics textbooks, such as Johnston (1972) and Pindyck and Rubinfeld (1981). The sample autocorrelation estimate is displayed after the Durbin-Watson statistic. The sample is computed as Pn i=2 ei ei−1 r= P n 2 i=1 ei This autocorrelation of the residuals may not be a very good estimate of the autocorrelation of the true errors, especially if there are few observations and the independent variables have certain patterns. If there are missing observations in the regression, these measures are computed as though the missing observations did not exist. Positive autocorrelation of the errors generally tends to make the estimate of the error variance too small, so confidence intervals are too narrow and true null hypotheses are rejected with a higher probability than the stated significance level. Negative
3915
3916
Chapter 61. The REG Procedure
autocorrelation of the errors generally tends to make the estimate of the error variance too large, so confidence intervals are too wide and the power of significance tests is reduced. With either positive or negative autocorrelation, least-squares parameter estimates are usually not as efficient as generalized least-squares parameter estimates. For more details, refer to Judge et al. (1985, Chapter 8) and the SAS/ETS User’s Guide. The following SAS statements request the DW option for the US population data (see Figure 61.57): proc reg data=USPopulation; model Population=Year YearSq / dw; run;
The REG Procedure Model: MODEL1 Dependent Variable: Population Durbin-Watson D Number of Observations 1st Order Autocorrelation
1.191 22 0.323
Figure 61.57. Regression Using DW Option
Computations for Ridge Regression and IPC Analysis In ridge regression analysis, the crossproduct matrix for the independent variables is centered (the NOINT option is ignored if it is specified) and scaled to one on the diagonal elements. The ridge constant k (specified with the RIDGE= option) is then added to each diagonal element of the crossproduct matrix. The ridge regression estimates are the least-squares estimates obtained by using the new crossproduct matrix. Let X be an n × p matrix of the independent variables after centering the data, and let Y be an n × 1 vector corresponding to the dependent variable. Let D be a p × p diagonal matrix with diagonal elements as in X0 X. The ridge regression estimate corresponding to the ridge constant k can be computed as 1
D− 2 (Z0 Z + kIp )−1 Z0 Y 1
where Z = XD− 2 and Ip is a p×p identity matrix. For IPC analysis, the smallest m eigenvalues of Z0 Z (where m is specified with the PCOMIT= option) are omitted to form the estimates. For information about ridge regression and IPC standardized parameter estimates, parameter estimate standard errors, and variance inflation factors, refer to Rawlings (1988), Neter, Wasserman, and Kutner (1990), and Marquardt and Snee (1975). Unlike Rawlings (1988), the REG procedure uses the mean squared errors of the submodels instead of the full model MSE to compute the standard errors of the parameter estimates.
Computer Resources in Regression Analysis
Construction of Q-Q and P-P Plots If a normal probability-probability or quantile-quantile plot for the variable x is requested, the n nonmissing values of x are first ordered from smallest to largest: x(1) ≤ x(2) ≤ · · · ≤ x(n) If a Q-Q plot is requested (with a PLOT statement of the form PLOT yvariable∗NQQ.), the ith ordered value x(i) is represented by a point with y coordinate x(i) and x-coordinate Φ−1 i−0.375 n+0.25 , where Φ(·) is the standard normal distribution. If a P-P plot is requested (with a PLOT statement of the form PLOT yvariable∗NPP.), the ith ordered value x(i) is represented by a point with y-coordinate ni and x x −µ coordinate Φ (i)σ , where µ is the mean of the nonmissing x-values and σ is the standard deviation. If an x-value has multiplicity k (that is, x(i) = · · · = x(i+k−1) ), x −µ then only the point Φ (i)σ , i+k−1 is displayed. n
Computational Methods The REG procedure first composes a crossproducts matrix. The matrix can be calculated from input data, reformed from an input correlation matrix, or read in from an SSCP data set. For each model, the procedure selects the appropriate crossproducts from the main matrix. The normal equations formed from the crossproducts are solved using a sweep algorithm (Goodnight 1979). The method is accurate for data that are reasonably scaled and not too collinear. The mechanism that PROC REG uses to check for singularity involves the diagonal (pivot) elements of X0 X as it is being swept. If a pivot is less than SINGULAR*CSS, then a singularity is declared and the pivot is not swept (where CSS is the corrected sum of squares for the regressor and SINGULAR is machine dependent but is approximately 1E−7 on most machines or reset in the PROC statement). The sweep algorithm is also used in many places in the model-selection methods. The RSQUARE method uses the leaps and bounds algorithm by Furnival and Wilson (1974).
Computer Resources in Regression Analysis The REG procedure is efficient for ordinary regression; however, requests for optional features can greatly increase the amount of time required. The major computational expense in the regression analysis is the collection of the crossproducts matrix. For p variables and n observations, the time required is proportional to np2 . For each model run, PROC REG needs time roughly proportional to k 3 , where k is the number of regressors in the model. Add an additional nk 2 for one of the R, CLM, or CLI options and another nk 2 for the INFLUENCE option.
3917
3918
Chapter 61. The REG Procedure
Most of the memory that PROC REG needs to solve large problems is used for crossproducts matrices. PROC REG requires 4p2 bytes for the main crossproducts matrix plus 4k 2 bytes for the largest model. If several output data sets are requested, memory is also needed for buffers. See the “Input Data Sets” section on page 3860 for information on how to use TYPE=SSCP data sets to reduce computing time.
Displayed Output Many of the more specialized tables are described in detail in previous sections. Most of the formulas for the statistics are in Chapter 2, “Introduction to Regression Procedures,” while other formulas can be found in the section “Model Fit and Diagnostic Statistics” on page 3896 and the “Influence Diagnostics” section on page 3898. The analysis-of-variance table includes • the Source of the variation, Model for the fitted regression, Error for the residual error, and C Total for the total variation after correcting for the mean. The Uncorrected Total Variation is produced when the NOINT option is used. • the degrees of freedom (DF) associated with the source • the Sum of Squares for the term • the Mean Square, the sum of squares divided by the degrees of freedom • the F Value for testing the hypothesis that all parameters are zero except for the intercept. This is formed by dividing the mean square for Model by the mean square for Error. • the Prob>F, the probability of getting a greater F statistic than that observed if the hypothesis is true. This is the significance probability. Other statistics displayed include the following: • Root MSE is an estimate of the standard deviation of the error term. It is calculated as the square root of the mean square error. • Dep Mean is the sample mean of the dependent variable. • C.V. is the coefficient of variation, computed as 100 times Root MSE divided by Dep Mean. This expresses the variation in unitless values. • R-Square is a measure between 0 and 1 that indicates the portion of the (corrected) total variation that is attributed to the fit rather than left to residual error. It is calculated as SS(Model) divided by SS(Total). It is also called the coefficient of determination. It is the square of the multiple correlation; in other words, the square of the correlation between the dependent variable and the predicted values.
Displayed Output
• Adj R-Sq, the adjusted R2 , is a version of R2 that has been adjusted for degrees of freedom. It is calculated as 2 ¯ 2 = 1 − (n − i)(1 − R ) R n−p
where i is equal to 1 if there is an intercept and 0 otherwise; n is the number of observations used to fit the model; and p is the number of parameters in the model. The parameter estimates and associated statistics are then displayed, and they include the following: • the Variable used as the regressor, including the name Intercept to represent the estimate of the intercept parameter • the degrees of freedom (DF) for the variable. There is one degree of freedom unless the model is not full rank. • the Parameter Estimate • the Standard Error, the estimate of the standard deviation of the parameter estimate • T for H0: Parameter=0, the t test that the parameter is zero. This is computed as the Parameter Estimate divided by the Standard Error. • the Prob > |T|, the probability that a t statistic would obtain a greater absolute value than that observed given that the true parameter is zero. This is the twotailed significance probability. If model-selection methods other than NONE, RSQUARE, ADJRSQ, or CP are used, the analysis-of-variance table and the parameter estimates with associated statistics are produced at each step. Also displayed are • C(p), which is Mallows’ Cp statistic • bounds on the condition number of the correlation matrix for the variables in the model (Berk 1977) After statistics for the final model are produced, the following is displayed when the method chosen is FORWARD, BACKWARD, or STEPWISE: • a Summary table listing Step number, Variable Entered or Removed, Partial and Model R-Square, and C(p) and F statistics The RSQUARE method displays its results beginning with the model containing the fewest independent variables and producing the largest R2 . Results for other models with the same number of variables are then shown in order of decreasing R2 , and so
3919
3920
Chapter 61. The REG Procedure
on, for models with larger numbers of variables. The ADJRSQ and CP methods group models of all sizes together and display results beginning with the model having the optimal value of adjusted R2 and Cp , respectively. For each model considered, the RSQUARE, ADJRSQ, and CP methods display the following: • Number in Model or IN, the number of independent variables used in each model • R-Square or RSQ, the squared multiple correlation coefficient If the B option is specified, the RSQUARE, ADJRSQ, and CP methods produce the following: • Parameter Estimates, the estimated regression coefficients If the B option is not specified, the RSQUARE, ADJRSQ, and CP methods display the following: • Variables in Model, the names of the independent variables included in the model
ODS Table Names PROC REG assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 61.8. ODS Tables Produced in PROC REG
ODS Table Name ACovEst ACovTestANOVA ANOVA CanCorr CollinDiag CollinDiagNoInt ConditionBounds
Description Consistent covariance of estimates matrix Test ANOVA using ACOV estimates Model ANOVA table Canonical correlations for hypothesis combinations Collinearity Diagnostics table Collinearity Diagnostics for no intercept model Bounds on condition number
Statement MODEL
Option ALL, ACOV
TEST
ACOV (MODEL statement)
MODEL MTEST
default CANPRINT
MODEL
COLLIN
MODEL
COLLINOINT
MODEL
(SELECTION=BACKWARD | FORWARD | STEPWISE | MAXR | MINR) and DETAILS
ODS Table Names
3921
Table 61.8. (continued)
ODS Table Name Corr CorrB CovB CrossProducts DWStatistic DependenceEquations Eigenvalues Eigenvectors EntryStatistics
Description Correlation matrix for analysis variables Correlation of estimates Covariance of estimates Bordered model X’X matrix Durbin-Watson statistic Linear dependence equations MTest eigenvalues MTest eigenvectors Entry statistics for selection methods
Statement PROC
Option ALL, CORR
MODEL MODEL MODEL MODEL MODEL
CORRB COVB ALL, XPX ALL, DW default if needed
MTEST MTEST MODEL
MTest error plus hypothesis matrix H+E MTest error matrix E Model fit statistics MTest hypothesis matrix Inv(L Ginv(X’X) L’) and Inv(Lb-c) Inv(L Ginv(X’X) L’) and Inv(Lb-c) Bordered X’X inverse matrix L Ginv(X’X) L’ and Lb-c MTest matrix M, across dependents Multivariate test statistics Number of observations Output statistics table
MTEST
CANPRINT CANPRINT (SELECTION=BACKWARD | FORWARD | STEPWISE | MAXR | MINR) and DETAILS PRINT
MTEST MODEL MTEST MTEST
PRINT default PRINT DETAILS
TEST
PRINT
MODEL MTEST MTEST
I DETAILS DETAILS
MTEST
ParameterEstimates RemovalStatistics
Model parameter estimates Removal statistics for selection methods
MODEL MODEL
ResidualStatistics
Residual statistics and PRESS statistic Parameter estimates for selection methods
MODEL
Selection summary for forward, backward and stepwise methods Sequential parameter estimates
MODEL
default default ALL, CLI, CLM, INFLUENCE, P, R default (SELECTION=BACKWARD | STEPWISE | MAXR | MINR) and DETAILS ALL, CLI, CLM, INFLUENCE, P, R SELECTION=BACKWARD | FORWARD | STEPWISE | MAXR | MINR SELECTION=BACKWARD | FORWARD | STEPWISE
ErrorPlusHypothesis ErrorSSCP FitStatistics HypothesisSSCP InvMTestCov InvTestCov InvXPX MTestCov MTransform MultStat NObs OutputStatistics
SelParmEst
SelectionSummary
SeqParmEst
MODEL
MODEL
MODEL
SEQB
3922
Chapter 61. The REG Procedure
Table 61.8. (continued)
ODS Table Name SimpleStatistics SpecTest SubsetSelSummary
TestANOVA TestCov USSCP
Description Simple statistics for analysis variables White’s heteroscedasticity test Selection summary for R-Square, Adj-RSq and Cp methods Test ANOVA table L Ginv(X’X) L’ and Lb-c Uncorrected SSCP matrix for analysis variables
Statement PROC
Option ALL, SIMPLE
MODEL
ALL, SPEC
MODEL
SELECTION=RSQUARE | ADJRSQ | CP
TEST TEST PROC
default PRINT ALL, USSCP
ODS Graphics (Experimental) This section describes the use of ODS for creating statistical graphs with the REG procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release. To request these graphs you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, “Statistical Graphics Using ODS.” When the experimental ODS graphics are in effect, the REG procedure produces a variety of plots. For models with multiple dependent variables, separate plots are produced for each dependent variable. For jobs with more than one MODEL statement, plots are produced for each model statement. The plots available are as follows: • With a single regressor, a scatterplot of the input data overlayed with the fitted regression line, confidence band, and prediction limits. • A summary panel of fit diagnostics: – – – – – – –
Residuals versus the predicted values Studentized residuals versus the predicted values Studentized residuals versus the leverage Normal quantile plot of the residuals Dependent variable values versus the predicted values Cook’s D versus observation number Histogram of the residuals
ODS Graphics (Experimental)
– A “Residual-Fit” (or RF) plot consisting of side-by-side quantile plots of the centered fit and the residuals. This plot “shows how much variation in the data is explained by the fit and how much remains in the residuals” (Cleveland, 1993). If the PLOTS(UNPACKPANELS) option is specified in the PROC REG statement, then the eight plots in the fit diagnostics panel are displayed individually. • Panels of the residuals versus the regressors in the model. Note that each panel contains at most six plots, and multiple panels are used in the case that there are more than six regressors (including the intercept) in the model. • If the PARTIAL option is specified in a MODEL statement, panels of the partial regression plots for each regressor (see the “The PARTIAL Option” section on page 3901). Note that each panel contains at most six partial plots, and multiple panels are used in the case that there are more than six regressors in the model. • If the RIDGE= option is specified in the model statement, panels of ridge traces versus the specified ridge parameters for each regressor in the model. At most eight ridge traces are included on a panel and multiple panels are used for models with more than eight regressors. PLOTS (general-plot-options)
specifies characteristics of the graphics produced when you use the experimental ODS GRAPHICS statement. You can specify the following general-plot-options in parentheses after the PLOTS option: UNPACK | UNPACKPANELS specifies that plots in the fit diagnostics panel should
be displayed separately. MAXPOINTS=number | NONE specifies that plots with elements that require processing more than number points are suppressed. The default
is MAXPOINTS=5000. MAXPOINTS=NONE.
This cutoff is ignored if you specify
ODS Graph Names PROC REG assigns a name to each graph it creates using ODS. You can use these names to reference the graphs when using ODS. The names are listed in Table 61.9. To request these graphs you must specify the ODS GRAPHICS statement. For more information on the ODS GRAPHICS statement, see Chapter 15, “Statistical Graphics Using ODS.”
3923
3924
Chapter 61. The REG Procedure
Table 61.9. ODS Graphics Produced by PROC REG
ODS Graph Name ActualByPredicted CooksD DiagnosticsPanel Fit
PartialPlotPaneli QQPlot ResidualByPredicted ResidualHistogram ResidualPaneli RFPlot RidgePaneli RStudentByLeverage RStudentByPredicted
Plot Description Dependent variable versus predicted values Cook’s D statistic versus observation number Panel of fit diagnostics Regression line, confidence band, and prediction limits overlayed on scatterplot of data Panel i of partial regression plots Normal quantile plot residuals Residuals versus predicted values Histogram of fit residuals Panel i of residuals versus regressors Side-by-side plots of quantiles of centered fit and residuals Panel i of ridge traces Studentized residuals versus leverage Studentized residuals versus predicted values
PLOTS Option UNPACKPANELS UNPACKPANELS
UNPACKPANELS UNPACKPANELS UNPACKPANELS UNPACKPANELS
UNPACKPANELS UNPACKPANELS
Examples Example 61.1. Aerobic Fitness Prediction Aerobic fitness (measured by the ability to consume oxygen) is fit to some simple exercise tests. The goal is to develop an equation to predict fitness based on the exercise tests rather than on expensive and cumbersome oxygen consumption measurements. Three model-selection methods are used: forward selection, backward selection, and MAXR selection. The following statements produce Output 61.1.1 through Output 61.1.5. (Collinearity diagnostics for the full model are shown in Figure 61.42 on page 3896.)
Example 61.1. Aerobic Fitness Prediction
*-------------------Data on Physical Fitness-------------------* | These measurements were made on men involved in a physical | | fitness course at N.C.State Univ. The variables are Age | | (years), Weight (kg), Oxygen intake rate (ml per kg body | | weight per minute), time to run 1.5 miles (minutes), heart | | rate while resting, heart rate while running (same time | | Oxygen rate measured), and maximum heart rate recorded while | | running. | | ***Certain values of MaxPulse were changed for this analysis.| *--------------------------------------------------------------*; data fitness; input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@; datalines; 44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172 ; proc reg data=fitness; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=forward; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=backward; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=maxr; run;
The FORWARD model-selection method begins with no variables in the model and adds RunTime, then Age,...
3925
3926
Chapter 61. The REG Procedure
Output 61.1.1. Forward Selection Method: PROC REG The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Forward Selection: Step 1 Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
1 29 30
632.90010 218.48144 851.38154
632.90010 7.53384
F Value
Pr > F
84.01
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept RunTime
82.42177 -3.31056
3.85530 0.36119
3443.36654 632.90010
457.05 84.01
<.0001 <.0001
Bounds on condition number: 1, 1 -------------------------------------------------------------------------------Forward Selection: Step 2
Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
2 28 30
650.66573 200.71581 851.38154
325.33287 7.16842
F Value
Pr > F
45.38
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age RunTime
88.46229 -0.15037 -3.20395
5.37264 0.09551 0.35877
1943.41071 17.76563 571.67751
271.11 2.48 79.75
<.0001 0.1267 <.0001
Bounds on condition number: 1.0369, 4.1478 --------------------------------------------------------------------------------
...then RunPulse, then MaxPulse,...
Example 61.1. Aerobic Fitness Prediction
Forward Selection: Step 3
Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
3 27 30
690.55086 160.83069 851.38154
230.18362 5.95669
F Value
Pr > F
38.64
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age RunTime RunPulse
111.71806 -0.25640 -2.82538 -0.13091
10.23509 0.09623 0.35828 0.05059
709.69014 42.28867 370.43529 39.88512
119.14 7.10 62.19 6.70
<.0001 0.0129 <.0001 0.0154
Bounds on condition number: 1.3548, 11.597 -------------------------------------------------------------------------------Forward Selection: Step 4
Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
4 26 30
712.45153 138.93002 851.38154
178.11288 5.34346
F Value
Pr > F
33.33
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age RunTime RunPulse MaxPulse
98.14789 -0.19773 -2.76758 -0.34811 0.27051
11.78569 0.09564 0.34054 0.11750 0.13362
370.57373 22.84231 352.93570 46.90089 21.90067
69.35 4.27 66.05 8.78 4.10
<.0001 0.0488 <.0001 0.0064 0.0533
Bounds on condition number: 8.4182, 76.851 --------------------------------------------------------------------------------
...and finally, Weight. The final variable available to add to the model, RestPulse, is not added since it does not meet the 50% (the default value of the SLE option is 0.5 for FORWARD selection) significance-level criterion for entry into the model.
3927
3928
Chapter 61. The REG Procedure
Forward Selection: Step 5
Variable Weight Entered: R-Square = 0.8480 and C(p) = 5.1063
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 25 30
721.97309 129.40845 851.38154
144.39462 5.17634
F Value
Pr > F
27.90
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age Weight RunTime RunPulse MaxPulse
102.20428 -0.21962 -0.07230 -2.68252 -0.37340 0.30491
11.97929 0.09550 0.05331 0.34099 0.11714 0.13394
376.78935 27.37429 9.52157 320.35968 52.59624 26.82640
72.79 5.29 1.84 61.89 10.16 5.18
<.0001 0.0301 0.1871 <.0001 0.0038 0.0316
Bounds on condition number: 8.7312, 104.83 --------------------------------------------------------------------------------
No other variable met the 0.5000 significance level for entry into the model.
Summary of Forward Selection
Step 1 2 3 4 5
Variable Entered RunTime Age RunPulse MaxPulse Weight
Number Vars In 1 2 3 4 5
Partial R-Square
Model R-Square
0.7434 0.0209 0.0468 0.0257 0.0112
0.7434 0.7642 0.8111 0.8368 0.8480
C(p) 13.6988 12.3894 6.9596 4.8800 5.1063
F Value
Pr > F
84.01 2.48 6.70 4.10 1.84
<.0001 0.1267 0.0154 0.0533 0.1871
The BACKWARD model-selection method begins with the full model.
Example 61.1. Aerobic Fitness Prediction
Output 61.1.2. Backward Selection Method: PROC REG The REG Procedure Model: MODEL2 Dependent Variable: Oxygen Backward Elimination: Step 0 All Variables Entered: R-Square = 0.8487 and C(p) = 7.0000
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
6 24 30
722.54361 128.83794 851.38154
120.42393 5.36825
F Value
Pr > F
22.43
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age Weight RunTime RunPulse RestPulse MaxPulse
102.93448 -0.22697 -0.07418 -2.62865 -0.36963 -0.02153 0.30322
12.40326 0.09984 0.05459 0.38456 0.11985 0.06605 0.13650
369.72831 27.74577 9.91059 250.82210 51.05806 0.57051 26.49142
68.87 5.17 1.85 46.72 9.51 0.11 4.93
<.0001 0.0322 0.1869 <.0001 0.0051 0.7473 0.0360
Bounds on condition number: 8.7438, 137.13 --------------------------------------------------------------------------------
RestPulse is the first variable deleted,...
3929
3930
Chapter 61. The REG Procedure
Backward Elimination: Step 1
Variable RestPulse Removed: R-Square = 0.8480 and C(p) = 5.1063
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 25 30
721.97309 129.40845 851.38154
144.39462 5.17634
F Value
Pr > F
27.90
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age Weight RunTime RunPulse MaxPulse
102.20428 -0.21962 -0.07230 -2.68252 -0.37340 0.30491
11.97929 0.09550 0.05331 0.34099 0.11714 0.13394
376.78935 27.37429 9.52157 320.35968 52.59624 26.82640
72.79 5.29 1.84 61.89 10.16 5.18
<.0001 0.0301 0.1871 <.0001 0.0038 0.0316
Bounds on condition number: 8.7312, 104.83 --------------------------------------------------------------------------------
...followed by Weight. No other variables are deleted from the model since the variables remaining (Age,RunTime, RunPulse, and MaxPulse) are all significant at the 10% (the default value of the SLS option is 0.1 for the BACKWARD elimination method) significance level.
Example 61.1. Aerobic Fitness Prediction
Backward Elimination: Step 2
Variable Weight Removed: R-Square = 0.8368 and C(p) = 4.8800
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
4 26 30
712.45153 138.93002 851.38154
178.11288 5.34346
F Value
Pr > F
33.33
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age RunTime RunPulse MaxPulse
98.14789 -0.19773 -2.76758 -0.34811 0.27051
11.78569 0.09564 0.34054 0.11750 0.13362
370.57373 22.84231 352.93570 46.90089 21.90067
69.35 4.27 66.05 8.78 4.10
<.0001 0.0488 <.0001 0.0064 0.0533
Bounds on condition number: 8.4182, 76.851 --------------------------------------------------------------------------------
All variables left in the model are significant at the 0.1000 level.
Summary of Backward Elimination
Step 1 2
Variable Removed RestPulse Weight
Number Vars In 5 4
Partial R-Square
Model R-Square
0.0007 0.0112
0.8480 0.8368
C(p) 5.1063 4.8800
F Value
Pr > F
0.11 1.84
0.7473 0.1871
The MAXR method tries to find the “best” one-variable model, the “best” twovariable model, and so on. For the fitness data, the one-variable model contains RunTime; the two-variable model contains RunTime and Age;
3931
3932
Chapter 61. The REG Procedure
Output 61.1.3. Maximum R-Square Improvement Selection Method: PROC REG The REG Procedure Model: MODEL3 Dependent Variable: Oxygen Maximum R-Square Improvement: Step 1 Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
1 29 30
632.90010 218.48144 851.38154
632.90010 7.53384
F Value
Pr > F
84.01
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept RunTime
82.42177 -3.31056
3.85530 0.36119
3443.36654 632.90010
457.05 84.01
<.0001 <.0001
Bounds on condition number: 1, 1 --------------------------------------------------------------------------------
The above model is the best
1-variable model found.
Maximum R-Square Improvement: Step 2
Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
2 28 30
650.66573 200.71581 851.38154
325.33287 7.16842
F Value
Pr > F
45.38
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age RunTime
88.46229 -0.15037 -3.20395
5.37264 0.09551 0.35877
1943.41071 17.76563 571.67751
271.11 2.48 79.75
<.0001 0.1267 <.0001
Bounds on condition number: 1.0369, 4.1478 --------------------------------------------------------------------------------
The above model is the best
2-variable model found.
the three-variable model contains RunTime, Age, and RunPulse; the four-variable model contains Age, RunTime, RunPulse, and MaxPulse; the five-variable model
Example 61.1. Aerobic Fitness Prediction
Maximum R-Square Improvement: Step 3
Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
3 27 30
690.55086 160.83069 851.38154
230.18362 5.95669
F Value
Pr > F
38.64
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age RunTime RunPulse
111.71806 -0.25640 -2.82538 -0.13091
10.23509 0.09623 0.35828 0.05059
709.69014 42.28867 370.43529 39.88512
119.14 7.10 62.19 6.70
<.0001 0.0129 <.0001 0.0154
Bounds on condition number: 1.3548, 11.597 --------------------------------------------------------------------------------
The above model is the best
3-variable model found.
Maximum R-Square Improvement: Step 4
Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
4 26 30
712.45153 138.93002 851.38154
178.11288 5.34346
F Value
Pr > F
33.33
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age RunTime RunPulse MaxPulse
98.14789 -0.19773 -2.76758 -0.34811 0.27051
11.78569 0.09564 0.34054 0.11750 0.13362
370.57373 22.84231 352.93570 46.90089 21.90067
69.35 4.27 66.05 8.78 4.10
<.0001 0.0488 <.0001 0.0064 0.0533
Bounds on condition number: 8.4182, 76.851 --------------------------------------------------------------------------------
The above model is the best
4-variable model found.
contains Age, Weight, RunTime, RunPulse, and MaxPulse; and finally, the sixvariable model contains all the variables in the MODEL statement.
3933
3934
Chapter 61. The REG Procedure
Maximum R-Square Improvement: Step 5
Variable Weight Entered: R-Square = 0.8480 and C(p) = 5.1063
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 25 30
721.97309 129.40845 851.38154
144.39462 5.17634
F Value
Pr > F
27.90
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age Weight RunTime RunPulse MaxPulse
102.20428 -0.21962 -0.07230 -2.68252 -0.37340 0.30491
11.97929 0.09550 0.05331 0.34099 0.11714 0.13394
376.78935 27.37429 9.52157 320.35968 52.59624 26.82640
72.79 5.29 1.84 61.89 10.16 5.18
<.0001 0.0301 0.1871 <.0001 0.0038 0.0316
Bounds on condition number: 8.7312, 104.83 --------------------------------------------------------------------------------
The above model is the best
5-variable model found.
Maximum R-Square Improvement: Step 6
Variable RestPulse Entered: R-Square = 0.8487 and C(p) = 7.0000
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
6 24 30
722.54361 128.83794 851.38154
120.42393 5.36825
F Value
Pr > F
22.43
<.0001
Variable
Parameter Estimate
Standard Error
Type II SS
F Value
Pr > F
Intercept Age Weight RunTime RunPulse RestPulse MaxPulse
102.93448 -0.22697 -0.07418 -2.62865 -0.36963 -0.02153 0.30322
12.40326 0.09984 0.05459 0.38456 0.11985 0.06605 0.13650
369.72831 27.74577 9.91059 250.82210 51.05806 0.57051 26.49142
68.87 5.17 1.85 46.72 9.51 0.11 4.93
<.0001 0.0322 0.1869 <.0001 0.0051 0.7473 0.0360
Bounds on condition number: 8.7438, 137.13 --------------------------------------------------------------------------------
The above model is the best
6-variable model found.
No further improvement in R-Square is possible.
Example 61.1. Aerobic Fitness Prediction
Note that for all three of these methods, RestPulse contributes least to the model. In the case of forward selection, it is not added to the model. In the case of backward selection, it is the first variable to be removed from the model. In the case of MAXR selection, RestPulse is included only for the full model. For the STEPWISE, BACKWARDS and FORWARD selection methods, you can control the amount of detail displayed by using the DETAILS option. For example, the following statements display only the selection summary table for the FORWARD selection method. proc reg data=fitness; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=forward details=summary; run;
Output 61.1.4. Forward Selection Summary The REG Procedure Model: MODEL1 Dependent Variable: Oxygen
Summary of Forward Selection
Step 1 2 3 4 5
Variable Entered RunTime Age RunPulse MaxPulse Weight
Number Vars In 1 2 3 4 5
Partial R-Square
Model R-Square
0.7434 0.0209 0.0468 0.0257 0.0112
0.7434 0.7642 0.8111 0.8368 0.8480
C(p) 13.6988 12.3894 6.9596 4.8800 5.1063
F Value
Pr > F
84.01 2.48 6.70 4.10 1.84
<.0001 0.1267 0.0154 0.0533 0.1871
Next, the RSQUARE model-selection method is used to request R2 and Cp statistics for all possible combinations of the six independent variables. The following statements produce Output 61.1.5 model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=rsquare cp; title ’Physical fitness data: all models’; run;
3935
3936
Chapter 61. The REG Procedure
Output 61.1.5. All Models by the RSQUARE Method: PROC REG Physical fitness data: all models The REG Procedure Model: MODEL2 Dependent Variable: Oxygen R-Square Selection Method Number in Model
R-Square
C(p)
Variables in Model
1 0.7434 13.6988 RunTime 1 0.1595 106.3021 RestPulse 1 0.1584 106.4769 RunPulse 1 0.0928 116.8818 Age 1 0.0560 122.7072 MaxPulse 1 0.0265 127.3948 Weight -----------------------------------------------------------------------------2 0.7642 12.3894 Age RunTime 2 0.7614 12.8372 RunTime RunPulse 2 0.7452 15.4069 RunTime MaxPulse 2 0.7449 15.4523 Weight RunTime 2 0.7435 15.6746 RunTime RestPulse 2 0.3760 73.9645 Age RunPulse 2 0.3003 85.9742 Age RestPulse 2 0.2894 87.6951 RunPulse MaxPulse 2 0.2600 92.3638 Age MaxPulse 2 0.2350 96.3209 RunPulse RestPulse 2 0.1806 104.9523 Weight RestPulse 2 0.1740 105.9939 RestPulse MaxPulse 2 0.1669 107.1332 Weight RunPulse 2 0.1506 109.7057 Age Weight 2 0.0675 122.8881 Weight MaxPulse ------------------------------------------------------------------------------
Example 61.1. Aerobic Fitness Prediction
3 0.8111 6.9596 Age RunTime RunPulse 3 0.8100 7.1350 RunTime RunPulse MaxPulse 3 0.7817 11.6167 Age RunTime MaxPulse 3 0.7708 13.3453 Age Weight RunTime 3 0.7673 13.8974 Age RunTime RestPulse 3 0.7619 14.7619 RunTime RunPulse RestPulse 3 0.7618 14.7729 Weight RunTime RunPulse 3 0.7462 17.2588 Weight RunTime MaxPulse 3 0.7452 17.4060 RunTime RestPulse MaxPulse 3 0.7451 17.4243 Weight RunTime RestPulse 3 0.4666 61.5873 Age RunPulse RestPulse 3 0.4223 68.6250 Age RunPulse MaxPulse 3 0.4091 70.7102 Age Weight RunPulse 3 0.3900 73.7424 Age RestPulse MaxPulse 3 0.3568 79.0013 Age Weight RestPulse 3 0.3538 79.4891 RunPulse RestPulse MaxPulse 3 0.3208 84.7216 Weight RunPulse MaxPulse 3 0.2902 89.5693 Age Weight MaxPulse 3 0.2447 96.7952 Weight RunPulse RestPulse 3 0.1882 105.7430 Weight RestPulse MaxPulse -----------------------------------------------------------------------------4 0.8368 4.8800 Age RunTime RunPulse MaxPulse 4 0.8165 8.1035 Age Weight RunTime RunPulse 4 0.8158 8.2056 Weight RunTime RunPulse MaxPulse 4 0.8117 8.8683 Age RunTime RunPulse RestPulse 4 0.8104 9.0697 RunTime RunPulse RestPulse MaxPulse 4 0.7862 12.9039 Age Weight RunTime MaxPulse 4 0.7834 13.3468 Age RunTime RestPulse MaxPulse 4 0.7750 14.6788 Age Weight RunTime RestPulse 4 0.7623 16.7058 Weight RunTime RunPulse RestPulse 4 0.7462 19.2550 Weight RunTime RestPulse MaxPulse 4 0.5034 57.7590 Age Weight RunPulse RestPulse 4 0.5025 57.9092 Age RunPulse RestPulse MaxPulse 4 0.4717 62.7830 Age Weight RunPulse MaxPulse 4 0.4256 70.0963 Age Weight RestPulse MaxPulse 4 0.3858 76.4100 Weight RunPulse RestPulse MaxPulse -----------------------------------------------------------------------------5 0.8480 5.1063 Age Weight RunTime RunPulse MaxPulse 5 0.8370 6.8461 Age RunTime RunPulse RestPulse MaxPulse 5 0.8176 9.9348 Age Weight RunTime RunPulse RestPulse 5 0.8161 10.1685 Weight RunTime RunPulse RestPulse MaxPulse 5 0.7887 14.5111 Age Weight RunTime RestPulse MaxPulse 5 0.5541 51.7233 Age Weight RunPulse RestPulse MaxPulse -----------------------------------------------------------------------------6 0.8487 7.0000 Age Weight RunTime RunPulse RestPulse MaxPulse
The models in Output 61.1.5 are arranged first by the number of variables in the model and second by the magnitude of R2 for the model. Before making a final decision about which model to use, you would want to perform collinearity diagnostics. Note that, since many different models have been fit and the choice of a final model is based on R2 , the statistics are biased and the p-values for the parameter estimates are not valid.
3937
3938
Chapter 61. The REG Procedure
Example 61.2. Predicting Weight by Height and Age In this example, the weights of school children are modeled as a function of their heights and ages. Modeling is performed separately for boys and girls. The example shows the use of a BY statement with PROC REG, multiple MODEL statements, and the OUTEST= and OUTSSCP= options, which create data sets. Since the BY statement is used, interactive processing is not possible in this example; no statements can appear after the first RUN statement. The following statements produce Output 61.2.1 through Output 61.2.4: *------------Data on Age, Weight, and Height of Children-------* | Age (months), height (inches), and weight (pounds) were | | recorded for a group of school children. | | From Lewis and Taylor (1967). | *--------------------------------------------------------------*; data htwt; input sex $ age datalines; f 143 56.3 85.0 f f 191 62.5 112.5 f f 160 62.0 94.5 f f 157 64.5 123.5 f f 191 65.3 107.0 f f 141 61.8 85.0 f f 185 63.3 101.0 f f 210 65.5 140.0 f f 149 64.3 110.5 f f 169 62.3 99.5 f f 173 62.8 102.5 f f 150 61.3 94.0 f f 144 59.5 93.5 f f 146 60.0 109.0 f f 155 61.3 107.0 f f 183 64.5 102.5 f f 154 60.0 114.0 f f 152 60.5 105.0 f f 148 60.5 84.5 f f 164 65.3 98.0 f f 177 61.3 81.0 f f 183 66.5 112.0 f f 182 65.5 133.0 f f 165 55.5 67.0 f f 163 56.5 84.0 f f 171 63.0 84.0 f f 193 59.8 115.0 f f 169 61.5 85.0 f m 157 60.5 105.0 m m 139 60.5 87.0 m m 146 57.5 90.0 m m 151 66.3 117.0 m m 153 60.0 84.0 m m 176 65.0 118.5 m m 146 57.3 83.0 m m 151 61.0 81.0 m m 193 66.3 133.0 m m 143 57.5 75.0 m
:3.1 height weight @@; 155 171 140 149 150 140 166 146 139 177 166 184 177 145 167 185 156 191 189 157 171 143 182 154 141 167 141 175 144 189 160 141 206 140 183 144 162 175
62.3 62.5 53.8 58.3 59.5 53.5 61.5 56.3 57.5 61.8 59.3 62.3 61.3 59.0 62.3 60.0 54.5 63.3 64.3 60.5 61.5 61.5 62.0 61.0 56.0 61.0 61.3 60.3 57.3 67.0 60.5 53.3 68.3 59.5 66.0 62.8 64.5 64.0
105.0 112.0 68.5 93.0 78.5 81.0 103.5 83.5 96.0 142.5 89.5 108.0 112.0 91.5 92.5 106.0 75.0 113.5 113.5 112.0 91.0 116.5 91.5 122.5 72.5 93.5 85.0 86.0 76.5 128.0 84.0 84.0 134.0 94.5 105.5 94.0 119.0 92.0
f f f f f f f f f f f f f f f f f f f f f f f f f f f f m m m m m m m m m m
153 185 139 143 147 164 175 170 186 185 168 139 178 147 183 148 144 190 143 147 172 179 142 150 147 182 164 180 150 183 156 150 250 185 140 160 164 175
63.3 59.0 61.5 51.3 61.3 58.0 60.8 64.3 57.8 65.3 61.5 52.8 63.5 55.8 64.3 56.3 55.8 66.8 58.3 59.5 64.8 63.0 56.0 54.5 51.5 64.0 63.3 61.3 59.5 64.8 61.8 59.0 67.5 66.0 56.5 59.3 60.5 68.0
108.0 104.0 104.0 50.5 115.0 83.5 93.5 90.0 95.0 118.0 95.0 63.5 148.5 75.0 109.5 77.0 73.5 140.0 77.5 101.0 142.0 98.5 72.5 74.0 64.0 111.5 108.0 110.5 84.0 111.0 112.0 99.5 171.5 105.0 84.0 78.5 95.0 112.0
f f f f f f f f f f f f f f f f f f f f f f f f f f f m m m m m m m m m m m
161 142 178 145 180 176 180 162 197 182 169 147 197 145 143 147 154 140 178 148 190 186 165 155 210 144 186 165 150 147 173 164 176 180 151 178 186 175
59.0 56.5 61.5 58.8 63.3 61.3 59.0 58.0 61.5 58.3 62.0 59.8 64.8 57.8 55.5 58.3 62.8 60.0 66.5 59.0 56.8 57.0 61.3 66.0 62.0 61.0 63.5 64.8 60.8 50.5 61.3 57.8 63.8 61.8 58.3 67.3 66.0 63.5
92.0 69.0 103.5 89.0 114.0 112.0 112.0 84.0 121.0 104.5 98.5 84.5 112.0 84.0 84.0 111.5 93.5 77.0 117.5 95.0 98.5 83.5 106.5 144.5 116.0 92.0 108.0 98.0 128.0 79.0 93.0 95.0 98.5 104.0 86.0 119.5 112.0 98.5
Example 61.2. Predicting Weight by Height and Age m m m m m m m m m m m m m m m m m m m m m m ;
173 144 147 150 140 184 168 203 200 145 182 177 150 171 142 144 193 139 196 153 164 151
69.0 59.5 57.0 59.5 58.5 66.5 66.5 66.5 71.0 56.5 67.0 63.0 59.0 61.8 56.0 60.0 72.0 55.0 64.5 57.8 66.5 59.3
112.5 88.0 84.0 84.0 86.5 112.0 111.5 117.0 147.0 91.0 133.0 111.0 98.0 112.0 87.5 89.0 150.0 73.5 98.0 79.5 112.0 87.0
m m m m m m m m m m m m m m m m m m m m m
170 156 188 193 156 156 149 142 152 143 173 177 150 162 148 206 194 186 164 155 189
63.8 66.3 67.3 67.8 58.3 68.5 52.5 58.8 59.5 57.5 66.0 60.5 61.8 63.0 60.5 69.5 65.3 66.5 58.0 57.3 65.0
112.5 106.0 112.0 127.5 92.5 114.0 81.0 84.0 105.0 101.0 112.0 112.0 118.0 91.0 118.0 171.5 134.5 112.0 84.0 80.5 114.0
m m m m m m m m m m m m m m m m m m m m m
174 149 169 157 156 144 142 189 174 163 155 175 188 141 140 159 152 161 159 178 164
66.0 57.0 62.0 58.0 61.5 57.0 55.0 66.3 69.8 65.3 61.8 65.5 63.3 57.5 56.8 63.3 60.8 56.8 62.8 63.5 61.5
108.0 92.0 100.0 80.5 108.5 84.0 70.0 112.0 119.5 117.5 91.5 114.0 115.5 85.0 83.5 112.0 97.0 75.0 99.0 102.5 140.0
m m m m m m m m m m m m m m m m m m m m m
164 144 172 168 158 176 188 188 166 166 162 166 163 174 160 149 146 153 178 142 167
63.5 60.0 65.0 60.0 65.0 61.5 71.0 65.8 62.5 67.3 60.0 62.0 66.0 63.0 64.0 56.3 55.0 64.8 63.8 55.0 62.0
108.0 117.5 112.0 93.5 121.0 81.0 140.0 150.5 84.0 121.0 105.0 91.0 112.0 112.0 116.0 72.0 71.5 128.0 112.0 76.0 107.5
title ’----- Data on age, weight, and height of children ------’; proc reg outest=est1 outsscp=sscp1 rsquare; by sex; eq1: model weight=height; eq2: model weight=height age; proc print data=sscp1; title2 ’SSCP type data set’; proc print data=est1; title2 ’EST type data set’; run;
3939
3940
Chapter 61. The REG Procedure
Output 61.2.1. Height and Weight Data: Female Children ----- Data on age, weight, and height of children ----------------------------------------- sex=f ------------------------------------The REG Procedure Model: eq1 Dependent Variable: weight Analysis of Variance
DF
Sum of Squares
Mean Square
1 109 110
21507 16615 38121
21507 152.42739
Root MSE Dependent Mean Coeff Var
12.34615 98.87838 12.48620
Source Model Error Corrected Total
R-Square Adj R-Sq
F Value
Pr > F
141.09
<.0001
0.5642 0.5602
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept height
1 1
-153.12891 4.16361
21.24814 0.35052
-7.21 11.88
<.0001 <.0001
----- Data on age, weight, and height of children ----------------------------------------- sex=f ------------------------------------The REG Procedure Model: eq2 Dependent Variable: weight Analysis of Variance
DF
Sum of Squares
Mean Square
2 108 110
22432 15689 38121
11216 145.26700
Root MSE Dependent Mean Coeff Var
12.05268 98.87838 12.18939
Source Model Error Corrected Total
R-Square Adj R-Sq
F Value
Pr > F
77.21
<.0001
0.5884 0.5808
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept height age
1 1 1
-150.59698 3.60378 1.90703
20.76730 0.40777 0.75543
-7.25 8.84 2.52
<.0001 <.0001 0.0130
Example 61.2. Predicting Weight by Height and Age
Output 61.2.2. Height and Weight Data: Male Children ----- Data on age, weight, and height of children ----------------------------------------- sex=m ------------------------------------The REG Procedure Model: eq1 Dependent Variable: weight Analysis of Variance
DF
Sum of Squares
Mean Square
1 124 125
31126 18714 49840
31126 150.92222
Root MSE Dependent Mean Coeff Var
12.28504 103.44841 11.87552
Source Model Error Corrected Total
R-Square Adj R-Sq
F Value
Pr > F
206.24
<.0001
0.6245 0.6215
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept height
1 1
-125.69807 3.68977
15.99362 0.25693
-7.86 14.36
<.0001 <.0001
----- Data on age, weight, and height of children ----------------------------------------- sex=m ------------------------------------The REG Procedure Model: eq2 Dependent Variable: weight Analysis of Variance
DF
Sum of Squares
Mean Square
2 123 125
32975 16866 49840
16487 137.11922
Root MSE Dependent Mean Coeff Var
11.70979 103.44841 11.31945
Source Model Error Corrected Total
R-Square Adj R-Sq
F Value
Pr > F
120.24
<.0001
0.6616 0.6561
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept height age
1 1 1
-113.71346 2.68075 3.08167
15.59021 0.36809 0.83927
-7.29 7.28 3.67
<.0001 <.0001 0.0004
3941
3942
Chapter 61. The REG Procedure
For both females and males, the overall F statistics for both models are significant, indicating that the model explains a significant portion of the variation in the data. For females, the full model is
weight = −150.57 + 3.60 × height + 1.91 × age and, for males, the full model is
weight = −113.71 + 2.68 × height + 3.08 × age Output 61.2.3. SSCP Matrix ----- Data on age, weight, and height of children -----SSCP type data set Obs 1 2 3 4 5 6 7 8 9 10
sex
_TYPE_
f f f f f m m m m m
SSCP SSCP SSCP SSCP N SSCP SSCP SSCP SSCP N
_NAME_
Intercept
Intercept height weight age Intercept height weight age
111.0 6718.4 10975.5 1824.9 111.0 126.0 7825.0 13034.5 2072.1 126.0
height
weight
age
6718.40 407879.32 669469.85 110818.32 111.00 7825.00 488243.60 817919.60 129432.57 126.00
10975.50 669469.85 1123360.75 182444.95 111.00 13034.50 817919.60 1398238.75 217717.45 126.00
1824.90 110818.32 182444.95 30363.81 111.00 2072.10 129432.57 217717.45 34515.95 126.00
The OUTSSCP= data set is shown in Output 61.2.3. Note how the BY groups are separated. Observations with – TYPE– =‘N’ contain the number of observations in the associated BY group. Observations with – TYPE– =‘SSCP’ contain the rows of the uncorrected sums of squares and crossproducts matrix. The observations with – NAME– =‘Intercept’ contain crossproducts for the intercept. Output 61.2.4. OUTEST Data Set ----- Data on age, weight, and height of children -----EST type data set Obs sex _MODEL_ _TYPE_ _DEPVAR_ 1 2 3 4
f f m m
eq1 eq2 eq1 eq2
PARMS PARMS PARMS PARMS
weight weight weight weight
_RMSE_ Intercept 12.3461 12.0527 12.2850 11.7098
-153.129 -150.597 -125.698 -113.713
height weight 4.16361 3.60378 3.68977 2.68075
-1 -1 -1 -1
age . 1.90703 . 3.08167
_IN_ _P_ _EDF_ 1 2 1 2
2 3 2 3
109 108 124 123
_RSQ_ 0.56416 0.58845 0.62451 0.66161
The OUTEST= data set is displayed in Output 61.2.4; again, the BY groups are separated. The – MODEL– column contains the labels for models from the MODEL statements. If no labels are specified, the defaults MODEL1 and MODEL2 would appear as values for – MODEL– . Note that – TYPE– =‘PARMS’ for all observations, indicating that all observations contain parameter estimates. The – DEPVAR– column displays the dependent variable, and the – RMSE– column gives the Root Mean Square Error for the associated model. The Intercept column gives the estimate for the intercept for the associated model, and variables with the same name
Example 61.3. Regression with Quantitative and Qualitative Variables
as variables in the original data set (height, age) give parameter estimates for those variables. The dependent variable, weight, is shown with a value of −1. The – IN– column contains the number of regressors in the model not including the intercept; – P– contains the number of parameters in the model; – EDF– contains the error degrees of freedom; and – RSQ– contains the R2 statistic. Finally, note that the – IN– , – P– , – EDF– and – RSQ– columns appear in the OUTEST= data set since the RSQUARE option is specified in the PROC REG statement.
Example 61.3. Regression with Quantitative and Qualitative Variables At times it is desirable to have independent variables in the model that are qualitative rather than quantitative. This is easily handled in a regression framework. Regression uses qualitative variables to distinguish between populations. There are two main advantages of fitting both populations in one model. You gain the ability to test for different slopes or intercepts in the populations, and more degrees of freedom are available for the analysis. Regression with qualitative variables is different from analysis of variance and analysis of covariance. Analysis of variance uses qualitative independent variables only. Analysis of covariance uses quantitative variables in addition to the qualitative variables in order to account for correlation in the data and reduce MSE; however, the quantitative variables are not of primary interest and merely improve the precision of the analysis. Consider the case where Yi is the dependent variable, X1i is a quantitative variable, X2i is a qualitative variable taking on values 0 or 1, and X1i X2i is the interaction. The variable X2i is called a dummy, binary, or indicator variable. With values 0 or 1, it distinguishes between two populations. The model is of the form Yi = β0 + β1 X1i + β2 X2i + β3 X1i X2i + i for the observations i = 1, 2, . . . , n. The parameters to be estimated are β0 , β1 , β2 , and β3 . The number of dummy variables used is one less than the number of qualitative levels. This yields a nonsingular X 0 X matrix. See Chapter 10 of Neter, Wasserman, and Kutner (1990) for more details. An example from Neter, Wasserman, and Kutner (1990) follows. An economist is investigating the relationship between the size of an insurance firm and the speed at which they implement new insurance innovations. He believes that the type of firm may affect this relationship and suspects that there may be some interaction between the size and type of firm. The dummy variable in the model allows the two firms to have different intercepts. The interaction term allows the firms to have different slopes as well. In this study, Yi is the number of months from the time the first firm implemented the innovation to the time it was implemented by the ith firm. The variable X1i is the size of the firm, measured in total assets of the firm. The variable X2i denotes the firm type and is 0 if the firm is a mutual fund company and 1 if the firm is a stock
3943
3944
Chapter 61. The REG Procedure
company. The dummy variable allows each firm type to have a different intercept and slope. The previous model can be broken down into a model for each firm type by plugging in the values for X2i . If X2i = 0, the model is Yi = β0 + β1 X1i + i This is the model for a mutual company. If X2i = 1, the model for a stock firm is Yi = (β0 + β2 ) + (β1 + β3 )X1i + i This model has intercept β0 + β2 and slope β1 + β3 . The data∗ follow. Note that the interaction term is created in the DATA step since polynomial effects such as size*type are not allowed in the MODEL statement in the REG procedure. title ’Regression With Quantitative and Qualitative Variables’; data insurance; input time size type @@; sizetype=size*type; datalines; 17 151 0 26 92 0 21 175 0 30 31 0 22 104 0 0 277 0 12 210 0 19 120 0 4 290 0 16 238 0 28 164 1 15 272 1 11 295 1 38 68 1 31 85 1 21 224 1 20 166 1 13 305 1 30 124 1 14 246 1 ; run;
The following statements begin the analysis: proc reg data=insurance; model time = size type sizetype; run;
The ANOVA table is displayed in Output 61.3.1.
∗ From Neter, J. et al., Applied Linear Statistical Models, Third Edition, Copyright (c) 1990, Richard D. Irwin. Reprinted with permission of The McGraw-Hill Companies.
Example 61.3. Regression with Quantitative and Qualitative Variables
Output 61.3.1. ANOVA Table and Parameter Estimates Regression With Quantitative and Qualitative Variables The REG Procedure Model: MODEL1 Dependent Variable: time Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
3 16 19
1504.41904 176.38096 1680.80000
501.47301 11.02381
Root MSE Dependent Mean Coeff Var
3.32021 19.40000 17.11450
R-Square Adj R-Sq
F Value
Pr > F
45.49
<.0001
0.8951 0.8754
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept size type sizetype
1 1 1 1
33.83837 -0.10153 8.13125 -0.00041714
2.44065 0.01305 3.65405 0.01833
13.86 -7.78 2.23 -0.02
<.0001 <.0001 0.0408 0.9821
The overall F statistic is significant (F =45.490, p<0.0001). The interaction term is not significant (t=−0.023, p=0.9821). Hence, this term should be removed and the model re-fitted, as shown in the following statements. delete sizetype; print; run;
The DELETE statement removes the interaction term (sizetype) from the model. The new ANOVA table is shown in Output 61.3.2.
3945
3946
Chapter 61. The REG Procedure
Output 61.3.2. ANOVA Table and Parameter Estimates Regression With Quantitative and Qualitative Variables The REG Procedure Model: MODEL1.1 Dependent Variable: time Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
2 17 19
1504.41333 176.38667 1680.80000
752.20667 10.37569
Root MSE Dependent Mean Coeff Var
3.22113 19.40000 16.60377
R-Square Adj R-Sq
F Value
Pr > F
72.50
<.0001
0.8951 0.8827
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept size type
1 1 1
33.87407 -0.10174 8.05547
1.81386 0.00889 1.45911
18.68 -11.44 5.52
<.0001 <.0001 <.0001
The overall F statistic is still significant (F =72.497, p<0.0001). The intercept and the coefficients associated with size and type are significantly different from zero (t=18.675, p<0.0001; t=−11.443, p<0.0001; t=5.521, p<0.0001, respectively). Notice that the R2 did not change with the omission of the interaction term. The fitted model is
time = 33.87 − 0.102 × size + 8.055 × type The fitted model for a mutual fund company (X2i = 0) is
time = 33.87 − 0.102 × size and the fitted model for a stock company (X2i = 1) is
time = (33.87 + 8.055) − 0.102 × size So the two models have different intercepts but the same slope. Now plot the residual versus predicted values using the firm type as the plot symbol (PLOT=TYPE); this can be useful in determining if the firm types have different residual patterns. PROC REG does not support the plot y*x=type syntax for high-resolution graphics, so use PROC GPLOT to create Output 61.3.3. First, the
Example 61.3. Regression with Quantitative and Qualitative Variables
OUTPUT statement saves the residuals and predicted values from the new model in the OUT= data set. output out=out r=r p=p; run; symbol1 v=’0’ c=blue f=swissb; symbol2 v=’1’ c=yellow f=swissb; axis1 label=(angle=90); proc gplot data=out; plot r*p=type / nolegend vaxis=axis1 cframe=ligr; plot p*size=type / nolegend vaxis=axis1 cframe=ligr; run; Output 61.3.3. Plot of Residual vs. Predicted Values
The residuals show no major trend. Neither firm type by itself shows a trend either. This indicates that the model is satisfactory. A plot of the predicted values versus size appears in Output 61.3.4, where the firm type is again used as the plotting symbol.
3947
3948
Chapter 61. The REG Procedure
Output 61.3.4. Plot of Predicted vs. Size
The different intercepts are very evident in this plot.
Example 61.4. Displaying Plots for Simple Linear Regression This example introduces the basic PROC REG graphics syntax used to produce a standard plot of data from the aerobic fitness data set (Example 61.1 on page 3924). A simple linear regression of Oxygen on RunTime is performed, and a plot of Oxygen∗RunTime is requested. The fitted model, the regression line, and the four default statistics are also displayed in Output 61.4.1. data fitness; set fitness; label Age =’age(years)’ Weight =’weight(kg)’ Oxygen =’oxygen uptake(ml/kg/min)’ RunTime =’1.5 mile time(min)’ RestPulse=’rest pulse’ RunPulse =’running pulse’ MaxPulse =’maximum running pulse’; proc reg data=fitness; model Oxygen=RunTime; plot Oxygen*RunTime / cframe=ligr; run;
Example 61.5. Creating a Cp Plot
3949
Output 61.4.1. Simple Linear Regression
Example 61.5. Creating a Cp Plot The Cp statistics for model selection are plotted against the number of parameters in the model, and the CHOCKING= and CMALLOWS= options draw useful reference lines. Note the four default statistics in the plot margin, the default model equation, and the default legend in Output 61.5.1. title ’Cp Plot with Reference Lines’; symbol1 c=green; proc reg data=fitness; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=rsquare noprint; plot cp.*np. / chocking=red cmallows=blue vaxis=0 to 15 by 5 cframe=ligr; run;
3950
Chapter 61. The REG Procedure
Output 61.5.1. Cp Plot
Using the criteria suggested by Hocking (1976) (see the section “Dictionary of PLOT Statement Options” beginning on page 3844), Output 61.5.1 indicates that a 6-variable model is a reasonable choice for doing parameter estimation, while a 5-variable model may be suitable for doing prediction.
Example 61.6. Controlling Plot Appearance with Graphic Options This example uses model fit summary statistics from the OUTEST= data set to create a plot for a model selection analysis. Global graphics statements and PLOT statement options are used to control the appearance of the plot. goptions ctitle=black htitle=3.5pct ftitle=swiss ctext =magenta htext =3.0pct ftext =swiss cback =ligr border; symbol1 v=circle c=red h=1 w=2; title1 ’Selection=Rsquare’; title2 ’plot Rsquare versus the number of parameters P in ’ ’each model’; proc reg data=fitness; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / selection=rsquare noprint; plot rsq.*np. / aic bic edf gmsep jp np pc sbc sp haxis=2 to 7 by 1
Example 61.6. Controlling Plot Appearance
caxis=red cframe=white ctext=blue modellab=’Full Model’ modelht=2.4 statht=2.4; run;
In the GOPTIONS statement, BORDER frames the entire display CBACK= specifies the background color CTEXT= selects the default color for the border and all text, including titles, footnotes, and notes CTITLE= specifies the title, footnote, note, and border color HTEXT= specifies the height for all text in the display HTITLE= specifies the height for the first title line FTEXT= selects the default font for all text, including titles, footnotes, notes, the model label and equation, the statistics, the axis labels, the tick values, and the legend FTITLE= specifies the first title font For more information on the GOPTIONS statement and other global graphics statements, refer to SAS/GRAPH Software: Reference. Output 61.6.1. Controlling Plot Appearance and Plotting OUTEST= Statistics
3951
3952
Chapter 61. The REG Procedure
In Output 61.6.1, note the following: • The PLOT statement option CTEXT= affects all text not controlled by the CTITLE= option in the GOPTIONS statement. Hence, the GOPTIONS statement option CTEXT=MAGENTA has no effect. Therefore, the color of the title is black and all other text is blue. • The area enclosed by the axes and the frame has a white background, while the background outside the plot area is gray. • The MODELHT= option allows the entire model equation to fit on one line. • The STATHT= option allows the statistics in the margin to fit in one column. • The displayed statistics and the fitted model equation refer to the selected model. See the “Traditional High-Resolution Graphics Plots” section beginning on page 3840 for more information.
Example 61.7. Plotting Model Diagnostic Statistics This example illustrates how you can display diagnostics for checking the adequacy of a regression model. The following statements plot the studentized deleted residuals against the observation number for the full model. Vertical reference lines at ±tinv(.95, n − p − 1) = ±1.714 are added to identify possible outlying Oxygen values. A vertical reference line is displayed at zero by default when the RSTUDENT option is specified. The graph is shown in Output 61.7.1. Observations 15 and 17 are indicated as possible outliers. title ’Check for Outlying Observations’; symbol v=dot h=1 c=green; proc reg data=fitness; model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse; plot rstudent.*obs. / vref= -1.714 1.714 cvref=blue lvref=1 href= 0 to 30 by 5 chref=red cframe=ligr; run;
Example 61.8. Creating PP and QQ Plots
Output 61.7.1. Plotting Model Diagnostic Statistics
Example 61.8. Creating PP and QQ Plots The following program creates probability-probability plots and quantile-quantile plots of the residuals (Output 61.8.1 and Output 61.8.2, respectively). An annotation data set is created to produce the (0,0)−(1,1) reference line for the PP-plot. Note that the NOSTAT option for the PP-plot suppresses the statistics that would be displayed in the margin.
3953
3954
Chapter 61. The REG Procedure data annote1; length function color $8; retain ysys xsys ’2’ color ’black’; function=’move’; x=0; y=0; output; function=’draw’; x=1; y=1; output; run; symbol1 c=blue; proc reg data=fitness; title ’PP Plot’; model Oxygen=RunTime / noprint; plot npp.*r. / annotate=annote1 nostat cframe=ligr modellab="’Best’ Two-Parameter Model:"; run; title ’QQ Plot’; plot r.*nqq. / noline mse cframe=ligr modellab="’Best’ Two-Parameter Model:"; run;
Output 61.8.1. Normal Probability-Probability Plot for the Residuals
Example 61.9. Displaying Confidence and Prediction Intervals
Output 61.8.2. Normal Quantile-Quantile Plot for the Residuals
Example 61.9. Displaying Confidence and Prediction Intervals This example illustrates how you can use shorthand commands to plot the dependent variable, the predicted value, and the 95% confidence or prediction intervals against a regressor. The following statements use the PRED option to create a plot with prediction intervals; the CONF option works similarly. Results are displayed in Output 61.9.1. Note that the statistics displayed by default in the margin are suppressed while three other statistics are exhibited. legend1 position=(bottom left inside) across=1 cborder=red offset=(0,0) shape=symbol(3,1) label=none value=(height=.8); title ’Prediction Intervals’; symbol1 c=yellow v=- h=1; symbol2 c=red; symbol3 c=blue; symbol4 c=blue; proc reg data=fitness; model Oxygen=RunTime / noprint; plot Oxygen*RunTime / pred nostat mse aic bic caxis=red ctext=blue cframe=ligr legend=legend1 modellab=’ ’; run;
3955
3956
Chapter 61. The REG Procedure
Output 61.9.1. Prediction Intervals
Plots can be produced with both confidence and prediction intervals using the following statement. plot Oxygen*RunTime / conf pred;
Example 61.10. Displaying the Ridge Trace for Acetylene Data This example and Example 61.11 use the acetylene data in Marquardt and Snee (1975) to illustrate the RIDGEPLOT and OUTVIF options. data acetyl; input x1-x4 x1x2 = x1 * x1x1 = x1 * label x1 = x2 = x3 = x4 = x1x2= x1x1= datalines; 1300 7.5 .012 1300 13.5 .013 1200 5.3 .04 1200 13.5 .026 1100 5.3 .084 1100 17 .086
@@; x2; x1; ’reactor temperature(celsius)’ ’h2 to n-heptone ratio’ ’contact time(sec)’ ’conversion percentage’ ’temperature-ratio interaction’ ’squared temperature’; 49 48.5 28 35 15 29.5
1300 9 1300 17 1200 7.5 1200 17 1100 7.5
.012 .0135 .038 .034 .098
50.2 47.5 31.5 38 17
1300 1300 1200 1200 1100
11 23 11 23 11
.0115 .012 .032 .041 .092
50.5 44.5 34.5 38.5 20.5
Example 61.10. Displaying the Ridge Trace for Acetylene Data
; title ’Ridge Trace of Acetylene Data’; symbol1 v=x c=blue; symbol2 v=circle c=yellow; symbol3 v=square c=cyan; symbol4 v=triangle c=green; symbol5 v=plus c=orange; legend2 position=(bottom right inside) across=3 cborder=black offset=(0,0) label=(color=blue position=(top center) ’independent variables’) cframe=white; proc reg data=acetyl outvif outest=b ridge=0 to 0.02 by .002; model x4=x1 x2 x3 x1x2 x1x1/noprint; plot / ridgeplot nomodel legend=legend2 nostat vref=0 lvref=1 cvref=blue cframe=ligr; run;
The results produced by the RIDGEPLOT option are shown in Output 61.10.1. The OUTVIF option outputs the variance inflation factors to the OUTEST= data set, which is used in Example 61.11. Output 61.10.1. Using the RIDEGPLOT Option for Ridge Regression
If you specify the experimental ODS GRAPHICS statement (see the “ODS Graphics” section on page 3922), a plot of ridge traces is produced, without the need to specify the RIDGEPLOT option in the PLOT statement. For general information about
3957
3958
Chapter 61. The REG Procedure
ODS graphics, see Chapter 15, “Statistical Graphics Using ODS.” The following statements provide an example: ods html; ods graphics on; proc reg data=acetyl outest=b ridge=0 to 0.02 by .002; model x4=x1 x2 x3 x1x2 x1x1/noprint; run; ods graphics off; ods html close;
Output 61.10.2. Ridge Traces Produced with ODS Graphics (Experimental)
Example 61.11. Plotting Variance Inflation Factors This example uses the REG procedure to create plots from a data set. The variance inflation factors (output by the OUTVIF option in the previous example) are plotted against the ridge regression control values k. The following statements create Output 61.11.1:
Example 61.11. Plotting Variance Inflation Factors
data b (keep=_RIDGE_ x1-x3 x1x2 x1x1); set b; if _TYPE_=’RIDGEVIF’; label x1=’variance inflation factor’; run; legend3 position=(top right inside) across=3 cborder=black cframe=white label=(color=blue position=(top center) ’independent variables’) value=(’X1’ ’X2’ ’X3’ ’X1X2’ ’X1X1’); symbol1 c=blue /*v=circle */; symbol2 c=yellow /*v=x */; symbol3 c=cyan /*v=triangle*/; symbol4 c=green /*v=square */; symbol5 c=orange /*v=diamond */; title ’Variance Inflation Factors of Acetylene Data’; proc reg data=b; var _RIDGE_ x3 x1x2 x1x1; model x1=x2 / noprint; plot (x1 x2 x3 x1x2 x1x1)*_RIDGE_ / nomodel nostat legend=legend3 overlay vaxis = 0 to 75 by 25 cframe=ligr haxis = 0 to .02 by .002; footnote "Note: the VIF at k=0 is 7682 for X1, " "6643 for X1X1, 345 for X1X2, and 320 for X2"; run;
The GPLOT procedure can create the same plot with the following statements. The resulting display is not shown in this report. axis1 label=(a=90 r=0 ’variance inflation factor’) order=(0 to 75 by 25) minor=none; proc gplot data=b; plot (x1 x2 x3 x1x2 x1x1)*_RIDGE_ / legend=legend3 overlay frame vaxis = axis1 haxis = 0 to .02 by .002 hminor=0; footnote "Note: the VIF at k=0 is 7682 for X1, " "6643 for X1X1, 345 for X1X2, and 320 for X2"; run;
3959
3960
Chapter 61. The REG Procedure
Output 61.11.1. Using PROC REG to Plot the VIFs
Example 61.12. ODS Graphics This example highlights the use of ODS for creating statistical graphs with the REG procedure. The USPopulation example is revisited, showing how these graphics can be used to enrich the analysis. Note that the ODS graphics available with PROC REG can be obtained in addition to the graphics you can request with the PLOT statement. To request the ODS plots you need to specify the experimental ODS GRAPHICS statement. For general information about ODS graphics, see Chapter 15, “Statistical Graphics Using ODS.” For specific information about the graphics available in the REG procedure, see the “ODS Graphics” section on page 3922. The following statements produce the default plots: ods html; ods graphics on; proc reg data=USPopulation; Linear: model Population=Year; Quadratic:model Population=Year YearSq; run;quit; ods graphics off; ods html close;
Example 61.12. ODS Graphics
Output 61.12.1. Fit Diagnostics For the Model Linear in Year (Experimental)
When the experimental ODS graphics are in effect, the fit diagnostic panel Output 61.12.1 is produced by default. These diagnostic plots suggest that while the linear model captures the increasing trend in the data, the model could be significantly improved by adding a term which is quadratic in the variable year: • The plots of residual and studentized residual versus predicted value show a clear quadratic pattern. • The plot of studentized residual versus leverage seems to indicate that there are two outlying data points. However, the plot of Cook’s D distance versus observation number reveals that these two points are just the data points for the endpoint years 1790 and 2000. These points show up as apparent outliers be-
3961
3962
Chapter 61. The REG Procedure cause the departure of the linear model from the underlying quadratic behavior in the data shows up most strongly at these endpoints. • The normal quantile plot of the residuals and the residual histogram are not consistent with the assumption of gaussian errors. This occurs as the residuals themselves still contain the quadratic behavior that is not captured by the linear model. • The plot of the dependent variable versus the predicted value exhibits a quadratic form around the 45 degree line which represents a perfect fit. • The “Residual-Fit” (or RF) plot consisting of side-by-side quantile plots of the centered fit and the residuals shows that the spread in the residuals is no greater than the spread in the centered fit. For inappropriate models, the spread of the residuals in such a plot is often greater than the spread of the centered fit. In this case, the RF plot shows that the linear model does indeed capture the increasing trend in the data, and hence accounts for much of the variation in the response.
Output 61.12.2. Residual By Year For the Model Linear in Year (Experimental)
The plot of residual versus the regressor shown in Output 61.12.2 also indicates the need for a quadratic term in the model. There is a strong quadratic trend still remain-
Example 61.12. ODS Graphics
ing in the residuals. Note for models with multiple regressors the plots of residual versus each of the regressors are displayed in panels with up to six plots per panel. Output 61.12.3. Fit Plot with Confidence Band and Prediction Limits (Experimental)
Output 61.12.3 shows as scatterplot of the data overlayed with the regression line, and 95% confidence band and prediction limits. Note that this plot also indicates that the model fails to capture the quadratic nature of the data. This plot is produced for models containing a single regressor. You can use the ALPHA= option in the model statement to change the significance level of the confidence band and prediction limits.
3963
3964
Chapter 61. The REG Procedure
Output 61.12.4. Fit Diagnostics For the Model Linear in Year (Experimental)
By contrast, Output 61.12.4 shows the fit diagnostics panel for the model that includes a quadratic term for year. These diagnostics indicate that this model is significantly more successful than the corresponding linear model: • The plots of residuals and studentized residuals versus predicted values exhibit no obvious patterns. • The points on the plot of the dependent variable versus the predicted values lie along a 45 degree line, indicating that the model successfully predicts the behavior of the dependent variable. • The plot of studentized residual versus leverage shows that the years 1790 and 2000 are leverage points with 2000 showing up as an outlier. This is confirmed by the plot of Cook’s D distance versus observation number. This suggests that
Example 61.12. ODS Graphics
while the quadratic model fits the current data well, the model may not be quite so successful over a wider range of data. You might want to investigate whether the population trend over the last couple of decades is growing slightly faster than quadratically. If you want to obtain the plots in the Diagnostics Panel as individual plots, you can do so by specifying the PLOTS(UNPACKPANELS) option in the PROC REG statement. The following statements provide an example: ods html; ods graphics on; proc reg data=USPopulation plots(unpackpanels); Quadratic:model Population=Year YearSq; run;quit; ods graphics off; ods html close;
Output 61.12.5. Residual Histogram (Experimental)
The residual histogram is shown in Output 61.12.5.
3965
3966
Chapter 61. The REG Procedure
References Akaike, H. (1969), “Fitting Autoregressive Models for Prediction,” Annals of the Institute of Statistical Mathematics, 21, 243–247. Allen, D.M. (1971), “Mean Square Error of Prediction as a Criterion for Selecting Variables,” Technometrics, 13, 469–475. Allen, D.M. and Cady, F.B. (1982), Analyzing Experimental Data by Regression, Belmont, CA: Lifetime Learning Publications. Amemiya, T. (1976), “Selection of Regressors,” Technical Report No. 225, Stanford, CA: Stanford University. Belsley, D.A., Kuh, E., and Welsch, R.E. (1980), Regression Diagnostics, New York: John Wiley & Sons, Inc. Berk, K.N. (1977), “Tolerance and Condition in Regression Computations,” Journal of the American Statistical Association, 72, 863–866. Bock, R.D. (1975), Multivariate Statistical Methods in Behavioral Research, New York: McGraw-Hill Book Co. Box, G.E.P. (1966), “The Use and Abuse of Regression,” Technometrics, 8, 625–629. Cleveland, W.S. (1993), Visualizing Data, Summit, NJ: Hobart Press. Cook, R.D. (1977), “Detection of Influential Observations in Linear Regression,” Technometrics, 19, 15–18. Cook, R.D. (1979), “Influential Observations in Linear Regression,” Journal of the American Statistical Association, 74, 169–174. Daniel, C. and Wood, F. (1980), Fitting Equations to Data, Revised Edition, New York: John Wiley & Sons, Inc. Darlington, R.B. (1968), “Multiple Regression in Psychological Research and Practice,” Psychological Bulletin, 69, 161–182. Draper, N. and Smith, H. (1981), Applied Regression Analysis, Second Edition, New York: John Wiley & Sons, Inc. Durbin, J. and Watson, G.S. (1951), “Testing for Serial Correlation in Least Squares Regression,” Biometrika, 37, 409–428. Freund, R.J. and Littell, R.C. (1986), SAS System for Regression, 1986 Edition, Cary, NC: SAS Institute Inc. Furnival, G.M. and Wilson, R.W. (1974), “Regression by Leaps and Bounds,” Technometrics, 16, 499–511. Gauss, K.F. (1809), Werke, 4, 1–93. Goodnight, J.H. (1979), “A Tutorial on the SWEEP Operator,” The American Statistician, 33, 149–158. (Also available as The Sweep Operator: Its Importance in Statistical Computing, SAS Technical Report R-106.)
References
Grunfeld, Y. (1958), “The Determinants of Corporate Investment,” unpublished thesis, Chicago, discussed in Boot, J.C.G. (1960), “Investment Demand: An Empirical Contribution to the Aggregation Problem,” International Economic Review, 1, 3–30. Hocking, R.R. (1976), “The Analysis and Selection of Variables in Linear Regression,” Biometrics, 32, 1–50. Johnston, J. (1972), Econometric Methods, New York: McGraw-Hill Book Co. Judge, G.G., Griffiths, W.E., Hill, R.C., and Lee, T. (1980), The Theory and Practice of Econometrics, New York: John Wiley & Sons, Inc. Judge, G.G., Griffiths, W.E., Hill, R.C., Lutkepohl, H., and Lee, T.C. (1985), “The Theory and Practice of Econometrics,” Second Edition, New York: John Wiley & Sons, Inc. Kennedy, W.J. and Gentle, J.E. (1980), Statistical Computing, New York: Marcel Dekker, Inc. Lewis, T. and Taylor, L.R. (1967), Introduction to Experimental Ecology, New York: Academic Press, Inc. LaMotte, L.R. (1994), “A Note on the Role of Independence in t Statistics Constructed From Linear Statistics in Regression Models,” The American Statistician, 48, 238–240. Lord, F.M. (1950), “Efficiency of Prediction when a Progression Equation from One Sample is Used in a New Sample,” Research Bulletin No. 50-40, Princeton, NJ: Educational Testing Service. Mallows, C.L. (1967), “Choosing a Subset Regression,” unpublished report, Bell Telephone Laboratories. Mallows, C.L. (1973), “Some Comments on Cp ,” Technometrics, 15, 661–675. Mardia, K.V., Kent, J.T., and Bibby, J.M. (1979), Multivariate Analysis, London: Academic Press, Inc. Markov, A.A. (1900), Wahrscheinlichkeitsrechnung, Tebrer, Leipzig. Marquardt, D.W. and Snee, R.D. (1975), “Ridge Regression in Practice,” American Statistician, 29 (1), 3–20. Morrison, D.F. (1976), Multivariate Statistical Methods, Second Edition, New York: McGraw-Hill, Inc. Mosteller, F. and Tukey, J.W. (1977), Data Analysis and Regression, Reading, MA: Addison-Wesley Publishing Co., Inc. Neter, J., Wasserman, W., and Kutner, M.H. (1990), Applied Linear Statistical Models, Homewood, Illinois: Richard D. Irwin, Inc. Neter, J., Wasserman, W., and Kutner, M. H. (1990), Applied Linear Statistical Models, Third Edition, Homewood, IL: Irwin.
3967
3968
Chapter 61. The REG Procedure
Nicholson, G.E., Jr. (1948), “The Application of a Regression Equation to a New Sample,” unpublished Ph.D. dissertation, University of North Carolina at Chapel Hill. Pillai, K.C.S. (1960), Statistical Table for Tests of Multivariate Hypotheses, Manila: The Statistical Center, University of the Philippines. Pindyck, R.S. and Rubinfeld, D.L. (1981), Econometric Models and Econometric Forecasts, Second Edition, New York: McGraw-Hill Book Co. Pringle, R.M. and Raynor, A.A. (1971), Generalized Inverse Matrices with Applications to Statistics, New York: Hafner Publishing Company. Rao, C.R. (1973), Linear Statistical Inference and Its Applications, Second Edition, New York: John Wiley & Sons, Inc. Rawlings, J.O. (1988), Applied Regression Analysis: A Research Tool, Belmont, California: Wadsworth, Inc. Rothman, D. (1968), Letter to the editor, Technometrics, 10, 432. Sall, J.P. (1981), SAS Regression Applications, Revised Edition, SAS Technical Report A-102, Cary, NC: SAS Institute Inc. Sawa, T. (1978), “Information Criteria for Discriminating Among Alternative Regression Models,” Econometrica, 46, 1273–1282. Schwarz, G. (1978), “Estimating the Dimension of a Model,” Annals of Statistics, 6, 461–464. Stein, C. (1960), “Multiple Regression,” in Contributions to Probability and Statistics, eds. I. Olkin et al., Stanford, CA: Stanford University Press. Timm, N.H. (1975), Multivariate Analysis with Applications in Education and Psychology, Monterey, CA: Brooks-Cole Publishing Co. Weisberg, S. (1980), Applied Linear Regression, New York: John Wiley & Sons, Inc. White, H. (1980), “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity,” Econometrics, 48, 817–838.
Chapter 62
The ROBUSTREG Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3971 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3972 M Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3972 LTS Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3979 SYNTAX . . . . . . . . . . . . . . PROC ROBUSTREG Statement BY Statement . . . . . . . . . . CLASS Statement . . . . . . . . ID Statement . . . . . . . . . . . MODEL Statement . . . . . . . OUTPUT Statement . . . . . . . PERFORMANCE Statement . . TEST Statement . . . . . . . . . WEIGHT Statement . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. 3982 . 3983 . 3988 . 3989 . 3989 . 3989 . 3990 . 3991 . 3992 . 3992
DETAILS . . . . . . . . . . . . . . . . . . . . . . . M Estimation . . . . . . . . . . . . . . . . . . . High Breakdown Value Estimation . . . . . . . . MM Estimation . . . . . . . . . . . . . . . . . . Robust Multivariate Location and Scale Estimates Leverage Point and Outlier Detection . . . . . . . INEST= Data Set . . . . . . . . . . . . . . . . . OUTEST= Data Set . . . . . . . . . . . . . . . . Computational Resources . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . ODS Graphics (Experimental) . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. 3992 . 3993 . 3999 . 4006 . 4009 . 4010 . 4011 . 4011 . 4012 . 4012 . 4013
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . Example 62.1. Comparison of Robust Estimates . . . . Example 62.2. Robust ANOVA . . . . . . . . . . . . . Example 62.3. Growth Study of De Long and Summers
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 4016 . 4016 . 4020 . 4024
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4029
3970
Chapter 62. The ROBUSTREG Procedure
Chapter 62
The ROBUSTREG Procedure Overview The main purpose of robust regression is to detect outliers and provide resistant (stable) results in the presence of outliers. In order to achieve this stability, robust regression limits the influence of outliers. Historically, three classes of problems have been addressed with robust regression techniques: • problems with outliers in the y-direction (response direction) • problems with multivariate outliers in the x-space (i.e., outliers in the covariate space, which are also referred to as leverage points) • problems with outliers in both the y-direction and the x-space Many methods have been developed in response to these problems. However, in statistical applications of outlier detection and robust regression, the methods most commonly used today are Huber M estimation, high breakdown value estimation, and combinations of these two methods. The new ROBUSTREG procedure in this version provides four such methods: M estimation, LTS estimation, S estimation, and MM estimation. 1. M estimation was introduced by Huber (1973), and it is the simplest approach both computationally and theoretically. Although it is not robust with respect to leverage points, it is still used extensively in analyzing data for which it can be assumed that the contamination is mainly in the response direction. 2. Least Trimmed Squares (LTS) estimation is a high breakdown value method introduced by Rousseeuw (1984). The breakdown value is a measure of the proportion of contamination that an estimation method can withstand and still maintain its robustness. The performance of this method was improved by the FAST-LTS algorithm of Rousseeuw and Van Driessen (2000). 3. S estimation is a high breakdown value method introduced by Rousseeuw and Yohai (1984). With the same breakdown value, it has a higher statistical efficiency than LTS estimation. 4. MM estimation, introduced by Yohai (1987), combines high breakdown value estimation and M estimation. It has both the high breakdown property and a higher statistical efficiency than S estimation. Experimental graphics are now available with the ROBUSTREG procedure. For more information, see the section “ODS Graphics” on page 4013.
3972
Chapter 62. The ROBUSTREG Procedure
Getting Started The following examples demonstrate how you can use the ROBUSTREG procedure to fit a linear regression model and conduct outlier and leverage point diagnostics.
M Estimation This example shows how you can use the ROBUSTREG procedure to do M estimation, which is a commonly used method for outlier detection and robust regression when contamination is mainly in the response direction. data stack; input x1 x2 x3 y; datalines; 80 27 89 42 80 27 88 37 75 25 90 37 62 24 87 28 62 22 87 18 62 23 87 18 62 24 93 19 62 24 93 20 58 23 87 15 58 18 80 14 58 18 89 14 58 17 88 13 58 18 82 11 58 19 93 12 50 18 89 8 50 18 86 7 50 19 72 8 50 19 79 8 50 20 80 9 56 20 82 15 70 20 91 15 ;
The data set stack is the well-known stackloss data set presented by Brownlee (1965). The data describe the operation of a plant for the oxidation of ammonia to nitric acid and consist of 21 four-dimensional observations. The explanatory variables for the response stackloss (y) are the rate of operation (x1), the cooling water inlet temperature (x2), and the acid concentration (x3). The following ROBUSTREG statements analyze the data: proc robustreg data=stack; model y = x1 x2 x3 / diagnostics leverage; id x1; test x3; run;
M Estimation
By default, the procedure does M estimation with the bisquare weight function, and it uses the median method for estimating the scale parameter. The MODEL statement specifies the covariate effects. The DIAGNOSTICS option requests a table for outlier diagnostics, and the LEVERAGE option adds leverage point diagnostic results to this table for continuous covariate effects. The ID statement specifies that variable x1 is used to identify each observation in this table. If the ID statement is omitted, the observation number is used to identify the observations. The TEST statement requests a test of significance for the covariate effects specified. The results of this analysis are displayed in the following figures.
The ROBUSTREG Procedure Model Information Data Set Dependent Variable Number of Covariates Number of Observations Method
WORK.STACK y 3 21 M Estimation
Summary Statistics
Q1
Median
Q3
Mean
Standard Deviation
MAD
53.0000 18.0000 82.0000 10.0000
58.0000 20.0000 87.0000 15.0000
62.0000 24.0000 89.5000 19.5000
60.4286 21.0952 86.2857 17.5238
9.1683 3.1608 5.3586 10.1716
5.9304 2.9652 4.4478 5.9304
Variable x1 x2 x3 y
Figure 62.1. Model Fitting Information and Summary Statistics
Figure 62.1 displays the model fitting information and summary statistics for the response variable and the continuous covariates. The columns labeled Q1, Median, and Q3 provide the lower quantile, median, and upper quantile. The column labeled MAD provides a robust estimate of the univariate scale, which is computed as the corrected median absolute deviation (MAD). The columns labeled Mean and Standard Deviation provide the usual mean and standard deviation. Large difference between the standard deviation and the MAD for a variable indicates some big jumps for this variable. In the stackloss data, the stackloss (reponse variable y) has the biggest difference between the standard deviation and the dispersion.
The ROBUSTREG Procedure Parameter Estimates
Parameter DF Estimate Intercept x1 x2 x3 Scale
1 -42.2854 1 0.9276 1 0.6507 1 -0.1123 1 2.2819
Standard Error
95% Confidence Limits
9.5045 -60.9138 -23.6569 0.1077 0.7164 1.1387 0.2940 0.0744 1.2270 0.1249 -0.3571 0.1324
Figure 62.2. Model Parameter Estimates
ChiSquare Pr > ChiSq 19.79 74.11 4.90 0.81
<.0001 <.0001 0.0269 0.3683
3973
3974
Chapter 62. The ROBUSTREG Procedure
Figure 62.2 displays the table of robust parameter estimates, standard errors, and confidence limits. The row labeled Scale provides a point estimate of the scale parameter in the linear regression model, which is obtained by the median method. See the section “M Estimation” on page 3993 for more information about scale estimation methods. For the stackloss data, M estimation yields the fitted linear model: yˆ = −42.2845 + 0.9276x1 + 0.6507x2 − 0.1123x3
The ROBUSTREG Procedure Diagnostics
Obs 1 2 3 4 21
x1 80.000000 80.000000 75.000000 62.000000 70.000000
Mahalanobis Distance
Robust MCD Distance
2.2536 2.3247 1.5937 1.2719 2.1768
5.5284 5.6374 4.1972 1.5887 3.6573
Leverage * * *
Standardized Robust Residual 1.0995 -1.1409 1.5604 3.0381 -4.5733
*
Outlier
* *
Diagnostics Summary Observation Type Outlier Leverage
Proportion
Cutoff
0.0952 0.1905
3.0000 3.0575
Figure 62.3. Diagnostics
Figure 62.3 displays outlier and leverage point diagnostics. Standardized robust residuals are computed based on the estimated parameters. Both the Mahalanobis distance and the robust MCD distance are displayed. Outliers and leverage points, identified with asterisks, are defined by the standardized robust residuals and robust MCD distances which exceed the corresponding cutoff values displayed in the diagnostics profile. Observations 4 and 21 are outliers because their standardized robust residuals exceed the cutoff value in absolute value. The procedure detects 4 observations with high leverage, which is contributed mainly by x1, especially for the first three observations. This can be verified by the definition of the robust MCD distance in the section “Robust Multivariate Location and Scale Estimates” on page 4009. Leverage points (points with high leverage) with smaller standardized robust residuals than the cutoff value in absolute value are called good leverage points; otherwise, called bad leverage points. Observations 21 is a bad leverage point. Two particularly useful plots for revealing outliers and leverage points are a scatter plot of the standardized robust residuals against the robust distances (RDPLOT) and a scatter plot of the robust distances against the classical Mahalanobis distances (DDPLOT). For the stackloss data, the following statements produce the RDPLOT in Figure 62.4 and the DDPLOT in Figure 62.5. The histogram and the normal quantilequantile plots for the standardized robust residuals are also created with the
M Estimation
RESHISTOGRAM and RESQQPLOT options in the PROC statement. See Figure 62.6 and Figure 62.7.
Figure 62.4. RDPLOT for Stackloss Data (Experimental)
Figure 62.5. DDPLOT for Stackloss Data (Experimental)
3975
3976
Chapter 62. The ROBUSTREG Procedure
Figure 62.6. Histogram (Experimental)
Figure 62.7. Q-Q PLOT (Experimental)
M Estimation
ods html; ods graphics on; proc robustreg data=stack plots=(rdplot ddplot reshistogram resqqplot); model y = x1 x2 x3; run; ods graphics off; ods html close;
These plots are helpful in identifying outliers, good, and bad high leverage points. These graphical displays are requested by specifying the experimental ODS GRAPHICS statement and the experimental PLOT | PLOTS= option in the PROC statement. For general information about ODS graphics, see Chapter 15, “Statistical Graphics Using ODS.” For specific information about the graphics available in the ROBUSTREG procedure, see the section “ODS Graphics” on page 4013. The ROBUSTREG Procedure Goodness-of-Fit Statistic R-Square AICR BICR Deviance
Value 0.6659 29.5231 36.3361 125.7905
Figure 62.8. Goodness-of-Fit
Figure 62.8 displays robust versions of goodness-of-fit statistics for the model. You can use the robust information criteria, AICR and BICR, for model selection and comparison. For both AICR and BICR, the lower the value the more describable the model. The ROBUSTREG Procedure Robust Linear Tests Test
Test Rho Rn2
Test Statistic 0.9378 0.8092
Lambda DF 0.7977
1 1
ChiSquare Pr > ChiSq 1.18 0.81
0.2782 0.3683
Figure 62.9. Test of Significance
Figure 62.9 displays the test results requested by the TEST statement. The ROBUSTREG procedure conducts two robust linear tests, the ρ-test and the Rn2 -test. See the section “Linear Tests” on page 3998 for information on how the procedure computes test statistics and the correction factor lambda. You can conclude that the effect x3 is not significant.
3977
3978
Chapter 62. The ROBUSTREG Procedure
For the bisquare weight function, the default constant c is 4.685 such that the asymptotic efficiency of the M estimates is 95% with the Gaussian distribution. See the section “M Estimation” on page 3993 for details. The smaller the constant c, the lower the asymptotic efficiency but the sharper the M estimate as an outlier detector. For the stackloss data set, you could consider using a sharper outlier detector. In the following invocation of the ROBUSTREG procedure, a smaller constant, e.g. c = 3.5, is used. proc robustreg method=m(wf=bisquare(c=3.5)) data=stack; model y = x1 x2 x3 / diagnostics leverage; id x1; test x3; run;
The ROBUSTREG Procedure Parameter Estimates
Parameter DF Estimate Intercept x1 x2 x3 Scale
Standard Error
1 -37.1076 1 0.8191 1 0.5173 1 -0.0728 1 1.4265
95% Confidence Limits
5.4731 -47.8346 -26.3805 0.0620 0.6975 0.9407 0.1693 0.1855 0.8492 0.0719 -0.2138 0.0681
ChiSquare Pr > ChiSq 45.97 174.28 9.33 1.03
<.0001 <.0001 0.0022 0.3111
Figure 62.10. Model Parameter Estimates
Figure 62.10 displays the table of robust parameter estimates, standard errors, and confidence limits with the constant c = 3.5. The refitted linear model is: yˆ = −37.1076 + 0.8191x1 + 0.5173x2 − 0.0728x3
The ROBUSTREG Procedure Diagnostics
Obs 1 2 3 4 21
x1 80.000000 80.000000 75.000000 62.000000 70.000000
Mahalanobis Distance
Robust MCD Distance
2.2536 2.3247 1.5937 1.2719 2.1768
5.5284 5.6374 4.1972 1.5887 3.6573
Leverage * * *
4.2719 0.7158 4.4142 5.7792 -6.2727
*
Diagnostics Summary Observation Type Outlier Leverage
Figure 62.11. Diagnostics
Standardized Robust Residual
Proportion
Cutoff
0.1905 0.1905
3.0000 3.0575
Outlier * * * *
LTS Estimation
Figure 62.11 displays outlier and leverage point diagnostics with the constant c = 3.5. Besides observations 4 and 21, observations 1 and 3 are also detected as outliers.
LTS Estimation If the data are contaminated in the x-space, M estimation does not do well. The following example shows how you can use LTS estimation to deal with this situation. data hbk; input index$ datalines; 1 10.1 19.6 2 9.5 20.5 3 10.7 20.2 4 9.9 21.5 5 10.3 21.1 6 10.8 20.4 7 10.5 20.9 8 9.9 19.6 9 9.7 20.7 10 9.3 19.7 11 11.0 24.0 12 12.0 23.0 13 12.0 26.0 14 11.0 34.0 15 3.4 2.9 16 3.1 2.2 17 0.0 1.6 18 2.3 1.6 19 0.8 2.9 20 3.1 3.4 21 2.6 2.2 22 0.4 3.2 23 2.0 2.3 24 1.3 2.3 25 1.0 0.0 26 0.9 3.3 27 3.3 2.5 28 1.8 0.8 29 1.2 0.9 30 1.2 0.7 31 3.1 1.4 32 0.5 2.4 33 1.5 3.1 34 0.4 0.0 35 3.1 2.4 36 1.1 2.2 37 0.1 3.0 38 1.5 1.2 ;
x1 x2 x3 y @@; 28.3 28.9 31.0 31.7 31.1 29.2 29.1 28.8 31.0 30.3 35.0 37.0 34.0 34.0 2.1 0.3 0.2 2.0 1.6 2.2 1.9 1.9 0.8 0.5 0.4 2.5 2.9 2.0 0.8 3.4 1.0 0.3 1.5 0.7 3.0 2.7 2.6 0.2
9.7 10.1 10.3 9.5 10.0 10.0 10.8 10.3 9.6 9.9 -0.2 -0.4 0.7 0.1 -0.4 0.6 -0.2 0.0 0.1 0.4 0.9 0.3 -0.8 0.7 -0.3 -0.8 -0.7 0.3 0.3 -0.3 0.0 -0.4 -0.6 -0.7 0.3 -1.0 -0.6 0.9
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
2.1 0.5 3.4 0.3 0.1 1.8 1.9 1.8 3.0 3.1 3.1 2.1 2.3 3.3 0.3 1.1 0.5 1.8 1.8 2.4 1.6 0.3 0.4 0.9 1.1 2.8 2.0 0.2 1.6 0.1 2.0 1.0 2.2 0.6 0.3 0.0 0.3
0.0 2.0 1.6 1.0 3.3 0.5 0.1 0.5 0.1 1.6 2.5 2.8 1.5 0.6 0.4 3.0 2.4 3.2 0.7 3.4 2.1 1.5 3.4 0.1 2.7 3.0 0.7 1.8 2.0 0.0 0.6 2.2 2.5 2.0 1.7 2.2 0.4
1.2 1.2 2.9 2.7 0.9 3.2 0.6 3.0 0.8 3.0 1.9 2.9 0.4 1.2 3.3 0.3 0.9 0.9 0.7 1.5 3.0 3.3 3.0 0.3 0.2 2.9 2.7 0.8 1.2 1.1 0.3 2.9 2.3 1.5 2.2 1.6 2.6
-0.7 -0.5 -0.1 -0.7 0.6 -0.7 -0.5 -0.4 -0.9 0.1 0.9 -0.4 0.7 -0.5 0.7 0.7 0.0 0.1 0.7 -0.1 -0.3 -0.9 -0.3 0.6 -0.3 -0.5 0.6 -0.9 -0.7 0.6 0.2 0.7 0.2 -0.2 0.4 -0.9 0.2
The data set hbk is an artificial data set generated by Hawkins, Bradu, and Kass (1984). Both ordinary least squares (OLS) estimation and M estimation (not shown
3979
3980
Chapter 62. The ROBUSTREG Procedure
here) suggest that observations 11 to 14 are serious outliers. However, these four observations were generated from the underlying model, whereas observations 1 to 10 were contaminated. The reason that OLS estimation and M estimation do not pick up the contaminated observations is that they cannot distinguish good leverage points (observations 11 to 14) from bad leverage points (observations 1 to 10). In such cases, the LTS method identifies the true outliers. The following statements invoke the ROBUSTREG procedure with the LTS estimation method. proc robustreg data=hbk fwls method=lts; model y = x1 x2 x3 / diagnostics leverage; id index; run;
The ROBUSTREG Procedure Model Information Data Set Dependent Variable Number of Covariates Number of Observations Method
WORK.HBK y 3 75 LTS Estimation
Summary Statistics
Variable x1 x2 x3 y
Q1
Median
Q3
Mean
Standard Deviation
MAD
0.8000 1.0000 0.9000 -0.5000
1.8000 2.2000 2.1000 0.1000
3.1000 3.3000 3.0000 0.7000
3.2067 5.5973 7.2307 1.2787
3.6526 8.2391 11.7403 3.4928
1.9274 1.6309 1.7791 0.8896
Figure 62.12. Model Fitting Information and Summary Statistics
Figure 62.12 displays the model fitting information and summary statistics for the response variable and independent covariates.
The ROBUSTREG Procedure LTS Profile Total Number of Observations Number of Squares Minimized Number of Coefficients Highest Possible Breakdown Value
75 57 4 0.2533
Figure 62.13. LTS Profile
Figure 62.13 displays information about the LTS fit, which includes the breakdown value of the LTS estimate. Roughly speaking, the breakdown value is a measure of the proportion of contamination that an estimation method can withstand and still maintain its robustness. In this example the LTS estimate minimizes the sum of 57
LTS Estimation
smallest squares of residuals. It can still pick up the right model if the remaining 18 observations are contaminated. This corresponds to the breakdown value around 0.25, which is set as the default. The ROBUSTREG Procedure LTS Parameter Estimates Parameter Intercept x1 x2 x3 Scale (sLTS) Scale (Wscale)
DF
Estimate
1 1 1 1 0 0
-0.3431 0.0901 0.0703 -0.0731 0.7451 0.5749
Figure 62.14. LTS Parameter Estimates
Figure 62.14 displays parameter estimates for covariates and scale. Two robust estimates of the scale parameter are displayed. See the section “Final Weighted Scale Estimator” on page 4002 for how these estimates are computed. The weighted scale estimate (Wscale) is a more efficient estimate of the scale parameter.
The ROBUSTREG Procedure Diagnostics
Obs
index
1 3 5 7 9 11 13 15 17 19 21 23 25 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mahalanobis Distance
Robust MCD Distance
1.9168 1.8558 2.3137 2.2297 2.1001 2.1462 2.0105 1.9193 2.2212 2.3335 2.4465 3.1083 2.6624 6.3816
29.4424 30.2054 31.8909 32.8621 32.2778 30.5892 30.6807 29.7994 31.9537 30.9429 36.6384 37.9552 36.9175 41.0914
Leverage
Standardized Robust Residual
* * * * * * * * * * * * * *
17.0868 17.8428 18.3063 16.9702 17.7498 17.5155 18.8801 18.2253 17.1843 17.8021 0.0406 -0.0874 1.0776 -0.7875
Outlier * * * * * * * * * *
Diagnostics Summary Observation Type Outlier Leverage
Proportion
Cutoff
0.1333 0.1867
3.0000 3.0575
Figure 62.15. Diagnostics
Figure 62.15 displays outlier and leverage point diagnostics. The ID variable index is used to identify the observations. The first ten observations are identified as outliers and observations 11 to 14 are identified as good leverage points.
3981
3982
Chapter 62. The ROBUSTREG Procedure
The ROBUSTREG Procedure Parameter Estimates for Final Weighted Least Squares Fit
Parameter Intercept x1 x2 x3 Scale
DF Estimate 1 1 1 1 0
-0.1805 0.0814 0.0399 -0.0517 0.5572
Standard Error 0.1044 0.0667 0.0405 0.0354
95% Confidence Limits -0.3852 -0.0493 -0.0394 -0.1210
ChiSquare Pr > ChiSq
0.0242 0.2120 0.1192 0.0177
2.99 1.49 0.97 2.13
0.0840 0.2222 0.3242 0.1441
Figure 62.16. Final Weighted LS Estimates
Figure 62.16 displays the final weighted least squares estimates. These estimates are least squares estimates computed after deleting the detected outliers.
Syntax PROC ROBUSTREG < options > ; BY variables ; CLASS variables ; ID variables ; MODEL response = <effects> < / options > ; OUTPUT < OUT= SAS-data-set > < options > ; PERFORMANCE < options > ; TEST ’label’ effects ; WEIGHT variable ; The PROC ROBUSTREG statement invokes the procedure. The METHOD= option in the PROC ROBUSTREG statement selects one of the four estimation methods, M, LTS, S, and MM. By default, Huber M estimation is used. The MODEL statement is required and specifies the variables used in the regression. Main effects and interaction terms can be specified in the MODEL statement, as in the GLM procedure. The CLASS statement specifies which explanatory variables are treated as categorical. These variables are allowed in the MODEL statement only for M estimation, and not for other estimation methods. The ID statement names variables to identify observations in the outlier diagnostics tables. The WEIGHT statement identifies a variable in the input data set whose values are used to weight the observations. The OUTPUT statement creates an output data set containing final weights, predicted values, and residuals. The TEST statement requests robust linear tests for the model parameters. The PERFORMANCE statement tunes the performance of the procedure by using single or multiple processors available on the hardware. In one invocation of PROC ROBUSTREG, multiple OUTPUT and TEST statements are allowed.
PROC ROBUSTREG Statement
PROC ROBUSTREG Statement PROC ROBUSTREG < options > ; The PROC ROBUSTREG statement invokes the procedure. You can specify the following options in the PROC ROBUSTREG statement. COVOUT
saves the estimated covariance matrix in the OUTEST= data set for M estimation and MM estimation. DATA=SAS-data-set
specifies the input SAS data set used by PROC ROBUSTREG. By default, the most recently created SAS data set is used. FWLS
requests that final weighted least squares estimators be computed. INEST= SAS-data-set
specifies an input SAS data set that contains initial estimates for all the parameters in the model. See the section “INEST= Data Set” on page 4011 for a detailed description of the contents of the INEST= data set. ITPRINT
displays the iteration history for the iteratively reweighted least squares algorithm used by M and MM estimation. You can also use this option in the MODEL statement. NAMELEN=n
specifies the length of effect names in tables and output data sets to be n characters, where n is a value between 20 and 200. The default length is 20 characters. ORDER=DATA | FORMATTED | FREQ | INTERNAL
specifies the sorting order for the levels of the classification variables (specified in the CLASS statement). This ordering determines which parameters in the model correspond to each level in the data. The following table explains how PROC ROBUSTREG interprets values of the ORDER= option. Table 62.1. Options for Order
Value of ORDER= DATA
Levels Sorted By order of appearance in the input data set
FORMATTED
formatted value
FREQ
descending frequency count; levels with the most observations come first in the order
INTERNAL
unformatted value
By default, ORDER=FORMATTED. For FORMATTED and INTERNAL, the sort order is machine dependent. For more information on sorting order, refer to the chapter titled “The SORT Procedure” in the SAS Procedures Guide.
3983
3984
Chapter 62. The ROBUSTREG Procedure
OUTEST=SAS-data-set
specifies an output SAS data set containing the parameter estimates, and, if the COVOUT option is specified, the estimated covariance matrix. See the section “OUTEST= Data Set” on page 4011 for a detailed description of the contents of the OUTEST= data set. SEED=number
specifies the seed for the random number generator used to randomly select the subgroups and subsets for LTS and S estimation. By default or you specify zero, the ROBUSTREG procedure generates a seed between one and one billion. METHOD= method type < ( options ) >
specifies the estimation method and options specify some additional options for the estimation method. PROC ROBUSTREG provides four estimation methods: M estimation, LTS estimation, S estimation, and MM estimation. The default method is M estimation. Since the LTS and S methods use subsampling algorithms, it is not suitable to apply these methods to an analysis with continuous independent variables which have only a few nonzero values or a few nonzero values within one BY group.
Options with METHOD=M With METHOD=M, you can specify the following additional options: ASYMPCOV=H1 | H2 | H3
specifies the type of asymptotic covariance computed for the M estimate. The three types are described in the section “Asymptotic Covariance and Confidence Intervals” on page 3997. By default, ASYMPCOV= H1. CONVERGENCE=criterion <(EPS=value)>
specifies a convergence criterion for the M estimate. Table 62.2. Options to Specify Convergence Criteria
Type residual weight coefficient
Option CONVERGENCE= RESID CONVERGENCE= WEIGHT CONVERGENCE= COEF
By default, CONVERGENCE = COEF. You can specify the precision of the convergence can be specified with the EPS= option. By default, EPS=1.E−8. MAXITER=n
sets the maximum number of iterations during the parameter estimation. By default, MAXITER=1000. SCALE=scale type | value
specifies the scale parameter or a method for estimating the scale parameter.
PROC ROBUSTREG Statement
Table 62.3. Options to Specify Scale
Scale Median estimate Tukey estimate Huber estimate Fixed constant
Option SCALE=MED SCALE=TUKEY<(D=d)> SCALE=HUBER<(D=d)> SCALE=value
Default d 2.5 2.5
By default, SCALE = MED. WF | WEIGHTFUNCTION=function type
specifies the weight function used for the M estimate. The ROBUSTREG procedure provides ten weight functions, which are listed in the following table. You can specify the parameters in these functions with the A=, B=, and C= options. These functions are described in the section “M Estimation” on page 3993. The default weight function is bisquare. Table 62.4. Options to Specify Weight Functions
Weight Function andrews bisquare cauchy fair hampel huber logistic median talworth welsch
Option WF = ANDREWS<(C=c)> WF = BISQUARE<(C=c)> WF = CAUCHY<(C=c)> WF = FAIR<(C=c)> WF = HAMPEL<(
Default a, b, c 1.339 4.685 2.385 1.4 2, 4, 8 1.345 1.205 0.01 2.795 2.985
Options with METHOD=LTS With METHOD=LTS, you can specify the following additional options: CSTEP=n
specifies the number of C-steps for the LTS estimate. See the section “LTS Estimate” on page 4000 for how the default value is determined. IADJUST=ALL | NONE
requests (IADJUST=ALL) or suppresses (IADJUST=NONE) the intercept adjustment for all estimates in the LTS-algorithm. By default, the intercept adjustment is used for data sets with less than 10000 observations. See the section “Algorithm” on page 4001 for details. H=n
specifies the quantile for the LTS estimate. See the section “LTS Estimate” on page 4000 for how the default value is determined.
3985
3986
Chapter 62. The ROBUSTREG Procedure
NBEST=n
specifies the number of best solutions kept for each subgroup during the computation of the LTS estimate. The default number is 10, which is the maximum number allowed. NREP=n
specifies the number of repeats of least squares fit in subgroups during the computation of the LTS estimate See the section “LTS Estimate” on page 4000 for how the default number is determined. SUBANALYSIS
requests a display of the subgrouping information and parameter estimates within subgroups. This option may generate the following ODS tables: Table 62.5. ODS Tables Available with SUBANALYSIS
ODS Table Name BestEstimates
Description Best final estimates for LTS
BestSubEstimates
Best estimates for each subgroup
CStep
C-Step information for LTS
Groups
Grouping information for LTS
Some of these tables are data dependent. SUBGROUPSIZE=n
specifies the data set size of the subgroups in the computation of the LTS estimate. The default number is 300.
Options with METHOD=S With METHOD=S, you can specify the following additional options: ASYMPCOV=H1 | H2 | H3 | H4
specifies the type of asymptotic covariance computed for the S estimate. The four types are described in the section “Asymptotic Covariance and Confidence Intervals” on page 4005. By default, ASYMPCOV= H4. CHIF= TUKEY | YOHAI
specifies the χ function for the S estimate. PROC ROBUSTREG provides two χ functions, Tukey’s BISQUARE function and Yohai’s OPTIMAL function, which you can request with CHIF=TUKEY and CHIF=YOHAI, respectively. The default is Tukey’s bisquare function. EFF=value
specifies the efficiency for the S estimate. The parameter k0 in the χ function is determined by this efficiency. The default efficiency is determined such that the consistent S estimate has the breakdown value of 25%.
PROC ROBUSTREG Statement
MAXITER=n
sets the maximum number of iterations for computing the scale parameter of the S estimate. By default, MAXITER=1000. NREP=n
specifies the number of repeats of subsampling in the computation of the S estimate. See the section “Algorithm” on page 4004 for how the default number of repeats is determined. NOREFINE
suppresses the refinement for the S estimate. See the section “Algorithm” on page 4004 for details. SUBSETSIZE=n
specifies the size of the subset for the S estimate. See the section “Algorithm” on page 4004 for how its default value is determined. TOLERANCE=value
specifies the tolerance for the S estimate of the scale. The default value is .001.
Options with METHOD=MM With METHOD=MM, you can specify the following additional options: ASYMPCOV=H1 | H2 | H3 | H4
specifies the type of asymptotic covariance computed for the MM estimate. The four types are described in the “Details” section. By default, ASYMPCOV= H4. BIASTEST<(ALPHA= number)>
requests the bias test for the final MM estimate. See the section “Bias Test” on page 4008 for details about this test. CHIF= TUKEY | YOHAI
selects the χ function for the MM estimate. PROC ROBUSTREG provides two χ functions: Tukey’s BISQUARE function and Yohai’s OPTIMAL function, which you can request with CHIF=TUKEY and CHIF=YOHAI, respectively. The default is Tukey’s bisquare function. This χ function is also used by the initial S estimate if you specify the INITEST=S option. CONVERGENCE=criterion <(EPS=number)>
specifies a convergence criterion for the MM estimate. Table 62.6. Options to Specify Convergence Criteria
Type residual weight coefficient
Option CONVERGENCE= RESID CONVERGENCE= WEIGHT CONVERGENCE= COEF
By default, CONVERGENCE = COEF. You can specify the precision of the convergence with the EPS= option. By default, EPS=1.E−8.
3987
3988
Chapter 62. The ROBUSTREG Procedure
EFF=value
specifies the efficiency for the MM estimate. The parameter k1 in the χ function is determined by this efficiency. The default efficiency is set to 85%, which corresponds to k1 = 3.440 for CHIF=TUKEY or k1 = 0.868 for CHIF=YOHAI. INITH=n
specifies the integer h for the initial LTS estimator used by the MM estimator. See the section “Algorithm” on page 4007 for how to specify h and how the default is determined. INITEST= LTS | S
specifies the initial estimator for the MM estimator. By default, the LTS estimator is used as the initial estimator for the MM estimator. K0=number
specifies the parameter k0 in the χ function for the MM estimate. For CHIF=TUKEY, the default is k0 = 2.9366. For CHIF=YOHAI, the default is k0 = 0.7405. These default values correspond to the 25% breakdown value of the MM estimator. MAXITER=n
sets the maximum number of iterations during the parameter estimation. By default, MAXITER=1000.
BY Statement BY variables ; You can specify a BY statement with PROC ROBUSTREG to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the ROBUSTREG procedure. The NOTSORTED option does not mean that the data are unsorted, but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the SAS Procedures Guide.
MODEL Statement
CLASS Statement CLASS variables ; Explanatory variables that are classification variables rather than quantitative numeric variables must be listed in the CLASS statement. For each explanatory variable listed in the CLASS statement, indicator variables are generated for the levels assumed by the CLASS variable. If the CLASS statement is used, it must appear before the MODEL statement.
ID Statement ID variables ; When the diagnostics table is requested with the DIAGNOSTICS option in the MODEL statement, the variables listed in the ID statement are displayed besides the observation number. These variables can be used to identify each observation. If the ID statement is omitted, the observation number is used to identify the observations.
MODEL Statement
Options You can specify the following options for the model fit. ALPHA=value
specifies the significance level for the confidence intervals for regression parameters. The value must be between 0 and 1. By default, ALPHA = 0.05. CORRB
produces the estimated correlation matrix of the parameter estimates. COVB
produces the estimated covariance matrix of the parameter estimates. CUTOFF=value
specifies the multiplier of the cutoff value for outlier detection. By default, CUTOFF = 3. DIAGNOSTICS<(ALL)>
requests the outlier diagnostics. By default, only observations identified as outliers or leverage points are displayed. To request that all observations be displayed, specify the ALL option.
3989
3990
Chapter 62. The ROBUSTREG Procedure
ITPRINT
displays the iteration history for the iteratively reweighted least squares algorithm used by M and MM estimation. You can also use this option in the PROC statement. LEVERAGE<(CUTOFF=value | CUTOFFALPHA=value | QUANTILE=n)>
requests an analysis of leverage points for the continuous covariates. The results are added to the diagnostics table, which you can request with the DIAGNOSTICS option in the MODEL statement. You can specify the cutoff valueqfor leverage point
detection with the CUTOFF= option. The default cutoff value is χ2p;1−α , where α can be specified with the CUTOFFALPHA= option. By default, α = .025. You can use the QUANTILE= option to specify the quantile to be minimized for the MCD algorithm used for the leverage point analysis. By default, QUANTILE=[(3n + p + 1)/4], where n is the number of observations and p is the number of independent variables. The LEVERAGE option is ignored if the model includes class variables as covariates. Since the MCD algorithm uses subsampling, it is not suitable to apply the leverage point analysis to continuous variables which have only a few nonzero values or a few nonzero values within one BY group. NOGOODFIT
suppresses the computation of goodness-of-fit statistics. NOINT
specifies no-intercept regression. SINGULAR=value
specifies the tolerance for testing singularity of the information matrix and the crossproducts matrix for the initial least-squares estimates. Roughly, the test requires that a pivot be at least this value times the original diagonal value. By default, SINGULAR = 1.E−12.
OUTPUT Statement OUTPUT
PERFORMANCE Statement keyword=name
specifies the statistics to include in the output data set and gives names to the new variables. Specify a keyword for each desired statistic (see the following list), an equal sign, and the variable to contain the statistic.
The keywords allowed and the statistics they represent are as follows: LEVERAGE
specifies a variable to indicate leverage points. To include this variable in the OUTPUT data set, you must specify the LEVERAGE option in the PROC statement. See the section “Leverage Point and Outlier Detection” on page 4010 for how to define LEVERAGE.
OUTLIER
specifies a variable to indicate outliers. See the section “Leverage Point and Outlier Detection” on page 4010 for how to define OUTLIER.
PREDICTED | P
specifies a variable to contain the estimated response.
RESIDUAL | R
specifies a variable to contain the residuals yi − xTi b
SRESIDUAL | SR
specifies a variable to contain the standardized residuals yi − xTi b σ ˆ
STDP
specifies a variable to contain the estimates of the standard errors of the estimated response.
WEIGHT
specifies a variable to contain the computed final weights.
PERFORMANCE Statement You use the PERFORMANCE statement to specify options that tune the performance of PROC ROBUSTREG. By default these options are chosen to maximize performance. See Chen (2002) for some empirical results.
PERFORMANCE < options > ; The following option is available: CPUCOUNT=n
specifies the number of threads to use in the computation of LTS or S estimation (initial LTS or S estimation for MM estimation). By default this will be equal to the number of processors on the hardware.
3991
3992
Chapter 62. The ROBUSTREG Procedure
TEST Statement
WEIGHT Statement WEIGHT variable ; The WEIGHT statement specifies a weight variable in the input data set. If you want to use fixed weights for each observation in the input data set, place the weights in a variable in the data set and specify the name in a WEIGHT statement. The values of the WEIGHT variable can be nonintegral and are not truncated. Observations with nonpositive or missing values for the weight variable do not contribute to the fit of the model.
Details This section describes the statistical and computational aspects of the ROBUSTREG procedure. The following notation is used throughout this section. Let X = (xij ) denote an n × p matrix, y = (y1 , ..., yn )T a given n-vector of responses, and θ = (θ1 , ..., θp )T an unknown p-vector of parameters or coefficients whose components are to be estimated. The matrix X is called the design matrix. Consider the usual linear model y = Xθ + e where e = (e1 , ..., en )T is an n-vector of unknown errors. It is assumed that (for a given X) the components ei of e are independent and identically distributed according to a distribution L(·/σ), where σ is a scale parameter (usually unknown). The vector of residuals for a given value of θˆ is denoted by r = (r1 , ..., rn )T and the ith row of the matrix X is denoted by xTi .
M Estimation
M Estimation M estimation in the context of regression was first introduced by Huber (1973) as an result of making the least squares approach robust. Although M estimators are not robust with respect to leverage points, they are popular in applications where leverage points are not an issue. Instead of minimizing a sum of squares of the residuals, a Huber-type M estimator θˆM of θ minimizes a sum of less rapidly increasing functions of the residuals:
Q(θ) =
n X i=1
ri ρ( ) σ
where r = y − Xθ. For the ordinary least squares estimation, ρ is the quadratic function. If σ is known, by taking derivatives with respect to θ, θˆM is also a solution of the system of p equations: n X i=1
ri ψ( )xij = 0, j = 1, ..., p σ
where ψ = ρ0 . If ρ is convex, θˆM is the unique solution. The ROBUSTREG procedure solves this system by using iteratively reweighted least squares (IRLS). The weight function w(x) is defined as
w(x) =
ψ(x) x
The ROBUSTREG procedure provides ten kinds of weight functions (corresponding to ten ρ-functions) through the WEIGHTFUNCTION= option in the MODEL statement. See the section “Weight Functions” on page 3995 for a complete discussion. You can specify the scale parameter σ with the SCALE= option in the PROC statement. If σ is unknown, both θ and σ are estimated by minimizing the function
Q(θ, σ) =
n X i=1
ri [ρ( ) + a]σ, a > 0 σ
The algorithm proceeds by alternately improving θˆ in a location step and σ ˆ in a scale step. For the scale step, three methods are available to estimate σ, which you can select with the SCALE= option.
3993
3994
Chapter 62. The ROBUSTREG Procedure
1. (SCALE=HUBER<(D=d)>) Compute σ ˆ by the iteration n
(ˆ σ (m+1) )2 =
1 X ri χd ( (m) )(ˆ σ (m) )2 nh σ ˆ i=1
where χd (x) =
x2 /2 d2 /2
if |x| < d otherwise
√ − 12 d2 2 2 ) is is the Huber function and h = n−p n (d + (1 − d )Φ(d) − .5 − d 2πe the Huber constant (refer to Huber 1981, p. 179). You can specify d with the D= option. By default, d = 2.5. 2. (SCALE=TUKEY<(D=d)>) Compute σ ˆ by solving the supplementary equation n
1 X ri χd ( ) = β n−p σ i=1
where χd (x) =
3x2 d2
−
3x4 d4
+
x6 d6
1
if |x| < d otherwise
R Here ψ = 16 χ01 is Tukey’s biweight function, and β = χd (s)dΦ(s) is the constant such that the solution σ ˆ is asymptotically consistent when L(·/σ) = Φ(·) (refer to Hampel et. al. 1986, p. 149). You can specify d with the D= option. By default, d = 2.5. 3. (SCALE=MED) Compute σ ˆ by the iteration σ ˆ (m+1) = median{|yi − xTi θˆ(m) |/β0 , i = 1, ..., n} where β0 = Φ−1 (.75) is the constant such that the solution σ ˆ is asymptotically consistent when L(·/σ) = Φ(·) (refer to Hampel et. al. 1986, p. 312). Note that SCALE = MED is the default.
Algorithm The basic algorithm for computing M estimates for regression is iteratively reweighted least squares (IRLS). As the name suggests, a weighted least squares fit is carried out inside an iteration loop. For each iteration, a set of weights for the observations is used in the least squares fit. The weights are constructed by applying a weight function to the current residuals. Initial weights are based on residuals from an initial fit. The ROBUSTREG procedure uses the unweighted least squares fit as a default initial fit. The iteration terminates when a convergence criterion is satisfied. The maximum number of iterations is set to 1000. You can specify the weight function and the convergence criteria.
M Estimation
Weight Functions You can specify the weight function for M estimation with the WEIGHTFUNCTION= option. The ROBUSTREG procedure provides ten weight functions. By default, the procedure uses the bisquare weight function. In most cases, M estimates are more sensitive to the parameters of these weight functions than to the type of the weight function. The median weight function is not stable and is seldom recommended in data analysis; it is included in the procedure for completeness. You can specify the parameters for these weight functions. Except for the hampel and median weight functions, default values for these parameters are defined such that the corresponding M estimates have 95% asymptotic efficiency in the location model with the Gaussian distribution (see Holland and Welsch (1977)). The following list shows the weight functions available.
( andrews
x c
W (x, c) =
0
bisquare
W (x, c) =
cauchy
W (x, c) =
fair
W (x, c) =
sin( xc )
if |x| ≤ πc otherwise
(1 − ( xc )2 )2 if |x| < c 0 otherwise
1 1+(
|x| 2 ) c
1 |x| (1+ c )
3995
3996
hampel
Chapter 62. The ROBUSTREG Procedure
W (x, a, b, c) =
huber
W (x, c) =
logistic
W (x, c) =
median
talworth
|x| < a a < |x| ≤ b
b < |x| ≤ c otherwise
1 c |x|
|x| a c−|x| |x| c−b
0
if |x| < c otherwise
tanh( xc ) x c
1 c 1 |x|
1 if |x| < c 0 otherwise
W (x, c) =
W (x, c) =
1 a
if x = 0 otherwise
M Estimation
welsch
W (x, c) = exp(− 12 ( xc )2 )
See Table 62.4 on page 3985 for the default values of the constants in these weight functions.
Convergence Criteria The following convergence criteria are available in PROC ROBUSTREG: 1. Relative change in the scaled residuals (CONVERGENCE= RESID) 2. Relative change in the coefficients (CONVERGENCE= COEF) 3. Relative change in weights (CONVERGENCE= W) You can specify the criteria with the CONVERGENCE= option in the PROC statement. The default is CONVERGENCE= COEF. You can specify the precision of the convergence criterion with the EPS= sub-option. In addition to these convergence criteria, a convergence criterion based on scaleindependent measure of the gradient is always checked. See Coleman, et. al. (1980). A warning is issued if this criterion is not satisfied.
Asymptotic Covariance and Confidence Intervals The following three estimators of the asymptotic covariance of the robust estimator are available in PROC ROBUSTREG: H1: K
P − p)] (ψ(ri ))2 T −1 P (X X) [(1/n) (ψ 0 (ri ))]2
2 [1/(n
P [1/(n − p)] (ψ(ri ))2 −1 P H2: K W [(1/n) (ψ 0 (ri ))] H3: K −1
X 1 (ψ(ri ))2 W −1 (X T X)W −1 (n − p) 0
) where K = 1 + np var(ψ is a correction factor and Wjk = (Eψ 0 )2 Huber (1981, p. 173) for more details.
P
ψ 0 (ri )xij xik . Refer to
3997
3998
Chapter 62. The ROBUSTREG Procedure
You can specify the asymptotic covariance estimate with the option ASYMPCOV= option. The ROBUSTREG procedure uses H1 as the default because of its simplicity and stability. Confidence intervals are computed from the diagonal elements of the estimated asymptotic covariance matrix. R2 and Deviance The robust version of R2 is defined as P
2
R =
P yi −xTi θˆ µ ρ( yi −ˆ ) − ρ( sˆ ) sˆ P yi −ˆµ ρ( sˆ )
and the robust deviance is defined as the optimal value of the objective function on the σ 2 -scale: D = 2(ˆ s)2
X
ρ(
yi − xTi θˆ ) sˆ
where ρ0 = ψ, θˆ is the M estimator of θ, µ ˆ is the M estimator of location, and sˆ is the M estimator of the scale parameter in the full model.
Linear Tests Two tests are available in PROC ROBUSTREG for the canonical linear hypothesis H0 :
θj = 0, j = q + 1, ..., p
The first test is a robust version of the F test, which is refered to as the ρ-test. Denote the M estimators in the full and reduced model as θˆ0 ∈ Ω0 and θˆ1 ∈ Ω1 , respectively. Let Q0 = Q(θˆ0 ) = min{Q(θ)|θ ∈ Ω0 } Q1 = Q(θˆ1 ) = min{Q(θ)|θ ∈ Ω1 } with Q=
n X i=1
ri ρ( ) σ
The robust F test is based on the test statistic Sn2 =
2 [Q1 − Q0 ] p−q
High Breakdown Value Estimation
Asymptotically Sn2 ∼ λχ2p−q under H0 , where the standardization factor is λ = R 2 R ψ (s)dΦ(s)/ ψ 0 (s)dΦ(s) and Φ is the cumulative distribution function of the standard normal distribution. Large values of Sn2 are significant. This test is a special case of the general τ -test of Hampel et. al. (1986, Section 7.2). The second test is a robust version of the Wald test, which is refered to as Rn2 -test. The test uses a test statistic −1 ˆ Rn2 = n(θˆq+1 , ..., θˆp )H22 (θq+1 , ..., θˆp )T
where n1 H22 is the (p − q) × (p − q) lower right block of the asymptotic covariance matrix of the M estimate θˆM of θ in a p-parameter linear model. Under H0 , the statistic Rn2 has an asymptotic χ2 distribution with p − q degrees of freedom. Large absolute values of Rn2 are significant. Refer to Hampel et. al. (1986, Chapter 7).
Model Selection When M estimation is used, two criteria are available in PROC ROBUSTREG for model selection. The first criterion is a counterpart of the Akaike (1974) AIC criterion for robust regression, and it is defined as
AICR = 2
n X
ρ(ri:p ) + αp
i=1
ˆ σ, σ where ri:p = (yi − xTi θ)/ˆ ˆ is a robust estimate of σ and θˆ is the M estimator with p-dimensional design matrix. As with AIC, α is the weight of the penalty for dimensions. The ROBUSTREG procedure uses α = 2Eψ 2 /Eψ 0 (Ronchetti (1985)) and estimates it using the final robust residuals. The second criterion is a robust version of the Schwarz information criteria(BIC), and it is defined as BICR = 2
n X
ρ(ri:p ) + p log(n)
i=1
High Breakdown Value Estimation The breakdown value of an estimator is defined as the smallest fraction of contamination that can cause the estimator to take on values arbitrarily far from its value on the uncontamined data. The breakdown value of an estimator can be used as a measure of the robustness of the estimator. Rousseeuw and Leroy (1987) and others introduced the following high breakdown value estimators for linear regression.
3999
4000
Chapter 62. The ROBUSTREG Procedure
LTS Estimate The least trimmed squares (LTS) estimate proposed by Rousseeuw (1984) is defined as the p-vector θˆLT S = arg min QLT S (θ) θ
where
QLT S (θ) =
h X
2 r(i)
i=1 2 ≤ r 2 ≤ ... ≤ r 2 are the ordered squared residuals r 2 = (y − xT θ)2 , i = r(1) i i i (2) (n)
1, ..., n, and h is defined in the range
n 2
+1≤h≤
3n+p+1 . 4
You can specify the parameter h with the H= option in the PROC statement. By default, h = [ 3n+p+1 ]. The breakdown value is n−h 4 n for the LTS estimate. The least median of squares (LMS) estimate is defined as the p-vector θˆLM S = arg min QLM S (θ) θ
where 2 QLM S (θ) = r(h) 2 ≤ r 2 ≤ ... ≤ r 2 are the ordered squared residuals r 2 = (y − xT θ)2 , i = r(1) i i i (2) (n)
1, ..., n, and h is defined in the range
n 2
+1≤h≤
3n+p+1 . 4
The breakdown value for the LMS estimate is also n−h n . However the LTS estimate has several advantages over the LMS estimate. Its objective function is smoother, making the LTS estimate less “jumpy” (i.e. sensitive to local effects) than the LMS estimate. Its statistical efficiency is better, because the LTS estimate is asymptotically normal whereas the LMS estimate has a lower convergence rate (Rousseeuw and Leroy (1987)). Another important advantage is that, using the FAST-LTS algorithm of Rousseeuw and Van Driessen (2000), the LTS estimate takes less computing time and is more accurate. The ROBUSTREG procedure computes LTS estimates using the FAST-LTS algorithm. The estimates are often used to detect outliers in the data, which are then downweighted in the resulting weighted LS regression.
High Breakdown Value Estimation
Algorithm
Least trimmed squares (LTS) regression is based on the subset of h observations (out of a total of n observations) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n2 and n. The LTS method was proposed by Rousseeuw (1984, p. 876) as a highly robust regression estimator with breakdown value n−h n . The ROBUSTREG procedure uses the FAST-LTS algorithm given by Rousseeuw and Van Driessen (1998). The intercept adjustment technique is also used in this implementation. However, because this adjustment is expensive to compute, it is optional. You can use the IADJUST option in the PROC statement to request or suppress the intercept adjustment. By default, PROC ROBUSTREG does intercept adjustment for data sets with less than 10000 observations. The algorithm is described briefly as follows. Refer to Rousseeuw and Van Driessen (2000) for details. 1. The default h is [ 3n+p+1 ], where p is the number of independent variables. You 4 can specify any integer h with [ n2 ] + 1 ≤ h ≤ [ 3n+p+1 ] with the H= option 4 in the MODEL statement. The breakdown value for LTS, n−h n , is reported. The default h is a good compromise between breakdown value and statistical efficiency. 2. If p = 1 (single regressor) the procedure uses the exact algorithm of Rousseeuw and Leroy (1987, p. 172-172). 3. If p ≥ 2, the procedure uses the following algorithm. If n < 2ssubs, where ssubs is the size of the subgroups (you can specify ssubs using the SUBGROUPSIZE= option in the PROC statement, by default, ssubs = 300), draw a random p-subset and compute the regression coefficients using these p points (if the regression is degenerate, draw another p-subset). Compute the absolute residuals for all observations in the data set and select the first h points with smallest absolute residuals. From this selected h-subset, carry out nsteps C-steps (Concentration step, see Rousseeuw and Van Driessen (2000) for details. You can specify nsteps with the CSTEP= option in the PROC statement, by default, nsteps = 2). Redraw p-subsets and repeat the preceding computing procedure nrep times and find the nbsol (at most) solutions with the lowest sums of h squared residuals. nrep can be specified with the NREP= option in the PROC statement. By default, NREP=min{500, np }. For small n and p, all np subsets are used and the NREP= option is ignored (Rousseeuw and Hubert (1996)). nbsol can be specified with the NBEST= option in the PROC statement. By default, NBEST=10. For each of these nbsol best solutions, take C-steps until convergence and find the best final solution. 4. If n ≥ 5ssubs, construct 5 disjoint random subgroups with size ssubs. If 2ssubs < n < 5ssubs, the data are split into at most four subgroups with ssubs or more observations in each subgroup, so that each observation belongs to a subgroup and such that the subgroups have roughly the same size. Let nsubs denote the number of subgroups. Inside each subgroup, repeat the pronrep cedure in Step 3 [ nsubs ] times and keep the nbsol best solutions. Pool the subgroups, yielding the merged set of size nmerged . In the merged set, for each of the nsubs × nbsol best solutions, carry out nsteps C-steps using nmerged and hmerged = [nmerged nh ] and keep the nbsol best solutions. In the full data set, for
4001
4002
Chapter 62. The ROBUSTREG Procedure each of these nbsol best solutions, take C-steps using n and h until convergence and find the best final solution.
R2 The robust version of R2 for the LTS estimate is defined as 2 RLT S =1−
s2LT S (X, y) s2LT S (1, y)
for models with the intercept term and as 2 RLT S =1−
s2LT S (X, y) s2LT S (0, y)
for models without the intercept term, where v u h u1 X 2 sLT S (X, y) = dh,n t r(i) h i=1
sLT S is a preliminary estimate of the parameter σ in the distribution function L(·/σ). Here dh,n is chosen to make sLT S consistent assuming a Gaussian model. Specifically, s dh,n = 1/
1−
ch,n = 1/Φ−1 (
2n φ(1/ch,n ) hch,n
h+n ) 2n
with Φ and φ being the distribution function and the density function of the standard normal distribution, respectively. Final Weighted Scale Estimator
The ROBUSTREG procedure displays two scale estimators, sLT S and Wscale. The estimate Wscale is a more efficient scale estimate based on the preliminary estimate sLT S , and it is defined as s P wi ri2 Wscale = P i i wi − p where wi =
0 1
if |ri |/sLT S > k otherwise
You can specify k with the CUTOFF= option in the MODEL statement. By default, k = 3.
High Breakdown Value Estimation
S Estimate The S estimate proposed by Rousseeuw and Yohai (1984) is defined as the p-vector θˆS = arg min S(θ) θ
where the dispersion S(θ) is the solution of n
1 X yi − xTi θ χ( ) = β. n−p S i=1
R Here β is set to χ(s)dΦ(s) such that θˆS and S(θˆS ) are asymptotically consistent estimates of θ and σ for the Gaussian regression model. The breakdown value of the S estimate is β sups χ(s) The ROBUSTREG procedure provides two choices for χ: the Tukey function and the Yohai function. The Tukey function, which you can specify with the option CHIF=TUKEY, is χk0 (s) =
3( ks0 )2 − 3( ks0 )4 + ( ks0 )6 , if |s| ≤ k0 1 otherwise
The constant k0 controls the breakdown value and efficiency of the S estimate. By specifying the efficiency using the EFF= option, you can determine the corresponding k0 . The default k0 is 2.9366 such that the breakdown value of the S estimate is 0.25 with a corresponding asymptotic efficiency for the Gaussian model of 75.9%. The Yohai function, which you can specify with the option CHIF=YOHAI, is
χk0 (s) =
s2 2 k02 [b0
+ b1 ( ks0 )2 + b2 ( ks0 )4 +b3 ( ks0 )6 + b4 ( ks0 )8 ] 3.25k02
if |s| ≤ 2k0 if 2k0 < |s| ≤ 3k0 if |s| > 3k0
where b0 = 1.792, b1 = −0.972, b2 = 0.432, b3 = −0.052, and b4 = 0.002. By specifying the efficiency using the EFF= option, you can determine the corresponding k0 . By default, k0 is set to 0.7405 such that the breakdown value of the S estimate is 0.25 with a corresponding asymptotic efficiency for the Gaussian model of 72.7%.
4003
4004
Chapter 62. The ROBUSTREG Procedure
Algorithm
The ROBUSTREG procedure implements the algorithm by Marazzi (1993) for the S estimate, which is a refined version of the algorithm proposed by Ruppert (1992). The refined algorithm is briefly described as follows. Initialize iter = 1. 1. Draw a random q-subset of the total n observations and compute the regression coefficients using these q observations (if the regression is degenerate, draw another q-subset), where q ≥ p can be specified with the SUBSIZE= option. By default, q = p. P 2. Compute the residuals: ri = yi − pi=1 xij θj for i = 1, ..., n. If iter = 1, set ∗ ∗ s∗ = 2median{|r i |, i = 1, ..., n}; if s = 0, set s = min{|ri |, i = 1, ..., n}; Pn ∗ ∗ ∗ while i=1 χ(rP i /s ) > (n − p)β, set s = 1.5s ; go to Step 3. n ∗ If iter > 1 and i=1 χ(ri /s ) <= (n − p)β, go to the Step 3; else go to Step 5. 3. Solve for s the equation n
1 X χ(ri /s) = β n−p i=1
using an iterative algorithm. 4. If iter > 1 and s > s∗ , go to Step 5. Otherwise, set s∗ = s and θ∗ = θ. If s∗ < T OLS, return s∗ and θ∗ ; else go to Step 5. 5. if iter < N REP , set iter = iter + 1 and return to Step 1; else return s∗ and θ∗ . The ROBUSTREG procedure does the following refinement step by default. You can request this refinement not be done using the NOREFINE option in the PROC statement. 6. Let ψ = χ0 . Using the values s∗ and θ∗ from the previous steps, compute M estimates θM and σM of θ and σ with the setup for M estimation in the section “M Estimation” on page 3993. If σM > s∗ , give a warning and return s∗ and θ∗ ; otherwise, return σM and θM . You can specify T OLS with the TOLERANCE= option; by default, T OLS = .001. Alternately You can specify N REP with the NREP= option. You can also use the options NREP= NREP0 or NREP= NREP1 to determine N REP according to the following table. NREP= NREP0 is set as the default.
High Breakdown Value Estimation
Table 62.7. Default NREP
P 1 2 3 4 5 6 7 8 9 >9
NREP0 150 300 400 500 600 700 850 1250 1500 1500
NREP1 500 1000 1500 2000 2500 3000 3000 3000 3000 3000
R2 and Deviance The robust version of R2 for the S estimate is defined as RS2 = 1 −
(n − p)Sp2 (n − 1)Sµ2
for the model with the intercept term and RS2
(n − p)Sp2 =1− nS02
for the model without the intercept term, where Sp is the S estimate of the scale in the full model, Sµ is the S estimate of the scale in the regression model with only the intercept term, and S0 is the S estimate of the scale without any regressor. The deviance D is defined as the optimal value of the objective function on the σ 2 -scale: D = Sp2 Asymptotic Covariance and Confidence Intervals
Since the S estimate satisfies the first-order necessary conditions as the M estimate, it has the same asymptotic covariance as that of the M estimate. All three estimators of the asymptotic covariance for the M estimate in the section “Asymptotic Covariance and Confidence Intervals” on page 3997 can be used for the S estimate. Besides, the weighted covariance estimator H4 described in the section “Asymptotic Covariance and Confidence Intervals” on page 4008 is also available and is set as the default. Confidence intervals for estimated parameters are computed from the diagonal elements of the estimated asymptotic covariance matrix.
4005
4006
Chapter 62. The ROBUSTREG Procedure
MM Estimation MM estimation is a combination of high breakdown value estimation and efficient estimation, which was introduced by Yohai (1987). It has three steps: 1. Compute an initial (consistent) high breakdown value estimate θˆ0 . The ROBUSTREG procedure provides two kinds of estimates as the initial estimate, the LTS estimate and the S estimate. By default, the LTS estimate because of its speed, efficiency, and high breakdown value. The breakdown value of the final MM estimate is decided by the breakdown value of the initial LTS estimate and the constant k0 in the χ function. To use the S estimate as the initial estimate, you specify the INITEST=S option in the PROC statement. In this case, the breakdown value of the final MM estimate is decided only by the constant k0 . Instead of computing the LTS estimate or the S estimate as initial estimates, you can also specify the initial estimate explicitly using the INEST= option in the PROC statement. See the section “INEST= Data Set” on page 4011 for details. 2. Find σ ˆ 0 such that n
1 X yi − xTi θˆ0 χ( )=β n−p σ ˆ0 i=1
where β =
R
χ(s)dΦ(s).
The ROBUSTREG procedure provides two choices for χ: the Tukey function and the Yohai function. The Tukey function, which you can specify with the option CHIF=TUKEY, is χk0 (s) =
3( ks0 )2 − 3( ks0 )4 + ( ks0 )6 , if |s| ≤ k0 1 otherwise
where k0 can be specified with the K0= option. The default k0 = 2.9366 such that the asymptotically consistent scale estimate σ ˆ 0 has the breakdown value of 25%. The Yohai function, which you can specify with the option CHIF=YOHAI, is
χk0 (s) =
s2 2 k02 [b0
+ b1 ( ks0 )2 + b2 ( ks0 )4 +b3 ( ks0 )6 + b4 ( ks0 )8 ] 3.25k02
if |s| ≤ 2k0 if 2k0 < |s| ≤ 3k0 if |s| > 3k0
where b0 = 1.792, b1 = −0.972, b2 = 0.432, b3 = −0.052, and b4 = 0.002. You can specify k0 with the K0= option. The default k0 is .7405 such that the asymptotically consistent scale estimate σ ˆ 0 has the breakdown value of 25%.
MM Estimation
3. Find a local minimum θˆM M of QM M =
n X
ρ(
i=1
yi − xTi θ ) σ ˆ0
such that QM M (θˆM M ) ≤ QM M (θˆ0 ). The algorithm for M estimation is used here. The ROBUSTREG procedure provides two choices for χ: the Tukey function and the Yohai function. The Tukey function, which you can specify with the option CHIF=TUKEY, is 3( ks1 )2 − 3( ks1 )4 + ( ks1 )6 , if |s| ≤ k1 ρ(s) = χk1 (s) = 1 otherwise where k1 can be specified with the K1= option. The default k1 is 3.440 such that the MM estimate has 85% asymptotic efficiency with the Gaussian distribution. The Yohai function, which you can specify with the option CHIF=Yohai, is
ρ(s) = χk1 (s) =
s2 2 k12 [b0
+ b1 ( ks1 )2 + b2 ( ks1 )4 +b3 ( ks1 )6 + b4 ( ks1 )8 ] 3.25k12
if |s| ≤ 2k1 if 2k1 < |s| ≤ 3k1 if |s| > 3k1
where k1 can be specified with the K1= option. The default k1 is 0.868 such that the MM estimate has 85% asymptotic efficiency with the Gaussian distribution.
Algorithm The initial LTS estimate is computed using the algorithm described in the section “LTS Estimate” on page 4000. You can control the quantile of the LTS estimate with the option INITH=h, where h is an integer between [ n2 ] + 1 and [ 3n+p+1 ]. By default, 4 h = [ 3n+p+1 ], which corresponds to a breakdown value of around 25%. 4 The initial S estimate is computed using the algorithm described in the section “S Estimate” on page 4003. You can control the breakdown value and efficiency of this initial S estimate by the constant k0 which can be specified with the K0 option. The scale parameter σ is solved by an iterative algorithm (σ (m+1) )2 =
n X ri 1 χk0 ( (m) )(σ (m) )2 (n − p)β σ i=1
where β =
R
χk0 (s)dΦ(s).
Once the scale parameter is computed, the iteratively reweighted least squares (IRLS) algorithm with fixed scale parameter is used to compute the final MM estimator.
4007
4008
Chapter 62. The ROBUSTREG Procedure
Convergence Criteria In the iterative algorithm for the scale parameter, the relative change of the scale parameter controls the convergence. In the iteratively reweighted least squares algorithm, the same convergence criteria for the M estimate used before are used here.
Bias Test Although the final MM estimate inherits the high-breakdown-value properity, its bias due to the distortion of the outliers can be high. Yohai, Stahel, and Zamar (1991) introduced a bias test. The ROBUSTREG procedure implements this test when you specify the BIASTEST option in the PROC statement. This test bases on the initial scale estimate σ ˆ 0 and the final scale estimate σ ˆ10 , which is the solution of n
1 X yi − xTi θˆM M χ( )=β n−p σ ˆ10 i=1
Let ψk0 (·) = χ0k0 (·) and ψk1 (·) = χ0k1 (·), where 0 denotes the derivative with respect to the argument. Compute r˜i = (yi − xTi θˆ0 )/ˆ σ 0 for i = 1, ..., n P 0 (1/n) ψk0 (˜ ri ) P v0 = 0 (ˆ σ1 /n) ψk0 (˜ ri )˜ ri ψk0 (˜ r) (0) P i0 pi = for i = 1, ..., n (1/n) ψk0 (˜ ri ) (1)
pi
=
d2 = T
=
ψk1 (˜ r) P i0 for i = 1, ..., n (1/n) ψk1 (˜ ri ) 1 X (1) (0) (pi − pi )2 n 2n(ˆ σ10 − σ ˆ0) v0 d2 (ˆ σ 0 )2
Standard asymptotic theory shows that T approximately follows a χ2 -distribution with p degrees of freedom. If T exceeds the α quantile χ2α of the χ2 -distribution with p degrees of freedom, then the ROBUSTREG procedure gives a warning and recommends to use other methods. Otherwise the final MM estimate and the initial scale estimate are reported. You can specify α with the ALPHA= option following the BIASTEST option. By default, ALPHA= 0.99.
Asymptotic Covariance and Confidence Intervals Since the MM estimate is computed as a M estimate with a fixed scale in the last step, the asymptotic covariance for the M estimate can be used here for the asymptotic covariance of the MM estimate. Besides the three estimators H1, H2, and H3 as
Robust Multivariate Location and Scale Estimates
described in the section “Asymptotic Covariance and Confidence Intervals” on page 3997, a weighted covariance estimator H4 is available: H4: K
P − p)] (ψ(ri ))2 −1 P 0 W [(1/n) (ψ (ri ))]2
2 [1/(n
where K = 1 + 1 P wi . n
p var(ψ 0 ) n (Eψ 0 )2
is the correction factor and Wjk =
1 w ¯
P
wi xij xik , w ¯ =
You can specify these estimators with the option ASYMPCOV= [H1 | H2 | H3 | H4]. The ROBUSTREG procedure uses H4 as default. Confidence intervals for estimated parameters are computed from the diagonal elements of the estimated asymptotic covariance matrix. R2 and Deviance The robust version of R2 for the MM estimate is defined as 2
R =
P
P yi −xTi θˆ µ ρ( yi −ˆ ρ( sˆ ) sˆ ) − P yi −ˆµ ρ( sˆ )
and the robust deviance is defined as the optimal value of the objective function on the σ 2 -scale: D = 2(ˆ s)2
X
ρ(
yi − xTi θˆ ) sˆ
where ρ0 = ψ, θˆ is the MM estimator of θ, µ ˆ is the MM estimator of location, and sˆ is the MM estimator of the scale parameter in the full model.
Linear Tests For MM estimation, the same ρ-test and Rn2 -test used for M estimation can be used. See the section “Linear Tests” on page 3998 for details.
Model Selection For MM estimation, the same two model selection methods used for M estimation can be used. See the section “Model Selection” on page 3999 for details.
Robust Multivariate Location and Scale Estimates The ROBUSTREG procedure uses the robust multivariate location and scale estimates for leverage points detection. The procedure provides the minimum covariance determinant (MCD) method, which was introduced by Rousseeuw (1984).
4009
4010
Chapter 62. The ROBUSTREG Procedure
Algorithm PROC ROBUSTREG implements the algorithm given by Rousseeuw and Van Driessen (1999) for MCD, which is similar to the algorithm for LTS.
Robust Distance The Mahalanobis Distance is defined as −1 ¯ M D(xi ) = [(xi − x ¯)T C(X) (xi − x ¯)]1/2 P 1 Pn where x ¯ = n1 ni=1 xi and C¯ = n−1 ¯)T (xi − x ¯). Here xi = i=1 (xi − x T (xi1 , ..., xi(p−1) ) do not include the constant variable. The relation between the Mahalanobis distance M D(xi ) and the hat matrix H = (hij ) = X(X T X)−1 X T is
hii =
1 1 M D2i + n−1 n
The Robust Distance is defined as RD(xi ) = [(xi − T (X))T C(X)−1 (xi − T (X))]1/2 where T (X) and C(X) are the robust multivariate location and scale obtained by MCD. These distances are used to detect leverage points.
Leverage Point and Outlier Detection Let C(p) =
q
χ2p;1−α be the cutoff value. The variable LEVERAGE is defined as
LEVERAGE =
0 1
if RD(xi ) ≤ C(p) otherwise
You can specify a cutoff value with the LEVERAGE option in the MODEL statement. Residuals ri , i = 1, ..., n based on robust regression estimates are used to detect vertical outliers. The variable OUTLIER is defined as OUTLIER =
0 1
if |ri | ≤ kσ otherwise
You can specify the multiplier k of the cutoff value by using the CUTOFF= option in the MODEL statement. An ODS table called DIAGNOSTICS contains these two variables.
OUTEST= Data Set
INEST= Data Set When you use the M or MM estimation, you cna use the INEST= data set to specify initial estimates for all the parameters in the model. The INEST= option is ignored if you specify LTS or S estimation using the METHOD=LTS or METHOD=S option or if you specify the INITEST= option after the METHOD=MM option in the PROC statement. The INEST= data set must contain the intercept variable (named Intercept) and all independent variables in the MODEL statement. If BY processing is used, the INEST= data set should also include the BY variables, and there must be at least one observation for each BY group. If there is more than one observation in one BY group, the first one read is used for that BY group. If the INEST= data set also contains the – TYPE– variable, only observations with – TYPE– value “PARMS” are used as starting values. You can specify starting values for the iteratively reweighted least squares algorithm in the INEST= data set. The INEST= data set has the same structure as the OUTEST= data set but is not required to have all the variables or observations that appear in the OUTEST= data set. One simple use of the INEST= option is passing the previous OUTEST= data set directly to the next model as an INEST= data set, assuming that the two models have the same parameterization.
OUTEST= Data Set The OUTEST= data set contains parameter estimates for the model. You can specify a label in the MODEL statement to distinguish between the estimates for different modeling using the ROBUSTREG procedure. If the COVOUT option is specified, the OUTEST= data set also contains the estimated covariance matrix of the parameter estimates. Note that, if the ROBUSTREG procedure does not converge, the parameter estimates are set to missing in the OUTEST data set. The OUTES= data set contains all variables specified in the MODEL statement and the BY statement. One observation consists of parameter values for the model with the dependent variable having the value −1. If the COVOUT option is specified, there are additional observations containing the rows of the estimated covariance matrix. For these observations, the dependent variable contains the parameter estimate for the corresponding row variable. The following variables are also added to the data set:
– MODEL–
a character variable containing the label of the MODEL statement, if present. Otherwise, the variable’s value is blank
– NAME–
a character variable containing the name of the dependent variable for the parameter estimates observations or the name of the row for the covariance matrix estimates
– TYPE–
a character variable containing the type of the observation, either PARMS for parameter estimates or COV for covariance estimates
– METHOD–
a character variable of containing the type of estimation method, either M estimation, or LTS estimation, or S estimation, or MM estimation
4011
4012
Chapter 62. The ROBUSTREG Procedure
– STATUS–
a character variable containing the status of model fitting, either Converged, or Warning, or Failed
INTERCEPT
a numeric variable containing the intercept parameter estimates and covariances
– SCALE–
a numeric variable containing the scale parameter estimates
Any BY variables specified are also added to the OUTEST= data set.
Computational Resources The algorithms for the various different estimation methods need different amount of memory for working space. Let p be the number of parameters estimated and n be the number of observations used in the model estimation. For M estimation, the minimum working space (in bytes) needed is 3n + 2p2 + 30p If sufficient space is available, the input data set is also kept in memory; otherwise, the input data set is reread for computing the iteratively reweighted least squares estimates and the execution time of the procedure increases substantially. For each reweighted least squares, O(np2 + p3 ) multiplications and additions are required for computing the cross product matrix and its inverse. The O(v) notation means that, for large values of the argument, v, O(v) is approximately a constant times v. Since the iteratively reweighted least squares algorithm converges very quickly (normally within less than 20 iterations), the computation of M estimates is fast. LTS estimation is more expensive in computation. The minimum working space (in bytes) needed is np + 12n + 4p2 + 60p The memory is mainly used to store the current data used by LTS for modeling. The LTS algorithm uses subsampling and spends much of its computing time on resampling and computing estimates for subsamples. Since it resamples if singularity is detected, it may take more time if the data set has serious singularities. The MCD algorithm for high leverage point diagnostics is similar to the LTS algorithm.
ODS Table Names The ROBUSTREG procedures assigns a name to each table it creates. You can specify these names when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table.
ODS Graphics (Experimental)
4013
Table 62.8. ODS Tables Produced in PROC ROBUSTREG
ODS Table Name BestEstimates BestSubEstimates BiasTest ClassLevels CorrB CovB CStep Diagnostics DiagSummary GoodFit InitLTSProfile InitSProfile IterHistory LTSEstimates LTSLocationScale LTSProfile LTSRsquare MMProfile ModelInfo NObs ParameterEstimates ParameterEstimatesF ParameterEstimatesR ParmInfo SProfile Groups SummaryStatistics TestsProfile ∗
Description Best final estimates for LTS Best estimates for each subgroup Bias test for MM estimation Class variable levels Parameter estimate correlation matrix Parameter estimate covariance matrix C-Step for LTS fitting Outlier diagnostics Summary of the outlier diagnostics R2, deviance, AIC, and BIC Profile for initial LTS estimate Profile for initial S estimate Iteration history LTS parameter estimates Location and scale for LTS Profile for LTS estimate R2 for LTS estimate Profile for MM estimate Model information Observations Summary Parameter estimates Final weighted LS estimates Reduced parameter estimates Parameter indices Profile for S estimate Groups for LTS fitting Summary statistics for model variables Results for tests
Statement PROC PROC PROC CLASS MODEL MODEL PROC MODEL MODEL MODEL PROC PROC PROC PROC PROC PROC PROC PROC MODEL PROC MODEL PROC TEST MODEL PROC PROC MODEL TEST
Option SUBANALYSIS SUBANALYSIS∗ BIASTEST default∗ CORRB COVB SUBANALYSIS DIAGNOSTICS default default METHOD METHOD ITPRINT METHOD METHOD METHOD METHOD METHOD default default default FWLS default default METHOD SUBANALYSIS∗ default default
Depends on data.
ODS Graphics (Experimental) Graphical displays are important in robust regression and outlier detection. Two plots are particularly useful for revealing outliers and leverage points. The first is a scatter plot of the standardized robust residuals against the robust distances (RDPLOT). The second is a scatter plot of the robust distances against the classical Mahalanobis distances (DDPLOT). See Figure 62.4 on page 3975 and Figure 62.5 on page 3975 for examples. In addition to these two plots, a histogram and a quantile-quantile plot of the standardized robust residuals are also helpful. See Figure 62.6 on page 3976 and Figure 62.7 on page 3976 for examples. This section describes the use of ODS for creating these four plots with the ROBUSTREG procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release.
4014
Chapter 62. The ROBUSTREG Procedure
To request these plots you must specify the ODS GRAPHICS statement in addition to the PLOT= (or PLOTS=) option, which is described as follows. For more information on the ODS GRAPHICS statement, see Chapter 15, “Statistical Graphics Using ODS.” You can specify the PLOT= or PLOTS= option in the PROC statement to request one or more plots: PLOT=keyword PLOTS=(keyword-list)
requests plots for robust regression. You can specify one or more of the following keywords: Table 62.9. Options for Plots
Keyword DDPLOT
Plot Robust distance - Mahalanobis distance
RDPLOT
Standardized robust residual - Robust distance
RESHISTOGRAM
Histogram of standardized robust residuals
RESQQPLOT
Q-Q plot of standardized robust residuals
ALL
All plots
With the RDPLOT and DDPLOT options, you can label the points on the plots by specifying the LABEL= suboption immediately after the keyword: PLOT=DDPLOT<(LABEL=label method)> PLOT=RDPLOT<(LABEL=label method)>
You can specify one of the following label methods: Table 62.10. Label Methods
Value of LABEL= ALL
Label Method label all points
OUTLIER
label outliers
LEVERAGE
label leverage points
NONE
no labels
By default, the ROBUSTREG procedure labels both outliers and leverage points. If you specify ID variables in the ID statement, the values of the first ID variable are used as labels; otherwise, observation numbers are used as labels. The histogram is superimposed with a normal density curve and a kernel density curve.
ODS Graph Names PROC ROBUSTREG assigns a name to each graph it creates using ODS. You can use these names to reference the graphs when using ODS. The names are listed in Table 62.11 on page 4015.
ODS Graphics (Experimental)
To request these graphs you must specify the ODS GRAPHICS statement in addition to the PLOT= (or PLOTS=) option described in Table 62.9 on page 4014. For more information on the ODS GRAPHICS statement, see Chapter 15, “Statistical Graphics Using ODS.” Table 62.11. ODS Graphics Produced by PROC ROBUSTREG
ODS Graph Name DDPlot RDPlot ResidualHistogram ResidualQQPlot
Plot Description Robust distance - Mahalanobis distance Standardized robust residual Robust distance Histogram of standardized robust residuals Q-Q plot of standardized robust residuals
Statement PLOTS= Option PROC DDPLOT PROC
RDPLOT
PROC
RESHISTOGRAM
PROC
RESQQPLOT
4015
4016
Chapter 62. The ROBUSTREG Procedure
Examples Example 62.1. Comparison of Robust Estimates This example illustrates differences in the performance of robust estimates available in the ROBUSTREG procedure. The following statements generate 1000 random observations. The first 900 observations are from a linear model and the last 100 observations are significantly biased in the y-direction. In other words, ten percent of the observations are contaminated with outliers. data a (drop=i); do i=1 to 1000; x1=rannor(1234); x2=rannor(1234); e=rannor(1234); if i > 900 then y=100 + e; else y=10 + 5*x1 + 3*x2 + .5 * e; output; end; run; proc reg data=a; model y = x1 x2; run; proc robustreg data=a method=m ; model y = x1 x2; run; proc robustreg data=a method=mm; model y = x1 x2; run; Output 62.1.1. OLS Estimates for Data with 10% Contamination The REG Procedure Model: MODEL1 Dependent Variable: y Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept x1 x2
1 1 1
19.06712 3.55485 2.12341
0.86322 0.86892 0.83039
22.09 4.09 2.56
<.0001 <.0001 0.0107
Example 62.1. Comparison of Robust Estimates
Output 62.1.2. M Estimates for Data with 10% Contamination The ROBUSTREG Procedure Model Information Data Set Dependent Variable Number of Covariates Number of Observations Method
WORK.B y 2 1000 M Estimation
Parameter Estimates
Parameter DF Estimate Intercept x1 x2 Scale
1 1 1 1
10.0024 5.0077 3.0161 0.5780
Standard Error 0.0174 0.0175 0.0167
95% Confidence Limits 9.9683 4.9735 2.9834
ChiSquare Pr > ChiSq
10.0364 331908 5.0420 82106.9 3.0488 32612.5
<.0001 <.0001 <.0001
Output 62.1.3. MM Estimates for Data with 10% Contamination The ROBUSTREG Procedure Model Information Data Set Dependent Variable Number of Covariates Number of Observations Method
WORK.B y 2 1000 MM Estimation
Parameter Estimates
Parameter DF Estimate Intercept x1 x2 Scale
1 1 1 0
10.0035 5.0085 3.0181 0.6733
Standard Error 0.0176 0.0178 0.0168
95% Confidence Limits 9.9690 4.9737 2.9851
ChiSquare Pr > ChiSq
10.0379 323947 5.0433 79600.6 3.0511 32165.0
<.0001 <.0001 <.0001
The tables of parameter estimates generated by the ROBUSTREG procedure using M estimation and MM estimation are shown in Output 62.1.2 and Output 62.1.3. For comparison, the ordinary least squares (OLS) estimates produced by the REG procedure are shown in Output 62.1.1. Both the M estimate and the MM estimate correctly estimate the regression coefficients for the underlying model (10, 5, and 3), but the OLS estimate does not. The next statements demonstrate that if the percentage of contamination is increased to 40%, the M estimates and MM estimates with default options fail to pick up the underlying model. However, by tuning the constant c for the M estimate and the constants INITH and K0 for the MM estimate, you can increase the breakdown values of these estimates and capture the right model. Output 62.1.4 and Output 62.1.5 display these estimates.
4017
4018
Chapter 62. The ROBUSTREG Procedure data b (drop=i); do i=1 to 1000; x1=rannor(1234); x2=rannor(1234); e=rannor(1234); if i > 600 then y=100 + e; else y=10 + 5*x1 + 3*x2 + .5 * e; output; end; run; proc robustreg data=b method=m(wf=bisquare(c=2)); model y = x1 x2; run; proc robustreg data=b method=mm(inith=502 k0=1.8); model y = x1 x2; run;
Output 62.1.4. M Estimates for Data with 40% Contamination The ROBUSTREG Procedure Model Information Data Set Dependent Variable Number of Covariates Number of Observations Method
WORK.B y 2 1000 M Estimation
Parameter Estimates
Parameter DF Estimate Intercept x1 x2 Scale
1 1 1 1
10.0137 4.9905 3.0399 1.0531
Standard Error 0.0219 0.0220 0.0210
95% Confidence Limits 9.9708 4.9473 2.9987
ChiSquare Pr > ChiSq
10.0565 209688 5.0336 51399.1 3.0811 20882.4
<.0001 <.0001 <.0001
Output 62.1.5. MM Estimates for Data with 40% Contamination The ROBUSTREG Procedure Model Information Data Set Dependent Variable Number of Covariates Number of Observations Method
WORK.B y 2 1000 MM Estimation
Parameter Estimates
Parameter DF Estimate Intercept x1 x2 Scale
1 1 1 0
10.0103 4.9890 3.0363 1.8997
Standard Error 0.0213 0.0218 0.0201
95% Confidence Limits 9.9686 4.9463 2.9970
ChiSquare Pr > ChiSq
10.0520 221639 5.0316 52535.7 3.0756 22895.4
<.0001 <.0001 <.0001
Example 62.1. Comparison of Robust Estimates
When there are bad leverage points, the M estimates fail to pick up the underlying model no matter what constant c you use. In this case, other estimates (LTS, S, and MM estimates) in PROC ROBUSTREG, which are robust to bad leverage points, will pick up the underlying model. The following statements generate 1000 observations with 1% bad high leverage points. data b (drop=i); do i=1 to 1000; x1=rannor(1234); x2=rannor(1234); e=rannor(1234); if i > 600 then y=100 + e; else y=10 + 5*x1 + 3*x2 + .5 * e; if i < 11 then x1=200 * rannor(1234); if i < 11 then x2=200 * rannor(1234); if i < 11 then y= 100*e; output; end; run; proc robustreg data=b method=s(k0=1.8); model y = x1 x2; run; proc robustreg data=b method=mm(inith=502 k0=1.8); model y = x1 x2; run; Output 62.1.6. S Estimates for Data with 1% Leverage Points The ROBUSTREG Procedure Model Information Data Set Dependent Variable Number of Covariates Number of Observations Method
WORK.C y 2 1000 S Estimation
Parameter Estimates
Parameter DF Estimate Intercept x1 x2 Scale
1 1 1 0
9.9808 5.0303 3.0217 2.2094
Standard Error 0.0216 0.0208 0.0222
95% Confidence Limits 9.9383 4.9896 2.9782
ChiSquare Pr > ChiSq
10.0232 212532 5.0710 58656.3 3.0652 18555.7
<.0001 <.0001 <.0001
4019
4020
Chapter 62. The ROBUSTREG Procedure
Output 62.1.7. MM Estimates for Data with 1% Leverage Points The ROBUSTREG Procedure Model Information Data Set Dependent Variable Number of Covariates Number of Observations Method
WORK.C y 2 1000 MM Estimation
Parameter Estimates
Parameter DF Estimate Intercept x1 x2 Scale
1 1 1 0
9.9820 5.0303 3.0222 2.2134
Standard Error 0.0215 0.0206 0.0221
95% Confidence Limits 9.9398 4.9898 2.9789
ChiSquare Pr > ChiSq
10.0241 215369 5.0707 59469.1 3.0655 18744.9
<.0001 <.0001 <.0001
Output 62.1.6 displays the S estimates and Output 62.1.7 displays the MM estimates with initial LTS estimates.
Example 62.2. Robust ANOVA The classical analysis of variance (ANOVA) technique based on least squares assumes that the underlying experimental errors are normally distributed. However, data often contain outliers due to recording or other errors. In other cases, extreme responses occurs when control variables in the experiments is set to extremes. It is important to distinguish these extreme points and determine whether they are outliers or important extreme cases. You can use the ROBUSTREG procedure for robust analysis of variance based on M estimation. Typically, there are no high leverage points in a well-designed experiment, so M estimation is appropriate. The following example shows how to use the ROBUSTREG procedure for robust ANOVA. An experiment was carried out to study the effects of two successive treatments (T1, T2) on the recovery time of mice with certain diseases. Sixteen mice were randomly assigned into four groups for the four different combinations of the treatments. The recovery times (time) were recorded (in hours). data recover; input id T1 $ T2 $ time; datalines; 1 0 0 20.2 2 0 0 23.9 3 0 0 21.9 4 0 0 42.4 5 1 0 27.2 6 1 0 34.0 7 1 0 27.4 8 1 0 28.5 9 0 1 25.9 10 0 1 34.5
Example 62.2. Robust ANOVA 11 12 13 14 15 16 ;
0 0 1 1 1 1
1 1 1 1 1 1
25.1 34.2 35.0 33.9 38.3 39.9
The following statements invoke the GLM procedure for a standard ANOVA. proc glm data=recover; class T1 T2; model time = T1 T2 T1*T2; run; Output 62.2.1. Overall ANOVA The GLM Procedure Dependent Variable: time
DF
Sum of Squares
Mean Square
F Value
Pr > F
Model
3
209.9118750
69.9706250
1.86
0.1905
Error
12
451.9225000
37.6602083
Corrected Total
15
661.8343750
Source
R-Square
Coeff Var
Root MSE
time Mean
0.317167
19.94488
6.136791
30.76875
Output 62.2.2. Model ANOVA The GLM Procedure Dependent Variable: time Source T1 T2 T1*T2
Source T1 T2 T1*T2
DF
Type I SS
Mean Square
F Value
Pr > F
1 1 1
81.4506250 106.6056250 21.8556250
81.4506250 106.6056250 21.8556250
2.16 2.83 0.58
0.1671 0.1183 0.4609
DF
Type III SS
Mean Square
F Value
Pr > F
1 1 1
81.4506250 106.6056250 21.8556250
81.4506250 106.6056250 21.8556250
2.16 2.83 0.58
0.1671 0.1183 0.4609
Output 62.2.1 indicates that the overall model effect is not significant at the 10% level and Output 62.2.2 indicates that neither treatment is significant at the 10% level.
4021
4022
Chapter 62. The ROBUSTREG Procedure
The following statements invoke the ROBUSTREG procedure with the same model. proc robustreg data=recover; class T1 T2; model time = T1 T2 T1*T2 / diagnostics; T1_T2: test T1*T2; output out=robout r=resid sr=stdres; run;
Output 62.2.3 shows some basic information about the model and the response variable time. Output 62.2.3. Model Fitting Information and Summary Statistics The ROBUSTREG Procedure Model Information Data Set Dependent Number of Number of Number of Number of Method
WORK.RECOVER time 2 0 2 16 M Estimation
Variable Covariates Continuous Covariates Discrete Covariates Observations
Summary Statistics
Q1
Median
Q3
Mean
Standard Deviation
MAD
25.5000
31.2000
34.7500
30.7688
6.6425
6.8941
Variable time
The Parameter Estimates table in Output 62.2.4 indicates that the main effects of both treatments are significant at the 5% level. Output 62.2.4. Model Parameter Estimates The ROBUSTREG Procedure Parameter Estimates
Parameter Intercept T1 T1 T2 T2 T1*T2 T1*T2 T1*T2 T1*T2 Scale
DF Estimate
0 1 0 1 0 0 1 1
1 1 1 0 1 0 1
1
1
36.7655 -6.8307 0.0000 -7.6755 0.0000 -0.2619 0.0000 0.0000 0.0000 3.5346
Standard Error
95% Confidence Limits
2.0489 32.7497 2.8976 -12.5100 0.0000 0.0000 2.8976 -13.3548 0.0000 0.0000 4.0979 -8.2936 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
40.7814 -1.1514 0.0000 -1.9962 0.0000 7.7698 0.0000 0.0000 0.0000
ChiSquare Pr > ChiSq 321.98 5.56 . 7.02 . 0.00 . . .
<.0001 0.0184 . 0.0081 . 0.9490 . . .
Example 62.2. Robust ANOVA
Output 62.2.5. Diagnostics The ROBUSTREG Procedure Diagnostics
Obs
Standardized Robust Residual
4
5.7722
Outlier *
Diagnostics Summary Observation Type
Proportion
Cutoff
0.0625
3.0000
Outlier
The reason for the difference between the traditional ANOVA and the robust ANOVA is explained by Output 62.2.5, which shows that the fourth observation is an outlier. Further investigation shows that the original value of 24.4 for the fourth observation was recorded incorrectly. Output 62.2.6 displays the robust test results. The interaction between the two treatments is not significant. Output 62.2.7 displays the robust residuals and standardized robust residuals. Output 62.2.6. Test of Significance The ROBUSTREG Procedure Robust Linear Tests T1_T2
Test
Test Statistic
Rho Rn2
Lambda DF
0.0041 0.0041
0.7977
1 1
ChiSquare Pr > ChiSq 0.01 0.00
0.9431 0.9490
Output 62.2.7. ROBUSTREG Output Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
T1
T2
time
resid
0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
20.2 23.9 21.9 42.4 27.2 34.0 27.4 28.5 25.9 34.5 25.1 34.2 35.0 33.9 38.3 39.9
-1.7974 1.9026 -0.0974 20.4026 -1.8900 4.9100 -1.6900 -0.5900 -4.0348 4.5652 -4.8348 4.2652 -1.7655 -2.8655 1.5345 3.1345
stdres -0.50851 0.53827 -0.02756 5.77222 -0.53472 1.38911 -0.47813 -0.16693 -1.14152 1.29156 -1.36785 1.20668 -0.49950 -0.81070 0.43413 0.88679
4023
4024
Chapter 62. The ROBUSTREG Procedure
Example 62.3. Growth Study of De Long and Summers Robust regression and outlier detection techniques have considerable applications to econometrics. The following example from Zaman, Rousseeuw, and Orhan (2001) shows how these techniques substantially improve the ordinary least squares (OLS) results for the growth study of De Long and Summers. De Long and Summers (1991) studied the national growth of 61 countries from 1960 to 1985 using OLS. data growth; input country$ GDP LFG EQP NEQ GAP datalines; Argentin 0.0089 0.0118 0.0214 0.2286 Austria 0.0332 0.0014 0.0991 0.1349 Belgium 0.0256 0.0061 0.0684 0.1653 Bolivia 0.0124 0.0209 0.0167 0.1133 Botswana 0.0676 0.0239 0.1310 0.1490 Brazil 0.0437 0.0306 0.0646 0.1588 Cameroon 0.0458 0.0169 0.0415 0.0885 Canada 0.0169 0.0261 0.0771 0.1529 Chile 0.0021 0.0216 0.0154 0.2846 Colombia 0.0239 0.0266 0.0229 0.1553 CostaRic 0.0121 0.0354 0.0433 0.1067 Denmark 0.0187 0.0115 0.0688 0.1834 Dominica 0.0199 0.0280 0.0321 0.1379 Ecuador 0.0283 0.0274 0.0303 0.2097 ElSalvad 0.0046 0.0316 0.0223 0.0577 Ethiopia 0.0094 0.0206 0.0212 0.0288 Finland 0.0301 0.0083 0.1206 0.2494 France 0.0292 0.0089 0.0879 0.1767 Germany 0.0259 0.0047 0.0890 0.1885 Greece 0.0446 0.0044 0.0655 0.2245 Guatemal 0.0149 0.0242 0.0384 0.0516 Honduras 0.0148 0.0303 0.0446 0.0954 HongKong 0.0484 0.0359 0.0767 0.1233 India 0.0115 0.0170 0.0278 0.1448 Indonesi 0.0345 0.0213 0.0221 0.1179 Ireland 0.0288 0.0081 0.0814 0.1879 Israel 0.0452 0.0305 0.1112 0.1788 Italy 0.0362 0.0038 0.0683 0.1790 IvoryCoa 0.0278 0.0274 0.0243 0.0957 Jamaica 0.0055 0.0201 0.0609 0.1455 Japan 0.0535 0.0117 0.1223 0.2464 Kenya 0.0146 0.0346 0.0462 0.1268 Korea 0.0479 0.0282 0.0557 0.1842 Luxembou 0.0236 0.0064 0.0711 0.1944 Madagasc -0.0102 0.0203 0.0219 0.0481 Malawi 0.0153 0.0226 0.0361 0.0935 Malaysia 0.0332 0.0316 0.0446 0.1878 Mali 0.0044 0.0184 0.0433 0.0267 Mexico 0.0198 0.0349 0.0273 0.1687 Morocco 0.0243 0.0281 0.0260 0.0540 Netherla 0.0231 0.0146 0.0778 0.1781
@@; 0.6079 0.5809 0.4109 0.8634 0.9474 0.8498 0.9333 0.1783 0.5402 0.7695 0.7043 0.4079 0.8293 0.8205 0.8414 0.9805 0.5589 0.4708 0.4585 0.7924 0.7885 0.8850 0.7471 0.9356 0.9243 0.6457 0.6816 0.5441 0.9207 0.8229 0.7484 0.9415 0.8807 0.2863 0.9217 0.9628 0.7853 0.9478 0.5921 0.8405 0.3605
Example 62.3. Growth Study of De Long and Summers
Nigeria -0.0047 0.0283 0.0358 0.0842 0.8579 Norway 0.0260 0.0150 0.0701 0.2199 0.3755 Pakistan 0.0295 0.0258 0.0263 0.0880 0.9180 Panama 0.0295 0.0279 0.0388 0.2212 0.8015 Paraguay 0.0261 0.0299 0.0189 0.1011 0.8458 Peru 0.0107 0.0271 0.0267 0.0933 0.7406 Philippi 0.0179 0.0253 0.0445 0.0974 0.8747 Portugal 0.0318 0.0118 0.0729 0.1571 0.8033 Senegal -0.0011 0.0274 0.0193 0.0807 0.8884 Spain 0.0373 0.0069 0.0397 0.1305 0.6613 SriLanka 0.0137 0.0207 0.0138 0.1352 0.8555 Tanzania 0.0184 0.0276 0.0860 0.0940 0.9762 Thailand 0.0341 0.0278 0.0395 0.1412 0.9174 Tunisia 0.0279 0.0256 0.0428 0.0972 0.7838 U.K. 0.0189 0.0048 0.0694 0.1132 0.4307 U.S. 0.0133 0.0189 0.0762 0.1356 0.0000 Uruguay 0.0041 0.0052 0.0155 0.1154 0.5782 Venezuel 0.0120 0.0378 0.0340 0.0760 0.4974 Zambia -0.0110 0.0275 0.0702 0.2012 0.8695 Zimbabwe 0.0110 0.0309 0.0843 0.1257 0.8875 ;
The regression equation they used is: GDP = β0 + β1 LF G + β2 GAP + β3 EQP + β4 N EQ + , where the response variable is the growth in gross domestic product per worker (GDP ) and the regressors are labor force growth (LF G), relative GDP gap (GAP ), equipment investment (EQP ), and non-equipment investment (N EQ). The following statements invoke the REG procedure for the OLS analysis: proc reg data=growth; model GDP = LFG GAP EQP NEQ ; run; Output 62.3.1. OLS Estimates The REG Procedure Model: MODEL1 Dependent Variable: GDP Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept LFG GAP EQP NEQ
1 1 1 1 1
-0.01430 -0.02981 0.02026 0.26538 0.06236
0.01028 0.19838 0.00917 0.06529 0.03482
-1.39 -0.15 2.21 4.06 1.79
0.1697 0.8811 0.0313 0.0002 0.0787
The OLS analysis of Output 62.3.1 indicates that GAP and EQP have a significant influence on GDP at the 5% level.
4025
4026
Chapter 62. The ROBUSTREG Procedure
The following statements invoke the ROBUSTREG procedure with the default M estimation. proc robustreg data=growth; model GDP = LFG GAP EQP NEQ / diagnostics leverage; output out=robout r=resid sr=stdres; run; Output 62.3.2. Model Fitting Information and Summary Statistics The ROBUSTREG Procedure Model Information Data Set Dependent Variable Number of Covariates Number of Observations Method
MYLIB.GROWTH GDP 4 61 M Estimation
Summary Statistics
Q1
Median
Q3
Mean
Standard Deviation
MAD
0.0118 0.5796 0.0265 0.0956 0.0121
0.0239 0.8015 0.0433 0.1356 0.0231
0.0281 0.8863 0.0720 0.1812 0.0310
0.0211 0.7258 0.0523 0.1399 0.0224
0.00979 0.2181 0.0296 0.0570 0.0155
0.00949 0.1778 0.0325 0.0624 0.0150
Variable LFG GAP EQP NEQ GDP
Output 62.3.2 displays model information and summary statistics for variables in the model. Output 62.3.3. M estimates The ROBUSTREG Procedure Parameter Estimates
Parameter DF Estimate Intercept LFG GAP EQP NEQ Scale
1 1 1 1 1 1
-0.0247 0.1040 0.0250 0.2968 0.0885 0.0099
Standard Error 0.0097 0.1867 0.0086 0.0614 0.0328
95% Confidence Limits -0.0437 -0.2619 0.0080 0.1764 0.0242
-0.0058 0.4699 0.0419 0.4172 0.1527
ChiSquare Pr > ChiSq 6.53 0.31 8.36 23.33 7.29
0.0106 0.5775 0.0038 <.0001 0.0069
Example 62.3. Growth Study of De Long and Summers
Output 62.3.4. Diagnostics The ROBUSTREG Procedure Diagnostics
Obs
Mahalanobis Distance
Robust MCD Distance
1 5 8 9 17 23 27 31 53 57 58 59 60 61
2.6083 3.4351 3.1876 3.6752 2.6024 2.1225 2.6461 2.9179 2.2600 3.8701 2.5953 2.9239 1.8562 1.9634
4.0639 6.7391 4.6843 5.0599 3.8186 3.8238 5.0336 4.7140 4.3193 5.4874 3.9671 4.1663 2.7135 3.9128
Leverage * * * * * * * * * * * *
Standardized Robust Residual -0.9424 1.4200 -0.1972 -1.8784 -1.7971 1.7161 0.0909 0.0216 -1.8082 0.1448 -0.0978 0.3573 -4.9798 -2.5959
*
Outlier
*
Diagnostics Summary Observation Type
Proportion
Cutoff
0.0164 0.2131
3.0000 3.3382
Outlier Leverage
Output 62.3.5. Goodness-of-Fit The ROBUSTREG Procedure Goodness-of-Fit Statistic R-Square AICR BICR Deviance
Value 0.3178 80.2134 91.5095 0.0070
Output 62.3.3 displays the M estimates. Besides GAP and EQP , the robust analysis also indicates that N EQ is significant. This new finding is explained by Output 62.3.4, which shows that Zambia, the sixtieth country in the data, is an outlier. Output 62.3.4 also identifies leverage points based the robust MCD distances; however, there are no serious high leverage points in this data set. Output 62.3.5 displays robust versions of goodness-of-fit statistics for the model. The following statements invoke the ROBUSTREG procedure with LTS estimation, which was used by Zaman, Rousseeuw, and Orhan (2001). The results are consistent with those of M estimation. proc robustreg method=lts(h=33) fwls data=growth; model GDP = LFG GAP EQP NEQ / diagnostics leverage ; output out=robout r=resid sr=stdres; run;
4027
4028
Chapter 62. The ROBUSTREG Procedure
Output 62.3.6. LTS estimates The ROBUSTREG Procedure LTS Profile Total Number of Observations Number of Squares Minimized Number of Coefficients Highest Possible Breakdown Value
61 33 5 0.4590
LTS Parameter Estimates Parameter
DF
Estimate
1 1 1 1 1 0 0
-0.0249 0.1123 0.0214 0.2669 0.1110 0.0076 0.0109
Intercept LFG GAP EQP NEQ Scale (sLTS) Scale (Wscale)
Output 62.3.6 displays the LTS estimates. Output 62.3.7. Diagnostics and LTS-Rsquare The ROBUSTREG Procedure Diagnostics
Obs
Mahalanobis Distance
Robust MCD Distance
1 5 8 9 17 23 27 31 53 57 58 59 60 61
2.6083 3.4351 3.1876 3.6752 2.6024 2.1225 2.6461 2.9179 2.2600 3.8701 2.5953 2.9239 1.8562 1.9634
4.0639 6.7391 4.6843 5.0599 3.8186 3.8238 5.0336 4.7140 4.3193 5.4874 3.9671 4.1663 2.7135 3.9128
Leverage * * * * * * * * * * * *
Standardized Robust Residual -1.0715 1.6574 -0.2324 -2.0896 -1.6367 1.7570 0.2334 0.0971 -1.2978 0.0605 -0.0857 0.4113 -4.4984 -2.1201
*
Outlier
*
Diagnostics Summary Observation Type
Proportion
Cutoff
0.0164 0.2131
3.0000 3.3382
Outlier Leverage
R-Square for LTS Estimation R-Square
0.7418
Output 62.3.7 displays outlier and leverage point diagnostics based on the LTS estimates.
References
Output 62.3.8. Final Weighted LS estimates The ROBUSTREG Procedure Parameter Estimates for Final Weighted Least Squares Fit
Parameter Intercept LFG GAP EQP NEQ Scale
DF Estimate 1 1 1 1 1 0
-0.0222 0.0446 0.0245 0.2824 0.0849 0.0116
Standard Error 0.0093 0.1771 0.0082 0.0581 0.0314
95% Confidence Limits -0.0405 -0.3026 0.0084 0.1685 0.0233
-0.0039 0.3917 0.0406 0.3964 0.1465
ChiSquare Pr > ChiSq 5.65 0.06 8.89 23.60 7.30
0.0175 0.8013 0.0029 <.0001 0.0069
Output 62.3.8 displays the final weighted lease squares estimates, which are identical to those reported in Zaman, Rousseeuw, and Orhan (2001).
References Akaike, H. (1974), “A New Look at the Statistical Identification Model,” IEEE Trans. Automat Control, 19, 716–723. Brownlee, K.A. (1965), Statistical Theory and Methodology in Science and Engineering, 2nd ed., New York: John Wiley & Sons, Inc. Chen C. (2002), “Robust Regression and Outlier Detection with the ROBUSTREG Procedure,” Proceedings of the Twenty-seventh Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc. Coleman, D. Holland, P., Kaden, N., Klema, V., and Peters, S.C. (1980), “A System of Subroutines for Iteratively Reweighted Least-Squares Computations,” ACM Transactions on Mathematical Software, 6, 327-336. De Long, J.B., Summers, L.H. (1991), “Equipment Investment and Economic Growth,” Quarterly Journal of Economics, 106, 445-501. Hampel, F. R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W.A. (1986), Robust Statistics, The Approach Based on Influence Functions, New York: John Wiley & Sons, Inc. Hawkins, D.M., Bradu, D. and Kass, G.V. (1984), “Location of Several Outliers in Multiple Regression Data Using Elemental Sets,” Technometrics, 26, 197-208. Holland, P. and Welsch, R. (1977), “Robust Regression Using Interactively Reweighted Least-Squares,” Commun. Statist. Theor. Meth., 6, 813-827. Huber, P.J. (1973), “Robust Regression: Asymptotics, Conjectures and Monte Carlo,” Ann. Stat., 1, 799-821. Huber, P.J. (1981), Robust Statistics, New York: John Wiley & Sons, Inc. Marazzi, A. (1993), Algorithm, Routines, and S Functions for Robust Statistics, Pacific Grove, CA: Wadsworth & Brooks / Cole. Ronchetti, E. (1985), “Robust Model Selection in Regression,” Statistics and Probability Letters, 3, 21-23.
4029
4030
Chapter 62. The ROBUSTREG Procedure
Rousseeuw, P.J. (1984), “Least Median of Squares Regression,” Journal of the American Statistical Association, 79, 871-880. Rousseeuw, P.J. and Hubert, M. (1996), “Recent Development in PROGRESS,” Computational Statistics and Data Analysis, 21, 67-85. Rousseeuw, P.J. and Leroy, A.M. (1987), Robust Regression and Outlier Detection, New York: John Wiley & Sons, Inc. Rousseeuw, P.J. and Van Driessen, K. (1999), “A Fast Algorithm for the Minimum Covariance Determinant Estimator,” Technometrics, 41, 212-223. Rousseeuw, P.J. and Van Driessen, K. (2000), “An Algorithm for Positive-Breakdown Regression Based on Concentration Steps,” Data Analysis: Scientific Modeling and Practical Application, ed. W. Gaul, O. Opitz, and M. Schader. New York: Springer-Verlag, 335–346. Rousseeuw, P.J. and Yohai, V. (1984), “Robust Regression by Means of S Estimators,” in Robust and Nonlinear Time Series Analysis, ed. J. Franke, W. Härdle, and R.D. Martin, Lecture Notes in Statistics, 26, New York: Springer-Verlag, 256-274. Ruppert, D. (1992), “Computing S Estimators for Regression and Multivariate Location/Dispersion,” Journal of Computational and Graphical Statistics, 1, 253-270. Yohai V.J. (1987), “High Breakdown Point and High Efficiency Robust Estimates for Regression,” Annals of Statistics, 15, 642-656. Yohai V.J., Stahel, W.A. and Zamar, R.H. (1991), “A Procedure for Robust Estimation and Inference in Linear Regression,” in Stahel, W.A. and Weisberg, S.W., eds., Directions in Robust Statistics and Diagnostics, Part II, New York: SpringerVerlag. Yohai, V.J. and Zamar, R.H. (1997), “Optimal Locally Robust M Estimate of Regression,” Journal of Statist. Planning and Inference, 64, 309-323. Zaman, A., Rousseeuw, P.J., Orhan, M. (2001), “Econometric Applications of HighBreakdown Robust Regression Techniques,” Econometrics Letters, 71, 1-8.
Chapter 63
The RSREG Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4033 Comparison to Other SAS Software . . . . . . . . . . . . . . . . . . . . . . 4033 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4034 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4034 A Response Surface with a Simple Optimum . . . . . . . . . . . . . . . . . 4034 SYNTAX . . . . . . . . . . PROC RSREG Statement BY Statement . . . . . . ID Statement . . . . . . . MODEL Statement . . . RIDGE Statement . . . . WEIGHT Statement . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. 4039 . 4039 . 4040 . 4041 . 4041 . 4043 . 4044
DETAILS . . . . . . . . . . . . . . . . . . . . . Introduction to Response Surface Experiments Coding the Factor Variables . . . . . . . . . . Missing Values . . . . . . . . . . . . . . . . . Plotting the Surface . . . . . . . . . . . . . . Searching for Multiple Response Conditions . Handling Covariates . . . . . . . . . . . . . . Computational Method . . . . . . . . . . . . Output Data Sets . . . . . . . . . . . . . . . . Displayed Output . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. 4045 . 4045 . 4047 . 4048 . 4048 . 4048 . 4050 . 4050 . 4051 . 4052 . 4054
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4055 Example 63.1. A Saddle-Surface Response Using Ridge Analysis . . . . . . 4055 Example 63.2. Response Surface Analysis with Covariates . . . . . . . . . 4059 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4061
4032
Chapter 63. The RSREG Procedure
Chapter 63
The RSREG Procedure Overview The RSREG procedure uses the method of least squares to fit quadratic response surface regression models. Response surface models are a kind of general linear model in which attention focuses on characteristics of the fit response function and in particular, where optimum estimated response values occur. In addition to fitting a quadratic function, you can use the RSREG procedure to • test for lack of fit • test for the significance of individual factors • analyze the canonical structure of the estimated response surface • compute the ridge of optimum response • predict new values of the response
Comparison to Other SAS Software Other SAS/STAT procedures can be used to fit the response surface, but the RSREG procedure is more specialized. The following statements model a three-factor response surface in PROC RSREG: proc rsreg; model y=x1 x2 x3; run;
These statements are more compact than the statements for other regression procedures in SAS/STAT software. For example, the equivalent statements for the GLM procedure are proc glm; model y=x1 x1*x1 x2 x1*x2 x2*x2 x3 x1*x3 x2*x3 x3*x3; run;
Additionally, PROC RSREG includes specialized methodology for analyzing the fitted response surface, such as canonical analysis and optimum response ridges. Note that the ADX Interface in SAS/QC software provides an interactive environment for constructing and analyzing many different kinds of experiments, including response surface experiments. The ADX Interface is the preferred interactive SAS
4034
Chapter 63. The RSREG Procedure
System tool for analyzing experiments, since it includes facilities for checking underlying assumptions and graphically optimizing the response surface. The RSREG procedure is appropriate for analyzing experiments in a batch environment.
Terminology Variables are referred to according to the following conventions: factor variables
independent variables used in constructing the quadratic response surface. To estimate the necessary parameters, each variable must have at least three distinct values in the data. Independent variables must be numeric.
response variables
the dependent variables to which the quadratic response surface is fit. Dependent variables must be numeric.
covariates
additional independent variables for use in the regression but not in the formation of the quadratic response surface. Covariates must be numeric.
WEIGHT variable
a variable for weighting the observations in the regression. The WEIGHT variable must be numeric.
ID variables
variables not in the above lists that are transferred to an output data set containing statistics for each observation in the input data set. This data set is created using the OUT= option in the PROC RSREG statement. ID variables can be either character or numeric.
BY variables
variables for grouping observations. Separate analyses are obtained for each BY group. BY variables can be either character or numeric.
Getting Started A Response Surface with a Simple Optimum This example uses the three-factor quadratic model discussed in John (1971). Schneider and Stockett (1963) performed an experiment aimed at reducing the unpleasant odor of a chemical produced with several factors. The objective is to minimize the unpleasant odor of a chemical. The following statements read the data. title ’Response Surface with a Simple Optimum’; data smell; input Odor T R H @@; label T = "Temperature" R = "Gas-Liquid Ratio" H = "Packing Height"; datalines; 66 40 .3 4 39 120 .3 4 43 40 .7 4 49 120 .7
4
A Response Surface with a Simple Optimum 58 40 .5 2 65 80 .3 2 -31 80 .5 4 ;
17 120 .5 2 7 80 .7 2 -35 80 .5 4
-5 40 .5 6 43 80 .3 6 -26 80 .5 4
4035
-40 120 .5 -22 80 .7
6 6
The INPUT statement names the variables contained in the SAS data set smell; the variable Odor is the response, while the variables T, R, and H are the independent factors. The following statements invoke PROC RSREG on the data set smell. Figure 63.1 through Figure 63.3 display the results of the analysis, including a lack-of-fit test requested with the LACKFIT option. proc rsreg data=smell; model Odor = T R H / lackfit; run;
Response Surface with a Simple Optimum The RSREG Procedure Coding Coefficients for the Independent Variables Factor T R H
Subtracted off
Divided by
80.000000 0.500000 4.000000
40.000000 0.200000 2.000000
Response Surface for Variable Odor Response Mean Root MSE R-Square Coefficient of Variation
Regression Linear Quadratic Crossproduct Total Model
Residual Lack of Fit Pure Error Total Error
15.200000 22.478508 0.8820 147.8849
DF
Type I Sum of Squares
R-Square
F Value
Pr > F
3 3 3 9
7143.250000 11445 293.500000 18882
0.3337 0.5346 0.0137 0.8820
4.71 7.55 0.19 4.15
0.0641 0.0264 0.8965 0.0657
DF
Sum of Squares
Mean Square
F Value
Pr > F
3 2 5
2485.750000 40.666667 2526.416667
828.583333 20.333333 505.283333
40.75
0.0240
Figure 63.1. Summary Statistics and Analysis of Variance
Figure 63.1 displays the coding coefficients for the transformation of the independent variables to lie between −1 and 1, simple statistics for the response variable,
4036
Chapter 63. The RSREG Procedure
hypothesis tests for linear, quadratic, and crossproduct terms, and the lack-of-fit test. The hypothesis tests can be used to gain a rough idea of importance of the effects; here the crossproduct terms are not significant. However, the lack-of-fit for the model is significant, so more complicated modeling or further experimentation with additional variables should be performed before firm statements are made concerning the underlying process. Response Surface with a Simple Optimum The RSREG Procedure
Parameter
DF
Estimate
Standard Error
t Value
Pr > |t|
Parameter Estimate from Coded Data
Intercept T R H T*T R*T R*R H*T H*R H*H
1 1 1 1 1 1 1 1 1 1
568.958333 -4.102083 -1345.833333 -22.166667 0.020052 1.031250 1195.833333 0.018750 -4.375000 1.520833
134.609816 1.489024 335.220685 29.780489 0.007311 1.404907 292.454665 0.140491 28.098135 2.924547
4.23 -2.75 -4.01 -0.74 2.74 0.73 4.09 0.13 -0.16 0.52
0.0083 0.0401 0.0102 0.4902 0.0407 0.4959 0.0095 0.8990 0.8824 0.6252
-30.666667 -12.125000 -17.000000 -21.375000 32.083333 8.250000 47.833333 1.500000 -1.750000 6.083333
Factor T R H
DF
Sum of Squares
Mean Square
F Value
Pr > F
Label
4 4 4
5258.016026 11045 3813.016026
1314.504006 2761.150641 953.254006
2.60 5.46 1.89
0.1613 0.0454 0.2510
Temperature Gas-Liquid Ratio Packing Height
Figure 63.2. Parameter Estimates and Hypothesis Tests
Parameter estimates and the factor ANOVA are shown in Figure 63.2. Looking at the parameter estimates, you can see that the crossproduct terms are not significantly different from zero, as noted previously. The “Estimate” column contains estimates based on the raw data, and the “Parameter Estimate from Coded Data” column contains those based on the coded data. The factor ANOVA table displays tests for all four parameters corresponding to each factor—the parameters corresponding to the linear effect, the quadratic effect, and the effects of the cross products with each of the other two factors. The only factor with a significant over-all effect is R, indicating that the level of noise left unexplained by the model is still too high to estimate the effects of T and H accurately. This may be due to the lack of fit.
A Response Surface with a Simple Optimum
Response Surface with a Simple Optimum The RSREG Procedure Canonical Analysis of Response Surface Based on Coded Data
Factor T R H
Critical Value Coded Uncoded 0.121913 0.199575 1.770525
84.876502 0.539915 7.541050
Label Temperature Gas-Liquid Ratio Packing Height
Predicted value at stationary point: -52.024631
Eigenvalues
T
Eigenvectors R
H
48.858807 31.103461 6.037732
0.238091 0.970696 -0.032594
0.971116 -0.237384 0.024135
-0.015690 0.037399 0.999177
Stationary point is a minimum.
Figure 63.3. Canonical Analysis and Eigenvectors
Figure 63.3 contains the canonical analysis and eigenvectors. The canonical analysis indicates that the directions of principle orientation for the predicted response surface are along the axes associated with the three factors, confirming the small interaction effect in the Regression ANOVA. The largest eigenvalue (48.8588) corresponds to the eigenvector {0.238091, 0.971116, −0.015690}, the largest component of which (0.971116) is associated with R; similarly, the second largest eigenvalue (31.1035) is associated with T. The third eigenvalue (6.0377), associated with H, is quite a bit smaller than the other two, indicating that the response surface is relatively insensitive to changes in this factor. The coded form of the canonical analysis indicates that the estimated response surface is at a minimum when T and R are both near the middle of their respective ranges and H is relatively high; in uncoded, terms, the model predicts that the unpleasant odor will be minimized when T = 84.876502, R = 0.539915, and H = 7.541050. To plot the response surface with respect to two of the factor variables, first fix H, the least significant factor variable, at its estimated optimum value and generate a grid of points for T and R. To ensure that the grid data do not affect parameter estimates, the response variable (Odor) is set to missing. (See the “Missing Values” section on page 4048.) The following statements produce and graph the necessary data. Initial data steps creates a grid over T and R, with H set to a constant value, and combine this grid with the original data. Then, PROC RSREG is used to create predictions for the combined data. Finally, PROC G3D is used to create a surface plot of the predictions.
4037
4038
Chapter 63. The RSREG Procedure data grid; do; Odor = . ; H = 7.541; do T = 20 to 140 by 5; do R = .1 to .9 by .05; output; end; end; end; data grid; set smell grid; run; proc rsreg data=grid out=predict noprint; model Odor = T R H / predict; run; data plot; set predict; if H = 7.541; proc g3d data=plot; plot T*R=Odor / rotate=38 tilt=75 xticknum=3 yticknum=3 zmax=300 zmin=-60 ctop=red cbottom=blue caxis=black; run;
The first DATA step creates grid points for T and R at H=7.541 and sets Odor to missing, and the second DATA step concatenates these grid points with the original data. Predicted values are created in the SAS data set predict by invoking the RSREG procedure with the PREDICT option in the MODEL statement. The analysis is not displayed due to the NOPRINT option. The third DATA step subsets the predicted values over just the grid points (excluding the predictions at the original points). PROC G3D is then used to create the three-dimensional plot shown in Figure 63.4.
PROC RSREG Statement
Figure 63.4. The Response Surface Obtained from the PREDICT Option
Syntax The following statements are available in PROC RSREG.
PROC RSREG < options > ; MODEL responses= independents < / options > ; RIDGE < options > ; WEIGHT variable ; ID variables ; BY variables ; The PROC RSREG and MODEL statements are required. The BY, ID, MODEL, RIDGE, and WEIGHT statements are described after the PROC RSREG statement, and they can appear in any order.
PROC RSREG Statement PROC RSREG < options > ; The PROC RSREG statement invokes the procedure. You can specify the following options in the PROC RSREG statement.
4039
4040
Chapter 63. The RSREG Procedure
DATA=SAS-data-set
specifies the input SAS data set that contains the data to be analyzed. By default, PROC RSREG uses the most recently created SAS data set. NOPRINT
suppresses the normal display of results when only the output data set is required. For more information, see the description of the NOPRINT option in the “MODEL Statement” and “RIDGE Statement” sections. Note that this option temporarily disables the Output Delivery System (ODS); see Chapter 14, “Using the Output Delivery System,” for more information. OUT=SAS-data-set
creates an output SAS data set that contains statistics for each observation in the input data set. In particular, this data set contains the BY variables, the ID variables, the WEIGHT variable, the variables in the MODEL statement, and the output options requested in the MODEL statement. You must specify output options in the MODEL statement; otherwise, the output data set is created but contains no observations. To create a permanent SAS data set, you must specify a two-level name (refer to the discussion in SAS Language Reference: Concepts for more information on permanent SAS data sets). For details on the data set created by PROC RSREG, see the “Output Data Sets” section on page 4051.
BY Statement BY variables ; You can specify a BY statement with PROC RSREG to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the RSREG procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure (in base SAS software). For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
MODEL Statement
ID Statement ID variables ; The ID statement names variables that are to be transferred to the data set created by the OUT= option in the PROC RSREG statement.
MODEL Statement MODEL responses=independents < / options > ; The MODEL statement lists response (dependent) variables followed by an equal sign and then lists independent variables, some of which may be covariates. The output options to the MODEL statement specify which statistics are output to the data set created using the OUT= option in the PROC RSREG statement. If none of the options are selected, the data set is created but contains no observations. The option keywords become values of the special variable – TYPE– in the output data set. Any of the following options can be specified. Task Analyze Original Data
Options NOCODE
Fit Model to First BY Group Only
BYOUT
Declare Covariates
COVAR=
Request Additional Statistics
PRESS
Request Additional Tests
LACKFIT
Suppress Displayed Output
NOANOVA NOOPTIMAL NOPRINT
Output Statistics
ACTUAL PREDICT RESIDUAL L95 U95 L95M U95M D
ACTUAL
specifies that the observed response values from the input data set be written to the output data set. BYOUT
uses only the first BY group to estimate the model. Subsequent BY groups have scoring statistics computed in the output data set only. The BYOUT option is used only when a BY statement is specified.
4041
4042
Chapter 63. The RSREG Procedure
COVAR=n
declares that the first n variables on the right-hand side of the model are simple linear regressors (covariates) and not factors in the quadratic response surface. By default, PROC RSREG forms quadratic and crossproduct effects for all regressor variables in the MODEL statement. See the “Handling Covariates” section on page 4050 for more details and Example 63.2 on page 4059 for an example using covariates. D
specifies that Cook’s D influence statistic be written to the output data set. See Chapter 2, “Introduction to Regression Procedures,” for details and formulas. LACKFIT
performs a lack-of-fit test. Refer to Draper and Smith (1981) for a discussion of lack-of-fit tests. L95
specifies that the lower bound of a 95% confidence interval for an individual predicted value be written to the output data set. The variance used in calculating this bound is a function of both the mean square error and the variance of the parameter estimates. See Chapter 2 for details and formulas. L95M
specifies that the lower bound of a 95% confidence interval for the expected value of the dependent variable be written to the output data set. The variance used in calculating this bound is a function of the variance of the parameter estimates. See Chapter 2 for details and formulas. NOANOVA NOAOV
suppresses the display of the analysis of variance and parameter estimates from the model fit. NOCODE
performs the canonical and ridge analyses with the parameter estimates derived from fitting the response to the original values of the factors variables, rather than their coded values (see the “Coding the Factor Variables” section on page 4047 for more details.) Use this option if the data are already stored in a coded form. NOOPTIMAL NOOPT
suppresses the display of the canonical analysis for the quadratic response surface. NOPRINT
suppresses the display of both the analysis of variance and the canonical analysis. PREDICT
specifies that the values predicted by the model be written to the output data set. PRESS
computes and displays the predicted residual sum of squares (PRESS) statistic for each dependent variable in the model. The PRESS statistic is added to the summary information at the beginning of the analysis of variance, so if the NOANOVA or
RIDGE Statement
NOPRINT option is specified, PRESS has no effect. See Chapter 2 for details and formulas. RESIDUAL
specifies that the residuals, calculated as ACTUAL − PREDICTED, be written to the output data set. U95
specifies that the upper bound of a 95% confidence interval for an individual predicted value be written to the output data set. The variance used in calculating this bound is a function of both the mean square error and the variance of the parameter estimates. See Chapter 2 for details and formulas. U95M
specifies that the upper bound of a 95% confidence interval for the expected value of the dependent variable be written to the output data set. The variance used in calculating this bound is a function of the variance of the parameter estimates. See Chapter 2 for details and formulas.
RIDGE Statement RIDGE < options > ; A RIDGE statement computes the ridge of optimum response. The ridge starts at a given point x0 , and the point on the ridge at radius r from x0 is the collection of factor settings that optimizes the predicted response at this radius. You can think of the ridge as climbing or falling as fast as possible on the surface of predicted response. Thus, the ridge analysis can be used as a tool to help interpret an existing response surface or to indicate the direction in which further experimentation should be performed. The default starting point, x0 , has each coordinate equal to the point midway between the highest and lowest values of the factor in the design. The default radii at which the ridge is computed are 0, 0.1, . . . , 0.9, 1. If, as usual, the ridge analysis is based on the response surface fit to coded values for the factor variables (see the “Coding the Factor Variables” section on page 4047 for details), then this results in a ridge that starts at the point with a coded zero value for each coordinate and extends toward, but not beyond, the edge of the range of experimentation. Alternatively, both the center point for the ridge and the radii at which it is to be computed can be specified. You can specify the following options in the RIDGE statement: CENTER=uncoded-factor-values
gives the coordinates of the point x0 from which to begin the ridge. The coordinates should be given in the original (uncoded) factor variable values and should be separated by commas. There must be as many coordinates specified as there are factors in the model, and the order of the coordinates must be the same as that used in the MODEL statement. This starting point should be well inside the range of experimentation. The default sets each coordinate equal to the value midway between the highest and lowest values for the associated factor.
4043
4044
Chapter 63. The RSREG Procedure
MAXIMUM MAX
computes the ridge of maximum response. Both the MIN and MAX options can be specified; at least one must be specified. MINIMUM MIN
computes the ridge of minimum response. Both the MIN and MAX options can be specified; at least one must be specified. NOPRINT
suppresses the display of the ridge analysis when only an output data set is required. OUTR=SAS-data-set
creates an output SAS data set containing the computed optimum ridge. For details, see the “Output Data Sets” section on page 4051. RADIUS=coded-radii
gives the distances from the ridge starting point at which to compute the optimum. The values in the list represent distances between coded points. The list can take any of the following forms or can be composed of mixtures of them: m1 , m2 , . . . , mn several values m TO n
a sequence where m equals the starting value, n equals the ending value, and the increment equals 1
m TO n BY i
a sequence where m equals the starting value, n equals the ending value, and i equals the increment
Mixtures of the preceding forms should be separated by commas. The default list runs from 0 to 1 by increments of 0.1. The following are examples of valid lists. radius=0 to 5 by .5; radius=0, .2, .25, .3, .5 to 1.0 by .1;
WEIGHT Statement WEIGHT variable ; When a WEIGHT statement is used, a weighted residual sum of squares X wi (yi − yˆi )2 i
is minimized, where wi is the value of the variable specified in the WEIGHT statement, yi is the observed value of the response variable, and yˆi is the predicted value of the response variable. The observation is used in the analysis only if the value of the WEIGHT statement variable is greater than zero. The WEIGHT statement has no effect on degrees of freedom or number of observations. If the weights for the observations are proportional to the reciprocals of the error variances, then the weighted least-squares estimates are best linear unbiased estimators (BLUE).
Introduction to Response Surface Experiments
Details Introduction to Response Surface Experiments Many industrial experiments are conducted to discover which values of given factor variables optimize a response. If each factor is measured at three or more values, a quadratic response surface can be estimated by least-squares regression. The predicted optimal value can be found from the estimated surface if the surface is shaped like a simple hill or a valley. If the estimated surface is more complicated, or if the predicted optimum is far from the region of experimentation, then the shape of the surface can be analyzed to indicate the directions in which new experiments should be performed. Suppose that a response variable y is measured at combinations of values of two factor variables, x1 and x2 . The quadratic response-surface model for this variable is written as y = β0 + β1 x1 + β2 x2 + β3 x21 + β4 x22 + β5 x1 x2 + The steps in the analysis for such data are 1. model fitting and analysis of variance to estimate parameters 2. canonical analysis to investigate the shape of the predicted response surface 3. ridge analysis to search for the region of optimum response
Model Fitting and Analysis of Variance The first task in analyzing the response surface is to estimate the parameters of the model by least-squares regression and to obtain information about the fit in the form of an analysis of variance. The estimated surface is typically curved: a “hill” whose peak occurs at the unique estimated point of maximum response, a “valley,” or a “saddle-surface” with no unique minimum or maximum. Use the results of this phase of the analysis to answer the following questions: • What is the contribution of each type of effect—linear, quadratic, and crossproduct—to the statistical fit? The ANOVA table with sources labeled “Regression” addresses this question. • What part of the residual error is due to lack of fit? Does the quadratic response model adequately represent the true response surface? If you specify the LACKFIT option in the MODEL statement, then the ANOVA table with sources labeled “Residual” addresses this question. • What is the contribution of each factor variable to the statistical fit? Can the response be predicted as well if the variable is removed? The ANOVA table with sources labeled “Factor” addresses this question. • What are the predicted responses for a grid of factor values? (See the section “Plotting the Surface” on page 4048 and the “Searching for Multiple Response Conditions” section on page 4048.)
4045
4046
Chapter 63. The RSREG Procedure
Lack-of-Fit Test The test for lack-of-fit compares the variation around the model with “pure” variation within replicated observations. This measures the adequacy of the quadratic response surface model. In particular, if there are ni replicated observations Yi1 , . . . , Yini of the response all at the same values xi of the factors, then we can predict the true response at xi either by using the predicted value Yˆi based on the model or by using the mean Y¯i of the replicated values. The test for lack-of-fit decomposes the residual error into a component due to the variation of the replications around their mean value (the “pure” error), and a component due to the variation of the mean values around the model prediction (the “bias” error): ni XX i
j=1
Yij − Yˆi
2
=
ni XX i
j=1
Yij − Y¯i
2
+
X
2 ni Y¯i − Yˆi
i
If the model is adequate, then both components estimate the nominal level of error; however, if the bias component of error is much larger than the pure error, then this constitutes evidence that there is significant lack of fit. If some observations in your design are replicated, you can test for lack of fit by specifying the LACKFIT option in the MODEL statement. Note that, since all other tests use total error rather than pure error, you may want to hand-calculate the tests with respect to pure error if the lack-of-fit is significant. On the other hand, significant lack-of-fit indicates the quadratic model is inadequate, so if this is a problem you can also try to refine the model, possibly using PROC GLM for general polynomial modeling; refer to Chapter 32, “The GLM Procedure,” for more information. Example 63.1 on page 4055 illustrates the use of the LACKFIT option.
Canonical Analysis The second task in analyzing the response surface is to examine the overall shape of the curve and determine whether the estimated stationary point is a maximum, a minimum, or a saddle point. The canonical analysis can be used to answer the following questions: • Is the surface shaped like a hill, a valley, a saddle surface, or a flat surface? • If there is a unique optimum combination of factor values, where is it? • To which factor or factors are the predicted responses most sensitive? The eigenvalues and eigenvectors in the matrix of second-order parameters characterize the shape of the response surface. The eigenvectors point in the directions of principle orientation for the surface, and the signs and magnitudes of the associated eigenvalues give the shape of the surface in these directions. Positive eigenvalues indicate directions of upward curvature, and negative eigenvalues indicate directions of downward curvature. The larger an eigenvalue is in absolute value, the more pronounced is the curvature of the response surface in the associated direction. Often, all of the coefficients of an eigenvector except for one are relatively small, indicating that the vector points roughly along the axis associated with the factor corresponding
Coding the Factor Variables
to the single large coefficient. In this case, the canonical analysis can be used to determine the relative sensitivity of the predicted response surface to variations in that factor. (See the “Getting Started” section on page 4034 for an example.)
Ridge Analysis If the estimated surface is found to have a simple optimum well within the range of experimentation, the analysis performed by the preceding two steps may be sufficient. In more complicated situations, further search for the region of optimum response is required. The method of ridge analysis computes the estimated ridge of optimum response for increasing radii from the center of the original design. The ridge analysis answers the following question: • If there is not a unique optimum of the response surface within the range of experimentation, in which direction should further searching be done in order to locate the optimum? You can use the RIDGE statement to compute the ridge of maximum or minimum response.
Coding the Factor Variables For the results of the canonical and ridge analyses to be interpretable, the values of different factor variables should be comparable. This is because the canonical and ridge analyses of the response surface are not invariant with respect to differences in scale and location of the factor variables. The analysis of variance is not affected by these changes. Although the actual predicted surface does not change, its parameterization does. The usual solution to this problem is to code each factor variable so that its minimum in the experiment is −1 and its maximum is 1 and to carry through the analysis with the coded values instead of the original ones. This practice has the added benefit of making 1 a reasonable boundary radius for the ridge analysis since 1 represents approximately the edge of the experimental region. By default, PROC RSREG computes the linear transformation to perform this coding as the data are initially read in, and the canonical and ridge analyses are performed on the model fit to the coded data. The actual form of the coding operation for each value of a variable is coded value = (original value − M )/S where M is the average of the highest and lowest values for the variable in the design and S is half their difference.
4047
4048
Chapter 63. The RSREG Procedure
Missing Values If an observation has missing data for any of the variables used by the procedure, then that observation is not used in the estimation process. If one or more response variables are missing, but no factor or covariate variables are missing, then predicted values and confidence limits are computed for the output data set, but the residual and Cook’s D statistic are missing.
Plotting the Surface You can generate predicted values for a grid of points with the PREDICT option (see the “Getting Started” section on page 4034 for an example) and then use these values to create a contour plot or a three-dimensional plot of the response surface over a two-dimensional grid. Any two factor variables can be chosen to form the grid for the plot. Several plots can be generated by using different pairs of factor variables.
Searching for Multiple Response Conditions Suppose you want to find the factor setting that produces responses in a certain region. For example, you have the following data with two factors and three responses: data a; input x1 x2 y1 y2 y3; datalines; -1 -1 1.8 1.940 3.6398 -1 1 2.6 1.843 4.9123 1 -1 5.4 1.063 6.0128 1 1 0.7 1.639 2.3629 0 0 8.5 0.134 9.0910 0 0 3.0 0.545 3.7349 0 0 9.8 0.453 10.4412 0 0 4.1 1.117 5.0042 0 0 4.8 1.690 6.6245 0 0 5.9 1.165 6.9420 0 0 7.3 1.013 8.7442 0 0 9.3 1.179 10.2762 1.4142 0 3.9 0.945 5.0245 -1.4142 0 1.7 0.333 2.4041 0 1.4142 3.0 1.869 5.2695 0 -1.4142 5.7 0.099 5.4346 ;
You want to find the values of x1 and x2 that maximize y1 subject to y2<2 and y3
Searching for Multiple Response Conditions
output; if eof then do; y1=.; y2=.; y3=.; do x1=-2 to 2 by .1; do x2=-2 to 2 by .1; output; end; end; end; run;
Next, use PROC RSREG to fit a response surface model to the data and to compute predicted values for both the observed data and the grid, putting the predicted values in a data set c. proc rsreg data=b out=c; model y1 y2 y3=x1 x2 / predict; run;
Finally, find the subset of predicted values that satisfy the constraints, sort by the unconstrained variable, and display the top five predictions. data d; set c; if y2<2; if y3
The final results are displayed in Figure 63.5. They indicate that optimal values of the factors are around 0.3 for x1 and around -0.5 for x2. Obs
x1
x2
1 2 3 4 5
0.3 0.3 0.3 0.4 0.4
-0.5 -0.6 -0.4 -0.6 -0.5
_TYPE_ PREDICT PREDICT PREDICT PREDICT PREDICT
Figure 63.5. Top Five Predictions
y1
y2
y3
i
6.92570 6.91424 6.91003 6.90769 6.90540
0.75784 0.74174 0.77870 0.73357 0.75135
7.60471 7.54194 7.64341 7.51836 7.56883
1 2 3 4 5
4049
4050
Chapter 63. The RSREG Procedure
Handling Covariates Covariate regressors are added to a response surface model because they are believed to account for a sizable yet relatively uninteresting portion of the variation in the data. What the experimenter is really interested in is the response corrected for the effect of the covariates. A common example is the block effect in a block design. In the canonical and ridge analyses of a response surface, which estimate responses at hypothetical levels of the factor variables, the actual value of the predicted response is computed using the average values of the covariates. The estimated response values do optimize the estimated surface of the response corrected for covariates, but true prediction of the response requires actual values for the covariates. You can use the COVAR= option in the MODEL statement to include covariates in the response surface model. Example 63.2 on page 4059 illustrates the use of this option.
Computational Method Canonical Analysis For each response variable, the model can be written in the form yi = x0i Axi + b0 xi + c0 zi + i where yi
is the ith observation of the response variable.
xi
= (xi1 , xi2 , . . . , xik )0 are the k factor variables for the ith observation.
zi
= (zi1 , zi2 , . . . , ziL )0 are the L covariates, including the intercept term.
A
is the k × k symmetrized matrix of quadratic parameters, with diagonal elements equal to the coefficients of the pure quadratic terms in the model and off-diagonal elements equal to half the coefficient of the corresponding cross product.
b
is the k × 1 vector of linear parameters.
c
is the L × 1 vector of covariate parameters, one of which is the intercept.
i
is the error associated with the ith observation. Tests performed by PROC RSREG assume that errors are independently and normally distributed with mean zero and variance σ 2 .
The parameters in A, b, and c are estimated by least squares. To optimize y with respect to x, take partial derivatives, set them to zero, and solve: ∂y = 2x0 A + b0 = 0 ∂x
=⇒
1 x = − A−1 b 2
You can determine if the solution is a maximum or minimum by looking at the eigenvalues of A:
Output Data Sets If the eigenvalues. . . are all negative are all positive have mixed signs contain zeros
then the solution is. . . a maximum a minimum a saddle point in a flat area
Ridge Analysis The eigenvector for the largest eigenvalue gives the direction of steepest ascent from the stationary point, if positive, or steepest descent, if negative. The eigenvectors corresponding to small or zero eigenvalues point in directions of relative flatness. The point on the optimum response ridge at a given radius R from the ridge origin is found by optimizing (x0 + d)0 A(x0 + d) + b0 (x0 + d) over d satisfying d0 d = R2 , where x0 is the k × 1 vector containing the ridge origin and A and b are as previously discussed. By the method of Lagrange multipliers, the optimal d has the form d = −(A − µI)−1 (Ax0 + 0.5b) where I is the k × k identity matrix and µ is chosen so that d0 d = R2 . There may be several values of µ that satisfy this constraint; the right one depends on which sort of response ridge is of interest. If you are searching for the ridge of maximum response, then the appropriate µ is the unique one that satisfies the constraint and is greater than all the eigenvalues of A. Similarly, the appropriate µ for the ridge of minimum response satisfies the constraint and is less than all the eigenvalues of A. (Refer to Myers and Montgomery (1995) for details.)
Output Data Sets OUT=SAS-data-set An output data set containing statistics requested with options in the MODEL statement for each observation in the input data set is created whenever the OUT= option is specified in the PROC RSREG statement. The data set contains the following variables. • the BY variables • the ID variables • the WEIGHT variable • the independent variables in the MODEL statement
4051
4052
Chapter 63. The RSREG Procedure • the variable – TYPE– , which identifies the observation type in the output data set. – TYPE– is a character variable with a length of eight, and it takes on the values ’ACTUAL’, ’PREDICT’, ’RESIDUAL’, ’U95M’, ’L95M’, ’U95’, ’L95’, and ’D’, corresponding to the options specified. • the response variables containing special output values identified by the – TYPE– variable
All confidence limits use the two-tailed Student’s t value.
OUTR=SAS-data-set An output data set containing the optimum response ridge is created when the OUTR= option is specified in the RIDGE statement. The data set contains the following variables: • the current values of the BY variables • a character variable – DEPVAR– containing the name of the dependent variable • a character variable – TYPE– identifying the type of ridge being computed, MINIMUM or MAXIMUM. If both MAXIMUM and MINIMUM are specified, the data set contains observations for the minimum ridge followed by observations for the maximum ridge. • a numeric variable – RADIUS– giving the distance from the ridge starting point • the values of the model factors at the estimated optimum point at distance – RADIUS– from the ridge starting point • a numeric variable – PRED– , which is the estimated expected value of the dependent variable at the optimum • a numeric variable – STDERR– , which is the standard error of the estimated expected value
Displayed Output All estimates and hypothesis tests assume that the model is correctly specified and the errors are distributed according to classical statistical assumptions. The output displayed by PROC RSREG includes the following.
Estimation and Analysis of Variance • The actual form of the coding operation for each value of a variable is coded value =
1 (original value − M ) S
where M is the average of the highest and lowest values for the variable in the design and S is half their difference. The Subtracted off column contains the M values for this formula for each factor variable, and S is found in the Divided by column.
Displayed Output
• The summary table for the response variable contains the following information. − Response Mean is the mean of the response variable in the sample. When a WEIGHT statement is used, the mean y¯ is calculated by P wi yi y¯ = Pi i wi − Root MSE estimates the standard deviation of the response variable and is calculated as the square root of the Total Error mean square. − The R-Square value is R2 , or the coefficient of determination. R2 measures the proportion of the variation in the response that is attributed to the model rather than to random error. − The Coefficient of Variation is 100 times the ratio of the Root MSE to the Response Mean. • A table analyzing the significance of the terms of the regression is displayed. Terms are brought into the regression in four steps: (1) the Intercept and any covariates in the model, (2) Linear terms like X1 and X2, (3) pure Quadratic terms like X1*X1 or X2*X2, and (4) Crossproduct terms like X1*X2. − The Degrees of Freedom should be the same as the number of corresponding parameters unless one or more of the parameters are not estimable. − Type I Sum of Squares, also called the sequential sums of squares, measure the reduction in the error sum of squares as sets of terms (Linear, Quadratic, and so forth) are added to the model. − R-Square measures the portion of total R2 contributed as each set of terms (Linear, Quadratic, and so forth) is added to the model. − Each F Value tests the null hypothesis that all parameters in the term are zero using the Total Error mean square as the denominator. This item is a test of a Type I hypothesis, containing the usual F test numerator, conditional on the effects of subsequent variables not being in the model. − Pr > F is the significance value or probability of obtaining at least as great an F ratio given that the null hypothesis is true. • The Total Error Sum of Squares can be partitioned into Lack of Fit and Pure Error. When Lack of Fit is significant, there is variation around the model other than random error (such as cubic effects of the factor variables). − The Total Error Mean Square estimates σ 2 , the variance. − F Value tests the null hypothesis that the variation is adequately described by random error. • A table containing the parameter estimates from the model is displayed. − The Parameter Estimate column contains the parameter estimates based on the uncoded values of the factor variables. If an effect is a linear combination of previous effects, the parameter for the effect is not estimable. When this happens, the degrees of freedom are zero, the parameter estimate is set to zero, and the estimates and tests on other parameters are conditional on this parameter being zero.
4053
4054
Chapter 63. The RSREG Procedure − The Standard Error column contains the estimated standard deviations of the parameter estimates based on uncoded data. − The t Value column contains t values of a test of the null hypothesis that the true parameter is zero when the uncoded values of the factor variables are used. − Pr > |T| gives the significance value or probability of a greater absolute t ratio given that the true parameter is zero. − The Parameter Estimate from Coded Data column contains the parameter estimates based on the coded values of the factor variables. These are the estimates used in the subsequent canonical and ridge analyses. • The sum of squares are partitioned by the Factors in the model, and an analysis table is displayed. The test on a factor, say X1, is a joint test on all the parameters involving that factor. For example, the test for X1 tests the null hypothesis that the true parameters for X1, X1*X1, and X1*X2 are all zero.
Canonical Analysis • The Critical Value columns contains the values of the factor variables that correspond to the stationary point of the fitted response surface. The critical values can be at a minimum, maximum, or saddle point. • The Eigenvalues and Eigenvectors are from the matrix of quadratic parameter estimates based on the coded data. They characterize the shape of the response surface.
Ridge Analysis • Coded Radius is the distance from the coded version of the associated point to the coded version of the origin of the ridge. The origin is given by the point at radius zero. • Estimated Response is the estimated value of the response variable at the associated point. The Standard Error of this estimate is also given. This quantity is useful for assessing the relative credibility of the prediction at a given radius. Typically, this standard error increases rapidly as the ridge moves up to and beyond the design perimeter, reflecting the inherent difficulty of making predictions beyond the range of experimentation. • Uncoded Factor Values are the values of the uncoded factor variables that give the optimum response at this radius from the ridge origin.
ODS Table Names PROC RSREG assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.”
Example 63.1. A Saddle-Surface Response Using Ridge Analysis
Table 63.1. ODS Tables Produced in PROC RSREG
ODS Table Name Coding ErrorANOVA FactorANOVA FitStatistics ModelANOVA ParameterEstimates Ridge Spectral StationaryPoint
Description Coding coefficients for the independent variables Error analysis of variance Factor analysis of variance Overall statistics for fit Model analysis of variance Estimated linear parameters Ridge analysis for optimum response Spectral analysis Stationary point of response surface
Statement default default default default default default RIDGE default default
Examples Example 63.1. A Saddle-Surface Response Using Ridge Analysis Frankel (1961) reports an experiment aimed at maximizing the yield of mercaptobenzothiazole (MBT) by varying processing time and temperature. Myers (1976) uses a two-factor model in which the estimated surface does not have a unique optimum. A ridge analysis is used to determine the region in which the optimum lies. The objective is to find the settings of time and temperature in the processing of a chemical that maximize the yield. The following statements read the data and invoke PROC RSREG. These statements produce Output 63.1.1 through Output 63.1.5: data d; input Time Temp MBT; label Time = "Reaction Time (Hours)" Temp = "Temperature (Degrees Centigrade)" MBT = "Percent Yield Mercaptobenzothiazole"; datalines; 4.0 250 83.8 20.0 250 81.7 12.0 250 82.4 12.0 250 82.9 12.0 220 84.7 12.0 280 57.9 12.0 250 81.2 6.3 229 81.3 6.3 271 83.1 17.7 229 85.3 17.7 271 72.7 4.0 250 82.0 ; proc sort; by Time Temp; run;
4055
4056
Chapter 63. The RSREG Procedure
proc rsreg; model MBT=Time Temp / lackfit; ridge max; run;
Output 63.1.1. Coding and Response Variable Information The RSREG Procedure Coding Coefficients for the Independent Variables Factor Time Temp
Subtracted off
Divided by
12.000000 250.000000
8.000000 30.000000
Response Surface for Variable MBT: Percent Yield Mercaptobenzothiazole Response Mean Root MSE R-Square Coefficient of Variation
79.916667 4.615964 0.8003 5.7760
Output 63.1.2. Analyses of Variance The RSREG Procedure
Regression Linear Quadratic Crossproduct Total Model
Residual Lack of Fit Pure Error Total Error
DF
Type I Sum of Squares
R-Square
F Value
Pr > F
2 2 1 5
313.585803 146.768144 51.840000 512.193947
0.4899 0.2293 0.0810 0.8003
7.36 3.44 2.43 4.81
0.0243 0.1009 0.1698 0.0410
DF
Sum of Squares
Mean Square
F Value
Pr > F
3 3 6
124.696053 3.146667 127.842720
41.565351 1.048889 21.307120
39.63
0.0065
Parameter
DF
Estimate
Standard Error
t Value
Pr > |t|
Parameter Estimate from Coded Data
Intercept Time Temp Time*Time Temp*Time Temp*Temp
1 1 1 1 1 1
-545.867976 6.872863 4.989743 0.021631 -0.030075 -0.009836
277.145373 5.004928 2.165839 0.056784 0.019281 0.004304
-1.97 1.37 2.30 0.38 -1.56 -2.29
0.0964 0.2188 0.0608 0.7164 0.1698 0.0623
82.173110 -1.014287 -8.676768 1.384394 -7.218045 -8.852519
Factor Time Temp
DF
Sum of Squares
Mean Square
F Value
Pr > F
Label
3 3
61.290957 461.250925
20.430319 153.750308
0.96 7.22
0.4704 0.0205
Reaction Time (Hours) Temperature (Degrees Centigrade)
Example 63.1. A Saddle-Surface Response Using Ridge Analysis
Output 63.1.2 shows that the lack of fit for the model is highly significant. Since the quadratic model does not fit the data very well, firm statements about the underlying process should not be based only on the current analysis. Note from the analysis of variance for the model that the test for the time factor is not significant. If further experimentation is undertaken, it might be best to fix Time at a moderate to high value and to concentrate on the effect of temperature. In the actual experiment discussed here, extra runs were made that confirmed the results of the following analysis. Output 63.1.3. Canonical Analysis The RSREG Procedure Canonical Analysis of Response Surface Based on Coded Data
Factor Time Temp
Critical Value Coded Uncoded -0.441758 -0.309976
Label
8.465935 240.700718
Reaction Time (Hours) Temperature (Degrees Centigrade)
Predicted value at stationary point: 83.741940
Eigenvalues 2.528816 -9.996940
Eigenvectors Time 0.953223 0.302267
Temp
-0.302267 0.953223
Stationary point is a saddle point.
The canonical analysis (Output 63.1.3) indicates that the predicted response surface is shaped like a saddle. The eigenvalue of 2.5 shows that the valley orientation of the saddle is less curved than the hill orientation, with eigenvalue of −9.99. The coefficients of the associated eigenvectors show that the valley is more aligned with Time and the hill with Temp. Because the canonical analysis resulted in a saddle point, the estimated surface does not have a unique optimum. Output 63.1.4. Ridge Analysis The RSREG Procedure Estimated Ridge of Maximum Response for Variable MBT: Percent Yield Mercaptobenzothiazole
Coded Radius
Estimated Response
Standard Error
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
82.173110 82.952909 83.558260 84.037098 84.470454 84.914099 85.390012 85.906767 86.468277 87.076587 87.732874
2.665023 2.648671 2.602270 2.533296 2.457836 2.404616 2.410981 2.516619 2.752355 3.130961 3.648568
Uncoded Factor Values Time Temp 12.000000 11.964493 12.142790 12.704153 13.517555 14.370977 15.212247 16.037822 16.850813 17.654321 18.450682
250.000000 247.002956 244.023941 241.396084 239.435227 237.919138 236.624811 235.449230 234.344204 233.284652 232.256238
4057
4058
Chapter 63. The RSREG Procedure
However, the ridge analysis in Output 63.1.4 indicates that maximum yields will result from relatively high reaction times and low temperatures. A contour plot of the predicted response surface, shown in Output 63.1.5, confirms this conclusion. Output 63.1.5. Contour Plot of Predicted Response Surface
The statements that produce this plot follow. Note that contour and three-dimensional plots can be created interactively using SAS/INSIGHT software or the ADX Interface in SAS/QC software. Initial DATA steps create a grid over Time and Temp and combine this grid with the original data, using a variable flag to indicate the grid. Then, PROC RSREG is used to create predictions for the combined data. Finally, PROC GCONTOUR to displays a contour plot of the predictions over just the grid. data b; set d; flag=1; MBT=.; do Time=0 to 20 by 1; do Temp=220 to 280 by 5; output; end; end; data c; set d b; proc rsreg data=c out=e noprint; model MBT=Time Temp / predict; id flag; run;
Example 63.2. Response Surface Analysis with Covariates
4059
axis1 label=(angle=90) minor=none; axis2 order=(220 to 280 by 20) minor=none; proc gcontour data=e(where=(flag=1)); plot Time*Temp=MBT / nlevels=12 vaxis=axis1 haxis=axis2 nolegend autolabel llevels=2 2 2 1 1 1 1 1 1 1 1 1 ; run;
Example 63.2. Response Surface Analysis with Covariates One way of viewing covariates is as extra sources of variation in the dependent variable that may mask the variation due to primary factors. This example demonstrates the use of the COVAR= option in PROC RSREG to fit a response surface model to the dependent variable values corrected for the covariates. You have a chemical process with a yield that you hypothesize to be dependent on three factors: reaction time, reaction temperature, and reaction pressure. You perform an experiment to measure this dependence. You are willing to include up to 20 runs in your experiment, but you can perform no more than 8 runs on the same day, so the design for the experiment is composed of three blocks. Additionally, you know that the grade of raw material for the reaction has a significant impact on the yield. You have no control over this, but you keep track of it. The following statements create a SAS data set containing the results of the experiment: data Experiment; input Day Grade Time datalines; 1 67 -1 -1 1 68 -1 1 1 70 1 -1 1 66 1 1 1 74 0 0 1 68 0 0 2 75 -1 -1 2 69 -1 1 2 70 1 -1 2 71 1 1 2 72 0 0 2 74 0 0 3 69 1.633 0 3 67 -1.633 0 3 68 0 1.633 3 71 0 -1.633 3 70 0 0 3 72 0 0 3 70 0 0 3 72 0 0 ;
Temp Pressure Yield; -1 1 1 -1 0 0 1 -1 -1 1 0 0 0 0 0 0 1.633 -1.633 0 0
32.98 47.04 67.11 26.94 103.22 42.94 122.93 62.97 72.96 94.93 93.11 112.97 78.88 52.53 68.96 92.56 88.99 102.50 82.84 103.12
4060
Chapter 63. The RSREG Procedure
Your first analysis neglects to take the covariates into account. The following statements use PROC RSREG to fit a response surface to the observed yield, but note that Day and Grade are omitted. proc rsreg data=Experiment; model Yield = Time Temp Pressure; run;
The ANOVA results (shown in Output 63.2.1) indicate that no process variable effects are significantly larger than the background noise. Output 63.2.1. Analysis of Variance Ignoring Covariates The RSREG Procedure
Regression Linear Quadratic Crossproduct Total Model
DF
Type I Sum of Squares
R-Square
F Value
Pr > F
3 3 3 9
1880.842426 2370.438681 241.873250 4493.154356
0.1353 0.1706 0.0174 0.3233
0.67 0.84 0.09 0.53
0.5915 0.5023 0.9663 0.8226
Residual
DF
Sum of Squares
Mean Square
Total Error
10
9405.129724
940.512972
However, when the yields are adjusted for covariate effects of day and grade of raw material, very strong process variable effects are revealed. The following statements produce the ANOVA results in Output 63.2.2. Note that in order to include the effects of the classification factor Day as covariates, you need to create dummy variables indicating each day separately. data Experiment; set Experiment; d1 = (Day = 1); d2 = (Day = 2); d3 = (Day = 3); proc rsreg data=Experiment; model Yield = d1-d3 Grade Time Temp Pressure / covar=4; run;
References
Output 63.2.2. Analysis of Variance Including Covariates The RSREG Procedure
Regression
DF
Type I Sum of Squares
R-Square
F Value
Pr > F
Covariates Linear Quadratic Crossproduct Total Model
3 3 3 3 12
13695 156.524497 22.989775 23.403614 13898
0.9854 0.0113 0.0017 0.0017 1.0000
316957 3622.53 532.06 541.64 80413.2
<.0001 <.0001 <.0001 <.0001 <.0001
Residual Total Error
DF
Sum of Squares
Mean Square
7
0.100820
0.014403
The results show very strong effects due to both the covariates and the process variables.
References Box, G.E.P. (1954), “The Exploration and Exploitation of Response Surfaces: some General Considerations,” Biometrics, 10, 16. Box, G.E.P. (1987), Empirical Model Building and Response Surfaces, New York: John Wiley & Sons, Inc. Box, G.E.P. and Draper, N.R. (1982), “Measures of Lack of Fit for Response Surface Designs and Predictor Variable Transformations,” Technometrics, 24, 1–8. Box, G.E.P. and Hunter, J.S. (1957), “Multifactor Experimental Designs for Exploring Response Surfaces,” Annals of Mathematical Statistics, 28, 195–242. Box, G.E.P., Hunter, W.G., and Hunter, J.S. (1978), Statistics for Experimenters, New York: John Wiley & Sons, Inc. Box, G.E.P. and Wilson, K.J. (1951), “On the Experimental Attainment of Optimum Conditions,” Journal of the Royal Statistical Society, Ser. B, 13, 1–45. Cochran, W.G. and Cox, G.M. (1957), Experimental Designs, Second Edition, New York: John Wiley & Sons, Inc. Draper, N.R. (1963), “Ridge Analysis of Response Surfaces,” Technometrics 5, 469–479. Draper, N.R. and John, J.A. (1988), “Response Surface Designs for Quantitative and Qualitative Variables,” Technometrics, 30 (4), 423–428. Draper, N.R. and Smith, H. (1981), Applied Regression Analysis, Second Edition, New York: John Wiley & Sons, Inc. John, P.W.M. (1971), Statistical Design and Analysis of Experiments, New York: Macmillan Publishing Co., Inc. Mead, R. and Pike, D.J. (1975), “A review of Response Surface Methodology from a Biometric Point of View,” Biometrics, 31, 803.
4061
4062
Chapter 63. The RSREG Procedure
Meyer, D.C. (1963), “Response Surface Methodology in Education and Psychology,” Journal of Experimental Education 31, 329. Myers, R.H. (1976), Response Surface Methodology, Blacksburg, VA: Virginia Polytechnic Institute and State University. Myers, R.H. and Montgomery, D.C. (1995), Response Surface Methodology, New York: John Wiley & Sons, Inc. Schneider, A.M. and Stockett, A.L. (1963), “An Experiment to Select Optimum Operating Conditions on the Basis of Arbitrary Preference Ratings,” Chemical Engineering Progress Symposium Series 59.
Chapter 64
The SCORE Procedure Chapter Contents OVERVIEW . . . . . . . . . . Raw Data Set . . . . . . . . Scoring Coefficients Data Set Standardization of Raw Data
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 4065 . 4065 . 4065 . 4066
GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4067 SYNTAX . . . . . . . . . . PROC SCORE Statement BY Statement . . . . . . ID Statement . . . . . . . VAR Statement . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. 4071 . 4072 . 4073 . 4074 . 4074
DETAILS . . . . . . . . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . . . . . . . Regression Parameter Estimates from PROC REG Output Data Set . . . . . . . . . . . . . . . . . . Computational Resources . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. 4074 . 4074 . 4074 . 4075 . 4075
EXAMPLES . . . . . . . . . . . . . . . . . . . . Example 64.1. Factor Scoring Coefficients . . . Example 64.2. Regression Parameter Estimates . Example 64.3. Custom Scoring Coefficients . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 4075 . 4076 . 4081 . 4086
. . . .
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4087
4064
Chapter 64. The SCORE Procedure
Chapter 64
The SCORE Procedure Overview The SCORE procedure multiplies values from two SAS data sets, one containing coefficients (for example, factor-scoring coefficients or regression coefficients) and the other containing raw data to be scored using the coefficients from the first data set. The result of this multiplication is a SAS data set containing linear combinations of the coefficients and the raw data values. Many statistical procedures output coefficients that PROC SCORE can apply to raw data to produce scores. The new score variable is formed as a linear combination of raw data and scoring coefficients. For each observation in the raw data set, PROC SCORE multiplies the value of a variable in the raw data set by the matching scoring coefficient from the data set of scoring coefficients. This multiplication process is repeated for each variable in the VAR statement. The resulting products are then summed to produce the value of the new score variable. This entire process is repeated for each observation in the raw data set. In other words, PROC SCORE cross multiplies part of one data set with another.
Raw Data Set The raw data set can contain the original data used to calculate the scoring coefficients, or it can contain an entirely different data set. The raw data set must contain all the variables needed to produce scores. In addition, the scoring coefficients and the variables in the raw data set that are used in scoring must have the same names. See the section “Getting Started” beginning on page 4067.
Scoring Coefficients Data Set The data set containing scoring coefficients must contain two special variables: the – TYPE– variable and the – NAME– or – MODEL– variable. • The – TYPE– variable identifies the observations that contain scoring coefficients. • The – NAME– or – MODEL– variable provides a SAS name for the new score variable. PROC SCORE first looks for a – NAME– variable in the SCORE= input data set. If there is such a variable, the variable’s value is what SCORE uses to name the new score variable. If the SCORE= data set does not have a – NAME– variable, then PROC SCORE looks for a – MODEL– variable. For example, PROC FACTOR produces an output data set that contains factorscoring coefficients. In this output data set, the scoring coefficients are identified by
4066
Chapter 64. The SCORE Procedure
– TYPE– =’SCORE’. For – TYPE– =’SCORE’, the – NAME– variable has values of ’Factor1’, ’Factor2’, and so forth. PROC SCORE gives the new score variables the names Factor1, Factor2, and so forth. As another example, the REG procedure produces an output data set that contains parameter estimates. In this output data set, the parameter estimates are identified by – TYPE– =’PARMS’. The – MODEL– variable contains the label used in the MODEL statement in PROC REG, or it uses MODELn if no label is specified. This label is the name PROC SCORE gives to the new score variable.
Standardization of Raw Data PROC SCORE automatically standardizes or centers the DATA= variables for you, based on information from the original variables and analysis from the SCORE= data set. If the SCORE= scoring coefficients data set contains observations with – TYPE– =’MEAN’ and – TYPE– =’STD’, then PROC SCORE standardizes the raw data before scoring. For example, this type of SCORE= data set can come from PROC PRINCOMP without the COV option. If the SCORE= scoring coefficients data set contains observations with – TYPE– =’MEAN’ but – TYPE– =’STD’ is absent, then PROC SCORE centers the raw data (the means are substracted) before scoring. For example, this type of SCORE= data set can come from PROC PRINCOMP with the COV option. If the SCORE= scoring coefficients data set does not contain observations with – TYPE– =’MEAN’ and – TYPE– =’STD’, or if you use the NOSTD option, then PROC SCORE does not center or standardize the raw data. If the SCORE= scoring coefficients are obtained from observations with – TYPE– =’USCORE’, then PROC SCORE “standardizes” the raw data using the uncorrected standard deviations identified by – TYPE– =’USTD’, and the means are not subtracted from the raw data. For example, this type of SCORE= data set can come from PROC PRINCOMP with the NOINT option. For more information on – TYPE– =’USCORE’ scoring coefficients in TYPE=UCORR or TYPE=UCOV output data sets, see Appendix A, “Special SAS Data Sets.” You can use PROC SCORE to score the data that were also used to generate the scoring coefficents, although more typically, scoring results are directly obtained from the OUT= data set in a procedure that computes scoring coefficients. When scoring new data, it is important to realize that PROC SCORE assumes that the new data have approximately the same scales as the original data. For example, if you specify the COV option with PROC PRINCOMP for the original analysis, the scoring coefficients in the PROC PRINCOMP OUTSTAT= data set are not appropriate for standardized data. With the COV option, PROC PRINCOMP will not output – TYPE– =’STD’ observations to the OUTSTAT= data set, and PROC SCORE will only subtract the means of the original (not new) variables from the new variables before multiplying. Without the COV option in PROC PRINCOMP, both the original variable means and standard deviations will be in the OUTSTAT= data set, and PROC SCORE will
Getting Started
subtract the original variable means from the new variables and divide them by the original variable standard deviations before multiplying. In general, procedures that output scoring coefficients in their OUTSTAT= data sets provide the necessary information for PROC SCORE to determine the appropriate standardization. However, if you use PROC SCORE with a scoring coefficients data set that you constructed without – TYPE– =’MEAN’ and – TYPE– =’STD’ observations, you might have to do the relevant centering or standardization of the new data first. If you do this, you must use the means and standard deviations of the original variables, that is, the variables that were used to generate the coefficients, not the means and standard deviations of the variables to be scored. See the section “Getting Started” on page 4067 for further illustration.
Getting Started The SCORE procedure multiplies the values from two SAS data sets and creates a new data set to contain the results of the multiplication. The variables in the new data set are linear combinations of the variables in the two input data sets. Typically, one of these data sets contains raw data that you want to score, and the other data set contains scoring coefficients. The following example demonstrates how to use the SCORE procedure to multiply values from two SAS data sets, one containing factor-scoring coefficients and the other containing raw data to be scored using the scoring coefficients. Suppose you are interested in the performance of three different types of schools: private schools, state-run urban schools, and state-run rural schools. You want to compare the schools’ performances as measured by student grades on standard tests in English, mathematics, and biology. You administer these tests and record the scores for each of the three types of schools. The following DATA step creates the SAS data set Schools. The data are provided by Chaseling (1996). data Schools; input Type $ English Math Biology @@; datalines; p 52 55 45 p 42 49 40 p 63 64 p 47 50 51 p 64 69 47 p 63 67 p 59 63 42 p 56 61 41 p 41 44 p 39 42 45 p 56 63 44 p 63 73 p 62 68 46 p 51 61 51 p 45 56 p 63 66 63 p 65 67 57 p 49 50 p 47 48 34 p 53 54 46 p 49 40 p 50 41 50 p 82 72 80 p 68 61 p 68 61 46 p 63 53 48 p 77 72 p 50 47 60 p 61 49 48 p 64 54 p 60 53 40 p 80 69 75 p 76 69 p 55 48 51 p 85 76 80 p 70 64 p 61 51 61 p 51 47 58 p 78 72
54 54 72 42 54 47 43 62 74 45 77 48 79
4067
4068
Chapter 64. The SCORE Procedure p u u u u u u u u u u u u u r r r r r r r r r r r r r ;
52 36 45 39 28 60 28 66 34 62 56 69 52 79 35 69 57 56 42 53 95 42 37 44 77 67 50
47 44 47 45 32 65 31 51 29 50 44 56 42 68 28 65 55 55 41 42 79 43 33 46 77 65 47
46 46 53 49 33 66 32 48 33 51 43 52 42 77 24 53 32 42 48 47 95 50 34 47 69 45 49
u u u u u u u u u u u u u u r r r r r r r r r r r r r
49 32 44 48 52 60 43 79 47 69 76 57 51 61 50 55 47 39 51 62 65 70 49 62 43 65 55
47 43 52 51 59 63 46 68 35 50 61 41 36 44 47 44 42 36 44 59 60 68 47 55 42 73 48
58 46 43 46 44 63 45 77 40 47 74 55 42 41 48 41 66 33 52 48 43 55 49 44 52 49 46
u u u u u u u u u u u u u r r r r r r r r r r r r r r
64 52 54 53 54 47 40 58 60 59 50 56 44 38 36 62 45 63 47 80 67 63 42 67 51 34 38
72 57 63 61 61 52 42 52 49 41 36 44 31 28 28 58 38 51 42 74 60 56 43 64 54 29 36
45 42 42 54 51 49 48 49 49 52 52 51 57 22 38 45 45 42 44 81 53 48 50 52 45 32 51
The data set Schools contains the character variable Type, which represents the type of school. Valid values are p (private schools), r (state-run rural schools), and u (staterun urban schools). The three numeric variables in the data set are English, Math, and Biology, which represent the student scores for English, mathematics, and biology, respectively. The double trailing at sign (@@) in the INPUT statement specifies that observations are input from each line until all values are read. The following statements invoke the FACTOR procedure to compute the data set of factor scoring coefficients. The statements perform a principle components factor analysis using all three numeric variables in the SAS data set Schools. The OUTSTAT= option requests that PROC FACTOR output the factor scores to the data set Scores. The NOPRINT option suppresses display of the output. proc factor data=Schools score outstat=Scores noprint; var english math biology; run; proc score data=schools score=Scores out=New; var english math biology; id type; run;
Getting Started
4069
The SCORE procedure is then invoked using Schools as the raw data set to be scored and Scores as the scoring data set. The OUT= option creates the SAS data set New to contain the linear combinations. The VAR statement specifies that the variables English, Math, and Biology are used in computing scores. The ID statement copies the variable Type from the Schools data set to the output data set New. The following statements print the SAS output data set Scores, the first two observations from the original data set Schools, and the first two observations of the resulting data set New. title ’OUTSTAT= Data Set from PROC FACTOR’; proc print data=Scores; run; title ’First Two Observations of the DATA= Data Set from PROC SCORE’; proc print data=Schools(obs=2); run; title ’First Two Observations of the OUT= Data Set from PROC SCORE’; proc print data=New(obs=2); run;
Figure 64.1 displays the output data set Scores produced by the FACTOR procedure. The last observation (observation number 11) contains the scoring coefficients (– TYPE– =’SCORE’). Only one factor has been retained. Figure 64.1 also lists the first two observations of the original SAS data set Schools and the first two observations of the output data set New from the SCORE procedure.
4070
Chapter 64. The SCORE Procedure
OUTSTAT= Data Set from PROC FACTOR Obs 1 2 3 4 5 6 7 8 9 10 11
_TYPE_
_NAME_
MEAN STD N CORR CORR CORR COMMUNAL PRIORS EIGENVAL PATTERN SCORE
English Math Biology
Factor1 Factor1
English
Math
Biology
55.525 12.949 120.000 1.000 0.833 0.672 0.881 1.000 2.405 0.939 0.390
52.325 12.356 120.000 0.833 1.000 0.594 0.827 1.000 0.437 0.910 0.378
50.350 12.239 120.000 0.672 0.594 1.000 0.696 1.000 0.159 0.834 0.347
First Two Observations of the DATA= Data Set from PROC SCORE Obs 1 2
Type
English
p p
52 42
Math
Biology
55 49
45 40
First Two Observations of the OUT= Data Set from PROC SCORE Obs 1 2
Type
Factor1
p p
-0.17604 -0.80294
Figure 64.1. Views of the Scores, Schools, and New Data Sets
The score variable Factor1 in the New data set is named according to the value of the – NAME– variable in the Scores data set. The values of the variable Factor1 are computed as follows: the original data set variables are standardized to a mean of 0 and a variance of 1 because the Scores data set contains observations with – TYPE– =’MEAN’ and – TYPE– =’STD’. These standardized variables are then multiplied by their respective standardized scoring coefficients from the data set Scores. These products are summed over all three variables, and the sum is the value of the new variable Factor1. The first two values of the scored variable Factor1 are obtained as follows:
(52 − 55.525) (55 − 52.325) (45 − 50.350) × 0.390 + × 0.378 + × 0.347 = −0.17604 12.949 12.356 12.239
(42 − 55.525) (49 − 52.325) (40 − 50.350) × 0.390 + × 0.378 + × 0.347 = −0.80294 12.949 12.356 12.239 The following statements request that the GCHART procedure produce a horizontal bar chart of the variable Type. The length of each bar represents the mean of the variable Factor1.
Syntax
proc gchart; hbar type/type=mean sumvar=Factor1; run;
Figure 64.2. Bar Chart of School Type
Figure 64.2 displays the mean score of the variable Factor1 for each of the three school types. For private schools (Type=p), the average value of the variable Factor1 is 0.384, while for state-run schools the average value is much lower. The state-run urban schools (Type=u) have the lowest mean value of -0.202, and the state-run rural schools (Type=r) have a mean value of -0.183.
Syntax The following statements are available in the SCORE procedure.
PROC SCORE DATA= SAS-data-set < options > ; BY variables ; ID variables ; VAR variables ; The only required statement is the PROC SCORE statement. The BY, ID, and VAR statements are described following the PROC SCORE statement.
4071
4072
Chapter 64. The SCORE Procedure
PROC SCORE Statement PROC SCORE DATA= SAS-data-set < options > ; You can specify the following options in the PROC SCORE statement. DATA=SAS-data-set
names the input SAS data set containing the raw data to score. This specification is required. NOSTD
suppresses centering and scaling of the raw data. Ordinarily, if PROC SCORE finds – TYPE– =‘MEAN’, – TYPE– = ‘USCORE’, – TYPE– =‘USTD’, or – TYPE– =‘STD’ observations in the SCORE= data set, the procedure uses these to standardize the raw data before scoring. OUT=SAS-data-set
specifies the name of the SAS data set created by PROC SCORE. If you want to create a permanent SAS data set, you must specify a two-level name. (Refer to “SAS Files” in SAS Language Reference: Concepts for more information on permanent SAS data sets.) If the OUT= option is omitted, PROC SCORE still creates an output data set and automatically names it according to the DATAn convention, just as if you omitted a data set name in a DATA statement. PREDICT
specifies that PROC SCORE should treat coefficients of −1 in the SCORE= data set as 0. In regression applications, the dependent variable is coded with a coefficient of −1. Applied directly to regression results, PROC SCORE produces negative residuals (see the description of the RESIDUAL option, which follows); the PREDICT option produces predicted values instead. RESIDUAL
reverses the sign of each score. Applied directly to regression results, PROC SCORE produces negative residuals (PREDICT−ACTUAL); the RESIDUAL option produces positive residuals (ACTUAL−PREDICT) instead. SCORE=SAS-data-set
names the data set containing the scoring coefficients. If you omit the SCORE= option, the most recently created SAS data set is used. This data set must have two special variables: – TYPE– and either – NAME– or – MODEL– . TYPE=name | ‘string’
specifies the observations in the SCORE= data set that contain scoring coefficients. The TYPE= procedure option is unrelated to the data set option that has the same name. PROC SCORE examines the values of the special variable – TYPE– in the SCORE= data set. When the value of – TYPE– matches TYPE=name, the observation in the SCORE= data set is used to score the raw data in the DATA= data set. If you omit the TYPE= option, scoring coefficients are read from observations with either – TYPE– =’SCORE’ or – TYPE– =’USCORE’. Because the default for PROC SCORE is TYPE=SCORE, you need not specify the TYPE= option for factor scoring or for computing scores from OUTSTAT= data sets from the CANCORR, CANDISC,
BY Statement
PRINCOMP, or VARCLUS procedure. When you use regression coefficients from PROC REG, specify TYPE=PARMS. The maximum length of the argument specified in the TYPE= option depends on the length defined by the VALIDVARNAME= SAS system option. For additional information, refer to SAS Language Reference: Dictionary. Note that the TYPE= option setting is not case-sensitive. For example, the two option settings, TYPE=’MyScore’ and TYPE=’myscore’, are equivalent.
BY Statement BY variables ; You can specify a BY statement with PROC SCORE to obtain separate scoring for observations in groups defined by the BY variables. You can also specify a BY statement to apply separate groups of scoring coefficients to the entire DATA= data set. If your SCORE= input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the SCORE procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide. If the DATA= data set does not contain any of the BY variables, the entire DATA= data set is scored by each BY group of scoring coefficients in the SCORE= data set. If the DATA= data set contains some but not all of the BY variables, or if some BY variables do not have the same type or length in the DATA= data set as in the SCORE= data set, then PROC SCORE prints an error message and stops. If all the BY variables appear in the DATA= data set with the same type and length as in the SCORE= data set, then each BY group in the DATA= data set is scored using scoring coefficients from the corresponding BY group in the SCORE= data set. The BY groups in the DATA= data set must be in the same order as in the SCORE= data set. All BY groups in the DATA= data set must also appear in the SCORE= data set. If you do not specify the NOTSORTED option, some BY groups can appear in the SCORE= data set but not in the DATA= data set; such BY groups are not used in computing scores.
4073
4074
Chapter 64. The SCORE Procedure
ID Statement ID variables ; The ID statement identifies variables from the DATA= data set to be included in the OUT= data set. If there is no ID statement, all variables from the DATA= data set are included in the OUT= data set. The ID variables can be character or numeric.
VAR Statement VAR variables ; The VAR statement specifies the variables to be used in computing scores. These variables must be in both the DATA= and SCORE= input data sets and must be numeric. If you do not specify a VAR statement, the procedure uses all numeric variables in the SCORE= data set. You should almost always specify a VAR statement with PROC SCORE because you would rarely use all the numeric variables in your data set to compute scores.
Details Missing Values If one of the scoring variables in the DATA= data set has a missing value for an observation, all the scores have missing values for that observation. The exception to this criterion is if the PREDICT option is specified, the variable with a coefficient of −1 can tolerate a missing value and still produce a prediction score. Also, a variable with a coefficient of 0 can tolerate a missing value. If a scoring coefficient in the SCORE= data set has a missing value for an observation, the coefficient is not used in creating the new score variable for the observation. In other words, missing values of scoring coefficients are treated as zeros. This treatment affects only the observation in which the missing value occurs.
Regression Parameter Estimates from PROC REG If the SCORE= data set is an OUTEST= data set produced by PROC REG and if you specify TYPE=PARMS, the interpretation of the new score variables depends on the PROC SCORE options chosen and the variables listed in the VAR statement. If the VAR statement contains only the independent variables used in a model in PROC REG, the new score variables give the predicted values. If the VAR statement contains the dependent variables and the independent variables used in a model in PROC REG, the interpretation of the new score variables depends on the PROC SCORE options chosen. If you omit both the PREDICT and the RESIDUAL options, the new score variables give negative residuals (PREDICT−ACTUAL). If you specify the RESIDUAL option, the new score variables give positive residuals (ACTUAL−PREDICT). If you specify the PREDICT option, the new score variables give predicted values.
Examples
Unless you specify the NOINT option for PROC REG, the OUTEST= data set contains the variable Intercept. The SCORE procedure uses the intercept value in computing the scores.
Output Data Set PROC SCORE produces an output data set but displays no output. The output OUT= data set contains the following: • the ID variables, if any • all variables from the DATA= data set, if no ID variables are specified • the BY variables, if any • the new score variables, named from the – NAME– or – MODEL– values in the SCORE= data set
Computational Resources Let v = number of variables used in computing scores s
= number of new score variables
b = maximum number of new score variables in a BY group n = number of observations
Memory The array storage required is approximately 8(4v + (3 + v)b + s) bytes. When you do not use BY processing, the array storage required is approximately 8(4v + (4 + v)s) bytes.
Time The time required to construct the scoring matrix is roughly proportional to vs and the time needed to compute the scores is roughly proportional to nvs.
Examples The following three examples use a subset of the Fitness data set. The complete data set is given in Chapter 61, “The REG Procedure.”
4075
4076
Chapter 64. The SCORE Procedure
Example 64.1. Factor Scoring Coefficients This example shows how to use PROC SCORE with factor scoring coefficients. First, the FACTOR procedure produces an output data set containing scoring coefficients in observations identified by – TYPE– =’SCORE’. These data, together with the original data set Fitness, are supplied to PROC SCORE, resulting in a data set containing scores Factor1 and Factor2. These statements produce Output 64.1.1 through Output 64.1.3: /* This data set contains only the first 12 observations */ /* from the full data set used in the chapter on PROC REG. */ data Fitness; input Age Weight Oxygen RunTime RestPulse RunPulse @@; datalines; 44 89.47 44.609 11.37 62 178 40 75.07 45.313 10.07 62 185 44 85.84 54.297 8.65 45 156 42 68.15 59.571 8.17 40 166 38 89.02 49.874 9.22 55 178 47 77.45 44.811 11.63 58 176 40 75.98 45.681 11.95 70 176 43 81.19 49.091 10.85 64 162 44 81.42 39.442 13.08 63 174 38 81.87 60.055 8.63 48 170 44 73.03 50.541 10.13 45 168 45 87.66 37.388 14.03 56 186 ; proc factor data=Fitness outstat=FactOut method=prin rotate=varimax score; var Age Weight RunTime RunPulse RestPulse; title ’FACTOR SCORING EXAMPLE’; run; proc print data=FactOut; title2 ’Data Set from PROC FACTOR’; run; proc score data=Fitness score=FactOut out=FScore; var Age Weight RunTime RunPulse RestPulse; run; proc print data=FScore; title2 ’Data Set from PROC SCORE’; run;
Output 64.1.1 shows the PROC FACTOR output. The scoring coefficients for the two factors are shown at the end of the PROC FACTOR output.
Example 64.1. Factor Scoring Coefficients
Output 64.1.1. Creating an OUTSTAT= Data Set with PROC FACTOR FACTOR SCORING EXAMPLE The FACTOR Procedure Initial Factor Method: Principal Components Eigenvalues of the Correlation Matrix: Total = 5
1 2 3 4 5
Average = 1
Eigenvalue
Difference
Proportion
Cumulative
2.30930638 1.19219952 0.88222702 0.50256713 0.11369996
1.11710686 0.30997249 0.37965990 0.38886717
0.4619 0.2384 0.1764 0.1005 0.0227
0.4619 0.7003 0.8767 0.9773 1.0000
Factor Pattern
Age Weight RunTime RunPulse RestPulse
Factor1
Factor2
0.29795 0.43282 0.91983 0.72671 0.81179
0.93675 -0.17750 0.28782 -0.38191 -0.23344
4077
4078
Chapter 64. The SCORE Procedure
The FACTOR Procedure Initial Factor Method: Principal Components Variance Explained by Each Factor Factor1
Factor2
2.3093064
1.1921995
Final Communality Estimates: Total = 3.501506 Age
Weight
RunTime
RunPulse
RestPulse
0.96628351
0.21883401
0.92893333
0.67396207
0.71349297
The FACTOR Procedure Rotation Method: Varimax Orthogonal Transformation Matrix
1 2
1
2
0.92536 -0.37908
0.37908 0.92536
Rotated Factor Pattern
Age Weight RunTime RunPulse RestPulse
Factor1
Factor2
-0.07939 0.46780 0.74207 0.81725 0.83969
0.97979 -0.00018 0.61503 -0.07792 0.09172
Example 64.1. Factor Scoring Coefficients
The FACTOR Procedure Rotation Method: Varimax Variance Explained by Each Factor Factor1
Factor2
2.1487753
1.3527306
Final Communality Estimates: Total = 3.501506 Age
Weight
RunTime
RunPulse
RestPulse
0.96628351
0.21883401
0.92893333
0.67396207
0.71349297
The FACTOR Procedure Rotation Method: Varimax Squared Multiple Correlations of the Variables with Each Factor Factor1
Factor2
1.0000000
1.0000000
Standardized Scoring Coefficients
Age Weight RunTime RunPulse RestPulse
Factor1
Factor2
-0.17846 0.22987 0.27707 0.41263 0.39952
0.77600 -0.06672 0.37440 -0.17714 -0.04793
Output 64.1.2 lists the OUTSTAT= data set from PROC FACTOR. Note that observations 18 and 19 have – TYPE– =’SCORE’. Observations 1 and 2 have – TYPE– =’MEAN’ and – TYPE– =’STD’, respectively. These four observations are used by PROC SCORE.
4079
4080
Chapter 64. The SCORE Procedure
Output 64.1.2. OUTSTAT= Data Set from PROC FACTOR Reproduced with PROC PRINT FACTOR SCORING EXAMPLE Data Set from PROC FACTOR
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
_TYPE_ MEAN STD N CORR CORR CORR CORR CORR COMMUNAL PRIORS EIGENVAL UNROTATE UNROTATE TRANSFOR TRANSFOR PATTERN PATTERN SCORE SCORE
_NAME_
Age Weight RunTime RunPulse RestPulse
Factor1 Factor2 Factor1 Factor2 Factor1 Factor2 Factor1 Factor2
Age
Weight
RunTime
RunPulse
Rest Pulse
42.4167 2.8431 12.0000 1.0000 0.0128 0.5005 -0.0953 -0.0080 0.9663 1.0000 2.3093 0.2980 0.9368 0.9254 0.3791 -0.0794 0.9798 -0.1785 0.7760
80.5125 6.7660 12.0000 0.0128 1.0000 0.2637 0.1731 0.2396 0.2188 1.0000 1.1922 0.4328 -0.1775 -0.3791 0.9254 0.4678 -0.0002 0.2299 -0.0667
10.6483 1.8444 12.0000 0.5005 0.2637 1.0000 0.5555 0.6620 0.9289 1.0000 0.8822 0.9198 0.2878 . . 0.7421 0.6150 0.2771 0.3744
172.917 8.918 12.000 -0.095 0.173 0.556 1.000 0.485 0.674 1.000 0.503 0.727 -0.382 . . 0.817 -0.078 0.413 -0.177
55.6667 9.2769 12.0000 -0.0080 0.2396 0.6620 0.4853 1.0000 0.7135 1.0000 0.1137 0.8118 -0.2334 . . 0.8397 0.0917 0.3995 -0.0479
Since the PROC SCORE statement does not contain the NOSTD option, the data in the Fitness data set are standardized before scoring. For each variable specified in the VAR statement, the mean and standard deviation are obtained from the FactOut data set. For each observation in the Fitness data set, the variables are then standardized. For example, for observation 1 in the Fitness data set, the variable Age is standardized to 0.5569 = [(44 − 42.4167)/2.8431]. After the data in the Fitness data set are standardized, the standardized values of the variables in the VAR statement are multiplied by the matching coefficients in the FactOut data set, and the resulting products are summed. This sum is output as a value of the new score variable. Output 64.1.3 displays the FScore data set produced by PROC SCORE. This data set contains the variables Age, Weight, Oxygen, RunTime, RestPulse, and RunPulse from the Fitness data set. It also contains Factor1 and Factor2, the two new score variables.
Example 64.2. Regression Parameter Estimates
Output 64.1.3. OUT= Data Set from PROC SCORE Reproduced with PROC PRINT FACTOR SCORING EXAMPLE Data Set from PROC SCORE
Obs
Age
Weight
Oxygen
Run Time
Rest Pulse
Run Pulse
Factor1
Factor2
1 2 3 4 5 6 7 8 9 10 11 12
44 40 44 42 38 47 40 43 44 38 44 45
89.47 75.07 85.84 68.15 89.02 77.45 75.98 81.19 81.42 81.87 73.03 87.66
44.609 45.313 54.297 59.571 49.874 44.811 45.681 49.091 39.442 60.055 50.541 37.388
11.37 10.07 8.65 8.17 9.22 11.63 11.95 10.85 13.08 8.63 10.13 14.03
62 62 45 40 55 58 70 64 63 48 45 56
178 185 156 166 178 176 176 162 174 170 168 186
0.82129 0.71173 -1.46064 -1.76087 0.55819 -0.00113 0.95318 -0.12951 0.66267 -0.44496 -1.11832 1.20836
0.35663 -0.99605 0.36508 -0.27657 -1.67684 1.40715 -0.48598 0.36724 0.85740 -1.53103 0.55349 1.05948
Example 64.2. Regression Parameter Estimates In this example, PROC REG computes regression parameter estimates for the Fitness data. (See Example 64.1 to create the Fitness data set.) The parameter estimates are output to a data set and used as scoring coefficients. For the first part of this example, PROC SCORE is used to score the Fitness data, which are the same data used in the regression. In the second part of this example, PROC SCORE is used to score a new data set, Fitness2. For PROC SCORE, the TYPE= specification is PARMS, and the names of the score variables are found in the variable – MODEL– , which gets its values from the model label. The following code produces Output 64.2.1 through Output 64.2.3: proc reg data=Fitness outest=RegOut; OxyHat: model Oxygen=Age Weight RunTime RunPulse RestPulse; title ’REGRESSION SCORING EXAMPLE’; run; proc print data=RegOut; title2 ’OUTEST= Data Set from PROC REG’; run; proc score data=Fitness score=RegOut out=RScoreP type=parms; var Age Weight RunTime RunPulse RestPulse; run; proc print data=RScoreP; title2 ’Predicted Scores for Regression’; run; proc score data=Fitness score=RegOut out=RScoreR type=parms;
4081
4082
Chapter 64. The SCORE Procedure var Oxygen Age Weight RunTime RunPulse RestPulse; run; proc print data=RScoreR; title2 ’Negative Residual Scores for Regression’; run;
Output 64.2.1 shows the PROC REG output. The column labeled “Parameter Estimates” lists the parameter estimates. These estimates are output to the RegOut data set. Output 64.2.1. Creating an OUTEST= Data Set with PROC REG REGRESSION SCORING EXAMPLE The REG Procedure Model: oxyhat Dependent Variable: Oxygen Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 6 11
509.62201 38.70060 548.32261
101.92440 6.45010
Root MSE Dependent Mean Coeff Var
2.53970 48.38942 5.24847
R-Square Adj R-Sq
F Value
Pr > F
15.80
0.0021
0.9294 0.8706
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept Age Weight RunTime RunPulse RestPulse
1 1 1 1 1 1
151.91550 -0.63045 -0.10586 -1.75698 -0.22891 -0.17910
31.04738 0.42503 0.11869 0.93844 0.12169 0.13005
4.89 -1.48 -0.89 -1.87 -1.88 -1.38
0.0027 0.1885 0.4068 0.1103 0.1090 0.2176
Output 64.2.2 lists the RegOut data set. Notice that – TYPE– =’PARMS’ and – MODEL– =’OXYHAT’, which are from the label in the MODEL statement in PROC REG.
Example 64.2. Regression Parameter Estimates
Output 64.2.2. OUTEST= Data Set from PROC REG Reproduced with PROC PRINT REGRESSION SCORING EXAMPLE OUTEST= Data Set from PROC REG
Obs
_MODEL_
_TYPE_
_DEPVAR_
1
oxyhat
PARMS
Oxygen
Obs
Weight
1
-0.10586
_RMSE_
Intercept
Age
151.916
-0.63045
2.53970
RunTime
RunPulse
Rest Pulse
Oxygen
-1.75698
-0.22891
-0.17910
-1
Output 64.2.3 lists the data sets created by PROC SCORE. Since the SCORE= data set does not contain observations with – TYPE– =’MEAN’ or – TYPE– =’STD’, the data in the Fitness data set are not standardized before scoring. The SCORE= data set contains the variable Intercept, so this intercept value is used in computing the score. To produce the RScoreP data set, the VAR statement in PROC SCORE includes only the independent variables from the model in PROC REG. As a result, the OxyHat variable contains predicted values. To produce the RScoreR data set, the VAR statement in PROC SCORE includes both the dependent variables and the independent variables from the model in PROC REG. As a result, the OxyHat variable contains negative residuals (PREDICT−ACTUAL). If the RESIDUAL option is specified, the variable OxyHat contains positive residuals (ACTUAL−PREDICT). If the PREDICT option is specified, the OxyHat variable contains predicted values.
4083
4084
Chapter 64. The SCORE Procedure
Output 64.2.3. Predicted and Residual Scores from the OUT= Data Set Created by PROC SCORE and Reproduced Using PROC PRINT REGRESSION SCORING EXAMPLE Predicted Scores for Regression
Obs
Age
Weight
Oxygen
Run Time
1 2 3 4 5 6 7 8 9 10 11 12
44 40 44 42 38 47 40 43 44 38 44 45
89.47 75.07 85.84 68.15 89.02 77.45 75.98 81.19 81.42 81.87 73.03 87.66
44.609 45.313 54.297 59.571 49.874 44.811 45.681 49.091 39.442 60.055 50.541 37.388
11.37 10.07 8.65 8.17 9.22 11.63 11.95 10.85 13.08 8.63 10.13 14.03
Rest Pulse
Run Pulse
62 62 45 40 55 58 70 64 63 48 45 56
178 185 156 166 178 176 176 162 174 170 168 186
oxyhat 42.8771 47.6050 56.1211 58.7044 51.7386 42.9756 44.8329 48.6020 41.4613 56.6171 52.1299 37.0080
REGRESSION SCORING EXAMPLE Negative Residual Scores for Regression
Obs
Age
Weight
Oxygen
Run Time
Rest Pulse
Run Pulse
oxyhat
1 2 3 4 5 6 7 8 9 10 11 12
44 40 44 42 38 47 40 43 44 38 44 45
89.47 75.07 85.84 68.15 89.02 77.45 75.98 81.19 81.42 81.87 73.03 87.66
44.609 45.313 54.297 59.571 49.874 44.811 45.681 49.091 39.442 60.055 50.541 37.388
11.37 10.07 8.65 8.17 9.22 11.63 11.95 10.85 13.08 8.63 10.13 14.03
62 62 45 40 55 58 70 64 63 48 45 56
178 185 156 166 178 176 176 162 174 170 168 186
-1.73195 2.29197 1.82407 -0.86657 1.86460 -1.83542 -0.84811 -0.48897 2.01935 -3.43787 1.58892 -0.38002
The second part of this example uses the parameter estimates to score a new data set. The following code produces Output 64.2.4 and Output 64.2.5: /* The FITNESS2 data set contains observations 13-16 from */ /* the FITNESS data set used in EXAMPLE 2 in the PROC REG */ /* chapter. */ data Fitness2; input Age Weight Oxygen RunTime RestPulse RunPulse; datalines; 45 66.45 44.754 11.12 51 176 47 79.15 47.273 10.60 47 162 54 83.12 51.855 10.33 50 166 49 81.42 49.156 8.95 44 180 ; proc print data=Fitness2; title ’REGRESSION SCORING EXAMPLE’;
Example 64.2. Regression Parameter Estimates
title2 ’New Raw Data Set to be Scored’; run; proc score data=Fitness2 score=RegOut out=NewPred type=parms nostd predict; var Oxygen Age Weight RunTime RunPulse RestPulse; run; proc print data=NewPred; title2 ’Predicted Scores for Regression’; title3 ’for Additional Data from FITNESS2’; run;
Output 64.2.4 lists the Fitness2 data set. Output 64.2.4. Listing of the Fitness2 Data Set REGRESSION SCORING EXAMPLE New Raw Data Set to be Scored
Obs 1 2 3 4
Age
Weight
Oxygen
Run Time
Rest Pulse
Run Pulse
45 47 54 49
66.45 79.15 83.12 81.42
44.754 47.273 51.855 49.156
11.12 10.60 10.33 8.95
51 47 50 44
176 162 166 180
PROC SCORE scores the Fitness2 data set using the parameter estimates in the RegOut data set. These parameter estimates result from fitting a regression equation to the Fitness data set. The NOSTD option is specified, so the raw data are not standardized before scoring. (However, the NOSTD option is not necessary here. The SCORE= data set does not contain observations with – TYPE– =’MEAN’ or – TYPE– =’STD’, so standardization is not performed.) The VAR statement contains the dependent variables and the independent variables used in PROC REG. In addition, the PREDICT option is specified. This combination gives predicted values for the new score variable. The name of the new score variable is OxyHat, from the value of the – MODEL– variable in the SCORE= data set. Output 64.2.5 shows the data set produced by PROC SCORE.
4085
4086
Chapter 64. The SCORE Procedure
Output 64.2.5. Predicted Scores from the OUT= Data Set Created by PROC SCORE and Reproduced Using PROC PRINT REGRESSION SCORING EXAMPLE Predicted Scores for Regression for Additional Data from FITNESS2
Obs
Age
Weight
Oxygen
Run Time
1 2 3 4
45 47 54 49
66.45 79.15 83.12 81.42
44.754 47.273 51.855 49.156
11.12 10.60 10.33 8.95
Rest Pulse
Run Pulse
51 47 50 44
176 162 166 180
oxyhat 47.5507 49.7802 43.9682 47.5949
Example 64.3. Custom Scoring Coefficients This example uses a specially created custom scoring data set and produces Output 64.3.1. The first scoring coefficient creates a variable that is Age−Weight; the second scoring coefficient evaluates the variable RunPulse−RstPulse; and the third scoring coefficient totals all six variables. Since the scoring coefficients data set (data set A) does not contain any observations with – TYPE– =’MEAN’ or – TYPE– =’STD’, the data in the Fitness data set (see Example 64.1) are not standardized before scoring. data A; input _type_ $ _name_ $ Age Weight RunTime RunPulse RestPulse; datalines; SCORE AGE_WGT 1 -1 0 0 0 SCORE RUN_RST 0 0 0 1 -1 SCORE TOTAL 1 1 1 1 1 ; proc print data=A; title ’CONSTRUCTED SCORING EXAMPLE’; title2 ’Scoring Coefficients’; run; proc score data=Fitness score=A out=B; var Age Weight RunTime RunPulse RestPulse; run; proc print data=B; title2 ’Scored Data’; run;
References
Output 64.3.1. Custom Scoring Data Set and Scored Fitness Data: PROC PRINT CONSTRUCTED SCORING EXAMPLE Scoring Coefficients
Obs 1 2 3
_type_
_name_
SCORE SCORE SCORE
AGE_WGT RUN_RST TOTAL
Age
Weight
Run Time
Run Pulse
1 0 1
-1 0 1
0 0 1
0 1 1
Rest Pulse 0 -1 1
Output 64.3.2. Custom Scored Fitness Data: PROC PRINT CONSTRUCTED SCORING EXAMPLE Scored Data
Obs
Age
Weight
Oxygen
Run Time
Rest Pulse
Run Pulse
AGE_WGT
RUN_RST
TOTAL
1 2 3 4 5 6 7 8 9 10 11 12
44 40 44 42 38 47 40 43 44 38 44 45
89.47 75.07 85.84 68.15 89.02 77.45 75.98 81.19 81.42 81.87 73.03 87.66
44.609 45.313 54.297 59.571 49.874 44.811 45.681 49.091 39.442 60.055 50.541 37.388
11.37 10.07 8.65 8.17 9.22 11.63 11.95 10.85 13.08 8.63 10.13 14.03
62 62 45 40 55 58 70 64 63 48 45 56
178 185 156 166 178 176 176 162 174 170 168 186
-45.47 -35.07 -41.84 -26.15 -51.02 -30.45 -35.98 -38.19 -37.42 -43.87 -29.03 -42.66
116 123 111 126 123 118 106 98 111 122 123 130
384.84 372.14 339.49 324.32 369.24 370.08 373.93 361.04 375.50 346.50 340.16 388.69
References Chaseling, J. (1996), “Standard Test Results of Students at Three Types of Schools,” sample data, Faculty of Environmental Sciences, Griffith University: Queensland, Australia.
4087
Chapter 65
The SIM2D Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4091 Introduction to Spatial Simulation . . . . . . . . . . . . . . . . . . . . . . . 4091 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4092 Preliminary Spatial Data Analysis . . . . . . . . . . . . . . . . . . . . . . . 4092 Investigating Variability by Simulation . . . . . . . . . . . . . . . . . . . . 4095 SYNTAX . . . . . . . . . . . PROC SIM2D Statement . COORDINATES Statement GRID Statement . . . . . . SIMULATE Statement . . MEAN Statement . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 4097 . 4099 . 4099 . 4100 . 4101 . 4105
DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4106 Computational and Theoretical Details of Spatial Simulation . . . . . . . . 4106 Output Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4110 EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4111 Example 65.1. Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4111 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4115
4090
Chapter 65. The SIM2D Procedure
Chapter 65
The SIM2D Procedure Overview The SIM2D procedure produces a spatial simulation for a Gaussian random field with a specified mean and covariance structure in two dimensions using an LU decomposition technique. The simulation can be conditional or unconditional. If it is conditional, a set of coordinates and associated field values are read from a SAS data set. The resulting simulation honors these data values. You can specify the mean structure as a quadratic in the coordinates. You can specify the covariance by naming the form and supplying the associated parameters. PROC SIM2D can handle anisotropic and nested semivariogram models. Three covariance models are supported: Gaussian, exponential, and spherical. A single nugget effect is also supported. You can specify the locations of simulation points in a GRID statement or they can be read from a SAS data set. The grid specification is most suitable for a regular grid; the data set specification can handle any irregular pattern of points. The SIM2D procedure writes the simulated values for each grid point to an output data set. The SIM2D procedure does not produce any displayed output.
Introduction to Spatial Simulation The purpose of spatial simulation is to produce a set of partial realizations of a spatial random field (SRF) Z(s), s ∈ D ⊂ R2 in a way that preserves a specified mean µ(s) = E [Z(s)] and covariance structure Cz (s1 − s2 ) = cov (Z(s1 ), Z(s2 )). The realizations are partial in the sense that they occur only at a finite set of locations (s1 , s2 , · · · , sn ). These locations are typically on a regular grid, but they can be arbitrary locations in the plane. There are a number of different types of spatial simulation and associated computational methods. PROC SIM2D produces simulations for continuous processes in two dimensions. This means that the possible values of the measured quantity Z(s0 ) at location s0 = (x0 , y0 ) can vary continuously over a certain range. An additional assumption, needed for computational purposes, is that the spatial random field Z(s) is Gaussian. Spatial simulation is different from spatial prediction, where the emphasis is on producing a point estimate at a given grid location. In this sense, spatial prediction is local. In contrast, spatial simulation is global; the emphasis is on the entire realization (Z(s1 ), Z(s2 ), · · · , Z(sn )).
4092
Chapter 65. The SIM2D Procedure
Given the correct mean µ(s) and covariance structure Cz (s1 − s2 ), SRF quantities that are difficult or impossible to calculate in a spatial prediction context can easily be approximated by repeated simulations.
Getting Started Spatial simulation, just like spatial prediction, requires a model of spatial dependence, usually in terms of the covariance Cz (h). For a given set of spatial data Z(si ), i = 1, · · · , n, the covariance structure (both the form and parameter values) can be found by the VARIOGRAM procedure. This example uses the coal seam thickness data that is also used in the “Getting Started” section of Chapter 80, “The VARIOGRAM Procedure.”
Preliminary Spatial Data Analysis In this example, the data consist of coal seam thickness measurements (in feet) taken over an approximately square area. The coordinates are offsets from a point in the southwest corner of the measurement area, with the north and east distances in units of thousands of feet. It is instructive to see the locations of the measured points in the area where you want to perform spatial simulations. It is generally desirable to have these locations scattered evenly around the simulation area. First, the data are input and the sample locations plotted.
data thick; input east north thick @@; datalines; 0.7 59.6 34.1 2.1 82.7 4.8 52.8 34.3 5.9 67.1 6.4 33.7 36.4 7.0 46.7 13.3 0.6 44.7 13.3 68.2 17.8 6.9 43.9 20.1 66.3 23.0 93.9 43.6 24.3 73.0 24.8 26.3 39.7 26.4 58.0 27.7 83.3 41.8 27.9 90.8 29.5 89.4 43.0 30.1 6.1 32.7 40.2 37.5 34.8 8.1 37.0 70.3 39.2 38.2 77.9 39.4 82.5 41.4 43.0 4.7 46.4 84.1 41.5 46.7 10.6 51.0 88.8 42.0 52.8 68.9 55.5 92.9 42.2 56.0 1.6 62.1 26.6 40.1 63.0 12.7 70.5 83.7 40.9 70.9 11.0 78.1 45.5 38.7 78.2 9.1 80.5 55.9 38.7 81.1 51.0 84.5 11.0 41.5 85.2 67.3 86.7 70.4 39.6 87.2 55.7
42.2 37.0 34.6 37.8 37.7 39.3 36.9 43.3 43.6 43.3 40.7 43.3 42.6 39.3 42.7 41.8 41.7 41.7 38.6 39.4 38.8
4.7 6.0 8.2 13.4 22.7 24.8 26.9 29.1 30.8 35.3 38.9 43.7 49.9 52.9 60.6 69.0 71.5 78.4 83.8 85.5 88.1
75.1 35.7 40.1 31.3 87.6 15.1 65.0 47.9 12.1 32.0 23.3 7.6 22.1 32.7 75.2 75.6 29.5 20.0 7.9 73.0 0.0
39.5 35.9 35.4 37.8 42.8 42.3 37.8 36.7 42.8 38.8 40.5 43.1 40.7 39.2 40.1 40.1 39.8 40.8 41.6 39.8 41.6
Preliminary Spatial Data Analysis 88.4 88.9 91.5 94.8 ;
12.1 6.2 55.4 71.5
41.3 41.5 39.0 39.7
88.4 90.6 92.9 96.2
99.6 7.0 46.8 84.3
41.2 41.5 39.1 40.3
88.8 90.7 93.4 98.2
82.9 49.6 70.9 58.2
proc gplot data=thick; title ’Locations of Measured Samples’; plot north*east / frame cframe=ligr haxis=axis1 vaxis=axis2; symbol1 v=dot color=blue; axis1 minor=none; axis2 minor=none label=(angle=90 rotate=0); label east = ’East’ north = ’North’ ; run;
Figure 65.1. Locations of Measured Samples proc g3d data=thick; title ’Surface Plot of Coal Seam Thickness’; scatter east*north=thick / xticknum=5 yticknum=5
40.5 38.9 39.7 39.5
4093
4094
Chapter 65. The SIM2D Procedure grid zmin=20 zmax=65; label east = ’East’ north = ’North’ thick = ’Thickness’ ; run;
Figure 65.2. Surface Plot of Coal Seam Thickness
Figure 65.2 shows the small scale variation typical of spatial data, but there does not appear to be any surface trend. Hence, you can work with the original thickness data rather than residuals from a trend surface fit. In fact, a reasonable approximation of the spatial process generating the coal seam data is given by Z(s) = µ + ε(s) where the ε(s) is a Gaussian SRF with Gaussian covariance structure h2 Cz (h) = c0 exp − 2 a0
Investigating Variability by Simulation
Note that the term “Gaussian” is used in two ways in this description. For a set of locations s1 , s2 , · · · , sn , the random vector Z(s) =
Z(s1 ) Z(s2 ) .. .
Z(sn ) has a multivariate Gaussian or normal distribution Nn (µ, Σ). The (i,j)th element of Σ is computed by Cz (si − sj ), which happens to be a Gaussian functional form. Any functional form for Cz (h) yielding a valid covariance matrix Σ can be used. Both the functional form of Cz (h) and the parameter values µ = 40.14 c0 = 7.5 a0 = 30.0 are visually estimated using PROC VARIOGRAM, a DATA step, and the GPLOT procedure. Refer to the “Getting Started” section beginning on page 4852 in the chapter on the VARIOGRAM procedure for details on how these parameter values are obtained. The choice of a Gaussian functional form for Cz (h) is simply based on the data, and it is not at all crucial to the simulation. However, it is crucial to the simulation method used in PROC SIM2D that Z(s) be a Gaussian SRF. For details, see the section “Computational and Theoretical Details of Spatial Simulation” beginning on page 4106.
Investigating Variability by Simulation The variability of Z(s), modeled by Z(s) = µ + ε(s) with the Gaussian covariance structure Cz (h) found previously is not obvious from the covariance model form and parameters. The variation around the mean of the surface is relatively small, making it difficult visually to pick up differences in surface plots of simulated realizations. Instead, you investigate variations at selected grid points. To do this investigation, this example uses PROC SIM2D and specifies the Gaussian model with the parameters found previously. Five thousand simulations (iterations) are performed on two points: the extreme south-west point of the region and a point towards the north-east corner of the region. Because of the irregular nature of these points, a GDATA= data set is produced with the coordinates of the selected points. Summary statistics are computed for each of these grid points by using a BY statement in PROC UNIVARIATE.
4095
4096
Chapter 65. The SIM2D Procedure
data grid; input xc yc; datalines; 0 0 75 75 run; proc sim2d data=thick outsim=sim1; simulate var=thick numreal=5000 seed=79931 scale=7.5 range=30.0 form=gauss; mean 40.14; coordinates xc=east yc=north; grid gdata=grid xc=xc yc=yc; run; proc sort data=sim1; by gxc gyc; run; proc univariate data=sim1; var svalue; by gxc gyc; title ’Simulation Statistics at Selected Grid Points’; run;
Simulation Statistics at Selected Grid Points ------ X-coordinate of the grid point=0 Y-coordinate of the grid point=0 -------
Variable:
The UNIVARIATE Procedure SVALUE (Simulated Value at Grid Point) Moments
N Mean Std Deviation Skewness Uncorrected SS Coeff Variation
5000 40.1387121 0.54603592 -0.0217334 8057071.54 1.36037231
Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean
5000 200693.561 0.29815523 -0.0519914 1490.478 0.00772211
Basic Statistical Measures Location Mean Median Mode
40.13871 40.14620 .
Variability Std Deviation Variance Range Interquartile Range
0.54604 0.29816 3.81973 0.76236
Figure 65.3. Simulation Statistics at Grid Point (XC=0, YC=0)
Syntax
Simulation Statistics at Selected Grid Points ------ X-coordinate of the grid point=0 Y-coordinate of the grid point=0 -------
Variable:
The UNIVARIATE Procedure SVALUE (Simulated Value at Grid Point) Tests for Location: Mu0=0
Test
-Statistic-
-----p Value------
Student’s t Sign Signed Rank
t M S
Pr > |t| Pr >= |M| Pr >= |S|
5197.892 2500 6251250
<.0001 <.0001 <.0001
Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min
Estimate 41.9369 41.4002 41.0273 40.8334 40.5168 40.1462 39.7544 39.4509 39.2384 38.8656 38.1172
Extreme Observations ------Lowest-----
-----Highest-----
Value
Obs
Value
Obs
38.1172 38.2959 38.3370 38.3834 38.4198
2691 1817 3026 2275 3100
41.8085 41.8251 41.8446 41.9338 41.9369
1149 3612 3757 135 4536
Figure 65.4. Simulation Statistics at Grid Point (XC=75, YC=75)
Syntax The following statements are available in PROC SIM2D.
PROC SIM2D options ; COORDINATES coordinate-variables ; GRID grid-options ; SIMULATE simulate-options ; MEAN mean-options ;
4097
4098
Chapter 65. The SIM2D Procedure
The SIMULATE and MEAN statements are hierarchical; you can specify any number of SIMULATE statements, but you must specify at least one. If you specify a MEAN statement, it refers to the preceding SIMULATE statement. If you do not specify a MEAN statement, a zero mean model is simulated. You must specify a single COORDINATES statement to identify the x and y coordinate variables in the input data set when you perform a conditional simulation. You must also specify a single GRID statement to specify the grid information. The following table outlines the options available in PROC SIM2D classified by function. Table 65.1. Options Available in the SIM2D Procedure
Task Data Set Options specify input data set specify grid data set specify quadratic form data set write simulated values Declaring the Role of Variables specify the conditioning variable specify the x and y coordinate variables in the DATA= data set specify the x and y coordinate variables in the GDATA= data set specify the constant coefficient variable in the QDATA= data set specify the linear x coefficient variable in the QDATA= data set specify the linear y coefficient variable in the QDATA= data set specify the quadratic x coefficient variable in the QDATA= data set specify the quadratic y coefficient variable in the QDATA= data set specify the quadratic xy coefficient variable in the QDATA= data set Controlling the Simulation specify the number of realizations specify the seed value for the random generator Controlling the Mean Quadratic Surface specify the CONST term specify the linear x term specify the linear y term specify the quadratic x term specify the quadratic y term specify the quadratic cross term
Statement
Option
PROC SIM2D GRID MEAN PROC SIM2D
DATA= GDATA= QDATA= OUTSIM=
SIMULATE COORDINATES
VAR= XC= YC=
GRID
XC= YC=
MEAN
CONST=
MEAN
CX=
MEAN
CY=
MEAN
CXX=
MEAN
CYY=
MEAN
CXY=
SIMULATE SIMULATE
NUMREAL= SEED=
MEAN MEAN MEAN MEAN MEAN MEAN
CONST= CX= CY= CXX= CYY= CXY=
COORDINATES Statement
4099
Table 65.1. (continued)
Task Controlling the Semivariogram Model specify a nugget effect specify a functional form specify nested functional forms specify a range parameter specify nested range parameters specify a scale parameter specify nested scale parameters specify an angle for an anisotropic model specify nested angles specify a minor-major axis ratio for an anisotropic model specify nested minor-major axis ratios
Statement
Option
SIMULATE SIMULATE SIMULATE SIMULATE SIMULATE SIMULATE SIMULATE SIMULATE SIMULATE SIMULATE
NUGGET= FORM= FORM=(f1 , · · · , fk ) RANGE= RANGE=(r1 , · · · , rk ) SCALE= SCALE=(s1 , · · · , sk ) ANGLE= ANGLE=(a1 , · · · , ak ) RATIO=
SIMULATE
RATIO=(ra1 , · · · , rak )
PROC SIM2D Statement PROC SIM2D options ; You can specify the following options with the PROC SIM2D statement. DATA=SAS-data-set
specifies a SAS data set containing the x and y coordinate variables and the SIMULATE VAR= variables. This data set is required if any of the SIMULATE statements are conditional, that is, if you specify the VAR= option. If none of the SIMULATE statements are conditional then you do not need the DATA= option, and this option is ignored if you specify it. NARROW
restricts the variables included in the OUTSIM= data set. When you specify the NARROW option, only four variables are included. This option is useful when a large number of simulations are produced. Including only four variables reduces the memory required for the OUTSIM= data set. For details on the variables that are excluded with the NARROW option, see the section “Output Data Set” on page 4110. OUTSIM=SAS-data-set
specifies a SAS data set to store the simulation values, iteration number, simulate statement label, variable name, and grid location. For details, see the section “Output Data Set” on page 4110.
COORDINATES Statement COORDINATES coordinate-variables ; The following two options give the name of the variables in the DATA= data set containing the values of the x and y coordinates of the conditioning data. Only one COORDINATES statement is allowed, and it is applied to all SIMULATE statements that have a VAR= specification. In other words, it is assumed that all the VAR= variables in all SIMULATE statements have the same x and y coordinates.
4100
Chapter 65. The SIM2D Procedure
You can abbreviate the COORDINATES statement as COORD. XCOORD=(variable-name) XC=(variable-name)
gives the name of the variable containing the x coordinate of the data in the DATA= data set. YCOORD=(variable-name) YC=(variable-name)
gives the name of the variable containing the y coordinate of the data locations in the DATA= data set.
GRID Statement GRID grid-options ; The following options can be used to specify the grid of spatial locations at which to perform the simulations. A single GRID statement is required and is applied to all SIMULATE statements. There are two basic methods for specifying the grid. You can specify the x and y coordinates explicitly, or they can be read from a SAS data set. The options for the explicit specification of grid locations are as follows. X=number X=x1 , . . . , xm X=x1 to xm X=x1 to xm by δx
specifies the x coordinate of the grid locations. Y=number Y=y1 , . . . , ym Y=y1 to ym Y=y1 to ym by δy
specifies the y coordinate of the grid locations. For example, the following two GRID statements are equivalent: grid x=1,2,3,4,5 y=0,2,4,6,8,10; grid x=1 to 5 y=0 to 10 by 2;
To specify grid locations from a SAS data set, you must provide the name of the data set and the variables containing the values of the x and y coordinates. GRIDDATA=SAS-data-set GDATA=SAS-data-set
specifies a SAS data set containing the x and y grid coordinates.
SIMULATE Statement
XCOORD=(variable-name) XC=(variable-name)
gives the name of the variable containing the x coordinate of the grid locations in the GRIDDATA= data set. YCOORD=(variable-name) YC=(variable-name)
gives the name of the variable containing the y coordinate of the grid locations in the GRIDDATA= data set.
SIMULATE Statement SIMULATE simulate-options ; The SIMULATE statement specifies details on the simulation and the covariance model used in the simulation. You can specify the following options with a SIMULATE statement, which can be abbreviated by SIM. NUMREAL=number NUMR=number NR=number
specifies the number of realizations to produce for the spatial process specified by the covariance model. Note that the number of observations in the OUTSIM= data set contributed by a given SIMULATE statement is the product of the NUMREAL= value with the number of grid points. This can cause the OUTSIM= data set to become large even for moderate values of the NUMREAL= option. VAR= (variable-name)
specifies the single numeric variable used as the conditioning variable in the simulation. In other words, the simulation is conditional on the values of the VAR= variable found in the DATA= data set. If you omit the VAR= option, the simulation is unconditional. Since multiple SIMULATE statements are allowed, you can perform both unconditional and conditional simulations with a single PROC SIM2D statement.
Covariance Model Specification There are two ways to specify a semivariogram or covariance model. In the first method, you can specify the required parameters SCALE, RANGE, and FORM, and possibly the optional parameters NUGGET, ANGLE, and RATIO, explicitly in the SIMULATE statement. In the second method, you can specify an MDATA= data set. This data set contains variables corresponding to the required SCALE, RANGE, and FORM parameters, and, optionally, variables for the NUGGET, ANGLE, and RATIO parameters. The two methods are exclusive; either you specify all parameters explicitly, or they are all are read from the MDATA= data set.
4101
4102
Chapter 65. The SIM2D Procedure
ANGLE=angle ANGLE=(angle1 ,. . . ,anglek )
specifies the angle of the major axis for anisotropic models, measured in degrees clockwise from the N-S axis. In the case of a nested semivariogram model, you can specify an angle for each nesting. The default is ANGLE=0. FORM=form– spec FORM=(form– spec1 , form– spec2 ,. . . ,form– speck )
specifies the functional form or forms of the semivariogram model, where form– spec can take only the values SPHERICAL, EXPONENTIAL, and GAUSSIAN. The two ways of specifying the FORM= parameter allows specification of both nested and nonnested models. The following abbreviations are permitted. For the spherical model, you can specify the form– spec as FORM=SPHERICAL, FORM=SPH, or FORM=S. For the exponential model, you can specify the form– spec as FORM=EXPONENTIAL, FORM=EXP, or FORM=E. For the Gaussian model, you can specify the form– spec as FORM=GAUSSIAN, FORM=GAUSS, or FORM=G. MDATA=SAS-data-set
specifies the input data set that contains parameter values for the covariance or semivariogram model. The MDATA= data set must contain variables named SCALE, RANGE, and FORM, and it can optionally contain the variables NUGGET, ANGLE, and RATIO. The FORM variables must be character, and they can assume the same values allowed in the explicit FORM= syntax described previously. The RANGE and SCALE variables must be numeric. The optional variables ANGLE, RATIO, and NUGGET must also be numeric if present. The number of observations present in the MDATA= data set corresponds to the level of nesting of the covariance or semivariogram model. For example, to specify a nonnested model using a spherical covariance, an MDATA= data set might look like the following. data md1; input scale range form $; datalines; 25 10 sph run;
The PROC SIM2D statement to use the MDATA= specification is of the form proc sim2d data=...; sim var=.... mdata=md1; run;
This is equivalent to the following explicit specification of the covariance model parameters: proc sim2d data=...; sim var=.... scale=25 range=10 form=sph; run;
SIMULATE Statement
The following MDATA= data set is an example of an anisotropic nested model: data md2; input scale range form $ nugget angle ratio; datalines; 20 8 S 5 35 .7 12 3 G 5 0 .8 4 1 G 5 45 .5 ; proc sim2d data=...; sim var=.... mdata=md2; run;
This is equivalent to the following explicit specification of the covariance model parameters: proc sim2d data=...; sim var=.... scale=(20,12,4) range=(8,3,1) form=(S,G,G) angle=(35,0,45) ratio=(.7,.8,.5) nugget=5; run;
This example is somewhat artificial in that it is usually hard to detect different anisotropy directions and ratios for different nestings using an experimental semivariogram. Note that the NUGGET value is the same for all nestings. This is always the case; the nugget effect is a single additive term for all models. For further details, refer to the section “The Nugget Effect” on page 2051 in Chapter 37, “The KRIGE2D Procedure.” The SIMULATE statement can be given a label. This is useful for identification in the OUTSIM= data set when multiple SIMULATE statements are specified. For example, proc sim2d data=...; gauss1: sim var=.... form=gauss; mean ....; gauss2: sim var=.... form gauss; mean ....; exp1: sim var=.... form=exp; mean ....; exp2: sim var=.... form=exp; mean ....; run;
In the OUTSIM= data set, the values ’GAUSS1’, ’GAUSS2’, ’EXP1’, and ’EXP2’ for the LABEL variable help to identify the realizations corresponding to the four SIMULATE statements. If you do not provide a label for a SIMULATE statement, a default label of SIMn is given, where n is the number of unlabeled SIMULATE statements seen so far.
4103
4104
Chapter 65. The SIM2D Procedure
NUGGET=number
specifies the nugget effect for the model. This effect is due to a discontinuity in the semivariogram as determined by plotting the sample semivariogram (refer to the section “The Nugget Effect” on page 2051 in the chapter on the KRIGE2D procedure for details). For models without any nugget effect, the NUGGET= option is left out. The default is NUGGET=0. RANGE=range RANGE=(range1 ,. . . ,rangek )
specifies the range parameter in the semivariogram models. In the case of a nested semivariogram model, you must specify a range for each nesting. The range parameter is the divisor in the exponent in all supported models. It has the units of distance or distance squared for these models, and it is related to the correlation scale for the underlying spatial process. Refer to the section “Theoretical Semivariogram Models” beginning on page 2045 in the chapter on the KRIGE2D procedure for details on how the RANGE= values are determined. RATIO=ratio RATIO=(ratio1 ,. . . ,ratiok )
specifies the ratio of the length of the minor axis to the length of the major axis for anisotropic models. The value of the RATIO= option must be between 0 and 1. In the case of a nested semivariogram model, you can specify a ratio for each nesting. The default is RATIO=1. SCALE=scale SCALE=(scale1 ,. . . ,scalek )
specifies the scale parameter in semivariogram models. In the case of a nested semivariogram model, you must specify a scale for each nesting. The scale parameter is the multiplicative factor in all supported models; it has the same units as the variance of the VAR= variable. Refer to the section “Theoretical Semivariogram Models” beginning on page 2045 in the chapter on the KRIGE2D procedure for details on how the SCALE= values are determined. SEED=seed value
specifies the seed to use for the random number generator. If you omit the SEED= value, the system clock is used. SINGULAR=number
gives the singularity criteria for solving the set of linear equations involved in the computation of the mean and covariance of the conditional distribution associated with a given SIMULATE statement. The larger the value of the SINGULAR= option, the easier it is for the covariance matrix system to be declared singular. The default is SINGULAR=1E-8. For more details on the use of the SINGULAR= option, see the section “Computational and Theoretical Details of Spatial Simulation” beginning on page 4106.
MEAN Statement
MEAN Statement MEAN spec1 ,. . . ,spec6 ; MEAN QDATA=SAS-data-set CONST=var1 CX=var2 CY=var3 CXX=var4 CYY=var5 CXY=var6 ; MEAN QDATA=SAS-data-set ; A mean function µ(s) that is a quadratic in the coordinates can be written µ(s) = µ(x, y) = β0 + β1 x + β2 y + β3 x2 + β4 y 2 + β5 xy The MEAN statement is used to specify the quadratic surface to use as the mean function for the simulated SRF. There are three ways to specify the MEAN statement. The MEAN statement allows the specification of the coefficients β0 , · · · , β5 either explicitly or through a QDATA= data set. An example of an explicit specification is mean 1.4 + 2.5*x + 3.6*y + .47*x*x + .58*y*y + .69*x*y;
In this example, all terms have a nonzero coefficient. Any term with a zero coefficient is simply left out of the specification. For example, mean 1.4;
is a valid quadratic form with all terms having zero coefficients except the constant term. An equivalent way of specifying the mean function is through the QDATA= data set. For example, the following MEAN statement mean 1.4 + 2.5*x + 3.6*y + .47*x*x + .58*y*y + .69*x*y;
can be alternatively specified by the following DATA step and MEAN statement: data q1; input c1 c2 c3 c4 c5 c6; datalines; 1.4 2.5 3.6 .47 .58 .69 run; proc sim2d data=....; simulate ...; mean qdata=q1 const=c1 cx=c2 cy=c3 cxx=c4 cyy=c5 cxy=c6; run;
4105
4106
Chapter 65. The SIM2D Procedure
The QDATA= data set specifies the data set containing the coefficients. The parameters CONST=, CX=, CY=, CXX=, CYY=, and CYX= specify the variables in the QDATA= data set that correspond to the constant, linear x, linear y, and so on. For any coefficient not specified in this list, the QDATA= data set is checked for the presence of variables with default names of CONST, CX, CY, CXX, CYY, and CXY. If these variables are present, their values are taken as the corresponding coefficients. Hence, you can rewrite the previous example as data q1; input const cx cy cxx cyy cxy; datalines; 1.4 2.5 3.6 .47 .58 .69 ; proc sim2d data=....; simulate ...; mean qdata=q1; run;
If a given coefficient does not appear in the list or in the data set with the default name, a value of zero is assumed.
Details Computational and Theoretical Details of Spatial Simulation Introduction There are a number of approaches to simulating spatial random fields or, more generally, simulating sets of dependent random variables. This includes sequential indicator methods, turning bands, and the Karhunen-Loeve Expansion. Refer to Christakos (1992, Chapter 8) and Duetsch and Journel (1992, Chapter V) for details. A particularly simple method available for Gaussian spatial random fields is the LU decomposition method. This method is computationally efficient. For a given covariance matrix, the LU = LLT decomposition is computed once, and the simulation proceeds by repeatedly generating a vector of independent N (0, 1) random variables and multiplying by the L matrix. One problem with this technique is memory requirements; memory is required to hold the full data and grid covariance matrix in core. While this is especially limiting in the three-dimensional case, you can use PROC SIM2D, which handles only twodimensional data, for moderately sized simulation problems.
Theoretical Development It is a simple matter to produce an N (0, 1) random number, and by stacking k N (0, 1) random numbers in a column vector, you can obtain a vector with independent standard normal components W ∼ Nk (0, I). The meaning of the terms independence and randomness in the context of a deterministic algorithm required for the generation of these numbers is a little subtle; refer to Knuth (1981, Vol. 2, Chapter 3) for details.
Computational and Theoretical Details of Spatial Simulation
Rather than W ∼ Nk (0, I), what is required is the generation of a vector Z ∼ Nk (0, C), that is, Z=
Z1 Z2 .. .
Zk with covariance matrix
C11 C12 · · · C1k C21 C22 · · · C2k C= .. .
Ck1 Ck2 · · · Ckk If the covariance matrix is symmetric and positive definite, it has a Cholesky root L such that C can be factored as C = LLT where L is lower triangular. Refer to Ralston and Rabinowitz (1978, Chapter 9, Section 3-3) for details. This vector Z can be generated by the transformation Z = LW. Note that this is where the assumption of a Gaussian SRF is crucial. When W ∼ Nk (0, I), then Z = LW is also Gaussian. The mean of Z is E(Z) = L(E(W)) = 0 and the variance is Var(Z) = Var(LW) = E(LWWT LT ) = LE(WWT )LT = LLT = C Consider now an SRF Z(s), s ∈ D ⊂ R2 , with spatial covariance function C(h). Fix locations s1 , s2 , · · · , sk , and let Z denote the random vector Z=
Z(s1 ) Z(s2 ) .. .
Z(sk ) with corresponding covariance matrix
C(0) C(s1 − s2 ) · · · C(s1 − sk ) C(s2 − s1 ) C(0) · · · C(s2 − sk ) Cz = . . . C(sk − s1 ) C(sk − s2 ) · · · C(0)
4107
4108
Chapter 65. The SIM2D Procedure
Since this covariance matrix is symmetric and positive definite, it has a Cholesky root, and the Z(si ), i = 1, · · · , k can be simulated as described previously. This is how the SIM2D procedure implements unconditional simulation in the zero mean case. More generally, Z(s) = µ(s) + ε(s) with µ(s) being a quadratic form in the coordinates s = (x, y), and the ε(s) being an SRF having the same covariance matrix Cz as previously. In this case, the µ(si ), i = 1, · · · , k is computed once and added to the simulated vector ε(si ), i = 1, · · · , k for each realization. For a conditional simulation, this distribution of Z=
Z(s1 ) Z(s2 ) .. .
Z(sk ) must be conditioned on the observed data. The relevant general result concerning conditional distributions of multivariate normal random variables is the following. Let X ∼ Nm (µ, Σ), where
X1 X2
µ1 µ2
X=
µ=
and Σ=
Σ11 Σ12 Σ21 Σ22
The subvector X1 is k × 1, X2 is n × 1, Σ11 is k × k, Σ22 is n × n, and Σ12 = ΣT21 is k × n, with k + n = m. The full vector X is partitioned into two subvectors X1 and X2 , and Σ is similarly partitioned into covariances and cross covariances. ˜ with ˜ Σ), With this notation, the distribution of X1 conditioned on X2 = x2 is Nk (µ, ˜ = µ1 + Σ12 Σ−1 µ 22 (x2 − µ2 ) and ˜ = Σ11 − Σ12 Σ−1 Σ21 Σ 22
Computational and Theoretical Details of Spatial Simulation
Refer to Searle (1971, pp. 46–47) for details. The correspondence with the conditional spatial simulation problem is as follows. Let the coordinates of the observed ˜ denote the data points be denoted ˜s1 , ˜s2 , · · · , ˜sn , with values z˜1 , z˜2 , · · · , z˜n . Let Z random vector Z(˜s1 ) Z(˜s2 ) ˜ Z= .. . Z(˜sn ) ˜ The random vector Z corresponds to X2 , while Z corresponds to X1 . ˜ =z ˜ as in the previous distribution. The matrix ˜ ∼ Nk (µ, ˜ C) Z|Z
Then
˜ = C11 − C12 C−1 C21 C 22 is again positive definite, so a Cholesky factorization can be performed. ˜ is simply the number of nonmissing observations for the The dimension n for Z VAR= variable; the values z˜1 , z˜2 , · · · , z˜n are the values of this variable. The coordinates ˜s1 , ˜s2 , · · · , ˜sn are also found in the DATA= data set, with the variables corresponding to the x and y coordinates identified in the COORDINATES statement. Note that all VAR= variables use the same set of conditioning coordinates; this fixes the matrix C22 for all simulations. The dimension k for Z is the number of grid points specified in the GRID statement. Since there is a single GRID statement, this fixes the matrix C11 for all simulations. Similarly, C12 is fixed. ˜ = LLT is computed once, as is the mean correction The Cholesky factorization C ˜ = µ1 + C12 C−1 µ 22 (x2 − µ2 ) Note that the means µ1 and µ2 are computed using the grid coordinates s1 , s2 , · · · , sk , the data coordinates ˜s1 , ˜s2 , · · · , ˜sn , and the quadratic form specification from the MEAN statement. The simulation is now performed exactly as in the unconditional case. A k × 1 vector of independent standard N (0, 1) random variables is generated ˜ is added to the transformed vector. This is repeated N and multiplied by L, and µ times, where N is the value specified for the NR= option.
Computational Details ˜ and Σ described in the previous section, the inverse Σ−1 In the computation of µ 22 is never actually computed; an equation of the form Σ22 A = B is solved for A using a modified Gaussian elimination algorithm that takes advantage of the fact that Σ22 is symmetric with constant diagonal Cz (0) that is larger than all off-diagonal elements. The SINGULAR= option pertains to this algorithm. The value specified for the SINGULAR= option is scaled by Cz (0) before comparison with the pivot element.
4109
4110
Chapter 65. The SIM2D Procedure
Memory Usage
For conditional simulations, the largest matrix held in core at any one time depends on the number of grid points and data points. Using the previous notation, the data-data covariance matrix C22 is n × n, where n is the number of nonmissing observations for the VAR= variable in the DATA= data set. The grid-data cross covariance C12 is n × k, where k is the number of grid points. The grid-grid covariance C11 is k × k. The maximum memory required at any one time for storing these matrices is max (k(k + 1), n(n + 1) + 2(n × k)) × sizeof(double) There are additional memory requirements that add to the total memory usage, but usually these matrix calculations dominate, especially when the number of grid points is large.
Output Data Set The SIM2D procedure produces a single output data set: the OUTSIM=SAS-dataset. The OUTSIM= data set contains all the needed information to uniquely identify the simulated values. The OUTSIM= data set contains the following variables: • LABEL, which is the label for the current SIMULATE statement • VARNAME, which is the name of the conditioning variable for the current SIMULATE statement • – ITER– , which is the iteration number within the current SIMULATE statement • GXC, which is the x-coordinate for the current grid point • GYC, which is the y-coordinate for the current grid point • SVALUE, which is the value of the simulated variable If you specify the NARROW option in the PROC SIM2D statement, the LABEL and VARNAME variables are not included in the OUTSIM= data set. This option is useful in the case where the number of data points, grid points, and realizations are such that they generate a very large OUTSIM= data set. The size of the OUTSIM= data set is reduced when these variables are not included. In the case of an unconditional simulation, the VARNAME variable is not included. In the case of mixed conditional and unconditional simulations (that is, when multiple SIMULATE statements are specified and one or more contain a VAR= specification and one or more do not contain a VAR= specification), the VARNAME variable is included but is given a missing value for those observations corresponding to an unconditional simulation.
Example 65.1. Simulation
Example Example 65.1. Simulation Continuing with the coal seam thickness example from the “Getting Started” section beginning on page 4092, this example asks a more complicated question. This question is economic in nature, and the (approximate) answer requires the use of simulation.
Simulating a Subregion for Economic Feasibility The coal seam must be of a minimum thickness, called a cutoff value, for a mining operation to be profitable. Suppose that, for a subregion of the measured area, the cost of mining is higher than the remaining areas due to the geology of the overburden. This higher cost results in a higher thickness cutoff value for the subregion. Suppose also that it is determined from a detailed cost analysis that at least 60 percent of the subregion must exceed a seam thickness of 39.7 feet for profitability. How can you use the SRF model (µ and Cz (s)) and the measured seam thickness values Z(si ), i = 1, · · · , 75 to determine, in some approximate way, if at least 60 percent of the subregion exceeds this minimum? Spatial prediction does not appear to be helpful in answering this question. While it is easy to determine if a predicted value at a location in the subregion is above the 39.7 feet cutoff value, it is not clear how to incorporate the standard error associated with the predicted value. The standard error is what characterizes the stochastic nature of the prediction (and the underlying SRF). It is clear that it must be included in any realistic approach to the problem. A conditional simulation, on the other hand, seems to be a natural way of obtaining an approximate answer. By simulating the SRF on a sufficiently fine grid in the subregion, you can determine the proportion of grid points in which the mean value over realizations exceeds the 39.7 feet cutoff and compare it with the 60 percent value needed for profitability. It is desirable in any simulation study that the quantity being estimated (in this case, the proportion exceeding the 39.7 feet cutoff) not depend on the number of simulations performed. For example, suppose that the maximum seam thickness is simulated. It is likely that the maximum value increases as the number of simulations performed increases. Hence, a simulation is not useful for such an estimate. A simulation is useful for determining the distribution of the maximum, but there are general theoretical results for such distributions, making such a simulation unnecessary. Refer to Leadbetter, Lindgren, and Rootzen (1983) for details. In the case of simulating the proportion exceeding the 39.7 feet cutoff, it is expected that this quantity will settle down to a fixed value as the number of realizations increases. At a fixed grid point, the quantity being compared with the cutoff value is the mean over all simulated realizations; this mean value settles down to a fixed number as the number of realizations increases. In the same manner, the proportion of the grid where the mean values exceed the cutoff also becomes constant. This can be tested using PROC SIM2D.
4111
4112
Chapter 65. The SIM2D Procedure
A crucial, nonprovable assumption in applying SRF theory to the coal seam thickness data is that the values Z(si ), i = 1, · · · , 75 represent a single realization from the set of all possible realizations consistent with the SRF model (µ and Cz (h)). A conditional simulation repeatedly produces other possible simulated realizations consistent with the model and data. However, the only concern of the mining company is with this single unique realization. It is not concerned with similar coal fields to be mined sometime in the future; it may never see another coal field remotely similar to this one, or it may not be in business in the future. Hence the proportion found by generating repeated simulated realizations must somehow relate back to the unique realization that is the coal field (seam thickness). This is done by interpreting the proportion found from a simulation to the spatial mean proportion for the unique realization. The term “spatial mean” is simply an appropriate integral over the fixed (but unknown) spatial function z(s). (The SRF is denoted Z(s); a particular realization, a deterministic function of the spatial coordinates, is denoted z(s).) This interpretation requires an ergodic assumption, which is also needed in the original estimation of Cz (s). Refer to Cressie (1993, pp. 53–58) for a discussion of ergodicity and Gaussian SRFs.
Implementation Using PROC SIM2D The subregion to be considered is the southeast corner of the field, which is a square region with length 40 distance units (in thousands of feet). PROC SIM2D is run on the entire data set for conditioning, while the simulation grid covers only this subregion. It is convenient to be able to vary the seed, the grid increment, and the number of simulations performed. The following macro implements the computation of the percent area exceeding the cutoff value by using the seed, the grid increment, and the number of simulated realizations as macro arguments. The data set produced by PROC SIM2D is transposed so that each grid location is a separate variable. The MEANS procedure is then used to average the simulated value at each grid point over all realizations. It is this average that is compared to the cutoff value. The last DATA step does the comparison and determines the percent of the grid locations that exceed this cutoff value and writes the results to the listing file in the form of a report. The macro is first invoked with a relatively coarse grid (grid increment of 10 distance units) and a small number of realizations (5). The next invocation uses a finer grid and 50 realizations, and the final invocation uses the same grid increment and 500 realizations. Each time, the macro is invoked with a different seed. The simulations indicate that around 87 percent of the subregion exceeds the cutoff value. The number of grid points in the simulation increases with the square of the decrease in the grid increment, leading to long CPU processing times. Increasing the number of realizations results in a linear increase in processing times. Hence, using as coarse a grid as possible allows more realizations and experimentation with different seeds.
/*- Set the covariance model parameters and cutoff value -*/
Example 65.1. Simulation %let %let %let %let
cc0=7.5; aa0=30.0; form=gauss; cut=39.7;
data thick; input east north thick @@; datalines; 0.7 59.6 34.1 2.1 82.7 4.8 52.8 34.3 5.9 67.1 6.4 33.7 36.4 7.0 46.7 13.3 0.6 44.7 13.3 68.2 17.8 6.9 43.9 20.1 66.3 23.0 93.9 43.6 24.3 73.0 24.8 26.3 39.7 26.4 58.0 27.7 83.3 41.8 27.9 90.8 29.5 89.4 43.0 30.1 6.1 32.7 40.2 37.5 34.8 8.1 37.0 70.3 39.2 38.2 77.9 39.4 82.5 41.4 43.0 4.7 46.4 84.1 41.5 46.7 10.6 51.0 88.8 42.0 52.8 68.9 55.5 92.9 42.2 56.0 1.6 62.1 26.6 40.1 63.0 12.7 70.5 83.7 40.9 70.9 11.0 78.1 45.5 38.7 78.2 9.1 80.5 55.9 38.7 81.1 51.0 84.5 11.0 41.5 85.2 67.3 86.7 70.4 39.6 87.2 55.7 88.4 12.1 41.3 88.4 99.6 88.9 6.2 41.5 90.6 7.0 91.5 55.4 39.0 92.9 46.8 94.8 71.5 39.7 96.2 84.3 ;
42.2 37.0 34.6 37.8 37.7 39.3 36.9 43.3 43.6 43.3 40.7 43.3 42.6 39.3 42.7 41.8 41.7 41.7 38.6 39.4 38.8 41.2 41.5 39.1 40.3
4.7 6.0 8.2 13.4 22.7 24.8 26.9 29.1 30.8 35.3 38.9 43.7 49.9 52.9 60.6 69.0 71.5 78.4 83.8 85.5 88.1 88.8 90.7 93.4 98.2
75.1 35.7 40.1 31.3 87.6 15.1 65.0 47.9 12.1 32.0 23.3 7.6 22.1 32.7 75.2 75.6 29.5 20.0 7.9 73.0 0.0 82.9 49.6 70.9 58.2
%macro area_sim(seed=,nr=,ginc=); %let ngrid=%eval(40/&ginc+1); %let tgrid=%eval(&ngrid*&ngrid); proc sim2d data=thick outsim=sim1; simulate var=thick numreal=&nr seed=&seed scale=&cc0 range=&aa0 form=&form; mean 40.14; coordinates xc=east yc=north; grid x=60 to 100 by &ginc y=0 to 40 by &ginc; run; proc transpose data=sim1 out=sim2 prefix=sims; by _iter_; var svalue; run;
39.5 35.9 35.4 37.8 42.8 42.3 37.8 36.7 42.8 38.8 40.5 43.1 40.7 39.2 40.1 40.1 39.8 40.8 41.6 39.8 41.6 40.5 38.9 39.7 39.5
4113
4114
Chapter 65. The SIM2D Procedure
proc means data=sim2 noprint n mean; var sims1-sims&tgrid; output out=msim n=numsim mean=ms1-ms&tgrid; run; /*- Determine the percentage of sites exceeding cutoff -*/ data _null_; file print; array simss ms1-ms&tgrid; set msim; /*- Loop over the grid sites to test cutoff cflag=0; do ss=1 to &tgrid; tempv=simss[ss]; if simss[ss] > &cut then do; cflag + 1; end; end;
-*/
area_per=100*(cflag/&tgrid); put // +5 ’Conditional Simulation of Coal Seam’ ’ Thickness for Subregion’; put / +5 ’Subregion is South-East Corner 40 by 40’ ’ distance units’; put / +5 "Seed:&seed" +2 "Grid Increment:&ginc"; put / +5 "Total Number of Grid Points:&tgrid" +2 "Number of Simulations:&nr"; put / +5 "Percent of Subregion Exceeding Cutoff of &cut ft.:" +2 area_per 5.2; run; %mend area_sim;
%area_sim(seed=12345,nr=5,ginc=10); %area_sim(seed=54321,nr=50,ginc=1); %area_sim(seed=655311,nr=500,ginc=1); Output 65.1.1. Conditional Simulation of Coal Seam Thickness Conditional Simulation of Coal Seam Thickness for Subregion Subregion is South-East Corner 40 by 40 distance units Seed:12345
Grid Increment:10
Total Number of Grid Points:25
Number of Simulations:5
Percent of Subregion Exceeding Cutoff of 39.7 ft.:
80.00
References
Conditional Simulation of Coal Seam Thickness for Subregion Subregion is South-East Corner 40 by 40 distance units Seed:54321
Grid Increment:1
Total Number of Grid Points:1681
Number of Simulations:50
Percent of Subregion Exceeding Cutoff of 39.7 ft.:
88.34
Conditional Simulation of Coal Seam Thickness for Subregion Subregion is South-East Corner 40 by 40 distance units Seed:655311
Grid Increment:1
Total Number of Grid Points:1681
Number of Simulations:500
Percent of Subregion Exceeding Cutoff of 39.7 ft.:
87.63
References Christakos, G. (1992), Random Field Models in Earth Sciences, New York: Academic Press. Cressie, N.A.C. (1993), Statistics for Spatial Data, New York: John Wiley & Sons, Inc. Duetsch, C.V. and Journel, A.G. (1992), GSLIB: Geostatistical Software Library and User’s Guide, New York: Oxford University Press. Knuth, D.E., (1981), The Art of Computer Programming: Seminumerical Algorithms, Vol. 2, Second Edition, Reading, MA: Addison-Wesley. Leadbetter, M.R., Lindgren, G. and Rootzen, H. (1983), Extremes and Related Properties of Random Sequences and Processes, New York: Springer-Verlag. Ralston, A. and Rabinowitz, P. (1978), A First Course in Numerical Analysis, Second Edition, New York: McGraw-Hill, Inc. Searle, S.R. (1971), Linear Models, New York: John Wiley & Sons, Inc.
4115
Chapter 66
The STDIZE Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4119 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4119 SYNTAX . . . . . . . . . . PROC STDIZE Statement BY Statement . . . . . . FREQ Statement . . . . . LOCATION Statement . SCALE Statement . . . . VAR Statement . . . . . WEIGHT Statement . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. 4129 . 4129 . 4134 . 4135 . 4135 . 4135 . 4135 . 4135
DETAILS . . . . . . . . . . . . Standardization Methods . . Computation of the Statistics Computing Quantiles . . . . Missing Values . . . . . . . . Output Data Sets . . . . . . . Displayed Output . . . . . . ODS Table Names . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. 4136 . 4136 . 4138 . 4139 . 4141 . 4141 . 4142 . 4142
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4143 Example 66.1. Standardization of Variables in Cluster Analysis . . . . . . . 4143 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4154
4118
Chapter 66. The STDIZE Procedure
Chapter 66
The STDIZE Procedure Overview The STDIZE procedure standardizes one or more numeric variables in a SAS data set by subtracting a location measure and dividing by a scale measure. A variety of location and scale measures are provided, including estimates that are resistant to outliers and clustering. Some of the well-known standardization methods such as mean, median, std, range, Huber’s estimate, Tukey’s biweight estimate, and Andrew’s wave estimate are available in the STDIZE procedure. In addition, you can multiply each standardized value by a constant and add a constant. Thus, the final output value is result = add + multiply ×
original − location scale
where result add multiply original location scale
= final output value = constant to add (ADD= option) = constant to multiply by (MULT= option) = original input value = location measure = scale measure
PROC STDIZE can also find quantiles in one pass of the data, a capability that is especially useful for very large data sets. With such data sets, the UNIVARIATE procedure may have high or excessive memory or time requirements.
Getting Started The following example demonstrates how you can use the STDIZE procedure to obtain location and scale measures of your data. In the following hypothetical data set, a random sample of grade 12 students is selected from a number of co-educational schools. Each school is classified as one of two types: Urban or Rural. There are 40 observations. The variables are id (student identification), Type (type of school attended: ‘urban’=urban area and ‘rural’=rural area), and total (total assessment scores in History, Geometry, and Chemistry). The following DATA step creates the SAS data set TotalScores.
4120
Chapter 66. The STDIZE Procedure data TotalScores; title ’High School Scores Data’; input id Type $ total; datalines; 1 rural 135 2 rural 125 3 rural 223 4 rural 224 5 rural 133 6 rural 253 7 rural 144 8 rural 193 9 rural 152 10 rural 178 11 rural 120 12 rural 180 13 rural 154 14 rural 184 15 rural 187 16 rural 111 17 rural 190 18 rural 128 19 rural 110 20 rural 217 21 urban 192 22 urban 186 23 urban 64 24 urban 159 25 urban 133 26 urban 163 27 urban 130 28 urban 163 29 urban 189 30 urban 144 31 urban 154 32 urban 198 33 urban 150 34 urban 151 35 urban 152 36 urban 151 37 urban 127 38 urban 167 39 urban 170 40 urban 123 ; run;
Getting Started
Suppose you would now like to standardize the total scores in different types of schools prior to any further analysis. Before standardizing the total scores, you can use the Schematic Plots from PROC UNIVARIATE to summarize the total scores for both types of schools. proc univariate data=TotalScores plot; var total; by Type; run;
The PLOT option in the PROC UNIVARIATE statement creates the Schematic Plots and several other types of plots. The Schematic Plots display side-by-side box plots for each BY group (Figure 66.1). The vertical axis represents the total scores, and the horizontal axis displays two box plots: the one on the left is for the rural scores and the one on the right is for the urban scores.
4121
4122
Chapter 66. The STDIZE Procedure
High School Scores Data The UNIVARIATE Procedure Variable: total Schematic Plots | 260 + | | | | | | 240 + | | | | | | | 220 + | | | | | | | 200 + | | | | | | +-----+ | | | | | 180 + | | | | | | | | | | +-----+ | *--+--* | | 160 + | | | | | | | *--+--* | | | | | | | | | | 140 + | | +-----+ | | | | | +-----+ | | | | 120 + | | | | | | 100 + | | | 80 + | | | 0 60 + ------------+-----------+-----------
High School Scores Data The UNIVARIATE Procedure Variable: total Schematic Plots Type
rural
urban
Figure 66.1. Schematic Plots from PROC UNIVARIATE
Getting Started
Inspection reveals that one urban score is a low outlier. Also, if you compare the lengths of two boxplots, there seems to be twice as much dispersion for the rural scores as for the urban scores. High School Scores Data ---------------------------------- Type=urban ---------------------------------The UNIVARIATE Procedure Variable: total Extreme Observations ----Lowest----
----Highest---
Value
Obs
Value
Obs
64 123 127 130 133
23 40 37 27 25
170 186 189 192 198
39 22 29 21 32
Figure 66.2. Table for Extreme Observations When Type=urban
Figure 66.2 displays the table from PROC UNIVARIATE for the lowest and highest five total scores for urban schools. The outlier (Obs = 3), marked in Figure 66.1 by the symbol ‘0’, has a score of 64. The following statements use the traditional standardization (METHOD=STD) to compute the location and scale measures: proc stdize data=totalscores method=std pstat; title2 ’METHOD=STD’; var total; by Type; run;
method
4123
4124
Chapter 66. The STDIZE Procedure
High School Scores Data METHOD=STD ---------------------------------- Type=rural ---------------------------------The STDIZE Procedure Location and Scale Measures Location = mean Name total
Scale = standard deviation
Location
Scale
N
167.050000
41.956713
20
High School Scores Data METHOD=STD ---------------------------------- Type=urban ---------------------------------The STDIZE Procedure Location and Scale Measures Location = mean Name total
Scale = standard deviation
Location
Scale
N
153.300000
30.066768
20
Figure 66.3. Location and Scale Measures Table When METHOD=STD
Figure 66.3 displays the table of location and scale measures from the PROC STDIZE statement. PROC STDIZE uses the mean as the location measure and the standard deviation as the scale measure for standardizing. The PSTAT option displays this table; otherwise, no display is created. The ratio of the scale of rural scores to the scale of urban scores is approximately 1.4 (41.96/30.07). This ratio is smaller than the dispersion ratio observed in the previous Schematic Plots. The STDIZE procedure provides several location and scale measures that are resistant to outliers. The following statements invoke three different standardization methods and display the Location and Scale Measures tables: proc stdize data=totalscores method=mad pstat; title2 ’METHOD=MAD’; var total; by Type; run;
Getting Started
proc stdize data=totalscores method=iqr pstat; title2 ’METHOD=IQR’; var total; by Type; run; proc stdize data=totalscores method=abw(4) pstat; title2 ’METHOD=ABW(4)’; var total; by Type; run;
The results from this analysis are displayed in the following figures. High School Scores Data METHOD=MAD ---------------------------------- Type=rural ---------------------------------The STDIZE Procedure Location and Scale Measures Location = median Name total
Scale = median abs dev from median
Location
Scale
N
166.000000
32.000000
20
High School Scores Data METHOD=MAD ---------------------------------- Type=urban ---------------------------------The STDIZE Procedure Location and Scale Measures Location = median Name total
Scale = median abs dev from median
Location
Scale
N
153.000000
15.500000
20
Figure 66.4. Location and Scale Measures Table When METHOD=MAD
Figure 66.4 displays the table of location and scale measures when the standardization method is MAD. The location measure is the median, and the scale measure is the median absolute deviation from median. The ratio of the scale of rural scores to the scale of urban scores is approximately 2.06 (32.0/15.5) and is close to the dispersion ratio observed in the previous Schematic Plots.
4125
4126
Chapter 66. The STDIZE Procedure
High School Scores Data METHOD=IQR ---------------------------------- Type=rural ---------------------------------The STDIZE Procedure Location and Scale Measures Location = median Name total
Scale = interquartile range
Location
Scale
N
166.000000
61.000000
20
High School Scores Data METHOD=IQR ---------------------------------- Type=urban ---------------------------------The STDIZE Procedure Location and Scale Measures Location = median Name total
Scale = interquartile range
Location
Scale
N
153.000000
30.000000
20
Figure 66.5. Location and Scale Measures Table When METHOD=IQR
Figure 66.5 displays the table of location and scale measures when the standardization method is IQR. The location measure is the median, and the scale measure is the interquartile range. The ratio of the scale of rural scores to the scale of urban scores is approximately 2.03 (61/30) and is, in fact, the dispersion ratio observed in the previous Schematic Plots.
Getting Started
High School Scores Data METHOD=ABW(4) ---------------------------------- Type=rural ---------------------------------The STDIZE Procedure Location and Scale Measures Location = biweight 1-step M-estimate Name total
Scale = biweight A-estimate
Location
Scale
N
162.889603
56.662855
20
High School Scores Data METHOD=ABW(4) ---------------------------------- Type=urban ---------------------------------The STDIZE Procedure Location and Scale Measures Location = biweight 1-step M-estimate Name total
Scale = biweight A-estimate
Location
Scale
N
156.014608
28.615980
20
Figure 66.6. Location and Scale Measures Table When METHOD=ABW
Figure 66.6 displays the table of location and scale measures when the standardization method is ABW. The location measure is the biweight 1-step M-estimate, and the scale measure is the biweight A-estimate. Note that the initial estimate for ABW is MAD. The tuning constant (4) of ABW is obtained by the following steps: 1. For rural scores, the location estimate for MAD is 166.0 and the scale estimate for MAD is 32.0. The maximum of the rural scores is 253 (not shown), and the minimum is 110 (not shown). Thus, the tuning constant needs to be 3 so that it does not reject any observation that has a score between 110 to 253. 2. For urban scores, the location estimate for MAD is 153.0 and the scale estimate for MAD is 15.5. The maximum of the rural scores is 198, and the minimum (also an outlier) is 64. Thus, the tuning constant needs to be 4 so that it rejects the outlier (64) but includes the maximum (198) as an normal observation. 3. The maximum of the tuning constants, obtained in steps 1 and 2, is 4. Refer to Goodall (1983, Chapter 11) for details on the tuning constant. The ratio of the scale of rural scores to the scale of urban scores is approximately 2.06 (32.0/15.5). It is also close to the dispersion ratio observed in the previous Schematic Plots. The preceding analysis shows that METHOD=MAD, METHOD=IQR, and METHOD=ABW all provide better dispersion ratios than does METHOD=STD.
4127
4128
Chapter 66. The STDIZE Procedure
You can recompute the standard deviation after deleting the outlier from the original data set for comparison. The following statements create a DATA set NoOutlier that excludes the outlier from the TotalScores data set and invoke PROC STDIZE with METHOD=STD. data NoOutlier; set totalscores; if (total = 64) then delete; run; proc stdize data=NoOutlier method=std pstat; title2 ’after removing outlier, METHOD=STD’; var total; by Type; run;
High School Scores Data after removing outlier, METHOD=STD ---------------------------------- Type=rural ---------------------------------The STDIZE Procedure Location and Scale Measures Location = mean Name total
Scale = standard deviation
Location
Scale
N
167.050000
41.956713
20
High School Scores Data after removing outlier, METHOD=STD ---------------------------------- Type=urban ---------------------------------The STDIZE Procedure Location and Scale Measures Location = mean Name total
Scale = standard deviation
Location
Scale
N
158.000000
22.088207
19
Figure 66.7. After Deleting the Outlier, Location and Scale Measures Table When METHOD=STD
Figure 66.7 displays the location and scale measures after deleting the outlier. The lack of resistance of the standard deviation to outliers is clearly illustrated: if you delete the outlier, the sample standard deviation of urban scores changes from 30.07 to 22.09. The new ratio of the scale of rural scores to the scale of urban scores is approximately 1.90 (41.96/22.09).
PROC STDIZE Statement
4129
Syntax The following statements are available in the STDIZE procedure.
PROC STDIZE < options > ; BY variables ; FREQ variable ; LOCATION variables ; SCALE variables ; VAR variables ; WEIGHT variable ; The PROC STDIZE statement is required. The BY, LOCATION, FREQ, VAR, SCALE, and WEIGHT statements are described in alphabetical order following the PROC STDIZE statement.
PROC STDIZE Statement PROC STDIZE < options > ; The PROC STDIZE statement invokes the procedure. You can specify the following options in the PROC STDIZE statement. Table 66.1. Summary of PROC STDIZE Statement Options
Task Specify standardization methods
Options METHOD= INITIAL=
Description specifies the name of the standardization method specifies the method for computing initial estimates for the A estimates
Unstandardize variables
UNSTD
unstandardizes variables when you also specify the METHOD=IN option
Process missing values
NOMISS
omits observations with any missing values from computation specifies the method or a numeric value for replacing missing values replaces missing data by zero in the standardized data replaces missing data by the location measure (does not standardize the data)
MISSING= REPLACE REPONLY Specify data set details
DATA= OUT= OUTSTAT=
specifies the input data set specifies the output data set specifies the output statistic data set
Specify computational settings
VARDEF= NMARKERS=
specifies the variances divisor specifies the number of markers when you also specify PCTLMTD=ONEPASS
4130
Chapter 66. The STDIZE Procedure
Table 66.1. (continued)
Task
Options MULT= ADD=
FUZZ= Specify percentiles
PCTLDEF=
PCTLMTD= PCTLPTS=
Normalize scale estimators
NORM
SNORM
Specify output
PSTAT
Description specifies the constant to multiply each value by after standardizing specifies the constant to add to each value after standardizing and multiplying by the value specified in the MULT= option specifies the relative fuzz factor for writing the output specifies the definition of percentiles when you also specify the PCTLMTD=ORD– STAT option specifies the method used to estimate percentiles writes observations containing percentiles to the data set specified in the OUTSTAT= option normalizes the scale estimator to be consistent for the standard deviation of a normal distribution normalizes the scale estimator to have an expectation of approximately 1 for a standard normal distribution displays the location and scale measures
These options and their abbreviations are described, in alphabetical order, in the remainder of this section. ADD= c
specifies a constant, c, to add to each value after standardizing and multiplying by the value you specify in the MULT= option. The default value is 0. DATA=SAS-data-set
specifies the input data set to be standardized. If you omit the DATA= option, the most recently created data set is used. FUZZ=c
specifies the relative fuzz factor. The default value is 1E-14. For the OUT= data set, the score is computed as follows: if |Result| < Scale × Fuzz, then Result = 0 For the OUTSTAT= data set and the Location and Scale table, the scale and location values are computed as follows: if Scale < |Location| × Fuzz, then Scale = 0
PROC STDIZE Statement
Otherwise, if |Location| < Scale × Fuzz, then Location = 0 INITIAL=method
specifies the method for computing initial estimates for the A estimates (ABW, AWAVE, and AHUBER). The following methods are not allowed: INITIAL=ABW, INITIAL=AHUBER, INITIAL=AWAVE, and INITIAL=IN. The default is INITIAL=MAD. METHOD=name
specifies the name of the method for computing location and scale measures. Valid values for name are as follows: MEAN, MEDIAN, SUM, EUCLEN, USTD, STD, RANGE, MIDRANGE, MAXABS, IQR, MAD, ABW, AHUBER, AWAVE, AGK, SPACING, L, and IN. For details on these methods, see the descriptions in the “Standardization Methods” section on page 4136. The default is METHOD=STD. MISSING= method MISSING= value
specifies the method (or a numeric value) for replacing missing values. If you omit the MISSING= option, the REPLACE option replaces missing values with the location measure given by the METHOD= option. Specify the MISSING= option when you want to replace missing values with a different value. You can specify any name that is valid in the METHOD= option except the name IN. The corresponding location measure is used to replace missing values. If a numeric value is given, the value replaces missing values after standardizing the data. However, you can specify the REPONLY option with the MISSING= option to suppress standardization for cases in which you want only to replace missing values. MULT= c
specifies a constant, c, by which to multiply each value after standardizing. The default value is 1. NMARKERS= n
specifies the number of markers used when you specify the one-pass algorithm (PCTLMTD=ONEPASS). The value n must be greater than or equal to 5. The default value is 105. NOMISS
omits observations with missing values for any of the analyzed variables from calculation of the location and scale measures. If you omit the NOMISS option, all nonmissing values are used. NORM
normalizes the scale estimator to be consistent for the standard deviation of a normal distribution when you specify the option METHOD=AGK, METHOD=IQR, METHOD=MAD, or METHOD=SPACING.
4131
4132
Chapter 66. The STDIZE Procedure
OUT=SAS-data-set
specifies the name of the SAS data set created by PROC STDIZE. The output data set is a copy of the DATA= data set except that the analyzed variables have been standardized. Note that analyzed variables are those specified in the VAR statement or, if there is no VAR statement, all numeric variables not listed in any other statement. See the section “Output Data Sets” on page 4141 for more information. If you want to create a permanent SAS data set, you must specify a two-level name. (Refer to “SAS Files” in SAS Language Reference: Concepts for more information on permanent SAS data sets.) If you omit the OUT= option, PROC STDIZE creates an output data set named according to the DATAn convention. OUTSTAT=SAS-data-set
specifies the name of the SAS data set containing the location and scale measures and other computed statistics. See the section “Output Data Sets” on page 4141 for more information. PCTLDEF= percentiles
specifies which of five definitions is used to calculate percentiles when you specify the option PCTLMTD=ORD– STAT. By default, PCTLDEF=5. Note that the option PCTLMTD=ONEPASS implies a specification of PCTLDEF=5. See the section “Computational Methods for the PCTLDEF= Option” on page 4140 for details on the PCTLDEF= option. PCTLMTD=ORD– STAT PCTLMTD=ONEPASS | P2
specifies the method used to estimate percentiles. Specify the PCTLMTD=ORD– STAT option to compute the percentiles by the order statistics method. The PCTLMTD=ONEPASS option modifies an algorithm invented by Jain and Chlamtac (1985). See the “Computing Quantiles” section on page 4139 for more details on this algorithm. The PCTLMTD=ONEPASS option modifies an algorithm invented by Jain and Chlamtac (1985). See the “Computing Quantiles” section on page 4139 for more details on this algorithm. PCTLPTS= n
writes percentiles to the OUTSTAT= data set. Values of n can be any decimal number between 0 and 100, inclusive. A requested percentile is identified by the – TYPE– variable in the OUTSTAT= data set with a value of Pn. For example, suppose you specify the option PCTLPTS=10, 30. The corresponding observations in the OUTSTAT= data set that contain the 10th and the 30th percentiles would then have values – TYPE– =P10 and – TYPE– =P30, respectively.
PROC STDIZE Statement
PSTAT
displays the location and scale measures. REPLACE
replaces missing data with the value 0 in the standardized data (this value corresponds to the location measure before standardizing). To replace missing data by other values, see the preceding description of the MISSING= option. You cannot specify both the REPLACE and REPONLY options. REPONLY
replaces missing data only; PROC STDIZE does not standardize the data. Missing values are replaced with the location measure unless you also specify the MISSING=value option, in which case missing values are replaced with value. You cannot specify both the REPLACE and REPONLY options. SNORM
normalizes the scale estimator to have an expectation of approximately 1 for a standard normal distribution when you specify the METHOD=SPACING option. UNSTD UNSTDIZE
unstandardizes variables when you specify the METHOD=IN(ds) option. The location and scale measures, along with constants for addition and multiplication that the unstandardization is based upon, are identified by the – TYPE– variable in the ds data set. The ds data set must have a – TYPE– variable and contain the following two observations: a – TYPE– = ‘LOCATION’ observation and a – TYPE– = ‘SCALE’ observation. The variable – TYPE– can also contain the optional observations, ‘ADD’ and ‘MULT’; if these observations are not found in the ds data set, the constants specified in the ADD= and MULT= options (or their default values) are used for unstandardization. See the “OUTSTAT= Data Set” section on page 4141 for details on the statistics that each value of – TYPE– represents. The formula used for unstandardization is as follows: If the final output value from the previous standardization is calculated as result = add + multiply ×
original = scale ×
original − location scale
result − add + location multiply
VARDEF= DF VARDEF= N VARDEF= WDF VARDEF= WEIGHT | WGT
specifies the divisor to be used in the calculation of variances. VARDEF=DF. The values and associated divisors are as follows.
By default,
4133
4134
Chapter 66. The STDIZE Procedure Value
Divisor
Formula
DF N WDF WEIGHT | WGT
degrees of freedom number of observations sum of weights minus 1 sum of weights
n−1 nP ( i wi ) − 1 P i wi
BY Statement BY variables ; You can specify a BY statement with PROC STDIZE to obtain separate standardization for observations in groups defined by the BY variables. If your DATA= input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the STDIZE procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide. When you specify the option METHOD=IN(ds), the following rules are applied to BY-group processing: • If the ds data set does not contain any of the BY variables, the entire DATA= data set is standardized by the location and scale measures (along with the constants for addition and multiplication) in the ds data set. • If the ds data set contains some, but not all, of the BY variables or if some BY variables do not have the same type or length in the ds data set that they have in the DATA= data set, PROC STDIZE displays an error message and stops. • If all of the BY variables appear in the ds data set with the same type and length as in the DATA= data set, each BY group in the DATA= data set is standardized using the location and scale measures (along with the constants for addition and multiplication) from the corresponding BY group in the ds data set. The BY groups in the ds data set must be in the same order as they appear in the DATA= data set. All BY groups in the DATA= data set must also appear in the ds data set. If you do not specify the NOTSORTED option, some BY groups can appear in the ds data set but not in the DATA= data set; such BY groups are not used in standardizing data.
WEIGHT Statement
FREQ Statement FREQ | FREQUENCY variable ; If one variable in the input data set represents the frequency of occurrence for other values in the observation, specify the variable name in a FREQ statement. PROC STDIZE treats the data set as if each observation appeared n times, where n is the value of the FREQ variable for the observation. Nonintegral values of the FREQ variable are truncated to the largest integer less than the FREQ value. If the FREQ variable has a value that is less than 1 or is missing, the observation is not used in the analysis.
LOCATION Statement LOCATION variables ; The LOCATION statement specifies a list of numeric variables that contain location measures in the input data set specified by the METHOD=IN option.
SCALE Statement SCALE variables ; The SCALE statement specifies the list of numeric variables containing scale measures in the input data set specified by the METHOD=IN option.
VAR Statement VAR | VARIABLES variables ; The VAR statement lists numeric variables to be standardized. If you omit the VAR statement, all numeric variables not listed in the BY, FREQ, and WEIGHT statements are used.
WEIGHT Statement WGT | WEIGHT variable ; The WEIGHT statement specifies a numeric variable in the input data set with values that are used to weight each observation. Only one variable can be specified. The WEIGHT variable values can be nonintegers. An observation is used in the analysis only if the value of the WEIGHT variable is greater than zero. The WEIGHT variable applies only when you specify the option METHOD=MEAN, METHOD=SUM, METHOD=EUCLEN, METHOD=USTD, METHOD=STD, METHOD=AGK, or METHOD=L. PROC STDIZE uses the value of the WEIGHT variable wi , as follows. The sample mean and (uncorrected) sample variances are computed as xw =
X i
wi xi /
X i
wi
4135
4136
Chapter 66. The STDIZE Procedure usw 2 =
X
wi xi 2 /d
i
sw 2 =
X
wi (xi − xw )2 /d
i
where wi is the weight value of the ith observation, xi is the value of the ith observation, and d is the divisor controlled by the VARDEF= option (see the VARDEF= option for details). PROC STDIZE uses the value of the WEIGHT variable to calculate the following statistics: MEAN SUM USTD STD EUCLEN
the weighted mean, xw P the weighted sum, i wi xi the weighted uncorrected standard deviation, p the weighted standard deviation, s2w
p
us2w
the weighted Euclidean length, computed as the square root of the weighted uncorrected sum of squares: sX
wi xi 2
i
AGK
the AGK estimate. This estimate is documented further in the ACECLUS procedure as the METHOD=COUNT option. See the discussion of the WEIGHT statement in Chapter 16, “The ACECLUS Procedure,” for information on how the WEIGHT variable is applied to the AGK estimate.
L
the Lp estimate. This estimate is documented further in the FASTCLUS procedure as the LEAST= option. See the discussion of the WEIGHT statement in Chapter 28, “The FASTCLUS Procedure,” for information on how the WEIGHT variable is used to compute weighted cluster means. Note that the number of clusters is always 1.
Details Standardization Methods The following table lists standardization methods and their corresponding location and scale measures available with the METHOD= option.
Standardization Methods
Table 66.2. Available Standardization Methods
Method MEAN MEDIAN SUM EUCLEN USTD STD RANGE MIDRANGE MAXABS IQR MAD
Location mean median 0 0 0 mean minimum midrange 0 median median
ABW(c) AHUBER(c) AWAVE(c) AGK(p) SPACING(p) L(p) IN(ds)
biweight 1-step M-estimate Huber 1-step M-estimate Wave 1-step M-estimate mean mid minimum-spacing L(p) read from data set
Scale 1 1 sum Euclidean length standard deviation about origin standard deviation range range/2 maximum absolute value interquartile range median absolute deviation from median biweight A-estimate Huber A-estimate Wave A-estimate AGK estimate (ACECLUS) minimum spacing L(p) read from data set
For METHOD=ABW(c), METHOD=AHUBER(c), or METHOD=AWAVE(c), c is a positive numeric tuning constant. For METHOD=AGK(p), p is a numeric constant giving the proportion of pairs to be included in the estimation of the within-cluster variances. For METHOD=SPACING(p), p is a numeric constant giving the proportion of data to be contained in the spacing. For METHOD=L(p), p is a numeric constant greater than or equal to 1 specifying the power to which differences are to be raised in computing an L(p) or Minkowski metric. For METHOD=IN(ds), ds is the name of a SAS data set that meets either one of the following two conditions: • contains a – TYPE– variable. The observation that contains the location measure corresponds to the value – TYPE– = ’LOCATION’ and the observation that contains the scale measure corresponds to the value – TYPE– = ’SCALE’. You can also use a data set created by the OUTSTAT= option from another PROC STDIZE statement as the ds data set. See the section “Output Data Sets” on page 4141 for the contents of the OUTSTAT= data set. • contains the location and scale variables specified by the LOCATION and SCALE statements. PROC STDIZE reads in the location and scale variables in the ds data set by first looking for the – TYPE– variable in the ds data set. If it finds this variable, PROC
4137
4138
Chapter 66. The STDIZE Procedure
STDIZE continues to search for all variables specified in the VAR statement. If it does not find the – TYPE– variable, PROC STDIZE searches for the location variables specified in the LOCATION statement and the scale variables specified in the SCALE statement. For robust estimators, refer to Goodall (1983) and Iglewicz (1983). The MAD method has the highest breakdown point (50%), but it is somewhat inefficient. The ABW, AHUBER, and AWAVE methods provide a good compromise between breakdown and efficiency. The L(p) location estimates are increasingly robust as p drops from 2 (corresponding to least squares, or mean estimation) to 1 (corresponding to least absolute value, or median estimation). However, the L(p) scale estimates are not robust. The SPACING method is robust to both outliers and clustering (Jannsen et al. 1995) and is, therefore, a good choice for cluster analysis or nonparametric density estimation. The mid-minimum spacing method estimates the mode for small p. The AGK method is also robust to clustering and more efficient than the SPACING method, but it is not as robust to outliers and takes longer to compute. If you expect g clusters, the argument to METHOD=SPACING or METHOD=AGK should be g1 or less. The AGK method is less biased than the SPACING method for small samples. As a general guide, it is reasonable to use AGK for samples of size 100 or less and SPACING for samples of size 1000 or more, with the treatment of intermediate sample sizes depending on the available computer resources.
Computation of the Statistics Formulas for statistics of METHOD=MEAN, METHOD=MEDIAN, METHOD=SUM, METHOD=USTD, METHOD=STD, METHOD=RANGE, and METHOD=IQR are given in the chapter on elementary statistics procedures in the SAS Procedures Guide. Note that the computations of median and upper and lower quartiles depend on the PCTLMTD= option. The other statistics listed in Table 66.2, except for METHOD=IN, are described as follows: EUCLEN
Euclidean length. pPn 2 x i=1 i where xi is the ith observation and n is the total number of observations in the sample.
L(p)
Minkowski metric. This metric is documented as the LEAST=p option in the PROC FASTCLUS statement of the FASTCLUS procedure (see Chapter 28, “The FASTCLUS Procedure,” ). If you specify METHOD=L(p) in the PROC STDIZE statement, your results are similar to those obtained from PROC FASTCLUS if you specify the LEAST=p option with MAXCLUS=1 (and use the default values of the MAXITER= option). The difference between the two types of calculations concerns the maximum number of iterations. In PROC STDIZE, it is a criteria for convergence on
Computing Quantiles
all variables; In PROC FASTCLUS, it is a criteria for convergence on a single variable. The location and scale measures for L(p) are output to the OUTSEED= data set in PROC FASTCLUS. MIDRANGE
(maximum + minimum)/2
ABW(c)
Tukey’s biweight. Refer to Goodall (1983, pp. 376–378, p. 385) for the biweight 1-step M-estimate. Also refer to Iglewicz (1983, pp. 416–418) for the biweight A-estimate.
AHUBER(c)
Hubers. Refer to Goodall (1983, pp. 371–374) for the Huber 1step M-estimate. Also refer to Iglewicz (1983, pp. 416–418) for the Huber A-estimate of scale.
AWAVE(c)
Andrews’ Wave. Refer to Goodall (1983, p. 376) for the Wave 1-step M-estimate. Also refer to Iglewicz (1983, pp. 416 –418) for the Wave A-estimate of scale.
AGK(p)
The noniterative univariate form of the estimator described by Art, Gnanadesikan, and Kettenring (1982). The AGK estimate is documented in the section on the METHOD= option in the PROC ACECLUS statement of the ACECLUS procedure (also see the “Background” section on page 388 in Chapter 16, “The ACECLUS Procedure,” ). Specifying METHOD=AGK(p) in the PROC STDIZE statement is the same as specifying METHOD=COUNT and P=p in the PROC ACECLUS statement.
SPACING(p)
The absolute difference between two data values. The minimum spacing for a proportion p is the minimum absolute difference between two data values that contain a proportion p of the data between them. The mid minimum-spacing is the mean of these two data values.
Computing Quantiles PROC STDIZE offers two methods for computing quantiles: the one-pass approach and the order-statistics approach (like that used in the UNIVARIATE procedure). The one-pass approach used in PROC STDIZE modifies the P2 algorithm for histograms proposed by Jain and Chlamtac (1985). The primary difference comes from the movement of markers. The one-pass method allows a marker to move to the right (or left) by more than one position (to the largest possible integer) as long as it does not result in two markers being in the same position. The modification is necessary in order to incorporate the FREQ variable. You may obtain inaccurate results if you use the one-pass approach to estimate quantiles beyond the quartiles (that is, when you estimate quantiles < P25 or > P75). A large sample size (10,000 or more) is often required if the tail quantiles (quantiles <= P10 or >= P90 ) are requested. Note that, for variables with highly skewed or heavy-tailed distributions, tail quantile estimates may be inaccurate.
4139
4140
Chapter 66. The STDIZE Procedure
The order-statistics approach for estimating quantiles is faster than the one-pass method but requires that the entire data set be stored in memory. The accuracy in estimating the quantiles is comparable for both methods when the requested percentiles are between the lower and upper quartiles. The default is PCTLMTD=ORD– STAT if enough memory is available; otherwise, PCTLMTD=ONEPASS.
Computational Methods for the PCTLDEF= Option You can specify one of five methods for computing quantile statistics when you use the order-statistics approach (PCTLMTD=ORD– STAT); otherwise, the PCTLDEF=5 method is used when you use the one-pass approach (PCTLMTD=ONEPASS). Let n be the number of nonmissing values for a variable, and let x1 , x2 , . . . , xn represent the ordered values of the variable. For the tth percentile, let p = t/100. In the following definitions numbered 1, 2, 3, and 5, let np = j + g where j is the integer part and g is the fractional part of np. For definition 4, let (n + 1)p = j + g Given the preceding definitions, the tth percentile, y, is defined as follows: PCTLDEF=1
weighted average at xnp y = (1 − g)xj + gxj+1 where x0 is taken to be x1
PCTLDEF=2
observation numbered closest to np y = xi where i is the integer part of np + 1/2 if g 6= 1/2. If g = 1/2, then y = xj if j is even, or y = xj+1 if j is odd
PCTLDEF=3
empirical distribution function y = xj if g = 0 y = xj+1 if g > 0
PCTLDEF=4
weighted average aimed at xp(n+1) y = (1 − g)xj + gxj+1 where xn+1 is taken to be xn
PCTLDEF=5
empirical distribution function with averaging y = (xj + xj+1 )/2 if g = 0 y = xj+1 if g > 0
Output Data Sets
Missing Values Missing values can be replaced by the location measure or by any specified constant (see the REPLACE option and the MISSING= option). You can also suppress standardization if you want only to replace missing values (see the REPONLY option). If you specify the NOMISS option, PROC STDIZE omits observations with any missing values in the analyzed variables from computation of the location and scale measures.
Output Data Sets OUT= Data Set The output data set is a copy of the DATA= data set except that the analyzed variables have been standardized. Analyzed variables are those listed in the VAR statement or, if there is no VAR statement, all numeric variables not listed in any other statement.
OUTSTAT= Data Set The new data set contains the following variables: • the BY variables, if any • – TYPE– , a character variable • the analyzed variables Each observation in the new data set contains a type of statistic as indicated by the – TYPE– variable. The values of the – TYPE– variable are as follows:
– TYPE– LOCATION
location measure of each variable
SCALE
scale measure of each variable
ADD
constant specified in the ADD= option. This value is the same for each variable.
MULT
constant specified in the MULT= option. This value is the same for each variable.
N
total number of nonmissing positive frequencies of each variable
NORM
norm measure of each variable. This observation is produced only when you specify the NORM option with METHOD=AGK, METHOD=IQR, METHOD=MAD, or METHOD=SPACING or when you specify the SNORM option with METHOD=SPACING.
NObsRead
number of physical records read
NObsUsed
number of physical records used in the analysis
NObsMiss
number of physical records containing missing values
4141
4142
Chapter 66. The STDIZE Procedure
SumFreqsRead
sum of the frequency variable (or the sum of NObsUsed ones when there is no frequency variable) for all observations read
SumFreqsUsed
sum of the frequency variable (or the sum of NObsUsed ones when there is no frequency variable) for all observations used in the analysis
SumWeightsRead
sum of the weight variable (or the sum of NObsUsed ones when there is no weight variable) for all observations read
SumWeightsUsed
sum of the weight variable (or the sum of NObsUsed ones when there is no weight variable) for all observations used in the analysis
Pn
percentiles of each variable, as specified by the PCTLPTS= option. The argument n is any real number such that 0 ≤ n ≤ 100.
Displayed Output If you specify the PSTAT option, PROC STDIZE displays the following statistics for each variable: • the name of the variable, Name • the location estimate, Location • the scale estimate, Scale • the norm estimate, Norm (when you specify the NORM option with METHOD=AGK, METHOD=IQR, METHOD=MAD, or METHOD=SPACING or when you specify the SNORM option with METHOD=SPACING) • the total nonmissing positive frequencies, N
ODS Table Names PROC STDIZE assigns a name to the single table it creates. You can use this name to reference the table when using the Output Delivery System (ODS) to select output or create an output data set. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 66.3. ODS Table Produced in PROC STDIZE
ODS Table Name Statistics
Description Location and Scale Measures
Option PSTAT
Example 66.1. Standardization of Variables in Cluster Analysis
Examples Example 66.1. Standardization Analysis
of
Variables
in
Cluster
To illustrate the effect of standardization in cluster analysis, this example uses the Fish data set described in the “Getting Started” section of Chapter 28, “The FASTCLUS Procedure.” The numbers are measurements taken on 159 fish caught from the same lake (Laengelmavesi) near Tampere in Finland; this data set is available from the Data Archive of the Journal of Statistics Education. The complete data set is displayed in Chapter 67, “The STEPDISC Procedure.” The species (Bream, Parkki, Pike, Perch, Roach, Smelt, and Whitefish), weight, three different length measurements (measured from the nose of the fish to the beginning of its tail, the notch of its tail, and the end of its tail), height, and width of each fish are recorded. The height and width are recorded as percentages of the third length variable. Several new variables are created in the Fish data set: Weight3, Height, Width, and logLengthRatio. The weight of a fish indicates its size—a heavier Tuna tends to be larger than a lighter Tuna. To get a one dimensional measure of the size of a fish, take the cubic root of the weight (Weight3). The variables Height, Width, Length1, Length2, and Length3 are rescaled in order to adjust for dimensionality. The logLengthRatio variable measures the tail length. Because the new variables Weight3–logLengthRatio depend on the variable Weight, observations with missing values for Weight are not added to the data set. Consequently, there are 157 observations in the SAS data set Fish. Before you perform a cluster analysis on coordinate data, it is necessary to consider scaling or transforming the variables since variables with large variances tend to have a larger effect on the resulting clusters than those with small variances. This example uses three different approaches to standardize or transform the data prior to the cluster analysis. The first approach uses several standardization methods provided in the STDIZE procedure. However, since standardization is not always appropriate prior to the clustering (refer to Milligan and Cooper, 1987, for a Monte Carlo study on various methods of variable standardization), the second approach performs the cluster analysis with no standardization. The third approach invokes the ACECLUS procedure to transform the data into a within-cluster covariance matrix. The clustering is performed by the FASTCLUS procedure to find seven clusters. Note that the variables Length2 and Length3 are eliminated from this analysis since they both are significantly and highly correlated with the variable Length1. The correlation coefficients are 0.9958 and 0.9604, respectively. An output data set is created, and the FREQ procedure is invoked to compare the clusters with the species classification. The DATA step is as follows:
4143
4144
Chapter 66. The STDIZE Procedure proc format; value specfmt 1=’Bream’ 2=’Roach’ 3=’Whitefish’ 4=’Parkki’ 5=’Perch’ 6=’Pike’ 7=’Smelt’; data Fish (drop=HtPct WidthPct); title ’Fish Measurement Data’; input Species Weight Length1 Length2 Length3 HtPct WidthPct @@; if Weight <= 0 or Weight=. then delete; Weight3=Weight**(1/3); Height=HtPct*Length3/(Weight3*100); Width=WidthPct*Length3/(Weight3*100); Length1=Length1/Weight3; Length2=Length2/Weight3; Length3=Length3/Weight3; logLengthRatio=log(Length3/Length1); format Species specfmt.; symbol = put(Species, specfmt2.); datalines; 1 242.0 23.2 25.4 30.0 38.4 13.4 1 290.0 24.0 26.3 31.2 40.0 13.8 1 340.0 23.9 26.5 31.1 39.8 15.1 1 363.0 26.3 29.0 33.5 38.0 13.3 ... [155 more records] ; run;
The following macro, Std, standardizes the Fish data. The macro reads a single argument, mtd, which selects the METHOD= specification to be used in PROC STDIZE. /*--- macro for standardization ---*/ %macro Std(mtd); title2 "Data is standardized by PROC STDIZE with METHOD= &mtd"; proc stdize data=fish out=sdzout method=&mtd; var Length1 logLengthRatio Height Width Weight3; run; %mend Std;
The following macro, FastFreq, includes a PROC FASTCLUS statement for performing cluster analysis and a PROC FREQ statement for cross-tabulating species with the cluster membership information that is derived from the previous PROC FASTCLUS statement. The macro reads a single argument, ds, which selects the input data set to be used in PROC FASTCLUS.
Example 66.1. Standardization of Variables in Cluster Analysis
4145
/*--- macro for clustering and cross-tabulating ---*/ /*--- cluster membership with species ---*/ %macro FastFreq(ds); proc fastclus data=&ds out=clust maxclusters=7 maxiter=100 noprint; var Length1 logLengthRatio Height Width Weight3; run; proc freq data=clust; tables species*cluster; run; %mend FastFreq;
The following analysis, (labeled ‘Approach 1’) includes 18 different methods of standardization followed by clustering. Since there is a large amount of output from this approach, only results from METHOD=STD, METHOD=RANGE, METHOD=AGK(.14), and METHOD=SPACING(.14) are shown. The following statements produce Output 66.1.1 through Output 66.1.4. /**********************************************************/ /* */ /* Approach 1: data is standardized by PROC STDIZE */ /* */ /**********************************************************/ %Std(MEAN); %FastFreq(sdzout); %Std(MEDIAN); %FastFreq(sdzout); %Std(SUM); %FastFreq(sdzout); %Std(EUCLEN); %FastFreq(sdzout); %Std(USTD); %FastFreq(sdzout); %Std(STD); %FastFreq(sdzout); %Std(RANGE); %FastFreq(sdzout); %Std(MIDRANGE); %FastFreq(sdzout); %Std(MAXABS); %FastFreq(sdzout); %Std(IQR);
4146
Chapter 66. The STDIZE Procedure %FastFreq(sdzout); %Std(MAD); %FastFreq(sdzout); %Std(AGK(.14)); %FastFreq(sdzout); %Std(SPACING(.14)); %FastFreq(sdzout); %Std(ABW(5)); %FastFreq(sdzout); %Std(AWAVE(5)); %FastFreq(sdzout); %Std(L(1)); %FastFreq(sdzout); %Std(L(1.5)); %FastFreq(sdzout); %Std(L(2)); %FastFreq(sdzout);
Example 66.1. Standardization of Variables in Cluster Analysis
Output 66.1.1. Data Is Standardized by PROC STDIZE with METHOD=STD Fish Measurement Data Data is standardized by PROC STDIZE with METHOD= STD The FREQ Procedure Table of Species by CLUSTER Species
CLUSTER(Cluster)
Frequency | Percent | Row Pct | Col Pct | 1| 2| 3| 4| 5| 6| 7| Total ----------+--------+--------+--------+--------+--------+--------+--------+ Bream | 0 | 0 | 0 | 0 | 0 | 34 | 0 | 34 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 21.66 | 0.00 | 21.66 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Roach | 0 | 0 | 0 | 0 | 0 | 0 | 19 | 19 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 12.10 | 12.10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 38.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Whitefish | 0 | 2 | 0 | 1 | 0 | 0 | 3 | 6 | 0.00 | 1.27 | 0.00 | 0.64 | 0.00 | 0.00 | 1.91 | 3.82 | 0.00 | 33.33 | 0.00 | 16.67 | 0.00 | 0.00 | 50.00 | | 0.00 | 10.53 | 0.00 | 7.69 | 0.00 | 0.00 | 6.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Parkki | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 11 | 0.00 | 0.00 | 0.00 | 0.00 | 7.01 | 0.00 | 0.00 | 7.01 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Perch | 0 | 17 | 0 | 12 | 0 | 0 | 27 | 56 | 0.00 | 10.83 | 0.00 | 7.64 | 0.00 | 0.00 | 17.20 | 35.67 | 0.00 | 30.36 | 0.00 | 21.43 | 0.00 | 0.00 | 48.21 | | 0.00 | 89.47 | 0.00 | 92.31 | 0.00 | 0.00 | 54.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Pike | 17 | 0 | 0 | 0 | 0 | 0 | 0 | 17 | 10.83 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 10.83 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Smelt | 0 | 0 | 13 | 0 | 0 | 0 | 1 | 14 | 0.00 | 0.00 | 8.28 | 0.00 | 0.00 | 0.00 | 0.64 | 8.92 | 0.00 | 0.00 | 92.86 | 0.00 | 0.00 | 0.00 | 7.14 | | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 2.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Total 17 19 13 13 11 34 50 157 10.83 12.10 8.28 8.28 7.01 21.66 31.85 100.00
4147
4148
Chapter 66. The STDIZE Procedure
Output 66.1.2. Data Is Standardized by PROC STDIZE with METHOD=RANGE Fish Measurement Data Data is standardized by PROC STDIZE with METHOD= RANGE The FREQ Procedure Table of Species by CLUSTER Species
CLUSTER(Cluster)
Frequency | Percent | Row Pct | Col Pct | 1| 2| 3| 4| 5| 6| 7| Total ----------+--------+--------+--------+--------+--------+--------+--------+ Bream | 0 | 0 | 34 | 0 | 0 | 0 | 0 | 34 | 0.00 | 0.00 | 21.66 | 0.00 | 0.00 | 0.00 | 0.00 | 21.66 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Roach | 0 | 0 | 0 | 19 | 0 | 0 | 0 | 19 | 0.00 | 0.00 | 0.00 | 12.10 | 0.00 | 0.00 | 0.00 | 12.10 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 61.29 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Whitefish | 0 | 0 | 0 | 3 | 3 | 0 | 0 | 6 | 0.00 | 0.00 | 0.00 | 1.91 | 1.91 | 0.00 | 0.00 | 3.82 | 0.00 | 0.00 | 0.00 | 50.00 | 50.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 9.68 | 13.04 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Parkki | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 11 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7.01 | 0.00 | 7.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Perch | 0 | 0 | 0 | 9 | 20 | 0 | 27 | 56 | 0.00 | 0.00 | 0.00 | 5.73 | 12.74 | 0.00 | 17.20 | 35.67 | 0.00 | 0.00 | 0.00 | 16.07 | 35.71 | 0.00 | 48.21 | | 0.00 | 0.00 | 0.00 | 29.03 | 86.96 | 0.00 | 100.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Pike | 17 | 0 | 0 | 0 | 0 | 0 | 0 | 17 | 10.83 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 10.83 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Smelt | 0 | 14 | 0 | 0 | 0 | 0 | 0 | 14 | 0.00 | 8.92 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 8.92 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Total 17 14 34 31 23 11 27 157 10.83 8.92 21.66 19.75 14.65 7.01 17.20 100.00
Example 66.1. Standardization of Variables in Cluster Analysis
Output 66.1.3. Data Is Standardized by PROC STDIZE with METHOD=AGK(.14) Fish Measurement Data Data is standardized by PROC STDIZE with METHOD= AGK(.14) The FREQ Procedure Table of Species by CLUSTER Species
CLUSTER(Cluster)
Frequency | Percent | Row Pct | Col Pct | 1| 2| 3| 4| 5| 6| 7| Total ----------+--------+--------+--------+--------+--------+--------+--------+ Bream | 0 | 0 | 34 | 0 | 0 | 0 | 0 | 34 | 0.00 | 0.00 | 21.66 | 0.00 | 0.00 | 0.00 | 0.00 | 21.66 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Roach | 0 | 0 | 0 | 17 | 0 | 0 | 2 | 19 | 0.00 | 0.00 | 0.00 | 10.83 | 0.00 | 0.00 | 1.27 | 12.10 | 0.00 | 0.00 | 0.00 | 89.47 | 0.00 | 0.00 | 10.53 | | 0.00 | 0.00 | 0.00 | 73.91 | 0.00 | 0.00 | 5.71 | ----------+--------+--------+--------+--------+--------+--------+--------+ Whitefish | 0 | 0 | 0 | 3 | 0 | 3 | 0 | 6 | 0.00 | 0.00 | 0.00 | 1.91 | 0.00 | 1.91 | 0.00 | 3.82 | 0.00 | 0.00 | 0.00 | 50.00 | 0.00 | 50.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 13.04 | 0.00 | 13.04 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Parkki | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 7.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7.01 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Perch | 0 | 0 | 0 | 3 | 0 | 20 | 33 | 56 | 0.00 | 0.00 | 0.00 | 1.91 | 0.00 | 12.74 | 21.02 | 35.67 | 0.00 | 0.00 | 0.00 | 5.36 | 0.00 | 35.71 | 58.93 | | 0.00 | 0.00 | 0.00 | 13.04 | 0.00 | 86.96 | 94.29 | ----------+--------+--------+--------+--------+--------+--------+--------+ Pike | 0 | 0 | 0 | 0 | 17 | 0 | 0 | 17 | 0.00 | 0.00 | 0.00 | 0.00 | 10.83 | 0.00 | 0.00 | 10.83 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Smelt | 0 | 14 | 0 | 0 | 0 | 0 | 0 | 14 | 0.00 | 8.92 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 8.92 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Total 11 14 34 23 17 23 35 157 7.01 8.92 21.66 14.65 10.83 14.65 22.29 100.00
4149
4150
Chapter 66. The STDIZE Procedure
Output 66.1.4. Data Is Standardized by PROC STDIZE with METHOD=SPACING(.14) Fish Measurement Data Data is standardized by PROC STDIZE with METHOD= SPACING(.14) The FREQ Procedure Table of Species by CLUSTER Species
CLUSTER(Cluster)
Frequency | Percent | Row Pct | Col Pct | 1| 2| 3| 4| 5| 6| 7| Total ----------+--------+--------+--------+--------+--------+--------+--------+ Bream | 0 | 0 | 0 | 0 | 0 | 0 | 34 | 34 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 21.66 | 21.66 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Roach | 0 | 0 | 0 | 17 | 0 | 2 | 0 | 19 | 0.00 | 0.00 | 0.00 | 10.83 | 0.00 | 1.27 | 0.00 | 12.10 | 0.00 | 0.00 | 0.00 | 89.47 | 0.00 | 10.53 | 0.00 | | 0.00 | 0.00 | 0.00 | 85.00 | 0.00 | 5.26 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Whitefish | 3 | 0 | 0 | 3 | 0 | 0 | 0 | 6 | 1.91 | 0.00 | 0.00 | 1.91 | 0.00 | 0.00 | 0.00 | 3.82 | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | | 13.04 | 0.00 | 0.00 | 15.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Parkki | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 11 | 0.00 | 0.00 | 7.01 | 0.00 | 0.00 | 0.00 | 0.00 | 7.01 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Perch | 20 | 0 | 0 | 0 | 0 | 36 | 0 | 56 | 12.74 | 0.00 | 0.00 | 0.00 | 0.00 | 22.93 | 0.00 | 35.67 | 35.71 | 0.00 | 0.00 | 0.00 | 0.00 | 64.29 | 0.00 | | 86.96 | 0.00 | 0.00 | 0.00 | 0.00 | 94.74 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Pike | 0 | 17 | 0 | 0 | 0 | 0 | 0 | 17 | 0.00 | 10.83 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 10.83 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Smelt | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 14 | 0.00 | 0.00 | 0.00 | 0.00 | 8.92 | 0.00 | 0.00 | 8.92 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Total 23 17 11 20 14 38 34 157 14.65 10.83 7.01 12.74 8.92 24.20 21.66 100.00
The following analysis (labeled ‘Approach 2’) applies the cluster analysis directly to the original data. The following statements produce Output 66.1.5. /**********************************************************/ /* */ /* Approach 2: data is untransformed */ /* */ /**********************************************************/ title2 ’Data is untransformed’; %FastFreq(fish);
Example 66.1. Standardization of Variables in Cluster Analysis
Output 66.1.5. Untransformed Data Fish Measurement Data Data is untransformed The FREQ Procedure Table of Species by CLUSTER Species
CLUSTER(Cluster)
Frequency | Percent | Row Pct | Col Pct | 1| 2| 3| 4| 5| 6| 7| Total ----------+--------+--------+--------+--------+--------+--------+--------+ Bream | 13 | 0 | 0 | 0 | 0 | 0 | 21 | 34 | 8.28 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 13.38 | 21.66 | 38.24 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 61.76 | | 44.83 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 47.73 | ----------+--------+--------+--------+--------+--------+--------+--------+ Roach | 3 | 4 | 0 | 0 | 12 | 0 | 0 | 19 | 1.91 | 2.55 | 0.00 | 0.00 | 7.64 | 0.00 | 0.00 | 12.10 | 15.79 | 21.05 | 0.00 | 0.00 | 63.16 | 0.00 | 0.00 | | 10.34 | 25.00 | 0.00 | 0.00 | 30.77 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Whitefish | 3 | 0 | 0 | 0 | 0 | 0 | 3 | 6 | 1.91 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.91 | 3.82 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 50.00 | | 10.34 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 6.82 | ----------+--------+--------+--------+--------+--------+--------+--------+ Parkki | 2 | 3 | 0 | 0 | 6 | 0 | 0 | 11 | 1.27 | 1.91 | 0.00 | 0.00 | 3.82 | 0.00 | 0.00 | 7.01 | 18.18 | 27.27 | 0.00 | 0.00 | 54.55 | 0.00 | 0.00 | | 6.90 | 18.75 | 0.00 | 0.00 | 15.38 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Perch | 8 | 9 | 0 | 1 | 20 | 0 | 18 | 56 | 5.10 | 5.73 | 0.00 | 0.64 | 12.74 | 0.00 | 11.46 | 35.67 | 14.29 | 16.07 | 0.00 | 1.79 | 35.71 | 0.00 | 32.14 | | 27.59 | 56.25 | 0.00 | 6.67 | 51.28 | 0.00 | 40.91 | ----------+--------+--------+--------+--------+--------+--------+--------+ Pike | 0 | 0 | 10 | 0 | 1 | 4 | 2 | 17 | 0.00 | 0.00 | 6.37 | 0.00 | 0.64 | 2.55 | 1.27 | 10.83 | 0.00 | 0.00 | 58.82 | 0.00 | 5.88 | 23.53 | 11.76 | | 0.00 | 0.00 | 100.00 | 0.00 | 2.56 | 100.00 | 4.55 | ----------+--------+--------+--------+--------+--------+--------+--------+ Smelt | 0 | 0 | 0 | 14 | 0 | 0 | 0 | 14 | 0.00 | 0.00 | 0.00 | 8.92 | 0.00 | 0.00 | 0.00 | 8.92 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 93.33 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Total 29 16 10 15 39 4 44 157 18.47 10.19 6.37 9.55 24.84 2.55 28.03 100.00
The following analysis (labeled ‘Approach 3’) transforms the original data with the ACECLUS procedure and creates a TYPE=ACE output data set that is used as an input data set for the cluster analysis. The following statements produce Output 66.1.6. /**********************************************************/ /* */ /* Approach 3: data is transformed by PROC ACECLUS */ /* */ /**********************************************************/ title2 ’Data is transformed by PROC ACECLUS’; proc aceclus data=fish out=ace p=.02 noprint; var Length1 logLengthRatio Height Width Weight3; run; %FastFreq(ace);
4151
4152
Chapter 66. The STDIZE Procedure
Output 66.1.6. Data Is Transformed by PROC ACECLUS Fish Measurement Data Data is transformed by PROC ACECLUS The FREQ Procedure Table of Species by CLUSTER Species
CLUSTER(Cluster)
Frequency | Percent | Row Pct | Col Pct | 1| 2| 3| 4| 5| 6| 7| Total ----------+--------+--------+--------+--------+--------+--------+--------+ Bream | 13 | 0 | 0 | 0 | 0 | 0 | 21 | 34 | 8.28 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 13.38 | 21.66 | 38.24 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 61.76 | | 44.83 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 47.73 | ----------+--------+--------+--------+--------+--------+--------+--------+ Roach | 3 | 4 | 0 | 0 | 12 | 0 | 0 | 19 | 1.91 | 2.55 | 0.00 | 0.00 | 7.64 | 0.00 | 0.00 | 12.10 | 15.79 | 21.05 | 0.00 | 0.00 | 63.16 | 0.00 | 0.00 | | 10.34 | 25.00 | 0.00 | 0.00 | 30.77 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Whitefish | 3 | 0 | 0 | 0 | 0 | 0 | 3 | 6 | 1.91 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.91 | 3.82 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 50.00 | | 10.34 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 6.82 | ----------+--------+--------+--------+--------+--------+--------+--------+ Parkki | 2 | 3 | 0 | 0 | 6 | 0 | 0 | 11 | 1.27 | 1.91 | 0.00 | 0.00 | 3.82 | 0.00 | 0.00 | 7.01 | 18.18 | 27.27 | 0.00 | 0.00 | 54.55 | 0.00 | 0.00 | | 6.90 | 18.75 | 0.00 | 0.00 | 15.38 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Perch | 8 | 9 | 0 | 1 | 20 | 0 | 18 | 56 | 5.10 | 5.73 | 0.00 | 0.64 | 12.74 | 0.00 | 11.46 | 35.67 | 14.29 | 16.07 | 0.00 | 1.79 | 35.71 | 0.00 | 32.14 | | 27.59 | 56.25 | 0.00 | 6.67 | 51.28 | 0.00 | 40.91 | ----------+--------+--------+--------+--------+--------+--------+--------+ Pike | 0 | 0 | 10 | 0 | 1 | 4 | 2 | 17 | 0.00 | 0.00 | 6.37 | 0.00 | 0.64 | 2.55 | 1.27 | 10.83 | 0.00 | 0.00 | 58.82 | 0.00 | 5.88 | 23.53 | 11.76 | | 0.00 | 0.00 | 100.00 | 0.00 | 2.56 | 100.00 | 4.55 | ----------+--------+--------+--------+--------+--------+--------+--------+ Smelt | 0 | 0 | 0 | 14 | 0 | 0 | 0 | 14 | 0.00 | 0.00 | 0.00 | 8.92 | 0.00 | 0.00 | 0.00 | 8.92 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 93.33 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+--------+--------+--------+ Total 29 16 10 15 39 4 44 157 18.47 10.19 6.37 9.55 24.84 2.55 28.03 100.00
Table 66.4 displays a table summarizing each classification results. In this table, the first column represents the standardization method, the second column represents the number of clusters that the 7 species are classified into, and the third column represents the total number of observations that are misclassified.
Example 66.1. Standardization of Variables in Cluster Analysis
Table 66.4. Summary of Clustering Results
Method of Standardization MEAN MEDIAN SUM EUCLEN USTD STD RANGE MIDRANGE MAXABS IQR MAD ABW(5) AWAVE(5) AGK(.14) SPACING(.14) L(1) L(1.5) L(2) untransformed PROC ACECLUS
Number of Clusters 5 5 6 6 6 5 7 7 7 5 4 6 6 7 7 6 5 5 5 5
Misclassification 71 71 51 45 45 33 32 32 26 28 35 34 29 28 25 41 33 33 71 71
Consider the results displayed in Output 66.1.1. In that analysis, the method of standardization is STD, and the number of clusters and the number of misclassifications are computed as shown in Table 66.5. Table 66.5. Computations of Numbers of Clusters and Misclassification When Standardization Method Is STD
Species Bream Roach Whitefish Parkki Perch Pike Smelt
Cluster Number 6 7 7 5 7 1 3
Misclassification in Each Species 0 0 3 0 29 0 1
In Output 66.1.1, the Bream species is classified as cluster 6 since all 34 Bream fish are categorized into cluster 6 with no misclassification. A similar pattern is seen with the Roach, Parkki, Pike, and Smelt species. For the Whitefish species, two fish are categorized into cluster 2, one fish is categorized into cluster 4, and three fish are categorized into cluster 7. Because the majority of this species is categorized into cluster 7, it is recorded in Table 66.5 as being classified as cluster 7 with 3 misclassifications. A similar pattern is seen with the Perch species: it is classified as cluster 7 with 29 misclassifications.
4153
4154
Chapter 66. The STDIZE Procedure
In summary, when the standardization method is STD, seven species of fish are classified into only 5 clusters and the total number of misclassified observations is 33. The result of this analysis demonstrates that when variables are standardized by the STDIZE procedure with methods including RANGE, MIDRANGE, MAXABS, AGK(.14), and SPACING(.14), the FASTCLUS procedure produces the correct number of clusters and less misclassification than it does when other standardization methods are used. The SPACING method attains the best result, probably because the variables Length1 and Height both exhibit marked groupings (bimodality) in their distributions.
References Art, D., Gnanadesikan, R., and Kettenring, R. (1982), “Data-based Metrics for Cluster Analysis,” Utilitas Mathematica, 21A, 75–99. Goodall, C. (1983), “M -Estimators of Location: An Outline of Theory,” in Hoaglin, D.C., Mosteller, M. and Tukey, J.W., eds., Understanding Robust and Exploratory Data Analysis, New York: John Wiley & Sons, Inc. Iglewicz, B. (1983), “Robust Scale Estimators and Confidence Intervals for Location,” in Hoaglin, D.C., Mosteller, M. and Tukey, J.W., eds., Understanding Robust and Exploratory Data Analysis, New York: John Wiley & Sons, Inc. Jannsen, P., Marron, J.S., Veraverbeke, N, and Sarle, W.S. (1995), “Scale Measures for Bandwidth Selection,” J. of Nonparametric Statistics, 5(4), 359–380. Jain R. and Chlamtac I. (1985), “The P2 Algorithm for Dynamic Calculation of Quantiles and Histograms Without Storing Observations,” Communications of the ACM, 28(10), 1076-1085. Journal of Statistics Education, “Fish Catch Data Set,” [http://www.stat.ncsu.edu/info/jse], accessed 4 December 1997. Milligan, G.W. and Cooper, M.C. (1987), “A Study of Variable Standardization,” College of Administrative Science Working Paper Series, 87–63, Columbus, OH: Ohio State University.
Chapter 67
The STEPDISC Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4157 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4158 SYNTAX . . . . . . . . . . . . PROC STEPDISC Statement BY Statement . . . . . . . . CLASS Statement . . . . . . FREQ Statement . . . . . . . VAR Statement . . . . . . . WEIGHT Statement . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. 4165 . 4165 . 4168 . 4169 . 4169 . 4169 . 4169
DETAILS . . . . . . . . . . Missing Values . . . . . . Input Data Sets . . . . . . Computational Resources Displayed Output . . . . ODS Table Names . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 4170 . 4170 . 4170 . 4171 . 4172 . 4174
. . . . . .
. . . . . .
EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4175 Example 67.1. Performing a Stepwise Discriminant Analysis . . . . . . . . 4175 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4181
4156
Chapter 67. The STEPDISC Procedure
Chapter 67
The STEPDISC Procedure Overview Given a classification variable and several quantitative variables, the STEPDISC procedure performs a stepwise discriminant analysis to select a subset of the quantitative variables for use in discriminating among the classes. The set of variables that make up each class is assumed to be multivariate normal with a common covariance matrix. The STEPDISC procedure can use forward selection, backward elimination, or stepwise selection (Klecka 1980). The STEPDISC procedure is a useful prelude to further analyses using the CANDISC procedure or the DISCRIM procedure. With PROC STEPDISC, variables are chosen to enter or leave the model according to one of two criteria: • the significance level of an F -test from an analysis of covariance, where the variables already chosen act as covariates and the variable under consideration is the dependent variable • the squared partial correlation for predicting the variable under consideration from the CLASS variable, controlling for the effects of the variables already selected for the model Forward selection begins with no variables in the model. At each step, PROC STEPDISC enters the variable that contributes most to the discriminatory power of the model as measured by Wilks’ Lambda, the likelihood ratio criterion. When none of the unselected variables meets the entry criterion, the forward selection process stops. Backward elimination begins with all variables in the model except those that are linearly dependent on previous variables in the VAR statement. At each step, the variable that contributes least to the discriminatory power of the model as measured by Wilks’ Lambda is removed. When all remaining variables meet the criterion to stay in the model, the backward elimination process stops. Stepwise selection begins, like forward selection, with no variables in the model. At each step, the model is examined. If the variable in the model that contributes least to the discriminatory power of the model as measured by Wilks’ lambda fails to meet the criterion to stay, then that variable is removed. Otherwise, the variable not in the model that contributes most to the discriminatory power of the model is entered. When all variables in the model meet the criterion to stay and none of the other variables meet the criterion to enter, the stepwise selection process stops. Stepwise selection is the default method of variable selection.
4158
Chapter 67. The STEPDISC Procedure
It is important to realize that, in the selection of variables for entry, only one variable can be entered into the model at each step. The selection process does not take into account the relationships between variables that have not yet been selected. Thus, some important variables could be excluded in the process. Also, Wilks’ Lambda may not be the best measure of discriminatory power for your application. However, if you use PROC STEPDISC carefully, in combination with your knowledge of the data and careful cross-validation, it can be a valuable aid in selecting variables for a discrimination model. As with any stepwise procedure, it is important to remember that, when many significance tests are performed, each at a level of, for example, 5% (0.05), the overall probability of rejecting at least one true null hypothesis is much larger than 5%. If you want to prevent including any variables that do not contribute to the discriminatory power of the model in the population, you should specify a very small significance level. In most applications, all variables considered have some discriminatory power, however small. To choose the model that provides the best discrimination using the sample estimates, you need only to guard against estimating more parameters than can be reliably estimated with the given sample size. Costanza and Afifi (1979) use Monte Carlo studies to compare alternative stopping rules that can be used with the forward selection method in the two-group multivariate normal classification problem. Five different numbers of variables, ranging from 10 to 30, are considered in the studies. The comparison is based on conditional and estimated unconditional probabilities of correct classification. They conclude that the use of a moderate significance level, in the range of 10 percent to 25 percent, often performs better than the use of a much larger or a much smaller significance level. The significance level and the squared partial correlation criteria select variables in the same order, although they may select different numbers of variables. Increasing the sample size tends to increase the number of variables selected when using significance levels, but it has little effect on the number selected using squared partial correlations. See Chapter 6, “Introduction to Discriminant Procedures,” for more information on discriminant analysis.
Getting Started The data in this example are measurements on 159 fish caught in Finland’s lake Laengelmavesi; this data set is available from the Data Archive of the Journal of Statistics Education. For each of the seven species (bream, parkki, pike, perch, roach, smelt, and whitefish), the weight, length, height, and the width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded as percentages of the third length variable. PROC STEPDISC will select a subset of the six quantitative variables that may be useful for differentiating between the fish species. This subset is used in conjunction with PROC CANDISC and PROC DISCRIM to develop discrimination models.
Getting Started
The following program creates the data set fish and uses PROC STEPDISC to select a subset of potential discriminator variables. By default, PROC STEPDISC uses stepwise selection on all numeric variables that are not listed in other statements, and the significance levels for a variable to enter the subset and to stay in the subset are set to 0.15. proc format; value specfmt 1=’Bream’ 2=’Roach’ 3=’Whitefish’ 4=’Parkki’ 5=’Perch’ 6=’Pike’ 7=’Smelt’; data fish (drop=HtPct WidthPct); title ’Fish Measurement Data’; input Species Weight Length1 Length2 Length3 Height=HtPct*Length3/100; Width=WidthPct*Length3/100; format Species specfmt.; datalines; 1 242.0 23.2 25.4 30.0 38.4 13.4 1 290.0 24.0 1 340.0 23.9 26.5 31.1 39.8 15.1 1 363.0 26.3 1 430.0 26.5 29.0 34.0 36.6 15.1 1 450.0 26.8 1 500.0 26.8 29.7 34.5 41.1 15.3 1 390.0 27.6 1 450.0 27.6 30.0 35.1 39.9 13.8 1 500.0 28.5 1 475.0 28.4 31.0 36.2 39.4 14.1 1 500.0 28.7 1 500.0 29.1 31.5 36.4 37.8 12.0 1 . 29.5 1 600.0 29.4 32.0 37.2 40.2 13.9 1 600.0 29.4 1 700.0 30.4 33.0 38.3 38.8 13.8 1 700.0 30.4 1 610.0 30.9 33.5 38.6 40.5 13.3 1 650.0 31.0 1 575.0 31.3 34.0 39.5 38.3 14.1 1 685.0 31.4 1 620.0 31.5 34.5 39.7 39.1 13.3 1 680.0 31.8 1 700.0 31.9 35.0 40.5 40.1 13.8 1 725.0 31.8 1 720.0 32.0 35.0 40.6 40.3 15.0 1 714.0 32.7 1 850.0 32.8 36.0 41.6 40.6 14.9 1 1000.0 33.5 1 920.0 35.0 38.5 44.1 40.9 14.3 1 955.0 35.0 1 925.0 36.2 39.5 45.3 41.4 14.9 1 975.0 37.4 1 950.0 38.0 41.0 46.5 37.9 13.7 2 40.0 12.9 14.1 16.2 25.6 14.0 2 69.0 16.5 2 78.0 17.5 18.8 21.2 26.3 13.7 2 87.0 18.2 2 120.0 18.6 20.0 22.2 28.0 16.1 2 0.0 19.0 2 110.0 19.1 20.8 23.1 26.7 14.7 2 120.0 19.4 2 150.0 20.4 22.0 24.7 23.5 15.2 2 145.0 20.5 2 160.0 20.5 22.5 25.3 27.8 15.1 2 140.0 21.0 2 160.0 21.1 22.5 25.0 25.6 15.2 2 169.0 22.0 2 161.0 22.0 23.4 26.7 25.9 13.6 2 200.0 22.1 2 180.0 23.6 25.2 27.9 25.4 14.0 2 290.0 24.0 2 272.0 25.0 27.0 30.6 28.0 15.6 2 390.0 29.5 3 270.0 23.6 26.0 28.7 29.2 14.8 3 270.0 24.1 3 306.0 25.6 28.0 30.8 28.5 15.2 3 540.0 28.5 3 800.0 33.7 36.4 39.6 29.7 16.6 3 1000.0 37.3 4 55.0 13.5 14.7 16.5 41.5 14.1 4 60.0 14.3 4 90.0 16.3 17.7 19.8 37.4 13.5 4 120.0 17.5 4 150.0 18.4 20.0 22.4 39.7 14.7 4 140.0 19.0 4 170.0 19.0 20.7 23.2 40.5 14.7 4 145.0 19.8
HtPct WidthPct @@;
26.3 29.0 29.7 30.0 30.7 31.0 32.0 32.0 33.0 33.5 34.0 35.0 35.0 36.0 37.0 38.5 41.0
31.2 33.5 34.7 35.0 36.2 36.2 37.3 37.2 38.5 38.7 39.2 40.6 40.9 41.5 42.6 44.0 45.9
40.0 38.0 39.2 36.2 39.3 39.7 37.3 41.5 38.8 37.4 40.8 38.1 40.0 39.8 44.5 41.1 40.6
13.8 13.3 14.2 13.4 13.7 13.3 13.6 15.0 13.5 14.8 13.7 15.1 14.8 14.1 15.5 14.3 14.7
18.2 19.8 20.5 21.0 22.0 22.5 24.0 23.5 26.0 31.7 26.5 31.0 40.0 15.5 19.0 20.7 21.5
20.3 22.2 22.8 23.7 24.3 25.0 27.2 26.8 29.2 35.0 29.3 34.0 43.5 17.4 21.3 23.2 24.1
26.1 25.3 28.4 25.8 27.3 26.2 27.7 27.6 30.4 27.1 27.8 31.6 28.4 37.8 39.4 36.8 40.4
13.9 14.3 14.7 13.9 14.6 13.3 14.1 15.4 15.4 15.3 14.5 19.3 15.0 13.3 13.7 14.2 13.1
4159
4160
Chapter 67. The STEPDISC Procedure 4 200.0 21.2 23.0 25.8 40.1 4 300.0 24.0 26.0 29.0 39.2 5 5.9 7.5 8.4 8.8 24.0 5 40.0 13.8 15.0 16.0 23.9 5 70.0 15.7 17.4 18.5 24.8 5 78.0 16.8 18.7 19.4 26.8 5 85.0 17.8 19.6 20.8 24.7 5 110.0 19.0 21.0 22.5 25.3 5 125.0 19.0 21.0 22.5 25.3 5 120.0 20.0 22.0 23.5 26.0 5 130.0 20.0 22.0 23.5 26.0 5 110.0 20.0 22.0 23.5 23.5 5 150.0 20.5 22.5 24.0 28.3 5 150.0 21.0 23.0 24.5 21.3 5 225.0 22.0 24.0 25.5 28.6 5 188.0 22.6 24.6 26.2 25.7 5 197.0 23.5 25.6 27.0 24.3 5 300.0 25.2 27.3 28.7 29.0 5 265.0 25.4 27.5 28.9 24.4 5 250.0 25.9 28.0 29.4 26.6 5 320.0 27.8 30.0 31.6 24.1 5 556.0 32.0 34.5 36.5 28.1 5 685.0 34.0 36.5 39.0 27.9 5 700.0 34.5 37.0 39.4 27.5 5 900.0 36.5 39.0 41.4 26.9 5 820.0 36.6 39.0 41.3 30.1 5 900.0 37.0 40.0 42.5 27.6 5 820.0 37.1 40.0 42.5 26.2 5 1000.0 39.8 43.0 45.2 26.4 5 1000.0 40.2 43.5 46.0 27.4 6 200.0 30.0 32.3 34.8 16.0 6 300.0 32.7 35.0 38.8 15.3 6 430.0 35.5 38.0 40.5 18.0 6 456.0 40.0 42.5 45.5 16.0 6 540.0 40.1 43.0 45.8 17.0 6 567.0 43.2 46.0 48.7 16.0 6 950.0 48.3 51.7 55.1 16.2 6 1600.0 56.0 60.0 64.0 15.0 6 1650.0 59.0 63.4 68.0 15.9 7 6.7 9.3 9.8 10.8 16.1 7 7.0 10.1 10.6 11.6 14.9 7 9.8 10.7 11.2 12.4 16.8 7 10.0 11.3 11.8 13.1 16.9 7 9.8 11.4 12.0 13.2 16.7 7 13.4 11.7 12.4 13.5 18.0 7 19.7 13.2 14.3 15.2 18.9 ; proc stepdisc data=fish; class Species; run;
14.2 14.6 16.0 15.2 15.9 16.1 14.6 15.8 16.3 14.5 15.0 17.0 15.1 14.8 14.6 15.9 15.7 17.9 15.0 14.3 15.1 17.5 17.6 15.9 18.1 17.8 17.0 15.6 16.1 17.7 9.7 11.3 11.3 9.5 11.2 10.0 11.2 9.6 11.0 9.7 9.9 10.3 9.8 8.7 9.4 13.6
4
273.0 23.0 25.0 28.0 39.6 14.8
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6
32.0 51.5 100.0 80.0 85.0 115.0 130.0 120.0 135.0 130.0 145.0 170.0 145.0 180.0 218.0 260.0 250.0 300.0 514.0 840.0 700.0 690.0 650.0 850.0 1015.0 1100.0 1100.0 1000.0 300.0 300.0 345.0 510.0 500.0 770.0 1250.0 1550.0
12.5 15.0 16.2 17.2 18.2 19.0 19.3 20.0 20.0 20.5 20.7 21.5 22.0 23.0 25.0 25.4 25.4 26.9 30.5 32.5 34.0 34.6 36.5 36.9 37.0 39.0 40.1 41.1 31.7 34.8 36.0 40.0 42.0 44.8 52.0 56.0
13.7 16.2 18.0 19.0 20.0 21.0 21.3 22.0 22.0 22.5 22.7 23.5 24.0 25.0 26.5 27.5 27.5 28.7 32.8 35.0 36.0 37.0 39.0 40.0 40.0 42.0 43.0 44.0 34.0 37.3 38.5 42.5 45.0 48.0 56.0 60.0
14.7 17.2 19.2 20.2 21.0 22.5 22.8 23.5 23.5 24.0 24.2 25.0 25.5 26.5 28.0 28.9 28.9 30.1 34.0 37.3 38.3 39.3 41.4 42.3 42.4 44.6 45.5 46.6 37.8 39.8 41.0 45.5 48.0 51.2 59.7 64.0
24.0 26.7 27.2 27.9 24.2 26.3 28.0 24.0 25.0 24.4 24.6 25.1 25.0 24.3 25.6 24.8 25.2 25.2 29.5 30.8 27.7 26.9 26.9 28.2 29.2 28.7 27.5 26.8 15.1 15.8 15.6 15.0 14.5 15.0 17.9 15.0
13.6 15.3 17.3 15.1 13.2 14.7 15.5 15.0 15.0 15.1 15.0 14.9 15.0 13.9 14.8 15.0 15.8 15.4 17.7 20.9 17.6 16.2 14.5 16.8 17.6 15.4 16.3 16.3 11.0 10.1 9.7 9.8 10.2 10.5 11.7 9.6
7 7 7 7 7 7 7
7.5 9.7 8.7 9.9 12.2 12.2 19.9
10.0 10.4 10.8 11.3 11.5 12.1 13.8
10.5 11.0 11.3 11.8 12.2 13.0 15.0
11.6 12.0 12.6 13.1 13.4 13.8 16.2
17.0 18.3 15.7 16.9 15.6 16.5 18.1
10.0 11.5 10.2 8.9 10.4 9.1 11.6
PROC STEPDISC begins by displaying summary information about the analysis; see Figure 67.1. This information includes the number of observations with nonmissing values, the number of classes in the classification variable (specified by the CLASS statement), the number of quantitative variables under consideration, the significance criteria for variables to enter and to stay in the model, and the method of variable selection being used. The frequency of each class is also displayed.
Getting Started
Fish Measurement Data The STEPDISC Procedure The Method for Selecting Variables is STEPWISE Observations Class Levels
158 7
Variable(s) in the Analysis Variable(s) will be Included Significance Level to Enter Significance Level to Stay
6 0 0.15 0.15
Class Level Information
Species
Variable Name
Bream Parkki Perch Pike Roach Smelt Whitefish
Bream Parkki Perch Pike Roach Smelt Whitefish
Frequency
Weight
Proportion
34 11 56 17 20 14 6
34.0000 11.0000 56.0000 17.0000 20.0000 14.0000 6.0000
0.215190 0.069620 0.354430 0.107595 0.126582 0.088608 0.037975
Figure 67.1. Summary Information
For each entry step, the statistics for entry are displayed for all variables not currently selected; see Figure 67.2. The variable selected to enter at this step (if any) is displayed, as well as all the variables currently selected. Next are multivariate statistics that take into account all previously selected variables and the newly entered variable.
4161
4162
Chapter 67. The STEPDISC Procedure
Fish Measurement Data The STEPDISC Procedure Stepwise Selection: Step 1 Statistics for Entry, DF = 6, 151 Variable Weight Length1 Length2 Length3 Height Width
R-Square
F Value
Pr > F
Tolerance
0.3750 0.6017 0.6098 0.6280 0.7553 0.4806
15.10 38.02 39.32 42.49 77.69 23.29
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Variable Height will be entered. Variable(s) that have been Entered Height
Multivariate Statistics Statistic Wilks’ Lambda Pillai’s Trace Average Squared Canonical Correlation
Value
F Value
Num DF
Den DF
Pr > F
0.244670 0.755330 0.125888
77.69 77.69
6 6
151 151
<.0001 <.0001
Figure 67.2. Step 1: Variable HEIGHT Selected for Entry
For each removal step (Figure 67.3), the statistics for removal are displayed for all variables currently entered. The variable to be removed at this step (if any) is displayed. If no variable meets the criterion to be removed and the maximum number of steps as specified by the MAXSTEP= option has not been attained, then the procedure continues with another entry step.
Getting Started
Fish Measurement Data The STEPDISC Procedure Stepwise Selection: Step 2 Statistics for Removal, DF = 6, 151 Variable
R-Square
F Value
Pr > F
0.7553
77.69
<.0001
Height
No variables can be removed.
Statistics for Entry, DF = 6, 150
Variable Weight Length1 Length2 Length3 Width
Partial R-Square
F Value
Pr > F
Tolerance
0.7388 0.9220 0.9229 0.9173 0.8783
70.71 295.35 299.31 277.37 180.44
<.0001 <.0001 <.0001 <.0001 <.0001
0.4690 0.6083 0.5892 0.5056 0.3699
Variable Length2 will be entered. Variable(s) that have been Entered Length2 Height
Multivariate Statistics Statistic Wilks’ Lambda Pillai’s Trace Average Squared Canonical Correlation
Value
F Value
Num DF
Den DF
Pr > F
0.018861 1.554349 0.259058
157.04 87.78
12 12
300 302
<.0001 <.0001
Figure 67.3. Step 2: No Variable is Removed; Variable Length1 Added
The stepwise procedure terminates either when no variable can be removed and no variable can be entered or when the maximum number of steps as specified by the MAXSTEP= option has been attained. In this example at Step 7 no variables can be either removed or entered (Figure 67.4). Steps 3 through 6 are not displayed in this document.
4163
4164
Chapter 67. The STEPDISC Procedure
Fish Measurement Data The STEPDISC Procedure Stepwise Selection: Step 7 Statistics for Removal, DF = 6, 146
Variable Weight Length1 Length2 Length3 Height Width
Partial R-Square
F Value
Pr > F
0.4521 0.2987 0.5250 0.7948 0.7257 0.5757
20.08 10.36 26.89 94.25 64.37 33.02
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
No variables can be removed. No further steps are possible.
Figure 67.4. Step 7: No Variables Entered or Removed
PROC STEPDISC ends by displaying a summary of the steps. Fish Measurement Data The STEPDISC Procedure Stepwise Selection Summary
Step
Number In
Entered
1 2 3 4 5 6
1 2 3 4 5 6
Height Length2 Length3 Width Weight Length1
Removed
Partial R-Square
F Value
Pr > F
Wilks’ Lambda
Pr < Lambda
Average Squared Canonical Correlation
0.7553 0.9229 0.8826 0.5775 0.4461 0.2987
77.69 299.31 186.77 33.72 19.73 10.36
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
0.24466983 0.01886065 0.00221342 0.00093510 0.00051794 0.00036325
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
0.12588836 0.25905822 0.38427100 0.45200732 0.49488458 0.51744189
Pr > ASCC <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
Figure 67.5. Step Summary
All the variables in the data set are found to have potential discriminatory power. These variables are used to develop discrimination models in both the CANDISC and DISCRIM procedure chapters.
PROC STEPDISC Statement
Syntax The following statements are available in PROC STEPDISC.
PROC STEPDISC < options > ; CLASS variable ; BY variables ; FREQ variable ; VAR variables ; WEIGHT variable ; The BY, CLASS, FREQ, VAR, and WEIGHT statements are described after the PROC STEPDISC statement.
PROC STEPDISC Statement PROC STEPDISC < options > ; The PROC STEPDISC statement invokes the STEPDISC procedure. The PROC STEPDISC statement has the following options. Table 67.1. STEPDISC Procedure Options
Task Specify Data Set
Options DATA=
Select Method
METHOD=
Selection Criterion
SLENTRY= SLSTAY= PR2ENTRY= PR2STAY=
Selection Process
INCLUDE= MAXSTEP= START= STOP=
Determine Singularity
SINGULAR=
Control Displayed Output Correlations
BCORR PCORR TCORR WCORR
Covariances
BCOV PCOV TCOV WCOV
SSCP Matrices
BSSCP PSSCP
4165
4166
Chapter 67. The STEPDISC Procedure
Table 67.1. (continued)
Task
Options TSSCP WSSCP Miscellaneous
ALL SIMPLE STDMEAN
Suppress Output
SHORT
ALL
activates all of the display options. BCORR
displays between-class correlations. BCOV
displays between-class covariances. The between-class covariance matrix equals the between-class SSCP matrix divided by n(c − 1)/c, where n is the number of observations and c is the number of classes. The between-class covariances should be interpreted in comparison with the total-sample and within-class covariances, not as formal estimates of population parameters. BSSCP
displays the between-class SSCP matrix. DATA=SAS-data-set
specifies the data set to be analyzed. The data set can be an ordinary SAS data set or one of several specially structured data sets created by statistical procedures available with SAS/STAT software. These specially structured data sets include TYPE=CORR, COV, CSSCP, and SSCP. If the DATA= option is omitted, the procedure uses the most recently created SAS data set. INCLUDE=n
includes the first n variables in the VAR statement in every model. By default, INCLUDE=0. MAXSTEP=n
specifies the maximum number of steps. By default, MAXSTEP= two times the number of variables in the VAR statement. METHOD=BACKWARD | BW METHOD=FORWARD | FW METHOD=STEPWISE | SW
specifies the method used to select the variables in the model. The BACKWARD method specifies backward elimination, FORWARD specifies forward selection, and STEPWISE specifies stepwise selection. By default, METHOD=STEPWISE.
PROC STEPDISC Statement
PCORR
displays pooled within-class correlations (partial correlations based on the pooled within-class covariances). PCOV
displays pooled within-class covariances. PR2ENTRY=p PR2E=p
specifies the partial R2 for adding variables in the forward selection mode, where p ≤ 1. PR2STAY=p PR2S=p
specifies the partial R2 for retaining variables in the backward elimination mode, where p ≤ 1. PSSCP
displays the pooled within-class corrected SSCP matrix. SHORT
suppresses the displayed output from each step. SIMPLE
displays simple descriptive statistics for the total sample and within each class. SINGULAR=p
specifies the singularity criterion for entering variables, where 0 < p < 1. PROC STEPDISC precludes the entry of a variable if the squared multiple correlation of the variable with the variables already in the model exceeds 1 − p. With more than one variable already in the model, PROC STEPDISC also excludes a variable if it would cause any of the variables already in the model to have a squared multiple correlation (with the entering variable and the other variables in the model) exceeding 1 − p. By default, SINGULAR= 1E−8. SLENTRY=p SLE=p
specifies the significance level for adding variables in the forward selection mode, where 0 ≤ p ≤ 1. The default value is 0.15. SLSTAY=p SLS=p
specifies the significance level for retaining variables in the backward elimination mode, where 0 ≤ p ≤ 1. The default value is 0.15. START=n
specifies that the first n variables in the VAR statement be used to begin the selection process. When you specify METHOD=FORWARD or METHOD=STEPWISE, the default value is 0; when you specify METHOD=BACKWARD, the default value is the number of variables in the VAR statement.
4167
4168
Chapter 67. The STEPDISC Procedure
STDMEAN
displays total-sample and pooled within-class standardized class means. STOP=n
specifies the number of variables in the final model. The STEPDISC procedure stops the selection process when a model with n variables is found. This option applies only when you specify METHOD=FORWARD or METHOD=BACKWARD. When you specify METHOD=FORWARD, the default value is the number of variables in the VAR statement; when you specify METHOD=BACKWARD, the default value is 0. TCORR
displays total-sample correlations. TCOV
displays total-sample covariances. TSSCP
displays the total-sample corrected SSCP matrix. WCORR
displays within-class correlations for each class level. WCOV
displays within-class covariances for each class level. WSSCP
displays the within-class corrected SSCP matrix for each class level.
BY Statement BY variables ; You can specify a BY statement with PROC STEPDISC to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the STEPDISC procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure (in base SAS software). For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
WEIGHT Statement
CLASS Statement CLASS variable ; The values of the CLASS variable define the groups for analysis. Class levels are determined by the formatted values of the CLASS variable. The CLASS variable can be numeric or character. A CLASS statement is required.
FREQ Statement FREQ variable ; If a variable in the data set represents the frequency of occurrence for the other values in the observation, include the name of the variable in a FREQ statement. The procedure then treats the data set as if each observation appears n times, where n is the value of the FREQ variable for the observation. The total number of observations is considered to be equal to the sum of the FREQ variable when the procedure determines degrees of freedom for significance probabilities. If the value of the FREQ variable is missing or is less than one, the observation is not used in the analysis. If the value is not an integer, the value is truncated to an integer.
VAR Statement VAR variables ; The VAR statement specifies the quantitative variables eligible for selection. The default is all numeric variables not listed in other statements.
WEIGHT Statement WEIGHT variable ; To use relative weights for each observation in the input data set, place the weights in a variable in the data set and specify the name in a WEIGHT statement. This is often done when the variance associated with each observation is different and the values of the WEIGHT variable are proportional to the reciprocals of the variances. If the value of the WEIGHT variable is missing or is less than zero, then a value of zero for the weight is assumed. The WEIGHT and FREQ statements have a similar effect except that the WEIGHT statement does not alter the degrees of freedom.
4169
4170
Chapter 67. The STEPDISC Procedure
Details Missing Values Observations containing missing values are omitted from the analysis.
Input Data Sets The input data set can be an ordinary SAS data set or one of several specially structured data sets created by statistical procedures available with SAS/STAT software. For more information on these data sets, see Appendix A, “Special SAS Data Sets.” The BY variable in these data sets becomes the CLASS variable in PROC STEPDISC. These specially structured data sets include • TYPE=CORR data sets created by PROC CORR using a BY statement • TYPE=COV data sets created by PROC PRINCOMP using both the COV option and a BY statement • TYPE=CSSCP data sets created by PROC CORR using the CSSCP option and a BY statement, where the OUT= data set is assigned TYPE=CSSCP with the TYPE= data set option • TYPE=SSCP data sets created by PROC REG using both the OUTSSCP= option and a BY statement When the input data set is TYPE=CORR, TYPE=COV, or TYPE=CSSCP, the STEPDISC procedure reads the number of observations for each class from the observations with – TYPE– =’N’ and the variable means in each class from the observations with – TYPE– =’MEAN’. The procedure then reads the within-class correlations from the observations with – TYPE– =’CORR’, the standard deviations from the observations with – TYPE– =’STD’ (data set TYPE=CORR), the within-class covariances from the observations with – TYPE– =’COV’ (data set TYPE=COV), or the within-class corrected sums of squares and crossproducts from the observations with – TYPE– =’CSSCP’ (data set TYPE=CSSCP). When the data set does not include any observations with – TYPE– =’CORR’ (data set TYPE=CORR), – TYPE– =’COV’ (data set TYPE=COV), or – TYPE– =’CSSCP’ (data set TYPE=CSSCP) for each class, PROC STEPDISC reads the pooled within-class information from the data set. In this case, the STEPDISC procedure reads the pooled within-class correlations from the observations with – TYPE– =’PCORR’, the pooled within-class standard deviations from the observations with – TYPE– =’PSTD’ (data set TYPE=CORR), the pooled within-class covariances from the observations with – TYPE– =’PCOV’ (data set TYPE=COV), or the pooled within-class corrected SSCP matrix from the observations with– TYPE– =’PSSCP’ (data set TYPE=CSSCP). When the input data set is TYPE=SSCP, the STEPDISC procedure reads the number of observations for each class from the observations with – TYPE– =’N’, the sum of weights of observations from the variable INTERCEPT in observations with – TYPE– =’SSCP’ and – NAME– =’INTERCEPT’, the variable sums
Computational Resources
from the variable=variablenames in observations with – TYPE– =’SSCP’ and – NAME– =’INTERCEPT’, and the uncorrected sums of squares and crossproducts from the variable=variablenames in observations with – TYPE– =’SSCP’ and – NAME– =variablenames.
Computational Resources In the following discussion, let n = number of observations c = number of class levels v = number of variables in the VAR list l = length of the CLASS variable t = v + c − 1.
Memory Requirements The amount of memory in bytes for temporary storage needed to process the data is c(4v 2 + 28v + 3l + 4c + 72) + 16v 2 + 92v + 4t2 + 20t + 4l Additional temporary storage of 72 bytes at each step is also required to store the results.
Time Requirements The following factors determine the time requirements of a stepwise discriminant analysis. • The time needed for reading the data and computing covariance matrices is proportional to nv 2 . The STEPDISC procedure must also look up each class level in the list. This is faster if the data are sorted by the CLASS variable. The time for looking up class levels is proportional to a value ranging from n to n ln(c). • The time needed for stepwise discriminant analysis is proportional to the number of steps required to select the set of variables in the discrimination model. The number of steps required depends on the data set itself and the selection method and criterion used in the procedure. Each forward or backward step takes time proportional to (v + c)2 .
4171
4172
Chapter 67. The STEPDISC Procedure
Displayed Output The STEPDISC procedure displays the following output: • Class Level Information, including the values of the classification variable, the Frequency of each value, the Weight of each value, and the Proportion of each value in the total sample Optional output includes • Within-Class SSCP Matrices for each group • Pooled Within-Class SSCP Matrix • Between-Class SSCP Matrix • Total-Sample SSCP Matrix • Within-Class Covariance Matrices for each group • Pooled Within-Class Covariance Matrix • Between-Class Covariance Matrix, equal to the between-class SSCP matrix divided by n(c − 1)/c, where n is the number of observations and c is the number of classes • Total-Sample Covariance Matrix • Within-Class Correlation Coefficients and Pr > |r| to test the hypothesis that the within-class population correlation coefficients are zero • Pooled Within-Class Correlation Coefficients and Pr > |r| to test the hypothesis that the partial population correlation coefficients are zero • Between-Class Correlation Coefficients and Pr > |r| to test the hypothesis that the between-class population correlation coefficients are zero • Total-Sample Correlation Coefficients and Pr > |r| to test the hypothesis that the total population correlation coefficients are zero • descriptive Simple Statistics including N (the number of observations), Sum, Mean, Variance, and Standard Deviation for the total sample and within each class • Total-Sample Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the total-sample standard deviation • Pooled Within-Class Standardized Class Means, obtained by subtracting the grand mean from each class mean and dividing by the pooled within-class standard deviation At each step, the following statistics are displayed: • for each variable considered for entry or removal: Partial R-Square, the squared (partial) correlation, the F statistic, and Pr > F , the probability level, from a one-way analysis of covariance
Displayed Output
• the minimum Tolerance for entering each variable. A variable is entered only if its tolerance and the tolerances for all variables already in the model are greater than the value specified in the SINGULAR= option. The tolerance for the entering variable is 1 − R2 from regressing the entering variable on the other variables already in the model. The tolerance for a variable already in the model is 1 − R2 from regressing that variable on the entering variable and the other variables already in the model. With m variables already in the model, for each entering variable, m + 1 multiple regressions are performed using the entering variable and each of the m variables already in the model as a dependent variable. These m + 1 tolerances are computed for each entering variable, and the minimum tolerance is displayed for each. The tolerance is computed using the total-sample correlation matrix. It is customary to compute tolerance using the pooled within-class correlation matrix (Jennrich 1977), but it is possible for a variable with excellent discriminatory power to have a high total-sample tolerance and a low pooled within-class tolerance. For example, PROC STEPDISC enters a variable that yields perfect discrimination (that is, produces a canonical correlation of one), but a program using pooled within-class tolerance does not. • the variable Label, if any • the name of the variable chosen • the variables already selected or removed • Wilks’ Lambda and the associated F approximation with degrees of freedom and Pr < F , the associated probability level after the selected variable has been entered or removed. Wilks’ lambda is the likelihood ratio statistic for testing the hypothesis that the means of the classes on the selected variables are equal in the population (see the “Multivariate Tests” section in Chapter 2, “Introduction to Regression Procedures.” ) Lambda is close to zero if any two groups are well separated. • Pillai’s Trace and the associated F approximation with degrees of freedom and Pr > F , the associated probability level after the selected variable has been entered or removed. Pillai’s trace is a multivariate statistic for testing the hypothesis that the means of the classes on the selected variables are equal in the population (see the “Multivariate Tests” section in Chapter 2). • Average Squared Canonical Correlation (ASCC). The ASCC is Pillai’s trace divided by the number of groups minus 1. The ASCC is close to 1 if all groups are well separated and if all or most directions in the discriminant space show good separation for at least two groups. • Summary to give statistics associated with the variable chosen at each step. The summary includes the following: − − − − −
Step number Variable Entered or Removed Number In, the number of variables in the model Partial R-Square the F Value for entering or removing the variable
4173
4174
Chapter 67. The STEPDISC Procedure − − − − − −
Pr > F , the probability level for the F statistic Wilks’ Lambda Pr < Lambda based on the F approximation to Wilks’ Lambda Average Squared Canonical Correlation Pr > ASCC based on the F approximation to Pillai’s trace the variable Label, if any
ODS Table Names PROC STEPDISC assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 67.2. ODS Tables Produced in PROC STEPDISC
ODS Table Name BCorr BCov BSSCP Counts CovDF Levels Messages Multivariate PCorr PCov PSSCP PStdMeans SimpleStatistics Steps Summary TCorr TCov TSSCP TStdMeans Variables WCorr WCov WSSCP
Description Between-class correlations Between-class covariances Between-class SSCP matrix Number of observations, variables, classes, df DF for covariance matrices, not printed Class level information Entry/removal messages Multivariate statistics Pooled within-class correlations Pooled within-class covariances Pooled within-class SSCP matrix Pooled standardized class means Simple statistics Stepwise selection entry/removal Stepwise selection summary Total-sample correlations Total-sample covariances Total-sample SSCP matrix Total standardized class means Variable lists Within-class correlations Within-class covariances Within-class SSCP matrices
PROC STEPWISE Option BCORR BCOV BSSCP default any *COV option default default default PCORR PCOV PSSCP STDMEAN SIMPLE default default TCORR TCOV TSSCP STDMEAN default WCORR WCOV WSSCP
Example 67.1. Performing a Stepwise Discriminant Analysis
Example Example 67.1. Performing a Stepwise Discriminant Analysis The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species: Iris setosa, I. versicolor, and I. virginica. proc format; value specname 1=’Setosa ’ 2=’Versicolor’ 3=’Virginica ’; data iris; title ’Fisher (1936) Iris Data’; input SepalLength SepalWidth PetalLength PetalWidth Species @@; format Species specname.; label SepalLength=’Sepal Length in mm.’ SepalWidth =’Sepal Width in mm.’ PetalLength=’Petal Length in mm.’ PetalWidth =’Petal Width in mm.’; datalines; 50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1 56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
4175
4176
Chapter 67. The STEPDISC Procedure 49 56 51 54 61 68 45 55 51 63 ;
31 30 25 39 29 30 23 23 37 33
15 41 30 13 47 55 13 40 15 60
01 13 11 04 14 21 03 13 04 25
1 2 2 1 2 3 1 2 1 3
67 63 57 51 56 55 57 66 52 53
31 25 28 35 29 25 25 30 35 37
47 49 41 14 36 40 50 44 15 15
15 15 13 03 13 13 20 14 02 02
2 2 2 1 2 2 3 2 1 1
63 61 65 72 69 48 57 68 58
23 28 30 36 31 34 38 28 28
44 47 58 61 49 16 17 48 51
13 12 22 25 15 02 03 14 24
2 2 3 3 2 1 1 2 3
54 64 69 65 64 48 51 54 67
37 29 31 32 27 30 38 34 30
15 43 54 51 53 14 15 17 50
02 13 21 20 19 01 03 02 17
1 2 3 3 3 1 1 1 2
A stepwise discriminant analysis is performed using stepwise selection. In the PROC STEPDISC statement, the BSSCP and TSSCP options display the between-class SSCP matrix and the total-sample corrected SSCP matrix. By default, the significance level of an F test from an analysis of covariance is used as the selection criterion. The variable under consideration is the dependent variable, and the variables already chosen act as covariates. The following SAS statements produce Output 67.1.1 through Output 67.1.8: proc stepdisc data=iris bsscp tsscp; class Species; var SepalLength SepalWidth PetalLength PetalWidth; run; Output 67.1.1. Iris Data: Summary Information Fisher (1936) Iris Data The STEPDISC Procedure The Method for Selecting Variables is STEPWISE Observations Class Levels
150 3
Variable(s) in the Analysis Variable(s) will be Included Significance Level to Enter Significance Level to Stay
4 0 0.15 0.15
Class Level Information
Species
Variable Name
Setosa Versicolor Virginica
Setosa Versicolor Virginica
Frequency
Weight
Proportion
50 50 50
50.0000 50.0000 50.0000
0.333333 0.333333 0.333333
Example 67.1. Performing a Stepwise Discriminant Analysis
4177
Output 67.1.2. Iris Data: Between-Class and Total-Sample SSCP Matrices Fisher (1936) Iris Data The STEPDISC Procedure Between-Class SSCP Matrix Variable
Label
SepalLength SepalWidth PetalLength PetalWidth
Sepal Sepal Petal Petal
Length in mm. Width in mm. Length in mm. Width in mm.
SepalLength
SepalWidth
PetalLength
PetalWidth
6321.21333 -1995.26667 16524.84000 7127.93333
-1995.26667 1134.49333 -5723.96000 -2293.26667
16524.84000 -5723.96000 43710.28000 18677.40000
7127.93333 -2293.26667 18677.40000 8041.33333
Total-Sample SSCP Matrix Variable
Label
SepalLength SepalWidth PetalLength PetalWidth
Sepal Sepal Petal Petal
Length in mm. Width in mm. Length in mm. Width in mm.
SepalLength
SepalWidth
PetalLength
PetalWidth
10216.83333 -632.26667 18987.30000 7692.43333
-632.26667 2830.69333 -4911.88000 -1812.42667
18987.30000 -4911.88000 46432.54000 19304.58000
7692.43333 -1812.42667 19304.58000 8656.99333
In Step 1, the tolerance is 1.0 for each variable under consideration because no variables have yet entered the model. Variable PetalLength is selected because its F statistic, 1180.161, is the largest among all variables. Output 67.1.3. Iris Data: Stepwise Selection Step 1 Fisher (1936) Iris Data The STEPDISC Procedure Stepwise Selection: Step 1 Statistics for Entry, DF = 2, 147 Variable
Label
SepalLength SepalWidth PetalLength PetalWidth
Sepal Sepal Petal Petal
Length in mm. Width in mm. Length in mm. Width in mm.
R-Square
F Value
Pr > F
Tolerance
0.6187 0.4008 0.9414 0.9289
119.26 49.16 1180.16 960.01
<.0001 <.0001 <.0001 <.0001
1.0000 1.0000 1.0000 1.0000
Variable PetalLength will be entered. Variable(s) that have been Entered PetalLength
Multivariate Statistics Statistic Wilks’ Lambda Pillai’s Trace Average Squared Canonical Correlation
Value
F Value
Num DF
Den DF
Pr > F
0.058628 0.941372 0.470686
1180.16 1180.16
2 2
147 147
<.0001 <.0001
In Step 2, with variable PetalLength already in the model, PetalLength is tested for removal before selecting a new variable for entry. Since PetalLength meets the criterion to stay, it is used as a covariate in the analysis of covariance for variable selection. Variable SepalWidth is selected because its F statistic, 43.035, is the largest among all variables not in the model and its associated tolerance, 0.8164, meets the criterion to enter. The process is repeated in Steps 3 and 4. Variable PetalWidth is entered in Step 3, and variable SepalLength is entered in Step 4.
4178
Chapter 67. The STEPDISC Procedure
Output 67.1.4. Iris Data: Stepwise Selection Step 2 Fisher (1936) Iris Data The STEPDISC Procedure Stepwise Selection: Step 2 Statistics for Removal, DF = 2, 147 Variable
Label
PetalLength
Petal Length in mm.
R-Square
F Value
Pr > F
0.9414
1180.16
<.0001
No variables can be removed.
Statistics for Entry, DF = 2, 146
Variable
Label
SepalLength SepalWidth PetalWidth
Sepal Length in mm. Sepal Width in mm. Petal Width in mm.
Partial R-Square
F Value
Pr > F
Tolerance
0.3198 0.3709 0.2533
34.32 43.04 24.77
<.0001 <.0001 <.0001
0.2400 0.8164 0.0729
Variable SepalWidth will be entered. Variable(s) that have been Entered SepalWidth PetalLength
Multivariate Statistics Statistic Wilks’ Lambda Pillai’s Trace Average Squared Canonical Correlation
Value
F Value
Num DF
Den DF
Pr > F
0.036884 1.119908 0.559954
307.10 93.53
4 4
292 294
<.0001 <.0001
Example 67.1. Performing a Stepwise Discriminant Analysis
4179
Output 67.1.5. Iris Data: Stepwise Selection Step 3 Fisher (1936) Iris Data The STEPDISC Procedure Stepwise Selection: Step 3 Statistics for Removal, DF = 2, 146
Variable
Label
SepalWidth PetalLength
Sepal Width in mm. Petal Length in mm.
Partial R-Square
F Value
Pr > F
0.3709 0.9384
43.04 1112.95
<.0001 <.0001
No variables can be removed.
Statistics for Entry, DF = 2, 145
Variable
Label
SepalLength PetalWidth
Sepal Length in mm. Petal Width in mm.
Partial R-Square
F Value
Pr > F
Tolerance
0.1447 0.3229
12.27 34.57
<.0001 <.0001
0.1323 0.0662
Variable PetalWidth will be entered. Variable(s) that have been Entered SepalWidth PetalLength PetalWidth
Multivariate Statistics Statistic Wilks’ Lambda Pillai’s Trace Average Squared Canonical Correlation
Value
F Value
Num DF
Den DF
Pr > F
0.024976 1.189914 0.594957
257.50 71.49
6 6
290 292
<.0001 <.0001
4180
Chapter 67. The STEPDISC Procedure
Output 67.1.6. Iris Data: Stepwise Selection Step 4 Fisher (1936) Iris Data The STEPDISC Procedure Stepwise Selection: Step 4 Statistics for Removal, DF = 2, 145
Variable
Label
SepalWidth PetalLength PetalWidth
Sepal Width in mm. Petal Length in mm. Petal Width in mm.
Partial R-Square
F Value
Pr > F
0.4295 0.3482 0.3229
54.58 38.72 34.57
<.0001 <.0001 <.0001
No variables can be removed.
Statistics for Entry, DF = 2, 144
Variable
Label
SepalLength
Sepal Length in mm.
Partial R-Square
F Value
Pr > F
Tolerance
0.0615
4.72
0.0103
0.0320
Variable SepalLength will be entered. All variables have been entered.
Multivariate Statistics Statistic Wilks’ Lambda Pillai’s Trace Average Squared Canonical Correlation
Value
F Value
Num DF
Den DF
Pr > F
0.023439 1.191899 0.595949
199.15 53.47
8 8
288 290
<.0001 <.0001
Since no more variables can be added to or removed from the model, the procedure stops at Step 5 and displays a summary of the selection process. Output 67.1.7. Iris Data: Stepwise Selection Step 5 Fisher (1936) Iris Data The STEPDISC Procedure Stepwise Selection: Step 5 Statistics for Removal, DF = 2, 144
Variable
Label
SepalLength SepalWidth PetalLength PetalWidth
Sepal Sepal Petal Petal
Length in mm. Width in mm. Length in mm. Width in mm.
Partial R-Square
F Value
Pr > F
0.0615 0.2335 0.3308 0.2570
4.72 21.94 35.59 24.90
0.0103 <.0001 <.0001 <.0001
No variables can be removed. No further steps are possible.
References
4181
Output 67.1.8. Iris Data: Stepwise Selection Summary Fisher (1936) Iris Data The STEPDISC Procedure Stepwise Selection Summary
Step 1 2 3 4
Number In Entered 1 2 3 4
Removed
PetalLength SepalWidth PetalWidth SepalLength
Partial R-Square F Value Pr > F
Label Petal Sepal Petal Sepal
Length in mm. Width in mm. Width in mm. Length in mm.
Average Squared Wilks’ Pr < Canonical Lambda Lambda Correlation
0.9414 1180.16 <.0001 0.05862828 <.0001 0.3709 43.04 <.0001 0.03688411 <.0001 0.3229 34.57 <.0001 0.02497554 <.0001 0.0615 4.72 0.0103 0.02343863 <.0001
0.47068586 0.55995394 0.59495691 0.59594941
Pr > ASCC <.0001 <.0001 <.0001 <.0001
References Costanza, M.C. and Afifi, A.A. (1979), “Comparison of Stopping Rules in Forward Stepwise Discriminant Analysis,” Journal of the American Statistical Association, 74, 777–785. Fisher, R.A. (1936), “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, 7, 179–188. Jennrich, R.I. (1977), “Stepwise Discriminant Analysis,” in Statistical Methods for Digital Computers, eds. K. Enslein, A. Ralston, and H. Wilf, New York: John Wiley & Sons, Inc. Journal of Statistics Education, “Fish Catch Data Set,” [http://www.stat.ncsu.edu/info/jse], accessed 4 December 1997 Klecka, W.R. (1980), Discriminant Analysis, Sage University Paper Series on Quantitative Applications in the Social Sciences, Series No. 07-019, Beverly Hills: Sage Publications. Rencher, A.C. and Larson, S.F. (1980), “Bias in Wilks’s Λ in Stepwise Discriminant Analysis,” Technometrics, 22, 349–356.
Chapter 68
The SURVEYFREQ Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4185 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4185 SYNTAX . . . . . . . . . . . . . . PROC SURVEYFREQ Statement BY Statement . . . . . . . . . . CLUSTER Statement . . . . . . STRATA Statement . . . . . . . TABLES Statement . . . . . . . WEIGHT Statement . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. 4192 . 4193 . 4195 . 4195 . 4196 . 4196 . 4203
DETAILS . . . . . . . . . . . . . . . . . . . . . . Specifying the Sample Design . . . . . . . . . . Domain Analysis . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . . . . . . Statistical Computations . . . . . . . . . . . . . Definitions and Notation . . . . . . . . . . . Totals . . . . . . . . . . . . . . . . . . . . . Covariance of Totals . . . . . . . . . . . . . Proportions . . . . . . . . . . . . . . . . . . Row and Column Proportions . . . . . . . . Confidence Limits . . . . . . . . . . . . . . Degrees of Freedom . . . . . . . . . . . . . Coefficient of Variation . . . . . . . . . . . Design Effect . . . . . . . . . . . . . . . . . Expected Weighted Frequency . . . . . . . . Rao-Scott Chi-Square Test . . . . . . . . . . Rao-Scott Likelihood Ratio Chi-Square Test Wald Chi-Square Test . . . . . . . . . . . . Wald Log-Linear Chi-Square Test . . . . . . Displayed Output . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. 4203 . 4203 . 4205 . 4205 . 4206 . 4207 . 4209 . 4210 . 4210 . 4212 . 4213 . 4214 . 4214 . 4215 . 4215 . 4216 . 4219 . 4221 . 4223 . 4225 . 4230
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4230 Example 68.1. Two-Way Tables . . . . . . . . . . . . . . . . . . . . . . . . 4230 Example 68.2. Multiway Tables . . . . . . . . . . . . . . . . . . . . . . . . 4234
4184
Chapter 68. The SURVEYFREQ Procedure Example 68.3. Output Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 4236
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4237
Chapter 68
The SURVEYFREQ Procedure Overview The SURVEYFREQ procedure produces one-way to n-way frequency and crosstabulation tables from sample survey data. These tables include estimates of population totals and proportions, and the corresponding standard errors. PROC SURVEYFREQ computes the variance estimates based on the sample design used to obtain the survey data. The design can be a complex multistage survey design with stratification, clustering, and unequal weighting. PROC SURVEYFREQ also provides design-based tests of independence and association between variables. PROC SURVEYFREQ uses the Taylor expansion method to estimate sampling errors of estimators based on complex sample designs. This method is appropriate for all designs where the first-stage sample is selected with replacement, or where the first-stage sampling fraction is small, as it often is in practice. The Taylor expansion method obtains a linear approximation for the estimator and then uses the variance estimate for this approximation to estimate the variance of the estimate itself (Woodruff 1971, Fuller 1975). When there are clusters or primary sampling units (PSUs) in the sample design, the procedure estimates variance from the variation among PSUs. When the design is stratified, the procedure combines stratum variance estimates to compute the overall variance estimate.
Getting Started The following example shows how you can use PROC SURVEYFREQ to analyze sample survey data. The example uses data from a customer satisfaction survey for a student information system (SIS), a software product that provides modules for student registration, class scheduling, attendance, grade reporting, and other functions. The software company conducted a survey of school personnel who use the SIS. A probability sample of SIS users was selected from the study population, which included SIS users at middle schools and high schools in the three-state area of Georgia, South Carolina, and North Carolina. The sample design for this survey was a twostage stratified design. A first-stage sample of schools was selected from the list of schools using the SIS in the three-state area. The list of schools, or the first-stage sampling frame, was stratified by state and by customer status (whether the school was a new user of the system, or a renewal user). Within the first-stage strata, schools were selected with probability proportional to size and with replacement, where the size measure was school enrollment. From each sample school, five staff members were randomly selected to complete the SIS satisfaction questionnaire. These staff members included three teachers, and two administrators or guidance staff members.
4186
Chapter 68. The SURVEYFREQ Procedure
The SAS data set SIS– Survey contains the survey results, as well as the sample design information needed to analyze the data. This data set includes an observation for each school staff member responding to the survey. The variable Response contains the staff member’s response on overall satisfaction with the system. The variable State contains the school’s state, and the variable NewUser contains the school’s customer status (’New Customer’ or ’Renewal Customer’). These two variables determine the first stage strata from which schools were selected. The variable School contains the school identification code and identifies the first-stage sampling units, or clusters. The variable SamplingWeight contains the overall sampling weight for each respondent. Overall sampling weights were computed from the selection probabilities at each stage of sampling and were adjusted for nonresponse. Other variables in the data set SIS– Survey include SchoolType and Department. The variable SchoolType identifies the school as a high school or a middle school. The variable Department identifies the staff member as a teacher, or an administrator or guidance department member. The following PROC SURVEYFREQ statements request a one-way table for the variable Response. title ’School Information System Survey’; proc surveyfreq data=SIS_Survey; tables Response; strata State NewUser; cluster School; weight SamplingWeight; run;
The PROC SURVEYFREQ statement invokes the procedure and identifies the input data set to be analyzed. The TABLES statement requests a one-way table for the variable Response. The table request syntax for PROC SURVEYFREQ is very similar to the PROC FREQ table request syntax. This example shows a request for a single one-way table, but you can also request two-way tables or multiway tables. As in PROC FREQ, you can request more than one table in the same TABLES statement, and you can use multiple TABLES statements in the same invocation of PROC SURVEYFREQ. The STRATA, CLUSTER, and WEIGHT statements provide sample design information to the procedure, so that the analysis is done according to the sample design used for the survey, and the estimates apply to the study population. The STRATA statement names the variables State and NewUser, which identify the first-stage strata. Note that the design for this example also includes stratification at the second stage of selection (by type of school personnel), but you specify only the first-stage strata for PROC SURVEYFREQ. The CLUSTER statement identifies School as the cluster or first-stage sampling unit. The WEIGHT statement names the sampling weight variable.
Getting Started
Figure 68.1 and Figure 68.2 display the output produced by PROC SURVEYFREQ, which includes the Data Summary table and the one-way Table of Response. The Data Summary table is produced by default unless you specify the NOSUMMARY option. This table shows there are are 6 strata, 370 clusters or schools, and 1850 observations or respondents in the SIS– Survey data set. The sum of the sampling weights is approximately 39,000, which estimates the total number of school personnel using the SIS in the study area. School Information System Survey The SURVEYFREQ Procedure Data Summary Number Number Number Sum of
of Strata of Clusters of Observations Weights
6 370 1850 38899.6482
Figure 68.1. SIS– Survey Data Summary
Figure 68.2 displays the one-way table for Response, which provides estimates of the population total (weighted frequency) and the population percentage for each category, or level, of Response. The response level ’Very Unsatisfied’ has a frequency of 304, which means that 304 sample respondents fall into this category. It is estimated that 17.17% of all school personnel in the study population fall into this category, and the standard error of this estimate is 1.29%. Note that the estimates apply to the population of all SIS users in the study area, as opposed to describing only the sample of 1850 respondents. The estimate of the total number of school personnel ’Very Unsatisfied’ is 6,678, with a standard deviation of 502. The standard errors computed by PROC SURVEYFREQ are based on the multistage stratified design used for the survey. This differs from some of the traditional analysis procedures, which assume the design is simple random sampling from an infinite population. School Information System Survey Table of Response Weighted Std Dev of Std Err of Response Frequency Frequency Wgt Freq Percent Percent -----------------------------------------------------------------------------Very Unsatisfied 304 6678 501.61039 17.1676 1.2872 Unsatisfied 326 6907 495.94101 17.7564 1.2712 Neutral 581 12291 617.20147 31.5965 1.5795 Satisfied 455 9309 572.27868 23.9311 1.4761 Very Satisfied 184 3714 370.66577 9.5483 0.9523 Total 1850 38900 129.85268 100.000 ------------------------------------------------------------------------------
Figure 68.2. One-Way Table of Response
4187
4188
Chapter 68. The SURVEYFREQ Procedure
The following PROC SURVEYFREQ statements request confidence limits for the percentage estimates and a chi-square goodness-of-fit test for the one-way table of Response. proc surveyfreq data=SIS_Survey nosummary; tables Response / cl nowt chisq; Strata State NewUser; cluster School; weight SamplingWeight; run;
The NOSUMMARY option in the PROC statement suppresses the Data Summary table. In the TABLES statement, the CL option requests confidence limits for the percentages in the one-way table. The NOWT option suppresses display of the weighted frequencies and their standard deviations. The CHISQ option requests a Rao-Scott chi-square goodness-of-fit test. Figure 68.3 shows the one-way table of Response, which includes confidence limits for the percentages. The 95% confidence limits for the percentage of users that are ’Very Unsatisfied’ are 14.64% and 19.70%. To change the α level of the confidence limits, which equals 5% by default, you can use the ALPHA= option. As for the other estimates and standard errors produced by PROC SURVEYFREQ, these confidence limit computations take into account the complex sample design used for the survey, and the results apply to the entire study population.
School Information System Survey The SURVEYFREQ Procedure Table of Response Std Err of 95% Confidence Limits Response Frequency Percent Percent for Percent -------------------------------------------------------------------------------Very Unsatisfied 304 17.1676 1.2872 14.6364 19.6989 Unsatisfied 326 17.7564 1.2712 15.2566 20.2562 Neutral 581 31.5965 1.5795 28.4904 34.7026 Satisfied 455 23.9311 1.4761 21.0285 26.8338 Very Satisfied 184 9.5483 0.9523 7.6756 11.4210 Total 1850 100.000 --------------------------------------------------------------------------------
Figure 68.3. Confidence Limits for Response Percentages
Getting Started
Figure 68.4 shows the chi-square goodness-of-fit results for the table of Response. The null hypothesis for this test is equal proportions for the levels of the one-way table. (To test a null hypothesis of specified proportions instead of equal proportions, you can use the TESTP= option to specify null hypothesis proportions.) The chi-square test invoked by the CHISQ option is the Rao-Scott design-adjusted chi-square test, which takes the survey design into account and provides inferences for the entire study population. To produce the Rao-Scott chi-square statistic, PROC SURVEYFREQ first computes the usual Pearson chi-square statistic based on the weighted frequencies, and then adjusts this value with a design correction. An F approximation is also provided. For the table of Response, the F value is 632.85 with a p-value < .0001, which leads to rejection of the null hypothesis of equal proportions for all response levels. Table of Response Rao-Scott Chi-Square Test Pearson Chi-Square Design Correction
5294.7773 2.0916
Rao-Scott Chi-Square DF Pr > ChiSq
2531.3980 4 <.0001
F Value Num DF Den DF Pr > F
632.8495 4 1456 <.0001
Sample Size = 1850
Figure 68.4. Chi-Square Goodness-of-Fit Test for Response
Continuing to analyze the SIS– Survey data, the following PROC SURVEYFREQ statements request a two-way table for the variables SchoolType by Response. proc surveyfreq data=SIS_Survey nosummary; tables SchoolType * Response; strata State NewUser; cluster School; weight SamplingWeight; run;
The STRATA, CLUSTER and WEIGHT statements do not change from the one-way table example, since the survey design and the input data set are the same. These SURVEYFREQ statements request a different table, but specify the same sample design information.
4189
4190
Chapter 68. The SURVEYFREQ Procedure
Figure 68.5 shows the two-way table produced. The first variable named in the twoway table request, SchoolType, is referred to as the row variable, and the second variable named, Response, is referred to as the column variable. Two-way tables display all column variable levels for each row variable level. So this two-way table lists all levels of the column variable Response for each level of the row variable SchoolType, ’Middle School’ and ’High School’. Also SchoolType = ’Total’ shows the distribution of Response overall for both types of schools. And Response = ’Total’ provides totals over all levels of response, for each type of school and overall. To suppress these totals, you can use the NOTOTAL option. By default, without any other TABLES statement options, a two-way table displays the frequency, weighted frequency and its standard deviation, and percentage and its standard error for each table cell, or combination of row and column variable levels. But there are several options available to customize your table display by adding more information or suppressing some of the default information.
School Information System Survey The SURVEYFREQ Procedure Table of SchoolType by Response Weighted Std Dev of Std Err of SchoolType Response Frequency Frequency Wgt Freq Percent Percent ---------------------------------------------------------------------------------------------------Middle School Very Unsatisfied 116 2496 351.43834 6.4155 0.9030 Unsatisfied 109 2389 321.97957 6.1427 0.8283 Neutral 234 4856 504.20553 12.4847 1.2953 Satisfied 197 4064 443.71188 10.4467 1.1417 Very Satisfied 94 1952 302.17144 5.0193 0.7758 Total 750 15758 1000 40.5089 2.5691 ---------------------------------------------------------------------------------------------------High School Very Unsatisfied 188 4183 431.30589 10.7521 1.1076 Unsatisfied 217 4518 446.31768 11.6137 1.1439 Neutral 347 7434 574.17175 19.1119 1.4726 Satisfied 258 5245 498.03221 13.4845 1.2823 Very Satisfied 90 1762 255.67158 4.5290 0.6579 Total 1100 23142 1003 59.4911 2.5691 ---------------------------------------------------------------------------------------------------Total Very Unsatisfied 304 6678 501.61039 17.1676 1.2872 Unsatisfied 326 6907 495.94101 17.7564 1.2712 Neutral 581 12291 617.20147 31.5965 1.5795 Satisfied 455 9309 572.27868 23.9311 1.4761 Very Satisfied 184 3714 370.66577 9.5483 0.9523 Total 1850 38900 129.85268 100.000 ----------------------------------------------------------------------------------------------------
Figure 68.5. Two-Way Table of SchoolType by Response
Getting Started
4191
The following PROC SURVEYFREQ statements request a two-way table of SchoolType by Response with row percentages, and also request a chi-square test for association between the two variables. proc surveyfreq data=SIS_Survey nosummary; tables SchoolType * Response / row nowt chisq; strata State NewUser; cluster School; weight SamplingWeight; run;
The ROW option in the TABLES statement requests row percentages, which display the distribution of Response as a percentage of each level of the row variable SchoolType. The NOWT option suppresses display of the weighted frequencies and their standard deviations. The CHISQ option requests a Rao-Scott chi-square test of association between SchoolType and Response. Figure 68.6 displays the two-way table produced. For middle schools, it is estimated that 25.79% of school personnel are satisfied with the school information system, and 12.39% are very satisfied. For high schools, these estimates are 22.67% and 7.61%, respectively. School Information System Survey The SURVEYFREQ Procedure Table of SchoolType by Response Std Err of Row Std Err of SchoolType Response Frequency Percent Percent Percent Row Percent -------------------------------------------------------------------------------------------------Middle School Very Unsatisfied 116 6.4155 0.9030 15.8373 1.9920 Unsatisfied 109 6.1427 0.8283 15.1638 1.8140 Neutral 234 12.4847 1.2953 30.8196 2.5173 Satisfied 197 10.4467 1.1417 25.7886 2.2947 Very Satisfied 94 5.0193 0.7758 12.3907 1.7449 Total 750 40.5089 2.5691 100.000 -------------------------------------------------------------------------------------------------High School Very Unsatisfied 188 10.7521 1.1076 18.0735 1.6881 Unsatisfied 217 11.6137 1.1439 19.5218 1.7280 Neutral 347 19.1119 1.4726 32.1255 2.0490 Satisfied 258 13.4845 1.2823 22.6663 1.9240 Very Satisfied 90 4.5290 0.6579 7.6128 1.0557 Total 1100 59.4911 2.5691 100.000 -------------------------------------------------------------------------------------------------Total Very Unsatisfied 304 17.1676 1.2872 Unsatisfied 326 17.7564 1.2712 Neutral 581 31.5965 1.5795 Satisfied 455 23.9311 1.4761 Very Satisfied 184 9.5483 0.9523 Total 1850 100.000 --------------------------------------------------------------------------------------------------
Figure 68.6. Two-Way Table with Row Percentages
4192
Chapter 68. The SURVEYFREQ Procedure
Figure 68.7 displays the chi-square test results. The Rao-Scott chi-square statistic equals 190.19, and the corresponding F value is 47.55 with a p-value < .0001. This indicates a significant association between school type (middle school or high school) and satisfaction with the student information system. Table of SchoolType by Response Rao-Scott Chi-Square Test Pearson Chi-Square Design Correction
394.9453 2.0766
Rao-Scott Chi-Square DF Pr > ChiSq
190.1879 4 <.0001
F Value Num DF Den DF Pr > F
47.5470 4 1456 <.0001
Sample Size = 1850
Figure 68.7. Chi-Square Test of No Association
Syntax The following statements are available in PROC SURVEYFREQ.
PROC SURVEYFREQ < options > ; BY variables ; CLUSTER variables ; STRATA variables < / option > ; TABLES requests < / options > ; WEIGHT variable ; The PROC SURVEYFREQ statement invokes the procedure, identifies the data set to be analyzed, and provides sample design information. The PROC SURVEYFREQ statement is required. The TABLES statement specifies frequency or crosstabulation tables and requests tests and statistics for those tables. The STRATA statement lists the variables that form the strata in a stratified sample design. The CLUSTER statement specifies cluster identification variables in a clustered sample design. The WEIGHT statement names the sampling weight variable. You can use a BY statement with PROC SURVEYFREQ to obtain separate analyses for groups defined by the BY variables. All statements can appear multiple times except the PROC SURVEYFREQ statement and the WEIGHT statement, which can appear only once.
PROC SURVEYFREQ Statement
The rest of this section gives detailed syntax information for the BY, CLUSTER, STRATA, TABLES, and WEIGHT statements in alphabetical order after the description of the PROC SURVEYFREQ statement.
PROC SURVEYFREQ Statement PROC SURVEYFREQ < options > ; The PROC SURVEYFREQ statement invokes the procedure. In this statement, you identify the data set to be analyzed and specify sample design information. The DATA= option names the input data set to be analyzed. If your analysis includes a finite population correction factor, you can input either the sampling rate or the population total using the RATE= or TOTAL= option. If your design is stratified, with different sampling rates or totals for different strata, then you can input these stratum rates or totals in a SAS data set containing the stratification variables. You can specify the following options in the PROC SURVEYFREQ statement: DATA=SAS-data-set
names the SAS data set to be analyzed by PROC SURVEYFREQ. If you omit the DATA= option, the procedure uses the most recently created SAS data set. MISSING
requests that the procedure treat missing values as a valid category for all categorical variables, which include TABLES variables, STRATA variables, and CLUSTER variables. For more information, see the section “Missing Values” on page 4205. NOSUMMARY
suppresses the display of the Data Summary table, which PROC SURVEYFREQ produces by default. For a description of this table, see the section “Data and Sample Design Summary Table” on page 4225. ORDER=DATA | FORMATTED | FREQ | INTERNAL
specifies the order in which the values of the frequency and crosstabulation table variables are to be reported. The following table shows how PROC SURVEYFREQ interprets values of the ORDER= option: DATA
orders values according to their order in the input data set.
FORMATTED
orders values by their formatted values. This order is operatingenvironment dependent. By default, the order is ascending.
FREQ
orders values by descending frequency count. The frequency count of a variable value is its (nonweighted) frequency of occurrence or sample size, and not its weighted frequency.
INTERNAL
orders values by their unformatted values, which yields the same order that the SORT procedure does. This order is operatingenvironment dependent.
By default, ORDER=INTERNAL.
4193
4194
Chapter 68. The SURVEYFREQ Procedure
PAGE
displays only one table per page. Otherwise, PROC SURVEYFREQ displays multiple tables per page as space permits. RATE=value | SAS-data-set R=value | SAS-data-set
specifies the sampling rate as a nonnegative value, or identifies an input data set that gives the stratum sampling rates in a variable named – RATE– . The procedure uses this information to compute a finite population correction for variance estimation. If your sample design has multiple stages, you should specify the first-stage sampling rate, which is the ratio of the number of PSUs selected to the total number of PSUs in the population. For a nonstratified sample design, or for a stratified sample design with the same sampling rate in all strata, you should specify a nonnegative value for the RATE= option. If your design is stratified with different sampling rates in the strata, then you should name a SAS data set that contains the stratification variables and the sampling rates. See the section “Population Totals and Sampling Rates” on page 4204 for more details. The sampling rate value must be a nonnegative number. You can specify value as a number between 0 and 1. Or you can specify value in percentage form as a number between 1 and 100, and PROC SURVEYFREQ will convert that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%. If you do not specify the TOTAL= option or the RATE= option, then the variance estimation does not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option. TOTAL=value | SAS-data-set N=value | SAS-data-set
specifies the total number of primary sampling units (PSUs) in the study population as a positive value, or identifies an input data set that gives the stratum population totals in a variable named – TOTAL– . The procedure uses this information to compute a finite population correction for variance estimation. For a nonstratified sample design, or for a stratified sample design with the same population total in all strata, you should specify a positive value for the TOTAL= option. If your sample design is stratified with different population totals in the strata, then you should name a SAS data set that contains the stratification variables and the population totals. See the section “Population Totals and Sampling Rates” on page 4204 for more details. If you do not specify the TOTAL= option or the RATE= option, then the variance estimation does not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option.
CLUSTER Statement
BY Statement BY variables ; You can specify a BY statement with PROC SURVEYFREQ to obtain separate analyses on observations in groups defined by the BY variables. The variables are one or more variables in the input data set. Note that using a BY statement provides completely separate analyses of the BY groups. It does not provide a statistically valid subpopulation or domain analysis, the difference being that in domain analysis the total number of units in the subpopulation is not known with certainty. You should include the domain variable(s) in your TABLES request to obtain domain analysis. See the section “Domain Analysis” on page 4205 for more details. If you specify more than one BY statement, the procedure uses only the last BY statement and ignores any previous BY statements. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the FREQ procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
CLUSTER Statement CLUSTER variables ; The CLUSTER statement names variables that identify the first-stage clusters, or PSUs, in a clustered sample design. The combinations of categories of CLUSTER variables define the clusters in the sample. If there is a STRATA statement, clusters are nested within strata. If your sample design has clustering at multiple stages, you should specify only the first-stage clusters or primary sampling units (PSUs) in the CLUSTER statement. See the section “Specifying the Sample Design” on page 4203 for more information. The CLUSTER variables are one or more variables in the DATA= input data set. These variables can either be character or numeric, but the procedure treats them as categorical variables. The formatted values of the CLUSTER variables determine the
4195
4196
Chapter 68. The SURVEYFREQ Procedure
CLUSTER variable levels. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary. You can use multiple CLUSTER statements to specify CLUSTER variables. The procedure uses variables from all CLUSTER statements to create clusters.
STRATA Statement STRATA variables < / option > ; The STRATA statement names variables that form the strata in a stratified sample design. The combinations of categories of STRATA variables define the strata in the sample, where strata are nonoverlapping subgroups that were sampled independently. If your sample design has stratification at multiple stages, you should identify only the first-stage strata in the STRATA statement. See the section “Specifying the Sample Design” on page 4203 for more information. The STRATA variables are one or more variables in the DATA= input data set. These variables can be either character or numeric, but the procedure treats them as categorical. The formatted values of the STRATA variables determine the STRATA variable levels. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary. You can specify the following option in the STRATA statement after a slash (/): LIST
displays a “Stratum Information” table, which lists all strata together with the corresponding values of the STRATA variables. This table provides the number of observations and number of clusters for each stratum, as well as the sampling fraction if you specify the RATE= or the TOTAL= option. See the section “Stratum Information Table” on page 4225 for more information.
TABLES Statement TABLES requests < / options > ; The TABLES statement requests one-way to n-way frequency and crosstabulation tables and statistics for those tables. If you omit the TABLES statement, PROC SURVEYFREQ generates one-way frequency tables for all data set variables that are not listed in the other statements. The following argument is required in the TABLES statement. requests
specify the frequency and crosstabulation tables to produce. A request is composed of one variable name or several variable names separated by asterisks. To request a one-way frequency table, use a single variable. To request a two-way crosstabulation table, use an asterisk between two variables. To request a multiway table (an n-way
TABLES Statement
table, where n>2), separate the desired variables with asterisks. The unique values of these variables form the rows, columns, and layers of the table. For two-way tables to multiway tables, the values of the last variable form the crosstabulation table columns, while the values of the next-to-last variable form the rows. Each level (or combination of levels) of the other variables forms one layer. PROC SURVEYFREQ produces a separate crosstabulation table for each layer. For example, a specification of A*B*C*D in a TABLES statement produces k tables, where k is the number of different combinations of levels for A and B. Each table lists the levels for D (columns) within each level of C (rows). You can use multiple TABLES statements in the PROC SURVEYFREQ step. You can also specify any number of table requests in a single TABLES statement. To specify multiple table requests quickly, use a grouping syntax by placing parentheses around several variables and joining other variables or variable combinations. For example, the following statements illustrate grouping syntax: Table 68.1. Grouping Syntax
Request tables A*(B C); tables (A B)*(C D); tables (A B C)*D; tables A – – C; tables (A – – C)*D;
Equivalent to tables A*B A*C; tables A*C B*C A*D B*D; tables A*D B*D C*D; tables A B C; tables A*D B*D C*D;
The TABLES statement variables are one or more variables from the DATA= input data set. These variables can be either character or numeric, but the procedure treats them as categorical variables. PROC SURVEYFREQ uses the formatted values of the TABLES variable to determine the categorical variable levels. So if you assign a format to a variable with a FORMAT statement, PROC SURVEYFREQ formats the values before dividing observations into the levels of a frequency or crosstabulation table. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary. The frequency or crosstabulation table lists the values of both character and numeric variables in ascending order based on internal (unformatted) variable values unless you change the order with the ORDER= option. To list the values in ascending order by formatted value, use ORDER=FORMATTED in the PROC SURVEYFREQ statement.
Without Options If you request a frequency or crosstabulation table without specifying options, PROC SURVEYFREQ produces the following for each table level or cell: • frequency (sample size) • weighted frequency (estimated total) • standard error of weighted frequency • percentage (estimated proportion) • standard error of percentage
4197
4198
Chapter 68. The SURVEYFREQ Procedure
The table displays weighted frequencies if your analysis includes a WEIGHT statement, or if you specify the WTFREQ option in the TABLES statement. The table also displays the number of observations with missing values. See the section “One-Way Frequency Tables” on page 4226 and the section “Crosstabulation Tables” on page 4227 for more information.
Options The following table lists the options available with the TABLES statement. Descriptions follow in alphabetical order. Table 68.2. TABLES Statement Options
Option Description Control Statistical Analysis ALPHA= sets the level for confidence limits CHISQ requests Rao-Scott chi-square test CHISQ1 requests Rao-Scott modified chi-square test DDF= specifies denominator DF for Wald chi-square tests LRCHISQ requests Rao-Scott likelihood ratio test LRCHISQ1 requests Rao-Scott modified likelihood ratio test TESTP= specifies null proportions for one-way chi-square tests WCHISQ requests Wald chi-square test WLLCHISQ requests Wald log-linear chi-square test Control Additional Table Information CL displays confidence limits for percents CLWT displays confidence limits for weighted frequencies COL displays column percents and standard errors CV displays coefficients of variation for percents CVWT displays coefficients of variation for weighted frequencies DEFF displays design effects for percents EXPECTED displays expected weighted frequencies for two-way tables ROW displays row percents and standard errors VAR displays variances for percents VARWT displays variances for weighted frequencies WTFREQ displays weighted frequencies and standard errors when there is no WEIGHT statement Control Displayed Output NOFREQ suppresses display of frequency counts NOPERCENT suppresses display of percents NOPRINT suppresses display of tables but displays statistical tests NOSPARSE suppresses display of zero rows and columns in two-way tables NOSTD suppresses display of standard errors for all estimates NOTOTAL suppresses display of row and column totals NOWT suppresses display of weighted frequencies
TABLES Statement
You can specify the following options in a TABLES statement: ALPHA=α
sets the level for confidence limits. The value of the ALPHA= option must be between 0 and 1, and the default is 0.05. A confidence level of α produces 100(1 − α)% confidence limits. The default of ALPHA=0.05 produces 95% confidence limits. You request confidence limits for percentages with the CL option, and you request confidence limits for weighted frequencies with the CLWT option. See the section “Confidence Limits” on page 4213 for more information. CHISQ
requests the Rao-Scott chi-square test. This test applies a design effect correction to the Pearson chi-square statistic computed from the weighted frequencies. See the section “Rao-Scott Chi-Square Test” on page 4216 for more information. By default for one-way tables, the CHISQ option provides a design-based goodness-of-fit test for equal proportions. To compute the test for other null hypothesis proportions, specify the null proportions with the TESTP= option. The CHISQ option uses proportion estimates to compute the design effect correction. To use null hypothesis proportions instead, specify the CHISQ1 option. CHISQ1
requests the Rao-Scott modified chi-square test. This test applies a design effect correction to the Pearson chi-square statistic computed from the weighted frequencies, and bases the design effect correction on null hypothesis proportions. See the section “Rao-Scott Chi-Square Test” on page 4216 for more information. To compute the design effect correction from proportion estimates instead of null proportions, specify the CHISQ option. By default for one-way tables, the CHISQ option provides a design-based goodness-of-fit test for equal proportions. To compute the test for other null hypothesis proportions, specify the null proportions with the TESTP= option. CL
requests confidence limits for the percentages, or proportion estimates, in the crosstabulation table. PROC SURVEYFREQ determines the confidence coefficient from the ALPHA= option, which by default equals 0.05 and produces 95% confidence limits. See the section “Confidence Limits” on page 4213 for more information. CLWT
requests confidence limits for the weighted frequencies, or estimated totals, in the crosstabulation table. PROC SURVEYFREQ determines the confidence coefficient from the ALPHA= option, which by default equals 0.05 and produces 95% confidence limits. See the section “Confidence Limits” on page 4213 for more information.
4199
4200
Chapter 68. The SURVEYFREQ Procedure
COL
displays the column percentage, or estimated proportion of the column total, for each cell in a two-way table. The COL option also displays the standard errors of the column percentages. See the section “Row and Column Proportions” on page 4212 for more information. This option has no effect for one-way tables. CV
displays the coefficient of variation for each percentage, or proportion estimate, in the crosstabulation table. See the section “Coefficient of Variation” on page 4214 for more information. CVWT
displays the coefficient of variation for each weighted frequency, or estimated total, in the crosstabulation table. See the section “Coefficient of Variation” on page 4214 for more information. DDF=df
specifies the denominator degrees of freedom for the F-statistics used in the Wald chi-square tests. By default, the denominator degrees of freedom is the number of clusters minus the number of strata. See the section “Wald Chi-Square Test” on page 4221 and the section “Wald Log-Linear Chi-Square Test” on page 4223 for more information. You request the Wald chi-square test with the WCHISQ option, and you request the Wald log-linear chi-square test with the WLLCHISQ option. DEFF
displays the design effect for each overall proportion estimate in the crosstabulation table. See the section “Design Effect” on page 4215 for more information. EXPECTED
displays expected weighted frequencies for the table cells in a two-way table. The expected frequencies are computed under the null hypothesis that the row and column variables are independent. See the section “Expected Weighted Frequency” on page 4215 for more information. This option has no effect for one-way tables. LRCHISQ
requests the Rao-Scott likelihood ratio chi-square test. This test applies a design effect correction to the likelihood ratio chi-square statistic computed from the weighted frequencies. See the section “Rao-Scott Likelihood Ratio Chi-Square Test” on page 4219 for more information. By default for one-way tables, the LRCHISQ option provides a design-based test for equal proportions. To compute the test for other null hypothesis proportions, specify the null proportions with the TESTP= option. The LRCHISQ option uses proportion estimates to compute the design effect correction. To use null hypothesis proportions instead, specify the LRCHISQ1 option. LRCHISQ1
requests the Rao-Scott modified likelihood ratio chi-square test. This test applies a design effect correction to the likelihood ratio chi-square statistic computed from the weighted frequencies, and bases the design effect correction on null hypothesis proportions. See the section “Rao-Scott Likelihood Ratio Chi-Square Test” on page
TABLES Statement
4219 for more information. To compute the design effect correction from proportion estimates instead of null proportions, specify the LRCHISQ option. By default for one-way tables, the LRCHISQ option provides a design-based test for equal proportions. To compute the test for other null hypothesis proportions, specify the null proportions with the TESTP= option. NOFREQ
suppresses the display of cell frequencies in the crosstabulation table. The NOFREQ option also suppresses the display of row, column, and overall table frequencies. NOPERCENT
suppresses the display of cell percentages in the crosstabulation table. The NOPERCENT option also suppresses the display of standard errors of the percentages. NOPRINT
suppresses the display of frequency and crosstabulation tables but displays all requested statistical tests. Note that this option disables the Output Delivery System (ODS) for the suppressed tables. For more information, see Chapter 14, “Using the Output Delivery System.” NOSPARSE
suppresses the display of variable levels with zero frequency in two-way tables. By default, the procedure displays all levels of the column variable within each level of the row variable, including any column variable levels with zero frequency for that row. For multiway tables, the procedure displays all levels of the row variable for each layer of the table by default, including any row variable levels with zero frequency for the layer. Also by default, the procedure displays all variable levels that occur in the input data set, including those levels with no observations actually used in the analysis due to missing or nonpositive weights or missing values. See the section “Missing Values” on page 4205 for details. NOSTD
suppresses the display of all standard errors in the crosstabulation table. NOTOTAL
suppresses the display of row totals, column totals, and overall totals in the crosstabulation table. NOWT
suppresses the display of weighted frequencies in the crosstabulation table. The NOWT option also suppresses the display of standard errors of the weighted frequencies. ROW
displays the row percentage, or estimated proportion of the row total, for each cell in a two-way table. The ROW option also displays the standard errors of the row percentages. See the section “Row and Column Proportions” on page 4212 for more information. This option has no effect for one-way tables.
4201
4202
Chapter 68. The SURVEYFREQ Procedure
TESTP=(values)
specifies null hypothesis proportions, or test percentages, for one-way chi-square tests. You can separate values with blanks or commas. Specify values in probability form as numbers between 0 and 1, where the proportions sum to 1. Or specify values in percentage form as numbers between 0 and 100, where the percentages sum to 100. PROC SURVEYFREQ treats the value 1 as the percentage form 1%. The number of TESTP= values must equal the number of variable levels in the one-way table. List these values in the order in which the corresponding variable levels appear in the output. When you specify the TESTP= option, PROC SURVEYFREQ displays the specified test percentages in the one-way frequency table. The TESTP= option has no effect for two-way tables. PROC SURVEYFREQ uses the TESTP= values for all one-way chi-square tests you request in the TABLES statement. The available one-way chi-square tests include the Rao-Scott (Pearson) chi-square test and the Rao-Scott likelihood ratio chi-square test and their modified versions, requested by options CHISQ, CHISQ1, LRCHISQ, and LRCHISQ1. See the section “Rao-Scott Chi-Square Test” on page 4216 and the section “Rao-Scott Likelihood Ratio Chi-Square Test” on page 4219 for more details. VAR
displays the variance estimate for each percentage in the crosstabulation table. See the section “Proportions” on page 4210 for details. VARWT
displays the variance estimate for each weighted frequency, or estimated total, in the crosstabulation table. See the section “Totals” on page 4209 for details. WCHISQ
requests the Wald chi-square test. See the section “Wald Chi-Square Test” on page 4221 for more information. By default, the denominator degrees of freedom for the Wald test F-statistic is the number of clusters minus the number of strata. Alternatively, you can specify the denominator degrees of freedom with the DDF= option. WLLCHISQ
requests the Wald log-linear chi-square test. See the section “Wald Log-Linear ChiSquare Test” on page 4223 for more information. By default, the denominator degrees of freedom for the Wald test F-statistic is the number of clusters minus the number of strata. Alternatively, you can specify the denominator degrees of freedom with the DDF= option. WTFREQ
displays the weighted frequencies and their standard errors when you do not specify a WEIGHT statement. PROC SURVEYFREQ displays the weighted frequencies by default when the analysis includes a WEIGHT statement. Without a WEIGHT statement, PROC SURVEYFREQ assigns all observations a weight of 1.
Specifying the Sample Design
WEIGHT Statement WEIGHT variable ; The WEIGHT statement names the variable that contains the sampling weights. This variable must be numeric. If you do not specify a WEIGHT statement, PROC SURVEYFREQ assigns all observations a weight of 1. Sampling weights must be positive numbers. If an observation has a weight that is nonpositive or missing, then the procedure omits that observation from the analysis. See the section “Missing Values” on page 4205 for more information. If you specify more than one WEIGHT statement, the procedure uses only the first WEIGHT statement and ignores the rest.
Details Specifying the Sample Design PROC SURVEYFREQ produces tables and statistics based on the sample design used to obtain the survey data. The procedure uses the Taylor expansion method to estimate sampling errors of estimators based on complex sample designs. See the section “Statistical Computations” on page 4206 for details. This method is appropriate for all designs where the first-stage sample is selected with replacement, or where the first-stage sampling fraction is small, as it often is in practice. PROC SURVEYFREQ can be used for single-stage designs or for multistage designs, with or without stratification, and with or without unequal weighting. You provide sample design information with the STRATA, CLUSTER, and WEIGHT statements, and with the RATE= or TOTAL= option in the PROC SURVEYFREQ statement. When there are clusters, or PSUs, in the sample design, the procedure estimates variance from the variance among PSUs. For a multistage sample design, the variance estimation method depends only on the first stage of the sample design. So, the required input includes only first-stage cluster (PSU) and first-stage stratum identification. You do not need to input design information about any additional stages of sampling.
Stratification If your sample design is stratified at the first stage of sampling, use the STRATA statement to name variables that form the strata. The combinations of categories of STRATA variables define the strata in the sample, where strata are nonoverlapping subgroups that were sampled independently. If your sample design has stratification at multiple stages, you should identify only the first-stage strata in the STRATA statement. If you do not specify a STRATA statement, PROC SURVEYFREQ assumes there is no stratification at the first stage.
Clustering If your sample design selects clusters at the first stage of sampling, use the CLUSTER statement to name variables that identify the first-stage clusters, or primary sampling units (PSUs). The combinations of categories of CLUSTER variables define the clusters in the sample. If there is a STRATA statement, clusters are nested within strata.
4203
4204
Chapter 68. The SURVEYFREQ Procedure
If your sample design has clustering at multiple stages, you should specify only the first-stage clusters, or PSUs, in the CLUSTER statement. PROC SURVEYFREQ assumes that each cluster defined by the CLUSTER statement variables represents a PSU in the sample, and that each observation belongs to one PSU. If you do not specify a CLUSTER statement, the procedure treats each observation as a PSU.
Weighting If your sample design includes unequal weighting, use the WEIGHT statement to name the variable that contains the sampling weights. If you do not specify a WEIGHT statement, PROC SURVEYFREQ assigns all observations a weight of 1. Sampling weights must be positive numbers. If an observation has a weight that is nonpositive or missing, then the procedure omits that observation from the analysis. See the section “Missing Values” on page 4205 for more information.
Population Totals and Sampling Rates If your analysis needs to include a finite population correction (fpc), you can input either the sampling rate or the population total using the RATE= option or the TOTAL= option in the PROC SURVEYFREQ statement. (You cannot specify both of these options in the same PROC SURVEYFREQ statement.) If you do not specify one of these options, the procedure does not use the fpc when computing variance estimates. For fairly small sampling fractions, it is appropriate to ignore this correction. Refer to Cochran (1977) and Kish (1965). If your design has multiple stages of selection and you are specifying the RATE= option, you should input the first-stage sampling rate, which is the ratio of the number of PSUs in the sample to the total number of PSUs in the study population. If you are specifying the TOTAL= option for a multistage design, you should input the total number of PSUs in the study population. For a nonstratified sample design, or for a stratified sample design with the same sampling rate or the same population total in all strata, you should use the RATE=value option or the TOTAL=value option. If your sample design is stratified with different sampling rates or population totals in the strata, then you can use the RATE=SAS-data-set option or the TOTAL=SAS-data-set option to name a SAS data set that contains the stratum sampling rates or totals. This data set is called a secondary data set, as opposed to the primary data set that you specify with the DATA= option. The secondary data set must contain all the stratification variables listed in the STRATA statement and all the variables in the BY statement. Furthermore, the BY groups must appear in the same order as in the primary data set. If there are formats associated with the STRATA variables and the BY variables, then the formats must be consistent in the primary and the secondary data sets. If you specify the TOTAL=SAS-data-set option, the secondary data set must have a variable named – TOTAL– that contains the stratum population totals. Alternatively, if you specify the RATE=SAS-data-set option, the secondary data set must have a variable named – RATE– that contains the stratum sampling rates. If the secondary data set contains more than one observation for any one stratum, then the procedure uses the first value of – TOTAL– or – RATE– for that stratum and ignores the rest.
Missing Values
The value in the RATE= option or the values of – RATE– in the secondary data set must be nonnegative numbers. You can specify value as a number between 0 and 1. Or you can specify value in percentage form as a number between 1 and 100, and PROC SURVEYFREQ will convert that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%. If you specify the TOTAL=value option, value must not be less than the sample size. If you provide stratum population totals in a secondary data set, these values must not be less than the corresponding stratum sample sizes.
Domain Analysis PROC SURVEYFREQ provides domain analysis through its multiway table capability. Domain Analysis refers to the computation of statistics for subpopulations, or domains, in addition to the computation of statistics for the entire study population. Formation of these domains may be unrelated to the sample design, so the domain sample sizes may actually be random variables. Domain analysis takes into account this variability, using the entire sample when estimating variance for domain estimates. This is also known as subgroup analysis, subpopulation analysis, or subdomain analysis. For more information on domain analysis, refer to Lohr (1999), Cochran (1977), and Fuller et al. (1989). To request domain analysis with PROC SURVEYFREQ, you should include the domain variable(s) in your TABLES statement request. For example, specifying DOMAIN * A * B in a TABLES statement produces separate two-way tables of A by B for each level of DOMAIN. If your domains are formed by more than one variable, you can specify DomainVariable– 1 * DomainVariable– 2 * A * B, for example, to obtain two-way tables of A by B for each domain formed by the different combinations of levels for DomainVariable– 1 and DomainVariable– 2. Including the domain variables in a TABLES statement request gives a different analysis from that obtained by using a BY statement, which provides completely separate analyses of the BY groups. The BY statement can also be used to analyze the dataset by subgroups, but it is critical to note that this will not produce a valid domain analysis. The BY statement is only appropriate when the number of units in each subgroup is known with certainty; when the subgroup sample size is a random variable, include the domain variables in your TABLES statement request.
Missing Values If an observation has a missing value or a nonpositive value for the WEIGHT variable, then PROC SURVEYFREQ excludes that observation from the analysis. An observation is also excluded from the analysis if it has a missing value for any STRATA or CLUSTER variable, unless you specify the MISSING option in the PROC SURVEYFREQ statement. The MISSING option requests that the procedure treat missing values as a valid category for all categorical variables, which include strata variables, cluster variables, and classification or table variables. Additionally, PROC SURVEYFREQ excludes an observation from a crosstabulation table (and any associated analyses) if that observation has a missing value for any
4205
4206
Chapter 68. The SURVEYFREQ Procedure
of the table variables, unless you specify the MISSING option. When the procedure excludes observations with missing values from a table, it displays the total frequency of missing observations below that table. With the MISSING option, the procedure treats the missing values as a valid category and includes them in calculations of percentages and other statistics. If all values in a stratum are excluded from the analysis of a table as missing values, then that stratum is called an empty stratum. Empty strata are not counted in the total number of strata for the table, which is used to determine the degrees of freedom for confidence limits and tests. Similarly, empty clusters and missing observations are not included in the total counts of clusters and observations used in the analysis of the table. For each table request, PROC SURVEYFREQ produces a nondisplayed ODS summary table that contains the number of (nonmissing) observations, strata, and clusters that are included in the analysis of the requested table. When there are missing observations, empty strata, or empty clusters for the requested table, then these numbers in the “Table Summary” differ from the total number of observations, strata, and clusters that are present in the input data set and reported in the “Data Summary.” See Example 68.3 on page 4236 for more information on the “Table Summary.” If you have missing values in your survey data for any reason (such as nonresponse), this can compromise the quality of your survey results. An observation without missing values is called a complete respondent, and an observation with missing values is called an incomplete respondent. If the complete respondents are different from the incomplete respondents with regard to a survey effect or outcome, then survey estimates will be biased and will not accurately represent the survey population. There are a variety of techniques in sample design and survey operations that can reduce nonresponse. Once data collection is complete, you can use imputation to replace missing values with acceptable values, and you can use sampling weight adjustments to compensate for nonresponse. You should complete this data preparation and adjustment before you analyze your data with PROC SURVEYFREQ. Refer to Cochran (1977), Kalton and Kaspyzyk (1986), and Brick and Kalton (1996) for more details.
Statistical Computations The SURVEYFREQ procedure uses the Taylor series expansion method to estimate standard errors of estimators of proportions for crosstabulation tables. For sample survey data, the proportion estimator is a ratio estimator formed from estimators of totals. For example, to estimate the proportion in a crosstabulation table cell, the procedure uses the ratio of the estimator of the cell total frequency to the estimator of the overall population total, where these totals are linear statistics computed from the survey data. The Taylor series expansion method obtains a first-order linear approximation for the ratio estimator and then uses the variance estimate for this approximation to estimate the variance of the estimate itself (Woodruff 1971, Fuller 1975). When there are clusters, or PSUs, in the sample design, the procedure estimates variance from the variance among PSUs. When the design is stratified, the procedure combines stratum variance estimates to compute the overall variance estimate. For a
Statistical Computations
multistage sample design, the variance estimation method depends only on the first stage of the sample design. So, the required input includes only first-stage cluster (PSU) and first-stage stratum identification. You do not need to input design information about any additional stages of sampling. This variance estimation method assumes that the first-stage sampling fraction is small, or the first-stage sample is drawn with replacement, as it often is in practice. In addition to this required sample design information, you also need to specify the sampling weights for a valid analysis, if the weights are not equal. Quite often in complex surveys, respondents have unequal weights, which reflect unequal selection probabilities and adjustments for nonresponse. For more information on the analysis of sample survey data, refer to Lohr (1999), S¨arndal, Swenson, and Wretman (1992), Lee, Forthoffer, and Lorimor (1989), Cochran (1977), Kish (1965), and Hansen, Hurwitz, and Madow (1953).
Definitions and Notation For a stratified clustered sample design, define the following: h = 1, 2, . . . , H
is the stratum number, with a total of H strata
i = 1, 2, . . . , nh
is the cluster number within stratum h,
with a total of nh sample clusters from stratum h j = 1, 2, . . . , mhi
is the unit number within cluster i of stratum h,
with a total of mhi sample units from cluster i of stratum h n =
nh H X X h=1
mhi
is the total number of observations in the sample
i=1
fh = first-stage sampling rate for stratum h Whij
= sampling weight of unit j in cluster i of stratum h
The sampling rate fh is the fraction of first-stage units (PSUs) selected for the sample. You can specify the stratum sampling rates with the RATE= option. Or if you specify population totals with the TOTAL= option, PROC SURVEYFREQ computes fh as the ratio of stratum sample size to the stratum total, in terms of PSUs. See the section “Population Totals and Sampling Rates” on page 4204 for details. If you do not specify the RATE= option or the TOTAL= option, then the procedure assumes that the stratum sampling rates fh are negligible and does not use a finite population correction when computing variances. This notation is also applicable to other sample designs. For example, for a design without stratification, you can let H = 1; for a sample design without clustering, you can let mhi = 1 for every h and i, replacing clusters with individual sampling units.
4207
4208
Chapter 68. The SURVEYFREQ Procedure
For a two-way table representing the crosstabulation of two variables, define the following, where there are R levels of the row variable and C levels of the column variable: r = 1, 2, . . . , R
is the row number, with a total of R rows
c = 1, 2, . . . , C
is the column number, with a total of C columns
Nrc = is the population total in row r and column c Nr· =
C X
Nrc
is the total in row r
Nrc
is the total in column c
c=1
N·c =
R X r=1
N
=
R C X X r=1
Nrc
is the overall total
c=1
Prc = Nrc / N
is the population proportion in row r and column c
Pr. = Nr· / N
is the proportion in row r
P.c = N·c / N
is the proportion in column c
Prcr = Nrc / Nr·
is the row proportion for cell (r, c)
Prcc = Nrc / N·c
is the column proportion for cell (r, c)
For a specified observation (identified by stratum, cluster, and unit number within the cluster), define the following to indicate whether or not that observation belongs to cell (r, c), row r and column c, of the two-way table, for r = 1, 2, . . . , R and c = 1, 2, . . . , C: ( δhij (r, c) =
1
if observation (hij) is in cell (r, c)
0
otherwise
Similarly, define the following functions to indicate the observation’s row classification and the observation’s column classification. ( δhij (r) =
( δhij (c) =
1
if observation (hij) is in row r
0
otherwise
1
if observation (hij) is in column c
0
otherwise
Statistical Computations
Totals PROC SURVEYFREQ estimates population frequency totals for the specified crosstabulation tables, including totals for two-way table cells, rows, columns, and overall totals. The procedure computes the estimate of the total frequency in table cell (r, c) as the weighted frequency sum mhi nh X H X X
brc = N
δhij (r, c) Whij
h=1 i=1 j=1
Similarly, PROC SURVEYFREQ computes estimates of the row totals, column totals, and overall totals as br· = N
nh X mhi H X X
δhij (r) Whij
h=1 i=1 j=1
b·c = N
nh X mhi H X X
δhij (c) Whij
h=1 i=1 j=1
b N
=
nh X mhi H X X
Whij
h=1 i=1 j=1
The estimators of totals are linear sample statistics, and so their variances can be estimated directly, without the Taylor series approximation that is used for proportions. PROC SURVEYFREQ estimates the variance of the total frequency in table cell (r, c) as H X
brc ) = c N Var(
brc ) c h (N Var
h=1
where if nh > 1, brc ) c h (N Var
=
nh nh (1 − fh ) X h 2 (nrchi − n ¯ rc ) nh − 1 i=1
hi nrc
h n ¯ rc
=
=
mhi X j=1 nh X
δhij (r, c) Whij hi nrc / nh
i=1
and if nh = 1, brc ) = c h (N Var
missing 0
if nh0 = 1 for h0 = 1, 2, . . . , H if nh0 > 1 for some 1 < h0 < H
4209
4210
Chapter 68. The SURVEYFREQ Procedure
The standard deviation of the total is computed as q brc ) = Std(N
brc ) c N Var(
The variances and standard deviations are computed in a similar manner for row totals, column totals, and overall table totals.
Covariance of Totals PROC SURVEYFREQ estimates the covariance between total frequency estimates for table cells (r, c) and (a, b) as H X
brc , N bab ) = d N Cov(
h=1
! nh nh (1 − fh ) X hi h hi h (nrc − n ¯ rc ) (nab − n ¯ ab ) nh − 1 i=1
b N), b brc is an rc×rc matrix V( The estimated covariance matrix of the table cell totals N brc , N bab ), for r = 1, · · · , R; d N which contains the pair-wise table cell covariances Cov( c = 1, · · · , C; a = 1, · · · , R; and b = 1, · · · , C.
Proportions PROC SURVEYFREQ computes the estimate of the proportion in table cell (r, c) as the ratio of the estimated total for the table cell to the estimated overall total, Pbrc
=
brc / N b N nh X mhi H X X
=
δhij (r, c) Whij /
h=1 i=1 j=1
nh X mhi H X X
Whij
h=1 i=1 j=1
Using the Taylor series expansion method, PROC SURVEYFREQ estimates the variance of this proportion estimate as
c Pbrc ) = Var(
H X
c h (Pbrc ) Var
h=1
where if nh > 1, n
c h (Pbrc ) = Var hi erc
=
h nh (1 − fh ) X hi h 2 ( erc − e¯rc ) nh − 1 i=1 mhi X b ( δhij (r, c) − Pbrc ) Whij / N
j=1 h e¯rc
=
nh X i=1
hi erc / nh
Statistical Computations and if nh = 1, c h (Pbrc ) = Var
missing 0
if nh0 = 1 for h0 = 1, 2, . . . , H if nh0 > 1 for some 1 < h0 < H
The standard error of the proportion is computed as q StdErr(Pbrc ) =
c Pbrc ) Var(
Similarly, the estimate of the proportion in row r is brc / N b Pbr· = N And its variance estimate is H X
c Pbr· ) = Var(
c h (Pbr· ) Var
h=1
where if nh > 1, c h (Pbr· ) Var
=
er·hi
=
nh nh (1 − fh ) X ( er·hi − e¯r·h )2 nh − 1 i=1 mhi X b ( δhij (r) − Pbr· ) Whij / N j=1
e¯r·h
=
nh X
er·hi / nh
i=1
and if nh = 1, c h (Pbr· ) = Var
missing 0
if nh0 = 1 for h0 = 1, 2, . . . , H if nh0 > 1 for some 1 < h0 < H
The standard error of the proportion in row r is computed as q StdErr(Pbr· ) =
c Pbr· ) Var(
Computations for the proportion in column c are done in the same way.
4211
4212
Chapter 68. The SURVEYFREQ Procedure
Row and Column Proportions PROC SURVEYFREQ computes the estimate of the row proportion for table cell (r, c) as the ratio of the estimated total for the table cell to the estimated total for row r, Pbrcr
= =
brc / N br· N mhi nh X H X X
δhij (r, c) Whij /
h=1 i=1 j=1
nh X mhi H X X
δhij (r) Whij
h=1 i=1 j=1
Again using the Taylor series expansion method, PROC SURVEYFREQ estimates the variance of this row proportion estimate as H X
c Pbrcr ) = Var(
c h (Pbrc ) Var
h=1
where if nh > 1, c h (Pbrcr ) Var
=
grchi
=
nh nh (1 − fh ) X ( grchi − g¯rch )2 nh − 1 i=1 m hi X br· ( δhij (r, c) − Pbrcr δhij (r) ) Whij / N j=1
g¯rch
=
nh X
grchi / nh
i=1
and if nh = 1, dh (Pbrcr ) = Var
missing 0
if nh0 = 1 for h0 = 1, 2, . . . , H if nh0 > 1 for some 1 < h0 < H
The standard error of the row proportion is computed as StdErr(Pbrcr ) =
q
c Pbrcr ) Var(
Similarly, PROC SURVEYFREQ estimates the column proportion for table cell (r, c) as the ratio of the estimated total for the table cell to the estimated total for column c, Pbrcc
= =
brc / N b·c N nh X mhi H X X h=1 i=1 j=1
δhij (r, c) Whij /
nh X mhi H X X h=1 i=1 j=1
δhij (c) Whij
Statistical Computations
The variance estimate for the column proportion is computed as described above for the row proportion, but with
grchi
mhi X b·c = ( δhij (r, c) − Pbrcc δhij (c) ) Whij / N j=1
Confidence Limits If you specify the CL option in the TABLES statement, PROC SURVEYFREQ computes confidence limits for the proportions in the frequency and crosstabulation tables. The confidence coefficient is determined according to the value of the ALPHA= option, which by default equals 0.05 and produces 95% confidence limits. For the proportion in table cell (r, c), the confidence limits are computed as Pbrc ± tdf,α/2 · StdErr(Pbrc ) where Pbrc is the estimate of the proportion in table cell (r, c), StdErr(Pbrc ) is the standard error of the estimate, and tdf, α/2 is the 100(1 − α/2) percentile of the t distribution with df degrees of freedom calculated as described in the “Degrees of Freedom” section on page 4214. The confidence limits for row proportions and column proportions are computed similarly to the confidence limits for table cell proportions. If you specify the CLWT option in the TABLES statement, PROC SURVEYFREQ computes confidence limits for the weighted frequencies, or totals, in the crosstabulation tables. For the total in table cell (r, c), the confidence limits are computed as brc ± tdf,α/2 · StdErr(N brc ) N brc is the estimate of the total frequency in table cell (r, c), StdErr(N ˆrc ) is where N the standard error of the estimate, and tdf, α/2 is the 100(1 − α/2) percentile of the t distribution with df degrees of freedom calculated as described in the “Degrees of Freedom” section on page 4214. The confidence limits for row totals, column totals, and the overall total are computed similarly to the confidence limits for table cell totals. For each table request, PROC SURVEYFREQ produces a nondisplayed ODS summary table that contains the number of (nonmissing) observations, strata, and clusters that are included in the analysis of the requested table. When you request confidence limits, this table also contains the degrees of freedom df and the value of tdf, α/2 used to compute the confidence limits. See Example 68.3 on page 4236 for more information on the table summary.
4213
4214
Chapter 68. The SURVEYFREQ Procedure
Degrees of Freedom To compute confidence limits for proportions and totals, PROC SURVEYFREQ uses the 100(1−α/2) percentile from the t distribution with df degrees of freedom. PROC SURVEYFREQ calculates the degrees of freedom for t as the number of clusters minus the number of strata. If there are no clusters, then df equals the number of observations minus the number of strata. If the design is not stratified, then df equals the number of clusters minus one. If missing values or missing weights are present in the data, the number of strata, the number of observations, and the number of clusters are counted based on the observations in nonempty strata. See the section “Missing Values” on page 4205 for details. For the Wald F statistics, PROC SURVEYFREQ also calculates the denominator degrees of freedom as the number of clusters minus the number of strata. Alternatively you can specify the denominator degrees of freedom for these tests with the DDF= option in the TABLES statement. See the section “Wald Chi-Square Test” on page 4221 and the section “Wald Log-Linear Chi-Square Test” on page 4223 for details. For each table request, PROC SURVEYFREQ produces a nondisplayed ODS summary table that contains the number of (nonmissing) observations, strata, and clusters that are included in the analysis of the requested table. When you request confidence limits or chi-square tests, this table also contains the degrees of freedom df . See Example 68.3 on page 4236 for more information on the table summary.
Coefficient of Variation If you specify the CV option in the TABLES statement, PROC SURVEYFREQ computes the coefficients of variation for the proportion estimates in the frequency and crosstabulation tables. The coefficient of variation is the ratio of the standard error to the estimate. For the proportion in table cell (r, c), the coefficient of variation is computed as CV(Pbrc ) = StdErr(Pbrc ) / Pbrc where Pbrc is the estimate of the proportion in table cell (r, c), and StdErr(Pbrc ) is the standard error of the estimate. The coefficients of variation for row proportions and column proportions are computed similarly. If you specify the CVWT option in the TABLES statement, PROC SURVEYFREQ computes the coefficients of variation for the weighted frequencies, or estimated totals, in the crosstabulation tables. For the total in table cell (r, c), the coefficient of variation is computed as brc ) = StdErr(N brc ) / N brc CV(N brc is the estimate of the total in table cell (r, c), and StdErr(N brc ) is the where N standard error of the estimate. The coefficients of variation for row totals, column totals, and the overall total are computed similarly.
Statistical Computations
Design Effect If you specify the DEFF option in the TABLES statement, PROC SURVEYFREQ computes design effects for the overall proportion estimates in the frequency and crosstabulation tables. The design effect for an estimate is the ratio of the actual variance (estimated based on the sample design) to the variance of a simple random sample with the same number of observations. Refer to Lohr (1999) and Kish (1965). The design effect for the proportion in table cell (r, c) is computed as DEFF(Pbrc )
= =
c Pbrc ) / Var c SRS (Pbrc ) Var( n o c Pbrc ) / ( 1 − f ) Pbrc ( 1 − Pbrc ) / ( n − 1 ) Var(
c Pbrc ) is the variance where Pbrc is the estimate of the proportion in table cell (r, c), Var( of the estimate, f is the overall sampling fraction, and n is the number of observations in the sample. PROC SURVEYFREQ determines the value of f , the overall sampling fraction, based on the RATE= and TOTAL= options. If you do not specify either of these options, then PROC SURVEYFREQ assumes the value of f is negligible and does not use a finite population correction in the analysis, as described in the section “Population Totals and Sampling Rates” on page 4204. If you specify RATE=value, then PROC SURVEYFREQ uses this value for the overall sampling fraction f . If you specify TOTAL=value, then PROC SURVEYFREQ computes f as the ratio of the number of PSUs in the sample to the specified total. If you specify stratum sampling rates with the RATE=SAS-data-set option, then PROC SURVEYFREQ computes stratum totals based on these stratum sampling rates and the number of sample PSUs in each stratum. The procedure sums the stratum totals to form the overall total, and computes f as the ratio of the number of sample PSUs to the overall total. Alternatively, if you specify stratum totals with the TOTAL=SAS-data-set option, then PROC SURVEYFREQ sums these totals to compute the overall total. The overall sampling fraction f is then computed as the ratio of the number of sample PSUs to the overall total.
Expected Weighted Frequency If you specify the EXPECTED option in the TABLES statement, PROC SURVEYFREQ displays expected weighted frequencies for the table cells in two-way tables. The expected weighted frequencies are computed under the null hypothesis that the row and column variables are independent, as br· N b·c / N b Erc = N br· is the estimated total for row r, N b·c is the estimated total for column c, and where N b is the estimated overall total. Equivalently, N b Erc = Pbr· Pb·c N
4215
4216
Chapter 68. The SURVEYFREQ Procedure
These expected values are used in the design-based chi-square tests of independence, as described in “Rao-Scott Chi-Square Test” and the section “Wald Chi-Square Test” on page 4221.
Rao-Scott Chi-Square Test The Rao-Scott chi-square test is a design-adjusted version of the Pearson chi-square test, which involves differences between observed and expected frequencies. For two-way tables, the null hypothesis for this test is no association between the row and column variables. For one-way tables, the Rao-Scott chi-square tests the null hypothesis of equal proportions, or you can specify null proportions for one-way tables with the TESTP= option. Two forms of the design correction are available for the Rao-Scott tests. One form of the design correction uses the proportion estimates, and you request the corresponding Rao-Scott chi-square test with the CHISQ option. The other form of the design correction uses the null hypothesis proportions. You request this test, called the Rao-Scott modified chi-square test, with the CHISQ1 option. Refer to Lohr (1999), Thomas, Singh, and Roberts (1996), and Rao and Scott (1981, 1984, 1987) for details on design-adjusted chi-square tests. Two-Way Tables
The Rao-Scott chi-square statistic is computed from the Pearson chi-square statistic and a design correction based on the design effects of the proportions. Under the null hypothesis of no association between the row and column variables, this statistic approximately follows a chi-square distribution with (R − 1)(C − 1) degrees of freedom. An F approximation is also given. The Rao-Scott chi-square is computed as QRS
= QP / D
where D is the design correction described in the “Design Correction for Two-Way Tables” section on page 4217, and QP is the Pearson chi-square based on the estimated totals. QP
=
XX r
brc − Erc )2 / Erc (N
c
brc is the estimated total for table cell (r, c), and Erc is the expected total for where N cell (r, c) under the null hypothesis of no association, br· N b·c / N b Erc = N Under the null hypothesis of no association, the Rao-Scott chi-square QRS approximately follows a chi-square distribution with (R − 1)(C − 1) degrees of freedom. A better approximation may be obtained by the F statistic F
= QRS / (R − 1)(C − 1)
Statistical Computations
which has an F distribution with (R − 1)(C − 1) and (R − 1)(C − 1)κ degrees of freedom under the null hypothesis, where κ equals the number of clusters minus the number of strata, as described in the section “Degrees of Freedom” on page 4214. Design Correction for Two-Way Tables
If you specify the CHISQ option, the design correction is computed from the estimated proportions, as D =
nP P
P (1 − Pbrc ) DEFF(Pbrc ) − r (1 − Pbr· ) DEFF(Pbr· ) o P − c (1 − Pb·c ) DEFF(Pb·c ) / (R − 1)(C − 1)
r
c
where DEFF(Pbrc )
=
c Pbrc ) / VarSRS (Pbrc ) Var( n o c Pbrc ) / ( 1 − f ) Pbrc ( 1 − Pbrc ) / (n − 1) Var(
=
as described in the section “Design Effect” on page 4215. Pbrc is the estimate of the c Pbrc ) is the variance of the estimate, f is the overall proportion in table cell (r, c), Var( sampling fraction, and n is the number of observations in the sample. DEFF(Pbr· ), the design effect for the estimate of the proportion in row r, and DEFF(Pb·c ), the design effect for the estimate of the proportion in row c, are computed similarly. If you specify the CHISQ1 option for the Rao-Scott modified test, the design correction uses the null hypothesis cell proportions, computed as the product of the corresponding estimated row and cell proportions. D0 =
nP P r
P 0 ) DEFF (P b b b (1 − Prc 0 rc ) − r (1 − Pr· ) DEFF(Pr· ) o P − c (1 − Pb·c ) DEFF(Pb·c ) / (R − 1)(C − 1) c
where 0 Prc = Pbr· · Pb·c
and DEFF0 (Pbrc )
= =
0 c Pbrc ) / VarSRS (Prc Var( ) 0 0 c Pbrc ) / ( 1 − f ) Prc Var( ( 1 − Prc )/(n−1)
4217
4218
Chapter 68. The SURVEYFREQ Procedure
One-Way Tables
For one-way tables, the Rao-Scott chi-square statistic provides a design-based goodness-of-fit test for equal proportions. Or if you specify null proportions with the TESTP= option, PROC SURVEYFREQ computes the goodness-of-fit test for the specified proportions. Under the null hypothesis, the Rao-Scott chi-square statistic approximately follows a chi-square distribution with (C − 1) degrees of freedom for a table with C levels. PROC SURVEYFREQ also computes an F statistic that may provide a better approximation. The Rao-Scott chi-square is computed as QRS
= QP / D
where D is the design correction described in the section “Design Correction for OneWay Tables” on page 4219, and QP is the Pearson chi-square based on the estimated totals, QP
=
X
bc − Ec )2 / Ec (N
c
where Ec is the expected total for level c under the null hypothesis. For the null hypothesis of equal proportions, b /C Ec = N For specified null proportions, b · P0 Ec = N c where Pc0 is the null proportion for level c. Under the null hypothesis, the one-way Rao-Scott chi-square QRS approximately follows a chi-square distribution with (C − 1) degrees of freedom. A better approximation may be obtained by the F statistic F
= QRS / (C − 1)
which has an F distribution with (C − 1) and (C − 1)κ degrees of freedom under the null hypothesis, where κ equals the number of clusters minus the number of strata, as described in the section “Degrees of Freedom” on page 4214.
Statistical Computations
Design Correction for One-Way Tables
If you specify the CHISQ option, the design correction is computed from the estimated proportions, as D =
X
(1 − Pbc ) DEFF(Pbc ) / (C − 1)
c
where c Pbc ) / DEFF(Pbc ) = Var(
n o ( 1 − f ) Pbc ( 1 − Pbc ) / (n − 1)
c Pbc ) is the variance of the estimate, Pbc is the proportion estimate for table level c, Var( f is the overall sampling fraction, and n is the number of observations in the sample. If you specify the CHISQ1 option for the Rao-Scott modified test, the design correction uses the null hypothesis proportions – either equal proportions for all levels, or the proportions you specify with the TESTP= option. D0 =
X
(1 − Pc0 ) DEFF0 (Pbc ) / (C − 1)
c
where c Pbc ) / DEFF0 (Pbc ) = Var(
( 1 − f )Pc0 ( 1 − Pc0 ) / ( n − 1 )
and Pc0 = 1/C for equal proportions, or Pc0 takes the value specified with the TESTP= option.
Rao-Scott Likelihood Ratio Chi-Square Test The Rao-Scott likelihood ratio chi-square test is a design-adjusted version of the likelihood ratio test, which involves ratios between observed and expected frequencies and tests the null hypothesis of no association between the row and column variables in a two-way table. For a one-way tables the null hypothesis is equal proportions for the table levels, or you can specify other null proportions with the TESTP= option. Refer to Lohr (1999), Thomas, Singh, and Roberts (1996), and Rao and Scott (1981, 1984, 1987). Two forms of the design correction are available for the Rao-Scott tests. One form of the design correction uses the proportion estimates, and you request the corresponding Rao-Scott likelihood ratio test with the LRCHISQ option. The other form of the design correction uses the null hypothesis proportions. You request this test, called the Rao-Scott modified likelihood ratio test, with the LRCHISQ1 option.
4219
4220
Chapter 68. The SURVEYFREQ Procedure
Two-Way Tables
The Rao-Scott likelihood ratio statistic is computed from the likelihood ratio chisquare statistic and a design correction based on the design effects of the proportions. Under the null hypothesis of no association, this statistic approximately follows a chisquare distribution with (R − 1)(C − 1) degrees of freedom. An F approximation is also given. The Rao-Scott likelihood ratio chi-square is computed as G2RS
= G2 / D
where G2 is the likelihood ratio chi-square based on the estimated totals, and D is the design correction. XX brc ln N brc / Erc G2 = 2 N r
c
brc is the estimated total for table cell (r, c), and Erc is the expected total for where N cell (r, c) under the null hypothesis of no association, br· N b·c / N b Erc = N The Rao-Scott likelihood ratio chi-square uses the same design correction D as the Rao-Scott (Pearson) chi-square uses, which is described in the section “Design Correction for Two-Way Tables” on page 4217. If you specify the LRCHISQ option, the design correction is computed from the estimated proportions. If you specify the LRCHISQ1 option for the Rao-Scott modified test, the design correction uses the null hypothesis cell proportions, computed as the product of the corresponding estimated row and column proportions. Under the null hypothesis of no association, the Rao-Scott likelihood ratio chi-square G2RS approximately follows a chi-square distribution with (R − 1)(C − 1) degrees of freedom. A better approximation may be obtained by the F statistic F
= G2RS / (R − 1)(C − 1)
which has an F distribution with (R − 1)(C − 1) and (R − 1)(C − 1)κ degrees of freedom under the null hypothesis, where κ equals the number of clusters minus the number of strata, as described in the section “Degrees of Freedom” on page 4214. One-Way Tables
For one-way tables, the Rao-Scott likelihood ratio chi-square statistic provides a design-based goodness-of-fit test for equal proportions. Or if you specify null proportions with the TESTP= option, PROC SURVEYFREQ computes the goodness-of-fit test for the specified proportions. Under the null hypothesis, the Rao-Scott likelihood ratio statistic approximately follows a chi-square distribution with (C − 1) degrees of freedom for a table with C levels. Am F approximation is also given. The Rao-Scott likelihood ratio chi-square is computed as G2RS
= G2 / D
Statistical Computations
where G2 is the likelihood ratio chi-square based on the estimated totals, and D is the design correction. X bc ln N bc / Ec G2 = 2 N c
where Ec is the expected total for level c under the null hypothesis. For the null hypothesis of equal proportions, b /C Ec = N For specified null proportions Pc0 , b · P0 Ec = N c The Rao-Scott likelihood ratio chi-square uses the same design correction D as the Rao-Scott (Pearson) chi-square uses, which is described in the section “Design Correction for One-Way Tables” on page 4219. If you specify the LRCHISQ option, the design correction is computed from the estimated proportions. If you specify the LRCHISQ1 option for the Rao-Scott modified test, the design correction uses the null hypothesis cell proportions. Under the null hypothesis of no association, the Rao-Scott likelihood ratio chi-square G2RS approximately follows a chi-square distribution with (C − 1) degrees of freedom. A better approximation may be obtained by the F statistic F
= G2RS / (C − 1)
which has an F distribution with (C − 1) and (C − 1)κ degrees of freedom under the null hypothesis, where κ equals the number of clusters minus the number of strata, as described in the section “Degrees of Freedom” on page 4214.
Wald Chi-Square Test PROC SURVEYFREQ provides two Wald chi-square tests for independence of the row and column variables in two-way tables: a Wald chi-square test based on the difference between observed and expected weighted cell frequencies, and a Wald log-linear chi-square test based on the log odds ratio. These statistics test for independence of the row and column variables in two-way tables, taking into account the complex survey design. Refer to Bedrick (1983), Koch, Freeman, and Freeman (1975), and Wald (1943) for information on Wald statistics and their applications to categorical data analysis. For these two tests, PROC SURVEYFREQ computes the generalized Wald chi-square statistic, the corresponding Wald F statistic, and also an adjusted Wald F statistic for tables larger than 2 × 2. Under the null hypothesis of independence, the Wald chisquare statistic approximately follows a chi-square distribution with (R − 1)(C − 1) degrees of freedom for very large samples. However, it has been shown that this test may perform poorly in terms of actual significance level and power, especially for tables with a large number of cells or for samples with a relatively small number of clusters. Refer to Thomas and Rao (1984 and 1985) and Lohr (1999) for more
4221
4222
Chapter 68. The SURVEYFREQ Procedure
information. Refer to Felligi (1980) and Hidiroglou, Fuller, and Hickman (1980) for information on the adjusted Wald F statistic. Thomas and Rao (1984) found that the adjusted Wald F statistic provides a more stable test than the chi-square statistic, although its power may be low when the number of sample clusters is not large. Refer also to Korn and Graubard (1990) and Thomas, Singh, and Roberts (1996). If you specify the WCHISQ option in the TABLES statement, PROC SURVEYFREQ computes a Wald test for independence in the two-way table based on the differences between the observed (weighted) cell frequencies and the expected frequencies. Under the null hypothesis of independence of the row and column variables, the expected cell frequencies are computed as br· N b·c / N b Erc = N br· is the estimated total for row r, N b·c is the estimated total for column c, where N b is the estimated overall total, as described in the section “Expected Weighted and N Frequency” on page 4215. And the null hypothesis that the population weighted frequencies equal the expected frequencies is H0 : Yrc = Nrc − Erc = 0 for all r = 1, . . . (R − 1) and c = 1, . . . (C − 1) This null hypothesis can be stated equivalently in terms of cell proportions, with the expected cell proportions computed as the products of the marginal row and column proportions. The generalized Wald chi-square statistic is computed as b 0 (H V( b N) b H0 )−1 Y b QWald = Y b is the (R − 1)(C − 1) array of the differences between the observed and where Y b N) b H0 ) estimates the variance brc −Erc ), and (H V( expected weighted frequencies (N b of Y. b N) b is the covariance matrix of the estimates N brc , and its computation is described V( in the section “Covariance of Totals” on page 4210. H is an (R−1)(C−1) by RC matrix containing the partial derivatives of the elements b with respect to the elements of N. b The elements of H are computed as follows, of Y where a denotes a row different from row r, and b denotes a column different from column c. brc br· + N b·c − N b·c N br· / N b /N b ∂ Ybrc /∂ N = 1 − N bac ∂ Ybrc /∂ N
=
−
br· − N br· N b·c / N b N
b /N
brb ∂ Ybrc /∂ N
=
−
b·c − N br· N b·c / N b N
b /N
∂ Ybrc /∂ Ybab
=
br· N b·c / N b N
2
Statistical Computations
Under the null hypothesis of independence, the statistic QWald approximately follows a chi-square distribution with (R − 1)(C − 1) degrees of freedom for very large samples. PROC SURVEYFREQ computes the Wald F statistic as FWald = QWald / (R − 1)(C − 1) Under the null hypothesis of independence, FWald approximately follows an F distribution with (R − 1)(C − 1) numerator degrees of freedom. By default, PROC SURVEYFREQ computes the denominator degrees of freedom as the number of clusters minus the number of strata, as described in the section “Degrees of Freedom” on page 4214. Alternatively, you can specify the denominator degrees of freedom with the DDF= option in the TABLES statement. For tables larger than 2 × 2, PROC SURVEYFREQ also computes the adjusted Wald F statistic FAdj Wald =
s−k+1 QWald ks
where k = (R−1)(C −1), and s is the number of clusters minus the number of strata. Alternatively, you can specify the value of s with the DDF= option in the TABLES statement. Note that for 2 × 2 tables, k = (R − 1)(C − 1) = 1, so the the adjusted Wald F statistic equals the (unadjusted) Wald F statistic, with the same numerator and denominator degrees of freedom. Under the null hypothesis, FAdj Wald approximately follows an F distribution with k numerator degrees of freedom and (s − k + 1) denominator degrees of freedom.
Wald Log-Linear Chi-Square Test If you specify the WLLCHISQ option in the TABLES statement, PROC SURVEYFREQ computes a Wald test for independence based on the log odds ratios. See the section “Wald Chi-Square Test” on page 4221 for more information on Wald tests. For a two-way table of R rows and C columns, the Wald log-linear test is based on the (R − 1)(C − 1) array of brc − log N brC − log N bRc + log N bRC Ybrc = log N brc is the estimated total for table cell (r, c). The null hypothesis of indepenwhere N dence between the row and column variables is H0 : Yrc = 0 for all r = 1, . . . (R − 1) and c = 1, . . . (C − 1). This null hypothesis can be stated equivalently in terms of cell proportions.
4223
4224
Chapter 68. The SURVEYFREQ Procedure
The generalized Wald log-linear chi-square statistic is computed as b 0 V( b Y) b −1 Y b QWald LL = Y b is the (R − 1)(C − 1) array of the Ybrc , and V( b Y) b estimates the variance of where Y b Y. b Y) b = A D−1 Vb (N) b D−1 A0 V( b N) b is the covariance matrix of the estimates N brc , as described in the section where V( “Covariance of Totals” on page 4210, D is a diagonal matrix with the estimated totals brc on the diagonal, and A is the (R−1)(C −1) by RC ×RC linear contrast matrix. N Under the null hypothesis of independence, the statistic QWald LL approximately follows a chi-square distribution with (R − 1)(C − 1) degrees of freedom for very large samples. PROC SURVEYFREQ computes the Wald log-linear F statistic as FWald LL = QWald LL / (R − 1)(C − 1) Under the null hypothesis of independence, FWald LL approximately follows an F distribution with (R − 1)(C − 1) numerator degrees of freedom. By default, PROC SURVEYFREQ computes the denominator degrees of freedom as the number of clusters minus the number of strata, as described in the section “Degrees of Freedom” on page 4214. Alternatively, you can specify the denominator degrees of freedom with the DDF= option in the TABLES statement. For tables larger than 2 × 2, PROC SURVEYFREQ also computes the adjusted Wald log-linear F statistic FAdj Wald LL =
s−k+1 QWald LL ks
where k = (R−1)(C −1), and s is the number of clusters minus the number of strata. Alternatively, you can specify the value of s with the DDF= option in the TABLES statement. Note that for 2 × 2 tables, k = (R − 1)(C − 1) = 1, so the the adjusted Wald F statistic equals the (unadjusted) Wald F statistic, with the same numerator and denominator degrees of freedom. Under the null hypothesis, FAdj Wald LL approximately follows an F distribution with k numerator degrees of freedom and (s − k + 1) denominator degrees of freedom.
Displayed Output
Displayed Output Data and Sample Design Summary Table The “Data Summary” table provides information on the input data set and the sample design. PROC SURVEYFREQ displays this table unless you specify the NOSUMMARY option in the PROC SURVEYFREQ statement. The “Data Summary” table displays the total number of valid observations. To be considered valid, an observation must have a nonmissing, positive WEIGHT value if you specify a WEIGHT statement. If you do not specify the MISSING option, a valid observation must also have nonmissing values for all STRATA and CLUSTER variables. The number of valid observations may differ from the the number of nonmissing observations for an individual analysis variable, which the procedure displays in the frequency or crosstabulation tables. See the section “Missing Values” on page 4205 for more information. PROC SURVEYFREQ displays the following information in the “Data Summary” table: • Number of Strata, if you specify a STRATA statement • Number of Clusters, if you specify a CLUSTER statement • Number of Observations, which is the total number of valid observations • Sum of Weights, which is the sum over all valid observations, if you specify a WEIGHT statement
Stratum Information Table If you specify the LIST option in the STRATA statement, PROC SURVEYFREQ displays a “Stratum Information” table. This table provides the following information for each stratum. • Stratum Index, which is a sequential stratum identification number • STRATA variable(s), which lists the levels of STRATA variables for the stratum • Number of Observations, which is the number of valid observations in the stratum • Population Total for the stratum, if you specify the TOTAL= option • Sampling Rate for the stratum, if you specify the TOTAL= option or the RATE= option. If you specify the TOTAL= option, the sampling rate is based on the number of valid observations in the stratum. • Number of Clusters, which is the number of clusters in the stratum, if you specify a CLUSTER statement
4225
4226
Chapter 68. The SURVEYFREQ Procedure
One-Way Frequency Tables PROC SURVEYFREQ displays one-way frequency tables for all one-way table requests in the TABLES statements, unless you specify the NOPRINT option in the TABLES statement. A one-way table shows the sample frequency distribution of a single variable, and provides estimates for its population distribution in terms of totals and proportions. For each level of the variable, PROC SURVEYFREQ displays the following information in the one-way table: • Frequency count, giving the number of sample observations for the level • Weighted Frequency total, estimating the total population frequency for the level • Standard Deviation of Weighted Frequency • Percent, estimating the population proportion for the level • Standard Error of Percent The one-way table displays weighted frequencies if your analysis includes a WEIGHT statement, or if you specify the WTFREQ option in the TABLES statement. The one-way table also displays the Frequency Missing, or the number of observations with missing values. You can suppress the frequency counts by specifying the NOFREQ option in the TABLES statement. Also, the NOWT option suppresses the weighted frequencies and their standard deviations. The NOPERCENT option suppresses the percentages and their standard errors. The NOSTD option suppresses the standard errors of the percentages and the standard deviations of the weighted frequencies. The NOTOTAL option suppresses the total row of the one-way table. PROC SURVEYFREQ optionally displays the following information for a one-way table: • Variance of Weighted Frequency, if you specify the VARWT option • Confidence Limits for Weighted Frequency, if you specify the CLWT option • Coefficient of Variation for Weighted Frequency, if you specify the CVWT option • Test Percent, if you specify the TESTP= option • Variance of Percent, if you specify the VAR option • Confidence Limits for Percent, if you specify the CL option • Coefficient of Variation for Percent, if you specify the CV option • Design Effect for Percent, if you specify the DEFF option
Displayed Output
Crosstabulation Tables PROC SURVEYFREQ displays all multiway table requests in the TABLES statements, unless you specify the NOPRINT option in the TABLES statement. For twoway to multiway crosstabulation tables, the values of the last variable in the table request form the table columns. The values of the next-to-last variable form the rows. Each level (or combination of levels) of the other variables form one layer. PROC SURVEYFREQ produces a separate two-way crosstabulation table for each layer. For each layer, the crosstabulation table displays the row and column variable names and values (or levels). Each two-way table lists levels of the column variable within each level of the row variable. By default, the procedure displays all levels of the column variable within each level of the row variables, including any column variable levels with zero frequency for that row. For multiway tables, the procedure displays all levels of the row variable for each layer of the table by default, including any row levels with zero frequency for that layer. You can suppress the display of zero frequency levels by specifying the NOSPARSE option. For each combination of variable levels, or table cell, the two-way table displays the following information: • Frequency, giving the number of observations that have the indicated values of the two variables • Weighted Frequency total, estimating the total population frequency for the table cell • Standard Deviation of Weighted Frequency • Percent, estimating the population proportion for the table cell • Standard Error of Percent The two-way table displays weighted frequencies if your analysis includes a WEIGHT statement, or if you specify the WTFREQ option in the TABLES statement. The two-way table also displays the Frequency Missing, or the number of observations with missing values. You can suppress the frequency counts by specifying the NOFREQ option in the TABLES statement. Also, the NOWT option suppresses the weighted frequencies and their standard deviations. The NOPERCENT option suppresses the percentages and their standard errors. The NOSTD option suppresses the standard errors of the percentages and the standard deviations of the weighted frequencies. The NOTOTAL option suppresses the row totals and column totals.
4227
4228
Chapter 68. The SURVEYFREQ Procedure
PROC SURVEYFREQ optionally displays the following information for a two-way table: • Expected Weighted Frequency, if you specify the EXPECTED option • Variance of Weighted Frequency, if you specify the VARWT option • Confidence Limits for Weighted Frequency, if you specify the CLWT option • Coefficient of Variation for Weighted Frequency, if you specify the CVWT option • Variance of Percent, if you specify the VAR option • Confidence Limits for Percent, if you specify the CL option • Coefficient of Variation for Percent, if you specify the CV option • Design Effect for Percent, if you specify the DEFF option • Row Percent, estimating the cell’s proportion of the population total for that cell’s row, if you specify the ROW option • Standard Error of Row Percent, if you specify the ROW option • Variance of Row Percent, if you specify the VAR option and the ROW option • Confidence Limits for Row Percent, if you specify the CL option and the ROW option • Coefficient of Variation for Row Percent, if you specify the CV option and the ROW option • Column Percent, estimating the cell’s proportion of the population total for that cell’s column, if you specify the COL option • Standard Error of Column Percent, if you specify the COL option • Variance of Column Percent, if you specify the VAR option and the COL option • Confidence Limits for Column Percent, if you specify the CL option and the COL option • Coefficient of Variation for Column Percent, if you specify the CV option and the COL option If you specify the ROW option, the NOPERCENT option suppresses the row percentages and their standard errors. The NOSTD option suppresses the standard errors of the row percentages. Similarly, if you specify the COL option, the NOPERCENT option suppresses the column percentages and their standard errors. The NOSTD option suppresses the standard errors of the column percentages.
Displayed Output
Statistical Tests If you specify the CHISQ option for the Rao-Scott chi-square test, the CHISQ1 option for the modified test, the LRCHISQ option for the Rao-Scott likelihood ratio chisquare test, or the LRCHISQ1 option for the modified test, PROC SURVEYFREQ displays the following information: • Pearson Chi-Square, if you specify the CHISQ or CHISQ1 option • Likelihood Ratio Chi-Square, if you specify the LRCHISQ or LRCHISQ1 option • Design Correction • Rao-Scott Chi-Square, if you specify the CHISQ or CHISQ1 option • Rao-Scott Likelihood Ratio Chi-Square, if you specify the LRCHISQ or LRCHISQ1 option • DF, the degrees of freedom for the chi-square test • Pr > ChiSq, the p-value for the chi-square test • F Value • Num DF, the numerator degrees of freedom for F • Den DF, the denominator degrees of freedom for F • Pr > F, the p-value for the F test If you specify the WCHISQ option for the Wald chi-square test or the WLLCHISQ option for the Wald log-linear chi-square test, PROC SURVEYFREQ displays the following information: • Wald Chi-Square, if you specify the WCHISQ option • Wald Log-Linear Chi-Square, if you specify the WLLCHISQ option • F Value • Num DF, the numerator degrees of freedom for F • Den DF, the denominator degrees of freedom for F • Pr > F, the p-value for the F test • Adjusted F Value, for tables larger than 2 × 2 • Num DF, the numerator degrees of freedom for Adjusted F • Den DF, the denominator degrees of freedom for Adjusted F • Pr > Adj F, the p-value for the Adjusted F test
4229
4230
Chapter 68. The SURVEYFREQ Procedure
ODS Table Names PROC SURVEYFREQ assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 68.3. ODS Tables Produced in PROC SURVEYFREQ
ODS Table Name ChiSq ChiSq1 CrossTabs LRChiSq LRChiSq1 OneWay
Description Chi-square test Modified chi-square test Crosstabulation table Likelihood ratio test Modified likelihood ratio test One-way frequency table
StrataInfo Summary TableSummary WChiSq WLLChiSq
Stratum information Data summary Table summary (not displayed) Wald chi-square test Wald log-linear chi-square test
Statement TABLES TABLES TABLES TABLES TABLES PROC or TABLES STRATA PROC TABLES TABLES TABLES
Option CHISQ CHISQ1 (n-way table request, n > 1) LRCHISQ LRCHISQ1 (with no TABLES stmt) (one-way table request) LIST default default WCHISQ WLLCHISQ
Examples Example 68.1. Two-Way Tables This example uses the SIS– Survey data set from the section “Getting Started” on page 4185. The data set contains results from a customer satisfaction survey for a student information system (SIS). The following PROC SURVEYFREQ statements request a two-way table for the variables Department by Response and customize the crosstabulation table display. proc surveyfreq data=SIS_Survey; tables Department * Response / cv deff nowt nostd nototal; strata State NewUser / list; cluster School; weight SamplingWeight; run;
The TABLES statement requests a two-way table of Department by Response. The CV option requests coefficients of variation for the percentage estimates. The DEFF option requests design effects for the percentage estimates. The NOWT option suppresses display of the weighted frequencies, and the NOSTD option suppresses display of standard errors for the estimates. The NOTOTAL option suppresses the row totals, column totals, and overall totals.
Example 68.1. Two-Way Tables
The WCHISQ option requests a Wald chi-square test of association between the variables Department and Response. The STRATA, CLUSTER, and WEIGHT statements provide sample design information to the procedure, so that the analysis will be done according to the sample design used for the survey. The STRATA statement names the variables State and NewUser, which identify the first-stage strata. The LIST option in the STRATA statement requests a Stratum Information table. The CLUSTER statement identifies School as the cluster or first-stage sampling unit. The WEIGHT statement names the sampling weight variable. Output 68.1.1 displays the Data Summary and Stratum Information tables produced by PROC SURVEYFREQ. The Stratum Information table lists the six strata in the survey and shows the number of clusters, or schools, and the number of observations in each stratum. Output 68.1.1. Data Summary and Stratum Information School Information System Survey The SURVEYFREQ Procedure Data Summary Number Number Number Sum of
of Strata of Clusters of Observations Weights
6 370 1850 38899.6482
Stratum Information Stratum Number of Number of Index State NewUser Obs Clusters -------------------------------------------------------------1 GA Renewal Customer 315 63 2 GA New Customer 355 71 3 NC Renewal Customer 280 56 4 NC New Customer 420 84 5 SC Renewal Customer 210 42 6 SC New Customer 270 54 --------------------------------------------------------------
4231
4232
Chapter 68. The SURVEYFREQ Procedure
Output 68.1.2 displays the two-way table of Department by Response. According to the TABLES statement options specified, this two-way table includes coefficients of variation and design effects for the percentage estimates, and it does not show the weighted frequencies or the standard errors of the estimates. Output 68.1.2. Two-Way Table of Department by Response School Information System Survey The SURVEYFREQ Procedure Table of Department by Response CV for Design Department Response Frequency Percent Percent Effect ------------------------------------------------------------------------------Faculty Very Unsatisfied 209 13.4987 0.0865 2.1586 Unsatisfied 203 13.0710 0.0868 2.0962 Neutral 346 22.4127 0.0629 2.1157 Satisfied 254 16.2006 0.0806 2.3232 Very Satisfied 98 6.2467 0.1362 2.2842 ------------------------------------------------------------------------------Admin/Guidance Very Unsatisfied 95 3.6690 0.1277 1.1477 Unsatisfied 123 4.6854 0.1060 1.0211 Neutral 235 9.1838 0.0700 0.9166 Satisfied 201 7.7305 0.0756 0.8848 Very Satisfied 86 3.3016 0.1252 0.9892 -------------------------------------------------------------------------------
The following PROC SURVEYFREQ statements request a two-way table of Department by Response that includes row percentages, and also a Wald chi-square test of association between the two table variables. proc surveyfreq data=SIS_Survey nosummary; tables Department * Response / row nowt wchisq; strata State NewUser; cluster School; weight SamplingWeight; run;
Output 68.1.3 displays the two-way table. The row percentages show the distribution of Response for Department = ’Faculty’ and for Department = ’Admin/Guidance’.
Example 68.1. Two-Way Tables
4233
Output 68.1.3. Table of Department by Response with Row Percentages School Information System Survey The SURVEYFREQ Procedure Table of Department by Response Std Err of Row Std Err of Department Response Frequency Percent Percent Percent Row Percent --------------------------------------------------------------------------------------------------Faculty Very Unsatisfied 209 13.4987 1.1675 18.8979 1.6326 Unsatisfied 203 13.0710 1.1350 18.2992 1.5897 Neutral 346 22.4127 1.4106 31.3773 1.9705 Satisfied 254 16.2006 1.3061 22.6805 1.8287 Very Satisfied 98 6.2467 0.8506 8.7452 1.1918 Total 1110 71.4297 0.1468 100.000 --------------------------------------------------------------------------------------------------Admin/Guidance Very Unsatisfied 95 3.6690 0.4684 12.8419 1.6374 Unsatisfied 123 4.6854 0.4966 16.3995 1.7446 Neutral 235 9.1838 0.6430 32.1447 2.2300 Satisfied 201 7.7305 0.5842 27.0579 2.0406 Very Satisfied 86 3.3016 0.4133 11.5560 1.4466 Total 740 28.5703 0.1468 100.000 --------------------------------------------------------------------------------------------------Total Very Unsatisfied 304 17.1676 1.2872 Unsatisfied 326 17.7564 1.2712 Neutral 581 31.5965 1.5795 Satisfied 455 23.9311 1.4761 Very Satisfied 184 9.5483 0.9523 Total 1850 100.000 ---------------------------------------------------------------------------------------------------
Output 68.1.4 displays the Wald chi-square test for association between Department and Response. The Wald chi-square is 11.44, and the corresponding adjusted F value is 2.84 with a p-value of .0243. This indicates a significant association between department (faculty or admin/guidance) and satisfaction with the student information system. Output 68.1.4. Wald Chi-Square Test Table of Department by Response Wald Chi-Square Test Chi-Square
11.4454
F Value Num DF Den DF Pr > F
2.8613 4 364 0.0234
Adj F Value Num DF Den DF Pr > Adj F
2.8378 4 361 0.0243
Sample Size = 1850
4234
Chapter 68. The SURVEYFREQ Procedure
Example 68.2. Multiway Tables Continuing to use the SIS– Survey data set from the section “Getting Started” on page 4185, this example shows how to produce multiway tables. The following PROC SURVEYFREQ statements request a table of Department by SchoolType by Response for the student information system survey. proc surveyfreq data=SIS_Survey; tables Department * SchoolType * Response SchoolType * Response; strata State NewUser; cluster School; weight SamplingWeight; run;
The TABLES statement requests a multiway table with SchoolType as the row variable, Response as the column variable, and Department as the layer variable. This request produces a separate two-way table of SchoolType by Response for each level of the variable Department. The TABLES statement also requests a two-way table of SchoolType by Response, which totals the multiway table over both levels of Department. As in the previous examples, the STRATA, CLUSTER, and WEIGHT statements provide sample design information, so that the analysis will be done according to the design used for this survey. Output 68.2.1 displays the multiway table produced by PROC SURVEYFREQ, which includes a table of SchoolType by Response for Department = ’Faculty’ and for Department = ’Admin/Guidance’.
Example 68.2. Multiway Tables
4235
Output 68.2.1. Multiway Table of Department by SchoolType by Response School Information System Survey The SURVEYFREQ Procedure Table of SchoolType by Response Controlling for Department=Faculty Weighted Std Dev of Std Err of SchoolType Response Frequency Frequency Wgt Freq Percent Percent ---------------------------------------------------------------------------------------------------Middle School Very Unsatisfied 74 1846 301.22637 6.6443 1.0838 Unsatisfied 78 1929 283.11476 6.9428 1.0201 Neutral 130 3289 407.80855 11.8369 1.4652 Satisfied 113 2795 368.85087 10.0597 1.3288 Very Satisfied 55 1378 261.63311 4.9578 0.9411 Total 450 11237 714.97120 40.4415 2.5713 ---------------------------------------------------------------------------------------------------High School Very Unsatisfied 135 3405 389.42313 12.2536 1.3987 Unsatisfied 125 3155 384.56734 11.3563 1.3809 Neutral 216 5429 489.37826 19.5404 1.7564 Satisfied 141 3507 417.54773 12.6208 1.5040 Very Satisfied 43 1052 221.59367 3.7874 0.7984 Total 660 16549 719.61536 59.5585 2.5713 ---------------------------------------------------------------------------------------------------Total Very Unsatisfied 209 5251 454.82598 18.8979 1.6326 Unsatisfied 203 5085 442.39032 18.2992 1.5897 Neutral 346 8718 550.81735 31.3773 1.9705 Satisfied 254 6302 507.01711 22.6805 1.8287 Very Satisfied 98 2430 330.97602 8.7452 1.1918 Total 1110 27786 119.25529 100.000 ----------------------------------------------------------------------------------------------------
Table of SchoolType by Response Controlling for Department=Admin/Guidance Weighted Std Dev of Std Err of SchoolType Response Frequency Frequency Wgt Freq Percent Percent ---------------------------------------------------------------------------------------------------Middle School Very Unsatisfied 42 649.43427 133.06194 5.8435 1.1947 Unsatisfied 31 460.35557 100.80158 4.1422 0.9076 Neutral 104 1568 186.99946 14.1042 1.6804 Satisfied 84 1269 165.71127 11.4142 1.4896 Very Satisfied 39 574.93878 110.37243 5.1732 0.9942 Total 300 4521 287.86832 40.6774 2.5801 ---------------------------------------------------------------------------------------------------High School Very Unsatisfied 53 777.77725 136.41869 6.9983 1.2285 Unsatisfied 92 1362 175.40862 12.2573 1.5806 Neutral 131 2005 212.34804 18.0404 1.8990 Satisfied 117 1739 190.07798 15.6437 1.7118 Very Satisfied 47 709.37033 126.54394 6.3828 1.1371 Total 440 6593 288.92483 59.3226 2.5801 ---------------------------------------------------------------------------------------------------Total Very Unsatisfied 95 1427 182.28132 12.8419 1.6374 Unsatisfied 123 1823 193.43045 16.3995 1.7446 Neutral 235 3572 250.22739 32.1447 2.2300 Satisfied 201 3007 226.82311 27.0579 2.0406 Very Satisfied 86 1284 160.83434 11.5560 1.4466 Total 740 11114 60.78850 100.000 ----------------------------------------------------------------------------------------------------
4236
Chapter 68. The SURVEYFREQ Procedure
Example 68.3. Output Data Sets PROC SURVEYFREQ uses the Output Delivery System (ODS) to create output data sets. This is a departure from older SAS procedures that provide OUTPUT statements for similar functionality. Using ODS, you can create a SAS data set from any piece of PROC SURVEYFREQ output. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” When selecting tables for ODS output data set, you reference tables by their ODS table names. Each table created by PROC SURVEYFREQ is assigned a name, and the section “ODS Table Names” on page 4230 lists the table names. To save the one-way table of Response from Figure 68.3 in an output data set, use an ODS OUTPUT statement as follows: proc surveyfreq data=SIS_Survey; tables Response / cl nowt; ods output OneWay=ResponseTable; strata State NewUser; cluster School; weight SamplingWeight; run;
Output 68.3.1 displays the output data set ResponseTable, which contains the oneway table of Response. This data set has six observations, and each of these observations corresponds to a row of the one-way table. The first five observations correspond to the five levels of Response, as they are ordered in the display, and the last observation corresponds to the overall total, which is the last row of the table. The data set ResponseTable includes a variable corresponding to each column of the one-way table. For example, the variable Percent contains the percentage estimates, and the variables LowerCL and UpperCL contain the lower and upper confidence limits for the percentage estimates. Output 68.3.1. ResponseTable Output Data Set Obs 1 2 3 4 5 6
Table Table Table Table Table Table Table
Response Response Response Response Response Response
Response Very Unsatisfied Unsatisfied Neutral Satisfied Very Satisfied .
Frequency
Percent
StdErr
LowerCL
UpperCL
304 326 581 455 184 1850
17.1676 17.7564 31.5965 23.9311 9.5483 100.000
1.2872 1.2712 1.5795 1.4761 0.9523 _
14.6364 15.2566 28.4904 21.0285 7.6756 _
19.6989 20.2562 34.7026 26.8338 11.4210 _
PROC SURVEYFREQ also creates a table summary that is not displayed. Some of the information in this table is similar to that contained in the “Data Summary” table, but the table summary describes data used to analyze the specified table, while the data summary describes the entire input data set. Due to missing values, for example, the number of observations (or strata or clusters) used to analyze a particular table may differ from the number of observations (or strata or clusters) reported for the input data set in the “Data Summary” table. See the section “Missing Values” on page
References
4237
4205 for more details. If you request confidence limits, the “Table Summary” table also contains the degrees of freedom and the t-value used to compute the confidence limits. The following statements store the nondisplayed “Table Summary” table in the output data set ResponseSummary. proc surveyfreq data=SIS_Survey; tables Response / cl nowt; ods output TableSummary=ResponseSummary; strata State NewUser; cluster School; weight SamplingWeight; run;
Output 68.3.2 displays the output data set ResponseSummary. Output 68.3.2. ResponseSummary Output Data Set
Obs
Table
1
Table Response
Number of Observations 1850
Number of Strata
Number of Clusters
Degrees of Freedom
6
370
364
t Percentile 1.966503
References Agresti, A. (1990), Categorical Data Analysis, New York: John Wiley & Sons, Inc. Agresti, A. (1996), An Introduction to Categorical Data Analysis, New York: John Wiley & Sons, Inc. Bedrick, E.J. (1983), “Adjusted Chi-Squared Tests for Cross-Classified Tables of Survey Data,” Biometrika, 70, 591–596. Brick, J.M. and Kalton, G. (1996), “Handling Missing Data in Survey Research,” Statistical Methods in Medical Research, 5, 215–238. Cochran, W.G. (1977), Sampling Techniques, Third Edition, New York: John Wiley & Sons, Inc. Fienberg, S.E. (1980), The Analysis of Cross-Classified Data, Second Edition, Cambridge, MA: MIT Press. Felligi, I.P. (1980), “Approximate Tests of Independence and Goodness of Fit Based on Stratified Multistage Samples,” Journal of the American Statistical Association, 75, 261–268. Fleiss, J.L. (1981), Statistical Methods for Rates and Proportions, Second Edition, New York: John Wiley & Sons, Inc. Fuller, W.A. (1975), “Regression Analysis for Sample Survey,” Sankhy¯ a, 37, Series C, Pt. 3, 117–132.
4238
Chapter 68. The SURVEYFREQ Procedure
Fuller, W.A., Kennedy, W., Schnell, D., Sullivan, G., and Park, H.J. (1989), PC CARP, Ames, IA: Statistical Laboratory, Iowa State University. Hansen, M.H., Hurwitz, W.N., and Madow, W.G. (1953), Sample Survey Methods and Theory, Volumes I and II, New York: John Wiley & Sons, Inc. Hidiroglou, M.A., Fuller, W.A., and Hickman, R.D. (1980), SUPER CARP, Ames, IA: Statistical Laboratory, Iowa State University. Kalton, G. (1983), Introduction to Survey Sampling, Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-035, Beverly Hills, CA, and London: Sage Publications, Inc. Kalton, G. and Kaspyzyk, D. (1986), “The Treatment of Missing Survey Data,” Survey Methodology, 12, 1–16. Kish, L. (1965), Survey Sampling, New York: John Wiley & Sons, Inc. Koch, G.G., Freeman, D.H., and Freeman, J.L. (1975), “Strategies in the Multivariate Analysis of Data from Complex Surveys,” International Statistical Review, 43, 59–78. Koch, G.G., Landis, J.R., Freeman, D.H., Freeman, J.L., and Lehnen, R.G. (1977), “A General Methodology for the Analysis of Experiments with Repeated Measurement of Categorical Data,” Biometrics, 33, 133–158. Korn, E.L. and Graubard, B.I. (1990), “Simultaneous Testing with Complex Survey Data: Use of Bonferroni t-Statistics,” The American Statistician, 44, 270–276. Lee, E.S., Forthoffer, R.N., and Lorimor, R.J. (1989), Analyzing Complex Survey Data, Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-071, Beverly Hills, CA, and London: Sage Publications, Inc. Levy, P. and Lemeshow, S. (1999), Sampling of Populations, Methods and Applications, Third Edition, New York: John Wiley & Sons, Inc. Lohr, S.L. (1999), Sampling: Design and Analysis, Pacific Grove, CA: Duxbury Press. Nathan, G. (1975), “Tests for Independence in Contingency Tables from Stratified Samples,” Sankhy¯ a, 37, Series C, 77–87. Rao, J.N.K. and Scott, A.J. (1979), “Chi-Squared Tests for Analysis of Categorical Data from Complex Surveys,” Proceedings of the Survey Research Methods Section, ASA, 58–66. Rao, J.N.K. and Scott, A.J. (1981), “The Analysis of Categorical Data from Complex Surveys: Chi-Squared Tests for Goodness of Fit and Independence in Two-Way Tables,” Journal of the American Statistical Association, 76, 221–230. Rao, J.N.K. and Scott, A.J. (1984), “On Chi-Squared Tests for Multiway Contingency Tables with Cell Properties Estimated from Survey Data,” The Annals of Statistics, 12, 46–60. Rao, J.N.K. and Scott, A.J. (1987), “On Simple Adjustments to Chi-Square Tests with Survey Data,” The Annals of Statistics, 15, 385–397.
References
S¨arndal, C.E., Swenson, B., and Wretman, J. (1992), Model Assisted Survey Sampling, New York: Springer-Verlag Inc. Satterthwaite, F.E. (1946), “An Approximate Distribution of Estimates of Variance Components,” Biometrics, 2, 110–114. Shah, B.V., Barnwell, B.G., and Bieler, G.S. (1996), SUDAAN User’s Manual: Release 7.0, Research Triangle Park, NC: Research Triangle Institute. Thomas, D.R., and Rao, J.N.K. (1984), “A Monte Carlo Study of Exact Levels of Goodness-of-Fit Statistics Under Cluster Sampling,” Proceedings of the Survey Research Methods Section, ASA, 207–211. Thomas, D.R., and Rao, J.N.K. (1985), “On the Power of Some Goodness-of-Fit Tests Under Cluster Sampling,” Proceedings of the Survey Research Methods Section, ASA, 291–296. Thomas, D.R., Singh, A.C., and Roberts, G.R. (1996), “Tests of Independence on Two-Way Tables Under Cluster Sampling: An Evaluation,” International Statistical Review, 64, 295–311. Wald, A. (1943), “Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations is Large,” Transactions of the American Mathematical Society, 54, 426–482. Woodruff, R.S. (1971), “A Simple Method for Approximating the Variance of a Complicated Estimate,” Journal of the American Statistical Association, 66, 411–414.
4239
Chapter 69
The SURVEYLOGISTIC Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4243 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4245 SYNTAX . . . . . . . . . . . . . . . . . PROC SURVEYLOGISTIC Statement BY Statement . . . . . . . . . . . . . CLASS Statement . . . . . . . . . . . CLUSTER Statement . . . . . . . . . CONTRAST Statement . . . . . . . . FREQ Statement . . . . . . . . . . . . MODEL Statement . . . . . . . . . . Response Variable Options . . . . . Model Options . . . . . . . . . . . STRATA Statement . . . . . . . . . . TEST Statement . . . . . . . . . . . . UNITS Statement . . . . . . . . . . . WEIGHT Statement . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. 4250 . 4250 . 4252 . 4253 . 4255 . 4255 . 4258 . 4258 . 4259 . 4261 . 4266 . 4266 . 4267 . 4268
DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . . . . . . . . . . . . Model Specification . . . . . . . . . . . . . . . . . . . . . Response Level Ordering . . . . . . . . . . . . . . . . CLASS Variable Parameterization . . . . . . . . . . . . Link Functions and the Corresponding Distributions . . Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . Determining Observations for Likelihood Contributions Iterative Algorithms for Model-Fitting . . . . . . . . . Convergence Criteria . . . . . . . . . . . . . . . . . . . Existence of Maximum Likelihood Estimates . . . . . . Model Fitting Statistics . . . . . . . . . . . . . . . . . Generalized Coefficient of Determination . . . . . . . . INEST= Data Set . . . . . . . . . . . . . . . . . . . . . Survey Design Information . . . . . . . . . . . . . . . . . Specification of Population Totals and Sampling Rates . Primary Sampling Units (PSUs) . . . . . . . . . . . . . Variance Estimation for Sample Survey Data . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. 4268 . 4268 . 4269 . 4269 . 4270 . 4273 . 4275 . 4275 . 4275 . 4277 . 4277 . 4279 . 4280 . 4280 . 4280 . 4280 . 4281 . 4282
4242
Chapter 69. The SURVEYLOGISTIC Procedure Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized Logistic Model . . . . . . . . . . . . . . . . . . . . . . Cumulative Logit Model . . . . . . . . . . . . . . . . . . . . . . . . Complementary log-log Model . . . . . . . . . . . . . . . . . . . . Probit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimated Variances . . . . . . . . . . . . . . . . . . . . . . . . . . Adjustments to the Variance Estimation . . . . . . . . . . . . . . . . Hypothesis Testing and Estimation . . . . . . . . . . . . . . . . . . . . Score Statistics and Tests . . . . . . . . . . . . . . . . . . . . . . . Testing the Parallel Lines Assumption . . . . . . . . . . . . . . . . . Wald Confidence Intervals for Parameters . . . . . . . . . . . . . . . Testing Linear Hypotheses about the Regression Coefficients . . . . Odds Ratio Estimation . . . . . . . . . . . . . . . . . . . . . . . . . Rank Correlation of Observed Responses and Predicted Probabilities Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Displayed Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. 4282 . 4283 . 4284 . 4284 . 4284 . 4285 . 4285 . 4286 . 4287 . 4287 . 4287 . 4288 . 4288 . 4288 . 4292 . 4292 . 4292 . 4295
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4296 Example 69.1. Logistic Regression with Different Link Functions for Stratified Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 4296 Example 69.2. The Household Component of the Medical Expenditure Panel Survey (MEPS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4302 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4310
Chapter 69
The SURVEYLOGISTIC Procedure Overview Categorical responses arise extensively in survey research. Common examples of responses include • binary: e.g., attended graduate school or not • ordinal: e.g., mild, moderate, and severe pain • nominal: e.g., ABC, NBC, CBS, FOX TV network viewed at a certain hour Logistic regression analysis is often used to investigate the relationship between such discrete responses and a set of explanatory variables. See Binder (1981, 1983), Roberts, Rao, and Kumar (1987), Skinner, Holt, and Smith (1989), Morel (1989), and Lehtonen and Pahkinen (1995) for papers that describe logistic regression for sample survey data. For binary response models, the response of a sampling unit can take a specified value or not (for example, attended graduate school or not). Suppose x is a row vector of explanatory variables and π is the response probability to be modeled. The linear logistic model has the form logit(π) ≡ log
π 1−π
= α + xβ
where α is the intercept parameter and β is the vector of slope parameters. The logistic model shares a common feature with the more general class of generalized linear models, namely, that a function g = g(µ) of the expected value, µ, of the response variable is assumed to be linearly related to the explanatory variables. Since µ implicitly depends on the stochastic behavior of the response, and since the explanatory variables are assumed to be fixed, the function g provides the link between the random (stochastic) component and the systematic (deterministic) component of the response variable. For this reason, Nelder and Wedderburn (1972) refer to g(·) as a link function. One advantage of the logit function over other link functions is that differences on the logistic scale are interpretable regardless of whether the data are sampled prospectively or retrospectively (McCullagh and Nelder 1989, Chapter 4). Other link functions that are widely used in practice are the probit function and the complementary log-log function. The SURVEYLOGISTIC procedure enables you to choose one of these link functions, resulting in fitting a broad class of binary response models of the form g(π) = α + xβ
4244
Chapter 69. The SURVEYLOGISTIC Procedure
For ordinal response models, the response Y of an individual or an experimental unit may be restricted to one of a usually small number of ordinal values, denoted for convenience by 1, . . . , D, D + 1 (D ≥ 1). For example, the pain severity can be classified into three response categories as 1=mild, 2=moderate, and 3=severe. The SURVEYLOGISTIC procedure fits a common slopes cumulative model, which is a parallel lines regression model based on the cumulative probabilities of the response categories rather than on their individual probabilities. The cumulative model has the form g(Pr(Y ≤ d | x)) = αd + xβ,
1≤d≤D
where α1 , . . . , αk are k intercept parameters and β is the vector of slope parameters. This model has been considered by many researchers. Aitchison and Silvey (1957) and Ashford (1959) employ a probit scale and provide a maximum likelihood analysis; Walker and Duncan (1967) and Cox and Snell (1989) discuss the use of the log-odds scale. For the log-odds scale, the cumulative logit model is often referred to as the proportional odds model. For nominal response logistic models, where the D + 1 possible responses have no natural ordering, the logit model can also be extended to a generalized logit model, which has the form log
Pr(Y = i | x) Pr(Y = D + 1 | x)
= αi + β 0i x,
i = 1, . . . , D
where the α1 , . . . , αD are D intercept parameters, and the β 1 , . . . , β D are D vectors of parameters. These models were introduced by McFadden (1974) as the discrete choice model, and they are also known as multinomial models. The SURVEYLOGISTIC procedure fits linear logistic regression models for discrete response survey data by the method of maximum likelihood. For statistical inferences, PROC SURVEYLOGISTIC incorporates complex survey sample designs, including designs with stratification, clustering, and unequal weighting. The maximum likelihood estimation is carried out with either the Fisher-scoring algorithm or the Newton-Raphson algorithm. You can specify starting values for the parameter estimates. The logit link function in the ordinal logistic regression models can be replaced by the probit function or the complementary log-log function. Odds ratio estimates are displayed along with parameter estimates. You can also specify the change in the explanatory variables for which odds ratio estimates are desired. Variances of the regression parameters and odds ratios are computed using the Taylor expansion approximation; see Binder (1983). The SURVEYLOGISTIC procedure enables you to specify categorical variables (also known as CLASS variables) as explanatory variables. It also enables you to specify interaction terms in the same way as in the LOGISTIC procedure. Like many procedures in SAS/STAT software that allow the specification of CLASS variables, the SURVEYLOGISTIC procedure provides a CONTRAST statement
Getting Started
for specifying customized hypothesis tests concerning the model parameters. The CONTRAST statement also provides estimation of individual rows of contrasts, which is particularly useful for obtaining odds ratio estimates for various levels of the CLASS variables.
Getting Started The SURVEYLOGISTIC procedure is similar to the LOGISTIC procedure and other regression procedures in the SAS System. Please refer to Chapter 42, “The LOGISTIC Procedure,” for general information about how to perform logistic regression using SAS. PROC SURVEYLOGISTIC is designed to handle sample survey data, and thus it incorporates the sampling design information into the analysis. The following example illustrates how to use PROC SURVEYLOGISTIC to perform logistic regression for sample survey data. In the customer satisfaction survey example in the “Getting Started” section on page 4422 of Chapter 72, “The SURVEYSELECT Procedure,” an Internet service provider conducts a customer satisfaction survey. The survey population consists of the company’s current subscribers from four states: Alabama (AL), Florida (FL), Georgia (GA), and South Carolina (SC). The company plans to select a sample of customers from this population, interview the selected customers and ask their opinions on customer service, and then make inferences about the entire population of subscribers from the sample data. A stratified sample is selected using the probability proportional to size (PPS) method. The sample design divides the customers into strata depending on their types (‘Old’ or ‘New’) of their states (AL, FL, GA, SC). There are eight strata in all. Within each stratum, customers are selected and interviewed using the PPS with replacement method, where the size variable is Usage. The stratified PPS sample contains 192 customers. The data are stored in the SAS data set SampleStrata. Figure 69.1 displays the first 10 observations of this data set. Customer Satisfaction Survey Stratified PPS Sampling (First 10 Observations)
Obs 1 2 3 4 5 6 7 8 9 10
State
Type
AL AL AL AL AL AL AL AL AL AL
New New New New New New New New New New
Customer ID 2178037 75375074 116722913 133059995 216784622 225046040 238463776 255918199 395767821 409095328
Rating
Usage
Sampling Weight
Unsatisfied Unsatisfied Satisfied Neutral Satisfied Neutral Satisfied Unsatisfied Extremely Unsatisfied Satisfied
23.53 99.11 31.11 52.70 8.86 8.32 4.63 10.05 33.14 10.67
14.7473 3.5012 11.1546 19.7542 39.1613 41.6960 74.9483 34.5405 10.4719 32.5295
Figure 69.1. Stratified PPS Sample (First 10 Observations)
In the SAS data set SampleSRS, the variable CustomerID uniquely identifies each
4245
4246
Chapter 69. The SURVEYLOGISTIC Procedure
customer. The variable State contains the state of the customer’s address. The variable Type equals ‘Old’ if the customer has subscribed to the service for more than one year; otherwise, the variable Type equals ‘New’. The variable Usage contains the customer’s average monthly service usage, in hours. The variable Rating contains the customer’s responses to the survey. The sample design uses an unequal probability sampling method, with the sampling weights stored in the variable SamplingWeight. The following SAS statements fit a cumulative logistic model between the satisfaction levels and the Internet usage using the stratified PPS sample. title ’Customer Satisfaction Survey’; proc surveylogistic data=SampleStrata; strata state type/list; model Rating (order=internal) = Usage; weight SamplingWeight; run;
The PROC statement invokes the SURVEYLOGISTIC procedure. The STRATA statement specifies the stratification variables State and Type that are used in the sample design. The LIST option requests a summary of the stratification. In the MODEL statement, Rating is the response variable and Usage is the explanatory variable. The ORDER=internal is used for the response variable Rating to ask the procedure to order the response levels using the internal numerical value (1-5) instead of the formatted character value. The WEIGHT statement specifies the variable SamplingWeight that contains the sampling weights. The results of this analysis are shown in the following tables. Customer Satisfaction Survey The SURVEYLOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Stratum Variables Number of Strata Weight Variable Model Optimization Technique Variance Adjustment
WORK.SAMPLESTRATA Rating 5 State Type 8 SamplingWeight Cumulative Logit Fisher’s Scoring Degrees of Freedom (DF)
Sampling Weight
Figure 69.2. Stratified PPS Sample, Model Information
PROC SURVEYLOGISTIC first lists the following model fitting information and sample design information in Figure 69.2: • The link function is the logit of the cumulative of the lower response categories.
Getting Started
• The Fisher Scoring optimization technique is used to obtain the maximum likelihood estimates for the regression coefficients. • The response variable is Rating, which has five response levels. • The stratification variables are State and Type. • There are eight strata in the sample. • The weight variable is SamplingWeight. • The variance adjustment method used for the regression coefficients is the default degrees of freedom adjustment.
Customer Satisfaction Survey Number Number Sum of Sum of
of Observations Read of Observations Used Weights Read Weights Used
192 192 13262.74 13262.74
Figure 69.3. Stratified PPS Sample, Number of Observations
Figure 69.3 lists the number of observations in the data set and the number of observations used in the analysis. Since no missing value presents in this example, observations in the entire data set are used in the analysis. The sums of weights are also reported in this table. Customer Satisfaction Survey Response Profile Ordered Value 1 2 3 4 5
Rating Extremely Unsatisfied Unsatisfied Neutral Satisfied Extremely Satisfied
Total Frequency
Total Weight
52 47 47 38 8
2067.1092 2148.7127 3649.4869 2533.5379 2863.8888
Probabilities modeled are cumulated over the lower Ordered Values.
Figure 69.4. Stratified PPS Sample, Response Profile
The “Response Profile” table in Figure 69.4 lists the five response levels, their ordered values, and their total frequencies and total weights for each category. Due to the ORDER=internal option for the response variable Rating, the category “Extremely Unsatisfied” has the Ordered Value 1, the category “Unsatisfied” has the Ordered Value 2, and so on.
4247
4248
Chapter 69. The SURVEYLOGISTIC Procedure
Customer Satisfaction Survey Stratum Information Stratum Index State Type N Obs ------------------------------------------1 AL New 22 2 Old 24 3 FL New 25 4 Old 22 5 GA New 25 6 Old 25 7 SC New 24 8 Old 25 -------------------------------------------
Figure 69.5. Stratified PPS Sample, Stratification Summary
Figure 69.5 displays the output of the stratification summary. There are a total of eight strata, and each stratum is defined by the customer types within each state. The table also shows the number of customers within each stratum. Customer Satisfaction Survey Score Test for the Proportional Odds Assumption Chi-Square
DF
Pr > ChiSq
3692.2558
3
<.0001
Figure 69.6. Stratified PPS Sample, Testing the Proportional Odds Assumption
Figure 69.6 shows the chi-square test for testing the proportional odds assumption. The test is highly significant, which indicates that the cumulative logit model may not adequately fit the data. Customer Satisfaction Survey Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
Intercept Only
Intercept and Covariates
AIC SC -2 Log L
42099.954 42112.984 42091.954
41378.851 41395.139 41368.851
Figure 69.7. Stratified PPS Sample, Model Fitting Information
Getting Started
Figure 69.7 shows the iteration algorithm converged to obtain the MLE for this example. The “Model Fit Statistics” table contains the Akaike Information Criterion (AIC), the Schwarz Criterion (SC), and the negative of twice the log likelihood (-2 Log L) for the intercept-only model and the fitted model. AIC and SC can be used to compare different models, and the ones with smaller values are preferred. Customer Satisfaction Survey Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald
Chi-Square
DF
Pr > ChiSq
723.1023 465.4939 4.5212
1 1 1
<.0001 <.0001 0.0335
Figure 69.8. Stratified PPS Sample, Testing Global Null Hypothesis
The table “Testing Global Null Hypothesis: BETA=0” in Figure 69.8 shows the likelihood ratio test, the efficient score test, and the Wald test for testing the significance of the explanatory variable (Usage). All tests are significant. Customer Satisfaction Survey Analysis of Maximum Likelihood Estimates
Parameter Intercept Intercept Intercept Intercept Usage
Extremely Unsatisfied Unsatisfied Neutral Satisfied
DF
Estimate
Standard Error
Wald Chi-Square
Pr > ChiSq
1 1 1 1 1
-2.0168 -1.0527 0.1334 1.0751 0.0377
0.3988 0.3543 0.4189 0.5794 0.0178
25.5769 8.8292 0.1015 3.4432 4.5212
<.0001 0.0030 0.7501 0.0635 0.0335
Figure 69.9. Stratified PPS Sample, Parameter Estimates
Figure 69.9 shows the parameter estimates of the logistic regression and their standard errors. Customer Satisfaction Survey Odds Ratio Estimates
Effect Usage
Point Estimate 1.038
95% Wald Confidence Limits 1.003
1.075
Figure 69.10. Stratified PPS Sample, Odds Ratios
Figure 69.10 displays the odds ratio estimate and its standard error.
4249
4250
Chapter 69. The SURVEYLOGISTIC Procedure
Syntax The following statements are available in PROC SURVEYLOGISTIC:
PROC SURVEYLOGISTIC < options >; BY variables ; CLASS variable <(v-options)>
PROC SURVEYLOGISTIC Statement PROC SURVEYLOGISTIC < options >; The PROC SURVEYLOGISTIC statement invokes the SURVEYLOGISTIC procedure and optionally identifies input data sets and controls the ordering of the response levels. ALPHA=α
sets the confidence level for confidence limits. The value of the ALPHA= option must be between 0 and 1, and the default value is 0.05. A confidence level of α produces 100(1 − α)% confidence limits. The default of ALPHA=0.05 produces 95% confidence limits. DATA=SAS-data-set
names the SAS data set containing the data to be analyzed. If you omit the DATA= option, the procedure uses the most recently created SAS data set.
PROC SURVEYLOGISTIC Statement
INEST= SAS-data-set
names the SAS data set that contains initial estimates for all the parameters in the model. BY-group processing is allowed in setting up the INEST= data set. See the section “INEST= Data Set” on page 4280 for more information. MISSING
requests that the procedure treat missing values as a valid category for all categorical variables, which include classification variables in the model, strata variables, and cluster variables. NAMELEN=n
specifies the length of effect names in tables and output data sets to be n characters, where n is a value between 20 and 200. The default length is 20 characters. NOSORT
suppresses the internal sorting process to shorten the computation time if the data set is presorted by the STRATA and CLUSTER variables. By default, the procedure sorts the data by the STRATA variables if you use the STRATA statement; then the procedure sorts the data by the CLUSTER variables within strata. If your data are already stored by the order of STRATA and CLUSTER variables, then you can specify this option to omit this sorting process to reduce the usage of computing resources, especially when your data set is very large. However, if you specify this NOSORT option while your data is not presorted by STRATA and CLUSTER variables, then any changes in these variables creates a new stratum or cluster. RATE=value | SAS-data-set R=value | SAS-data-set
specifies the sampling rate as a nonnegative value, or specifies an input data set that contains the stratum sampling rates. The procedure uses this information to compute a finite population correction (fpc) for variance estimation when the sample design is without replacement. If your sample design has multiple stages, you should specify the first-stage sampling rate, which is the ratio of the number of primary sampling units (PSUs) selected to the total number of PSUs in the population. For a nonstratified sample design, or for a stratified sample design with the same sampling rate in all strata, you should specify a nonnegative value for the RATE= option. If your design is stratified with different sampling rates in the strata, then you should name a SAS data set that contains the stratification variables and the sampling rates. See the section “Specification of Population Totals and Sampling Rates” on page 4280 for more details. The value in the RATE= option or the values of – RATE– in the secondary data set must be nonnegative numbers. You can specify value as a number between 0 and 1. Or you can specify value in percentage form as a number between 1 and 100, and PROC SURVEYLOGISTIC will convert that number to a proportion. The procedure treats the value 1 as 100%. If you do not specify the TOTAL= option or the RATE= option, then the variance estimation does not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option.
4251
4252
Chapter 69. The SURVEYLOGISTIC Procedure
TOTAL=value | SAS-data-set N=value | SAS-data-set
specifies the total number of primary sampling units (PSUs) in the study population as a positive value, or names an input data set that contains the stratum population totals. The procedure uses this information to compute a finite population correction for variance estimation. For a nonstratified sample design, or for a stratified sample design with the same population total in all strata, you should specify a positive value for the TOTAL= option. If your sample design is stratified with different population totals in the strata, then you should name a SAS data set that contains the stratification variables and the population totals. See the section “Specification of Population Totals and Sampling Rates” on page 4280 for more details. If you do not specify the TOTAL= option or the RATE= option, then the variance estimation does not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option.
BY Statement BY variables ; You can specify a BY statement with PROC SURVEYLOGISTIC to obtain separate analyses on observations in groups defined by the BY variables. Note that using a BY statement provides completely separate analyses of the BY groups. It does not provide a statistically valid subpopulation or domain analysis, where the total number of units in the subpopulation is not known with certainty. When a BY statement appears, the procedure expects the input data sets to be sorted in the order of the BY variables. The variables are one or more variables in the input data set. If you specify more than one BY statement, the procedure uses only the latest BY statement and ignores any previous ones. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Use the BY statement options NOTSORTED or DESCENDING in the BY statement. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
CLASS Statement
CLASS Statement CLASS variable <(v-options)>
specifies that, at most, the first n characters of a CLASS variable name be used in creating names for the corresponding dummy variables. The default is 32 − min(32, max(2, f )), where f is the formatted length of the CLASS variable. DESCENDING DESC
reverses the sorting order of the classification variable. LPREFIX= n
specifies that, at most, the first n characters of a CLASS variable label be used in creating labels for the corresponding dummy variables. ORDER=DATA | FORMATTED | FREQ | INTERNAL
specifies the sorting order for the levels of classification variables. This ordering determines which parameters in the model correspond to each level in the data, so the ORDER= option may be useful when you use the CONTRAST statement. When the default ORDER=FORMATTED is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values. The following table shows how PROC SURVEYLOGISTIC interprets values of the ORDER= option. Value of ORDER= DATA
Levels Sorted By order of appearance in the input data set
FORMATTED
external formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value
FREQ
descending frequency count; levels with the most observations come first in the order
INTERNAL
unformatted value
By default, ORDER=FORMATTED. For FORMATTED and INTERNAL, the sort order is machine dependent. For more information on sorting order, see the chapter
4253
4254
Chapter 69. The SURVEYLOGISTIC Procedure
on the SORT procedure in the SAS Procedures Guide and the discussion of BY-group processing in SAS Language Reference: Concepts. PARAM=keyword
specifies the parameterization method for the classification variable or variables. Design matrix columns are created from CLASS variables according to the following coding schemes. The default is PARAM=EFFECT. If PARAM=ORTHPOLY or PARAM=POLY, and the CLASS levels are numeric, then the ORDER= option in the CLASS statement is ignored, and the internal, unformatted values are used. EFFECT
specifies effect coding
GLM
specifies less-than-full-rank, reference cell coding; this option can only be used as a global option
ORDINAL
specifies the cumulative parameterization for an ordinal CLASS variable.
POLYNOMIAL POLY REFERENCE REF
specifies polynomial coding specifies reference cell coding
ORTHEFFECT orthogonalizes PARAM=EFFECT ORTHORDINAL ORTHOTHERM orthogonalizes PARAM=ORDINAL ORTHPOLY
orthogonalizes PARAM=POLYNOMIAL
ORTHREF
orthogonalizes PARAM=REFERENCE
The EFFECT, POLYNOMIAL, REFERENCE, ORDINAL, and their orthogonal parameterizations are full rank. The REF= option in the CLASS statement determines the reference level for the EFFECT, REFERENCE, and their orthogonal parameterizations. Parameter names for a CLASS predictor variable are constructed by concatenating the CLASS variable name with the CLASS levels. However, for the POLYNOMIAL and orthogonal parameterizations, parameter names are formed by concatenating the CLASS variable name and keywords that reflect the parameterization. REF=’level’ | keyword
specifies the reference level for PARAM=EFFECT or PARAM=REFERENCE. For an individual (but not a global) variable REF= option, you can specify the level of the variable to use as the reference level. For a global or individual variable REF= option, you can use one of the following keywords. The default is REF=LAST. FIRST
designates the first ordered level as reference
LAST
designates the last ordered level as reference
CONTRAST Statement
CLUSTER Statement CLUSTER | CLUSTERS variables ; The CLUSTER statement names variables that identify the clusters in a clustered sample design. The combinations of categories of CLUSTER variables define the clusters in the sample. If there is a STRATA statement, clusters are nested within strata. If your sample design has clustering at multiple stages, you should identify only the first-stage clusters, or primary sampling units (PSUs), in the CLUSTER statement. See the section “Primary Sampling Units (PSUs)” on page 4281 for more information. The CLUSTER variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the CLUSTER variables determine the CLUSTER variable levels. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary. You can use multiple CLUSTER statements to specify cluster variables. The procedure uses all variables from all CLUSTER statements to create clusters.
CONTRAST Statement CONTRAST ’label’ row-description <,..., row-description>< / options >; where a row-description is: effect values <,...effect values> The CONTRAST statement provides a mechanism for obtaining customized hypothesis tests. It is similar to the CONTRAST statement in PROC LOGISTIC and PROC GLM, depending on the coding schemes used with any classification variables involved. The CONTRAST statement enables you to specify a matrix, L, for testing the hypothesis Lθ = 0, where θ is the parameter vector. You must be familiar with the details of the model parameterization that PROC SURVEYLOGISTIC uses (for more information, see the PARAM= option in the section “CLASS Statement” on page 4253). Optionally, the CONTRAST statement enables you to estimate each row, l0i θ, of Lθ and test the hypothesis l0i θ = 0. Computed statistics are based on the asymptotic chi-square distribution of the Wald statistic. There is no limit to the number of CONTRAST statements that you can specify, but they must appear after the MODEL statement. The following parameters are specified in the CONTRAST statement: label
identifies the contrast on the output. A label is required for every contrast specified, and it must be enclosed in quotes.
4255
4256
Chapter 69. The SURVEYLOGISTIC Procedure
effect
identifies an effect that appears in the MODEL statement. The name INTERCEPT can be used as an effect when one or more intercepts are included in the model. You do not need to include all effects that are included in the MODEL statement.
values
are constants that are elements of the L matrix associated with the effect. To correctly specify your contrast, it is crucial to know the ordering of parameters within each effect and the variable levels associated with any parameter. The “Class Level Information” table shows the ordering of levels within variables. The E option, described later in this section, enables you to verify the proper correspondence of values to parameters.
The rows of L are specified in order and are separated by commas. Multiple degreeof-freedom hypotheses can be tested by specifying multiple row-descriptions. For any of the full-rank parameterizations, if an effect is not specified in the CONTRAST statement, all of its coefficients in the L matrix are set to 0. If too many values are specified for an effect, the extra ones are ignored. If too few values are specified, the remaining ones are set to 0. When you use effect coding (by default or by specifying PARAM=EFFECT in the CLASS statement), all parameters are directly estimable (involve no other parameters). For example, suppose an effect coded CLASS variable A has four levels. Then there are three parameters (α1 , α2 , α3 ) representing the first three levels, and the fourth parameter is represented by −α1 − α2 − α3 To test the first versus the fourth level of A, you would test α1 = −α1 − α2 − α3 or, equivalently, 2α1 + α2 + α3 = 0 which, in the form Lθ = 0, is
2 1 1
α1 α2 = 0 α3
Therefore, you would use the following CONTRAST statement: contrast ’1 vs. 4’ A 2 1 1;
To contrast the third level with the average of the first two levels, you would test α1 + α2 = α3 2
CONTRAST Statement
or, equivalently, α1 + α2 − 2α3 = 0 Therefore, you would use the following CONTRAST statement: contrast ’1&2 vs. 3’ A 1 1 -2;
Other CONTRAST statements are constructed similarly. For example, contrast contrast contrast contrast
’1 vs. 2 ’ ’1&2 vs. 4 ’ ’1&2 vs. 3&4’ ’Main Effect’
A A A A A A
1 -1 3 3 2 2 1 0 0 1 0 0
0; 2; 0; 0, 0, 1;
When you use the less-than-full-rank parameterization (by specifying PARAM=GLM in the CLASS statement), each row is checked for estimability. If PROC SURVEYLOGISTIC finds a contrast to be nonestimable, it displays missing values in corresponding rows in the results. PROC SURVEYLOGISTIC handles missing level combinations of classification variables in the same manner as PROC LOGISTIC. Parameters corresponding to missing level combinations are not included in the model. This convention can affect the way in which you specify the L matrix in your CONTRAST statement. If the elements of L are not specified for an effect that contains a specified effect, then the elements of the specified effect are distributed over the levels of the higher-order effect just as the LOGISTIC procedure does for its CONTRAST and ESTIMATE statements. For example, suppose that the model contains effects A and B and their interaction A*B. If you specify a CONTRAST statement involving A alone, the L matrix contains nonzero terms for both A and A*B, since A*B contains A. The degrees of freedom is the number of linearly independent constraints implied by the CONTRAST statement, that is, the rank of L. You can specify the following options after a slash (/). ALPHA=α
sets the confidence level for confidence limits. The value of the ALPHA= option must be between 0 and 1, and the default value is 0.05. A confidence level of α produces 100(1 − α)% confidence limits. The default of ALPHA=0.05 produces 95% confidence limits. E
requests that the L matrix be displayed.
4257
4258
Chapter 69. The SURVEYLOGISTIC Procedure
ESTIMATE=keyword
requests that each individual contrast (that is, each row, li0 β, of Lβ) or exponentiated 0 contrast (eli β ) be estimated and tested. PROC SURVEYLOGISTIC displays the point estimate, its standard error, a Wald confidence interval, and a Wald chi-square test for each contrast. The significance level of the confidence interval is controlled by the 0 ALPHA= option. You can estimate the contrast or the exponentiated contrast (eli β ), or both, by specifying one of the following keywords: PARM
specifies that the contrast itself be estimated
EXP
specifies that the exponentiated contrast be estimated
BOTH
specifies that both the contrast and the exponentiated contrast be estimated
SINGULAR = number
tunes the estimability check. This option is ignored when the full-rank parameterization is used. If v is a vector, define ABS(v) to be the largest absolute value of the elements of v. For a row vector l0 of the contrast matrix L, define c to be equal to ABS(l) if ABS(l) is greater than 0; otherwise, c equals 1. If ABS(l0 − l0 T ) is greater than c ∗ number, then l is declared nonestimable. The T matrix is the Hermite form − matrix I − 0 I 0 , where I 0 represents a generalized inverse of the information matrix I 0 of the null model. The value for number must be between 0 and 1; the default value is 1E−4.
FREQ Statement FREQ variable ; The variable in the FREQ statement identifies a variable that contains the frequency of occurrence of each observation. PROC SURVEYLOGISTIC treats each observation as if it appears n times, where n is the value of the FREQ variable for the observation. If it is not an integer, the frequency value is truncated to an integer. If the frequency value is less than 1 or missing, the observation is not used in the model fitting. When the FREQ statement is not specified, each observation is assigned a frequency of 1. If you use the events/trials syntax in the MODEL statement, the FREQ statement is disallowed because the event and trial variables represent the frequencies in the data set.
MODEL Statement MODEL events/trials= < effects >< / options >; MODEL variable < (variable– options) >= < effects >< /options >; The MODEL statement names the response variable and the explanatory effects, including covariates, main effects, interactions, and nested effects; see the section “Specification of Effects” on page 1784 of Chapter 32, “The GLM Procedure,”
Response Variable Options
for more information. If you omit the explanatory variables, the procedure fits an intercept-only model. Model options can be specified after a slash (/). Two forms of the MODEL statement can be specified. The first form, referred to as single-trial syntax, is applicable to binary, ordinal, and nominal response data. The second form, referred to as events/trials syntax, is restricted to the case of binary response data. The single-trial syntax is used when each observation in the DATA= data set contains information on only a single trial, for instance, a single subject in an experiment. When each observation contains information on multiple binaryresponse trials, such as the counts of the number of subjects observed and the number responding, then events/trials syntax can be used. In the events/trials syntax, you specify two variables that contain count data for a binomial experiment. These two variables are separated by a slash. The value of the first variable, events, is the number of positive responses (or events). The value of the second variable, trials, is the number of trials. The values of both events and (trials−events) must be nonnegative and the value of trials must be positive for the response to be valid. In the single-trial syntax, you specify one variable (on the left side of the equal sign) as the response variable. This variable can be character or numeric. Options specific to the response variable can be specified immediately after the response variable with a pair of parentheses around them. For both forms of the MODEL statement, explanatory effects follow the equal sign. Variables can be either continuous or classification variables. Classification variables can be character or numeric, and they must be declared in the CLASS statement. When an effect is a classification variable, the procedure enters a set of coded columns into the design matrix instead of directly entering a single column containing the values of the variable.
Response Variable Options You specify the following options by enclosing them in a pair of parentheses after the response variable. DESCENDING | DESC
reverses the order of response categories. If both the DESCENDING and ORDER= options are specified, PROC SURVEYLOGISTIC orders the response categories according to the ORDER= option and then reverses that order. See the “Response Level Ordering” section on page 4269 for more detail. EVENT=’category’ | keyword
specifies the event category for the binary response model. PROC SURVEYLOGISTIC models the probability of the event category. The EVENT= option has no effect when there are more than two response categories. You can specify the value (formatted if a format is applied) of the event category in quotes or you can specify one of the following keywords. The default is EVENT=FIRST. FIRST
designates the first ordered category as the event
LAST
designates the last ordered category as the event
4259
4260
Chapter 69. The SURVEYLOGISTIC Procedure
One of the most common sets of response levels is {0,1}, with 1 representing the event for which the probability is to be modeled. Consider the example where Y takes the values 1 and 0 for event and nonevent, respectively, and Exposure is the explanatory variable. To specify the value 1 as the event category, use the model statement model Y(event=’1’) = Exposure; ORDER= DATA | FORMATTED | FREQ | INTERNAL
specifies the sorting order for the levels of the response variable. By default, ORDER=FORMATTED. For FORMATTED and INTERNAL, the sort order is machine dependent. When the default ORDER=FORMATTED is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values. The following table shows the interpretation of the ORDER= values. Value of ORDER= DATA
Levels Sorted By order of appearance in the input data set
FORMATTED
external formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value
FREQ
descending frequency count; levels with the most observations come first in the order
INTERNAL
unformatted value
For more information on sorting order, see the chapter on the SORT procedure in the SAS Procedures Guide and the discussion of BY-group processing in SAS Language Reference: Concepts. REFERENCE=’category’ | keyword REF=’category’ | keyword
specifies the reference category for the generalized logit model and the binary response model. For the generalized logit model, each nonreference category is contrasted with the reference category. For the binary response model, specifying one response category as the reference is the same as specifying the other response category as the event category. You can specify the value (formatted if a format is applied) of the reference category in quotes or you can specify one of the following keywords. The default is REF=LAST. FIRST
designates the first ordered category as the reference
LAST
designates the last ordered category as the reference
Model Options
4261
Model Options Model options can be specified after a slash (/). Table 69.1 summarizes the options available in the MODEL statement. Table 69.1. Model Statement Options
Option Description Model Specification Options LINK= Specifies link function NOINT Suppresses intercept(s) OFFSET= Specifies offset variable Convergence Criterion Options ABSFCONV= Specifies absolute function convergence criterion FCONV= Specifies relative function convergence criterion GCONV= Specifies relative gradient convergence criterion XCONV= Specifies relative parameter convergence criterion MAXITER= Specifies maximum number of iterations NOCHECK Suppresses checking for infinite parameters RIDGING= Specifies technique used to improve the log-likelihood function when its value is worse than that of the previous step SINGULAR= Specifies tolerance for testing singularity TECHNIQUE= Specifies iterative algorithm for maximization Options for Adjustment to Variance Estimation VADJUST= Choose variance estimation adjustment method Options for Confidence Intervals ALPHA= Specifies α for the 100(1 − α)% confidence intervals CLPARM Computes confidence intervals for parameters CLODDS Computes confidence intervals for odds ratios Options for Display of Details CORRB Displays correlation matrix COVB Displays covariance matrix EXPB Displays exponentiated values of estimates ITPRINT Displays iteration history NODUMMYPRINT Suppresses “Class Level Information” table PARMLABEL Displays parameter labels RSQUARE Displays generalized R2 STB Displays standardized estimates The following list describes these options. ABSFCONV=value
specifies the absolute function convergence criterion. Convergence requires a small change in the log-likelihood function in subsequent iterations, |l(i) − l(i−1) | < value where l(i) is the value of the log-likelihood function at iteration i. See the section “Convergence Criteria” on page 4277.
4262
Chapter 69. The SURVEYLOGISTIC Procedure
ALPHA=α
sets the level of significance α for 100(1 − α)% confidence intervals for regression parameters or odds ratios. The value α must be between 0 and 1. By default, α is equal to the value of the ALPHA= option in the PROC SURVEYLOGISTIC statement, or α = 0.05 if the option is not specified. This option has no effect unless confidence limits for the parameters or odds ratios are requested. CLODDS
requests confidence intervals for the odds ratios. Computation of these confidence intervals is based on individual Wald tests. The confidence coefficient can be specified with the ALPHA= option. See the “Wald Confidence Intervals for Parameters” section on page 4288 for more information. CLPARM
requests confidence intervals for the parameters. Computation of these confidence intervals is based on the individual Wald tests. The confidence coefficient can be specified with the ALPHA= option. See the “Wald Confidence Intervals for Parameters” section on page 4288 for more information. CORRB
displays the correlation matrix of the parameter estimates. COVB
displays the covariance matrix of the parameter estimates. EXPB EXPEST
ˆ
displays the exponentiated values (eθi ) of the parameter estimates θˆi in the “Analysis of Maximum Likelihood Estimates” table for the logit model. These exponentiated values are the estimated odds ratios for the parameters corresponding to the continuous explanatory variables. FCONV=value
specifies the relative function convergence criterion. Convergence requires a small relative change in the log-likelihood function in subsequent iterations, |l(i) − l(i−1) | < value |l(i−1) | + 1E−6 where l(i) is the value of the log-likelihood at iteration i. “Convergence Criteria” on page 4277.
See the section
GCONV=value
specifies the relative gradient convergence criterion. Convergence requires that the normalized prediction function reduction is small, g0(i) I(i) g(i) < value |l(i) | + 1E−6 where l(i) is the value of the log-likelihood function, g(i) is the gradient vector, and I(i) the (expected) information matrix. All of these functions are evaluated at iteration
Model Options
i. This is the default convergence criterion, and the default value is 1E−8. See the section “Convergence Criteria” on page 4277. ITPRINT
displays the iteration history of the maximum-likelihood model fitting. The ITPRINT option also displays the last evaluation of the gradient vector and the final change in the −2 Log Likelihood. LINK=keyword L=keyword
specifies the link function linking the response probabilities to the linear predictors. You can specify one of the following keywords. The default is LINK=LOGIT. CLOGLOG
the complementary log-log function. PROC SURVEYLOGISTIC fits the binary complementary log-log model for binary response and fits the cumulative complementary log-log model when there are more than two response categories. Aliases: CCLOGLOG, CCLL, CUMCLOGLOG.
GLOGIT
the generalized logit function. PROC SURVEYLOGISTIC fits the generalized logit model where each nonreference category is contrasted with the reference category. You can use the response variable option REF= to specify the reference category.
LOGIT
the cumulative logit function. PROC SURVEYLOGISTIC fits the binary logit model when there are two response categories and fits the cumulative logit model when there are more than two response categories. Aliases: CLOGIT, CUMLOGIT.
PROBIT
the inverse standard normal distribution function. PROC SURVEYLOGISTIC fits the binary probit model when there are two response categories and fits the cumulative probit model when there are more than two response categories. Aliases: NORMIT, CPROBIT, CUMPROBIT.
See the section “Link Functions and the Corresponding Distributions” on page 4273 for details. MAXITER=n
specifies the maximum number of iterations to perform. By default, MAXITER=25. If convergence is not attained in n iterations, the displayed output created by the procedure contain results that are based on the last maximum likelihood iteration. NOCHECK
disables the checking process to determine whether maximum likelihood estimates of the regression parameters exist. If you are sure that the estimates are finite, this option can reduce the execution time if the estimation takes more than eight iterations. For more information, see the “Existence of Maximum Likelihood Estimates” section on page 4277.
4263
4264
Chapter 69. The SURVEYLOGISTIC Procedure
NODUMMYPRINT NODESIGNPRINT NODP
suppresses the “Class Level Information” table, which shows how the design matrix columns for the CLASS variables are coded. NOINT
suppresses the intercept for the binary response model or the first intercept for the ordinal response model. OFFSET= name
names the offset variable. The regression coefficient for this variable will be fixed at 1. PARMLABEL
displays the labels of the parameters in the “Analysis of Maximum Likelihood Estimates” table. RIDGING=ABSOLUTE | RELATIVE | NONE
specifies the technique used to improve the log-likelihood function when its value in the current iteration is less than that in the previous iteration. If you specify the RIDGING=ABSOLUTE option, the diagonal elements of the negative (expected) Hessian are inflated by adding the ridge value. If you specify the RIDGING=RELATIVE option, the diagonal elements are inflated by a factor of 1 plus the ridge value. If you specify the RIDGING=NONE option, the crude line search method of taking half a step is used instead of ridging. By default, RIDGING=RELATIVE. RSQUARE RSQ
requests a generalized R2 measure for the fitted model. For more information, see the “Generalized Coefficient of Determination” section on page 4280. SINGULAR=value
specifies the tolerance for testing the singularity of the Hessian matrix (NewtonRaphson algorithm) or the expected value of the Hessian matrix (Fisher-scoring algorithm). The Hessian matrix is the matrix of second partial derivatives of the log likelihood. The test requires that a pivot for sweeping this matrix be at least this number times a norm of the matrix. Values of the SINGULAR= option must be numeric. By default, SINGULAR=1E−12. STB
displays the standardized estimates for the parameters for the continuous explanatory variables in the “Analysis of Maximum Likelihood Estimates” table. The standardized estimate of θi is given by θˆi /(s/si ), where si is the total sample standard deviation for the ith explanatory variable and √ π/ 3 Logistic 1 √ Normal s= π/ 6 Extreme-value
Model Options
For the intercept parameters and parameters associated with a CLASS variable, the standardized estimates are set to missing. TECHNIQUE=FISHER | NEWTON TECH=FISHER | NEWTON
specifies the optimization technique for estimating the regression parameters. NEWTON (or NR) is the Newton-Raphson algorithm and FISHER (or FS) is the Fisher-scoring algorithm. Both techniques yield the same estimates, but the estimated covariance matrices are slightly different except for the case when the LOGIT link is specified for binary response data. The default is TECHNIQUE=FISHER. See the section “Iterative Algorithms for Model-Fitting” on page 4275 for details. VADJUST=DF | MOREL | NONE < ( Morel-options ) > VARADJ=DF | MOREL | NONE < ( Morel-options ) > VARADJUST=DF | MOREL | NONE < ( Morel-options ) >
specifies an adjustment to the variance estimation (on page 4286) for the regression coefficients. By default, PROC SURVEYLOGISTIC uses the degrees of freedom adjustment VADJUST=DF. You can specify the VADJUST=MOREL option for the variance adjustment proposed by Morel (1989). If you do not wish to use any variance adjustment, you can specify the VADJUST=NONE option. You can specify the following Morel-options within parentheses after the VADJUST=MOREL option. ADJBOUND=φ
sets the upper bound coefficient φ in the variance adjustment. This upper bound must be positive. By default, the procedure use φ = 0.5. See the section “Adjustments to the Variance Estimation” on page 4286 for more details on how this upper bound is used in the variance estimation. DEFFBOUND=δ
sets the lower bound of the estimated design effect in the variance adjustment. This lower bound must be positive. By default, the procedure use δ = 1. See the section “Adjustments to the Variance Estimation” on page 4286 for more details on how this lower bound is used in the variance estimation. XCONV=value
specifies the relative parameter convergence criterion. Convergence requires a small relative parameter change in subsequent iterations, (i)
max |δj | < value j
4265
4266
Chapter 69. The SURVEYLOGISTIC Procedure
where (i)
δj =
(i) (i−1) (i−1) θj − θj |θj | < 0.01 (i)
(i−1)
θj −θj
(i−1)
θj
otherwise
(i)
and θj is the estimate of the jth parameter at iteration i. See the section “Iterative Algorithms for Model-Fitting” on page 4275.
STRATA Statement STRATA | STRATUM variables < / option > ; The STRATA statement names variables that form the strata in a stratified sample design. The combinations of levels of STRATA variables define the strata in the sample. If your sample design has stratification at multiple stages, you should identify only the first-stage strata in the STRATA statement. See the section “Specification of Population Totals and Sampling Rates” on page 4280 for more information. The STRATA variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the STRATA variables determine the levels. Thus, you can use formats to group values into levels. See the discussion of the FORMAT procedure in the SAS Procedures Guide. You can specify the following option in the STRATA statement after a slash (/): LIST
displays a “Stratum Information” table, which includes values of the STRATA variables and sampling rates for each stratum. This table also provides the number of observations and number of clusters for each stratum and analysis variable. See the section “Displayed Output” on page 4292 for more details.
TEST Statement < label: > TEST equation1 < , . . . , < equationk >>< /option > ; The TEST statement tests linear hypotheses about the regression coefficients. The Wald test is used to jointly test the null hypotheses (H0 : Lθ = c) specified in a single TEST statement. When c = 0 you should specify a CONTRAST statement instead. Each equation specifies a linear hypothesis (a row of the L matrix and the corresponding element of the c vector); multiple equations are separated by commas. The label, which must be a valid SAS name, is used to identify the resulting output and should always be included. You can submit multiple TEST statements. The form of an equation is as follows:
term < ±term . . . > < = ±term < ±term . . . >> where term is a parameter of the model, or a constant, or a constant times a parameter. For a binary response model, the intercept parameter is named INTERCEPT;
UNITS Statement
for an ordinal response model, the intercept parameters are named INTERCEPT, INTERCEPT2, INTERCEPT3, and so on. When no equal sign appears, the expression is set to 0. The following code illustrates possible uses of the TEST statement: proc surveylogistic; model y= a1 a2 a3 a4; test1: test intercept + .5 * a2 = 0; test2: test intercept + .5 * a2; test3: test a1=a2=a3; test4: test a1=a2, a2=a3; run;
Note that the first and second TEST statements are equivalent, as are the third and fourth TEST statements. You can specify the following option in the TEST statement after a slash(/). PRINT
displays intermediate calculations in the testing of the null hypothesis H0 : Lθ = b 0 bordered by (Lθ b − c) and [LV( b 0 ]−1 bordered by b θ)L b θ)L c. This includes LV( b 0 ]−1 (Lθ b − c), where θˆ is the maximum likelihood estimator of θ and V( b b θ)L b θ) [LV( ˆ is the estimated covariance matrix of θ. For more information, see the “Testing Linear Hypotheses about the Regression Coefficients” section on page 4288.
UNITS Statement UNITS independent1 = list1 < . . . independentk = listk >< /option > ; The UNITS statement enables you to specify units of change for the continuous explanatory variables so that customized odds ratios can be estimated. An estimate of the corresponding odds ratio is produced for each unit of change specified for an explanatory variable. The UNITS statement is ignored for CLASS variables. If the CLODDS option is specified in the MODEL statement, the corresponding confidence limits for the odds ratios are also displayed. The term independent is the name of an explanatory variable and list represents a list of units of change, separated by spaces, that are of interest for that variable. Each unit of change in a list has one of the following forms: • number • SD or −SD • number * SD where number is any nonzero number, and SD is the sample standard deviation of the corresponding independent variable. For example, X = −2 requests an odds ratio that represents the change in the odds when the variable X is decreased by two units.
4267
4268
Chapter 69. The SURVEYLOGISTIC Procedure
X = 2∗SD requests an estimate of the change in the odds when X is increased by two sample standard deviations. You can specify the following option in the UNITS statement after a slash(/). DEFAULT= list
gives a list of units of change for all explanatory variables that are not specified in the UNITS statement. Each unit of change can be in any of the forms described previously. If the DEFAULT= option is not specified, PROC SURVEYLOGISTIC does not produce customized odds ratio estimates for any explanatory variable that is not listed in the UNITS statement. For more information, see the “Odds Ratio Estimation” section on page 4288.
WEIGHT Statement WEIGHT variable < / option >; The WEIGHT statement names the variable that contains the sampling weights. This variable must be numeric. If you do not specify a WEIGHT statement, PROC SURVEYLOGISTIC assigns all observations a weight of 1. Sampling weights must be positive numbers. If an observation has a weight that is nonpositive or missing, then the procedure omits that observation from the analysis. If you specify more than one WEIGHT statement, the procedure uses only the first WEIGHT statement and ignores the rest.
Details Missing Values Any observation with missing values for the response, offset, or explanatory variables is excluded from the analysis. The estimated linear predictor, its standard error estimate, the fitted probabilities, and their confidence limits are not computed for any observation with missing offset or explanatory variable values. An observation is also excluded if it has a missing value for any STRATA or CLUSTER variable, unless the MISSING option is used in the PROC SURVEYLOGISTIC statement. Missing values in your survey data (such as nonresponse) can compromise the quality of your results. An observation without missing values is called a complete respondent, and an observation with missing values is called an incomplete respondent. If the missing data are missing at random, then PROC SURVEYLOGISTIC produces unbiased results when it excludes observations with missing values. However, if the complete respondents are different from the incomplete respondents with regard to a survey effect or outcome, then excluding nonrespondents from the analysis may result in biased estimates that do not accurately represent the survey population. When the missing data are not missing at random, you should use imputation to replace missing values with acceptable values and use sampling weight adjustments
Response Level Ordering
to compensate for nonresponse before you use PROC SURVEYLOGISTIC. Refer to Cochran (1977), Kalton and Kaspyzyk (1986), and Brick and Kalton (1996) for more information.
Model Specification Response Level Ordering Response level ordering is important because, by default, PROC SURVEYLOGISTIC models the probabilities of response levels with lower Ordered Value. Ordered Values, displayed in the “Response Profile” table, are assigned to response levels in ascending sorted order. That is, the lowest response level is assigned Ordered Value 1, the next lowest is assigned Ordered Value 2, and so on. For example, if your response variable Y takes values in {1, . . . , D + 1}, then the functions of the response probabilities modeled with the cumulative model are logit(Pr(Y ≤ i|x)), i = 1, . . . , D and for the generalized logit model they are log
Pr(Y = i|x) Pr(Y = D + 1|x)
, i = 1, . . . , D
where the highest Ordered Value Y = D + 1 is the reference level. You can change these default functions by specifying the EVENT=, the REF=, the DESCENDING, or the ORDER= response variable options in the MODEL statement. For binary response data with event and nonevent categories, the procedure models the function
logit(π) = log
π 1−π
where π is the probability of the response level assigned Ordered Value 1 in the “Response Profiles” table. Since
logit(π) = −logit(1 − π)
the effect of reversing the order of the two response levels is to change the signs of α and β in the model logit(π) = α + β 0 x. If your event category has a higher Ordered Value than the nonevent category, the procedure models the nonevent probability. You can use response variable options to model the event probability. For example, suppose the binary response variable Y
4269
4270
Chapter 69. The SURVEYLOGISTIC Procedure
takes the values 1 and 0 for event and nonevent, respectively, and Exposure is the explanatory variable. By default, the procedure assigns Ordered Value 1 to response level Y=0, and Ordered Value 2 to response level Y=1. Therefore, the procedure models the probability of the nonevent (Ordered Value=1) category. To model the event probability, you can do the following: • explicitly state which response level is to be modeled using the response variable option EVENT= in the MODEL statement, model Y(event=’1’) = Exposure;
• specify the response variable option DESCENDING in the MODEL statement, model Y(descending)=Exposure;
• specify the response variable option REF= in the MODEL statement as the nonevent category for the response variable. This option is most useful when you are fitting a generalized logit model. model Y(ref=’0’) = Exposure;
• assign a format to Y such that the first formatted value (when the formatted values are put in sorted order) corresponds to the event. For this example, Y=1 is assigned formatted value ‘event’ and Y=0 is assigned formatted value ‘nonevent’. Since ORDER=FORMATTED by default, Ordered Value 1 is assigned to response level Y=1 so the procedure models the event. proc format; value Disease 1=’event’ 0=’nonevent’; run; proc surveylogistic; format Y Disease.; model Y=Exposure; run;
CLASS Variable Parameterization Consider a model with one CLASS variable A with four levels, 1, 2, 5, and 7. Details of the possible choices for the PARAM= option follow. EFFECT
Three columns are created to indicate group membership of the nonreference levels. For the reference level, all three dummy variables have a value of −1. For instance, if the reference level is 7 (REF=7), the design matrix columns for A are as follows.
A 1 2 5 7
Effect Coding Design Matrix A1 A2 A5 1 0 0 0 1 0 0 0 1 −1 −1 −1
CLASS Variable Parameterization
Parameter estimates of CLASS main effects using the effect coding scheme estimate the difference in the effect of each nonreference level compared to the average effect over all four levels. GLM
As in PROC GLM, four columns are created to indicate group membership. The design matrix columns for A are as follows.
A 1 2 5 7
GLM Coding Design Matrix A1 A2 A5 A7 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Parameter estimates of CLASS main effects using the GLM coding scheme estimate the difference in the effects of each level compared to the last level. ORDINAL
Three columns are created to indicate group membership of the higher levels of the effect. For the first level of the effect (which for A is 1), all three dummy variables have a value of 0. The design matrix columns for A are as follows. Ordinal Coding Design Matrix A A2 A5 A7 1 0 0 0 2 1 0 0 5 1 1 0 7 1 1 1
The first level of the effect is a control or baseline level. Parameter estimates of CLASS main effects using the ORDINAL coding scheme estimate the effect on the response as the ordinal factor is set to each succeeding level. When the parameters for an ordinal main effect have the same sign, the response effect is monotonic across the levels. POLYNOMIAL POLY
Three columns are created. The first represents the linear term (x), the second represents the quadratic term (x2 ), and the third represents the cubic term (x3 ), where x is the level value. If the CLASS levels are not numeric, they are translated into 1, 2, 3, . . . according to their sorting order. The design matrix columns for A are as follows.
4271
4272
Chapter 69. The SURVEYLOGISTIC Procedure
A 1 2 5 7
Polynomial Coding Design Matrix APOLY1 APOLY2 APOLY3 1 1 1 2 4 8 5 25 125 7 49 343
REFERENCE REF
Three columns are created to indicate group membership of the nonreference levels. For the reference level, all three dummy variables have a value of 0. For instance, if the reference level is 7 (REF=7), the design matrix columns for A are as follows. Reference Coding Design Matrix A A1 A2 A5 1 1 0 0 2 0 1 0 5 0 0 1 7 0 0 0
Parameter estimates of CLASS main effects using the reference coding scheme estimate the difference in the effect of each nonreference level compared to the effect of the reference level. ORTHEFFECT The columns are obtained by applying the Gram-Schmidt orthogonalization to the columns for PARAM=EFFECT. The design matrix columns for A are as follows.
A 1 2 5 7
Orthogonal Effect Coding Design Matrix AOEFF1 AOEFF2 AOEFF3 1.41421 −0.81650 −0.57735 0.00000 1.63299 −0.57735 0.00000 0.00000 1.73205 −1.41421 −0.81649 −0.57735
ORTHORDINAL ORTHOTHERM The columns are obtained by applying the Gram-Schmidt orthogonalization to the columns for PARAM=ORDINAL. The design matrix columns for A are as follows.
Link Functions and the Corresponding Distributions Orthogonal Ordinal Coding Design Matrix AOORD1 AOORD2 AOORD3 −1.73205 0.00000 0.00000 0.57735 −1.63299 0.00000 0.57735 0.81650 −1.41421 0.57735 0.81650 1.41421
A 1 2 5 7
ORTHPOLY
The columns are obtained by applying the Gram-Schmidt orthogonalization to the columns for PARAM=POLY. The design matrix columns for A are as follows. Orthogonal Polynomial Coding Design Matrix AOPOLY1 AOPOLY2 AOPOLY5 −1.153 0.907 −0.921 −0.734 −0.540 1.473 0.524 −1.370 −0.921 1.363 1.004 0.368
A 1 2 5 7
ORTHREF
The columns are obtained by applying the Gram-Schmidt orthogonalization to the columns for PARAM=REFERENCE. The design matrix columns for A are as follows.
A 1 2 5 7
Orthogonal Reference Coding Design Matrix AOREF1 AOREF2 AOREF3 1.73205 0.00000 0.00000 −0.57735 1.63299 0.00000 −0.57735 −0.81650 1.41421 −0.57735 −0.81650 −1.41421
Link Functions and the Corresponding Distributions Four link functions are available in the SURVEYLOGISTIC procedure. The logit function is the default. To specify a different link function, use the LINK= option in the MODEL statement. The link functions and the corresponding distributions are as follows: • The logit function g(π) = log
π 1−π
4273
4274
Chapter 69. The SURVEYLOGISTIC Procedure is the inverse of the cumulative logistic distribution function, which is F (x) =
1 1 + e−x
• The probit (or normit) function g(π) = Φ−1 (π) is the inverse of the cumulative standard normal distribution function, which is 1 F (x) = Φ(x) = √ 2π
Z
x
1 2
e− 2 z dz
−∞
Traditionally, the probit function includes an additive constant 5, but throughout PROC SURVEYLOGISTIC, the terms probit and normit are used interchangeably, defined as g(p) above. • The complementary log-log function g(π) = log(− log(1 − π)) is the inverse of the cumulative extreme-value function (also called the Gompertz distribution), which is x
F (x) = 1 − e−e
• The generalized logit function extends the binary logit link to a vector of levels (π1 , . . . , πk+1 ) by contrasting each level with a fixed level πi g(πi ) = log i = 1, . . . , k πk+1 The variances of the normal, logistic, and extreme-value distributions are not the same. Their respective means and variances are Distribution Normal Logistic Extreme-value
Mean 0 0 −γ
Variance 1 π 2 /3 π 2 /6
where γ is the Euler constant. In comparing parameter estimates using different link functions, you need to take into account the different scalings of the corresponding distributions and, for the complementary log-log function, a possible shift in location. For example, if the fitted probabilities are in the neighborhood of 0.1√to 0.9, then the parameter estimates using the logit link function should be about π/ 3 ≈ 1.8 larger than the estimates from the probit link function.
Iterative Algorithms for Model-Fitting
Model Fitting Determining Observations for Likelihood Contributions If you use events/trials syntax, each observation is split into two observations. One has the response value 1 with a frequency equal to the frequency of the original observation (which is 1 if the FREQ statement is not used) times the value of the events variable. The other observation has the response value 2 and a frequency equal to the frequency of the original observation times the value of (trials − events). These two observations have the same explanatory variable values and the same FREQ and WEIGHT values as the original observation. For either single-trial or events/trials syntax, let j index all observations. In other words, for single-trial syntax, j indexes the actual observations. And, for events/trials syntax, j indexes the observations after splitting (as described previously). If your data set has 30 observations and you use single-trial syntax, j has values from 1 to 30; if you use events/trials syntax, j has values from 1 to 60. Suppose the response variable in a cumulative response model can take on the ordered values 1, . . . , k, k+1 where k is an integer ≥ 1. The likelihood for the jth observation with ordered response value yj and explanatory variables vector ( row vectors) xj is given by yj = 1 F (α1 + xj β) F (αi + xj β) − F (αi−1 + xj β) 1 < yj = i ≤ k Lj = 1 − F (αk + xj β) yj = k + 1 where F (.) is the logistic, normal, or extreme-value distribution function, α1 , . . . , αk are ordered intercept parameters, and β is the slope parameter vector. For the generalized logit model, letting the k + 1st level be the reference level, the intercepts α1 , . . . , αk are unordered and the slope vector β i varies with each logit. The likelihood for the jth observation with ordered response value yj and explanatory variables vector xj (row vectors) is given by Lj
= Pr(Y = yj |xj ) eαi +xj βi P 1 + ki=1 eαi +xj βi = 1 Pk 1 + i=1 eαi +xj βi
1 ≤ yj = i ≤ k
yj = k + 1
Iterative Algorithms for Model-Fitting Two iterative maximum likelihood algorithms are available in PROC b of the SURVEYLOGISTIC to obtain the maximum likelihood estimate θ model parameter θ . The default is the Fisher-scoring method, which is equivalent to fitting by iteratively reweighted least squares. The alternative algorithm is the
4275
4276
Chapter 69. The SURVEYLOGISTIC Procedure
Newton-Raphson method. Both algorithms give the same parameter estimates; b is estimated in the section “Variance Estimation for The covariance matrix of θ Sample Survey Data” on page 4282. For a generalized logit model, only the Newton-Raphson technique is available. You can use the TECHNIQUE= option to select a fitting algorithm. Iteratively Reweighted Least-Squares Algorithm (Fisher Scoring)
Let Y be the response variable which takes values 1, . . . , k, k+1 (k ≥ 1). Let j index all observations and Yj be the value of response for the jth observation. Consider the multinomial variable Zj = (Z1j , . . . , Zkj )0 such that Zij =
1 if Yj = i 0 otherwise
P and Z(k+1)j = 1 − ki=1 Zij . With πij denoting the probability that the jth observation has response value i, the expected value of Zj is π j = (π1j , . . . , πkj )0 , and P π(k+1)j = 1 − ki=1 πij . The covariance matrix of Zj is Vj , which is the covariance matrix of a multinomial random variable for one trial with parameter vector π j . Let θ be the vector of regression parameters; for example, θ = (α1 , . . . , αk , β 0 )0 for cumulative logit model. Let Dj be the matrix of partial derivatives of π j with respect to θ. The estimating equation for the regression parameters is X
D0j Wj (Zj − π j ) = 0
j
where Wj = wj fj Vj−1 , wj and fj are the WEIGHT and FREQ values of the jth observation. With a starting value of θ (0) , the maximum likelihood estimate of θ is obtained iteratively as X X θ (i+1) = θ (i) + ( D0j Wj Dj )−1 D0j Wj (Zj − π j ) j
j
where Dj , Wj , and π j are evaluated at the ith iteration θ (i) . The expression after the plus sign is the step size. If the log-likelihood evaluated at θ (i+1) is less than that evaluated at θ (i) , then θ (i+1) is recomputed by step-halving or ridging. The iterative scheme continues until convergence is obtained, that is, until θ (i+1) is sufficiently b = θ (i+1) . close to θ (i) . Then the maximum likelihood estimate of θ is θ By default, starting values are zero for the slope parameters, and for the intercept parameters, starting values are the observed cumulative logits (that is, logits of the observed cumulative proportions of response). Alternatively, the starting values may be specified with the INEST= option.
Existence of Maximum Likelihood Estimates
Newton-Raphson Algorithm
Let g =
X
H =
X
wj fj
j
∂lj ∂θ
−wj fj
j
∂ 2 lj ∂θ 2
be the gradient vector and the Hessian matrix, where lj = log Lj is the log likelihood for the jth observation. With a starting value of θ (0) , the maximum likelihood b of θ is obtained iteratively until convergence is obtained: estimate θ θ (i+1) = θ (i) + H−1 g where H and g are evaluated at the ith iteration θ (i) . If the log likelihood evaluated at θ (i+1) is less than that evaluated at θ (i , then θ (i+1) is recomputed by step-halving or ridging. The iterative scheme continues until convergence is obtained, that is, until θ (i+1) is sufficiently close to θ (i) . Then the maximum likelihood estimate of θ is b = θ (i+1) . θ
Convergence Criteria Four convergence criteria are allowed, namely, ABSFCONV=, FCONV=, GCONV=, and XCONV=. If you specify more than one convergence criterion, the optimization is terminated as soon as one of the criteria is satisfied. If none of the criteria is specified, the default is GCONV=1E−8.
Existence of Maximum Likelihood Estimates The likelihood equation for a logistic regression model does not always have a finite solution. Sometimes there is a nonunique maximum on the boundary of the parameter space, at infinity. The existence, finiteness, and uniqueness of maximum likelihood estimates for the logistic regression model depend on the patterns of data points in the observation space (Albert and Anderson 1984; Santner and Duffy 1986). Consider a binary response model. Let Yj be the response of the ith subject and let xj be the vector of explanatory variables (including the constant 1 associated with the intercept). There are three mutually exclusive and exhaustive types of data configurations: complete separation, quasi-complete separation, and overlap. Complete Separation
There is a complete separation of data points if there exists a vector b that correctly allocates all observations to their response groups; that is,
b0 xj > 0 Yj = 1 b0 xj < 0 Yj = 2
4277
4278
Chapter 69. The SURVEYLOGISTIC Procedure This configuration gives nonunique infinite estimates. If the iterative process of maximizing the likelihood function is allowed to continue, the log likelihood diminishes to zero, and the dispersion matrix becomes unbounded.
Quasi-Complete Separation
The data are not completely separable but there is a vector b such that 0 b xj ≥ 0 Yj = 1 b0 xj ≤ 0 Yj = 2 and equality holds for at least one subject in each response group. This configuration also yields nonunique infinite estimates. If the iterative process of maximizing the likelihood function is allowed to continue, the dispersion matrix becomes unbounded and the log likelihood diminishes to a nonzero constant.
Overlap
If neither complete nor quasi-complete separation exists in the sample points, there is an overlap of sample points. In this configuration, the maximum likelihood estimates exist and are unique.
Complete separation and quasi-complete separation are problems typically encountered with small data sets. Although complete separation can occur with any type of data, quasi-complete separation is not likely with truly continuous explanatory variables. The SURVEYLOGISTIC procedure uses a simple empirical approach to recognize the data configurations that lead to infinite parameter estimates. The basis of this approach is that any convergence method of maximizing the log likelihood must yield a solution giving complete separation, if such a solution exists. In maximizing the log likelihood, there is no checking for complete or quasi-complete separation if convergence is attained in eight or fewer iterations. Subsequent to the eighth iteration, the probability of the observed response is computed for each observation. If the probability of the observed response is one for all observations, there is a complete separation of data points and the iteration process is stopped. If the complete separation of data has not been determined and an observation is identified to have an extremely large probability (≥0.95) of the observed response, there are two possible situations. First, there is overlap in the data set, and the observation is an atypical observation of its own group. The iterative process, if allowed to continue, will stop when a maximum is reached. Second, there is quasi-complete separation in the data set, and the asymptotic dispersion matrix is unbounded. If any of the diagonal elements of the dispersion matrix for the standardized observations vectors (all explanatory variables standardized to zero mean and unit variance) exceeds 5,000, quasi-complete separation is declared and the iterative process is stopped. If either complete separation or quasi-complete separation is detected, a warning message is displayed in the procedure output.
Model Fitting Statistics
Checking for quasi-complete separation is less foolproof than checking for complete separation. The NOCHECK option in the MODEL statement turns off the process of checking for infinite parameter estimates. In cases of complete or quasi-complete separation, turning off the checking process typically results in the procedure failing to converge.
Model Fitting Statistics Suppose the model contains s explanatory effects. For the jth observation, let π ˆj be the estimated probability of the observed response. The three criteria displayed by the SURVEYLOGISTIC procedure are calculated as follows: • −2 Log Likelihood: −2 Log L = −2
X
wj fj log(ˆ πj )
j
where wj and fj are the weight and frequency values of the jth observation. For binary response models using events/trials syntax, this is equivalent to −2 Log L = −2
X
wj fj {rj log(ˆ πj ) + (nj − rj ) log(1 − π ˆj )}
j
where rj is the number of events, nj is the number of trials, and π ˆj is the estimated event probability. • Akaike Information Criterion: AIC = −2 Log L + 2p where p is the number of parameters in the model. For cumulative response models, p = k + s where k is the total number of response levels minus one, and s is the number of explanatory effects. For the generalized logit model, p = k(s + 1). • Schwarz Criterion: X SC = −2 Log L + p log( fj ) j
where p is as defined previously. The −2 Log Likelihood statistic has a chi-square distribution under the null hypothesis (that all the explanatory effects in the model are zero) and the procedure produces a p-value for this statistic. The AIC and SC statistics give two different ways of adjusting the −2 Log Likelihood statistic for the number of terms in the model and the number of observations used.
4279
4280
Chapter 69. The SURVEYLOGISTIC Procedure
Generalized Coefficient of Determination Cox and Snell (1989, pp. 208–209) propose the following generalization of the coefficient of determination to a more general linear model: 2
R =1−
L(0) b L(θ)
2
n
b is the likelihood of where L(0) is the likelihood of the intercept-only model, L(θ) 2 the specified model, and n is the sample size. The quantity R achieves a maximum of less than 1 for discrete models, where the maximum is given by 2
2 Rmax = 1 − {L(0)} n
Nagelkerke (1991) proposes the following adjusted coefficient, which can achieve a maximum value of 1: 2 ˜2 = R R 2 Rmax
˜ 2 are provided in Nagelkerke (1991). In the Properties and interpretation of R2 and R “Testing Global Null Hypothesis: BETA=0” table, R2 is labeled as “RSquare” and ˜ 2 is labeled as “Max-rescaled RSquare.” Use the RSQUARE option to request R2 R ˜2. and R
INEST= Data Set You can specify starting values for the iterative algorithm in the INEST= data set. The INEST= data set contains one observation for each BY group. The INEST= data set must contain the intercept variables (named Intercept for binary response models and Intercept, Intercept2, Intercept3, and so forth, for ordinal response models) and all explanatory variables in the MODEL statement. If BY processing is used, the INEST= data set should also include the BY variables, and there must be one observation for each BY group. If the INEST= data set also contains the – TYPE– variable, only observations with – TYPE– value ’PARMS’ are used as starting values.
Survey Design Information Specification of Population Totals and Sampling Rates Variance estimates in survey samples involve a finite population correction (fpc) for sampling without replacement. For small sampling fractions or sampling with replacement, it is appropriate to ignore this correction (Cochran 1977; Kish 1965), and PROC SURVEYLOGISTIC does so by default. If your analysis requires an fpc, specify the sampling fraction using either the RATE= option or the TOTAL= option on the PROC SURVEYLOGISTIC statement. If you do not specify one of these options, the procedure does not use the fpc when computing variance estimates.
Primary Sampling Units (PSUs)
If your design has multiple stages of selection and you are specifying the RATE= option, you should input the first-stage sampling rate, which is the ratio of the number of primary sampling units (PSUs) in the sample to the total number of PSUs in the study population. If you are using the TOTAL= option for a multistage design, you should specify the total number of PSUs in the study population. See the section “Primary Sampling Units (PSUs)” on page 4281 for more details. For a nonstratified sample design, or for a stratified sample design for which all the strata have the same sampling rate or population total, you should specify the rate or total as the value of the RATE=value option or the TOTAL=value option. If your sample design is stratified with different sampling rates or population totals in the strata, then you should use the RATE=SAS-data-set option or the TOTAL=SAS-dataset option to name a SAS data set that contains the stratum sampling rates or totals. This data set is called a secondary data set, as opposed to the primary data set that you specify with the DATA= option. The secondary data set must contain all the stratification variables listed in the STRATA statement, as well as all the variables in the BY statement if any of these statements are specified. If there are formats associated with the STRATA variables and the BY variables, then the formats must be consistent in the primary and the secondary data sets. If you specify the TOTAL=SAS-data-set option, the secondary data set must have a variable named – TOTAL– that contains the stratum population totals. If you specify the RATE=SAS-data-set option, the secondary data set must have a variable named – RATE– that contains the stratum sampling rates. If the secondary data set contains more than one observation for any one stratum, then the procedure uses the first value of – TOTAL– or – RATE– for that stratum and ignores the rest. The value in the RATE= option or the values of – RATE– in the secondary data set must be positive numbers. You can specify value as a number between 0 and 1; or you can specify value in percentage form as a number between 1 and 100, in which case PROC SURVEYLOGISTIC will convert that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%. If you specify the TOTAL=value option, this value must not be less than the sample size. If you provide stratum population totals in a secondary data set, these values must not be less than the corresponding stratum sample sizes.
Primary Sampling Units (PSUs) When you have clusters, or primary sampling units (PSUs), in your sample design, the procedure estimates the variance based on the variation among PSUs. Use the CLUSTER statement to identify the first-stage clusters in your design. PROC SURVEYLOGISTIC assumes that each cluster represents a PSU in the sample and that each observation is an element of a PSU. If you do not specify a CLUSTER statement, the procedure treats each observation as a PSU.
4281
4282
Chapter 69. The SURVEYLOGISTIC Procedure
Variance Estimation for Sample Survey Data Due to the variability of characteristics among items in the population, researchers apply scientific sample designs in the sample selection process to reduce the risk of a distorted view of the population, and they make inferences about the population based on the information from the sample survey data. In order to make statistically valid inferences for the population, they must incorporate the sample design in the data analysis. The SURVEYLOGISTIC procedure fits linear logistic regression models for discrete response survey data using the maximum likelihood method. In the variance estimation, the procedure incorporates complex survey sample designs, including designs with stratification, clustering, and unequal weighting. The procedure uses the Taylor expansion method to estimate sampling errors of estimators based on complex sample designs. This method obtains a linear approximation for the estimator and then uses the variance estimate for this approximation to estimate the variance of the estimate itself. See Binder (1981, 1983), Roberts, Rao, and Kumar (1987), Skinner, Holt, and Smith (1989), Morel (1989), and Lehtonen and Pahkinen (1995) for papers that describe logistic regression for sample survey data. When there are clusters, or primary sampling units (PSUs), in the sample design, the procedure estimates variance from the variation among PSUs. When the design is stratified, the procedure pools stratum variance estimates to compute the overall variance estimate. For t tests of the estimates, the degrees of freedom equals the number of clusters minus the number of strata in the sample design. Statistical analyses, such as hypothesis tests and confident limits, will depend on these variance estimates.
Notation Let Y be the response variable with categories 1, 2, . . . , D, D + 1. The p covariates are denoted by a p-dimension row vector x. For a stratified clustered sample design, each observation is represented by a row vector, 0 (whij , yhij , yhij(D+1) , xhij )
where • h = 1, 2, . . . , H is the stratum number with a total of H strata • i = 1, 2, . . . , nh is the cluster number within stratum h, with a total of nh clusters • j = 1, 2, . . . , mhi is the unit number within cluster i of stratum h, with a total of mhi units • whij denotes the sampling weight • yhij is a D-dimensional column vector whose elements are indicator variables for the first D categories for variable Y . If the response of the jth member of the ith cluster in stratum h falls in category d, the dth row of the vector is one, and the remaining elements of the vector are zero, where d = 1, 2, . . . , D
Likelihood Function
• yhij(D+1) is the indicator variable for the (D + 1) category of variable Y • xhij denotes the k-dimensional row vector of explanatory variables for the jth member of the ith cluster in stratum h. If there is an intercept, then xhij1 ≡ 1. P • n ˜= H h=1 nh is the total number of clusters in the entire sample PH Pnh • n = h=1 i=1 mhi is the total sample size The following notations are also used in the following sections: • fh denotes the sampling rate for stratum h • π hij is the expected vector of the response variable π hij
= E(yhij |xhij ) = (πhij1 , πhij2 , . . . , πhijD )0
πhij(D+1) = E(yhij(D+1) |xhij ) Note that πhij(D+1) = 1 − 10 π hij where 1 is a D-dimensional column vector whose elements are 1.
Likelihood Function Let f (·) be a link function such that π = f (x, θ) where θ is a p-dimensional column vector for regression coefficients. The pseudo log likelihood is
l(θ) =
nh X mhi H X X
whij (log(π hij ))0 yhij + log(πhij(D+1) )yhij(D+1)
h=1 i=1 j=1
ˆ which is a solution to the estimating Denote the maximum likelihood estimator as θ, equations: nh X mhi H X X
whij D0hij diag(π hij ) − π hij π 0hij
−1
(yhij − π hij ) = 0
h=1 i=1 j=1
where Dhij is the matrix of partial derivatives of the link function f with respect to θ. ˆ the procedure uses iterations with a To obtain the maximum likelihood estimator θ, (0) starting value θ for θ. See the section “Iterative Algorithms for Model-Fitting” on page 4275 for detail.
4283
4284
Chapter 69. The SURVEYLOGISTIC Procedure
Generalized Logistic Model Formulation of the generalized logit models for nominal response variables can be found in Agresti (1990). Without loss of generality, let the last category, D +1, be the reference category for the response variable Y . The link function for the generalized logistic model is defined as πhijd =
exhij βd P xhij β r 1+ D r=1 e
and the model parameters are: β d = (βd1 , βd2 , . . . , βdk )0 θ = (β 01 , β 02 , . . . , β 0D )0 for d = 1, 2, . . . , D.
Cumulative Logit Model Details of the cumulative logit model (or proportional odds model) can be found in McCullagh and Nelder (1989). Denote the cumulative sum of the expected proportions for the first d categories of variable Y by Fhijd =
d X
πhijr
r=1
for d = 1, 2, . . . , D. Then the link function for the proportional odds model is Fhijd log = αd + xhij β 1 − Fhijd with the model parameters: β = (β1 , β2 , . . . , βk )0 α = (α1 , α2 , . . . , αD )0 , α1 < α2 < · · · < αD θ = (α0 , β 0 )0
Complementary log-log Model Use the notations in the previous section, the link function for the complementary log-log is log(− log(1 − Fhijd )) = αd + xhij β with the model parameters:
β = (β1 , β2 , . . . , βk )0 α = (α1 , α2 , . . . , αD )0 , α1 < α2 < · · · < αD θ = (α0 , β 0 )0
Estimated Variances
Probit Model Another commonly used model for ordinal responses is the probit model with the link function Fhijd = Φ(αd + xhij β) where
1 Φ(z0 ) = √ 2π
Z
z0
1 2
e− 2 z dz
−∞
is the cumulative distribution function of the standard normal distribution. The model parameters are: β = (β1 , β2 , . . . , βk )0 α = (α1 , α2 , . . . , αD )0 , α1 < α2 < · · · < αD θ = (α0 , β 0 )0
Estimated Variances Using Taylor approximation, the estimated covariance matrix of θˆ is b −1 G bQ b −1 ˆ =Q Vb (θ) where
b = Q
nh X mhi H X X
b hij diag(ˆ whij D π hij ) − π ˆ hij π ˆ hij 0
−1
b0 D hij
h=1 i=1 j=1 H
b = G
h=1
ehi· =
n
h n − 1 X nh (1 − fh ) X ¯h·· )(ehi· − e ¯h·· )0 (ehi· − e n−p nh − 1
mhi X
i=1
b hij diag(ˆ whij D π hij ) − π ˆ hij π ˆ hij 0
−1
(yhij − π ˆ hij )
j=1
¯h·· = e
nh 1 X ehi· nh i=1
If you use the Newton-Raphson algorithm by using the TECHNIQUE=NEWTON b is replaced by the negative (expected) option in the MODEL statement, the matrix Q Hessian matrix, b hij and the response probabilities π The matrices of partial derivatives D ˆ hij are evalˆ uated at θ.
4285
4286
Chapter 69. The SURVEYLOGISTIC Procedure
Adjustments to the Variance Estimation b should reduce the The factor (n − 1)/(n − p) in the computation of the matrix G small sample bias associated with using the estimated function to calculate deviations ( Morel 1989; Hidiroglou, Fuller, and Hickman 1980). For simple random sampling, this factor contributes to the degrees of freedom correction applied to the residual mean square for ordinary least squares in which p parameter are estimated. By default, the procedure will use this adjustment in variance estimation. It is equivalent to specify the VADJUST=DF option in the MODEL statement. If you do not wish to use this multiplier in the variance estimation, you can specify the VADJUST=NONE option in the MODEL statement to suppress this factor. In addition, you can specify the VADJUST=MOREL option to compute a further ˆ introduced by adjustment to the variance estimator for the regression coefficients θ, Morel (1989): ˆ =Q b −1 G bQ b −1 + κλQ b −1 Vb (θ) where for given nonnegative constants δ and φ, b −1 G b κ = max δ, p−1 tr Q p λ = min φ, n ˜−p b −1 will The adjustment κλQ • reduce the small sample bias reflected in inflated Type 1 error rates b −1 • guarantee a positive definite estimated covariance matrix provided that Q exists • be close to zero when the sample size becomes large In this adjustment, κ is an estimate of the design effect, which has been bounded below by the positive constant δ. You can use DEFFBOUND=δ in the VADJUST=MOREL option in the MODEL statement to specify this lower bound; by default, the procedure uses δ = 1. The factor λ converges to zero when the sample size becomes large, and λ has an upper bound φ. You can use ADJBOUND=φ in the VADJUST=MOREL option in the MODEL statement to specify this upper bound; by default, the procedure uses φ = 0.5.
Testing the Parallel Lines Assumption
Hypothesis Testing and Estimation Score Statistics and Tests To understand the general form of the score statistics, let g(θ) be the vector of first partial derivatives of the log likelihood with respect to the parameter vector θ, and let g(θ) be the matrix of second partial derivatives of the log likelihood with respect to θ. That is, g(θ) is the gradient vector, and H(θ) is the Hessian matrix. Let I(θ) be either −H(θ) or the expected value of −H(θ). Consider a null hypothesis H0 . Let ˆ 0 be the MLE of θ under H0 . The chi-square score statistic for testing H0 is defined θ by ˆ 0 )I−1 (θ ˆ 0 )g(θ ˆ0) g0 (θ and it has an asymptotic χ2 distribution with r degrees of freedom under H0 , where r is the number of restrictions imposed on θ by H0 .
Testing the Parallel Lines Assumption For an ordinal response, PROC SURVEYLOGISTIC performs a test of the parallel lines assumption. In the displayed output, this test is labeled “Score Test for the Equal Slopes Assumption” when the LINK= option is NORMIT or CLOGLOG. When LINK=LOGIT, the test is labeled as “Score Test for the Proportional Odds Assumption” in the output. This section describes the methods used to calculate the test. For this test the number of response levels, D + 1, is assumed to be strictly greater than 2. Let Y be the response variable taking values 1, . . . , D, D + 1. Suppose there are k explanatory variables. Consider the general cumulative model without making the parallel lines assumption g(Pr(Y ≤ d | x)) = (1, x)θ d ,
1≤d≤D
where g(·) is the link function, and θ d = (αd , βd1 , . . . , βdk )0 is a vector of unknown parameters consisting of an intercept αd and k slope parameters βk1 , . . . , βkd . The parameter vector for this general cumulative model is θ = (θ 01 , . . . , θ 0D )0 Under the null hypothesis of parallelism H0 : β1i = β2i = · · · = βDi , 1 ≤ i ≤ k, there is a single common slope parameter for each of the s explanatory variables. Let β1 , . . . , βk be the common slope parameters. Let α ˆ1, . . . , α ˆ D and βˆ1 , . . . , βˆD be the MLEs of the intercept parameters and the common slope parameters. Then, under H0 , the MLE of θ is ˆ0 , . . . , θ ˆ 0 )0 θˆ0 = (θ 1 D
with
ˆ d = (ˆ θ αd , βˆ1 , . . . , βˆk )0
1≤d≤D
4287
4288
Chapter 69. The SURVEYLOGISTIC Procedure
ˆ 0 )I−1 (θ ˆ 0 )g(θ ˆ 0 ) has an asymptotic chi-square and the chi-squared score statistic g0 (θ distribution with k(D−1) degrees of freedom. This tests the parallel lines assumption by testing the equality of separate slope parameters simultaneously for all explanatory variables.
Wald Confidence Intervals for Parameters Wald confidence intervals are sometimes called the normal confidence intervals. They are based on the asymptotic normality of the parameter estimators. The 100(1 − α)% Wald confidence interval for θj is given by θbj ± z1−α/2 σ bj where zp is the 100pth percentile of the standard normal distribution, θbj is the maximum likelihood estimate of θj , and σ bj is the standard error estimate of θbj in the section “Variance Estimation for Sample Survey Data” on page 4282.
Testing Linear Hypotheses about the Regression Coefficients Linear hypotheses for θ are expressed in matrix form as H0 : Lθ = c where L is a matrix of coefficients for the linear hypotheses, and c is a vector of constants. The vector of regression coefficients θ includes slope parameters as well as intercept parameters. The Wald chi-square statistic for testing H0 is computed as b − c)0 [LV( b 0 ]−1 (Lθ b − c) b θ)L χ2W = (Lθ b is the estimated covariance matrix in the section “Variance Estimation b θ) where V( for Sample Survey Data” on page 4282. Under H0 , χ2W has an asymptotic chi-square distribution with r degrees of freedom, where r is the rank of L.
Odds Ratio Estimation Consider a dichotomous response variable with outcomes event and nonevent. Consider a dichotomous risk factor variable X that takes the value 1 if the risk factor is present and 0 if the risk factor is absent. According to the logistic model, the log odds function, g(X), is given by
Pr(event | X) g(X) ≡ log Pr(nonevent | X)
= β0 + β1 X
The odds ratio ψ is defined as the ratio of the odds for those with the risk factor (X = 1) to the odds for those without the risk factor (X = 0). The log of the odds ratio is given by log(ψ) ≡ log(ψ(X = 1, X = 0)) = g(X = 1) − g(X = 0) = β1
Odds Ratio Estimation
The parameter, β1 , associated with X represents the change in the log odds from X = 0 to X = 1. So, the odds ratio is obtained by simply exponentiating the value of the parameter associated with the risk factor. The odds ratio indicates how the odds of event change as you change X from 0 to 1. For instance, ψ = 2 means that the odds of an event when X = 1 are twice the odds of an event when X = 0. Suppose the values of the dichotomous risk factor are coded as constants a and b instead of 0 and 1. The odds when X = a become exp(β0 + aβ1 ), and the odds when X = b become exp(β0 + bβ1 ). The odds ratio corresponding to an increase in X from a to b is ψ = exp[(b − a)β1 ] = [exp(β1 )]b−a ≡ [exp(β1 )]c Note that for any a and b such that c = b − a = 1, ψ = exp(β1 ). So the odds ratio can be interpreted as the change in the odds for any increase of one unit in the corresponding risk factor. However, the change in odds for some amount other than one unit is often of greater interest. For example, a change of one pound in body weight may be too small to be considered important, while a change of 10 pounds may be more meaningful. The odds ratio for a change in X from a to b is estimated by raising the odds ratio estimate for a unit change in X to the power of c = b − a as shown previously. For a polytomous risk factor, the computation of odds ratios depends on how the risk factor is parameterized. For illustration, suppose that Race is a risk factor with four categories: White, Black, Hispanic, and Other. For the effect parameterization scheme (PARAM=EFFECT) with White as the reference group, the design variables for Race are as follows.
Race Black Hispanic Other White
Design Variables X1 X2 X3 1 0 0 0 1 0 0 0 1 −1 −1 −1
The log odds for Black is g(Black) = β0 + β1 (X1 = 1) + β2 (X2 = 0) + β3 (X3 = 0) = β0 + β1 The log odds for White is g(White) = β0 + β1 (X1 = −1) + β2 (X2 = −1) + β3 (X3 = −1)) = β0 − β1 − β2 − β3
4289
4290
Chapter 69. The SURVEYLOGISTIC Procedure
Therefore, the log odds ratio of Black versus White becomes log(ψ(Black, White)) = g(Black) − g(White) = 2β1 + β2 + β3 For the reference cell parameterization scheme (PARAM=REF) with White as the reference cell, the design variables for race are as follows.
Race Black Hispanic Other White
Design Variables X1 X2 X3 1 0 0 0 1 0 0 0 1 0 0 0
The log odds ratio of Black versus White is given by log(ψ(Black, White)) = g(Black) − g(White) = (β0 + β1 (X1 = 1) + β2 (X2 = 0)) + β3 (X3 = 0)) − (β0 + β1 (X1 = 0) + β2 (X2 = 0) + β3 (X3 = 0)) = β1 For the GLM parameterization scheme (PARAM=GLM), the design variables are as follows.
Race Black Hispanic Other White
Design Variables X1 X2 X3 X4 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
The log odds ratio of Black versus White is log(ψ(Black, White)) = g(Black) − g(White) = (β0 + β1 (X1 = 1) + β2 (X2 = 0) + β3 (X3 = 0) + β4 (X4 = 0)) − (β0 + β1 (X1 = 0) + β2 (X2 = 0) + β3 (X3 = 0) + β4 (X4 = 1)) = β1 − β4
Odds Ratio Estimation
4291
Consider the hypothetical example of heart disease among race in Hosmer and Lemeshow (2000, p. 51). The entries in the following contingency table represent counts.
Disease Status Present Absent
White 5 20
Race Black Hispanic 20 15 10 10
Other 10 10
The computation of odds ratio of Black versus White for various parameterization schemes is tabulated in the following table.
PARAM EFFECT REF GLM
Odds Ratio of Heart Disease Comparing Black to White Parameter Estimates βˆ1 βˆ2 βˆ3 βˆ4 Odds Ratio Estimation 0.7651 0.4774 0.0719 exp(2 × 0.7651 + 0.4774 + 0.0719) = 8 2.0794 1.7917 1.3863 exp(2.0794) = 8 2.0794 1.7917 1.3863 0.0000 exp(2.0794) = 8
Since the log odds ratio (log(ψ)) is a linear function of the parameters, the Wald confidence interval for log(ψ) can be derived from the parameter estimates and the estimated covariance matrix. Confidence intervals for the odds ratios are obtained by exponentiating the corresponding confidence intervals for the log odd ratios. In the displayed output of PROC SURVEYLOGISTIC, the “Odds Ratio Estimates” table contains the odds ratio estimates and the corresponding 95% Wald confidence intervals computed using the covariance matrix in the section “Variance Estimation for Sample Survey Data” on page 4282. For continuous explanatory variables, these odds ratios correspond to a unit increase in the risk factors. To customize odds ratios for specific units of change for a continuous risk factor, you can use the UNITS statement to specify a list of relevant units for each explanatory variable in the model. Estimates of these customized odds ratios are given in a separate table. Let (Lj , Uj ) be a confidence interval for log(ψ). The corresponding lower and upper confidence limits for the customized odds ratio exp(cβj ) are exp(cLj ) and exp(cUj ), respectively (for c > 0), or exp(cUj ) and exp(cLj ), respectively (for c < 0). You use the CLODDS= option to request the confidence intervals for the odds ratios. For a generalized logit model, odds ratios are computed similarly, except D odds ratios are computed for each effect, corresponding to the D logits in the model.
4292
Chapter 69. The SURVEYLOGISTIC Procedure
Rank Correlation of Observed Responses and Predicted Probabilities The predicted mean score of an observation is the sum of the Ordered Values (shown in the Response Profile table) minus one, weighted by the corresponding Ppredicted D+1 probabilities for that observation; that is, the predicted means score= d=1 (d − 1)ˆ πd , where D+1 is the number of response levels and π ˆd is the predicted probability of the dth (ordered) response. A pair of observations with different observed responses is said to be concordant if the observation with the lower ordered response value has a lower predicted mean score than the observation with the higher ordered response value. If the observation with the lower ordered response value has a higher predicted mean score than the observation with the higher ordered response value, then the pair is discordant. If the pair is neither concordant nor discordant, it is a tie. Enumeration of the total numbers of concordant and discordant pairs is carried out by categorizing the predicted mean score into intervals of length D/500 and accumulating the corresponding frequencies of observations. Let N be the sum of observation frequencies in the data. Suppose there is a total of t pairs with different responses, nc of them are concordant, nd of them are discordant, and t − nc − nd of them are tied. PROC SURVEYLOGISTIC computes the following four indices of rank correlation for assessing the predictive ability of a model: c = (nc + 0.5(t − nc − nd ))/t Somers’ D = (nc − nd )/t Goodman-Kruskal Gamma = (nc − nd )/(nc + nd ) Kendall’s Tau-a = (nc − nd )/(0.5N (N − 1)) Note that c also gives an estimate of the area under the receiver operating characteristic (ROC) curve when the response is binary (Hanley and McNeil 1982). For binary responses, the predicted mean score is equal to the predicted probability for Ordered Value 2. As such, the preceding definition of concordance is consistent with the definition used in previous releases for the binary response model.
Output Displayed Output The displayed output of the SURVEYLOGISTIC procedure includes the following: • name of the input Data Set • name and label of the Response Variable if the single-trial syntax is used • number of Response Levels • name of the Events Variable if the events/trials syntax is used • name of the Trials Variable if the events/trials syntax is used • Number of Observations read from the input data set
Displayed Output
• Number of Observations used in the analysis • name of the Frequency Variable if the FREQ statement is specified • Sum of Frequencies of all the observations read from the input data set • Sum of Frequencies of all the observations used in the analysis • name of the Weight Variable if the WEIGHT statement is specified • Sum of Weights of all the observations read from the input data set • Sum of Weights of all the observations used in the analysis • name of the Offset Variable if the OFFSET= option is specified • name(s) of the stratification variable(s) if the STRATA statement is specified • total number of strata if the STRATA statement is specified • name(s) of the cluster variable(s) if the CLUSTER statement is specified • total number of clusters if the CLUSTER statement is specified • Sum of Weights of all the observations used in the analysis • Link Function • variance adjustment method • parameters used in the VADJUST=MOREL option if this option is specified • “Response Profile” table, which gives, for each response level, the ordered value (an integer between one and the number of response levels, inclusive); the value of the response variable if the single-trial syntax is used or the values “EVENT” and “NO EVENT” if the events/trials syntax is used; the count or frequency; and the sum of weights if the WEIGHT statement is specified • “Class Level Information” table, which gives the level and the design variables for each CLASS explanatory variable • “Maximum Likelihood Iterative Phase” table, which gives the iteration number, the step size (in the scale of 1.0, .5, .25, and so on) or the ridge value, −2 log likelihood, and parameter estimates for each iteration. Also displayed are the last evaluation of the gradient vector and the last change in the −2 log likelihood. You need to use the ITPRINT option in the MODEL statement to obtain this table • score test result for testing the parallel lines assumption, if an ordinal response model is fitted. If LINK=CLOGLOG or LINK=PROBIT, this test is labeled “Score Test for the Parallel Slopes Assumption.” The proportion odds assumption is a special case of the parallel lines assumption when LINK=LOGIT. In this case, the test is labeled “Score Test for the Proportional Odds Assumption” • “Model Fit Statistics” and “Testing Global Null Hypothesis: BETA=0” tables, which give the various criteria (−2 Log L, AIC, SC) based on the likelihood for fitting a model with intercepts only and for fitting a model with intercepts and explanatory variables. If you specify the NOINT option, these statistics are calculated without considering the intercept parameters. The third column of the table gives the chi-square statistics and p-values for the −2 Log L statistic and for the Score statistic. These test the joint effect of the explanatory
4293
4294
Chapter 69. The SURVEYLOGISTIC Procedure variables included in the model. The Score criterion is always missing for the models identified by the first two columns of the table. Note also that the first two rows of the Chi-Square column are always missing, since tests cannot be performed for AIC and SC • generalized R2 measures for the fitted model if you specify the RSQUARE option in the MODEL statement • “Type III Analysis of Effects” table if the model contains an effect involving a CLASS variable. This table gives the Wald Chi-square statistic, the degrees of freedom, and the p-value for each effect in the model • “Analysis of Maximum Likelihood Estimates” table, which includes − maximum likelihood estimate of the parameter − estimated standard error of the parameter estimate, computed as the square root of the corresponding diagonal element of the estimated covariance matrix − Wald chi-square statistic, computed by squaring the ratio of the parameter estimate divided by its standard error estimate − p-value of the Wald chi-square statistic with respect to a chi-square distribution with one degree of freedom − standardized estimate for the slope parameter, given by βˆi /(s/si ), where si is the total sample standard deviation for the ith explanatory variable and √ π/ 3 Logistic 1 √ Normal s= π/ 6 Extreme-value You need to specify the STB option in the MODEL statement to obtain these estimates. Standardized estimates of the intercept parameters are set to missing. ˆ
− value of (eβi ) for each slope parameter βi if you specify the EXPB option in the MODEL statement. For continuous variables, this is equivalent to the estimated odds ratio for a 1 unit change. − label of the variable (if space permits) if you specify the PARMLABEL option in the MODEL statement. Due to constraints on the line size, the variable label may be suppressed in order to display the table in one panel. Use the SAS system option LINESIZE= to specify a larger line size to accommodate variable labels. A shorter line size can break the table into two panels allowing labels to be displayed. • “Odds Ratio Estimates” table, which contains the odds ratio estimates and the corresponding 95% Wald confidence intervals. For continuous explanatory variables, these odds ratios correspond to a unit increase in the risk factors. • measures of association between predicted probabilities and observed responses, which include a breakdown of the number of pairs with different responses, and four rank correlation indexes: Somers’ D, Goodman-Kruskal Gamma, and Kendall’s Tau-a, and c
ODS Table Names
4295
• confidence intervals for all the parameters if you use the CLPARM option in the MODEL statement • confidence intervals for all the odds ratios if you use the CLODDS option in the MODEL statement • “Analysis of Effects not in the Model” table, which gives the score chi-square statistic for testing the significance of each variable not in the model after adjusting for the variables already in the model, and the p-value of the chi-square statistic with respect to a chi-square distribution with one degree of freedom. You specify the DETAILS option in the MODEL statement to obtain this table. • estimated covariance matrix of the parameter estimates if you use the COVB option in the MODEL statement • estimated correlation matrix of the parameter estimates if you use the CORRB option in the MODEL statement • “Linear Hypothesis Testing” table, which gives the result of the Wald test for each TEST statement (if specified)
ODS Table Names PROC SURVEYLOGISTIC assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 69.2. ODS Tables Produced in PROC SURVEYLOGISTIC
ODS Table Name ClassLevelInfo CLOdds CLparmWald ContrastCoeff ContrastEstimate ContrastTest ConvergenceStatus CorrB CovB CumulativeModelTest DesignSummary FitStatistics GlobalTests IterHistory
Description CLASS variable levels and design variables Wald’s confidence limits for odds ratios Wald’s confidence limits for parameters L matrix from CONTRAST Estimates from CONTRAST Wald test for CONTRAST Convergence status Estimated correlation matrix of parameter estimators Estimated covariance matrix of parameter estimators Test of the cumulative model assumption Design summary Model fit statistics Test for global null hypothesis Iteration history
Statement MODEL MODEL
Option default (with CLASS vars) CLODDS
MODEL
CLPARM
CONTRAST CONTRAST CONTRAST MODEL MODEL
E ESTIMATE= default default CORRB
MODEL
COVB
MODEL
(ordinal response)
STRATA | CLUSTER MODEL MODEL
default default default
MODEL
ITPRINT
4296
Chapter 69. The SURVEYLOGISTIC Procedure
Table 69.2. (continued)
ODS Table Name LastGradient LogLikeChange ModelInfo NObs OddsRatios ParameterEstimates
RSquare ResponseProfile SimpleStatistics StrataInfo TestPrint1 TestPrint2 TestStmts TypeIII
Description Last evaluation of gradient Final change in the log likelihood Model information Number of observations Odds ratios Maximum likelihood estimates of model parameters R-square Response profile Summary statistics for explanatory variables Stratum information L[cov(b)]L’ and Lb-c Ginv(L[cov(b)]L’) and Ginv(L[cov(b)]L’)(Lb-c) Linear hypotheses testing results Type III tests of effects
Statement MODEL MODEL
Option ITPRINT ITPRINT
PROC PROC MODEL MODEL
default default default default
MODEL PROC PROC
RSQUARE default SIMPLE
STRATA TEST TEST
LIST PRINT PRINT
TEST
default
MODEL
default (with CLASS variables)
By referring to the names of such tables, you can use the ODS OUTPUT statement to place one or more of these tables in output data sets.
Examples Example 69.1. Logistic Regression with Different Link Functions for Stratified Cluster Sampling A market research firm conducts a survey among undergraduate students at a certain university to evaluate three new Web designs for a commercial Web site targeting undergraduate students at the university. The sample design is a stratified sample where strata are students’ classes. Within each class, 300 students are randomly selected using simple random sampling without replacement. The total number of students in each class in the fall semester of 2001 is shown in the following table: Class 1 - Freshman 2 - Sophomore 3 - Junior 4 - Senior
Enrollment 3,734 3,565 3,903 4,196
Examples
This total enrollment information is saved in the SAS data set Enrollment using the following SAS statements: proc format ; value Class
1=’Freshman’ 2=’Sophomore’ 3=’Junior’ 4=’Senior’;
run; data Enrollment; format Class Class.; input Class _TOTAL_; datalines; 1 3734 2 3565 3 3903 4 4196 ;
In the data set Enrollment, the variable – TOTAL– contains the enrollment figures for all classes. They are also the population size for each stratum in this example. Each student selected in the sample evaluates one randomly selected Web design using the following scale: 1 2 3 4 5
dislike very much dislike neutral like like very much
The survey results are collected and shown in the following table, with the three different Web designs coded as A, B, and C. Evaluation of New Web Designs Rating Counts Strata Design 1 2 3 4
5
Freshman
A B C
10 5 11
34 6 14
35 24 20
16 30 34
15 25 21
Sophomore
A B C
19 10 15
12 18 22
26 32 34
18 23 9
25 26 20
Junior
A B C
8 1 16
21 4 19
23 15 30
26 33 23
22 47 12
Senior
A B C
11 8 2
14 15 34
24 25 30
33 30 18
18 22 16
4297
4298
Chapter 69. The SURVEYLOGISTIC Procedure
The survey results are stored in a SAS data set WebSurvey using the following SAS statements. proc format ; value Design 1=’A’ 2=’B’ 3=’C’; value Rating 1=’dislike very much’ 2=’dislike’ 3=’neutral’ 4=’like’ 5=’like very much’; run; data WebSurvey; format Class Class. Design Design. do Class=1 to 4; do Design=1 to 3; do Rating=1 to 5; input Count @@; output; end; end; end; datalines; 10 34 35 16 15 8 21 23 26 22 5 10 1 14 25 23 37 11 14 20 34 21 16 19 19 12 26 18 25 11 14 24 33 18 10 18 8 15 35 30 12 15 22 34 9 20 2 34 ; data WebSurvey; set WebSurvey; if Class=1 then Weight=3734/300; if Class=2 then Weight=3565/300; if Class=3 then Weight=3903/300; if Class=4 then Weight=4196/300;
Rating Rating. ;
24 30 32 30
30 23 23 18
21 12 17 16
The data set WebSurvey contains the variables Class, Design, Rating, Count, and Weight. The variable class is the stratum variable, with four strata: freshman, sophomore, junior, and senior. The variable Design specifies the three new Web designs: A, B, and C. The variable Rating contains students’ evaluations for the new Web designs. The variable counts gives the frequency with which each Web design received each rating within each stratum. The variable weight contains the sampling weights, which are the reciprocals of selection probabilities in this example.
Examples
Output 69.1.1. Web Design Survey Sample (First 20 Observation) Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Class
Design
Freshman Freshman Freshman Freshman Freshman Freshman Freshman Freshman Freshman Freshman Freshman Freshman Freshman Freshman Freshman Sophomore Sophomore Sophomore Sophomore Sophomore
A A A A A B B B B B C C C C C A A A A A
Rating
Count
dislike very much dislike neutral like like very much dislike very much dislike neutral like like very much dislike very much dislike neutral like like very much dislike very much dislike neutral like like very much
10 34 35 16 15 8 21 23 26 22 5 10 24 30 21 1 14 25 23 37
Weight 12.4467 12.4467 12.4467 12.4467 12.4467 12.4467 12.4467 12.4467 12.4467 12.4467 12.4467 12.4467 12.4467 12.4467 12.4467 11.8833 11.8833 11.8833 11.8833 11.8833
Output 69.1.1 shows the first 20 observations of the data set. The following SAS statements perform the logistic regression. proc surveylogistic data=WebSurvey total=Enrollment; stratum Class; freq Count; class Design; model Rating (order=internal) = design ; weight Weight; run;
The PROC statement invokes PROC SURVEYLOGISTIC. The TOTAL= option specifies the data set Enrollment, which contains the population totals in the strata. The population totals are used to calculate the finite population correction factor in the variance estimates. The response variable Rating is in the ordinal scale. A cumulative logit model is used to investigate the responses to the Web designs. In the MODEL statement, rating is the response variable, and Design is the effect in the regression model. The ORDER=INTERNAL option is used for the response variable Rating to sort the ordinal response levels of Rating by its internal (numerical) values rather than by the formatted values (e.g., “like very much”). Because the sample design involves stratified simple random sampling, the STRATA statement is used to specify the stratification variable Class. The WEIGHT statement specifies the variable Weight for sampling weights.
4299
4300
Chapter 69. The SURVEYLOGISTIC Procedure
Output 69.1.2. Web Design Survey, Model Information The SURVEYLOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Frequency Variable Stratum Variable Number of Strata Weight Variable Model Optimization Technique Variance Adjustment Finite Population Correction
WORK.WEBSURVEY Rating 5 Count Class 4 Weight Cumulative Logit Fisher’s Scoring Degrees of Freedom (DF) Used
Response Profile Ordered Value 1 2 3 4 5
Rating dislike very much dislike neutral like like very much
Total Frequency
Total Weight
116 227 338 283 236
1489.0733 2933.0433 4363.3767 3606.8067 3005.7000
Probabilities modeled are cumulated over the lower Ordered Values.
The sample and analysis summary is shown in Output 69.1.2. There are five response levels for the Rating with ‘dislike very much’ as the lowest ordered value. The regression model is modeling lower cumulative probabilities using logit as the link function. Because the TOTAL= option is used, the finite population correction is included in the variance estimation. The sampling weight is also used in the analysis. Output 69.1.3. Web Design Survey, Testing the Proportional Odds Assumption Score Test for the Proportional Odds Assumption Chi-Square
DF
Pr > ChiSq
98.1957
6
<.0001
In Output 69.1.3, the score chi-square for testing the proportional odds assumption is 98.1957, which is highly significant. This indicates that the cumulative logit model may not adequately fit the data. An alternative model is to use the generalized logit model with the LINK=GLOGIT option as shown in the following SAS statements: proc surveylogistic data=WebSurvey total=Enrollment; stratum Class;
Examples
freq Count; class Design; model Rating (ref=’neutral’) = Design /link=glogit; weight Weight; run;
The REF=’neutral’ option is used for the response variable Rating to indicate that all other response levels are referenced to the level ‘neutral.’ The option LINK=GLOGIT option requests the procedure to fit a generalized logit model. Output 69.1.4. Web Design Survey, Model Information The SURVEYLOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Frequency Variable Stratum Variable Number of Strata Weight Variable Model Optimization Technique Variance Adjustment Finite Population Correction
WORK.WEBSURVEY Rating 5 Count Class 4 Weight Generalized Logit Fisher’s Scoring Degrees of Freedom (DF) Used
Response Profile Ordered Value 1 2 3 4 5
Rating
Total Frequency
Total Weight
227 116 283 236 338
2933.0433 1489.0733 3606.8067 3005.7000 4363.3767
dislike dislike very much like like very much neutral
Logits modeled use Rating=’neutral’ as the reference category.
The summary of the analysis is shown in Output 69.1.4, which indicates that the generalized logit model is used in the analysis. Output 69.1.5. Web Design Survey, Class Level Information Class Level Information
Class
Value
Design Variables
Design
A B C
1 0 -1
0 1 -1
4301
4302
Chapter 69. The SURVEYLOGISTIC Procedure
Output 69.1.5 shows the parameterization for the main effect Design. Output 69.1.6. Web Design Survey, Parameter and Odds Ratio Estimates Analysis of Maximum Likelihood Estimates
Parameter
Rating
Intercept Intercept Intercept Intercept Design Design Design Design Design Design Design Design
dislike dislike very much like like very much dislike dislike very much like like very much dislike dislike very much like like very much
A A A A B B B B
DF
Estimate
Standard Error
Wald Chi-Square
Pr > ChiSq
1 1 1 1 1 1 1 1 1 1 1 1
-0.3964 -1.0826 -0.1892 -0.3767 -0.0942 -0.0647 -0.1370 0.0446 0.0391 0.2721 0.1669 0.1420
0.0832 0.1045 0.0780 0.0824 0.1166 0.1469 0.1104 0.1130 0.1201 0.1448 0.1102 0.1174
22.7100 107.3889 5.8888 20.9223 0.6518 0.1940 1.5400 0.1555 0.1057 3.5294 2.2954 1.4641
<.0001 <.0001 0.0152 <.0001 0.4195 0.6596 0.2146 0.6933 0.7451 0.0603 0.1298 0.2263
Odds Ratio Estimates
Effect Design Design Design Design Design Design Design Design
Rating A A A A B B B B
vs vs vs vs vs vs vs vs
C C C C C C C C
dislike dislike very much like like very much dislike dislike very much like like very much
Point Estimate 0.861 1.153 0.899 1.260 0.984 1.615 1.218 1.389
95% Wald Confidence Limits 0.583 0.692 0.618 0.851 0.659 0.975 0.838 0.925
1.272 1.923 1.306 1.865 1.471 2.675 1.768 2.086
The parameter and odds ratio estimates are are shown in Output 69.1.6. For each odds ratio estimate, its 95% confidence limits shown in the table contain the value 1.0. Therefore, no conclusion can be made based on this survey about which Web design is preferred.
Example 69.2. The Household Component of the Medical Expenditure Panel Survey (MEPS) The Household Component of the Medical Expenditure Panel Survey (MEPS-HC) is designed to produce national and regional estimates of the health care use, expenditures, sources of payment, and insurance coverage of the U.S. civilian noninstitutionalized population (MEPS Fact Sheet, 2001). The sample design of the survey includes stratification, clustering, multiple stages of selection, and disproportionate sampling. Furthermore, the MEPS sampling weights reflect adjustments for survey nonresponse and adjustments to population control totals from the Current Population Survey (Computing Standard Errors for MEPS Estimates, 2003). In this example, the 1999 full-year consolidated data file HC-038 (PUF Data Files, 2002) from the MEPS is used to investigate the relationship between med-
Examples
ical insurance coverage and the demographic variables. The data can be downloaded directly from the Agency for Healthcare Research and Quality (AHRQ) Web site (http://www.meps.ahrq.gov/Puf/PufDetail.asp?ID=93) in either ASCII format or SAS transport format. The Web site includes a detailed description of the data as well as the SAS program code used to access and to format it. For this example, the SAS transport format data file for HC-038 is downloaded to ’C:H38.ssp’ on a Windows-based PC. The instructions on the Web site lead to the following SAS statements for creating a SAS data set named MEPS, which contains only the sample design variables and other variables necessary for this analysis. proc format; value racex -9 = ’NOT ASCERTAINED’ -8 = ’DK’ -7 = ’REFUSED’ -1 = ’INAPPLICABLE’ 1 = ’AMERICAN INDIAN’ 2 = ’ALEUT, ESKIMO’ 3 = ’ASIAN OR PACIFIC ISLANDER’ 4 = ’BLACK’ 5 = ’WHITE’ 91 = ’OTHER’ ; value sex -9 = ’NOT ASCERTAINED’ -8 = ’DK’ -7 = ’REFUSED’ -1 = ’INAPPLICABLE’ 1 = ’MALE’ 2 = ’FEMALE’ ; value povcat9h 1 = ’NEGATIVE OR POOR’ 2 = ’NEAR POOR’ 3 = ’LOW INCOME’ 4 = ’MIDDLE INCOME’ 5 = ’HIGH INCOME’ ; value inscov9f 1 = ’ANY PRIVATE’ 2 = ’PUBLIC ONLY’ 3 = ’UNINSURED’ ; run; libname puflib ’C:’; filename in1 ’C:H38.ssp’; proc xcopy in=in1 out=puflib import; run; data meps; set puflib.H38; label racex= sex= inscov99= povcat99= varstr99= varpsu99= perwt99f= totexp99=;
4303
4304
Chapter 69. The SURVEYLOGISTIC Procedure format racex racex. sex sex. povcat99 povcat9h. inscov99 inscov9f.; keep inscov99 sex racex povcat99 varstr99 varpsu99 perwt99f totexp99; run;
There are a total of 24,618 observations in this SAS data set. Each observation corresponds to a person in the survey. The stratification variable is VARSTR99, which identifies the 143 strata in the sample. The variable VARPSU99 identifies the 460 PSUs in the sample. The sampling weights are stored in the variable PERWT99F. The response variable is the health insurance coverage indicator variable, INSCOV99, which has three values: 1 2 3
the person had any private insurance coverage any time during 1999 the person had only public insurance coverage during 1999 the person was uninsured during all of 1999
The demographic variables include gender (SEX), race (RACEX), and family income level as a percent of the poverty line (POVCAT99). The variable RACEX has five categories: 1 2 3 4 5
American Indian Aleut, Eskimo Asian or Pacific Islander Black White
The variable POVCAT99 is constructed by dividing family income by the applicable poverty line (based on family size and composition), with the resulting percentages grouped into five categories: 1 2 3 4 5
negative or poor (less than 100%) near poor (100% to less than 125%) low income (125% to less than 200%) middle income (200% to less than 400%) high income (greater than or equal to 400%)
The data set also contains the total health care expenditure in 1999, TOTEXP99, which is used as a covariate in the analysis.
Examples
Output 69.2.1. 1999 Full-year MEPS (First 30 Observations)
O b s
S E X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
MALE FEMALE MALE MALE FEMALE MALE MALE MALE FEMALE MALE MALE MALE FEMALE MALE MALE FEMALE MALE MALE FEMALE MALE FEMALE MALE FEMALE MALE FEMALE MALE MALE FEMALE FEMALE MALE
R A C E X
P O V C A T 9 9
I N S C O V 9 9
WHITE WHITE WHITE WHITE WHITE WHITE WHITE WHITE WHITE WHITE WHITE WHITE WHITE WHITE WHITE WHITE BLACK WHITE WHITE WHITE WHITE WHITE WHITE WHITE BLACK BLACK BLACK BLACK BLACK BLACK
MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME NEGATIVE OR POOR NEGATIVE OR POOR NEGATIVE OR POOR NEGATIVE OR POOR LOW INCOME LOW INCOME LOW INCOME LOW INCOME LOW INCOME LOW INCOME LOW INCOME MIDDLE INCOME MIDDLE INCOME MIDDLE INCOME NEGATIVE OR POOR NEGATIVE OR POOR NEGATIVE OR POOR
PUBLIC ONLY ANY PRIVATE ANY PRIVATE ANY PRIVATE ANY PRIVATE ANY PRIVATE ANY PRIVATE ANY PRIVATE ANY PRIVATE ANY PRIVATE ANY PRIVATE ANY PRIVATE ANY PRIVATE PUBLIC ONLY ANY PRIVATE UNINSURED PUBLIC ONLY UNINSURED UNINSURED UNINSURED UNINSURED UNINSURED UNINSURED PUBLIC ONLY UNINSURED PUBLIC ONLY ANY PRIVATE PUBLIC ONLY PUBLIC ONLY PUBLIC ONLY
T O T E X P 9 9
P E R W T 9 9 F
2735 6687 60 60 786 345 680 3226 2852 112 3179 168 1066 0 0 0 230 408 0 40 51 0 610 24 1758 551 65 0 10 0
14137.86 17050.99 35737.55 35862.67 19407.11 18499.83 18499.83 22394.53 27008.96 25108.71 17569.81 21478.06 21415.68 12254.66 17699.75 18083.15 6537.97 8951.36 11833.00 12754.07 14698.57 3890.20 5882.29 8610.47 0.00 7049.70 34067.03 9313.84 14697.03 4574.73
V A R S T R 9 9
V A R P S U 9 9
131 131 131 131 131 131 131 136 136 136 136 136 136 125 125 125 78 95 95 95 95 92 92 92 64 64 64 73 73 73
2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 10 2 2 2 2 19 19 19 1 1 1 12 12 12
Output 69.2.1 displays the first 30 observations of this data set. The following SAS statements fit a generalized logit model for the 1999 full-year consolidated MEPS data. proc surveylogistic data=meps; stratum VARSTR99; cluster VARPSU99; weight PERWT99F; class SEX RACEX POVCAT99; model INSCOV99 = TOTEXP99 SEX RACEX POVCAT99 / link=glogit; run;
The STRATUM statement specifies the stratification variable VARSTR99. The CLUSTER statement specifies the PSU variable VARPSU99. The WEIGHT statement specifies the sample weight variable PERWT99F. The demographic variables
4305
4306
Chapter 69. The SURVEYLOGISTIC Procedure
SEX, RACEX, and POVCAT99 are listed in the CLASS statement to indicate that they are categorical independent variables in the MODEL statement. In the MODEL statement, the response variable is INSCOV99, and the independent variables are TOTEXP99 along with the selected demographic variables. The LINK= option requests the procedure to fit the generalized logit model because the response variable INSCOV99 has nominal responses. The results of this analysis are shown in the following tables. Output 69.2.2. MEPS, Model Information The SURVEYLOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Stratum Variable Number of Strata Cluster Variable Number of Clusters Weight Variable Model Optimization Technique Variance Adjustment
WORK.MEPS INSCOV99 3 VARSTR99 143 VARPSU99 460 PERWT99F Generalized Logit Fisher’s Scoring Degrees of Freedom (DF)
PROC SURVEYLOGISTIC lists the model fitting information and sample design information in Output 69.2.2: Output 69.2.3. MEPS, Number of Observations Number Number Sum of Sum of
of Observations Read of Observations Used Weights Read Weights Used
24618 23565 2.7641E8 2.7641E8
Output 69.2.3 displays the number of observations and the total of sampling weights both in the data set and used in the analysis. Only the observations with positive person-level weight are used in the analysis. Therefore, 1,053 observations with zero person-level weights were deleted.
Examples
Output 69.2.4. MEPS, Response Profile Response Profile Ordered Value 1 2 3
INSCOV99
Total Frequency
Total Weight
16130 4241 3194
204403997 41809572 30197198
ANY PRIVATE PUBLIC ONLY UNINSURED
Logits modeled use INSCOV99=’UNINSURED’ as the reference category.
Output 69.2.4 lists the three insurance coverage levels for the response variable INSCOV99. The “UNINSURED” category is used as the reference category in the model. Output 69.2.5. MEPS, Classification Levels Class Level Information Class
Value
Design Variables
SEX
FEMALE MALE
1 -1
RACEX
ALEUT, ESKIMO AMERICAN INDIAN ASIAN OR PACIFIC ISLANDER BLACK WHITE
1 0 0 0 -1
0 1 0 0 -1
0 0 1 0 -1
0 0 0 1 -1
POVCAT99
HIGH INCOME LOW INCOME MIDDLE INCOME NEAR POOR NEGATIVE OR POOR
1 0 0 0 -1
0 1 0 0 -1
0 0 1 0 -1
0 0 0 1 -1
Output 69.2.5 shows the parameterization in the regression model for each categorical independent variable.
4307
4308
Chapter 69. The SURVEYLOGISTIC Procedure
Output 69.2.6. MEPS, Parameter Estimates Analysis of Maximum Likelihood Estimates
Parameter Intercept Intercept TOTEXP99 TOTEXP99 SEX SEX RACEX RACEX RACEX RACEX RACEX RACEX RACEX RACEX POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99
INSCOV99
Standard Wald Error Chi-Square
DF Estimate
ANY PRIVATE PUBLIC ONLY ANY PRIVATE PUBLIC ONLY FEMALE ANY PRIVATE FEMALE PUBLIC ONLY ALEUT, ESKIMO ANY PRIVATE ALEUT, ESKIMO PUBLIC ONLY AMERICAN INDIAN ANY PRIVATE AMERICAN INDIAN PUBLIC ONLY ASIAN OR PACIFIC ISLANDER ANY PRIVATE ASIAN OR PACIFIC ISLANDER PUBLIC ONLY BLACK ANY PRIVATE BLACK PUBLIC ONLY HIGH INCOME ANY PRIVATE HIGH INCOME PUBLIC ONLY LOW INCOME ANY PRIVATE LOW INCOME PUBLIC ONLY MIDDLE INCOME ANY PRIVATE MIDDLE INCOME PUBLIC ONLY NEAR POOR ANY PRIVATE NEAR POOR PUBLIC ONLY
1 2.7703 0.1892 1 1.9216 0.1547 1 0.000215 0.000071 1 0.000241 0.000072 1 0.1208 0.0248 1 0.1741 0.0308 1 7.1457 0.6981 1 7.6303 0.5018 1 -2.0904 0.2606 1 -1.8992 0.2897 1 -1.8055 0.2308 1 -1.9914 0.2288 1 -1.7517 0.1983 1 -1.7038 0.1693 1 1.4560 0.0685 1 -0.6092 0.0903 1 -0.3066 0.0666 1 -0.0239 0.0754 1 0.6467 0.0587 1 -0.3496 0.0807 1 -0.8015 0.1076 1 0.2985 0.0952
Analysis of Maximum Likelihood Estimates Parameter Intercept Intercept TOTEXP99 TOTEXP99 SEX SEX RACEX RACEX RACEX RACEX RACEX RACEX RACEX RACEX POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99
INSCOV99 ANY PRIVATE PUBLIC ONLY ANY PRIVATE PUBLIC ONLY FEMALE ANY PRIVATE FEMALE PUBLIC ONLY ALEUT, ESKIMO ANY PRIVATE ALEUT, ESKIMO PUBLIC ONLY AMERICAN INDIAN ANY PRIVATE AMERICAN INDIAN PUBLIC ONLY ASIAN OR PACIFIC ISLANDER ANY PRIVATE ASIAN OR PACIFIC ISLANDER PUBLIC ONLY BLACK ANY PRIVATE BLACK PUBLIC ONLY HIGH INCOME ANY PRIVATE HIGH INCOME PUBLIC ONLY LOW INCOME ANY PRIVATE LOW INCOME PUBLIC ONLY MIDDLE INCOME ANY PRIVATE MIDDLE INCOME PUBLIC ONLY NEAR POOR ANY PRIVATE NEAR POOR PUBLIC ONLY
Pr > ChiSq <.0001 <.0001 0.0024 0.0008 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.7510 <.0001 <.0001 <.0001 0.0017
Output 69.2.6 displays the parameter estimates and their standard errors.
214.3326 154.2029 9.1900 11.1515 23.7174 31.9573 104.7599 231.2565 64.3323 42.9775 61.1936 75.7282 78.0413 101.3199 452.1841 45.5393 21.1762 0.1007 121.1736 18.7732 55.4443 9.8308
Examples
Output 69.2.7. MEPS, Odds Ratios Odds Ratio Estimates Effect
INSCOV99
TOTEXP99 TOTEXP99 SEX SEX RACEX RACEX RACEX RACEX RACEX RACEX RACEX RACEX POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99 POVCAT99
ANY PRIVATE PUBLIC ONLY ANY PRIVATE PUBLIC ONLY ANY PRIVATE PUBLIC ONLY ANY PRIVATE PUBLIC ONLY ANY PRIVATE PUBLIC ONLY ANY PRIVATE PUBLIC ONLY ANY PRIVATE PUBLIC ONLY ANY PRIVATE PUBLIC ONLY ANY PRIVATE PUBLIC ONLY ANY PRIVATE PUBLIC ONLY
FEMALE vs MALE FEMALE vs MALE ALEUT, ESKIMO ALEUT, ESKIMO AMERICAN INDIAN AMERICAN INDIAN ASIAN OR PACIFIC ASIAN OR PACIFIC BLACK BLACK HIGH INCOME HIGH INCOME LOW INCOME LOW INCOME MIDDLE INCOME MIDDLE INCOME NEAR POOR NEAR POOR
vs vs vs vs ISLANDER vs ISLANDER vs vs vs vs NEGATIVE vs NEGATIVE vs NEGATIVE vs NEGATIVE vs NEGATIVE vs NEGATIVE vs NEGATIVE vs NEGATIVE
WHITE WHITE WHITE WHITE WHITE WHITE WHITE WHITE OR POOR OR POOR OR POOR OR POOR OR POOR OR POOR OR POOR OR POOR
Odds Ratio Estimates Point Estimate 1.000 1.000 1.273 1.417 >999.999 >999.999 0.553 1.146 0.735 1.045 0.776 1.394 11.595 0.274 1.990 0.492 5.162 0.356 1.213 0.680
95% Wald Confidence Limits 1.000 1.000 1.155 1.255 >999.999 >999.999 0.340 0.603 0.500 0.656 0.639 1.132 9.301 0.213 1.607 0.395 4.200 0.280 0.903 0.527
1.000 1.000 1.403 1.598 >999.999 >999.999 0.901 2.179 1.082 1.665 0.943 1.717 14.455 0.353 2.464 0.614 6.343 0.451 1.630 0.877
Output 69.2.7 displays the odds ratio estimates and their standard errors. For example, after adjusting for the effects of sex, race, and total health care expenditures, a person with high income is estimated to be 11.595 times more likely than a poor person to choose private health care insurance over no insurance, but only 0.274 times as likely to choose public health insurance over no insurance.
4309
4310
Chapter 69. The SURVEYLOGISTIC Procedure
References Agresti, A. (1984), Analysis of Ordinal Categorical Data, New York: John Wiley & Sons, Inc. Agresti, A. (1990), Categorical Data Analysis, New York: John Wiley & Sons, Inc. Aitchison, J. and Silvey, S. D. ( 1957), “The Generalization of Probit Analysis to the Case of Multiple Responses,” Biometrika, 44, 131–40. Albert, A. and Anderson, J. A. (1984), “On the Existence of Maximum Likelihood Estimates in Logistic Regression Models,” Biometrika, 71, 1–10. Ashford, J. R. (1959), “An Approach to the Analysis of Data for Semi-Quantal Responses in Biology Response,” Biometrics, 15, 573–81. Binder, D. A. (1981), “On the Variances of Asymptotically Normal Estimators from Complex Surveys,” Survey Methodology, 7, 157–170. Binder, D. A. (1983), “On the Variances of Asymptotically Normal Estimators from Complex Surveys,” International Statistical Review, 51, 279–292. Binder, D. A. and Roberts, G. R. (2003), “Design-based and Model-based Methods for Estimating Model Parameters,” in Analysis of Survey Data, ed. C. Skinner and R. Chambers, New York: John Wiley & Sons, Inc. Brick, J. M. and Kalton, G. (1996), “Handling Missing Data in Survey Research,” Statistical Methods in Medical Research, 5, 215 –238. Cochran, W. G. (1977), Sampling Techniques, Third Edition, New York: John Wiley & Sons, Inc. Collett, D. (1991), Modelling Binary Data, London: Chapman and Hall. Computing Standard Errors for MEPS Estimates, January Agency for Healthcare Research and Quality, Rockville, [http://www.meps.ahrq.gov/factsheets/FS– StandardErrors.htm].
2003, MD,
Cox, D. R. and Snell, E. J. (1989), The Analysis of Binary Data, Second Edition, London: Chapman and Hall. Fleiss, J. L. (1981), Statistical Methods for Rates and Proportions, Second Edition, New York: John Wiley & Sons, Inc. Freeman, D. H., Jr. (1987), Applied Categorical Data Analysis, New York: Marcel Dekker, Inc. Hanley, J. A. and McNeil, B. J. (1982), “The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve,” Radiology, 143 29–36. Hidiroglou, M. A., Fuller, W. A., and Hickman, R. D. (1980), SUPER CARP, Ames, IA: Statistical Laboratory, Iowa State University. Hosmer, D. W, Jr. and Lemeshow, S. (2000), Applied Logistic Regression, Second Edition, New York: John Wiley & Sons, Inc.
References
Kalton, G., and Kaspyzyk, D. (1986), “The Treatment of Missing Survey Data,” Survey Methodology, 12, 1 –16. Kish, L. (1965), Survey Sampling, New York: John Wiley & Sons, Inc. Korn, E. and Graubard B. (1999), Analysis of Health Survey, New York: John Wiley & Sons, Inc. Lancaster, H. O., (1961), “Significance Tests in Discrete Distributions,” JASA, 56, 223–234. Lehtonen, R. and Pahkinen E. (1995), Practical Methods for Design and Analysis of Complex Surveys, Chichester: John Wiley & Sons, Inc. McCullagh, P. and Nelder, J. A. (1989), Generalized Linear Models, London: Chapman Hall. McFadden, D. (1974), “Conditional Logit Analysis of Qualitative Choice Behaviour” in Frontiers in Econometrics, ed. by P. Zarembka, New York: Academic Press. MEPS Fact Sheet, February 2001, Agency for Healthcare Research and Quality, Rockville, MD, [http://www.meps.ahrq.gov/whatismeps/bulletin.htm]. Morel, G. (1989) “Logistic Regression under Complex Survey Designs,” Survey Methodology, 15, 203–223. Nagelkerke, N. J. D. (1991), “A Note on a General Definition of the Coefficient of Determination,” Biometrika, 78, 691–692. Nelder, J. A. and Wedderburn, R. W. M. (1972), “Generalized Linear Models,” Journal of the Royal Statistical Society, Series A, 135, 761–768. PUF Data Files, October 2002, Agency for Healthcare Research and Quality, Rockville, MD, [http://www.meps.ahrq.gov/Puf/PufDetail.asp?ID=93]. Roberts, G., Rao, J. N. K., and Kumar, S. (1987), “Logistic Regression Analysis of Sample Survey Data,” Biometrika, 74, 1–12. Santner, T. J. and Duffy, E. D. (1986), “A Note on A. Albert and J. A. Anderson’s Conditions for the Existence of Maximum Likelihood Estimates in Logistic Regression Models,” Biometrika, 73, 755–758. Skinner, C. J., Holt, D., and Smith, T. M. F. (1989), Analysis of Complex Surveys, New York: John Wiley & Sons, Inc. Stokes, M. E., Davis, C. S., and Koch, G. G. (2000), Categorical Data Analysis Using the SAS System, Second Edition, Cary, NC: SAS Institute Inc. Walker, S. H. and Duncan, D. B. (1967), “Estimation of the Probability of an Event as a Function of Several Independent Variables,” Biometrika, 54, 167–179.
4311
Chapter 70
The SURVEYMEANS Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4315 GETTING STARTED . . . Simple Random Sampling Stratified Sampling . . . Output Data Set . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 4315 . 4315 . 4318 . 4321
SYNTAX . . . . . . . . . . . . . . . . PROC SURVEYMEANS Statement BY Statement . . . . . . . . . . . . CLASS Statement . . . . . . . . . . CLUSTER Statement . . . . . . . . DOMAIN Statement . . . . . . . . . RATIO Statement . . . . . . . . . . STRATA Statement . . . . . . . . . VAR Statement . . . . . . . . . . . WEIGHT Statement . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. 4322 . 4323 . 4328 . 4329 . 4329 . 4330 . 4330 . 4332 . 4332 . 4333
DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . . . . . . . . . . . Survey Data Analysis . . . . . . . . . . . . . . . . . . . Specification of Population Totals and Sampling Rates Primary Sampling Units (PSUs) . . . . . . . . . . . . Domain Analysis . . . . . . . . . . . . . . . . . . . . Statistical Computations . . . . . . . . . . . . . . . . . . Definition and Notation . . . . . . . . . . . . . . . . Mean . . . . . . . . . . . . . . . . . . . . . . . . . . Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . Proportions . . . . . . . . . . . . . . . . . . . . . . . Total . . . . . . . . . . . . . . . . . . . . . . . . . . Output . . . . . . . . . . . . . . . . . . . . . . . . . . . Output Data Sets . . . . . . . . . . . . . . . . . . . . Displayed Output . . . . . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. 4333 . 4333 . 4334 . 4334 . 4335 . 4336 . 4336 . 4336 . 4337 . 4339 . 4341 . 4341 . 4345 . 4345 . 4345 . 4349
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4350 Example 70.1. Stratified Cluster Sample Design . . . . . . . . . . . . . . . 4350
4314
Chapter 70. The SURVEYMEANS Procedure Example 70.2. Domain Analysis . . . . . . . . . . . . . . . . . . . . . . . 4354 Example 70.3. Ratio Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4358 Example 70.4. Analyzing Survey Data with Missing Values . . . . . . . . . 4358
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4361
Chapter 70
The SURVEYMEANS Procedure Overview The SURVEYMEANS procedure produces estimates of survey population means and totals from sample survey data. The procedure also produces variance estimates, confidence limits, and other descriptive statistics. When computing these estimates, the procedure takes into account the sample design used to select the survey sample. The sample design can be a complex survey sample design with stratification, clustering, and unequal weighting. PROC SURVEYMEANS uses the Taylor expansion method to estimate sampling errors of estimators based on complex sample designs. This method obtains a linear approximation for the estimator and then uses the variance estimate for this approximation to estimate the variance of the estimate itself (Woodruff 1971, Fuller 1975). When there are clusters, or primary sampling units (PSUs), in the sample design, the procedure estimates variance from the variation among PSUs. When the design is stratified, the procedure pools stratum variance estimates to compute the overall variance estimate. PROC SURVEYMEANS uses the Output Delivery System (ODS) to place results in output data sets. This is a departure from older SAS procedures that provide OUTPUT statements for similar functionality.
Getting Started This section demonstrates how you can use the SURVEYMEANS procedure to produce descriptive statistics from sample survey data. For a complete description of PROC SURVEYMEANS, please refer to the “Syntax” section on page 4322. The “Examples” section on page 4350 provides more complicated examples to illustrate the applications of PROC SURVEYMEANS.
Simple Random Sampling This example illustrates how you can use PROC SURVEYMEANS to estimate population means and proportions from sample survey data. The study population is a junior high school with a total of 4,000 students in grades 7, 8, and 9. Researchers want to know how much these students spend weekly for ice cream, on average, and what percentage of students spend at least $10 weekly for ice cream. To answer these questions, 40 students were selected from the entire student population using simple random sampling (SRS). Selection by simple random sampling means that all students have an equal chance of being selected, and no student can be selected more than once. Each student selected for the sample was asked how much
4316
Chapter 70. The SURVEYMEANS Procedure
he spends for ice cream per week, on average. The SAS data set named IceCream saves the responses of the 40 students: data IceCream; input Grade Spending @@; if (Spending < 10) then Group=’less’; else Group=’more’; datalines; 7 7 7 7 8 12 9 10 7 1 7 10 7 3 7 2 9 15 8 16 7 6 7 6 7 6 9 15 9 8 9 7 7 3 7 12 7 4 9 14 8 18 7 4 7 11 9 8 8 10 8 13 7 2 9 6 ;
8 20 8 17 9 9 9 11
8 19 8 14 7 2 7 2
7 9 7 7
2 8 1 9
The variable Grade contains a student’s grade. The variable Spending contains a student’s response on how much he spends per week for ice cream, in dollars. The variable Group is created to indicate whether a student spends at least $10 weekly for ice cream: Group=’more’ if a student spends at least $10, or Group=’less’ if a student spends less than $10. You can use PROC SURVEYMEANS to produce estimates for the entire student population, based on this random sample of 40 students: title1 ’Analysis of Ice Cream Spending’; title2 ’Simple Random Sample Design’; proc surveymeans data=IceCream total=4000; var Spending Group; run;
The PROC SURVEYMEANS statement invokes the procedure. The TOTAL=4000 option specifies the total number of students in the study population, or school. The procedure uses this total to adjust variance estimates for the effects of sampling from a finite population. The VAR statement names the variables to analyze, Spending and Group. Figure 70.1 displays the results from this analysis. There are a total of 40 observations used in the analysis. The “Class Level Information” table lists the two levels of the variable Group. This variable is a character variable, and so PROC SURVEYMEANS provides a categorical analysis for it, estimating the relative frequency or proportion for each level. If you want a categorical analysis for a numeric variable, you can name that variable in the CLASS statement.
Simple Random Sampling
4317
Analysis of Ice Cream Spending Simple Random Sample Design The SURVEYMEANS Procedure Data Summary Number of Observations
40
Class Level Information Class Variable Group
Levels 2
Values less more
Statistics Std Error Lower 95% Variable Level N Mean of Mean CL for Mean --------------------------------------------------------------------------------Spending 40 8.750000 0.845139 7.040545 Group less 23 0.575000 0.078761 0.415690 more 17 0.425000 0.078761 0.265690 --------------------------------------------------------------------------------Statistics Upper 95% Variable Level CL for Mean --------------------------------Spending 10.459455 Group less 0.734310 more 0.584310 ---------------------------------
Figure 70.1. Analysis of Ice Cream Spending, Simple Random Sample Design
The “Statistics” table displays the estimates for each analysis variable. By default, PROC SURVEYMEANS displays the number of observations, the estimate of the mean, its standard error, and 95% confidence limits for the mean. You can obtain other statistics by specifying the corresponding statistic-keywords in the PROC SURVEYMEANS statement. The estimate of the average weekly ice cream expense is $8.75 for students in this school. The standard error of this estimate if $0.85, and the 95% confidence interval for weekly ice cream expense is from $7.04 to $10.46. The analysis variable Group is a character variable, and so PROC SURVEYMEANS analyzes it as categorical, estimating the relative frequency or proportion for each level or category. These estimates are displayed in the Mean column of the “Statistics” table. It is estimated that 57.5% of all students spend less than $10 weekly on ice cream, while 42.5% of the students spend at least $10 weekly. The standard error of each estimate is 7.9%.
4318
Chapter 70. The SURVEYMEANS Procedure
Stratified Sampling Suppose that the sample of students described in the previous section was actually selected using stratified random sampling. In stratified sampling, the study population is divided into nonoverlapping strata, and samples are selected from each stratum independently. The list of students in this junior high school was stratified by grade, yielding three strata: grades 7, 8, and 9. A simple random sample of students was selected from each grade. Table 70.1 shows the total number of students in each grade. Table 70.1. Number of Students by Grade
Grade 7 8 9 Total
Number of Students 1,824 1,025 1,151 4,000
To analyze this stratified sample, you need to provide the population totals for each stratum to PROC SURVEYMEANS. The SAS data set named StudentTotals contains the information from Table 70.1: data StudentTotals; input Grade _total_; datalines; 7 1824 8 1025 9 1151 ;
The variable Grade is the stratum identification variable, and the variable – TOTAL– contains the total number of students for each stratum. PROC SURVEYMEANS requires you to use the variable name – TOTAL– for the stratum population totals. The procedure uses the stratum population totals to adjust variance estimates for the effects of sampling from a finite population. If you do not provide population totals or sampling rates, then the procedure assumes that the proportion of the population in the sample is very small, and the computation does not involve a finite population correction. In a stratified sample design, when the sampling rates in the strata are unequal, you need to use sampling weights to reflect this information in order to produce an unbiased mean estimator. In this example, the appropriate sampling weights are reciprocals of the probabilities of selection. You can use the following data step to create the sampling weights: data IceCream; set IceCream; if Grade=7 then Prob=20/1824; if Grade=8 then Prob=9/1025; if Grade=9 then Prob=11/1151; Weight=1/Prob;
Stratified Sampling
If you use PROC SURVEYSELECT to select your sample, PROC SURVEYSELECT creates these sampling weights for you. The following SAS statements perform the stratified analysis of the survey data: title1 ’Analysis of Ice Cream Spending’; title2 ’Stratified Simple Random Sample Design’; proc surveymeans data=IceCream total=StudentTotals; stratum Grade / list; var Spending Group; weight Weight; run;
The PROC SURVEYMEANS statement invokes the procedure. The DATA= option names the SAS data set IceCream as the input data set to be analyzed. The TOTAL= option names the data set StudentTotals as the input data set containing the stratum population totals. Comparing this to the analysis in the “Simple Random Sampling” section on page 4315, notice that the TOTAL=StudentTotals option is used here instead of the TOTAL=4000 option. In this stratified sample design, the population totals are different for different strata, and so you need to provide them to PROC SURVEYMEANS in a SAS data set. The STRATA statement identifies the stratification variable Grade. The LIST option in the STRATA statement requests that the procedure display stratum information. The WEIGHT statement tells the procedure that the variable Weight contains the sampling weights. Analysis of Ice Cream Spending Stratified Simple Random Sample Design The SURVEYMEANS Procedure Data Summary Number of Strata Number of Observations Sum of Weights
3 40 4000
Class Level Information Class Variable Group
Levels 2
Values less more
Figure 70.2. Data Summary
Figure 70.2 displays information on the input data set. There are three strata in the design, and 40 observations in the sample. The categorical variable Group has two levels, ‘less’ and ‘more’.
4319
4320
Chapter 70. The SURVEYMEANS Procedure
Analysis of Ice Cream Spending Stratified Simple Random Sample Design The SURVEYMEANS Procedure Stratum Information Stratum Population Sampling Index Grade Total Rate N Obs Variable Level N ---------------------------------------------------------------------------1 7 1824 1.10% 20 Spending 20 Group less 17 more 3 2 8 1025 0.88% 9 Spending 9 Group less 0 more 9 3 9 1151 0.96% 11 Spending 11 Group less 6 more 5 ----------------------------------------------------------------------------
Figure 70.3. Stratum Information
Figure 70.3 displays information for each stratum. The table displays a Stratum Index and the values of the STRATA variable. The Stratum Index identifies each stratum by a sequentially assigned number. For each stratum, the table gives the population total (total number of students), the sampling rate, and the sample size. The stratum sampling rate is the ratio of the number of students in the sample to the number of students in the population for that stratum. The table also lists each analysis variable and the number of stratum observations for that variable. For categorical variables, the table lists each level and the number of sample observations in that level. Analysis of Ice Cream Spending Stratified Simple Random Sample Design The SURVEYMEANS Procedure Statistics Std Error Lower 95% Variable Level N Mean of Mean CL for Mean --------------------------------------------------------------------------------Spending 40 9.141298 0.531799 8.063771 Group less 23 0.544555 0.058424 0.426177 more 17 0.455445 0.058424 0.337068 --------------------------------------------------------------------------------Statistics Upper 95% Variable Level CL for Mean --------------------------------Spending 10.218825 Group less 0.662932 more 0.573823 ---------------------------------
Figure 70.4. Analysis of Ice Cream Spending, Stratified SRS Design
Output Data Set
Figure 70.4 shows that • the estimate of average weekly ice cream expense is $9.14 for students in this school, with a standard error of $0.53, and a 95% confidence interval from $8.06 to $10.22. • an estimate of 54.5% of all students spend less than $10 weekly on ice cream, and 45.5% spend more, with a standard error of 5.8%.
Output Data Set PROC SURVEYMEANS uses the Output Delivery System (ODS) to create output data sets. This is a departure from older SAS procedures that provide OUTPUT statements for similar functionality. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” For example, to save the “Statistics” table shown in Figure 70.4 in the previous section in an output data set, you use the ODS OUTPUT statement as follows: title1 ’Analysis of Ice Cream Spending’; title2 ’Stratified Simple Random Sample Design’; proc surveymeans data=IceCream total=StudentTotals; stratum Grade / list; var Spending Group; weight Weight; ods output Statistics=MyStat; run;
The statement ods output Statistics=MyStat;
requests that the “Statistics” table that appears in Figure 70.4 be placed in a SAS data set named MyStat. The PRINT procedure displays observations of the data set MyStat: proc print data=MyStat; run;
Figure 70.5 displays the data set MyStat.
4321
4322
Chapter 70. The SURVEYMEANS Procedure
Analysis of Ice Cream Spending Stratified Simple Random Sample Design
O B S
V a r N a m e
V a r L e v e l
1 Spending 2 Group less 3 Group more
N
M e a n
S t d E r r
L o w e r C L M e a n
U p p e r C L M e a n
40 23 17
9.141298 0.544555 0.455445
0.531799 0.058424 0.058424
8.063771 0.426177 0.337068
10.218825 0.662932 0.573823
Figure 70.5. The Data Set MyStat
The section “ODS Table Names” on page 4349 gives the complete list of the tables produced by PROC SURVEYMEANS.
Syntax The following statements are available in PROC SURVEYMEANS.
PROC SURVEYMEANS < options > < statistic-keywords > ; BY variables ; CLASS variables ; CLUSTER variables ; DOMAIN variables < variable∗variable variable∗variable∗variable . . . > ; RATIO < ’label’ > variables / variables ; STRATA variables < / option > ; VAR variables ; WEIGHT variable ; The PROC SURVEYMEANS statement invokes the procedure. It optionally names the input data sets and specifies statistics for the procedure to compute. The PROC SURVEYMEANS statement is required. The VAR statement identifies the variables to be analyzed. The CLASS statement identifies those numeric variables that are to be analyzed as categorical variables. The STRATA statement lists the variables that form the strata in a stratified sample design. The CLUSTER statement specifies cluster identification variables in a clustered sample design. The DOMAIN statement lists the variables that define domains for subpopulation analysis. The RATIO statement requests ratio analysis for means or proportions of analysis variables. The WEIGHT statement names the sampling
PROC SURVEYMEANS Statement
weight variable. You can use a BY statement with PROC SURVEYMEANS to obtain separate analyses for groups defined by the BY variables. All statements can appear multiple times except the PROC SURVEYMEANS statement and the WEIGHT statement, which can appear only once. The rest of this section gives detailed syntax information for the BY, CLASS, CLUSTER, DOMAIN, RATIO, STRATA, VAR, and WEIGHT statements in alphabetical order after the description of the PROC SURVEYMEANS statement.
PROC SURVEYMEANS Statement PROC SURVEYMEANS < options > < statistic-keywords > ; The PROC SURVEYMEANS statement invokes the procedure. In this statement, you identify the data set to be analyzed and specify sample design information. The DATA= option names the input data set to be analyzed. If your analysis includes a finite population correction factor, you can input either the sampling rate or the population total using the RATE= or TOTAL= option. If your design is stratified, with different sampling rates or totals for different strata, then you can input these stratum rates or totals in a SAS data set containing the stratification variables. In the PROC SURVEYMEANS statement, you also can use statistic-keywords to specify statistics for the procedure to compute. Available statistics include the population mean and population total, together with their variance estimates and confidence limits. You can also request data set summary information and sample design information. You can specify the following options in the PROC SURVEYMEANS statement: ALPHA=α
sets the confidence level for confidence limits. The value of the ALPHA= option must be between 0 and 1, and the default value is 0.05. A confidence level of α produces 100(1 − α)% confidence limits. The default of ALPHA=0.05 produces 95% confidence limits. DATA=SAS-data-set
specifies the SAS data set to be analyzed by PROC SURVEYMEANS. If you omit the DATA= option, the procedure uses the most recently created SAS data set. MISSING
requests that the procedure treat missing values as a valid category for all categorical variables, which include categorical analysis variables, strata variables, cluster variables, and domain variables. ORDER=DATA | FORMATTED | INTERNAL
specifies the order in which the values of the categorical variables are to be reported. The following shows how PROC SURVEYMEANS interprets values of the ORDER= option: DATA
orders values according to their order in the input data set.
4323
4324
Chapter 70. The SURVEYMEANS Procedure
FORMATTED
orders values by their formatted values. This order is operating environment dependent. By default, the order is ascending.
INTERNAL
orders values by their unformatted values, which yields the same order that the SORT procedure does. This order is operating environment dependent.
By default, ORDER=FORMATTED. The ORDER= option applies to all the categorical variables. When the default ORDER=FORMATTED is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values. RATE=value | SAS-data-set R=value | SAS-data-set
specifies the sampling rate as a nonnegative value, or names an input data set that contains the stratum sampling rates. The procedure uses this information to compute a finite population correction for variance estimation. If your sample design has multiple stages, you should specify the first-stage sampling rate, which is the ratio of the number of PSUs selected to the total number of PSUs in the population. For a nonstratified sample design, or for a stratified sample design with the same sampling rate in all strata, you should specify a nonnegative value for the RATE= option. If your design is stratified with different sampling rates in the strata, then you should name a SAS data set that contains the stratification variables and the sampling rates. See the section “Specification of Population Totals and Sampling Rates” on page 4334 for more details. The sampling rate value must be a nonnegative number. You can specify value as a number between 0 and 1. Or you can specify value in percentage form as a number between 1 and 100, and PROC SURVEYMEANS will convert that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%. If you do not specify the TOTAL= option or the RATE= option, then the variance estimation does not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option. STACKING
requests the procedure to produce the output data sets using a stacking table structure, which was the default in releases prior to Version 9. The new default is to produce a rectangular table structure in the output data sets. The STACKING option affects the following tables: • Domain • Ratio • Statistics • StrataInfo
PROC SURVEYMEANS Statement
When you use the ODS statement to create SAS data sets for these tables in the output, the data set structure can be either stacking or rectangular. A rectangular structure creates one observation for each analysis variable in the data set. However, if you use the STACKING option in Version 9, the procedure creates only one observation in the output data set for all analysis variables. The following example shows these two structures in output data sets. data new; input sex$ x; datalines; M 12 F 5 M 13 F 23 F 11 ; proc surveymeans data=new mean; ods output statistics=rectangle; run; proc print data=rectangle; run; proc surveymeans data=new mean stacking; ods output statistics=stacking; run; proc print data=stacking; run;
Figure 70.6 shows the rectangular structure of the output data set for the statistics table. rectangle structure in the output data set
OBS
Var Name
Var Level
1 2 3
x sex sex
F M
Mean
StdErr
12.800000 0.600000 0.400000
2.905168 0.244949 0.244949
Figure 70.6. Rectangular Structure in the Output Data Set
Figure 70.7 shows the stacking structure of the output data set for the statistics table.
4325
4326
Chapter 70. The SURVEYMEANS Procedure
stacking structure in the output data set OBS
x
x_Mean
x_StdErr
sex_F
sex_F_Mean
1
x
12.800000
2.905168
sex=F
0.600000
OBS 1
sex_F_StdErr
sex_M
sex_M_Mean
sex_M_StdErr
0.244949
sex=M
0.400000
0.244949
Figure 70.7. Stacking Structure in the Output Data Set TOTAL=value | SAS-data-set N=value | SAS-data-set
specifies the total number of primary sampling units (PSUs) in the study population as a positive value, or names an input data set that contains the stratum population totals. The procedure uses this information to compute a finite population correction for variance estimation. For a nonstratified sample design, or for a stratified sample design with the same population total in all strata, you should specify a positive value for the TOTAL= option. If your sample design is stratified with different population totals in the strata, then you should name a SAS data set that contains the stratification variables and the population totals. See the section “Specification of Population Totals and Sampling Rates” on page 4334 for more details. If you do not specify the TOTAL= option or the RATE= option, then the variance estimation does not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option. statistic-keywords
specifies the statistics for the procedure to compute. If you do not specify any statistic-keywords, PROC SURVEYMEANS computes the NOBS, MEAN, STDERR, and CLM statistics by default. The statistics produced depend on the type of the analysis variable. If you name a numeric variable in the CLASS statement, then the procedure analyzes that variable as a categorical variable. The procedure always analyzes character variables as categorical. See the section “CLASS Statement” on page 4329 for more information. PROC SURVEYMEANS computes MIN, MAX, and RANGE for numeric variables but not for categorical variables. For numeric variables, the keyword MEAN produces the mean, but for categorical variables it produces the proportion in each category or level. Also for categorical variables, the keyword NOBS produces the number of observations for each variable level, and the keyword NMISS produces the number of missing observations for each level. If you request the keyword NCLUSTER for a categorical variable, PROC SURVEYMEANS displays for each level the number of clusters with observations in that level. PROC SURVEYMEANS computes SUMWGT in the same way for both categorical and numeric variables, as the sum of the weights over all nonmissing observations.
PROC SURVEYMEANS Statement
PROC SURVEYMEANS performs univariate analysis, analyzing each variable separately. Thus the number of nonmissing and missing observations may not be the same for all analysis variables. See the section “Missing Values” on page 4333 for more information. If you use the keyword RATIO without the keyword MEAN, the keyword MEAN is implied. Other available statistics computed for a ratio are N, NCLU, SUMWGT, RATIO, STDERR, DF, T, PROBT, and CLM, as listed below. If no statistics are requested, the procedure will compute the ratio and its standard error by default for a RATIO statement. The valid statistic-keywords are as follows: ALL
all statistics listed
CLM
100(1 − α)% two-sided confidence limits for the MEAN, where α is determined by the ALPHA= option described on page 4323, and the default is α = 0.05
CLSUM
100(1 − α)% two-sided confidence limits for the SUM, where α is determined by the ALPHA= option described on page 4323, and the default is α = 0.05
CV
coefficient of variation for MEAN
CVSUM
coefficient of variation for SUM
DF
degrees of freedom for the t test
LCLM
100(1 − α)% one-sided lower confidence limit of the MEAN, where α is determined by the ALPHA= option described on page 4323, and the default is α = 0.05
LCLMSUM
100(1 − α)% one-sided lower confidence limit of the SUM, where α is determined by the ALPHA= option described on page 4323, and the default is α = 0.05
MAX
maximum value
MEAN
mean for a numeric variable, or the proportion in each category for a categorical variable
MIN
minimum value
NCLUSTER
number of clusters
NMISS
number of missing observations
NOBS
number of nonmissing observations
RANGE
range, MAX−MIN
RATIO
ratio of means or proportions
STD
standard deviation of the SUM. When you request SUM, the procedure computes STD by default.
STDERR
standard error of the MEAN or RATIO. When you request MEAN or RATIO, the procedure computes STDERR by default.
4327
4328
Chapter 70. The SURVEYMEANS Procedure
SUM SUMWGT T
P weighted sum, wi yi , or estimated population total when the appropriate sampling weights are used P sum of the weights, wi t-value and its corresponding p-value with DF degrees of freedom for H0 : θ = 0 where θ is the population mean or the population ratio
UCLM
100(1 − α)% one-sided upper confidence limit of the MEAN, where α is determined by the ALPHA= option described on page 4323, and the default is α = 0.05
UCLMSUM
100(1 − α)% one-sided upper confidence limit of the SUM, where α is determined by the ALPHA= option described on page 4323, and the default is α = 0.05
VAR
variance of the MEAN or RATIO
VARSUM
variance of the SUM
See the section “Statistical Computations” on page 4336 for details on how PROC SURVEYMEANS computes these statistics.
BY Statement BY variables ; You can specify a BY statement with PROC SURVEYMEANS to obtain separate analyses on observations in groups defined by the BY variables. Note that using a BY statement provides completely separate analyses of the BY groups. It does not provide a statistically valid subpopulation or domain analysis, where the total number of units in the subpopulation is not known with certainty. You should use the DOMAIN statement to obtain domain analysis. When a BY statement appears, the procedure expects the input data sets to be sorted in order of the BY variables. The variables are one or more variables in the input data set. If you specify more than one BY statement, the procedure uses only the latest BY statement and ignores any previous ones. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Use the BY statement options NOTSORTED or DESCENDING in the BY statement. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
CLUSTER Statement
• Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
CLASS Statement CLASS | CLASSES variables ; The CLASS statement names variables to be analyzed as categorical variables. For categorical variables, PROC SURVEYMEANS estimates the proportion in each category or level, instead of the overall mean. PROC SURVEYMEANS always analyzes character variables as categorical. If you want categorical analysis for a numeric variable, you must include that variable in the CLASS statement. The CLASS variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the CLASS variables determine the categorical variable levels. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary. You can use multiple CLASS statements to specify categorical variables. When you specify class variables, you may use the SAS system option SUMSIZE= to limit (or to specify) the amount of memory that is available for data analysis. Refer to the chapter on SAS System options in SAS Language Reference: Dictionary for a description of the SUMSIZE= option.
CLUSTER Statement CLUSTER | CLUSTERS variables ; The CLUSTER statement names variables that identify the clusters in a clustered sample design. The combinations of categories of CLUSTER variables define the clusters in the sample. If there is a STRATA statement, clusters are nested within strata. If your sample design has clustering at multiple stages, you should identify only the first-stage clusters, or primary sampling units (PSUs), in the CLUSTER statement. See the section “Primary Sampling Units (PSUs)” on page 4335 for more information. The CLUSTER variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the CLUSTER variables determine the CLUSTER variable levels. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary.
4329
4330
Chapter 70. The SURVEYMEANS Procedure
You can use multiple CLUSTER statements to specify cluster variables. The procedure uses variables from all CLUSTER statements to create clusters.
DOMAIN Statement DOMAIN | SUBGROUP variables < variable∗variable variable∗variable∗variable . . . > ; The DOMAIN statement requests analysis for subpopulations, or domains, in addition to analysis for the entire study population. The DOMAIN statement names the variables that identify domains, which are called domain variables. It is common practice to compute statistics for domains. The formation of these domains may be unrelated to the sample design. Therefore, the sample sizes for the domains are random variables. In order to incorporate this variability into the variance estimation, you should use a DOMAIN statement. Note that a DOMAIN statement is different from a BY statement. In a BY statement, you treat the sample sizes as fixed in each subpopulation, and you perform analysis within each BY group independently. See the section “Domain Statistics” on page 4342 for more details. A domain variable can be either character or numeric. However, the procedure treats domain variables as categorical variables. If a variable appears by itself in a DOMAIN statement, each level of this variable determines a domain in the study population. If two or more variables are joined by asterisks (*), then every possible combination of levels of the variables determines a domain. The procedure performs a descriptive analysis within each domain defined by the domain variables. The formatted values of the domain variables determine the categorical variable levels. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary.
RATIO Statement RATIO < ’label’ > variables / variables ; The RATIO statement requests ratio analysis for means or proportions of analysis variables. A ratio statement names the variables whose means will be used as numerators or denominators in a ratio. Variables appearing before the slash (/), called numerator variables, are used for numerators. Variables appearing after the slash (/), called denominator variables, are used for denominators. These variables can be any number of analysis variables, either continuous or categorical, in the input data set. You can optionally specify a label for each RATIO statement to identify the ratios in the output. Labels must be enclosed in single quotes. If a RATIO statement does not have any numerator variable or denominator variable specified, the RATIO statement will be ignored. A numerator or denominator variable must be an analysis variable. That is, if there is a VAR statement, then a numerator or denominator variable must appear in the VAR
RATIO Statement
statement. If there is no VAR statement, a numerator or denominator variable must be on the default analysis variable list (see the section “VAR Statement” on page 4332). If a numerator or denominator variable is not an analysis variable, it is ignored. The computation of ratios depends on whether the numerator and denominator variables are continuous or categorical. For continuous variables, ratios are calculated with the mean of the variables. For example, for continuous variables X, Y, Z, and T, the following RATIO statement requests the procedure to analyze the ratios x ¯/¯ z, x ¯/t¯, y¯/¯ z , and y¯/t¯: ratio x y / z t;
If a continuous variable appears as both a numerator and a denominator variable, the ratio of this variable itself is ignored. For categorical variables, ratios are calculated with the proportions for the categories of a categorical variable. For example, if categorical variable Gender has values “Male” and “Female,” with proportions pm = Pr(Gender=“Male”) and pf = Pr(Gender=“Female”), and Y is a continuous variable, then the following RATIO statement requests the procedure to analyze the ratios pm /pf , pf /pm , y¯/pm , and y¯/pf : ratio Gender y / Gender;
If a categorical variable appears as both a numerator and a denominator variable, then the ratios of the proportions for all categories are computed, except the ratio of each category with itself. You may have more than one RATIO statement. Each RATIO statement produces ratios independently using its own numerator and denominator variables. Each RATIO statement also produces its own ratio analysis table. Available statistics for a ratio are • N, number of observations used to compute the ratio • NCLU, number of clusters • SUMWGT, sum of weights • RATIO, ratio • STDERR, standard error of ratio • VAR, variance of ratio • T, t-value of ratio • PROBT, p-value of t • DF, degrees of freedom of t • CLM, two-sided confidence limits of ratio • UCLM, one-sided upper confidence limit of ratio
4331
4332
Chapter 70. The SURVEYMEANS Procedure • LCLM, one-sided lower confidence limit of ratio
The procedure will calculate these statistics based on the statistic-keywords described on page 4326 which you specified in the PROC statement. If a statistic-keyword is not appropriate for RATIO statement, that statistic-keyword is ignored. If no valid statistics are requested for a RATIO statement, the procedure will compute the ratio and its standard error by default. Note that ratios within a domain are currently not available. When calculating the means or proportions for the numerator and denominator variables in a ratio, an observation is excluded if it has a missing value in either the continuous numerator variable or the denominator variable. An observation with missing values is also excluded for the categorical numerator or denominator variables, unless the MISSING option is used.
STRATA Statement STRATA | STRATUM variables < / option > ; The STRATA statement names variables that form the strata in a stratified sample design. The combinations of categories of STRATA variables define the strata in the sample. If your sample design has stratification at multiple stages, you should identify only the first-stage strata in the STRATA statement. See the section “Specification of Population Totals and Sampling Rates” on page 4334 for more information. The STRATA variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the STRATA variables determine the levels. Thus, you can use formats to group values into levels. See the discussion of the FORMAT procedure in the SAS Procedures Guide. You can specify the following option in the STRATA statement after a slash (/): LIST
displays a “Stratum Information” table, which includes values of the STRATA variables and sampling rates for each stratum. This table also provides the number of observations and number of clusters for each stratum and analysis variable. See the section “Displayed Output” on page 4345 for more details.
VAR Statement VAR variables ; The VAR statement names the variables to be analyzed. If you want a categorical analysis for a numeric variable, you must also name that variable in the CLASS statement. For categorical variables, PROC SURVEYMEANS estimates the proportion in each category or level, instead of the overall mean. Character variables are always analyzed as categorical variables. See the section “CLASS Statement” on page 4329 for more information.
Missing Values
If you do not specify a VAR statement, then PROC SURVEYMEANS analyzes all variables in the DATA= input data set, except those named in the BY, CLUSTER, STRATA, and WEIGHT statements.
WEIGHT Statement WEIGHT | WGT variable ; The WEIGHT statement names the variable that contains the sampling weights. This variable must be numeric. If you do not specify a WEIGHT statement, PROC SURVEYMEANS assigns all observations a weight of 1. Sampling weights must be positive numbers. If an observation has a weight that is nonpositive or missing, then the procedure omits that observation from the analysis. If you specify more than one WEIGHT statement, the procedure uses only the first WEIGHT statement and ignores the rest.
Details Missing Values When computing statistics for an analysis variable, PROC SURVEYMEANS omits observations with missing values for that variable. The procedure bases statistics for each variable only on observations that have nonmissing values for that variable. If you specify the MISSING option described on page 4323 in the PROC SURVEYMEANS statement, the procedure treats missing values of a categorical variable as a valid category. An observation is also excluded if it has a missing value for any STRATA or CLUSTER variable, unless the MISSING option is used. If an observation has a missing value or a nonpositive value for the WEIGHT variable, then PROC SURVEYMEANS excludes that observation from the analysis. The procedure performs univariate analysis and analyzes each VAR variable separately. Thus, the number of missing observations may be different for different variables. You can specify the keyword NMISS in the PROC SURVEYMEANS statement to display the number of missing values for each analysis variable in the “Statistics” table. If you have missing values in your survey data for any reason (such as nonresponse), this can compromise the quality of your survey results. An observation without missing values is called a complete respondent, and an observation with missing values is called an incomplete respondent. If the complete respondents are different from the incomplete respondents with regard to a survey effect or outcome, then the survey estimates will be biased and will not accurately represent the survey population. A variety of techniques in sample design and survey operations can reduce nonresponse. Once data collection is complete, you can use imputation to replace missing values with acceptable values, and you can use sampling weight adjustments to compensate for nonresponse. You should complete this data preparation and adjustment before you analyze your data with PROC SURVEYMEANS. Refer to Cochran (1977), Kalton and Kaspyzyk (1986), and Brick and Kalton (1996) for more details.
4333
4334
Chapter 70. The SURVEYMEANS Procedure
PROC SURVEYMEANS assumes that missing data are missing at random, because the patterns of missing data are unknown. Therefore, PROC SURVEYMEANS excludes those observations with missing values. If there is evidence indicating that the missing data are not at random, for example, if complete respondents are different from incomplete respondents for your study, you can use the DOMAIN statement to compute the descriptive statistics among complete respondents from your survey data without imputation on incomplete respondents. See Example 70.4 on page 4358. If missing values result in empty strata in the sample, then they will have an impact on the statistical computation, which uses the total number of strata. If all the observations in a stratum have missing weights or missing values for the current analysis variable, this stratum is an empty stratum. For example, data new; input stratum y z w; datalines; 1 . 13 40 1 2 9 . 1 . 5 25 2 5 10 20 2 8 60 15 ; proc surveymeans df mean nobs nmiss; strata stratum; var y z; weight w; run;
You analyze variable Y and Z, with weight variable W and stratum variable STRATUM. For variable Y, all observations have missing values or missing weights in STRATUM=1, therefore, the analysis for variable Y uses only observations in STRATUM=2. Thus, for variable Y, STRATUM=1 is an empty stratum and STRATUM=2 is a non-empty stratum. Note, however, that STRATUM=1 is a nonempty stratum for variable Z. If your sample design contains stratification, PROC SURVEYMEANS analyzes only the data in non-empty strata. Therefore, the total number of strata for an analysis variable means the total number of non-empty strata. In this example, the total number of strata for Y and Z is one and two, respectively.
Survey Data Analysis Specification of Population Totals and Sampling Rates If your analysis should include a finite population correction (fpc), you can input either the sampling rate or the population total using the RATE= option or the TOTAL= option. (You cannot specify both of these options in the same PROC SURVEYMEANS statement.) If you do not specify one of these options, the procedure does not use the fpc when computing variance estimates. For fairly small
Primary Sampling Units (PSUs)
sampling fractions, it is appropriate to ignore this correction. Refer to Cochran (1977) and Kish (1965). If your design has multiple stages of selection and you are specifying the RATE= option, you should input the first-stage sampling rate, which is the ratio of the number of PSUs in the sample to the total number of PSUs in the study population. If you are specifying the TOTAL= option for a multistage design, you should input the total number of PSUs in the study population. See the section “Primary Sampling Units (PSUs)” on page 4335 for more details. For a nonstratified sample design, or for a stratified sample design with the same sampling rate or the same population total in all strata, you should use the RATE=value option or the TOTAL=value option. If your sample design is stratified with different sampling rates or population totals in the strata, then you can use the RATE=SAS-data-set option or the TOTAL=SASdata-set option to name a SAS data set that contains the stratum sampling rates or totals. This data set is called a secondary data set, as opposed to the primary data set that you specify with the DATA= option. The secondary data set must contain all the stratification variables listed in the STRATA statement and all the variables in the BY statement. If there are formats associated with the STRATA variables and the BY variables, then the formats must be consistent in the primary and the secondary data sets. If you specify the TOTAL=SAS-data-set option, the secondary data set must have a variable named – TOTAL– that contains the stratum population totals. Or if you specify the RATE=SAS-data-set option, the secondary data set must have a variable named – RATE– that contains the stratum sampling rates. If the secondary data set contains more than one observation for any one stratum, then the procedure uses the first value of – TOTAL– or – RATE– for that stratum and ignores the rest. The value in the RATE= option or the values of – RATE– in the secondary data set must be nonnegative numbers. You can specify value as a number between 0 and 1. Or you can specify value in percentage form as a number between 1 and 100, and PROC SURVEYMEANS will convert that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%. If you specify the TOTAL=value option, value must not be less than the sample size. If you provide stratum population totals in a secondary data set, these values must not be less than the corresponding stratum sample sizes.
Primary Sampling Units (PSUs) When you have clusters, or primary sampling units (PSUs), in your sample design, the procedure estimates variance from the variation among PSUs. See the section “Variance and Standard Error of the Mean” on page 4338 and the section “Variance and Standard Deviation of the Total” on page 4341. You can use the CLUSTER statement to identify the first stage clusters in your design. PROC SURVEYMEANS assumes that each cluster represents a PSU in the sample and that each observation is an element of a PSU. If you do not specify a CLUSTER statement, the procedure treats each observation as a PSU.
4335
4336
Chapter 70. The SURVEYMEANS Procedure
Domain Analysis It is common practice to compute statistics for subpopulations, or domains, in addition to computing statistics for the entire study population. Analysis for domains using the entire sample is called domain analysis (subgroup analysis, subpopulation analysis, subdomain analysis). The formation of these subpopulations of interest may be unrelated to the sample design. Therefore, the sample sizes for the subpopulations may actually be random variables. In order to incorporate this variability into the variance estimation, you should use a DOMAIN statement. Note that using a BY statement provides completely separate analyses of the BY groups. It does not provide a statistically valid subpopulation or domain analysis, where the total number of units in the subpopulation is not known with certainty. For more detailed information about domain analysis, refer to Kish (1965).
Statistical Computations The SURVEYMEANS procedure uses the Taylor expansion method to estimate sampling errors of estimators based on complex sample designs. This method obtains a linear approximation for the estimator and then uses the variance estimate for this approximation to estimate the variance of the estimate itself (Woodruff 1971, Fuller 1975). When there are clusters, or PSUs, in the sample design, the procedure estimates variance from the variation among PSUs. When the design is stratified, the procedure pools stratum variance estimates to compute the overall variance estimate. For t tests of the estimates, the degrees of freedom equals the number of clusters minus the number of strata in the sample design. For a multistage sample design, the variance estimation method depends only on the first stage of the sample design. So, the required input includes only first-stage cluster (PSU) and first-stage stratum identification. You do not need to input design information about any additional stages of sampling. This variance estimation method assumes that the first-stage sampling fraction is small, or the first-stage sample is drawn with replacement, as it often is in practice. Quite often in complex surveys, respondents have unequal weights, which reflect unequal selection probabilities and adjustments for nonresponse. In such surveys, the appropriate sampling weights must be used to obtain valid estimates for the study population. For more information on the analysis of sample survey data, refer to Lee, Forthoffer, and Lorimor (1989), Cochran (1977), Kish (1965), and Hansen, Hurwitz, and Madow (1953).
Definition and Notation For a stratified clustered sample design, together with the sampling weights, the sample can be represented by an n × (P + 1) matrix (w, Y) = (whij , yhij ) (1) (2) (P ) = whij , yhij , yhij , . . . , yhij
Mean
where • h = 1, 2, . . . , H is the stratum number, with a total of H strata • i = 1, 2, . . . , nh clusters
is the cluster number within stratum h, with a total of nh
• j = 1, 2, . . . , mhi is the unit number within cluster i of stratum h, with a total of mhi units • p = 1, 2, . . . , P is the analysis variable number, with a total of P variables P Pnh • n= H h=1 i=1 mhi is the total number of observations in the sample • whij denotes the sampling weight for observation j in cluster i of stratum h (1 (2) (P ) • yhij = yhij ), yhij , . . . , yhij are the observed values of the analysis variables for observation j in cluster i of stratum h, including both the values of numerical variables and the values of indicator variables for levels of categorical variables. For a categorical variable C, let l denote the number of levels of C, and denote the level values as c1 , c2 , . . . , cl . Then there are l indicator variables associated with these levels. That is, for level C = ck (k = 1, 2, . . . , l), a y (q) (q ∈ {1, 2, . . . , P }) contains the values of the indicator variable for the category C = ck , with the value of observation j in cluster i of stratum h: 1 if Chij = ck (q) yhij = I{C=ck } (h, i, j) = 0 otherwise Therefore, the total number of analysis variables, P , is the total number of numerical variables plus the total number of levels of all categorical variables. Also, fh denotes the sampling rate for stratum h. You can use the TOTAL= option or the RATE= option to input population totals or sampling rates. See the section “Specification of Population Totals and Sampling Rates” on page 4334 for details. If you input stratum totals, PROC SURVEYMEANS computes fh as the ratio of the stratum sample size to the stratum total. If you input stratum sampling rates, PROC SURVEYMEANS uses these values directly for fh . If you do not specify the TOTAL= option or the RATE= option, then the procedure assumes that the stratum sampling rates fh are negligible, and a finite population correction is not used when computing variances. This notation is also applicable to other sample designs. For example, for a sample design without stratification, you can let H = 1; for a sample design without clusters, you can let mhi = 1 for every h and i.
Mean When you specify the keyword MEAN, the procedure computes the estimate of the mean (mean per element) from the survey data. Also, the procedure computes the mean by default if you do not specify any statistic-keywords in the PROC SURVEYMEANS statement.
4337
4338
Chapter 70. The SURVEYMEANS Procedure
PROC SURVEYMEANS computes the estimate of the mean as nh X mhi H X X ¯ = Yb whij yhij / w··· h=1 i=1 j=1
where w··· =
nh X mhi H X X
whij
h=1 i=1 j=1
is the sum of the weights over all observations in the sample.
Variance and Standard Error of the Mean When you specify the keyword STDERR, the procedure computes the standard error of the mean. Also, the procedure computes the standard error by default if you specify the keyword MEAN, or if you do not specify any statistic-keywords in the PROC SURVEYMEANS statement. The keyword VAR requests the variance of the mean. PROC SURVEYMEANS uses the Taylor series expansion theory to estimate the vari¯ . The procedure computes the estimated variance as ance of the mean Yb ¯) = Vb (Yb
H X
ch (Yb ¯) V
h=1
where if nh > 1, n
h nh (1 − fh ) X (ehi· − e¯h·· )2 nh − 1 i=1 m hi X ¯ ) / w··· = whij (yhij − Yb
ch (Yb ¯) = V
ehi·
j=1
e¯h·· =
nh X
! ehi·
/ nh
i=1
and if nh = 1, ch (Yb ¯) = V
missing 0
if nh0 = 1 for h0 = 1, 2, . . . , H if nh0 > 1 for some 1 < h0 < H
The standard error of the mean is the square root of the estimated variance. q b ¯) ¯ StdErr(Y ) = Vb (Yb
t Test for the Mean
Ratio When you use a RATIO statement, the procedure produces statistics requested by the statistics-keywords in the PROC SURVEYMEANS statement. Suppose that you want to calculate the ratio of variable Y over variable X. Let xhij be the value of variable X for the jth member in cluster i in the hth stratum. The ratio of Y over X is Pnh Pmhi h=1 i=1 j=1 PH Pnh Pmhi h=1 i=1 j=1 PH
b = R
whij yhij whij xhij
PROC SURVEYMEANS uses the Taylor series expansion method to estimate the b as variance of the ratio R H X b = ch (R) b Vb (R) V h=1
where if nh > 1, n
ch (R) b = V ghi· = g¯h·· =
h nh (1 − fh ) X (ghi· − g¯h·· )2 nh − 1 i=1 Pmhi b j=1 whij (yhij − xhij R) PH Pnh Pmhi h=1 i=1 j=1 whij xhij ! nh X ghi· / nh
i=1
and if nh = 1, ch (R) b = V
missing 0
if nh0 = 1 for h0 = 1, 2, . . . , H if nh0 > 1 for some 1 < h0 < H
The standard error of the ratio is the square root of the estimated variance. q b b StdErr(R) = Vb (R) t Test for the Mean If you specify the keyword T, PROC SURVEYMEANS computes the t-value for testing that the population mean equals zero, H0 : Y¯ = 0. The test statistic equals ¯ ) = Yb ¯ / StdErr(Yb ¯) t(Yb The two-sided p-value for this test is ¯ )| ) Prob( |T | > |t(Yb
4339
4340
Chapter 70. The SURVEYMEANS Procedure
where T is a random variable with the t distribution with df degrees of freedom. PROC SURVEYMEANS calculates the degrees of freedom for the t test as the number of clusters minus the number of strata. If there are no clusters, then df equals the number of observations minus the number of strata. If the design is not stratified, then df equals the number of clusters minus one. The procedure displays df for the t test if you specify the keyword DF in the PROC SURVEYMEANS statement. If missing values or missing weights are present in your data, the number of strata, the number of observations, and the number of clusters are counted based on the observations in non-empty strata. See the section “Missing Values” on page 4333 for details. For degrees of freedom in domain analysis, see the section “Domain Statistics” on page 4342.
Confidence Limits for the Mean If you specify the keyword CLM, the procedure computes two-sided confidence limits for the mean. Also, the procedure includes the confidence limits by default if you do not specify any statistic-keywords in the PROC SURVEYMEANS statement. The confidence coefficient is determined by the value of the ALPHA= option, which by default equals 0.05 and produces 95% confidence limits. The confidence limits are computed as ¯ Yb
¯ ) · tdf, α/2 ± StdErr(Yb
¯ is the estimate of the mean, StdErr(Yb ¯ ) is the standard error of the mean, where Yb and tdf, α/2 is the 100(1 − α/2) percentile of the t distribution with df calculated as described in the section “t Test for the Mean” on page 4339. If you specify the keyword UCLM, the procedure computes the one-sided upper 100(1 − α) confidence limit for the mean: ¯ Yb
¯ ) · tdf, α + StdErr(Yb
If you specify the keyword LCLM, the procedure computes the one-sided lower 100(1 − α) confidence limit for the mean: ¯ Yb
¯ ) · tdf, α − StdErr(Yb
Coefficient of Variation If you specify the keyword CV, PROC SURVEYMEANS computes the coefficient of variation, which is the ratio of the standard error of the mean to the estimated mean. ¯ ) / Yb ¯ cv(Y¯ ) = StdErr(Yb If you specify the keyword CVSUM, PROC SURVEYMEANS computes the coefficient of variation for the estimated total, which is the ratio of the standard deviation of the sum to the estimated total. cv(Y ) = Std(Yb ) / Yb
Variance and Standard Deviation of the Total
Proportions If you specify the keyword MEAN for a categorical variable, PROC SURVEYMEANS estimates the proportion, or relative frequency, for each level of the categorical variable. If you do not specify any statistic-keywords in the PROC SURVEYMEANS statement, the procedure estimates the proportions for levels of the categorical variables, together with their standard errors and confidence limits. The procedure estimates the proportion in level ck for variable C as Pnh Pmhi (q) h=1 i=1 j=1 whij yhij PH Pnh Pmhi h=1 i=1 j=1 whij
PH pˆ = (q)
where yhij is the value of the indicator function for level C = ck , defined in the sec(q)
tion “Definition and Notation” on page 4336, and yhij equals 1 if the observed value (q)
of variable C equals ck , and yhij equals 0 otherwise. Since the proportion estimator is actually an estimator of the mean for an indicator variable, the procedure computes its variance and standard error according to the method outlined in the section “Variance and Standard Error of the Mean” on page 4338. Similarly, the procedure computes confidence limits for proportions as described in the section “Confidence Limits for the Mean” on page 4340.
Total If you specify the keyword SUM, the procedure computes the estimate of the population total from the survey data. The estimate of the total is the weighted sum over the sample. nh X mhi H X X b Y = whij yhij h=1 i=1 j=1
For a categorical variable level, Yb estimates its total frequency in the population.
Variance and Standard Deviation of the Total When you specify the keyword STD or the keyword SUM, the procedure estimates the standard deviation of the total. The keyword VARSUM requests the variance of the total. PROC SURVEYMEANS estimates the variance of the total as Vb (Yb ) =
H X
ch (Yb ) V
h=1
where if nh > 1, n
ch (Yb ) = V
h nh (1 − fh ) X (yhi· − y¯h·· )2 nh − 1
i=1
4341
4342
Chapter 70. The SURVEYMEANS Procedure
yhi· =
y¯h·· =
mhi X
whij yhij
j=1 nh X
! yhi·
/ nh
i=1
and if nh = 1, ch (Yb ) = V
missing 0
if nh0 = 1 for h0 = 1, 2, . . . , H if nh0 > 1 for some 1 < h0 < H
The standard deviation of the total equals q Std(Yb ) =
Vb (Yb )
Confidence Limits of a Total If you specify the keyword CLSUM, the procedure computes confidence limits for the total. The confidence coefficient is determined by the value of the ALPHA= option, which by default equals 0.05 and produces 95% confidence limits. The confidence limits are computed as Yb
± Std(Yb ) · tdf, α/2
where Yb is the estimate of the total, Std(Yb ) is the estimated standard deviation, and tdf, α/2 is the 100(1 − α/2) percentile of the t distribution with df calculated as described in the section “t Test for the Mean” on page 4339. If you specify the keyword UCLSUM, the procedure computes the one-sided upper 100(1 − α) confidence limit for the sum: Yb
+ Std(Yb ) · tdf, α
If you specify the keyword LCLSUM, the procedure computes the one-sided lower 100(1 − α) confidence limit for the sum: Yb
− Std(Yb ) · tdf, α
Domain Statistics When you use a DOMAIN statement to request a domain analysis, the procedure computes the requested statistics for each domain. For a domain D, let ID be the corresponding indicator variable: 1 if observation (h, i, j) belongs to D ID (h, i, j) = 0 otherwise
Domain Statistics
Let zhij
yhij 0
= yhij ID (h, i, j) =
if observation (h, i, j) belongs to D otherwise
The requested statistics for variable y in domain D are computed based on the values of z. Domain Mean The estimated mean of y in the domain D is mhi nh X H X X vhij zhij / v··· =
¯D Yc
h=1 i=1 j=1
where vhij
= whij ID (h, i, j) =
v··· =
nh X mhi H X X
whij 0
if observation (h, i, j) belongs to D otherwise
vhij
h=1 i=1 j=1
¯D is estimated by The variance of Yc ¯D ) = Vb (Yc
H X
ch (Yc ¯D ) V
h=1
where if nh > 1, n
h nh (1 − fh ) X (rhi· − r¯h·· )2 nh − 1 i=1 m hi X ¯D ) / v··· = vhij (zhij − Yc
ch (Yc ¯D ) = V
rhi·
j=1
r¯h·· =
nh X
! rhi·
/ nh
i=1
and if nh = 1, ch (Yc ¯D ) = V
missing 0
if nh0 = 1 for h0 = 1, 2, . . . , H if nh0 > 1 for some 1 < h0 < H
Domain Total The estimated total in domain D is
YbD
=
nh X mhi H X X h=1 i=1 j=1
vhij zhij
4343
4344
Chapter 70. The SURVEYMEANS Procedure
and its estimated variance is
Vb (YbD ) =
H X
ch (YbD ) V
h=1
where if nh > 1, n
ch (YbD ) = V
h nh (1 − fh ) X (zhi· − z¯h·· )2 nh − 1
i=1
zhi· =
z¯h·· =
mhi X
vhij zhij
j=1 nh X
! zhi·
/ nh
i=1
and if nh = 1, ch (YbD ) = V
missing 0
if nh0 = 1 for h0 = 1, 2, . . . , H if nh0 > 1 for some 1 < h0 < H
Degrees of Freedom For domain analysis, PROC SURVEYMEANS computes the degrees of freedom for t tests as the number of clusters in the non-empty strata minus the number of non-empty strata. When the sample design has no clusters, the degrees of freedom equals the number of observations in non-empty strata minus the number of non-empty strata. As discussed in the section “Missing Values” on page 4333, missing values and missing weights can result in empty strata. In domain analysis, an empty stratum can also occur when the stratum contains no observations in the specified domain. If no observations in a whole stratum belong to a domain, then this stratum is called an empty stratum for that domain. For example, data new; input str clu y w d; datalines; 1 1 . 40 9 1 2 2 . 9 1 3 . 25 9 2 4 5 20 9 2 5 8 15 9 3 6 5 30 7 3 7 9 89 7 3 8 6 23 7 ; proc surveymeans df nobs nclu nmiss; strata str; cluster clu; var y; weight w; domain d; run;
Displayed Output
Table 70.2. Calculations of df for Y
Non Empty Strata Clusters Used in the Analysis df
Domain D=7 STR=3 CLU=6, CLU=7, and CLU=8 3−1=2
Domain D=9 STR=2 CLU=4 and CLU=5 2−1=1
Although there are three strata in the data set, STR=1 is an empty stratum for variable Y because of missing values and missing weights. In addition, no observations in stratum STR=3 belong to domain D=9. Therefore, STR=3 becomes an empty stratum as well for variable Y in domain D=9. As a result, the total number of nonempty strata for domain D=9 is one. The non-empty stratum for domain D=9 and variable Y is stratum STR=2. The total number of clusters for domain D=9 is two, which belong to stratum STR=2. Thus, for variable Y in domain D=9, the degrees of freedom for the t tests of the domain mean is df = 2 − 1 = 1. Similarly, for domain D=7, strata STR=1 and STR=2 are both empty strata, so the total number of strata is one (STR=3), and the total number of clusters is three ( CLU=6, CLU=7, and CLU=8). Table 70.2 illustrates how domains affect the total number of clusters and total number of strata in the df calculation. Figure 70.8 shows the df computed by the procedure. The SURVEYMEANS Procedure Domain Analysis: d d Variable N N Miss Clusters DF -----------------------------------------------------------------------------7 y 3 0 3 6 9 y 2 2 2 4 ------------------------------------------------------------------------------
Figure 70.8. Degrees of Freedoms in Domain Analysis
Output Output Data Sets Output data sets from PROC SURVEYMEANS are produced using ODS (Output Delivery System). ODS encompasses more than just the production of output data sets. For example, you can use ODS to manipulate the format of your output, the headers and titles of the tables, and the order of the columns in a table. For a more detailed description on using ODS, see Chapter 14, “Using the Output Delivery System.”
Displayed Output By default PROC SURVEYMEANS displays a “Data Summary” table and a “Statistics” table. If you specify CLASS variables, or if you specify any character variables in the VAR statement, then the procedure displays a “Class Level Information” table. If you specify the LIST option in the STRATA statement, then the procedure displays a “Stratum Information” table. If you have a DOMAIN statement,
4345
4346
Chapter 70. The SURVEYMEANS Procedure
the procedure displays a “Domain Analysis” table. If you have a RATIO statement, the procedure displays a “Ratio Analysis” table. Data and Sample Design Summary
The “Data Summary” table provides information on the input data set and the sample design. This table displays the total number of valid observations, where an observation is considered valid if it has nonmissing values for all procedure variables other than the analysis variables; that is, for all specified STRATA, CLUSTER, and WEIGHT variables. This number may differ from the number of nonmissing observations for an individual analysis variable, which the procedure displays in the “Statistics” table. See the section “Missing Values” on page 4333 for more information. PROC SURVEYMEANS displays the following information in the “Data Summary” table: • Number of Strata, if you specify a STRATA statement • Number of Clusters, if you specify a CLUSTER statement • Number of Observations, which is the total number of valid observations • Sum of Weights, which is the sum over all valid observations, if you specify a WEIGHT statement Class Level Information
If you use a CLASS statement to name classification variables for categorical analysis, or if you list any character variables in the VAR statement, then PROC SURVEYMEANS displays a “Class Level Information” table. This table contains the following information for each classification variable: • Class Variable, which lists each CLASS variable name • Levels, which is the number of values or levels of the classification variable • Values, which lists the values of the classification variable. The values are separated by a white space character; therefore, to avoid confusion, you should not include a white space character within a classification variable value. Stratum Information
If you specify the LIST option in the STRATA statement, PROC SURVEYMEANS displays a “Stratum Information” table. This table displays the number of valid observations in each stratum, as well as the number of nonmissing stratum observations for each analysis variable. The “Stratum Information” table provides the following for each stratum: • Stratum Index, which is a sequential stratum identification number • STRATA variable(s), which lists the levels of STRATA variables for the stratum
Displayed Output
• Population Total, if you specify the TOTAL= option • Sampling Rate, if you specify the TOTAL= option or the RATE= option. If you specify the TOTAL= option, the sampling rate is based on the number of valid observations in the stratum. • N Obs, which is the number of valid observations • Variable, which lists each analysis variable name • Levels, which identifies each level for categorical variables • N, which is the number of nonmissing observations for the analysis variable • Clusters, which is the number of clusters, if you specify a CLUSTER statement Statistics
The “Statistics” table displays all of the statistics that you request with statistickeywords described on page 4326 in the PROC SURVEYMEANS statement. If you do not specify any statistic-keywords, then by default this table displays the following information for each analysis variable: the sample size, the mean, the standard error of the mean, and the confidence limits for the mean. The “Statistics” table may contain the following information for each analysis variable, depending on which statistic-keywords you request: • Variable name • Level, which identifies each level for categorical variables • N, which is the number of nonmissing observations • N Miss, which is the number of missing observations • Minimum • Maximum • Range • Number of Clusters • Sum of Weights • DF, which is the degrees of freedom for the t test • Mean • Std Error of Mean, which is the standard error of the mean • Var of Mean, which is the variance of the mean • t Value, for testing H0 : population MEAN = 0 • Pr > | t |, which is the two-sided p-value for the t test • 100(1−α)% CL for Mean, which are two-sided confidence limits for the mean • 100(1 − α)% Upper CL for Mean, which are one-sided upper confidence limits for the mean • 100(1−α)% Lower CL for Mean, which are one-sided lower confidence limits for the mean
4347
4348
Chapter 70. The SURVEYMEANS Procedure • Coeff of Variation, which is the coefficients of variation for the mean and the sum • Sum • Std Dev, which is the standard deviation of the sum • Var of Sum, which is the variance of the sum • 100(1 − α)% CL for Sum, which are two-sided confidence limits for the sum • 100(1 − α)% Upper CL for Sum, which are one-sided upper confidence limits for the sum • 100(1 − α)% Lower CL for Sum, which are one-sided lower confidence limits for the Sum
Domain Analysis
If you use a DOMAIN statement, the procedure displays statistics in each domain in a “Domain Analysis” table. A “Domain Analysis” table contains all the columns in the “Statistics” table, plus columns of domain variable values. Note that depending on how you define the domains with domain variables, the procedure may produce more than one “Domain Analysis” table. For example, in the following DOMAIN statement domain A B*C*D A*C C;
you use four definitions to define domains: • A: all the levels of A • C: all the levels of C • A*C: all the interactive levels of A and C • B*C*D: all the interactive levels of B, C, and D The procedure displays four “Domain Analysis” tables, one for each domain definition. However, if you use ODS output statement to create an output data set for domain analysis, the output data set contains a variable Domain whose values are these domain definitions. Ratio Analysis
The “Ratio Analysis” table displays all of the statistics that you request with statistickeywords in the PROC statement described on page 4326. If you do not specify any statistic-keywords, then by default this table displays the ratio and its standard error. The “Ratio Analysis” table may contain the following information for each ratio, depending on which statistic-keywords you request: • Numerator, which identifies the numerator variable of the ratio • Denominator, which identifies the denominator variable of the ratio
ODS Table Names
• N, which is the number of observations used in the ratio analysis • number of Clusters • Sum of Weights • DF, which is the degrees of freedom for the t test • Ratio • Std Error of Ratio, which is the standard error of the ratio • Var of Ratio, which is the variance of the ratio • t Value, for testing H0 : population RATIO = 0 • Pr > | t |, which is the two-sided p-value for the t test • 100(1 − α)% CL for Ratio, which are two-sided confidence limits for the Ratio • Upper 100(1 − α)% CL for Ratio, which are one-sided upper confidence limits for the Ratio • Lower 100(1 − α)% CL for Ratio, which are one-sided lower confidence limits for the Ratio When you use the ODS output statement to create an output data set, if you use labels for your RATIO statement, these labels are saved in a variable Ratio Statement in the output data set.
ODS Table Names PROC SURVEYMEANS assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 70.3. ODS Tables Produced in PROC SURVEYMEANS
ODS Table Name ClassVarInfo Domain Ratio Statistics StrataInfo Summary
Description Class level information Statistics in domains Statistics for ratios Statistics Stratum information Data summary
Statement CLASS DOMAIN RATIO PROC STRATA PROC
Option default default default default LIST default
For example, the following statements create an output data set named MyStrata, which contains the “StrataInfo” table, and an output data set named MyStat, which contains the “Statistics” table for the ice cream study discussed in the section “Stratified Sampling” on page 4318: title1 ’Analysis of Ice Cream Spending’; title2 ’Stratified Simple Random Sample Design’; proc surveymeans data=IceCream total=StudentTotals; strata Grade / list;
4349
4350
Chapter 70. The SURVEYMEANS Procedure var Spending Group; weight Weight; ods output StrataInfo = MyStrata Statistics = MyStat; run;
Examples The “Getting Started” section on page 4315 contains examples of analyzing data from simple random sampling and stratified simple random sample designs. This section provides more examples that illustrate how to use PROC SURVEYMEANS.
Example 70.1. Stratified Cluster Sample Design Consider the example in the section “Stratified Sampling” on page 4318. The study population is a junior high school with a total of 4,000 students in grades 7, 8, and 9. Researchers want to know how much these students spend weekly for ice cream, on the average, and what percentage of students spend at least $10 weekly for ice cream. The example in the section “Stratified Sampling” on page 4318 assumes that the sample of students was selected using a stratified simple random sample design. This example shows analysis based on a more complex sample design. Suppose that every student belongs to a study group and that study groups are formed within each grade level. Each study group contains between two and four students. Table 70.4 shows the total number of study groups for each grade. Table 70.4. Study Groups and Students by Grade
Grade 7 8 9 Total
Number of Study Groups 608 252 403 617
Number of Students 1,824 1,025 1,151 4,000
It is quicker and more convenient to collect data from students in the same study group than to collect data from students individually. Therefore, this study uses a stratified clustered sample design. The primary sampling units, or clusters, are study groups. The list of all study groups in the school is stratified by grade level. From each grade level, a sample of study groups is randomly selected, and all students in each selected study group are interviewed. The sample consists of eight study groups from the 7th grade, three groups from the 8th grade, and five groups from the 9th grade. The SAS data set named IceCreamStudy saves the responses of the selected students: data IceCreamStudy; input Grade StudyGroup Spending @@; if (Spending < 10) then Group=’less’; else Group=’more’;
Example 70.1. Stratified Cluster Sample Design
7 7 9 7 8 8 9 7 7 7 ;
datalines; 34 7 7 34 2 9 230 8 9 403 4 7 143 12 8 143 10 9 312 10 7 321 3 7 78 1 7 78 6 7
34 230 230 403 143 312 321 321 78 412
7 15 7 11 16 8 6 12 10 6
7 9 7 8 8 9 8 7 7 7
412 27 501 59 59 235 156 489 489 156
4 15 3 13 18 6 19 2 2 2
9 7 8 8 9 9 8 7 7 9
27 501 59 59 235 235 156 489 156 301
14 2 20 17 9 11 14 9 1 8
In the data set IceCreamStudy, the variable Grade contain a student’s grade. The variable StudyGroup identifies a student’s study group. It is possible for students from different grades to have the same study group number because study groups are sequentially numbered within each grade. The variable Spending contains a student’s response to how much he spends per week for ice cream, in dollars. The variable GROUP indicates whether a student spends at least $10 weekly for ice cream. It is not necessary to store the data in order of grade and study group. The SAS data set StudyGroup is created to provide PROC SURVEYMEANS with the sample design information shown in Table 70.4: data StudyGroups; input Grade _total_; datalines; 7 608 8 252 9 403 ;
The variable Grade identifies the strata, and the variable – TOTAL– contains the total number of study groups in each stratum. As discussed in the section “Specification of Population Totals and Sampling Rates” on page 4334, the population totals stored in the variable – TOTAL– should be expressed in terms of the primary sampling units (PSUs), which are study groups in this example. Therefore, the variable – TOTAL– contains the total number of study groups for each grade, rather than the total number of students. In order to obtain unbiased estimates, you create sampling weights using the following SAS statements: data IceCreamStudy; set IceCreamStudy; if Grade=7 then Prob=8/608; if Grade=8 then Prob=3/252; if Grade=9 then Prob=5/403; Weight=1/Prob;
The sampling weights are the reciprocals of the probabilities of selections. The variable Weight contains the sampling weights. Because the sampling design is
4351
4352
Chapter 70. The SURVEYMEANS Procedure
clustered, and all students from each selected cluster are interviewed, the sampling weights equal the inverse of the cluster (or study group) selection probabilities. The following SAS statements perform the analysis for this sample design: title1 ’Analysis of Ice Cream Spending’; title2 ’Stratified Clustered Sample Design’; proc surveymeans data=IceCreamStudy total=StudyGroups; strata Grade / list; cluster StudyGroup; var Spending Group; weight Weight; run;
Output 70.1.1. Data Summary and Class Information Analysis of Ice Cream Spending Stratified Clustered Sample Design The SURVEYMEANS Procedure Data Summary Number Number Number Sum of
of Strata of Clusters of Observations Weights
3 16 40 3162.6
Class Level Information Class Variable Group
Levels 2
Values less more
Output 70.1.1 provides information on the sample design and the input data set. There are 3 strata in the sample design, and the sample contains 16 clusters and 40 observations. The variable Group has two levels, ‘less’ and ‘more’.
Example 70.1. Stratified Cluster Sample Design
Output 70.1.2. Stratum Information Analysis of Ice Cream Spending Stratified Clustered Sample Design The SURVEYMEANS Procedure Stratum Information Stratum Population Sampling Index Grade Total Rate N Obs Variable Level N ---------------------------------------------------------------------------1 7 608 1.32% 20 Spending 20 Group less 17 more 3 2 8 252 1.19% 9 Spending 9 Group less 0 more 9 3 9 403 1.24% 11 Spending 11 Group less 6 more 5 ---------------------------------------------------------------------------Stratum Information Stratum Population Sampling Index Grade Total Rate N Obs Variable Level Clusters ---------------------------------------------------------------------------1 7 608 1.32% 20 Spending 8 Group less 8 more 3 2 8 252 1.19% 9 Spending 3 Group less 0 more 3 3 9 403 1.24% 11 Spending 5 Group less 4 more 4 ----------------------------------------------------------------------------
Output 70.1.2 displays information for each stratum. Since the primary sampling units in this design are study groups, the population totals shown in Output 70.1.2 are the total numbers of study groups for each stratum or grade. This differs from Figure 70.3 on page 4320, which provides the population totals in terms of students since students were the primary sampling units for that design. Output 70.1.2 also displays the number of clusters for each stratum and analysis variable.
4353
4354
Chapter 70. The SURVEYMEANS Procedure
Output 70.1.3. Statistics Analysis of Ice Cream Spending Stratified Clustered Sample Design The SURVEYMEANS Procedure Statistics Std Error Lower 95% Variable Level N Mean of Mean CL for Mean --------------------------------------------------------------------------------Spending 40 8.923860 0.650859 7.517764 Group less 23 0.561437 0.056368 0.439661 more 17 0.438563 0.056368 0.316787 --------------------------------------------------------------------------------Statistics Upper 95% Variable Level CL for Mean --------------------------------Spending 10.329957 Group less 0.683213 more 0.560339 ---------------------------------
Output 70.1.3 displays the estimates of the average weekly ice cream expense and the percentage of students spending at least $10 weekly for ice cream.
Example 70.2. Domain Analysis Suppose that you are studying profiles of the 800 top-performing companies to provide information on their impact on the economy. You are also interested in the company profiles within each market type. A sample of 66 companies is selected with unequal probability across market types. However, market type is not included in the sample design. Thus, the number of companies within each market type is a random variable in your sample. To obtain statistics within each market type, you should use domain analysis. The data of the 66 companies are saved in the following data set:
data Company; length Type $14; input Type$ Asset Sale Value Profit Employee Weight; datalines; Other 2764.0 1828.0 1850.3 144.0 18.7 9.6 Energy 13246.2 4633.5 4387.7 462.9 24.3 42.6 Finance 3597.7 377.8 93.0 14.0 1.1 12.2 Transportation 6646.1 6414.2 2377.5 348.2 47.1 21.8 HiTech 1068.4 1689.8 1430.2 72.9 4.6 4.3 Manufacturing 1125.0 1719.4 1057.5 98.1 20.4 4.5 Other 1459.0 1241.4 452.7 24.5 20.1 5.5 Finance 2672.3 262.5 296.2 23.1 2.2 9.3 Finance 311.0 566.2 932.0 52.8 2.7 1.9 Energy 1148.6 1014.6 485.1 60.6 4.0 4.5
Example 70.2. Domain Analysis Finance Energy Energy Medical Transportation Other Retail HiTech Retail Finance HiTech Other Finance Finance Finance Energy Manufacturing Retail Retail Manufacturing HiTech Medical Energy Energy Retail Other Finance Retail Energy Energy Finance HiTech Finance Manufacturing Transportation Energy Energy Manufacturing Energy Transportation Medical Energy Medical Finance Finance Retail Retail HiTech Manufacturing Manufacturing Manufacturing Energy Other Transportation
5327.0 572.4 1602.7 678.4 5808.8 1288.4 268.8 204.4 5222.6 2627.8 872.7 1419.4 4461.7 8946.8 6719.2 6942.0 833.4 1538.8 415.9 167.3 442.4 1139.9 801.5 1157.0 4954.8 468.8 2661.9 257.9 5345.8 530.1 3334.3 1644.7 1826.6 2671.7 618.8 2354.7 1529.1 6534.0 4458.4 4824.5 5831.7 6611.1 6468.3 4199.2 1720.7 473.1 1679.7 1379.9 4018.2 16823.4 227.1 575.8 3872.8 362.0 3359.3 4844.7 1295.6 356.9 1658.0 626.6 12156.7 1345.5 3982.6 4196.0 8760.7 886.4 2362.2 3153.3 2499.9 3419.0 1430.4 1610.0 13666.5 15465.4 4069.3 4174.7 2924.7 711.9 1262.1 1716.0 684.4 672.9 3069.3 1719.0 246.5 318.8 11562.2 1128.5 9316.0 1059.4 1094.3 3848.0 1102.1 4878.3 466.4 675.8 10839.4 5468.7 733.5 2135.3 10354.2 14477.4 1902.1 2697.9 2245.2 2132.2 949.4 1248.3
372.9 653.0 2007.0 820.9 1910.0 939.3 4662.7 8240.2 1090.3 1126.8 1039.9 664.2 366.4 181.1 337.4 1407.8 483.2 767.7 826.3 3132.1 9464.7 3170.4 811.1 721.1 2038.3 1083.8 209.3 2651.4 180.8 688.0 680.7 3946.8 1006.9 1080.0 992.6 664.3 2736.7 2907.6 1067.8 364.3 287.4 1439.0 924.1 580.4 816.5 563.3 932.4 845.7 1895.4 96.6 5607.2 329.3 2230.4 298.9
25.2 75.6 318.8 45.6 245.6 69.7 289.0 381.3 64.9 56.8 57.6 56.9 41.7 21.2 36.4 157.6 71.3 58.6 58.3 28.9 459.6 270.1 86.6 91.8 178.1 62.6 27.6 224.1 162.3 126.0 106.6 313.9 90.0 137.0 47.2 77.7 411.4 289.2 146.7 71.2 61.8 196.4 43.8 64.2 95.9 29.4 65.2 64.5 232.8 10.9 321.9 34.2 198.9 35.4
4.2 2.8 5.9 3.7 22.8 12.2 132.1 85.8 15.4 0.7 22.7 15.5 3.0 2.1 4.3 6.4 25.3 19.0 65.8 67.0 86.7 59.5 1.6 4.5 162.0 1.9 2.4 75.6 0.6 3.5 9.4 64.3 7.5 25.2 25.3 3.5 26.6 38.2 3.4 14.5 6.0 4.9 3.1 6.7 8.0 44.7 47.3 5.2 47.8 2.7 188.5 2.2 8.0 10.4
17.7 6.0 19.2 1.8 17.4 3.7 15.0 22.1 3.5 2.2 2.3 3.4 16.5 9.3 17.8 11.4 6.7 2.9 5.7 15.0 19.3 21.3 6.3 6.2 13.6 1.6 13.1 11.5 5.0 6.1 39.2 13.5 28.5 8.4 8.8 5.4 43.9 13.7 10.1 4.9 3.1 10.6 1.7 37.3 30.2 4.4 4.4 2.4 35.0 3.2 33.5 6.9 8.0 3.9
4355
4356
Chapter 70. The SURVEYMEANS Procedure Retail Retail ;
2834.4 2621.1
2884.6 6173.8
458.2 1992.7
41.2 183.7
49.8 115.1
9.8 9.2
For each company in your sample, • the variable Type identifies the type of market for the company. • the variable Asset contains the company’s assets in millions of dollars. • the variable Sale contains sales in millions of dollars. • the variable Value contains the market value of the company in millions of dollars. • the variable Profit contains the profit in millions of dollars. • the variable Employee stores the number of employees in thousands. • the variable Weight contains the sampling weight. The following SAS statements use PROC SURVEYMEANS to perform the domain analysis, estimating means and other statistics for the overall population and also for the subpopulations (or domain) defined by market type. The DOMAIN statement specifies Type as the domain variable: title1 ’Top Companies Profile Study’; proc surveymeans data=Company total=800 mean sum; var Asset Sale Value Profit Employee; weight Weight; domain Type; run;
Output 70.2.1. Company Profile Study Top Companies Profile Study The SURVEYMEANS Procedure Data Summary Number of Observations Sum of Weights
66 799.8
Statistics Std Error Variable Mean of Mean Sum Std Dev -----------------------------------------------------------------------Asset 6523.488510 720.557075 5217486 1073829 Sale 4215.995799 839.132506 3371953 847885 Value 2145.935121 342.531720 1716319 359609 Profit 188.788210 25.057876 150993 30144 Employee 36.874869 7.787857 29493 7148.003298 ------------------------------------------------------------------------
Example 70.2. Domain Analysis
4357
Output 70.2.1 shows that there are 66 observations in the sample. The sum of the sampling weights equals 799.8, which is close to the total number of companies in the study population. The “Statistics” table in Output 70.2.1 displays the estimates of the mean and total for all analysis variables for the entire 800 companies, while Output 70.2.2 shows the mean and total estimates for each company type. Output 70.2.2. Domain Analysis for Company Profile Study Top Companies Profile Study The SURVEYMEANS Procedure Domain Analysis: Type Std Error Type Variable Mean of Mean Sum Std Dev -------------------------------------------------------------------------------Energy Asset 7868.302932 1941.699163 1449341 785962 Sale 5419.679099 2416.214417 998305 673373 Value 2249.297177 520.295162 414321 213580 Profit 289.564658 52.512141 53338 25927 Employee 14.151194 3.974697 2606.650000 1481.777769 Finance Asset 7890.190264 1057.185336 1855773 704506 Sale 829.210502 115.762531 195030 74436 Value 565.068197 76.964547 132904 48156 Profit 63.716837 10.099341 14986 5801.108513 Employee 5.806293 0.811555 1365.640000 519.658410 HiTech Asset 5031.959781 732.436967 321542 183302 Sale 5464.292019 731.296997 349168 196013 Value 6707.828482 1194.160584 428630 249154 Profit 346.407042 42.299004 22135 12223 Employee 70.766980 8.683595 4522.010000 2524.778281 Manufacturing Asset 7403.004250 1454.921083 888361 492577 Sale 7207.638833 2112.444703 864917 501679 Value 2986.442750 799.121544 358373 196979 Profit 211.933583 39.993255 25432 13322 Employee 83.314333 31.089019 9997.720000 6294.309490 Medical Asset 5046.570609 1218.444638 140799 131942 Sale 3313.219713 758.216303 92439 85655 Value 2561.614695 530.802245 71469 64663 Profit 218.682796 44.051447 6101.250000 5509.560969 Employee 46.518996 11.135955 1297.880000 1213.651734 Other Asset 1850.250000 338.128984 58838 31375 Sale 1620.784906 168.686773 51541 24593 Value 1432.820755 297.869828 45564 24204 Profit 115.089937 27.970560 3659.860000 2018.201371 Employee 14.306604 2.313733 454.950000 216.327710 Retail Asset 2939.845750 393.692369 235188 94605 Sale 7395.453500 1746.187580 591636 263263 Value 2103.863125 529.756409 168309 78304 Profit 157.171875 31.734253 12574 5478.281027 Employee 93.624000 15.726743 7489.920000 3093.832061 Transportation Asset 4712.047359 888.954411 267644 163516 Sale 4030.233275 1015.555708 228917 142669 Value 1703.330282 313.841326 96749 58947 Profit 224.762324 56.168925 12767 8287.585418 Employee 30.946303 6.786270 1757.750000 1066.586615 --------------------------------------------------------------------------------
4358
Chapter 70. The SURVEYMEANS Procedure
Example 70.3. Ratio Analysis Suppose you are interested in the profit per employee and the sale per employee among the 800 top-performing companies in the data in the previous example. The following SAS statements illustrate how you can use PROC SURVEYMEANS to estimate these ratios: title1 ’Ratio Analysis in Top Companies Profile Study’; proc surveymeans data=Company total=800 ratio; var Profit Sale Employee; weight Weight; ratio Profit Sale / Employee; run;
The RATIO statement requests the ratio of the profit and the sale to the number of employees. Output 70.3.1. Estimate Ratios Ratio Analysis in Top Companies Profile Study The SURVEYMEANS Procedure Ratio Analysis Numerator Denominator Ratio Std Err -------------------------------------------------Sale Employee 114.332497 20.502742 Profit Employee 5.119698 1.058939 --------------------------------------------------
Output 70.3.1 shows the estimated ratios and their standard errors. Because the profit and the sale figures are in millions of dollars, and the employee numbers in thousands, the profit per employee is estimated as $5,120 with a standard error of $1,059, and the sale per employee is $114,333 with a standard error of $20,503.
Example 70.4. Analyzing Survey Data with Missing Values As described in the section “Missing Values” on page 4333, the SURVEYMEANS procedure excludes an observation from the analysis if it has a missing value for the analysis variable or a nonpositive value for the WEIGHT variable. However, if there is evidence indicating that the nonrespondents are different from the respondents for your study, you can use the DOMAIN statement to compute descriptive statistics among respondents from your survey data without imputation for nonrespondents. Note that although the variance estimation for respondents takes into account the assumption that the study population consists of distinct groups of respondents and nonrespondents, the degrees of freedom will not adjust for the nonrespondents because they are deleted from the computation. As a result, there are fewer degrees of freedom and wider confidence limits in comparison to counting
Example 70.4. Analyzing Survey Data with Missing Values
those nonrespondents for degrees of freedom. When the sample size and the number of respondents are large, the difference maybe ignored. Consider the ice cream example in the section “Stratified Sampling” on page 4318. Suppose that some of the students failed to provide the amounts spent on ice cream, as shown in the following data set IceCream: data IceCream; input Grade Spending @@; datalines; 7 7 7 7 8 . 9 10 7 . 7 10 7 3 7 . 9 15 8 16 7 6 7 6 7 6 9 15 9 8 9 7 7 3 7 12 7 4 9 14 8 18 7 4 7 11 9 8 8 . 8 13 7 . 9 . ; data StudentTotals; input Grade _total_; datalines; 7 1824 8 1025 9 1151 ;
8 20 8 17 9 9 9 11
8 19 8 14 7 2 7 2
7 9 7 7
2 . 1 9
Considering the possibility that those students who didn’t respond spend differently than those students who did respond, you can create an indicator variable to identify the respondents and non-respondents with the following SAS DATA step statements: data IceCream; set IceCream; if Spending=. then Indicator=’Nonrespondent’; else do; Indicator=’Respondent’; if (Spending < 10) then Group=’less’; else Group=’more’; end; if Grade=7 then Prob=20/1824; if Grade=8 then Prob=9/1025; if Grade=9 then Prob=11/1151; Weight=1/Prob;
The variable Indicator identifies a student in the data set as either a respondent or a nonrespondent. The variable Group specifies whether a student spent more than $10 among the respondents. The following SAS statements produce the desired analysis: title1 ’Analysis of Ice Cream Spending’; proc surveymeans data=IceCream total=StudentTotals mean sum; strata Grade / list; var Spending Group; weight Weight; domain Indicator; run;
4359
4360
Chapter 70. The SURVEYMEANS Procedure
Output 70.4.1. Analysis of Incomplete Ice Cream Data Excluding Observations with Missing Values Analysis of Ice Cream Spending The SURVEYMEANS Procedure Data Summary Number of Strata Number of Observations Sum of Weights
3 40 4000
Statistics Std Error Variable Level Mean of Mean Sum Std Dev --------------------------------------------------------------------------------Spending 9.770542 0.541381 32139 1780.792065 Group less 0.515404 0.067092 1695.345455 220.690305 more 0.484596 0.067092 1594.004040 220.690305 ---------------------------------------------------------------------------------
Output 70.4.2 shows the mean and total estimates excluding those students who failed to provide the spending amount on ice cream. Output 70.4.2. Analysis of Incomplete Ice Cream Data Treating Respondents as a Domain Analysis of Ice Cream Spending The SURVEYMEANS Procedure Domain Analysis: Indicator Std Error Indicator Variable Level Mean of Mean Sum ---------------------------------------------------------------------------------Nonrespondent Spending . . . Group less . . . more . . . Respondent Spending 9.770542 0.652347 32139 Group less 0.515404 0.067092 1695.345455 more 0.484596 0.067092 1594.004040 ---------------------------------------------------------------------------------Domain Analysis: Indicator Indicator Variable Level Std Dev -------------------------------------------------Nonrespondent Spending . Group less . more . Respondent Spending 3515.126876 Group less 220.690305 more 220.690305 --------------------------------------------------
References
Output 70.4.1 shows the mean and total estimates treating respondents as a domain in the student population. Compared to the estimates in Output 70.4.1, the point estimates are the same, but the variance estimations are slightly higher.
References Brick, J.M. and Kalton, G. (1996), “Handling Missing Data in Survey Research,” Statistical Methods in Medical Research, 5, 215 –238. Cochran, W.G. (1977), Sampling Techniques, Third Edition, New York: John Wiley & Sons, Inc. Foreman, E.K. (1991), Survey Sampling Principles, New York: Marcel Dekker, Inc. Fuller, W.A. (1975), “Regression Analysis for Sample Survey,” Sankhy¯ a, 37, Series C, Pt. 3, 117 –132. Fuller, W.A., Kennedy, W., Schnell, D., Sullivan, G., and Park, H.J. (1989), PC CARP, Ames, IA: Statistical Laboratory, Iowa State University. Hansen, M.H., Hurwitz, W.N., and Madow, W.G. (1953), Sample Survey Methods and Theory, Volumes I and II, New York: John Wiley & Sons, Inc. Hidiroglou, M.A., Fuller, W.A., and Hickman, R.D. (1980), SUPER CARP, Ames, IA: Statistical Laboratory, Iowa State University. Kalton, G. (1983), Introduction to Survey Sampling, Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-035, Beverly Hills, CA, and London: Sage Publications, Inc. Kalton, G., and Kaspyzyk, D. (1986), “The Treatment of Missing Survey Data,” Survey Methodology, 12, 1 –16. Kish, L. (1965), Survey Sampling, New York: John Wiley & Sons, Inc. Lee, E.S., Forthoffer, R.N., and Lorimor, R.J. (1989), Analyzing Complex Survey Data, Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-071, Beverly Hills, CA, and London: Sage Publications, Inc. Levy, P. and Lemeshow, S. (1999), Sampling of Populations, Methods and Applications, Third Edition, New York: John Wiley & Sons, Inc. Pringle, R.M. and Raynor, A.A. (1971), Generalized Inverse Matrices with Applications to Statistics, New York: Hafner Publishing Co. Woodruff, R.S. (1971), “A Simple Method for Approximating the Variance of a Complicated Estimate,” Journal of the American Statistical Association, 66, 411–414.
4361
Chapter 71
The SURVEYREG Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4365 GETTING STARTED . . . Simple Random Sampling Stratified Sampling . . . Output Data Set . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 4365 . 4365 . 4368 . 4372
SYNTAX . . . . . . . . . . . . . . PROC SURVEYREG Statement BY Statement . . . . . . . . . . CLASS Statement . . . . . . . . CLUSTER Statement . . . . . . CONTRAST Statement . . . . . ESTIMATE Statement . . . . . . MODEL Statement . . . . . . . STRATA Statement . . . . . . . WEIGHT Statement . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. 4373 . 4373 . 4375 . 4375 . 4376 . 4376 . 4378 . 4379 . 4381 . 4382
DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . . . . . . . . . . . Survey Design Information . . . . . . . . . . . . . . . . Specification of Population Totals and Sampling Rates Primary Sampling Units (PSUs) . . . . . . . . . . . . Computational Details . . . . . . . . . . . . . . . . . . . Notation . . . . . . . . . . . . . . . . . . . . . . . . Regression Coefficients . . . . . . . . . . . . . . . . Variance Estimation . . . . . . . . . . . . . . . . . . Degrees of Freedom . . . . . . . . . . . . . . . . . . Testing Effects . . . . . . . . . . . . . . . . . . . . . Design Effect . . . . . . . . . . . . . . . . . . . . . . Stratum Collapse . . . . . . . . . . . . . . . . . . . . Sampling Rate of the Pooled Stratum from Collapse . Contrasts . . . . . . . . . . . . . . . . . . . . . . . . Output . . . . . . . . . . . . . . . . . . . . . . . . . . . Displayed Output . . . . . . . . . . . . . . . . . . . . Output Data Sets . . . . . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. 4382 . 4382 . 4382 . 4382 . 4383 . 4384 . 4384 . 4384 . 4385 . 4385 . 4386 . 4388 . 4388 . 4389 . 4389 . 4390 . 4390 . 4393 . 4394
4364
Chapter 71. The SURVEYREG Procedure
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example 71.1. Simple Random Sampling . . . . . . . . . . . . . Example 71.2. Simple Random Cluster Sampling . . . . . . . . Example 71.3. Regression Estimator for Simple Random Sample Example 71.4. Stratified Sampling . . . . . . . . . . . . . . . . Example 71.5. Regression Estimator for Stratified Sample . . . . Example 71.6. Stratum Collapse . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. 4395 . 4395 . 4397 . 4400 . 4401 . 4407 . 4411
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4416
Chapter 71
The SURVEYREG Procedure Overview The SURVEYREG procedure performs regression analysis for sample survey data. This procedure can handle complex survey sample designs, including designs with stratification, clustering, and unequal weighting. The procedure fits linear models for survey data and computes regression coefficients and their variance-covariance matrix. The procedure also provides significance tests for the model effects and for any specified estimable linear functions of the model parameters. Using the regression model, the procedure can compute predicted values for the sample survey data. PROC SURVEYREG computes the regression coefficient estimators by generalized least-squares estimation using element-wise regression. The procedure assumes that the regression coefficients are the same across strata and primary sampling units (PSUs). To estimate the variance-covariance matrix for the regression coefficients, PROC SURVEYREG uses the Taylor expansion theory for estimating sampling errors of estimators based on complex sample designs (Woodruff 1971; Fuller 1975; S¨arndal, Swenson, and Wretman 1992, Chapter 5 and Chapter 13). This method obtains a linear approximation for the estimator and then uses the variance estimator for this approximation to estimate the variance of the estimator itself. PROC SURVEYREG uses the ODS (Output Delivery System) to place results in output data sets. This is a departure from older SAS procedures that provide OUTPUT statements for similar functionality.
Getting Started This section demonstrates how you can use PROC SURVEYREG to perform a regression analysis for sample survey data. For a complete description of the usage of PROC SURVEYREG, see the section “Syntax” on page 4373. The “Examples” section on page 4395 provides more detailed examples that illustrate the applications of PROC SURVEYREG.
Simple Random Sampling Suppose that, in a junior high school, there are a total of 4,000 students in grades 7, 8, and 9. You want to know how household income and the number of children in a household affect students’ average weekly spending for ice cream. In order to answer this question, you draw a sample using simple random sampling from the student population in the junior high school. You randomly select 40 students and ask them their average weekly expenditure for ice cream, their household income, and the number of children in their household. The answers from the 40 students are saved as a SAS data set:
4366
Chapter 71. The SURVEYREG Procedure data IceCream; input Grade Spending Income datalines; 7 7 39 2 7 7 38 1 9 10 47 4 7 1 34 4 7 3 44 4 8 20 60 3 7 2 35 2 7 2 36 1 8 16 53 1 7 6 37 4 7 6 39 2 9 15 50 4 8 14 46 2 9 8 41 2 9 7 47 3 7 3 39 3 7 4 43 4 9 14 46 3 9 9 44 3 7 2 37 1 7 4 44 2 7 11 42 2 8 10 42 2 8 13 46 1 9 6 45 1 9 11 45 4 7 9 46 1 ;
Kids @@; 8 7 8 9 7 8 9 7 8 7 9 7 7
12 10 19 15 6 17 8 12 18 1 8 2 2
47 43 57 51 41 57 41 50 58 37 41 40 36
1 2 4 1 2 3 1 2 4 2 2 3 1
In the data set IceCream, the variable Grade indicates a student’s grade. The variable Spending contains the dollar amount of each student’s average weekly spending for ice cream. The variable Income specifies the household income, in thousands of dollars. The variable Kids indicates how many children are in a student’s family. The following PROC SURVEYREG statements request a regression analysis: title1 ’Ice Cream Spending Analysis’; title2 ’Simple Random Sample Design’; proc surveyreg data=IceCream total=4000; class Kids; model Spending = Income Kids / solution anova; run;
The PROC SURVEYREG statement invokes the procedure. The TOTAL=4000 option specifies the total in the population from which the sample is drawn. The CLASS statement requests that the procedure use the variable Kids as a classification variable in the analysis. The MODEL statement describes the linear model that you want to fit, with Spending as the dependent variable and Income and Kids as the independent variables. The SOLUTION option in the MODEL statement requests that the procedure output the regression coefficient estimates. The ANOVA option requests that the procedure output the ANOVA table.
Simple Random Sampling
Ice Cream Spending Analysis Simple Random Sample Design The SURVEYREG Procedure Regression Analysis for Dependent Variable Spending Data Summary Number of Observations Mean of Spending Sum of Spending
40 8.75000 350.00000
Fit Statistics R-square Root MSE Denominator DF
0.8132 2.4506 39
Class Level Information Class Variable
Levels
Kids
Values
4
1 2 3 4
Figure 71.1. Summary of Data
Figure 71.1 displays the summary of the data, the summary of the fit, and the levels of the classification variable Kids. The “Fit Statistics” table displays the denominator degrees of freedom, which are used in F tests and t tests in the regression analysis.
Ice Cream Spending Analysis Simple Random Sample Design The SURVEYREG Procedure Regression Analysis for Dependent Variable Spending Tests of Model Effects Effect Model Intercept Income Kids
Num DF
F Value
Pr > F
4 1 1 3
119.15 153.32 324.45 0.92
<.0001 <.0001 <.0001 0.4385
NOTE: The denominator degrees of freedom for the F tests is 39.
Figure 71.2. Testing Effects in the Regression
Figure 71.2 displays the ANOVA table for the regression and the tests for model
4367
4368
Chapter 71. The SURVEYREG Procedure
effects. The effect Income is significant in the linear regression model, while the effect Kids is not significant at the 5% level.
Ice Cream Spending Analysis Simple Random Sample Design The SURVEYREG Procedure Regression Analysis for Dependent Variable Spending Estimated Regression Coefficients
Parameter
Estimate
Standard Error
t Value
Pr > |t|
Intercept Income Kids 1 Kids 2 Kids 3 Kids 4
-26.084677 0.775330 0.897655 1.494032 -0.513181 0.000000
2.46720403 0.04304415 1.12352876 1.24705263 1.33454891 0.00000000
-10.57 18.01 0.80 1.20 -0.38 .
<.0001 <.0001 0.4292 0.2381 0.7027 .
NOTE: The denominator degrees of freedom for the t tests is 39. Matrix X’X is singular and a generalized inverse was used to solve the normal equations. Estimates are not unique.
Figure 71.3. Regression Coefficients
The regression coefficient estimates and their standard errors and associated t tests are displayed in Figure 71.3.
Stratified Sampling Suppose that the previous student sample is actually drawn using a stratified sample design. The strata are grades in the junior high school: 7, 8, and 9. Within strata, simple random samples are selected. Table 71.1 provides the number of students in each grade. Table 71.1. Students in Grades
Grade 7 8 9 Total
Number of Students 1,824 1,025 1,151 4,000
In order to analyze this sample using PROC SURVEYREG, you need to input the stratification information by creating a SAS data set with the information in Table 71.1. The following SAS statements create such a data set called StudentTotals: data StudentTotals; input Grade _TOTAL_; datalines; 7 1824
Stratified Sampling
8 1025 9 1151 ;
The variable Grade is the stratification variable, and the variable – TOTAL– contains the total numbers of students in each stratum in the survey population. PROC SURVEYREG requires you to use the keyword – TOTAL– as the name of the variable that contains the population total information. In a stratified sample design, when the sampling rates in the strata are unequal, you need to use sampling weights to reflect this information. For this example, the appropriate sampling weights are the reciprocals of the probabilities of selection. You can use the following data step to create the sampling weights: data IceCream; set IceCream; if Grade=7 then Prob=20/1824; if Grade=8 then Prob=9/1025; if Grade=9 then Prob=11/1151; Weight=1/Prob;
If you use PROC SURVEYSELECT to select your sample, PROC SURVEYSELECT creates these sampling weights for you. The following statements demonstrate how you can fit a linear model while incorporating the sample design information (stratification): title1 ’Ice Cream Spending Analysis’; title2 ’Stratified Simple Random Sample Design’; proc surveyreg data=IceCream total=StudentTotals; strata Grade /list; class Kids; model Spending = Income Kids / solution anova; weight Weight; run;
Comparing these statements to those in the section “Simple Random Sampling” on page 4365, you can see how the TOTAL=StudentTotals option replaces the previous TOTAL=4000 option. The STRATA statement specifies the stratification variable Grade. The LIST option in the STRATA statement requests that the stratification information be included in the output. The WEIGHT statement specifies the weight variable.
4369
4370
Chapter 71. The SURVEYREG Procedure
Ice Cream Spending Analysis Stratified Simple Random Sample Design The SURVEYREG Procedure Regression Analysis for Dependent Variable Spending Data Summary Number of Observations Sum of Weights Weighted Mean of Spending Weighted Sum of Spending
40 4000.0 9.14130 36565.2
Design Summary Number of Strata
3
Fit Statistics R-square Root MSE Denominator DF
0.8219 2.4185 37
Figure 71.4. Summary of the Regression
Figure 71.4 summarizes the data information, the sample design information, and the fit information. Note that, due to the stratification, the denominator degrees of freedom for F tests and t tests is 37, which is different from the analysis in Figure 71.1.
Ice Cream Spending Analysis Stratified Simple Random Sample Design The SURVEYREG Procedure Regression Analysis for Dependent Variable Spending Stratum Information Stratum Index
Grade
1 2 3
7 8 9
N Obs
Population Total
20 9 11
1824 1025 1151
Class Level Information Class Variable Kids
Levels 4
Values 1 2 3 4
Figure 71.5. Stratification and Classification Information
Sampling Rate 1.10% 0.88% 0.96%
Stratified Sampling
For each stratum, Figure 71.5 displays the value of identifying variables, the number of observations (sample size), the total population size, and the calculated sampling rate or fraction.
Ice Cream Spending Analysis Stratified Simple Random Sample Design The SURVEYREG Procedure Regression Analysis for Dependent Variable Spending Tests of Model Effects Effect
Num DF
F Value
Pr > F
4 1 1 3
124.85 150.95 326.89 0.99
<.0001 <.0001 <.0001 0.4081
Model Intercept Income Kids
NOTE: The denominator degrees of freedom for the F tests is 37.
Figure 71.6. Testing Effects
Figure 71.6 displays the ANOVA table for the regression and tests for the significance of model effects under the stratified sample design. The Income effect is strongly significant, while the Kids effect is not significant at the 5% level.
Ice Cream Spending Analysis Stratified Simple Random Sample Design The SURVEYREG Procedure Regression Analysis for Dependent Variable Spending Estimated Regression Coefficients
Parameter
Estimate
Standard Error
t Value
Pr > |t|
Intercept Income Kids 1 Kids 2 Kids 3 Kids 4
-26.086882 0.776699 0.888631 1.545726 -0.526817 0.000000
2.44108058 0.04295904 1.07000634 1.20815863 1.32748011 0.00000000
-10.69 18.08 0.83 1.28 -0.40 .
<.0001 <.0001 0.4116 0.2087 0.6938 .
NOTE: The denominator degrees of freedom for the t tests is 37. Matrix X’WX is singular and a generalized inverse was used to solve the normal equations. Estimates are not unique.
Figure 71.7. Regression Coefficients
The regression coefficient estimates for the stratified sample, along with their standard errors and associated t tests, are displayed in Figure 71.7.
4371
4372
Chapter 71. The SURVEYREG Procedure
You can request other statistics and tests using PROC SURVEYREG. You can also analyze data from a more complex sample design. The remainder of this chapter provides more detailed information.
Output Data Set PROC SURVEYREG uses the Output Delivery System (ODS) to create output data sets. This is a departure from older SAS procedures that provide OUTPUT statements for similar functionality. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” For example, to save the “ParameterEstimates” table (Figure 71.7) in the previous section in an output data set, you use the ODS OUTPUT statement as follows: title1 ’Ice Cream Spending Analysis’; title2 ’Stratified Simple Random Sample Design’; proc surveyreg data=IceCream total=StudentTotals; strata Grade /list; class Kids; model Spending = Income Kids / solution; weight Weight; ods output ParameterEstimates = MyParmEst; run;
The statement ods output ParameterEstimates = MyParmEst;
requests that the “ParameterEstimates” table that appears in Figure 71.7 be placed in a SAS data set named MyParmEst. The PRINT procedure displays observations of the data set MyParmEst: proc print data=MyParmEst; run;
Figure 71.8 displays the observations in the data set MyParmEst.
Ice Cream Spending Analysis Stratified Simple Random Sample Design OBS
Parameter
Estimate
StdErr
DenDF
tValue
Probt
1 2 3 4 5 6
Intercept Income Kids 1 Kids 2 Kids 3 Kids 4
-26.086882 0.776699 0.888631 1.545726 -0.526817 0.000000
2.44108058 0.04295904 1.07000634 1.20815863 1.32748011 0.00000000
37 37 37 37 37 37
-10.69 18.08 0.83 1.28 -0.40 .
<.0001 <.0001 0.4116 0.2087 0.6938 .
Figure 71.8. The Data Set MyParmEst
PROC SURVEYREG Statement
The section “ODS Table Names” on page 4394 gives the complete list of the tables produced by PROC SURVEYREG.
Syntax The following statements are available in PROC SURVEYREG:
PROC SURVEYREG < options > ; BY variables ; CLASS variables ; CLUSTER variables ; CONTRAST ’label’ effect values < . . . effect values > < / options > ; ESTIMATE ’label’ effect values < . . . effect values > < / options > ; MODEL dependent = < effects > < / options > ; STRATA variables < / options > ; WEIGHT variable ; The PROC SURVEYREG and MODEL statements are required. If your model contains classification effects, you must list the classification variables in a CLASS statement, and the CLASS statement must precede the MODEL statement. If you use a CONTRAST statement or an ESTIMATE statement, the MODEL statement must precede the CONTRAST or ESTIMATE statement. The CLASS, CLUSTER, STRATA, CONTRAST, and ESTIMATE statements can appear multiple times. You should only use one MODEL statement and one WEIGHT statement.
PROC SURVEYREG Statement PROC SURVEYREG < options >; The PROC SURVEYREG statement invokes the procedure. You can specify the following options in the PROC SURVEYREG statement: ALPHA=α
sets the confidence level for confidence limits. The value of the ALPHA= option must be between 0 and 1, and the default value is 0.05. A confidence level of α produces 100(1 − α)% confidence limits. The default of ALPHA=0.05 produces 95% confidence limits. DATA=SAS-data-set
specifies the SAS data set to be analyzed by PROC SURVEYREG. If you omit the DATA= option, the procedure uses the most recently created SAS data set.
4373
4374
Chapter 71. The SURVEYREG Procedure
RATE=value | SAS-data-set R=value | SAS-data-set
specifies the sampling rate as a non-negative value, or specifies an input data set that contains the stratum sampling rates. The procedure uses this information to compute a finite population correction for variance estimation. If your sample design has multiple stages, you should specify the first-stage sampling rate, which is the ratio of the number of PSUs selected to the total number of PSUs in the population. For a nonstratified sample design, or for a stratified sample design with the same sampling rate in all strata, you should specify a non-negative value for the RATE= option. If your design is stratified with different sampling rates in the strata, then you should name a SAS data set that contains the stratification variables and the sampling rates. See the section Specification of Population Totals and Sampling Rates on page 4382 for more details. The value in the RATE= option or the values of – RATE– in the secondary data set must be non-negative numbers. You can specify value as a number between 0 and 1. Or you can specify value in percentage form as a number between 1 and 100, and PROC SURVEYREG will convert that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%. If you do not specify the TOTAL= option or the RATE= option, then the variance estimation does not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option. TOTAL=value | SAS-data-set N=value | SAS-data-set
specifies the total number of primary sampling units in the study population as a positive value, or specifies an input data set that contains the stratum population totals. The procedure uses this information to compute a finite population correction for variance estimation. For a nonstratified sample design, or for a stratified sample design with the same population total in all strata, you should specify a positive value for the TOTAL= option. If your sample design is stratified with different population totals in the strata, then you should name a SAS data set that contains the stratification variables and the population totals. See the section Specification of Population Totals and Sampling Rates on page 4382 for more details. If you do not specify the TOTAL= option or the RATE= option, then the variance estimation does not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option. TRUNCATE
specifies that class levels should be determined using no more than the first 16 characters of the formatted values of the CLASS, STRATA, and CLUSTER variables. When formatted values are longer than 16 characters, you can use this option in order to revert to the levels as determined in releases previous to Version 9.
CLASS Statement
BY Statement BY variables ; You can specify a BY statement with PROC SURVEYREG to obtain separate analyses on observations in groups defined by the BY variables. Note that using a BY statement provides completely separate analyses of the BY groups. It does not provide a statistically valid subpopulation or domain analysis, where the total number of units in the subpopulation is not known with certainty. For more information on subpopulation analysis for sample survey data, refer to Cochran (1977). When a BY statement appears, the procedure expects the input data sets to be sorted in order of the BY variables. If you specify more than one BY statement, the procedure uses only the latest BY statement and ignores any previous ones. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the SURVEYREG procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
CLASS Statement CLASS | CLASSES variables ; The CLASS statement specifies the classification variables to be used in the model. Typical class variables are TREATMENT, GENDER, RACE, GROUP, and REPLICATION. If you specify the CLASS statement, it must appear before the MODEL statement. Classification variables can be either character or numeric. Class levels are determined from the formatted values of the CLASS variables. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Concepts. By default, class levels are determined from the entire formatted values of the CLASS variables. Note that this represents a slight change from previous releases in the way in which class levels are determined. In releases prior to Version 9, class levels were determined using no more than the first 16 characters of the formatted values. If you wish to revert to this
4375
4376
Chapter 71. The SURVEYREG Procedure
previous behavior you can use the TRUNCATE option in the PROC SURVEYREG statement. You can use multiple CLASS statements to specify classification variables.
CLUSTER Statement CLUSTER | CLUSTERS variables ; The CLUSTER statement specifies variables that identify clusters in a clustered sample design. The combinations of categories of CLUSTER variables define the clusters in the sample. If there is a STRATA statement, clusters are nested within strata. If your sample design has clustering at multiple stages, you should identify only the first-stage clusters, or primary sampling units (PSUs), in the CLUSTER statement. The CLUSTER variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the CLUSTER variables determine the CLUSTER variable levels. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary. By default, clusters are determined from the entire formatted values of the CLUSTER variables. Note that this represents a slight change from previous releases in the way in which clusters are determined. In releases prior to Version 9, clusters were determined using no more than the first 16 characters of the formatted values. If you wish to revert to this previous behavior you can use the TRUNCATE option in the PROC SURVEYREG statement. You can use multiple CLUSTER statements to specify cluster variables. The procedure uses variables from all CLUSTER statements to create clusters.
CONTRAST Statement CONTRAST ’label’ effect values < / options > ; CONTRAST ’label’ effect values < . . . effect values > < / options > ; The CONTRAST statement provides custom hypothesis tests for linear combinations of the regression parameters H0 : Lβ = 0, where L is the vector or matrix you specify and β is the vector of regression parameters. Thus, to use this feature, you must be familiar with the details of the model parameterization used by PROC SURVEYREG. For information on the parameterization, see the section “Parameterization of PROC GLM Models” on page 1787 in Chapter 32, “The GLM Procedure.” Each term in the MODEL statement, called an effect, is a variable or a combination of variables. You can specify an effect with a variable name or a special notation using variable names and operators. For more details on how to specify an effect, see the section “Specification of Effects” on page 1784 in Chapter 32, “The GLM Procedure.” For each CONTRAST statement, PROC SURVEYREG computes Wald’s F test. The procedure displays this value with the degrees of freedom, and identifies it with the
CONTRAST Statement
contrast label. The numerator degrees of freedom for Wald’s F test equals rank(L). The denominator degrees of freedom equals the number of clusters (or the number of observations if there is no CLUSTER statement) minus the number of strata. Alternatively, you can use the DF= option in the MODEL statement to specify the denominator degrees of freedom. You can specify any number of CONTRAST statements, but they must appear after the MODEL statement. In the CONTRAST statement, label
identifies the contrast in the output. A label is required for every contrast specified. Labels must be enclosed in single quotes.
effect
identifies an effect that appears in the MODEL statement. You can use the INTERCEPT keyword as an effect when an intercept is fitted in the model. You do not need to include all effects that are in the MODEL statement.
values
are constants that are elements of L associated with the effect.
You can specify the following options in the CONTRAST statement after a slash (/): E
displays the entire coefficient L vector or matrix. NOFILL
requests no filling in higher-order effects. When you specify only certain portions of L, by default PROC SURVEYREG constructs the remaining elements from the context. (For more information, see the section “Specification of ESTIMATE Expressions” on page 1801 in Chapter 32, “The GLM Procedure.” ) When you specify the NOFILL option, PROC SURVEYREG does not construct the remaining portions and treats the vector or matrix L as it is defined in the CONTRAST statement. SINGULAR=value
specifies the sensitivity for checking estimability. If v is a vector, define ABS(v) to be the largest absolute value of the elements of v. Say H is the (X0 X)− X0 X matrix, and C is ABS(L) except for elements of L that equal 0, and then C is 1. If ABS(L − LH) > C·value, then L is declared nonestimable. The SINGULAR=value must be between 0 and 1, and the default is 10−4 . As stated previously, the CONTRAST statement enables you to perform hypothesis tests H0 : Lβ = 0. If the L matrix contains more than one contrast, then you can separate the rows of the L matrix with commas. For example, for the model proc surveyreg; class A B; model Y=A B; run;
4377
4378
Chapter 71. The SURVEYREG Procedure
with A at 5 levels and B at 2 levels, the parameter vector is (µ α1 α2 α3 α4 α5 β1 β2 ) To test the hypothesis that the pooled A linear and A quadratic effect is zero, you can use the following L matrix: L=
0 −2 −1 0 1 0 2 −1 −2 −1
2 2
0 0
0 0
The corresponding CONTRAST statement is contrast ’A Linear & Quadratic’ a -2 -1 0 1 2, a 2 -1 -2 -1 2;
ESTIMATE Statement ESTIMATE ’label’ effect values < / options > ; ESTIMATE ’label’ effect values < . . . effect values > < / options > ; You can use an ESTIMATE statement to estimate a linear function of the regression b parameters by multiplying a row vector L by the parameter estimate vector β. Each term in the MODEL statement, called an effect, is a variable or a combination of variables. You can specify an effect with a variable name or with a special notation using variable names and operators. For more details on how to specify an effect, see the section “Specification of Effects” on page 1784 in Chapter 32, “The GLM Procedure.” PROC SURVEYREG checks the linear function for estimability. SINGULAR= option described on page 4379).
(See the
b along with its standard error and t test. If The procedure displays the estimate Lβ you specify the CLPARM option in the MODEL statement, PROC SURVEYREG also displays confidence limits for the linear function. By default, the degrees of freedom for the t test equals the number of clusters (or the number of observations if there is no CLUSTER statement) minus the number of strata. Alternatively, you can specify the degrees of freedom with the DF= option in the MODEL statement. You can specify any number of ESTIMATE statements, but they must appear after the MODEL statement. In the ESTIMATE statement, label
identifies the linear function L in the output. A label is required for every function specified. Labels must be enclosed in single quotes.
effect
identifies an effect that appears in the MODEL statement. You can use the INTERCEPT keyword as an effect when an intercept is fitted in the model. You do not need to include all effects that are in the MODEL statement.
MODEL Statement values
values are constants that are elements of the vector L associated with the effect. For example, the following code forms an estimate that is the difference between the parameters estimated for the first and second levels of the CLASS variable A. estimate ’A1 vs A2’ A
1 -1;
You can specify the following options in the ESTIMATE statement after a slash (/): DIVISOR=value
specifies a value by which to divide all coefficients so that fractional coefficients can be entered as integers. For example, you can use estimate ’1/3(A1+A2) - 2/3A3’ a 1 1 -2 / divisor=3;
instead of estimate ’1/3(A1+A2) - 2/3A3’ a 0.33333 0.33333 -0.66667; E
displays the entire coefficient vector L. NOFILL
requests no filling in higher-order effects. When you specify only certain portions of the vector L, by default PROC SURVEYREG constructs the remaining elements from the context. (See the section “Specification of ESTIMATE Expressions” on page 1801 in Chapter 32, “The GLM Procedure.” ) When you specify the NOFILL option, PROC SURVEYREG does not construct the remaining portions and treats the vector L as it is defined in the ESTIMATE statement. SINGULAR=value
specifies the sensitivity for checking estimability. If v is a vector, define ABS(v) to be the largest absolute value of the elements of v. Say H is the (X0 X)− X0 X matrix, and C is ABS(L) except for elements of L that equal 0, and then C is 1. If ABS(L − LH) > C×value, then L is declared nonestimable. The SINGULAR= value must be between 0 and 1, and the default is 10−4 .
MODEL Statement MODEL dependent = < effects >< / options >; The MODEL statement specifies the dependent (response) variable and the independent (regressor) variables or effects. Each term in a MODEL statement, called an effect, is a variable or a combination of variables. You can specify an effect with a variable name or with special notation using variable names and operators. For more information on how to specify an effect, see the section “Specification of Effects” on page 1784 in Chapter 32, “The GLM Procedure.” The dependent variable must be numeric. Only one MODEL statement is allowed for each PROC SURVEYREG
4379
4380
Chapter 71. The SURVEYREG Procedure
statement. If you specify more than one MODEL statement, the procedure uses the first model and ignores the rest. You can specify the following options in the MODEL statement after a slash (/): ADJRSQ
requests the procedure to compute the adjusted multiple R-square. ANOVA
requests the ANOVA table to be produced in the output. By default, the ANOVA table will not be printed in the output. CLPARM
requests confidence limits for the parameter estimates. The SURVEYREG procedure determines the confidence coefficient using the ALPHA= option, which by default equals 0.05 and produces 95% confidence bounds. The CLPARM option also requests confidence limits for all the estimable linear functions of regression parameters in the ESTIMATE statements. Note that when there is a CLASS statement, you need to use the SOLUTION option with the CLPARM option to obtain the parameter estimates and their confidence limits. COVB
displays the estimated covariance matrix of the estimated regression estimates. DEFF
displays design effects for the regression coefficient estimates. DF=value
specifies the denominator degrees of freedom for the F tests and the degrees of freedom for the t tests. The default is the number of clusters (or the number of observations if there is no CLUSTER statement) minus the number of actual strata. The number of actual strata equals the number of strata in the data before collapsing minus the number of strata collapsed plus 1. See the section “Stratum Collapse” on page 4388 for details on “collapsing of strata.” I INVERSE
displays the inverse or the generalized inverse of the X0 X matrix. When there is a WEIGHT variable, the procedure displays the inverse or the generalized inverse of the X0 WX matrix, where W is the diagonal matrix constructed from WEIGHT variable values. NOINT
omits the intercept from the model. SOLUTION
displays a solution to the normal equations, which are the parameter estimates. The SOLUTION option is useful only when you use a CLASS statement. If you do not specify a CLASS statement, PROC SURVEYREG displays parameter estimates by default. But if you specify a CLASS statement, PROC SURVEYREG does not display parameter estimates unless you also specify the SOLUTION option.
STRATA Statement
VADJUST=DF | NONE VARADJ=DF | NONE VARADJUST=DF | NONE
specifies if the you want to use degrees of freedom adjustment (n − 1)/(n − p) in the computation of the matrix G for the variance estimation on page 4385. If you do not specify the VADJUST= option, by default, PROC SURVEYREG uses the degrees of freedom adjustment, that is equivalent to the VARADJ=DF option. If you do not wish to use this variance adjustment, you can specify the VADJUST=NONE option. X XPX
displays the X0 X matrix, or the X0 WX matrix when there is a WEIGHT variable, where W is the diagonal matrix constructed from WEIGHT variable values. The X option also displays the crossproducts vector X0 y, or X0 Wy.
STRATA Statement STRATA | STRATUM variables < / options > ; The STRATA statement specifies variables that form the strata in a stratified sample design. The combinations of categories of STRATA variables define the strata in the sample. If your sample design has stratification at multiple stages, you should identify only the first-stage strata in the STRATA statement. See the section “Specification of Population Totals and Sampling Rates” on page 4382 for more information. The STRATA variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. By default, strata are determined from the entire formatted values of the STRATA variables. Note that this represents a slight change from previous releases in the way in which strata are determined. In releases prior to Version 9, strata were determined using no more than the first 16 characters of the formatted values. If you wish to revert to this previous behavior you can use the TRUNCATE option in the PROC SURVEYREG statement. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide. You can use multiple STRATA statements to specify stratum variables. You can specify the following options in the STRATA statement after a slash (/): LIST
displays a “Stratum Information” table, which includes values of the STRATA variables, and the number of observations, number of clusters, population total, and sampling rate for each stratum. This table also displays stratum collapse information. NOCOLLAPSE
prevents the procedure from collapsing, or combining, strata that have only one sampling unit. By default, the procedure collapses strata that contain only one sampling unit. See the section “Stratum Collapse” on page 4388 for details.
4381
4382
Chapter 71. The SURVEYREG Procedure
WEIGHT Statement WEIGHT | WGT variable ; The WEIGHT statement specifies the variable that contains the sampling weights. This variable must be numeric. If you do not specify a WEIGHT statement, PROC SURVEYREG assigns all observations a weight of 1. Sampling weights must be positive numbers. If an observation has a weight that is nonpositive or missing, then the procedure omits that observation from the analysis. If you specify more than one WEIGHT statement, the procedure uses only the first WEIGHT statement and ignores the rest.
Details Missing Values If an observation has a missing value or a nonpositive value for the WEIGHT variable, then PROC SURVEYREG excludes that observation from the analysis. An observation is also excluded if it has a missing value for any STRATA variable, CLUSTER variable, dependent variable, or any variable used in the independent effects. The analysis includes all observations in the data set that have nonmissing values for all these design and analysis variables. If you have missing values in your survey data for any reason (such as nonresponse), this can compromise the quality of your survey results. If the respondents are different from the nonrespondents with regard to a survey effect or outcome, then survey estimates will be biased and will not accurately represent the survey population. There are a variety of techniques in sample design and survey operations that can reduce nonresponse. Once data collection is complete, you can use imputation to replace missing values with acceptable values, and you can use sampling weight adjustments to compensate for nonresponse. You should complete this data preparation and adjustment before you analyze your data with PROC SURVEYREG. Refer to Cochran (1977) for more details.
Survey Design Information Specification of Population Totals and Sampling Rates If your analysis should include a finite population correction (fpc), you can input either the sampling rate or the population total using the RATE= option or the TOTAL= option. You cannot specify both of these options in the same PROC SURVEYREG statement. If you do not specify one of these options, the procedure does not use the fpc when computing variance estimates. For fairly small sampling fractions, it is appropriate to ignore this correction. Refer to Cochran (1977) and Kish (1965). If your design has multiple stages of selection and you are specifying the RATE= option, you should input the first-stage sampling rate, which is the ratio of the number of PSUs in the sample to the total number of PSUs in the study population. If you are specifying the TOTAL= option for a multistage design, you should input the total number of PSUs in the study population.
Primary Sampling Units (PSUs)
For a nonstratified sample design, or for a stratified sample design with the same sampling rate or the same population total in all strata, you should use the RATE=value option or the TOTAL=value option. If your sample design is stratified with different sampling rates or population totals in the strata, then you can use the RATE=SAS-data-set option or the TOTAL=SAS-data-set option to name a SAS data set that contains the stratum sampling rates or totals. This data set is called a secondary data set, as opposed to the primary data set that you specify with the DATA= option. The secondary data set must contain all the stratification variables listed in the STRATA statement and all the variables in the BY statement. If there are formats associated with the STRATA variables and the BY variables, then the formats must be consistent in the primary and the secondary data sets. If you specify the TOTAL=SAS-data-set option, the secondary data set must have a variable named – TOTAL– that contains the stratum population totals. Or if you specify the RATE=SAS-data-set option, the secondary data set must have a variable named – RATE– that contains the stratum sampling rates. The secondary data set must contain all BY and STRATA groups that occur in the primary data set. If the secondary data set contains more than one observation for any one stratum, then the procedure uses the first value of – TOTAL– or – RATE– for that stratum and ignores the rest. The value in the RATE= option, or the values of – RATE– in the secondary data set, must be non-negative numbers. You can specify a sampling rate as a number between 0 and 1. Or you can specify a sampling rate in percentage form as a number between 1 and 100, and PROC SURVEYREG will convert that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%. If you specify the TOTAL=value option, value must not be less than the sample size. If you provide stratum population totals in a secondary data set, these values must not be less than the corresponding stratum sample sizes.
Primary Sampling Units (PSUs) When you have clusters, or primary sampling units (PSUs), in your sample design, the procedure estimates variance from the variation among PSUs. For more information, see the section “Variance Estimation” on page 4385. You can use the CLUSTER statement to identify the first stage clusters in your design. PROC SURVEYREG assumes that each cluster represents a PSU in the sample and that each observation is an element of a PSU. If you do not specify a CLUSTER statement, the procedure treats each observation as a PSU.
4383
4384
Chapter 71. The SURVEYREG Procedure
Computational Details Notation For a stratified clustered sample design, observations are represented by an n×(p+2) matrix (w, y, X) = (whij , yhij , xhij ) where • w denotes the sampling weight vector • y denotes the dependent variable • X denotes the design matrix. (When an effect contains only classification variables, the columns of X corresponding to this effect contain only 0s and 1s; no reparameterization is made.) • h = 1, 2, . . . , H is the stratum number with a total of H strata • i = 1, 2, . . . , nh is the cluster number within stratum h, with a total of nh clusters • j = 1, 2, . . . , mhi is the unit number within cluster i of stratum h, with a total of mhi units • p is the total number of parameters (including an intercept if the INTERCEPT effect is included in the MODEL statement) P Pnh • n= H h=1 i=1 mhi is the total number of observations in the sample Also, fh denotes the sampling rate for stratum h. You can use the TOTAL= option or the RATE= option to input population totals or sampling rates. See the section “Specification of Population Totals and Sampling Rates” on page 4382 for details. If you input stratum totals, PROC SURVEYREG computes fh as the ratio of the stratum sample size to the stratum total. If you input stratum sampling rates, PROC SURVEYREG uses these values directly for fh . If you do not specify the TOTAL= option or the RATE= option, then the procedure assumes that the stratum sampling rates fh are negligible, and a finite population correction is not used when computing variances.
Regression Coefficients PROC SURVEYREG solves the normal equations X0 WXβ = X0 Wy using a modified sweep routine that produces a generalized (g2) inverse (X0 WX)− and a solution (Pringle and Raynor 1971) b = (X0 WX)− X0 Wy β where W is the diagonal matrix constructed from WEIGHT variable values. For models with class variables, there are more design matrix columns than there are degrees of freedom (DF) for the effect. Thus, there are linear dependencies among the columns. In this case, the parameters are not estimable; there is an infinite number of
Degrees of Freedom
least-squares solutions. PROC SURVEYREG uses a generalized (g2) inverse to obtain values for the estimates. The solution values are not displayed unless you specify the SOLUTION option in the MODEL statement. The solution has the characteristic that estimates are 0 whenever the design column for that parameter is a linear combination of previous columns. (Strictly termed, the solution values should not be called estimates.) With this full parameterization, hypothesis tests are constructed to test linear functions of the parameters that are estimable.
Variance Estimation PROC SURVEYREG uses the Taylor series expansion theory to estimate the covariance-variance matrix of the estimated regression coefficients (Fuller 1975). Let b r = y − Xβ where the (h, i, j)th element is rhij . Compute 1 × p row vectors ehij
= whij rhij xhij
ehi· =
mhi X
ehij
j=1
¯h·· = e
nh 1 X ehi· nh i=1
and calculate the p × p matrix H
n
h n − 1 X nh (1 − fh ) X ¯h·· )0 (ehi· − e ¯h·· ) G= (ehi· − e n−p nh − 1
h=1
i=1
PROC SURVEYREG computes the covariance matrix of β as b = (X0 WX)− G(X0 WX)− V b should reduce the The factor (n − 1)/(n − p) in the computation of the matrix G small sample bias associated with using the estimated function to calculate deviations (Hidiroglou et al. (1980)). For simple random sampling, this factor contributes to the degrees of freedom correction applied to the residual mean square for ordinary least squares in which p parameter are estimated. By default, the procedure will use this adjustment in the variance estimation. It is equivalent to specify the VADJUST=DF option in the MODEL statement. If you do not wish to use this multiplier in the variance estimation, you can specify the VADJUST=NONE option in the MODEL statement to suppress this factor.
Degrees of Freedom PROC SURVEYREG produces tests for the significance of model effects, regression parameters, estimable functions specified in the ESTIMATE statement, and contrasts specified in the CONTRAST statement. It computes all these tests taking into account the sample design. The degrees of freedom for these tests differ from the degrees of freedom for the ANOVA table, which does not consider the sample design.
4385
4386
Chapter 71. The SURVEYREG Procedure
Denominator Degrees of Freedom
The denominator DF refers to the denominator degrees of freedom for F tests and to the degrees of freedom for t tests in the analysis. By default, the denominator DF equals the number of clusters minus the actual number of strata. If there are no clusters, the denominator DF equals the number of observations minus the actual number of strata. The actual number of strata equals • one, if there is no STRATA statement • the number of strata in the sample, if there is a STRATA statement but the procedure does not collapse any strata • the number of strata in the sample after collapsing, if there is a STRATA statement and the procedure collapses strata that have only one sampling unit Alternatively, you can specify the denominator DF using the DF= option on page 4380 in the MODEL statement. Numerator Degrees of Freedom
The numerator DF refers to the numerator degrees of freedom for the Wald F statistic associated with an effect or with a contrast. The procedure computes the Wald F statistic for an effect as a Type III test; that is, the test has the following properties: • The hypothesis for an effect does not involve parameters of other effects except for containing effects (which it must involve to be estimable). • The hypotheses to be tested are invariant to the ordering of effects in the model. See the section “Testing Effects” on page 4386 for more information. The numerator DF for the Wald F statistic for a contrast is the rank of the L matrix that defines the contrast.
Testing Effects For each effect in the model, PROC SURVEYREG computes an L matrix such that every element of Lβ is estimable; the L matrix has the maximum possible rank associated with the effect. To test the effect, the procedure uses the Wald F statistic for the hypothesis H0 : Lβ = 0. The Wald F statistic equals FWald =
b 0 (L0 VL) b b −1 (Lβ) (Lβ) rank(L)
with numerator degrees of freedom equal to rank(L) and denominator degrees of freedom equal to the number of clusters minus the number of strata (unless you have specified the denominator degrees of freedom with the DF= option in the MODEL statement; see the section “Denominator Degrees of Freedom” on page 4386). It is possible that the L matrix cannot be constructed for an effect, in which case that effect is not testable. For more information on how the matrix L is constructed, see the discussion in Chapter 11, “The Four Types of Estimable Functions.”
Adjusted R-square
Analysis of Variance (ANOVA) PROC SURVEYREG produces an analysis of variance table for the model specified in the MODEL statement. This table is identical to the one produced by the GLM procedure for the model. PROC SURVEYREG computes ANOVA table entries using the sampling weights, but not the sample design information on stratification and clustering. The degrees of freedom (DF) displayed in the ANOVA table are the same as those in the ANOVA table produced by PROC GLM. The Total DF is the total degrees of freedom used to obtain the regression coefficient estimates. The Total DF equals the total number of observations minus 1 if the model includes an intercept. If the model does not include an intercept, the Total DF equals the total number of observations. The Model DF equals the degrees of freedom for the effects in the MODEL statement, not including the intercept. The Error DF equals the total DF minus the model DF.
Multiple R-square PROC SURVEYREG computes a multiple R-square for the weighted regression as R2 = 1 −
SSerror SStotal
where SSerror is the error sum of squares in the ANOVA table SSerror = r0 Wr and SStotal is the total sum of squares 0 y Wy if no intercept 2 n m H h X hi XX SStotal = y0 Wy − whij yhij / w··· otherwise h=1 i=1 j=1
where w··· is the sum of the sampling weights over all observations.
Adjusted R-square If you specify the option ADJRSQ in the MODEL statement, PROC SURVEYREG computes an multiple R-square adjusted as the weighted regression as
ADJRSQ =
n(1 − R2 ) 1 − n−p
if no intercept
(n − 1)(1 − R2 ) 1− n−p
otherwise
where R2 is the multiple R-square.
4387
4388
Chapter 71. The SURVEYREG Procedure
Root Mean Square Errors PROC SURVEYREG computes the square root of mean square errors as p √ MSE = n SSerror / (n − p) w··· where w··· is the sum of the sampling weights over all observations.
Design Effect If you specify the DEFF option in the MODEL statement, PROC SURVEYREG calculates the design effects for the regression coefficients. The design effect of an estimate is the ratio of the actual variance to the variance computed under the assumption of simple random sampling. DEFF =
Variance under the Sample Design Variance under Simple Random Sampling
Refer to Kish (1965, p. 258). PROC SURVEYREG computes the numerator as described in the section “Variance Estimation” on page 4385. And the denominator is computed under the assumption that the sample design is simple random sampling, with no stratification and no clustering. To compute the variance under the assumption of simple random sampling, PROC SURVEYREG calculates the sampling rate as follows. If you specify both sampling weights and sampling rates (or population totals) for the analysis, then the sampling rate under simple random sampling is calculated as f SRS = n / w··· where n is the sample size and w··· (the sum of the weights over all observations) estimates the population size. If the sum of the weights is less than the sample size, fSRS is set to zero. If you specify sampling rates for the analysis but not sampling weights, then PROC SURVEYREG computes the sampling rate under simple random sampling as the average of the stratum sampling rates. H 1 X f SRS = fh H h=1
If you do not specify sampling rates (or population totals) for the analysis, then the sampling rate under simple random sampling is assumed to be zero. f SRS = 0
Stratum Collapse If there is only one sampling unit in a stratum, then PROC SURVEYREG cannot estimate the variance for this stratum. To estimate stratum variances, by default the procedure collapses, or combines, those strata that contain only one sampling unit. If you specify the NOCOLLAPSE option in the STRATA statement, PROC SURVEYREG
Contrasts
4389
does not collapse strata and uses a variance estimate of 0 for any stratum that contains only one sampling unit. If you do not specify the NOCOLLAPSE option, PROC SURVEYREG collapses strata according to the following rules. If there are multiple strata that each contain only one sampling unit, then the procedure collapses, or combines, all these strata into a new pooled stratum. If there is only one stratum with a single sampling unit, then PROC SURVEYREG collapses that stratum with the preceding stratum, where strata are ordered by the STRATA variable values. If the stratum with one sampling unit is the first stratum, then the procedure combines it with the following stratum. If you specify stratum sampling rates using the RATE=SAS-data-set option, PROC SURVEYREG computes the sampling rate for the new pooled stratum as the weighted average of the sampling rates for the collapsed strata. See the section “Computational Details” on page 4384 for details. If the specified sampling rate equals 0 for any of the collapsed strata, then the pooled stratum is assigned a sampling rate of 0. If you specify stratum totals using the TOTAL=SAS-data-set option, PROC SURVEYREG combines the totals for the collapsed strata to compute the sampling rate for the new pooled stratum.
Sampling Rate of the Pooled Stratum from Collapse Assuming that PROC SURVEYREG collapses single-unit strata h1 , h2 , . . . , hc into the pooled stratum, the procedure calculates the sampling rate for the pooled stratum as if any of fhl = 0 where l = 1, 2, . . . , c 0 !−1 c c X X fPooled Stratum = nhl fh−1 nhl otherwise l l=1
l=1
Contrasts You can use the CONTRAST statement to perform custom hypothesis tests. If the hypothesis is testable in the univariate case, the Wald F statistic for H0 : Lβ = 0 is computed as b 0 (LFull 0 VL b b Full )−1 (LFull β) (LFull β) FWald = rank(L) where L is the contrast vector or matrix you specify, β is the vector of regression b = (X0 WX)− X0 WY, V b b is the estimated covariance matrix of β, parameters, β rank(L) is the rank of L, and LFull is a matrix such that • LFull has the same number of columns as L • LFull has full row rank • the rank of LFull equals the rank of the L matrix • all rows of LFull are estimable functions • the Wald F statistic computed using the LFull matrix is equivalent to the Wald F statistic computed using the L matrix with any row deleted that is a linear combination of previous rows
4390
Chapter 71. The SURVEYREG Procedure
If L is a full-rank matrix, and all rows of L are estimable functions, then LFull is the same as L. It is possible that LFull matrix cannot be constructed for contrasts in a CONTRAST statement, in which case the contrasts are not testable.
Output Displayed Output The SURVEYREG procedure produces the following output. Data Summary
By default, PROC SURVEYREG displays the following information in the “Data Summary” table: • Number of Observations, which is the total number of observations used in the analysis, excluding observations with missing values • Sum of Weights, if you specify a WEIGHT statement • Mean of the dependent variable in the MODEL statement, or Weighted Mean if you specify a WEIGHT statement • Sum of the dependent variable in the MODEL statement, or Weighted Sum if you specify a WEIGHT statement Design Summary
When you specify a CLUSTER statement or a STRATA statement, the procedure displays a “Design Summary” table, which provides the following sample design information: • Number of Strata, if you specify a STRATA statement • Number of Strata Collapsed, if the procedure collapses strata • Number of Clusters, if you specify a CLUSTER statement • Overall Sampling Rate used to calculate the design effect, if you specify the DEFF option in the MODEL statement Fit Statistics
By default, PROC SURVEYREG displays the following regression statistics in the “Fit Statistics” table: • R-square for the regression • Root MSE, which is the square root of the mean square error • Denominator DF, which is the denominator degrees of freedom for the F tests and also the degrees of freedom for the t tests produced by the procedure
Displayed Output
Stratum Information
When you specify the LIST option in the STRATA statement, PROC SURVEYREG displays a “Stratum Information” table, which provides the following information for each stratum: • Stratum Index, which is a sequential stratum identification number • STRATA variable(s), which lists the levels of STRATA variables for the stratum • Population Total, if you specify the TOTAL= option • Sampling Rate, if you specify the TOTAL= option or the RATE= option. If you specify the TOTAL= option, the sampling rate is based on the number of nonmissing observations in the stratum. • N Obs, which is the number of observations • number of Clusters, if you specify a CLUSTER statement • Collapsed, which has the value ‘Yes’ if the stratum is collapsed with another stratum before analysis If PROC SURVEYREG collapses strata, the “Stratum Information” table also displays stratum information for the new, collapsed stratum. The new stratum has a Stratum Index of 0 and is labeled ‘Pooled’. Class Level Information
If you use a CLASS statement to name classification variables, PROC SURVEYREG displays a “Class Level Information” table. This table contains the following information for each classification variable: • Class Variable, which lists each CLASS variable name • Levels, which is the number of values or levels of the classification variable • Values, which lists the values of the classification variable. The values are separated by a white space character; therefore, to avoid confusion, you should not include a white space character within a classification variable value. X0 X Matrix If you specify the XPX option in the MODEL statement, PROC SURVEYREG displays the X0 X matrix, or the X0 WX matrix when there is a WEIGHT variable. This option also displays the crossproducts vector X0 y or X0 Wy, where y is the response vector (dependent variable). Inverse Matrix of X0 X
If you specify the INV option in the MODEL statement, PROC SURVEYREG displays the inverse or the generalized inverse of the X0 X matrix. When there is a WEIGHT variable, the procedure displays the inverse or the generalized inverse of the X0 WX matrix.
4391
4392
Chapter 71. The SURVEYREG Procedure
ANOVA for Dependent Variable
If you specify the ANOVA option in the model statement, PROC SURVEYREG displays an analysis of variance table for the dependent variable. This table is identical to the ANOVA table displayed by the GLM procedure. Tests of Model Effects
By default, PROC SURVEYREG displays a “Tests of Model Effects” table, which provides Wald’s F test for each effect in the model. The table contains the following information for each effect: • Effect, which is the effect name • Num DF, which is the numerator degrees of freedom for Wald’s F test • F Value, which is Wald’s F statistic • Pr > F, which is the significance probability corresponding to the F Value A footnote displays the denominator degrees of freedom, which is the same for all effects. Estimated Regression Coefficients
PROC SURVEYREG displays the “Estimated Regression Coefficients” table by default when there is no CLASS statement. Also, the procedure displays this table when you specify a CLASS statement and also specify the SOLUTIONS option in the MODEL statement. This table contains the following information for each regression parameter: • Parameter, which identifies the effect or regressor variable • Estimate, which is the estimate of the regression coefficient • Standard Error, which is the standard error of the estimate • t Value, which is the t statistic for testing H0 : Parameter = 0 • Pr > | t |, which is the two-sided significance probability corresponding to the t Value Covariance of Estimated Regression Coefficients
When you specify the COVB option in the MODEL statement, PROC SURVEYREG displays the “Covariance of Estimated Regression Coefficients” matrix. Coefficients of Contrast
When you specify the E option in a CONTRAST statement, PROC SURVEYREG displays a “Coefficients of Contrast” table for the contrast. You can use this table to check the coefficients you specified in the CONTRAST statement. Also, this table gives a note for a nonestimable contrast.
Output Data Sets
Analysis of Contrasts
If you specify a CONTRAST statement, PROC SURVEYREG produces an “Analysis of Contrasts” table, which displays Wald’s F test for the contrast. If you use more than one CONTRAST statement, the procedure displays all results in the same table. The “Analysis of Contrasts” table contains the following information for each contrast: • Contrast, which is the label of the contrast • Num DF, which is the numerator degrees of freedom for Wald’s F test • F Value, which is Wald’s F statistic for testing H0 : Contrast = 0 • Pr > F, which is the significance probability corresponding to the F Value Coefficients of Estimate
When you specify the E option in an ESTIMATE statement, PROC SURVEYREG displays a “Coefficients of Estimate” table for the linear function of the regression parameters in the ESTIMATE statement. You can use this table to check the coefficients you specified in the ESTIMATE statement. Also, this table gives a note for a nonestimable function. Analysis of Estimable Functions
If you specify an ESTIMATE statement, PROC SURVEYREG checks the function for estimability. If the function is estimable, PROC SURVEYREG produces an “Analysis of Estimable Functions” table, which displays the estimate and the corresponding t test. If you use more than one ESTIMATE statement, the procedure displays all results in the same table. The table contains the following information for each estimable function: • Parameter, which is the label of the function • Estimate, which is the estimate of the estimable liner function • Standard Error, which is the standard error of the estimate • t Value, which is the t statistic for testing H0 : Estimable Function = 0 • Pr > | t |, which is the two-sided significance probability corresponding to the t Value
Output Data Sets Output data sets from PROC SURVEYREG are produced using ODS (Output Delivery System). ODS encompasses more than just the production of output data sets. For example, you can use ODS to manipulate the format of your output, the headers and titles of the tables, the order of the columns in a table. For a more detailed description on using ODS, see Chapter 14, “Using the Output Delivery System.”
4393
4394
Chapter 71. The SURVEYREG Procedure
ODS Table Names PROC SURVEYREG assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 71.2. ODS Tables Produced in PROC SURVEYREG
ODS Table Name ANOVA
Description ANOVA for dependent variable
Statement MODEL
Option ANOVA
ClassVarInfo
Class level information
CLASS
default
ContrastCoef
Coefficients of contrast
CONTRAST
E
Contrasts
Analysis of contrasts
CONTRAST
default
CovB
MODEL
COVB
DataSummary
Covariance of estimated regression coefficients Data summary
MODEL
default
DesignSummary
Design summary
STRATA | CLUSTER
default
Effects
Tests of model effects
MODEL
EstimateCoef
Coefficients of estimate
ESTIMATE
E
Estimates
Analysis of estimable functions
ESTIMATE
default
FitStatistics
Fit Statistics
MODEL
default
InvXPX
Inverse matrix of X0 X
MODEL
INV
ParameterEstimates
MODEL
default
StrataInfo
Estimated regression coefficients Stratum information
STRATA
LIST
XPX
X0 X
MODEL
XPX
matrix
By referring to the names of such tables, you can use the ODS OUTPUT statement to place one or more of these tables in output data sets. For example, the following statements create an output data set named MyStrata, which contains the “StrataInfo” table, an output data set named MyParmEst, which contains the “ParameterEstimates” table, and an output data set named Cov, which contains the “CovB” table for the ice cream study discussed in the section “Stratified Sampling” on page 4368: title1 ’Ice Cream Spending Analysis’; title2 ’Stratified Simple Random Sample Design’; proc surveyreg data=IceCream total=StudentTotals; strata Grade /list; class Kids; model Spending = Income Kids / solution covb; weight Weight; ods output StrataInfo = MyStrata ParameterEstimates = MyParmEst CovB = Cov; run;
Example 71.1. Simple Random Sampling
Note that the option CovB is specified in the MODEL statement in order to produce the covariance matrix table.
Examples Example 71.1. Simple Random Sampling This example investigates the relationship between the labor force participation rate (LFPR) of women in 1968 and 1972 in large cities in the United States. A simple random sample of 19 cities is drawn from a total of 200 cities. For each selected city, the LFPRs are recorded and saved in a SAS data set named Labor. The LFPR in 1972 is contained in the variable LFPR1972, and the LFPR in 1968 is identified by the variable LFPR1968: data Labor; input City $ datalines; New York Los Angeles Chicago Philadelphia Detroit San Francisco Boston Pittsburgh St. Louis Connecticut Washington D.C. Cincinnati Baltimore Newark Minn/St. Paul Buffalo Houston Patterson Dallas ;
1-16 LFPR1972 LFPR1968; .45 .50 .52 .45 .46 .55 .60 .49 .35 .55 .52 .53 .57 .53 .59 .64 .50 .57 .64
.42 .50 .52 .45 .43 .55 .45 .34 .45 .54 .42 .51 .49 .54 .50 .58 .49 .56 .63
Assume that the LFPRs in 1968 and 1972 have a linear relationship, as shown in the following model:
LFPR1972 = β0 + β1 ∗ LFPR1968 + error You can use PROC SURVEYREG to obtain the estimated regression coefficients and estimated standard errors of the regression coefficients. The following statements perform the regression analysis: title ’Study of Labor Force Participation Rates of Women’; proc surveyreg data=Labor total=200; model LFPR1972 = LFPR1968; run;
4395
4396
Chapter 71. The SURVEYREG Procedure
Here, the TOTAL=200 option specifies the finite population total from which the simple random sample of 19 cities is drawn. You can specify the same information by using the sampling rate option RATE=0.095 (19/200=.095). Output 71.1.1. Summary of Regression Using Simple Random Sampling
Study of Labor Force Participation Rates of Women The SURVEYREG Procedure Regression Analysis for Dependent Variable LFPR1972 Data Summary Number of Observations Mean of LFPR1972 Sum of LFPR1972
19 0.52684 10.01000
Fit Statistics R-square Root MSE Denominator DF
0.3970 0.05657 18
Output 71.1.1 summarizes the data information, the fit information. Output 71.1.2. Regression Coefficient Estimates
Study of Labor Force Participation Rates of Women The SURVEYREG Procedure Regression Analysis for Dependent Variable LFPR1972 Tests of Model Effects Effect Model Intercept LFPR1968
Num DF
F Value
Pr > F
1 1 1
13.84 4.63 13.84
0.0016 0.0452 0.0016
NOTE: The denominator degrees of freedom for the F tests is 18.
Estimated Regression Coefficients
Parameter
Estimate
Standard Error
t Value
Pr > |t|
Intercept LFPR1968
0.20331056 0.65604048
0.09444296 0.17635810
2.15 3.72
0.0452 0.0016
NOTE: The denominator degrees of freedom for the t tests is 18.
Example 71.2. Simple Random Cluster Sampling
Output 71.1.2 presents the significance tests for the model effects and estimated regression coefficients. The F tests and t tests for the effects in the model are also presented in these tables. From the regression performed by PROC SURVEYREG, you obtain a positive estimated slope for the linear relationship between the LFPR in 1968 and the LFPR in 1972. The regression coefficients are all significant at the 5% level. Effects Intercept and LFPR1968 are significant in the model at the 5% level. In this example, the F test for the overall model without intercept is the same as the effect LFPR1968.
Example 71.2. Simple Random Cluster Sampling This example illustrates the use of regression analysis in a simple random cluster sample design. The data are from S¨arndal, Swenson, and Wretman (1992, p. 652). A total of 284 Swedish municipalities are grouped into 50 clusters of neighboring municipalities. Five clusters with a total of 32 municipalities are randomly selected. The results from the regression analysis in which clusters are used in the sample design are compared to the results of a regression analysis that ignores the clusters. The linear relationship between the population in 1975 and in 1985 is investigated. The 32 selected municipalities in the sample are saved in the data set Municipalities: data Municipalities; input Municipality Cluster Population85 Population75; datalines; 205 37 5 5 206 37 11 11 207 37 13 13 208 37 8 8 209 37 17 19 6 2 16 15 7 2 70 62 8 2 66 54 9 2 12 12 10 2 60 50 94 17 7 7 95 17 16 16 96 17 13 11 97 17 12 11 98 17 70 67 99 17 20 20 100 17 31 28 101 17 49 48 276 50 6 7 277 50 9 10 278 50 24 26 279 50 10 9 280 50 67 64 281 50 39 35 282 50 29 27 283 50 10 9 284 50 27 31
4397
4398
Chapter 71. The SURVEYREG Procedure 52 53 54 55 56
10 10 10 10 10
7 9 28 12 107
6 8 27 11 108
;
The variable Municipality identifies the municipalities in the sample; the variable Cluster indicates the cluster to which a municipality belongs; and the variables Population85 and Population75 contain the municipality populations in 1985 and in 1975 (in thousands), respectively. A regression analysis is performed by PROC SURVEYREG with a CLUSTER statement: title1 ’Regression Analysis for Swedish Municipalities’; title2 ’Cluster Simple Random Sampling’; proc surveyreg data=Municipalities total=50; cluster Cluster; model Population85=Population75; run;
The TOTAL=50 option specifies the total number of clusters in the sampling frame. Output 71.2.1. Regression Analysis for Simple Random Cluster Sampling
Regression Analysis for Swedish Municipalities Cluster Simple Random Sampling The SURVEYREG Procedure Regression Analysis for Dependent Variable Population85 Data Summary Number of Observations Mean of Population85 Sum of Population85
32 27.50000 880.00000
Design Summary Number of Clusters
5
Fit Statistics R-square Root MSE Denominator DF
0.9860 3.0488 4
Estimated Regression Coefficients
Parameter Intercept Population75
Estimate
Standard Error
t Value
Pr > |t|
-0.0191292 1.0546253
0.89204053 0.05167565
-0.02 20.41
0.9839 <.0001
NOTE: The denominator degrees of freedom for the t tests is 4.
Example 71.2. Simple Random Cluster Sampling
Output 71.2.1 displays the data summary, design summary, fit statistics, and regression coefficient estimates. Since the sample design includes clusters, the procedure displays the total number of clusters in the sample in the “Design Summary” table. In the “Estimated Regression Coefficients” table, the estimated slope for the linear relationship is 1.05, which is significant at the 5% level; but the intercept is not significant. This suggests that a regression line crossing the original can be established between populations in 1975 and in 1985. The CLUSTER statement is necessary in PROC SURVEYREG in order to incorporate the sample design. If you do not specify a CLUSTER statement in the regression analysis, the standard deviation of the regression coefficients will be incorrectly estimated: title1 ’Regression Analysis for Swedish Municipalities’; title2 ’Simple Random Sampling’; proc surveyreg data=Municipalities total=284; model Population85=Population75; run;
The analysis ignores the clusters in the sample, assuming that the sample design is a simple random sampling. Therefore, the TOTAL= option specifies the total number of municipalities, which is 284. Output 71.2.2. Regression Analysis for Simple Random Sampling
Regression Analysis for Swedish Municipalities Simple Random Sampling The SURVEYREG Procedure Regression Analysis for Dependent Variable Population85 Data Summary Number of Observations Mean of Population85 Sum of Population85
32 27.50000 880.00000
Fit Statistics R-square Root MSE Denominator DF
0.9860 3.0488 31
Estimated Regression Coefficients
Parameter Intercept Population75
Estimate
Standard Error
t Value
Pr > |t|
-0.0191292 1.0546253
0.67417606 0.03668414
-0.03 28.75
0.9775 <.0001
NOTE: The denominator degrees of freedom for the t tests is 31.
4399
4400
Chapter 71. The SURVEYREG Procedure
Output 71.2.2 displays the regression results ignoring the clusters. Compared to the results in Output 71.2.1 on page 4398, the regression coefficient estimates are the same. However, without using clusters, the regression coefficients have a smaller variance estimate in Output 71.2.2. Using clusters in the analysis, the estimated regression coeffiecient for effect Population75 is 1.05, with the estimated standard error 0.05, as displayed in Output 71.2.1; without using the clusters, the estimate is 1.05, but with the estimated standard error 0.04, as displayed in Output 71.2.2. To estimated the variance of the regression coefficients correctly, you should include the clustering information in the regression analysis.
Example 71.3. Regression Estimator for Simple Random Sample Using auxiliary information, you can construct the regression estimators to provide more accurate estimates of the population characteristics that are of interest. With ESTIMATE statements in PROC SURVEYREG, you can specify a regression estimator as a linear function of the regression parameters to estimate the population total. This example illustrates this application, using the data in the previous example. In this sample, a linear model between the Swedish populations in 1975 and in 1985 is established:
Population85 = α + β ∗ Population75 + error Assuming that the total population in 1975 is known to be 8200 (in thousands), you can use the ESTIMATE statement to predict the 1985 total population using the following statements: title1 ’Regression Analysis for Swedish Municipalities’; title2 ’Estimate Total Population’; proc surveyreg data=Municipalities total=50; cluster Cluster; model Population85=Population75; estimate ’1985 population’ Intercept 284 Population75 8200; run;
Since each observation in the sample is a municipality, and there is a total of 284 municipalities in Sweden, the coefficient for Intercept (α) in the ESTIMATE statement is 284, and the coefficient for Population75 (β) is the total population in 1975 (8.2 million).
Example 71.4. Stratified Sampling
Output 71.3.1. Use the Regression Estimator to Estimate the Population Total
Regression Analysis for Swedish Municipalities Estimate Total Population The SURVEYREG Procedure Regression Analysis for Dependent Variable Population85 Analysis of Estimable Functions
Parameter 1985 population
Estimate
Standard Error
t Value
Pr > |t|
8642.49485
258.558613
33.43
<.0001
NOTE: The denominator degrees of freedom for the t tests is 4.
Output 71.3.1 displays the regression results and the estimation of the total population. Using the linear model, you can predict the total population in 1985 to be 8.64 million, with a standard error of 0.26 million.
Example 71.4. Stratified Sampling This example illustrates using the SURVEYREG procedure to perform a regression in a stratified sample design. Consider a population of 235 farms producing corn in Nebraska and Iowa. You are interested in the relationship between corn yield (CornYield) and the total farm size (FarmArea). Each state is divided into several regions, and each region is used as a stratum. Within each stratum, a simple random sample with replacement is drawn. A total of 19 farms is selected using a stratified simple random sample. The sample size and population size within each stratum are displayed in Table 71.3. Table 71.3. Number of Farms in Each Stratum
Stratum 1 2 3 4 5
State Iowa
Nebraska Total
Region 1 2 3 1 2
Number of Farms Population Sample 100 3 50 5 15 3 30 6 40 2 235 19
Three models for the data are considered: • Model I — Common intercept and slope: Corn Yield = α + β ∗ Farm Area • Model II — Common intercept, different slope: α + βIowa ∗ Farm Area if the farm is in Iowa Corn Yield = α + βNebraska ∗ Farm Area if the farm is in Nebraska
4401
4402
Chapter 71. The SURVEYREG Procedure • Model III — Different intercept and different slope: αIowa + βIowa ∗ Farm Area if the farm is in Iowa Corn Yield = αNebraska + βNebraska ∗ Farm Area if the farm is in Nebraska
Data from the stratified sample are saved in the SAS data set Farms. In the data set Farms, the variable Weight represents the sampling weight. In this example, the sampling weights are reciprocal of selection probabilities: data Farms; input State $ Region FarmArea CornYield Weight; datalines; Iowa 1 100 54 33.333 Iowa 1 83 25 33.333 Iowa 1 25 10 33.333 Iowa 2 120 83 10.000 Iowa 2 50 35 10.000 Iowa 2 110 65 10.000 Iowa 2 60 35 10.000 Iowa 2 45 20 10.000 Iowa 3 23 5 5.000 Iowa 3 10 8 5.000 Iowa 3 350 125 5.000 Nebraska 1 130 20 5.000 Nebraska 1 245 25 5.000 Nebraska 1 150 33 5.000 Nebraska 1 263 50 5.000 Nebraska 1 320 47 5.000 Nebraska 1 204 25 5.000 Nebraska 2 80 11 20.000 Nebraska 2 48 8 20.000 ;
The information on population size in each stratum is saved in the SAS data set StratumTotals: data StratumTotals; input State $ Region _TOTAL_; datalines; Iowa 1 100 Iowa 2 50 Iowa 3 15 Nebraska 1 30 Nebraska 2 40 ;
Using the sample data from the data set Farms and the control information data from the data set StratumTotals, you can fit Model I using PROC SURVEYREG with the following statements:
Example 71.4. Stratified Sampling
title1 ’Analysis of Farm Area and Corn Yield’; title2 ’Model I: Same Intercept and Slope’; proc surveyreg data=Farms total=StratumTotals; strata State Region / list; model CornYield = FarmArea; weight Weight; run;
Output 71.4.1. Data Summary and Stratum Information Fitting Model I
Analysis of Farm Area and Corn Yield Model I: Same Intercept and Slope The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Data Summary Number of Observations Sum of Weights Weighted Mean of CornYield Weighted Sum of CornYield
19 234.99900 31.56029 7416.6
Design Summary Number of Strata
5
Fit Statistics R-square Root MSE Denominator DF
0.3882 20.6422 14
Stratum Information Stratum Index 1 2 3 4 5
State Iowa
Nebraska
Region 1 2 3 1 2
N Obs
Population Total
3 5 3 6 2
100 50 15 30 40
Sampling Rate 3.00% 10.0% 20.0% 20.0% 5.00%
Output 71.4.1 displays the data summary and stratification information fitting Model I. The sampling rates are automatically computed by the procedure based on the sample sizes and the population totals in strata.
4403
4404
Chapter 71. The SURVEYREG Procedure
Output 71.4.2. Estimated Regression Coefficients and the Estimated Covariance Matrix
Analysis of Farm Area and Corn Yield Model I: Same Intercept and Slope The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Tests of Model Effects Effect Model Intercept FarmArea
Num DF
F Value
Pr > F
1 1 1
21.74 4.93 21.74
0.0004 0.0433 0.0004
NOTE: The denominator degrees of freedom for the F tests is 14.
Estimated Regression Coefficients
Parameter
Estimate
Standard Error
t Value
Pr > |t|
Intercept FarmArea
11.8162978 0.2126576
5.31981027 0.04560949
2.22 4.66
0.0433 0.0004
NOTE: The denominator degrees of freedom for the t tests is 14.
Output 71.4.2 displays tests of model effects and the estimated regression coefficients. Alternatively, you can assume that the linear relationship between corn yield (CornYield) and farm area (FarmArea) is different among the states (Model II). In order to analyze the data using this model, you create auxiliary variables FarmAreaNE and FarmAreaIA to represent farm area in different states: 0 if the farm is in Iowa FarmAreaNE = FarmArea if the farm is in Nebraska FarmArea if the farm is in Iowa FarmAreaIA = 0 if the farm is in Nebraska The following statements create these variables in a new data set called FarmsByState and use PROC SURVEYREG to fit Model II: title1 ’Analysis of Farm Area and Corn Yield’; title2 ’Model II: Same Intercept, Different Slopes’; data FarmsByState; set Farms; if State=’Iowa’ then do; FarmAreaIA=FarmArea ; FarmAreaNE=0; end; else do; FarmAreaIA=0 ; FarmAreaNE=FarmArea; end; run;
Example 71.4. Stratified Sampling
The following statements perform the regression using the new data set FarmsByState. The analysis uses the auxilary variables FarmAreaIA and FarmAreaNE as the regressors: proc SURVEYREG data=FarmsByState total=StratumTotals; strata State Region; model CornYield = FarmAreaIA FarmAreaNE; weight Weight; run;
Output 71.4.3. Regression Results from Fitting Model II
Analysis of Farm Area and Corn Yield Model II: Same Intercept, Different Slopes The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Data Summary Number of Observations Sum of Weights Weighted Mean of CornYield Weighted Sum of CornYield
19 234.99900 31.56029 7416.6
Design Summary Number of Strata
5
Fit Statistics R-square Root MSE Denominator DF
0.8158 11.6759 14
Estimated Regression Coefficients
Parameter Intercept FarmAreaIA FarmAreaNE
Estimate
Standard Error
t Value
Pr > |t|
4.04234816 0.41696069 0.12851012
3.80934848 0.05971129 0.02495495
1.06 6.98 5.15
0.3066 <.0001 0.0001
NOTE: The denominator degrees of freedom for the t tests is 14.
Output 71.4.3 displays the data summary, design information, fit statistics, and parameter estimates. The estimated slope parameters for each state are quite different from the estimated slope in Model I. The results from the regression show that Model II fits these data better than Model I. For Model III, different intercepts are used for the linear relationship in two states. The following statements illustrate the use of the NOINT option in the MODEL statement associated with the CLASS statement to fit Model III:
4405
4406
Chapter 71. The SURVEYREG Procedure title2 ’Model III: Different Intercepts and Slopes’; proc SURVEYREG data=FarmsByState total=StratumTotals; strata State Region; class State; model CornYield = State FarmAreaIA FarmAreaNE / noint covb solution; weight Weight; run;
The model statement includes the classification effect State as a regressor. Therefore, the parameter estimates for effect State will presents the intercepts in two states. Output 71.4.4. Regression Results for Fitting Model III
Analysis of Farm Area and Corn Yield Model III: Different Intercepts and Slopes The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Data Summary Number of Observations Sum of Weights Weighted Mean of CornYield Weighted Sum of CornYield
19 234.99900 31.56029 7416.6
Design Summary Number of Strata
5
Fit Statistics R-square Root MSE Denominator DF
0.9300 11.9810 14
Estimated Regression Coefficients
Parameter State Iowa State Nebraska FarmAreaIA FarmAreaNE
Estimate
Standard Error
t Value
Pr > |t|
5.27797099 0.65275201 0.40680971 0.14630563
5.27170400 1.70031616 0.06458426 0.01997085
1.00 0.38 6.30 7.33
0.3337 0.7068 <.0001 <.0001
NOTE: The denominator degrees of freedom for the t tests is 14.
Covariance of Estimated Regression Coefficients
State Iowa State Nebraska FarmAreaIA FarmAreaNE
State Iowa
State Nebraska
FarmAreaIA
FarmAreaNE
27.790863033 0 -0.205517205 0
0 2.8910750385 0 -0.027354011
-0.205517205 0 0.0041711265 0
0 -0.027354011 0 0.0003988349
Example 71.5. Regression Estimator for Stratified Sample
Output 71.4.4 displays the regression results for fitting Model III, including the data summary, parameter estimates, and covariance matrix of the regression coefficients. The estimated covariance matrix shows a lack of correlation between the regression coefficients from different states. This suggests that Model III might be the best choice for building a model for farm area and corn yield in these two states. However, some statistics remain the same under different regression models, for example, Weighted Mean of CornYield. These estimators do not rely on the particular model you use.
Example 71.5. Regression Estimator for Stratified Sample This example uses the corn yield data from the previous example to illustrate how to construct a regression estimator for a stratified sample design. Similar to Example 71.3 on page 4400, by incorporating auxilary information into a regression estimator, the procedure can produce more accurate estimates of the population characteristics that are of interest. In this example, the sample design is a stratified sample design. The auxilary information is the total farm areas in regions of each state, as displayed in Table 71.4. You want to estimate the total corn yield using this information under the three linear models given in Example 71.4. Table 71.4. Information for Each Stratum
Stratum 1 2 3 4 5
State Iowa
Nebraska Total
Region 1 2 3 1 2
Number of Farms in Population Sample 100 3 50 5 15 3 30 6 40 2 235 19
Total Farm Area 13,200 8,750 21,950
The regression estimator to estimate the total corn yield under Model I can be obtained by using PROC SURVEYREG with an ESTIMATE statement: title1 ’Estimate Corn Yield from Farm Size’; title2 ’Model I: Same Intercept and Slope’; proc surveyreg data=Farms total=StratumTotals; strata State Region / list; class State Region; model CornYield = FarmArea State*Region /solution; weight Weight; estimate ’Estimate of CornYield under Model I’ INTERCEPT 235 FarmArea 21950 State*Region 100 50 15 30 40 /e; run;
To apply the contraint in each stratum that the weighted total number of farms equals to the total number of farms in the stratum, you can include the strata as an effect in the MODEL statement, effect State*Region. Thus, the CLASS statement must list the STRATA variables, State and Region, as classification variables. The following
4407
4408
Chapter 71. The SURVEYREG Procedure
ESTIMATE statement specifies the regression estimator, which is a linear function of the regression parameters: estimate ’Estimate of CornYield under Model I’ INTERCEPT 235 FarmArea 21950 State*Region 100 50 15 30 40 /e;
This linear function contains the total for each explanatory variable in the model. Because the sampling units are farms in this example, the coefficient for Intercept in the ESTIMATE statement is the total number of farms (235); the coefficient for FarmArea is the total farm area listed in Table 71.4 (21950); and the coefficients for effect State*Region are the total number of farms in each strata (as displayed in Table 71.4). Output 71.5.1. Regression Estimator for the Total of CornYield under Model I
Estimate Corn Yield from Farm Size Model I: Same Intercept and Slope The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Analysis of Estimable Functions
Parameter Estimate of CornYield under Model I
Estimate
Standard Error
t Value
Pr > |t|
7463.52329
926.841541
8.05
<.0001
NOTE: The denominator degrees of freedom for the t tests is 14.
Output 71.5.1 displays the results of the ESTIMATE statement. The regression estimator for the total of CornYield in Iowa and Nebraska is 7464 under Model I, with a standard error of 927. Under Model II, a regression estimator for totals can be obtained using the following statements: title1 ’Estimate Corn Yield from Farm Size’; title2 ’Model II: Same Intercept, Different Slopes’; proc surveyreg data=FarmsByState total=StratumTotals; strata State Region; class State Region; model CornYield = FarmAreaIA FarmAreaNE state*region /solution; weight Weight; estimate ’Total of CornYield under Model II’ INTERCEPT 235 FarmAreaIA 13200 FarmAreaNE 8750 State*Region 100 50 15 30 40 /e; run;
Example 71.5. Regression Estimator for Stratified Sample
In this model, you also need to include strata as a fixed effect in the MODEL statement. Other regressors are the auxiliary variables FarmAreaIA and FarmAreaNE (defined in Example 71.4). In the following ESTIMATE statement, the coefficient for Intercept is still the total number of farms; and the coefficients for FarmAreaIA and FarmAreaNE are the total farm area in Iowa and Nebraska, respectively, as displayed in Table 71.4. The total number of farms in each strata are the coefficients for the strata effect: estimate ’Total of CornYield under Model II’ INTERCEPT 235 FarmAreaIA 13200 FarmAreaNE 8750 State*Region 100 50 15 30 40 /e;
Output 71.5.2. Regression Estimator for the Total of CornYield under Model II
Estimate Corn Yield from Farm Size Model II: Same Intercept, Different Slopes The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Analysis of Estimable Functions
Parameter Total of CornYield under Model II
Estimate
Standard Error
t Value
Pr > |t|
7580.48657
859.180439
8.82
<.0001
NOTE: The denominator degrees of freedom for the t tests is 14.
Output 71.5.2 displays that the results of the regression estimator for the total of corn yield in two states under Model II is 7580 with a standard error of 859. The regression estimator under Model II has a slightly smaller standard error than under Model I. Finally, you can apply Model III to the data and estimate the total corn yield. Under Model III, you can also obtain the regression estimators for the total corn yield for each state. Three ESTIMATE statements are used in the following statements to create the three regression estimators: title1 ’Estimate Corn Yield from Farm Size’; title2 ’Model III: Different Intercepts and Slopes’; proc SURVEYREG data=FarmsByState total=StratumTotals; strata State Region; class State Region; model CornYield = state FarmAreaIA FarmAreaNE State*Region /noint solution; weight Weight; estimate ’Total CornYield in Iowa under Model III’ State 165 0 FarmAreaIA 13200 FarmAreaNE 0 State*region 100 50 15 0 0 /e; estimate ’Total CornYield in Nebraska under Model III’
4409
4410
Chapter 71. The SURVEYREG Procedure State 0 70 FarmAreaIA 0 FarmAreaNE 8750 State*Region 0 0 0 30 40 /e; estimate ’Total CornYield in both states under Model III’ State 165 70 FarmAreaIA 13200 FarmAreaNE 8750 State*Region 100 50 15 30 40 /e; run;
The fixed effect State is added to the MODEL statement to obtain different intercepts in different states, using the NOINT option. Among the ESTIMATE statements, the coefficients for explanatory variables are different depending on which regression estimator is estimated. For example, in the ESTIMATE statement estimate ’Total CornYield in Iowa under Model III’ State 165 0 FarmAreaIA 13200 FarmAreaNE 0 State*region 100 50 15 0 0 /e;
the coefficients for the effect State are 165 and 0, respectively. This indicates that the total number of farms in Iowa is 165 and the total number of farms in Nebraska is 0, because the estimation is the total corn yield in Iowa only. Similarly, the total numbers of farms in three regions in Iowa are used for the coefficients of the strata effect State*Region, as displayed in Table 71.4. Output 71.5.3. Regression Estimator for the Total of CornYield under Model III
Estimate Corn Yield from Farm Size Model III: Different Intercepts and Slopes The SURVEYREG Procedure Regression Analysis for Dependent Variable CornYield Analysis of Estimable Functions
Parameter Total CornYield in Iowa under Model III Total CornYield in Nebraska under Model III Total CornYield in both states under Model III
Estimate
Standard Error
t Value
6246.10697 1334.37961 7580.48657
851.272372 116.302948 859.180439
7.34 11.47 8.82
Analysis of Estimable Functions Parameter Total CornYield in Iowa under Model III Total CornYield in Nebraska under Model III Total CornYield in both states under Model III
Pr > |t| <.0001 <.0001 <.0001
NOTE: The denominator degrees of freedom for the t tests is 14.
Output 71.5.3 displays the results from the three regression estimators using Model III. Since the estimations are independent in each state, the total corn yield from both states is equal to the sum of the estimated total of corn yield in Iowa and Nebraska,
Example 71.6. Stratum Collapse
6246 + 1334 = 7580. This regression estimator is the same as the one under Model II. The variance of regression estimator of the total corn yield in both states is the sum of variances of regression estimators for total corn yield in each state. Therefore, it is not necessary to use Model III to obtain the regression estimator for the total corn yield unless you need to estimate the total corn yield for each individual state.
Example 71.6. Stratum Collapse In a stratified sample, it is possible that some strata will have only one sampling unit. When this happens, PROC SURVEYREG collapses the strata that contain a single sampling unit into a pooled stratum. For more detailed information on stratum collapse, see the section “Stratum Collapse” on page 4388. Suppose that you have the following data: data Sample; input Stratum X Y W; datalines; 10 0 0 5 10 1 1 5 11 1 1 10 11 1 2 10 12 3 3 16 33 4 4 45 14 6 7 50 12 3 4 16 ;
The variable Stratum is again the stratification variable, the variable X is the independent variable, and the variable Y is the dependent variable. You want to regress Y on X. In the data set Sample, both Stratum=33 and Stratum=14 contain one observation. By default, PROC SURVEYREG collapses these strata into one pooled stratum in the regression analysis. To input the finite population correction information, you create the SAS data set StratumTotals: data StratumTotals; input Stratum _TOTAL_; datalines; 10 10 11 20 12 32 33 40 33 45 14 50 15 . 66 70 ;
4411
4412
Chapter 71. The SURVEYREG Procedure
The variable Stratum is the stratification variable, and the variable – TOTAL– contains the stratum totals. The data set StratumTotals contains more strata than the data set Sample. Also in the data set StratumTotals, more than one observation contains the stratum totals for Stratum=33: 33 40 33 45
PROC SURVEYREG allows this type of input. The procedure simply ignores strata that are not present in the data set Sample; for the multiple entries of a stratum, the procedure uses the first observation. In this example, Stratum=33 has the stratum total – TOTAL– =40. The following SAS statements perform the regression analysis: title1 ’Stratified Sample with Single Sampling Unit in Strata’; title2 ’With Stratum Collapse’; proc SURVEYREG data=Sample total=StratumTotals; strata Stratum/list; model Y=X; weight W; run;
Output 71.6.1. Summary of Data and Regression
Stratified Sample with Single Sampling Unit in Strata With Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Data Summary Number of Observations Sum of Weights Weighted Mean of Y Weighted Sum of Y
8 157.00000 4.31210 677.00000
Design Summary Number of Strata Number of Strata Collapsed
5 2
Fit Statistics R-square Root MSE Denominator DF
0.9564 0.5111 4
Output 71.6.1 displays that there are a total of five strata in the input data set, and two strata are collapsed into a pooled stratum. The denominator degrees of freedom
Example 71.6. Stratum Collapse
is 4, due to the collapse (see the section “Denominator Degrees of Freedom” on page 4386). Output 71.6.2. Stratification Information
Stratified Sample with Single Sampling Unit in Strata With Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Stratum Information Stratum Index
Collapsed
1 2 3 4 5
Yes Yes
0
Pooled
Stratum 10 11 12 14 33
N Obs
Population Total
Sampling Rate
2 2 2 1 1
10 20 32 50 40
20.0% 10.0% 6.25% 2.00% 2.50%
2
90
2.22%
NOTE: Strata with only one observation are collapsed into the stratum with Stratum Index "0".
Output 71.6.2 displays the stratification information, including stratum collapse. Under the column Collapsed, the fourth stratum (Stratum=14) and the fifth (Stratum=33) are marked as ‘Yes’, which indicates that these two strata are collapsed into the pooled stratum (Stratum Index=0). The sampling rate for the pooled stratum is 2% (see the section “Sampling Rate of the Pooled Stratum from Collapse” on page 4389).
4413
4414
Chapter 71. The SURVEYREG Procedure
Output 71.6.3. Parameter Estimates and Effect Tests
Stratified Sample with Single Sampling Unit in Strata With Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Tests of Model Effects Effect Model Intercept X
Num DF
F Value
Pr > F
1 1 1
173.01 0.00 173.01
0.0002 0.9961 0.0002
NOTE: The denominator degrees of freedom for the F tests is 4.
Estimated Regression Coefficients
Parameter
Estimate
Standard Error
t Value
Pr > |t|
Intercept X
0.00179469 1.12598708
0.34306373 0.08560466
0.01 13.15
0.9961 0.0002
NOTE: The denominator degrees of freedom for the t tests is 4.
Output 71.6.3 displays the parameter estimates and the tests of the significance of the model effects. Alternatively, if you prefer not to collapse strata with a single sampling unit, you can specify the NOCOLLAPSE option in the STRATA statement: title1 ’Stratified Sample with Single Sampling Unit in Strata’; title2 ’Without Stratum Collapse’; proc SURVEYREG data=Sample total=StratumTotals; strata Stratum/list nocollapse; model Y = X; weight W; run;
Example 71.6. Stratum Collapse
Output 71.6.4. Summary of Data and Regression
Stratified Sample with Single Sampling Unit in Strata Without Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Data Summary Number of Observations Sum of Weights Weighted Mean of Y Weighted Sum of Y
8 157.00000 4.31210 677.00000
Design Summary Number of Strata
5
Fit Statistics R-square Root MSE Denominator DF
0.9564 0.5111 3
Output 71.6.4 does not contain the stratum collapse information displayed in Output 71.6.1, and the denominator degrees of freedom is 3 instead of 4. Output 71.6.5. Stratification Information
Stratified Sample with Single Sampling Unit in Strata Without Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Stratum Information Stratum Index
Stratum
1 2 3 4 5
10 11 12 14 33
N Obs
Population Total
2 2 2 1 1
10 20 32 50 40
Sampling Rate 20.0% 10.0% 6.25% 2.00% 2.50%
In Output 71.6.5, although the fourth stratum and the fifth stratum contain only one observation, no stratum collapse occurs.
4415
4416
Chapter 71. The SURVEYREG Procedure
Output 71.6.6. Parameter Estimates and Effect Tests
Stratified Sample with Single Sampling Unit in Strata Without Stratum Collapse The SURVEYREG Procedure Regression Analysis for Dependent Variable Y Tests of Model Effects Effect Model Intercept X
Num DF
F Value
Pr > F
1 1 1
347.27 0.00 347.27
0.0003 0.9962 0.0003
NOTE: The denominator degrees of freedom for the F tests is 3.
Estimated Regression Coefficients
Parameter
Estimate
Standard Error
t Value
Pr > |t|
Intercept X
0.00179469 1.12598708
0.34302581 0.06042241
0.01 18.64
0.9962 0.0003
NOTE: The denominator degrees of freedom for the t tests is 3.
As a result of not collapsing strata, the standard error estimates of the parameters are different from those in Output 71.6.3, as are the tests of the significance of model effects are.
References Cochran, W.G. (1977), Sampling Techniques, Third Edition, New York: John Wiley & Sons, Inc. Foreman, E.K. (1991), Survey Sampling Principles, New York: Marcel Dekker, Inc. Fuller, W.A. (1975), “Regression Analysis for Sample Survey,” Sankhy¯ a, 37 (3), Series C, 117–132. Fuller, W.A., Kennedy, W., Schnell, D., Sullivan, G., and Park, H.J. (1989), PC CARP, Ames, IA: Statistical Laboratory, Iowa State University. Hidiroglou, M.A., Fuller, W.A., and Hickman, R.D. (1980), SUPER CARP, Ames, IA: Statistical Laboratory, Iowa State University. Kish, L. (1965), Survey Sampling, New York: John Wiley & Sons, Inc. Pringle, R.M. and Raynor, A.A. (1971), Generalized Inverse Matrices with Applications to Statistics, New York: Hafner Publishing Co. S¨arndal, C.E., Swenson, B., and Wretman, J. (1992), Model Assisted Survey Sampling, New York: Springer-Verlag Inc.
References
Woodruff, R.S. (1971), “A Simple Method for Approximating the Variance of a Complicated Estimate,” Journal of the American Statistical Association, 66, 411–414.
4417
Chapter 72
The SURVEYSELECT Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4421 GETTING STARTED . . . . . . . . . . . Simple Random Sampling . . . . . . . . Stratified Sampling . . . . . . . . . . . Stratified Sampling with Control Sorting
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 4422 . 4423 . 4425 . 4428
SYNTAX . . . . . . . . . . . . . . . . PROC SURVEYSELECT Statement CONTROL Statement . . . . . . . . ID Statement . . . . . . . . . . . . . SIZE Statement . . . . . . . . . . . STRATA Statement . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 4430 . 4430 . 4443 . 4443 . 4444 . 4444
DETAILS . . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . Sorting by CONTROL Variables . . . Sample Selection Methods . . . . . . Simple Random Sampling . . . . . Unrestricted Random Sampling . . Systematic Random Sampling . . . Sequential Random Sampling . . . PPS Sampling without Replacement PPS Sampling with Replacement . PPS Systematic Sampling . . . . . PPS Sequential Sampling . . . . . Brewer’s PPS Method . . . . . . . Murthy’s PPS Method . . . . . . . Sampford’s PPS Method . . . . . . Output Data Set . . . . . . . . . . . . Displayed Output . . . . . . . . . . . ODS Table Names . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. 4445 . 4445 . 4445 . 4446 . 4447 . 4447 . 4448 . 4448 . 4449 . 4451 . 4451 . 4452 . 4453 . 4454 . 4455 . 4456 . 4458 . 4460
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . Example 72.1. Replicated Sampling . . . . . . . . . . . Example 72.2. PPS Selection of Two Units Per Stratum Example 72.3. PPS (Dollar-Unit) Sampling . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 4460 . 4460 . 4463 . 4466
4420
Chapter 72. The SURVEYSELECT Procedure
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4469
Chapter 72
The SURVEYSELECT Procedure Overview The SURVEYSELECT procedure provides a variety of methods for selecting probability-based random samples. The procedure can select a simple random sample or can sample according to a complex multistage sample design that includes stratification, clustering, and unequal probabilities of selection. With probability sampling, each unit in the survey population has a known, positive probability of selection. This property of probability sampling avoids selection bias and enables you to use statistical theory to make valid inferences from the sample to the survey population. To select a sample with PROC SURVEYSELECT, you input a SAS data set that contains the sampling frame or list of units from which the sample is to be selected. You also specify the selection method, the desired sample size or sampling rate, and other selection parameters. The SURVEYSELECT procedure selects the sample, producing an output data set that contains the selected units, their selection probabilities, and sampling weights. When you are selecting a sample in multiple stages, you invoke the procedure separately for each stage of selection, inputting the frame and selection parameters for each current stage. The SURVEYSELECT procedure provides methods for both equal probability sampling and probability proportional to size (PPS) sampling. In equal probability sampling, each unit in the sampling frame, or in a stratum, has the same probability of being selected for the sample. In PPS sampling, a unit’s selection probability is proportional to its size measure. For details on probability sampling methods, refer to Lohr (1999), Kish (1965, 1987), Kalton (1983), and Cochran (1977). The SURVEYSELECT procedure provides the following equal probability sampling methods: • simple random sampling • unrestricted random sampling (with replacement) • systematic random sampling • sequential random sampling This procedure also provides the following probability proportional to size (PPS) methods: • PPS sampling without replacement • PPS sampling with replacement • PPS systematic sampling • PPS algorithms for selecting two units per stratum • sequential PPS sampling with minimum replacement
4422
Chapter 72. The SURVEYSELECT Procedure
The procedure uses fast, efficient algorithms for these sample selection methods. Thus, it performs well even for large input data sets or sampling frames, which may occur in practice for large-scale sample surveys. The SURVEYSELECT procedure can perform stratified sampling, selecting samples independently within the specified strata, or nonoverlapping subgroups of the survey population. Stratification controls the distribution of the sample size in the strata. It is widely used in practice toward meeting a variety of survey objectives. For example, with stratification you can ensure adequate sample sizes for subgroups of interest, including small subgroups, or you can use stratification toward improving the precision of the overall estimates. When you are using a systematic or sequential selection method, the SURVEYSELECT procedure also can sort by control variables within strata for the additional control of implicit stratification. The SURVEYSELECT procedure provides replicated sampling, where the total sample is composed of a set of replicates, each selected in the same way. You can use replicated sampling to study variable nonsampling errors, such as variability in the results obtained by different interviewers. You can also use replication to compute standard errors for the combined sample estimates.
Getting Started In this example, an Internet service provider conducts a customer satisfaction survey. The survey population consists of the company’s current subscribers. The company plans to select a sample of customers from this population, interview the selected customers, and then make inferences about the entire survey population from the sample data. The SAS data set Customers contains the sampling frame, which is the list of units in the survey population. The sample of customers will be selected from this sampling frame. The data set Customers is constructed from the company’s customer database. It contains one observation for each customer, with a total of 13,471 observations. Figure 72.1 displays the first 10 observations of the data set Customers. Internet Service Provider Customers (First 10 Observations) Obs
CustomerID
State
Type
Usage
1 2 3 4 5 6 7 8 9 10
416-87-4322 288-13-9763 339-00-8654 118-98-0542 421-67-0342 623-18-9201 324-55-0324 832-90-2397 586-45-0178 801-24-5317
AL GA GA GA FL SC FL AL GA SC
New Old Old New New New Old Old New New
839 224 2451 349 562 68 137 1563 615 728
Figure 72.1. Customers Data Set (First 10 Observations)
Simple Random Sampling
In the SAS data set Customers, the variable CustomerID uniquely identifies each customer. The variable State contains the state of the customer’s address. The company has customers in the following four states: Georgia (GA), Alabama (AL), Florida (FL), and South Carolina (SC). The variable Type equals ‘Old’ if the customer has subscribed to the service for more than one year; otherwise, the variable Type equals ‘New’. The variable Usage contains the customer’s average monthly service usage, in minutes. The following sections illustrate the use of PROC SURVEYSELECT for probability sampling with three different designs for the customer satisfaction survey. All three designs are one stage, with customers as the sampling units. The first design is simple random sampling without stratification. In the second design, customers are stratified by state and type, and the sample is selected by simple random sampling within strata. In the third design, customers are sorted within strata by usage, and the sample is selected by systematic random sampling within strata.
Simple Random Sampling The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set using simple random sampling: title ’Customer Satisfaction Survey’; proc surveyselect data=Customers method=srs n=100 out=SampleSRS; run;
The PROC SURVEYSELECT statement invokes the procedure. The DATA= option names the SAS data set Customers as the input data set from which to select the sample. The METHOD=SRS option specifies simple random sampling as the sample selection method. In simple random sampling, each unit has an equal probability of selection, and sampling is without replacement. Without-replacement sampling means that a unit cannot be selected more than once. The N=100 option specifies a sample size of 100 customers. The OUT= option stores the sample in the SAS data set named SampleSRS. Figure 72.2 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A sample of 100 customers is selected from the data set Customers by simple random sampling. With simple random sampling and no stratification in the sample design, the selection probability is the same for all units in the sample. In this sample, the selection probability for each customer equals 0.007423, which is the sample size (100) divided by the population size (13,471). The sampling weight equals 134.71 for each customer in the sample, where the weight is the inverse of the selection probability. If you specify the STATS option, PROC SURVEYSELECT includes the selection probabilities and sampling weights in the output data set. (This information is always included in the output data set for more complex designs.)
4423
4424
Chapter 72. The SURVEYSELECT Procedure
The random number seed is 39647. PROC SURVEYSELECT uses this number as the initial seed for random number generation. Since the SEED= option is not specified in the PROC SURVEYSELECT statement, the seed value is obtained using the time of day from the computer’s clock. You can specify SEED=39647 to reproduce this sample.
Customer Satisfaction Survey The SURVEYSELECT Procedure Selection Method
Simple Random Sampling
Input Data Set Random Number Seed Sample Size Selection Probability Sampling Weight Output Data Set
CUSTOMERS 39647 100 0.007423 134.71 SAMPLESRS
Figure 72.2. Sample Selection Summary
The sample of 100 customers is stored in the SAS data set SampleSRS. PROC SURVEYSELECT does not display this output data set. The following PROC PRINT statements display the first 20 observations of SampleSRS: title1 ’Customer Satisfaction Survey’; title2 ’Sample of 100 Customers, Selected by SRS’; title3 ’(First 20 Observations)’; proc print data=SampleSRS(obs=20); run;
Figure 72.3 displays the first 20 observations of the output data set SampleSRS, which contains the sample of customers. This data set includes all the variables from the DATA= input data set Customers. If you do not want to include all variables, you can use the ID statement to specify which variables to copy from the input data set to the output (sample) data set.
Stratified Sampling
Customer Satisfaction Survey Sample of 100 Customers, Selected by SRS (First 20 Observations) Obs
CustomerID
State
Type
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
036-89-0212 045-53-3676 050-99-2380 066-93-5368 082-99-9234 097-17-4766 110-73-1051 111-91-6424 127-39-4594 162-50-3866 162-56-1370 167-21-6808 168-02-5189 174-07-8711 187-03-7510 190-78-5019 200-75-0054 201-14-1003 207-15-7701 211-14-1373
FL AL GA AL FL FL FL GA GA FL FL SC AL FL SC GA GA GA GA AL
New New Old Old New Old Old New New New New New Old New New New New Old Old Old
Usage 74 411 167 1232 90 131 102 247 61 100 224 60 7553 284 21 185 224 3437 24 88
Figure 72.3. Customer Sample (First 20 Observations)
Stratified Sampling In this section, stratification is added to the sample design for the customer satisfaction survey. The sampling frame, or list of all customers, is stratified by State and Type. This divides the sampling frame into nonoverlapping subgroups formed from the values of the State and Type variables. Samples are then selected independently within the strata. PROC SURVEYSELECT requires that the input data set be sorted by the STRATA variables. The following PROC SORT statements sort the Customers data set by the stratification variables State and Type: proc sort data=Customers; by State Type; run;
The following PROC FREQ statements display the crosstabulation of the Customers data set by State and Type: proc freq data=Customers; tables State*Type; run;
4425
4426
Chapter 72. The SURVEYSELECT Procedure
The FREQ Procedure Table of State by Type State
Type
Frequency| Percent | Row Pct | Col Pct |New |Old | Total ---------+--------+--------+ AL | 1238 | 706 | 1944 | 9.19 | 5.24 | 14.43 | 63.68 | 36.32 | | 14.43 | 14.43 | ---------+--------+--------+ FL | 2170 | 1370 | 3540 | 16.11 | 10.17 | 26.28 | 61.30 | 38.70 | | 25.29 | 28.01 | ---------+--------+--------+ GA | 3488 | 1940 | 5428 | 25.89 | 14.40 | 40.29 | 64.26 | 35.74 | | 40.65 | 39.66 | ---------+--------+--------+ SC | 1684 | 875 | 2559 | 12.50 | 6.50 | 19.00 | 65.81 | 34.19 | | 19.63 | 17.89 | ---------+--------+--------+ Total 8580 4891 13471 63.69 36.31 100.00
Figure 72.4. Stratification of Customers by State and Type
Figure 72.4 presents the table of State by Type for the 13,471 customers. There are four states and two levels of Type, forming a total of eight strata. The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set according to the stratified sample design: title1 ’Customer Satisfaction Survey’; title2 ’Stratified Sampling’; proc surveyselect data=Customers method=srs n=15 seed=1953 out=SampleStrata; strata State Type; run;
The STRATA statement names the stratification variables State and Type. In the PROC SURVEYSELECT statement, the METHOD=SRS option specifies simple random sampling. The N=15 option specifies a sample size of 15 customers for each stratum. If you want to specify different sample sizes for different strata, you can use the N=SAS-data-set option to name a secondary data set that contains the stratum sample sizes. The SEED=1953 option specifies ’1953’ as the initial seed for random number generation.
Stratified Sampling
Customer Satisfaction Survey Stratified Sampling The SURVEYSELECT Procedure Selection Method Strata Variables
Simple Random Sampling State Type
Input Data Set Random Number Seed Stratum Sample Size Number of Strata Total Sample Size Output Data Set
CUSTOMERS 1953 15 8 120 SAMPLESTRATA
Figure 72.5. Sample Selection Summary
Figure 72.5 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A total of 120 customers are selected. The following PROC PRINT statements display the first 30 observations of the output data set SampleStrata: title1 ’Customer Satisfaction Survey’; title2 ’Sample Selected by Stratified Design’; title3 ’(First 30 Observations)’; proc print data=SampleStrata(obs=30); run;
Figure 72.6 displays the first 30 observations of the output data set SampleStrata, which contains the sample of 120 customers, 15 customers from each of the eight strata. The variable SelectionProb contains the selection probability for each customer in the sample. Since customers are selected with equal probability within strata in this design, the selection probability equals the stratum sample size (15) divided by the stratum population size. The selection probabilities differ from stratum to stratum since the population sizes differ. The selection probability for each customer in the first stratum (State=‘AL’ and Type=‘New’) is 0.012116, and the selection probability is 0.021246 for customers in the second stratum. The variable SamplingWeight contains the sampling weights, which are computed as inverse selection probabilities.
4427
4428
Chapter 72. The SURVEYSELECT Procedure
Customer Satisfaction Survey Sample Selected by Stratified Design (First 30 Observations)
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
State AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL
Type New New New New New New New New New New New New New New New Old Old Old Old Old Old Old Old Old Old Old Old Old Old Old
CustomerID
Usage
Selection Prob
Sampling Weight
002-26-1498 070-86-8494 121-28-6895 131-79-7630 211-88-4991 222-81-3742 238-46-3776 370-01-0671 407-07-5479 550-90-3188 582-40-9610 672-59-9114 848-60-3119 886-83-4909 993-31-7677 124-60-0495 128-54-9590 204-05-4017 210-68-8704 239-75-4343 317-70-6496 365-37-1340 399-78-7900 404-90-6273 421-04-8548 604-48-0587 774-04-0162 849-66-4156 937-69-9106 985-09-8691
1189 106 76 265 108 83 278 123 1580 177 46 66 28 170 64 80 56 17 4363 430 452 21 108 824 1332 16 318 79 182 24
0.012116 0.012116 0.012116 0.012116 0.012116 0.012116 0.012116 0.012116 0.012116 0.012116 0.012116 0.012116 0.012116 0.012116 0.012116 0.021246 0.021246 0.021246 0.021246 0.021246 0.021246 0.021246 0.021246 0.021246 0.021246 0.021246 0.021246 0.021246 0.021246 0.021246
82.5333 82.5333 82.5333 82.5333 82.5333 82.5333 82.5333 82.5333 82.5333 82.5333 82.5333 82.5333 82.5333 82.5333 82.5333 47.0667 47.0667 47.0667 47.0667 47.0667 47.0667 47.0667 47.0667 47.0667 47.0667 47.0667 47.0667 47.0667 47.0667 47.0667
Figure 72.6. Customer Sample (First 30 Observations)
Stratified Sampling with Control Sorting The next sample design for the customer satisfaction survey uses stratification by State. The sampling frame is also sorted by Type and Usage before sample selection, to provide additional control over the distribution of the sample. Customers are then selected by systematic random sampling within strata. Selection by systematic sampling, together with control sorting, spreads the sample uniformly over the range of type and usage values within each stratum or state. The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set using this design: title1 ’Customer Satisfaction Survey’; title2 ’Stratified Sampling with Control Sorting’; proc surveyselect data=Customers method=sys rate=.02 seed=1234 out=SampleControl; strata State; control Type Usage; run;
Stratified Sampling with Control Sorting
The STRATA statement names the stratification variable State. The CONTROL statement names the control variables Type and Usage. In the PROC SURVEYSELECT statement, the METHOD=SYS option requests systematic random sampling. The RATE=.02 option specifies a sampling rate of 2% for each stratum. The SEED=1234 option specifies the initial seed for random number generation. Figure 72.7 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A sample of 271 customers is selected, using systematic random sampling within strata determined by State. The sampling frame Customers is sorted by control variables Type and Usage within strata. The type of sorting is serpentine, which is used by default since SORT=NEST is not specified. See the section “Sorting by CONTROL Variables” on page 4445 for a description of serpentine sorting. The sorted data set replaces the input data set. (To store the sorted input data in another data set, leaving the input data set unsorted, use the OUTSORT= option.) The output data set SampleControl contains the sample of customers.
Customer Satisfaction Survey Stratified Sampling with Control Sorting The SURVEYSELECT Procedure Selection Method Strata Variable Control Variables Control Sorting
Systematic Random Sampling State Type Usage Serpentine
Input Data Set Random Number Seed Stratum Sampling Rate Number of Strata Total Sample Size Output Data Set
Figure 72.7. Sample Selection Summary
CUSTOMERS 1234 0.02 4 271 SAMPLECONTROL
4429
4430
Chapter 72. The SURVEYSELECT Procedure
Syntax The following statements are available in PROC SURVEYSELECT:
PROC SURVEYSELECT options ; STRATA variables ; CONTROL variables ; SIZE variable ; ID variables ; The PROC SURVEYSELECT statement invokes the procedure and optionally identifies input and output data sets. It also specifies the selection method, the sample size, and other sample design parameters. The SURVEYSELECT statement is required. The SIZE statement identifies the variable that contains the size measures. It is required for any selection method that is probability proportional to size (PPS). The remaining statements are optional. The STRATA statement identifies a variable or set of variables that stratify the input data set. When you specify a STRATA statement, PROC SURVEYSELECT selects samples independently from the strata formed by the STRATA variables. The CONTROL statement identifies variables for ordering units within strata. It can be used for systematic and sequential sampling methods. The ID statement identifies variables to copy from the input data set to the output data set of selected units. The rest of this section gives detailed syntax information for the CONTROL, ID, SIZE, and STRATA statements in alphabetical order after the description of the PROC SURVEYSELECT statement.
PROC SURVEYSELECT Statement PROC SURVEYSELECT options ; The PROC SURVEYSELECT statement invokes the procedure and optionally identifies input and output data sets. If you do not name a DATA= input data set, the procedure selects the sample from the most recently created SAS data set. If you do not name an OUT= output data set to contain the sample of selected units, the procedure still creates an output data set and names it according to the DATAn convention. The PROC SURVEYSELECT statement also specifies the sample selection method, the sample size, and other sample design parameters. If you do not specify a selection method, PROC SURVEYSELECT uses simple random sampling (METHOD=SRS) if there is no SIZE statement. If you specify a SIZE statement but do not specify a selection method, PROC SURVEYSELECT uses probability proportional to size selection without replacement (METHOD=PPS). You must specify the sample size or sampling rate unless you request a method that selects two units from each stratum (METHOD=PPS– BREWER or METHOD=PPS– MURTHY). You can use the SAMPSIZE=n option to specify the sample size, or you can use the SAMPSIZE=SAS-data-set option to name a secondary input data set that contains
PROC SURVEYSELECT Statement
stratum sample sizes. You can also specify stratum sampling rates, minimum size measures, maximum size measures, and certainty size measures in the secondary input data set. See the descriptions of the SAMPSIZE=, SAMPRATE=, MINSIZE=, MAXSIZE=, and CERTSIZE= options. You can name only one secondary input data set in each invocation of the procedure. The following table lists the options available with the PROC SURVEYSELECT statement. Descriptions follow in alphabetical order. Table 72.1. PROC SURVEYSELECT Statement Options
Task Specify the input data set
Options DATA=
Specify output data sets
OUT= OUTSORT=
Suppress displayed output Specify selection method
NOPRINT METHOD=
Specify sample size
SAMPSIZE= SELECTALL
Specify sampling rate
SAMPRATE= NMIN= NMAX=
Specify number of replicates
REP=
Adjust size measures
MINSIZE= MAXSIZE=
Specify certainty size measures Specify sorting type Specify random number seed
CERTSIZE= SORT= SEED=
Control OUT= contents
JTPROBS OUTALL OUTHITS OUTSEED OUTSIZE STATS
You can specify the following options in the PROC SURVEYSELECT statement: CERTSIZE
requests automatic selection of those units with size measures greater than or equal to the stratum certainty size measure. You provide sampling unit size measures in the DATA= input data set variable named in the SIZE statement. And you provide the stratum certainty size measures in the secondary input data set variable – CERTSIZE– . Use the CERTSIZE option when you have already named the secondary input data set in another option, such as SAMPSIZE=SAS-data-set, SAMPRATE=SAS-data-set, MAXSIZE=SAS-data-set, or MINSIZE=SAS-data-set. You can name only one secondary input data set in each invocation of the procedure.
4431
4432
Chapter 72. The SURVEYSELECT Procedure
If any unit’s size measure is greater than or equal to the certainty size measure for its stratum, then PROC SURVEYSELECT selects this unit with certainty. Each certainty size measure must be a positive number. The CERTSIZE option is available for METHOD=PPS and METHOD=PPS– SAMPFORD. If you want to specify a single certainty size measure in the PROC SURVEYSELECT statement, use the CERTSIZE=certain option. CERTSIZE=certain
specifies the certainty size measure. PROC SURVEYSELECT selects with certainty any unit with size measure greater than or equal to the value certain, which must be a positive number. You provide size measures in the DATA= input data set variable named in the SIZE statement. This option is available for METHOD=PPS and METHOD=PPS– SAMPFORD. If you request a stratified sample design with a STRATA statement and specify the CERTSIZE= option, PROC SURVEYSELECT uses the certainty size certain for all strata. If you do not want to use the same certainty size for all strata, use the CERTSIZE=SAS-data-set option to specify a certainty size for each stratum. CERTSIZE=SAS-data-set
names a SAS data set that contains the certainty size measures for the strata. PROC SURVEYSELECT selects with certainty any unit with size measure greater than or equal to the certainty size measure for its stratum. You provide sampling unit size measures in the DATA= input data set variable named in the SIZE statement. And you provide the stratum certainty size measures in the CERTSIZE= input data set variable – CERTSIZE– . Each certainty size measure must be a positive number. This option is available for METHOD=PPS and METHOD=PPS– SAMPFORD. The CERTSIZE= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the CERTSIZE= data set as in the DATA= data set. The CERTSIZE= data set must include a variable named – CERTSIZE– that contains the certainty size measure for each stratum. CERTSIZE=P=p
specifies the certainty proportion. PROC SURVEYSELECT selects with certainty any unit with size measure greater than or equal to the proportion p of the total size for all units in the stratum. The procedure repeats this process with the remaining units until no more certainty units are selected. You provide size measures in the DATA= input data set variable named in the SIZE statement. This option is available for METHOD=PPS and METHOD=PPS– SAMPFORD. The certainty proportion must be a positive number. You can specify p as a number between 0 and 1. Or you can specify p in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%. If you request a stratified sample design with a STRATA statement and specify the CERTSIZE=P= option, PROC SURVEYSELECT uses the same certainty proportion p for all strata.
PROC SURVEYSELECT Statement
DATA=SAS-data-set
names the SAS data set from which PROC SURVEYSELECT selects the sample. If you omit the DATA= option, the procedure uses the most recently created SAS data set. In sampling terminology, the input data set is the sampling frame, or list of units from which the sample is selected. JTPROBS
includes joint probabilities of selection in the OUT= output data set. This option is available for the following probability proportional to size selection methods: METHOD=PPS, METHOD=PPS– SAMPFORD, and METHOD=PPS– WR. By default, PROC SURVEYSELECT outputs joint selection probabilities for METHOD=PPS– BREWER and METHOD=PPS– MURTHY, which select two units per stratum. For details on computation of joint selection probabilities for a particular sampling method, see the method description in the section “Sample Selection Methods” on page 4446. For more information on the contents of the output data set, see the section “Output Data Set” on page 4456. MAXSIZE
requests that sampling unit size measures be adjusted according to the stratum maximum size measures in the secondary input data set. You provide sampling unit size measures in the DATA= input data set variable named in the SIZE statement. And you provide the stratum maximum size measures in the secondary input data set variable – MAXSIZE– . Use the MAXSIZE option when you have already named the secondary input data set in another option, such as SAMPSIZE=SAS-data-set, SAMPRATE=SAS-data-set, MINSIZE=SAS-data-set, or CERTSIZE=SAS-data-set. You can name only one secondary input data set in each invocation of the procedure. If any size measure exceeds the maximum size measure for its stratum, then PROC SURVEYSELECT adjusts this size measure downward to equal the maximum size measure. Each maximum size measure must be a positive number. The MAXSIZE option is available whenever you specify a SIZE statement for probability proportional to size selection and a STRATA statement for stratification. If you want to specify a single maximum size value in the PROC SURVEYSELECT statement, use the MAXSIZE=max option. MAXSIZE=max
specifies the maximum size measure allowed. If any size measure exceeds the value max, then PROC SURVEYSELECT adjusts this size measure to equal max, which must be a positive number. You provide size measures in the DATA= input data set variable named in the SIZE statement. This option is available whenever you specify a SIZE statement for selection with probability proportional to size. If you request a stratified sample design with a STRATA statement and specify the MAXSIZE= option, PROC SURVEYSELECT uses the maximum size max for all strata. If you do not want to use the same maximum size for all strata, use the MAXSIZE=SAS-data-set option to specify a maximum size for each stratum.
4433
4434
Chapter 72. The SURVEYSELECT Procedure
MAXSIZE=SAS-data-set
names a SAS data set that contains the maximum size measures allowed for the strata. If any size measure exceeds the maximum size measure for its stratum, then PROC SURVEYSELECT adjusts this size measure downward to equal the maximum size measure. You provide sampling unit size measures in the DATA= input data set variable named in the SIZE statement. And you provide the stratum maximum size measures in the MAXSIZE= input data set variable – MAXSIZE– . Each maximum size measure must be a positive number. This option is available whenever you specify a SIZE statement for probability proportional to size selection and a STRATA statement for stratified selection. The MAXSIZE= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the MAXSIZE= data set as in the DATA= data set. The MAXSIZE= data set must include a variable named – MAXSIZE– that contains the maximum size measure for each stratum. METHOD=name M=name
specifies the method for sample selection. If you do not specify the METHOD= option, by default, PROC SURVEYSELECT uses simple random sampling (METHOD=SRS) if there is no SIZE statement. If you specify a SIZE statement, the default selection method is probability proportional to size without replacement (METHOD=PPS). Valid values for name are as follows: PPS
requests selection with probability proportional to size and without replacement. See the section “PPS Sampling without Replacement” on page 4449 for details. If you specify METHOD=PPS, you must name the size measure variable in the SIZE statement.
PPS– BREWER | BREWER requests selection according to Brewer’s method. Brewer’s method selects two units from each stratum with probability proportional to size and without replacement. See the section “Brewer’s PPS Method” on page 4453 for details. If you specify METHOD=PPS– BREWER, you must name the size measure variable in the SIZE statement. You do not need to specify the sample size with the SAMPSIZE= option, since Brewer’s method selects two units from each stratum. PPS– MURTHY | MURTHY requests selection according to Murthy’s method. Murthy’s method selects two units from each stratum with probability proportional to size and without replacement. See the section “Murthy’s PPS Method” on page 4454 for details. If you specify METHOD=PPS– MURTHY, you must name the size measure variable in the SIZE statement. You do not need to specify the sample size with the SAMPSIZE= option, since Murthy’s method selects two units from each stratum.
PROC SURVEYSELECT Statement
PPS– SAMPFORD | SAMPFORD requests selection according to Sampford’s method. Sampford’s method selects units with probability proportional to size and without replacement. See the section “Sampford’s PPS Method” on page 4455 for details. If you specify METHOD=PPS– SAMPFORD, you must name the size measure variable in the SIZE statement. PPS– SEQ | CHROMY requests sequential selection with probability proportional to size and with minimum replacement. This method is also known as Chromy’s method. See the section “PPS Sequential Sampling” on page 4452 for details. If you specify METHOD=PPS– SEQ, you must name the size measure variable in the SIZE statement. PPS– SYS
requests systematic selection with probability proportional to size. See the section “PPS Systematic Sampling” on page 4451 for details on this method. If you specify METHOD=PPS– SYS, you must name the size measure variable in the SIZE statement.
PPS– WR
requests selection with probability proportional to size and with replacement. See the section “PPS Sampling with Replacement” on page 4451 for details on this method. If you specify METHOD=PPS– WR, you must name the size measure variable in the SIZE statement.
SEQ
requests sequential selection according to Chromy’s method. If you specify METHOD=SEQ and do not specify a size measure variable with the SIZE statement, PROC SURVEYSELECT uses sequential zoned selection with equal probability and without replacement. See the section “Sequential Random Sampling” on page 4448 for details on this method. If you specify METHOD=SEQ and also name a size measure variable in the SIZE statement, PROC SURVEYSELECT uses METHOD=PPS– SEQ, which is sequential selection with probability proportional to size and with minimum replacement. See the section “PPS Sequential Sampling” on page 4452 for details on this method.
SRS
requests simple random sampling, which is selection with equal probability and without replacement. See the section “Simple Random Sampling” on page 4447 for details. This method is the default if you do not specify the METHOD= option and also do not specify a SIZE statement.
SYS
requests systematic random sampling. If you specify METHOD=SYS and do not specify a size measure variable with the SIZE statement, PROC SURVEYSELECT uses systematic selection with equal probability. See the section “Systematic Random Sampling” on page 4448 for details on this method. If you specify METHOD=SYS and also name a size measure variable in the SIZE statement, PROC SURVEYSELECT uses METHOD=PPS– SYS, which is systematic selection with probability proportional to size. See the section “PPS Systematic Sampling” on page 4451 for details.
4435
4436
Chapter 72. The SURVEYSELECT Procedure
URS
requests unrestricted random sampling, which is selection with equal probability and with replacement. See the section “Unrestricted Random Sampling” on page 4447 for details.
MINSIZE
requests that sampling unit size measures be adjusted according to the stratum minimum size measures in the secondary input data set. You provide sampling unit size measures in the DATA= input data set variable named in the SIZE statement. And you provide the stratum minimum size measures in the secondary input data set variable – MINSIZE– . Use the MINSIZE option when you have already named the secondary input data set in another option, such as SAMPSIZE=SAS-data-set, SAMPRATE=SAS-data-set, MAXSIZE=SAS-data-set, or CERTSIZE=SAS-data-set. You can name only one secondary input data set in each invocation of the procedure. If any size measure is less than the minimum size measure for its stratum, then PROC SURVEYSELECT adjusts this size measure upward to equal the minimum size measure. Each minimum size measure must be a positive number. The MINSIZE option is available whenever you specify a SIZE statement for probability proportional to size selection and a STRATA statement for stratification. If you want to specify a single minimum size value in the PROC SURVEYSELECT statement, use the MINSIZE=min option. MINSIZE=min
specifies the minimum size measure allowed. If any size measure is less than the value min, then PROC SURVEYSELECT adjusts this size measure upward to equal min, which must be a positive number. You provide size measures in the DATA= input data set variable named in the SIZE statement. This option is available whenever you specify a SIZE statement for selection with probability proportional to size. If you request a stratified sample design with a STRATA statement and specify the MINSIZE= option, PROC SURVEYSELECT uses the minimum size min for all strata. If you do not want to use the same minimum size for all strata, use the MINSIZE=SAS-data-set option to specify a minimum size for each stratum. MINSIZE=SAS-data-set
names a SAS data set that contains the minimum size measures allowed for the strata. If any size measure is less than the minimum size measure for its stratum, then PROC SURVEYSELECT adjusts this size measure upward to equal the minimum size measure. You provide sampling unit size measures in the DATA= input data set variable named in the SIZE statement. And you provide the stratum minimum size measures in the MINSIZE= input data set variable – MINSIZE– . Each minimum size measure must be a positive number. This option is available whenever you specify a SIZE statement for probability proportional to size selection and a STRATA statement for stratified selection. The MINSIZE= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the MINSIZE= data set as in the DATA= data set. The MINSIZE= data set must include a variable named – MINSIZE– that contains the minimum size measure for each stratum.
PROC SURVEYSELECT Statement
NMAX=n
specifies the maximum stratum sample size n for the SAMPRATE= option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the desired stratum sample size from the specified sampling rate and the total number of units in the stratum. If this sample size is greater than the value NMAX=n, then PROC SURVEYSELECT selects the maximum of n units. The maximum sample size n must be a positive integer. The NMAX= option is available only with the SAMPRATE= option, which may be used with equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). NMIN=n
specifies the minimum stratum sample size n for the SAMPRATE= option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the desired stratum sample size from the specified sampling rate and the total number of units in the stratum. If this sample size is less than the value NMIN=n, then PROC SURVEYSELECT selects the minimum of n units. The minimum sample size n must be a positive integer. The NMIN= option is available only with the SAMPRATE= option, which may be used with equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). NOPRINT
suppresses the display of all output. You can use the NOPRINT option when you want only to create an output data set. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 14, “Using the Output Delivery System.” OUT=SAS-data-set
names the output data set that contains the sample. If you omit the OUT= option, the data set is named DATAn, where n is the smallest integer that makes the name unique. The output data set contains the units selected for the sample, as well as design information and selection statistics, depending on the selection method and output options you specify. See the descriptions for the options JTPROBS, OUTHITS, OUTSEED, OUTSIZE, and STATS. For information on the contents of the output data set, see the section “Output Data Set” on page 4456. By default, the output data set contains only those units selected for the sample. To include all observations from the input data set in the output data set, use the OUTALL option. OUTALL
includes all observations from the input data set in the output data set. By default, the output data set includes only those observations selected for the sample. When you specify the OUTALL option, the output data set includes all observations from DATA= and also contains a variable to indicate each observation’s selection status. The variable Selected equals 1 for an observation selected for the sample, and equals
4437
4438
Chapter 72. The SURVEYSELECT Procedure
0 for an observation not selected. For information on the contents of the output data set, see the section “Output Data Set” on page 4456. The OUTALL option is available only for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). OUTHITS
includes a separate observation in the output data set for each selection when the same unit is selected more than once. By default, the output data set contains only one observation for each selected unit, even if it is selected more than once, and the variable NumberHits contains the number of hits or selections for that unit. The OUTHITS option is available for selection methods that select with replacement or with minimum replacement (METHOD=URS, METHOD=PPS– WR, METHOD=PPS– SYS, and METHOD=PPS– SEQ). OUTSEED
includes the initial seed for each stratum in the output data set. The variable InitialSeed contains the stratum initial seed. See the section “Sample Selection Methods” on page 4446 for information on initial seeds and random number generation in PROC SURVEYSELECT. To reproduce the same sample for any stratum in a subsequent execution of PROC SURVEYSELECT, you can specify the same stratum initial seed with the SEED=SAS-data-set option, along with the same sample selection parameters. OUTSIZE
includes additional design and sampling frame parameters in the output data set. If you specify the OUTSIZE option, PROC SURVEYSELECT includes the sample size or sampling rate in the output data set. When you request the OUTSIZE option and also specify the SIZE statement, the procedure outputs the size measure total for the sampling frame. If you do not specify the SIZE statement, the procedure outputs the total number of sampling units in the frame. Also, PROC SURVEYSELECT includes the minimum size measure if you specify the MINSIZE= option, the maximum size measure if you specify the MAXSIZE= option, and the certainty size measure if you specify the CERTSIZE= option. If you have a stratified design, the output data set includes the stratum-level values of these parameters. Otherwise, the output data set includes the overall population-level values. For information on the contents of the output data set, see the section “Output Data Set” on page 4456. OUTSORT=SAS-data-set
names an output data set that contains the sorted input data set. This option is available when you specify a CONTROL statement for systematic or sequential selection methods (METHOD=SYS, METHOD=PPS– SYS, METHOD=SEQ, and METHOD=PPS– SEQ). PROC SURVEYSELECT sorts the input data set by the CONTROL variables within strata before selecting the sample. If you specify CONTROL variables but do not name an output data set with the OUTSORT= option, then the sorted data set replaces the input data set.
PROC SURVEYSELECT Statement
REP=nrep
specifies the number of sample replicates. If you specify the REP= option, PROC SURVEYSELECT selects nrep independent samples, each with the same specified sample size or sampling rate and the same sample design. You can use replicated sampling to provide a simple method of variance estimation for any form of statistic, as well as to evaluate variable nonsampling errors such as interviewer differences. Refer to Lohr (1999), Kish (1965, 1987), and Kalton (1983) for information on replicated sampling. SAMPRATE=r RATE=r
specifies the sampling rate, which is the proportion of units selected for the sample. The sampling rate r must be a positive number. You can specify r as a number between 0 and 1. Or you can specify r in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%. The SAMPRATE= option is available only for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). For systematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses the inverse of the sampling rate r as the interval. See the section “Systematic Random Sampling” on page 4448 for details. For other selection methods, PROC SURVEYSELECT converts the sampling rate r to the sample size before selection, multiplying the rate by the number of units in the stratum or frame and rounding up to the nearest integer. If you request a stratified sample design with a STRATA statement and specify the SAMPRATE=r option, PROC SURVEYSELECT uses the sampling rate r for each stratum. If you do not want to use the same sampling rate for each stratum, use the SAMPRATE=(values) option or the SAMPRATE=SAS-data-set option to specify a sampling rate for each stratum. SAMPRATE=(values) RATE=(values)
specifies sampling rates for the strata. You can separate values with blanks or commas. The number of SAMPRATE= values must equal the number of strata in the input data set. List the stratum sampling rate values in the order in which the strata appear in the input data set. If you use the SAMPRATE=(values) option, the input data set must be sorted by the STRATA variables in ascending order. You cannot use the DESCENDING or NOTSORTED options in the STRATA statement. Each stratum sampling rate value must be a positive number. You can specify each value as a number between 0 and 1. Or you can specify a value in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%.
4439
4440
Chapter 72. The SURVEYSELECT Procedure
The SAMPRATE= option is available only for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). For systematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses the inverse of the stratum sampling rate as the interval for the stratum. See the section “Systematic Random Sampling” on page 4448 for details on systematic sampling. For other selection methods, PROC SURVEYSELECT converts the stratum sampling rate to a stratum sample size before selection, multiplying the rate by the number of units in the stratum and rounding up to the nearest integer. SAMPRATE=SAS-data-set RATE=SAS-data-set
names a SAS data set that contains sampling rates for the strata. This input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the SAMPSIZE= data set as in the DATA= data set. The SAMPRATE= data set should have a variable – RATE– that contains the sampling rate for each stratum. Each sampling rate value must be a positive number. You can specify each value as a number between 0 and 1. Or you can specify a value in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%. The SAMPRATE= option is available only for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). For systematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses the inverse of the stratum sampling rate as the interval for the stratum. See the section “Systematic Random Sampling” on page 4448 for details. For other selection methods, PROC SURVEYSELECT converts the stratum sampling rate to the stratum sample size before selection, multiplying the rate by the number of units in the stratum and rounding up to the nearest integer. SAMPSIZE=n N=n
specifies the sample size, which is the number of units selected for the sample. The sample size n must be a positive integer. For methods that select without replacement, the sample size n must not exceed the number of units in the input data set. If you request a stratified sample design with a STRATA statement and specify the SAMPSIZE=n option, PROC SURVEYSELECT selects n units from each stratum. For methods that select without replacement, the sample size n must not exceed the number of units in any stratum. If you do not want to select the same number of units from each stratum, use the SAMPSIZE=(values) option or the SAMPSIZE=SASdata-set option to specify different sample sizes for the strata. For without-replacement selection methods, by default, PROC SURVEYSELECT does not allow you to specify a stratum sample size that is greater than the total number of units in the stratum. However, you can change this default by specifying the SELECTALL option. With the SELECTALL option, PROC SURVEYSELECT selects all stratum units whenever the stratum sample size exceeds the number of units in the stratum.
PROC SURVEYSELECT Statement
SAMPSIZE=(values) N=(values)
specifies sample sizes for the strata. You can separate values with blanks or commas. The number of SAMPSIZE= values must equal the number of strata in the input data set. List the stratum sample size values in the order in which the strata appear in the input data set. If you use the SAMPSIZE=(values) option, the input data set must be sorted by the STRATA variables in ascending order. You cannot use the DESCENDING or NOTSORTED options in the STRATA statement. Each stratum sample size value must be a positive integer. For without-replacement selection methods, by default, PROC SURVEYSELECT does not allow you to specify a stratum sample size that is greater than the total number of units in the stratum. However, you can change this default by specifying the SELECTALL option. With the SELECTALL option, PROC SURVEYSELECT selects all stratum units whenever the stratum sample size exceeds the number of units in the stratum. SAMPSIZE=SAS-data-set N=SAS-data-set
names a SAS data set that contains the sample sizes for the strata. This input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the SAMPSIZE= data set as in the DATA= data set. The SAMPSIZE= data set should have a variable – NSIZE– that contains the sample size for each stratum. Each stratum sample size value must be a positive integer. For without-replacement selection methods, by default, PROC SURVEYSELECT does not allow you to specify a stratum sample size that is greater than the total number of units in the stratum. However, you can change this default by specifying the SELECTALL option. With the SELECTALL option, PROC SURVEYSELECT selects all stratum units whenever the stratum sample size exceeds the number of units in the stratum. SEED=number
specifies the initial seed for random number generation. The value of the SEED= option must be an integer. If you do not specify the SEED= option, or if the SEED= value is negative or zero, PROC SURVEYSELECT uses the time of day from the computer’s clock to obtain the initial seed. See the section “Sample Selection Methods” on page 4446 for more information. Whether or not you specify the SEED= option, PROC SURVEYSELECT displays the value of the initial seed in the “Sample Selection Summary” table. If you need to reproduce the same sample in a subsequent execution of PROC SURVEYSELECT, you can specify this same seed value with the SEED= option, along with the same sample selection parameters, and PROC SURVEYSELECT will reproduce the sample. If you request a stratified sample design with a STRATA statement, you can use the SEED=SAS-data-set option to specify an initial seed for each stratum. Otherwise, PROC SURVEYSELECT generates random numbers continuously across strata from
4441
4442
Chapter 72. The SURVEYSELECT Procedure
the random number stream initialized by the SEED= value, as described in the section “Sample Selection Methods” on page 4446. To include the stratum initial seeds in the output data set, use the OUTSEED option. SEED=SAS-data-set
names a SAS data set that contains initial seeds for the strata. This input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the SAMPSIZE= data set as in the DATA= data set. The SEED= data set should have a variable – SEED– that contains the initial seed for each stratum. Each stratum initial seed value should be an integer. If the initial seed value for the first stratum is not a positive integer, PROC SURVEYSELECT uses the time of day from the computer’s clock to obtain the initial seed. If the initial seed value for a subsequent stratum is not a positive integer, PROC SURVEYSELECT continues to use the random number stream already initialized by the seed for the previous stratum. See the section “Sample Selection Methods” on page 4446 for more information. To include the stratum initial seeds in the output data set, specify the OUTSEED option. If you specified initial seeds by strata with the SEED=SAS-data-set option, you can reproduce the same sample in a subsequent execution of PROC SURVEYSELECT by specifying these same stratum initial seeds, along with the same sample selection parameters. If you need to reproduce the same sample for only a subset of the strata, you can use the same initial seeds for those strata in the subset. SELECTALL
requests that PROC SURVEYSELECT select all stratum units whenever the stratum sample size exceeds the total number of units in the stratum, for without-replacement selection methods. By default, PROC SURVEYSELECT does not allow you to specify a stratum sample size that is greater than the total number of units in the stratum, for methods that select without replacement. The SELECTALL option is available for without-replacement selection methods, which include METHOD=SRS, METHOD=SYS, METHOD=SEQ, METHOD=PPS, and METHOD=PPS– SAMPFORD. The SELECTALL option is not available for with-replacement selection methods, with-minimum-replacement methods, or for those PPS methods that select two units per stratum. SORT=NEST | SERP
specifies the type of sorting by CONTROL variables. The option SORT=NEST requests nested sorting, and SORT=SERP requests hierarchic serpentine sorting. The default is SORT=SERP. See the section “Sorting by CONTROL Variables” on page 4445 for descriptions of serpentine and nested sorting. Where there is only one CONTROL variable, the two types of sorting are equivalent. This option is available when you specify a CONTROL statement for systematic or sequential selection methods (METHOD=SYS, METHOD=PPS– SYS, METHOD=SEQ, and METHOD=PPS– SEQ). When you specify a CONTROL
ID Statement
statement, PROC SURVEYSELECT sorts the input data set by the CONTROL variables within strata before selecting the sample. With sorting by CONTROL variables, you can also use the OUTSORT= option to name an output data set that contains the sorted input data set. Otherwise, if you do not specify the OUTSORT= option, then the sorted data set replaces the input data set. STATS
includes selection probabilities and sampling weights in the OUT= output data set for equal probability selection methods when you do not specify a STRATA statement. This option is available for the following equal probability selection methods: METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ. For PPS selection methods and stratified designs, the output data set contains selection probabilities and sampling weights by default. For more information on the contents of the output data set, see the section “Output Data Set” on page 4456.
CONTROL Statement CONTROL variables ; The CONTROL statement names variables for sorting the input data set. CONTROL variables can be character or numeric.
The
PROC SURVEYSELECT sorts the input data set by the CONTROL variables before selecting the sample. If you also specify a STRATA statement, PROC SURVEYSELECT sorts by CONTROL variables within strata. Control sorting is available for systematic and sequential selection methods (METHOD=SYS, METHOD=PPS– SYS, METHOD=SEQ, and METHOD=PPS– SEQ). By default, PROC SURVEYSELECT uses hierarchic serpentine sorting by the CONTROL variables. If you specify the SORT=NEST option, the procedure uses nested sorting. See the description for the SORT= option. For more information on serpentine and nested sorting, see the section “Sorting by CONTROL Variables” on page 4445. You can use the OUTSORT= option to name an output data set that contains the sorted input data set. If you do not specify the OUTSORT= option when you use the CONTROL statement, then the sorted data set replaces the input data set.
ID Statement ID variables ; The ID statement names variables from the DATA= input data set to be included in the OUT= data set of selected units. If there is no ID statement, PROC SURVEYSELECT includes all variables from the DATA= data set in the OUT= data set. The ID variables can be character or numeric.
4443
4444
Chapter 72. The SURVEYSELECT Procedure
SIZE Statement SIZE variable ; The SIZE statement names one and only one size measure variable, which contains the size measures to be used when sampling with probability proportional to size. The SIZE variable must be numeric. When the value of an observation’s SIZE variable is missing or nonpositive, that observation has no chance of being selected for the sample. The SIZE statement is required for all PPS selection methods, which include METHOD=PPS, METHOD=PPS– BREWER, METHOD=PPS– MURTHY, METHOD=PPS– SAMPFORD, METHOD=PPS– SEQ, METHOD=PPS– SYS, and METHOD=PPS– WR. For details on how size measures are used, see the descriptions of PPS methods in the section “Sample Selection Methods” on page 4446. Note that a unit’s size measure, specified in the SIZE statement and used for PPS selection, is not the same as the sample size. The sample size is the number of units selected for the sample, and you can specify this with the SAMPSIZE= option.
STRATA Statement STRATA variables ; You can specify a STRATA statement with PROC SURVEYSELECT to partition the input data set into nonoverlapping groups defined by the STRATA variables. PROC SURVEYSELECT then selects independent samples from these strata, according to the selection method and design parameters specified in the PROC SURVEYSELECT statement. For information on the use of stratification in sample design, refer to Lohr (1999), Kalton (1983), Kish (1965, 1987), and Cochran (1977). The variables are one or more variables in the input data set. The STRATA variables function much like BY variables, and PROC SURVEYSELECT expects the input data set to be sorted in order of the STRATA variables. If you specify a CONTROL statement, or if you specify METHOD=PPS, the input data set must be sorted in ascending order of the STRATA variables. This means you cannot use the STRATA option NOTSORTED or DESCENDING when you specify a CONTROL statement or METHOD=PPS. If your input data set is not sorted by the STRATA variables in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with the STRATA variables in a BY statement. • Specify the option NOTSORTED or DESCENDING in the STRATA statement for the SURVEYSELECT procedure (when you do not specify a CONTROL statement or METHOD=PPS). The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the STRATA variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the STRATA variables using the DATASETS procedure.
Sorting by CONTROL Variables
For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
Details Missing Values If an observation has a missing or nonpositive value for the SIZE variable, PROC SURVEYSELECT excludes that observation from the sample selection. The procedure writes a note to the log giving the number of observations omitted due to missing or nonpositive size measures. PROC SURVEYSELECT treats missing STRATA variable values like any other STRATA variable value. The missing values form a separate stratum. If a value of – NSIZE– is missing in the SAMPSIZE= input data set, then PROC SURVEYSELECT writes an error message to the log and does not select a sample from that stratum. The procedure treats missing values of – NRATE– , – MINSIZE– , – MAXSIZE– , and – CERTSIZE– similarly.
Sorting by CONTROL Variables If you specify a CONTROL statement, PROC SURVEYSELECT sorts the input data set by the CONTROL variables before selecting the sample. If you also specify a STRATA statement, the procedure sorts by CONTROL variables within strata. Sorting by CONTROL variables is available for systematic and sequential selection methods, which include METHOD=SYS, METHOD=PPS– SYS, METHOD=SEQ, and METHOD=PPS– SEQ. Sorting provides additional control over the distribution of the sample, giving some benefits of proportionate stratification. By default, the sorted data set replaces the input data set. Or you can use the OUTSORT= option to name an output data set that contains the sorted input data set. PROC SURVEYSELECT provides two types of sorting: nested sorting and hierarchic serpentine sorting. If you specify the SORT=NEST option, then the procedure sorts by the CONTROL variables according to nested sorting. If you do not specify the SORT=NEST option, the procedure uses serpentine sorting by default. These two types of sorting are equivalent when there is only one CONTROL variable. If you request nested sorting, PROC SURVEYSELECT sorts observations in the same order as PROC SORT does for an ascending sort by the CONTROL variables. Refer to the chapter on the SORT procedure in the SAS Procedures Guide. PROC SURVEYSELECT sorts within strata if you also specify a STRATA statement. The procedure first arranges the input observations in ascending order of the first CONTROL variable. Then within each level of the first control variable, the procedure arranges the observations in ascending order of the second CONTROL variable. This continues for all CONTROL variables specified.
4445
4446
Chapter 72. The SURVEYSELECT Procedure
In hierarchic serpentine sorting, PROC SURVEYSELECT sorts by the first CONTROL variable in ascending order. Then within the first level of the first CONTROL variable, the procedure sorts by the second CONTROL variable in ascending order. Within the second level of the first CONTROL variable, the procedure sorts by the second CONTROL variable in descending order. Sorting by the second CONTROL variable continues to alternate between ascending and descending sorting throughout all levels of the first CONTROL variable. If there is a third CONTROL variable, the procedure sorts by that variable within levels formed from the first two CONTROL variables, again alternating between ascending and descending sorting. This continues for all CONTROL variables specified. This sorting algorithm minimizes the change from one observation to the next with respect to the CONTROL variable values, thus making nearby observations more similar. For more information on serpentine sorting, refer to Chromy (1979) and Williams and Chromy (1980).
Sample Selection Methods PROC SURVEYSELECT provides a variety of methods for selecting probabilitybased random samples. With probability sampling, each unit in the survey population has a known, positive probability of selection. This property of probability sampling avoids selection bias and enables you to use statistical theory to make valid inferences from the sample to the survey population. Refer to Lohr (1999), Kish (1965, 1987), Kalton (1983), and Cochran (1977) for more information on probability sampling. In equal probability sampling, each unit in the sampling frame, or in a stratum, has the same probability of being selected for the sample. PROC SURVEYSELECT provides the following methods that select units with equal probability: simple random sampling, unrestricted random sampling, systematic random sampling, and sequential random sampling. In simple random sampling, units are selected without replacement, which means that a unit cannot be selected more than once. Both systematic and sequential equal probability sampling are also without replacement. In unrestricted random sampling, units are selected with replacement, which means that a unit can be selected more than once. In with-replacement sampling, the number of hits refers to the number of times a unit is selected. In probability proportional to size (PPS) sampling, a unit’s selection probability is proportional to its size measure. PROC SURVEYSELECT provides the following methods that select units with probability proportional to size (PPS): PPS sampling without replacement, PPS sampling with replacement, PPS systematic sampling, PPS sequential sampling, Brewer’s method, Murthy’s method, and Sampford’s method. PPS sampling is often used in cluster sampling, where you select clusters (or groups of sampling units) of varying size in the first stage of selection. For example, clusters may be schools, hospitals, or geographical areas, and the final sampling units may be students, patients, or citizens. Cluster sampling can provide efficiencies in frame construction and other survey operations. Refer to Lohr (1999), Kalton (1983), Kish (1965), and the other references cited in the following sections for more information. All the probability sampling methods provided by PROC SURVEYSELECT use random numbers in their selection algorithms, as described in the following sections
Sample Selection Methods
and in the references cited. PROC SURVEYSELECT uses a uniform random number function to generate streams of pseudo-random numbers from an initial starting point, or seed. You can use the SEED= option to specify the initial seed. If you do not specify the SEED= option, PROC SURVEYSELECT uses the time of day from the computer’s clock to obtain the initial seed. PROC SURVEYSELECT generates uniform random numbers according to the method of Fishman and Moore (1982), using a prime modulus multiplicative generator with modulus 231 and multiplier 397204094. PROC SURVEYSELECT uses the same uniform random number generator as the RANUNI function. For more information on the RANUNI function, see the SAS Language Reference: Dictionary The following sections give detailed descriptions of the sample selection methods available in PROC SURVEYSELECT. In these sections, nh denotes the sample size (the number of units in the sample) for stratum h, and Nh denotes the population size (number of units in the population) for stratum h, for h = 1, 2, . . . , H. When the sample design is not stratified, n denotes the sample size, and N denotes the population size. For PPS sampling, Mhi represents the size measure for unit i in stratum h, Mh· is the total of all size measures for the population of stratum h, and Zhi = Mhi /Mh is the relative size of unit i in stratum h.
Simple Random Sampling The method of simple random sampling (METHOD=SRS) selects units with equal probability and without replacement. Each possible sample of n different units out of N has the same probability of being selected. The selection probability for each individual unit equals n/N . When you request stratified sampling with a STRATA statement, PROC SURVEYSELECT selects samples independently within strata. The selection probability for a unit in stratum h equals nh /Nh for stratified simple random sampling. By default, PROC SURVEYSELECT uses Floyd’s ordered hash table algorithm for simple random sampling. This algorithm is fast, efficient, and appropriate for large data sets. Refer to Bentley and Floyd (1987) and Bentley and Knuth (1986). If there is not enough memory available for Floyd’s algorithm, PROC SURVEYSELECT switches to the sequential algorithm of Fann, Muller, and Rezucha (1962), which requires less memory but may require more time to select the sample. When SURVEYSELECT uses the alternative sequential algorithm, it writes a note to the log. To request the sequential algorithm, even if enough memory is available for Floyd’s algorithm, you can specify METHOD=SRS2 in the PROC SURVEYSELECT statement.
Unrestricted Random Sampling The method of unrestricted random sampling (METHOD=URS) selects units with equal probability and with replacement. Because units are selected with replacement, a unit can be selected for the sample more than once. The expected number of selections or hits for each unit equals n/N when sampling without stratification. For stratified sampling, the expected number of hits for a unit in stratum h equals nh /Nh . Note that the expected number of hits exceeds one when the sample size n is greater than the population size N .
4447
4448
Chapter 72. The SURVEYSELECT Procedure
For unrestricted random sampling, by default, the output data set contains one observation for each distinct unit selected for the sample, together with a variable NumberHits that gives the number of times the observation was selected. But if you specify the OUTHITS option, then the output data set contains a separate observation for each selection, so that a unit selected three times, e.g., is represented by three observations in the output data set. For information on the contents of the output data set, see the section “Output Data Set” on page 4456.
Systematic Random Sampling The method of systematic random sampling (METHOD=SYS) selects units at a fixed interval throughout the sampling frame or stratum after a random start. If you specify the sample size (or the stratum sample sizes) with the SAMPSIZE= option, PROC SURVEYSELECT uses a fractional interval to provide exactly the specified sample size. The interval equals N/n, or Nh /nh for stratified sampling. The selection probability for each unit equals n/N , or nh /Nh for stratified sampling. If you specify the sampling rate (or the stratum sampling rates) with the SAMPRATE= option, PROC SURVEYSELECT uses the inverse of the rate as the interval for systematic selection. The selection probability for each unit equals the specified rate. Systematic random sampling controls the distribution of the sample by spreading it throughout the sampling frame or stratum at equal intervals, thus providing implicit stratification. You can use the CONTROL statement to order the input data set by the CONTROL variables before sample selection. If you also use a STRATA statement, PROC SURVEYSELECT sorts by the CONTROL variables within strata. If you do not specify a CONTROL statement, PROC SURVEYSELECT applies systematic selection to the observations in the order in which they appear in the input data set.
Sequential Random Sampling If you specify the option METHOD=SEQ and do not include a SIZE statement, PROC SURVEYSELECT uses the equal probability version of Chromy’s method for sequential random sampling. This method selects units sequentially with equal probability and without replacement. Refer to Chromy (1979) and Williams and Chromy (1980). See the section “PPS Sequential Sampling” on page 4452 for a description of Chromy’s PPS selection method. Sequential random sampling controls the distribution of the sample by spreading it throughout the sampling frame or stratum, thus providing implicit stratification according to the order of units in the frame or stratum. You can use the CONTROL statement to sort the input data set by the CONTROL variables before sample selection. If you also use a STRATA statement, PROC SURVEYSELECT sorts by the CONTROL variables within strata. By default, the procedure uses hierarchic serpentine ordering for sorting. If you specify the SORT=NEST option, the procedure uses nested sorting. See the section “Sorting by CONTROL Variables” on page 4445 for descriptions of serpentine and nested sorting. If you do not specify a CONTROL statement, PROC SURVEYSELECT applies sequential selection to the observations in the order in which they appear in the input data set. Following Chromy’s method of sequential selection, PROC SURVEYSELECT randomly chooses a starting unit from the entire stratum (or frame, if the design is not
Sample Selection Methods
stratified). Using this unit as the first one, the procedure treats the stratum units as a closed loop. This is done so that all pairwise (joint) selection probabilities are positive and an unbiased variance estimator can be obtained. The procedure numbers units sequentially from the random start to the end of the stratum and then continues from the beginning of the stratum until all units are numbered. Beginning with the randomly chosen starting unit, PROC SURVEYSELECT accumulates the expected number of selections or hits, where the expected number of selections EShi equals nh /Nh for all units i in stratum h. The procedure computes
Ihi = Int (
i X
EShj ) = Int (i nh /Nh )
j=1
Fhi = Frac (
i X
EShj ) = Frac (i nh /Nh )
j=1
where Int denotes the integer part of the number, and Frac denotes the fractional part. Considering each unit sequentially, Chromy’s method determines whether unit i is selected by comparing the total number of selections for the first i − 1 units,
Th(i−1) =
i−1 X
Shj
j=1
with the value of Ih(i−1) . If Th(i−1) = Ih(i−1) , Chromy’s method determines whether or not unit i is selected as follows. If Fhi = 0 or Fh(i−1) > Fhi , then unit i is selected with certainty. Otherwise, unit i is selected with probability (Fhi − Fh(i−1) ) / (1 − Fh(i−1) ) If Th(i−1) = Ih(i−1) + 1 , Chromy’s method determines whether or not unit i is selected as follows. If Fhi = 0 or Fhi > Fh(i−1) , then the unit is not selected. Otherwise, unit i is selected with probability Fhi / Fh(i−1)
PPS Sampling without Replacement If you specify the option METHOD=PPS, PROC SURVEYSELECT selects units with probability proportional to size and without replacement. The selection probability for unit i in stratum h equals nh Zhi . The procedure uses the Hanurav-Vijayan algorithm for PPS selection without replacement. Hanurav (1967) introduced this algorithm for the selection of two units per stratum, and Vijayan (1968) generalized it for the selection of more than two units. The algorithm enables computation of joint
4449
4450
Chapter 72. The SURVEYSELECT Procedure
selection probabilities and provides joint selection probability values that usually ensure nonnegativity and stability of the Sen-Yates-Grundy variance estimator. Refer to Fox (1989), Golmant (1990), and Watts (1991) for details. Notation in the remainder of this section drops the stratum subscript h for simplicity, but selection is still done independently within strata if you specify a stratified design. For a stratified design, n now denotes the sample size for the current stratum, N denotes the stratum population size, and Mi denotes the size measure for unit i in the stratum. If the design is not stratified, this notation applies to the entire sampling frame. According to the Hanurav-Vijayan algorithm, PROC SURVEYSELECT first orders units within the stratum in ascending order by size measure, so that M1 ≤ M2 ≤ . . . ≤ MN . Then the procedure selects the PPS sample of n observations as follows: 1. The procedure randomly chooses one of the integers 1, 2, . . . , n with probability θ1 , θ2 , . . . , θn , where θi = n (ZN −n+i+1 − ZN −n+i ) (T + i ZN −n+1 ) / T P −n Zj = Mj /M , T = N j=1 Zj , and, by definition, ZN +1 = 1/n to ensure Pn that i=1 θi = 1 . 2. If i is the integer selected in step 1, the procedure includes the last (n − i) units of the stratum in the sample, where the units are ordered by size measure as described previously. The procedure then selects the remaining i units according to steps 3 through 6 below. 3. The procedure defines new normed size measures for the remaining (N −n+i) stratum units that were not selected in steps 1 and 2: Zj∗ = Zj / (T + i ZN −n+1 )
for j = 1, . . . , N − n + 1
Zj∗ = ZN −n+1 / (T + i ZN −n+1 ) for j = N − n + 2, . . . , N − n + i 4. The procedure selects the next unit from the first (N − n + 1) stratum units with probability proportional to aj (1), where a1 (1) = i Z1∗ aj (1) = i Zj∗
Qj−1
k=1 [1
− (i − 1) Pk ] for j = 2, . . . , N − n + 1
and Pk = Mk /(Mk+1 + Mk+2 + · · · + MN −n+i ) . 5. If stratum unit j1 is the unit selected in step 4, then the procedure selects the next unit from units j1 + 1 through N − n + 2 with probability proportional to aj (2, j1 ), where aj1 +1 (2, j1 ) = (i − 1) Zj∗1 +1 aj (2, j1 ) = (i − 1)
Zj∗
j−1 Y
[1 − (i − 2) Pk ] for j = j1 + 2, . . . , N − n + 2
k=j1 +1
6. The procedure repeats step 5 until all n sample units are selected.
Sample Selection Methods
4451
If you request the JTPROBS option, PROC SURVEYSELECT computes the joint selection probabilities for all pairs of selected units in each stratum. The joint selection probability for units i and j in the stratum equals
P(ij) =
n X
(r)
θr Kij
r=1
where (r)
Kij
N −n+r
= 1
= r ZN −n+1 / (T + r ZN −n+1 ) N − n < i ≤ N − n + r, j > N − n + r = r Zi / (T + r ZN −n+1 ) (r)
1 ≤ i ≤ N − n, j > N − n + r j ≤N −n+r
= πij and (r) πij
i−1 Y r(r − 1) = Pi Zj (1 − Pk ) 2 k=1
where Pk = Mk /(Mk+1 + Mk+2 + · · · + MN −n+r ) .
PPS Sampling with Replacement If you specify the option METHOD=PPS– WR, PROC SURVEYSELECT selects units with probability proportional to size and with replacement. The procedure makes nh independent random selections from the stratum of Nh units, selecting with probability Zhi = Mhi /Mh· . Because units are selected with replacement, a unit can be selected for the sample more than once. The expected number of selections or hits for unit i in stratum h equals nh Zhi . If you request the JTPROBS option, PROC SURVEYSELECT computes the joint expected number of hits for all pairs of selected units in each stratum. The joint expected number of hits for units i and j in stratum h equals Ph(ij) = nh (nh − 1) Zhi Zhj
for j 6= i
= nh (nh − 1) Zhi Zhi / 2 for j = i
PPS Systematic Sampling If you specify the option METHOD=PPS– SYS, PROC SURVEYSELECT selects units by systematic random sampling with probability proportional to size. Systematic sampling selects units at a fixed interval throughout the stratum or sampling frame after a random start. PROC SURVEYSELECT uses a fractional interval to provide exactly the specified sample size. The interval equals Mh· /nh for stratified sampling and M/n for sampling without stratification. Depending on the sample size and the values of the size measures, it may be possible for a unit to be selected more
4452
Chapter 72. The SURVEYSELECT Procedure
than once. The expected number of selections or hits for unit i in stratum h equals nh Mhi /Mh· = nh Zhi . Refer to Cochran (1977, pp. 265–266) and Madow (1949). Systematic random sampling controls the distribution of the sample by spreading it throughout the sampling frame or stratum at equal intervals, thus providing implicit stratification. You can use the CONTROL statement to order the input data set by the CONTROL variables before sample selection. If you also use a STRATA statement, PROC SURVEYSELECT sorts by the CONTROL variables within strata. If you do not specify a CONTROL statement, PROC SURVEYSELECT applies systematic selection to the observations in the order in which they appear in the input data set.
PPS Sequential Sampling If you specify the option METHOD=PPS– SEQ, PROC SURVEYSELECT uses Chromy’s method of sequential random sampling. Refer to Chromy (1979) and Williams and Chromy (1980). Chromy’s method selects units sequentially with probability proportional to size and with minimum replacement. Selection with minimum replacement means that the actual number of hits for a unit can equal the integer part of the expected number of hits for that unit, or the next largest integer. This can be compared to selection without replacement, where each unit can be selected only once, so the number of hits can equal 0 or one. The other alternative is selection with replacement, where there is no restriction on the number of hits for each unit, so the number of hits can equal 0, 1, · · · , nh , where nh is the stratum sample size. Sequential random sampling controls the distribution of the sample by spreading it throughout the sampling frame or stratum, thus providing implicit stratification according to the order of units in the frame or stratum. You can use the CONTROL statement to sort the input data set by the CONTROL variables before sample selection. If you also use a STRATA statement, PROC SURVEYSELECT sorts by the CONTROL variables within strata. By default, the procedure uses hierarchic serpentine ordering to sort the sampling frame by the CONTROL variables within strata. If you specify the SORT=NEST option, the procedure uses nested sorting. See the section “Sorting by CONTROL Variables” on page 4445 for descriptions of serpentine and nested sorting. If you do not specify a CONTROL statement, PROC SURVEYSELECT applies sequential selection to the observations in the order in which they appear in the input data set. According to Chromy’s method of sequential selection, PROC SURVEYSELECT first chooses a starting unit randomly from the entire stratum, with probability proportional to size. The procedure uses this unit as the first one and treats the stratum observations as a closed loop. This is done so that all pairwise (joint) expected number of hits are positive and an unbiased variance estimator can be obtained. The procedure numbers observations sequentially from the random start to the end of the stratum and then continues from the beginning of the stratum until all units are numbered. Beginning with the randomly chosen starting unit, Chromy’s method partitions the ordered stratum sampling frame into nh zones of equal size. There is one selection from each zone and a total of nh selections or hits, although fewer than nh distinct
Sample Selection Methods
units may be selected. Beginning with the random start, the procedure accumulates the expected number of hits and computes EShi = nh Zhi
Ihi = Int (
i X
EShj )
j=1
Fhi = Frac (
i X
EShj )
j=1
where EShi represents the expected number of hits for unit i in stratum h; Int denotes the integer part of the number; and Frac denotes the fractional part. Considering each unit sequentially, Chromy’s method determines the actual number of hits for unit i by comparing the total number of hits for the first i − 1 units,
Th(i−1) =
i−1 X
Shj
j=1
with the value of Ih(i−1) . If Th(i−1) = Ih(i−1) , Chromy’s method determines the total number of hits for the first i units as follows. If Fhi = 0 or Fh(i−1) > Fhi , then Thi = Ihi . Otherwise, Thi = Ihi + 1 with probability (Fhi − Fh(i−1) ) / (1 − Fh(i−1) ) And the number of hits for unit i equals Thi − Th(i−1) . If Th(i−1) = Ih(i−1) + 1 , Chromy’s method determines the total number of hits for the first i units as follows. If Fhi = 0, then Thi = Ihi . If Fhi > Fh(i−1) , then Thi = Ihi + 1. Otherwise, Thi = Ihi + 1 with probability Fhi / Fh(i−1)
Brewer’s PPS Method Brewer’s method (METHOD=PPS– BREWER) selects two units from each stratum, with probability proportional to size and without replacement. The selection probability for unit i in stratum h equals 2Mhi /Mh· = 2Zhi .
4453
4454
Chapter 72. The SURVEYSELECT Procedure
Brewer’s algorithm first selects a unit with probability Zhi (1 − Zhi ) Dh (1 − 2Zhi ) where Dh =
Nh X Zhi (1 − Zhi ) i=1
1 − 2Zhi
Then a second unit is selected from the remaining units with probability Zhj 1 − Zhi where unit i is the first unit selected. The joint selection probability for units i and j in stratum h equals Ph(ij) =
2 Zhi Zhj Dh
1 − Zhi − Zhj (1 − 2Zhi ) (1 − 2Zhj )
Brewer’s method requires that the relative size Zhi be less than 0.5 for all units. Refer to Cochran (1977, pp. 261–263) and Brewer (1963). Brewer’s method yields the same selection probabilities and joint selection probabilities as Durbin’s method. Refer to Cochran (1977) and Durbin (1967).
Murthy’s PPS Method Murthy’s method (METHOD=PPS– MURTHY) selects two units from each stratum, with probability proportional to size and without replacement. The selection probability for unit i in stratum h equals Phi = Zhi [1 + K − (Zhi /(1 − Zhi )] where Zhi = Mhi /Mh· and
K
=
N X
[Zhj /(1 − Zhj )]
j=1
Murthy’s algorithm first selects a unit with probability Zhi . Then a second unit is selected from the remaining units with probability Zhj /(1 − Zhi ), where unit i is the first unit selected. The joint selection probability for units i and j in stratum h equals Ph(ij) = Zhi Zhj
2 − Zhi − Zhj (1 − Zhi ) (1 − Zhj )
Refer to Cochran (1977, pp. 263–265) and Murthy (1957).
Sample Selection Methods
Sampford’s PPS Method Sampford’s method (METHOD=PPS– SAMPFORD) is an extension of Brewer’s method that selects more than two units from each stratum, with probability proportional to size and without replacement. The selection probability for unit i in stratum h equals Phi = nh
Mhi = nh Zhi Mh·
Sampford’s method first selects a unit from stratum h with probability Zhi . Then subsequent units are selected with probability proportional to Zhi 1 − nh Zhi and with replacement. If the same unit appears more than once in the sample of size nh , then Sampford’s algorithm rejects that sample and selects a new sample. The sample is accepted if it contains nh distinct units. The joint selection probability for units i and j in stratum h equals
Ph(ij) = Kh λi λj
nh X t − nh (Phi + Phj ) Lnh −t (ij) / nt−2 h t=2
where λi =
Lm =
Zhi 1 − nh Zhi X
λi1 λi2 · · · λim
S(m)
where S(m) denotes all possible samples of size m, for m = 1, 2, . . . , Nh . The sum Lm (ij) is defined similarly to Lm but sums over all possible samples of size m that do not include units i and j, and
Kh =
nh X
!−1 t Lnh −t /
nth
t=1
Sampford’s method requires that the relative size Zhi be less than 1/nh for all units. Refer to Cochran (1977, pp. 262–263) and Sampford (1967).
4455
4456
Chapter 72. The SURVEYSELECT Procedure
Output Data Set PROC SURVEYSELECT creates a SAS data set that contains the sample of selected units. You can specify the name of this output data set with the OUT= option in the PROC SURVEYSELECT statement. If you omit the OUT= option, the data set is named DATAn, where n is the smallest integer that makes the name unique. By default, the output data set contains one observation for each unit selected for the sample. But if you specify the OUTALL option, the output data set includes all observations from the input data set. With OUTALL, the output data set also contains a variable to indicate each observation’s selection status. The variable Selected equals 1 for an observation selected for the sample, and equals 0 for an observation not selected. The OUTALL option is available only for equal probability selection methods. If you specify the OUTHITS option for methods that may select the same unit more than once (that is, methods that select with replacement or with minimum replacement), the output data set contains a separate observation for each selection. If you do not specify the OUTHITS option, the output data set contains only one observation for each selected unit, even if the unit is selected more than once, and the variable NumberHits contains the number of hits or selections for that unit. The output data set contains design information and selection statistics, depending on the selection method and output options you specify. The output data set can include the following variables: • Selected, which indicates whether or not the observation is selected for the sample. This variable is included if you specify the OUTALL option. It equals 1 for an observation selected for the sample, and it equals 0 for an observation not selected. • STRATA variables, which you specify in the STRATA statement • Replicate, which is the sample replicate number. This variable is included when you request replicated sampling with the REP= option. • ID variables, which you name in the ID statement • CONTROL variables, which you specify in the CONTROL statement • Zone, which is the selection zone. METHOD=PPS– SEQ.
This variable is included for
• SIZE variable, which you specify in the SIZE statement • AdjustedSize, which is the adjusted size measure. This variable is included if you request adjusted sizes with the MINSIZE= option or the MAXSIZE= option. • Certain, which indicates certainty selection. This variable is included if you specify the CERTSIZE= option. It equals 1 for units included with certainty because their size measures exceed the certainty size measure. Otherwise, it equals 0.
Output Data Set
• NumberHits, which is the number of hits or selections. This variable is included for selection methods that are with replacement or with minimum replacement (METHOD=URS, METHOD=PPS– WR, METHOD=PPS– SYS, and METHOD=PPS– SEQ). The output data set includes the following variables if you request a PPS selection method or if you specify the STATS option for other methods: • ExpectedHits, which is the expected number of hits or selections. This variable is included for selection methods that are with replacement or with minimum replacement, and so may select the same unit more than once (METHOD=URS, METHOD=PPS– WR, METHOD=PPS– SYS, and METHOD=PPS– SEQ). • SelectionProb, which is the probability of selection. This variable is included for selection methods that are without replacement. • SamplingWeight, which is the sampling weight. This variable equals the inverse of ExpectedHits or SelectionProb. For METHOD=PPS– BREWER and METHOD=PPS– MURTHY, which select two units from each stratum with probability proportional to size, the output data set contains the following variable: • JtSelectionProb, which is the joint probability of selection for the two units selected from the stratum If you request the JTPROBS option to compute joint probabilities of selection for METHOD=PPS or METHOD=PPS– SAMPFORD, then the output data set contains the following variables: • Unit, which is an identification variable that numbers the selected units sequentially within each stratum • JtProb– 1, JtProb– 2, JtProb– 3, . . . , where the variable JtProb– 1 contains the joint probability of selection for the current unit and unit 1. Similarly, JtProb– 2 contains the joint probability of selection for the current unit and unit 2, and so on. If you request the JTPROBS option for METHOD=PPS– WR, then the output data set contains the following variables: • Unit, which is an identification variable that numbers the selected units sequentially within each stratum • JtHits– 1, JtHits– 2, JtHits– 3, . . . , where the variable JtHits– 1 contains the joint expected number of hits for the current unit and unit 1. Similarly, JtHits– 2 contains the joint expected number of hits for the current unit and unit 2, and so on.
4457
4458
Chapter 72. The SURVEYSELECT Procedure
If you request the OUTSIZE option, the output data set contains the following variables. If you specify a STRATA statement, the output data set includes stratum-level values of these variables. Otherwise, the output data set contains population-level values of these variables. • MinimumSize, which is the minimum size measure specified with the MINSIZE= option. This variable is included if you request the MINSIZE= option. • MaximumSize, which is the maximum size measure specified with the MAXSIZE= option. This variable is included if you request the MAXSIZE= option. • CertaintySize, which is the certainty size measure specified with the CERTSIZE= option. This variable is included if you request the CERTSIZE= option. • Total, which is the total number of sampling units in the stratum. This variable is included if there is no SIZE statement. • TotalSize, which is the total of size measures in the stratum. This variable is included if there is a SIZE statement. • TotalAdjSize, which is the total of adjusted size measures in the stratum. This variable is included if there is a SIZE statement and if you request adjusted sizes with the MAXSIZE= option or the MINSIZE= option. • SamplingRate, which is the sampling rate. This variable is included if you specify the SAMPRATE= option. • SampleSize, which is the sample size. This variable is included if you specify the SAMPSIZE= option, or if you specify METHOD=PPS– BREWER or METHOD=PPS– MURTHY, which select two units from each stratum. If you request the OUTSEED option, the output data set contains the following variable: • InitialSeed, which is the initial seed for the stratum.
Displayed Output By default, PROC SURVEYSELECT displays two tables that summarize the sample selection. You can suppress display of these tables by using the NOPRINT option. PROC SURVEYSELECT creates an output data set that contains the units selected for the sample. The procedure does not display this output data set. Use PROC PRINT, PROC REPORT, or any other SAS reporting tool to display the output data set.
Displayed Output
PROC SURVEYSELECT displays the following information in the “Sample Selection Method” table: • Selection Method • Size Measure variable, if you specify a SIZE statement • Minimum Size Measure, if you specify the MINSIZE= option • Maximum Size Measure, if you specify the MAXSIZE= option • Certainty Size Measure, if you specify the CERTSIZE= option • Strata Variables, if you specify a STRATA statement • Control Variables, if you specify a CONTROL statement • type of Control Sorting, Serpentine or Nested, if you specify a CONTROL statement PROC SURVEYSELECT displays the following information in the “Sample Selection Summary” table: • Input Data Set name • Sorted Data Set name, if you specify the OUTSORT= option • Random Number Seed • Sample Size or Stratum Sample Size, if you specify the SAMPSIZE=n option • Sample Size Data Set, if you specify the SAMPSIZE=SAS-data-set option • Sampling Rate or Stratum Sampling Rate, if you specify the SAMPRATE=r option • Sampling Rate Data Set, if you specify the SAMPRATE=SAS-data-set option • Minimum Sample Size or Stratum Minimum Sample Size, if you specify the NMIN= option with the SAMPRATE= option • Maximum Sample Size or Stratum Maximum Sample Size, if you specify the NMAX= option with the SAMPRATE= option • Selection Probability, if you specify METHOD=SRS, METHOD=SYS, or METHOD=SEQ and do not specify a STRATA statement • Expected Number of Hits, if you specify METHOD=URS and do not specify a STRATA statement • Sampling Weight for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, METHOD=SEQ) if you do not specify a STRATA statement • Number of Strata, if you specify a STRATA statement • Number of Replicates, if you specify the REP= option • Total Sample Size, if you specify a STRATA statement or the REP= option • Output Data Set name
4459
4460
Chapter 72. The SURVEYSELECT Procedure
ODS Table Names PROC SURVEYSELECT assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 72.2. ODS Tables Produced in PROC SURVEYSELECT
ODS Table Name Method Summary
Description Sample selection method Sample selection summary
Statement PROC PROC
Option default default
Examples Example 72.1. Replicated Sampling This example uses the Customers data set from the section “Getting Started” on page 4422. The data set Customers contains an Internet service provider’s current subscribers, and the service provider wants to select a sample from this population for a customer satisfaction survey. This example illustrates replicated sampling, which selects multiple samples from the survey population according to the same design. You can use replicated sampling to provide a simple method of variance estimation, or to evaluate variable nonsampling errors such as interviewer differences. Refer to Lohr (1999), Kish (1965, 1987), and Kalton (1983) for information on replicated sampling. This design includes four replicates, each with a sample size of 50 customers. The sampling frame is stratified by State and sorted by Type and Usage within strata. Customers are selected by sequential random sampling with equal probability within strata. The following PROC SURVEYSELECT statements select a probability sample of customers from the Customers data set using this design. title1 ’Customer Satisfaction Survey’; title2 ’Replicated Sampling’; proc surveyselect data=Customers method=seq n=(8 12 20 10) rep=4 seed=40070 out=SampleRep; strata State; control Type Usage; run;
The STRATA statement names the stratification variable State. The CONTROL statement names the control variables Type and Usage. In the PROC SURVEYSELECT statement, the METHOD=SEQ option requests sequential random sampling. The REP=4 option specifies four replicates of this sample. The N=(8 12 20 10) option specifies the stratum sample sizes for each replicate. The
Example 72.1. Replicated Sampling
N= option lists the stratum sample sizes in the same order as the strata appear in the Customers data set, which has been sorted by State. The sample size of eight customers corresponds to the first stratum, State = ‘AL’. The sample size 12 corresponds to the next stratum, State = ‘FL’, and so on. The SEED=40070 option specifies ’40070’ as the initial seed for random number generation. Output 72.1.1 displays the output from PROC SURVEYSELECT, which summarizes the sample selection. A total of 200 customers is selected in four replicates. PROC SURVEYSELECT selects each replicate using sequential random sampling within strata determined by State. The sampling frame Customers is sorted by control variables Type and Usage within strata, according to hierarchic serpentine sorting. The output data set SampleRep contains the sample. Output 72.1.1. Sample Selection Summary Customer Satisfaction Survey Replicated Sampling The SURVEYSELECT Procedure Selection Method Strata Variable Control Variables Control Sorting
Sequential Random Sampling With Equal Probability State Type Usage Serpentine
Input Data Set Random Number Seed Number of Strata Number of Replicates Total Sample Size Output Data Set
CUSTOMERS 40070 4 4 200 SAMPLEREP
The following PROC PRINT statements display the selected customers for the first stratum, State = ‘AL’, from the output data set SampleRep. title1 ’Customer Satisfaction Survey’; title2 ’Sample Selected by Replicated Design’; title3 ’(First Stratum)’; proc print data=SampleRep; where State = ’AL’; run;
4461
4462
Chapter 72. The SURVEYSELECT Procedure
Output 72.1.2. Customer Sample (First Stratum) Customer Satisfaction Survey Sample Selected by Replicated Design (First Stratum)
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
State AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL AL
Replicate 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
CustomerID 882-37-7496 581-32-5534 980-29-2898 172-56-4743 998-55-5227 625-44-3396 627-48-2509 257-66-6558 622-83-1680 343-57-1186 976-05-3796 859-74-0652 476-48-1066 109-27-8914 743-25-0298 722-08-2215 668-57-7696 300-72-0129 073-60-0765 526-87-0258 726-61-0387 632-29-9020 417-17-8378 091-26-2366 336-04-1288 827-04-7407 317-70-6496 002-38-4582 181-83-3990 675-34-7393 228-07-6671 298-46-2434
Type New New Old Old Old New New New New New New New New Old Old Old New New New Old Old Old New New New New Old Old Old New New New
Usage
Selection Prob
572 863 571 128 35 60 114 172 22 53 110 303 839 2102 376 105 200 471 656 672 150 51 56 93 419 650 452 206 33 47 65 161
.004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226 .004115226
Sampling Weight 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243 243
Output 72.1.2 displays the 32 sample customers of the first stratum (State = ‘AL’) from the output data set SampleRep, which includes the entire sample of 200 customers. The variable SelectionProb contains the selection probability, and SamplingWeight contains the sampling weight. Since customers are selected with equal probability within strata in this design, all customers in the same stratum have the same selection probability. These selection probabilities and sampling weights apply to a single replicate, and the variable Replicate contains the sample replicate number.
Example 72.2. PPS Selection of Two Units Per Stratum
Example 72.2. PPS Selection of Two Units Per Stratum A state health agency plans to conduct a state-wide survey of a variety of different hospital services. The agency plans to select a probability sample of individual discharge records within hospitals using a two-stage sample design. First stage units are hospitals, and second stage units are patient discharges during the study time period. Hospitals are stratified first according to geographic region and then by rural/urban type and size of hospital. Two hospitals are selected from each stratum with probability proportional to size. This example describes hospital selection for this survey using PROC SURVEYSELECT. The data set HospitalFrame contains all hospitals in the first geographical region of this state. data HospitalFrame; input Hospital$ Type$ SizeMeasure @@; if (SizeMeasure < 20) then Size=’Small ’; else if (SizeMeasure < 50) then Size=’Medium’; else Size=’Large ’; datalines; 034 Rural 0.870 107 Rural 1.316 079 Rural 2.127 223 Rural 3.960 236 Rural 5.279 165 Rural 5.893 086 Rural 0.501 141 Rural 11.528 042 Urban 3.104 124 Urban 4.033 006 Urban 4.249 261 Urban 4.376 195 Urban 5.024 190 Urban 10.373 038 Urban 17.125 083 Urban 40.382 259 Urban 44.942 129 Urban 46.702 133 Urban 46.992 218 Urban 48.231 026 Urban 61.460 058 Urban 65.931 119 Urban 66.352 ;
In the SAS data set HospitalFrame, the variable Hospital identifies the hospital. The variable Type equals ‘Urban’ if the hospital is located in an urbanized area, and ‘Rural’ otherwise. The variable SizeMeasure contains the hospital’s size measure, which is constructed from past data on service utilization for the hospital together with the desired sampling rates for each service. This size measure reflects the amount of relevant survey information expected from the hospital. Refer to Drummond et al. (1982) for details on this type of size measure. The variable Size equals ‘Small’, ‘Medium’, or ‘Large’, depending on the value of the hospital’s size measure. The following PROC PRINT statements display the data set Hospital Frame. title1 ’Hospital Utilization Survey’; title2 ’Sampling Frame, Region 1’; proc print data=HospitalFrame; run;
4463
4464
Chapter 72. The SURVEYSELECT Procedure
Output 72.2.1. Sampling Frame Hospital Utilization Survey Sampling Frame, Region 1
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Hospital 034 107 079 223 236 165 086 141 042 124 006 261 195 190 038 083 259 129 133 218 026 058 119
Type Rural Rural Rural Rural Rural Rural Rural Rural Urban Urban Urban Urban Urban Urban Urban Urban Urban Urban Urban Urban Urban Urban Urban
Size Measure 0.870 1.316 2.127 3.960 5.279 5.893 0.501 11.528 3.104 4.033 4.249 4.376 5.024 10.373 17.125 40.382 44.942 46.702 46.992 48.231 61.460 65.931 66.352
Size Small Small Small Small Small Small Small Small Small Small Small Small Small Small Small Medium Medium Medium Medium Medium Large Large Large
The following PROC SURVEYSELECT statements select a probability sample of hospitals from the HospitalFrame data set, using a stratified design with PPS selection of two units from each stratum. title1 ’Hospital Utilization Survey’; proc surveyselect data=HospitalFrame method=pps_brewer seed=48702 out=SampleHospitals; size SizeMeasure; strata Type Size notsorted; run;
The STRATA statement names the stratification variables Type and Size. The NOTSORTED option specifies that observations with the same STRATA variable values are grouped together but are not necessarily sorted in alphabetical or increasing numerical order. In the HospitalFrame data set, Size = ‘Small’ precedes Size = ‘Medium’. In the PROC SURVEYSELECT statement, the METHOD=PPS– BREWER option requests sample selection by Brewer’s method, which selects two units per stratum with probability proportional to size. The SEED=48702 option specifies ’48702’ as the initial seed for random number generation. The SIZE statement specifies the size measure variable. It is not necessary to specify the sample size with the N= option, since Brewer’s method always selects two units from each stratum.
Example 72.2. PPS Selection of Two Units Per Stratum
Output 72.2.2 displays the output from PROC SURVEYSELECT. A total of 8 hospitals were selected from the 4 strata. The data set SampleHospitals contains the selected hospitals. Output 72.2.2. Sample Selection Summary Hospital Utilization Survey The SURVEYSELECT Procedure Selection Method Size Measure Strata Variables
Brewer’s PPS Method SizeMeasure Type Size
Input Data Set Random Number Seed Stratum Sample Size Number of Strata Total Sample Size Output Data Set
HOSPITALFRAME 48702 2 4 8 SAMPLEHOSPITALS
The following PROC PRINT statements display the sample hospitals. title1 ’Hospital Utilization Survey’; title2 ’Sample Selected by Stratified PPS Design’; proc print data=SampleHospitals; run;
Output 72.2.3. Sample Hospitals Hospital Utilization Survey Sample Selected by Stratified PPS Design
Obs 1 2 3 4 5 6 7 8
Type Rural Rural Urban Urban Urban Urban Urban Urban
Size
Hospital
Small Small Small Small Medium Medium Large Large
079 236 006 195 133 218 026 058
Size Measure 2.127 5.279 4.249 5.024 46.992 48.231 61.460 65.931
Selection Prob
Sampling Weight
Jt Selection Prob
0.13516 0.33545 0.17600 0.20810 0.41357 0.42448 0.63445 0.68060
7.39868 2.98106 5.68181 4.80533 2.41795 2.35584 1.57617 1.46929
0.01851 0.01851 0.01454 0.01454 0.11305 0.11305 0.31505 0.31505
The variable SelectionProb contains the selection probability for each hospital in the sample. The variable JtSelectionProb contains the joint probability of selection for the two sample hospitals in the same stratum. The variable SamplingWeight contains the sampling weight component for this first stage of the design. The final-stage weight components, which correspond to patient record selection within hospitals, can be multiplied by the hospital weight components to obtain the overall sampling weights.
4465
4466
Chapter 72. The SURVEYSELECT Procedure
Example 72.3. PPS (Dollar-Unit) Sampling A small company wants to audit employee travel expenses in an effort to improve the expense reporting procedure and possibly reduce expenses. The company does not have resources to examine all expense reports and wants to use statistical sampling to objectively select expense reports for audit. The data set TravelExpense contains the dollar amount of all employee travel expense transactions during the past month. data TravelExpense; input ID$ Amount @@; if (Amount < 500) then Level=’1_Low ’; else if (Amount > 1500) then Level=’3_High’; else Level=’2_Avg ’; datalines; 110 237.18 002 567.89 234 118.50 743 74.38 411 1287.23 782 258.10 216 325.36 174 218.38 568 1670.80 302 134.71 285 2020.70 314 47.80 139 1183.45 775 330.54 425 780.10 506 895.80 239 620.10 011 420.18 672 979.66 142 810.25 738 670.85 192 314.58 243 87.50 263 1893.40 496 753.30 332 540.65 486 2580.35 614 230.56 654 185.60 308 688.43 784 505.14 017 205.48 162 650.42 289 1348.34 691 30.50 545 2214.80 517 940.35 382 217.85 024 142.90 478 806.90 107 560.72 ;
In the SAS data set TravelExpense, the variable ID identifies the travel expense report. The variable Amount contains the dollar amount of the reported expense. The variable Level equals ‘1– Low’, ‘2– Avg’, or ‘3– High’, depending on the value of Amount. In the sample design for this audit, expense reports are stratified by Level. This ensures that each of these expense levels is included in the sample and also permits a disproportionate allocation of the sample, selecting proportionately more of the expense reports from the higher levels. Within strata, the sample of expense reports is selected with probability proportional to the amount of the expense, thus giving a greater chance of selection to larger expenses. In auditing terms, this is known as monetary-unit sampling. Refer to Wilburn (1984). PROC SURVEYSELECT requires that the input data set be sorted by the STRATA variables. The following PROC SORT statements sort the TravelExpense data set by the stratification variable Level. proc sort data=TravelExpense; by Level; run;
Example 72.3. PPS (Dollar-Unit) Sampling
The following PROC PRINT statements display the sampling frame data set TravelExpense, which contains 41 observations. title1 ’Travel Expense Audit’; proc print data=TravelExpense; run;
Output 72.3.1. Sampling Frame Travel Expense Audit Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
ID 110 002 234 743 411 782 216 174 568 302 285 314 139 775 425 506 239 011 672 142 738 192 243 263 496 332 486 614 654 308 784 017 162 289 691 545 517 382 024 478 107
Amount 237.18 567.89 118.50 74.38 1287.23 258.10 325.36 218.38 1670.80 134.71 2020.70 47.80 1183.45 330.54 780.10 895.80 620.10 420.18 979.66 810.25 670.85 314.58 87.50 1893.40 753.30 540.65 2580.35 230.56 185.60 688.43 505.14 205.48 650.42 1348.34 30.50 2214.80 940.35 217.85 142.90 806.90 560.72
Level 1_Low 2_Avg 1_Low 1_Low 2_Avg 1_Low 1_Low 1_Low 3_High 1_Low 3_High 1_Low 2_Avg 1_Low 2_Avg 2_Avg 2_Avg 1_Low 2_Avg 2_Avg 2_Avg 1_Low 1_Low 3_High 2_Avg 2_Avg 3_High 1_Low 1_Low 2_Avg 2_Avg 1_Low 2_Avg 2_Avg 1_Low 3_High 2_Avg 1_Low 1_Low 2_Avg 2_Avg
4467
4468
Chapter 72. The SURVEYSELECT Procedure
The following PROC SURVEYSELECT statements select a probability sample of expense reports from the TravelExpense data set using the stratified design with PPS selection within strata. title1 ’Travel Expense Audit’; proc surveyselect data=TravelExpense method=pps n=(6 10 4) seed=47279 out=AuditSample; size Amount; strata Level; run;
The STRATA statement names the stratification variable Level. The SIZE statement specifies the size measure variable Amount. In the PROC SURVEYSELECT statement, the METHOD=PPS option requests sample selection with probability proportional to size and without replacement. The N=(6 10 4) option specifies the stratum sample sizes, listing the sample sizes in the same order that the strata appear in the TravelExpense data set. The sample size of 6 corresponds to the first stratum, Level = ‘1– Low’, the sample size of 10 corresponds to the second stratum, Level = ‘2– Avg’, and 4 corresponds to the last stratum, Level = ‘3– High’. The SEED=47279 option specifies ’47279’ as the initial seed for random number generation. Output 72.3.2 displays the output from PROC SURVEYSELECT. A total of 20 expense reports is selected for audit. The data set AuditSample contains the sample of travel expense reports. Output 72.3.2. Sample Selection Summary Travel Expense Audit The SURVEYSELECT Procedure Selection Method Size Measure Strata Variable
PPS, Without Replacement Amount Level
Input Data Set Random Number Seed Number of Strata Total Sample Size Output Data Set
TRAVELEXPENSE 47279 3 20 AUDITSAMPLE
The following PROC PRINT statements display the audit sample, which is shown in Output 72.3.3. title1 ’Travel Expense Audit’; title2 ’Sample Selected by Stratified PPS Design’; proc print data=AuditSample; run;
References
Output 72.3.3. Audit Sample Travel Expense Audit Sample Selected by Stratified PPS Design
Obs
Level
ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1_Low 1_Low 1_Low 1_Low 1_Low 1_Low 2_Avg 2_Avg 2_Avg 2_Avg 2_Avg 2_Avg 2_Avg 2_Avg 2_Avg 2_Avg 3_High 3_High 3_High 3_High
654 017 382 614 782 775 784 332 002 239 738 496 425 478 672 139 568 263 285 486
Amount 185.60 205.48 217.85 230.56 258.10 330.54 505.14 540.65 567.89 620.10 670.85 753.30 780.10 806.90 979.66 1183.45 1670.80 1893.40 2020.70 2580.35
Selection Prob
Sampling Weight
0.31105 0.34437 0.36510 0.38640 0.43256 0.55396 0.34623 0.37057 0.38924 0.42503 0.45981 0.51633 0.53470 0.55307 0.67148 0.81116 0.64385 0.72963 0.77869 0.99435
3.21489 2.90385 2.73896 2.58797 2.31183 1.80518 2.88823 2.69853 2.56909 2.35278 2.17479 1.93676 1.87022 1.80810 1.48925 1.23280 1.55316 1.37056 1.28421 1.00568
References Bentley, J.L. and Floyd, R. (1987), “A Sample of Brilliance,” Communications of the Association for Computing Machinery, 30, 754–757. Bentley, J.L. and Knuth, D. (1986), “Literate Programming,” Communications of the Association for Computing Machinery, 29, 364–369. Brewer, K.W.R. (1963), “A Model of Systematic Sampling with Unequal Probabilities,” Australian Journal of Statistics, 5, 93–105. Chromy, J.R. (1979), “Sequential Sample Selection Methods,” Proceedings of the American Statistical Association, Survey Research Methods Section, 401–406. Cochran, W.G. (1977), Sampling Techniques, Third Edition, New York: John Wiley & Sons, Inc. Drummond, D., Lessler, J., Watts, D., and Williams, S. (1982), “A Design for Achieving Prespecified Levels of Representation for Multiple Domains in Health Record Samples,” Proceedings of the Fourth Conference on Health Survey Research Methods, DHHS Publication No. (PHS) 84-3346, Washington, D.C.: National Center for Health Services Research, 233–248. Durbin, J. (1967), “Design of Multi-stage Surveys for the Estimation of Sampling Errors,” Applied Statistics, 16, 152–164.
4469
4470
Chapter 72. The SURVEYSELECT Procedure
Fan, C.T., Muller, M.E., and Rezucha, I. (1962), “Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers,” Journal of the American Statistical Association, 57, 387–402. Fishman, G.S. and Moore, L.R. (1982), “A Statistical Evaluation of Multiplicative Congruential Generators with Modulus (231 − 1),” Journal of the American Statistical Association, 77, 129–136. Fox, D.R. (1989), “Computer Selection of Size-Biased Samples,” The American Statistician, 43(3), 168–171. Golmant, J. (1990), “Correction: Computer Selection of Size-Biased Samples,” The American Statistician, 44(2), 194. Hanurav, T.V. (1967), “Optimum Utilization of Auxiliary Information: πps Sampling of Two Units from a Stratum,” Journal of the Royal Statistical Society, Series B, 29, 374–391. Kalton, G. (1983), Introduction to Survey Sampling, Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-035, Beverly Hills, CA and London: Sage Publications, Inc. Kish, L. (1965), Survey Sampling, New York: John Wiley & Sons, Inc. Kish, L. (1987), Statistical Design for Research, New York: John Wiley & Sons, Inc. Lohr, S.L. (1999), Sampling: Design and Analysis, Pacific Grove, CA: Duxbury Press. Madow, W.G. (1949), “On the Theory of Systematic Sampling, II,” Annals of Mathematical Statistics, 20, 333–354. McLeod, A.I. and Bellhouse, D.R. (1983), “A Convenient Algorithm for Drawing a Simple Random Sample,” Applied Statistics, 32, 182–183. Murthy, M.N. (1957), “Ordered and Unordered Estimators in Sampling Without Replacement,” Sankhy¯ a, 18, 379–390. Murthy, M.N. (1967), Sampling Theory and Methods, Calcutta, India: Statistical Publishing Society. Sampford, M.R. (1967), “On Sampling without Replacement with Unequal Probabilities of Selection,” Biometrika, 54, 499–513. Vijayan, K. (1968), “An Exact πps Sampling Scheme: Generalization of a Method of Hanurav,” Journal of the Royal Statistical Society, Series B, 30, 556–566. Watts, D.L. (1991), “Correction: Computer Selection of Size-Biased Samples,” The American Statistician, 45(2), 172. Wilburn, A.J. (1984), Practical Statistical Sampling for Auditors, New York: Marcel Dekker, Inc. Williams, R.L. and Chromy, J.R. (1980), “SAS Sample Selection Macros,” Proceedings of the Fifth Annual SAS Users Group International Conference, 5, 392–396.
Chapter 73
The TPHREG Procedure (Experimental) Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4473 SYNTAX . . . . . . . . . . . PROC TPHREG Statement MODEL Statement . . . . CLASS Statement . . . . . CONTRAST Statement . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. 4474 . 4474 . 4474 . 4477 . 4479
DETAILS . . . . . . . . . . . . . . . . . . . . CLASS Variable Parameterization . . . . . Miscellaneous Changes from PROC PHREG Displayed Output . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. 4482 . 4482 . 4485 . 4486 . 4488
EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4489 Example 73.1. Analysis of the VA Lung Cancer Data . . . . . . . . . . . . 4489 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4494
4472
Chapter 73. The TPHREG Procedure (Experimental)
Chapter 73
The TPHREG Procedure (Experimental) Overview The TPHREG procedure, experimental in this release, includes most of the functionality of the PHREG procedure (see Chapter 54, “The PHREG Procedure”) with the additional benefits of the CLASS statement. The CLASS statement enables you to specify categorical variables (also known as factors or CLASS variables) to be used in the analysis. Model effects, including covariates, main effects (CLASS variables), crossed effects (interactions), and nested effects, can be specified in the same way as in the GLM procedure. The CLASS statement supports the less-than-full-rank parameterization as in the GLM procedure as well as various full-rank parameterization methods such as reference coding, effect coding, and orthogonal polynomial coding. For some of the full-rank coding schemes, you can designate a specific value (category or level) of the CLASS variable as the reference level. The CLASS statement also enables you to specify the ordering of the categories of CLASS variables, to reverse the ordering of the categories, and to treat categories with missing values as valid categories. With the TPHREG procedure, you can control how to move model effects in and out of a model with various model-building strategies such as forward selection, backward elimination, or stepwise selection. When there are no interaction terms, a main effect can enter or leave a model in a single step based on the p-value of the score or Wald statistic, respectively. When there are crossed or nested effects, the selection process also depends on whether you want to preserve model hierarchy. The HIERARCHY= option in the MODEL statement enables you to specify whether model hierarchy is to be preserved, how model hierarchy is applied, and whether a single effect or multiple effects can be moved in a single step. The TPHREG procedure also enables you to specify CONTRAST statements for testing customized hypotheses concerning the regression parameters. Each CONTRAST statement also provides estimation of individual rows of contrasts, which is particularly useful in comparing the hazards between the categories of a CLASS variable.
4474
Chapter 73. The TPHREG Procedure (Experimental)
Syntax The PROC TPHREG statement invokes the TPHREG procedure. All the other statements in the PHREG procedure (with the exception of the experimental ASSESS statement) are available in the TPHREG procedure. The MODEL statement in the TPHREG procedure enables you to specify explanatory effects, not just individual continuous variables, and has additional options specifically for having CLASS variables. In addition, you can specify the CLASS statement and the CONTRAST statement as follows:
CLASS variable <(options)>
PROC TPHREG Statement PROC TPHREG < options > ; All PROC PHREG statement options can be used in the PROC TPHREG statement. In addition, you can specify the following option: NAMELEN=n
specifies the length of effect names in tables and output data sets to be n characters, where n is a value between 20 and 200. The default length is 20 characters.
MODEL Statement MODEL time < *censor ( list ) > = effects < /options > ; MODEL (t1, t2) < *censor(list) > = effects < /options > ; The specifications of the time variables, the censoring indicator, and censored values are same as those in the PHREG procedure. The model effects, which follow the equal sign, include continuous or CLASS variables as the main effects. Categorical variables, which can be character or numeric, must be declared in the CLASS statement. Crossed and nested effects can be specified in the same fashion as in the GLM procedure (see the section “Specification of Effects” on page 1784 of Chapter 32, “The GLM Procedure,” for more information).
MODEL Statement
Any MODEL statement options in the PHREG procedure can be used in the TPHREG procedure. To accommodate the broader specification of model effects, the variable-selection options INCLUDE=, START=, and STOP= have been modified. INCLUDE=n
includes the first n effects in the MODEL statement in every model. By default, INCLUDE=0. The INCLUDE= option has no effect when SELECTION=NONE. START=n
begins the FORWARD, BACKWARD, or STEPWISE selection process with the first n effects listed in the MODEL statement. The value of n ranges from 0 to s, where s is the total number of effects in the MODEL statement. The default value of n is s for the BACKWARD method and 0 for the FORWARD and STEPWISE methods. Note that START=n specifies only that the first n effects appear in the first model, while INCLUDE=n requires that the first n effects be included in every model. For the SCORE method, START=n specifies that the smallest models contain n effects, where n ranges from 1 to s; the default value is 1. The START= option has no effect when SELECTION=NONE. STOP=n
specifies the maximum (FORWARD method) or minimum (BACKWARD method) number of effects to be included in the final model. The effect selection process is stopped when n effects are found. The value of n ranges from 0 to s, where s is the total number of effects in the MODEL statement. The default value of n is s for the FORWARD method and 0 for the BACKWARD method. For the SCORE method, STOP=n specifies that the smallest models contain n effects, where n ranges from 1 to s; the default value of n is s. The STOP= option has no effect when SELECTION=NONE or STEPWISE. Two new options are added to the MODEL statement in the TPHREG procedure. HIERARCHY=keyword HIER=keyword
specifies whether and how the model hierarchy requirement is applied and whether a single effect or multiple effects are allowed to enter or leave the model in one step. You can specify that only CLASS variable effects, or both CLASS and continuous variable effects, be subject to the hierarchy requirement. The HIERARCHY= option is ignored unless you also specify the FORWARD, BACKWARD, or STEPWISE selection method. Model hierarchy refers to the requirement that, for any term to be in the model, all effects contained in the term must be present in the model. For example, in order for the interaction A*B to enter the model, the main effects A and B must be in the model. Likewise, neither effect A nor B can leave the model while the interaction A*B is in the model.
4475
4476
Chapter 73. The TPHREG Procedure (Experimental)
The keywords you can specify in the HIERARCHY= option are described as follows: NONE Model hierarchy is not maintained. Any single effect can enter or leave the model at any given step of the selection process. SINGLE Only one effect can enter or leave the model at one time, subject to the model hierarchy requirement. For example, suppose that you specify the main effects A and B and the interaction of A*B in the model. In the first step of the selection process, either A or B can enter the model. In the second step, the other main effect can enter the model. The interaction effect can enter the model only when both main effects have already been entered. Also, before A or B can be removed from the model, the A*B interaction must first be removed. All effects (CLASS and continuous variables) are subject to the hierarchy requirement. SINGLECLASS This is the same as HIERARCHY=SINGLE except that only CLASS effects are subject to the hierarchy requirement. MULTIPLE More than one effect can enter or leave the model at one time, subject to the model hierarchy requirement. In a forward selection step, a single main effect can enter the model, or an interaction can enter the model together with all the effects that are contained in the interaction. In a backward elimination step, an interaction itself, or the interaction together with all the effects that the interaction contains, can be removed. All effects (CLASS and continuous variable) are subject to the hierarchy requirement. MULTIPLECLASS This is the same as HIERARCHY=MULTIPLE except that only CLASS effects are subject to the hierarchy requirement. The default value is HIERARCHY=SINGLE, which means that model hierarchy is to be maintained for all effects (that is, both CLASS and continuous variable effects) and that only a single effect can enter or leave the model at each step. NODUMMYPRINT NODESIGNPRINT NODP
suppresses the “Class Level Information” table, which shows how the design matrix columns for the CLASS variables are coded.
CLASS Statement
CLASS Statement CLASS variable <(options)>
specifies that, at most, the first n characters of a CLASS variable name be used in creating names for the corresponding dummy variables. The default is 32 − min(32, max(2, f )), where f is the formatted length of the CLASS variable. DESCENDING DESC
reverses the sorting order of the categorical variable. LPREFIX= n
specifies that, at most, the first n characters of a CLASS variable label be used in creating labels for the corresponding dummy variables. MISSING
allows missing value (for example,‘.’ for a numeric variable and blanks for a character variable) as a valid value for the CLASS variable. ORDER=DATA | FORMATTED | FREQ | INTERNAL
specifies the sorting order for the categories of cateogrical variables. This ordering determines which parameters in the model correspond to each level in the data, so the ORDER= option may be useful when you use the CONTRAST statement. When the default ORDER=FORMATTED is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values. The following table shows how PROC TPHREG interprets values of the ORDER= option. Value of ORDER= DATA
Levels Sorted By order of appearance in the input data set
FORMATTED
external formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value
FREQ
descending frequency count; levels with the most observations come first in the order
INTERNAL
unformatted value
4477
4478
Chapter 73. The TPHREG Procedure (Experimental)
By default, ORDER=FORMATTED. For FORMATTED and INTERNAL, the sort order is machine dependent. For more information on sorting order, see the chapter on the SORT procedure in the SAS Procedures Guide and the discussion of BY-group processing in SAS Language Reference: Concepts. PARAM=keyword
specifies the parameterization method for the categorical variable or variables. Design matrix columns are created from CLASS variables according to the following coding schemes. The default is PARAM=REF. If PARAM=ORTHPOLY or PARAM=POLY, and the CLASS levels are numeric, then the ORDER= option in the CLASS statement is ignored, and the internal, unformatted values are used. See the “CLASS Variable Parameterization” section on page 4482 for further details. EFFECT
specifies effect coding
GLM
specifies less-than-full-rank, reference-cell coding; this option can only be used as a global option
ORDINAL
specifies the cumulative parameterization for an ordinal CLASS variable.
POLYNOMIAL POLY REFERENCE REF ORTHEFFECT
specifies polynomial coding specifies reference cell coding orthogonalizes PARAM=EFFECT
ORTHORDINAL orthogonalizes PARAM=ORDINAL ORTHPOLY
orthogonalizes PARAM=POLYNOMIAL
ORTHREF
orthogonalizes PARAM=REFERENCE
The EFFECT, POLYNOMIAL, REFERENCE, ORDINAL, and their orthogonal parameterizations are full rank parameterization. The REF= option in the CLASS statement determines the reference level for the EFFECT, REFERENCE, and their orthogonal parameterizations. Parameter names for a CLASS predictor variable are constructed by concatenating the CLASS variable name with the CLASS levels. However, for the POLYNOMIAL and orthogonal parameterizations, parameter names are formed by concatenating the CLASS variable name and keywords that reflect the parameterization. REF=’level’ | keyword
specifies the reference level for PARAM=EFFECT or PARAM=REF. For an individual variable, you can specify a specific level of the variable in the REF= option. For a global or individual variable REF= option, you can use one of the following keywords. The default is REF=LAST. FIRST
designates the first ordered level as reference
LAST
designates the last ordered level as reference
CONTRAST Statement
TRUNCATE
specifies that class levels should be determined using no more than the first 16 characters of the formatted values of CLASS variables. This is a global option, not an individual CLASS variable option.
CONTRAST Statement CONTRAST ’label’ row-description <,... row-description>< /options> ; where a row-description is: effect values <,...effect values> The CONTRAST statement provides a mechanism for obtaining customized hypothesis tests. It is similar to the CONTRAST statement in PROC GLM and PROC CATMOD, depending on the coding schemes used with any categorical variables involved. The CONTRAST statement enables you to specify a matrix, L, for testing the hypothesis Lβ = 0. You must be familiar with the details of the model parameterization that PROC TPHREG uses (for more information, see the PARAM= option in the “CLASS Statement” section on page 4477). Optionally, the CONTRAST statement enables you to estimate each row, l0i β, of Lβ and test the hypothesis l0i β = 0. Computed statistics are based on the asymptotic chi-square distribution of the Wald statistic. There is no limit to the number of CONTRAST statements that you can specify, but they must appear after the MODEL statement. The following parameters are specified in the CONTRAST statement: label
identifies the contrast on the output. A label is required for every contrast specified, and it must be enclosed in quotes.
effect
identifies an effect that appears in the MODEL statement. You do not need to include all effects that are included in the MODEL statement.
values
are constants that are elements of the L matrix associated with the effect. To correctly specify your contrast, it is crucial to know the ordering of parameters within each effect and the variable levels associated with any parameter. The “Class Level Information” table shows the ordering of levels within variables. The E option, described later in this section, enables you to verify the proper correspondence of values to parameters.
The rows of L are specified in order and are separated by commas. Multiple degreeof-freedom hypotheses can be tested by specifying multiple row-descriptions. For any of the full-rank parameterizations, if an effect is not specified in the CONTRAST statement, all of its coefficients in the L matrix are set to 0. If too many values are specified for an effect, the extra ones are ignored. If too few values are specified, the remaining ones are set to 0.
4479
4480
Chapter 73. The TPHREG Procedure (Experimental)
When you use effect coding (by specifying PARAM=EFFECT in the CLASS statement), all parameters are directly estimable (involve no other parameters). For example, suppose an effect coded CLASS variable A has four levels. Then there are three parameters (α1 , α2 , α3 ) representing the first three levels, and the fourth parameter is represented by −α1 − α2 − α3 To test the first versus the fourth level of A, you would test α1 = −α1 − α2 − α3 or, equivalently, 2α1 + α2 + α3 = 0 which, in the form Lβ = 0, is
2 1 1
α1 α2 = 0 α3
Therefore, you would use the following CONTRAST statement: contrast ’1 vs. 4’ A 2 1 1;
To contrast the third level with the average of the first two levels, you would test α1 + α2 = α3 2 or, equivalently, α1 + α2 − 2α3 = 0 Therefore, you would use the following CONTRAST statement: contrast ’1&2 vs. 3’ A 1 1 -2;
Other CONTRAST statements are constructed similarly. For example, contrast contrast contrast contrast
’1 vs. 2 ’ ’1&2 vs. 4 ’ ’1&2 vs. 3&4’ ’Main Effect’
A A A A A A
1 -1 3 3 2 2 1 0 0 1 0 0
0; 2; 0; 0, 0, 1;
CONTRAST Statement
When you use the less-than-full-rank parameterization (by specifying PARAM=GLM in the CLASS statement), each row is checked for estimability. If PROC TPHREG finds a contrast to be nonestimable, it displays missing values in corresponding rows in the results. PROC TPHREG handles missing level combinations of categorical variables in the same manner as PROC GLM. Parameters corresponding to missing level combinations are not included in the model. This convention can affect the way in which you specify the L matrix in your CONTRAST statement. If the elements of L are not specified for an effect that contains a specified effect, then the elements of the specified effect are distributed over the levels of the higher-order effect just as the GLM procedure does for its CONTRAST and ESTIMATE statements. For example, suppose that the model contains effects A and B and their interaction A*B. If you specify a CONTRAST statement involving A alone, the L matrix contains nonzero terms for both A and A*B, since A*B contains A. The degrees of freedom is the number of linearly independent constraints implied by the CONTRAST statement, that is, the rank of L. You can specify the following options after a slash (/). ALPHA= p
specifies the level of significance p for the 100(1 − p)% confidence interval for each contrast when the ESTIMATE option is specified. The value p must be between 0 and 1. By default, p is equal to the value of the ALPHA= option in the PROC TPHREG statement, or 0.05 if that option is not specified. E
requests that the L matrix be displayed. ESTIMATE=keyword
requests that each individual contrast (that is, each row, li0 β, of Lβ) or exponentiated 0 contrast (eli β ) be estimated and tested. PROC TPHREG displays the point estimate, its standard error, a Wald confidence interval, and a Wald chi-square test for each contrast. The significance level of the confidence interval is controlled by the ALPHA= 0 option. You can estimate the contrast or the exponentiated contrast (eli β ), or both, by specifying one of the following keywords: PARM
specifies that the contrast itself be estimated
EXP
specifies that the exponentiated contrast be estimated
BOTH
specifies that both the contrast and the exponentiated contrast be estimated
SINGULAR = number
tunes the estimability check. This option is ignored when the full-rank parameterization is used. If v is a vector, define ABS(v) to be the largest absolute value of the elements of v. For a row vector l0 of the contrast matrix L, define c to be equal to ABS(l) if ABS(l) is greater than 0; otherwise, c equals 1. If ABS(l0 − l0 T ) is greater than c ∗ number, then l is declared nonestimable. The T matrix is the Hermite form − matrix I − 0 I 0 , where I 0 represents a generalized inverse of the information matrix I 0
4481
4482
Chapter 73. The TPHREG Procedure (Experimental)
of the null model. The value for number must be between 0 and 1; the default value is 1E−4.
Details CLASS Variable Parameterization Consider a model with one CLASS variable A with four levels, 1, 2, 5, and 7. Details of the possible choices for the PARAM= option follow. EFFECT
Three columns are created to indicate group membership of the nonreference levels. For the reference level, all three dummy variables have a value of −1. For instance, if the reference level is 7 (REF=7), the design matrix columns for A are as follows.
A 1 2 5 7
Effect Coding Design Matrix A1 A2 A5 1 0 0 0 1 0 0 0 1 −1 −1 −1
Parameter estimates of CLASS main effects using the effect coding scheme estimate the difference in the effect of each nonreference level compared to the average effect over all four levels. GLM
As in PROC GLM, four columns are created to indicate group membership. The design matrix columns for A are as follows.
A 1 2 5 7
GLM Coding Design Matrix A1 A2 A5 A7 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Parameter estimates of CLASS main effects using the GLM coding scheme estimate the difference in the effects of each level compared to the last level.
CLASS Variable Parameterization ORDINAL
Three columns are created to indicate group membership of the higher levels of the effect. For the first level of the effect (which for A is 1), all three dummy variables have a value of 0. The design matrix columns for A are as follows. Ordinal Coding Design Matrix A A2 A5 A7 1 0 0 0 2 1 0 0 5 1 1 0 7 1 1 1
The first level of the effect is a control or baseline level. Parameter estimates of CLASS main effects using the ORDINAL coding scheme estimate the effect on the response as the ordinal factor is set to each succeeding level. When the parameters for an ordinal main effect have the same sign, the response effect is monotonic across the levels. POLYNOMIAL POLY
Three columns are created. The first represents the linear term (x), the second represents the quadratic term (x2 ), and the third represents the cubic term (x3 ), where x is the level value. If the CLASS levels are not numeric, they are translated into 1, 2, 3, . . . according to their sorting order. The design matrix columns for A are as follows.
A 1 2 5 7
Polynomial Coding Design Matrix APOLY1 APOLY2 APOLY3 1 1 1 2 4 8 5 25 125 7 49 343
4483
4484
Chapter 73. The TPHREG Procedure (Experimental)
REFERENCE REF
Three columns are created to indicate group membership of the nonreference levels. For the reference level, all three dummy variables have a value of 0. For instance, if the reference level is 7 (REF=7), the design matrix columns for A are as follows. Reference Coding Design Matrix A A1 A2 A5 1 1 0 0 2 0 1 0 5 0 0 1 7 0 0 0
Parameter estimates of CLASS main effects using the reference coding scheme estimate the difference in the effect of each nonreference level compared to the effect of the reference level. ORTHEFFECT The columns are obtained by applying the Gram-Schmidt orthogonalization to the columns for PARAM=EFFECT. The design matrix columns for A are as follows.
A 1 2 5 7
Orthogonal Effect Coding Design Matrix AOEFF1 AOEFF2 AOEFF3 1.41421 −0.81650 −0.57735 0.00000 1.63299 −0.57735 0.00000 0.00000 1.73205 −1.41421 −0.81649 −0.57735
ORTHORDINAL The columns are obtained by applying the Gram-Schmidt orthogonalization to the columns for PARAM=ORDINAL. The design matrix columns for A are as follows.
A 1 2 5 7
Orthogonal Ordinal Coding Design Matrix AOORD1 AOORD2 AOORD3 −1.73205 0.00000 0.00000 0.57735 −1.63299 0.00000 0.57735 0.81650 −1.41421 0.57735 0.81650 1.41421
Miscellaneous Changes from PROC PHREG ORTHPOLY
The columns are obtained by applying the Gram-Schmidt orthogonalization to the columns for PARAM=POLY. The design matrix columns for A are as follows. Orthogonal Polynomial Coding Design Matrix AOPOLY1 AOPOLY2 AOPOLY5 −1.153 0.907 −0.921 −0.734 −0.540 1.473 0.524 −1.370 −0.921 1.363 1.004 0.368
A 1 2 5 7
ORTHREF
The columns are obtained by applying the Gram-Schmidt orthogonalization to the columns for PARAM=REFERENCE. The design matrix columns for A are as follows.
A 1 2 5 7
Orthogonal Reference Coding Design Matrix AOREF1 AOREF2 AOREF3 1.73205 0.00000 0.00000 −0.57735 1.63299 0.00000 −0.57735 −0.81650 1.41421 −0.57735 −0.81650 −1.41421
Miscellaneous Changes from PROC PHREG The default method of computing the survivor function estimate is METHOD=CH which is based on the empirical cumulative hazard function estimate rather than the product-limit estimate (METHOD=PL) as in PROC PHREG. This applies to both the OUTPUT and BASELINE statements. The OUT= data set in the OUTPUT statement contains the entire input data set along with statistics you request using the keyword=name options. Observations in the OUT= data set follow the same order as the input data set. The ORDER=SORTED option in the OUTPUT statement of PROC PHREG (see the “OUTPUT Statement” section (page 3233) in Chapter 54, “The PHREG Procedure,” ) is no longer available. The BASELINE statement in PROC PHREG enables you to to predict the cumulative mean function (CMF) and the cumulative hazard function (CUMHAZ) for recurrent events models. However, such features are not yet available in the BASELINE statement of PROC TPHREG.
4485
4486
Chapter 73. The TPHREG Procedure (Experimental)
Displayed Output If you use the NOPRINT option in the PROC TPHREG statement, the procedure does not display any output. Otherwise, the displayed output of the TPHREG procedure includes the following: • the “Model Information” table, which contains − the two-level name of the input data set − the name and label of the failure-time variable − if you specify the censoring variable, - the name and label of the censoring variable - the values that the censoring variable assumes to indicate censored times − if you use the OFFSET= option in the MODEL statement, the name and label of the offset variable − if you specify the FREQ statement, the name and label of the frequency variable − if you specify the WEIGHT statement, the name and label of the weight variable − the method of handling ties in the failure time • the “Class Level Information” table, which shows the levels and the corresponding design variables for each CLASS explanatory variable • the “Summary of the Number of Event and Censored Values” table, which gives, for each stratum, the breakdown of the number of events and censored values. This table is not produced if the NOSUMMARY option is specified. • if you specify the SIMPLE option in the PROC TPHREG statement, the “Descriptive Statistics for Continuous Explanatory Variables” table for continuous explanatory variables, and the “Frequency Distribution of CLASS Variables” table. The “Descriptive Statistics for Continuous Explanatory Variables” table contains the mean, standard deviation, maximum and minimum of each continuous variable specified in the MODEL statement. If the WEIGHT statement is specified, the “Frequency Distribution of Class Variables” table also contains the weight distributions of the CLASS variables. • if you specify the ITPRINT option in the MODEL statement, the “Iteration History” table, which displays the iteration number, step size, log likelihood, and parameter estimates at each iteration The last evaluation of the gradient vector is also displayed. • the “Model Fit Statistics” table, which gives the values of −2 log likelihood for fitting a model with no explanatory variable and for fitting a model with all the explanatory variables. The AIC and SBC are also given in this table. • the “Testing Global Null Hypothesis: BETA=0” table, which displays results of the likelihood ratio test, the score test, and the Wald test
Displayed Output
• if the model contains an effect involving a CLASS variable, the “Type 3 Tests” table, which gives the Wald chi-square statistic, the degrees of freedom, and the p-value for each effect in the model • the “Analysis of Maximum Likelihood Estimates” table, which contains the following: − the maximum likelihood estimate of the parameter − the estimated standard error of the parameter estimate, computed as the square root of the corresponding diagonal element of the estimated covariance matrix − if you specify the COVS option in the PROC statement, the ratio of the robust standard error estimate to the model-based standard error estimate − the Wald Chi-Square statistic, computed as the square of the parameter estimate divided by its standard error estimate − the degrees of freedom of the Wald chi-square statistic. It has a value of 1 unless the corresponding parameter is redundant or infinite, in which case the value is 0. − the p-value of the Wald chi-square statistic with respect to a chi-square distribution with one degree of freedom − the hazards ratio estimate computed by exponentiating the parameter estimate − if you specified the RISKLIMITS option in the MODEL statement, the confidence limits for the hazards ratio • if you specify SELECTION=SCORE in the MODEL statement, the “Regression Models Selected by Score Criterion” table, which gives the number of explanatory variables in each model, the score chi-square statistic, and the names of the variables included in the model • if you use the FORWARD or STEPWISE selection method and specify the DETAILS option in the MODEL statement, the “Effects to Enter” table, which gives the score chi-square statistic for testing the significance of each candidate effect for entry (after adjusting for the effects already in the model), the degrees of freedom of the score chi-square statistic, and the corresponding p-value. This table is produced before an effect is selected for entry. • if you use the BACKWARD or STEPWISE selection method and specify the DETAILS option in the MODEL statement, the “Effects to Remove” table, which gives the Wald chi-square statistic for testing the significance of each candidate effect for removal, the degrees of freedom of the Wald chi-square, and the corresponding p-value. This table is produced before an effect is selected for removal. • if you use the BACKWARD, FORWARD, or STEPWISE selection method, a table summarizing the model-building process, which gives the step number, the effect entered or removed at each step, the chi-square statistic, and the corresponding p-value on which the selection is based • if you use the COVB option in the MODEL statement, the estimated covariance matrix of the parameter estimates
4487
4488
Chapter 73. The TPHREG Procedure (Experimental) • if you use the CORRB option in the MODEL statement, the estimated correlation matrix of the parameter estimates • if you specify a CONTRAST statement, the “Contrast Test Results” table, which gives the result of the Wald test for each CONTRAST specified. If you specify the E option in the CONTRAST statement, then the contrast matrix is displayed. If you specify the ESTIMATE= option in the CONTRAST statement, the “Contrast Rows Estimation and Testing Results” table is produced, which includes the point estimate and confidence interval for each row of the contrast matrix, and the corresponding Wald test as well. • if you specify a TEST statement, − the “Linear Coefficients” table, which gives the coefficients and constants of the linear hypothesis (if the E option is specified) − the printing of the intermediate calculations of the Wald test (if the option PRINT is specified) − the “Test Results” table, which gives the Wald chi-square statistic, the degrees of freedom, and the p-value − the “Average Effect” table, which gives the weighted average of the parameter estimates for the variables in the TEST statement, the estimated standard error, the z-score, and the p-value (if the AVERAGE option is specified)
ODS Table Names PROC TPHREG assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. Table 73.1. ODS Tables Produced in PROC TPHREG
ODS Table Name BestSubsets CensoredSummary ClassLevelFreq ClassLevelInfo ClassWgt ContrastCoeff ContrastEstimate ContrastTest ConvergenceStatus CorrB CovB
Description Best subset selection Summary of event and censored observations Frequency breakdown of CLASS variables CLASS variable levels and design variables Weight breakdown of CLASS variables L matrix for contrasts Individual contrast estimates Wald test for contrasts Convergence status Estimated correlation matrix of parameter estimators Estimated covariance matrix of parameter estimators
Statement MODEL MODEL
Option SELECTION=SCORE default
PROC
SIMPLE (with CLASS vars)
MODEL
default (with CLASS vars)
WEIGHT
SIMPLE (with CLASS vars)
CONTRAST CONTRAST CONTRAST MODEL MODEL
E ESTIMATE= default default CORRB
MODEL
COVB
Example 73.1. Analysis of the VA Lung Cancer Data
4489
Table 73.1. (continued)
ODS Table Name EffectsToEnter EffectsToRemove FitStatistics GlobalScore GlobalTests IterHistory LastGradient ModelBuildingSummary ModelInfo NObs ParameterEstimates ResidualChiSq SimpleStatistics TestAverage TestCoeff TestPrint1 TestPrint2 TestStmts Type3
Description Eligible effects for entry to model Eligible effects for removal from model Model fit statistics Global chi-square test Tests of the global null hypothesis Iteration history Last evaluation of gradient Summary of model building Model information Number of observations Maximum likelihood estimates of model parameters Residual chi-square Summary statistics for continuous explanatory variables Average Effect for test coefficients for linear hypotheses L[cov(b)]L’ and Lb-c Ginv(L[cov(b)]L’) and Ginv(L[cov(b)]L’)(Lb-c) Linear Hypotheses Test Results Type 3 tests of effects
Statement MODEL
Option SELECTION=F/S
MODEL
SELECTION=B/S
MODEL MODEL MODEL
default NOFIT default
MODEL MODEL MODEL PROC MODEL
ITPRINT ITPRINT SELECTION=B/F/S default default default
MODEL PROC
SELECTION=F/B SIMPLE
TEST TEST TEST TEST
AVERAGE E PRINT PRINT
TEST MODEL
default default (with CLASS vars)
Example Example 73.1. Analysis of the VA Lung Cancer Data This example uses the Veteran’s Administration lung cancer data presented in Appendix 1 of Kalbfleisch and Prentice (1980). In this trial, males with advanced inoperable lung cancer were randomized to a standard therapy and a test chemotherapy. The primary end point for the therapy comparison was time to death in days, represented by the variable Time. Negative values of Time are censored values. The data include information on a number of explanatory variables: Therapy (type of therapy: standard or test), Cell (type of tumor cell: adeno, large, small, or squamous), Prior (prior therapy: 0=no, 10=yes), Age (age in years), Duration (months from diagnosis to randomization), and Kps (Karnofsky performance scale). A censoring indicator variable Censor is created from the data, with value 1 indicating a censored time and value 0 an event time.
4490
Chapter 73. The TPHREG Procedure (Experimental) proc format; value yesno 0=’no’ 10=’yes’; run; data VALung; drop check m; retain Therapy Cell; infile cards column=column; length Check $ 1; label Time=’time to death in days’ Kps=’Karnofsky performance scale’ Duration=’months from diagnosis to randomization’ Age=’age in years’ Prior=’prior therapy’ Cell=’cell type’ Therapy=’type of treatment’; format Prior yesno.; M=Column; input Check $ @@; if M>Column then M=1; if Check=’s’|Check=’t’ then do; input @M Therapy $ Cell $; delete; end; else do; input @M Time Kps Duration Age Prior @@; censor=(Time<0); Time=abs(Time); end; datalines; standard squamous 72 60 7 69 0 411 70 5 64 10 228 60 3 38 0 126 60 118 70 11 65 10 10 20 5 49 0 82 40 10 69 10 110 80 314 50 18 43 0 -100 70 6 70 0 42 60 4 81 0 8 40 144 30 4 63 0 -25 80 9 52 10 11 70 11 48 10 standard small 30 60 3 61 0 384 60 9 42 0 4 40 2 35 0 54 80 13 60 4 56 0 -123 40 3 55 0 -97 60 5 67 0 153 60 59 30 2 65 0 117 80 3 46 0 16 30 4 53 10 151 50 22 60 4 68 0 56 80 12 43 10 21 40 2 55 10 18 20 139 80 2 64 0 20 30 5 65 0 31 75 3 65 0 52 70 287 60 25 66 10 18 30 4 60 0 51 60 1 67 0 122 80 27 60 8 62 0 54 70 1 67 0 7 50 7 72 0 63 50 392 40 4 68 0 10 40 23 67 10 standard adeno 8 20 19 61 10 92 70 10 60 0 35 40 6 62 0 117 80 132 80 5 50 0 12 50 4 63 10 162 80 5 64 0 3 30 95 80 4 34 0 standard large 177 50 16 66 10 162 80 5 62 0 216 50 15 52 0 553 70 278 60 12 63 0 12 40 12 68 10 260 80 5 45 0 200 80 156 70 2 66 0 -182 90 2 62 0 143 90 8 60 0 105 80 103 80 5 38 0 250 70 8 53 10 100 60 13 37 10 test squamous 999 90 12 54 10 112 80 6 60 0 -87 80 3 48 0 -231 50 242 50 1 70 0 991 70 7 50 10 111 70 3 62 0 1 20 587 60 3 58 0 389 90 2 62 0 33 30 6 64 0 25 20 357 70 13 58 0 467 90 2 64 0 201 80 28 52 10 1 50 30 70 11 63 0 44 60 13 70 10 283 90 2 51 0 15 50
9 63 10 29 68 0 58 63 10
4 14 12 15 2 28 11
63 10 63 10 69 0 42 0 55 0 53 0 48 0
2 38 3 43
0 0
2 47 0 12 41 10 11 66 0
8 21 36 7 13
52 10 65 10 63 0 35 0 40 10
Example 73.1. Analysis of the VA Lung Cancer Data test small 25 30 2 69 0 87 60 2 60 0 24 60 8 49 0 61 70 2 71 0 51 30 87 59 10 test adeno 24 40 2 60 0 51 60 5 62 0 8 50 5 66 0 140 70 3 63 0 45 40 3 69 0 test large 52 60 4 45 0 15 30 5 63 0 111 60 5 64 0 ;
4491
-103 2 99 25 29
70 22 36 10 40 36 44 10 70 3 72 0 70 2 70 0 40 8 67 0
21 20 8 95
20 30 80 70
4 9 2 1
71 0 54 10 68 0 61 0
13 7 99 80
30 2 62 20 11 66 85 4 62 50 17 71
0 0 0 0
18 90 36 186 80
40 5 69 10 60 22 50 10 70 8 61 0 90 3 60 0 40 4 63 0
-83 52 48 84
99 60 10 80
3 3 4 4
57 0 43 0 81 0 62 10
31 73 7 19
80 3 60 3 40 4 50 10
39 70 58 42
0 0 0 0
53 60 12 66 133 75 1 65 49 30 3 37
0 0 0
164 70 15 68 10 43 60 11 49 10 231 70 18 67 10
19 30 4 39 10 340 80 10 64 10 378 80 4 65 0
PROC TPHREG is invoked to fit the Cox proportional hazards model to these data. Variables Prior, Cell, and Therapy, which are categorical variables, are declared in the CLASS statement. By default, PROC TPHREG parameterizes the CLASS variables using the reference coding with the last category as the reference category. However, you can explicitly specify the reference category of your choice. Here, Prior=no is chosen as the reference category for prior therapy, Cell=large is chosen as the reference category for type of tumor cell, and Therapy=standard is chosen as the reference category for the type of therapy. Both the continuous explanatory variables (Kps, Duration, and Age) and the CLASS explanatory variables (Prior, Cell, and Therapy) are specified in the MODEL statement. Knowing how the Cell variable is parameterized, the hazards ratios of all pairs of cell-type groups can be estimated using the ESTIMATE=EXP option in a CONTRAST statement. proc tphreg data=VALung; class Prior(ref=’no’) Cell(ref=’large’) Therapy(ref=’standard’); model Time*censor(1) = Kps Duration Age Prior Cell Therapy; contrast ’Pairwise’ cell 1 0 0, /* adeno vs large */ cell 0 1 0, /* small vs large */ cell 0 0 1, /* squamous vs large */ cell 1 -1 0, /* adeno vs small */ cell 1 0 -1, /* adeno vs squamous */ cell 0 1 -1 /* small vs squamous */ / estimate=exp; run;
The output of PROC TPHREG is very similar to that of PROC PHREG, with additional tables for displaying the parameterization of the CLASS variables, the multiparameter tests for the model effects, and the analysis results of the specified contrasts.
4492
Chapter 73. The TPHREG Procedure (Experimental)
Output 73.1.1. Reference Coding of CLASS Variables The TPHREG Procedure Class Level Information Class
Value
Design Variables
Prior
no yes
0 1
Cell
adeno large small squamous
1 0 0 0
Therapy
standard test
0 1
0 0 1 0
0 0 0 1
Coding of the CLASS variables is displayed in Output 73.1.1. There is one dummy variable for Prior and one for Therapy, since both variables are binary. The dummy variable has a value of 0 for the reference category (Prior=no, Therapy=standard). The CLASS variable Cell has four categories and are represented by three dummy variables. Note that the reference category, Cell=large, has a value of 0 for all three dummy variables. Output 73.1.2. Wald Tests for Individual Model Effects Type 3 Tests
Effect Kps Duration Age Prior Cell Therapy
DF
Wald Chi-Square
Pr > ChiSq
1 1 1 1 3 1
35.1124 0.0001 0.8443 0.0971 17.9164 1.9579
<.0001 0.9920 0.3582 0.7554 0.0005 0.1617
The test results of individual model effects are shown in Output 73.1.2. There is a strong prognostic effect of the Karnofsky performance status on patient survival (p < 0.0001), and the survival times in the various cell-type groups differ significantly (p = 0.0005). However, there is a lack of evidence that the test chemotherapy differs from the standard therapy (p = 0.1617) after accounting for the prognostic effects of other variables.
Example 73.1. Analysis of the VA Lung Cancer Data
Output 73.1.3. Inference about the Regression Parameters Analysis of Maximum Likelihood Estimates
Parameter Kps Duration Age Prior Cell Cell Cell Therapy
DF
Parameter Estimate
Standard Error
Chi-Square
Pr > ChiSq
1 1 1 1 1 1 1 1
-0.03262 -0.0000916 -0.00855 0.07232 0.78867 0.45686 -0.39963 0.28994
0.00551 0.00913 0.00930 0.23213 0.30267 0.26627 0.28266 0.20721
35.1124 0.0001 0.8443 0.0971 6.7899 2.9438 1.9988 1.9579
<.0001 0.9920 0.3582 0.7554 0.0092 0.0862 0.1574 0.1617
yes adeno small squamous test
Analysis of Maximum Likelihood Estimates Hazard Ratio
Parameter Kps Duration Age Prior Cell Cell Cell Therapy
yes adeno small squamous test
0.968 1.000 0.991 1.075 2.200 1.579 0.671 1.336
Variable Label Karnofsky performance scale months from diagnosis to randomization age in years prior therapy yes cell type adeno cell type small cell type squamous type of treatment test
In the Cox proportional hazards model, the effects of the covariates are to act multiplicatively on the hazard of the survival time, and therefore it is a little easier to interpret the corresponding hazards ratios than the regression parameters. For a parameter that corresponds to an continous variable, the hazard ratio is the ratio of hazard rates for a increase of one unit of the variable. From Output 73.1.3, the hazard ratio estimate for Kps is 0.968, meaning that an increase of 10 units in Karnofsky performance scale will shrink the hazard rate by 1 − (0.968)10 =28%. For a CLASS variable parameter, the hazard ratio is the ratio of the hazard rates between the given category and the reference category. The hazard rate of Cell=adeno is 220% that of Cell=large, the hazard rate of Cell=small is 158% that of Cell=large, and the hazard rate of Cell=squamous is only 67% that of Cell=large. Output 73.1.4. Overall Test for All Paired Cell-type Groups Contrast Test Results
Contrast
DF
Wald Chi-Square
Pr > ChiSq
Pairwise
3
17.9164
0.0005
Although there are six pairwise comparisons for the four types of tumor cells in the CONTRAST statement, the overall test has only 3 degrees of freedom (Output
4493
4494
Chapter 73. The TPHREG Procedure (Experimental)
73.1.4). In fact this is the very same testing of no prognostics effect between the cell-type groups as shown in Output 73.1.2. Output 73.1.5. Hazards Ratios for All Paired Cell-type Groups Contrast Rows Estimation and Testing Results
Contrast
Type
Pairwise Pairwise Pairwise Pairwise Pairwise Pairwise
EXP EXP EXP EXP EXP EXP
Row
Estimate
Standard Error
Alpha
1 2 3 4 5 6
2.2005 1.5791 0.6706 1.3935 3.2815 2.3549
0.6660 0.4205 0.1895 0.3840 0.9870 0.6480
0.05 0.05 0.05 0.05 0.05 0.05
Confidence Limits 1.2159 0.9370 0.3853 0.8119 1.8200 1.3732
3.9824 2.6611 1.1669 2.3916 5.9167 4.0384
Contrast Rows Estimation and Testing Results
Contrast
Type
Pairwise Pairwise Pairwise Pairwise Pairwise Pairwise
EXP EXP EXP EXP EXP EXP
Row
Wald Chi-Square
Pr > ChiSq
1 2 3 4 5 6
6.7899 2.9438 1.9988 1.4497 15.6101 9.6866
0.0092 0.0862 0.1574 0.2286 <.0001 0.0019
Output 73.1.5 is generated by the ESTIMATE=EXP option in the CONTRAST statement. Values of the Estimate column are the estimated hazard ratios: 2.200 for ‘adeno’ vs ‘large’, 1.579 for ‘small’ versus ‘large’, 0.671 for ‘squamous’ versus ‘large’, 1.394 for ‘adeno’ versus ’small’, 3.282 for ‘adeno’ versus ‘squamous’, and 2.355 for ‘small’ versus ‘squamous’. Note that the first three hazard ratio estimates are already given in the parameter estimate table (Output 73.1.3), and therefore there is no need to specify those first three rows in the CONTRAST statement.
References Kalbfleisch, J. D. and Prentice, R. L. (1980), The Statistical Analysis of Failure Time Data, New York: John Wiley & Sons, Inc.
Chapter 74
The TPSPLINE Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4497 The Penalized Least Squares Estimate . . . . . . . . . . . . . . . . . . . . 4497 PROC TPSPLINE with Large Data Sets . . . . . . . . . . . . . . . . . . . 4499 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4499 SYNTAX . . . . . . . . . . . . PROC TPSPLINE Statement BY Statement . . . . . . . . FREQ Statement . . . . . . . ID Statement . . . . . . . . . MODEL Statement . . . . . SCORE Statement . . . . . . OUTPUT Statement . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. 4506 . 4507 . 4507 . 4507 . 4508 . 4508 . 4510 . 4510
DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4511 Computational Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . 4511 ODS Tables Produced by PROC TPSPLINE . . . . . . . . . . . . . . . . . 4515 EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . Example 74.1. Partial Spline Model Fit . . . . . . . . . . . Example 74.2. Spline Model With Higher-Order Penalty . . Example 74.3. Multiple Minima of the GCV Function . . . Example 74.4. Large Data Set Application . . . . . . . . . Example 74.5. Computing a Bootstrap Confidence Interval
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 4515 . 4515 . 4518 . 4522 . 4527 . 4531
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4540
4496
Chapter 74. The TPSPLINE Procedure
Chapter 74
The TPSPLINE Procedure Overview The TPSPLINE procedure uses the penalized least squares method to fit a nonparametric regression model. It computes thin-plate smoothing splines to approximate smooth multivariate functions observed with noise. The TPSPLINE procedure allows great flexibility in the possible form of the regression surface. In particular, PROC TPSPLINE makes no assumptions of a parametric form for the model. The generalized cross validation (GCV) function may be used to select the amount of smoothing. The TPSPLINE procedure complements the methods provided by the standard SAS regression procedures such as the GLM, REG and NLIN procedures. These procedures can handle most situations in which you specify the regression model and the model is known up to a fixed number of parameters. However, when you have no prior knowledge about the model, or when you know that the data cannot be represented by a model with a fixed number of parameters, you can use the TPSPLINE procedure to model the data. The TPSPLINE procedure uses the penalized least squares method to fit the data with a flexible model in which the number of effective parameters can be as large as the number of unique design points. Hence, as the sample size increases, the model space increases as well, enabling the thin-plate smoothing spline to fit more complicated situations. The main features of the TPSPLINE procedure are as follows: • provides penalized least squares estimates • supports the use of multidimensional data • supports multiple SCORE statements • fits both semiparametric models and nonparametric models • provides options for handling large data sets • supports multiple dependent variables • enables you to choose a particular model by specifying the model degrees of freedom or smoothing parameter
The Penalized Least Squares Estimate Penalized least squares estimates provide a way to balance fitting the data closely and avoiding excessive roughness or rapid variation. A penalized least squares estimate
4498
Chapter 74. The TPSPLINE Procedure
is a surface that minimizes the penalized least squares over the class of all surfaces satisfying sufficient regularity conditions. Define xi as a d-dimensional covariate vector, zi as a p-dimensional covariate vector, and yi as the observation associated with (xi , zi ). Assuming that the relation between zi and yi is linear but the relation between xi and yi is unknown, you can fit the data using a semiparametric model as follows: yi = f (xi ) + zi β + i where f is an unknown function that is assumed to be reasonably smooth, i , i = 1, · · · , n are independent, zero-mean random errors, and β is a p-dimensional unknown parametric vector. This model consists of two parts. The zi β is the parametric part of the model, and the zi are the regression variables. The f (xi ) is the nonparametric part of the model, and the xi are the smoothing variables. The ordinary least squares method estimates f (xi ) and β by minimizing the quantity: n
1X (yi − f (xi ) − zi β)2 n i=1
However, the functional space of f (x) is so large that you can always find a function f that interpolates the data points. In order to obtain an estimate that fits the data well and has some degree of smoothness, you can use the penalized least squares method. The penalized least squares function is defined as n
Sλ (f ) =
1X (yi − f (xi ) − zi β)2 + λJ2 (f ) n i=1
where J2 (f ) is the penalty on the roughness of f and is defined, in most cases, as the integral of the square of the second derivative of f . The first term measures the goodness of fit and the second term measures the smoothness associated with f . The λ term is the smoothing parameter, which governs the tradeoff between smoothness and goodness of fit. When λ is large, it more heavily penalizes rougher fits. Conversely, a small value of λ puts more emphasis on the goodness of fit. The estimate fλ is selected from a reproducing kernel Hilbert space, and it can be represented as a linear combination of a sequence of basis functions. Hence, the final estimates of f can be written as
fλ (xi ) = θ0 +
d X j=1
θj xij +
n X j=1
δj Bj (xi )
Getting Started
where Bj is the basis function, which depends on where the data xj is located, and θj and δj are the coefficients that need to be estimated. For a fixed λ, the coefficients (θ, δ, β) can be estimated by solving an n × n system. The smoothing parameter can be chosen by minimizing the generalized cross validation (GCV) function. If you write ˆ = A(λ)y y then A(λ) is referred to as the hat or smoothing matrix, and the GCV function V (λ) is defined as V (λ) =
(1/n)||(I − A(λ))y||2 [(1/n)tr(I − A(λ))]2
PROC TPSPLINE with Large Data Sets The calculation of the penalized least squares estimate is computationally intensive. The amount of memory and CPU time needed for the analysis depend on the number of unique design points, which corresponds to the number of unknown parameters to be estimated. You can specify the D= value option in the MODEL statement to reduce the number of unknown parameters. The option groups design points by the specified range (see the D= option on page 4509). PROC TPSPLINE selects one design point from the group and treats all observations in the group as replicates of that design point. Calculation of the thin-plate smoothing spline estimates are based on the reprocessed data. The way to choose the design point from a group depends on the order of the data. Therefore, different orders of input data may result in different estimates. This option, by combining several design points into one, reduces the number of unique design points, thereby approximating the original data. The D= value you specify determines the width of the range used to group the data.
Getting Started The following example demonstrates how you can use the TPSPLINE procedure to fit a semiparametric model. Suppose that y is a continuous variable and x1 and x2 are two explanatory variables of interest. To fit a smoothing spline model, you can use a MODEL statement similar to that used in many regression procedures in the SAS System. proc tpspline; model y = (x1 x2); run;
4499
4500
Chapter 74. The TPSPLINE Procedure
The TPSPLINE procedure can fit semiparametric models; the parentheses in the preceding MODEL statement separates the smoothing variables from the regression variables. The following statements illustrates this syntax. proc tpspline; model y = x3 (x1 x2); run;
This model assumes a linear relation with x3 and an unknown functional relation with x1 and x2. If you want to fit several responses using the same explanatory variables, you can save computation time by using the multiple responses feature in the MODEL statement. For example, if y1 and y2 are two response variables, the following MODEL statement can be used to fit two models. Separate analyses are then performed for each response variable. proc tpspline; model y1 y2 = (x1 x2); run;
The following example illustrates the use of PROC TPSPLINE. The data are from Bates, Lindstrom, Wahba, and Yandell (1987). data Measure; input x1 x2 datalines; -1.0 -1.0 -.5 -1.0 .0 -1.0 .5 -1.0 1.0 -1.0 -1.0 -.5 -.5 -.5 .0 -.5 .5 -.5 1.0 -.5 -1.0 .0 -.5 .0 .0 .0 .5 .0 1.0 .0 -1.0 .5 -.5 .5 .0 .5 .5 .5 1.0 .5 -1.0 1.0 -.5 1.0 .0 1.0 .5 1.0 1.0 1.0 ;
y @@; 15.54483570 18.67397826 19.66086310 18.59838649 15.86842815 10.92383867 14.81392847 16.56449698 14.90792284 10.91956264 9.61492010 14.03133439 15.77400253 13.99627680 9.55700164 11.20625177 14.83723493 16.55494349 14.98448603 11.14575565 15.82595514 18.64014953 19.54375504 18.56884576 15.86586951
-1.0 -.5 .0 .5 1.0 -1.0 -.5 .0 .5 1.0 -1.0 -.5 .0 .5 1.0 -1.0 -.5 .0 .5 1.0 -1.0 -.5 .0 .5 1.0
-1.0 -1.0 -1.0 -1.0 -1.0 -.5 -.5 -.5 -.5 -.5 .0 .0 .0 .0 .0 .5 .5 .5 .5 .5 1.0 1.0 1.0 1.0 1.0
15.76312613 18.49722167 19.80231311 18.51904737 16.03913832 11.14066546 14.82830425 16.44307297 15.05653924 10.94227538 9.64648093 14.03122345 16.00412514 14.02826553 9.58467047 11.08651907 14.99369172 16.51294369 14.71816070 11.17168689 15.96022497 18.56095997 19.80902641 18.61010439 15.90136745
Getting Started
The data set Measure contains three variables x1, x2, and y. Suppose that you want to fit a surface by using the variables x1 and x2 to model the response y. The variables x1 and x2 are spaced evenly on a [−1 × 1] × [−1 × 1] square, and the response y is generated by adding a random error to a function f (x1, x2). The raw data are plotted using the G3D procedure. In order to plot those replicates, the data are jittered a little bit. data Measure1; set Measure; run; proc sort data=Measure1; by x2 x1; run; data measure1; set measure1; by x1; if last.x1 then x1=x1+0.00001; run; proc g3d data=Measure1; scatter x2*x1=y /size=.5 zmin=9 zmax=21 zticknum=4; title "Raw Data"; run;
Figure 74.1 displays the raw data.
Figure 74.1. Plot of Data Set MEASURE
4501
4502
Chapter 74. The TPSPLINE Procedure
The following statements invoke the TPSPLINE procedure, using the Measure data set as input. In the MODEL statement, the x1 and x2 variables are listed as smoothing variables. The LOGNLAMBDA= option returns a list of GCV values with log10 (nλ) ranging from −4 to −2. The OUTPUT statement creates the data set estimate to contain the predicted values and the 95% upper and lower confidence limits. proc tpspline data=Measure; model y=(x1 x2) /lognlambda=(-4 to -2 by 0.1); output out=estimate pred uclm lclm; run; proc print data=estimate; run;
The results of this analysis are displayed in the following figures. Figure 74.2 shows that the data set Measure contains 50 observations with 25 unique design points. The GCV values are listed along with the log10 of nλ. The value of log10 (nλ) that minimizes the GCV function is around −3.5. The final thin-plate smoothing spline estimate is based on LOGNLAMBDA = −3.4762. The residual sum of squares is 0.246110, and the degrees of freedom is 24.593203. The standard deviation, defined as RSS/(Tr(I-A)), is 0.098421. The predictions and 95% confidence limits are displayed in Figure 74.3. The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations Number of Missing Observations Unique Smoothing Design Points
50 0 25
Summary of Final Model Number of Regression Variables Number of Smoothing Variables Order of Derivative in the Penalty Dimension of Polynomial Space
Figure 74.2. Output from PROC TPSPLINE
0 2 2 3
Getting Started
The TPSPLINE Procedure Dependent Variable: y GCV Function log10(n*Lambda) -4.000000 -3.900000 -3.800000 -3.700000 -3.600000 -3.500000 -3.400000 -3.300000 -3.200000 -3.100000 -3.000000 -2.900000 -2.800000 -2.700000 -2.600000 -2.500000 -2.400000 -2.300000 -2.200000 -2.100000 -2.000000
GCV 0.019215 0.019183 0.019148 0.019113 0.019082 0.019064* 0.019074 0.019135 0.019286 0.019584 0.020117 0.021015 0.022462 0.024718 0.028132 0.033165 0.040411 0.050614 0.064699 0.083813 0.109387
Note: * indicates minimum GCV value.
Summary Statistics of Final Estimation log10(n*Lambda) Smoothing Penalty Residual SS Tr(I-A) Model DF Standard Deviation
-3.4762 2558.1432 0.2461 25.4068 24.5932 0.0984
4503
4504
Chapter 74. The TPSPLINE Procedure
Estimates from Proc TPSPLINE Obs
x1
x2
y
P_y
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
-1.0 -1.0 -0.5 -0.5 0.0 0.0 0.5 0.5 1.0 1.0 -1.0 -1.0 -0.5 -0.5 0.0 0.0 0.5 0.5 1.0 1.0 -1.0 -1.0 -0.5 -0.5 0.0 0.0 0.5 0.5 1.0 1.0 -1.0 -1.0 -0.5 -0.5 0.0 0.0 0.5 0.5 1.0 1.0 -1.0 -1.0 -0.5 -0.5 0.0 0.0 0.5 0.5 1.0 1.0
-1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
15.5448 15.7631 18.6740 18.4972 19.6609 19.8023 18.5984 18.5190 15.8684 16.0391 10.9238 11.1407 14.8139 14.8283 16.5645 16.4431 14.9079 15.0565 10.9196 10.9423 9.6149 9.6465 14.0313 14.0312 15.7740 16.0041 13.9963 14.0283 9.5570 9.5847 11.2063 11.0865 14.8372 14.9937 16.5549 16.5129 14.9845 14.7182 11.1458 11.1717 15.8260 15.9602 18.6401 18.5610 19.5438 19.8090 18.5688 18.6101 15.8659 15.9014
15.6474 15.6474 18.5783 18.5783 19.7270 19.7270 18.5552 18.5552 15.9436 15.9436 11.0467 11.0467 14.8246 14.8246 16.5102 16.5102 14.9812 14.9812 10.9497 10.9497 9.6372 9.6372 14.0188 14.0188 15.8822 15.8822 14.0006 14.0006 9.5769 9.5769 11.1614 11.1614 14.9182 14.9182 16.5386 16.5386 14.8549 14.8549 11.1727 11.1727 15.8851 15.8851 18.5946 18.5946 19.6729 19.6729 18.5832 18.5832 15.8761 15.8761
LCLM_y
UCLM_y
15.5115 15.5115 18.4430 18.4430 19.5917 19.5917 18.4199 18.4199 15.8077 15.8077 10.9114 10.9114 14.6896 14.6896 16.3752 16.3752 14.8461 14.8461 10.8144 10.8144 9.5019 9.5019 13.8838 13.8838 15.7472 15.7472 13.8656 13.8656 9.4417 9.4417 11.0261 11.0261 14.7831 14.7831 16.4036 16.4036 14.7199 14.7199 11.0374 11.0374 15.7493 15.7493 18.4593 18.4593 19.5376 19.5376 18.4478 18.4478 15.7402 15.7402
15.7832 15.7832 18.7136 18.7136 19.8622 19.8622 18.6905 18.6905 16.0794 16.0794 11.1820 11.1820 14.9597 14.9597 16.6452 16.6452 15.1162 15.1162 11.0850 11.0850 9.7724 9.7724 14.1538 14.1538 16.0171 16.0171 14.1356 14.1356 9.7122 9.7122 11.2967 11.2967 15.0532 15.0532 16.6736 16.6736 14.9900 14.9900 11.3080 11.3080 16.0210 16.0210 18.7299 18.7299 19.8081 19.8081 18.7185 18.7185 16.0120 16.0120
Figure 74.3. Data Set ESTIMATE
The fitted surface is plotted with PROC G3D as follows. proc g3d data=estimate; plot x2*x1=p_y/grid zmin=9 zmax=21 zticknum=4; title ’Plot of Fitted Surface’; run;
Getting Started
The resulting plot is displayed in Figure 74.4.
Figure 74.4. Plot of TPSPLINE Fit of Data Set Measure
Because the data in data set Measure are very sparse, the fitted surface is not smooth. To produce a smoother surface, the following statements generate the data set Pred in order to obtain a finer grid. The SCORE statement evaluates the fitted surface at those new design points. data pred; do x1=-1 to 1 by 0.1; do x2=-1 to 1 by 0.1; output; end; end; run; proc tpspline data=measure; model y=(x1 x2)/lognlambda=(-4 to -2 by 0.1); score data=pred out=predy; run; proc g3d data=predy; plot x2*x1=p_y/grid zmin=9 zmax=21 zticknum=4; title ’Plot of Fitted Surface on a Fine Grid’; run;
The surface plot based on the finer grid is displayed in Figure 74.5. The plot shows
4505
4506
Chapter 74. The TPSPLINE Procedure
that a parametric model with quadratic terms of x1 and x2 provides a reasonable fit to the data.
Figure 74.5. Plot of TPSPLINE fit
Syntax PROC TPSPLINE < option > ; MODEL dependents = < variables > (variables) < /options > ; SCORE data=SAS-data-set out=SAS-data-set ; OUTPUT < out=SAS-data-set > keyword < · · · keyword > ; BY variables ; FREQ variable ; ID variables ; The syntax in PROC TPSPLINE is similar to that of other regression procedures in the SAS System. The PROC TPSPLINE and MODEL statements are required. The SCORE statement can appear multiple times; all other statements appear only once. The syntax for PROC TPSPLINE is described in the following sections in alphabetical order after the description of the PROC TPSPLINE statement.
FREQ Statement
PROC TPSPLINE Statement PROC TPSPLINE < option > ; The PROC TPSPLINE statement invokes the procedure. You can specify the following option. DATA=SAS-data-set
specifies the SAS data set to be read by PROC TPSPLINE. The default value is the most recently created data set.
BY Statement BY variables ; You can specify a BY statement with PROC TPSPLINE to obtain separate analysis on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the TPSPLINE procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
FREQ Statement FREQ variable ; If one variable in your input data set represents the frequency of occurrence for other values in the observation, specify the variable’s name in a FREQ statement. PROC TPSPLINE treats the data as if each observation appears n times, where n is the value of the FREQ variable for the observation. If the value of the FREQ variable is less than one, the observation is not used in the analysis. Only the integer portion of the value is used.
4507
4508
Chapter 74. The TPSPLINE Procedure
ID Statement ID variables ; The variables in the ID statement are copied from the input data set to the OUT= data set. If you omit the ID statement, only the variables used in the MODEL statement and requested statistics are included in the output data set.
MODEL Statement MODEL dependents = < regression variables > (smoothing variables) < /options > ; The MODEL statement specifies the dependent variables, the independent regression variables, which are listed with no parentheses, and the independent smoothing variables, which are listed inside parentheses. The regression variables are optional. At least one smoothing variable is required, and it must be listed after the regression variables. No variables can be listed in both the regression variable list and the smoothing variable list. If you specify more than one dependent variable, PROC TPSPLINE calculates a thinplate smoothing spline estimate for each dependent variable, using the regression variables and smoothing variables specified on the right-hand side. If you specify regression variables, PROC TPSPLINE fits a semiparametric model using the regression variables as the linear part of the model. You can specify the following options in the MODEL statement. ALPHA=number
specifies the significance level α of the confidence limits on the final thin-plate smoothing spline estimate when you request confidence limits to be included in the output data set. Specify number as a value between 0 and 1. The default value is 0.05. See the “OUTPUT Statement” section on page 4510 for more information on the OUTPUT statement. DF=number
specifies the degrees of freedom of the thin-plate smoothing spline estimate, defined as df = trace(A(λ)) where A(λ) is the hat matrix. Specify number as a value between zero and the number of unique design points.
MODEL Statement
DISTANCE=number D=number
defines a range such that if two data points (xi , zi ) and (xj , zj ) satisfy maxk |xik − xjk | ≤ D/2 then these data points are treated as replicates, where xi are the smoothing variables and zi are the regression variables. You can use the DISTANCE= option to reduce the number of unique design points by treating nearby data as replicates. This can be useful when you have a large data set. The default value is 0. LAMBDA0=number
specifies the smoothing parameter, λ0 , to be used in the thin-plate smoothing spline estimate. By default, PROC TPSPLINE uses the λ parameter that minimizes the GCV function for the final fit. The LAMBDA0= value must be positive. LAMBDA=list-of-values
specifies a set of values for the λ parameter. PROC TPSPLINE returns a GCV value for each λ point that you specify. You can use the LAMBDA= option to study the GCV function curve for a set of values for λ. All values listed in the LAMBDA= option must be positive. LOGNLAMBDA0=number LOGNL0=number
specifies the smoothing parameter λ0 on the log10(nλ) scale. If you specify both the LOGNL0= and LAMBDA0= options, only the value provided by the LOGNL0= option is used. By default, PROC TPSPLINE uses the λ parameter that minimizes the GCV function for the estimate. LOGNLAMBDA=list-of-values LOGNL=list-of-values
specifies a set of values for the λ parameter on the log10(nλ) scale. PROC TPSPLINE returns a GCV value for each λ point that you specify. You can use the LOGNLAMBDA= option to study the GCV function curve for a set of λ values. If you specify both the LOGNL= and LAMBDA= options, only the list of values provided by LOGNL= option is used. In some cases, the LOGNL= option may be prefered over the LAMBDA= option. Because the LAMBDA= value must be positive, a small change in that value can result in a major change in the GCV value. If you instead specify λ on the log10 scale, the allowable range is enlarged to include negative values. Thus, the GCV function is less sensitive to changes in LOGNLAMBDA. M=number
specifies the order of the derivative in the penalty term. The M= value must be a positive integer. The default value is the max(2, IN T (d/2) + 1), where d is the number of smoothing variables.
4509
4510
Chapter 74. The TPSPLINE Procedure
SCORE Statement SCORE DATA=SAS-data-set OUT=SAS-data-set ; The SCORE statement calculates predicted values for a new data set. If you have multiple data sets to predict, you can specify multiple SCORE statements. You must use a SCORE statement for each data set. The following keywords must be specified in the SCORE statement. DATA=SAS-data-set
specifies the input SAS data set containing the smoothing variables x and regression variables z. The predicted response (y) value is computed for each (x, z) pair. The data set must include all independent variables specified in the MODEL statement. OUT=SAS-data-set
specifies the name of the SAS data set to contain the predictions.
OUTPUT Statement OUTPUT OUT=SAS-data-set < keyword · · · keyword > ; The OUTPUT statement creates a new SAS data set containing diagnostic measures calculated after fitting the model. You can request a variety of diagnostic measures that are calculated for each observation in the data set. The new data set contains the variables specified in the MODEL statement in addition to the requested variables. If no keyword is present, the data set contains only the predicted values. Details on the specifications in the OUTPUT statement are as follows. OUT=SAS-data-set
specifies the name of the new data set to contain the diagnostic measures. This specification is required. keyword
specifies the statistics to include in the output data set. The names of the new variables that contain the statistics are formed by using a prefix of one or more characters that identify the statistic, followed by an underscore (– ), followed by the dependent variable name. For example, suppose that you have two dependent variables, say y1 and y2, and you specify the keywords PRED, ADIAG, and UCLM. The output SAS data set will contain the following variables:
Computational Formulas
4511
• P– y1 and P– y2 • ADIAG– y1 and ADIAG– y2 • UCLM– y1 and UCLM– y2 The keywords and the statistics they represent are as follows: RESID | R
residual values, calculated as ACTUAL - PREDICTED
PRED
predicted values
STD
standard error of the mean predicted value
UCLM
upper limit of the confidence interval for the expected value of the dependent variables. By default, PROC TPSPLINE computes 95% confidence limits.
LCLM
lower limit of the confidence interval for the expected value of the dependent variables. By default, PROC TPSPLINE computes 95% confidence limits.
ADIAG
diagonal element of the hat matrix associated with the observation
COEF
coefficients arranged in the order of (θ0 , θ1 , · · · , θd , δ1 , · · · δnUnique ) where nUnique is the number of unique data points. This option can only be used when there is only one dependent variable in the model.
Details Computational Formulas The theoretical foundations for the thin-plate smoothing spline are described in Duchon (1976, 1977) and Meinguet (1979). Further results and applications are given in Wahba and Wendelberger (1980), Hutchinson and Bischof (1983), and Seaman and Hutchinson (1985). Suppose that Hm is a space of functions whose partial derivatives of total order m are in L2 (E d ) where E d is the domain of x. Now, consider the data model yi = f (x1 (i), . . . , xd (i)) + i , i = 1, . . . , n where f ∈ Hm . Using the notation from the section “The Penalized Least Squares Estimate” on page 4497, for a fixed λ, estimate f by minimizing the penalized least squares function n
1X (yi − f (xi ) − zi β)2 + λJm (f ) n i=1
4512
Chapter 74. The TPSPLINE Procedure
There are several ways to define Jm (f ). For the thin-plate smoothing spline, with x of dimension d, define Jm (f ) as Z
∞
Z
−∞
where
P
i αi
∞
X
···
Jm (f ) =
−∞
h m! α1 ! · · · αd !
∂mf ∂x1 α1 ···∂xd αd
i2
dx1 · · · dxd
= m.
When d = 2 and m = 2, Jm (f ) is as follows: Z
∞
Z
∞
J2 (f ) = −∞
h (
−∞
i2
∂2f ∂x1 2
+2
h
∂2f ∂x1 ∂x2
i2
+
h
∂2f ∂x2 2
i2
)dx1 dx2
In general, m and d must satisfy the condition that 2m − d > 0. For the sake of simplicity, the formulas and equations that follow assume m = 2. Refer to Wahba (1990) and Bates et al. (1987) for more details. Duchon (1976) showed that fλ can be represented as
fλ (xi ) = θ0 +
d X
θj xij +
j=1
where E2 (s) =
1 ||s||2 23 π
n X
δj E2 (xi − xj )
j=1
ln(||s||).
If you define K = (K)ij = E2 (xi − xj ) and T = (T)ij = (xij ), the goal is to find coefficients β, θ, and δ that minimize Sλ (β, θ, δ) =
1 ||y − Tθ − Kδ − Zβ||2 + λδ T Kδ n
A unique solution is guaranteed if the matrix T is of full rank and δ T Kδ ≥ 0. θ If α = and X = (T : Z), the expression for Sλ becomes β 1 ||y − Xα − Kδ||2 + λδ T Kδ n The coefficients α and δ can be obtained by solving (K + nλIn )δ + Xα = y XT δ = 0 To compute α and δ, let the QR decomposition of X be X = (Q1 : Q2 )
R 0
Computational Formulas
where (Q1 : Q2 ) is an orthogonal matrix and R is upper triangular, with XT Q2 = 0 (Dongarra et al. 1979). Since XT δ = 0, δ must be in the column space of Q2 . Therefore, δ can be expressed as δ = Q2 γ for a vector γ. Substituting δ = Q2 γ into the preceding equation and multiplying through by QT 2 gives T QT 2 (K + nλIn )Q2 γ = Q2 y
or −1 T δ = Q2 γ = Q2 [QT Q2 y 2 (K + nλIn )Q2 ]
The coefficient α can be obtained by solving Rα = QT 1 [y − (K + nλIn )δ] The influence matrix A(λ) is defined as ˆ = A(λ)y y and has the form −1 T A(λ) = I − nλQ2 [QT Q2 2 (K + nλIn )Q2 ]
Similar to the regression case, and if you consider the trace of A(λ) as the degrees of freedom for the information signal and the trace of (In − A(λ)) as the degrees of freedom for the noise component, the estimate σ 2 can be represented as σ ˆ2 =
RSS(λ) T race(In − A(λ))
where RSS(λ) is the residual sum of squares. Theoretical properties of these estimates have not yet been published. However, good numerical results in simulation studies have been described by several authors. For more information, refer to O’Sullivan and Wong (1987), Nychka (1986a, 1986b, and 1988), and Hall and Titterington (1987).
Confidence Intervals Viewing the spline model as a Bayesian model, Wahba (1983) proposed Bayesian confidence intervals for smoothing spline estimates as follows: fˆλ (xi ) ± zα/2
p
σ ˆ 2 aii (λ)
4513
4514
Chapter 74. The TPSPLINE Procedure
where aii (λ) is the ith diagonal element of the A(λ) matrix and zα/2 is the α/2 point of the normal distribution. The confidence intervals are interpreted as intervals “across the function” as opposed to point-wise intervals. Suppose that you fit a spline estimate to experimental data that consists of a true function f and a random error term, i . In repeated experiments, it is likely that about 100(1 − α)% of the confidence intervals cover the corresponding true values, although some values are covered every time and other values are not covered by the confidence intervals most of the time. This effect is more pronounced when the true surface or surface has small regions of particularly rapid change.
Smoothing Parameter The quantity λ is called the smoothing parameter, which controls the balance between the goodness of fit and the smoothness of the final estimate. A large λ heavily penalizes the mth derivative of the function, thus forcing f (m) close to 0. A small λ places less of a penalty on rapid change in f (m) (x), resulting in an estimate that tends to interpolate the data points. The smoothing parameter greatly affects the analysis, and it should be selected with care. One method is to perform several analyses with different values for λ and compare the resulting final estimates. A more objective way to select the smoothing parameter λ is to use the “leave-outone” cross validation function, which is an approximation of the predicted mean squares error. A generalized version of the leave-out-one cross validation function is proposed by Wahba (1990) and is easy to calculate. This Generalized Cross Validation (GCV) function (V (λ)) is defined as
V (λ) =
(1/n)||(I − A(λ))y||2 [(1/n)tr(I − A(λ))]2
The justification for using the GCV function to select λ relies on asymptotic theory. Thus, you cannot expect good results for very small sample sizes or when there is not enough information in the data to separate the information signal from the noise component. Simulation studies suggest that for independent and identically distributed Gaussian noise, you can obtain reliable estimates of λ for n greater than 25 or 30. Note that, even for large values of n (say n ≥ 50), in extreme Monte Carlo simulations there may be a small percentage of unwarranted extreme estimates in which ˆ = 0 or λ ˆ = ∞ (Wahba 1983). Generally, if σ 2 is known to within an order of λ magnitude, the occasional extreme case can be readily identified. As n gets larger, the effect becomes weaker. The GCV function is fairly robust against nonhomogeneity of variances and nonGaussian errors (Villalobos and Wahba 1987). Andrews (1988) has provided favorable theoretical results when variances are unequal. However, this selection method is likely to give unsatisfactory results when the errors are highly correlated. The GCV value may be suspect when λ is extremely small because computed values may become indistinguishable from zero. In practice, calculations with λ = 0
Example 74.1. Partial Spline Model Fit
or λ near 0 can cause numerical instabilities resulting in an unsatisfactory solution. Simulation studies have shown that a λ with log10 (nλ) > −8 is small enough that the final estimate based on this λ almost interpolates the data points. A GCV value based on a λ ≤ 10−8 may not be accurate.
ODS Tables Produced by PROC TPSPLINE PROC TPSPLINE assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 74.1. ODS Tables Produced by PROC TPSPLINE
ODS Table Name DataSummary FitSummary FitStatistics GCVFunction
Description Data summary Fit parameters and fit summary Model fit statistics GCV table
Statement PROC PROC
Option default default
PROC MODEL
default LOGNLAMBDA, LAMBDA
By referring to the names of such tables, you can use the ODS OUTPUT statement to place one or more of these tables in output data sets. For example, the following statements create an output data set named FitStats containing the FitStatistics table, an output data set named DataInfo containing the DataSummary table, an output data set named ModelInfo containing the FitSummary and an output data set named GCVFunc containing the GCVFunction. proc tpspline data=Melanoma; model Incidences=Year /LOGNLAMBDA=(-4 to 0 by 0.2); ods output FitStatistics = FitStats DataSummary = DataInfo FitSummary = ModelInfo GCVFunction = GCVFunc; run;
Examples Example 74.1. Partial Spline Model Fit The following example analyzes the data set Measure that was introduced in the “Getting Started” section on page 4499. That analysis determined that the final estimated surface can be represented by a quadratic function for one or both of the independent variables. This example illustrates how you can use PROC TPSPLINE to fit a partial spline model. The data set Measure is fit using the following model: f (x1, x2) = 1 + x1 + x21 + h(x2 )
4515
4516
Chapter 74. The TPSPLINE Procedure
The model has a parametric component (associated with the x1 variable) and a nonparametric component (associated with the x2 variable). The following statements fit a partial spline model. data Measure; set Measure; x1sq = x1*x1; run; data pred; do x1=-1 to 1 by 0.1; do x2=-1 to 1 by 0.1; x1sq = x1*x1; output; end; end; run; proc tpspline data= measure; model y = x1 x1sq (x2); score data = pred out = predy; run;
Output 74.1.1 displays the results from these statements. Output 74.1.1. Output from PROC TPSPLINE The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations Number of Missing Observations Unique Smoothing Design Points
50 0 5
Summary of Final Model Number of Regression Variables Number of Smoothing Variables Order of Derivative in the Penalty Dimension of Polynomial Space
Summary Statistics of Final Estimation log10(n*Lambda) Smoothing Penalty Residual SS Tr(I-A) Model DF Standard Deviation
-2.2374 205.3461 8.5821 43.1534 6.8466 0.4460
2 1 2 4
Example 74.1. Partial Spline Model Fit
As displayed in Output 74.1.1, there are five unique design points for the smoothing variable x2 and two regression variables in the model (x1,x1sq). The dimension of the null space (polynomial space) is 4. The standard deviation of the estimate is much larger than the one based on the model with both x1 and x2 as smoothing variables (0.445954 compared to 0.098421). One of the many possible explanations may be that the number of unique design points of the smoothing variable is too small to warrant an accurate estimate for h(x2). The following statements produce a surface plot for the partial spline model: title ’Plot of Fitted Surface on a Fine Grid’; proc g3d data=predy; plot x2*x1=p_y/grid zmin=9 zmax=21 zticknum=4; run;
The surface displayed in Output 74.1.2 is similar to the one estimated by using the full nonparametric model (displayed in Figure 74.5). Output 74.1.2. Plot of TPSPLINE Fit from the Partial Spline Model
4517
4518
Chapter 74. The TPSPLINE Procedure
Example 74.2. Spline Model With Higher-Order Penalty The following example continues the analysis of the data set Measure to illustrate how you can use PROC TPSPLINE to fit a spline model with a higher-order penalty term. Spline models with high-order penalty terms move low-order polynomial terms into the null space. Hence, there is no penalty for these terms, and they can vary without constraint. As shown in the previous analyses, the final model for the data set Measure must include quadratic terms for both x1 and x2. This example fits the following model: f (x1 , x2 ) = θ0 + θ1 x1 + θ2 x21 + θ3 x2 + θ4 x22 + θ5 x1 ∗ x2 + g(x1 , x2 ) The model includes quadratic terms for both variables, although it differs from the usual linear model. The nonparametric term g(x1 , x2 ) explains the variation of the data unaccounted for by a simple quadratic surface. To modify the order of the derivative in the penalty term, specify the M= option. The following statements specify the option M=3 in order to include the quadratic terms in the null space: data measure; set measure; x1sq = x1*x1; x2sq = x2*x2; x1x2 = x1*x2; ; proc tpspline data= measure; model y = (x1 x2) / m=3; score data = pred out = predy; run;
The output resulting from these statements is displayed in Output 74.2.1.
Example 74.2. Spline Model With Higher-Order Penalty
Output 74.2.1. Output from PROC TPSPLINE with M=3 The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations Number of Missing Observations Unique Smoothing Design Points
50 0 25
Summary of Final Model Number of Regression Variables Number of Smoothing Variables Order of Derivative in the Penalty Dimension of Polynomial Space
0 2 3 6
Summary Statistics of Final Estimation log10(n*Lambda) Smoothing Penalty Residual SS Tr(I-A) Model DF Standard Deviation
-3.7831 2092.4495 0.2731 29.1716 20.8284 0.0968
The model contains six terms in the null space. Compare Output 74.2.1 with Figure 74.2: the LOGNLAMBDA value and the smoothing penalty differ significantly. Note that, in general, these terms are not directly comparable for different models. The final estimate based on this model is close to the estimate based on the model using the default, M=2. In the following statements, the REG procedure fits a quadratic surface model to the data set Measure. proc reg data= measure; model y = x1 x1sq x2 x2sq x1x2; run;
The results are displayed in Output 74.2.2.
4519
4520
Chapter 74. The TPSPLINE Procedure
Output 74.2.2. Quadratic Surface Model: The REG Procedure The REG Procedure Model: MODEL1 Dependent Variable: y Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 44 49
443.20502 8.93874 452.14376
88.64100 0.20315
Root MSE Dependent Mean Coeff Var
0.45073 15.08548 2.98781
R-Square Adj R-Sq
F Value
Pr > F
436.33
<.0001
0.9802 0.9780
Parameter Estimates
Variable Intercept x1 x1sq x2 x2sq x1x2
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
1 1 1 1 1 1
14.90834 0.01292 -4.85194 0.02618 5.20624 -0.04814
0.12519 0.09015 0.15237 0.09015 0.15237 0.12748
119.09 0.14 -31.84 0.29 34.17 -0.38
<.0001 0.8867 <.0001 0.7729 <.0001 0.7076
The REG procedure produces slightly different results. To fit a similar model with PROC TPSPLINE, you can use a MODEL statement specifying the degrees of freedom with the DF= option. You can also use a large value for the LOGNLAMBDA0= option to force a parametric model fit. Because there is one degree of freedom for each of the following terms, Intercept, x1, x2, x1sq, x2sq, and x1x2, the DF=6 option is used. proc tpspline data=measure; model y=(x1 x2) /m=3 df=6 lognlambda=(-4 to 1 by 0.2); score data = pred out = predy; run;
The results are displayed in Output 74.2.3. PROC TPSPLINE displays the list of GCV values for comparison.
Example 74.2. Spline Model With Higher-Order Penalty Output 74.2.3. Output from PROC TPSPLINE Using M=3 and DF=6 The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations Number of Missing Observations Unique Smoothing Design Points
50 0 25
Summary of Final Model Number of Regression Variables Number of Smoothing Variables Order of Derivative in the Penalty Dimension of Polynomial Space
0 2 3 6
GCV Function log10(n*Lambda) -4.000000 -3.800000 -3.600000 -3.400000 -3.200000 -3.000000 -2.800000 -2.600000 -2.400000 -2.200000 -2.000000 -1.800000 -1.600000 -1.400000 -1.200000 -1.000000 -0.800000 -0.600000 -0.400000 -0.200000 0 0.200000 0.400000 0.600000 0.800000 1.000000
GCV 0.016330 0.016051* 0.016363 0.017770 0.021071 0.027496 0.038707 0.056292 0.080613 0.109714 0.139642 0.166338 0.187437 0.202625 0.212871 0.219512 0.223727 0.226377 0.228041 0.229085 0.229740 0.230153 0.230413 0.230576 0.230680 0.230745
Note: * indicates minimum GCV value.
4521
4522
Chapter 74. The TPSPLINE Procedure The TPSPLINE Procedure Dependent Variable: y Summary Statistics of Final Estimation log10(n*Lambda) Smoothing Penalty Residual SS Tr(I-A) Model DF Standard Deviation
2.3830 0.0000 8.9384 43.9997 6.0003 0.4507
The final estimate is based on 6.000330 degrees of freedom because there are already 6 degrees of freedom in the null space and the search range for lambda is not large enough (in this case, setting DF=6 is equivalent to setting lambda = ∞). The standard deviation and RSS (Output 74.2.3) are close to the sum of squares for the error term and the root MSE from the the linear regression model (Output 74.2.2), respectively. For this model, the optimal LOGNLAMBDA is around −3.8, which produces a standard deviation estimate of 0.096765 (see Output 74.2.1) and a GCV value of 0.016051, while the model specifying DF=6 results in a LOGNLAMBDA larger than 1 and a GCV value larger than 0.23074. The nonparametric model, based on the GCV, should provide better prediction, but the linear regression model can be more easily interpreted.
Example 74.3. Multiple Minima of the GCV Function The following data represent the deposition of sulfate (SO4 ) at 179 sites in 48 contiguous states of the United States in 1990. Each observation records the latitude and longitude of the site as well as the SO4 deposition at the site measured in gram per square meter (g/m2 ). You can use PROC TPSPLINE to fit a surface that reflects the general trend and that reveals underlying features of the data. data so4; input latitude longitude so4 @@; datalines; 32.45833 87.24222 1.403 34.28778 85.96889 2.103 33.07139 109.86472 0.299 36.07167 112.15500 0.304 31.95056 112.80000 0.263 33.60500 92.09722 1.950 34.17944 93.09861 2.168 36.08389 92.58694 1.578 . . . 162 additional observations . . . 45.82278 91.87444 0.984 41.34028 106.19083 0.335
Example 74.3. Multiple Minima of the GCV Function
4523
42.73389 108.85000 0.236 42.49472 108.82917 0.313 42.92889 109.78667 0.182 43.22278 109.99111 0.161 43.87333 104.19222 0.306 44.91722 110.42028 0.210 45.07611 72.67556 2.646 ; data pred; do latitude = 25 to 47 by 1; do longitude = 68 to 124 by 1; output; end; end; run;
The preceding statements create the SAS data set so4 and the data set pred in order to make predictions on a regular grid. The following statements fit a surface for SO4 deposition. The ODS OUTPUT statement creates a data set called GCV to contain the GCV values for LOGNLAMBDA in the range from −6 to 1. proc tpspline data=so4; ods output GCVFunction=gcv; model so4 = (latitude longitude) /lognlambda=(-6 to 1 by 0.1); score data=pred out=prediction1; run;
Partial output from these statements is displayed in Output 74.3.1. Output 74.3.1. Partial Output from PROC TPSPLINE for Data Set SO4 The TPSPLINE Procedure Dependent Variable: so4 Summary of Input Data Set Number of Non-Missing Observations Number of Missing Observations Unique Smoothing Design Points
179 0 179
Summary of Final Model Number of Regression Variables Number of Smoothing Variables Order of Derivative in the Penalty Dimension of Polynomial Space
Summary Statistics of Final Estimation log10(n*Lambda) Smoothing Penalty Residual SS Tr(I-A) Model DF Standard Deviation
0.2770 2.4588 12.4450 140.2750 38.7250 0.2979
0 2 2 3
4524
Chapter 74. The TPSPLINE Procedure
The following statements produce Output 74.3.2: symbol1 interpol=join value=none; title "GCV Function"; proc gplot data=gcv; plot gcv*lognlambda/frame cframe=ligr vaxis=axis1 haxis=axis2; run;
Output 74.3.2 displays the plot of the GCV function versus nlambda in log10 scale. The GCV function has two minima. PROC TPSPLINE locates the minimum at 0.277005. The figure also displays a local minimum located around −2.56. Note that the TPSPLINE procedure may not always find the global minimum, although it did in this case. Output 74.3.2. GCV Function of SO4 Data Set
The following analysis specifies the option LOGNLAMBDA0=−2.56. The output is displayed in Output 74.3.3. proc tpspline data=so4; model so4 = (latitude longitude) /lognlambda0=-2.56; score data=pred out=prediction2; run;
Example 74.3. Multiple Minima of the GCV Function
Output 74.3.3. Output from PROC TPSPLINE for Data Set SO4 with LOGNLAMBDA=−2.56 The TPSPLINE Procedure Dependent Variable: so4 Summary of Input Data Set Number of Non-Missing Observations Number of Missing Observations Unique Smoothing Design Points
179 0 179
Summary of Final Model Number of Regression Variables Number of Smoothing Variables Order of Derivative in the Penalty Dimension of Polynomial Space
0 2 2 3
Summary Statistics of Final Estimation log10(n*Lambda) Smoothing Penalty Residual SS Tr(I-A) Model DF Standard Deviation
-2.5600 177.2144 0.0438 7.2086 171.7914 0.0779
The smoothing penalty is much larger in Output 74.3.3 than that displayed in Output 74.3.1. The estimate in Output 74.3.1 uses a large lambda value and, therefore, the surface is smoother than the estimate using LOGNLAMBDA=−2.56 (Output 74.3.3). The estimate based on LOGNLAMBDA=−2.56 has a larger value for the degrees of freedom, and it has a much smaller standard deviation. However, a smaller standard deviation in nonparametric regression does not necessarily mean that the estimate is good: a small λ value always produces an estimate closer to the data and, therefore, a smaller standard deviation. The following statements produce two contour plots of the estimates using the GCONTOUR procedure. In the final step, the plots are placed into a single graphic with the GREPLAY procedure.
4525
4526
Chapter 74. The TPSPLINE Procedure title "TPSPLINE fit with lognlambda=0.277"; proc gcontour data=prediction1 gout=grafcat; plot latitude*longitude = P_so4/ name="tpscon1" legend=legend1 vaxis=axis1 haxis=axis2 cframe=ligr hreverse; run; title "TPSPLINE fit with lognlambda=-2.56"; proc gcontour data=prediction2 gout=grafcat; plot latitude*longitude = P_so4/ name="tpscon2" legend=legend1 vaxis=axis1 haxis=axis2 cframe=ligr hreverse; run; title; proc greplay igout=grafcat tc=sashelp.templt template=v2 nofs; treplay 1:tpscon1 2:tpscon2; quit;
Compare the two estimates by examining the contour plots of both estimates (Output 74.3.4). Output 74.3.4. Contour Plot of TPSPLINE Estimates with Different Lambdas
As the contour plots show, the estimate with LOGNLAMBDA=0.277 may repre-
Example 74.4. Large Data Set Application
sent the underlying trend, while the estimate with the LOGNLAMBDA=-2.56 is very rough and may be modeling the noise component.
Example 74.4. Large Data Set Application The following example illustrates how you can use the D= option to decrease the computation time needed by the TPSPLINE procedure. Note that, while the D= option can be helpful in decreasing computation time for large data sets, it may produce unexpected results when used with small data sets. The following statements generate the data set large: data large; do x=-5 to 5 by 0.02; y=5*sin(3*x)+1*rannor(57391); output; end; run;
The data set large contains 501 observations with one independent variable x and one dependent variable y. The following statements invoke PROC TPSPLINE to produce a thin-plate smoothing spline estimate and the associated 99% confidence interval. The output statistics are saved in the data set fit1. proc tpspline data=large; model y =(x) /lambda=(-5 to -1 by 0.2) alpha=0.01; output out=fit1 pred lclm uclm; run;
The results from this MODEL statement are displayed in Output 74.4.1.
4527
4528
Chapter 74. The TPSPLINE Procedure
Output 74.4.1. Output from PROC TPSPLINE without the D= Option The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations Number of Missing Observations Unique Smoothing Design Points
501 0 501
Summary of Final Model Number of Regression Variables Number of Smoothing Variables Order of Derivative in the Penalty Dimension of Polynomial Space
0 1 2 2
GCV Function log10(n*Lambda) -5.000000 -4.800000 -4.600000 -4.400000 -4.200000 -4.000000 -3.800000 -3.600000 -3.400000 -3.200000 -3.000000 -2.800000 -2.600000 -2.400000 -2.200000 -2.000000 -1.800000 -1.600000 -1.400000 -1.200000 -1.000000
GCV 1.258653 1.228743 1.205835 1.188371 1.174644 1.163102 1.152627 1.142590 1.132700 1.122789 1.112755 1.102642 1.092769 1.083779 1.076636 1.072763* 1.074636 1.087152 1.120339 1.194023 1.344213
Note: * indicates minimum GCV value.
The TPSPLINE Procedure Dependent Variable: y Summary Statistics of Final Estimation log10(n*Lambda) Smoothing Penalty Residual SS Tr(I-A) Model DF Standard Deviation
-1.9483 9953.7063 475.0984 471.0861 29.9139 1.0042
The following statements specify an identical model, but with the additional specification of the D= option. The estimates are obtained by treating nearby points as
Example 74.4. Large Data Set Application
4529
replicates. proc tpspline data=large; model y =(x) /lambda=(-5 to -1 by 0.2) d=0.05 alpha=0.01; output out=fit2 pred lclm uclm; run;
The output is displayed in Output 74.4.2. Output 74.4.2. Output from PROC TPSPLINE with the D= Option The TPSPLINE Procedure Dependent Variable: y Summary of Input Data Set Number of Non-Missing Observations Number of Missing Observations Unique Smoothing Design Points
501 0 251
Summary of Final Model Number of Regression Variables Number of Smoothing Variables Order of Derivative in the Penalty Dimension of Polynomial Space
0 1 2 2
GCV Function log10(n*Lambda) -5.000000 -4.800000 -4.600000 -4.400000 -4.200000 -4.000000 -3.800000 -3.600000 -3.400000 -3.200000 -3.000000 -2.800000 -2.600000 -2.400000 -2.200000 -2.000000 -1.800000 -1.600000 -1.400000 -1.200000 -1.000000
GCV 1.306536 1.261692 1.226881 1.200060 1.179284 1.162776 1.149072 1.137120 1.126220 1.115884 1.105766 1.095730 1.085972 1.077066 1.069954 1.066076* 1.067929 1.080419 1.113564 1.187172 1.337252
Note: * indicates minimum GCV value.
4530
Chapter 74. The TPSPLINE Procedure The TPSPLINE Procedure Dependent Variable: y Summary Statistics of Final Estimation log10(n*Lambda) Smoothing Penalty Residual SS Tr(I-A) Model DF Standard Deviation
-1.9477 9943.5615 472.1424 471.0901 29.9099 1.0011
The difference between the two estimates is minimal. However, the CPU time for the second MODEL statement is only about 1/8 of the CPU time used in the first model fit. The following statements produce a plot for comparison of the two estimates: data fit2; set fit2; P1_y = P_y; LCLM1_y = LCLM_y; UCLM1_y = UCLM_y; drop P_y LCLM_y UCLM_y; proc sort data=fit1; by x y; proc sort data=fit2; by x y; data comp; merge fit1 fit2; by x y; label p1_y ="Yhat1" p_y="Yhat0" lclm_y ="Lower CL" uclm_y ="Upper CL"; symbol1 symbol2 symbol3 symbol4
i=join i=join i=join i=join
v=none v=none v=none v=none
; ; color=cyan; color=cyan;
title ’Comparison of Two Estimates’; title2 ’with and without the D= Option’; proc gplot data=comp; plot P_y*x=1 P1_y*x=2 LCLM_y*x=4 UCLM_y*x=4/overlay legend=legend1 vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
Example 74.5. Computing a Bootstrap Confidence Interval
The estimates from fit1 and fit2 are displayed in Output 74.4.3 with the 99% confidence interval from the fit1 output data set. Output 74.4.3. Comparison of Two Fits with and without the D= Option
Example 74.5. Computing a Bootstrap Confidence Interval The following example illustrates how you can construct a bootstrap confidence interval by using the multiple responses feature in PROC TPSPLINE. Numerous epidemiological observations have indicated that exposure to solar radiation is an important factor in the etiology of melanoma. The following data present age-adjusted melanoma incidences for 37 years from the Connecticut Tumor Registry (Houghton, Flannery, and Viola 1980). The data are analyzed by Ramsay and Silverman (1997).
4531
4532
Chapter 74. The TPSPLINE Procedure data melanoma; input year incidences @@; datalines; 1936 0.9 1937 0.8 1938 1940 1.4 1941 1.2 1942 1944 1.6 1945 1.5 1946 1948 2.5 1949 2.7 1950 1952 3.1 1953 2.4 1954 1956 2.5 1957 2.6 1958 1960 4.2 1961 3.9 1962 1964 3.7 1965 3.9 1966 1968 4.7 1969 4.4 1970 1972 4.8 ; run;
0.8 1.7 1.5 2.9 2.2 3.2 3.7 4.1 4.8
1939 1943 1947 1951 1955 1959 1963 1967 1971
1.3 1.8 2.0 2.5 2.9 3.8 3.3 3.8 4.8
The variable incidences records the number of melanoma cases per 100,000 people for the years 1936 to 1972. The following model fits the data and requests a 90% Bayesian confidence interval along with the estimate. proc tpspline data=melanoma; model incidences = (year) /alpha = 0.1; output out = result pred uclm lclm; run;
The output is displayed in Output 74.5.1 Output 74.5.1. Output from PROC TPSPLINE for the Melanoma Data Set The TPSPLINE Procedure Dependent Variable: incidences Summary of Input Data Set Number of Non-Missing Observations Number of Missing Observations Unique Smoothing Design Points
37 0 37
Summary of Final Model Number of Regression Variables Number of Smoothing Variables Order of Derivative in the Penalty Dimension of Polynomial Space
Summary Statistics of Final Estimation log10(n*Lambda) Smoothing Penalty Residual SS Tr(I-A) Model DF Standard Deviation
-0.0607 0.5171 1.2243 22.5852 14.4148 0.2328
0 1 2 2
Example 74.5. Computing a Bootstrap Confidence Interval
The following statements produce a plot of the estimated curve: symbol1 symbol2 symbol3 symbol4
h=1pct i=join i=join i=join
; v=none; v=none; v=none c=cyan;
legend1 frame cframe=ligr cborder=black label=none position=center; axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; title1 ’Age-adjusted Melanoma Incidences for 37 years’; proc gplot data=result; plot incidences*year=1 p_incidences*year=2 lclm_incidences*year=3 uclm_incidences*year=4 /overlay legend=legend1 vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
The estimated curve is displayed with 90% confidence interval bands in Output 74.5.2. The number of melanoma incidences exhibits a periodic pattern and increases over the years. The periodic pattern is related to sunspot activity and the accompanying fluctuations in solar radiation.
4533
4534
Chapter 74. The TPSPLINE Procedure
Output 74.5.2. TPSPLINE Estimate and 90% Confidence Interval of Melanoma Data
Wang and Wahba (1995) compared several bootstrap confidence intervals to Bayesian confidence intervals for smoothing splines. Both bootstrap and Bayesian confidence intervals are across-the-curve intervals, not point-wise intervals. They concluded that bootstrap confidence intervals work as well as Bayesian intervals concerning average coverage probability. Additionally, bootstrap confidence intervals appear to be better for small sample sizes. Based on their simulation, the “percentile-t interval” bootstrap interval performs better than the other types of bootstrap intervals. Suppose that fˆλˆ and σ ˆ are the estimates of f and σ from the data. Assume that fˆλˆ is the “true” f , and generate the bootstrap sample as follows: yi∗ = fˆλˆ (xi ) + ∗i , i = 1, · · · , n where ∗ = (∗1 , · · · , ∗n )T ≈ N(0, σ ˆIn×n ). Denote fλˆ∗ (xi ) as the random variable of the bootstrap estimate at xi . Repeat this process K times, so that at each point xi , you have K bootstrap estimates fˆλˆ (xi ) or K realizations of fλˆ∗ (xi ). For each fixed xi , consider the following statistic Di∗ , which is similar to a Student’s t statistic: Di∗ = (fλˆ∗ (xi ) − fˆλˆ (xi ))/σˆi ∗
Example 74.5. Computing a Bootstrap Confidence Interval
where σˆi ∗ is the estimate of σ ˆ based on the ith bootstrap sample. Suppose χα/2 and χ1−α/2 are the lower and upper α/2 points of the empirical distribution of Di∗ . The (1 − α)100% bootstrap confidence interval is defined as (fˆλˆ (xi ) − χ1−α/2 σ ˆ , fˆλˆ (xi ) − χα/2 σ ˆ) Bootstrap confidence intervals are easy to interpret and can be used with any distribution. However, because they require K model fits, their construction is computationally intensive. The multiple dependent variables feature in PROC TPSPLINE enables you to fit multiple models with the same independent variables. The procedure calculates the matrix decomposition part of the calculations only once regardless of the number of dependent variables in the model. These calculations are responsible for most of the computing time used by the TPSPLINE procedure. This feature is particularly useful when you need to generate a bootstrap confidence interval. To construct a bootstrap confidence interval, perform the following tasks: • Fit the data using PROC TPSPLINE and obtain estimates fˆλˆ (xi ) and σ ˆ. • Generate K bootstrap samples based on fˆλˆ (xi ) and σ ˆ. • Fit the K bootstrap samples with the TPSPLINE procedure to obtain estimates of fˆ∗ (xi ) and σ ˆ∗. i
ˆ λ
• Compute
Di∗
and the values χα/2 and χ1−α/2 .
The following statements illustrate this process: proc tpspline data=melanoma; model incidences = (year) /alpha = 0.05; output out = result pred uclm lclm; run;
The output from the initial PROC TPSPLINE analysis is displayed in Output 74.5.3. The data set result contains the predicted values and confidence limits from the analysis.
4535
4536
Chapter 74. The TPSPLINE Procedure
Output 74.5.3. Output from PROC TPSPLINE for the Melanoma Data Set The TPSPLINE Procedure Dependent Variable: incidences Summary of Input Data Set Number of Non-Missing Observations Number of Missing Observations Unique Smoothing Design Points
37 0 37
Summary of Final Model Number of Regression Variables Number of Smoothing Variables Order of Derivative in the Penalty Dimension of Polynomial Space
0 1 2 2
Summary Statistics of Final Estimation log10(n*Lambda) Smoothing Penalty Residual SS Tr(I-A) Model DF Standard Deviation
-0.0607 0.5171 1.2243 22.5852 14.4148 0.2328
The following statements illustrate how you can obtain a bootstrap confidence interval for the Melanoma data. The following statements create the data set bootstrap. The observations are created with information from the preceding PROC TPSPLINE execution; as displayed in Output 74.5.3, σ ˆ = 0.232823. The values of fˆλˆ (xi ) are stored in the data set result in the variable P– incidence. data bootstrap; set result; array y{1070} y1-y1070; do i=1 to 1070; y{i} = p_incidences + 0.232823*rannor(123456789); end; keep y1-y1070 p_incidences year; run; ods listing close; proc tpspline data=bootstrap; ods output FitStatistics=FitResult; id p_incidences; model y1-y1070 = (year); output out=result2; run; ods listing;
The DATA step generates 1,070 bootstrap samples based on the previous estimate from PROC TPSPLINE. For this data set, some of the bootstrap samples result in λs
Example 74.5. Computing a Bootstrap Confidence Interval
(selected by the GCV function) that cause problematic behavior. Thus, an additional 70 bootstrap samples are generated. The ODS listing destination is closed before PROC TPSPLINE is invoked. The model fits all the y1–y1070 variables as dependent variables, and the models are fit for all bootstrap samples simultaneously. The output data set result2 contains the variables year, y1–y1070, p– y1–p– y1070, and P– incidences. The ODS OUTPUT statement writes the FitStatistics table to the data set FitResult. The data set FitResult contains the two variables. They are Parameter and Value. The FitResult data set is used in subsequent calculations for Di∗ . In the data set FitResult, there are 63 estimates with a standard deviation of zero, ˆ that suggesting that the estimates provide perfect fits of the data and are caused by λs are approximately equal to zero. For small sample sizes, there is a positive probability that the λ chosen by the GCV function will be zero (refer to Wang and Wahba 1995). In the following steps, these cases are removed from the bootstrap samples as “bad” samples: they represent failure of the GCV function. The following SAS statements manipulate the data set FitResult, retaining the standard deviations for all bootstrap samples and merging FitResult with the data set result2, which contains the estimates for bootstrap samples. In the final data set boot, the Di∗ statistics are calculated. data FitResult; set FitResult; if Parameter="Standard Deviation"; keep Value; run; proc transpose data=FitResult out=sd prefix=sd; data result2; if _N_ = 1 then set sd; set result2; data boot; set result2; array y{1070} p_y1-p_y1070; array sd{1070} sd1-sd1070; do i=1 to 1070; if sd{i} > 0 then do; d = (y{i} - P_incidences)/sd{i}; obs = _N_; output; end; end; keep d obs P_incidences year; run;
The following SAS statements retain the first 1000 bootstrap samples and calculate the values χα/2 and χ1−α/2 with α = 0.1.
4537
4538
Chapter 74. The TPSPLINE Procedure proc sort data=boot; by obs; run; data boot; set boot; by obs; retain n; if first.obs then n=1; else n=n+1; if n > 1000 then delete; run;
proc sort data=boot; by obs d; run; data chi1 chi2 ; set boot; if (_N_ = (obs-1)*1000+50) then output chi1; if (_N_ = (obs-1)*1000+950) then output chi2; run; proc sort data=result; by year; run; proc sort data=chi1; by year; run; proc sort data=chi2; by year; run; data result; merge result chi1(rename=(d=chi05)) chi2(rename=(d=chi95)); keep year incidences P_incidences lower upper LCLM_incidences UCLM_incidences; lower = -chi95*0.232823 + P_incidences; upper = -chi05*0.232823 + P_incidences; label
lower="Lower 90% CL (Bootstrap)" upper="Upper 90% CL (Bootstrap)" lclm_incidences="Lower 90% CL (Bayesian)" uclm_incidences="Upper 90% CL (Bayesian)";
run;
The data set result contains the variables year, incidences, the TPSPLINE estimate
Example 74.5. Computing a Bootstrap Confidence Interval
P– incidences, and the 90% Bayesian and 90% bootstrap confidence intervals. The following statements produce Output 74.5.4: symbol1 symbol2 symbol3 symbol4 symbol5 symbol6
v=dot i=join i=join i=join i=join i=join
h=1pct v=none v=none v=none v=none v=none
; l=1; l=33; l=33; l=43 c=green; l=43 c=green;
title1 ’Age-adjusted Melanoma Incidences for 37 years’; proc gplot data=result; plot incidences * year=1 p_incidences * year=2 lclm_incidences * year=3 uclm_incidences * year=3 lower * year=4 upper * year=4 /overlay legend=legend1 vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
Output 74.5.4 displays the plot of the variable incidences, the predicted values, and the Bayesian and bootstrap confidence intervals. The plot shows that the bootstrap confidence interval is similar to the Bayesian confidence interval. However, the Bayesian confidence interval is symmetric around the estimates, while the bootstrap confidence interval is not.
4539
4540
Chapter 74. The TPSPLINE Procedure
Output 74.5.4. Comparison of Bayesian and Bootstrap Confidence Interval for Melanoma Data
References Andrews, D (1988), Asymptotic Optimality of GCL , Cross-Validation, and GCV in Regression with Heteroscedastic Errors, Cowles Foundation, Yale University, New Haven, CT, manuscript. Bates, D.; Lindstrom, M.; Wahba, G. and Yandell, B. (1987), “GCVPACK-Routines for Generalized Cross Validation,” Comm. Statist. B-Simulation Comput., 16, 263–297. Duchon, J. (1976), “Fonctions-Spline et Esperances Conditionnelles de Champs Gaussiens,” Ann. Sci. Univ. Clermont Ferrand II Math., 14, 19–27. Duchon, J. (1977), “Splines Minimizing Rotation-Invariant Semi-Norms in Sovolev Spaces,” in Constructive Theory of Functions of Several Variables, eds. W.
References
Schempp and K. Zeller, 85–100. Dongarra, J.; Bunch, J.; Moler, C. and Steward, G. (1979), Linpack Users’ Guide, Philadelphia: Society for Industrial and Applied Mathematics. Hall, P. and Titterington, D. (1987), “Common Structure of Techniques for Choosing Smoothing Parameters in Regression Problems,” J. Roy. Statist. Soc. Ser. B, 49, 184–198. Houghton, A. N.; Flannery, J. and Viola, M. V. (1980), “Malignant Melanoma in Connecticut and Denmark,” International Journal of Cancer, 25, 95–104. Hutchinson, M. and Bischof, R. (1983), “A New Method for Estimating the Spatial Distribution of Mean Seasonal and Annual Rainfall Applied to the Hunter Valley,” New South Wales, Aust. Met. Mag., 31, 179–184. Meinguet, J. (1979), “Multivariate Interpolation at Arbitrary Points Made Simple,” J. Appl. Math. Phys. (ZAMP), 5, 439–468. Nychka, D (1986a), “The Average Posterior Variance of a Smoothing Spline and a Consistent Estimate of the Mean Square Error,” Tech. Report 168, The Institute of Statistics, North Carolina State University, Raleigh, NC. Nychka, D (1986b), “A Frequency Interpretation of Bayesian “Confidence” Interval for Smoothing Splines,” Tech. Report 169, The Institute of Statistics, North Carolina State University, Raleigh, NC. Nychka, D (1988), “Confidence Intervals for Smoothing Splines,” J. Amer. Statist. Assoc., 83, 1134–1143. O’Sullivan, F. and Wong, T. (1987), “Determining a Function Diffusion Coefficient in the Heat Equation,” Tech. Report 98, Department of Statistics, University of California, Berkeley, CA. Ramsay, J. and Silverman, B (1997), Functional Data Analysis, New York: SpringerVerlag. Seaman, R. and Hutchinson, M. (1985), “Comparative Real Data Tests of Some Objective Analysis Methods by Withholding,” Aust. Met. Mag., 33, 37–46. Villalobos, M. and Wahba, G. (1987), “Inequality Constrained Multivariate Smoothing Splines with Application to the Estimation of Posterior Probabilities,” J. Amer. Statist. Assoc., 82 239–248. Wahba, G. (1983), “Bayesian “Confidence Intervals” for the Cross Validated Smoothing Spline,” J. Roy. Statist. Soc. Ser. B, 45, 133–150. Wahba, G., (1990), Spline Models for Observational Data, Philadelphia: Society for Industrial and Applied Mathematics. Wahba, G. and Wendelberger, J. (1980), “Some New Mathematical Methods for Variational Objective Analysis Using Splines and Cross Validation,” Monthly Weather Rev., 108, 1122–1145. Wang, Y. and Wahba, G. (1995), “Bootstrap Confidence Intervals for Smoothing Splines and their Comparison to Bayesian Confidence Intervals,” J. Statistical Computation and Simulation, 51, 263–279.
4541
Chapter 75
The TRANSREG Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4545 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4546 Main-Effects ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4546 Detecting Nonlinear Relationships . . . . . . . . . . . . . . . . . . . . . . 4550 SYNTAX . . . . . . . . . . . . . PROC TRANSREG Statement BY Statement . . . . . . . . . FREQ Statement . . . . . . . . ID Statement . . . . . . . . . . MODEL Statement . . . . . . OUTPUT Statement . . . . . . WEIGHT Statement . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. 4553 . 4553 . 4556 . 4557 . 4557 . 4557 . 4582 . 4592
DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Statement Usage . . . . . . . . . . . . . . . . . . . . . . . Box-Cox Transformations . . . . . . . . . . . . . . . . . . . . . . Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Values, UNTIE, and Hypothesis Tests . . . . . . . . . . . Controlling the Number of Iterations . . . . . . . . . . . . . . . . Using the REITERATE Algorithm Option . . . . . . . . . . . . . Avoiding Constant Transformations . . . . . . . . . . . . . . . . . Constant Variables . . . . . . . . . . . . . . . . . . . . . . . . . . Character OPSCORE Variables . . . . . . . . . . . . . . . . . . . Convergence and Degeneracies . . . . . . . . . . . . . . . . . . . Implicit and Explicit Intercepts . . . . . . . . . . . . . . . . . . . Passive Observations . . . . . . . . . . . . . . . . . . . . . . . . Point Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Redundancy Analysis . . . . . . . . . . . . . . . . . . . . . . . . Optimal Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . OPSCORE, MONOTONE, UNTIE, and LINEAR Transformations SPLINE and MSPLINE Transformations . . . . . . . . . . . . . . Specifying the Number of Knots . . . . . . . . . . . . . . . . . . SPLINE, BSPLINE, and PSPLINE Comparisons . . . . . . . . . . Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. 4592 . 4592 . 4595 . 4596 . 4599 . 4600 . 4601 . 4602 . 4603 . 4604 . 4604 . 4604 . 4605 . 4605 . 4605 . 4606 . 4609 . 4610 . 4611 . 4613 . 4614 . 4615
4544
Chapter 75. The TRANSREG Procedure Output Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . OUTTEST= Output Data Set . . . . . . . . . . . . . . . . . . . . Computational Resources . . . . . . . . . . . . . . . . . . . . . . Solving Standard Least-Squares Problems . . . . . . . . . . . . . Using the DESIGN Output Option . . . . . . . . . . . . . . . . . Discrete Choice Experiments: DESIGN, NORESTORE, NOZERO ANOVA Codings . . . . . . . . . . . . . . . . . . . . . . . . . . Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Displayed Output . . . . . . . . . . . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. 4617 . 4626 . 4627 . 4628 . 4654 . 4660 . 4662 . 4675 . 4676 . 4676
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example 75.1. Using Splines and Knots . . . . . . . . . . . . . . . . Example 75.2. Nonmetric Conjoint Analysis of Tire Data . . . . . . . Example 75.3. Metric Conjoint Analysis of Tire Data . . . . . . . . . Example 75.4. Transformation Regression of Exhaust Emissions Data Example 75.5. Preference Mapping of Cars Data . . . . . . . . . . . . Example 75.6. Box Cox . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. 4678 . 4678 . 4690 . 4694 . 4708 . 4717 . 4721
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4737
Chapter 75
The TRANSREG Procedure Overview The TRANSREG (transformation regression) procedure fits linear models, optionally with spline and other nonlinear transformations, and it can be used to code experimental designs prior to their use in other analyses. The TRANSREG procedure fits many types of linear models, including • ordinary regression and ANOVA • metric and nonmetric conjoint analysis (Green and Wind 1975; de Leeuw, Young, and Takane 1976) • metric and nonmetric vector and ideal point preference mapping (Carroll 1972) • simple, multiple, and multivariate regression with variable transformations (Young, de Leeuw, and Takane 1976; Winsberg and Ramsay 1980; Breiman and Friedman 1985) • redundancy analysis (Stewart and Love 1968) with variable transformations (Israels 1984) • canonical correlation analysis with variable transformations (van der Burg and de Leeuw 1983) • response surface regression (Meyers 1976; Khuri and Cornell 1987) with variable transformations • linear models with Box-Cox (1964) transformations of the dependent variables The data set can contain variables measured on nominal, ordinal, interval, and ratio scales (Siegel 1956). Any mix of these variable types is allowed for the dependent and independent variables. The TRANSREG procedure can transform • nominal variables by scoring the categories to minimize squared error (Fisher 1938), or they can be expanded into dummy variables • ordinal variables by monotonically scoring the ordered categories so that order is weakly preserved (adjacent categories can be merged) and squared error is minimized. Ties can be optimally untied or left tied (Kruskal 1964). Ordinal variables can also be transformed to ranks. • interval and ratio scale of measurement variables linearly or nonlinearly with spline (de Boor 1978; van Rijckevorsel 1982) or monotone spline (Winsberg and Ramsay 1980) transformations. In addition, smooth, logarithmic, exponential, power, logit, and inverse trigonometric sine transformations are available.
4546
Chapter 75. The TRANSREG Procedure
Transformations produced by the PROC TRANSREG multiple regression algorithm, requesting spline transformations, are often similar to transformations produced by the ACE smooth regression method of Breiman and Friedman (1985). However, ACE does not explicitly optimize a loss function (de Leeuw 1986), while PROC TRANSREG always explicitly optimizes a squared-error loss function. PROC TRANSREG extends the ordinary general linear model by providing optimal variable transformations that are iteratively derived using the method of alternating least squares (Young 1981). PROC TRANSREG iterates until convergence, alternating • finding least-squares estimates of the parameters of the model given the current scoring of the data (that is, the current vectors) • finding least-squares estimates of the scoring parameters given the current set of model parameters For more background on alternating least-squares optimal scaling methods and transformation regression methods, refer to Young, de Leeuw, and Takane (1976), Winsberg and Ramsay (1980), Young (1981), Gifi (1990), Schiffman, Reynolds, and Young (1981), van der Burg and de Leeuw (1983), Israels (1984), Breiman and Friedman (1985), and Hastie and Tibshirani (1986). (These are just a few of the many relevant sources.)
Getting Started This section provides several examples that illustrate features of the TRANSREG procedure.
Main-Effects ANOVA This example shows how to use the TRANSREG procedure to code and fit a maineffects ANOVA model. The input data set contains the dependent variables Y, factors X1 and X2, and 11 observations. The following statements perform a main-effects ANOVA: title ’Introductory Main-Effects ANOVA Example’; data A; input Y X1 $ X2 $; datalines; 8 a a 7 a a 4 a b 3 a b 5 b a 4 b a 2 b b 1 b b 8 c a
Main-Effects ANOVA
4547
7 c a 5 c b 2 c b ; *---Fit a Main-Effects ANOVA model with 1, 0, -1 coding. ---; proc transreg ss2; model identity(Y) = class(X1 X2 / effects); output coefficients replace; run; *---Print TRANSREG output data set---; proc print label; format Intercept -- X2a 5.2; run;
4548
Chapter 75. The TRANSREG Procedure
Introductory Main-Effects ANOVA Example The TRANSREG Procedure Dependent Variable Identity(Y)
Class Level Information Class
Levels
Values
X1
3
a b c
X2
2
a b
Number of Observations Read Number of Observations Used
12 12
TRANSREG Univariate Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.00000 0.00000 0.88144 Converged Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Identity(Y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
3 8 11
57.00000 7.66667 64.66667
19.00000 0.95833
Root MSE Dependent Mean Coeff Var
0.97895 4.66667 20.97739
F Value
Pr > F
19.83
0.0005
R-Square Adj R-Sq
0.8814 0.8370
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Intercept Class.X1a Class.X1b Class.X2a
DF
Coefficient
Type II Sum of Squares
1 1 1 1
4.6666667 0.8333333 -1.6666667 1.8333333
261.333 4.167 16.667 40.333
Mean Square
F Value
Pr > F
Label
261.333 4.167 16.667 40.333
272.70 4.35 17.39 42.09
<.0001 0.0705 0.0031 0.0002
Intercept X1 a X1 b X2 a
Figure 75.1. ANOVA Example Output from PROC TRANSREG
The iteration history in Figure 75.1 shows that the final R-Square of 0.88144 is reached on the first iteration.
Main-Effects ANOVA
This is followed by ANOVA, fit statistics, and regression tables. PROC TRANSREG uses an effects (also called deviations from means or 0, 1, -1) coding in this example. For more information on using PROC TRANSREG for ANOVA and other codings, see the “ANOVA Codings” section on page 4662. The TRANSREG procedure produces the data set displayed in Figure 75.2. Introductory Main-Effects ANOVA Example Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14
_TYPE_
_NAME_
Y
Intercept
SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE M COEFFI MEAN
ROW1 ROW2 ROW3 ROW4 ROW5 ROW6 ROW7 ROW8 ROW9 ROW10 ROW11 ROW12 Y Y
8 7 4 3 5 4 2 1 8 7 5 2 . .
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.67 .
X1 a
X1 b
X2 a
1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 -1.00 -1.00 -1.00 -1.00 0.83 5.50
0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00 -1.00 -1.00 -1.00 -1.00 -1.67 3.00
1.00 1.00 -1.00 -1.00 1.00 1.00 -1.00 -1.00 1.00 1.00 -1.00 -1.00 1.83 6.50
X1
X2
a a a a b b b b c c c c
a a b b a a b b a a b b
Figure 75.2. Output Data Set from PROC TRANSREG
The output data set has three kinds of observations, identified by values of – TYPE– . • When – TYPE– =’SCORE’, the observation contains information on the dependent and independent variables as follows: – Y is the original dependent variable. – X1 and X2 are the independent classification variables, and the Intercept through X2 a columns contain the main effects design matrix that PROC TRANSREG creates. The variable names are Intercept, X1a, X1b, and X2a. Their labels are shown in the listing. • When – TYPE– =’M COEFFI’, the observation contains coefficients of the final linear model. • When – TYPE– =’MEAN’, the observation contains the marginal means. The observations with – TYPE– =’SCORE’ form the score partition of the data set, and the observations with – TYPE– =’M COEFFI’ and – TYPE– =’MEAN’ form the coefficient partition of the data set.
4549
4550
Chapter 75. The TRANSREG Procedure
Detecting Nonlinear Relationships The TRANSREG procedure can detect nonlinear relationships among variables. For example, suppose 400 observations are generated from the following function t=
x + sin(x) 4
and data are created as follows y =t+ where is random normal error. The following statements find a cubic spline transformation of X with four knots. For information on using splines and knots, see the “Smoothing Splines” section on page 4596, the “Solving Standard Least-Squares Problems” section on page 4628, Example 75.1, and Example 75.4. The following statements produce Figure 75.3 through Figure 75.4: title ’Curve Fitting Example’; *---Create An Artificial Nonlinear Scatter Plot---; data Curve; Pi=constant(’pi’); Pi4=4*Pi; Increment=Pi4/400; do X=Increment to Pi4 by Increment; T=X/4 + sin(X); Y=T + normal(7); output; end; run; *---Request a Spline Transformation of X---; proc transreg data=Curve dummy; model identity(Y)=spline(X / nknots=4); output predicted; id T; run; *---Plot the Results---; goptions goutmode=replace nodisplay; %let opts = haxis=axis2 vaxis=axis1 frame cframe=ligr; * Depending on your goptions, these plot options may work better: * %let opts = haxis=axis2 vaxis=axis1 frame; proc gplot; title; axis1 minor=none label=(angle=90 rotate=0);
Detecting Nonlinear Relationships
axis2 minor=none; plot T*X=2 / &opts name=’tregin1’; plot Y*X=1 / &opts name=’tregin2’; plot Y*X=1 T*X=2 PY*X=3 / &opts name=’tregin3’ overlay ; symbol1 color=blue v=star i=none; symbol2 color=yellow v=none i=join line=1; symbol3 color=red v=none i=join line=2; run; quit; goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:tregin1 2:tregin3 3:tregin2; run; quit;
PROC TRANSREG increases the squared multiple correlation from the original value of 0.19945 to 0.47062. The plot of T by X shows the original function, the plot of Y by X shows the error-perturbed data, and the third plot shows the data, the true function as a solid curve, and the regression function as the dashed curve. The regression function closely approximates the true function. Curve Fitting Example The TRANSREG Procedure TRANSREG MORALS Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------0 0.74855 1.29047 0.19945 1 0.00000 0.00000 0.47062 0.27117 Converged Algorithm converged.
Figure 75.3. Curve Fitting Example Output
4551
4552
Chapter 75. The TRANSREG Procedure
Figure 75.4. Plots for the Curve Fitting Example
PROC TRANSREG Statement
Syntax The following statements are available in PROC TRANSREG.
PROC TRANSREG < DATA=SAS-data-set > < OUTTEST=SAS-data-set >< a-options >< o-options > ; MODEL < transform(dependents < / t-options >) < transform(dependents < / t-options >)...> = > transform(independents < / t-options >) < transform(independents < / t-options >)...>< / a-options > ; OUTPUT < OUT=SAS-data-set >< o-options > ; ID variables ; FREQ variable ; WEIGHT variable ; BY variables ; To use the TRANSREG procedure, you need the PROC TRANSREG and MODEL statements. To produce an OUT= output data set, the OUTPUT statement is required. PROC TRANSREG enables you to specify the same options in more than one statement. All of the MODEL statement a-options (algorithm options) and all of the OUTPUT statement o-options (output options) can also be specified in the PROC TRANSREG statement. You can abbreviate all a-options, o-options, and t-options (transformation options) to their first three letters. This is a special feature of the TRANSREG procedure and is not generally true of other SAS/STAT procedures. See Table 75.1 on page 4554. The rest of this section provides detailed syntax information for each of the preceding statements, beginning with the PROC TRANSREG statement. The remaining statements are described in alphabetical order.
PROC TRANSREG Statement PROC TRANSREG < DATA=SAS-data-set > < OUTTEST=SAS-data-set >< a-options >< o-options > ; The PROC TRANSREG statement starts the TRANSREG procedure. Optionally, this statement identifies an input and an OUTTEST= data set, specifies the algorithm and other computational details, requests displayed output, and controls the contents of the OUT= data set (which is created with the OUTPUT statement). The DATA= and OUTTEST= options can appear only in the PROC TRANSREG statement. The following table summarizes options available in the PROC TRANSREG statement. All a-options and o-options are described in the sections on either the MODEL or OUTPUT statement, in which these options can also be specified.
4553
4554
Chapter 75. The TRANSREG Procedure
Table 75.1. Options Available in the TRANSREG Procedure
Task Identify input data set specifies input SAS data set
Option
Statement
DATA=
PROC
Output data set with test statistics specifies output test statistics data set
OUTTEST=
PROC
Input data set specifies input observation type restarts iterations
TYPE= REITERATE
MODEL MODEL
Specify method and control iterations specifies minimum criterion change specifies minimum data change specifies canonical dummy-variable initialization specifies maximum number of iterations specifies iterative algorithm specifies number of canonical variables specifies singularity criterion
CCONVERGE= CONVERGE= DUMMY MAXITER= METHOD= NCAN= SINGULAR=
MODEL MODEL MODEL MODEL MODEL MODEL MODEL
Control missing data handling METHOD=MORALS fists each model individually includes monotone special missing values excludes observations with missing values unties special missing values
INDIVIDUAL MONOTONE= NOMISS UNTIE=
MODEL MODEL MODEL MODEL
Control intercept and CLASS variables CLASS dummy variable name prefix CLASS dummy variable label prefix no intercept or centering order of class variable levels controls output of reference levels CLASS dummy variable label separators
CPREFIX= LPREFIX= NOINT ORDER= REFERENCE= SEPARATORS=
MODEL MODEL MODEL MODEL MODEL MODEL
Control displayed output confidence limits alpha displays parameter estimate confidence limits displays model specification details displays iteration histories suppresses displayed output suppresses the iteration histories displays regression results displays ANOVA table displays conjoint part-worth utilities
ALPHA= CL DETAIL HISTORY NOPRINT SHORT SS2 TEST UTILITIES
MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL
Control standardization fits additive model do not zero constant variables specifies transformation standardization
ADDITIVE NOZEROCONSTANT TSTANDARD=
MODEL MODEL MODEL
Predicted values, residuals, scores outputs canonical scores outputs individual confidence limits
CANONICAL CLI
OUTPUT OUTPUT
PROC TRANSREG Statement
4555
Table 75.1. (continued)
Task outputs mean confidence limits specifies design matrix coding outputs leverage does not restore missing values suppresses output of scores outputs predicted values outputs redundancy variables outputs residuals
Option CLM DESIGN= LEVERAGE NORESTOREMISSING NOSCORES PREDICTED REDUNDANCY= RESIDUALS
Statement OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT
Output data set replacement replaces dependent variables replaces independent variables replaces all variables
DREPLACE IREPLACE REPLACE
OUTPUT OUTPUT OUTPUT
Output data set coefficients outputs coefficients outputs ideal point coordinates outputs marginal means outputs redundancy analysis coefficients
COEFFICIENTS COORDINATES MEANS MREDUNDANCY
OUTPUT OUTPUT OUTPUT OUTPUT
Output data set variable name prefixes dependent variable approximations independent variable approximations canonical dependent variables conservative individual lower CL canonical independent variables conservative-individual-upper CL conservative-mean-lower CL conservative-mean-upper CL METHOD=MORALS untransformed dependent liberal-individual-lower CL liberal-individual-upper CL liberal-mean-lower CL liberal-mean-upper CL residuals predicted values redundancy variables transformed dependents transformed independents
ADPREFIX= AIPREFIX= CDPREFIX= CILPREFIX= CIPREFIX= CIUPREFIX= CMLPREFIX= CMUPREFIX= DEPENDENT= LILPREFIX= LIUPREFIX= LMLPREFIX= LMUPREFIX= RDPREFIX= PPREFIX= RPREFIX= TDPREFIX= TIPREFIX=
OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT
Output data set macros creates macro variables
MACRO
OUTPUT
Output data set details dependent and independent approximations canonical correlation coefficients canonical elliptical point coordinate canonical point coordinates canonical quadratic point coordinates
APPROXIMATIONS CCC CEC CPC CQC
OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT
4556
Chapter 75. The TRANSREG Procedure Table 75.1. (continued)
Task approximations to transformed dependents approximations to transformed independents elliptical point coordinates point coordinates quadratic point coordinates multiple regression coefficients
Option DAPPROXIMATIONS IAPPROXIMATIONS MEC MPC MQC MRC
Statement OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT
DATA=SAS-data-set
specifies the SAS data set to be analyzed. If you do not specify the DATA= option, PROC TRANSREG uses the most recently created SAS data set. The data set must be an ordinary SAS data set; it cannot be a special TYPE= data set. OUTTEST=SAS-data-set
specifies an output data set to contain hypothesis tests results. When you specify the OUTTEST= option, the data set contains ANOVA results. When you specify the SS2 a-option, regression tables are also output. When you specify the UTILITIES o-option, conjoint analysis part-worth utilities are also output. For more information on the OUTTEST= data set, see the “OUTTEST= Output Data Set” section on page 4626.
BY Statement BY variables ; You can specify a BY statement with PROC TRANSREG to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the TRANSREG procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
MODEL Statement
FREQ Statement FREQ variable ; If one variable in the input data set represents the frequency of occurrence for other values in the observation, specify the variable’s name in a FREQ statement. PROC TRANSREG then treats the data set as if each observation appeared n times, where n is the value of the FREQ variable for the observation. Noninteger values of the FREQ variable are truncated to the largest integer less than the FREQ value. The observation is used in the analysis only if the value of the FREQ statement variable is greater than or equal to 1.
ID Statement ID variables ; The ID statement includes additional character or numeric variables in the OUT= data set. The variables must be contained in the input data set.
MODEL Statement MODEL < transform(dependents < / t-options >) < transform(dependents < / t-options >)...> = > transform(independents < / t-options >) < transform(independents < / t-options >)...>< / a-options > ; The MODEL statement specifies the dependent and independent variables (dependents and independents, respectively) and specifies the transformation (transform) to apply to each variable. Only one MODEL statement can appear in the TRANSREG procedure. The t-options are transformation options, and the a-options are the algorithm options. The t-options provide details for the transformation; these depend on the transform chosen. The t-options are listed after a slash in the parentheses that enclose the variable list (either dependents or independents). The a-options control the algorithm used, details of iteration, details of how the intercept and dummy variables are generated, and displayed output details. The a-options are listed after the entire model specification (the dependents, independents, transformations, and t-options) and after a slash. You can also specify the algorithm options in the PROC TRANSREG statement. When you specify the DESIGN o-option, dependents and an equal sign are not required. The operators “*”, “|”, and “@” from the GLM procedure are available for interactions with the CLASS expansion and the IDENTITY transformation. Class(a c e Identity(a c e
* | | * | |
b d f b d f
... ... ... @ n) ... ... ... @ n)
4557
4558
Chapter 75. The TRANSREG Procedure
In addition, transformations and spline expansions can be crossed with classification variables: transform(var) * class(group) transform(var) | class(group)
See the “Types of Effects” section on page 1784 in Chapter 32, “The GLM Procedure,” for a description of the @, *, and | operators and see the “Model Statement Usage” section on page 4592 for information on how to use these operators in PROC TRANSREG. Note that nesting is not allowed in PROC TRANSREG. The next three sections discuss the transformations available (transforms) (see the “Families of Transformations” section on page 4558), the transformation options (toptions) (see the “Transformation Options (t-options)” section on page 4564), and the algorithm options (a-options) (see the “Algorithm Options (a-options)” section on page 4573).
Families of Transformations In the MODEL statement, transform specifies a transformation in one of four families. Variable expansions
preprocess the specified variables, replacing them with more variables.
Nonoptimal transformations preprocess the specified variables, replacing each one with a single new nonoptimal, nonlinear transformation. Optimal transformations
replace the specified variables with new, iteratively derived optimal transformation variables that fit the specified model better than the original variable (except for contrived cases where the transformation fits the model exactly as well as the original variable).
Other transformations
are the IDENTITY and SSPLINE transformations. These do not fit into the preceding categories.
The following table summarizes the transformations in each family.
MODEL Statement
Family Variable expansions B-spline basis set of dummy variables elliptical response surface circular response surface piecewise polynomial basis quadratic response surface
Members of Family BSPLINE CLASS EPOINT POINT PSPLINE QPOINT
Nonoptimal transformations inverse trigonometric sine Box-Cox exponential logarithm logit raises variables to specified power transforms to ranks noniterative smoothing spline
ARSIN BOXCOX EXP LOG LOGIT POWER RANK SMOOTH
Optimal transformations linear monotonic, ties preserved monotonic B-spline optimal scoring B-spline monotonic, ties not preserved
LINEAR MONOTONE MSPLINE OPSCORE SPLINE UNTIE
Other transformations identity, no transformation iterative smoothing spline
IDENTITY SSPLINE
You can use any transformation with either dependent or independent variables (except the SMOOTH transformation, which can be used only with independent variables, and BOXCOX, which can be used only with dependent variables). However, the variable expansions are usually more appropriate for independent variables. The transform is followed by a variable (or list of variables) enclosed in parentheses. Optionally, depending on the transform, the parentheses can also contain t-options, which follow the variables and a slash. For example, model log(y)=class(x);
finds a LOG transformation of Y and performs a CLASS expansion of X. model identity(y) = spline(x1 x2 / nknots=3);
4559
4560
Chapter 75. The TRANSREG Procedure
The preceding statement finds SPLINE transformations of X1 and X2. The NKNOTS= t-option used with the SPLINE transformation specifies three knots. The IDENTITY(Y) transformation specifies that Y is not to be transformed. The rest of this section provides syntax details for members of the four families of transformations. The t-options are discussed in the “Transformation Options (toptions)” section on page 4564. Variable Expansions
The TRANSREG procedure performs variable expansions before iteration begins. Variable expansions expand the original variables into a typically larger set of new variables. The original variables are those that are listed in parentheses after transform, and they are sometimes referred to by the name of the transform. For example, in CLASS(X1 X2), X1 and X2 are sometimes referred to as CLASS expansion variables or simply CLASS variables, and the expanded variables are referred to as dummy variables. Similarly, in POINT(Dim1 Dim2), Dim1 and Dim2 are sometimes referred to as POINT variables. The resulting variables are not transformed by the iterative algorithms after the initial preprocessing. Observations with missing values for these types of variables are excluded from the analysis. The POINT, EPOINT, and QPOINT variable expansions are used in preference mapping analyses (also called PREFMAP, external unfolding, ideal point regression) (Carroll 1972) and for response surface regressions. These three expansions create circular, elliptical, and quadratic response or preference surfaces (see the “Point Models” section on page 4605 and Example 75.5). The CLASS variable expansion is used for main effects ANOVA. The following list provides syntax and details for the variable expansion transforms. BSPLINE BSP
expands each variable to a B-spline basis. You can specify the DEGREE=, KNOTS=, NKNOTS=, and EVENLY t-options with the BSPLINE expansion. When DEGREE=n (3 by default) with k knots (0 by default), n + k + 1 variables are created. In addition, the original variable appears in the OUT= data set before the ID variables. For example, BSPLINE(X) expands X into X– 0 X– 1 X– 2 X– 3 and outputs X as well. The X– : variables contain the B-spline (which are the same basis vectors that the SPLINE and MSPLINE transformations use internally). The columns of the BSPLINE expansion sum to a column of ones, so an implicit intercept model is fit when the BSPLINE expansion is specified. If you specify the BSPLINE expansion for more than one variable, the model is less than full rank. See the section “SPLINE, BSPLINE, and PSPLINE Comparisons” on page 4614. Variables following BSPLINE must be numeric, and they are typically continuous. CLASS CLA
expands the variables to a set of dummy variables. For example, CLASS(X1 X2) is used for a simple main-effects model, CLASS(X1 | X2) fits a main-effects and interactions model, and CLASS(X1|X2|X3|X4@2 X1*X2*X3) creates all main effects,
MODEL Statement
all two-way interactions, and one three-way interaction. See the “Model Statement Usage” section on page 4592 for information on how to use the operators @, *, and | in PROC TRANSREG. To determine class membership, PROC TRANSREG uses the values of the formatted variables. Variables following CLASS can be either character or numeric; numeric variables should be discrete. EPOINT EPO
expands the variables for an elliptical response surface regression or for an elliptical ideal point regression. Specify the COORDINATES o-option to output PREFMAP ideal elliptical point model coordinates to the OUT= data set. Each axis of the ellipse (or ellipsoid) is oriented in the same direction as one of the variables. The EPOINT expansion creates a new variable for each original variable. The value of each new variable is the square of each observed value for the corresponding parenthesized variable. The regression analysis then uses both sets of variables (original and squared). Variables following EPOINT must be numeric, and they are typically continuous. POINT POI
expands the variables for a circular response surface regression or for a circular ideal point regression. Specify the COORDINATES o-option to output PREFMAP ideal point model coordinates to the OUT= data set. The POINT expansion creates a new variable having a value for each observation that is the sums of squares of all the POINT variables. This new variable is added to the set of variables and is used in the regression analysis. For more on ideal point regression, refer to Carroll (1972). Variables following POINT must be numeric, and they are typically continuous. PSPLINE PSP
expands each variable to a piecewise polynomial basis. You can specify the DEGREE=, KNOTS=, NKNOTS=, and EVENLY t-options with PSPLINE. When DEGREE=n (3 by default) with k knots (0 by default), n + k variables are created. In addition, the original variable appears in the OUT= data set before the ID variables. For example, PSPLINE(X / NKNOTS=1) expands X into X– 1 X– 2 X– 3 X– 4 and outputs X as well. Unlike BSPLINE, an intercept is not implicit in the columns of PSPLINE. Refer to Smith (1979) for a good introduction to piecewise polynomial splines. Also see the section “SPLINE, BSPLINE, and PSPLINE Comparisons” on page 4614. Variables following PSPLINE must be numeric, and they are typically continuous. QPOINT QPO
expands the variables for a quadratic response surface regression or for a quadratic ideal point regression. Specify the COORDINATES o-option to output PREFMAP quadratic ideal point model coordinates to the OUT= data set. For m QPOINT variables, m(m+1)/2 new variables are created containing the squares and crossproducts of the original variables. The regression analysis uses both sets (original and crossed). Variables following QPOINT must be numeric, and they are typically continuous.
4561
4562
Chapter 75. The TRANSREG Procedure
Nonoptimal Transformations
Like variable expansions, nonoptimal transformations are computed before the iterative algorithm begins. Nonoptimal transformations create a single new transformed variable that replaces the original variable. The new variable is not transformed by the subsequent iterative algorithms (except for a possible linear transformation with missing value estimation). The following list provides syntax and details for nonoptimal variable transformations. ARSIN ARS
finds an inverse trigonometric sine transformation. Variables following ARSIN must be numeric, in the interval (−1.0 ≤ X ≤ 1.0), and they are typically continuous. BOXCOX BOX
finds a Box-Cox transformation of the specified variables (see the “Box-Cox Transformations” section on page 4595 and Example 75.6). The BOXCOX transformation can be used only with dependent variables. The ALPHA=, CLL=, CONVENIENT, GEOMETRICMEAN, LAMBDA=, and PARAMETER= t-options can be used with the BOXCOX transformation. Variables following BOXCOX must be numeric, and they are typically continuous. EXP
exponentiates variables (the variable X is transformed to aX ). To specify the value of a, use the PARAMETER= t-option. By default, a is the mathematical constant e = 2.718 . . .. Variables following EXP must be numeric, and they are typically continuous. LOG
transforms variables to logarithms (the variable X is transformed to loga (X)). To specify the base of the logarithm, use the PARAMETER= t-option. The default is a natural logarithm with base e = 2.718 . . .. Variables following LOG must be numeric and positive, and they are typically continuous. LOGIT
finds a logit transformation on the variables. The logit of X is log(X/(1−X)). Unlike other transformations, LOGIT does not have a three-letter abbreviation. Variables following LOGIT must be numeric, in the interval (0.0 < X < 1.0), and they are typically continuous. POWER POW
raises variables to a specified power (the variable X is transformed to X a ). You must specify the power parameter a by specifying the PARAMETER= t-option following the variables: power(variable / parameter=number)
MODEL Statement
You can use POWER for squaring variables (PARAMETER=2), reciprocal transformations (PARAMETER=−1), square roots (PARAMETER=0.5), and so on. Variables following POWER must be numeric, and they are typically continuous. RANK RAN
transforms variables to ranks. Ranks are averaged within ties. The smallest input value is assigned the smallest rank. Variables following RANK must be numeric. SMOOTH SMO
is a noniterative smoothing spline transformation. You can specify the smoothing parameter with either the SM= or the PARAMETER= t-option. The default smoothing parameter is SM=0. Variables following SMOOTH must be numeric, and they are typically continuous. The SMOOTH transformation can be used only with independent variables. For more information, see the “Smoothing Splines” section on page 4596. Optimal Transformations
Optimal transformations are iteratively derived. Missing values for these types of variables can be optimally estimated (see the “Missing Values” section on page 4599). The following list provides syntax and details for optimal transformations. LINEAR LIN
finds an optimal linear transformation of each variable. For variables with no missing values, the transformed variable is the same as the original variable. For variables with missing values, the transformed nonmissing values have a different scale and origin than the original values. Variables following LINEAR must be numeric. MONOTONE MON
finds a monotonic transformation of each variable, with the restriction that ties are preserved. The Kruskal (1964) secondary least-squares monotonic transformation is used. This transformation weakly preserves order and category membership (ties). Variables following MONOTONE must be numeric, and they are typically discrete. MSPLINE MSP
finds a monotonically increasing B-spline transformation with monotonic coefficients (de Boor 1978; de Leeuw 1986) of each variable. You can specify the DEGREE=, KNOTS=, NKNOTS=, and EVENLY t-options with MSPLINE. By default, PROC TRANSREG uses a quadratic spline. Variables following MSPLINE must be numeric, and they are typically continuous. OPSCORE OPS
finds an optimal scoring of each variable. The OPSCORE transformation assigns scores to each class (level) of the variable. Fisher’s (1938) optimal scoring method
4563
4564
Chapter 75. The TRANSREG Procedure
is used. Variables following OPSCORE can be either character or numeric; numeric variables should be discrete. SPLINE SPL
finds a B-spline transformation (de Boor 1978) of each variable. By default, PROC TRANSREG uses a cubic polynomial transformation. You can specify the DEGREE=, KNOTS=, NKNOTS=, and EVENLY t-options with SPLINE. Variables following SPLINE must be numeric, and they are typically continuous. UNTIE UNT
finds a monotonic transformation of each variable without the restriction that ties are preserved. The TRANSREG procedure uses the Kruskal (1964) primary leastsquares monotonic transformation method. This transformation weakly preserves order but not category membership (it may untie some previously tied values). Variables following UNTIE must be numeric, and they are typically discrete. Other Transformations IDENTITY IDE
specifies variables that are not changed by the iterations. Typically, the IDENTITY transformation is used with a simple variable list, such as IDENTITY(X1-X5). However, you can also specify interaction terms. For example, IDENTITY(X1 | X2) creates X1, X2, and the product X1*X2; and IDENTITY(X1 | X2 | X3) creates X1, X2, X1*X2, X3, X1*X3, X2*X3, and X1*X2*X3. See the “Model Statement Usage” section on page 4592 for information on how to use the operators @, *, and | in PROC TRANSREG. The IDENTITY transformation is used for variables when no transformation and no missing data estimation are desired. However, the REFLECT t-option, the ADDITIVE a-option, and the TSTANDARD=Z, and TSTANDARD=CENTER options can linearly transform all variables, including IDENTITY variables, after the iterations. Observations with missing values in IDENTITY variables are excluded from the analysis, and no optimal scores are computed for missing values in IDENTITY variables. Variables following IDENTITY must be numeric. SSPLINE SSP
finds an iterative smoothing spline transformation of each variable. The SSPLINE transformation does not generally minimize squared error. You can specify the smoothing parameter with either the SM= t-option or the PARAMETER= t-option. The default smoothing parameter is SM=0. Variables following SSPLINE must be numeric, and they are typically continuous.
Transformation Options (t-options) If you use a nonoptimal, optimal, or other transformation, you can use t-options, which specify additional details of the transformation. The t-options are specified within the parentheses that enclose variables and are listed after a slash. You can use t-options with both dependent and independent variables. For example,
MODEL Statement
proc transreg; model identity(y)=spline(x / nknots=3); output; run;
The preceding statements find an optimal variable transformation (SPLINE) of the independent variable, and they use a t-option to specify the number of knots (NKNOTS=). The following is a more complex example: proc transreg; model mspline(y / nknots=3)=class(x1 x2 / effects); output; run;
These statements find a monotone spline transformation (MSPLINE with three knots) of the dependent variable and perform a CLASS expansion with effects coding of the independents. The following sections discuss the t-options available for nonoptimal, optimal, and other transformations. The following table summarizes the t-options. Table 75.2. t-options Available in the MODEL Statement
Task Nonoptimal transformation t-options uses original mean and variance
Option ORIGINAL
Parameter t-options specifies miscellaneous parameters specifies smoothing parameter
PARAMETER= SM=
Spline t-options specifies the degree of the spline spaces the knots evenly exterior knots specifies the interior knots or break points creates n knots
DEGREE= EVENLY EXKNOTS= KNOTS= NKNOTS=
CLASS Variable t-options CLASS dummy variable name prefix requests a deviations-from-means coding requests a deviations-from-means coding CLASS dummy variable label prefix order of class variable levels CLASS dummy variable label separators controls reference levels
CPREFIX= DEVIATIONS EFFECTS LPREFIX= ORDER= SEPARATORS= ZERO=
BOXCOX t-options confidence interval alpha convenient lambda list
ALPHA= CLL=
4565
4566
Chapter 75. The TRANSREG Procedure
Table 75.2. (continued)
Task use a convenient lambda scale the transformation using geometric mean power parameter list
Option CONVENIENT GEOMETRICMEAN LAMBDA=
Other t-options operations occur after the expansion centers before the analysis begins renames variables reflects the variable around the mean specifies transformation standardization standardizes before the analysis begins
AFTER CENTER NAME= REFLECT TSTANDARD= Z
Nonoptimal Transformation t-options ORIGINAL ORI
matches the variable’s final mean and variance to the mean and variance of the original variable. By default, the mean and variance are based on the transformed values. The ORIGINAL t-option is available for all of the nonoptimal transformations. Parameter t-options PARAMETER=number PAR=number
specifies the transformation parameter. The PARAMETER= t-option is available for the BOXCOX, EXP, LOG, POWER, SMOOTH, and SSPLINE transformations. For BOXCOX, the parameter is the value to add to each value of the BOXCOX variable before a Box-Cox transformation. For EXP, the parameter is the value to be exponentiated; for LOG, the parameter is the base value; and for POWER, the parameter is the power. For SMOOTH and SSPLINE, the parameter is the raw smoothing parameter. (You can specify a SAS/GRAPH-style smoothing parameter with the SM= t-option.) The default for the PARAMETER= t-option for the BOXCOX transformation is 0 and for the LOG and EXP transformations is e = 2.718 . . .. The default parameter for SMOOTH and SSPLINE is computed from SM=0. For the POWER transformation, you must specify the PARAMETER= t-option; there is no default. SM=n
specifies a SAS/GRAPH-style smoothing parameter in the range 0 to 100. You can specify the SM= t-option only with the SMOOTH and SSPLINE transformations. The smoothness of the function increases as the value of the smoothing parameter increases. By default, SM=0. Spline t-options
The following t-options are available with the SPLINE and MSPLINE optimal transformations and the PSPLINE and BSPLINE expansions.
MODEL Statement
DEGREE=n DEG=n
specifies the degree of the spline transformation. The degree must be a nonnegative integer. The defaults are DEGREE=3 for SPLINE, PSPLINE, and BSPLINE variables and DEGREE=2 for MSPLINE variables. The polynomial degree should be a small integer, usually 0, 1, 2, or 3. Larger values are rarely useful. If you have any doubt as to what degree to specify, use the default. EVENLY EVE
is used with the NKNOTS= t-option to space the knots evenly. The differences between adjacent knots are constant. If you specify NKNOTS=k, k knots are created at minimum + i((maximum − minimum)/(k + 1)) for i = 1, . . . , k. For example, if you specify spline(X / knots=2 evenly)
and the variable X has a minimum of 4 and a maximum of 10, then the two interior knots are 6 and 8. Without the EVENLY t-option, the NKNOTS= t-option places knots at percentiles, so the knots are not evenly spaced. EXKNOTS=number-list | n TO m BY p EXK=number-list | n TO m BY p
specifies exterior knots for SPLINE and MSPLINE transformations and BSPLINE expansions. Usually, this option is not needed; PROC TRANSREG automatically picks suitable exterior knots. The only time you need to use this option is when you want to ensure that the exact same basis is used for different splines, for example when applying coefficients from one spline transformation to a variable in a different data set (see “Scoring Spline Variables” at the end of Example 75.1). Specify one or two values. If the minimum EXKNOTS= value is less than the minimum data value, it is used as the exterior knot. If the maximum EXKNOTS= value is greater than the maximum data value, it is used as the exterior knot. Otherwise these values are ignored. When EXKNOTS= is specified with the CENTER or Z t-options, the knots apply to the original variable, not to the centered or standardized variable. The B-spline transformations and expansions use a knot list consisting of exterior knots (values just smaller than the minimum), the specified (interior) knots, and exterior knots (values just larger than the minimum). You can use the DETAILS option to see all of these knots. Using different external knots gives different but equivalent Bspline bases. You can specify exterior knots on either the KNOTS= or EXKNOTS= t-options, however for the BSPLINE expansion, the KNOTS= t-option creates extra all-zero basis columns, whereas the EXKNOTS= t-option will give you the correct basis.
4567
4568
Chapter 75. The TRANSREG Procedure
KNOTS=number-list | n TO m BY p KNO=number-list | n TO m BY p
specifies the interior knots or break points. By default, there are no knots. The first time you specify a value in the knot list, it indicates a discontinuity in the nth (from DEGREE=n) derivative of the transformation function at the value of the knot. The second mention of a value indicates a discontinuity in the (n − 1)th derivative of the transformation function at the value of the knot. Knots can be repeated any number of times for decreasing smoothness at the break points, but the values in the knot list can never decrease. You cannot use the KNOTS= t-option with the NKNOTS= t-option. You should keep the number of knots small (see the section “Specifying the Number of Knots” on page 4613). NKNOTS=n NKN=n
creates n knots, the first at the 100/(n + 1) percentile, the second at the 200/(n + 1) percentile, and so on. Knots are always placed at data values; there is no interpolation. For example, if NKNOTS=3, knots are placed at the twenty-fifth percentile, the median, and the seventy-fifth percentile. By default, NKNOTS=0. The NKNOTS= t-option must be ≥ 0. You cannot use the NKNOTS= t-option with the KNOTS= t-option. You should keep the number of knots small (see the section “Specifying the Number of Knots” on page 4613). CLASS Variable t-options CPREFIX=n | number-list CPR=n | number-list
specifies the number of first characters of a CLASS expansion variable’s name to use in constructing names for dummy variables. When CPREFIX= is specified as an a-option (see the description of the CPREFIX= a-option on page 4575) or an o-option, it specifies the default for all CLASS variables. When you specify CPREFIX= as a t-option, it overrides the default only for selected variables. A different CPREFIX= value can be specified for each CLASS variable by specifying the CPREFIX=number-list t-option, like the ZERO=formatted-value-list t-option. DEVIATIONS DEV EFFECTS EFF
requests a deviations-from-means coding of CLASS variables. The coded design matrix has values of 0, 1, and −1 for reference levels. This coding is referred to as “deviations-from-means,” “effects,” “center-point,” or “full-rank” coding. LPREFIX=n | number-list LPR=n | number-list
specifies the number of first characters of a CLASS expansion variable’s label (or name if no label is specified) to use in constructing labels for dummy variables. When LPREFIX= is specified as an a-option (see the description of the LPREFIX= a-option
MODEL Statement
on page 4576) or an o-option, it specifies the default for all CLASS variables. When you specify LPREFIX= as a t-option, it overrides the default only for selected variables. A different LPREFIX= value can be specified for each CLASS variable by specifying the LPREFIX=number-list t-option, like the ZERO=formatted-value-list t-option. ORDER=DATA | FREQ | FORMATTED | INTERNAL ORD=DAT | FRE | FOR | INT
specifies the order in which the CLASS variable levels are to be reported. The default is ORDER=INTERNAL. For ORDER=FORMATTED and ORDER=INTERNAL, the sort order is machine dependent. When ORDER= is specified as an a-option (see the description of the ORDER= a-option on page 4578) or as an o-option, it specifies the default ordering for all CLASS variables. When you specify ORDER= as a t-option, it overrides the default ordering only for selected variables. You can specify a different ORDER= value for each CLASS specification. SEPARATORS=’string-1 ’<’string-2 ’ > SEP=’string-1 ’<’string-2 ’ >
specifies separators for creating CLASS expansion variable labels. By default, SEPARATORS=’ ’ ’ * ’ (“blank” and “blank asterisk blank”). When SEPARATORS= is specified as an a-option (see the description of the SEPARATORS= a-option on page 4579) or an o-option, it specifies the default separators for all CLASS variables. When you specify SEPARATORS= as a t-option, it overrides the default only for selected variables. You can specify a different SEPARATORS= value for each CLASS specification. ZERO=FIRST | LAST | NONE | SUM ZER=FIR | LAS | NON | SUM ZERO=’formatted-value ’ <’formatted-value ’ ...>
is used with CLASS variables. The default is ZERO=LAST. The specification CLASS(variable / ZERO=FIRST) sets to missing the dummy variable for the first of the sorted categories, implying a zero coefficient for that category. The specification CLASS(variable / ZERO=LAST) sets to missing the dummy variable for the last of the sorted categories, implying a zero coefficient for that category. The specification CLASS(variable / ZERO=’formatted-value’) sets to missing the dummy variable for the category with a formatted value that matches ’formattedvalue’, implying a zero coefficient for that category. With ZERO=formatted-valuelist, the first formatted value applies to the first variable in the specification, the second formatted value applies to the next variable that was not previously mentioned and so on. For example, CLASS(A A*B B B*C C / ZERO=’x’ ’y’ ’z’) specifies that the reference level for A is ’x’, for B is ’y’, and for C is ’z’. With ZERO=’formattedvalue’, the procedure first looks for exact matches between the formatted values and the specified value. If none are found, leading blanks are stripped from both and the values are compared again. If zero or two or more matches are found, warnings are issued. The specifications ZERO=FIRST, ZERO=LAST, and ZERO=’formatted-value’ are used for reference cell models. The Intercept parameter estimate is the marginal
4569
4570
Chapter 75. The TRANSREG Procedure
mean for the reference cell, and the other marginal means are obtained by adding the intercept to the dummy variable coefficients. The specification CLASS(variable / ZERO=NONE) sets to missing none of the dummy variables. The columns of the expansion sum to a column of ones, so an implicit intercept model is fit. If you specify ZERO=NONE for more than one variable, the model is less than full rank. In the model MODEL IDENTITY(Y) = CLASS(X / ZERO=NONE), the coefficients are cell means. The specification CLASS(variable / ZERO=SUM) sets to missing none of the dummy variables, and the coefficients for the dummy variables created from the variable sum to 0. This creates a less-than-full-rank model, but the coefficients are uniquely determined due to the sum-to-zero constraint. In the presence of iterative transformations, hypothesis tests for ZERO=NONE and ZERO=SUM levels are not exact; they are liberal because a model with an explicit intercept is fit inside the iterations. There is no provision for adjusting the transformations while setting to 0 a parameter that is redundant given the explicit intercept and the other parameters. Box-Cox t-options
The following t-options are available only with the BOXCOX transformation of the dependent variable (see the “Box-Cox Transformations” section on page 4595 and Example 75.6). ALPHA=p ALP=p
specifies the Box-Cox alpha for the confidence interval for the power parameter. By default, ALPHA=0.05. CLL=number-list
specifies the Box-Cox convenient lambda list. When the confidence interval for the power parameter includes one of the values in this list, PROC TRANSREG reports it and can optionally use the convenient power parameter instead of the more optimal power parameter. The default is CLL=1.0 0.0 0.5 -1.0 -0.5 2.0 -2.0 3.0 -3.0. By default, a linear transformation is preferred over log, square root, inverse, inverse square root, quadratic, inverse quadratic, cubic, and inverse cubic. If you specify the CONVENIENT t-option, then PROC TRANSREG uses the first convenient power parameter in the list that is in the confidence interval. For example, if the optimal power parameter is 0.25 and 0.0 is in the confidence interval but not 1.0, then the convenient power parameter is 0.0. CONVENIENT CON
specifies that a power parameter from the CLL= t-option list is to be used for the final transformation instead of the LAMBDA= t-option value if a CLL= value is in the confidence interval. See the CLL= t-option for more information on its usage.
MODEL Statement
GEOMETRICMEAN GEO
divides the Box-Cox transformation by y˙ λ−1 where y˙ is the geometric mean of the variable to be transformed. This form of the Box-Cox transformation essentially converts the transformation back to original units and hence allows direct comparison of the residual sums of squares for models with different power parameters. LAMBDA=number-list LAM=number-list
specifies a list of Box-Cox power parameters. The default is LAMBDA=-3 TO 3 BY 0.25. PROC TRANSREG tries each power parameter in the list and picks the best one. However, when the CONVENIENT t-option value is in the confidence interval. See the CLL= t-option for more information on its usage. Other t-options AFTER AFT
requests that certain operations occur after the expansion. This t-option affects the NKNOTS= t-option when the SPLINE or MSPLINE transformation is crossed with a CLASS specification. For example, if the original spline variable (1 2 3 4 5 6 7 8 9) is expanded into the three variables (1 2 3 0 0 0 0 0 0), (0 0 0 4 5 6 0 0 0), and (0 0 0 0 0 0 7 8 9), then, by default, NKNOTS=1 would use the overall median of 5 as the knot for all three variables. When you specify the AFTER t-option, the knots for the three variables are 2, 5, and 8. Note that the structural zeros are ignored when the internal knot list is created, but they are not ignored for the external knots. You can also specify the AFTER t-option with the RANK and SMOOTH transformations. The following specifications compute ranks and smooth within groups, after crossing, ignoring the structural zeros. class(x / zero=none) | rank(z / after) class(x / zero=none) | smooth(z / after) CENTER CEN
centers the variables before the analysis begins (in contrast to the TSTANDARD=CENTER option which centers after the analysis ends). The CENTER t-option can be used instead of running PROC STANDARD before PROC TRANSREG (see the “Centering” section on page 4675). When the KNOTS= t-option is specified with CENTER, the knots apply to the original variable, not to the centered variable. PROC TRANSREG will center the knots. NAME=(variable-list) NAM=(variable-list)
renames variables as they are used in the MODEL statement. This t-option allows a variable to be used more than once. For example, if X is a character variable, then the following step stores both the original character variable X and a numeric variable XC that contains category numbers in the OUT= data set.
4571
4572
Chapter 75. The TRANSREG Procedure proc transreg data=a; model identity(y) = opscore(x / name=(xc)); output; id x; run;
With the CLASS and IDENTITY transformations, which allow interaction effects, the first name applies to the first variable in the specification, the second name applies to the next variable that was not previously mentioned, and so on. For example, IDENTITY(A A*B B B*C C / NAME=(G H I)) specifies that the new name for A is G, for B is H, and for C is I. The same assignment is used for the (not useful) specification IDENTITY(A A B B C C / NAME=(G H I)). For all transforms other than CLASS and IDENTITY (all those in which interactions are not supported), repeated variables are not handled specially. For example, SPLINE(A A B B C C / NAME=(A G B H C I)) creates six variables, a copy of A named A, another copy of A named G, a copy of B named B, another copy of B named H, a copy of C named C, and another copy of C named I. REFLECT REF
reflects the transformation y = −(y − y¯) + y¯ after the iterations are completed and before the final standardization and results calculations. This t-option is particularly useful with the dependent variable in a conjoint analysis. When the dependent variable consists of ranks with the most preferred combination assigned 1.0, the REFLECT t-option reflects the transformation so that positive utilities mean high preference. (See Example 75.2.) TSTANDARD=CENTER | NOMISS | ORIGINAL | Z TST=CEN | NOM | ORI | Z
specifies the standardization of the transformed variables for the hypothesis tests and in the OUT= data set (see the “Centering” section on page 4675). By default, TSTANDARD=ORIGINAL. When TSTANDARD= is specified as an a-option (see the description of the TSTANDARD= a-option on page 4580) or an o-option, it determines the default standardization for all variables. When you specify TSTANDARD= as a t-option, it overrides the default standardization only for selected variables. You can specify a different TSTANDARD= value for each transformation. For example, to perform a redundancy analysis with standardized dependent variables, specify model identity(y1-y4 / tstandard=z) = identity(x1-x10); Z
centers and standardizes the variables to variance one before the analysis begins (in contrast to the TSTANDARD=Z option, which standardizes after the analysis ends). The Z t-option can be used instead of running PROC STANDARD before PROC
MODEL Statement
TRANSREG (see the “Centering” section on page 4675). When the KNOTS= toption is specified with Z, the knots apply to the original variable, not to the centered variable. PROC TRANSREG will standardize the knots.
Algorithm Options (a-options) This section discusses the options that can appear in the PROC TRANSREG or MODEL statements as a-options. They are listed after the entire model specification and after a slash. For example, proc transreg; model spline(y / nknots=3)=log(x1 x2 / parameter=2) / nomiss maxiter=50; output; run;
In the preceding statements, NOMISS and MAXITER= are a-options. (SPLINE and LOG are transforms, and NKNOTS= and PARAMETER= are t-options.) The statements find a spline transformation with 3 knots on Y and a base 2 logarithmic transformation on X1 and X2. The NOMISS a-option excludes all observations with missing values, and the MAXITER= a-option specifies the maximum number of iterations. Table 75.3. Options Available in the PROC TRANSREG or MODEL Statements
Task Input data set specifies input observation type restarts iterations
Option TYPE= REITERATE
Specify method and control iterations specifies minimum criterion change specifies minimum data change specifies canonical dummy-variable initialization specifies maximum number of iterations specifies iterative algorithm specifies number of canonical variables specifies singularity criterion
CCONVERGE= CONVERGE= DUMMY MAXITER= METHOD= NCAN= SINGULAR=
Control missing data handling METHOD=MORALS fists each model individually includes monotone special missing values excludes observations with missing values unties special missing values
INDIVIDUAL MONOTONE= NOMISS UNTIE=
Control intercept and CLASS variables CLASS dummy variable name prefix CLASS dummy variable label prefix no intercept or centering order of class variable levels controls output of reference levels
CPREFIX= LPREFIX= NOINT ORDER= REFERENCE=
4573
4574
Chapter 75. The TRANSREG Procedure Table 75.3. (continued)
Task CLASS dummy variable label separators
Option SEPARATORS=
Control displayed output confidence limits alpha displays parameter estimate confidence limits displays model specification details displays iteration histories suppresses displayed output suppresses the iteration histories displays regression results displays ANOVA table displays conjoint part-worth utilities
ALPHA= CL DETAIL HISTORY NOPRINT SHORT SS2 TEST UTILITIES
Control standardization fits additive model do not zero constant variables specifies transformation standardization
ADDITIVE NOZEROCONSTANT TSTANDARD=
The following list provides details on these a-options. ADDITIVE ADD
creates an additive model by multiplying the values of each independent variable (after the TSTANDARD= standardization) by that variable’s corresponding multiple regression coefficient. This process scales the independent variables so that the predicted-values variable for the final dependent variable is simply the sum of the final independent variables. An additive model is a univariate multiple regression model. As a result, the ADDITIVE a-option is not valid if METHOD=CANALS, or if METHOD=REDUNDANCY or METHOD=UNIVARIATE with more than one dependent variable. ALPHA=number ALP=number
specifies the level of significance for all of the confidence limits. ALPHA=0.05.
By default,
CCONVERGE=n CCO=n
specifies the minimum change in the criterion being optimized (squared multiple correlation for METHOD=MORALS and METHOD=UNIVARIATE, average squared multiple correlation for METHOD=REDUNDANCY, average squared canonical correlation for METHOD=CANALS) that is required to continue iterating. By default, CCONVERGE=0.0.
MODEL Statement
CL
requests confidence limits on the parameter estimates in the displayed output. CONVERGE=n CON=n
specifies the minimum average absolute change in standardized variable scores that is required to continue iterating. By default, CONVERGE=0.00001. Average change is computed over only those variables that can be transformed by the iterations; that is, all LINEAR, OPSCORE, MONOTONE, UNTIE, SPLINE, MSPLINE, and SSPLINE variables and nonoptimal transformation variables with missing values. CPREFIX=n CPR=n
specifies the number of first characters of a CLASS expansion variable’s name to use in constructing names for dummy variables. Dummy variable names are constructed from the first n characters of the CLASS expansion variable’s name and the first 32 − n characters of the formatted CLASS expansion variable’s value. For example, if the variable ClassVariable has values 1, 2, and 3, then, by default, the dummy variables are named ClassVariable1, ClassVariable2, and ClassVariable3. However, with CPREFIX=5, the dummy variables are named Class1, Class2, and Class3. When CPREFIX=0, dummy variable names are created entirely from the CLASS expansion variable’s formatted values. Valid values range from -1 to 31, where -1 indicates the default calculation and 0 to 31 are the number of prefix characters to use. The default, -1, sets n to 32 - min(32, max(2, fl)), where fl is the format length. When CPREFIX= is specified as an a-option or an o-option, it specifies the default for all CLASS variables. When you specify CPREFIX= as a t-option, it overrides the default only for selected variables. DETAIL DET
reports on details of the model specification. For example, it reports the knots and coefficients for splines, reference levels for CLASS variables, Box-Cox results, and so on. DUMMY DUM
provides a canonical dummy variable initialization. When there are no monotonicity constraints and there is only one canonical variable in each set, PROC TRANSREG (with the DUMMY a-option) can usually find the optimal solution in only one iteration. The initialization iteration is number 0, which is slower and uses more memory than other iterations. However, when there are no monotonicity constraints, when there is only one canonical variable in each set, and when there is enough available memory, specifying the DUMMY a-option can greatly decrease the amount of time required to find the optimal transformations. Furthermore, by solving for the transformations directly instead of iteratively, PROC TRANSREG avoids certain nonoptimal solutions.
4575
4576
Chapter 75. The TRANSREG Procedure
HISTORY HIS
displays the iteration histories even when the NOPRINT a-option is specified. INDIVIDUAL IND
fits each model for each dependent variable individually. This means, for example, that when INDIVIDUAL is specified, missing values in one dependent variable will not cause that observation to be deleted for the other models with the other dependent variables. In contrast, by default, missing values in any variable in any model can cause the observation to be deleted for all models. The INDIVIDUAL option can only be specified with METHOD=MORALS. This option also affects the order of the output. By default, the number of observations table is printed once at the beginning of the output. With INDIVIDUAL, a number of observations table appears for each model. LPREFIX=n LPR=n
specifies the number of first characters of a CLASS expansion variable’s label (or name if no label is specified) to use in constructing labels for dummy variables. Dummy variable labels are constructed from the first n characters of the CLASS expansion variable’s name and the first 127 − n characters of the formatted CLASS expansion variable’s value. Valid values range from -1 to 127. Values of 0 to 127 specify the number of name or label characters to use. The default is -1, which specifies that PROC TRANSREG should pick a value depending on the length of the prefix and the formatted class value. When LPREFIX= is specified as an a-option or an o-option, it determines the default for all CLASS variables. When you specify LPREFIX= as a t-option, it overrides the default only for selected variables. MAXITER=n MAX=n
specifies the maximum number of iterations (see the “Controlling the Number of Iterations” section on page 4601). By default, MAXITER=30. A specification of MAXITER=0 is allowed to save time when no transformations are requested. METHOD=CANALS | MORALS | REDUNDANCY | UNIVARIATE MET=CAN | MOR | RED | UNI
specifies the iterative algorithm. By default, METHOD=UNIVARIATE, unless you specify options that cannot be handled by the UNIVARIATE algorithm. Specifically, the default is METHOD=MORALS for the following situations: • if you specify LINEAR, OPSCORE, MONOTONE, UNTIE, SPLINE, MSPLINE, or SSPLINE transformations for the independent variables • if you specify the ADDITIVE a-option with more than one dependent variable • if you specify the IAPPROXIMATIONS o-option
MODEL Statement
CANALS
specifies canonical correlation with alternating least squares. This jointly transforms all dependent and independent variables to maximize the average of the first n squared canonical correlations, where n is the value of the NCAN= a-option.
MORALS
specifies multiple optimal regression with alternating least squares. This transforms each dependent variable, along with the set of independent variables, to maximize the squared multiple correlation.
REDUNDANCY jointly transforms all dependent and independent variables to maximize the average of the squared multiple correlations (see the “Redundancy Analysis” section on page 4606). UNIVARIATE
transforms each dependent variable to maximize the squared multiple correlation, while the independent variables are not transformed.
MONOTONE=two-letters MON=two-letters
specifies the first and last special missing value in the list of those special missing values to be estimated using within-variable order and category constraints. By default, there are no order constraints on missing value estimates. The two-letters value must consist of two letters in alphabetical order. For example, MONOTONE=DF means that the estimate of .D must be less than or equal to the estimate of .E, which must be less than or equal to the estimate of .F; no order constraints are placed on estimates of .– , .A through .C, and .G through .Z. For details, see the “Missing Values” section on page 4599. NCAN=n NCA=n
specifies the number of canonical variables to use in the METHOD=CANALS algorithm. By default, NCAN=1. The value of the NCAN= a-option must be ≥ 1. When canonical coefficients and coordinates are included in the OUT= data set, the NCAN= a-option also controls the number of rows of the canonical coefficient matrices in the data set. If you specify an NCAN= value larger than the minimum of the number of dependent variables and the number of independent variables, PROC TRANSREG displays a warning and sets the NCAN= a-option to the maximum allowable value. NOINT NOI
omits the intercept from the OUT= data set and suppresses centering of data. The NOINT a-option is not allowed with iterative transformations since there is no provision for optimal scaling without an intercept. The NOINT a-option is allowed only when there is no implicit intercept and when all of the data in a BY group absolutely will not change during the iterations.
4577
4578
Chapter 75. The TRANSREG Procedure
NOMISS NOM
excludes all observations with missing values from the analysis, but does not exclude them from the OUT= data set. If you omit the NOMISS a-option, PROC TRANSREG simultaneously computes the optimal transformations of the nonmissing values and estimates the missing values that minimize squared error. For details, see the “Missing Values” section on page 4599. Casewise deletion of observations with missing values occurs when the NOMISS a-option is specified, when there are missing values in expansions, when there are missing values in METHOD=UNIVARIATE independent variables, when there are weights less than or equal to 0, or when there are frequencies less than 1. Excluded observations are output with a blank value for the – TYPE– variable, and they have a weight of 0. They do not contribute to the analysis but are scored and transformed as supplementary or passive observations. See the “Passive Observations” section on page 4605 for more information on excluded observations. NOPRINT NOP
suppresses the display of all output unless you specify the HISTORY a-option. The NOPRINT a-option without the HISTORY a-option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 14, “Using the Output Delivery System.” NOZEROCONSTANT NOZERO NOZ
specifies that constant variables are expected and should not be zeroed. By default, constant variables are zeroed. This option is useful when PROC TRANSREG is used to code experimental designs for discrete choice models (see the “Discrete Choice Experiments: DESIGN, NORESTORE, NOZERO” section on page 4660). When these designs are very large, it may be more efficient to use the DESIGN=n option. It may be that attributes are constant within a block of n observations, so you need to specify the NOZEROCONSTANT option to get the correct results. You can specify this option in the PROC TRANSREG, MODEL, and OUTPUT statements. ORDER=DATA | FREQ | FORMATTED | INTERNAL ORD=DAT | FRE | FOR | INT
specifies the order in which the CLASS variable levels are to be reported. The default is ORDER=INTERNAL. For ORDER=FORMATTED and ORDER=INTERNAL, the sort order is machine dependent. When ORDER= is specified as an a-option or an o-option, it determines the default ordering for all CLASS variables. When you specify ORDER= as a t-option, it overrides the default ordering only for selected variables. DATA
sorts by order of appearance in the input data set.
FORMATTED
sorts by formatted value.
MODEL Statement
FREQ
sorts by descending frequency count; levels with the most observations appear first.
INTERNAL
sorts by unformatted value.
REFERENCE=NONE | MISSING | ZERO REF=NON | MIS | ZER
specifies how reference levels of CLASS variables are to be treated. The options are REFERENCE=NONE, the default, in which reference levels are suppressed; REFERENCE=MISSING, in which reference levels are displayed and output with missing values; and REFERENCE=ZERO, in which reference levels are displayed and output with zeros. The REFERENCE= option can be specified in the PROC TRANSREG, MODEL, or OUTPUT statement, and it can be independently specified for the OUT= data set and the displayed output. When you specify it in only one statement, it sets the option for both the displayed output and the OUT= data set. REITERATE REI
enables the TRANSREG procedure to use previous transformations as starting points. The REITERATE a-option affects only variables that are iteratively transformed (specified as LINEAR, OPSCORE, MONOTONE, UNTIE, SPLINE, MSPLINE, and SSPLINE). For iterative transformations, the REITERATE a-option requests a search in the input data set for a variable that consists of the value of the TDPREFIX= or TIPREFIX= o-option followed by the original variable name. If such a variable is found, it is used to provide the initial values for the first iteration. The final transformation is a member of the transformation family defined by the original variable, not the transformation family defined by the initialization variable. See the section “Using the REITERATE Algorithm Option” on page 4602. SEPARATORS=’string-1 ’<’string-2 ’ > SEP=’string-1 ’<’string-2 ’ >
specifies separators for creating CLASS expansion variable labels. By default, SEPARATORS=’ ’ ’ * ’ (“blank” and “blank asterisk blank”). The first value is used to separate variable names and values in interactions. The second value is used to separate interaction components. For example, the label for the dummy variable for the A=1 and B=2 cell is, by default, ’A 1 * B 2’. If SEPARATORS=’=’ ’x’ is specified, then the label is ’A=1xB=2’. When SEPARATORS= is specified as an a-option or an o-option, it determines the default separators for all CLASS variables. When you specify SEPARATORS= as a t-option, it overrides the default only for selected variables. SHORT SHO
suppresses the iteration histories. SINGULAR=n SIN=n
specifies the largest value within rounding error of zero. By default, SINGULAR=1E−12. The TRANSREG procedure uses the value of the SINGULAR= a-option for checking 1 − R2 when constructing full-rank ma-
4579
4580
Chapter 75. The TRANSREG Procedure
trices of predictor variables, checking denominators before dividing, and so on. PROC TRANSREG computes the regression coefficients by sweeping with rational pivoting. SS2
produces a regression table based on Type II sums of squares. Tests of the contribution of each transformation to the overall model are displayed and output to the OUTTEST= data set when you specify the OUTTEST= option. When you specify the SS2 a-option, the TEST a-option is implied. See the section “Hypothesis Tests” on page 4615. You can suppress the variable labels in the regression tables by specifying the NOLABEL option in the OPTIONS statement. TEST TES
generates an ANOVA table. PROC TRANSREG tests the null hypothesis that the vector of scoring coefficients for all of the transformations is zero. See the section “Hypothesis Tests” on page 4615. TSTANDARD=CENTER | NOMISS | ORIGINAL | Z TST=CEN | NOM | ORI | Z
specifies the standardization of the transformed variables for the hypothesis tests and in the OUT= data set. By default, TSTANDARD=ORIGINAL. When TSTANDARD= is specified as an a-option or an o-option, it determines the default standardization for all variables. When you specify TSTANDARD= as a t-option, it overrides the default standardization only for selected variables. CENTER
centers the output variables to mean zero, but the variances are the same as the variances of the input variables.
NOMISS
sets the means and variances of the transformed variables in the OUT= data set, computed over all output values that correspond to nonmissing values in the input data set, to the means and variances computed from the nonmissing observations of the original variables. The TSTANDARD=NOMISS specification is useful with missing data. When a variable is linearly transformed, the final variable contains the original nonmissing values and the missing value estimates. In other words, the nonmissing values are unchanged. If your data have no missing values, TSTANDARD=NOMISS and TSTANDARD=ORIGINAL produce the same results.
ORIGINAL
sets the means and variances of the transformed variables to the means and variances of the original variables. This is the default.
Z
standardizes the variables to mean zero, variance one.
The final standardization is affected by other options. If you also specify the ADDITIVE a-option, the TSTANDARD= option specifies an intermediate step in computing the final means and variances. The final independent variables, along with their means and standard deviations, are scaled by the regression coefficients, creating an additive model with all coefficients equal to one.
MODEL Statement
For nonoptimal variable transformations, the means and variances of the original variables are actually the means and variances of the nonlinearly transformed variables, unless you specify the ORIGINAL nonoptimal t-option in the MODEL statement. For example, if a variable X with no missing values is specified as LOG, then, by default, the final transformation of X is simply LOG(X), not LOG(X) standardized to the mean of X and variance of X. TYPE=’text ’|name TYP=’text ’|name
specifies the valid value for the – TYPE– variable in the input data set. If PROC TRANSREG finds an input – TYPE– variable, it uses only observations with a – TYPE– value that matches the TYPE= value. This enables a PROC TRANSREG OUT= data set containing coefficients to be used as input to PROC TRANSREG without requiring a WHERE statement to exclude the coefficients. If a – TYPE– variable is not in the data set, all observations are used. The default is TYPE=’SCORE’, so if you do not specify the TYPE= a-option, only observations with – TYPE– =’SCORE’ are used. Do not confuse this option with the data set TYPE= option. The DATA= data set must be an ordinary SAS data set. PROC TRANSREG displays a note when it reads observations with blank values of – TYPE– , but it does not automatically exclude those observations. Data sets created by the TRANSREG and PRINQUAL procedures have blank – TYPE– values for those observations that were excluded from the analysis due to nonpositive weights, nonpositive frequencies, or missing data. When these observations are read again, they are excluded for the same reason that they were excluded from their original analysis, not because their – TYPE– value is blank. UNTIE=two-letters UNT=two-letters
specifies the first and last special missing value in the list of those special missing values that are to be estimated with within-variable order constraints but no category constraints. The two-letters value must consist of two letters in alphabetical order. By default, there are category constraints but no order constraints on special missing value estimates. For details, see the “Missing Values” section on page 4599 and the “Optimal Scaling” section on page 4609. UTILITIES UTI
produces a table of the part-worth utilities from a conjoint analysis. Utilities, their standard errors, and the relative importance of each factor are displayed and output to the OUTTEST= data set when you specify the OUTTEST= option. When you specify the UTILITIES a-option, the TEST a-option is implied. Refer to SAS Technical Report R-109, Conjoint Analysis Examples, for more information on conjoint analysis.
4581
4582
Chapter 75. The TRANSREG Procedure
OUTPUT Statement OUTPUT OUT=SAS-data-set < o-options > ; The OUTPUT statement creates a new SAS data set that contains coefficients, marginal means, and information on original and transformed variables. The information on original and transformed variables composes the score partition of the data set; observations have – TYPE– =’SCORE’. The coefficients and marginal means compose the coefficient partition of the data set; observations have – TYPE– =’M COEFFI’ or – TYPE– =’MEAN’. Other values of – TYPE– are possible; for details, see “– TYPE– and – NAME– Variables” later in this chapter. For details on data set structure, see the “Output Data Set” section on page 4617. To specify the data set, use the OUT= specification. OUT=SAS-data-set
specifies the output data set for the data, transformed data, predicted values, residuals, scores, coefficients, and so on. When you use an OUTPUT statement but do not use the OUT= specification, PROC TRANSREG creates a data set and uses the DATAn convention. If you want to create a permanent SAS data set, you must specify a two-level name (refer to “SAS Files” in SAS Language Reference: Concepts and “Introduction to DATA Step Processing” in the SAS Procedures Guide for details). To control the contents of the data set and variable names, use one or more of the o-options. You can also specify these options in the PROC TRANSREG statement.
Output Options (o-options) The following table provides a summary of options in the OUTPUT statement. These options include the OUT= option and all of the o-options. Table 75.4. Options Available in the OUTPUT Statement
Task Identify output data set output data set
Option OUT=
Predicted values, residuals, scores outputs canonical scores outputs individual confidence limits outputs mean confidence limits specifies design matrix coding outputs leverage does not restore missings suppresses output of scores outputs predicted values outputs redundancy variables outputs residuals
CANONICAL CLI CLM DESIGN= LEVERAGE NORESTOREMISSING NOSCORES PREDICTED REDUNDANCY= RESIDUALS
Output data set replacement replaces dependent variables replaces independent variables
DREPLACE IREPLACE
OUTPUT Statement Table 75.4. (continued)
Task replaces all variables
Option REPLACE
Output data set coefficients outputs coefficients outputs ideal point coordinates outputs marginal means outputs redundancy analysis coefficients
COEFFICIENTS COORDINATES MEANS MREDUNDANCY
Output data set variable name prefixes dependent variable approximations independent variable approximations canonical dependent variables conservative individual lower CL canonical independent variables conservative-individual-upper CL conservative-mean-lower CL conservative-mean-upper CL METHOD=MORALS untransformed dependent liberal-individual-lower CL liberal-individual-upper CL liberal-mean-lower CL liberal-mean-upper CL residuals predicted values redundancy variables transformed dependents transformed independents
ADPREFIX= AIPREFIX= CDPREFIX= CILPREFIX= CIPREFIX= CIUPREFIX= CMLPREFIX= CMUPREFIX= DEPENDENT= LILPREFIX= LIUPREFIX= LMLPREFIX= LMUPREFIX= RDPREFIX= PPREFIX= RPREFIX= TDPREFIX= TIPREFIX=
Output data set macros creates macro variables
MACRO
Control CLASS variables controls output of reference levels
REFERENCE=
Output data set details dependent and independent approximations canonical correlation coefficients canonical elliptical point coordinate canonical point coordinates canonical quadratic point coordinates approximations to transformed dependents approximations to transformed independents elliptical point coordinates point coordinates quadratic point coordinates multiple regression coefficients
APPROXIMATIONS CCC CEC CPC CQC DAPPROXIMATIONS IAPPROXIMATIONS MEC MPC MQC MRC
4583
4584
Chapter 75. The TRANSREG Procedure
For the coefficients partition, the COEFFICIENTS, COORDINATES, and MEANS o-options provide the coefficients that are appropriate for your model. For more explicit control of the coefficient partition, use the options that control details and prefixes. The following list provides details on these options. ADPREFIX=name ADP=name
specifies a prefix for naming the dependent variable predicted values. The default is ADPREFIX=P when you specify the PREDICTED o-option; otherwise, it is ADPREFIX=A. Specifying the ADPREFIX= o-option also implies the PREDICTED o-option, and the ADPREFIX= o-option is the same as the PPREFIX= o-option. AIPREFIX=name AIP=name
specifies a prefix for naming the independent variable approximations. The default is AIPREFIX=A. Specifying the AIPREFIX= o-option also implies the IAPPROXIMATIONS o-option. APPROXIMATIONS APPROX APP
is equivalent to specifying both the DAPPROXIMATIONS and the IAPPROXIMATIONS o-options. If METHOD=UNIVARIATE, then the APPROXIMATIONS o-option implies only the DAPPROXIMATIONS o-option. CANONICAL CAN
outputs canonical variables to the OUT= data set. When METHOD=CANALS, the CANONICAL o-option is implied. The CDPREFIX= o-option specifies a prefix for naming the dependent canonical variables (default Cand), and the CIPREFIX= o-option specifies a prefix for naming the independent canonical variables (default Cani). CCC
outputs canonical correlation coefficients to the OUT= data set. CDPREFIX=name CDP=name
provides a prefix for naming the canonical dependent variables. The default is CDPREFIX=Cand. Specifying the CDPREFIX= o-option also implies the CANONICAL o-option. CEC
outputs canonical elliptical point model coordinates to the OUT= data set. CILPREFIX=name CIL=name
specifies a prefix for naming the conservative-individual-lower confidence limits. The default prefix is CIL. Specifying the CILPREFIX= o-option also implies the CLI ooption.
OUTPUT Statement
CIPREFIX=name CIP=name
provides a prefix for naming the canonical independent variables. The default is CIPREFIX=Cani. Specifying the CIPREFIX= o-option also implies the CANONICAL o-option. CIUPREFIX=name CIU=name
specifies a prefix for naming the conservative-individual-upper confidence limits. The default prefix is CIU. Specifying the CIUPREFIX= o-option also implies the CLI ooption. CLI
outputs individual confidence limits to the OUT= data set. The names of the confidence limits variables are constructed from the original dependent variable names and the prefixes specified in the following o-options: LILPREFIX= (default LIL for liberal individual lower), CILPREFIX= (default CIL for conservative individual lower), LIUPREFIX= (default LIU for liberal individual upper), and CIUPREFIX= (default CIU for conservative individual upper). When there are no monotonicity constraints, the liberal and conservative limits are the same. CLM
outputs mean confidence limits to the OUT= data set. The names of the confidence limits variables are constructed from the original dependent variable names and the prefixes specified in the following o-options: LMLPREFIX= (default LML for liberal mean lower), CMLPREFIX= (default CML for conservative mean lower), LMUPREFIX= (default LMU for liberal mean upper), and CMUPREFIX= (default CMU for conservative mean upper). When there are no monotonicity constraints, the liberal and conservative limits are the same. CMLPREFIX=name CML=name
specifies a prefix for naming the conservative-mean-lower confidence limits. The default prefix is CML. Specifying the CMLPREFIX= o-option also implies the CLM o-option. CMUPREFIX=name CMU=name
specifies a prefix for naming the conservative-mean-upper confidence limits. The default prefix is CMU. Specifying the CMUPREFIX= o-option also implies the CLM o-option. COEFFICIENTS COE
outputs either multiple regression coefficients or raw canonical coefficients to the OUT= data set. If you specify METHOD=CANALS (in the MODEL or PROC TRANSREG statement), then the COEFFICIENTS o-option outputs the first n canonical variables, where n is the value of the NCAN= a-option (specified in the MODEL or PROC TRANSREG statement). Otherwise, the COEFFICIENTS o-option includes multiple regression coefficients in the OUT= data set. In ad-
4585
4586
Chapter 75. The TRANSREG Procedure
dition, when you specify the CLASS expansion for any independent variable, the COEFFICIENTS o-option also outputs marginal means. COORDINATES COO
outputs either ideal point or vector model coordinates for preference mapping to the OUT= data set. When METHOD=CANALS, these coordinates are computed from canonical coefficients; otherwise, the coordinates are computed from multiple regression coefficients. For details, see the “Point Models” section on page 4605. CPC
outputs canonical point model coordinates to the OUT= data set. CQC
outputs canonical quadratic point model coordinates to the OUT= data set. DAPPROXIMATIONS DAP
outputs the approximations of the transformed dependent variables to the OUT= data set. These are the target values for the optimal transformations. With METHOD=UNIVARIATE and METHOD=MORALS, the dependent variable approximations are the ordinary predicted values from the linear model. The names of the approximation variables are constructed from the ADPREFIX= o-option (default A) and the original dependent variable names. For ordinary predicted values, use the PREDICTED o-option instead of the DAPPROXIMATIONS o-option, since the PREDICTED o-option uses a more relevant prefix (“P” instead of “A”) and a more relevant variable label suffix (“Predicted Values” instead of “Approximations”). DESIGN<=n> DES<=n>
specifies that your primary goal is design matrix coding, not analysis. Specifying the DESIGN o-option makes the procedure run faster. The DESIGN o-option sets the default method to UNIVARIATE and the default MAXITER= value to zero. It suppresses computing the regression coefficients, unless they are needed for some other option. Furthermore, when the DESIGN o-option is specified, the MODEL statement is not required to have an equal sign. When no MODEL statement equal sign is specified, all variables are considered independent variables, all options that require dependent variables are ignored, and the IREPLACE o-option is implied. You can use DESIGN=n for coding very large data sets, where n is the number of observations to code at one time. For example, to code a data set with a large number of observations, you can specify DESIGN=100 or DESIGN=1000 to process the data set in blocks of 100 or 1000 observations. If you specify the DESIGN o-option rather than DESIGN=n, PROC TRANSREG tries to process all observations at once, which will not work with very large data sets. Specify the NOZEROCONSTANT aoption with DESIGN=n to ensure that constant variables within blocks are not zeroed. See the section “Using the DESIGN Output Option” on page 4654 and the section “Discrete Choice Experiments: DESIGN, NORESTORE, NOZERO” on page 4660.
OUTPUT Statement
DEPENDENT=name DEP=name
specifies the untransformed dependent variable for OUT= data sets with METHOD=MORALS when there is more than one dependent variable. The default is DEPENDENT=– DEPEND– . DREPLACE DRE
replaces the original dependent variables with the transformed dependent variables in the OUT= data set. The names of the transformed variables in the OUT= data set correspond to the names of the original dependent variables in the input data set. By default, both the original dependent variables and transformed dependent variables (with names constructed from the TDPREFIX= (default T) o-option and the original dependent variable names) are included in the OUT= data set. IAPPROXIMATIONS IAP
outputs the approximations of the transformed independent variables to the OUT= data set. These are the target values for the optimal transformations. The names of the approximation variables are constructed from the AIPREFIX= o-option (default A) and the original independent variable names. Specifying the AIPREFIX= o-option also implies the IAPPROXIMATIONS o-option. The IAPPROXIMATIONS o-option is not valid when METHOD=UNIVARIATE. IREPLACE IRE
replaces the original independent variables with the transformed independent variables in the OUT= data set. The names of the transformed variables in the OUT= data set correspond to the names of the original independent variables in the input data set. By default, both the original independent variables and transformed independent variables (with names constructed from the TIPREFIX= o-option (default T) and the original independent variable names) are included in the OUT= data set. LEVERAGE<=name> LEV<=name>
creates a variable with the specified name in the OUT= data set that contains leverages. Specifying the LEVERAGE o-option is equivalent to specifying LEVERAGE=Leverage. LILPREFIX=name LIL=name
specifies a prefix for naming the liberal-individual-lower confidence limits. The default prefix is LIL. Specifying the LILPREFIX= o-option also implies the CLI ooption. LIUPREFIX=name LIU=name
specifies a prefix for naming the liberal-individual-upper confidence limits. The default prefix is LIU. Specifying the LIUPREFIX= o-option also implies the CLI ooption.
4587
4588
Chapter 75. The TRANSREG Procedure
LMLPREFIX=name LML=name
specifies a prefix for naming the liberal-mean-lower confidence limits. The default prefix is LML. Specifying the LMLPREFIX= o-option also implies the CLM o-option. LMUPREFIX=name LMU=name
specifies a prefix for naming the liberal-mean-upper confidence limits. The default prefix is LMU. Specifying the LMUPREFIX= o-option also implies the CLM ooption. MACRO(keyword=name...) MAC(keyword=name...)
creates macro variables. Most of the options available within the MACRO o-option are rarely needed. By default, the TRANSREG procedure creates a macro variable named – TRGIND with a complete list of independent variables created by the procedure. When the TRANSREG procedure is being used for design matrix creation prior to running a procedure without a CLASS statement, this macro provides a convenient way to use the results from PROC TRANSREG. For example, a PROC LOGISTIC step that uses a design matrix coded by PROC TRANSREG could use the following MODEL statement: model y=&_trgind;
The TRANSREG procedure, also by default, creates a macro variable named – TRGINDN, which contains the number of variables in the – TRGIND list. This macro variable could be used in an ARRAY statement as follows: array indvars[&_trgindn] &_trgind;
See the section “Using the DESIGN Output Option” on page 4654 and the section “Discrete Choice Experiments: DESIGN, NORESTORE, NOZERO” on page 4660 for examples of using the default macro variables. The available keywords are as follows. DN=name
specifies the name of a macro variable that contains the number of dependent variables. By default, a macro variable named – TRGDEPN is created. This is the number of variables in the DL= list and the number of macro variables created by the DV= and DE= specifications.
IN=name
specifies the name of a macro variable that contains the number of independent variables. By default, a macro variable named – TRGINDN is created. This is the number of variables in the IL= list and the number of macro variables created by the IV= and IE= specifications.
OUTPUT Statement
DL=name
specifies the name of a macro variable that contains the list of the dependent variables. By default, a macro variable named – TRGDEP is created. These are the variable names of the final transformed variables in the OUT= data set. For example, if there are three dependent variables, Y1–Y3, then – TRGDEP contains, by default, TY1 TY2 TY3 (or Y1 Y2 Y3 if you specify the REPLACE o-option).
IL=name
specifies the name of a macro variable that contains the list of the independent variables. By default, a macro variable named – TRGIND is created. These are the variable names of the final transformed variables in the OUT= data set. For example, if there are three independent variables, X1–X3, then – TRGIND contains, by default, TX1 TX2 TX3 (or X1 X2 X3 if you specify the REPLACE o-option).
DV=prefix
specifies a prefix for creating a list of macro variables, each of which contains one dependent variable name. For example, if there are three dependent variables, Y1–Y3, and you specify MACRO(DV=DEP), then three macro variables, DEP1, DEP2, and DEP3, are created, containing TY1, TY2, and TY3, respectively (or Y1, Y2, and Y3 if you specify the REPLACE o-option). By default, no list is created.
IV=prefix
specifies a prefix for creating a list of macro variables, each of which contains one independent variable name. For example, if there are three independent variables, X1–X3, and you specify MACRO(IV=IND), then three macro variables, IND1, IND2, and IND3, are created, containing TX1, TX2, and TX3, respectively (or X1, X2, and X3 if you specify the REPLACE o-option). By default, no list is created.
DE=prefix
specifies a prefix for creating a list of macro variables, each of which contains one dependent variable effect. This list shows the origin of each model term. Each effect consists of two or more parts, and each part consists of a value in 32 columns followed by a blank. For example, if you specify MACRO(DE=D), then a macro variable D1 is created for IDENTITY(Y). The D1 macro variable is shown next, wrapped onto two lines. 4 IDENTITY
TY Y
The first part is the number of parts (4), the second part is the transformed variable name, the third part is the transformation, and the last part is the input variable name. By default, no list is created. IE=prefix
specifies a prefix for creating a list of macro variables, each of which contains one independent variable effect. This list shows the origin of each model term. Each effect consists of two or more parts, and each part consists of a value in 32 columns followed by a blank. For example, if you specify MACRO(IE=I), then three
4589
4590
Chapter 75. The TRANSREG Procedure macro variables, I1, I2, and I3, are created for CLASS(X1 | X2) when both X1 and X2 have values of 1 and 2. These macro variables are shown next, but with extra white space removed. 5 5 8
Tx11 Tx21 Tx11x21
CLASS CLASS CLASS
x1 x2 x1
1 1 1
CLASS
x2
For CLASS variables, the formatted level appears after the variable name. The first two effects are the main effects, and the last is the interaction term. By default, no list is created. MEANS MEA
outputs marginal means for CLASS variable expansions to the OUT= data set. MEC
outputs multiple regression elliptical point model coordinates to the OUT= data set. MPC
outputs multiple regression point model coordinates to the OUT= data set. MQC
outputs multiple regression quadratic point model coordinates to the OUT= data set. MRC
outputs multiple regression coefficients to the OUT= data set. MREDUNDANCY MRE
outputs multiple redundancy analysis coefficients to the OUT= data set. NORESTOREMISSING NORESTORE NOR
specifies that missing values should not be restored when the OUT= data set is created. By default, the coded CLASS variable contains a row of missing values for observations in which the CLASS variable is missing. When you specify the NORESTOREMISSING o-option, these observations contain a row of zeros instead. This is useful when the TRANSREG procedure is used to code experimental designs for discrete choice models and there is a constant alternative indicated by a missing value. NOSCORES NOS
excludes original variables, transformed variables, predicted values, residuals, and scores from the OUT= data set. You can use the NOSCORES o-option with various other options to create an OUT= data set that contains only a coefficient partition (for example, a data set consisting entirely of coefficients and coordinates).
1
OUTPUT Statement
PREDICTED PRE P
outputs predicted values, which for METHOD=UNIVARIATE and METHOD=MORALS are the ordinary predicted values from the linear model, to the OUT= data set. The names of the predicted values’ variables are constructed from the PPREFIX= o-option (default P) and the original dependent variable names. Specifying the PPREFIX= o-option also implies the PREDICTED o-option. PPREFIX=name PDPREFIX=name PDP=name
specifies a prefix for naming the dependent variable predicted values. The default is PPREFIX=P when you specify the PREDICTED o-option; otherwise, it is PPREFIX=A. Specifying the PPREFIX= o-option also implies the PREDICTED ooption, and the PPREFIX= o-option is the same as the ADPREFIX= o-option. RDPREFIX=name RDP=name
specifies a prefix for naming the residual (dependent) variables to the OUT= data set. The default is RDPREFIX=R. Specifying the RDPREFIX= o-option also implies the RESIDUALS o-option. REDUNDANCY<=STANDARDIZE | UNSTANDARDIZE> RED<=STA | UNS>
outputs redundancy variables to the OUT= data set, either standardized or unstandardized. Specifying the REDUNDANCY o-option is the same as specifying REDUNDANCY=STANDARDIZE. The results of the REDUNDANCY o-option depends on the TSTANDARD= option. You must specify TSTANDARD=Z to get results based on standardized data. The TSTANDARD= option controls how the data that go into the redundancy analysis are scaled, and REDUNDANCY=STANDARDIZE|UNSTANDARDIZE controls how the redundancy variables are scaled. The REDUNDANCY o-option is implied by METHOD=REDUNDANCY. The RPREFIX= o-option specifies a prefix (default Red) for naming the redundancy variables. REFERENCE=NONE | MISSING | ZERO REF=NON | MIS | ZER
specifies how reference levels of CLASS variables are to be treated. The options are REFERENCE=NONE, the default, in which reference levels are suppressed; REFERENCE=MISSING, in which reference levels are displayed and output with missing values; and REFERENCE=ZERO, in which reference levels are displayed and output with zeros. The REFERENCE= option can be specified in the PROC TRANSREG, MODEL, or OUTPUT statement, and it can be independently specified for the OUT= data set and the displayed output. When you specify it in only one statement, it sets the option for both the displayed output and the OUT= data set.
4591
4592
Chapter 75. The TRANSREG Procedure
REPLACE REP
is equivalent to specifying both the DREPLACE and the IREPLACE o-options. RESIDUALS RES R
outputs the differences between the transformed dependent variables and their predicted values. The names of the residual variables are constructed from the RDPREFIX= o-option (default R) and the original dependent variable names. RPREFIX=name RPR=name
provides a prefix for naming the redundancy variables. The default is RPREFIX=Red. Specifying the RPREFIX= o-option also implies the REDUNDANCY o-option. TDPREFIX=name TDP=name
specifies a prefix for naming the transformed dependent variables. By default, TDPREFIX=T. The TDPREFIX= o-option is ignored when you specify the DREPLACE o-option. TIPREFIX=name TIP=name
specifies a prefix for naming the transformed independent variables. By default, TIPREFIX=T. The TIPREFIX= o-option is ignored when you specify the IREPLACE o-option.
WEIGHT Statement WEIGHT variable ; When you use a WEIGHT statement, a weighted residual sum of squares is minimized. The WEIGHT statement has no effect on degrees of freedom or number of observations, but the weights affect most other calculations. The observation is used in the analysis only if the value of the WEIGHT statement variable is greater than 0.
Details Model Statement Usage MODEL < transform(dependents < / t-options >) < transform(dependents < / t-options >)...> = > transform(independents < / t-options >) < transform(independents < / t-options >)...>< / a-options > ; Here are some examples of model statements: • linear regression
Model Statement Usage
4593
model identity(y) = identity(x);
• a linear model with a nonlinear regression function model identity(y) = spline(x / nknots=5);
• multiple regression model identity(y) = identity(x1-x5);
• multiple regression with nonlinear transformations model spline(y / nknots=3) = spline(x1-x5 / nknots=3);
• multiple regression with nonlinear but monotone transformations model mspline(y / nknots=3) = mspline(x1-x5 / nknots=3);
• multivariate multiple regression model identity(y1-y4) = identity(x1-x5);
• canonical correlation model identity(y1-y4) = identity(x1-x5) / method=canals;
• redundancy analysis model identity(y1-y4) = identity(x1-x5) / method=redundancy;
• preference mapping, vector model (Carroll 1972) model identity(Attrib1-Attrib3) = identity(Dim1-Dim2);
• preference mapping, ideal point model (Carroll 1972) model identity(Attrib1-Attrib3) = point(Dim1-Dim2);
• preference mapping, ideal point model, elliptical (Carroll 1972) model identity(Attrib1-Attrib3) = epoint(Dim1-Dim2);
• preference mapping, ideal point model, quadratic (Carroll 1972) model identity(Attrib1-Attrib3) = qpoint(Dim1-Dim2);
• metric conjoint analysis model identity(Subj1-Subj50) = class(a b c d e f / zero=sum);
• nonmetric conjoint analysis model monotone(Subj1-Subj50) = class(a b c d e f / zero=sum);
4594
Chapter 75. The TRANSREG Procedure • main effects, two-way interaction model identity(y) = class(a|b);
• less-than-full-rank model—main effects and two-way interaction are constrained to sum to zero model identity(y) = class(a|b / zero=sum);
• main effects and all two-way interactions model identity(y) = class(a|b|c@2);
• main effects and all two- and three-way interactions model identity(y) = class(a|b|c);
• main effects and just B*C two-way interaction model identity(y) = class(a b c b*c);
• seven main effects, three two-way interactions model identity(y) = class(a b c d e f g a*b a*c a*d);
• deviations-from-means (effects or (1, 0, −1)) coding, with an A reference level of ’1’ and a B reference level of ’2’ model identity(y) = class(a|b / deviations zero=’1’ ’2’);
• cell-means coding (implicit intercept) model identity(y) = class(a*b / zero=none);
• reference cell model model identity(y) = class(a|b / zero=’1’ ’1’);
• reference line with change in line parameters model identity(y) = class(a) | identity(x);
• reference curve with change in curve parameters model identity(y) = class(a) | spline(x);
• separate curves and intercepts model identity(y) = class(a / zero=none) | spline(x);
• quantitative effects with interaction model identity(y) = identity(x1 | x2);
• separate quantitative effects with interaction within each cell model identity(y) = class(a * b / zero=none) | identity(x1 | x2);
Box-Cox Transformations
Box-Cox Transformations The Box-Cox (1964) transformation has the form (y λ − 1)/λ log(y)
λ 6= 0 λ=0
This family of transformations of the positive dependent variable y is controlled by the parameter λ. Transformations linearly related to square root, inverse, quadratic, cubic, and so on are all special cases. The limit as λ approaches 0 is the log transformation. More generally, Box-Cox transformations of the following form can be fit: ((y + c)λ − 1)/(λg) log(y + c)/g
λ 6= 0 λ=0
By default, c = 0. The parameter c can be used to rescale y so that it is strictly positive. By default, g = 1. Alternatively, g can be y˙ λ−1 where y˙ is the geometric mean of y. The BOXCOX transformation in PROC TRANSREG can be used to perform a BoxCox transformation of the dependent variable. You can specify a list of power parameters using the LAMBDA= transformation option. By default, LAMBDA=-3 TO 3 BY 0.25. The procedure chooses the optimal power parameter using a maximum likelihood criterion (Draper and Smith 1981, pp. 225-226). You can specify the PARAMETER=c transformation option when you want to shift the values of y, usually to avoid negatives. To divide by y˙ λ−1 , specify the GEOMETRICMEAN transformation option. Here are some examples of usage of the LAMBDA= option: model BoxCox(y / lambda=0) = identity(x1-x5); model BoxCox(y / lambda=-2 to 2 by 0.1) = identity(x1-x5); model BoxCox(y) = identity(x1-x5);
In the first example model BoxCox(y / lambda=0) = identity(x1-x5);
LAMBDA=0 specifies a Box-Cox transformation with a power parameter of 0. Since a single value of 0 was specified for LAMBDA=, there is no difference between the following models: model BoxCox(y / lambda=0) = identity(x1-x5); model log(y) = identity(x1-x5);
In the second example
4595
4596
Chapter 75. The TRANSREG Procedure model BoxCox(y / lambda=-2 to 2 by 0.1) = identity(x1-x5);
there is a list of power parameters specified. This tells PROC TRANSREG to find a Box-Cox transformation before the usual iterations begin. PROC TRANSREG tries each power parameter in the list and picks the best transformation. A maximum likelihood approach (Draper and Smith 1981, pp. 225-226) is used. Note that this is quite different from TRANSREG’s usual approach of iteratively finding optimal transformations. It is analogous to SMOOTH, RANK, and the other nonoptimal transformations that are performed before the iterations begin. In the third example model BoxCox(y) = identity(x1-x5);
the default list of -3 TO 3 BY 0.25 is used. The procedure prints the optimal power parameter, a confidence interval on the power parameter (using the ALPHA= transformation option), a “convenient” power parameter (selected from the CLL= option list), and the log likelihood for each power parameter tried (see Example 75.6).
Smoothing Splines You can use PROC TRANSREG to output to a SAS data set the same smoothing splines that the GPLOT procedure creates. The SMOOTH transformation is a noniterative transformation for smoothing splines. The smoothing parameter can be specified with either the SM= or the PARAMETER= o-option. The independent variable transformation (Tx in this case) contains the results. The GPLOT request y*x=2 with I=SM50 creates the same curve as Tx*x. title ’Smoothing Splines’; data x; do x = 1 to 100 by 2; do rep = 1 to 3; y = log(x) + sin(x / 10) + normal(7); output; end; end; run; proc transreg; model identity(y) = smooth(x / sm=50); output; run; %let opts = haxis=axis2 vaxis=axis1 frame cframe=ligr; proc gplot; axis1 minor=none label=(angle=90 rotate=0); axis2 minor=none;
Smoothing Splines
plot y*x=1 y*x=2 tx*x=3 / &opts overlay; symbol1 color=blue v=star i=none; symbol2 color=yellow v=none i=sm50; symbol3 color=cyan v=dot i=none; run; quit;
Figure 75.5. Smoothing Spline Example 1
When you cross a SMOOTH variable with a CLASS variable, specify ZERO=NONE with the CLASS expansion and the AFTER t-option with the SMOOTH transformation so that separate functions are found within each group. title2 ’Two Groups’; data x; do x = 1 to 100; group = 1; do rep = 1 to 3; y = log(x) + sin(x / 10) + normal(7); output; end; group = 2; do rep = 1 to 3; y = -log(x) + cos(x / 10) + normal(7); output; end; end; run;
4597
4598
Chapter 75. The TRANSREG Procedure proc transreg; model identity(y) = class(group / zero=none) | smooth(x / after sm=50); output out=curves; run; data curves2; set curves; if group1 = 0 then tgroup1x = .; if group2 = 0 then tgroup2x = .; run; %let opts = haxis=axis2 vaxis=axis1 frame cframe=ligr; proc gplot; axis1 minor=none label=(angle=90 rotate=0); axis2 minor=none; plot y*x=1 tgroup1x*x=2 tgroup2x*x=2 / &opts overlay; symbol1 color=blue v=star i=none; symbol2 color=yellow v=none i=join; run; quit;
Figure 75.6. Smoothing Spline Example 2
The SMOOTH transformation is valid only with independent variables; typically, it is used in models with a single independent and a single dependent variable. When there are multiple independent variables designated as SMOOTH, the TRANSREG procedure tries to smooth the ith independent variable using the ith dependent variable as a target. When there are more independent variables than dependent variables, the last dependent variable is reused as often as is necessary. For example, for the model
Missing Values
model identity(y1-y3) = smooth(x1-x5);
smoothing is based on the pairs (y1, x1), (y2, x2), (y3, x3), (y3, x4), and (y3, x5). The SMOOTH transformation is a noniterative transformation; smoothing occurs once per variable before the iterations begin. In contrast, SSPLINE provides an iterative smoothing spline transformation. It does not generally minimize squared error; hence, divergence is possible with SSPLINE.
Missing Values PROC TRANSREG can estimate missing values, with or without category or monotonicity constraints, so that the regression model fit is optimized. Several approaches to missing data handling are provided. All observations with missing values in IDENTITY, CLASS, POINT, EPOINT, QPOINT, SMOOTH, PSPLINE, and BSPLINE variables are excluded from the analysis. When METHOD=UNIVARIATE (specified in the PROC TRANSREG or MODEL statement), observations with missing values in any of the independent variables are excluded from the analysis. When you specify the NOMISS a-option, observations with missing values in the other analysis variables are excluded. Otherwise, missing data are estimated, using variable means as initial estimates. You can specify the LINEAR, OPSCORE, MONOTONE, UNTIE, SPLINE, MSPLINE, SSPLINE, LOG, LOGIT, POWER, ARSIN, BOXCOX, RANK, and EXP transformations in any combination with nonmissing values, ordinary missing values, and special missing values, as long as the nonmissing values in each variable have positive variance. No category or order restrictions are placed on the estimates of ordinary missing values. You can force missing value estimates within a variable to be identical by using special missing values (refer to “DATA Step Processing” in SAS Language Reference: Concepts). You can specify up to 27 categories of missing values, in which within-category estimates must be the same, by coding the missing values using .– and .A through .Z. You can also specify an ordering of some missing value estimates. You can use the MONOTONE= a-option in the PROC TRANSREG or MODEL statement to indicate a range of special missing values (a subset of the list from .A to .Z) with estimates that must be weakly ordered within each variable in which they appear. For example, if MONOTONE=AI, the nine classes, .A, .B,. . ., .I, are monotonically scored and optimally scaled just as MONOTONE transformation values are scored. In this case, category but not order restrictions are placed on the missing values .– and .J through .Z. You can also use the UNTIE= a-option (in the PROC TRANSREG or MODEL statement) to indicate a range of special missing values with estimates that must be weakly ordered within each variable in which they appear but can be untied. The missing value estimation facilities allow for partitioned or mixed-type variables. For example, a variable can be considered part nominal and part ordinal. Nominal classes of otherwise ordinal variables are coded with special missing values. This feature can be useful with survey research. The class “unfamiliar with the product” in the variable “Rate your preference for ’Brand X’ on a 1 to 9 scale, or if you are unfamiliar with the product, check ’unfamiliar with the product’” is an example. You
4599
4600
Chapter 75. The TRANSREG Procedure
can code “unfamiliar with the product” as a special missing value, such as .A. The 1s to 9s can be monotonically transformed, while no monotonic restrictions are placed on the quantification of the “unfamiliar with the product” class. A variable specified for a LINEAR transformation, with special missing values and ordered categorical missing values, can be part interval, part ordinal, and part nominal. A variable specified for a MONOTONE transformation can have two independent ordinal parts. A variable specified for an UNTIE transformation can have an ordered categorical part and an ordered part without category restrictions. Many other mixes are possible.
Missing Values, UNTIE, and Hypothesis Tests The TRANSREG procedure has the ability to estimate missing data and monotonically transform variables while untying tied values. Estimates of ordinary missing values (.) may all be different. Analyses with UNTIE transformations, the UNTIE= a-option, and ordinary missing data estimation are all prone to degeneracy problems. Consider the following example. A perfect fit is found by collapsing all observations except the one with two missing values into a single value in Y and X1. data x; input y x1 x2 @@; datalines; 1 3 7 8 3 9 1 8 6 8 5 1 6 7 3 2 7 2 ;
. . 9 1 8 2
3 3 9 . 9 1
proc transreg dummy; model linear(y) = linear(x1 x2); output; run; proc print; run;
Obs 1 2 3 4 5 6 7 8 9 10
_TYPE_
_NAME_
y
SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE
ROW1 ROW2 ROW3 ROW4 ROW5 ROW6 ROW7 ROW8 ROW9 ROW10
1 8 1 . 3 8 6 2 1 .
Ty
Intercept
x1
x2
TIntercept
Tx1
Tx2
1 1 1 1 1 1 1 1 1 1
3 3 8 . 3 5 7 7 8 9
7 9 6 9 9 1 3 2 2 1
1 1 1 1 1 1 1 1 1 1
5.1233 5.1233 5.1233 12.7791 5.1233 5.1233 5.1233 5.1233 5.1233 5.1233
7 9 6 9 9 1 3 2 2 1
2.7680 2.7680 2.7680 12.5878 2.7680 2.7680 2.7680 2.7680 2.7680 2.7680
Figure 75.7. Missing Values Example
Controlling the Number of Iterations
Generally, the use of ordinary missing data estimation, the UNTIE transformation, and the UNTIE= a-option should be avoided, particularly with hypothesis tests. With these options, parameters are estimated based on only a single observation, and they can exert tremendous influence over the results. Each of these parameters has one model degree of freedom associated with it, so small or zero error degrees of freedom can also be a problem.
Controlling the Number of Iterations Several a-options in the PROC TRANSREG or MODEL statement control the number of iterations performed. Iteration terminates when any one of the following conditions is satisfied: • The number of iterations equals the value of the MAXITER= a-option. • The average absolute change in variable scores from one iteration to the next is less than the value of the CONVERGE= a-option. • The criterion change is less than the value of the CCONVERGE= a-option. You can specify negative values for either convergence option if you wish to define convergence only in terms of the other option. The criterion change can become negative when the data have converged so that it is numerically impossible, within machine precision, to increase the criterion. Usually, a negative criterion change is the result of very small amounts of rounding error since the algorithms are (usually) convergent. However, there are other cases where a negative criterion change is a sign of divergence, which is not necessarily an error. When you specify an SSPLINE transformation or the REITERATE or DUMMY a-option, divergence may be perfectly normal. When there are no monotonicity constraints and there is only one canonical variable in each set, PROC TRANSREG (with the DUMMY a-option) can usually find the optimal solution in only one iteration. (There are no monotonicity constraints when the MONOTONE, MSPLINE, or UNTIE transformations and the UNTIE= and MONOTONE= a-options are not specified. There is only one canonical variable in each set when METHOD=MORALS or METHOD=UNIVARIATE, or when METHOD=REDUNDANCY with only one dependent variable, or when METHOD=CANALS and NCAN=1.) The initialization iteration is number 0. When there are no monotonicity constraints and there is only one canonical variable in each set, the next iteration shows no change and iteration stops. At least two iterations (0 and 1) are performed with the DUMMY a-option even if nothing changes in iteration 0. The MONOTONE, MSPLINE, and UNTIE variables are not transformed by the dummy variable initialization. Note that divergence with the DUMMY a-option, particularly in the second iteration, is not an error. The initialization iteration is slower and uses more memory than other iterations. However, for many models, specifying the DUMMY a-option can greatly decrease the amount of time required to find the optimal transformations. Furthermore, by solving for the transformations directly instead of iteratively, PROC TRANSREG avoids certain nonoptimal solutions.
4601
4602
Chapter 75. The TRANSREG Procedure
You can increase the number of iterations to ensure convergence by increasing the value of the MAXITER= a-option and decreasing the value of the CONVERGE= a-option. Since the average absolute change in standardized variable scores seldom decreases below 1E−11, you should not specify a value for the CONVERGE= aoption less than 1E−8 or 1E−10. Most of the data changes occur during the first few iterations, but the data can still change after 50 or even 100 iterations. You can try different combinations of values for the CONVERGE= and MAXITER= aoptions to ensure convergence without extreme overiteration. If the data do not converge with the default specifications, try CONVERGE=1E−8 and MAXITER=50, or CONVERGE=1E−10 and MAXITER=200. Note that you can specify the REITERATE a-option to start iterating where the previous analysis stopped.
Using the REITERATE Algorithm Option You can use the REITERATE a-option to perform additional iterations when PROC TRANSREG stops before the data have adequately converged. For example, suppose that you execute the following code: proc transreg data=a; model mspline(y) = mspline(x1-x5); output out=b coefficients; run;
If the transformations do not converge in the default 30 iterations, you can perform more iterations without repeating the first 30 iterations. proc transreg data=b reiterate; model mspline(y) = mspline(x1-x5); output out=b coefficients; run;
Note that a WHERE statement is not necessary to exclude the coefficient observations. They are automatically excluded because their – TYPE– value is not SCORE. You can also use the REITERATE a-option to specify starting values other than the original values for the transformations. Providing alternate starting points may avoid local optima. Here are two examples. proc transreg data=a; model rank(y) = rank(x1-x5); output out=b; run; proc transreg data=b reiterate; /* Use ranks as the starting point. */ model mspline(y) = mspline(x1-x5); output out=c coefficients; run;
Avoiding Constant Transformations
data b; set a; array tx[6] ty tx1-tx5; do j = 1 to 6; tx[j] = normal(7); end; run; proc transreg data=b reiterate; /* Use a random starting point. */ model mspline(y) = mspline(x1-x5); output out=c coefficients; run;
Note that divergence with the REITERATE a-option, particularly in the second iteration, is not an error since the initial transformation is not required to be a valid member of the transformation family. When you specify the REITERATE a-option, the iteration does not terminate when the criterion change is negative during the first 10 iterations.
Avoiding Constant Transformations There are times when the optimal scaling produces a constant transformed variable. This can happen with the MONOTONE, UNTIE, and MSPLINE transformations when the target is negatively correlated with the original input variable. It can happen with all transformations when the target is uncorrelated with the original input variable. When this happens, the procedure modifies the target to avoid a constant transformation. This strategy avoids certain nonoptimal solutions. If the transformation is monotonic and a constant transformed variable results, the procedure multiplies the target by −1 and tries the optimal scaling again. If the transformation is not monotonic or if the multiplication by −1 did not help, the procedure tries using a random target. If the transformation is still constant, the previous nonconstant transformation is retained. When a constant transformation is avoided by any strategy, a message is displayed: “A constant transformation was avoided for name.” With extreme collinearity, small amounts of rounding error might interact with the instability of the coefficients to produce target vectors that are not positively correlated with the original scaling. If a regression coefficient for a variable is zero, the formula for the target for that variable contains a zero divide. In a multiple regression model, after many iterations, one independent variable can be scaled the same way as the current scaling of the dependent variable, so the other independent variables have coefficients of zero. When the constant transformation warning appears, you should interpret your results with extreme caution, and recheck your model.
4603
4604
Chapter 75. The TRANSREG Procedure
Constant Variables Constant and almost constant variables are zeroed and ignored. As long as the dependent variable is not constant, PROC TRANSREG produces an iteration history table for all models, not just models in which the variables can change. When constant variables are expected and should not be zeroed, specify the NOZEROCONSTANT option.
Character OPSCORE Variables Character OPSCORE variables are replaced by a numeric variable containing category numbers before the iterations, and the character values are discarded. Only the first eight characters are considered when determining category membership. If you want the original character variable in the output data set, give it a different name in the OPSCORE specification (OPSCORE(x / name=(x2)) and name the original variable on the ID statement (ID x;).
Convergence and Degeneracies When you specify the SSPLINE transformation, divergence is normal. The rest of this section assumes that you did not specify SSPLINE. For all the methods available in PROC TRANSREG, the algorithms are convergent, both in terms of the criterion being optimized and the parameters being estimated. The value of the criterion being maximized (squared multiple correlation, average squared multiple correlation, or average squared canonical correlation) can, theoretically, never decrease from one iteration to the next. The values of the parameters being solved for (the scores and weights of the transformed variables) become stable after sufficient iteration. In practice, the criterion being maximized can decrease with overiteration. When the statistic has very nearly reached its maximum, further iterations might report a decrease in the criterion in the last few decimal places. This is a normal result of very small amounts of rounding error. By default, iteration terminates when this occurs because, by default, CCONVERGE=0.0. Specifying CCONVERGE=−1, an impossible change, turns off this check for convergence. Even though the algorithms are convergent, they might not converge to a global optimum. Also, under extreme circumstances, the solution might degenerate. Because two points always form a straight line, the algorithms sometimes try to reach this degenerate optimum. This sometimes occurs when one observation is an ordinal outlier (when one observation has the extreme rank on all variables). The algorithm can reach an optimal solution that ties all other categories producing two points. Similar results can occur when there are many missing values. More generally, whenever there are very few constraints on the scoring of one or more points, degeneracies can be a problem. In a well-behaved analysis, the maximum data change, average data change, and criterion change all decrease at a rapid rate with each iteration. When the rate of change increases for several iterations, the solution might be degenerating.
Point Models
Implicit and Explicit Intercepts Depending on several options, the model intercept is nonzero, zero, or implicit, or there is no intercept. Ordinarily, the model contains an explicit nonzero intercept, and the Intercept variable in the OUT= data set contains ones. When TSTANDARD=CENTER or TSTANDARD=Z is specified, the model contains an explicit, zero intercept and the Intercept variable contains zeros. When METHOD=CANALS, the model is fit with centered variables and the Intercept variable is set to missing. If you specify CLASS with ZERO=NONE or BSPLINE for one or more independent variables, and TSTANDARD=NOMISS or TSTANDARD=ORIGINAL (the default), an implicit intercept model is fit. The intercept is implicit in a set of the independent variables since there exists a set of independent variables the sum of which is a column of ones. All statistics are mean corrected. The implicit intercept is not an option; it is implied by the model. With METHOD=CANALS, the Intercept variable contains the canonical intercept ˆ where Yα ˆ ˆ − x0 β ˆ ≈ Xβ. for canonical coefficients observations: βˆ0 = y0 α
Passive Observations Observations may be excluded from the analysis for several reasons; these include zero weight; zero frequency; missing values in variables designated IDENTITY, CLASS, POINT, EPOINT, QPOINT, SMOOTH, PSPLINE, or BSPLINE; and missing values with the NOMISS a-option specified. These observations are passive in that they do not contribute to determining transformations, R2 , sums of squares, degrees of freedom, and so on. However, some information can be computed for them. For example, if no independent variable values are missing, predicted values and redundancy variable values can both be computed. Residuals can be computed for observations with a nonmissing dependent and nonmissing predicted value. Canonical variables for dependent variables can be computed when no dependent variables are missing; canonical variables for independent variables can be computed when no independent variables are missing, and so on. Passive observations in the OUT= data set have a blank value for – TYPE– .
Point Models The expanded set of independent variables generated from the POINT, EPOINT, and QPOINT expansions can be used to perform ideal point regressions (Carroll 1972) and compute ideal point coordinates for plotting in a biplot (Gabriel 1981). The three types of ideal point coordinates can all be described as transformed coefficients. Assume that m independent variables are specified in one of the three point expansions. Let b0 be a 1 × m row vector of coefficients for these variables and one of the dependent variables. Let R be a matrix created from the coefficients of the extra variables. When coordinates are requested with the MPC, MEC, or MQC o-options, b0 and R are created from multiple regression coefficients. When coordinates are requested with the CPC, CEC, or CQC o-options, b0 and R are created from canonical coefficients.
4605
4606
Chapter 75. The TRANSREG Procedure
If you specify the POINT expansion in the MODEL statement, R is an m × m identity matrix times the coefficient for the sums of squares (– ISSQ– ) variable. If you specify the EPOINT expansion, R is an m × m diagonal matrix of coefficients from the squared variables. If you specify the QPOINT expansion, R is an m×m symmetric matrix of coefficients from the squared variables on the diagonal and crossproduct variables off the diagonal. The MPC, MEC, MQC, CPC, CEC, and CQC ideal point coordinates are defined as −0.5b0 R−1 . When R is singular, the ideal point coordinates are infinitely far away and are set to missing, so you should try a simpler version of the model. The version that is simpler than the POINT model is the vector model where no extra variables are created. In the vector model, designate all independent variables as IDENTITY. Then draw vectors from the origin to the COEFFICIENTS points. Typically, when you request ideal point coordinates, the MODEL statement should consist of a single transformation for the dependent variables (usually IDENTITY, MONOTONE, or MSPLINE) and a single expansion for the independent variables (one of POINT, EPOINT, or QPOINT).
Redundancy Analysis Redundancy analysis (Stewart and Love 1968) is a principal component analysis of multivariate regression predicted values. These first steps show the redundancy analysis results produced by PROC TRANSREG. The specification TSTANDARD=Z standardizes all variables to mean zero and variance one. METHOD=REDUNDANCY specifies redundancy analysis and outputs the redundancy variables to the OUT= data set. The MREDUNDANCY o-option outputs two sets of redundancy analysis coefficients to the OUT= data set. title ’Redundancy Analysis’; data x; input y1-y3 datalines; 6 8 8 15 18 1 12 16 18 9 5 6 15 20 17 6 9 15 14 10 7 5 12 14 6 3 6 7 2 14 3 5 9 13 18 6 3 11 3 15 6 3 7 10 20 7 5 9 8 10 ;
x1-x4; 26 20 29 16 13 26 10 22 21 12
27 8 31 22 9 22 22 29 27 18
proc transreg data=x tstandard=z method=redundancy; model identity(y1-y3) = identity(x1-x4); output out=red mredundancy replace; run; proc print data=red(drop=Intercept);
Redundancy Analysis
format _numeric_ 4.1; run;
Redundancy Analysis Obs _TYPE_ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE M REDUND M REDUND M REDUND R REDUND R REDUND R REDUND R REDUND
_NAME_ ROW1 ROW2 ROW3 ROW4 ROW5 ROW6 ROW7 ROW8 ROW9 ROW10 Red1 Red2 Red3 x1 x2 x3 x4
y1
y2
y3
x1
x2
x3
0.5 -2.0 0.0 0.5 1.0 -1.0 -1.0 0.5 0.5 1.0 . . . . . . .
0.6 2.1 -0.1 1.0 -0.4 -0.1 -0.4 -1.2 -1.2 -0.4 . . . . . . .
-0.8 1.5 1.2 1.2 0.3 -1.1 -0.6 0.0 -1.1 -0.6 . . . . . . .
0.6 1.1 1.4 0.4 0.4 -1.6 0.2 -1.5 -0.3 -0.6 0.7 0.3 -0.7 . . . .
0.9 -1.0 0.7 -0.8 -1.6 0.1 0.9 0.3 1.3 -0.8 -0.6 -1.5 -0.7 . . . .
1.0 0.1 1.5 -0.5 -1.0 1.0 -1.5 0.4 0.2 -1.1 0.4 -0.6 0.3 . . . .
x4 Red1 Red2 Red3 0.7 -1.7 1.2 0.1 -1.6 0.1 0.1 1.0 0.7 -0.4 -0.1 1.9 -0.3 . . . .
0.2 1.6 1.0 0.5 1.0 -0.8 -1.0 -1.2 -1.0 -0.4 . . . 0.8 -0.6 0.1 -0.5
-0.5 -1.5 0.8 1.7 0.1 -0.9 -0.4 0.8 -0.9 0.8 . . . -0.0 -0.2 -0.2 0.3
-0.9 0.4 -1.3 0.1 0.9 1.4 -1.3 0.7 -0.8 0.7 . . . -0.6 -0.7 -0.1 -0.5
Figure 75.8. Redundancy Analysis Example
The – TYPE– =’SCORE’ observations of the Red1–Red3 variables contain the redundancy variables. The nonmissing “M REDUND” values are coefficients for predicting the redundancy variables from the independent variables. The nonmissing “R REDUND” values are coefficients for predicting the independent variables from the redundancy variables. These following steps show how to generate the same results manually. The data set is standardized, predicted values are computed, and principal components of the predicted values are computed. The following statements produce the redundancy variables, shown in Figure 75.9: proc standard data=x out=std m=0 s=1; title2 ’Manually Generate Redundancy Variables’; run; proc reg noprint data=std; model y1-y3 = x1-x4; output out=p p=ay1-ay3; run; quit; proc princomp data=p cov noprint std out=p; var ay1-ay3; run; proc print data=p(keep=Prin:); format _numeric_ 4.1; run;
4607
4608
Chapter 75. The TRANSREG Procedure
Redundancy Analysis Manually Generate Redundancy Variables Obs 1 2 3 4 5 6 7 8 9 10
Prin1
Prin2
Prin3
0.2 1.6 1.0 0.5 1.0 -0.8 -1.0 -1.2 -1.0 -0.4
-0.5 -1.5 0.8 1.7 0.1 -0.9 -0.4 0.8 -0.9 0.8
-0.9 0.4 -1.3 0.1 0.9 1.4 -1.3 0.7 -0.8 0.7
Figure 75.9. Redundancy Analysis Example
The following statements produce the coefficients for predicting the redundancy variables from the independent variables, shown in Figure 75.10: proc reg data=p outest=redcoef noprint; title2 ’Manually Create Redundancy Coefficients’; model Prin1-Prin3 = x1-x4; run; quit; proc print data=redcoef(keep=x1-x4); format _numeric_ 4.1; run;
Redundancy Analysis Manually Create Redundancy Coefficients Obs 1 2 3
x1
x2
x3
x4
0.7 0.3 -0.7
-0.6 -1.5 -0.7
0.4 -0.6 0.3
-0.1 1.9 -0.3
Figure 75.10. Redundancy Analysis Example
The following statements produce the coefficients for predicting the independent variables from the redundancy variables, shown in Figure 75.11: proc reg data=p outest=redcoef2 noprint; title2 ’Manually Create Other Coefficients’; model x1-x4 = prin1-prin3; run; quit; proc print data=redcoef2(keep=Prin1-Prin3); format _numeric_ 4.1; run;
Optimal Scaling
Redundancy Analysis Manually Create Other Coefficients Obs
Prin1
Prin2
Prin3
1 2 3 4
0.8 -0.6 0.1 -0.5
-0.0 -0.2 -0.2 0.3
-0.6 -0.7 -0.1 -0.5
Figure 75.11. Redundancy Analysis Example
Optimal Scaling An alternating least-squares optimal scaling algorithm can be divided into two major stages. The first stage estimates the parameters of the linear model. These parameters are used to create the predicted values or target for each variable that can be transformed. Each target minimizes squared error (as explained in the discussion of the algorithms in SAS Technical Report R-108. The definition of the target depends on many factors, such as whether a variable is independent or dependent, which algorithm is used (for example, regression, redundancy, CANALS, principal components), and so on. The definition of the target is independent of the transformation family you specify for the variable. However, the target values for a variable typically do not fit the prescribed transformation family for the variable. They might not have the right category structure; they might not have the right order; they might not be a linear combination of the columns of a B-spline basis; and so on. The second major stage is optimal scaling. Optimal scaling can be defined as a possibly constrained, least-squares regression problem. When you specify an optimal transformation, or when missing data are estimated for any variable, the full representation of the variable is not simply a vector; it is a matrix with more than one column. The optimal scaling phase finds the vector that is a linear combination of the columns of this matrix, that is closest to the target (in terms of minimum squared error), among those that do not violate any of the constraints imposed by the transformation family. Optimal scaling methods are independent of the data analysis method that generated the target. In all cases, optimal scaling can be accomplished by creating a design matrix based on the original scaling of the variable and the transformation family specified for that variable. The optimally scaled variable is a linear combination of the columns of the design matrix. The coefficients of the linear combination are found using (possibly constrained) least squares. Many optimal scaling problems are solved without actually constructing design and projection matrices. The following two sections describe the algorithms used by PROC TRANSREG for optimal scaling. The first section discusses optimal scaling for OPSCORE, MONOTONE, UNTIE, and LINEAR transformations, including how missing values are handled. The second section addresses SPLINE and MSPLINE transformations.
4609
4610
Chapter 75. The TRANSREG Procedure
OPSCORE, MONOTONE, UNTIE, and LINEAR Transformations Two vectors of information are needed to produce the optimally scaled variable: the initial variable scaling vector x and the target vector y. For convenience, both vectors are first sorted on the values of the initial scaling vector. If you request an UNTIE transformation, the target vector is sorted within ties in the initial scaling vector. The normal SAS System collating sequence for missing and nonmissing values is used. Sorting simply allows constraints to be specified in terms of relations among adjoining coefficients. The sorting process partitions x and y into missing and nonmissing 0 y0 )0 . parts (x0m x0n )0 , and (ym n Next, PROC TRANSREG determines category membership. Every ordinary missing value (.) forms a separate category. (Three ordinary missing values form three categories.) Every special missing value within the range specified in the UNTIE= a-option forms a separate category. (If UNTIE= BC and there are three .B and two .C missing values, five categories are formed from them.) For all other special missing values, a separate category is formed for each different value. (If there are four .A missing values, one category is formed from them.) Each distinct nonmissing value forms a separate category for OPSCORE and MONOTONE transformations (1 1 1 2 2 3 form three categories). Each nonmissing datum forms a separate category for all other transformations (1 1 1 2 2 3 form six categories). Once category membership is determined, category means are computed. Here is an example: x:
(. . .A .A .B 1 1 1 2 2 3 3 3 4)’
y:
(5 6
2
OPSCORE and MONOTONE means:
(5 6
3
2 2
other means:
(5 6
3
2 1 2 3 4 6 4 5 6 7)’
4
2 1 2 3 4 6 4 5 6 7)’ 5
5
7)’
The category means are the coefficients of a category indicator design matrix. The category means are the Fisher (1938) optimal scores. For MONOTONE and UNTIE transformations, order constraints are imposed on the category means for the nonmissing partition by merging categories that are out of order. The algorithm checks upward until an order violation is found, then averages downward until the order violation is averaged away. (The average of x ¯1 computed from n1 observations and x ¯2 computed from n2 observations is (n1 x ¯ 1 + n2 x ¯2 )/(n1 + n2 ).) The MONOTONE algorithm (Kruskal 1964, secondary approach to ties) for this example with means for the nonmissing values (2 5 5 7)0 would do the following checks: 2 < 5:OK, 5 = 5:OK, 5 < 7:OK. The means are in the proper order, so no work is needed. The UNTIE transformation (Kruskal 1964, primary approach to ties) uses the same algorithm on the means of the nonmissing values (1 2 3 4 6 4 5 6 7)0 but with different results for this example: 1 < 2:OK, 2 < 3:OK, 3 < 4:OK, 4 < 6:OK, 6 > 4:average
SPLINE and MSPLINE Transformations
6 and 4 and replace 6 and 4 by the average. The new means of the nonmissing values are (1 2 3 4 5 5 5 6 7)0 . The check resumes: 4 < 5:OK, 5 = 5:OK, 5 = 5:OK, 5 < 6:OK, 6 < 7:OK. If some of the special missing values are ordered, the upward checking, downward averaging method is applied to them also, independently of the other missing and nonmissing partitions. Once the means conform to any required category or order constraints, an optimally scaled vector is produced from the means. The following example results from a MONOTONE transformation. x:
(. . .A .A .B 1 1 1 2 2 3 3 3 4)0
y:
(5 6
2
4
2 1 2 3 4 6 4 5 6 7)0
result:
(5 6
3
3
2 2 2 2 5 5 5 5 5 7)0
The upward checking, downward averaging algorithm is equivalent to creating a category indicator design matrix, solving for least-squares coefficients with order constraints, then computing the linear combination of design matrix columns. For the optimal transformation LINEAR and for nonoptimal transformations, missing values are handled as just described. The nonmissing target values are regressed onto the matrix defined by the nonmissing initial scaling values and an intercept. In this example, the target vector yn = (1 2 3 4 6 4 5 6 7)0 is regressed onto the design matrix
1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 4
0
Although only a linear transformation is performed, the effect of a linear regression optimal scaling is not eliminated by the later standardization step (unless the variable has no missing values). In the presence of missing values, the linear regression is necessary to minimize squared error.
SPLINE and MSPLINE Transformations The missing portions of variables subjected to SPLINE or MSPLINE transformations are handled the same way as for OPSCORE, MONOTONE, UNTIE, and LINEAR transformations (see the previous section). The nonmissing partition is handled by first creating a B-spline basis of the specified degree with the specified knots for the nonmissing partition of the initial scaling vector and then regressing the target onto the basis. The optimally scaled vector is a linear combination of the B-spline basis vectors using least-squares regression coefficients. An algorithm for generating the B-spline basis is given in de Boor (1978, pp. 134–135). B-splines are both a computationally accurate and efficient way of constructing a basis for piecewise polynomials; however, they are not the most natural method of describing splines. Consider an initial scaling vector x = (1 2 3 4 5 6 7 8 9)0 and a degree three spline with interior knots at 3.5 and 6.5. The B-spline basis for the transformation is the left matrix in Table 75.5, and the natural piecewise polynomial spline basis is the
4611
4612
Chapter 75. The TRANSREG Procedure
right matrix. The two matrices span the same column space. The natural basis has an intercept, a linear term, a quadratic term, a cubic term, and two more terms since there are two interior knots. These terms are generated (for knot k and x element x) by the formula (x − k)3 × I(x>k) . The indicator variable I(x>k) evaluates to 1.0 if x is greater than k and to 0.0 otherwise. If knot k had been repeated, there would be a (x − k)2 × I(x>k) term also. Notice that the fifth column makes no contribution to the curve before 3.5, makes zero contribution at 3.5 (the transformation is continuous), and makes an increasing contribution beyond 3.5. The same pattern of results holds for the last term with knot 6.5. The coefficient of the fifth column represents the change in the cubic portion of the curve after 3.5. The coefficient of the sixth column represents the change in the cubic portion of the curve after 6.5. Table 75.5. Spline Bases
Piecewise Polynomial Splines
B-Spline Basis
1.000 0.216 0.008 0 0 0 0 0 0
0.000 0.608 0.458 0.172 0.037 0.001 0 0 0
0.000 0.167 0.461 0.585 0.463 0.241 0.073 0.009 0.000
0.000 0.009 0.073 0.241 0.463 0.585 0.461 0.167 0.000
0 0 0 0.001 0.037 0.172 0.458 0.608 0.000
0 0 0 0 0 0 0.008 0.216 1.000
1 1 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8 9
1 4 9 16 25 36 49 64 81
1 0 0 8 0 0 27 0 0 64 0.125 0 125 3.375 0 216 15.625 0 343 42.875 0.125 512 91.125 3.375 729 166.375 15.625
The numbers in the B-spline basis do not have a simple interpretation like the numbers in the natural piecewise polynomial basis. The B-spline basis has a diagonally banded structure. The band shifts one column to the right after every knot. The number of entries in each row that may potentially be nonzero is one greater than the degree. The elements within a row always sum to one. The B-spline basis is accurate because of the smallness of the numbers and the lack of extreme collinearity inherent in the natural polynomials. B-splines are efficient because PROC TRANSREG can take advantage of the sparseness of the B-spline basis when it accumulates crossproducts. The number of required multiplications and additions to accumulate the crossproduct matrix does not increase with the number of knots but does increase with the degree of the spline, so it is much more computationally efficient to increase the number of knots than to increase the degree of the polynomial. MSPLINE transformations are handled like SPLINE transformations except that constraints are placed on the coefficients to ensure monotonicity. When the coefficients of the B-spline basis are monotonically increasing, the transformation is monotonically increasing. When the polynomial degree is two or less, monotone coefficient splines, integrated splines (Winsberg and Ramsay 1980), and the general class of all monotone splines are equivalent.
Specifying the Number of Knots
Specifying the Number of Knots Keep the number of knots small (usually less than ten, although you can specify more). A degree three spline with nine knots, one at each decile, can closely follow a large variety of curves. Each spline transformation of degree p with q knots fits a model with p + q parameters. The total number of parameters should be much less than the number of observations. Usually in regression analyses, it is recommended that there be at least five or ten observations for each parameter in order to get stable results. For example, when spline transformations of degree three with nine knots are requested for six variables, the number of observations in the data set should be at least five or ten times 72 (since 6 × (3 + 9) is the total number of parameters). The overall model can also have a parameter for the intercept and one or more parameters for each nonspline variable in the model. Increasing the number of knots gives the spline more freedom to bend and follow the data. Increasing the degree also gives the spline more freedom, but to a lesser extent. Specifying a large number of knots is much better than increasing the degree beyond three. When you specify NKNOTS=q for a variable with n observations, then each of the q + 1 segments of the spline contains n/(q + 1) observations on the average. When you specify KNOTS=number-list, make sure that there is a reasonable number of observations in each interval. The following statements find a cubic polynomial transformation of X and no transformation of Y: proc transreg; model identity(Y)=spline(X); output; run;
The following statements find a cubic spline transformation curve for X that consists of the weighted sum of a single constant, a single straight line, a quadratic curve for the portion of the variable less than 3.0, a different quadratic curve for the portion greater than 3.0 (since the 3.0 knot is repeated), and a different cubic curve for each of the intervals: (minimum to 1.5), (1.5 to 2.4), (2.4 to 3.0), (3.0 to 4.0), and (4.0 to maximum). The transformation is continuous everywhere, its first derivative is continuous everywhere, its second derivative is continuous everywhere except at 3.0, and its third derivative is continuous everywhere except at 1.5, 2.4, 3.0, and 4.0. proc transreg; model identity(Y)=spline(X / knots=1.5 2.4 3.0 3.0 4.0); output; run;
The following statements find a quadratic spline transformation that consists of a polynomial X– t = b0 +b1 X +b2 X2 for the range (X < 3.0) and a completely different polynomial X– t = b3 + b4 X + b5 X2 for the range (X > 3.0). The two curves are not required to be continuous at 3.0.
4613
4614
Chapter 75. The TRANSREG Procedure proc transreg; model identity(y)=spline(x / knots=3 3 3 degree=2); output; run;
The following statements categorize Y into 10 intervals and find a step-function transformation. One aspect of this transformation family is unlike all other optimal transformation families. The initial scaling of the data does not fit the restrictions imposed by the transformation family. This is because the initial variable can be continuous, but a discrete step function transformation is sought. Zero degree spline variables are categorized before the first iteration. proc transreg; model identity(Y)=spline(X / degree=0 nknots=9); output; run;
The following statements find a continuous, piecewise linear transformation of X: proc transreg; model identity(Y)=spline(X / degree=1 nknots=8); output; run;
SPLINE, BSPLINE, and PSPLINE Comparisons SPLINE is a transformation. It takes a variable as input and produces a transformed variable as output. Internally, with SPLINE, a B-spline basis is used to find the transformation, which is a linear combination of the columns of the B-spline basis. However, with SPLINE, the basis is not made available in any output. BSPLINE is an expansion. It takes a variable as input and produces more than one variable as output. The output variables comprise the B-spline basis that is used internally by SPLINE. PSPLINE is an expansion. It takes a variable as input and produces more than one variable as output. The difference between PSPLINE and BSPLINE is that PSPLINE produces a piecewise polynomial, whereas BSPLINE produces a B-spline. A matrix consisting of a piecewise polynomial basis and an intercept spans the same space as the B-spline matrix, but the basis vectors are quite different. The numbers in the piecewise polynomials can get quite large; the numbers in the B-spline basis range between 0 and 1. There are many more zeros in the B-spline basis. Interchanging SPLINE, BSPLINE, and PSPLINE should have no effect on the fit of the overall model except for the fact that PSPLINE is much more prone to numerical problems. Similarly, interchanging a CLASS expansion and an OPSCORE transformation should have no effect on the fit of the overall model.
Hypothesis Tests
Hypothesis Tests The TRANSREG procedure has a set of options for testing hypotheses in models with a single dependent variable. The TEST a-option produces an ANOVA table. It tests the null hypothesis that the vector of coefficients for all of the transformations is zero. The SS2 a-option produces a regression table with Type II tests of the contribution of each transformation to the overall model. In some cases, exact tests are provided; in other cases, the tests are approximate, liberal, or conservative. For two reasons it is typically not appropriate to test hypotheses by using the output from PROC TRANSREG as input to other procedures such as the REG procedure. First, PROC REG has no way of determining how many degrees of freedom were used for each transformation. Second, the Type II sums of squares for the tests of the individual regression coefficients are not correct for the transformation regression model since PROC REG, as it evaluates the effect of each variable, cannot change the transformations of the other variables. PROC TRANSREG uses the correct degrees of freedom and sums of squares. In an ordinary univariate linear model, there is one parameter for each independent variable, including the intercept. In the transformation regression model, many of the “variables” are used internally in the bases for the transformations. Each basis column has one parameter or scoring coefficient, and each linearly independent column has one model degree of freedom associated with it. Coefficients applied to transformed variables, model coefficients, do not enter into the degrees of freedom calculations. They are by-products of the standardizations and can be absorbed into the transformations by specifying the ADDITIVE a-option. The word parameter is reserved for model and scoring coefficients that have a degree of freedom associated with them. For expansions, there is one model parameter for each variable created by the expansion (except for all missing CLASS columns and expansions that have an implicit intercept). Each IDENTITY variable has one model parameter. If there are m POINT variables, they expand to m + 1 variables and, hence, have m + 1 model parameters. For m EPOINT variables, there are 2m model parameters. For m QPOINT variables, there are m(m + 3)/2 model parameters. If a variable with m categories is designated CLASS, there are m − 1 parameters. For BSPLINE and PSPLINE variables of DEGREE=n with NKNOTS=k, there are n + k parameters. Note that one of the n + k + 1 BSPLINE columns and one of the m CLASS(variable / ZERO=NONE) columns are not counted due to the implicit intercept. There are scoring parameters for missing values in nonexcluded observations. Each ordinary missing value (.) has one scoring parameter. Each different special missing value (.– and .A through .Z) within each variable has one scoring parameter. Missing values specified in the UNTIE= and MONOTONE= options follow the rules for UNTIE and MONOTONE transformations, which are described later in this chapter. For all nonoptimal transformations (LOG, LOGIT, ARSIN, POWER, EXP, RANK, BOXCOX), there is one parameter per variable in addition to any missing value scoring parameters.
4615
4616
Chapter 75. The TRANSREG Procedure
For SPLINE, OPSCORE, and LINEAR transformations, the number of scoring parameters is the number of basis columns that are used internally to find the transformations minus 1 for the intercept. The number of scoring parameters for SPLINE variables is the same as the number of model parameters for BSPLINE and PSPLINE variables. If DEGREE=n and NKNOTS=k, there are n + k scoring parameters. The number of scoring parameters for OPSCORE, SMOOTH, and SSPLINE variables is the same as the number of model parameters for CLASS variables. If there are m categories, there are m − 1 scoring parameters. There is one parameter for each LINEAR variable. For SPLINE, OPSCORE, LINEAR, MONOTONE, UNTIE, and MSPLINE transformations, missing value scoring parameters are computed as described previously with the nonoptimal transformations. The number of scoring parameters for MONOTONE, UNTIE, and MSPLINE transformations is less precise than for SPLINE, OPSCORE, and LINEAR transformations. One way of handling a MONOTONE transformation is to treat it as if it were the same as an OPSCORE transformation. If there are m categories, there are m − 1 potential scoring parameters. However, there are typically fewer than m − 1 unique parameter estimates since some of those m − 1 scoring parameter estimates may be tied during the optimal scaling to impose the order constraints. Imposing ties on the scoring parameter estimates is equivalent to fitting a model with fewer parameters. So there are two available scoring parameter counts: m − 1 and a smaller number that is determined during the analysis. Using m − 1 as the model degrees of freedom for MONOTONE variables (treating OPSCORE and MONOTONE transformations the same way) is conservative, since the MONOTONE scoring parameter estimates are more restricted than the OPSCORE scoring parameter estimates. Using the smaller count (the number of scoring parameter estimates that are different minus 1 for the intercept) in the model degrees of freedom is liberal, since the data and the model together are being used to determine the number of parameters. PROC TRANSREG reports tests using both liberal and conservative degrees of freedom to provide lower and upper bounds on the “true” p-values. For the UNTIE transformation, the conservative scoring parameter count is the number of distinct observations, whereas the liberal scoring parameter count is the number of scoring parameter estimates that are different minus 1 for the intercept. Hence, when you specify UNTIE, conservative tests have zero error degrees of freedom unless there are replicated observations. For MSPLINE variables of DEGREE=n and NKNOTS=k, the conservative scoring parameter count is n+k, whereas the liberal parameter count is the number of scoring parameter estimates that are different, minus 1 for the intercept. A liberal degrees of freedom of 1 does not necessarily imply a linear transformation. It just implies that n plus k minus the number of ties imposed equals 1. An example of a one degree-offreedom nonlinear transformation is a two-piece linear transformation in which the slope of one piece is 0. The number of scoring parameters is determined during each iteration. After the last iteration, enough information is available for the TEST a-option to produce an ANOVA table that reports the overall fit of the model. If you specify the SS2 aoption, further iterations are necessary to test the contribution of each transformation
Output Data Set
to the overall model. The liberal tests do not compensate for over-parameterization. For example, requesting a spline transformation with k knots when a linear transformation will suffice results in “liberal” tests that are actually conservative because too many degrees of freedom are being used for the transformations. Use as few knots as possible to avoid this problem. In ordinary multiple regression, an F test of the null hypothesis that the coefficient for variable xj is zero can be constructed by comparing two linear models. One model is the full model with all parameters, and the other is a reduced model that has all parameters except the parameter for variable xj . The difference between the model sum of squares for the full model and the model sum of squares for the reduced model is the Type II sum of squares for the test of the null hypothesis that the coefficient for variable xj is 0. The numerator of the F test has one degree of freedom. The mean square error for the full model is the denominator of the F test of variable xj . Note that the estimates of the coefficients for the two models are not usually the same. When variable xj is removed, the coefficients for the other variables change to compensate for the removal of xj . In a transformation regression model, the transformations of the other variables must be allowed to change and the numerator degrees of freedom are not always ones. It is not correct to simply let the model coefficients for the transformed variables change and apply the new model coefficients to the old transformations computed with the old scoring parameter estimates. In a transformation regression model, further iteration is needed to test each transformation because all the scoring parameter estimates for other variables must be allowed to change to test the effect of variable xj . This can be quite time consuming for a large model if the DUMMY a-option cannot be used to solve directly for the transformations.
Output Data Set The OUT= output data set can contain a great deal of information; however, in most cases, the output data set contains a small portion of the entire range of available information and is organized for direct input into the %PLOTIT macro or graphical or analysis procedures. For information on the %PLOTIT macro, see Appendix B, “Using the %PLOTIT Macro.”
Output Data Set Examples The next section provides a complete list of the contents of the OUT= data set. However, before presenting complete details, this section provides three brief examples, illustrating some typical output data sets. The first example shows the output data set from a two-way ANOVA model. The following statements produce Figure 75.12: title ’ANOVA Output Data Set Example’; data ReferenceCell; input Y X1 $ X2 $; datalines;
4617
4618
Chapter 75. The TRANSREG Procedure 11 12 10 4 5 3 5 6 4 2 3 1 ;
a a a a a a b b b b b b
a a a b b b a a a b b b
*---Fit Reference Cell Two-Way ANOVA Model---; proc transreg data=ReferenceCell; model identity(Y) = class(X1 | X2); output coefficients replace predicted residuals; run; *---Print the Results---; proc print; run; proc contents position; ods select position; run;
Output Data Set
ANOVA Output Data Set Example
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14
_TYPE_
_NAME_
Y
PY
RY
Intercept
X1a
X2a
SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE M COEFFI MEAN
ROW1 ROW2 ROW3 ROW4 ROW5 ROW6 ROW7 ROW8 ROW9 ROW10 ROW11 ROW12 Y Y
11 12 10 4 5 3 5 6 4 2 3 1 . .
11 11 11 4 4 4 5 5 5 2 2 2 . .
0 1 -1 0 1 -1 0 1 -1 0 1 -1 . .
1 1 1 1 1 1 1 1 1 1 1 1 2 .
1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 7.5
1 1 1 0 0 0 1 1 1 0 0 0 3 8
X1a X2a 1 1 1 0 0 0 0 0 0 0 0 0 4 11
X1
X2
a a a a a a b b b b b b
a a a b b b a a a b b b
ANOVA Output Data Set Example The CONTENTS Procedure Variables in Creation Order # 1 2 3 4 5 6 7 8 9 10 11
Variable
Type
Len
_TYPE_ _NAME_ Y PY RY Intercept X1a X2a X1aX2a X1 X2
Char Char Num Num Num Num Num Num Num Char Char
8 32 8 8 8 8 8 8 8 8 8
Label
Y Predicted Values Y Residuals Intercept X1 a X2 a X1 a * X2 a
Figure 75.12. ANOVA Example Output Data Set Contents
The – TYPE– variable indicates observation type: score, multiple regression coefficient (parameter estimates), and marginal means. The – NAME– variable contains the default observation labels, “ROW1”, “ROW2”, and so on, and contains the dependent variable name (Y) for the remaining observations. If you specify an ID statement, – NAME– contains the values of the first ID variable for score observations. The Y variable is the dependent variable, PY contains the predicted values, RY contains the residuals, and the variables Intercept through X1aX2a contain the design matrix. The X1 and X2 variables are the original CLASS variables. The next example shows the contents of the output data set from fitting a curve through a scatter plot. title ’Output Data Set for Curve Fitting Example’; data A;
4619
4620
Chapter 75. The TRANSREG Procedure do X = 1 to 100; Y = log(x) + sin(x / 10) + normal(7); output; end; run; proc transreg; model identity(Y) = spline(X / nknots=9); output predicted out=B; run; proc contents position; ods select position; run;
These statements produce Figure 75.13. Output Data Set for Curve Fitting Example The CONTENTS Procedure Variables in Creation Order #
Variable
Type
Len
1 2 3 4 5 6 7 8 9
_TYPE_ _NAME_ Y TY PY Intercept X TIntercept TX
Char Char Num Num Num Num Num Num Num
8 32 8 8 8 8 8 8 8
Label
Y Transformation Y Predicted Values Intercept Intercept Transformation X Transformation
Figure 75.13. Predicted Values Example Output Data Set Contents
The OUT= data set contains – TYPE– and – NAME– variables. Since no coefficients or coordinates are requested, all observations are – TYPE– =’SCORE’. The Y variable is the original dependent variable, TY is the transformed dependent variable, PY contains the predicted values, X is the original independent variable, and TX is the transformed independent variable. The data set also contains an Intercept and transformed intercept TIntercept variable. (In this case, the transformed intercept is the same as the intercept. However, if you specify the TSTANDARD= and ADDITIVE options, these are not always the same.)
Output Data Set
The next example shows the results from specifying METHOD=MORALS when there is more than one dependent variable. title ’METHOD=MORALS Output Data Set Example’; data x; input Y1 Y2 X1 $ X2 $; datalines; 11 1 a a 10 4 b a 5 2 a b 5 9 b b 4 3 c c 3 6 b a 1 8 a b ; *---Fit Reference Cell Two-Way ANOVA Model---; proc transreg data=x noprint dummy; model spline(Y1 Y2) = opscore(X1 X2 / name=(N1 N2)); output coefficients predicted residuals; id x1 x2; run; *---Print the Results---; proc print; run; proc contents position; ods select position; run;
These statements produce Figure 75.14.
4621
4622
Chapter 75. The TRANSREG Procedure
METHOD=MORALS Output Data Set Example Obs
_DEPVAR_
_TYPE_ SCORE SCORE SCORE SCORE SCORE SCORE SCORE M COEFFI SCORE SCORE SCORE SCORE SCORE SCORE SCORE M COEFFI
_NAME_
_DEPEND_
a b a b c b a Y1 a b a b c b a Y2
11 10 5 5 4 3 1 . 1 4 2 9 3 6 8 .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Spline(Y1) Spline(Y1) Spline(Y1) Spline(Y1) Spline(Y1) Spline(Y1) Spline(Y1) Spline(Y1) Spline(Y2) Spline(Y2) Spline(Y2) Spline(Y2) Spline(Y2) Spline(Y2) Spline(Y2) Spline(Y2)
Obs
Intercept
N1
N2
TIntercept
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 1 1 1 1 1 1 . 1 1 1 1 1 1 1 .
0 1 0 1 2 1 0 . 0 1 0 1 2 1 0 .
0 0 1 1 2 0 1 . 0 0 1 1 2 0 1 .
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 10.9253 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 -0.3119
T_DEPEND_
P_DEPEND_
R_DEPEND_
13.1600 6.1931 2.4467 2.4467 4.2076 5.5693 4.9766 . -0.5303 5.5487 3.8940 9.6358 5.6210 3.5994 5.2314 .
11.1554 6.8835 4.7140 0.4421 4.2076 6.8835 4.7140 . -0.5199 4.5689 4.5575 9.6462 5.6210 4.5689 4.5575 .
2.00464 -0.69041 -2.26724 2.00464 0.00000 -1.31422 0.26261 . -0.01043 0.97988 -0.66347 -0.01043 0.00000 -0.96945 0.67390 .
TN1 0.06711 1.51978 0.06711 1.51978 0.23932 1.51978 0.06711 -2.94071 0.03739 1.51395 0.03739 1.51395 0.34598 1.51395 0.03739 3.44636
Figure 75.14. METHOD=MORALS Rolled Output Data Set
TN2 -0.09384 -0.09384 1.32038 1.32038 1.32038 -0.09384 1.32038 -4.55475 -0.09384 -0.09384 1.32038 1.32038 1.32038 -0.09384 1.32038 3.59024
X1
X2
a b a b c b a Y1 a b a b c b a Y2
a a b b c a b Y1 a a b b c a b Y2
Output Data Set
The CONTENTS Procedure Variables in Creation Order # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Variable
Type
Len
_DEPVAR_ _TYPE_ _NAME_ _DEPEND_ T_DEPEND_ P_DEPEND_ R_DEPEND_ Intercept N1 N2 TIntercept TN1 TN2 X1 X2
Char Char Char Num Num Num Num Num Num Num Num Num Num Char Char
42 8 32 8 8 8 8 8 8 8 8 8 8 8 8
Label Dependent Variable Transformation(Name)
Dependent Dependent Dependent Dependent Intercept
Variable Variable Transformation Variable Predicted Values Variable Residuals
Intercept Transformation N1 Transformation N2 Transformation
Figure 75.14. (continued)
If you specify METHOD=MORALS with multiple dependent variables, PROC TRANSREG performs separate univariate analyses and stacks the results in the OUT= data set. For this example, the results of the first analysis are in the partition designated by – DEPVAR– =’Spline(Y1)’ and the results of the first analysis are in the partition designated by – DEPVAR– =’Spline(Y2)’, which are the transformation and dependent variable names. Each partition has – TYPE– =’SCORE’ observations for the variables and a – TYPE– =’M COEFFI’ observation for the coefficients. In this example, an ID variable is specified, so the – NAME– variable contains the formatted values of the first ID variable. Since both dependent variables have to go into the same column, the dependent variable is given a new name, – DEPEND– . The dependent variable transformation is named T– DEPEND– , the predicted values variable is named P– DEPEND– , and the residuals variable is named R– DEPEND– . The independent variables are character OPSCORE variables. By default, PROC TRANSREG replaces character OPSCORE variables with category numbers and discards the original character variables. To avoid this, the input variables are renamed from X1 and X2 to N1 and N2 and the original X1 and X2 are added to the data set as ID variables. The N1 and N2 variables contain the initial values for the OPSCORE transformations, and the TN1 and TN2 variables contain optimal scores. The data set also contains an Intercept and transformed intercept TIntercept variable. The regression coefficients are in the transformation columns, which also contain the variables to which they apply.
Output Data Set Contents This section presents the various matrices that can result from PROC TRANSREG processing and that appear in the OUT= data set. The exact contents of an OUT= data set depends on many options.
4623
4624
Chapter 75. The TRANSREG Procedure
Table 75.6. PROC TRANSREG OUT= Data Set Contents
– TYPE– SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE
Contents dependent variables independent variables transformed dependent variables transformed independent variables predicted values residuals leverage lower individual confidence limits
SCORE
upper individual confidence limits
SCORE
lower mean confidence limits
SCORE
upper mean confidence limits
SCORE SCORE SCORE SCORE SCORE
dependent canonical variables independent canonical variables redundancy variables ID, CLASS, BSPLINE variables independent variables approximations
Options, Default Prefix DREPLACE not specified IREPLACE not specified default, TDPREFIX=T default, TIPREFIX=T PREDICTED, PPREFIX=P RESIDUALS, RDPREFIX=R LEVERAGE, LEVERAGE=Leverage CLI, LILPREFIX=LIL, CILPREFIX=CIL CLI, LIUPREFIX=LIU, CIUPREFIX=CIU CLM, LMLPREFIX=LML, CMLPREFIX=CML CLM, LMUPREFIX=LMU, CMUPREFIX=CMU CANONICAL, CDPREFIX=Cand CANONICAL, CIPREFIX=Cani REDUNDANCY, RPREFIX=Red ID, CLASS, BSPLINE, IAPPROXIMATIONS, IAPREFIX=A
M COEFFI C COEFFI MEAN M REDUND R REDUND M POINT M EPOINT M QPOINT C POINT C EPOINT C QPOINT
multiple regression coefficients canonical coefficients marginal means multiple redundancy coefficients multiple redundancy coefficients point coordinates elliptical point coordinates quadratic point coordinates canonical point coordinates canonical elliptical point coordinates canonical quadratic point coordinates
COEFFICIENTS, MRC COEFFICIENTS, CCC COEFFICIENTS, MEANS MREDUNDANCY MREDUNDANCY COORDINATES or MPC, POINT COORDINATES or MEC, EPOINT COORDINATES or MQC, QPOINT COORDINATES or CPC, POINT COORDINATES or CEC, EPOINT COORDINATES or CQC, QPOINT
The independent and dependent variables are created from the original input data. Several potential differences exist between these variables and the actual input data. An intercept variable can be added, new variables can be added for POINT, EPOINT, QPOINT, CLASS, IDENTITY, PSPLINE, and BSPLINE variables, and category numbers are substituted for character OPSCORE variables. These matrices are not always what is input to the first iteration. After the expanded data set is stored for inclusion in the output data set, several things happen to the data before they are input to the first iteration: column means are substituted for missing values; zero degree SPLINE and MSPLINE variables are transformed so that the iterative algorithms get
Output Data Set
step function data as input, which conform to the zero degree transformation family restrictions; and the nonoptimal transformations are performed. Details for the UNIVARIATE Method
When you specify METHOD=UNIVARIATE (in the MODEL or PROC TRANSREG statement), PROC TRANSREG can perform several analyses, one for each dependent variable. While each dependent variable can be transformed, their independent variables are not transformed. The OUT= data set optionally contains all of the – TYPE– =’SCORE’ observations, optionally followed by coefficients or coordinates. Details for the MORALS Method
When you specify METHOD=MORALS (in the MODEL or PROC TRANSREG statement), successive analyses are performed, one for each dependent variable. Each analysis transforms one dependent variable and the entire set of the independent variables. All information for the first dependent variable (scores then, optionally, coefficients) appear first. Then all information for the second dependent variable (scores then, optionally, coefficients) appear next. This arrangement is repeated for all dependent variables. Details for the CANALS and REDUNDANCY Methods
For METHOD=CANALS and METHOD=REDUNDANCY (specified in either the MODEL or PROC TRANSREG statement), one analysis is performed that simultaneously transforms all dependent and independent variables. The OUT= data set optionally contains all of the – TYPE– =’SCORE’ observations, optionally followed by coefficients or coordinates.
Variable Names As shown in the preceding examples, some variables in the output data set directly correspond to input variables and some are created. All original optimal and nonoptimal transformation variable names are unchanged. The names of the POINT, QPOINT, and EPOINT expansion variables are also left unchanged, but new variables are created. When independent POINT variables are present, the sum-of-squares variable – ISSQ– is added to the output data set. For each EPOINT and QPOINT variable, a new squared variable is created by appending “– 2”. For example, Dim1 and Dim2 are expanded into Dim1, Dim2, Dim1– 2, and Dim2– 2. In addition, for each pair of QPOINT variables, a new crossproduct variable is created by combining the two names, for example, Dim1Dim2. The names of the CLASS variables are constructed from original variable names and levels. Lengths are controlled by the CPREFIX= a-option. For example, when X1 and X2 both have values of ’a’ and ’b’, CLASS(X1 | X2 / ZERO=NONE) creates X1 main effect variable names X1a X1b, X2 main effect variable names X2a X2b, and interaction variable names X1aX2a X1aX2b X1bX2a X1bX2b. PROC TRANSREG then uses these variable names when creating the transformed, predicted, and residual variable names by affixing the relevant prefix and possibly dropping extra characters.
4625
4626
Chapter 75. The TRANSREG Procedure
METHOD=MORALS Variable Names
When you specify METHOD=MORALS and only one dependent variable is present, the output data set is structured exactly as if METHOD=REDUNDANCY (see the section “Details for the CANALS and REDUNDANCY Methods” on page 4625). When more than one dependent variable is present, the dependent variables are output in the variable – DEPEND– , transformed dependent variables are output in the variable T– DEPEND– , predicted values are output in the variable P– DEPEND– , and residuals are output in the variable R– DEPEND– . You can partition the data set into BY groups, one per dependent variable, by referring to the character variable – DEPVAR– , which contains the original dependent variable names and transformations. Duplicate Variable Names
When the same name is generated from multiple variables in the OUT= data set, new names are created by appending ’2’, ’3’, or ’4’, and so on, until a unique name is created. For 32-character names, the last character is replaced with a numeric suffix until a unique name is created. For example, if there are two output variables that otherwise would be named X, then X and X2 are created instead. If there are two output variables that otherwise would be named ThisIsAThirtyTwoCharacterVarName, then ThisIsAThirtyTwoCharacterVarName and ThisIsAThirtyTwoCharacterVarNam2 are created instead.
OUTTEST= Output Data Set The OUTTEST= data set contains hypothesis test results. The OUTTEST= data set always contains ANOVA results. When you specify the SS2 a-option, regression tables are also output. When you specify the UTILITIES a-option, conjoint analysis part-worth utilities are also output. The OUTTEST= data set has the following variables:
– DEPVAR–
is a 42-character variable that contains the dependent variable transformation and name.
– TYPE–
is an 8-character variable that contains the table type. The first character is “U” for univariate or “M” for multivariate. The second character is blank. The third character is “A” for ANOVA, “2” for Type II sum of squares, or “U” for UTILITIES. The fourth character is blank. The fifth character is “L” for liberal tests, “C” for conservative tests, or “U” for the usual tests.
Title
is an 80-character variable that contains the table title.
Variable
is a 42-character variable that contains the independent variable transformations and names for regression tables and blanks for ANOVA tables.
Coefficient
contains the multiple regression coefficients for regression tables and underscore special missing values for ANOVA tables.
Statistic
is a 24-character variable that contains the names for statistics in other variables, such as Value.
Computational Resources Value
contains multivariate test statistics and all other information that does not fit in one of the other columns including R-Square, Dependent Mean, Adj R-Sq, and Coeff Var. Whenever Value is not an underscore special missing value, Statistic describes the contents of Value.
NumDF
contains numerator degrees of freedom for F tests.
DenDF
contains denominator degrees of freedom for F tests.
SSq
contains sums of squares.
MeanSquare
contains mean squares.
F
contains F statistics.
NumericP
contains the p-value for the F statistic, stored in a numeric variable.
P
is a 9-character variable that contains the formatted p-value for the F statistic, including the appropriate ∼, <=, >=, or blank symbols.
LowerLimit
contains lower confidence limits on the parameter estimates.
UpperLimit
contains upper confidence limits on the parameter estimates.
StdError
contains standard errors. For SS2 and UTILITIES tables, standard errors are output for each coefficient with one degree of freedom.
Importance
contains the relative importance of each factor for UTILITIES tables.
Label
is a 256-character variable that contains variable labels.
There are several possible tables in the OUTTEST= data set corresponding to combinations of univariate and multivariate tests; ANOVA and regression results; and liberal, conservative, and the usual tests. Each table is composed of only a subset of the variables. Numeric variables contain underscore special missing values when they are not a column in a table. Ordinary missing values (.) appear in variables that are part of a table when a nonmissing value cannot be produced. For example, the F is missing for a test with zero degrees of freedom.
Computational Resources This section provides information on the computational resources required to use PROC TRANSREG. Let n = number of observations q = number of expanded independent variables r = number of expanded dependent variables k = maximum spline degree p = maximum number of knots
4627
4628
Chapter 75. The TRANSREG Procedure • More than 56(q + r) plus the maximum of the data matrix size, the optimal scaling work space, and the covariance matrix size bytes of array space are required. The data matrix size is 8n(q + r) bytes. The optimal scaling work space requires less than 8(6n + (p + k + 2)(p + k + 11)) bytes. The covariance matrix size is 4(q + r)(q + r + 1) bytes. • PROC TRANSREG tries to store the original and transformed data in memory. If there is not enough memory, a utility data set is used, potentially resulting in a large increase in execution time. The amount of memory for the preceding data formulas is an underestimate of the amount of memory needed to handle most problems. These formulas give the absolute minimum amount of memory required. If a utility data set is used, and if memory can be used with perfect efficiency, then roughly the amount of memory stated previously is needed. In reality, most problems require at least two or three times the minimum. • PROC TRANSREG sorts the data once. The sort time is roughly proportional to (q + r)n3/2 . • One regression analysis per iteration is required to compute model parameters (or two canonical correlation analyses per iteration for METHOD=CANALS). The time required for accumulating the crossproducts matrix is roughly proportional to n(q + r)2 . The time required to compute the regression coefficients is roughly proportional to q 3 . • Each optimal scaling is a multiple regression problem, although some transformations are handled with faster special-case algorithms. The number of regressors for the optimal scaling problems depends on the original values of the variable and the type of transformation. For each monotone spline transformation, an unknown number of multiple regressions is required to find a set of coefficients that satisfies the constraints. The B-spline basis is generated twice for each SPLINE and MSPLINE transformation for each iteration. The time required to generate the B-spline basis is roughly proportional to nk 2 .
Solving Standard Least-Squares Problems This section illustrates how to solve some ordinary least-squares problems and generalizations of those problems by formulating them as transformation regression problems. One problem involves finding linear and nonlinear regression functions in a scatter plot. The next problem involves simultaneously fitting two lines or curves through a scatter plot. The last problem involves finding the overall fit of a multi-way main-effects and interactions analysis-of-variance model.
Nonlinear Regression Functions This example uses PROC TRANSREG in simple regression to find the optimal regression line, a nonlinear but monotone regression function, and a nonlinear nonmonotone regression function. A regression line can be found by specifying proc transreg; model identity(y) = identity(x); output predicted; run;
Solving Standard Least-Squares Problems
4629
A monotone regression function (in this case, a monotonically decreasing regression function, since the correlation coefficient is negative) can be found by requesting an MSPLINE transformation of the independent variable, as follows. proc transreg; model identity(y) = mspline(x / nknots=9); output predicted; run;
The monotonicity restriction can be relaxed by requesting a SPLINE transformation of the independent variable, as shown next. proc transreg; model identity(y) = spline(x / nknots=9); output predicted; run;
In this example, it is not useful to plot the transformation TX, since TX is just an intermediate result used in finding a regression function through the original X and Y scatter plot. The following statements provide a specific example of using the TRANSREG procedure for fitting nonlinear regression functions. These statements produce Figure 75.15 through Figure 75.18. title ’Linear and Nonlinear Regression Functions’; *---Generate an Artificial Nonlinear Scatter Plot---; *---SAS/IML Software is Required for this Example---; proc iml; N = 500; X = (1:N)‘; X = X/(N/200); Y = -((X/50)-1.5)##2 + sin(X/8) + sqrt(X)/5 + 2*log(X) + cos(X); X = X - X[:,]; X = -X / sqrt(X[##,]/(n-1)); Y = Y - Y[:,]; Y = Y / sqrt(Y[##,]/(n-1)); all = Y || X; create outset from all; append from all; quit; data A; set outset(rename=(col1=Y col2=X)); if Y<-2 then Y=-2 + ranuni(7654321)/2; X1=X; X2=X; X3=X; X4=X; run; *---Predicted Values for the Linear Regression Line---; proc transreg data=A;
4630
Chapter 75. The TRANSREG Procedure title2 ’A Linear Regression Line’; model identity(Y)=identity(X); output out=A pprefix=L; id X1-X4; run; *---Predicted Values for the Monotone Regression Function---; proc transreg data=A; title2 ’A Monotone Regression Function’; model identity(Y)=mspline(X / nknots=9); output out=A pprefix=M; id X1-X4 LY; run; *---Predicted Values for the Nonmonotone Regression Function---; proc transreg data=A; title2 ’A Nonmonotone Regression Function’; model identity(Y)=spline(X / nknots=9); output out=A predicted; id X1-X4 LY MY; run; *---Plot the Results---; goptions goutmode=replace nodisplay; %let opts = haxis=axis2 vaxis=axis1 frame cframe=ligr; * Depending on your goptions, these plot options may work better: * %let opts = haxis=axis2 vaxis=axis1 frame; proc gplot data=A; title; axis1 minor=none label=(angle=90 rotate=0) order=(-2 to 2 by 2); axis2 minor=none order=(-2 to 2 by 2); plot Y*X1=1 / &opts name=’tregnl1’; plot Y*X2=1 LY*X2=2 / overlay &opts name=’tregnl2’; plot Y*X3=1 MY*X3=2 / overlay &opts name=’tregnl3’; plot Y*X4=1 PY*X4=2 / overlay &opts name=’tregnl4’; symbol1 color=blue v=star i=none; symbol2 color=yellow v=none i=join; label X1 = ’Nonlinear Scatter Plot’ X2 = ’Linear Regression, r**2 = 0.14580’ X3 = ’Monotone Function, r**2 = 0.60576’ X4 = ’Nonlinear Function, r**2 = 0.89634’; run; quit; goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:tregnl1 2:tregnl3 3:tregnl2 4:tregnl4; run; quit;
Solving Standard Least-Squares Problems
Linear and Nonlinear Regression Functions A Linear Regression Line The TRANSREG Procedure TRANSREG Univariate Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.00000 0.00000 0.14580 Converged Algorithm converged.
Figure 75.15. A Linear Regression Line Linear and Nonlinear Regression Functions A Monotone Regression Function The TRANSREG Procedure TRANSREG MORALS Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.62131 1.34209 0.14580 2 0.00000 0.00000 0.60576 0.45995 Converged Algorithm converged.
Figure 75.16. A Monotone Regression Function Linear and Nonlinear Regression Functions A Nonmonotone Regression Function The TRANSREG Procedure TRANSREG MORALS Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.83948 2.78984 0.14580 2 0.00000 0.00000 0.89634 0.75054 Converged Algorithm converged.
Figure 75.17. A Nonmonotone Regression Function
4631
4632
Chapter 75. The TRANSREG Procedure
Figure 75.18. Linear, Monotone, and Nonmonotone Regression Functions
The squared correlation is only 0.15 for the linear regression, showing that a simple linear regression model is not appropriate for these data. By relaxing the constraints placed on the regression line, the proportion of variance accounted for increases from 0.15 (linear) to 0.61 (monotone) to 0.90 (nonmonotone). Relaxing the linearity constraint allows the regression function to bend and more closely follow the right portion of the scatter plot. Relaxing the monotonicity constraint allows the regression
Solving Standard Least-Squares Problems
4633
function to follow the periodic portion of the left side of the plot more closely. The nonlinear MSPLINE transformation is a quadratic spline with knots at the deciles. The nonlinear nonmonotonic SPLINE transformation is a cubic spline with knots at the deciles. Different knots and different degrees would produce slightly different results. The two nonlinear regression functions could be closely approximated by simpler piecewise linear regression functions. The monotone function could be approximated by a two-piece line with a single knot at the elbow. The nonmonotone function could be approximated by a six-piece function with knots at the five elbows. With this type of problem (one dependent variable with no missing values that is not transformed and one independent variable that is nonlinearly transformed), PROC TRANSREG always iterates exactly twice (although only one iteration is necessary). The first iteration reports the R2 for the linear regression line and finds the optimal transformation of X. Since the data change in the first iteration, a second iteration is performed, which reports the R2 for the final nonlinear regression function, and zero data change. The predicted values, which are a linear function of the optimal transformation of X, contain the y-coordinates for the nonlinear regression function. The variance of the predicted values divided by the variance of Y is the R2 for the fit of the nonlinear regression function. When X is monotonically transformed, the transformation of X is always monotonically increasing, but the predicted values increase if the correlation is positive and decrease for negative correlations.
Simultaneously Fitting Two Regression Functions One application of ordinary multiple regression is fitting two or more regression lines through a single scatter plot. With PROC TRANSREG, this application can easily be generalized to fit separate or parallel curves. To illustrate, consider a data set with two groups. The data set has a continuous independent variable X, a continuous dependent variable Y, and a group membership variable G that has the value 1 for one group and 2 for the other group. The following code shows how PROC TRANSREG can be used to fit two lines, curves, and monotone curves simultaneously through a scatter plot. You can use this code with an appropriate number-list for the KNOTS= t-option. proc transreg data=A dummy; title ’Parallel Lines, Separate Intercepts’; model identity(Y)=class(G) identity(X); output predicted; run; proc transreg data=A; title ’Parallel Monotone Curves, Separate Intercepts’; model identity(Y)=class(G) mspline(X / knots=-1.5 to 2.5 by 0.5); output predicted; run; proc transreg data=A dummy; title ’Parallel Curves, Separate Intercepts’; model identity(Y)=class(G) spline(X / knots=-1.5 to 2.5 by 0.5);
4634
Chapter 75. The TRANSREG Procedure output predicted; run; proc transreg data=A; title ’Separate Slopes, Same Intercept’; model identity(Y)=class(G / zero=none) * identity(X); output predicted; run; proc transreg data=A; title ’Separate Monotone Curves, Same Intercept’; model identity(Y) = class(G / zero=none) * mspline(X / knots=-1.5 to 2.5 by 0.5); output predicted; run; proc transreg data=A dummy; title ’Separate Curves, Same Intercept’; model identity(Y) = class(G / zero=none) * spline(X / knots=-1.5 to 2.5 by 0.5); output predicted; run; proc transreg data=A; title ’Separate Slopes, Separate Intercepts’; model identity(Y) = class(G / zero=none) | identity(X); output predicted; run; proc transreg data=A; title ’Separate Monotone Curves, Separate Intercepts’; model identity(Y) = class(G / zero=none) | mspline(X / knots=-1.5 to 2.5 by 0.5); output predicted; run; proc transreg data=A dummy; title ’Separate Curves, Separate Intercepts’; model identity(Y) = class(G / zero=none) | spline(X / knots=-1.5 to 2.5 by 0.5); output predicted; run;
Since the variables X1 and X2 both have a large partition of zeros, the KNOTS= t-option is specified instead of the NKNOTS= t-option. The following example generates an artificial data set with two curves. In the interest of space, only the preceding separate curves, separate intercepts example is run.
Solving Standard Least-Squares Problems
title ’Separate Curves, Separate Intercepts’; data A; do X = -2 to 3 by 0.025; G = 1; Y = 8*(X*X + 2*cos(X*6)) + 15*normal(7654321); output; G = 2; Y = 4*(-X*X + 4*sin(X*4)) - 40 + 15*normal(7654321); output; end; run; proc transreg data=A dummy; model identity(Y) = class(G / zero=none) | spline(X / knots=-1.5 to 2.5 by 0.5); output predicted; run; proc gplot; axis1 minor=none; axis2 minor=none label=(angle=90 rotate=0); symbol1 color=blue v=star i=none; symbol2 color=yellow v=dot i=none; plot Y*X=1 PY*X=2 /overlay frame cframe=ligr haxis=axis1 vaxis=axis2 href=0 vref=0; run; quit;
The previous statements produce Figure 75.19 through Figure 75.20. Separate Curves, Separate Intercepts The TRANSREG Procedure TRANSREG MORALS Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------0 0.42724 4.48710 0.71020 1 0.00000 0.00000 0.86604 0.15584 Converged Algorithm converged.
Figure 75.19. Fitting Models: Separate Curves, Separate Intercepts
4635
4636
Chapter 75. The TRANSREG Procedure
Figure 75.20. Plot for the Separate Curves, Separate Intercepts Example
Unbalanced ANOVA without Dummy Variables This example illustrates that an analysis of variance model can be formulated as a simple regression model with optimal scoring. The purpose of the example is to explain one aspect of how PROC TRANSREG works, not to propose an alternative way of performing an analysis of variance. Finding the overall fit of a large, unbalanced analysis of variance model can be handled as an optimal scoring problem without creating large, sparse design matrices. For example, consider an unbalanced full main-effects and interactions ANOVA model with six factors. Assume that a SAS data set is created with factor level indicator variables C1 through C6 and dependent variable Y. If each factor level consists of nonblank single characters, you can create a cell indicator in a DATA step with the statement x=compress(c1||c2||c3||c4||c5||c6);
The following statements optimally score X (using the OPSCORE transformation) and do not transform Y. The final R2 reported is the R2 for the full analysis of variance model. proc transreg; model identity(y)=opscore(x); output; run;
Solving Standard Least-Squares Problems
4637
The R2 displayed by the preceding statements is the same as the R2 that would be reported by both of the following PROC GLM runs. proc glm; class x; model y=x; run; proc glm; class c1-c6; model y=c1|c2|c3|c4|c5|c6; run;
PROC TRANSREG optimally scores the classes of X, within the space of a single variable with values linearly related to the cell means, so the full ANOVA problem is reduced to a simple regression problem with an optimal independent variable. PROC TRANSREG requires only one iteration to find the optimal scoring of X but, by default, performs a second iteration, which reports no data changes.
Hypothesis Tests for Simple Univariate Models If the dependent variable has one parameter (IDENTITY, LINEAR with no missing values, and so on) and if there are no monotonicity constraints, PROC TRANSREG fits univariate models, which can also be fit with a DATA step and PROC REG. This is illustrated with an artificial data set. data htex; do i = 0.5 to 10 by 0.5; x1 = log(i); x2 = sqrt(i) + sin(i); x3 = 0.05 * i * i + cos(i); y = x1 - x2 + x3 + 3 * normal(7); x1 = x1 + normal(7); x2 = x2 + normal(7); x3 = x3 + normal(7); output; end; run;
Both PROC TRANSREG and PROC REG are run to fit the same polynomial regression model. The ANOVA and regression tables from PROC TRANSREG are displayed in Figure 75.21. The ANOVA and regression tables from PROC REG are displayed in Figure 75.22. The SHORT a-option is specified to suppress the iteration history. proc transreg data=htex ss2 short; title ’Fit a Polynomial Regression Model with PROC TRANSREG’; model identity(y) = spline(x1); run;
4638
Chapter 75. The TRANSREG Procedure
Fit a Polynomial Regression Model with PROC TRANSREG The TRANSREG Procedure Dependent Variable Identity(y)
Number of Observations Read Number of Observations Used
20 20
Identity(y) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Identity(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
3 16 19
5.8365 218.3073 224.1438
1.94550 13.64421
Root MSE Dependent Mean Coeff Var
3.69381 0.85490 432.07258
F Value
Pr > F
0.14
0.9329
R-Square Adj R-Sq
0.0260 -0.1566
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Intercept Spline(x1)
DF
Coefficient
Type II Sum of Squares
1 3
1.4612767 -0.3924013
18.8971 5.8365
Mean Square
F Value
Pr > F
18.8971 1.9455
1.38 0.14
0.2565 0.9329
Figure 75.21. ANOVA and Regression Output from PROC TRANSREG data htex2; set htex; x1_1 = x1; x1_2 = x1 * x1; x1_3 = x1 * x1 * x1; run; proc reg; title ’Fit a Polynomial Regression Model with PROC REG’; model y = x1_1 - x1_3; run;
Solving Standard Least-Squares Problems
Fit a Polynomial Regression Model with PROC REG The REG Procedure Model: MODEL1 Dependent Variable: y Number of Observations Read Number of Observations Used
20 20
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
3 16 19
5.83651 218.30729 224.14380
1.94550 13.64421
Root MSE Dependent Mean Coeff Var
3.69381 0.85490 432.07258
R-Square Adj R-Sq
F Value
Pr > F
0.14
0.9329
0.0260 -0.1566
Parameter Estimates
Variable
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept x1_1 x1_2 x1_3
1 1 1 1
1.22083 0.79743 -0.49381 0.04422
1.47163 1.75129 1.50449 0.32956
0.83 0.46 -0.33 0.13
0.4190 0.6550 0.7470 0.8949
Figure 75.22. ANOVA and Regression Output from PROC REG
The PROC TRANSREG regression table differs in several important ways from the parameter estimate table produced by PROC REG. The REG procedure displays standard errors and ts. PROC TRANSREG displays Type II sums of squares, mean squares, and Fs. The difference is because the numerator degrees of freedom are not always 1, so t-tests are not uniformly appropriate. When the degrees of freedom for variable xj is 1, the following relationships hold between the standard errors (sβj ) and the Type II sums of squares (SSj ): sβj = (βˆj2 /Fj )1/2 and SSj = βˆj2 × M SE/s2βj PROC TRANSREG does not provide tests of the individual terms that go into the transformation. (However it could if BSPLINE or PSPLINE had been specified instead of SPLINE.) The test of SPLINE(X1) is the same as the test of the overall
4639
4640
Chapter 75. The TRANSREG Procedure
model. The intercepts are different due to the different numbers of variables and their standardizations. In the next example, both X1 and X2 are transformed in the first PROC TRANSREG step, and PROC TRANSREG is used instead of a DATA step to create the polynomials for PROC REG. Both PROC TRANSREG and PROC REG fit the same polynomial regression model. The output from PROC TRANSREG and PROC REG is in Figure 75.23. proc transreg data=htex ss2 dummy; title ’Two-Variable Polynomial Regression’; model identity(y) = spline(x1 x2); run; proc transreg noprint data=htex maxiter=0; /* Use PROC TRANSREG to prepare input to PROC REG */ model identity(y) = pspline(x1 x2); output out=htex2; run; proc reg; model y = x1_1-x1_3 x2_1-x2_3; test x1_1, x1_2, x1_3; test x2_1, x2_2, x2_3; run;
Solving Standard Least-Squares Problems
Two-Variable Polynomial Regression The TRANSREG Procedure Dependent Variable Identity(y)
Number of Observations Read Number of Observations Used
20 20
TRANSREG MORALS Algorithm Iteration History for Identity(y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------0 0.69502 4.73421 0.08252 1 0.00000 0.00000 0.17287 0.09035 Converged Algorithm converged.
Hypothesis Test Iterations Excluding Spline(x1) TRANSREG MORALS Algorithm Iteration History for Identity(y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------0 0.03575 0.32390 0.15097 1 0.00000 0.00000 0.15249 0.00152 Converged Algorithm converged.
Hypothesis Test Iterations Excluding Spline(x2) TRANSREG MORALS Algorithm Iteration History for Identity(y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------0 0.45381 1.43736 0.00717 1 0.00000 0.00000 0.02604 0.01886 Converged Algorithm converged.
Figure 75.23. Two-Variable Polynomial Regression Output from PROC TRANSREG
4641
4642
Chapter 75. The TRANSREG Procedure
Two-Variable Polynomial Regression The TRANSREG Procedure The TRANSREG Procedure Hypothesis Tests for Identity(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
6 13 19
38.7478 185.3960 224.1438
6.45796 14.26123
Root MSE Dependent Mean Coeff Var
3.77640 0.85490 441.73431
F Value
Pr > F
0.45
0.8306
R-Square Adj R-Sq
0.1729 -0.2089
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Intercept Spline(x1) Spline(x2)
DF
Coefficient
Type II Sum of Squares
1 3 3
3.5437125 0.3644562 -1.3551738
35.2282 4.5682 32.9112
Figure 75.23. (continued)
Mean Square
F Value
Pr > F
35.2282 1.5227 10.9704
2.47 0.11 0.77
0.1400 0.9546 0.5315
Solving Standard Least-Squares Problems
Two-Variable Polynomial Regression The REG Procedure Model: MODEL1 Dependent Variable: y Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
6 13 19
38.74775 185.39605 224.14380
6.45796 14.26123
Root MSE Dependent Mean Coeff Var
3.77640 0.85490 441.73431
R-Square Adj R-Sq
F Value
Pr > F
0.45
0.8306
0.1729 -0.2089
Parameter Estimates
Variable
Label
Intercept x1_1 x1_2 x1_3 x2_1 x2_2 x2_3
Intercept x1 1 x1 2 x1 3 x2 1 x2 2 x2 3
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
1 1 1 1 1 1 1
10.77824 0.40112 0.25652 -0.11639 -14.07054 5.95610 -0.80608
7.55244 1.81024 1.66023 0.36775 12.50521 5.97952 0.87291
1.43 0.22 0.15 -0.32 -1.13 1.00 -0.92
0.1771 0.8281 0.8796 0.7567 0.2809 0.3374 0.3726
Figure 75.23. (continued)
4643
4644
Chapter 75. The TRANSREG Procedure
Two-Variable Polynomial Regression The REG Procedure Model: MODEL1 Test 1 Results for Dependent Variable y
Source
DF
Mean Square
Numerator Denominator
3 13
1.52272 14.26123
F Value
Pr > F
0.11
0.9546
Two-Variable Polynomial Regression The REG Procedure Model: MODEL1 Test 2 Results for Dependent Variable y
Source
DF
Mean Square
Numerator Denominator
3 13
10.97042 14.26123
F Value
Pr > F
0.77
0.5315
Figure 75.23. (continued)
There are three iteration histories: one for the overall model and two for the two independent variables. The first PROC TRANSREG iteration history shows the R2 of 0.17287 for the fit of the overall model. The second is for model identity(y) = spline(x2);
which excludes SPLINE(X1). The third is for model identity(y) = spline(x1);
which excludes SPLINE(X2). The difference between the first and second R2 times the total sum of squares is the model sum of squares for SPLINE(X1) (0.17287 − 0.15249) × 224.143800 = 4.568165 The difference between the first and third R2 times the total sum of squares is the model sum of squares for SPLINE(X2) (0.17287 − 0.02604) × 224.143800 = 32.911247 The TEST statement in PROC REG tests the null hypothesis that the vector of parameters for X1– 1 X1– 2 X1– 3 is zero. This is the same test as the SPLINE(X1)
Solving Standard Least-Squares Problems
test used by PROC TRANSREG. Similarly, the PROC REG test that the vector of parameters for X2– 1 X2– 2 X2– 3 is zero is the same as the PROC TRANSREG SPLINE(X2) test. So for models with no monotonicity constraints and no dependent variable transformations, PROC TRANSREG provides little more than a different packaging of standard least-squares methodology.
Hypothesis Tests with Monotonicity Constraints Now consider a model with monotonicity constraints. This model has no counterpart in PROC REG. proc transreg data=htex ss2 short; title ’Monotone Splines’; model identity(y) = mspline(x1-x3 / nknots=3); run;
The SHORT a-option is specified to suppress the iteration histories. Two ANOVA tables are displayed—one using liberal degrees of freedom and one using conservative degrees of freedom. All sums of squares and the R2 s are the same for both tables. What differs are the degrees of freedom and statistics that are computed using degrees of freedom. The liberal test has 8 model degrees of freedom and 11 error degrees of freedom, whereas the conservative test has 15 model degrees of freedom and only 4 error degrees of freedom. The “true” p-value is between 0.8462 and 0.9997, so clearly you would fail to reject the null hypothesis. Unfortunately, results are not always this clear. See Figure 75.24.
4645
4646
Chapter 75. The TRANSREG Procedure
Monotone Splines The TRANSREG Procedure Dependent Variable Identity(y)
Number of Observations Read Number of Observations Used
20 20
Identity(y) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Identity(y)
Univariate ANOVA Table Based on Liberal Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
8 11 19
58.0534 166.0904 224.1438
7.25667 15.09913
Root MSE Dependent Mean Coeff Var
3.88576 0.85490 454.52581
R-Square Adj R-Sq
F Value
Liberal p
0.48
>= 0.8462
0.2590 -0.2799
Univariate ANOVA Table Based on Conservative Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
15 4 19
58.0534 166.0904 224.1438
3.87022 41.52261
Root MSE Dependent Mean Coeff Var
6.44380 0.85490 753.74578
F Value
Conservative p
0.09
<= 0.9997
R-Square Adj R-Sq
Figure 75.24. Monotone Spline Transformations
0.2590 -2.5197
Solving Standard Least-Squares Problems
Monotone Splines The TRANSREG Procedure Univariate Regression Table Based on Liberal Degrees of Freedom
Variable Intercept Mspline(x1) Mspline(x2) Mspline(x3)
DF
Coefficient
Type II Sum of Squares
1 2 3 3
4.8687676 -0.6886834 -1.8237319 0.8646155
54.7372 12.1943 46.3155 24.6840
Mean Square
F Value
54.7372 6.0972 15.4385 8.2280
3.63 0.40 1.02 0.54
Liberal p >= >= >= >=
0.0834 0.6773 0.4199 0.6616
Univariate Regression Table Based on Conservative Degrees of Freedom
Variable Intercept Mspline(x1) Mspline(x2) Mspline(x3)
DF
Coefficient
Type II Sum of Squares
1 5 5 5
4.8687676 -0.6886834 -1.8237319 0.8646155
54.7372 12.1943 46.3155 24.6840
Mean Square
F Value
54.7372 2.4389 9.2631 4.9368
1.32 0.06 0.22 0.12
Conservative p <= <= <= <=
0.3149 0.9959 0.9344 0.9809
Figure 75.24. (continued)
Hypothesis Tests with Dependent Variable Transformations PROC TRANSREG can also provide approximate tests of hypotheses when the dependent variable is transformed, but the output is more complicated. When a dependent variable has more than one degree of freedom, the problem becomes multivariate. Hypothesis tests are performed in the context of a multivariate linear model with the number of dependent variables equal to the number of scoring parameters for the dependent variable transformation. The transformation regression model with a dependent variable transformation differs from the usual multivariate linear model in two important ways. First, the usual assumption of multivariate normality is always violated. This fact is simply ignored. This is one reason that all hypothesis tests in the presence of a dependent variable transformation should be considered approximate at best. Multivariate normality is assumed even though it is known that the assumption is violated. The second difference concerns the usual multivariate test statistics: Pillai’s Trace, Wilks’ Lambda, Hotelling-Lawley Trace, and Roy’s Greatest Root. The first three statistics are defined in terms of all the squared canonical correlations. Here, there is only one linear combination (the transformation) and, hence, only one squared canonical correlation of interest, which is equal to the R2 . It may seem that Roy’s Greatest Root, which uses only the largest squared canonical correlation, is the only statistic of interest. Unfortunately, Roy’s Greatest Root is very liberal and provides only a lower bound on the p-value. Approximate upper bounds are provided by adjusting the other three statistics for the one linear combination case. The Wilks’ Lambda,
4647
4648
Chapter 75. The TRANSREG Procedure
Pillai’s Trace, and Hotelling-Lawley Trace statistics are a conservative adjustment of the usual statistics. These statistics are normally defined in terms of the squared canonical correlations, which are the eigenvalues of the matrix H(H + E)−1 , where H is the hypothesis sum-of-squares matrix and E is the error sum-of-squares matrix. Here the R2 is used for the first eigenvalue, and all other eigenvalues are set to 0 since only one linear combination is used. Degrees of freedom are computed assuming that all linear combinations contribute to the Lambda and Trace statistics, so the F tests for those statistics are conservative. The p-values for the liberal and conservative statistics provide approximate lower and upper bounds on p. In practice, the adjusted Pillai’s Trace is very conservative—perhaps too conservative to be useful. Wilks’ Lambda is less conservative, and the Hotelling-Lawley Trace seems to be the least conservative. The conservative statistics and the liberal Roy’s Greatest Root provide a bound on the true p-value. Unfortunately, they sometimes report a bound of 0.0001 and 1.0000. Here is an example with a dependent variable transformation. proc transreg data=htex ss2 dummy short; title ’Transform Dependent and Independent Variables’; model spline(y) = spline(x1-x3); run;
The univariate results match Roy’s Greatest Root results. Clearly, the proper action is to fail to reject the null hypothesis. However, as stated previously, results are not always this clear. See Figure 75.25.
Solving Standard Least-Squares Problems
4649
Transform Dependent and Independent Variables The TRANSREG Procedure Dependent Variable Spline(y)
Number of Observations Read Number of Observations Used
20 20
Spline(y) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Spline(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
9 10 19
110.8822 113.2616 224.1438
12.32025 11.32616
F Value
Liberal p
1.09
>= 0.4452
The above statistics are not adjusted for the fact that the dependent variable was transformed and so are generally liberal.
Root MSE Dependent Mean Coeff Var
3.36544 0.85490 393.66234
R-Square Adj R-Sq
0.4947 0.0399
Adjusted Multivariate ANOVA Table Based on the Usual Degrees of Freedom Dependent Variable Scoring Parameters=3 Statistic Wilks’ Lambda Pillai’s Trace Hotelling-Lawley Trace Roy’s Greatest Root
S=3
M=2.5
Value
F Value
Num DF
Den DF
0.505308 0.494692 0.978992 0.978992
0.23 0.22 0.26 1.09
27 27 27 9
24.006 30 11.589 10
N=3 p <= <= <= >=
0.9998 0.9999 0.9980 0.4452
The Wilks’ Lambda, Pillai’s Trace, and Hotelling-Lawley Trace statistics are a conservative adjustment of the normal statistics. Roy’s Greatest Root is liberal. These statistics are normally defined in terms of the squared canonical correlations which are the eigenvalues of the matrix H*inv(H+E). Here the R-Square is used for the first eigenvalue and all other eigenvalues are set to zero since only one linear combination is used. Degrees of freedom are computed assuming all linear combinations contribute to the Lambda and Trace statistics, so the F tests for those statistics are conservative. The p values for the liberal and conservative statistics provide approximate lower and upper bounds on p. A liberal test statistic with conservative degrees of freedom and a conservative test statistic with liberal degrees of freedom yield at best an approximate p value, which is indicated by a "~" before the p value.
Figure 75.25. Transform Dependent and Independent Variables
4650
Chapter 75. The TRANSREG Procedure
Transform Dependent and Independent Variables The TRANSREG Procedure Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Intercept Spline(x1) Spline(x2) Spline(x3)
DF
Coefficient
Type II Sum of Squares
1 3 3 3
6.9089087 -1.0832321 -2.1539191 0.4779207
117.452 32.493 45.251 10.139
Mean Square
F Value
117.452 10.831 15.084 3.380
10.37 0.96 1.33 0.30
Liberal p >= >= >= >=
0.0092 0.4504 0.3184 0.8259
The above statistics are not adjusted for the fact that the dependent variable was transformed and so are generally liberal.
Adjusted Multivariate Regression Table Based on the Usual Degrees of Freedom Variable Intercept
Spline(x1)
Spline(x2)
Spline(x3)
Coefficient Statistic
Value F Value Num DF Den DF
6.9089087 Wilks’ Lambda 0.49092 Pillai’s Trace 0.50908 Hotelling-Lawley 1.036993 Trace Roy’s Greatest 1.036993 Root
p
2.77 2.77 2.77
3 3 3
8 8 8
0.1112 0.1112 0.1112
2.77
3
8
0.1112
-1.0832321 Wilks’ Lambda Pillai’s Trace Hotelling-Lawley Trace Roy’s Greatest Root
0.777072 0.222928 0.286883
0.24 0.27 0.24
9 19.621 <= 0.9840 9 30 <= 0.9787 9 9.8113 <= 0.9784
0.286883
0.96
3
-2.1539191 Wilks’ Lambda Pillai’s Trace Hotelling-Lawley Trace Roy’s Greatest Root
0.714529 0.285471 0.399524
0.32 0.35 0.33
9 19.621 <= 0.9572 9 30 <= 0.9494 9 9.8113 <= 0.9424
0.399524
1.33
3
0.4779207 Wilks’ Lambda Pillai’s Trace Hotelling-Lawley Trace Roy’s Greatest Root
0.917838 0.082162 0.089517
0.08 0.09 0.07
9 19.621 <= 0.9998 9 30 <= 0.9996 9 9.8113 <= 0.9997
0.089517
0.30
3
10 >= 0.4504
10 >= 0.3184
10 >= 0.8259
These statistics are adjusted in the same way as the multivariate statistics above.
Figure 75.25. (continued)
Hypothesis Tests with One-Way ANOVA One-way ANOVA models are fit with either an explicit or implicit intercept. In implicit intercept models, the ANOVA table of PROC TRANSREG is the correct table for a model with an intercept, and the regression table is the correct table for a model
Solving Standard Least-Squares Problems
4651
that does not have a separate explicit intercept. The PROC TRANSREG implicit intercept ANOVA table matches the PROC REG table when the NOINT a-option is not specified, and the PROC TRANSREG implicit intercept regression table matches the PROC REG table when the NOINT a-option is specified. The following code illustrates this relationship. See Figure 75.26 through Figure 75.27 for the results. data oneway; input y x $; datalines; 0 a 1 a 2 a 7 b 8 b 9 b 3 c 4 c 5 c ; proc transreg ss2 data=oneway short; title ’Implicit Intercept Model’; model identity(y) = class(x / zero=none); output out=oneway2; run; proc reg data=oneway2; model y = xa xb xc; /* Implicit Intercept ANOVA */ model y = xa xb xc / noint; /* Implicit Intercept Regression */ run;
4652
Chapter 75. The TRANSREG Procedure
Implicit Intercept Model The TRANSREG Procedure Dependent Variable Identity(y)
Class Level Information Class
Levels
x
Values
3
a b c
Number of Observations Read Number of Observations Used Implicit Intercept Model
9 9
Identity(y) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Identity(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source Model Error Corrected Total
DF
Sum of Squares
Mean Square
2 6 8
74.00000 6.00000 80.00000
37.00000 1.00000
Root MSE Dependent Mean Coeff Var
1.00000 4.33333 23.07692
R-Square Adj R-Sq
F Value
Pr > F
37.00
0.0004
0.9250 0.9000
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable
DF
Coefficient
Type II Sum of Squares
Class.xa Class.xb Class.xc
1 1 1
1.00000000 8.00000000 4.00000000
3.000 192.000 48.000
Mean Square
F Value
Pr > F
Label
3.000 192.000 48.000
3.00 192.00 48.00
0.1340 <.0001 0.0004
x a x b x c
Figure 75.26. Implicit Intercept Model (TRANSREG Procedure)
Solving Standard Least-Squares Problems
Implicit Intercept Model The REG Procedure Model: MODEL1 Dependent Variable: y Number of Observations Read Number of Observations Used
9 9
Analysis of Variance
DF
Sum of Squares
Mean Square
2 6 8
74.00000 6.00000 80.00000
37.00000 1.00000
Root MSE Dependent Mean Coeff Var
1.00000 4.33333 23.07692
Source Model Error Corrected Total
R-Square Adj R-Sq
F Value
Pr > F
37.00
0.0004
0.9250 0.9000
NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.
xc =
Intercept - xa - xb
Parameter Estimates
Variable
Label
Intercept xa xb xc
Intercept x a x b x c
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
B B B 0
4.00000 -3.00000 4.00000 0
0.57735 0.81650 0.81650 .
6.93 -3.67 4.90 .
0.0004 0.0104 0.0027 .
Figure 75.27. Implicit Intercept Model (REG Procedure)
4653
4654
Chapter 75. The TRANSREG Procedure
Implicit Intercept Model The REG Procedure Model: MODEL2 Dependent Variable: y Number of Observations Read Number of Observations Used
9 9
NOTE: No intercept in model. R-Square is redefined. Analysis of Variance
DF
Sum of Squares
Mean Square
3 6 9
243.00000 6.00000 249.00000
81.00000 1.00000
Root MSE Dependent Mean Coeff Var
1.00000 4.33333 23.07692
Source Model Error Uncorrected Total
R-Square Adj R-Sq
F Value
Pr > F
81.00
<.0001
0.9759 0.9639
Parameter Estimates
Variable
Label
xa xb xc
x a x b x c
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
1 1 1
1.00000 8.00000 4.00000
0.57735 0.57735 0.57735
1.73 13.86 6.93
0.1340 <.0001 0.0004
Figure 75.27. (continued)
Using the DESIGN Output Option This example uses PROC TRANSREG and the DESIGN o-option to prepare an input data set with classification variables for the LOGISTIC procedure. The DESIGN o-option specifies that the goal is design matrix creation, not analysis. When you specify DESIGN, dependent variables are not required. The DEVIATIONS (or EFFECTS) t-option requests a deviations-from-means (1, 0, −1) coding of the classification variables, which is the same coding the CATMOD procedure uses. See Figure 75.28. PROC TRANSREG automatically creates a macro variable &– trgind that contains the list of independent variables created. This macro is used in the PROC LOGISTIC MODEL statement. See Figure 75.29. For comparison, the same analysis is also performed with PROC CATMOD. See Figure 75.30. title ’Using PROC TRANSREG to Create a Design Matrix’; data a; do y = 1, 2; do a = 1 to 4;
Using the DESIGN Output Option do b = 1 to 3; w = ceil(uniform(1) * 10 + 10); output; end; end; end; run; proc transreg data=a design; model class(a b / deviations); id y w; output; run; proc print; title2 ’PROC TRANSREG Output Data Set’; run; proc logistic; title2 ’PROC LOGISTIC with Classification Variables’; freq w; model y = &_trgind; run; proc catmod data=a; title2 ’PROC CATMOD Should Produce the Same Results’; model y = a b; weight w; run;
4655
4656
Chapter 75. The TRANSREG Procedure
Using PROC TRANSREG to Create a Design Matrix PROC TRANSREG Output Data Set Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
_TYPE_ SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE
_NAME_ 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
Intercept
a1
a2
a3
b1
b2
a
b
y
w
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 0 0 0 0 0 0 -1 -1 -1 1 1 1 0 0 0 0 0 0 -1 -1 -1
0 0 0 1 1 1 0 0 0 -1 -1 -1 0 0 0 1 1 1 0 0 0 -1 -1 -1
0 0 0 0 0 0 1 1 1 -1 -1 -1 0 0 0 0 0 0 1 1 1 -1 -1 -1
1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1
0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1
1 1 1 2 2 2 3 3 3 4 4 4 1 1 1 2 2 2 3 3 3 4 4 4
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
12 20 14 13 20 20 16 16 11 11 19 16 19 11 20 13 13 17 20 13 17 15 16 13
Figure 75.28. The PROC TRANSREG Design Matrix
Using the DESIGN Output Option
Using PROC TRANSREG to Create a Design Matrix PROC LOGISTIC with Classification Variables The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Frequency Variable Model Optimization Technique
Number Number Sum of Sum of
WORK.DATA8 y 2 w binary logit Fisher’s scoring
of Observations Read of Observations Used Frequencies Read Frequencies Used
24 24 375 375
Response Profile Ordered Value
y
Total Frequency
1 2
1 2
188 187
Probability modeled is y=1.
Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion AIC SC -2 Log L
Intercept Only
Intercept and Covariates
521.858 525.785 519.858
524.378 547.939 512.378
Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald
Chi-Square
DF
Pr > ChiSq
7.4799 7.4312 7.3356
5 5 5
0.1873 0.1905 0.1969
Figure 75.29. PROC LOGISTIC Output
4657
4658
Chapter 75. The TRANSREG Procedure
Using PROC TRANSREG to Create a Design Matrix PROC LOGISTIC with Classification Variables The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard Error
Wald Chi-Square
Pr > ChiSq
Intercept a1 a2 a3 b1 b2
1 1 1 1 1 1
-0.00040 -0.0802 0.2001 -0.1350 -0.2392 0.3433
0.1044 0.1791 0.1800 0.1819 0.1500 0.1474
0.0000 0.2007 1.2363 0.5514 2.5436 5.4223
0.9969 0.6542 0.2662 0.4578 0.1107 0.0199
Odds Ratio Estimates
Effect a1 a2 a3 b1 b2
Point Estimate 0.923 1.222 0.874 0.787 1.410
95% Wald Confidence Limits 0.650 0.858 0.612 0.587 1.056
1.311 1.738 1.248 1.056 1.882
Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs
Figure 75.29. (continued)
54.0 37.8 8.2 35156
Somers’ D Gamma Tau-a c
0.163 0.177 0.082 0.581
Using the DESIGN Output Option
Using PROC TRANSREG to Create a Design Matrix PROC CATMOD Should Produce the Same Results The CATMOD Procedure Data Summary Response Weight Variable Data Set Frequency Missing
y w A 0
Response Levels Populations Total Frequency Observations
Population Profiles Sample a b Sample Size ------------------------------1 1 1 31 2 1 2 31 3 1 3 34 4 2 1 26 5 2 2 33 6 2 3 37 7 3 1 36 8 3 2 29 9 3 3 28 10 4 1 26 11 4 2 35 12 4 3 29
Response Profiles Response y ------------1 1 2 2
Figure 75.30. PROC CATMOD Output
2 12 375 24
4659
4660
Chapter 75. The TRANSREG Procedure
Using PROC TRANSREG to Create a Design Matrix PROC CATMOD Should Produce the Same Results The CATMOD Procedure Maximum Likelihood Analysis Maximum likelihood computations converged.
Maximum Likelihood Analysis of Variance Source DF Chi-Square Pr > ChiSq -------------------------------------------------Intercept 1 0.00 0.9969 a 3 1.50 0.6823 b 2 5.64 0.0597 Likelihood Ratio
6
2.81
0.8329
Analysis of Maximum Likelihood Estimates Standard ChiParameter Estimate Error Square Pr > ChiSq -----------------------------------------------------------Intercept -0.00040 0.1044 0.00 0.9969 a 1 -0.0802 0.1791 0.20 0.6542 2 0.2001 0.1800 1.24 0.2662 3 -0.1350 0.1819 0.55 0.4578 b 1 -0.2392 0.1500 2.54 0.1107 2 0.3434 0.1474 5.42 0.0199
Figure 75.30. (continued)
Discrete Choice Experiments: DESIGN, NORESTORE, NOZERO A discrete choice experiment is constructed consisting of four product brands, each available at three different prices, $1.49, $1.99, $2.49. In addition, each choice set contains a constant “other” alternative available at $1.49. In the fifth choice set, price is constant. PROC TRANSREG is used to code the designand the PHREG procedure fits the multinomial logit choice model (not shown). See http://www.sas.com/service/techsup/tnote/tnote– stat.html for more information on discrete choice modeling and the multinomial logit model. Look for the latest “Multinomial Logit, Discrete Choice Modeling” report. title ’Choice Model Coding’; data design; array p[4]; input p1-p4 @@; set = _n_; do brand = 1 to 4; price = p[brand];
Discrete Choice Experiments: DESIGN, NORESTORE, NOZERO
4661
output; end; brand = .; price = 1.49; output; /* constant alternative */ keep set brand price; datalines; 1.49 1.99 1.49 1.99 1.99 1.99 2.49 1.49 1.99 1.49 1.99 1.49 1.99 1.49 2.49 1.99 1.49 1.49 1.49 1.49 2.49 1.49 1.99 2.49 1.49 1.49 2.49 2.49 2.49 2.49 1.49 1.49 1.49 2.49 2.49 1.99 2.49 2.49 2.49 1.49 1.99 2.49 1.49 2.49 2.49 1.99 2.49 2.49 2.49 1.49 1.49 1.99 1.49 1.99 1.99 1.49 2.49 1.99 1.99 1.99 1.99 1.99 1.49 2.49 1.99 2.49 1.99 1.99 1.49 2.49 1.99 2.49 ; proc transreg data=design design norestoremissing nozeroconstant; model class(brand / zero=none) identity(price); output out=coded; by set; run; proc print data=coded(firstobs=21 obs=25); var set brand &_trgind; run;
In the interest of space, only the fifth choice set is displayed in Figure 75.31. Choice Model Coding Obs
set
brand
21 22 23 24 25
5 5 5 5 5
1 2 3 4 .
brand1 1 0 0 0 0
brand2
brand3
0 1 0 0 0
0 0 1 0 0
brand4 0 0 0 1 0
price 1.49 1.49 1.49 1.49 1.49
Figure 75.31. The Fifth Choice Set
For the constant alternative (BRAND = .), the brand coding is a row of zeros due to the NORESTOREMISSING o-option, and PRICE is a constant $1.49 (instead of 0) due to the NOZEROCONSTANT a-option. The data set was coded by choice set (BY set;). This is a small problem, but with very large problems, it may be necessary to restrict the number of observations that are coded at one time so that the procedure uses less time and memory. Coding by choice set is one option. When coding is performed after the data are merged in, coding by subject and choice set combinations is another option. Alternatively, you can specify DESIGN=n, where n is the number of observations to code at one time. For example, you can specify DESIGN=100 or DESIGN=1000 to process the data set in blocks of 100 or 1000 observations. Specify the NOZEROCONSTANT option to ensure that constant variables within blocks are not zeroed. When you specify DESIGN=n, or perform coding after the data are merged in, specify the dependent variable and any other variables needed for analysis as ID variables.
4662
Chapter 75. The TRANSREG Procedure
ANOVA Codings This set of examples illustrates several different ways to code the same two-way ANOVA model. Figure 75.32 displays the input data set. title ’Two-way ANOVA Models’; data x; input a b @@; do i = 1 to 2; input y @@; output; end; drop i; datalines; 1 1 16 14 1 2 15 13 2 1 1 9 2 2 12 20 3 1 14 8 3 2 18 20 ; proc print label; run;
Two-way ANOVA Models Obs
a
b
y
1 2 3 4 5 6 7 8 9 10 11 12
1 1 1 1 2 2 2 2 3 3 3 3
1 1 2 2 1 1 2 2 1 1 2 2
16 14 15 13 1 9 12 20 14 8 18 20
Figure 75.32. Input Data Set
The following statements fit a cell-means model. See Figure 75.33 and Figure 75.34.
proc transreg data=x ss2 short; title2 ’Cell-Means Model’; model identity(y) = class(a * b / zero=none); output replace; run; proc print label; run;
ANOVA Codings
Two-way ANOVA Models Cell-Means Model The TRANSREG Procedure Dependent Variable Identity(y)
Class Level Information Class
Levels
Values
a
3
1 2 3
b
2
1 2
Number of Observations Read Number of Observations Used Implicit Intercept Model
12 12
Identity(y) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Identity(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 6 11
234.6667 88.0000 322.6667
46.93333 14.66667
Root MSE Dependent Mean Coeff Var
3.82971 13.33333 28.72281
F Value
Pr > F
3.20
0.0946
R-Square Adj R-Sq
0.7273 0.5000
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Class.a1b1 Class.a1b2 Class.a2b1 Class.a2b2 Class.a3b1 Class.a3b2
DF
Coefficient
Type II Sum of Squares
1 1 1 1 1 1
15.0000000 14.0000000 5.0000000 16.0000000 11.0000000 19.0000000
450.000 392.000 50.000 512.000 242.000 722.000
Figure 75.33. Cell-Means Model
The parameter estimates are µ ˆ11 = y 11 = 15 µ ˆ12 = y 12 = 14
Mean Square
F Value
Pr > F
Label
450.000 392.000 50.000 512.000 242.000 722.000
30.68 26.73 3.41 34.91 16.50 49.23
0.0015 0.0021 0.1144 0.0010 0.0066 0.0004
a a a a a a
1 1 2 2 3 3
* * * * * *
b b b b b b
1 2 1 2 1 2
4663
4664
Chapter 75. The TRANSREG Procedure µ ˆ21 = y 21 = 5 µ ˆ22 = y 22 = 16 µ ˆ31 = y 31 = 11 µ ˆ32 = y 32 = 19
Two-way ANOVA Models Cell-Means Model
Obs _TYPE_ _NAME_ 1 2 3 4 5 6 7 8 9 10 11 12
SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE
ROW1 ROW2 ROW3 ROW4 ROW5 ROW6 ROW7 ROW8 ROW9 ROW10 ROW11 ROW12
y Intercept 16 14 15 13 1 9 12 20 14 8 18 20
. . . . . . . . . . . .
a 1 * a 1 * a 2 * a 2 * a 3 * a 3 * b 1 b 2 b 1 b 2 b 1 b 2 a b 1 1 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0
0 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 1 1
1 1 1 1 2 2 2 2 3 3 3 3
1 1 2 2 1 1 2 2 1 1 2 2
Figure 75.34. Cell-Means Model, Design Matrix
The following statements fit a reference cell model. The default reference level is the last cell (3,2). See Figure 75.35 and Figure 75.36. proc transreg data=x ss2 short; title2 ’Reference Cell Model, (3,2) Reference Cell’; model identity(y) = class(a | b); output replace; run; proc print label; run;
ANOVA Codings
Two-way ANOVA Models Reference Cell Model, (3,2) Reference Cell The TRANSREG Procedure Dependent Variable Identity(y)
Class Level Information Class
Levels
Values
a
3
1 2 3
b
2
1 2
Number of Observations Read Number of Observations Used
12 12
Identity(y) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Identity(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 6 11
234.6667 88.0000 322.6667
46.93333 14.66667
Root MSE Dependent Mean Coeff Var
3.82971 13.33333 28.72281
F Value
Pr > F
3.20
0.0946
R-Square Adj R-Sq
0.7273 0.5000
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Intercept Class.a1 Class.a2 Class.b1 Class.a1b1 Class.a2b1
DF
Coefficient
Type II Sum of Squares
1 1 1 1 1 1
19.0000000 -5.0000000 -3.0000000 -8.0000000 9.0000000 -3.0000000
722.000 25.000 9.000 64.000 40.500 4.500
Mean Square
F Value
Pr > F
Label
722.000 25.000 9.000 64.000 40.500 4.500
49.23 1.70 0.61 4.36 2.76 0.31
0.0004 0.2395 0.4632 0.0817 0.1476 0.5997
Intercept a 1 a 2 b 1 a 1 * b 1 a 2 * b 1
Figure 75.35. Reference Cell Model, (3, 2) Reference Cell
The parameter estimates are µ ˆ32 = y 32 = 19 α ˆ 1 = y 12 − y 32 = 14 − 19 = −5
4665
4666
Chapter 75. The TRANSREG Procedure α ˆ 2 = y 22 − y 32 = 16 − 19 = −3 βˆ1 = y 31 − y 32 = 11 − 19 = −8 γˆ11 = y − (ˆ µ32 + α ˆ 1 + βˆ1 ) = 15 − (19 + −5 + −8) = 9 11
γˆ21 = y 21 − (ˆ µ32 + α ˆ 2 + βˆ1 ) = 5 − (19 + −3 + −8) = −3
The structural zeros are α3 ≡ β2 ≡ γ12 ≡ γ22 ≡ γ31 ≡ γ32 ≡ 0
Two-way ANOVA Models Reference Cell Model, (3,2) Reference Cell
Obs 1 2 3 4 5 6 7 8 9 10 11 12
_TYPE_
_NAME_
y
Intercept
a 1
a 2
b 1
a 1 * b 1
a 2 * b 1
a
b
SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE
ROW1 ROW2 ROW3 ROW4 ROW5 ROW6 ROW7 ROW8 ROW9 ROW10 ROW11 ROW12
16 14 15 13 1 9 12 20 14 8 18 20
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 0 0 0 0
1 1 0 0 1 1 0 0 1 1 0 0
1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0
1 1 1 1 2 2 2 2 3 3 3 3
1 1 2 2 1 1 2 2 1 1 2 2
Figure 75.36. Reference Cell Model, (3, 2) Reference Cell, Design Matrix
The following statements fit a reference cell model, but this time the reference level is the first cell (1,1). See Figure 75.37 through Figure 75.38. proc transreg data=x ss2 short; title2 ’Reference Cell Model, (1,1) Reference Cell’; model identity(y) = class(a | b / zero=first); output replace; run; proc print label; run;
ANOVA Codings
Two-way ANOVA Models Reference Cell Model, (1,1) Reference Cell The TRANSREG Procedure Dependent Variable Identity(y)
Class Level Information Class
Levels
Values
a
3
1 2 3
b
2
1 2
Number of Observations Read Number of Observations Used
12 12
Identity(y) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Identity(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 6 11
234.6667 88.0000 322.6667
46.93333 14.66667
Root MSE Dependent Mean Coeff Var
3.82971 13.33333 28.72281
F Value
Pr > F
3.20
0.0946
R-Square Adj R-Sq
0.7273 0.5000
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Intercept Class.a2 Class.a3 Class.b2 Class.a2b2 Class.a3b2
DF
Coefficient
Type II Sum of Squares
1 1 1 1 1 1
15.000000 -10.000000 -4.000000 -1.000000 12.000000 9.000000
450.000 100.000 16.000 1.000 72.000 40.500
Mean Square
F Value
Pr > F
Label
450.000 100.000 16.000 1.000 72.000 40.500
30.68 6.82 1.09 0.07 4.91 2.76
0.0015 0.0401 0.3365 0.8027 0.0686 0.1476
Intercept a 2 a 3 b 2 a 2 * b 2 a 3 * b 2
Figure 75.37. Reference Cell Model, (1, 1) Reference Cell
The parameter estimates are µ ˆ11 = y 11 = 15 α ˆ 2 = y 21 − y 11 = 5 − 15 = −10
4667
4668
Chapter 75. The TRANSREG Procedure α ˆ 3 = y 31 − y 11 = 11 − 15 = −4 βˆ2 = y 12 − y 11 = 14 − 15 = −1 γˆ22 = y − (ˆ µ11 + α ˆ 2 + βˆ2 ) = 16 − (15 + −10 + −1) = 12 22
γˆ32 = y 32 − (ˆ µ11 + α ˆ 3 + βˆ2 ) = 19 − (15 + −4 + −1) = 9
The structural zeros are α1 ≡ β1 ≡ γ11 ≡ γ12 ≡ γ21 ≡ γ31 ≡ 0
Two-way ANOVA Models Reference Cell Model, (1,1) Reference Cell
Obs 1 2 3 4 5 6 7 8 9 10 11 12
_TYPE_
_NAME_
y
Intercept
a 2
a 3
b 2
a 2 * b 2
a 3 * b 2
a
b
SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE
ROW1 ROW2 ROW3 ROW4 ROW5 ROW6 ROW7 ROW8 ROW9 ROW10 ROW11 ROW12
16 14 15 13 1 9 12 20 14 8 18 20
1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1 0 0 1 1
0 0 0 0 0 0 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1
1 1 1 1 2 2 2 2 3 3 3 3
1 1 2 2 1 1 2 2 1 1 2 2
Figure 75.38. Reference Cell Model, (1, 1) Reference Cell, Design Matrix
The following statements fit a deviations-from-means model. The default reference level is the last cell (3,2). This coding is also called effects coding. See Figure 75.39 and Figure 75.40. proc transreg data=x ss2 short; title2 ’Deviations From Means, (3,2) Reference Cell’; model identity(y) = class(a | b / deviations); output replace; run; proc print label; run;
ANOVA Codings
Two-way ANOVA Models Deviations From Means, (3,2) Reference Cell The TRANSREG Procedure Dependent Variable Identity(y)
Class Level Information Class
Levels
Values
a
3
1 2 3
b
2
1 2
Number of Observations Read Number of Observations Used
12 12
Identity(y) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Identity(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 6 11
234.6667 88.0000 322.6667
46.93333 14.66667
Root MSE Dependent Mean Coeff Var
3.82971 13.33333 28.72281
F Value
Pr > F
3.20
0.0946
R-Square Adj R-Sq
0.7273 0.5000
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Intercept Class.a1 Class.a2 Class.b1 Class.a1b1 Class.a2b1
DF
Coefficient
Type II Sum of Squares
1 1 1 1 1 1
13.3333333 1.1666667 -2.8333333 -3.0000000 3.5000000 -2.5000000
2133.33 8.17 48.17 108.00 73.50 37.50
Mean Square
F Value
Pr > F
Label
2133.33 8.17 48.17 108.00 73.50 37.50
145.45 0.56 3.28 7.36 5.01 2.56
<.0001 0.4837 0.1199 0.0349 0.0665 0.1609
Intercept a 1 a 2 b 1 a 1 * b 1 a 2 * b 1
Figure 75.39. Deviations-From-Means Model, (3, 2) Reference Cell
The parameter estimates are µ ˆ = y = 13.33333 α ˆ 1 = (y 11 + y 12 )/2 − y = (15 + 14)/2 − 13.33333 = 1.16667
4669
4670
Chapter 75. The TRANSREG Procedure α ˆ 2 = (y 21 + y 22 )/2 − y = (5 + 16)/2 − 13.33333 = −2.83333 βˆ1 = (y 11 + y 21 + y 31 )/3 − y = (15 + 5 + 11)/3 − 13.33333 = −3 γˆ11 = y − (y + α ˆ 1 + βˆ1 ) = 15 − (13.33333 + 1.16667 + −3) = 3.5 11
γˆ21 = y 21 − (y + α ˆ 2 + βˆ1 ) = 5 − (13.33333 + −2.83333 + −3) = −2.5
The structural zeros are α3 ≡ β2 ≡ γ12 ≡ γ22 ≡ γ31 ≡ γ32 ≡ 0
Two-way ANOVA Models Deviations From Means, (3,2) Reference Cell
Obs 1 2 3 4 5 6 7 8 9 10 11 12
_TYPE_
_NAME_
y
Intercept
a 1
a 2
b 1
a 1 * b 1
SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE
ROW1 ROW2 ROW3 ROW4 ROW5 ROW6 ROW7 ROW8 ROW9 ROW10 ROW11 ROW12
16 14 15 13 1 9 12 20 14 8 18 20
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 0 0 0 -1 -1 -1 -1
0 0 0 0 1 1 1 1 -1 -1 -1 -1
1 1 -1 -1 1 1 -1 -1 1 1 -1 -1
1 1 -1 -1 0 0 0 0 -1 -1 1 1
a 2 * b 1 0 0 0 0 1 1 -1 -1 -1 -1 1 1
a
b
1 1 1 1 2 2 2 2 3 3 3 3
1 1 2 2 1 1 2 2 1 1 2 2
Figure 75.40. Deviations-From-Means Model, (3, 2) Reference Cell, Design Matrix
The following statements fit a deviations-from-means model, but this time the reference level is the first cell (1,1). This coding is also called effects coding. See Figure 75.41 through Figure 75.42. proc transreg data=x ss2 short; title2 ’Deviations From Means, (1,1) Reference Cell’; model identity(y) = class(a | b / deviations zero=first); output replace; run; proc print label; run;
ANOVA Codings
Two-way ANOVA Models Deviations From Means, (1,1) Reference Cell The TRANSREG Procedure Dependent Variable Identity(y)
Class Level Information Class
Levels
Values
a
3
1 2 3
b
2
1 2
Number of Observations Read Number of Observations Used
12 12
Identity(y) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Identity(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 6 11
234.6667 88.0000 322.6667
46.93333 14.66667
Root MSE Dependent Mean Coeff Var
3.82971 13.33333 28.72281
F Value
Pr > F
3.20
0.0946
R-Square Adj R-Sq
0.7273 0.5000
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Intercept Class.a2 Class.a3 Class.b2 Class.a2b2 Class.a3b2
DF
Coefficient
Type II Sum of Squares
1 1 1 1 1 1
13.3333333 -2.8333333 1.6666667 3.0000000 2.5000000 1.0000000
2133.33 48.17 16.67 108.00 37.50 6.00
Mean Square
F Value
Pr > F
Label
2133.33 48.17 16.67 108.00 37.50 6.00
145.45 3.28 1.14 7.36 2.56 0.41
<.0001 0.1199 0.3274 0.0349 0.1609 0.5461
Intercept a 2 a 3 b 2 a 2 * b 2 a 3 * b 2
Figure 75.41. Deviations-From-Means Model, (1, 1) Reference Cell
The parameter estimates are µ ˆ = y = 13.33333 α ˆ 2 = (y 21 + y 22 )/2 − y = (5 + 16)/2 − 13.33333 = −2.8333
4671
4672
Chapter 75. The TRANSREG Procedure α ˆ 3 = (y 31 + y 32 )/2 − y = (11 + 19)/2 − 13.33333 = 1.66667 βˆ2 = (y 12 + y 22 + y 32 )/3 − y = (14 + 16 + 19)/3 − 13.33333 = 3 γˆ22 = y − (y + α ˆ 2 + βˆ2 ) = 16 − (13.33333 + −2.8333 + 3) = 2.5 22
γˆ32 = y 32 − (y + α ˆ 3 + βˆ2 ) = 19 − (13.33333 + 1.66667 + 3) = 1
The structural zeros are α1 ≡ β1 ≡ γ11 ≡ γ12 ≡ γ21 ≡ γ31 ≡ 0
Two-way ANOVA Models Deviations From Means, (1,1) Reference Cell
Obs 1 2 3 4 5 6 7 8 9 10 11 12
_TYPE_
_NAME_
y
Intercept
a 2
a 3
b 2
a 2 * b 2
SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE
ROW1 ROW2 ROW3 ROW4 ROW5 ROW6 ROW7 ROW8 ROW9 ROW10 ROW11 ROW12
16 14 15 13 1 9 12 20 14 8 18 20
1 1 1 1 1 1 1 1 1 1 1 1
-1 -1 -1 -1 1 1 1 1 0 0 0 0
-1 -1 -1 -1 0 0 0 0 1 1 1 1
-1 -1 1 1 -1 -1 1 1 -1 -1 1 1
1 1 -1 -1 -1 -1 1 1 0 0 0 0
a 3 * b 2 1 1 -1 -1 0 0 0 0 -1 -1 1 1
a
b
1 1 1 1 2 2 2 2 3 3 3 3
1 1 2 2 1 1 2 2 1 1 2 2
Figure 75.42. Deviations-From-Means Model, (1, 1) Reference Cell, Design Matrix
The following statements fit a less-than-full-rank model. The parameter estimates are constrained to sum to zero within each effect. See Figure 75.43 and Figure 75.44. proc transreg data=x ss2 short; title2 ’Less Than Full Rank Model’; model identity(y) = class(a | b / zero=sum); output replace; run; proc print label; run;
ANOVA Codings
Two-way ANOVA Models Less Than Full Rank Model The TRANSREG Procedure Dependent Variable Identity(y)
Class Level Information Class
Levels
Values
a
3
1 2 3
b
2
1 2
Number of Observations Read Number of Observations Used
12 12
Identity(y) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Identity(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source
DF
Sum of Squares
Mean Square
Model Error Corrected Total
5 6 11
234.6667 88.0000 322.6667
46.93333 14.66667
Root MSE Dependent Mean Coeff Var
3.82971 13.33333 28.72281
F Value
Pr > F
3.20
0.0946
R-Square Adj R-Sq
0.7273 0.5000
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Intercept Class.a1 Class.a2 Class.a3 Class.b1 Class.b2 Class.a1b1 Class.a1b2 Class.a2b1 Class.a2b2 Class.a3b1 Class.a3b2
DF
Coefficient
Type II Sum of Squares
1 1 1 1 1 1 1 1 1 1 1 1
13.3333333 1.1666667 -2.8333333 1.6666667 -3.0000000 3.0000000 3.5000000 -3.5000000 -2.5000000 2.5000000 -1.0000000 1.0000000
2133.33 8.17 48.17 16.67 108.00 108.00 73.50 73.50 37.50 37.50 6.00 6.00
Mean Square
F Value
Pr > F
Label
2133.33 8.17 48.17 16.67 108.00 108.00 73.50 73.50 37.50 37.50 6.00 6.00
145.45 0.56 3.28 1.14 7.36 7.36 5.01 5.01 2.56 2.56 0.41 0.41
<.0001 0.4837 0.1199 0.3274 0.0349 0.0349 0.0665 0.0665 0.1609 0.1609 0.5461 0.5461
Intercept a 1 a 2 a 3 b 1 b 2 a 1 * b 1 a 1 * b 2 a 2 * b 1 a 2 * b 2 a 3 * b 1 a 3 * b 2
The sum of the regression table DF’s, minus one for the intercept, will be greater than the model df when there are ZERO=SUM constraints.
Figure 75.43. Less-Than-Full-Rank Model
4673
4674
Chapter 75. The TRANSREG Procedure
The parameter estimates are µ ˆ = y = 13.33333 α ˆ 1 = (y 11 + y 12 )/2 − y = (15 + 14)/2 − 13.33333 = 1.16667 α ˆ 2 = (y 21 + y 22 )/2 − y = (5 + 16)/2 − 13.33333 = −2.8333 α ˆ 3 = (y 31 + y 32 )/2 − y = (11 + 19)/2 − 13.33333 = 1.66667 βˆ1 = (y 11 + y 21 + y 31 )/3 − y = (15 + 5 + 11)/3 − 13.33333 = −3 βˆ2 = (y + y + y )/3 − y = (14 + 16 + 19)/3 − 13.33333 = 3 12
22
32
γˆ11 = y 11 − (y + α ˆ 1 + βˆ1 ) = 15 − (13.33333 + 1.16667 + −3) = 3.5 γˆ12 = y − (y + α ˆ 1 + βˆ2 ) = 14 − (13.33333 + 1.16667 + 3) = −3.5 12
γˆ21 = y 21 − (y + α ˆ 2 + βˆ1 ) = 5 − (13.33333 + −2.83333 + −3) = −2.5 γˆ22 = y − (y + α ˆ 2 + βˆ2 ) = 16 − (13.33333 + −2.8333 + 3) = 2.5 22
γˆ31 = y 31 − (y + α ˆ 3 + βˆ1 ) = 11 − (13.33333 + 1.66667 + −3) = −1 γˆ32 = y − (y + α ˆ 3 + βˆ2 ) = 19 − (13.33333 + 1.66667 + 3) = 1 32
The constraints are α1 + α2 + α3 ≡ β1 + β2 ≡ 0 γ11 + γ12 ≡ γ21 + γ22 ≡ γ31 + γ32 ≡ γ11 + γ21 + γ31 ≡ γ12 + γ22 + γ32 ≡ 0
Centering
Two-way ANOVA Models Less Than Full Rank Model
Obs 1 2 3 4 5 6 7 8 9 10 11 12
_TYPE_
_NAME_
y
Intercept
a 1
a 2
a 3
b 1
SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE SCORE
ROW1 ROW2 ROW3 ROW4 ROW5 ROW6 ROW7 ROW8 ROW9 ROW10 ROW11 ROW12
16 14 15 13 1 9 12 20 14 8 18 20
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1
1 1 0 0 1 1 0 0 1 1 0 0
Obs
b 2
a 1 * b 1
a 1 * b 2
a 2 * b 1
a 2 * b 2
a 3 * b 1
a 3 * b 2
a
b
1 2 3 4 5 6 7 8 9 10 11 12
0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0
0 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 1 1
1 1 1 1 2 2 2 2 3 3 3 3
1 1 2 2 1 1 2 2 1 1 2 2
Figure 75.44. Less-Than-Full-Rank Model, Design Matrix
Centering You can use transformation options to center and standardize the variables in several ways. For example, this MODEL statement creates three independent variables, x, x2 , and x3 . model identity(y) = pspline(x);
The variables are not centered. When the CENTER t-option is specified, the three independent variables are x − x ¯, 2 3 (x − x ¯) , and (x − x ¯) . model identity(y) = pspline(x / center);
Since operations such as squaring occur after the centering, the resulting variables will not always be centered. The CENTER t-option is particularly useful with polynomials since centering before squaring and cubing can help reduce collinearity and
4675
4676
Chapter 75. The TRANSREG Procedure
numerical problems. For example, if one of your variables is year, with values all greater than 1900. Squaring and cubing without centering first will create variables that are all essentially perfectly correlated. When the TSTANDARD=CENTER t-option is specified, the three independent variables are x − x ¯, x2 − x2 , and x3 − x3 . model identity(y) = pspline(x / tstandard=center);
In this case, the variables are squared and cubed and then centered.
Displayed Output The display options control the amount of displayed output. The displayed output can contain • an iteration history and convergence status table, by default • an ANOVA table when the TEST, SS2, or UTILITIES a-option is specified • a regression table when the SS2 a-option is specified • conjoint analysis part-worth utilities when the UTILITIES a-option is specified • model details when the DETAIL a-option is specified • a multivariate ANOVA table when the dependent variable is transformed and the TEST or SS2 a-option is specified • a multivariate regression table when the dependent variable is transformed and it is specified • liberal and conservative ANOVA, multivariate ANOVA, regression, and multivariate regression tables when there are MONOTONE, UNTIE, or MSPLINE transformations and the TEST or SS2 a-option is specified
ODS Table Names PROC TRANSREG assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 75.7. ODS Tables Produced in PROC TRANSREG
ODS Table Name NObs ClassLevels ANOVA LiberalANOVA ConservANOVA FitStatistics
Description ANOVA ANOVA ANOVA ANOVA, *1 ANOVA, *1 Fit statistics like R-square
Statement MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC
Option TEST/SS2 TEST/SS2 TEST/SS2 TEST/SS2 TEST/SS2 TEST/SS2
ODS Table Names
4677
Table 75.7. (continued)
ODS Table Name LiberalFitStatistics ConservFitStatistics MVANOVA LiberalMVANOVA ConservMVANOVA Coef LiberalCoef ConservCoef MVCoef LiberalMVCoef ConservMVCoef Utilities LiberalUtilities ConservUtilities BoxCox Equation Details Univariate MORALS CANALS Redundancy TestIterations ConvergenceStatus Footnotes SplineCoef
Description Fit statistics, *1 Fit statistics, *1 Multivariate ANOVA, *2 Multivariate ANOVA, *1, *2 Multivariate ANOVA, *1, *2 Regression results Regression results, *1 Regression results, *1 Multivariate regression results, *2 Multivariate regression results, *1, *2 Multivariate regression results, *1, *2 Conjoint Analysis Utilities Conjoint Analysis Utilities, *1 Conjoint Analysis Utilities, *1 Box-Cox Transformation Results Linear Dependency Equation Model Details Univariate Iteration History MORALS Iteration History CANALS Iteration History Redundancy Iteration History Hypothesis Test Iterations Iteration History Convergence Status Iteration History Footnotes Spline coefficients
Statement MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC
Option TEST/SS2 TEST/SS2 TEST/SS2 TEST/SS2 TEST/SS2 SS2 SS2 SS2 SS2
MODEL/PROC
SS2
MODEL/PROC
SS2
MODEL/PROC MODEL/PROC MODEL/PROC MODEL MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC MODEL/PROC
UTILITY UTILITY UTILITY BOXCOX less-than-full-rank model DETAIL METHOD=UNIVARIATE METHOD=MORALS METHOD=CANALS METHOD=REDUNDANCY SS2
MODEL
default default SPLINE/MSPLINE
*1. Liberal and conservative test tables are produced when a MONOTONE, UNTIE, or MSPLINE, transformation is requested. *2. Multivariate tables are produced when the dependent variable is iteratively transformed.
4678
Chapter 75. The TRANSREG Procedure
Examples Example 75.1. Using Splines and Knots This example illustrates some properties of splines. Splines are curves, which are usually required to be continuous and smooth. Splines are usually defined as piecewise polynomials of degree n with function values and first n − 1 derivatives that agree at the points where they join. The abscissa values of the join points are called knots. The term “spline” is also used for polynomials (splines with no knots) and piecewise polynomials with more than one discontinuous derivative. Splines with no knots are generally smoother than splines with knots, which are generally smoother than splines with multiple discontinuous derivatives. Splines with few knots are generally smoother than splines with many knots; however, increasing the number of knots usually increases the fit of the spline function to the data. Knots give the curve freedom to bend to more closely follow the data. Refer to Smith (1979) for an excellent introduction to splines. In this example, an artificial data set is created with a variable Y that is a discontinuous function of X. See the first plot in Output 75.1.7. Notice that the function has four unconnected parts, each of which is a curve. Notice too that there is an overall quadratic trend, that is, ignoring the shapes of the individual curves, at first the Y values tend to decrease as X increases, then Y values tend to increase. The first PROC TRANSREG analysis fits a linear regression model. The predicted values of Y given X are output and plotted to form the linear regression line. The R2 for the linear regression is 0.10061, and it can be seen from the second plot in Output 75.1.7 that the linear regression model is not appropriate for these data. The following statements create the data set and perform the first PROC TRANSREG analysis. These statements produce Output 75.1.1. title ’An Illustration of Splines and Knots’; * * * * * *
Create in Y a discontinuous function of X. Store copies of X in V1-V7 for use in PROC GPLOT. These variables are only necessary so that each plot can have its own x-axis label while putting four plots on a page.;
data A; array V[7] V1-V7; X=-0.000001; do I=0 to 199; if mod(I,50)=0 then do; C=((X/2)-5)**2; if I=150 then C=C+5; Y=C; end; X=X+0.1; Y=Y-sin(X-C);
Example 75.1. Using Splines and Knots
do J=1 to 7; V[J]=X; end; output; end; run; * * * * * * * * * *
Each of the PROC TRANSREG steps fits a different spline model to the data set created previously. The TRANSREG steps build up a data set with various regression functions. All of the functions are then plotted with the final PROC GPLOT step. The OUTPUT statements add new predicted values variables to the data set, while the ID statements save all of the previously created variables that are needed for the plots.;
proc transreg data=A; model identity(Y) = identity(X); title2 ’A Linear Regression Function’; output out=A pprefix=Linear; id V1-V7; run;
Output 75.1.1. Fitting a Linear Regression Model with PROC TRANSREG An Illustration of Splines and Knots A Linear Regression Function The TRANSREG Procedure TRANSREG Univariate Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.00000 0.00000 0.10061 Converged Algorithm converged.
The second PROC TRANSREG analysis finds a degree two spline transformation with no knots, which is a quadratic polynomial. The spline is a weighted sum of a single constant, a single straight line, and a single quadratic curve. The R2 increases from 0.10061, which is the linear fit value from before, to 0.40720. It can be seen from the third plot in Output 75.1.7 that the quadratic regression function does not fit any of the individual curves well, but it does follow the overall trend in the data. Since the overall trend is quadratic, a degree three spline with no knots (not shown) increases R2 by only a small amount. The following statements perform the quadratic analysis and produce Output 75.1.2. proc transreg data=A;
4679
4680
Chapter 75. The TRANSREG Procedure model identity(Y)=spline(X / degree=2); title2 ’A Quadratic Polynomial Regression Function’; output out=A pprefix=Quad; id V1-V7 LinearY; run;
Output 75.1.2. Fitting a Quadratic Polynomial An Illustration of Splines and Knots A Quadratic Polynomial Regression Function The TRANSREG Procedure TRANSREG MORALS Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.82127 2.77121 0.10061 2 0.00000 0.00000 0.40720 0.30659 Converged Algorithm converged.
The next step uses the default degree of three, for a piecewise cubic polynomial, and requests knots at the known break points, X=5, 10, and 15. This requests a spline that is continuous, has continuous first and second derivatives, and has a third derivative that is discontinuous at 5, 10, and 15. The spline is a weighted sum of a single constant, a single straight line, a single quadratic curve, a cubic curve for the portion of X less than 5, a different cubic curve for the portion of X between 5 and 10, a different cubic curve for the portion of X between 10 and 15, and another cubic curve for the portion of X greater than 15. The new R2 is 0.61730, and it can be seen from the fourth plot (in Output 75.1.7) that the spline is less smooth than the quadratic polynomial and it follows the data more closely than the quadratic polynomial. The following statements perform this analysis and produce Output 75.1.3: proc transreg data=A; model identity(Y) = spline(X / knots=5 10 15); title2 ’A Cubic Spline Regression Function’; title3 ’The Third Derivative is Discontinuous at X=5, 10, 15’; output out=A pprefix=Cub1; id V1-V7 LinearY QuadY; run;
Example 75.1. Using Splines and Knots
4681
Output 75.1.3. Fitting a Piecewise Cubic Polynomial An Illustration of Splines and Knots A Cubic Spline Regression Function The Third Derivative is Discontinuous at X=5, 10, 15 The TRANSREG Procedure TRANSREG MORALS Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.85367 3.88449 0.10061 2 0.00000 0.00000 0.61730 0.51670 Converged Algorithm converged.
The same model could be fit with a DATA step and PROC REG, as follows. (The output from the following code is not displayed.) data B; /* A is the data set used for transreg set a(keep=X Y); X1=X; /* X X2=X**2; /* X squared X3=X**3; /* X cubed X4=(X> 5)*((X-5)**3); /* change in X**3 after 5 X5=(X>10)*((X-10)**3); /* change in X**3 after 10 X6=(X>15)*((X-15)**3); /* change in X**3 after 15 run;
*/ */ */ */ */ */ */
proc reg; model Y=X1-X6; run;
In the next step each knot is repeated three times, so the first, second, and third derivatives are discontinuous at X=5, 10, and 15, but the spline is required to be continuous at the knots. The spline is a weighted sum of the following. • a single constant • a line for the portion of X less than 5 • a quadratic curve for the portion of X less than 5 • a cubic curve for the portion of X less than 5 • a different line for the portion of X between 5 and 10 • a different quadratic curve for the portion of X between 5 and 10 • a different cubic curve for the portion of X between 5 and 10 • a different line for the portion of X between 10 and 15 • a different quadratic curve for the portion of X between 10 and 15 • a different cubic curve for the portion of X between 10 and 15
4682
Chapter 75. The TRANSREG Procedure • another line for the portion of X greater than 15 • another quadratic curve for the portion of X greater than 15 • and another cubic curve for the portion of X greater than 15
The spline is continuous since there is not a separate constant in the formula for the spline for each knot. Now the R2 is 0.95542, and the spline closely follows the data, except at the knots. The following statements perform this analysis and produce Output 75.1.4: proc transreg data=A; model identity(y) = spline(x / knots=5 5 5 10 10 10 15 15 15); title3 ’First - Third Derivatives Discontinuous at X=5, 10, 15’; output out=A pprefix=Cub3; id V1-V7 LinearY QuadY Cub1Y; run;
Output 75.1.4. Piecewise Polynomial with Discontinuous Derivatives An Illustration of Splines and Knots A Cubic Spline Regression Function First - Third Derivatives Discontinuous at X=5, 10, 15 The TRANSREG Procedure TRANSREG MORALS Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.92492 3.50038 0.10061 2 0.00000 0.00000 0.95542 0.85481 Converged Algorithm converged.
The same model could be fit with a DATA step and PROC REG, as follows. (The output from the following code is not displayed.) data B; set a(keep=X X1=X; X2=X**2; X3=X**3; X4=(X>5) * X5=(X>10) * X6=(X>15) * X7=(X>5) * X8=(X>10) * X9=(X>15) * X10=(X>5) * X11=(X>10) *
Y);
(X- 5); (X-10); (X-15); ((X-5)**2); ((X-10)**2); ((X-15)**2); ((X-5)**3); ((X-10)**3);
/* /* /* /* /* /* /* /* /* /* /*
X X squared X cubed change in change in change in change in change in change in change in change in
X X X X**2 X**2 X**2 X**3 X**3
after after after after after after after after
5 10 15 5 10 15 5 10
*/ */ */ */ */ */ */ */ */ */ */
Example 75.1. Using Splines and Knots X12=(X>15) * ((X-15)**3); run;
4683
/* change in X**3 after 15 */
proc reg; model Y=X1-X12; run;
When the knots are repeated four times in the next step, the spline function is discontinuous at the knots and follows the data even more closely, with an R2 of 0.99254. In this step, each separate curve is approximated by a cubic polynomial (with no knots within the separate polynomials). The following statements perform this analysis and produce Output 75.1.5: proc transreg data=A; model identity(Y) = spline(X / knots=5 5 5 5 10 10 10 10 15 15 15 15); title3 ’Discontinuous Function and Derivatives’; output out=A pprefix=Cub4; id V1-V7 LinearY QuadY Cub1Y Cub3Y; run;
Output 75.1.5. Discontinuous Function and Derivatives An Illustration of Splines and Knots A Cubic Spline Regression Function Discontinuous Function and Derivatives The TRANSREG Procedure TRANSREG MORALS Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.90271 3.29184 0.10061 2 0.00000 0.00000 0.99254 0.89193 Converged Algorithm converged.
To solve this problem with a DATA step and PROC REG, you would need to create all of the variables in the preceding DATA step (the B data set for the piecewise polynomial with discontinuous third derivatives), plus the following three variables. X13=(X > 5); X14=(X > 10); X15=(X > 15);
/* intercept change after 5 */ /* intercept change after 10 */ /* intercept change after 15 */
The last two steps use the NKNOTS= t-option to specify the number of knots but not their location. NKNOTS=4 places knots at the quintiles while NKNOTS=9 places knots at the deciles. The spline and its first two derivatives are continuous. The R2 values are 0.74450 and 0.95256. Even though the knots are placed in the wrong
4684
Chapter 75. The TRANSREG Procedure
places, the spline can closely follow the data with NKNOTS=9. The following statements produce Output 75.1.6. proc transreg data=A; model identity(Y) = spline(X / nknots=4); title3 ’Four Knots’; output out=A pprefix=Cub4k; id V1-V7 LinearY QuadY Cub1Y Cub3Y Cub4Y; run; proc transreg data=A; model identity(Y) = spline(X / nknots=9); title3 ’Nine Knots’; output out=A pprefix=Cub9k; id V1-V7 LinearY QuadY Cub1Y Cub3Y Cub4Y Cub4kY; run;
Output 75.1.6. Specifying Number of Knots instead of Knot Location An Illustration of Splines and Knots A Cubic Spline Regression Function Four Knots The TRANSREG Procedure TRANSREG MORALS Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.90305 4.46027 0.10061 2 0.00000 0.00000 0.74450 0.64389 Converged Algorithm converged.
Output 75.1.6. (continued) An Illustration of Splines and Knots A Cubic Spline Regression Function Nine Knots The TRANSREG Procedure TRANSREG MORALS Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.94832 3.03488 0.10061 2 0.00000 0.00000 0.95256 0.85196 Converged Algorithm converged.
Example 75.1. Using Splines and Knots
4685
The following statements produce plots that show the data and fit at each step of the analysis. These statements produce Output 75.1.7. goptions goutmode=replace nodisplay; %let opts = haxis=axis2 vaxis=axis1 frame cframe=ligr; * Depending on your goptions, these plot options may work better: * %let opts = haxis=axis2 vaxis=axis1 frame; proc gplot data=A; title; axis1 minor=none label=(angle=90 rotate=0); axis2 minor=none; plot Y*X=1 / &opts name=’tregdis1’; plot Y*V1=1 linearY*X=2 /overlay &opts name=’tregdis2’; plot Y*V2=1 quadY *X=2 /overlay &opts name=’tregdis3’; plot Y*V3=1 cub1Y *X=2 /overlay &opts name=’tregdis4’; plot Y*V4=1 cub3Y *X=2 /overlay &opts name=’tregdis5’; plot Y*V5=1 cub4Y *X=2 /overlay &opts name=’tregdis6’; plot Y*V6=1 cub4kY *X=2 /overlay &opts name=’tregdis7’; plot Y*V7=1 cub9kY *X=2 /overlay &opts name=’tregdis8’; symbol1 color=blue v=star i=none; symbol2 color=yellow v=dot i=none; label V1 = ’Linear Regression’ V2 = ’Quadratic Regression Function’ V3 = ’1 Discontinuous Derivative’ V4 = ’3 Discontinuous Derivatives’ V5 = ’Discontinuous Function’ V6 = ’4 Knots’ V7 = ’9 Knots’ Y = ’Y’ LinearY = ’Y’ QuadY = ’Y’ Cub1Y = ’Y’ Cub3Y = ’Y’ Cub4Y = ’Y’ Cub4kY = ’Y’ Cub9kY = ’Y’; run; quit; goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:tregdis1 2:tregdis3 3:tregdis2 4:tregdis4; treplay 1:tregdis5 2:tregdis7 3:tregdis6 4:tregdis8; run; quit;
4686
Chapter 75. The TRANSREG Procedure
Output 75.1.7. Plots Summarizing Analysis for Spline Example
Example 75.1. Using Splines and Knots
Output 75.1.7. (continued)
4687
4688
Chapter 75. The TRANSREG Procedure
These next steps show how to find optimal spline transformations of variables in one data set and apply the same transformations to variables in another data set. These steps produce two artificial data sets, in which the variable Y is a linear function of nonlinear transformations of the variables X, W, and Z. title2 ’Scoring Spline Variables’; data x; do i = 1 to 5000; w = normal(7); x = normal(7); z = normal(7); y = w * w + log(5 + x) + sin(z) + normal(7); output; end; run; data z; do i = 1 to 5000; w = normal(1); x = normal(1); z = normal(1); y = w * w + log(5 + x) + sin(z) + normal(1); output; end; run;
First, you run PROC TRANSREG asking for spline transformations of the three independent variables. You must use the EXKNOTS= t-option, because you need to use the same knots, both interior and exterior, with both data sets. By default the exterior knots will be different if the minima and maxima are different in the two data sets, so you will get the wrong results if you do not specify the EXKNOTS= t-option with values less than the minima and greater than the maxima of the two Y variables. ods output splinecoef=c; proc transreg data=x dum det ss2; model ide(y) = spl(w x z / knots=-1.5 to 1.5 by 0.5 exknots=-5 5); output out=d; run;
The nonprinting “SplineCoef” table is output to a SAS data set. This data set contains the coefficients used to get the spline transformations and can be used to transform variables in other data sets. These coefficients are also in the details table. However, in the “SplineCoef” table they are in a form directly suitable for use with PROC SCORE. This next step reads a different input data set and generates an output data set with the B-spline basis for each of the variables. Note that the same interior and exterior knots are used in both the previous and the next steps.
Example 75.2. Using Splines and Knots
4689
proc transreg data=z design; model bspl(w x z / knots=-1.5 to 1.5 by 0.5 exknots=-5 5); output out=b; run;
These next three steps score the B-spline bases created in the previous step using the coefficients generated in the first PROC TRANSREG step. PROC SCORE is run once for each SPLINE variable. proc score data=b score=c out=o1(rename=(spline=bw w=nw)); var w:; run; proc score data=b score=c out=o2(rename=(spline=bx x=nx)); var x:; run; proc score data=b score=c out=o3(rename=(spline=bz z=nz)); var z:; run;
The next steps merge the three transformations with the original data and plot the results. The plots in Output 75.1.8 show that in fact the two transformations for each variable, original and scored, are the same function. Furthermore, PROC TRANSREG found the functional forms that were used to generate the data: quadratic for W, log for X, and sine for Z. goptions goutmode=replace nodisplay; data all; merge d(keep=w x z tw tx tz) o1(keep=nw bw) o2(keep=nx bx) o3(keep=nz bz); run; proc gplot data=all; title3 ’Exterior Knots Specified - Curves are the Same’; symbol1 color=blue v=none i=smooths; symbol2 color=red v=none i=smooths; plot tw * w = 1 bw * nw = 2 / overlay name=’tregspl1’; plot tx * x = 1 bx * nx = 2 / overlay name=’tregspl2’; plot tz * z = 1 bz * nz = 2 / overlay name=’tregspl3’; run; quit; goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:tregspl1 2:tregspl3 3:tregspl2; run; quit;
4690
Chapter 75. The TRANSREG Procedure
Output 75.1.8. Scoring Spline Variables Example
Example 75.2. Nonmetric Conjoint Analysis of Tire Data This example uses PROC TRANSREG to perform a nonmetric conjoint analysis of tire preference data. Conjoint analysis decomposes rank ordered evaluation judgments of products or services into components based on qualitative product attributes.
Example 75.2. Nonmetric Conjoint Analysis of Tire Data
For each level of each attribute of interest, a numerical “part-worth utility” value is computed. The sum of the part-worth utilities for each product is an estimate of the utility for that product. The goal is to compute part-worth utilities such that the product utilities are as similar as possible to the original rank ordering. (This example is a greatly simplified introductory example.) The stimuli for the experiment are 18 hypothetical tires. The stimuli represent different brands (Goodstone, Pirogi, Machismo),∗ prices ($69.99, $74.99, $79.99), expected tread life (50,000, 60,000, 70,000), and road hazard insurance plans (Yes, No). There are 3 × 3 × 3 × 2 = 54 possible combinations. From these, 18 combinations are selected that form an efficient experimental design for a main effects model. The combinations are then ranked from 1 (most preferred) to 18 (least preferred). In this simple example, there is one set of rankings. A real conjoint study would have many more. First, the FORMAT procedure is used to specify the meanings of the factor levels, which are entered as numbers in the DATA step along with the ranks. PROC TRANSREG is used to perform the conjoint analysis. A maximum of 50 iterations is requested. The specification Monotone(Rank / Reflect) in the MODEL statement requests that the dependent variable Rank should be monotonically transformed and reflected so that positive utilities mean high preference. The variables Brand, Price, Life, and Hazard are designated as CLASS variables, and the part-worth utilities are constrained by ZERO=SUM to sum to zero within each factor. The UTILITIES a-option displays the conjoint analysis results. The Importance column of the Utilities Table shows that price is the most important attribute in determining preference (57%), followed by expected tread life (18%), brand (15%), and road hazard insurance (10%). Looking at the Utilities Table for the maximum part-worth utility within each attribute, you see from the results that the most preferred combination is Pirogi brand tires, at $69.99, with a 70,000 mile expected tread life, and road hazard insurance. This product is not actually in the data set. The sum of the part-worth utilities for this combination is 20.64 = 9.50 + 1.90 + 5.87 + 2.41 + 0.96 The following statements produce Output 75.2.1: title ’Nonmetric Conjoint Analysis of Ranks’; proc format; value BrandF 1 2 3 value PriceF 1 2 3 ∗
= ’Goodstone’ = ’Pirogi ’ = ’Machismo ’; = ’$69.99’ = ’$74.99’ = ’$79.99’;
In real conjoint experiments, real brand names are used.
4691
4692
Chapter 75. The TRANSREG Procedure value LifeF 1 = 2 = 3 = value HazardF 1 = 2 = run;
’50,000’ ’60,000’ ’70,000’; ’Yes’ ’No ’;
data Tires; input Brand Price Life Hazard Rank; format Brand BrandF9. Price PriceF9. Life LifeF6. Hazard HazardF3.; datalines; 1 1 2 1 3 1 1 3 2 2 1 2 1 2 14 1 2 2 2 10 1 3 1 1 17 1 3 3 1 12 2 1 1 2 7 2 1 3 2 1 2 2 1 1 8 2 2 3 1 5 2 3 2 1 13 2 3 2 2 16 3 1 1 1 6 3 1 2 1 4 3 2 2 2 15 3 2 3 1 9 3 3 1 2 18 3 3 3 2 11 ; proc transreg maxiter=50 utilities short; ods select ConvergenceStatus FitStatistics Utilities; model monotone(Rank / reflect) = class(Brand Price Life Hazard / zero=sum); output ireplace predicted; run; proc print label; var Rank TRank PRank Brand Price Life Hazard; label PRank = ’Predicted Ranks’; run;
Example 75.2. Nonmetric Conjoint Analysis of Tire Data
Output 75.2.1. Simple Conjoint Analysis Nonmetric Conjoint Analysis of Ranks The TRANSREG Procedure Monotone(Rank) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Monotone(Rank)
Root MSE Dependent Mean Coeff Var
0.49759 9.50000 5.23783
R-Square Adj R-Sq
0.9949 0.9913
Utilities Table Based on the Usual Degrees of Freedom Importance (% Utility Range)
Utility
Standard Error
9.5000
0.11728
Brand Goodstone Brand Pirogi Brand Machismo
-1.1718 1.8980 -0.7262
0.16586 0.16586 0.16586
15.463
Class.BrandGoodstone Class.BrandPirogi Class.BrandMachismo
Price $69.99 Price $74.99 Price $79.99
5.8732 -0.5261 -5.3471
0.16586 0.16586 0.16586
56.517
Class.Price_69_99 Class.Price_74_99 Class.Price_79_99
Life 50,000 Life 60,000 Life 70,000
-1.2350 -1.1751 2.4101
0.16586 0.16586 0.16586
18.361
Class.Life50_000 Class.Life60_000 Class.Life70_000
Hazard Yes Hazard No
0.9588 -0.9588
0.11728 0.11728
9.659
Label Intercept
Variable Intercept
Class.HazardYes Class.HazardNo
The standard errors are not adjusted for the fact that the dependent variable was transformed and so are generally liberal (too small).
4693
4694
Chapter 75. The TRANSREG Procedure
Output 75.2.1. (continued) Nonmetric Conjoint Analysis of Ranks
Obs
Rank
Rank Transformation
Predicted Ranks
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
3 2 14 10 17 12 7 1 8 5 13 16 6 4 15 9 18 11
14.4462 15.6844 5.7229 5.7229 2.6699 5.7229 14.4462 18.7699 11.1143 14.4462 5.7229 3.8884 14.4462 14.4462 5.7229 11.1143 1.1905 5.7229
13.9851 15.6527 5.6083 5.6682 2.7049 6.3500 15.0774 18.7225 10.5957 14.2408 5.8346 3.9170 14.3708 14.4307 6.1139 11.6166 1.2330 4.8780
Brand
Price
Life
Goodstone Goodstone Goodstone Goodstone Goodstone Goodstone Pirogi Pirogi Pirogi Pirogi Pirogi Pirogi Machismo Machismo Machismo Machismo Machismo Machismo
$69.99 $69.99 $74.99 $74.99 $79.99 $79.99 $69.99 $69.99 $74.99 $74.99 $79.99 $79.99 $69.99 $69.99 $74.99 $74.99 $79.99 $79.99
60,000 70,000 50,000 60,000 50,000 70,000 50,000 70,000 50,000 70,000 60,000 60,000 50,000 60,000 60,000 70,000 50,000 70,000
Hazard Yes No No No Yes Yes No No Yes Yes Yes No Yes Yes No Yes No No
Example 75.3. Metric Conjoint Analysis of Tire Data This example, which is more detailed than the previous one, uses PROC TRANSREG to perform a metric conjoint analysis of tire preference data. Conjoint analysis can be used to decompose preference ratings of products or services into components based on qualitative product attributes. For each level of each attribute of interest, a numerical “part-worth utility” value is computed. The sum of the part-worth utilities for each product is an estimate of the utility for that product. The goal is to compute part-worth utilities such that the product utilities are as similar as possible to the original ratings. Metric conjoint analysis, as shown in this example, fits an ordinary linear model directly to data assumed to be measured on an interval scale. Nonmetric conjoint analysis, as shown in Example 75.2, finds an optimal monotonic transformation of original data before fitting an ordinary linear model to the transformed data. This example has three parts. In the first part, an experimental design is created. In the second part, a DATA step creates descriptions of the stimuli for the experiment. The third part of the example performs the conjoint analyses. The stimuli for the experiment are 18 hypothetical tires. The stimuli represent different brands (Goodstone, Pirogi, Machismo),∗ prices ($69.99, $74.99, $79.99), expected tread life (50,000, 60,000, 70,000), and road hazard insurance plans (Yes, No). For a conjoint study such as this, you need to create an experimental design with 3 three-level factors, 1 two-level factor, and 18 combinations or runs. The easiest way to get this design is with the %MktEx autocall macro. The %MktEx macro requires ∗
In real conjoint experiments, real brand names would be used
Example 75.3. Metric Conjoint Analysis of Tire Data
4695
you to specify the number of levels of each of the four factors, followed by N=18, the number of runs. Specifying a random number seed, while not strictly necessary, helps ensure that the design is reproducible. The %MktLab macro assigns the actual factor names instead of the default names X1, X2, and so on, and it assigns formats to the factor levels. The %MktEval macro helps you evaluate the design. It shows how correlated or independent the factors are, how often each factor level appears in the design, how often each pair occurs for every factor pair, and how often each product profile or run occurs in the design. See Kuhfeld (2003) for more information on these tools and their use in conjoint and choice modeling. title ’Tire Study, Experimental Design’; proc format; value BrandF 1 = 2 = 3 = value PriceF 1 = 2 = 3 = value LifeF 1 = 2 = 3 = value HazardF 1 = 2 = run;
’Goodstone’ ’Pirogi ’ ’Machismo ’; ’$69.99’ ’$74.99’ ’$79.99’; ’50,000’ ’60,000’ ’70,000’; ’Yes’ ’No ’;(persist
%mktex(3 3 3 2, n=18, seed=448) %mktlab(vars=Brand Price Life Hazard, out=sasuser.TireDesign, statements=format Brand BrandF9. Price PriceF9. Life LifeF6. Hazard HazardF3.) %mkteval; proc print data=sasuser.TireDesign; run;
The %MktEx macro output displayed in Output 75.3.1 shows you that the design is 100% efficient, which means it is orthogonal and balanced. The %MktEval macro output displayed in Output 75.3.2 shows you that all of the factors are uncorrelated or orthogonal, the design is balanced (each level occurs once), and every pair of factor levels occurs equally often (again showing that the design is orthogonal). The n-way frequencies show that each product profile occurs once (there are no duplicates). The design is shown in Output 75.3.3. The design is automatically randomized (the profiles were sorted into a random order and the original levels are randomly reassigned). Orthogonality, balance, randomization, and other design concepts are discussed in detail in Kuhfeld (2003).
4696
Chapter 75. The TRANSREG Procedure
Output 75.3.1. Tire Study, Design Efficiency Tire Study, Experimental Design Algorithm Search History Current Best Design Row,Col D-Efficiency D-Efficiency Notes ---------------------------------------------------------1 Start 100.0000 100.0000 Tab 1 End 100.0000
Tire Study, Experimental Design The OPTEX Procedure Class Level Information Class
Levels
x1 x2 x3 x4
3 3 3 2
-Values1 1 1 1
2 3 2 3 2 3 2
Tire Study, Experimental Design The OPTEX Procedure Average Prediction Design Standard Number D-Efficiency A-Efficiency G-Efficiency Error -----------------------------------------------------------------------1 100.0000 100.0000 100.0000 0.6667
Example 75.3. Metric Conjoint Analysis of Tire Data
Output 75.3.2. Tire Study, Design Evaluation Canonical Correlations Between the Factors There are 0 Canonical Correlations Greater Than 0.316
Brand Price Life Hazard
Brand
Price
Life
1 0 0 0
0 1 0 0
0 0 1 0
Hazard 0 0 0 1
Summary of Frequencies There are 0 Canonical Correlations Greater Than 0.316 Frequencies Brand Price Life Hazard Brand Price Brand Life Brand Hazard Price Life Price Hazard Life Hazard N-Way
6 6 6 9 2 2 3 2 3 3 1
6 6 6 9 2 2 3 2 3 3 1
6 6 6 2 2 3 2 3 3 1
2 2 3 2 3 3 1
2 2 3 2 3 3 1
2 2 3 2 3 3 1
2 2 2 2 2 2 2 2 2
1 1 1 1 1 1 1 1 1 1 1 1
Output 75.3.3. Tire Study, Design Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Brand
Price
Life
Pirogi Machismo Machismo Machismo Goodstone Pirogi Goodstone Machismo Pirogi Pirogi Goodstone Goodstone Pirogi Goodstone Machismo Machismo Pirogi Goodstone
$79.99 $79.99 $74.99 $74.99 $74.99 $69.99 $69.99 $69.99 $74.99 $74.99 $79.99 $69.99 $79.99 $74.99 $69.99 $79.99 $69.99 $79.99
50,000 70,000 50,000 50,000 70,000 70,000 50,000 60,000 60,000 60,000 60,000 50,000 50,000 70,000 60,000 70,000 70,000 60,000
Hazard No No Yes No Yes Yes Yes Yes Yes No No No Yes No No Yes No Yes
Next, the questionnaires are printed, and subjects are given the questionnaires and are asked to rate the tires. The following statements produce Output 75.3.4. This output is abbreviated; the statements produce stimuli for all combinations.
4697
4698
Chapter 75. The TRANSREG Procedure data _null_; title; set sasuser.TireDesign; file print; if mod(_n_,4) eq 1 then do; put _page_; put +55 ’Subject ________’; end; length hazardstring $ 7.; if put(hazard, hazardf3.) = ’Yes’ then hazardstring = ’with’; else hazardstring = ’without’; s = 3 + (_n_ >= 10); put // _n_ +(-1) ’) For your next tire purchase, ’ ’how likely are you to buy this product?’ // +s Brand ’brand tires at ’ Price +(-1) ’,’ / +s ’with a ’ Life ’tread life guarantee, ’ / +s ’and ’ hazardstring ’road hazard insurance.’ // +s ’Definitely Would Definitely Would’ / +s ’Not Purchase Purchase’ // +s ’1 2 3 4 5 6 7 8 9 ’; run;
Example 75.3. Metric Conjoint Analysis of Tire Data
Output 75.3.4. Conjoint Analysis, Stimuli Descriptions Subject ________
1) For your next tire purchase, how likely are you to buy this product? Pirogi brand tires at $79.99, with a 50,000 tread life guarantee, and without road hazard insurance. Definitely Would Not Purchase 1
2
3
Definitely Would Purchase 4
5
6
7
8
9
2) For your next tire purchase, how likely are you to buy this product? Machismo brand tires at $79.99, with a 70,000 tread life guarantee, and without road hazard insurance. Definitely Would Not Purchase 1
2
3
Definitely Would Purchase 4
5
6
7
8
9
3) For your next tire purchase, how likely are you to buy this product? Machismo brand tires at $74.99, with a 50,000 tread life guarantee, and with road hazard insurance. Definitely Would Not Purchase 1
2
3
Definitely Would Purchase 4
5
6
7
8
9
4) For your next tire purchase, how likely are you to buy this product? Machismo brand tires at $74.99, with a 50,000 tread life guarantee, and without road hazard insurance. Definitely Would Not Purchase 1
2
3
Definitely Would Purchase 4
5
6
7
8
9
The third part of the example performs the conjoint analyses. The DATA step reads the data. Only the ratings are entered, one row per subject. Real conjoint studies have many more subjects than five. The TRANSPOSE procedure transposes this (5 × 18) data set into an (18 × 5) data set that can be merged with the factor level data set sasuser.TireDesign. The next DATA step does the merge. The PRINT procedure displays the input data set. PROC TRANSREG fits the five individual conjoint models, one for each subject. The UTILITIES a-option displays the conjoint analysis results. The SHORT a-option
4699
4700
Chapter 75. The TRANSREG Procedure
suppresses the iteration histories, OUTTEST=Utils creates an output data set with all of the conjoint results, and the SEPARATORS= option requests that the labels constructed for each category contain two blanks between the variable name and the level value. The ODS select statement is used to limit the displayed output. The MODEL statement specifies IDENTITY for the ratings, which specifies a metric conjoint analysis —the ratings are not transformed. The variables Brand, Price, Life, and Hazard are designated as CLASS variables, and the part-worth utilities are constrained to sum to zero within each factor. The following statements produce Output 75.3.5: title ’Tire Study, Data Entry, Preprocessing’; data Results; input (c1-c18) (1.); datalines; 233279766526376493 124467885349168274 262189456534275794 184396375364187754 133379775526267493 ; *---Create an Object by Subject Data Matrix---; proc transpose data=Results out=Results(drop=_name_) prefix=Subj; run; *---Merge the Factor Levels With the Data Matrix---; data Both; merge sasuser.TireDesign Results; run; *---Print Input Data Set---; proc print; title2 ’Data Set for Conjoint Analysis’; run; *---Fit Each Subject Individually---; proc transreg data=Both utilities short outtest=Utils separators=’ ods select TestsNote FitStatistics Utilities; title2 ’Individual Conjoint Analyses’; model identity(Subj1-Subj5) = class(Brand Price Life Hazard / zero=sum); run;
The output contains two tables per subject, one with overall fit statistics and one with the conjoint analysis results.
’;
Example 75.3. Metric Conjoint Analysis of Tire Data
Output 75.3.5. Conjoint Analysis Tire Study, Data Entry, Preprocessing Data Set for Conjoint Analysis Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Brand
Price
Life
Pirogi Machismo Machismo Machismo Goodstone Pirogi Goodstone Machismo Pirogi Pirogi Goodstone Goodstone Pirogi Goodstone Machismo Machismo Pirogi Goodstone
$79.99 $79.99 $74.99 $74.99 $74.99 $69.99 $69.99 $69.99 $74.99 $74.99 $79.99 $69.99 $79.99 $74.99 $69.99 $79.99 $69.99 $79.99
50,000 70,000 50,000 50,000 70,000 70,000 50,000 60,000 60,000 60,000 60,000 50,000 50,000 70,000 60,000 70,000 70,000 60,000
Hazard
Subj1
Subj2
Subj3
Subj4
Subj5
No No Yes No Yes Yes Yes Yes Yes No No No Yes No No Yes No Yes
2 3 3 2 7 9 7 6 6 5 2 6 3 7 6 4 9 3
1 2 4 4 6 7 8 8 5 3 4 9 1 6 8 2 7 4
2 6 2 1 8 9 4 5 6 5 3 4 2 7 5 7 9 4
1 8 4 3 9 6 3 7 5 3 6 4 1 8 7 7 5 4
1 3 3 3 7 9 7 7 5 5 2 6 2 6 7 4 9 3
4701
4702
Chapter 75. The TRANSREG Procedure
Output 75.3.5. (continued) Tire Study, Data Entry, Preprocessing Individual Conjoint Analyses The TRANSREG Procedure The TRANSREG Procedure Hypothesis Tests for Identity(Subj1)
Root MSE Dependent Mean Coeff Var
0.44721 5.00000 8.94427
R-Square Adj R-Sq
0.9783 0.9630
Utilities Table Based on the Usual Degrees of Freedom
Label Intercept
Utility
Standard Error
5.0000
0.10541
Importance (% Utility Range)
Variable Intercept
Brand Brand Brand
Goodstone Pirogi Machismo
0.3333 0.6667 -1.0000
0.14907 0.14907 0.14907
17.857
Class.BrandGoodstone Class.BrandPirogi Class.BrandMachismo
Price Price Price
$69.99 $74.99 $79.99
2.1667 0.0000 -2.1667
0.14907 0.14907 0.14907
46.429
Class.Price_69_99 Class.Price_74_99 Class.Price_79_99
-1.1667 -0.3333 1.5000
0.14907 0.14907 0.14907
28.571
Class.Life50_000 Class.Life60_000 Class.Life70_000
0.3333 -0.3333
0.10541 0.10541
7.143
Life Life Life
50,000 60,000 70,000
Hazard Hazard
Yes No
Class.HazardYes Class.HazardNo
Example 75.3. Metric Conjoint Analysis of Tire Data
Output 75.3.5. (continued) Tire Study, Data Entry, Preprocessing Individual Conjoint Analyses The TRANSREG Procedure The TRANSREG Procedure Hypothesis Tests for Identity(Subj2)
Root MSE Dependent Mean Coeff Var
0.50553 4.94444 10.22410
R-Square Adj R-Sq
0.9770 0.9608
Utilities Table Based on the Usual Degrees of Freedom
Label Intercept
Utility
Standard Error
4.9444
0.11915
Importance (% Utility Range)
Variable Intercept
Brand Brand Brand
Goodstone Pirogi Machismo
1.2222 -0.9444 -0.2778
0.16851 0.16851 0.16851
25.161
Class.BrandGoodstone Class.BrandPirogi Class.BrandMachismo
Price Price Price
$69.99 $74.99 $79.99
2.8889 -0.2778 -2.6111
0.16851 0.16851 0.16851
63.871
Class.Price_69_99 Class.Price_74_99 Class.Price_79_99
-0.4444 0.3889 0.0556
0.16851 0.16851 0.16851
9.677
Class.Life50_000 Class.Life60_000 Class.Life70_000
0.0556 -0.0556
0.11915 0.11915
1.290
Class.HazardYes Class.HazardNo
Life Life Life
50,000 60,000 70,000
Hazard Hazard
Yes No
4703
4704
Chapter 75. The TRANSREG Procedure
Output 75.3.5. (continued) Tire Study, Data Entry, Preprocessing Individual Conjoint Analyses The TRANSREG Procedure The TRANSREG Procedure Hypothesis Tests for Identity(Subj3)
Root MSE Dependent Mean Coeff Var
0.50553 4.94444 10.22410
R-Square Adj R-Sq
0.9747 0.9570
Utilities Table Based on the Usual Degrees of Freedom
Label Intercept
Utility
Standard Error
4.9444
0.11915
Importance (% Utility Range)
Variable Intercept
Brand Brand Brand
Goodstone Pirogi Machismo
0.0556 0.5556 -0.6111
0.16851 0.16851 0.16851
13.125
Class.BrandGoodstone Class.BrandPirogi Class.BrandMachismo
Price Price Price
$69.99 $74.99 $79.99
1.0556 -0.1111 -0.9444
0.16851 0.16851 0.16851
22.500
Class.Price_69_99 Class.Price_74_99 Class.Price_79_99
-2.4444 -0.2778 2.7222
0.16851 0.16851 0.16851
58.125
Class.Life50_000 Class.Life60_000 Class.Life70_000
0.2778 -0.2778
0.11915 0.11915
6.250
Life Life Life
50,000 60,000 70,000
Hazard Hazard
Yes No
Class.HazardYes Class.HazardNo
Example 75.3. Metric Conjoint Analysis of Tire Data
Output 75.3.5. (continued) Tire Study, Data Entry, Preprocessing Individual Conjoint Analyses The TRANSREG Procedure The TRANSREG Procedure Hypothesis Tests for Identity(Subj4)
Root MSE Dependent Mean Coeff Var
0.92496 5.05556 18.29596
R-Square Adj R-Sq
0.9099 0.8468
Utilities Table Based on the Usual Degrees of Freedom
Label Intercept
Utility
Standard Error
5.0556
0.21802
Importance (% Utility Range)
Variable Intercept
Brand Brand Brand
Goodstone Pirogi Machismo
0.6111 -1.5556 0.9444
0.30832 0.30832 0.30832
31.469
Class.BrandGoodstone Class.BrandPirogi Class.BrandMachismo
Price Price Price
$69.99 $74.99 $79.99
0.2778 0.2778 -0.5556
0.30832 0.30832 0.30832
10.490
Class.Price_69_99 Class.Price_74_99 Class.Price_79_99
-2.3889 0.2778 2.1111
0.30832 0.30832 0.30832
56.643
Class.Life50_000 Class.Life60_000 Class.Life70_000
0.0556 -0.0556
0.21802 0.21802
1.399
Life Life Life
50,000 60,000 70,000
Hazard Hazard
Yes No
Class.HazardYes Class.HazardNo
4705
4706
Chapter 75. The TRANSREG Procedure
Output 75.3.5. (continued) Tire Study, Data Entry, Preprocessing Individual Conjoint Analyses The TRANSREG Procedure The TRANSREG Procedure Hypothesis Tests for Identity(Subj5)
Root MSE Dependent Mean Coeff Var
0.34960 4.94444 7.07062
R-Square Adj R-Sq
0.9879 0.9794
Utilities Table Based on the Usual Degrees of Freedom
Label Intercept
Utility
Standard Error
4.9444
0.08240
Importance (% Utility Range)
Variable Intercept
Brand Brand Brand
Goodstone Pirogi Machismo
0.2222 0.2222 -0.4444
0.11653 0.11653 0.11653
7.500
Price Price Price
$69.99 $74.99 $79.99
2.5556 -0.1111 -2.4444
0.11653 0.11653 0.11653
56.250
Class.Price_69_99 Class.Price_74_99 Class.Price_79_99
-1.2778 -0.1111 1.3889
0.11653 0.11653 0.11653
30.000
Class.Life50_000 Class.Life60_000 Class.Life70_000
0.2778 -0.2778
0.08240 0.08240
6.250
Life Life Life
50,000 60,000 70,000
Hazard Hazard
Yes No
Class.BrandGoodstone Class.BrandPirogi Class.BrandMachismo
Class.HazardYes Class.HazardNo
These following statements summarize the results. Three tables are displayed: all of the importance values, the average importance, and the part-worth utilities. The first DATA step selects the importance information from the Utils data set. The final assignment statement stores just the variable name from the label relying on the fact that the separator is two blanks. PROC TRANSPOSE creates the data set of importances, one row per subject, and PROC PRINT displays the results. The MEANS procedure displays the average importance of each attribute across the subjects. The next DATA step selects the part-worth utilities information from the Utils data set. PROC TRANSPOSE creates the data set of utilities, one row per subject, and PROC PRINT displays the results.
Example 75.3. Metric Conjoint Analysis of Tire Data *---Gather the Importance Values---; data Importance; set Utils(keep=_depvar_ Importance Label); if n(Importance); label = substr(label, 1, index(label, ’ ’)); run; proc transpose out=Importance2(drop=_:); by _depvar_; id Label; run; proc print; title2 ’Importance Values’; run; proc means; title2 ’Average Importance’; run; *---Gather the Part-Worth Utilites---; data Utilities; set Utils(keep=_depvar_ Coefficient Label); if n(Coefficient); run; proc transpose out=Utilities2(drop=_:); by _depvar_; id Label; idlabel Label; run; proc print label; title2 ’Utilities’; run;
Output 75.3.6. Summary of Conjoint Analysis Results Tire Study, Data Entry, Preprocessing Importance Values Obs
Brand
Price
Life
Hazard
1 2 3 4 5
17.8571 25.1613 13.1250 31.4685 7.5000
46.4286 63.8710 22.5000 10.4895 56.2500
28.5714 9.6774 58.1250 56.6434 30.0000
7.14286 1.29032 6.25000 1.39860 6.25000
4707
4708
Chapter 75. The TRANSREG Procedure
Output 75.3.6. (continued) Tire Study, Data Entry, Preprocessing Average Importance The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ----------------------------------------------------------------------------Brand 5 19.0223929 9.5065111 7.5000000 31.4685315 Price 5 39.9078099 22.6510962 10.4895105 63.8709677 Life 5 36.6034409 20.6028215 9.6774194 58.1250000 Hazard 5 4.4663562 2.8733577 1.2903226 7.1428571 -----------------------------------------------------------------------------
Output 75.3.6. (continued) Tire Study, Data Entry, Preprocessing Utilities
Obs
Intercept
Brand Goodstone
Brand Pirogi
Brand Machismo
Price $69.99
Price $74.99
1 2 3 4 5
5.00000 4.94444 4.94444 5.05556 4.94444
0.33333 1.22222 0.05556 0.61111 0.22222
0.66667 -0.94444 0.55556 -1.55556 0.22222
-1.00000 -0.27778 -0.61111 0.94444 -0.44444
2.16667 2.88889 1.05556 0.27778 2.55556
0.00000 -0.27778 -0.11111 0.27778 -0.11111
Obs
Price $79.99
Life 50,000
Life 60,000
Life 70,000
Hazard Yes
1 2 3 4 5
-2.16667 -2.61111 -0.94444 -0.55556 -2.44444
-1.16667 -0.44444 -2.44444 -2.38889 -1.27778
-0.33333 0.38889 -0.27778 0.27778 -0.11111
1.50000 0.05556 2.72222 2.11111 1.38889
0.33333 0.05556 0.27778 0.05556 0.27778
Hazard No -0.33333 -0.05556 -0.27778 -0.05556 -0.27778
Based on the importance values, price is the most important attribute for some of the respondents, but expected tread life is most important for others. On the average, price is most important followed closely by expected tread life. Brand and road hazard insurance are less important. Both Goodstone and Pirogi are the most preferred brands by some of the respondents. All respondents preferred a lower price over a higher price, a longer tread life, and road hazard insurance.
Example 75.4. Transformation Emissions Data
Regression
of
Exhaust
In this example, the MORALS algorithm is applied to data from an experiment in which nitrogen oxide emissions from a single cylinder engine are measured for various combinations of fuel, compression ratio, and equivalence ratio. The data are provided by Brinkman (1981).
Example 75.4. Transformation Regression of Exhaust Emissions Data
The equivalence ratio and nitrogen oxide variables are continuous and numeric, so spline transformations of these variables are requested. Each spline is degree three with nine knots (one at each decile) in order to allow PROC TRANSREG a great deal of freedom in finding transformations. The compression ratio variable has only five discrete values, so an optimal scoring is requested. The character variable Fuel is nominal, so it is designated as a classification variable. No monotonicity constraints are placed on any of the transformations. Observations with missing values are excluded with the NOMISS a-option. The squared multiple correlation for the initial model is less than 0.25. PROC TRANSREG increases the R2 to over 0.95 by transforming the variables. The transformation plots show how each variable is transformed. The transformation of compression ratio (TCpRatio) is nearly linear. The transformation of equivalence ratio (TEqRatio) is nearly parabolic. It can be seen from this plot that the optimal transformation of equivalence ratio is nearly uncorrelated with the original scoring. This suggests that the large increase in R2 is due to this transformation. The transformation of nitrogen oxide (TNOx) is something like a log transformation. These results suggest the parametric model
log(NOX) = b0 + b1 × EqRatio + b2 × EqRatio2 + b3 × CpRatio X + bj classj (Fuel) + error . j
You can perform this analysis with PROC TRANSREG using the following MODEL statement: model log(NOx)= psp(EqRatio / deg=2) identity(CpRatio) class(Fuel / zero=first);
The LOG transformation computes the natural log. The PSPLINE expansion expands EqRatio into a linear term, EqRatio, and a squared term, EqRatio2 . A linear transformation of CpRatio and a dummy variable expansion of Fuel is requested with the first level as the reference level. These should provide a good parametric operationalization of the optimal transformations. The final model has an R2 of 0.91 (smaller than before since the model uses fewer degrees of freedom, but still quite good). The following statements produce Output 75.4.1 through Output 75.4.2: title ’Gasoline Example’; data Gas; input Fuel :$8. CpRatio EqRatio NOx @@; label Fuel = ’Fuel’ CpRatio = ’Compression Ratio (CR)’
4709
4710
Chapter 75. The TRANSREG Procedure EqRatio = ’Equivalence Ratio (PHI)’ NOx = ’Nitrogen Oxide (NOx)’; datalines; Ethanol 12.0 0.907 3.741 Ethanol 12.0 0.761 Ethanol 12.0 1.108 1.498 Ethanol 12.0 1.016 Ethanol 12.0 1.189 0.760 Ethanol 9.0 1.001 Ethanol 9.0 1.231 0.638 Ethanol 9.0 1.123 Ethanol 12.0 1.042 2.358 Ethanol 12.0 1.215 Ethanol 12.0 0.930 3.669 Ethanol 12.0 1.152 Ethanol 15.0 1.138 0.981 Ethanol 18.0 0.601 Ethanol 7.5 0.696 0.926 Ethanol 12.0 0.686 Ethanol 12.0 1.072 1.806 Ethanol 15.0 1.074 Ethanol 15.0 0.934 4.028 Ethanol 9.0 0.808 Ethanol 9.0 1.071 1.836 Ethanol 7.5 1.009 Ethanol 7.5 1.142 1.013 Ethanol 18.0 1.229 Ethanol 18.0 1.175 0.812 Ethanol 15.0 0.568 Ethanol 15.0 0.977 3.623 Ethanol 7.5 0.767 Ethanol 7.5 1.006 2.836 Ethanol 9.0 0.893 Ethanol 15.0 1.152 0.866 Ethanol 15.0 0.693 Ethanol 15.0 1.232 0.542 Ethanol 15.0 1.036 Ethanol 15.0 1.125 1.200 Ethanol 9.0 1.081 Ethanol 9.0 0.868 3.423 Ethanol 7.5 0.762 Ethanol 7.5 1.144 1.021 Ethanol 7.5 1.045 Ethanol 18.0 0.797 3.361 Ethanol 18.0 1.115 Ethanol 18.0 1.070 1.947 Ethanol 18.0 1.219 Ethanol 9.0 0.637 0.571 Ethanol 9.0 0.733 Ethanol 9.0 0.715 1.419 Ethanol 9.0 0.872 Ethanol 7.5 0.765 1.732 Ethanol 7.5 0.878 Ethanol 7.5 0.811 2.471 Ethanol 15.0 0.676 Ethanol 18.0 1.045 2.571 Ethanol 18.0 0.968 Ethanol 15.0 0.846 3.931 Ethanol 15.0 0.684 Ethanol 7.5 0.729 1.397 Ethanol 7.5 0.911 Ethanol 7.5 0.808 2.202 Ethanol 7.5 1.168 Indolene 7.5 0.831 4.818 Indolene 7.5 1.045 Indolene 7.5 1.021 3.275 Indolene 7.5 0.970 Indolene 7.5 0.825 4.255 Indolene 7.5 0.891 Indolene 7.5 0.710 2.118 Indolene 7.5 0.801 Indolene 7.5 1.074 2.286 Indolene 7.5 1.148 Indolene 7.5 1.000 3.965 Indolene 7.5 0.928 Indolene 7.5 0.767 3.834 Ethanol 7.5 0.749 Ethanol 7.5 0.892 3.656 Ethanol 7.5 1.002 82rongas 7.5 0.873 6.021 82rongas 7.5 0.987 82rongas 7.5 1.030 3.046 82rongas 7.5 1.101 82rongas 7.5 1.173 0.835 82rongas 7.5 0.931 82rongas 7.5 0.822 5.470 82rongas 7.5 0.749 82rongas 7.5 0.625 0.716 94%Eth 7.5 0.818 94%Eth 7.5 1.128 1.004 94%Eth 7.5 1.191 94%Eth 7.5 1.132 1.030 94%Eth 7.5 0.993 94%Eth 7.5 0.866 2.699 94%Eth 7.5 0.910 94%Eth 12.0 1.139 1.151 94%Eth 12.0 1.267 94%Eth 12.0 1.017 2.814 94%Eth 12.0 0.954 94%Eth 12.0 0.861 3.031 94%Eth 12.0 1.034 94%Eth 12.0 0.781 2.403 94%Eth 12.0 1.058 94%Eth 12.0 0.884 2.452 94%Eth 12.0 0.766
2.295 2.881 3.120 1.170 0.606 1.000 1.192 1.590 1.962 3.148 2.845 0.414 0.374 1.869 3.567 1.369 2.739 1.719 1.634 2.157 1.390 0.962 2.219 3.519 3.206 1.777 3.952 1.587 3.536 0.756 2.849 4.691 5.064 4.602 0.970 5.344 1.620 2.964 4.467 1.596 5.498 4.084 2.382 0.623 2.593 3.177 0.474 3.308 2.537 2.412 1.857
Example 75.4. Transformation Regression of Exhaust Emissions Data 94%Eth 94%Eth Ethanol Ethanol Ethanol Ethanol Ethanol Ethanol Ethanol Ethanol Indolene Indolene Indolene Ethanol Ethanol Methanol Methanol Methanol Methanol Methanol Gasohol Gasohol Gasohol Gasohol Gasohol Gasohol Methanol Indolene Indolene Ethanol Ethanol 94%Eth Ethanol 94%Eth 94%Eth ;
7.5 7.5 18.0 18.0 12.0 9.0 15.0 15.0 15.0 7.5 7.5 7.5 7.5 9.0 18.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 9.0 15.0 7.5 18.0 7.5 7.5
1.193 0.915 1.230 0.712 1.002 1.199 0.602 0.816 1.037 0.899 0.701 0.902 1.224 1.180 0.990 0.975 1.150 0.859 0.720 0.616 0.771 1.042 1.097 0.928 0.827 1.031 1.026 0.973 0.665 0.608 0.562 0.674 0.655 0.790 1.075
0.657 2.670 0.672 . 3.290 0.727 0.923 3.388 2.085 3.488 1.990 5.283 0.537 0.797 3.732 2.941 0.934 2.397 1.235 0.344 4.497 2.723 1.562 5.307 5.330 3.164 2.551 5.055 1.561 0.563 0.370 0.900 1.900 2.645 2.147
94%Eth Ethanol Ethanol Ethanol Ethanol Ethanol Ethanol Ethanol Ethanol Ethanol Indolene Indolene Indolene Ethanol Ethanol Methanol Methanol Methanol Methanol Gasohol Gasohol Gasohol Gasohol Gasohol Gasohol Methanol Methanol Indolene Ethanol Ethanol Ethanol Gasohol 94%Eth 94%Eth
7.5 18.0 18.0 12.0 9.0 9.0 15.0 15.0 15.0 7.5 7.5 7.5 7.5 7.5 18.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 12.0 18.0 7.5 7.5 7.5
0.885 0.812 0.804 0.813 0.696 1.030 0.694 0.896 1.181 1.227 0.807 0.997 1.089 0.795 1.201 1.089 1.212 0.751 1.090 0.712 0.959 1.125 0.984 0.889 0.674 0.871 0.598 0.980 0.629 0.584 0.535 0.645 1.022 0.720
4711
2.969 3.760 3.677 3.517 1.139 2.581 1.527 . 0.966 0.754 5.199 3.752 1.640 2.064 0.586 1.467 0.722 1.461 1.347 2.209 4.958 1.244 4.468 5.425 1.448 3.113 0.204 4.937 0.561 0.678 0.530 1.207 2.787 1.475
*---Fit the Nonparametric Model---; proc transreg data=Gas dummy test nomiss; model spline(NOx / nknots=9)=spline(EqRatio / nknots=9) opscore(CpRatio) class(Fuel / zero=first); title2 ’Iteratively Estimate NOx, CPRATIO and EQRATIO’; output out=Results; run; *---Plot the Results---; goptions goutmode=replace nodisplay; %let opts = haxis=axis2 vaxis=axis1 frame cframe=ligr; * Depending on your goptions, these plot options may work better: * %let opts = haxis=axis2 vaxis=axis1 frame; proc gplot data=Results; title; axis1 minor=none label=(angle=90 rotate=0);
4712
Chapter 75. The TRANSREG Procedure axis2 minor=none; symbol1 color=blue v=dot i=none; plot TCpRatio*CpRatio / &opts name=’tregex1’; plot TEqRatio*EqRatio / &opts name=’tregex2’; plot TNOx*NOx / &opts name=’tregex3’; run; quit; goptions display; proc greplay nofs tc=sashelp.templt template=l2r2; igout gseg; treplay 1:tregex1 2:tregex3 3:tregex2; run; quit; *-Fit the Parametric Model Suggested by the Nonparametric Analysis-; proc transreg data=Gas dummy ss2 short nomiss; title ’Gasoline Example’; title2 ’Now fit log(NOx) = b0 + b1*EqRatio + b2*EqRatio**2 +’; title3 ’b3*CpRatio + Sum b(j)*Fuel(j) + Error’; model log(NOx)= pspline(EqRatio / deg=2) identity(CpRatio) class(Fuel / zero=first); output out=Results2; run;
Example 75.4. Transformation Regression of Exhaust Emissions Data
Output 75.4.1. Transformation Regression Example: The Nonparametric Model Gasoline Example Iteratively Estimate NOx, CPRATIO and EQRATIO The TRANSREG Procedure Dependent Variable Spline(NOx) Nitrogen Oxide (NOx)
Class Level Information Class
Levels
Fuel
6
Values 82rongas 94%Eth Ethanol Gasohol Indolene Methanol
Number of Observations Read Number of Observations Used
171 169
TRANSREG MORALS Algorithm Iteration History for Spline(NOx) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------0 0.48074 3.86778 0.24597 1 0.00000 0.00000 0.95865 0.71267 Converged Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Spline(NOx) Nitrogen Oxide (NOx)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source Model Error Corrected Total
DF
Sum of Squares
Mean Square
21 147 168
326.0946 14.0674 340.1619
15.52831 0.09570
F Value
Liberal p
162.27
>= <.0001
The above statistics are not adjusted for the fact that the dependent variable was transformed and so are generally liberal.
Root MSE Dependent Mean Coeff Var
0.30935 2.34593 13.18661
R-Square Adj R-Sq
0.9586 0.9527
4713
4714
Chapter 75. The TRANSREG Procedure
Output 75.4.1. (continued) Gasoline Example Iteratively Estimate NOx, CPRATIO and EQRATIO The TRANSREG Procedure Adjusted Multivariate ANOVA Table Based on the Usual Degrees of Freedom Dependent Variable Scoring Parameters=12 Statistic Wilks’ Lambda Pillai’s Trace Hotelling-Lawley Trace Roy’s Greatest Root
S=12
M=4
Value
F Value
Num DF
Den DF
0.041355 0.958645 23.18089 23.18089
2.05 0.61 12.35 162.27
252 252 252 21
1455 1764 945.01 147
N=67 p <= <= <= >=
<.0001 1.0000 <.0001 <.0001
The Wilks’ Lambda, Pillai’s Trace, and Hotelling-Lawley Trace statistics are a conservative adjustment of the normal statistics. Roy’s Greatest Root is liberal. These statistics are normally defined in terms of the squared canonical correlations which are the eigenvalues of the matrix H*inv(H+E). Here the R-Square is used for the first eigenvalue and all other eigenvalues are set to zero since only one linear combination is used. Degrees of freedom are computed assuming all linear combinations contribute to the Lambda and Trace statistics, so the F tests for those statistics are conservative. The p values for the liberal and conservative statistics provide approximate lower and upper bounds on p. A liberal test statistic with conservative degrees of freedom and a conservative test statistic with liberal degrees of freedom yield at best an approximate p value, which is indicated by a "~" before the p value.
Example 75.4. Transformation Regression of Exhaust Emissions Data
Output 75.4.2. Plots of Compression Ratio, Equivalence Ratio, and Nitrogen Oxide
4715
4716
Chapter 75. The TRANSREG Procedure
Output 75.4.3. Transformation Regression Example: The Parametric Model Gasoline Example Now fit log(NOx) = b0 + b1*EqRatio + b2*EqRatio**2 + b3*CpRatio + Sum b(j)*Fuel(j) + Error The TRANSREG Procedure Dependent Variable Log(NOx) Nitrogen Oxide (NOx)
Class Level Information Class
Levels
Fuel
Values
6
82rongas 94%Eth Ethanol Gasohol Indolene Methanol
Number of Observations Read Number of Observations Used
171 169
Log(NOx) Algorithm converged.
The TRANSREG Procedure Hypothesis Tests for Log(NOx) Nitrogen Oxide (NOx)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source Model Error Corrected Total
DF
Sum of Squares
Mean Square
8 160 168
79.33838 7.44659 86.78498
9.917298 0.046541
Root MSE Dependent Mean Coeff Var
0.21573 0.63130 34.17294
R-Square Adj R-Sq
F Value
Pr > F
213.09
<.0001
0.9142 0.9099
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable
Type II Sum of DF Coefficient Squares
Intercept Pspline.EqRatio_1
1 1
Pspline.EqRatio_2
1
Identity(CpRatio)
1
Class.Fuel94_Eth Class.FuelEthanol Class.FuelGasohol Class.FuelIndolene Class.FuelMethanol
1 1 1 1 1
Mean Square F Value Pr > F Label
-14.586532 49.9469 49.9469 1073.18 <.0001 Intercept 35.102914 62.7478 62.7478 1348.22 <.0001 Equivalence Ratio (PHI) 1 -19.386468 64.6430 64.6430 1388.94 <.0001 Equivalence Ratio (PHI) 2 0.032058 1.4445 1.4445 31.04 <.0001 Compression Ratio (CR) -0.449583 1.3158 1.3158 28.27 <.0001 Fuel 94%Eth -0.414242 1.2560 1.2560 26.99 <.0001 Fuel Ethanol -0.016719 0.0015 0.0015 0.03 0.8584 Fuel Gasohol 0.001572 0.0000 0.0000 0.00 0.9853 Fuel Indolene -0.580133 1.7219 1.7219 37.00 <.0001 Fuel Methanol
Example 75.5. Preference Mapping of Cars Data
Example 75.5. Preference Mapping of Cars Data This example uses PROC TRANSREG to perform a preference mapping (PREFMAP) analysis (Carroll 1972) of car preference data after a PROC PRINQUAL principal component analysis. The PREFMAP analysis is a response surface regression that locates ideal points for each dependent variable in a space defined by the independent variables. The data are ratings obtained from 25 judges of their preference for each of 17 automobiles. The ratings were made on a zero (very weak preference) to nine (very strong preference) scale. These judgments were made in 1980 about that year’s products. There are two character variables that indicate the manufacturer and model of the automobile. The data set also contains three ratings: miles per gallon (MPG), projected reliability (Reliability), and quality of the ride (Ride). These ratings are on a one (bad) to five (good) scale. PROC PRINQUAL creates an OUT= data set containing standardized principal component scores (Prin1 and Prin2), along with the ID variables MODEL, MPG, Reliability, and Ride. The first PROC TRANSREG step fits univariate regression models for MPG and Reliability. All variables are designated IDENTITY. A vector drawn in the plot of Prin1 and Prin2 from the origin to the point defined by an attribute’s regression coefficients approximately shows how the cars differ on that attribute. Refer to Carroll (1972) for more information. The Prin1 and Prin2 columns of the TResult1 OUT= data set contain the car coordinates (– Type– =’SCORE’ observations) and endpoints of the MPG and Reliability vectors (– Type– =’M COEFFI’ observations). The second PROC TRANSREG step fits a univariate regression model with Ride designated IDENTIY, and Prin1 and Prin2 designated POINT. The POINT expansion creates an additional independent variable – ISSQ– , which contains the sum of Prin1 squared and Prin2 squared. The OUT= data set TResult2 contains no – Type– =’SCORE’ observations, only ideal point (– Type– =’M POINT’) coordinates for Ride. The coordinates of both the vectors and the ideal points are output by specifying COORDINATES in the OUTPUT statement in PROC TRANSREG. A vector model is used for MPG and Reliability because perfectly efficient and reliable cars do not exist in the data set. The ideal points for MPG and Reliability are far removed from the plot of the cars. It is more likely that an ideal point for quality of the ride is in the plot, so an ideal point model is used for the ride variable. Refer to Carroll (1972) and Schiffman, Reynolds, and Young (1981) for discussions of the vector model and point models (including the EPOINT and QPOINT versions of the point model that are not used in this example). The final DATA step combines the two output data sets and creates a data set suitable for the %PLOTIT macro. (For information on the %PLOTIT macro, see Appendix B, “Using the %PLOTIT Macro.”) The plot contains one point per car and one point for each of the three ratings. The %PLOTIT macro options specify the input data set, how to handle anti-ideal points (described later), and where to draw horizontal and vertical reference lines. The DATATYPE= option specifies that the input data set contains results of a PREFMAP vector model and a PREFMAP ideal point model.
4717
4718
Chapter 75. The TRANSREG Procedure
This instructs the macro to draw vectors to – Type– =’M COEFFI’ observations and circles around – Type– =’M POINT’ observations. An unreliable to reliable direction extends from the left and slightly below the origin to the right and slightly above the origin. The Japanese and European Cars are rated, on the average, as more reliable. A low MPG to good MPG direction extends from the top left of the plot to the bottom right. The smaller cars, on the average, get better gas mileage. The ideal point for Ride is in the top, just right of the center of the plot. Cars near the Ride ideal point tend to have a better ride than cars far away. It can be seen from the iteration history tables that none of these ratings perfectly fits the model, so all of the interpretations are approximate. The Ride point is a “negative-negative” ideal point. The point models assume that small ratings mean the object (car) is similar to the rating name and large ratings imply dissimilarity to the rating name. Because the opposite scoring is used, the interpretation of the Ride point must be reversed to a negative ideal point (bad ride). However, the coefficient for the – ISSQ– variable is negative, so the interpretation is reversed again, back to the original interpretation. Anti-ideal points are taken care of in the %PLOTIT macro. Specify ANTIIDEA=1 when large values are positive or ideal and ANTIIDEA=-1 when small values are positive or ideal. The following statements produce Output 75.5.1 through Output 75.5.2: title ’Preference Ratings for Automobiles Manufactured in 1980’; data CarPreferences; input Make $ 1-10 Model $ 12-22 @25 (Judge1-Judge25) (1.) MPG Reliability Ride; datalines; Cadillac Eldorado 8007990491240508971093809 3 2 4 Chevrolet Chevette 0051200423451043003515698 5 3 2 Chevrolet Citation 4053305814161643544747795 4 1 5 Chevrolet Malibu 6027400723121345545668658 3 3 4 Ford Fairmont 2024006715021443530648655 3 3 4 Ford Mustang 5007197705021101850657555 3 2 2 Ford Pinto 0021000303030201500514078 4 1 1 Honda Accord 5956897609699952998975078 5 5 3 Honda Civic 4836709507488852567765075 5 5 3 Lincoln Continental 7008990592230409962091909 2 4 5 Plymouth Gran Fury 7006000434101107333458708 2 1 5 Plymouth Horizon 3005005635461302444675655 4 3 3 Plymouth Volare 4005003614021602754476555 2 1 3 Pontiac Firebird 0107895613201206958265907 1 1 5 Volkswagen Dasher 4858696508877795377895000 5 3 4 Volkswagen Rabbit 4858509709695795487885000 5 4 3 Volvo DL 9989998909999987989919000 4 5 5 ; *---Compute Coordinates for a 2-Dimensional Scatter Plot of Cars---; proc prinqual data=CarPreferences out=PResults(drop=Judge1-Judge25) n=2 replace standard scores; id Model MPG Reliability Ride; transform identity(Judge1-Judge25);
Example 75.5. Preference Mapping of Cars Data
4719
title2 ’Multidimensional Preference (MDPREF) Analysis’; run; *---Compute Endpoints for MPG and Reliability Vectors---; proc transreg data=PResults; Model identity(MPG Reliability)=identity(Prin1 Prin2); output tstandard=center coordinates replace out=TResult1; id Model; title2 ’Preference Mapping (PREFMAP) Analysis’; run; *---Compute Ride Ideal Point Coordinates---; proc transreg data=PResults; Model identity(Ride)=point(Prin1 Prin2); output tstandard=center coordinates replace noscores out=TResult2; id Model; run; proc print; run; *---Combine Data Sets and Plot the Results---; data plot; title3 ’Plot of Automobiles and Ratings’; set Tresult1 Tresult2; run; %plotit(data=plot, datatype=vector ideal, antiidea=1, href=0, vref=0);
Output 75.5.1. Preference Ratings Example Output Preference Ratings for Automobiles Manufactured in 1980 Multidimensional Preference (MDPREF) Analysis The PRINQUAL Procedure PRINQUAL MTV Algorithm Iteration History Iteration Average Maximum Proportion Criterion Number Change Change of Variance Change Note ---------------------------------------------------------------------------1 0.00000 0.00000 0.66946 Converged Algorithm converged. WARNING: The number of observations is less than or equal to the number of variables. WARNING: Multiple optimal solutions may exist.
4720
Chapter 75. The TRANSREG Procedure
Output 75.5.1. (continued) Preference Ratings for Automobiles Manufactured in 1980 Preference Mapping (PREFMAP) Analysis The TRANSREG Procedure TRANSREG Univariate Algorithm Iteration History for Identity(MPG) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.00000 0.00000 0.57197 Converged Algorithm converged.
Output 75.5.1. (continued) Preference Ratings for Automobiles Manufactured in 1980 Preference Mapping (PREFMAP) Analysis The TRANSREG Procedure TRANSREG Univariate Algorithm Iteration History for Identity(Reliability) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.00000 0.00000 0.50859 Converged Algorithm converged.
Output 75.5.1. (continued) Preference Ratings for Automobiles Manufactured in 1980 Preference Mapping (PREFMAP) Analysis The TRANSREG Procedure TRANSREG Univariate Algorithm Iteration History for Identity(Ride) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.00000 0.00000 0.37797 Converged Algorithm converged.
Example 75.6. Box Cox
Output 75.5.1. (continued) Preference Ratings for Automobiles Manufactured in 1980 Preference Mapping (PREFMAP) Analysis Obs 1
_TYPE_ M POINT
_NAME_
Ride
Intercept
Prin1
Prin2
_ISSQ_
Ride
.
.
0.49461
2.46539
-0.17448
Model Ride
Output 75.5.2. Preference Ratings for Automobiles Manufactured in 1980
Example 75.6. Box Cox This example illustrates finding a Box-Cox transformation (see the “Box-Cox Transformations” section on page 4595) of some artificial data. Data were generated from the model y = ex+ where ∼ N(0, 1). The transformed data can be fit with a linear model log(y) = x +
4721
4722
Chapter 75. The TRANSREG Procedure
These statements produce Output 75.6.1. title ’Basic Box-Cox Example’; data x; do x = 1 to 8 by 0.025; y = exp(x + normal(7)); output; end; run; proc transreg data=x ss2 details; title2 ’Defaults’; model boxcox(y) = identity(x); run;
Output 75.6.1. Basic Box-Cox Example, Default Output Basic Box-Cox Example Defaults The TRANSREG Procedure Transformation Information for BoxCox(y) Lambda -3.00 -2.75 -2.50 -2.25 -2.00 -1.75 -1.50 -1.25 -1.00 -0.75 -0.50 -0.25 0.00 + 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
R-Square 0.03 0.04 0.04 0.05 0.06 0.07 0.10 0.14 0.21 0.34 0.52 0.71 0.79 0.70 0.51 0.34 0.22 0.15 0.11 0.08 0.06 0.05 0.04 0.03 0.03
< - Best Lambda * - Confidence Interval + - Convenient Lambda
Log Like -4601.01 -4266.08 -3934.11 -3605.75 -3281.88 -2963.74 -2653.14 -2352.72 -2066.32 -1799.25 -1558.55 -1360.28 -1275.31 < -1382.62 -1589.03 -1834.53 -2105.88 -2397.35 -2704.64 -3024.24 -3353.38 -3689.91 -4032.18 -4378.97 -4729.37
88
Example 75.6. Box Cox
PROC TRANSREG correctly selects the log transformation λ = 0, with a narrow confidence interval. The maximum of the log likelihood function is flagged with the less-than sign (<), and the convenient power parameter of λ = 0 in the confidence interval is flagged by the plus sign (+). The rest of the output is shown next in Output 75.6.2.
4723
4724
Chapter 75. The TRANSREG Procedure
Output 75.6.2. Basic Box-Cox Example, Default Output Basic Box-Cox Example Defaults
89
The TRANSREG Procedure Dependent Variable BoxCox(y)
Number of Observations Read Number of Observations Used
281 281
TRANSREG Univariate Algorithm Iteration History for BoxCox(y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.00000 0.00000 0.79064 Converged Algorithm converged.
Model Statement Specification Details Type
DF Variable
Description
Value
Lambda Used Lambda Log Likelihood Conv. Lambda Conv. Lambda LL CI Limit Alpha
0 0 -1275.3 0 -1275.3 -1277.2 0.05
Dep
1 BoxCox(y)
Ind
1 Identity(x) DF
1
The TRANSREG Procedure Hypothesis Tests for BoxCox(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source Model Error Corrected Total
DF
Sum of Squares
Mean Square
1 279 280
1145.884 303.421 1449.305
1145.884 1.088
F Value
Liberal p
1053.66
>= <.0001
The above statistics are not adjusted for the fact that the dependent variable was transformed and so are generally liberal.
Root MSE Dependent Mean Coeff Var
1.04285 4.49653 23.19225
R-Square Adj R-Sq Lambda
0.7906 0.7899 0.0000
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Intercept Identity(x)
DF
Coefficient
Type II Sum of Squares
1 1
0.01551366 0.99578183
0.01 1145.88
Mean Square
F Value
Liberal p
0.01 1145.88
0.01 1053.66
>= 0.9185 >= <.0001
The above statistics are not adjusted for the fact that the dependent variable was transformed and so are generally liberal.
Example 75.6. Box Cox
This next example uses several options. The LAMBDA= option specifies power parameters sparsely from -2 to -0.5 and 0.5 to 2 just to get the general shape of the log likelihood function in that region. Between -0.5 and 0.5, more power parameters are tried. The CONVENIENT option is specified so that if a power parameter like λ = 1 or λ = 0 is found in the confidence interval, it will be used instead of the optimal power parameter. PARAMETER=2 is specified to add 2 to each y before performing the transformations. ALPHA=0.00001 specifies a wide confidence interval. These statements produce Output 75.6.3. proc transreg data=x ss2 details; title2 ’Several Options Demonstrated’; model boxcox(y / lambda=-2 -1 -0.5 to 0.5 by 0.05 1 2 convenient parameter=2 alpha=0.00001) = identity(x); run;
4725
4726
Chapter 75. The TRANSREG Procedure
Output 75.6.3. Basic Box-Cox Example, Several Options Demonstrated Basic Box-Cox Example Several Options Demonstrated
90
The TRANSREG Procedure Transformation Information for BoxCox(y) Lambda -2.000 -1.000 -0.500 -0.450 -0.400 -0.350 -0.300 -0.250 -0.200 -0.150 -0.100 -0.050 0.000 + 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 1.000 2.000
R-Square
Log Like
0.22 0.45 0.67 0.70 0.72 0.74 0.76 0.77 0.79 0.79 0.80 0.80 0.79 0.78 0.76 0.74 0.72 0.69 0.65 0.62 0.58 0.54 0.50 0.22 0.06
-2583.73 -1779.35 -1439.82 -1410.51 -1382.74 -1356.92 -1333.59 -1313.42 -1297.21 -1285.83 -1280.09 -1280.63 -1287.71 -1301.19 -1320.56 -1345.09 -1373.99 -1406.51 -1442.02 -1480.02 -1520.13 -1562.05 -1605.57 -2105.88 -3320.36
* < * *
< - Best Lambda * - Confidence Interval + - Convenient Lambda
The results show that the optimal power parameter is -0.1 but 0 is in the confidence interval, hence a log transformation is chosen. The rest of the output is shown next in Output 75.6.4.
Example 75.6. Box Cox
4727
Output 75.6.4. Basic Box-Cox Example, Several Options Demonstrated Basic Box-Cox Example Several Options Demonstrated
91
The TRANSREG Procedure Dependent Variable BoxCox(y)
Number of Observations Read Number of Observations Used
281 281
TRANSREG Univariate Algorithm Iteration History for BoxCox(y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.00000 0.00000 0.79238 Converged Algorithm converged.
Model Statement Specification Details Type
DF Variable
Description
Value
Lambda Used Lambda Log Likelihood Conv. Lambda Conv. Lambda LL CI Limit Alpha Parameter Options
0 -0.1 -1280.1 0 -1287.7 -1289.9 0.00001 2 Convenient Lambda Used
Dep
1 BoxCox(y)
Ind
1 Identity(x) DF
1
4728
Chapter 75. The TRANSREG Procedure
Output 75.6.4. (continued) Basic Box-Cox Example Several Options Demonstrated
92
The TRANSREG Procedure The TRANSREG Procedure Hypothesis Tests for BoxCox(y)
Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source Model Error Corrected Total
DF
Sum of Squares
Mean Square
1 279 280
999.438 261.868 1261.306
999.4381 0.9386
F Value
Liberal p
1064.82
>= <.0001
The above statistics are not adjusted for the fact that the dependent variable was transformed and so are generally liberal.
Root MSE Dependent Mean Coeff Var
0.96881 4.61429 20.99591
R-Square Adj R-Sq Lambda
0.7924 0.7916 0.0000
Univariate Regression Table Based on the Usual Degrees of Freedom
Variable Intercept Identity(x)
DF
Coefficient
Type II Sum of Squares
1 1
0.42939328 0.92997620
8.746 999.438
Mean Square
F Value
Liberal p
8.746 999.438
9.32 1064.82
>= 0.0025 >= <.0001
The above statistics are not adjusted for the fact that the dependent variable was transformed and so are generally liberal.
The next part of this example shows how to make graphical displays of the Box-Cox transformation results. Plots include the log likelihood function with the confidence interval, root mean squared error as a function of the power parameter, R2 as a function of the power parameter, the Box-Cox transformation of the variable y, the original scatter plot based on the untransformed data, and the new scatter plot based on the transformed data. Also, a condensed version of the log likelihood table with the confidence interval is printed. Here are the data.
Example 75.6. Box Cox
4729
title h=1 ’Box-Cox Graphical Displays’; data x; input y x @@; datalines; 10.0 3.0 72.6 8.3 59.7 78.2 8.5 87.4 9.0 9.5 57.0 7.5 9.9 1.9 0.5 8.3 1.8 0.6 1.8 53.0 73.3 9.5 122.4 9.9 87.2 12.4 3.3 5.6 2.7 113.0 80.4 8.1 0.6 1.6 115.1 32.5 5.8 43.0 6.2 0.1 0.2 0.8 73.5 8.2 4.9 0.2 0.9 101.3 9.9 10.0 80.8 9.4 24.9 5.7 113.5 80.1 8.3 26.4 4.8 13.4 2.2 1.5 10.3 2.7 13.8 9.1 4.5 89.3 9.1 5.5 7.0 3.5 14.5 2.9 16.0 34.7 6.2 33.5 6.5 26.0 2.6 1.8 58.6 7.9 81.2 ;
8.1 3.4 1.0 6.7 9.4 9.6 9.1 0.8 3.2 3.7 9.7 3.8 4.7 2.6 3.7 5.6 8.1
20.1 4.8 0.1 1.4 121.1 9.9 112.8 10.0 121.2 9.9 110.5 10.0 15.9 3.1 21.8 5.2 0.2 0.3 16.9 3.0 6.2 2.1 99.8 9.7 38.6 4.5 20.0 4.8 29.3 6.1 12.7 3.1 37.2 6.9
90.1 0.1 37.5 40.7 23.1 3.1 56.5 15.2 69.0 11.2 12.5 44.1 79.1 2.9 48.9 0.1
9.8 1.1 5.9 6.4 4.3 1.5 7.3 3.5 9.2 5.0 3.2 6.2 9.8 2.9 6.3 0.3
1.1 42.5 49.5 5.1 7.1 52.4 85.4 5.2 3.6 0.2 4.8 15.3 33.6 82.9 1.6 15.4
0.9 5.1 6.7 2.4 3.5 7.9 9.8 3.0 3.5 0.4 1.8 3.8 5.8 8.4 1.9 4.2
The TRANSREG procedure is run to find the Box-Cox transformation. The lambda list is -2 TO 2 BY 0.01, which produces 401 lambdas. This many power parameters makes a nice graphical display with plenty of detail around the confidence interval. However, 401 values is a lot to print, so for this reason, the usual Box-Cox transformation information table is excluded from the printed output. Instead, it is output to a SAS data set using ODS so a sample of it can be printed. Just the confidence interval and the rows corresponding to power parameters that are multiples of 0.5 are printed. Null labels are provided for the columns that need to be printed without headers. The details table is also output to a SAS data set using ODS, since it contains information that will be incorporated into some of the plots. These statements produce Output 75.6.5. * Fit Box-Cox model, output results to output data sets; ods output boxcox=b details=d; ods exclude boxcox; proc transreg details data=x; model boxcox(y / convenient lambda=-2 to 2 by 0.01) = identity(x); output out=trans; run; proc print noobs label data=b(drop=rmse); title2 ’Confidence Interval’; where ci ne ’ ’ or abs(lambda - round(lambda, 0.5)) < 1e-6; label convenient = ’00’x ci = ’00’x; run;
4730
Chapter 75. The TRANSREG Procedure
Output 75.6.5. Box-Cox Graphical Displays Box-Cox Graphical Displays
93
The TRANSREG Procedure TRANSREG Univariate Algorithm Iteration History for BoxCox(y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------1 0.00000 0.00000 0.95396 Converged Algorithm converged.
Model Statement Specification Details Type
DF Variable
Description
Value
Lambda Used Lambda Log Likelihood Conv. Lambda Conv. Lambda LL CI Limit Alpha Options
0.5 0.46 -167.0 0.5 -168.3 -169.0 0.05 Convenient Lambda Used
Dep
1 BoxCox(y)
Ind
1 Identity(x) DF
1
Output 75.6.5. (continued) Box-Cox Graphical Displays Confidence Interval
94
Dependent
Lambda
R-Square
Log Like
BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y) BoxCox(y)
-2.00 -1.50 -1.00 -0.50 0.00 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 1.00 1.50 2.00
0.14 0.17 0.22 0.39 0.78 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.89 0.79 0.70
-1030.56 -810.50 -602.53 -415.56 -257.92 -168.40 -167.86 -167.46 -167.19 -167.05 -167.04 -167.16 -167.41 -167.79 -168.28 -168.89 -253.09 -345.35 -435.01
+
* * * * * < * * * * *
Example 75.6. Box Cox
4731
These next steps extract information from the Box-Cox transformation and details tables and store the information in macro variables. The confidence interval limit from the details table provides a vertical axis reference line for the log likelihood plot. The convenient power parameter (’Lambda Used’) is extracted from the footnote. The confidence interval is extracted from the confidence interval observations of the Box-Cox transformation table and will be used in the footnote and for horizontal axis reference lines in the log likelihood plot. * Store values for reference lines; data _null_; set d; if description = ’CI Limit’ then call symput(’vref’, formattedvalue); if description = ’Lambda Used’ then call symput(’lambda’, formattedvalue); run; data _null_; set b end=eof; where ci ne ’ ’; if _n_ = 1 then call symput(’href1’, compress(put(lambda, best12.))); if ci = ’<’ then call symput(’href2’, compress(put(lambda, best12.))); if eof then call symput(’href3’, compress(put(lambda, best12.))); run;
These steps plot the log likelihood, root mean square error and R2 . The input data set is the Box-Cox transformation table, which was output using ODS. These statements produce Output 75.6.6. * Plot log likelihood, confidence interval; axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; proc gplot data=b; title2 ’Log Likelihood’; plot loglike * lambda / vref=&vref href=&href1 &href2 &href3 vaxis=axis1 haxis=axis2 frame cframe=ligr; footnote "Confidence Interval: &href1 - &href2 - &href3, " "Lambda = &lambda"; symbol v=none i=spline c=blue; run; footnote; title2 ’RMSE’; plot rmse * lambda / vaxis=axis1 haxis=axis2 frame cframe=ligr; run; title2 ’R-Square’; plot rsquare * lambda / vaxis=axis1 haxis=axis2 frame cframe=ligr;
4732
Chapter 75. The TRANSREG Procedure axis1 order=(0 to 1 by 0.1) label=(angle=90 rotate=0) minor=none; run; quit;
Output 75.6.6. Box-Cox Graphical Displays
Example 75.6. Box Cox
Output 75.6.6. (continued)
4733
4734
Chapter 75. The TRANSREG Procedure
Output 75.6.6. (continued)
The optimal power parameter is 0.46, but since 0.5 is in the confidence interval, and since the CONVENIENT option was specified, the procedure chooses a square root transformation. The next steps plot the transformation of Y, the original scatter plot based on the untransformed data, and the new scatter plot based on the transformed data. The results are shown in Output 75.6.7. The input data set is the ordinary output data set from PROC TRANSREG. The transformation of the variable Y by default is Ty. axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; proc gplot data=trans; title2 ’Transformation’; symbol i=splines v=star c=blue; plot ty * y / vaxis=axis1 haxis=axis2 frame cframe=ligr; run; title2 ’Original Scatter Plot’; symbol i=none v=star c=blue; plot y * x / vaxis=axis1 haxis=axis2 frame cframe=ligr; run; title2 ’Transformed Scatter Plot’; symbol i=none v=star c=blue; plot ty * x / vaxis=axis1 haxis=axis2 frame cframe=ligr; run; quit;
Example 75.6. Box Cox
Output 75.6.7. Box-Cox Graphical Displays
4735
4736
Chapter 75. The TRANSREG Procedure
Output 75.6.7. (continued)
References
Output 75.6.7. (continued)
The square root transformation makes the scatter plot essentially linear.
References de Boor, C. (1978), A Practical Guide to Splines, New York: Springer Verlag. Box, G.E.P. and Cox, D.R. (1964), “An Analysis of Transformations,” Journal of the Royal Statistics Society, B-26, 211–252. Breiman, L. and Friedman, J.H. (1985), “Estimating Optimal Transformations for Multiple Regression and Correlation,” (with discussion), Journal of the American Statistical Association, 77, 580–619. Brinkman, N.D. (1981), “Ethanol Fuel—A Single-Cylinder Engine Study of Efficiency and Exhaust Emissions,” Society of Automotive Engineers Transactions, 90, 1410–1424. van der Burg, E. and de Leeuw, J. (1983), “Non-linear Canonical Correlation,” British Journal of Mathematical and Statistical Psychology, 36, 54–80. Carroll, J.D. (1972), “Individual Differences and Multidimensional Scaling,” in R.N. Shepard, A.K. Romney, and S.B. Nerlove (eds.), Multidimensional Scaling: Theory and Applications in the Behavioral Sciences (Volume 1), New York: Seminar Press.
4737
4738
Chapter 75. The TRANSREG Procedure
Draper, N.R. and Smith, H. (1981), Applied Regression Analysis, Second Edition, New York: John Wiley & Sons, Inc. Fisher, R. (1938), Statistical Methods for Research Workers (10th Edition), Edinburgh: Oliver and Boyd Press. Gabriel, K.R. (1981), “Biplot Display of Multivariate Matrices for Inspection of Data and Diagnosis,” Interpreting Multivariate Data, ed. Barnett, V., London: John Wiley & Sons. Gifi, A. (1990), Nonlinear Multivariate Analysis, New York: John Wiley & Sons, Inc. Goodnight, J.H. (1978), SAS Technical Report R-106, The Sweep Operator: Its Importance in Statistical Computing, Cary NC: SAS Institute Inc. Green, P.E. and Wind, Y. (1975), “New Way to Measure Consumers’ Judgements,” Harvard Business Review, July–August, 107–117. Hastie, T. and Tibshirani, R. (1986), “Generalized Additive Models,” Statistical Science, 3, 297–318. Israels, A.Z. (1984), “Redundancy Analysis for Qualitative Variables,” Psychometrika, 49, 331–346. Khuri, A.I. and Cornell, J.A. (1987), Response Surfaces, New York: Marcel Dekker. Kruskal, J.B. (1964), “Multidimensional Scaling By Optimizing Goodness of Fit to a Nonmetric Hypothesis,” Psychometrika, 29, 1–27. Kuhfeld, W.F. (2003), “Marketing Research Methods in SAS,” [http://support.sas.com/techsup/tnote/tnote– stat.html#market]. de Leeuw, J., Young, F.W., and Takane, Y. (1976), “Additive Structure in Qualitative Data: An Alternating Least Squares Approach with Optimal Scaling Features,” Psychometrika, 41, 471–503. de Leeuw, J. (1986), “Regression with Optimal Scaling of the Dependent Variable,” Department of Data Theory, The Netherlands: The University of Leiden. Meyers, R.H. (1976), Response Surface Methodology, Blacksburg, VA: Virginia Polytechnic Institute and State University. van Rijckeveorsel, J. (1982), “Canonical Analysis with B-Splines,” in H. Caussinus, P. Ettinger, and R. Tomassone (ed.), COMPUSTAT 1982, Part I, Wein, Physica Verlag. SAS Institute Inc. (1993), SAS Technical Report R-109, Conjoint Analysis Examples, Cary, NC: SAS Institute Inc. Schiffman, S.S., Reynolds, M.L., and Young, F.W. (1981), Introduction to Multidimensional Scaling, New York: Academic Press. Siegel, S. (1956), Nonparametric Statistics, New York: McGraw-Hill. Smith, P.L. (1979), “Splines as a Useful and Convenient Statistical Tool,” The American Statistician, 33, 57–62.
References
Stewart, D. and Love, W. (1968), “A General Canonical Correlation Index,” Psychological Bulletin, 70, 160–163. Winsberg, S. and Ramsay, J.O. (1980), “Monotonic Transformations to Additivity Using Splines,” Biometrika, 67, 669–674. Young, F.W. (1981), “Quantitative Analysis of Qualitative Data,” Psychometrika, 46, 357–388. Young, F.W., de Leeuw, J., and Takane, Y. (1976), “Regression with Qualitative and Quantitative Variables: An Alternating Least Squares Approach with Optimal Scaling Features,” Psychometrika, 41, 505–529.
4739
Chapter 76
The TREE Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4743 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4743 SYNTAX . . . . . . . . . PROC TREE Statement BY Statement . . . . . COPY Statement . . . . FREQ Statement . . . . HEIGHT Statement . . ID Statement . . . . . . NAME Statement . . . PARENT Statement . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. 4748 . 4748 . 4754 . 4755 . 4755 . 4755 . 4755 . 4756 . 4756
DETAILS . . . . . . Missing Values . . Output Data Set . Displayed Output ODS Table Names
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. 4756 . 4756 . 4756 . 4757 . 4757
. . . . .
. . . . .
. . . . .
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4758 Example 76.1. Mammals’ Teeth . . . . . . . . . . . . . . . . . . . . . . . . 4758 Example 76.2. Iris Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4766 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4772
4742
Chapter 76. The TREE Procedure
Chapter 76
The TREE Procedure Overview The TREE procedure produces a tree diagram, also known as a dendrogram or phenogram, using a data set created by the CLUSTER or VARCLUS procedure. The CLUSTER and VARCLUS procedures create output data sets that contain the results of hierarchical clustering as a tree structure. The TREE procedure uses the output data set to produce a diagram of the tree structure in the style of Johnson(1967), with the root at the top. Alternatively, the diagram can be oriented horizontally, with the root at the left. Any numeric variable in the output data set can be used to specify the heights of the clusters. PROC TREE can also create an output data set containing a variable to indicate the disjoint clusters at a specified level in the tree. Tree diagrams are discussed in the context of cluster analysis by Duran and Odell (1974), Hartigan (1975), and Everitt (1980). Knuth (1973) provides a general treatment of tree diagrams in computer programming. The literature on tree diagrams contains a mixture of botanical and genealogical terminology. The objects that are clustered are leaves. The cluster containing all objects is the root. A cluster containing at least two objects but not all of them is a branch. The general term for leaves, branches, and roots is node. If a cluster A is the union of clusters B and C, then A is the parent of B and C, and B and C are children of A. A leaf is thus a node with no children, and a root is a node with no parent. If every cluster has at most two children, the tree diagram is a binary tree. The CLUSTER procedure always produces binary trees. The VARCLUS procedure can produce tree diagrams with clusters that have many children.
Getting Started The TREE procedure creates tree diagrams from a SAS data set containing the tree structure. You can create this type of data set with the CLUSTER or VARCLUS procedure. In the following example, the VARCLUS procedure is used to divide a set of variables into hierarchical clusters and to create the SAS data set containing the tree structure. The TREE procedure then generates the tree diagrams. The following data, from Hand, et al. (1994), represent the amount of protein consumed from nine food groups for each of 25 European countries. The nine food groups are red meat (RedMeat), white meat (WhiteMeat), eggs (Eggs), milk (Milk), fish (Fish), cereal (Cereal), starch (Starch), nuts (Nuts), and fruits and vegetables (FruVeg). The following SAS statements create the data set Protein:
4744
Chapter 76. The TREE Procedure data Protein; input Country $15. RedMeat WhiteMeat Eggs Milk Fish Cereal Starch Nuts FruVeg; datalines; Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 E Germany 8.4 11.6 3.7 11.1 5.4 24.6 6.5 Finland 9.5 4.9 2.7 33.7 5.8 26.3 5.1 France 18.0 9.9 3.3 19.5 5.7 28.1 4.8 Greece 10.2 3.0 2.8 17.6 5.9 41.7 2.2 Hungary 5.3 12.4 2.9 9.7 0.3 40.1 4.0 Ireland 13.9 10.0 4.7 25.8 2.2 24.0 6.2 Italy 9.0 5.1 2.9 13.7 3.4 36.8 2.1 Netherlands 9.5 13.6 3.6 23.4 2.5 22.4 4.2 Norway 9.4 4.7 2.7 23.3 9.7 23.0 4.6 Poland 6.9 10.2 2.7 19.3 3.0 36.1 5.9 Portugal 6.2 3.7 1.1 4.9 14.2 27.0 5.9 Romania 6.2 6.3 1.5 11.1 1.0 49.6 3.1 Spain 7.1 3.4 3.1 8.6 7.0 29.2 5.7 Sweden 9.9 7.8 3.5 4.7 7.5 19.5 3.7 Switzerland 13.1 10.1 3.1 23.8 2.3 25.6 2.8 UK 17.4 5.7 4.7 20.6 4.3 24.3 4.7 USSR 9.3 4.6 2.1 16.6 3.0 43.6 6.4 W Germany 11.4 12.5 4.1 18.8 3.4 18.6 5.2 Yugoslavia 4.4 5.0 1.2 9.5 0.6 55.9 3.0 ; run;
5.5 1.3 2.1 3.7 1.1 0.7 0.8 1.0 2.4 7.8 5.4 1.6 4.3 1.8 1.6 2.0 4.7 5.3 5.9 1.4 2.4 3.4 3.4 1.5 5.7
1.7 4.3 4.0 4.2 4.0 2.4 3.6 1.4 6.5 6.5 4.2 2.9 6.7 3.7 2.7 6.6 7.9 2.8 7.2 2.0 4.9 3.3 2.9 3.8 3.2
The data set Protein contains the character variable Country and the nine numeric variables representing the food groups. The $15. in the INPUT statement specifies that the variable Country is a character variable with a length of 15. The following statements cluster the variables in the data set Protein. The OUTTREE= option creates an output SAS data set named Tree to contain the tree structure. The CENTROID option specifies the centroid clustering method, and the MAXCLUSTERS= option specifies that the largest number of clusters desired is four. The NOPRINT option suppresses the display of the output. The VAR statement specifies that all numeric variables (RedMeat—FruVeg) are used by the procedure. proc varclus data=Protein outtree=Tree centroid maxclusters=4 noprint; var RedMeat--FruVeg; run;
The output data set Tree, created by the OUTTREE= option in the previous statements, contains the following variables:
Getting Started
– NAME– – PARENT–
the name of the cluster
– NCL– – VAREXP–
the number of clusters
– PROPOR–
the proportion of variance explained by the clusters at the current level of the tree diagram
the parent of the cluster the amount of variance explained by the cluster
the minimum proportion of variance explained by a cluster – MINPRO– – MAXEIGEN– the maximum second eigenvalue of a cluster The following statements produce a tree diagram of the clusters created by PROC VARCLUS: proc tree data=tree ; proc tree data=tree lineprinter;
PROC TREE is invoked twice. In the first invocation, the tree diagram is presented using the default high resolution graphical output. In the second invocation, the LINEPRINTER option specifies line printer output. Figure 76.1 displays the default high resolution graphics version of the tree diagram.
Figure 76.1. High Resolution Tree Diagram from PROC TREE
4745
4746
Chapter 76. The TREE Procedure
Figure 76.2 displays the same information as Figure 76.1, using line printer output. Oblique Centroid Component Clustering Name of Variable or Cluster
1
N u m b e r
2
3
4 o f 5 C l u s t e r s
6
7
8
9
W h R i e t S C F d e t e r M M E M F a r N u e e g i i r e u V a a g l s c a t e t t s k h h l s g +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX |XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX |XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX |XXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXX |XXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXX +XXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXX |XXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXX . |XXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXX . +XXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXX . |. . . . . . . . . |. . . . . . . . . +. . . . . . . . . |. . . . . . . . . |. . . . . . . . . +. . . . . . . . . |. . . . . . . . . |. . . . . . . . . +. . . . . . . . . |. . . . . . . . . |. . . . . . . . . +. . . . . . . . . |. . . . . . . . . |. . . . . . . . . +. . . . . . . . .
Figure 76.2. Line Printer Graphics Version of the Tree Diagram
In both figures, the name of the cluster is displayed on the horizontal axis and the number of clusters is displayed on the vertical or height axis. As you look up from the bottom of the figures, clusters are progressively joined until a single, all-encompassing cluster is formed at the top (or root) of the diagram. Clusters exist at each level of the diagram. For example, at the level where the diagram indicates three clusters, the clusters are as follows: • Cluster 1: RedMeat WhiteMeat Eggs Milk • Cluster 2: Fish Starch • Cluster 3: Cereal Nuts FruVeg
Getting Started
As you proceed up the diagram one level, the number of clusters is two. The clusters are • Cluster 1: RedMeat WhiteMeat Eggs Milk Fish Starch • Cluster 2: Cereal Nuts FruVeg The following statements illustrate how you can specify the numeric variable defining the height of each node (cluster) in the tree. First, the AXIS1 statement is defined. The ORDER= option specifies the data values in the order in which they are to appear on the axis. Next, the TREE procedure is invoked. The HORIZONTAL option orients the tree diagram horizontally. The HAXIS option specifies that the AXIS1 statement be used to customize the appearance of the horizontal axis. The HEIGHT statement specifies the variable – PROPOR– (the proportion of variance explained) as the height variable. axis1 order=(0 to 1 by 0.2); proc tree data=Tree horizontal haxis=axis1; height _PROPOR_; run;
Figure 76.3. Horizontal Tree Diagram Using – PROPOR– as the HEIGHT Variable
Figure 76.3 displays the tree diagram oriented horizontally, using the variable – PROPOR– as the height variable. As you look from left to right in the diagram,
4747
4748
Chapter 76. The TREE Procedure
objects and clusters are progressively joined until a single, all-encompassing cluster is formed at the right (or root) of the diagram. Clusters exist at each level of the diagram, represented by horizontal line segments. Each vertical line segment represents a point where leaves and branches are connected into progressively larger clusters. For example, three clusters are formed at the left-most point along the axis where three horizontal line segments exist. At that point, where a vertical line segment connects the Cereal-Nuts and FruVeg clusters, the proportion of variance explained is about 0.6 (– PROPOR– = 0.6). At the next clustering level the variables Fish and Starch are clustered with variables RedMeat through Milk, resulting in a total of two clusters. The proportion of variance explained is about 0.45 at that point.
Syntax The TREE procedure is invoked by the following statements:
PROC TREE < options > ; NAME variables ; HEIGHT variable ; PARENT variables ; BY variables ; COPY variables ; FREQ variable ; ID variable ; If the input data set has been created by CLUSTER or VARCLUS, the only statement required is the PROC TREE statement. The BY, COPY, FREQ, HEIGHT, ID, NAME, and PARENT statements are described after the PROC TREE statement.
PROC TREE Statement PROC TREE < options > ; The PROC TREE statement starts the TREE procedure. The options that can appear in the PROC TREE statement are summarized in the following table. Table 76.1. PROC TREE Statement Options
Task Specify data sets
Options DATA= DOCK= LEVEL= NCLUSTERS= OUT=
Effect specifies the input data set does not count small clusters in OUT= data set defines disjoint cluster in OUT= data set specifies the number of clusters in OUT= data set specifies the output data set
PROC TREE Statement
4749
Table 76.1. (continued)
Task
Options ROOT=
Effect displays the root of a subtree
Specify cluster heights
HEIGHT= DISSIMILAR SIMILAR
specifies the variable for the height axis specifies that large values are far apart specifies that small values are close together
Display horizontal trees
HORIZONTAL
specifies that the height axis is horizontal
Control sort order
DESCENDING SORT
reverses SORT order sorts children by HEIGHT variable
Control displayed output
LIST NOPRINT
displays all nodes in the tree suppresses display of the tree
LINEPRINTER INC= MAXHEIGHT= MINHEIGHT= NTICK= CFRAME= DESCRIPTION= GOUT= HAXIS= HORDISPLAY=
displays tree using line printer style graphics specifies the increment between tick values specifies the maximum value on axis specifies the minimum value on axis specifies the number of tick intervals specifies the color of the frame specifies the catalog description specifies the catalog name customizes horizontal axis displays a horizontal tree with leaves on the right specifies the number of pages to expand tree horizontally specifies the line color and thickness, dots at the nodes specifies the name of graph in the catalog customizes vertical axis specifies the number of pages to expand tree vertically
High resolution graphics
HPAGES= LINES= NAME= VAXIS= VPAGES= Line printer graphics
INC= MAXHEIGHT= MINHEIGHT= NTICK= PAGES= POS= SPACES= TICKPOS= FILLCHAR= JOINCHAR=
specifies the increment between tick values specifies the maximum value on axis specifies the minimum value on axis specifies the number of tick intervals specifies the number of pages specifies the number of column positions specifies the number of spaces between objects specifies the number of column positions between ticks specifies the fill character between unjoined leaves specifies the character to display between joined leaves
4750
Chapter 76. The TREE Procedure
Table 76.1. (continued)
Task
Options LEAFCHAR= TREECHAR=
Effect specifies the character to represent clusters with no children specifies the character to represent clusters with children
CFRAME=color
specifies a color for the frame, which is the rectangle bounded by the axes. DATA=SAS-data-set
specifies the input data set defining the tree. If you omit the DATA= option, the most recently created SAS data set is used. DESCENDING DES
reverses the sorting order for the SORT option. DESCRIPTION=entry-description
specifies a description for the graph in the GOUT= catalog. The default is “Proc Tree Graph Output.” DISSIMILAR DIS
implies that the values of the HEIGHT variable are dissimilarities; that is, a large height value means that the clusters are very dissimilar or far apart. If neither the SIMILAR nor the DISSIMILAR option is specified, PROC TREE attempts to infer from the data whether the height values are similarities or dissimilarities. If PROC TREE cannot tell this from the data, it issues an error message and does not display a tree diagram. DOCK=n
causes observations in the OUT= data set assigned to output clusters with a frequency of n or less to be given missing values for the output variables CLUSTER and CLUSNAME. If the NCLUSTERS= option is also specified, DOCK= also prevents clusters with a frequency of n or less from being counted toward the number of clusters requested by the NCLUSTERS= option. By default, DOCK=0. FILLCHAR=’c’ FC=’c’
specifies the character to display between leaves that are not joined into a cluster. The character should be enclosed in single quotes. The default is a blank. The LINEPRINTER option must also be specified. GOUT=
specifies the catalog in which the generated graph is stored. WORK.GSEG.
The default is
PROC TREE Statement
HAXIS=AXISn
specifies the AXISn statement used to customize the appearance of the horizontal axis. HEIGHT=name H=name
specifies certain conventional variables to be used for the height axis of the tree diagram. For many situations, the only option you need is the HEIGHT= option. Valid values for name and their meanings are as follows: HEIGHT | H
specifies the – HEIGHT– variable.
LENGTH | L
defines the height of each node as its path length from the root. This can also be interpreted as the number of ancestors of the node.
MODE | M
specifies the – MODE– variable.
NCL | N
specifies the – NCL– (number of clusters) variable.
RSQ | R
specifies the – RSQ– variable.
See also the “HEIGHT Statement” section on page 4755, which can specify any variable in the input data set to be used for the height axis. In rare cases, you may need to specify either the DISSIMILAR option or the SIMILAR option. HORDISPLAY=RIGHT
specifies that the graph is to be oriented horizontally, with the leaf nodes on the right side, when the HORIZONTAL option is also specified. By default, the leaf nodes are on the left side. HORIZONTAL HOR
orients the tree diagram with the height axis horizontal and the root at the left. The leaf nodes are on the side specified in the HORDISPLAY= option. If you do not specify the HORIZONTAL option, the height axis is vertical, with the root at the top. When the tree takes up more than one page and is viewed on a screen, horizontal orientation can make the tree diagram considerably easier to read. HPAGES=n1
specifies that the original graph is to be enlarged to cover n1 pages. If you also specify the VPAGES=n2 option, the original graph is enlarged to cover n1 × n2 graphs. For example, if HPAGES=2 and VPAGES=3, then the original graph is generated followed by 2 × 3 = 6 more graphs. In these six graphs, the original is enlarged by a factor of 2 in the horizontal direction and by a factor of 3 in the vertical direction. The graphs are generated in left-to-right and top-to-bottom order. INC=n
specifies the increment between tick values on the height axis. If the HEIGHT variable is – NCL– , the default is usually 1, although a different value can be specified for consistency with other options. For any other HEIGHT variable, the default is some power of 10 times 1, 2, 2.5, or 5.
4751
4752
Chapter 76. The TREE Procedure
JOINCHAR=’c’ JC=’c’
specifies the character to display between leaves that are joined into a cluster. The character should be enclosed in single quotes. The default is X. The LINEPRINTER option must also be specified. LEAFCHAR=’c’ LC=’c’
specifies a character to represent clusters having no children. The character should be enclosed in single quotes. The default is a period. The LINEPRINTER option must also be specified. LEVEL=n
specifies the level of the tree defining disjoint clusters for the OUT= data set. The LEVEL= option also causes only clusters between the root and a height of n to be displayed. The clusters in the output data set are those that exist at a height of n on the tree diagram. For example, if the HEIGHT variable is – NCL– (number of clusters) and LEVEL=5 is specified, then the OUT= data set contains five disjoint clusters. If the HEIGHT variable is – RSQ– (R2 ) and LEVEL=0.9 is specified, then the OUT= data set contains the smallest number of clusters that yields an R2 of at least 0.9. LINEPRINTER
specifies that the generated report is to be displayed using line printer graphics. LINES=(
enables you to specify both the color and the thickness of the lines. In addition, a dot can be drawn at each leaf node. Note that if the frame and the lines are specified to be the same color, PROC TREE selects a different color for the lines. LIST
lists all the nodes in the tree, displaying the height, parent, and children of each node. MAXHEIGHT=n MAXH=n
specifies the maximum value displayed on the height axis. MINHEIGHT=n MINH=n
specifies the minimum value displayed on the height axis. NAME=name
specifies the entry name for the generated graph in the GOUT= catalog. Note that each time another graph is generated with the same name, the name is modified by appending a number to make it unique. NCLUSTERS=n NCL=n N=n
specifies the number of clusters desired in the OUT= data set. The number of clusters obtained may not equal the number specified if (1) there are fewer than n leaves in the tree, (2) there are more than n unconnected trees in the data set, (3) a multi-way
PROC TREE Statement
tree does not contain a level with the specified number of clusters, or (4) the DOCK= option eliminates too many clusters. The NCLUSTERS= option uses the – NCL– variable to determine the order in which the clusters are formed. If there is no – NCL– variable, the height variable (as determined by the HEIGHT statement or HEIGHT= option) is used instead. NTICK=n
specifies the number of tick intervals on the height axis. The default depends on the values of other options. NOPRINT
suppresses the display of the tree. Specify the NOPRINT option if you want only to create an OUT= data set. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 14, “Using the Output Delivery System.” OUT=SAS-data-set
creates an output data set that contains one observation for each object in the tree or subtree being processed and variables called CLUSTER and CLUSNAME showing cluster membership at any specified level in the tree. If you specify the OUT= option, you must also specify either the NCLUSTERS= or LEVEL= option in order to define the output partition level. If you want to create a permanent SAS data set, you must specify a two-level name (refer to “SAS Data Files” in SAS Language Reference: Concepts). PAGES=n
specifies the number of pages over which the tree diagram (from root to leaves) is to extend. The default is 1. The LINEPRINTER option must also be specified. POS=n
specifies the number of column positions on the height axis. The default depends on the value of the PAGES= option, the orientation of the tree diagram, and the values specified by the PAGESIZE= and LINESIZE= options. The LINEPRINTER option must also be specified. ROOT=’name’
specifies the value of the NAME variable for the root of a subtree to be displayed if you do not want to display the entire tree. If you also specify the OUT= option, the output data set contains only objects belonging to the subtree specified by the ROOT= option. SIMILAR SIM
implies that the values of the HEIGHT variable are similarities; that is, a large height value means that the clusters are very similar or close together. If neither the SIMILAR nor the DISSIMILAR option is specified, PROC TREE attempts to infer from the data whether the height values are similarities or dissimilarities. If PROC TREE cannot tell this from the data, it issues an error message and does not display a tree diagram.
4753
4754
Chapter 76. The TREE Procedure
SORT
sorts the children of each node by the HEIGHT variable, in the order of cluster formation. See the DESCENDING option on page 4750. SPACES=s S=s
specifies the number of spaces between objects on the output. The default depends on the number of objects, the orientation of the tree diagram, and the values specified by the PAGESIZE= and LINESIZE= options. The LINEPRINTER option must also be specified. TICKPOS=n
specifies the number of column positions per tick interval on the height axis. The default value is usually between 5 and 10, although a different value can be specified for consistency with other options. TREECHAR=’c’ TC=’c’
specifies a character to represent clusters with children. The character should be enclosed in single quotes. The default is X. The LINEPRINTER option must also be specified. VAXIS=AXISn
specifies that the AXISn statement be used to customize the appearance of the vertical axis. VPAGES=n2
specifies that the original graph is to be enlarged to cover n2 pages. If you also specify the HPAGES=n1 option, the original graph is enlarged to cover n1×n2 pages. For example, if HPAGES=2 and VPAGES=3, then the original graph is generated followed by 2 × 3 = 6 more graphs. In these six graphs, the original is enlarged by a factor of 2 in the horizontal direction and by a factor of 3 in the vertical direction. The graphs are generated in left-to-right and top-to-bottom order.
BY Statement BY variables ; You can specify a BY statement with PROC TREE to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the TREE procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
ID Statement
• Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
COPY Statement COPY variables ; The COPY statement specifies one or more character or numeric variables to be copied to the OUT= data set.
FREQ Statement FREQ variables ; The FREQ statement specifies one numeric variable that tells how many clustering observations belong to the cluster. If the FREQ statement is omitted, PROC TREE looks for a variable called – FREQ– to specify the number of observations per cluster. If neither the FREQ statement nor the – FREQ– variable is present, each leaf is assumed to represent one clustering observation, and the frequency for each internal node is found by summing the frequencies of its children.
HEIGHT Statement HEIGHT variable ; The HEIGHT statement specifies the name of a numeric variable to define the height of each node (cluster) in the tree. The height variable can also be specified by the HEIGHT= option in the PROC TREE statement. If both the HEIGHT statement and the HEIGHT= option are omitted, PROC TREE looks for a variable called – HEIGHT– . If the data set does not contain – HEIGHT– , PROC TREE looks for a variable called – NCL– . If – NCL– is not found either, the height of each node is defined to be its path length from the root.
ID Statement ID variables ; The ID variable is used to identify the objects (leaves) in the tree on the output. The ID variable can be a character or numeric variable of any length. If the ID statement is omitted, the variable in the NAME statement is used instead. If both the ID and NAME statements are omitted, PROC TREE looks for a variable called – NAME– . If the – NAME– variable is not found in the data set, PROC TREE issues an error message and stops. The ID variable is copied to the OUT= data set.
4755
4756
Chapter 76. The TREE Procedure
NAME Statement NAME variables ; The NAME statement specifies a character or numeric variable identifying the node represented by each observation. The NAME variable and the PARENT variable jointly define the tree structure. If the NAME statement is omitted, PROC TREE looks for a variable called – NAME– . If the – NAME– variable is not found in the data set, PROC TREE issues an error message and stops.
PARENT Statement PARENT variables ; The PARENT statement specifies a character or numeric variable identifying the node in the tree that is the parent of each observation. The PARENT variable must have the same formatted length as the NAME variable. If the PARENT statement is omitted, PROC TREE looks for a variable called – PARENT– . If the – PARENT– variable is not found in the data set, PROC TREE issues an error message and stops.
Details Missing Values An observation with a missing value for the NAME variable is omitted from processing. If the PARENT variable has a missing value but the NAME variable is present, the observation is treated as the root of a tree. A data set can contain several roots and, hence, several trees. Missing values of the HEIGHT variable are set to upper or lower bounds determined from the nonmissing values under the assumption that the heights are monotonic with respect to the tree structure. Missing values of the FREQ variable are inferred from nonmissing values where possible; otherwise, they are treated as zero.
Output Data Set The OUT= data set contains one observation for each leaf in the tree or subtree being processed. The variables are as follows: • the BY variables, if any • the ID variable, or the NAME variable if the ID statement is not used • the COPY variables • a numeric variable CLUSTER taking values from 1 to c, where c is the number of disjoint clusters. The cluster to which the first observation belongs is given the number 1, the cluster to which the next observation belongs that does not belong to cluster 1 is given the number 2, and so on.
ODS Table Names
• a character variable CLUSNAME giving the value of the NAME variable of the cluster to which the observation belongs The CLUSTER and CLUSNAME variables are missing if the corresponding leaf has a nonpositive frequency.
Displayed Output The displayed output from the TREE procedure includes the following: • the names of the objects in the tree • the height axis • the tree diagram. A high-resolution graphics tree diagram is produced on the graphics device. The leaves are displayed at the bottom of the graph. Horizontal lines connect the leaves into branches, while the topmost horizontal line indicates the root. If the LINEPRINTER option is specified, the root (the cluster containing all the objects) is indicated by a solid line of the character specified by the TREECHAR= option (the default character is ‘X’). At each level of the tree, clusters are shown by unbroken lines of the TREECHAR= symbol with the FILLCHAR= symbol (the default is a blank) separating the clusters. The LEAFCHAR= symbol (the default character is a period) represents singlemember clusters. By default, the tree diagram is oriented with the height axis vertical and the object names at the top of the diagram. If the HORIZONTAL option is specified, then the height axis is horizontal and the object names are on the left.
ODS Table Names PROC TREE assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 76.2. ODS Tables Produced in PROC TREE
ODS Table Name Tree TreeListing
Description Line-printer plot of the tree Line-printer listing of all nodes in the tree
Statement PROC PROC
Option LINEPRINTER LIST
4757
4758
Chapter 76. The TREE Procedure
Examples Example 76.1. Mammals’ Teeth The following data give the numbers of different kinds of teeth for a variety of mammals. The mammals are clustered by average linkage using the CLUSTER procedure (Output 76.1.1). The PROC TREE statement uses the average-linkage distance as the height axis, which is the default, and creates a horizontal high-resolution graphics tree (Output 76.1.2). data teeth; title ’Mammals’’ Teeth’; input mammal $ 1-16 @21 (v1-v8) (1.); label V1=’Right Top Incisors’ V2=’Right Bottom Incisors’ V3=’Right Top Canines’ V4=’Right Bottom Canines’ V5=’Right Top Premolars’ V6=’Right Bottom Premolars’ V7=’Right Top Molars’ V8=’Right Bottom Molars’; datalines; Brown Bat 23113333 Mole 32103333 Silver Hair Bat 23112333 Pigmy Bat 23112233 House Bat 23111233 Red Bat 13112233 Pika 21002233 Rabbit 21003233 Beaver 11002133 Groundhog 11002133 Gray Squirrel 11001133 House Mouse 11000033 Porcupine 11001133 Wolf 33114423 Bear 33114423 Raccoon 33114432 Marten 33114412 Weasel 33113312 Wolverine 33114412 Badger 33113312 River Otter 33114312 Sea Otter 32113312 Jaguar 33113211 Cougar 33113211 Fur Seal 32114411 Sea Lion 32114411 Grey Seal 32113322 Elephant Seal 21114411 Reindeer 04103333 Elk 04103333 Deer 04003333
Example 76.1. Mammals’ Teeth
4759
Moose 04003333 ; options pagesize=60 linesize=110; proc cluster method=average std pseudo noeigen outtree=tree; id mammal; var v1-v8; run; proc tree graphics horizontal; run;
Output 76.1.1 displays the information on how the clusters are joined. For example, the cluster history shows that the observations Wolf and Bear form cluster 29, which is merged with Raccoon to form cluster 11. Output 76.1.1. Output from PROC CLUSTER Mammals’ Teeth The CLUSTER Procedure Average Linkage Cluster Analysis The data have been standardized to mean 0 and variance 1 Root-Mean-Square Total-Sample Standard Deviation = Root-Mean-Square Distance Between Observations =
1 4
Cluster History
NCL 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
----------Clusters Joined---------Beaver Gray Squirrel Wolf Marten Weasel Jaguar Fur Seal Reindeer Deer Pigmy Bat CL28 CL31 Brown Bat Pika CL27 CL22 CL21 CL25 CL19 CL15 CL29 CL18 CL12 CL24 CL9 CL10 CL11 CL13 CL4 CL3 CL2
Groundhog Porcupine Bear Wolverine Badger Cougar Sea Lion Elk Moose Red Bat River Otter CL30 Silver Hair Bat Rabbit Sea Otter House Bat CL17 Elephant Seal CL16 Grey Seal Raccoon CL20 CL26 CL23 CL14 House Mouse CL7 Mole CL8 CL6 CL5
FREQ
PSF
PST2
Norm RMS Dist
2 2 2 2 2 2 2 2 2 2 3 4 2 2 3 3 6 3 5 7 3 6 9 4 12 7 15 6 10 17 32
. . . . . . . . . 281 139 83.2 76.7 73.2 67.4 62.9 47.4 45.0 40.8 38.9 38.0 34.5 30.0 28.7 25.7 28.3 26.8 31.9 31.0 27.8 .
. . . . . . . . . . . . . . . 1.7 6.8 . 3.5 2.8 . 10.3 7.3 . 7.0 4.1 6.9 7.2 12.7 16.1 27.8
0 0 0 0 0 0 0 0 0 0.2289 0.2292 0.2357 0.2357 0.2357 0.2462 0.2859 0.3328 0.3362 0.3672 0.4078 0.423 0.4339 0.5071 0.5473 0.5668 0.5792 0.6621 0.7156 0.8799 1.0316 1.1938
T i e T T T T T T T T
T T
4760
Chapter 76. The TREE Procedure
Output 76.1.2. PROC TREE High-Resolution Graphics
As you look from left-to-right in the diagram in Output 76.1.2, objects and clusters are progressively joined until a single, all-encompassing cluster is formed at the right (or root) of the diagram. Clusters exist at each level of the diagram, and every vertical line connects leaves and branches into progressively larger clusters. For example, the five bats form a cluster at the 0.6 level, while the next cluster consists only of the mole. The observations Reindeer, Elk, Deer, and Moose form the next cluster at the 0.6 level, the mammals Pika through House Mouse are in the fourth cluster, The observations Wolf, Bear, and Raccoon form the fifth cluster, while the last cluster contains the observations Marten through Elephant Seal. The following statements create the same tree with line printer graphics in a vertical orientation; the tree is displayed in Output 76.1.3. proc tree lineprinter; run;
Example 76.1. Mammals’ Teeth
Output 76.1.3. PROC TREE with the LINEPRINTER Option Average Linkage Cluster Analysis Name of Observation or Cluster S i l v e r B r o w n
H a i r
G r a y
P H i o g R u m e s y d e
M B B B B B o a a a a a l t t t t t e A v e r a g e D i s t a n c e B e t w
R e i n M d D o e E e o e l e s r k r e
P i k a
R a b b i t
B e a v e r
H o G P u r S o s o q r e u u c n i u M d r p o h r i u o e n s g l e e
W o l f
B e a r
R a c c o o n
M a r t e n
R i W v o e l r v e O r t i t n e e r
W e a s e l
B a d g e r
S G e r a e y O t S t e e a r l
J a g u a r
E l e p h F S a u e n C r a t o u S L S g e i e a a o a r l n l
1.5 + | | | |XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX |XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXX 1 +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXX |XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXX |XXXXXXXXXXX XXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXX |XXXXXXXXXXX XXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXX |XXXXXXXXX . XXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXX |XXXXXXXXX . XXXXXXX XXXXXXXXXXXXX XXXXX XXXXXXXXXXXXXXXXXXXXXXX 0.5 +XXXXXXXXX . XXX XXX XXXXXXXXXXX . XXXXX XXXXXXXXXXXXXXXXX XXXXX |XXXXXXXXX . XXX XXX XXXXXXXXXXX . XXXXX XXXXXXXXXXXXX XXX XXXXX |XXXXXXXXX . XXX XXX XXX XXXXXXX . XXX . XXXXXXXXXXX . XXX XXXXX |XXX XXXXX . XXX XXX XXX XXXXXXX . XXX . XXXXX XXXXX . XXX XXX . |. . . . . . XXX XXX . . XXX XXX . XXX . XXX . XXX . . XXX XXX . |. . . . . . XXX XXX . . XXX XXX . XXX . XXX . XXX . . XXX XXX . 0 +. . . . . . XXX XXX . . XXX XXX . XXX . XXX . XXX . . XXX XXX .
As you look up from the bottom of the diagram, objects and clusters are progressively joined until a single, all-encompassing cluster is formed at the top (or root) of the diagram. Clusters exist at each level of the diagram. For example, the unbroken line of Xs at the left-most side of the 0.6 level indicates that the five bats have formed a cluster. The next cluster is represented by a period because it contains only one mammal, Mole. Reindeer, Elk, Deer, and Moose form the next cluster, indicated by Xs again. The mammals Pika through House Mouse are in the fourth cluster. The observations Wolf, Bear, and Raccoon form the fifth cluster, while the last cluster contains the observations Marten through Elephant Seal. The next statement sorts the clusters at each branch in order of formation and uses the number of clusters as the height axis. The resulting tree is displayed in Output 76.1.4. proc tree sort height=n horizontal; run;
4761
4762
Chapter 76. The TREE Procedure
Output 76.1.4. PROC TREE with SORT and HEIGHT= Options
Because the CLUSTER procedure always produces binary trees, the number of internal (root and branch) nodes in the tree is one less than the number of leaves. Therefore 31 clusters are formed from the 32 mammals in the input data set. These are represented by the 31 vertical line segments in the tree diagram, each at a different value along the horizontal axis. As you examine the tree from left to right, the first vertical line segment is where Beaver and Groundhog are clustered and the number of clusters is 31. The next cluster is formed from Gray Squirrel and Porcupine. The third contains Wolf and Bear. Note how the tree graphically displays the clustering order information that was presented in tabular form by the CLUSTER procedure in Output 76.1.1. The same clusters as in Output 76.1.2 and Output 76.1.3 can be seen at the six-cluster level of the tree diagram in Output 76.1.4, although the SORT and HEIGHT= options make them appear in a different order. The following statements create these six clusters and display them in Output 76.1.5. The PROC TREE statement produces no output but creates an output data set indicating the cluster to which each observation belongs at the six-cluster level in the tree.
Example 76.1. Mammals’ Teeth
proc tree noprint out=part nclusters=6; id mammal; copy v1-v8; proc sort; by cluster; proc print label uniform; id mammal; var v1-v8; format v1-v8 1.; by cluster; run; Output 76.1.5. PROC TREE OUT= Data Set ---------------------------------- CLUSTER=1 -----------------------------------
mammal Beaver Groundhog Gray Squirrel Porcupine Pika Rabbit House Mouse
mammal Beaver Groundhog Gray Squirrel Porcupine Pika Rabbit House Mouse
Right Top Incisors
Right Bottom Incisors
Right Top Canines
Right Bottom Canines
1 1 1 1 2 2 1
1 1 1 1 1 1 1
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Right Top Premolars
Right Bottom Premolars
Right Top Molars
Right Bottom Molars
2 2 1 1 2 3 0
1 1 1 1 2 2 0
3 3 3 3 3 3 3
3 3 3 3 3 3 3
---------------------------------- CLUSTER=2 -----------------------------------
mammal Wolf Bear Raccoon
Right Top Incisors
Right Bottom Incisors
Right Top Canines
Right Bottom Canines
3 3 3
3 3 3
1 1 1
1 1 1
mammal
Right Top Premolars
Right Bottom Premolars
Wolf Bear Raccoon
4 4 4
4 4 4
Right Top Molars 2 2 3
Right Bottom Molars 3 3 2
4763
4764
Chapter 76. The TREE Procedure
---------------------------------- CLUSTER=3 -----------------------------------
mammal Marten Wolverine Weasel Badger Jaguar Cougar Fur Seal Sea Lion River Otter Sea Otter Elephant Seal Grey Seal
mammal Marten Wolverine Weasel Badger Jaguar Cougar Fur Seal Sea Lion River Otter Sea Otter Elephant Seal Grey Seal
Right Top Incisors
Right Bottom Incisors
Right Top Canines
Right Bottom Canines
3 3 3 3 3 3 3 3 3 3 2 3
3 3 3 3 3 3 2 2 3 2 1 2
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
Right Top Premolars
Right Bottom Premolars
Right Top Molars
Right Bottom Molars
4 4 3 3 3 3 4 4 4 3 4 3
4 4 3 3 2 2 4 4 3 3 4 3
1 1 1 1 1 1 1 1 1 1 1 2
2 2 2 2 1 1 1 1 2 2 1 2
Example 76.1. Mammals’ Teeth
---------------------------------- CLUSTER=4 -----------------------------------
mammal Reindeer Elk Deer Moose
mammal Reindeer Elk Deer Moose
Right Top Incisors
Right Bottom Incisors
Right Top Canines
Right Bottom Canines
0 0 0 0
4 4 4 4
1 1 0 0
0 0 0 0
Right Top Premolars
Right Bottom Premolars
3 3 3 3
3 3 3 3
Right Top Molars 3 3 3 3
Right Bottom Molars 3 3 3 3
---------------------------------- CLUSTER=5 -----------------------------------
mammal Pigmy Bat Red Bat Brown Bat Silver Hair Bat House Bat
mammal
Right Top Incisors 2 1 2 2 2
Right Bottom Incisors 3 3 3 3 3
Right Top Premolars
Right Bottom Premolars
2 2 3 2 1
2 2 3 3 2
Pigmy Bat Red Bat Brown Bat Silver Hair Bat House Bat
Right Top Canines
Right Bottom Canines
1 1 1 1 1
1 1 1 1 1
Right Top Molars 3 3 3 3 3
Right Bottom Molars 3 3 3 3 3
---------------------------------- CLUSTER=6 -----------------------------------
mammal Mole
mammal Mole
Right Top Incisors 3
Right Bottom Incisors 2
Right Top Premolars
Right Bottom Premolars
3
3
Right Top Canines
Right Bottom Canines
1
0
Right Top Molars 3
Right Bottom Molars 3
4765
4766
Chapter 76. The TREE Procedure
Example 76.2. Iris Data Fisher’s (1936) iris data gives sepal and petal dimensions for three different species of iris. The data are clustered by kth-nearest-neighbor density linkage using the CLUSTER procedure with K=8. Observations are identified by species (Setosa, Versicolor or Virginica) in the tree diagram, which is oriented with the height axis horizontal. The following statements produce Output 76.2.1 and Output 76.2.2. proc format; value specname 1=’Setosa ’ 2=’Versicolor’ 3=’Virginica ’; run; data iris; title ’Fisher (1936) Iris Data’; input SepalLength SepalWidth PetalLength PetalWidth Species @@; format Species specname.; label SepalLength=’Sepal Length in mm.’ SepalWidth =’Sepal Width in mm.’ PetalLength=’Petal Length in mm.’ PetalWidth =’Petal Width in mm.’; symbol = put(species, specname10.); datalines; 50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
Example 76.2. Iris Data 56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 63 33 60 25 3 53 37 15 02 1 ; proc cluster data=iris method=twostage print=10 outtree=tree k=8 noeigen; var SepalLength SepalWidth PetalLength PetalWidth; copy Species; id Species; run;
4767
3 1 2 3 3 3 1 1 1 2
options pagesize=60 linesize=110; proc tree data=tree horizontal lineprinter pages=1 maxh=10; id species; run;
The PAGES=1 option specifies that the tree diagram extends over one page from tree to root. Since the HORIZONTAL option is also specified, the horizontal extent of the diagram is one page. The number of vertical pages required for the diagram is dictated by the number of leaves in the tree. The MAXH=10 limits the values displayed on the height axis to a maximum of 10. This prunes the tree diagram so that only the portion from the leaves to level 10 is displayed. You can see this pruning effect in Output 76.2.2. Output 76.2.1. Clustering of Fisher’s Iris Data Fisher (1936) Iris Data The CLUSTER Procedure Two-Stage Density Linkage Clustering K = 8 Root-Mean-Square Total-Sample Standard Deviation = 10.69224
NCL 10 9 8 7 6 5 4 3 2
----Clusters Joined----CL11 CL13 CL10 CL8 CL9 CL6 CL5 CL4 CL3
Cluster History Normalized Fusion FREQ Density
Versicolor 48 0.2879 Virginica 46 0.2802 Virginica 49 0.2699 Versicolor 50 0.2586 Virginica 47 0.1412 Virginica 48 0.107 Virginica 49 0.0969 Virginica 50 0.0715 CL7 100 2.6277 3 modal clusters have been formed.
Maximum Density in Each Cluster Lesser Greater 0.1479 0.2005 0.1372 0.1372 0.0832 0.0605 0.0541 0.0370 3.5156
8.3678 3.5156 8.3678 8.3678 3.5156 3.5156 3.5156 3.5156 8.3678
T i e
4768
Chapter 76. The TREE Procedure
Output 76.2.2. Horizontal Tree for Fisher’s Iris Data Two-Stage Density Linkage Clustering Cluster Fusion Density
S p e c i e s
Virginica Virginica Virginica Virginica Virginica Versicolor Virginica Virginica Virginica Virginica Virginica Virginica Virginica Virginica Virginica Virginica Virginica Virginica Versicolor Virginica Virginica Virginica Virginica Virginica Virginica Virginica Virginica Virginica Virginica Virginica
0 1 2 3 4 5 6 7 8 9 10 +--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+ XX......................................................................................... XX XX......................................................................................... XX XXXX....................................................................................... XXXX XXXX....................................................................................... XXXX XXXXXX..................................................................................... XXXXXX XXXXXXX.................................................................................... XXXXXXX XXXXXXXX................................................................................... XXXXXXXX XXXXXXXXX.................................................................................. XXXXXXXXX XXXXXXXXXXXXXXX............................................................................ XXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX........................................................................... XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXX......................................................................... XXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX....................................................................... XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXX...................................................................... XXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX..................................................................... XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX..................................................................... XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX..................................................................... XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX.................................................................... XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX.................................................................... XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX................................................................ XXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX................................................................ XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXX................................................................. XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXX................................................................. XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................................ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................................ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................................ XXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXX.............................................................. XXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX................................................................ XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXX................................................................. XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX.................................................................... XXXXXXXXXXXXXXXXXXXXXX
Example 76.2. Iris Data
4769
Versicolor XXXXXXXXXXXXXXXXXXXXXX..................................................................... XXXXXXXXXXXXXXXXXXXXXX Virginica XXXXXXXXXXXXXXXXXXXXXX..................................................................... XXXXXXXXXXXXXXXXXXXXX Virginica XXXXXXXXXXXXXXXXXXXXX...................................................................... XXXXXXXXXXXXXXXXXXXXX Virginica XXXXXXXXXXXXXXXXXXXXX...................................................................... XXXXXXXXXXXXXXXXXXX Virginica XXXXXXXXXXXXXXXXXXX........................................................................ XXXXXXXXXXXXXXXXXXX Virginica XXXXXXXXXXXXXXXXXXX........................................................................ XXXXXXXXXXXXXXXXXX Virginica XXXXXXXXXXXXXXXXXX......................................................................... XXXXXXXXXXXXXXXX Virginica XXXXXXXXXXXXXXXX........................................................................... XXXXXXXXXXXXXXXX Virginica XXXXXXXXXXXXXXXX........................................................................... XXXXXXXXXXX Virginica XXXXXXXXXXX................................................................................ XXXXXXXXXXX Virginica XXXXXXXXXXX................................................................................ XXXXXXXXX Virginica XXXXXXXXX.................................................................................. XXXXXXXX Virginica XXXXXXXX................................................................................... XXXXXX Virginica XXXXXX..................................................................................... XXXXXX Virginica XXXXXX..................................................................................... XXXXXX Virginica XXXXXX..................................................................................... XXXXXX Virginica XXXXXX..................................................................................... XXXXX Virginica XXXXX...................................................................................... XX Virginica XX......................................................................................... XX Virginica XX......................................................................................... X Virginica XXX........................................................................................ XXX Versicolor XXXX....................................................................................... XXXX Versicolor XXXXXXXXXXX................................................................................ XXXXXXXXXXX Versicolor XXXXXXXXXXXXX.............................................................................. XXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXX............................................................................ XXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXX........................................................................... XXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXX.......................................................................... XXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXX......................................................................... XXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXX....................................................................... XXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXX..................................................................... XXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXX.................................................................. XXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXX.................................................................. XXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXX.................................................................. XXXXXXXXXXXXXXXXXXXXXXXXX
4770
Chapter 76. The TREE Procedure
Virginica XXXXXXXXXXXXXXXXXXXXXXXXXX................................................................. XXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXX.............................................................. XXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................................. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX........................................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.............................................. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX..................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX..................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.............................. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX..................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.......................................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................................ XXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXX.............................................................. XXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXX................................................................. XXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXX................................................................. XXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX........................................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX........................................................ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX........................................................ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX........................................................... XXXXXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXXXXX.................................................................... XXXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXXX....................................................................... XXXXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXXXX........................................................................ XXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXX.......................................................................... XXXXXXXXXXXXXXXXX Versicolor XXXXXXXXXXXXXXXXX.......................................................................... XXXXXXXXXXXXXX Virginica XXXXXXXXXXXXXX............................................................................. XXXXXXXXXXXXX Versicolor XXXXXXXXXXXXX.............................................................................. XXXXXXXXXX
Example 76.2. Iris Data
4771
Versicolor XXXXXXXXXX................................................................................. XXXX Versicolor XXXX....................................................................................... XXXX Versicolor XXXX....................................................................................... XXX Versicolor XXX........................................................................................ Setosa XXXXXXXXXXXXXXXX........................................................................... XXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXX................................................................. XXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................................ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX........................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.......................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
4772
Chapter 76. The TREE Procedure
Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.......................................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................................ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX........................................................ XXXXXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXXXXX..................................................................... XXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXX........................................................................ XXXXXXXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXXXXXXX........................................................................ XXXXXXXXXXXXXX Setosa XXXXXXXXXXXXXX............................................................................. XXXXXXXXX Setosa XXXXXXXXX.................................................................................. XXXXX Setosa XXXXX...................................................................................... XXXX Setosa XXXX.......................................................................................
References Duran, B.S. and Odell, P.L. (1974), Cluster Analysis, New York: Springer-Verlag. Everitt, B.S. (1980), Cluster Analysis, Second Edition, London: Educational Books Ltd.
Heineman
Fisher, R.A. (1936), “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, 7, 179–188. Hand, D.J.; Daly, F.; Lunn, A.D.; McConway, K.J.; and Ostrowski E. (1994), A Handbook of Small Data Sets, London: Chapman & Hall, 297–298. Hartigan, J.A. (1975), Clustering Algorithms, New York: John Wiley & Sons, Inc. Johnson, S.C. (1967), “Hierarchical Clustering Schemes,” Psychometrika, 32, 241–254. Knuth, D.E. (1973), The Art of Computer Programming, Volume 1, Fundamental Algorithms, Reading, MA: Addison-Wesley Publishing Co., Inc.
Chapter 77
The TTEST Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4775 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4776 One-Sample t Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4776 Comparing Group Means . . . . . . . . . . . . . . . . . . . . . . . . . . . 4777 SYNTAX . . . . . . . . . . PROC TTEST Statement BY Statement . . . . . . CLASS Statement . . . . FREQ Statement . . . . . PAIRED Statement . . . VAR Statement . . . . . WEIGHT Statement . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. 4779 . 4780 . 4780 . 4781 . 4781 . 4782 . 4782 . 4783
DETAILS . . . . . . . . . . . Input Data Set of Statistics Missing Values . . . . . . . Computational Methods . . Displayed Output . . . . . ODS Table Names . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 4783 . 4783 . 4783 . 4783 . 4787 . 4789
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example 77.1. Comparing Group Means Using Input Data Set of Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example 77.2. One-Sample Comparison Using the FREQ Statement . . . Example 77.3. Paired Comparisons . . . . . . . . . . . . . . . . . . . . .
. 4789 . 4789 . 4792 . 4793
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4794
4774
Chapter 77. The TTEST Procedure
Chapter 77
The TTEST Procedure Overview The TTEST procedure performs t tests for one sample, two samples, and paired observations. The one-sample t test compares the mean of the sample to a given number. The two-sample t test compares the mean of the first sample minus the mean of the second sample to a given number. The paired observations t test compares the mean of the differences in the observations to a given number. For one-sample tests, PROC TTEST computes the sample mean of the variable and compares it with a given number. Paired comparisons use the one sample process on the differences between the observations. Paired comparisons can be made between many pairs of variables with one call to PROC TTEST. For group comparisons, PROC TTEST computes sample means for each of two groups of observations and tests the hypothesis that the population means differ by a given amount. This latter analysis can be considered a special case of a one-way analysis of variance with two levels of classification. The underlying assumption of the t test in all three cases is that the observations are random samples drawn from normally distributed populations. This assumption can be checked using the UNIVARIATE procedure; if the normality assumptions for the t test are not satisfied, you should analyze your data using the NPAR1WAY procedure. The two populations of a group comparison must also be independent. If they are not independent, you should question the validity of a paired comparison. PROC TTEST computes the group comparison t statistic based on the assumption that the variances of the two groups are equal. It also computes an approximate t based on the assumption that the variances are unequal (the Behrens-Fisher problem). The degrees of freedom and probability level are given for each; Satterthwaite’s (1946) approximation is used to compute the degrees of freedom associated with the approximate t. In addition, you can request the Cochran and Cox (1950) approximation of the probability level for the approximate t. The folded form of the F statistic is computed to test for equality of the two variances (Steel and Torrie 1980). FREQ and WEIGHT statements are available. Data can be input in the form of observations or summary statistics. Summary statistics and their confidence intervals, and differences of means are output. For two-sample tests, the pooled-variance and a test for equality of variances are also produced.
4776
Chapter 77. The TTEST Procedure
Getting Started One-Sample t Test A one-sample t test can be used to compare a sample mean to a given value. This example, taken from Huntsberger and Billingsley (1989, p. 290), tests whether the mean length of a certain type of court case is 80 days using 20 randomly chosen cases. The data are read by the following DATA step: title ’One-Sample t Test’; data time; input time @@; datalines; 43 90 84 87 116 95 86 121 71 66 98 79 102 60 ; run;
99 112
93 105
92 98
The only variable in the data set, time, is assumed to be normally distributed. The trailing at signs (@@) indicate that there is more than one observation on a line. The following code invokes PROC TTEST for a one-sample t test: proc ttest h0=80 alpha=0.1; var time; run;
The VAR statement indicates that the time variable is being studied, while the H0= option specifies that the mean of the time variable should be compared to the value 80 rather than the default null hypothesis of 0. This ALPHA= option requests 10% confidence intervals rather than the default 5% confidence intervals. The output is displayed in Figure 77.1. One-Sample t Test The TTEST Procedure Statistics
Variable time
N
Lower CL Mean
Mean
Upper CL Mean
Lower CL Std Dev
Std Dev
Upper CL Std Dev
Std Err
Minimum
Maximum
20
82.447
89.85
97.253
15.2
19.146
26.237
4.2811
43
121
T-Tests Variable
DF
t Value
Pr > |t|
time
19
2.30
0.0329
Figure 77.1. One-Sample t Test Results
Summary statistics appear at the top of the output. The sample size (N), the mean and its confidence bounds (Lower CL Mean and Upper CL Mean), the standard deviation
Comparing Group Means
and its confidence bounds (Lower CL Std Dev and Upper CL Std Dev), and the standard error are displayed with the minimum and maximum values of the time variable. The test statistic, the degrees of freedom, and the p-value for the t test are displayed next; at the 10% α-level, this test indicates that the mean length of the court cases are significantly different from 80 days (t = 2.30, p = 0.0329).
Comparing Group Means If you want to compare values obtained from two different groups, and if the groups are independent of each other and the data are normally distributed in each group, then a group t test can be used. Examples of such group comparisons include • test scores for two third-grade classes, where one of the classes receives tutoring • fuel efficiency readings of two automobile nameplates, where each nameplate uses the same fuel • sunburn scores for two sunblock lotions, each applied to a different group of people • political attitude scores of males and females In the following example, the golf scores for males and females in a physical education class are compared. The sample sizes from each population are equal, but this is not required for further analysis. The data are read by the following statements: title ’Comparing Group Means’; data scores; input Gender $ Score @@; datalines; f 75 f 76 f 80 f 77 f 80 f 77 m 82 m 80 m 85 m 85 m 78 m 87 ; run;
f 73 m 82
The dollar sign ($) following Gender in the INPUT statement indicates that Gender is a character variable. The trailing at signs (@@) enable the procedure to read more than one observation per line. You can use a group t test to determine if the mean golf score for the men in the class differs significantly from the mean score for the women. If you also suspect that the distributions of the golf scores of males and females have unequal variances, then submitting the following statements invokes PROC TTEST with options to deal with the unequal variance case. proc ttest cochran ci=equal umpu; class Gender; var Score; run;
4777
4778
Chapter 77. The TTEST Procedure
The CLASS statement contains the variable that distinguishes the groups being compared, and the VAR statement specifies the response variable to be used in calculations. The COCHRAN option produces p-values for the unequal variance situation using the Cochran and Cox(1950) approximation. Equal tailed and uniformly most powerful unbiased (UMPU) confidence intervals for σ are requested by the CI= option. Output from these statements is displayed in Figure 77.2 through Figure 77.4. Comparing Group Means The TTEST Procedure Statistics
Variable
Gender
N
Score Score Score
f m Diff (1-2)
7 7
Lower CL Mean 74.504 79.804 -9.19
Mean
Upper CL Mean
Lower CL Std Dev
UMPU Lower CL Std Dev
Std Dev
76.857 82.714 -5.857
79.211 85.625 -2.524
1.6399 2.028 2.0522
1.5634 1.9335 2.0019
2.5448 3.1472 2.8619
Statistics
Variable
Gender
Score Score Score
f m Diff (1-2)
UMPU Upper CL Std Dev
Upper CL Std Dev
Std Err
Minimum
Maximum
5.2219 6.4579 4.5727
5.6039 6.9303 4.7242
0.9619 1.1895 1.5298
73 78
80 87
Figure 77.2. Simple Statistics
Simple statistics for the two populations being compared, as well as for the difference of the means between the populations, are displayed in Figure 77.2. The Variable column denotes the response variable, while the Class column indicates the population corresponding to the statistics in that row. The sample size (N) for each population, the sample means (Mean), and lower and upper confidence bounds for the means (Lower CL Mean and Upper CL Mean) are displayed next. The standard deviations (Std Dev) are displayed as well, with equal tailed confidence bounds in the Lower CL Std Dev and Upper CL Std Dev columns and UMPU confidence bounds in the UMPU Upper CL Std Dev and UMPU Lower CL Std Dev columns. In addition, standard error of the mean and the minimum and maximum data values are displayed. T-Tests Variable
Method
Variances
Score Score Score
Pooled Satterthwaite Cochran
Equal Unequal Unequal
Figure 77.3. t Tests
DF
t Value
Pr > |t|
12 11.5 6
-3.83 -3.83 -3.83
0.0024 0.0026 0.0087
Syntax
The test statistics, associated degrees of freedom, and p-values are displayed in Figure 77.3. The Method column denotes which t test is being used for that row, and the Variances column indicates what assumption about variances is being made. The pooled test assumes that the two populations have equal variances and uses degrees of freedom n1 +n2 −2, where n1 and n2 are the sample sizes for the two populations. The remaining two tests do not assume that the populations have equal variances. The Satterthwaite test uses the Satterthwaite approximation for degrees of freedom, while the Cochran test uses the Cochran and Cox approximation for the p-value. Equality of Variances Variable
Method
Score
Folded F
Num DF
Den DF
F Value
Pr > F
6
6
1.53
0.6189
Figure 77.4. Tests of Equality of Variances
Examine the output in Figure 77.4 to determine which t test is appropriate. The “Equality of Variances” test results show that the assumption of equal variances is reasonable for these data (the Folded F statistic F 0 = 1.53, with p = 0.6189). If the assumption of normality is also reasonable, the appropriate test is the usual pooled t test, which shows that the average golf scores for men and women are significantly different (t = −3.83, p = 0.0024). If the assumption of equality of variances is not reasonable, then either the Satterthwaite or the Cochran test should be used. The assumption of normality can be checked using PROC UNIVARIATE; if the assumption of normality is not reasonable, you should analyze the data with the nonparametric Wilcoxon Rank Sum test using PROC NPAR1WAY.
Syntax The following statements are available in PROC TTEST.
PROC TTEST < options > ; CLASS variable ; PAIRED variables ; BY variables ; VAR variables ; FREQ variable ; WEIGHT variable ; No statement can be used more than once. There is no restriction on the order of the statements after the PROC statement.
4779
4780
Chapter 77. The TTEST Procedure
PROC TTEST Statement PROC TTEST < options > ; The following options can appear in the PROC TTEST statement. ALPHA=p
specifies that confidence intervals are to be 100(1 − p)% confidence intervals, where 0 < p < 1. By default, PROC TTEST uses ALPHA=0.05. If p is 0 or less, or 1 or more, an error message is printed. CI=EQUAL CI=UMPU CI=NONE
specifies whether a confidence interval is displayed for σ and, if so, what kind. The CI=EQUAL option specifies an equal tailed confidence interval, and it is the default. The CI=UMPU option specifies an interval based on the uniformly most powerful unbiased test of H0 : σ = σ0 . The CI=NONE option requests that no confidence interval be displayed for σ. The values EQUAL and UMPU together request that both types of confidence intervals be displayed. If the value NONE is specified with one or both of the values EQUAL and UMPU, NONE takes precedence. For more information, see the “Confidence Interval Estimation” section on page 4785. COCHRAN
requests the Cochran and Cox (1950) approximation of the probability level of the approximate t statistic for the unequal variances situation. DATA=SAS-data-set
names the SAS data set for the procedure to use. By default, PROC TTEST uses the most recently created SAS data set. The input data set can contain summary statistics of the observations instead of the observations themselves. The number, mean, and standard deviation of the observations are required for each BY group (one sample and paired differences) or for each class within each BY group (two samples). For more information on the DATA= option, see the “Input Data Set of Statistics” section on page 4783. H0=m
requests tests against m instead of 0 in all three situations (one-sample, two-sample, and paired observation t tests). By default, PROC TTEST uses H0=0.
BY Statement BY variables ; You can specify a BY statement with PROC TTEST to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement.
FREQ Statement
• Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the TTEST procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure (in base SAS software). For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the SAS Procedures Guide.
CLASS Statement CLASS variable ; A CLASS statement giving the name of the classification (or grouping) variable must accompany the PROC TTEST statement in the two independent sample cases. It should be omitted for the one sample or paired comparison situations. If it is used without the VAR statement, all numeric variables in the input data set (except those appearing in the CLASS, BY, FREQ, or WEIGHT statement) are included in the analysis. The class variable must have two, and only two, levels. PROC TTEST divides the observations into the two groups for the t test using the levels of this variable. You can use either a numeric or a character variable in the CLASS statement. Class levels are determined from the formatted values of the CLASS variable. Thus, you can use formats to define group levels. Refer to the discussions of the FORMAT procedure, the FORMAT statement, formats, and informats in SAS Language Reference: Dictionary.
FREQ Statement FREQ variable ; The variable in the FREQ statement identifies a variable that contains the frequency of occurrence of each observation. PROC TTEST treats each observation as if it appears n times, where n is the value of the FREQ variable for the observation. If the value is not an integer, only the integer portion is used. If the frequency value is less than 1 or is missing, the observation is not used in the analysis. When the FREQ statement is not specified, each observation is assigned a frequency of 1. The FREQ statement cannot be used if the DATA= data set contains statistics instead of the original observations.
4781
4782
Chapter 77. The TTEST Procedure
PAIRED Statement PAIRED PairLists ; The PairLists in the PAIRED statement identifies the variables to be compared in paired comparisons. You can use one or more PairLists. Variables or lists of variables are separated by an asterisk (*) or a colon (:). The asterisk requests comparisons between each variable on the left with each variable on the right. The colon requests comparisons between the first variable on the left and the first on the right, the second on the left and the second on the right, and so forth. The number of variables on the left must equal the number on the right when the colon is used. The differences are calculated by taking the variable on the left minus the variable on the right for both the asterisk and colon. A pair formed by a variable with itself is ignored. Use the PAIRED statement only for paired comparisons. The CLASS and VAR statements cannot be used with the PAIRED statement. Examples of the use of the asterisk and the colon are shown in the following table. These PAIRED statements... PAIRED A*B;
yield these comparisons A-B
PAIRED A*B C*D;
A-B and C-D
PAIRED (A B)*(C D);
A-C, A-D, B-C, and B-D
PAIRED (A B)*(C B);
A-C, A-B, and B-C
PAIRED (A1-A2)*(B1-B2);
A1-B1, A1-B2, A2-B1, and A2-B2
PAIRED (A1-A2):(B1-B2);
A1-B1 and A2-B2
VAR Statement VAR variables ; The VAR statement names the variables to be used in the analyses. One-sample comparisons are conducted when the VAR statement is used without the CLASS statement, while group comparisons are conducted when the VAR statement is used with a CLASS statement. If the VAR statement is omitted, all numeric variables in the input data set (except a numeric variable appearing in the BY, CLASS, FREQ, or WEIGHT statement) are included in the analysis. The VAR statement can be used with one- and two-sample t tests and cannot be used with the PAIRED statement.
Computational Methods
WEIGHT Statement WEIGHT variable ; The WEIGHT statement weights each observation in the input data set by the value of the WEIGHT variable. The values of the WEIGHT variable can be nonintegral, and they are not truncated. Observations with negative, zero, or missing values for the WEIGHT variable are not used in the analyses. Each observation is assigned a weight of 1 when the WEIGHT statement is not used. The WEIGHT statement cannot be used with an input data set of summary statistics.
Details Input Data Set of Statistics PROC TTEST accepts data containing either observation values or summary statistics. It assumes that the DATA= data set contains statistics if it contains a character variable with name – TYPE– or – STAT– . The TTEST procedure expects this character variable to contain the names of statistics. If both – TYPE– and – STAT– variables exist and are of type character, PROC TTEST expects – TYPE– to contain the names of statistics including ‘N’, ‘MEAN’, and ‘STD’ for each BY group (or for each class within each BY group for two-sample t tests). If no ‘N’, ‘MEAN’, or ‘STD’ statistics exist, an error message is printed. FREQ, WEIGHT, and PAIRED statements cannot be used with input data sets of statistics. BY, CLASS, and VAR statements are the same regardless of data set type. For paired comparisons, see the – DIF– values for the – TYPE– =T observations in output produced by the OUTSTATS= option in the PROC COMPARE statement (refer to the SAS Procedures Guide).
Missing Values An observation is omitted from the calculations if it has a missing value for either the CLASS variable, a PAIRED variable, or the variable to be tested. If more than one variable is listed in the VAR statement, a missing value in one variable does not eliminate the observation from the analysis of other nonmissing variables.
Computational Methods The t Statistic The form of the t statistic used varies with the type of test being performed. • To compare an individual mean with a sample of size n to a value m, use t=
x ¯−m √ s/ n
where x ¯ is the sample mean of the observations and s2 is the sample variance of the observations.
4783
4784
Chapter 77. The TTEST Procedure • To compare n paired differences to a value m, use t=
d¯ − m √ sd / n
where d¯ is the sample mean of the paired differences and s2d is the sample variance of the paired differences. • To compare means from two independent samples with n1 and n2 observations to a value m, use t=
(¯ x1 − x ¯ )−m r 2 1 1 s + n1 n2
where s2 is the pooled variance s2 =
(n1 − 1)s21 + (n2 − 1)s22 n1 + n2 − 2
and s21 and s22 are the sample variances of the two groups. The use of this t statistic depends on the assumption that σ12 = σ22 , where σ12 and σ22 are the population variances of the two groups.
The Folded Form F Statistic The folded form of the F statistic, F 0 , tests the hypothesis that the variances are equal, where F0 =
max(s21 , s22 ) min(s21 , s22 )
A test of F 0 is a two-tailed F test because you do not specify which variance you expect to be larger. The p-value gives the probability of a greater F value under the null hypothesis that σ12 = σ22 .
The Approximate t Statistic Under the assumption of unequal variances, the approximate t statistic is computed as x ¯1 − x ¯2 t0 = √ w1 + w2 where w1 =
s21 , n1
w2 =
s22 n2
Computational Methods
The Cochran and Cox Approximation The Cochran and Cox (1950) approximation of the probability level of the approximate t statistic is the value of p such that t0 =
w1 t1 + w2 t2 w1 + w2
where t1 and t2 are the critical values of the t distribution corresponding to a significance level of p and sample sizes of n1 and n2 , respectively. The number of degrees of freedom is undefined when n1 6= n2 . In general, the Cochran and Cox test tends to be conservative (Lee and Gurland 1975).
Satterthwaite’s Approximation The formula for Satterthwaite’s (1946) approximation for the degrees of freedom for the approximate t statistic is: df =
(w1 + w2 )2 w12 w22 + n1 − 1 n2 − 1
Refer to Steel and Torrie (1980) or Freund, Littell, and Spector (1986) for more information.
Confidence Interval Estimation The form of the confidence interval varies with the statistic for which it is computed. In the following confidence intervals involving means, t1− α2 ,n−1 is the 100(1 − α2 )% quantile of the t distribution with n − 1 degrees of freedom. The confidence interval for • an individual mean from a sample of size n compared to a value m is given by s (¯ x − m) ± t1− α2 ,n−1 √ n where x ¯ is the sample mean of the observations and s2 is the sample variance of the observations • paired differences with a sample of size n differences compared to a value m is given by sd (d¯ − m) ± t1− α2 ,n−1 √ n where d¯ and s2d are the sample mean and sample variance of the paired differences, respectively
4785
4786
Chapter 77. The TTEST Procedure • the difference of two means from independent samples with n1 and n2 observations compared to a value m is given by r 1 1 ((¯ x1 − x ¯2 ) − m) ± t1− α2 ,n1 +n2 −2 s + n1 n2 where s2 is the pooled variance s2 =
(n1 − 1)s21 + (n2 − 1)s22 n1 + n2 − 2
and where s21 and s22 are the sample variances of the two groups. The use of this confidence interval depends on the assumption that σ12 = σ22 , where σ12 and σ22 are the population variances of the two groups. The distribution of the estimated standard deviation of a mean is not symmetric, so alternative methods of estimating confidence intervals are possible. PROC TTEST computes two estimates. For both methods, the data are assumed to have a normal distribution with mean µ and variance σ 2 , both unknown. The methods are as follows: • The default method, an equal-tails confidence interval, puts an equal amount of area ( α2 ) in each tail of the chi-square distribution. An equal tails test of H0 : σ = σ0 has acceptance region (n − 1)S 2 2 2 χ α ,n−1 ≤ ≤ χ 1−α ,n−1 2 2 σ02 which can be algebraically manipulated to give the following 100(1 − α)% confidence interval for σ 2 : ! (n − 1)S 2 (n − 1)S 2 , χ21− α ,n−1 χ2α ,n−1 2
2
In order to obtain a confidence interval for σ, the square root of each side is taken, leading to the following 100(1 − α)% confidence interval: ! s s (n − 1)S 2 (n − 1)S 2 , χ2α ,n−1 χ21− α ,n−1 2
2
• The second method yields a confidence interval derived from the uniformly most powerful unbiased test of H0 : σ = σ0 (Lehmann 1986). This test has acceptance region (n − 1)S 2 c1 ≤ ≤ c2 σ02 where the critical values c1 and c2 satisfy Z c2 fn (y)dy = 1 − α c1
Displayed Output
and Z
c2
yfn (y)dy = n(1 − α) c1
where fn (y) is the chi-squared distribution with n degrees of freedom. This acceptance region can be algebraically manipulated to arrive at P
(n − 1)S 2 (n − 1)S 2 ≤ σ2 ≤ c2 c1
=1−α
where c1 and c2 solve the preceding two integrals. To find the area in each tail of the chi-square distribution to which these two critical values correspond, solve c1 = χ21−α2 ,n−1 and c2 = χ2α1 ,n−1 for α1 and α2 ; the resulting α1 and α2 sum to α. Hence, a 100(1 − α)% confidence interval for σ 2 is given by (n − 1)S 2 (n − 1)S 2 , χ21−α2 ,n−1 χ2α1 ,n−1
!
In order to obtain a 100(1 − α)% confidence interval for σ, the square root is taken of both terms, yielding s
(n − 1)S 2 , χ21−α2 ,n−1
s
(n − 1)S 2 χ2α1 ,n−1
!
Displayed Output For each variable in the analysis, the TTEST procedure displays the following summary statistics for each group: • the name of the dependent variable • the levels of the classification variable • N, the number of nonmissing values • Lower CL Mean, the lower confidence bound for the mean • the Mean or average • Upper CL Mean, the upper confidence bound for the mean • Lower CL Std Dev, the lower confidence bound for the standard deviation • Std Dev, the standard deviation • Upper CL Std Dev, the upper confidence bound for the standard deviation • Std Err, the standard error of the mean • the Minimum value, if the line size allows • the Maximum value, if the line size allows • upper and lower UMPU confidence bounds for the standard deviation, displayed if the CI=UMPU option is specified in the PROC TTEST statement
4787
4788
Chapter 77. The TTEST Procedure
Next, the results of several t tests are given. For one-sample and paired observations t tests, the TTEST procedure displays • t Value, the t statistic for testing the null hypothesis that the mean of the group is zero • DF, the degrees of freedom • Pr > |t|, the probability of a greater absolute value of t under the null hypothesis. This is the two-tailed significance probability. To compute the one-tailed significance probability, first determine whether large values of t are significant or small values are. Let p denote the significance probability for the two-tailed test. If large values of t are significant, then the one-tailed probability is p/2 if t ≥ 0, and is 1 − p/2 if t < 0. If small values of t are significant, then the one-tailed probability is 1 − p/2 if t ≥ 0, and is p/2 if t < 0. For two-sample t tests, the TTEST procedure displays all the items in the following list. You need to decide whether equal or unequal variances are appropriate for your data. • Under the assumption of unequal variances, the TTEST procedure displays results using Satterthwaite’s method. If the COCHRAN option is specified, the results for the Cochran and Cox approximation are also displayed. − t Value, an approximate t statistic for testing the null hypothesis that the means of the two groups are equal − DF, the approximate degrees of freedom − Pr > |t|, the probability of a greater absolute value of t under the null hypothesis. This is the two-tailed significance probability. The one-tailed probability is computed the same way as in a one-sample t test. • Under the assumption of equal variances, the TTEST procedure displays results obtained by pooling the group variances. − t Value, the t statistic for testing the null hypothesis that the means of the two groups are equal − DF, the degrees of freedom − Pr > |t|, the probability of a greater absolute value of t under the null hypothesis. This is the two-tailed significance probability. The one-tailed probability is computed the same way as in a one-sample t test. • PROC TTEST then gives the results of the test of equality of variances: − the F 0 (folded) statistic (see the “The Folded Form F Statistic” section on page 4784) − Num DF and Den DF, the numerator and denominator degrees of freedom in each group − Pr > F, the probability of a greater F 0 value. This is the two-tailed significance probability.
Examples
4789
ODS Table Names PROC TTEST assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.” Table 77.1. ODS Tables Produced in PROC TTEST
ODS Table Name Equality Statistics TTests
Description Tests for equality of variance Univariate summary statistics t-tests
Statement CLASS statement by default by default
Examples Example 77.1. Comparing Group Means Using Input Data Set of Summary Statistics The following example, taken from Huntsberger and Billingsley (1989), compares two grazing methods using 32 steer. Half of the steer are allowed to graze continuously while the other half are subjected to controlled grazing time. The researchers want to know if these two grazing methods impact weight gain differently. The data are read by the following DATA step. title ’Group Comparison Using Input Data Set of Summary Statistics’; data graze; length GrazeType $ 10; input GrazeType $ WtGain @@; datalines; controlled 45 controlled 62 controlled 96 controlled 128 controlled 120 controlled 99 controlled 28 controlled 50 controlled 109 controlled 115 controlled 39 controlled 96 controlled 87 controlled 100 controlled 76 controlled 80 continuous 94 continuous 12 continuous 26 continuous 89 continuous 88 continuous 96 continuous 85 continuous 130 continuous 75 continuous 54 continuous 112 continuous 69 continuous 104 continuous 95 continuous 53 continuous 21 ; run;
The variable GrazeType denotes the grazing method: ‘controlled’ is controlled grazing and ‘continuous’ is continuous grazing. The dollar sign ($) following GrazeType makes it a character variable, and the trailing at signs (@@) tell the procedure that
4790
Chapter 77. The TTEST Procedure
there is more than one observation per line. The MEANS procedure is invoked to create a data set of summary statistics with the following statements: proc sort; by GrazeType; proc means data=graze noprint; var WtGain; by GrazeType; output out=newgraze; run;
The NOPRINT option eliminates all output from the MEANS procedure. The VAR statement tells PROC MEANS to compute summary statistics for the WtGain variable, and the BY statement requests a separate set of summary statistics for each level of GrazeType. The OUTPUT OUT= statement tells PROC MEANS to put the summary statistics into a data set called newgraze so that it may be used in subsequent procedures. This new data set is displayed in Output 77.1.1 by using PROC PRINT as follows: proc print data=newgraze; run;
The – STAT– variable contains the names of the statistics, and the GrazeType variable indicates which group the statistic is from. Output 77.1.1. Output Data Set of Summary Statistics Group Comparison Using Input Data Set of Summary Statistics Obs 1 2 3 4 5 6 7 8 9 10
GrazeType continuous continuous continuous continuous continuous controlled controlled controlled controlled controlled
_TYPE_ 0 0 0 0 0 0 0 0 0 0
_FREQ_
_STAT_
16 16 16 16 16 16 16 16 16 16
N MIN MAX MEAN STD N MIN MAX MEAN STD
WtGain 16.000 12.000 130.000 75.188 33.812 16.000 28.000 128.000 83.125 30.535
The following code invokes PROC TTEST using the newgraze data set, as denoted by the DATA= option. proc ttest data=newgraze; class GrazeType; var WtGain; run;
Examples
The CLASS statement contains the variable that distinguishes between the groups being compared, in this case GrazeType. The summary statistics and confidence intervals are displayed first, as shown in Output 77.1.2. Output 77.1.2. Summary Statistics The TTEST Procedure Statistics
Variable
GrazeType
N
WtGain WtGain WtGain
continuous controlled Diff (1-2)
16 16
Lower CL Mean
Mean
Upper CL Mean
Lower CL Std Dev
Std Dev
57.171 66.854 -31.2
75.188 83.125 -7.938
93.204 99.396 15.323
. . 25.743
33.812 30.535 32.215
Statistics
Variable
GrazeType
WtGain WtGain WtGain
continuous controlled Diff (1-2)
Upper CL Std Dev
Std Err
Minimum
Maximum
. . 43.061
8.4529 7.6337 11.39
12 28
130 128
In Output 77.1.2, the Variable column states the variable used in computations and the Class column specifies the group for which the statistics are computed. For each class, the sample size, mean, standard deviation and standard error, and maximum and minimum values are displayed. The confidence bounds for the mean are also displayed; however, since summary statistics are used as input, the confidence bounds for the standard deviation of the groups are not calculated. Output 77.1.3. t Tests T-Tests Variable
Method
Variances
WtGain WtGain
Pooled Satterthwaite
Equal Unequal
DF
t Value
Pr > |t|
30 29.7
-0.70 -0.70
0.4912 0.4913
Equality of Variances Variable
Method
WtGain
Folded F
Num DF
Den DF
F Value
Pr > F
15
15
1.23
0.6981
Output 77.1.3 shows the results of tests for equal group means and equal variances. A group test statistic for the equality of means is reported for equal and unequal variances. Before deciding which test is appropriate, you should look at the test for equality of variances; this test does not indicate a significant difference in the two variances (F 0 = 1.23, p = 0.6981), so the pooled t statistic should be used. Based
4791
4792
Chapter 77. The TTEST Procedure
on the pooled statistic, the two grazing methods are not significantly different (t = 0.70, p = 0.4912). Note that this test assumes that the observations in both data sets are normally distributed; this assumption can be checked in PROC UNIVARIATE using the raw data.
Example 77.2. One-Sample Comparison Using the FREQ Statement This example examines children’s reading skills. The data consist of Degree of Reading Power (DRP) test scores from 44 third-grade children and are taken from Moore (1995, p. 337). Their scores are given in the following DATA step. title ’One-Mean Comparison Using FREQ Statement’; data read; input score count @@; datalines; 40 2 47 2 52 2 26 1 19 2 25 2 35 4 39 1 26 1 48 1 14 2 22 1 42 1 34 2 33 2 18 1 15 1 29 1 41 2 44 1 51 1 43 1 27 2 46 2 28 1 49 1 31 1 28 1 54 1 45 1 ; run;
The following statements invoke the TTEST procedure to test if the mean test score is equal to 30. The count variable contains the frequency of occurrence of each test score; this is specified in the FREQ statement. proc ttest data=read h0=30; var score; freq count; run;
The output, shown in Output 77.2.1, contains the results. Output 77.2.1. TTEST Results One-Mean Comparison Using FREQ Statement The TTEST Procedure Statistics
Variable score
N
Lower CL Mean
Mean
Upper CL Mean
Lower CL Std Dev
Std Dev
Upper CL Std Dev
Std Err
Minimum
Maximum
44
31.449
34.864
38.278
9.2788
11.23
14.229
1.693
14
54
T-Tests Variable
DF
t Value
Pr > |t|
score
43
2.87
0.0063
Example 77.3. Paired Comparisons
The SAS log states that 30 observations and two variables have been read. However, the sample size given in the TTEST output is N=44. This is due to specifying the count variable in the FREQ statement. The test is significant (t = 2.87, p = 0.0063) at the 5% level, thus you can conclude that the mean test score is different from 30.
Example 77.3. Paired Comparisons When it is not feasible to assume that two groups of data are independent, and a natural pairing of the data exists, it is advantageous to use an analysis that takes the correlation into account. Utilizing this correlation results in higher power to detect existing differences between the means. The differences between paired observations are assumed to be normally distributed. Some examples of this natural pairing are • pre- and post-test scores for a student receiving tutoring • fuel efficiency readings of two fuel types observed on the same automobile • sunburn scores for two sunblock lotions, one applied to the individual’s right arm, one to the left arm • political attitude scores of husbands and wives In this example, taken from SUGI Supplemental Library User’s Guide, Version 5 Edition, a stimulus is being examined to determine its effect on systolic blood pressure. Twelve men participate in the study. Their systolic blood pressure is measured both before and after the stimulus is applied. The following statements input the data: title ’Paired Comparison’; data pressure; input SBPbefore SBPafter @@; datalines; 120 128 124 131 130 131 118 127 140 132 128 125 140 141 135 137 126 118 130 132 126 129 127 135 ; run;
The variables SBPbefore and SBPafter denote the systolic blood pressure before and after the stimulus, respectively. The statements to perform the test follow. proc ttest; paired SBPbefore*SBPafter; run;
The PAIRED statement is used to test whether the mean change in systolic blood pressure is significantly different from zero. The output is displayed in Output 77.3.1.
4793
4794
Chapter 77. The TTEST Procedure
Output 77.3.1. TTEST Results Paired Comparison The TTEST Procedure Statistics
Difference SBPbefore - SBPafter
N
Lower CL Mean
Mean
Upper CL Mean
Lower CL Std Dev
Std Dev
Upper CL Std Dev
Std Err
Minimum
Maximum
12
-5.536
-1.833
1.8698
4.1288
5.8284
9.8958
1.6825
-9
8
T-Tests Difference
DF
t Value
Pr > |t|
SBPbefore - SBPafter
11
-1.09
0.2992
The variables SBPbefore and SBPafter are the paired variables with a sample size of 12. The summary statistics of the difference are displayed (mean, standard deviation, and standard error) along with their confidence limits. The minimum and maximum differences are also displayed. The t test is not significant (t = −1.09, p = 0.2992), indicating that the stimuli did not significantly affect systolic blood pressure. Note that this test of hypothesis assumes that the differences are normally distributed. This assumption can be investigated using PROC UNIVARIATE with the NORMAL option. If the assumption is not satisfied, PROC NPAR1WAY should be used.
References Best, D.I. and Rayner, C.W. (1987), “Welch’s Approximate Solution for the Behren’sFisher Problem,” Technometrics, 29, 205–210. Cochran, W.G. and Cox, G.M. (1950), Experimental Designs, New York: John Wiley & Sons, Inc. Freund, R.J., Littell, R.C., and Spector, P.C. (1986), SAS System for Linear Models, 1986 Edition, Cary, NC: SAS Institute Inc. Huntsberger, David V. and Billingsley, Patrick P. (1989), Elements of Statistical Inference, Dubuque, Iowa: Wm. C. Brown Publishers. Moore, David S. (1995), The Basic Practice of Statistics, New York: W. H. Freeman and Company. Lee, A.F.S. and Gurland, J. (1975), “Size and Power of Tests for Equality of Means of Two Normal Populations with Unequal Variances,” Journal of the American Statistical Association, 70, 933–941. Lehmann, E. L. (1986), Testing Statistical Hypostheses, New York: John Wiley & Sons.
References
Posten, H.O., Yeh, Y.Y., and Owen, D.B. (1982), “Robustness of the TwoSample t Test Under Violations of the Homogeneity of Variance Assumption,” Communications in Statistics, 11, 109–126. Ramsey, P.H. (1980), “Exact Type I Error Rates for Robustness of Student’s t Test with Unequal Variances,” Journal of Educational Statistics, 5, 337–349. Robinson, G.K. (1976), “Properties of Student’s t and of the Behrens-Fisher Solution to the Two Mean Problem,” Annals of Statistics, 4, 963–971. Satterthwaite, F.W. (1946), “An Approximate Distribution of Estimates of Variance Components,” Biometrics Bulletin, 2, 110–114. Scheffe, H. (1970), “Practical Solutions of the Behrens-Fisher Problem,” Journal of the American Statistical Association, 65, 1501–1508. SAS Institute Inc, (1986), SUGI Supplemental Library User’s Guide, Version 5 Edition. Cary, NC: SAS Institute Inc. Steel, R.G.D. and Torrie, J.H. (1980), Principles and Procedures of Statistics, Second Edition, New York: McGraw-Hill Book Company. Wang, Y.Y. (1971), “Probabilities of the Type I Error of the Welch Tests for the Behren’s-Fisher Problem,” Journal of the American Statistical Association, 66, 605–608. Yuen, K.K. (1974), “The Two-Sample Trimmed t for Unequal Population Variances,” Biometrika, 61, 165–170.
4795
Chapter 78
The VARCLUS Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4799 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4801 SYNTAX . . . . . . . . . . . . PROC VARCLUS Statement BY Statement . . . . . . . . FREQ Statement . . . . . . . PARTIAL Statement . . . . . SEED Statement . . . . . . . VAR Statement . . . . . . . WEIGHT Statement . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. 4806 . 4806 . 4813 . 4813 . 4813 . 4814 . 4814 . 4814
DETAILS . . . . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . . . Using PROC VARCLUS . . . . . . . . . Output Data Sets . . . . . . . . . . . . . . Computational Resources . . . . . . . . . Interpreting VARCLUS Procedure Output Displayed Output . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. 4814 . 4814 . 4815 . 4816 . 4818 . 4818 . 4819 . 4820
EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4821 Example 78.1. Correlations among Physical Variables . . . . . . . . . . . . 4821 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4827
4798
Chapter 78. The VARCLUS Procedure
Chapter 78
The VARCLUS Procedure Overview The VARCLUS procedure divides a set of numeric variables into disjoint or hierarchical clusters. Associated with each cluster is a linear combination of the variables in the cluster. This linear combination can be either the first principal component (the default) or the centroid component (if you specify the CENTROID option). The first principal component is a weighted average of the variables that explains as much variance as possible. See Chapter 58, “The PRINCOMP Procedure,” for further details. Centroid components are unweighted averages of either the standardized variables (the default) or the raw variables (if you specify the COVARIANCE option). PROC VARCLUS tries to maximize the variance that is explained by the cluster components, summed over all the clusters. The cluster components are oblique, not orthogonal, even when the cluster components are first principal components. In an ordinary principal component analysis, all components are computed from the same variables, and the first principal component is orthogonal to the second principal component and to each other principal component. In PROC VARCLUS, each cluster component is computed from a different set of variables than all the other cluster components. The first principal component of one cluster may be correlated with the first principal component of another cluster. Hence, PROC VARCLUS is a type of oblique component analysis. As in principal component analysis, either the correlation or the covariance matrix can be analyzed. If correlations are used, all variables are treated as equally important. If covariances are used, variables with larger variances have more importance in the analysis. PROC VARCLUS creates an output data set that can be used with the SCORE procedure to compute component scores for each cluster. A second output data set can be used by the TREE procedure to draw a tree diagram of hierarchical clusters. The VARCLUS procedure can be used as a variable-reduction method. A large set of variables can often be replaced by the set of cluster components with little loss of information. A given number of cluster components does not generally explain as much variance as the same number of principal components on the full set of variables, but the cluster components are usually easier to interpret than the principal components, even if the latter are rotated. For example, an educational test might contain fifty items. PROC VARCLUS can be used to divide the items into, say, five clusters. Each cluster can then be treated as a subtest, with the subtest scores given by the cluster components. If the cluster components are centroid components of the covariance matrix, each subtest score is simply the sum of the item scores for that cluster.
4800
Chapter 78. The VARCLUS Procedure
The VARCLUS algorithm is both divisive and iterative. By default, PROC VARCLUS begins with all variables in a single cluster. It then repeats the following steps: 1. A cluster is chosen for splitting. Depending on the options specified, the selected cluster has either the smallest percentage of variation explained by its cluster component (using the PROPORTION= option) or the largest eigenvalue associated with the second principal component (using the MAXEIGEN= option). 2. The chosen cluster is split into two clusters by finding the first two principal components, performing an orthoblique rotation (raw quartimax rotation on the eigenvectors; Harris and Kaiser, 1964), and assigning each variable to the rotated component with which it has the higher squared correlation. 3. Variables are iteratively reassigned to clusters to try to maximize the variance accounted for by the cluster components. You can require the reassignment algorithms to maintain a hierarchical structure for the clusters. The procedure stops splitting when either: • the maximum number of clusters as specified by the MAXCLUSTERS= option is reached, or • each cluster satisfies the stopping criteria specified by the PROPORTION= (percentage of variation explained) and/or the MAXEIGEN= (second eigenvalue) options. By default, VARCLUS stops splitting when each cluster has only one eigenvalue greater than one, thus satisfying the most popular criterion for determining the sufficiency of a single underlying dimension. The iterative reassignment of variables to clusters proceeds in two phases. The first is a nearest component sorting (NCS) phase, similar in principle to the nearest centroid sorting algorithms described by Anderberg (1973). In each iteration, the cluster components are computed, and each variable is assigned to the component with which it has the highest squared correlation. The second phase involves a search algorithm in which each variable is tested to see if assigning it to a different cluster increases the amount of variance explained. If a variable is reassigned during the search phase, the components of the two clusters involved are recomputed before the next variable is tested. The NCS phase is much faster than the search phase but is more likely to be trapped by a local optimum. If principal components are used, the NCS phase is an alternating least-squares method and converges rapidly. The search phase can be very time consuming for a large number of variables. But if the default initialization method is used, the search phase is rarely able to substantially improve the results of the NCS phase, so the search takes few iterations. If random initialization is used, the NCS phase may be trapped by a local optimum from which the search phase can escape.
Getting Started
4801
If centroid components are used, the NCS phase is not an alternating least-squares method and may not increase the amount of variance explained; therefore, it is limited, by default, to one iteration. You can have VARCLUS do the clustering hierarchically by restricting the reassignment of variables such that the clusters maintain a tree structure. In this case, when a cluster is split, a variable in one of the two resulting clusters can be reassigned to the other cluster resulting from the split but not to a cluster that is not part of the original cluster (the one that is split).
Getting Started This example demonstrates how you can use the VARCLUS procedure to create hierarchical, unidimensional clusters of variables. The following data, from Hand, et al. (1994), represent amounts of protein consumed from nine food groups for each of 25 European countries. The nine food groups are red meat (RedMeat), white meat (WhiteMeat), eggs (Eggs), milk (Milk), fish (Fish), cereal (Cereal), starch (Starch), nuts (Nuts), and fruits and vegetables (FruitVeg). Suppose you want to simplify interpretation of the data by reducing the number of variables to a smaller set of variable cluster components. You can use the VARCLUS procedure for this type of variable reduction. The following DATA step creates the SAS data set Protein: data Protein; input Country $18. RedMeat WhiteMeat Eggs Milk Fish Cereal Starch Nuts FruitVeg; datalines; Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 E Germany 8.4 11.6 3.7 11.1 5.4 24.6 6.5 Finland 9.5 4.9 2.7 33.7 5.8 26.3 5.1 France 18.0 9.9 3.3 19.5 5.7 28.1 4.8 Greece 10.2 3.0 2.8 17.6 5.9 41.7 2.2 Hungary 5.3 12.4 2.9 9.7 0.3 40.1 4.0 Ireland 13.9 10.0 4.7 25.8 2.2 24.0 6.2 Italy 9.0 5.1 2.9 13.7 3.4 36.8 2.1 Netherlands 9.5 13.6 3.6 23.4 2.5 22.4 4.2 Norway 9.4 4.7 2.7 23.3 9.7 23.0 4.6 Poland 6.9 10.2 2.7 19.3 3.0 36.1 5.9 Portugal 6.2 3.7 1.1 4.9 14.2 27.0 5.9 Romania 6.2 6.3 1.5 11.1 1.0 49.6 3.1 Spain 7.1 3.4 3.1 8.6 7.0 29.2 5.7 Sweden 9.9 7.8 3.5 4.7 7.5 19.5 3.7 Switzerland 13.1 10.1 3.1 23.8 2.3 25.6 2.8
5.5 1.3 2.1 3.7 1.1 0.7 0.8 1.0 2.4 7.8 5.4 1.6 4.3 1.8 1.6 2.0 4.7 5.3 5.9 1.4 2.4
1.7 4.3 4.0 4.2 4.0 2.4 3.6 1.4 6.5 6.5 4.2 2.9 6.7 3.7 2.7 6.6 7.9 2.8 7.2 2.0 4.9
4802
Chapter 78. The VARCLUS Procedure UK USSR W Germany Yugoslavia ;
17.4 5.7 9.3 4.6 11.4 12.5 4.4 5.0
4.7 2.1 4.1 1.2
20.6 16.6 18.8 9.5
4.3 3.0 3.4 0.6
24.3 43.6 18.6 55.9
4.7 6.4 5.2 3.0
3.4 3.4 1.5 5.7
3.3 2.9 3.8 3.2
The data set Protein contains the character variable Country and the nine numeric variables representing the food groups. The $18. in the INPUT statement specifies that the variable Country is a character variable with a length of 18. The following statements create the variable clusters. proc varclus data=Protein outtree=tree centroid maxclusters=4; var RedMeat--FruitVeg; run;
The DATA= option specifies the SAS data set Protein as input. The OUTTREE= option creates the output SAS data set Tree to contain the tree structure information. When you specify this option, you are implicitly requiring the clusters to be hierarchical rather than disjoint. The CENTROID option specifies the centroid method of clustering. This means that the calculated cluster components are the unweighted averages of the standardized variables. The MAXCLUSTERS=4 option specifies that no more than four clusters be computed. The VAR statement lists the numeric variables (RedMeat–FruitVeg) to be used in the analysis. The results of this analysis are displayed in the following figures. Although PROC VARCLUS displays output for each step in the clustering process, the following figures display only the final analysis for four clusters. Figure 78.1 displays the final cluster summary. Oblique Centroid Component Cluster Analysis Cluster Summary for 4 Clusters Cluster Variation Proportion Cluster Members Variation Explained Explained ---------------------------------------------------------1 4 4 2.173024 0.5433 2 2 2 1.650997 0.8255 3 2 2 1.403853 0.7019 4 1 1 1 1.0000 Total variation explained = 6.227874 Proportion = 0.6920
Figure 78.1. Final Cluster Summary from the VARCLUS Procedure
For each cluster, Figure 78.1 displays the number of variables in the cluster, the cluster variation, the total explained variation, and the proportion of the total variance
Getting Started
explained by the variables in the cluster. The variance explained by the variables in a cluster is similar to the variance explained by a factor in common factor analysis, but it includes contributions only from the variables in the cluster rather than from all variables. The line labeled “Total variation explained” in Figure 78.1 gives the sum of the explained variation over all clusters. The final “Proportion” represents the total explained variation divided by the sum of cluster variation. This value, 0.6920, indicates that about 69% of the total variation in the data can be accounted for by the four cluster components. Figure 78.2 shows how the variables are clustered. The first cluster represents animal protein (RedMeat, WhiteMeat, Eggs, and Milk), the second cluster contains the variables Cereal and Nuts, the third cluster is composed of the variables Fish and Starch, and the last cluster contains the single variable representing fruits and vegetables (FruitVeg). Oblique Centroid Component Cluster Analysis R-squared with -----------------Own Next 1-R**2 Cluster Variable Cluster Closest Ratio ------------------------------------------------------Cluster 1 RedMeat 0.4375 0.1518 0.6631 WhiteMeat 0.6302 0.3331 0.5545 Eggs 0.7024 0.4902 0.5837 Milk 0.4288 0.2721 0.7847 ------------------------------------------------------Cluster 2 Cereal 0.8255 0.3983 0.2900 Nuts 0.8255 0.5901 0.4257 ------------------------------------------------------Cluster 3 Fish 0.7019 0.1365 0.3452 Starch 0.7019 0.3075 0.4304 ------------------------------------------------------Cluster 4 FruitVeg 1.0000 0.0578 0.0000 4 Clusters
Figure 78.2. R-square Values from the VARCLUS Procedure
Figure 78.2 also displays the R2 value of each variable with its own cluster and the R2 value with its nearest cluster. The R2 value for a variable with the nearest cluster should be low if the clusters are well separated. The last column displays the ratio of 2 )/(1 − R2 (1 − Rown nearest ) for each variable. Small values of this ratio indicate good clustering. Figure 78.3 displays the cluster structure and the intercluster correlations. The structure table displays the correlation of each variable with each cluster component. The table of intercorrelations contains the correlations between the cluster components.
4803
4804
Chapter 78. The VARCLUS Procedure
Oblique Centroid Component Cluster Analysis Cluster Structure Cluster 1 2 3 4 ----------------------------------------------------------------RedMeat 0.66145 -0.38959 0.06450 -0.34109 WhiteMeat 0.79385 -0.57715 0.04760 -0.06132 Eggs 0.83811 -0.70012 0.30902 -0.04552 Milk 0.65483 -0.52163 0.16805 -0.26096 Fish -0.08108 -0.36947 0.83781 0.26614 Cereal -0.58070 0.90857 -0.63111 0.04655 Starch 0.41593 -0.55448 0.83781 0.08441 Nuts -0.76817 0.90857 -0.37089 0.37497 FruitVeg -0.24045 0.23197 0.20920 1.00000
Inter-Cluster Correlations Cluster 1 2 3 4
1
2
3
4
1.00000 -0.74230 0.19984 -0.24045
-0.74230 1.00000 -0.55141 0.23197
0.19984 -0.55141 1.00000 0.20920
-0.24045 0.23197 0.20920 1.00000
Figure 78.3. Cluster Correlations and Intercorrelations
PROC VARCLUS next displays the summary table of statistics for the cluster history (Figure 78.4). The first three columns give the number of clusters, the total variation explained by clusters, and the proportion of variation explained by clusters. As displayed in Figure 78.4, when the number of allowable clusters is two, the total variation explained is 3.9607, and the cumulative proportion of variation explained by two clusters is 0.4401. When the number of clusters increases to three, the proportion of explained variance increases to 0.5880. When four clusters are computed, the explained variation is 0.6920. Oblique Centroid Component Cluster Analysis Total Proportion Minimum Minimum Maximum Number Variation of Variation Proportion R-squared 1-R**2 Ratio of Explained Explained Explained for a for a Clusters by Clusters by Clusters by a Cluster Variable Variable ------------------------------------------------------------------------------1 0.732343 0.0814 0.0814 0.0875 2 3.960717 0.4401 0.3743 0.1007 1.0213 3 5.291887 0.5880 0.5433 0.3928 0.7978 4 6.227874 0.6920 0.5433 0.4288 0.7847
Figure 78.4. Final Cluster Summary Table from the VARCLUS Procedure
Figure 78.4 also displays the minimum proportion of variance explained by a cluster, the minimum R2 for a variable, and the maximum (1 − R2 ) ratio for a variable. The last quantity is the ratio of the value 1 − R2 for a variable’s own cluster to the value 1 − R2 for its nearest cluster.
Getting Started
The following statements produce a tree diagram of the cluster structure created by PROC VARCLUS. The AXIS1 statement suppresses the label for the vertical axis, which would otherwise be “Name of Variable or Cluster”. axis1 label=none; proc tree data=tree horizontal vaxis=axis1; height _propor_; run;
Next, the TREE procedure is invoked using the SAS data set TREE, created by the OUTTREE= option in the preceding PROC VARCLUS statement. The HORIZONTAL option orients the tree diagram horizontally. The VAXIS option associates the vertical axis with the the AXIS1 statement. The HEIGHT statement specifies the use of the variable – PROPOR– (the proportion of variance explained) as the height variable. Figure 78.5 shows how the clusters are created. The ordered variable names are displayed on the vertical axis. The horizontal axis displays the proportion of variance explained at each clustering level.
Figure 78.5. Horizontal Tree Diagram from PROC TREE
As you look from left to right in the diagram, objects and clusters are progressively joined until a single, all-encompassing cluster is formed at the right (or root) of the diagram. Clusters exist at each level of the diagram, and every vertical line connects leaves and branches into progressively larger clusters. For example, when the variables are formed into three clusters, one cluster contains the variables RedMeat, WhiteMeat, Eggs, and Milk; the second cluster contains the
4805
4806
Chapter 78. The VARCLUS Procedure
variables Fish and Starch; the third cluster contains the variables Cereal, Nuts, and FruitVeg. The proportion of variance explained at that level is 0.5880 (from Figure 78.4). At the next stage of clustering, the third cluster is split as the variable FruitVeg forms the fourth cluster; the proportion of variance explained is 0.6920.
Syntax The following statements are available in PROC VARCLUS.
PROC VARCLUS < options > ; VAR variables ; SEED variables ; PARTIAL variables ; WEIGHT variables ; FREQ variables ; BY variables ; Usually you need only the VAR statement in addition to the PROC VARCLUS statement. The following sections give detailed syntax information for each of the statements, beginning with the PROC VARCLUS statement. The remaining statements are listed in alphabetical order.
PROC VARCLUS Statement PROC VARCLUS < options >; The PROC VARCLUS statement starts the VARCLUS procedure. By default, VARCLUS clusters the numeric variables in the most recently created SAS data set, starting with one cluster and splitting clusters until all clusters have at most one eigenvalue greater than one. VARCLUS chooses a cluster to split based on two options: MAXEIGEN=, and PROPORTION=. 1. If you specify either or both of these two options, then only the specified options affect the choice of the cluster to split. 2. If you specify neither of these options, the criterion for choice of cluster to split depends on the CENTROID option: (a) If you specify CENTROID, VARCLUS splits the cluster with the smallest percentage of variation explained by its cluster component, as if you had specified the PROPORTION= option. (b) If you do not specify CENTROID, VARCLUS splits the cluster with the largest eigenvalue associated with the second principal component, as if you had specified the MAXEIGEN= option. The final number of clusters is controlled by three options: MAXCLUSTERS=, MAXEIGEN=, and PROPORTION=.
PROC VARCLUS Statement
1. If you specify any of these three options, then only the options you specify affect the final number of clusters. 2. If you specify none of these options, VARCLUS continues to split clusters until the default splitting criterion is satisfied. The default splitting criterion depends on the CENTROID option: (a) If you specify CENTROID, the default splitting criterion is PROPORTION=0.75. (b) If you do not specify CENTROID, splitting is based on the MAXEIGEN= criterion, with a default depending on the COVARIANCE option: i. for analyzing a correlation matrix (no COVARIANCE option), the defaut value for MAXEIGEN= is one. ii. for analyzing a covariance matrix (using the COVARIANCE option), the default value for MAXEIGEN= is the average variance of the variables being clustered. VARCLUS continues to split clusters until any of the following conditions holds: • the number of cluster equals the value specified for MAXCLUSTERS=. • no cluster qualifies for splitting according to the MAXEIGEN= or PROPORTION= criteria. • a cluster was chosen for splitting, but after iteratively reassigning variables to clusters, one of the cluster has no members.
4807
4808
Chapter 78. The VARCLUS Procedure
Table 78.1 summarizes some of the options available in the PROC VARCLUS statement. Table 78.1. Options Available in the PROC VARCLUS Statement
Task Specify data sets
Options DATA= OUTSTAT= OUTTREE=
Determine the number of clusters
MAXCLUSTERS= MINCLUSTERS= MAXEIGEN= PROPORTION=
Specify cluster formation
CENTROID COVARIANCE HIERARCHY INITIAL= MAXITER= MAXSEARCH= MULTIPLEGROUP RANDOM=
Control output
CORR NOPRINT SHORT SIMPLE SUMMARY TRACE
Omit intercept
NOINT
Specify divisor for variances
VARDEF=
The following list gives details on these options. The list is in alphabetical order. CENTROID
uses centroid components rather than principal components. You should specify centroid components if you want the cluster components to be unweighted averages of the standardized variables (the default) or the unstandardized variables (if you specify the COVARIANCE option). It is possible to obtain locally optimal clusterings in which a variable is not assigned to the cluster component with which it has the highest squared correlation. You cannot specify both the CENTROID and MAXEIGEN= options. CORR C
displays the correlation matrix.
PROC VARCLUS Statement
COVARIANCE COV
analyzes the covariance matrix instead of the correlation matrix. The COVARIANCE option causes variables with a large variance to have more effect on the cluster components than variables with a small variance. DATA=SAS-data-set
specifies the input data set to be analyzed. The data set can be an ordinary SAS data set or TYPE=CORR, UCORR, COV, UCOV, FACTOR, or SSCP. If you do not specify the DATA= option, the most recently created SAS data set is used. See Appendix A, “Special SAS Data Sets,” for more information on types of SAS data sets. HIERARCHY HI
requires the clusters at different levels to maintain a hierarchical structure. To draw a tree diagram, use the OUTTREE= option and the TREE procedure. INITIAL=GROUP INITIAL=INPUT INITIAL=RANDOM INITIAL=SEED
specifies the method for initializing the clusters. If the INITIAL= option is omitted and the MINCLUSTERS= option is greater than 1, the initial cluster components are obtained by extracting the required number of principal components and performing an orthoblique rotation (raw quartimax rotation on the eigenvectors; Harris and Kaiser, 1964). The following list describes the values for the INITIAL= option: GROUP
obtains the cluster membership of each variable from an observation in the DATA= data set where the – TYPE– variable has a value of “GROUP”. In this observation, the variables to be clustered must each have an integer value ranging from one to the number of clusters. You can use this option only if the DATA= data set is a TYPE=CORR, UCORR, COV, UCOV, or FACTOR data set. You can use a data set created either by a previous run of PROC VARCLUS or in a DATA step.
INPUT
obtains scoring coefficients for the cluster components from observations in the DATA= data set where the – TYPE– variable has a value of “SCORE”. You can use this option only if the DATA= data set is a TYPE=CORR, UCORR, COV, UCOV, or FACTOR data set, You can use scoring coefficients from the FACTOR procedure or a previous run of PROC VARCLUS, or you can enter other coefficients in a DATA step.
RANDOM
assigns variables randomly to clusters.
SEED
initializes each cluster component to be one of the variables named in the SEED statement. Each variable listed in the SEED statement becomes the sole member of a cluster, and the other variables are initially unassigned. If you do not specify the SEED statement, the
4809
4810
Chapter 78. The VARCLUS Procedure first MINCLUSTERS= variables in the VAR statement are used as seeds.
MAXCLUSTERS=n MAXC=n
specifies the largest number of clusters desired. The default value is the number of variables. VARCLUS stops splitting clusters after the number of clusters reaches the value of the MAXCLUSTERS= option, regardless of what other splitting options are specified. MAXEIGEN=n
specifies that when choosing a cluster to split, VARCLUS should choose the cluster with the largest second eigenvalue, provided that its second eigenvalue is greater than the MAXEIGEN= value. The MAXEIGEN= option cannot be used with the CENTROID or MULTIPLEGROUP options. If you do not specify MAXEIGEN=, then: • If you specify PROPORTION=, CENTROID, or MULTIPLEGROUP, cluster splitting does not depend on the second eigenvalue. • Otherwise, if you specify MAXCLUSTERS=, the default value for MAXEIGEN= is zero. • Otherwise, the default value for MAXEIGEN= is either 1.0 if the correlation matrix is analyzed, or the average variance if the COVARIANCE option is specified. If you specify both MAXEIGEN= and MAXCLUSTERS=, the number of clusters will never exceed the value of the MAXCLUSTERS= option. If you specify both MAXEIGEN= and PROPORTION=, VARCLUS first looks for a cluster to split based on the MAXEIGEN= criterion. If no cluster meets that criterion, VARCLUS then looks for a cluster to split based on the PROPORTION= criterion. MAXITER=n
specifies the maximum number of iterations during the NCS phase. The default value is 1 if you specify the CENTROID option; the default is 10 otherwise. MAXSEARCH=n
specifies the maximum number of iterations during the search phase. The default is 1000 divide by the number of variables. MINCLUSTERS=n MINC=n
specifies the smallest number of clusters desired. The default value is 2 for INITIAL=RANDOM or INITIAL=SEED; otherwise, VARCLUS begins with one cluster and tries to split it in accordance with the PROPORTION= or MAXEIGEN= options.
PROC VARCLUS Statement
MULTIPLEGROUP MG
performs a multiple group component analysis (Harman 1976). You specify which variables belong to which clusters. No clusters are split, and no variables are reassigned to a different cluster. The input data set must be TYPE=CORR, UCORR, COV, UCOV, FACTOR or SSCP and must contain an observation with – TYPE– =“GROUP” defining the variable groups. Specifying the MULTIPLEGROUP option is equivalent to specifying all of the following options: INITIAL=GROUP, MINC=1, MAXITER=0, MAXSEARCH=0, PROPORTION=0, and MAXEIGEN=large number. NOINT
requests that no intercept be used; covariances or correlations are not corrected for the mean. If you specify the NOINT option, the OUTSTAT= data set is TYPE=UCORR. NOPRINT
suppresses displayed output. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 14, “Using the Output Delivery System.” OUTSTAT=SAS-data-set
creates an output data set to contain statistics including means, standard deviations, correlations, cluster scoring coefficients, and the cluster structure. If you want to create a permanent SAS data set, you must specify a two-level name. The OUTSTAT= data set is TYPE=UCORR if the NOINT option is specified. For more information on permanent SAS data sets, refer to “SAS Files” and “DATA Step Concepts” in SAS Language Reference: Concepts. For information on types of SAS data sets, see Appendix A, “Special SAS Data Sets,”. OUTTREE=SAS-data-set
creates an output data set to contain information on the tree structure that can be used by the TREE procedure to display a tree diagram. The OUTTREE= option implies the HIERARCHY option. See Example 78.1 for use of the OUTTREE= option. If you want to create a permanent SAS data set, you must specify a two-level name. For more information on permanent SAS data sets, refer to “SAS Files” and “DATA Step Concepts” in SAS Language Reference: Concepts. PROPORTION=n PERCENT=n
specifies that when choosing a cluster to split, VARCLUS should choose the cluster with the smallest proportion of variation explained, provided that its proportion of variation explained is less than the PROPORTION= value. Values greater than 1.0 are considered to be percentages, so PROPORTION=0.75 and PERCENT=75 are equivalent. However, if you specify both MAXEIGEN= and PROPORTION=, VARCLUS first looks for a cluster to split based on the MAXEIGEN= criterion. If no cluster meets that criterion, VARCLUS then looks for a cluster to split based on the PROPORTION= criterion.
4811
4812
Chapter 78. The VARCLUS Procedure
If you do not specify PROPORTION= then: • If you specify MAXEIGEN=, cluster splitting does not depend on the proportion of variation explained. • Otherwise, if you specify CENTROID and MAXCLUSTERS=, the default value for PROPORTION= is one. • Otherwise, if you specify CENTROID, without MAXCLUSTERS=, the default value is PROPORTION=0.75 or PERCENT=75. • Otherwise, cluster splitting does not depend on the proportion of variation explained. If you specify both PROPORTION= and MAXCLUSTERS=, the number of clusters will never exceed the value of the MAXCLUSTERS= option. RANDOM=n
specifies a positive integer as a starting value for use with REPLACE=RANDOM. If you do not specify the RANDOM= option, the time of day is used to initialize the pseudo-random number sequence. SHORT
suppresses display of the cluster structure, scoring coefficient, and intercluster correlation matrices. SIMPLE S
displays means and standard deviations. SUMMARY
suppresses all default displayed output except the final summary table. TRACE
lists the cluster to which each variable is assigned during the iterations. VARDEF=DF VARDEF=N VARDEF=WDF VARDEF=WEIGHT | WGT
specifies the divisor to be used in the calculation of variances and covariances. The default value is VARDEF=DF. The values and associated divisors are displayed in the following table. Value DF N WDF WEIGHT | WGT
Divisor degrees of freedom number of observations sum of weights minus one sum of weights
Formula n−i nP ( j wj ) − 1 P j wj
In the preceding table, i = 0 if the NOINT option is specified, and i = 1 otherwise.
PARTIAL Statement
BY Statement BY variables ; You can specify a BY statement with PROC VARCLUS to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the VARCLUS procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure. For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide.
FREQ Statement FREQ variable ; If a variable in your data set represents the frequency of occurrence for the other values in the observation, include the variable’s name in a FREQ statement. The procedure then treats the data set as if each observation appears n times, where n is the value of the FREQ variable for the observation. If the value of the FREQ variable is less than 1, the observation is not used in the analysis. Only the integer portion of the value is used. The total number of observations is considered equal to the sum of the FREQ variable.
PARTIAL Statement PARTIAL variable ; If you want to base the clustering on partial correlations, list the variables to be partialled out in the PARTIAL statement.
4813
4814
Chapter 78. The VARCLUS Procedure
SEED Statement SEED variables ; The SEED statement specifies variables to be used as seeds to initialize the clusters. It is not necessary to use INITIAL=SEED if the SEED statement is present, but if any other INITIAL= option is specified, the SEED statement is ignored.
VAR Statement VAR variables ; The VAR statement specifies the variables to be clustered. If you do not specify the VAR statement and do not specify TYPE=SSCP, all numeric variables not listed in other statements (except the SEED statement) are processed. The default VAR variable list does not include the variable INTERCEPT if the DATA= data set is TYPE=SSCP. If the variable INTERCEPT is explicitly specified in the VAR statement with a TYPE=SSCP data set, the NOINT option is enabled.
WEIGHT Statement WEIGHT variables ; If you want to specify relative weights for each observation in the input data set, place the weights in a variable in the data set and specify the name in a WEIGHT statement. This is often done when the variance associated with each observation is different and the values of the weight variable are proportional to the reciprocals of the variances. The WEIGHT variable can take nonintegral values. An observation is used in the analysis only if the value of the WEIGHT variable is greater than zero.
Details Missing Values Observations containing missing values are omitted from the analysis.
Using PROC VARCLUS
Using PROC VARCLUS Default options for PROC VARCLUS often provide satisfactory results. If you want to change the final number of clusters, use the MAXCLUSTERS=, MAXEIGEN=, or PROPORTION= options. The MAXEIGEN= and PROPORTION= options usually produce similar results but occasionally cause different clusters to be selected for splitting. The MAXEIGEN= option tends to choose clusters with a large number of variables, while the PROPORTION= option is more likely to select a cluster with a small number of variables.
Execution time PROC VARCLUS usually requires more computer time than principal factor analysis, but it can be faster than some of the iterative factoring methods. If you have more than 30 variables, you may want to reduce execution time by one or more of the following methods: • Specify the MINCLUSTERS= and MAXCLUSTERS= options if you know how many clusters you want. • Specify the HIERARCHY option. • Specify the SEED statement if you have some prior knowledge of what clusters to expect. If computer time is not a limiting factor, you may want to try one of the following methods to obtain a better solution: • If the clustering algorithm has not converged, specify larger values for MAXITER= and MAXSEARCH=. • Try several factoring and rotation methods with PROC FACTOR to use as input to PROC VARCLUS. • Run PROC VARCLUS several times, specifying INITIAL=RANDOM.
4815
4816
Chapter 78. The VARCLUS Procedure
Output Data Sets OUTSTAT= Data Set The OUTSTAT= data set is TYPE=CORR, and it can be used as input to the SCORE procedure or a subsequent run of PROC VARCLUS. The variables it contains are • BY variables • – NCL– , a numeric variable giving the number of clusters • – TYPE– , a character variable indicating the type of statistic the observation contains • – NAME– , a character variable containing a variable name or a cluster name, which is of the form CLUSn where n is the number of the cluster • the variables that are clustered The values of the – TYPE– variable are listed in the following table. Table 78.2. – TYPE– Value and Statistic
– TYPE– MEAN STD USTD N CORR UCORR MEMBERS VAREXP PROPOR GROUP RSQUARED SCORE USCORE
STRUCTUR CCORR
Contents means standard deviations uncorrected standard deviations, produced when the NOINT option is specified number of observations correlations uncorrected correlation matrix, produced when the NOINT option is specified number of members in each cluster variance explained by each cluster proportion of variance explained by each cluster number of the cluster to which each variable belongs squared multiple correlation of each variable with its cluster component standardized scoring coefficients scoring coefficients to be applied without subtracting the mean from the raw variables, produced when the NOINT option is specified cluster structure correlations between cluster components
The observations with – TYPE– =“MEAN”, “STD”, “N”, and “CORR” have missing values for the – NCL– variable. All other values of the – TYPE– variable are repeated for each cluster solution, with different solutions distinguished by the value of the – NCL– variable. If you want to specify the OUTSTAT= data set with the SCORE procedure, you can use a DATA step to select observations with the – NCL– variable missing or equal to the desired number of clusters.
Output Data Sets
data Coef2; set Coef; if _ncl_ = . or _ncl_ = 3; drop _ncl_; run; proc score data=NewScore score=Coef2; run;
PROC SCORE standardizes the new data by subtracting the original variable means that are stored in the – TYPE– =’MEAN’ observations, and dividing by the original variable standard deviations from the – TYPE– =’STD’ observations. Then PROC SCORE multiplies the standardized variables by the coefficients from the – TYPE– =’SCORE’ observations to get the cluster scores.
OUTTREE= Data Set The OUTTREE= data set contains one observation for each variable clustered plus one observation for each cluster of two or more variables, that is, one observation for each node of the cluster tree. The total number of output observations is between n and 2n − 1, where n is the number of variables clustered. The variables in the OUTTREE= data set are • BY variables, if any • – NAME– , a character variable giving the name of the node. If the node is a cluster, the name is CLUSn where n is the number of the cluster. If the node is a single variable, the variable name is used. • – PARENT– , a character variable giving the value of – NAME– of the parent of the node. If the node is the root of the tree, – PARENT– is blank. • – LABEL– , a character variable giving the label of the node. If the node is a cluster, the label is CLUSn where n is the number of the cluster. If the node is a single variable, the variable label is used. • – NCL– , the number of clusters. • – VAREXP– , the total variance explained by the clusters at the current level of the tree. • – PROPOR– , the total proportion of variance explained by the clusters at the current level of the tree. • – MINPRO– , the minimum proportion of variance explained by a cluster component. • – MAXEIG– , the maximum second eigenvalue of a cluster.
4817
4818
Chapter 78. The VARCLUS Procedure
Computational Resources Let n = number of observations v = number of variables c = number of clusters It is assumed that, at each stage of clustering, the clusters all contain the same number of variables.
Time The time required for PROC VARCLUS to analyze a given data set varies greatly depending on the number of clusters requested, the number of iterations in both the alternating least-squares and search phases, and whether centroid or principal components are used. The time required to compute the correlation matrix is roughly proportional to nv 2 . Default cluster initialization requires time roughly proportional to v 3 . Any other method of initialization requires time roughly proportional to cv 2 . In the alternating least-squares phase, each iteration requires time roughly proportional to cv 2 if centroid components are used or
c+5
v 2 v c2
if principal components are used. In the search phase, each iteration requires time roughly proportional to v 3 /c if centroid components are used or v 4 /c2 if principal components are used. The HIERARCHY option speeds up each iteration after the first split by as much as c/2.
Memory The amount of memory, in bytes, needed by PROC VARCLUS is approximately v 2 + 2vc + 20v + 15c
Interpreting VARCLUS Procedure Output Because PROC VARCLUS is a type of oblique component analysis, its output is similar to the output from the FACTOR procedure for oblique rotations. The scoring coefficients have the same meaning in both PROC VARCLUS and PROC FACTOR; they are coefficients applied to the standardized variables to compute component scores. The cluster structure is analogous to the factor structure containing the correlations between each variable and each cluster component. A cluster pattern is not displayed because it would be the same as the cluster structure, except that zeros would appear
Displayed Output
in the same places in which zeros appear in the scoring coefficients. The intercluster correlations are analogous to interfactor correlations; they are the correlations among cluster components. PROC VARCLUS also displays a cluster summary and a cluster listing. The cluster summary gives the number of variables in each cluster and the variation explained by the cluster component. The latter is similar to the variation explained by a factor but includes contributions from only the variables in that cluster rather than from all variables, as in PROC FACTOR. The proportion of variance explained is obtained by dividing the variance explained by the total variance of variables in the cluster. If the cluster contains two or more variables and the CENTROID option is not used, the second largest eigenvalue of the cluster is also displayed. The cluster listing gives the variables in each cluster. Two squared correlations are calculated for each cluster. The column labeled “Own Cluster” gives the squared correlation of the variable with its own cluster component. This value should be higher than the squared correlation with any other cluster unless an iteration limit has been exceeded or the CENTROID option has been used. The larger the squared correlation is, the better. The column labeled “Next Closest” contains the next highest squared correlation of the variable with a cluster component. This value is low if the clusters are well separated. The column headed “1–R**2 Ratio” gives the ratio of one minus the “Own Cluster” R2 to one minus the “Next Closest” R2 . A small “1–R**2 Ratio” indicates a good clustering.
Displayed Output The following items are displayed for each cluster solution unless the NOPRINT or SUMMARY option is specified. The CLUSTER SUMMARY table includes • the Cluster number • Members, the number of members in the cluster • Cluster Variation of the variables in the cluster • Variation Explained by the cluster component. This statistic is based only on the variables in the cluster rather than on all variables. • Proportion Explained, the result of dividing the variation explained by the cluster variation • Second Eigenvalue, the second largest eigenvalue of the cluster. This is displayed if the cluster contains more than one variable and the CENTROID option is not specified PROC VARCLUS also displays • Total variation explained, the sum across clusters of the variation explained by each cluster • Proportion, the total explained variation divided by the total variation of all the variables
4819
4820
Chapter 78. The VARCLUS Procedure
The cluster listing includes • Variable, the variables in each cluster • R-squared with Own Cluster, the squared correlation of the variable with its own cluster component; and R-squared with Next Closest, the next highest squared correlation of the variable with a cluster component. Own Cluster values should be higher than the R2 with any other cluster unless an iteration limit is exceeded or you specify the CENTROID option. Next Closest should be a low value if the clusters are well separated. • 1−R**2 Ratio, the ratio of one minus the value in the Own Cluster column to one minus the value in the Next Closest column. The occurrence of low ratios indicates well-separated clusters. If the SHORT option is not specified, PROC VARCLUS also displays • Standardized Scoring Coefficients, standardized regression coefficients for predicting cluster components from variables • Cluster Structure, the correlations between each variable and each cluster component • Inter-Cluster Correlations, the correlations between the cluster components If the analysis includes partitions for two or more numbers of clusters, a final summary table is displayed. Each row of the table corresponds to one partition. The columns include • Number of Clusters • Total Variation Explained by Clusters • Proportion of Variation Explained by Clusters • Minimum Proportion (of variation) Explained by a Cluster • Maximum Second Eigenvalue in a Cluster • Minimum R-squared for a Variable • Maximum 1−R**2 Ratio for a Variable
ODS Table Names PROC VARCLUS assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.”
Example 78.1. Correlations among Physical Variables
4821
Table 78.3. ODS Tables Produced in PROC VARCLUS
ODS Table Name ClusterQuality ClusterStructure ClusterSummary ConvergenceStatus Corr DataOptSummary InterClusterCorr IterHistory RSquare SimpleStatistics StdScoreCoef
Description Cluster quality Cluster structure Cluster Summary Convergence status Correlations Data and options summary table Inter-cluster correlations Iteration history Cluster Rsq Simple statistics Standardized scoring coefficients
Option default default default default CORR default default TRACE default SIMPLE default
Example Example 78.1. Correlations among Physical Variables The following data are correlations among eight physical variables as given by Harman (1976). The first PROC VARCLUS run clusters on the basis of principal components, the second run clusters on the basis of centroid components. The third analysis is hierarchical, and the TREE procedure is used to display a tree diagram. The results of the analyses follow. data phys8(type=corr); title ’Eight Physical Measurements on 305 School Girls’; title2 ’Harman: Modern Factor Analysis, 3rd Ed, p22’; label height=’Height’ arm_span=’Arm Span’ forearm=’Length of Forearm’ low_leg=’Length of Lower Leg’ weight=’Weight’ bit_diam=’Bitrochanteric Diameter’ girth=’Chest Girth’ width=’Chest Width’; input _name_ $ 1-8 (height arm_span forearm low_leg weight bit_diam girth width)(7.); _type_=’corr’; datalines; height 1.0 .846 .805 .859 .473 .398 .301 .382 arm_span.846 1.0 .881 .826 .376 .326 .277 .415 forearm .805 .881 1.0 .801 .380 .319 .237 .345 low_leg .859 .826 .801 1.0 .436 .329 .327 .365 weight .473 .376 .380 .436 1.0 .762 .730 .629 bit_diam.398 .326 .319 .329 .762 1.0 .583 .577 girth .301 .277 .237 .327 .730 .583 1.0 .539 width .382 .415 .345 .365 .629 .577 .539 1.0 ; proc varclus data=phys8; run;
4822
Chapter 78. The VARCLUS Procedure
The PROC VARCLUS statement invokes the procedure. By default, PROC VARCLUS clusters on the basis of principal components. Output 78.1.1. Principal Cluster Components: Cluster Summary Eight Physical Measurements on 305 School Girls Harman: Modern Factor Analysis, 3rd Ed, p22 Oblique Principal Component Cluster Analysis Cluster Summary for 1 Cluster Cluster Variation Proportion Second Cluster Members Variation Explained Explained Eigenvalue -----------------------------------------------------------------------1 8 8 4.67288 0.5841 1.7710 Total variation explained = 4.67288 Proportion = 0.5841 Cluster 1 will be split.
Cluster Summary for 2 Clusters Cluster Variation Proportion Second Cluster Members Variation Explained Explained Eigenvalue -----------------------------------------------------------------------1 4 4 3.509218 0.8773 0.2361 2 4 4 2.917284 0.7293 0.4764 Total variation explained = 6.426502 Proportion = 0.8033
R-squared with ----------------Own Next 1-R**2 Variable Cluster Variable Cluster Closest Ratio Label ---------------------------------------------------------------------------Cluster 1 height 0.8777 0.2088 0.1545 Height arm_span 0.9002 0.1658 0.1196 Arm Span forearm 0.8661 0.1413 0.1560 Length of Forearm low_leg 0.8652 0.1829 0.1650 Length of Lower Leg ---------------------------------------------------------------------------Cluster 2 weight 0.8477 0.1974 0.1898 Weight bit_diam 0.7386 0.1341 0.3019 Bitrochanteric Diameter girth 0.6981 0.0929 0.3328 Chest Girth width 0.6329 0.1619 0.4380 Chest Width
2 Clusters
No cluster meets the criterion for splitting.
As displayed in Output 78.1.1, the cluster component (by default, the first principal component) explains 58.41% of the total variation in the 8 variables. The cluster is split because the second eigenvalue is greater than 1 (the default value of the MAXEIGEN option). The two resulting cluster components explain 80.33% of the variation in the original variables. The cluster summary table shows that the variables height, arm– span, forearm, and low– leg have been assigned to the first cluster; and that the variables weight, bit– diam, girth, and width have been assigned to the second cluster.
Example 78.1. Correlations among Physical Variables
Output 78.1.2. Standard Scoring Coefficients and Cluster Structure Table Oblique Principal Component Cluster Analysis Standardized Scoring Coefficients Cluster 1 2 ----------------------------------------------------------------height Height 0.266977 0.000000 arm_span Arm Span 0.270377 0.000000 forearm Length of Forearm 0.265194 0.000000 low_leg Length of Lower Leg 0.265057 0.000000 weight Weight 0.000000 0.315597 bit_diam Bitrochanteric Diameter 0.000000 0.294591 girth Chest Girth 0.000000 0.286407 width Chest Width 0.000000 0.272710
Cluster Structure Cluster 1 2 ----------------------------------------------------------------height Height 0.936881 0.456908 arm_span Arm Span 0.948813 0.407210 forearm Length of Forearm 0.930624 0.375865 low_leg Length of Lower Leg 0.930142 0.427715 weight Weight 0.444281 0.920686 bit_diam Bitrochanteric Diameter 0.366201 0.859404 girth Chest Girth 0.304779 0.835529 width Chest Width 0.402430 0.795572
The standardized scoring coefficients in Output 78.1.2 show that each cluster component has similar scores for each of its associated variables. This suggests that the principal cluster component solution should be similar to the centroid cluster component solution, which follows in the next PROC VARCLUS run. The cluster structure table displays high correlations between the variables and their own cluster component. The correlations between the variables and the opposite cluster component are all moderate. Output 78.1.3. Inter-Cluster Correlations Oblique Principal Component Cluster Analysis Inter-Cluster Correlations Cluster 1 2
1
2
1.00000 0.44513
0.44513 1.00000
The intercluster correlation table shows that the cluster components are moderately correlated with ρ = 0.44513. In the following statements, the CENTROID option in the PROC VARCLUS statement specifies that cluster centroids be used as the basis for clustering. proc varclus data=phys8 centroid; run;
4823
4824
Chapter 78. The VARCLUS Procedure
Output 78.1.4. Centroid Cluster Components: Cluster Summary Oblique Centroid Component Cluster Analysis Cluster Summary for 1 Cluster Cluster Variation Proportion Cluster Members Variation Explained Explained ---------------------------------------------------------1 8 8 4.631 0.5789 Total variation explained = 4.631 Proportion = 0.5789
Cluster Summary for 2 Clusters Cluster Variation Proportion Cluster Members Variation Explained Explained ---------------------------------------------------------1 4 4 3.509 0.8773 2 4 4 2.91 0.7275 Total variation explained = 6.419 Proportion = 0.8024
R-squared with -----------------Own Next 1-R**2 Variable Cluster Variable Cluster Closest Ratio Label --------------------------------------------------------------------------------Cluster 1 height 0.8778 0.2075 0.1543 Height arm_span 0.8994 0.1669 0.1208 Arm Span forearm 0.8663 0.1410 0.1557 Length of Forearm low_leg 0.8658 0.1824 0.1641 Length of Lower Leg --------------------------------------------------------------------------------Cluster 2 weight 0.8368 0.1975 0.2033 Weight bit_diam 0.7335 0.1341 0.3078 Bitrochanteric Diameter girth 0.6988 0.0929 0.3321 Chest Girth width 0.6473 0.1618 0.4207 Chest Width 2 Clusters
The first cluster component, which, in the centroid method, is an unweighted sum of the standardized variables, explains 57.89% of the variation in the data. This value is near the maximum possible variance explained, 58.41%, which is attained by the first principal component (Output 78.1.1). The centroid clustering algorithm splits the variables into the same two clusters created in the principal component method. Recall that this outcome was suggested by the similar standardized scoring coefficients in the principal cluster component solution. The default behavior in the centroid method is to split any cluster with less than 75% of the total cluster variance explained by the centroid component. In the next step, the second cluster, with a component that explains only 72.75% of the total variation of the cluster, is split. In the R-squared table for two clusters, the width variable has a weaker relation to its cluster than any other variable; in the three cluster solution this variable is in a cluster of its own.
Example 78.1. Correlations among Physical Variables
Output 78.1.5. Standardized Scoring Coefficients Oblique Centroid Component Cluster Analysis Standardized Scoring Coefficients Cluster 1 2 ----------------------------------------------------------------height Height 0.266918 0.000000 arm_span Arm Span 0.266918 0.000000 forearm Length of Forearm 0.266918 0.000000 low_leg Length of Lower Leg 0.266918 0.000000 weight Weight 0.000000 0.293105 bit_diam Bitrochanteric Diameter 0.000000 0.293105 girth Chest Girth 0.000000 0.293105 width Chest Width 0.000000 0.293105
Each cluster component (Output 78.1.5) is an unweighted average of the cluster’s standardized variables. Thus, the coefficients for each of the cluster’s associated variables are identical in the centroid cluster component solution. Output 78.1.6. Cluster Summary for Three Clusters Oblique Centroid Component Cluster Analysis Cluster Summary for 3 Clusters Cluster Variation Proportion Cluster Members Variation Explained Explained ---------------------------------------------------------1 4 4 3.509 0.8773 2 3 3 2.383333 0.7944 3 1 1 1 1.0000 Total variation explained = 6.892333 Proportion = 0.8615
R-squared with -----------------Own Next 1-R**2 Variable Cluster Variable Cluster Closest Ratio Label --------------------------------------------------------------------------------Cluster 1 height 0.8778 0.1921 0.1513 Height arm_span 0.8994 0.1722 0.1215 Arm Span forearm 0.8663 0.1225 0.1524 Length of Forearm low_leg 0.8658 0.1668 0.1611 Length of Lower Leg --------------------------------------------------------------------------------Cluster 2 weight 0.8685 0.3956 0.2175 Weight bit_diam 0.7691 0.3329 0.3461 Bitrochanteric Diameter girth 0.7482 0.2905 0.3548 Chest Girth --------------------------------------------------------------------------------Cluster 3 width 1.0000 0.4259 0.0000 Chest Width 3 Clusters
The centroid method stops at the three cluster solution. As displayed in Output 78.1.6 and Output 78.1.7, the three centroid components account for 86.15% of the variability in the eight variables, and all cluster components account for at least 79.44% of the total variation in the corresponding cluster. Additionally, the smallest squared correlation between the variables and their own cluster component is 0.7482.
4825
4826
Chapter 78. The VARCLUS Procedure
Output 78.1.7. Cluster Quality Table Oblique Centroid Component Cluster Analysis Total Proportion Minimum Minimum Maximum Number Variation of Variation Proportion R-squared 1-R**2 Ratio of Explained Explained Explained for a for a Clusters by Clusters by Clusters by a Cluster Variable Variable -----------------------------------------------------------------------------------1 4.631000 0.5789 0.5789 0.4306 2 6.419000 0.8024 0.7275 0.6473 0.4207 3 6.892333 0.8615 0.7944 0.7482 0.3548
Note that, if the proportion option were set to a value between 0.5789 (the proportion of variance explained in the 1-cluster solution) and 0.7275 (the minimum proportion of variance explained in the 2-cluster solution), PROC VARCLUS would stop at a two cluster solution, and the centroid solution would find the same clusters as the principal components solution. In the following statements, the MAXC= option computes all clustering solutions, from one to eight clusters. The SUMMARY option suppresses all output except the final cluster quality table, and the OUTTREE= option saves the results of the analysis to an output data set and forces the clusters to be hierarchical. The TREE procedure is invoked to produce a graphical display of the clusters. proc varclus data=phys8 maxc=8 summary outtree=tree; run; goptions ftext=swiss; axis2 label=(justify=left); axis1 order=(0.5 to 1.0 by 0.1); proc tree horizontal vaxis=axis2 haxis=axis1 lines=(width=2); height _propor_; id _label_; run; Output 78.1.8. Hierarchical Clusters and the SUMMARY Option Oblique Principal Component Cluster Analysis Total Proportion Minimum Maximum Minimum Maximum Number Variation of Variation Proportion Second R-squared 1-R**2 Ratio of Explained Explained Explained Eigenvalue for a for a Clusters by Clusters by Clusters by a Cluster in a Cluster Variable Variable ---------------------------------------------------------------------------------------1 4.672880 0.5841 0.5841 1.770983 0.3810 2 6.426502 0.8033 0.7293 0.476418 0.6329 0.4380 3 6.895347 0.8619 0.7954 0.418369 0.7421 0.3634 4 7.271218 0.9089 0.8773 0.238000 0.8652 0.2548 5 7.509218 0.9387 0.8773 0.236135 0.8652 0.1665 6 7.740000 0.9675 0.9295 0.141000 0.9295 0.2560 7 7.881000 0.9851 0.9405 0.119000 0.9405 0.2093 8 8.000000 1.0000 1.0000 0.000000 1.0000 0.0000
The principal component method first separates the variables into the same two clusters that were created in the first PROC VARCLUS run. Note that, in creating the third cluster, the principal component method identifies the variable width. This is
References
the same variable that is put into its own cluster in the preceding centroid method example. Output 78.1.9. TREE Diagram from PROC TREE
The tree diagram in Output 78.1.9 displays the cluster hierarchy. It is clear from the diagram that there are two, or possibly three, clusters present. However, the MAXC=8 option forces PROC VARCLUS to split the clusters until each variable is in its own cluster.
References Anderberg, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press, Inc. Harris, C.W.; and Kaiser, H.F. (1964), “Oblique factor analytic solutions by orthogonal transformation,” Psychometrika, 32, 363–379. Harman, H.H. (1976), Modern Factor Analysis, Third Edition, Chicago: University of Chicago Press. Hand, D.J.; Daly, F.; Lunn, A.D.; McConway, K.J.; and Ostrowski E. (1994), A Handbook of Small Data Sets, London: Chapman & Hall, 297–298.
4827
Chapter 79
The VARCOMP Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4831 GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4832 Analyzing the Cure Rate of Rubber . . . . . . . . . . . . . . . . . . . . . . 4832 SYNTAX . . . . . . . . . . . . PROC VARCOMP Statement BY Statement . . . . . . . . CLASS Statement . . . . . . MODEL Statement . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. 4835 . 4835 . 4835 . 4836 . 4836
DETAILS . . . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . . Fixed and Random Effects . . . . . . . . Negative Variance Component Estimates Computational Methods . . . . . . . . . Displayed Output . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . Relationship to PROC MIXED . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. 4837 . 4837 . 4837 . 4838 . 4838 . 4840 . 4840 . 4841
EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4842 Example 79.1. Using the Four Estimation Methods . . . . . . . . . . . . . . 4842 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4847
4830
Chapter 79. The VARCOMP Procedure
Chapter 79
The VARCOMP Procedure Overview The VARCOMP procedure handles general linear models that have random effects. Random effects are classification effects with levels that are assumed to be randomly selected from an infinite population of possible levels. PROC VARCOMP estimates the contribution of each of the random effects to the variance of the dependent variable. A single MODEL statement specifies the dependent variables and the effects: main effects, interactions, and nested effects. The effects must be composed of class variables; no continuous variables are allowed on the right side of the equal sign. You can specify certain effects as fixed (nonrandom) by putting them first in the MODEL statement and indicating the number of fixed effects with the FIXED= option. An intercept is always fitted and assumed fixed. Except for the effects specified as fixed, all other effects are assumed to be random, and their contribution to the model can be thought of as an observation from a distribution that is normally and independently distributed. The dependent variables are grouped based on the similarity of their missing values. Each group of dependent variables is then analyzed separately. The columns of the design matrix X are formed in the same order in which the effects are specified in the MODEL statement. No reparameterization is done. Thus, the columns of X contain only 0s and 1s. You can specify four methods of estimation in the PROC VARCOMP statement using the METHOD= option. They are TYPE1 (based on computation of Type I sum of squares for each effect), MIVQUE0, Maximum Likelihood (METHOD=ML), and Restricted Maximum Likelihood (METHOD=REML).
4832
Chapter 79. The VARCOMP Procedure
Getting Started Analyzing the Cure Rate of Rubber This example, using data from Hicks (1973), concerns an experiment to determine the sources of variability in cure rates of rubber. The goal of the experiment was to find out if the different laboratories contributed more to the variance of cure rates than did the different batches of raw materials. This information would be useful in trying to control the cure rate of the final product because it would provide insights into the sources of the variability in cure rates. The rubber used was cured at three temperatures, which were taken to be fixed. Three laboratories were chosen at random, and three different batches of raw material were tested at each combination of temperature and laboratory. The following statements read the data into the SAS data set Cure. title ’Analyzing the Cure Rate of Rubber’; data Cure; input Lab Temp Batch $ Cure @@; datalines; 1 145 A 18.6 1 145 A 17.0 1 145 A 18.7 1 145 B 14.5 1 145 B 15.8 1 145 B 16.5 1 145 C 21.1 1 145 C 20.8 1 145 C 21.8 1 155 A 9.5 1 155 A 9.4 1 155 A 9.5 1 155 B 7.8 1 155 B 8.3 1 155 B 8.9 1 155 C 11.2 1 155 C 10.0 1 155 C 11.5 1 165 A 5.4 1 165 A 5.3 1 165 A 5.7 1 165 B 5.2 1 165 B 4.9 1 165 B 4.3 1 165 C 6.3 1 165 C 6.4 1 165 C 5.8 2 145 A 20.0 2 145 A 20.1 2 145 A 19.4 2 145 B 18.4 2 145 B 18.1 2 145 B 16.5 2 145 C 22.5 2 145 C 22.7 2 145 C 21.5 2 155 A 11.4 2 155 A 11.5 2 155 A 11.4 2 155 B 10.8 2 155 B 11.1 2 155 B 9.5 2 155 C 13.3 2 155 C 14.0 2 155 C 12.0 2 165 A 6.8 2 165 A 6.9 2 165 A 6.0 2 165 B 6.0 2 165 B 6.1 2 165 B 5.0 2 165 C 7.7 2 165 C 8.0 2 165 C 6.6 3 145 A 19.7 3 145 A 18.3 3 145 A 16.8 3 145 B 16.3 3 145 B 16.7 3 145 B 14.4 3 145 C 22.7 3 145 C 21.9 3 145 C 19.3 3 155 A 9.3 3 155 A 10.2 3 155 A 9.8 3 155 B 9.1 3 155 B 9.2 3 155 B 8.0 3 155 C 11.3 3 155 C 11.0 3 155 C 10.9 3 165 A 6.7 3 165 A 6.0 3 165 A 5.0 3 165 B 5.7 3 165 B 5.5 3 165 B 4.6 3 165 C 6.6 3 165 C 6.5 3 165 C 5.9 ;
1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3
145 145 145 155 155 155 165 165 165 145 145 145 155 155 155 165 165 165 145 145 145 155 155 155 165 165 165
A B C A B C A B C A B C A B C A B C A B C A B C A B C
18.7 17.6 21.0 10.0 9.1 11.1 5.3 5.2 5.6 20.0 16.7 21.3 11.5 9.7 11.5 5.7 5.2 6.3 17.1 15.2 19.3 9.5 9.0 11.4 4.8 5.4 5.8
The variables Lab, Temp, and Batch contain levels of laboratory, temperature, and batch, respectively. The Cure variable contains the response values.
Analyzing the Cure Rate of Rubber
4833
The following SAS statements perform a restricted maximum-likelihood variance component analysis. proc varcomp method=reml; class Temp Lab Batch; model Cure=Temp|Lab Batch(Lab Temp) / fixed=1; run;
The FIXED=1 option indicates that the first factor, Temp, is fixed. The effect specification Temp|Lab is equivalent to putting the three terms Temp, Lab, and Temp*Lab in the model. Batch(Lab Temp) is equivalent to putting Batch(Temp*Lab) in the MODEL statement. The results of this analysis are displayed in Figure 79.1 through Figure 79.4. Analyzing the Cure Rate of Rubber Variance Components Estimation Procedure Class Level Information Class
Levels
Values
Temp
3
145 155 165
Lab
3
1 2 3
Batch
3
A B C
Number of Observations Read Number of Observations Used
Dependent Variable:
108 108
Cure
Figure 79.1. Class Level Information
Figure 79.1 provides information about the variables used in the analysis and the number of observations and specifies the dependent variable. Analyzing the Cure Rate of Rubber Variance Components Estimation Procedure REML Iterations
Iteration
Objective
Var(Lab)
Var(Temp*Lab)
Var(Batch(Temp* Lab))
Var(Error)
0 1 2 3
13.4500060254 13.0898262160 13.0893125570 13.0893125555
0.5094464340 0.3194348317 0.3176048001 0.3176017115
0 0 0 0
2.4004888633 2.0869636935 2.0738906134 2.0738685461
0.5787185225 0.6016005334 0.6026217204 0.6026234568
Convergence criteria met.
Figure 79.2. Iteration History
4834
Chapter 79. The VARCOMP Procedure
The “REML Iterations” table, shown in Figure 79.2, displays the iteration history, which includes the value of the objective function associated with REML and the values of the variance components at each iteration. Analyzing the Cure Rate of Rubber Variance Components Estimation Procedure REML Estimates Variance Component Var(Lab) Var(Temp*Lab) Var(Batch(Temp*Lab)) Var(Error)
Estimate 0.31760 0 2.07387 0.60262
Figure 79.3. REML Estimates
Figure 79.3 displays the REML estimates of the variance components. Analyzing the Cure Rate of Rubber Variance Components Estimation Procedure Asymptotic Covariance Matrix of Estimates
Var(Lab) Var(Temp*Lab) Var(Batch(Temp*Lab)) Var(Error)
Var(Lab)
Var(Temp*Lab)
Var(Batch(Temp*Lab))
Var(Error)
0.32452 0 -0.04998 1.0259E-12
0 0 0 0
-0.04998 0 0.45042 -0.0022417
1.0259E-12 0 -0.0022417 0.0089668
Figure 79.4. Covariance Matrix for REML Estimates
The “Asymptotic Covariance Matrix of Estimates” table in Figure 79.4 displays the asymptotic covariance matrix of the REML estimates. The results of the analysis show that the variance attributable to Batch(Temp*Lab) (with a variance component of 2.0739) is considerably larger than the variance attributable to Lab (0.3176). Therefore, attempts to reduce the variability of cure rates should concentrate on improving the homogeneity of the batches of raw material used rather than standardizing the practices or equipment within the laboratories. Also, note that since the Batch(Temp*Lab) variance is considerably larger than the experimental error (Var(Error)=0.6026), the Batch(Temp*Lab) variability plays an important part in the overall variability of the cure rates.
BY Statement
Syntax The following statements are available in PROC VARCOMP.
PROC VARCOMP < options > ; CLASS variables ; MODEL dependent = < effects > < / options > ; BY variables ; Only one MODEL statement is allowed. The BY, CLASS, and MODEL statements are described after the PROC VARCOMP statement.
PROC VARCOMP Statement PROC VARCOMP < options >; This statement invokes the VARCOMP procedure. You can specify the following options in the PROC VARCOMP statement. DATA=SAS-data-set
specifies the input SAS data set to use. If this option is omitted, the most recently created SAS data set is used. EPSILON=number
specifies the convergence value of the objective function for METHOD=ML or METHOD=REML. By default, EPSILON=1E−8. MAXITER=number
specifies the maximum number of iterations METHOD=REML. By default, MAXITER=50.
for
METHOD=ML
or
METHOD=TYPE1 | MIVQUE0 | ML | REML
specifies which of the four methods (TYPE1, MIVQUE0, ML, or REML) you want to use. By default, METHOD= MIVQUE0. For more information see the “Computational Methods” section on page 4838.
BY Statement BY variables ; You can specify a BY statement with PROC VARCOMP to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. The variables are one or more variables in the input data set. If your input data set is not sorted in ascending order, use one of the following alternatives.
4835
4836
Chapter 79. The VARCOMP Procedure • Sort the data using the SORT procedure with a similar BY statement. • Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the VARCOMP procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. • Create an index on the BY variables using the DATASETS procedure (in base SAS software).
For more information on the BY statement, see the discussion in SAS Language Reference: Concepts. For more information on the DATASETS procedure, see the discussion in the SAS Procedures Guide.
CLASS Statement CLASS variables ; The CLASS statement specifies the classification variables to be used in the analysis. All effects in the MODEL statement must be composed of effects that appear in the CLASS statement. Class variables can be either numeric or character; if they are character, only the first 16 characters are used. Numeric class variables are not restricted to integers since a variable’s format determines the levels. For more information, see the discussion of the FORMAT statement in SAS Language Reference: Dictionary.
MODEL Statement MODEL dependent = < effects > < / option > ; The MODEL statement gives the dependent variables and independent effects. If you specify more than one dependent variable, a separate analysis is performed for each one. The independent effects are limited to main effects, interactions, and nested effects; no continuous effects are allowed. All independent effects must be composed of effects that appear in the CLASS statement. Effects are specified in the VARCOMP procedure in the same way as described for the ANOVA procedure. Only one MODEL statement is allowed. Only one option is available in the MODEL statement. FIXED=n
tells the VARCOMP procedure that the first n effects in the MODEL statement are fixed effects. The remaining effects are assumed to be random. By default, PROC VARCOMP assumes that all effects are random in the model. Keep in mind that if you use bar notation and, for example, specify Y=A|B / FIXED=2, then A*B is considered a random effect.
Fixed and Random Effects
Details Missing Values If an observation has a missing value for any variable used in the independent effects, then the analyses of all dependent variables omit this observation. An observation is deleted from the analysis of a given dependent variable if the observation’s value for that dependent variable is missing. Note that a missing value in one dependent variable does not eliminate an observation from the analysis of the other dependent variables. During processing, PROC VARCOMP groups the dependent variables on their missing values across observations so that sums of squares and cross products can be computed in the most efficient manner.
Fixed and Random Effects Central to the idea of variance components models is the idea of fixed and random effects. Each effect in a variance components model must be classified as either a fixed or a random effect. Fixed effects arise when the levels of an effect constitute the entire population about which you are interested. For example, if a plant scientist is comparing the yields of three varieties of soybeans, then Variety would be a fixed effect, providing that the scientist was concerned about making inferences on only these three varieties of soybeans. Similarly, if an industrial experiment focused on the effectiveness of two brands of a machine, Machine would be a fixed effect only if the experimenter’s interest did not go beyond the two machine brands. On the other hand, an effect is classified as a random effect when you want to make inferences on an entire population, and the levels in your experiment represent only a sample from that population. Psychologists comparing test results between different groups of subjects would consider Subject as a random effect. Depending on the psychologists’ particular interest, the Group effect might be either fixed or random. For example, if the groups are based on the sex of the subject, then Sex would be a fixed effect. But if the psychologists are interested in the variability in test scores due to different teachers, then they might choose a random sample of teachers as being representative of the total population of teachers, and Teacher would be a random effect. Note that, in the soybean example presented earlier, if the scientists are interested in making inferences on the entire population of soybean varieties and randomly choose three varieties for testing, then Variety would be a random effect. If all the effects in a model (except for the intercept) are considered random effects, then the model is called a random effects model; likewise, a model with only fixed effects is called a fixed-effects model. The more common case, where some factors are fixed and others are random, is called a mixed model. In PROC VARCOMP, by default, effects are assumed to be random. You specify which effects are fixed by using the FIXED= option in the MODEL statement. In general, if an interaction or nested effect contains any effect that is random, then the interaction or nested effect should be considered as a random effect as well.
4837
4838
Chapter 79. The VARCOMP Procedure
In the linear model, each level of a fixed effect contributes a fixed amount to the expected value of the dependent variable. What makes a random effect different is that each level of a random effect contributes an amount that is viewed as a sample from a population of normally distributed variables, each with mean 0, and an unknown variance, much like the usual random error term that is a part of all linear models. The estimate of the variance associated with the random effect is known as the variance component because it is measuring the part of the overall variance contributed by that effect. Thus, PROC VARCOMP estimates the variance of the random variables that are associated with the random effects in your model, and the variance components tell you how much each of the random factors contributes to the overall variability in the dependent variable.
Negative Variance Component Estimates The variance components estimated by PROC VARCOMP should theoretically be nonnegative because they are assumed to represent the variance of a random variable. Nevertheless, when you are using METHOD=MIVQUE0 (the default) or METHOD=TYPE1, some estimates of variance components may become negative. (Due to the nature of the algorithms used for METHOD=ML and METHOD=REML, negative estimates are constrained to zero.) These negative estimates may arise for a variety of reasons: • The variability in your data may be large enough to produce a negative estimate, even though the true value of the variance component is positive. • Your data may contain outliers. Refer to Hocking (1983) for a graphical technique for detecting outliers in variance components models using the SAS System. • A different model for interpreting your data may be appropriate. Under some statistical models for variance components analysis, negative estimates are an indication that observations in your data are negatively correlated. Refer to Hocking (1984) for further information about these models. Assuming that you are satisfied that the model PROC VARCOMP is using is appropriate for your data, it is common practice to treat negative variance components as if they are zero.
Computational Methods Four methods of estimation can be specified in the PROC VARCOMP statement using the METHOD= option. They are described in the following sections.
The Type I Method This method (METHOD=TYPE1) computes the Type I sum of squares for each effect, equates each mean square involving only random effects to its expected value, and solves the resulting system of equations (Gaylor, Lucas, and Anderson 1970). The X0 X | X0 Y matrix is computed and adjusted in segments whenever memory is not sufficient to hold the entire matrix.
Computational Methods
The MIVQUE0 Method Based on the technique suggested by Hartley, Rao, and LaMotte (1978), the MIVQUE0 method (METHOD=MIVQUE0) produces unbiased estimates that are invariant with respect to the fixed effects of the model and that are locally best quadratic unbiased estimates given that the true ratio of each component to the residual error component is zero. The technique is similar to TYPE1 except that the random effects are adjusted only for the fixed effects. This affords a considerable timing advantage over the TYPE1 method; thus, MIVQUE0 is the default method used in PROC VARCOMP. The X0 X|X0 Y matrix is computed and adjusted in segments whenever memory is not sufficient to hold the entire matrix. Each element (i, j) of the form SSQ(X0i MXj ) is computed, where M = I − X0 (X00 X0 )− X00 and where X0 is part of the design matrix for the fixed effects, Xi is part of the design matrix for one of the random effects, and SSQ is an operator that takes the sum of squares of the elements. For more information refer to Rao (1971, 1972) and Goodnight (1978).
The Maximum Likelihood Method The Maximum Likelihood method (METHOD=ML) computes maximum-likelihood estimates of the variance components; refer to Searle, Casella, and McCulloch (1992). The computing algorithm makes use of the W-transformation developed by Hemmerle and Hartley (1973). The procedure uses a Newton-Raphson algorithm, iterating until the log-likelihood objective function converges. The objective function for METHOD=ML is ln(|V|) + r0 V−1 r, where
V=
σ02 I
+
nr X
σi2 Xi X0i
i=1
and where σ02 is the residual variance, nr is the number of random effects in the model, σi2 represents the variance components, Xi is part of the design matrix for one of the random effects, and r = y − X0 (X00 V−1 X0 )− X00 V−1 y is the vector of residuals.
4839
4840
Chapter 79. The VARCOMP Procedure
The Restricted Maximum Likelihood Method The Restricted Maximum Likelihood Method (METHOD=REML) is similar to the maximum likelihood method, but it first separates the likelihood into two parts: one that contains the fixed effects and one that does not (Patterson and Thompson 1971). The procedure uses a Newton-Raphson algorithm, iterating until convergence is reached for the log-likelihood objective function of the portion of the likelihood that does not contain the fixed effects. Using notation from earlier methods, the objective function for METHOD=REML is ln(|V|)+r0 V−1 r+ln(|X00 V−1 X0 |). Refer to Searle, Casella, and McCulloch (1992) for additional details.
Displayed Output PROC VARCOMP displays the following items: • Class Level Information for verifying the levels in your data • Number of observations read from the data set and number of observations used in the analysis • for METHOD=TYPE1, an analysis-of-variance table with Source, DF, Type I Sum of Squares, Type I Mean Square, and Expected Mean Square, and a table of Type I variance component estimates • for METHOD=MIVQUE0, the SSQ Matrix containing sums of squares of partitions of the X0 X crossproducts matrix adjusted for the fixed effects • for METHOD=ML and METHOD=REML, the iteration history, including the objective function, as well as variance component estimates • for METHOD=ML and METHOD=REML, the estimated Asymptotic Covariance Matrix of the variance components • a table of variance component estimates
ODS Table Names PROC VARCOMP assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, “Using the Output Delivery System.”
Relationship to PROC MIXED
Table 79.1. ODS Tables Produced in PROC VARCOMP
ODS Table Name ANOVA AsyCov ClassLevels ConvergenceStatus DepVar DependentInfo Estimates IterHistory NObs SSCP
Description Type 1 analysis of variance Asymptotic covariance matrix of estimates Class level information Convergence status Dependent variable Dependent variable info (multiple variables) Variance component estimates Iteration history Number of observations Sum of squares matrix
Statement METHOD=TYPE1 METHOD=ML or REML default METHOD=ML or REML METHOD=TYPE1, REML, or ML
default METHOD=ML or REML default METHOD=MIVQUE0
In situations where multiple dependent variables are analyzed that differ in their missing value pattern, separate names for ANOVAn, AsyCovn, Estimatesn, IterHistoryn, and SSCPn tables are no longer required. The results are combined into a single output data set. For METHOD=TYPE1, ML, or REML, variable Dependent in the output data set identifies the dependent variable. For METHOD=MIVQUE0, a variable is added to the output data set for each dependent variable.
Relationship to PROC MIXED The MIXED procedure effectively performs the same analyses as PROC VARCOMP and many others, including Type I, Type II, and Type III tests of fixed effects, confidence limits, customized contrasts, and least-squares means. Furthermore, continuous variables are permitted as both fixed and random effects in PROC MIXED, and numerous other covariance structures besides variance components are available. To translate PROC VARCOMP code into PROC MIXED code, move all random effects to the RANDOM statement in PROC MIXED. For example, the syntax for the example in the “Getting Started” section on page 4832 is as follows: proc mixed; class Temp Lab Batch; model Cure = Temp; random Lab Temp*Lab Batch(Lab Temp); run;
REML is the default estimation method in PROC MIXED, and you can specify other methods using the METHOD= option.
4841
4842
Chapter 79. The VARCOMP Procedure
Example Example 79.1. Using the Four Estimation Methods In this example, a and b are classification variables and y is the dependent variable. a is declared fixed, and b and a*b are random. Note that this design is unbalanced because the cell sizes are not all the same. PROC VARCOMP is invoked four times, once for each of the estimation methods. The data are from Hemmerle and Hartley (1973). The following statements produce Output 79.1.1. data a; input a b y @@; datalines; 1 1 237 1 1 254 2 1 208 2 1 178 3 1 186 3 1 183 ;
1 1 246 2 1 187 3 2 142
1 2 178 2 2 146 3 2 125
1 2 179 2 2 145 3 2 136
proc varcomp method=type1; class a b; model y=a|b / fixed=1; run; proc varcomp method=mivque0; class a b; model y=a|b / fixed=1; run; proc varcomp method=ml; class a b; model y=a|b / fixed=1; run; proc varcomp method=reml; class a b; model y=a|b / fixed=1; run; Output 79.1.1. VARCOMP Procedure: Method=TYPE1 Variance Components Estimation Procedure Class Level Information Class
Levels
Values
a
3
1 2 3
b
2
1 2
Number of Observations Read Number of Observations Used
Dependent Variable:
16 16
y
2 2 141
Example 79.1. Using the Four Estimation Methods
4843
The “Class Level Information” table displays the levels of each variable specified in the CLASS statement. You can check this table to make sure the data are input correctly. Variance Components Estimation Procedure Type 1 Analysis of Variance
Source
DF
Sum of Squares
Mean Square
Expected Mean Square
a b a*b Error Corrected Total
2 1 2 10 15
11736 11448 299.041026 786.333333 24270
5868.218750 11448 149.520513 78.633333 .
Var(Error) + 2.725 Var(a*b) + 0.1 Var(b) + Q(a) Var(Error) + 2.6308 Var(a*b) + 7.8 Var(b) Var(Error) + 2.5846 Var(a*b) Var(Error) .
The Type I analysis of variance consists of sequentially partitioning the total sum of squares. The mean square is the sum of squares divided by the degrees of freedom, and the expected mean square is the expected value of the mean square under the mixed model. The “Q” notation in the expected mean squares refers to a quadratic form in parameters of the parenthesized effect. Variance Components Estimation Procedure Type 1 Estimates Variance Component
Estimate
Var(b) Var(a*b) Var(Error)
1448.4 27.42659 78.63333
The Type I estimates of the variance components result from solving the linear system of equations established by equating the observed mean squares to their expected values. Output 79.1.2. VARCOMP Procedure: Method=MIVQUE0 Variance Components Estimation Procedure Class Level Information Class
Levels
Values
a
3
1 2 3
b
2
1 2
Number of Observations Read Number of Observations Used
The “Class Level Information” is the same as before.
16 16
4844
Chapter 79. The VARCOMP Procedure
Variance Components Estimation Procedure MIVQUE(0) SSQ Matrix Source b a*b Error
b
a*b
Error
y
60.84000 20.52000 7.80000
20.52000 20.52000 7.80000
7.80000 7.80000 13.00000
89295.4 30181.3 12533.5
The MIVQUE0 sums-of-squares matrix is displayed in the previous table. Variance Components Estimation Procedure MIVQUE(0) Estimates Variance Component
y
Var(b) Var(a*b) Var(Error)
1466.1 -35.49170 105.73660
The MIVQUE0 estimates result from solving the equations established by the MIVQUE0 SSQ matrix. Note that the estimate of the variance component for the interaction effect, Var(a*b), is negative for this example. Output 79.1.3. VARCOMP Procedure: Method=ML Variance Components Estimation Procedure Class Level Information Class
Levels
Values
a
3
1 2 3
b
2
1 2
Number of Observations Read Number of Observations Used
Dependent Variable:
The “Class Level Information” is the same as before.
16 16
y
Example 79.1. Using the Four Estimation Methods
Variance Components Estimation Procedure Maximum Likelihood Iterations Iteration
Objective
Var(b)
Var(a*b)
Var(Error)
0 1 2 3
78.3850371200 78.2637043807 78.2635471161 78.2635471152
1031.49070 732.3606453636 723.6867470850 723.6658365289
0 0 0 0
74.3909717935 77.4011688154 77.5301774839 77.5304926877
Convergence criteria met.
The Newton-Raphson algorithm used by PROC VARCOMP requires three iterations to converge to the maximum likelihood estimates. Variance Components Estimation Procedure Maximum Likelihood Estimates Variance Component Var(b) Var(a*b) Var(Error)
Estimate 723.66584 0 77.53049
The ML estimate of Var(a*b) is zero for this example, and the other two estimates are smaller than their Type I and MIVQUE0 counterparts. Variance Components Estimation Procedure Asymptotic Covariance Matrix of Estimates
Var(b) Var(a*b) Var(Error)
Var(b)
Var(a*b)
Var(Error)
537826.1 0 -107.33905
0 0 0
-107.33905 0 858.71104
One benefit of using likelihood-based methods is that an approximate covariance matrix is available from the matrix of second derivatives evaluated at the ML solution. This covariance matrix is valid asymptotically and can be unreliable in small samples. Here the variance component estimates for B and the Error are negatively correlated and the elements for Var(a*b) are set to zero because the estimate equals zero. Also, the very large variance for Var(b) indicates a lot of uncertainty about the estimate for Var(b), and one contributing explanation is that B has only two levels in this data set.
4845
4846
Chapter 79. The VARCOMP Procedure
Output 79.1.4. VARCOMP Procedure: Method=REML Variance Components Estimation Procedure Class Level Information Class
Levels
Values
a
3
1 2 3
b
2
1 2
Number of Observations Read Number of Observations Used
16 16
Dependent Variable:
y
The “Class Level Information” is the same as before. Variance Components Estimation Procedure REML Iterations Iteration
Objective
Var(b)
Var(a*b)
Var(Error)
0 1 2 3 4
63.4134144942 63.0446869787 63.0311530508 63.0311265148 63.0311265127
1269.52701 1601.84199 1468.82932 1464.33646 1464.36727
0 32.7632417174 27.2258186561 26.9564053003 26.9588525177
91.5581191305 76.9355562461 78.7548276319 78.8431476502 78.8423898761
Convergence criteria met.
The REML optimization requires four iterations to converge. Variance Components Estimation Procedure REML Estimates Variance Component
Estimate
Var(b) Var(a*b) Var(Error)
1464.4 26.95885 78.84239
The REML estimates are all larger than the corresponding ML estimates (adjusting for potential downward bias) and are fairly similar to the Type I estimates.
References
Variance Components Estimation Procedure Asymptotic Covariance Matrix of Estimates
Var(b) Var(a*b) Var(Error)
Var(b)
Var(a*b)
Var(Error)
4401703.8 1.29359 -273.39651
1.29359 3559.1 -502.85157
-273.39651 -502.85157 1249.7
The Error variance component estimate is negatively correlated with the other two variance component estimates, and the estimated variances are all larger than their ML counterparts.
References Gaylor, D.W., Lucas, H.L., and Anderson, R.L. (1970), “Calculation of Expected Mean Squares by the Abbreviated Doolittle and Square Root Methods,” Biometrics, 26, 641–55. Goodnight, J.H. (1978), Computing MIVQUE0 Estimates of Variance Components, SAS Technical Report R-105. Cary, NC: SAS Institute Inc. Goodnight, J.H. and Hemmerle, W.J. (1979), “A Simplified Algorithm for the W-Transformation in Variance Component Estimation,” Technometrics, 21, 265–268. Hartley, H.O., Rao, J.N.K., and LaMotte, L. (1978), “A Simple Synthesis-Based Method of Variance Component Estimation,” Biometrics, 34, 233–244. Hemmerle, W.J. and Hartley, H.O. (1973), “Computing Maximum Likelihood Estimates for the Mixed AOV Model Using the W-Transformation,” Technometrics, 15, 819–831. Hicks, C.R. (1973), Fundamental Concepts in the Design of Experiments, New York: Holt, Rinehart and Winston, Inc. Hocking, R.R. (1983), “A Diagnostic Tool for Mixed Models with Applications to Negative Estimates of Variance Components,” Proceedings of the Eighth Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc., 8, 711–716. Hocking, R.R. (1984), The Analysis of Linear Models, Monterey, CA: Brooks-Cole Publishing Co. Patterson, H.D. and Thompson, R. (1971), “Recovery of Inter-Block Information When Block Sizes Are Unequal,” Biometrika, 58, 545–554. Rao, C.R. (1971), “Minimum Variance Quadratic Unbiased Estimation of Variance Components,” Journal of Multivariate Analysis, 1, 445–456.
4847
4848
Chapter 79. The VARCOMP Procedure
Rao, C.R. (1972), “Estimation of Variance and Covariance Components in Linear Models,” Journal of the American Statistical Association, 67, 112–115. Searle, S.R., Casella, G., and McCulloch, C.E. (1992), Variance Components, New York: John Wiley and Sons, Inc.
Chapter 80
The VARIOGRAM Procedure Chapter Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4851 Introduction to Spatial Prediction . . . . . . . . . . . . . . . . . . . . . . . 4851 GETTING STARTED . . . . . . . . . . . . Preliminary Spatial Data Analysis . . . . . Preliminary Variogram Analysis . . . . . Sample Variogram Computation and Plots
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 4852 . 4852 . 4856 . 4861
SYNTAX . . . . . . . . . . . . . . PROC VARIOGRAM Statement COMPUTE Statement . . . . . . COORDINATES Statement . . . DIRECTIONS Statement . . . . VAR Statement . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 4864 . 4865 . 4866 . 4869 . 4870 . 4870
DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theoretical Semivariogram Models . . . . . . . . . . . . . . Theoretical and Computational Details of the Semivariogram Output Data Sets . . . . . . . . . . . . . . . . . . . . . . . . Computational Resources . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. 4870 . 4870 . 4872 . 4877 . 4881
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4882 Example 80.1. A Box Plot of the Square Root Difference Cloud . . . . . . . 4882 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4885
4850
Chapter 80. The VARIOGRAM Procedure
Chapter 80
The VARIOGRAM Procedure Overview The VARIOGRAM procedure computes sample or empirical measures of spatial continuity for two-dimensional spatial data. These continuity measures are the regular semivariogram, a robust version of the semivariogram, and the covariance. The continuity measures are written to an output data set, allowing plotting or parameter estimation for theoretical semivariogram or covariance models. Both isotropic and anisotropic measures are available. The VARIOGRAM procedure produces two additional output data sets that are useful in the analysis of pairwise distances in the original data. The OUTPAIR= data set contains one observation for each pair of points. The coordinates, distance, angle, and values of the analysis variables are written to this data set. The OUTDISTANCE= data set contains histogram information on the count of pairs within distance intervals, which is useful for determining unit lag distances.
Introduction to Spatial Prediction Spatial prediction, in general, is any prediction method that incorporates spatial dependence. A simple and popular spatial prediction method is ordinary kriging. Ordinary kriging requires a model of the spatial continuity, or dependence. This is typically in the form of a covariance or semivariogram. Spatial prediction, then, involves two steps. First, you model the covariance or semivariogram of the spatial process. This involves choosing both a mathematical form and the values of the associated parameters. Second, you use this dependence model in solving the kriging system at a specified set of spatial points, resulting in predicted values and associated standard errors. SAS/STAT software has two procedures corresponding to these steps for spatial prediction of two-dimensional data. The VARIOGRAM procedure is used in the first step. By computing a sample estimate of the variogram or covariance, you can choose a theoretical model based on graphical or other means.
4852
Chapter 80. The VARIOGRAM Procedure
Getting Started In activities such as reservoir estimation in mining, petroleum exploration, and environmental modeling of air and water pollution, it often happens that data on one or more quantities are available at given spatial locations, and the goal is to predict the measured quantities at unsampled locations. Often, these unsampled locations are on a regular grid, and the predictions are used to produce surface plots or contour maps. A popular method of spatial prediction is ordinary kriging, which produces both predicted values and associated standard errors. Ordinary kriging requires the complete specification (the form and parameter values) of the spatial dependence of the spatial process in terms of a covariance or semivariogram model. Typically the semivariogram model is not known in advance and must be estimated, either visually or by some estimation method. PROC VARIOGRAM computes the sample semivariogram, from which you can find a suitable theoretical semivariogram by visual methods. The following example goes through a typical problem to show how you can compute a sample variogram and determine an appropriate theoretical model.
Preliminary Spatial Data Analysis The simulated data consist of coal seam thickness measurements (in feet) taken over an approximately square area. The coordinates are offsets from a point in the southwest corner of the measurement area, with the north and east distances in units of thousands of feet. First, the data are input.
data thick; input east north thick @@; datalines; 0.7 59.6 34.1 2.1 82.7 4.8 52.8 34.3 5.9 67.1 6.4 33.7 36.4 7.0 46.7 13.3 0.6 44.7 13.3 68.2 17.8 6.9 43.9 20.1 66.3 23.0 93.9 43.6 24.3 73.0 24.8 26.3 39.7 26.4 58.0 27.7 83.3 41.8 27.9 90.8 29.5 89.4 43.0 30.1 6.1 32.7 40.2 37.5 34.8 8.1 37.0 70.3 39.2 38.2 77.9 39.4 82.5 41.4 43.0 4.7 46.4 84.1 41.5 46.7 10.6 51.0 88.8 42.0 52.8 68.9 55.5 92.9 42.2 56.0 1.6 62.1 26.6 40.1 63.0 12.7 70.5 83.7 40.9 70.9 11.0
42.2 37.0 34.6 37.8 37.7 39.3 36.9 43.3 43.6 43.3 40.7 43.3 42.6 39.3 42.7 41.8 41.7
4.7 6.0 8.2 13.4 22.7 24.8 26.9 29.1 30.8 35.3 38.9 43.7 49.9 52.9 60.6 69.0 71.5
75.1 35.7 40.1 31.3 87.6 15.1 65.0 47.9 12.1 32.0 23.3 7.6 22.1 32.7 75.2 75.6 29.5
39.5 35.9 35.4 37.8 42.8 42.3 37.8 36.7 42.8 38.8 40.5 43.1 40.7 39.2 40.1 40.1 39.8
Preliminary Spatial Data Analysis 78.1 80.5 84.5 86.7 88.4 88.9 91.5 94.8 ;
45.5 55.9 11.0 70.4 12.1 6.2 55.4 71.5
38.7 38.7 41.5 39.6 41.3 41.5 39.0 39.7
78.2 81.1 85.2 87.2 88.4 90.6 92.9 96.2
9.1 51.0 67.3 55.7 99.6 7.0 46.8 84.3
41.7 38.6 39.4 38.8 41.2 41.5 39.1 40.3
78.4 83.8 85.5 88.1 88.8 90.7 93.4 98.2
20.0 7.9 73.0 0.0 82.9 49.6 70.9 58.2
40.8 41.6 39.8 41.6 40.5 38.9 39.7 39.5
It is instructive to see the locations of the measured points in the area where you want to perform spatial prediction. It is desirable to have these locations scattered evenly around the prediction area. If this is not the case, the prediction error might be unacceptably large where measurements are sparse. The following GPLOT procedure is useful in determining potential problems:
proc gplot data=thick; title ’Scatter Plot of Measurement Locations’; plot north*east / frame cframe=ligr haxis=axis1 vaxis=axis2; symbol1 v=dot color=blue; axis1 minor=none; axis2 minor=none label=(angle=90 rotate=0); label east = ’East’ north = ’North’ ; run;
4853
4854
Chapter 80. The VARIOGRAM Procedure
Figure 80.1. Scatter Plot of Measurement Locations
As Figure 80.1 indicates, while the locations are not ideally spread around the prediction area, there are not any large areas lacking measurements. You now can look at a surface plot of the measured variable, the thickness of coal seam, using the G3D procedure. This is a crucial step. Any obvious surface trend has to be removed before you compute and estimate the model of spatial dependence (the semivariogram model). proc g3d data=thick; title ’Surface Plot of Coal Seam Thickness’; scatter east*north=thick / xticknum=5 yticknum=5 grid zmin=20 zmax=65; label east = ’East’ north = ’North’ thick = ’Thickness’ ; run;
Preliminary Spatial Data Analysis
Figure 80.2. Surface Plot of Coal Seam Thickness
Figure 80.2 shows the small-scale variation typical of spatial data, but there does not appear to be any surface trend. Hence, you can work with the original thickness data rather than residuals from a trend surface fit.
4855
4856
Chapter 80. The VARIOGRAM Procedure
Preliminary Variogram Analysis Recall that the goal of this example is spatial prediction. In particular, you would like to produce a contour map or surface plot on a regular grid of predicted values based on ordinary kriging. Ordinary kriging requires the complete specification of the spatial covariance or semivariogram. You can use PROC VARIOGRAM, along with a DATA step and PROC GPLOT, to estimate visually a reasonable semivariogram model (both the form and associated parameters) for the thickness data. Before proceeding with this estimation, consider the formula for the empirical or experimental semivariogram γz (h). Denote the coal seam thickness process by {Z(r), r ∈ D ⊂ R2 }. You have measurements (Z(ri ), i = 1, . . . , 75). The standard formula for γz (h) (isotropic case) is 2γz (h) =
X 1 (Z(ri ) − Z(rj ))2 | N (h) | N (h)
where N (h) is given by N (h) = {i, j :| ri − rj |= h} and | N (h) | is the number of such pairs (i, j). For actual data, it is unlikely that any pair (i, j) would exactly satisfy | ri − rj |= h, so typically a range of pairwise distances, | ri − rj |∈ [h − δh, h + δh), is used to group pairs (ri , rj ) for a single term in the expression for γz (h). Using this range, N (h) is modified by N (h, δh) = {i, j :| ri − rj |∈ [h − δh, h + δh)} PROC VARIOGRAM performs this grouping with two required options for variogram computation: the LAGDISTANCE= and MAXLAGS= options. The meaning of the required LAGDISTANCE= option is as follows. Classify all pairs of points into intervals according to their pairwise distance. The width of the distance interval is the LAGDISTANCE= value. The meaning of the required MAXLAGS= option is simply the number of intervals. The problem is that a surface plot of the original data, or the scatter plot of the measurement locations, is not very helpful in determining the distribution of these pairwise distances; it is not clear what values to give to the LAGDISTANCE= and MAXLAGS= options. You use PROC VARIOGRAM with the OUTDISTANCE= option to produce a modified histogram of the pairwise distances in order to find reasonable values for the LAGDISTANCE= and MAXLAGS= options. In the following analysis, you use the
Preliminary Variogram Analysis
NOVARIOGRAM option in the COMPUTE statement and the OUTDISTANCE= option in the PROC VARIOGRAM statement. You need the NOVARIOGRAM option to keep an error message from being issued due to the absence of the LAGDISTANCE= and MAXLAGS= options. The DATA step after the PROC VARIOGRAM statement computes the midpoint of each distance interval. This midpoint is then used in the GCHART procedure. Since the number of distance intervals is not specified by using the NHCLASSES= option in the COMPUTE statement, the default of 10 is used.
proc variogram data=thick outdistance=outd; compute novariogram; coordinates xc=east yc=north; var thick; run; title ’OUTDISTANCE= Data Set Showing Distance Intervals’; proc print data=outd; run; data outd; set outd; mdpt=round((lb+ub)/2,.1); label mdpt = ’Midpoint of Interval’; run; axis1 minor=none; axis2 minor=none label=(angle=90 rotate=0); title ’Distribution of Pairwise Distances’; proc gchart data=outd; vbar mdpt / type=sum sumvar=count discrete frame cframe=ligr gaxis=axis1 raxis=axis2 nolegend; run;
OUTDISTANCE= Data Set Showing Distance Intervals Obs
VARNAME
LAG
LB
1 2 3 4 5 6 7 8 9 10 11
thick thick thick thick thick thick thick thick thick thick thick
0 1 2 3 4 5 6 7 8 9 10
0.000 6.969 20.907 34.845 48.783 62.720 76.658 90.596 104.534 118.472 132.410
UB 6.969 20.907 34.845 48.783 62.720 76.658 90.596 104.534 118.472 132.410 146.348
COUNT
PER
45 263 383 436 495 525 412 179 35 2 0
0.01622 0.09477 0.13802 0.15712 0.17838 0.18919 0.14847 0.06450 0.01261 0.00072 0.00000
Figure 80.3. OUTDISTANCE= Data Set Showing Distance Intervals
4857
4858
Chapter 80. The VARIOGRAM Procedure
Figure 80.4. Distribution of Pairwise Distances
For plotting and estimations purposes, it is desirable to have as many points as possible for the plot of γz (h) against h. This corresponds to having as many distance intervals as possible, that is, having a small value for the LAGDISTANCE= option. However, a rule of thumb used in computing sample semivariograms is to use at least 30 point pairs in computing a single value of the empirical or experimental semivariogram. If the LAGDISTANCE= value is set too small, there may be too few points in one or more of the intervals. On the other hand, if the LAGDISTANCE= value is set to a large value, the number of point pairs in the distance intervals may be much greater than that needed for estimation precision, thereby “wasting” point pairs at the expense of variogram points. Hence, there is a tradeoff between the number of distance intervals and the number of point pairs within each interval. As discussed in the section “OUTDIST=SAS-data-set ” on page 4878 the first few distance intervals, corresponding to lag 0 and lag 1, are typically the limiting intervals. This is particularly true for lag 0 since it is half the width of the remaining intervals. For the default of NHCLASSES=10, the lag 0 class contains 45 points, which is reasonably close to 30, but the lag 1 class contains 263 points.
Preliminary Variogram Analysis
If you rerun PROC VARIOGRAM with NHCLASSES=20, these numbers become 8 and 83 for lags 0 and 1, respectively. Because of the asymmetrical nature of lag 0, you are willing to violate the rule of thumb for the 0th lag. You will, however, have sufficient numbers in lag 1 and above. proc variogram data=thick outdistance=outd; compute nhc=20 novariogram; coordinates xc=east yc=north; var thick; run; title ’OUTDISTANCE= Data Set Showing Distance Intervals’; proc print data=outd; run; data outd; set outd; mdpt=round((lb+ub)/2,.1); label mdpt = ’Midpoint of Interval’; run; axis1 minor=none; axis2 minor=none label=(angle=90 rotate=0); title ’Distribution of Pairwise Distances’; proc gchart data=outd; vbar mdpt / type=sum sumvar=count discrete frame cframe=ligr gaxis=axis1 raxis=axis2 nolegend; run;
OUTDISTANCE= Data Set Showing Distance Intervals Obs
VARNAME
LAG
LB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
thick thick thick thick thick thick thick thick thick thick thick thick thick thick thick thick thick thick thick thick thick
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.000 3.484 10.453 17.422 24.391 31.360 38.329 45.298 52.267 59.236 66.205 73.174 80.143 87.112 94.081 101.050 108.018 114.987 121.956 128.925 135.894
UB 3.484 10.453 17.422 24.391 31.360 38.329 45.298 52.267 59.236 66.205 73.174 80.143 87.112 94.081 101.050 108.018 114.987 121.956 128.925 135.894 142.863
COUNT
PER
8 83 143 167 198 197 203 235 234 284 264 236 221 165 75 41 15 5 1 0 0
0.00288 0.02991 0.05153 0.06018 0.07135 0.07099 0.07315 0.08468 0.08432 0.10234 0.09514 0.08505 0.07964 0.05946 0.02703 0.01477 0.00541 0.00180 0.00036 0.00000 0.00000
Figure 80.5. OUTDISTANCE= Data Set Showing Distance Intervals
4859
4860
Chapter 80. The VARIOGRAM Procedure
Figure 80.6. Distribution of Pairwise Distances
The length of the lag 1 class (3.484, 10.453) is 6.969. You round off and use LAGDISTANCE=7.0 in the COMPUTE statement. The use of the MAXLAGS= option is more difficult. From Figure 80.5, note that up to a pairwise distance of 101, you have a sufficient number of pairs. With your choice of LAGDISTANCE=7.0, this yields a maximum number of lags 101 7 ≈ 14. The problem with using the maximum lag value is that it includes pairs of points so far apart that they are likely to be independent. Using pairs of points that are independent adds nothing to the empirical semivariogram plot; they are essentially added noise. If there is an estimate of correlation length, perhaps from a prior geologic study of a similar site, you can specify the MAXLAGS= value so that the maximum pairwise distance does not exceed two or three correlation lengths. If there is no estimate of correlation length, you can use the following rule of thumb: use 12 to 34 of the “diameter” of the region containing the data. A MAXLAGS= value of 10 is within this range. You now rerun PROC VARIOGRAM with these values.
Sample Variogram Computation and Plots
Sample Variogram Computation and Plots Using the values of LAGDISTANCE=7.0 and MAXLAGS=10 computed previously, rerun PROC VARIOGRAM without the NOVARIOGRAM option. Also, request a robust version of the semivariogram; then, plot both results against the pairwise distance of each class. proc variogram data=thick outv=outv; compute lagd=7 maxlag=10 robust; coordinates xc=east yc=north; var thick; run; title ’OUTVAR= Data Set Showing Sample Variogram Results’; proc print data=outv label; var lag count distance variog rvario; run; data outv2; set outv; vari=variog; type = ’regular’; output; vari=rvario; type = ’robust’; output; run; title ’Standard and Robust Semivariogram for Coal Seam Thickness Data’; proc gplot data=outv2; plot vari*distance=type / frame cframe=ligr vaxis=axis2 haxis=axis1; symbol1 i=join l=1 c=blue /* v=star */; symbol2 i=join l=1 c=yellow /* v=square */; axis1 minor=none label=(c=black ’Lag Distance’) /* offset=(3,3) */; axis2 order=(0 to 9 by 1) minor=none label=(angle=90 rotate=0 c=black ’Variogram’) /* offset=(3,3) */; run;
4861
4862
Chapter 80. The VARIOGRAM Procedure
OUTVAR= Data Set Showing Sample Variogram Results
Obs 1 2 3 4 5 6 7 8 9 10 11 12
Lag Class Value (in LAGDIST= units)
Number of Pairs in Class
Average Lag Distance for Class
Variogram
Robust Variogram
-1 0 1 2 3 4 5 6 7 8 9 10
75 8 85 142 169 199 199 205 232 244 285 262
. 2.5045 7.3625 14.1547 21.0913 27.9691 35.1591 42.2547 48.7775 56.1824 62.9121 69.8925
. 0.02937 0.38047 1.15158 2.79719 4.68769 6.16018 7.58912 7.12506 7.04832 6.66298 6.18775
. 0.01694 0.19807 0.98029 3.01412 4.86998 6.15639 8.05072 7.07155 7.62851 8.02993 7.92206
Figure 80.7. OUTVAR= Data Set Showing Sample Variogram Results
Figure 80.8. Standard and Robust Semivariogram for Coal Seam Thickness Data
Sample Variogram Computation and Plots
Figure 80.8 shows first a slow, then a rapid rise from the origin, suggesting a Gaussian type form: 2 h γz (h) = c0 1 − exp − 2 a0 See the section “Theoretical and Computational Details of the Semivariogram” on page 4872 for graphs of the standard semivariogram forms. By experimentation, you find that a scale of c0 = 7.5 and a range of a0 = 30 fits reasonably well for both the robust and standard semivariogram The following statements plot the sample and theoretical variograms: data outv3; set outv; c0=7.5; a0=30; vari = c0*(1-exp(-distance*distance/(a0*a0))); type = ’Gaussian’; output; vari = variog; type = ’regular’; output; vari = rvario; type = ’robust’; output; run; title ’Theoretical and Sample Semivariogram for Coal Seam Thickness Data’; proc gplot data=outv3; plot vari*distance=type / frame cframe=ligr vaxis=axis2 haxis=axis1; symbol1 i=join l=1 c=blue /* v=star */; symbol2 i=join l=1 c=yellow /* v=square */; symbol3 i=join l=1 c=cyan /* v=diamond */; axis1 minor=none label=(c=black ’Lag Distance’) /* offset=(3,3) */; axis2 order=(0 to 9 by 1) minor=none label=(angle=90 rotate=0 c=black ’Variogram’) /* offset=(3,3) */; run;
4863
4864
Chapter 80. The VARIOGRAM Procedure
Figure 80.9. Theoretical and Sample Semivariogram for Coal Seam Thickness Data
Figure 80.9 shows that the choice of a semivariogram model is adequate. You can use this Gaussian form and these particular parameters in PROC KRIGE2D to produce a contour plot of the kriging estimates and the associated standard errors.
Syntax The following statements are available in PROC VARIOGRAM.
PROC VARIOGRAM options ; COMPUTE computation-options ; COORDINATES coordinate-variables ; DIRECTIONS directions-list ; VAR analysis-variables-list ; The COMPUTE and COORDINATES statements are required. The following table outlines the options available in PROC VARIOGRAM classified by function.
PROC VARIOGRAM Statement
4865
Table 80.1. Options Available in the VARIOGRAM Procedure
Task Data Set Options specify input data set write spatial continuity measures write distance histogram information write pairwise point information
Statement
Option
PROC VARIOGRAM PROC VARIOGRAM PROC VARIOGRAM PROC VARIOGRAM
DATA= OUTVAR= OUTDISTANCE= OUTPAIR=
Declaring the Role of Variables specify the analysis variables VAR specify the x, y coordinates in the DATA= data COORDINATES set Controlling Continuity Measure Computations specify the basic lag distance COMPUTE specify the tolerance around the lag distance COMPUTE specify the maximum number of lags in compu- COMPUTE tations specify the number of angle classes COMPUTE specify the angle tolerances for angle classes COMPUTE specify the bandwidths for angle classes COMPUTE compute robust semivariogram COMPUTE suppress computation of all continuity mea- COMPUTE sures Controlling Distance Histogram Data Set specify the distance histogram data set PROC VARIOGRAM specify the number of histogram classes COMPUTE Controlling Pairwise Information Data Set specify the pairwise data set specify the maximum distance for the pairwise data set
PROC VARIOGRAM COMPUTE
XCOORD= YCOORD=
LAGDISTANCE= LAGDISTANCE= MAXLAGS= NDIRECTIONS= ANGLETOL= BANDWIDTH= ROBUST NOVARIOGRAM
OUTDISTANCE= NHCLASSES= OUTPAIR= OUTPDISTANCE=
PROC VARIOGRAM Statement PROC VARIOGRAM options ; You can specify the following options in the PROC VARIOGRAM statement. DATA=SAS-data-set
specifies a SAS data set containing the x and y coordinate variables and the VAR statement variables. OUTDISTANCE=SAS-data-set OUTDIST=SAS-data-set OUTD=SAS-data-set
specifies a SAS data set in which to store summary distance information. This data set contains a count of all pairs of data points within a given distance interval. The number of distance intervals is controlled by the NHCLASSES= option in the COMPUTE statement. The OUTDISTANCE= data set is useful for plotting mod-
4866
Chapter 80. The VARIOGRAM Procedure
ified histograms of the count data for determining appropriate lag distances. See the section “OUTDIST=SAS-data-set ” on page 4878 for details. OUTPAIR=SAS-data-set OUTP=SAS-data-set
specifies a SAS data set in which to store distance and angle information for each pair of points in the DATA= data set. This option should be used with caution when the DATA= data set is large. If n denotes the number of observations in the observations unless you DATA= data set, the OUTPAIR= data set contains n(n−1) 2 restrict it with the OUTPDISTANCE= option in the COMPUTE statement. The OUTPDISTANCE= option in the COMPUTE statement excludes pairs of points when the distance between the pairs exceeds the OUTPDISTANCE= value. See the section “OUTPAIR=SAS-data-set ” on page 4881 for details. OUTVAR=SAS-data-set OUTVR=SAS-data-set
specifies a SAS data set in which to store the continuity measures. See the section “OUTVAR=SAS-data-set ” on page 4877 for details.
COMPUTE Statement COMPUTE computation-options ; The COMPUTE statement provides a number of options that control the computation of the semivariogram, the robust semivariogram, and the covariance. ANGLETOLERANCE=angle tolerance ANGLETOL=angle tolerance ATOL=angle tolerance
specifies the tolerance, in degrees, around the angles determined by the 180o NDIRECTIONS= specification. The default is 2×n , where nd is the d NDIRECTIONS= specification. See the section “Theoretical and Computational Details of the Semivariogram” on page 4872 for more detailed information. BANDWIDTH=bandwidth distance BANDW=bandwidth distance
specifies the bandwidth, or perpendicular distance cutoff for determining the angle class for a given pair of points. The distance classes define a series of cylindrically shaped areas, while the angle classes radially cut these cylindrically shaped areas. For a given angle class (θ1 − δθ1 , θ1 + δθ1 ), as you proceed out radially, the area encompassed by this angle class becomes larger. The BANDWIDTH= option restricts this area by excluding all points with a perpendicular distance from the line θ = θ1 that is greater than the BANDWIDTH= value. If you do not specify the BANDWIDTH= option, no restriction occurs. See Figure 80.15 on page 4876 for more detailed information.
COMPUTE Statement
DEPSILON=distance value DEPS=distance value
specifies the distance value for declaring that two distinct points are zero distance apart. Such pairs, if they occur, cause numeric problems. If you specify DEPSILON=ε, then pairs of points P1 and P2 for which the distance between them | P1 P2 |< ε are excluded from the continuity measure calculations. The default value of the DEPSILON= option is 100 times machine epsilon; this product is approximately 1E-10 on most computers. LAGDISTANCE=distance unit LAGDIST=distance unit LAGD=distance unit
specifies the basic distance unit defining the lags. For example, a specification of LAGDISTANCE=x results in lag distance classes that are multiples of x. For a given pair of points P1 and P2 , the distance between them, denoted | P1 P2 |, is calculated. If | P1 P2 |= x, then this pair is in the first lag class. If | P1 P2 |= 2x, then this pair is in the second lag class, and so on. For irregularly spaced data, the pairwise distances are unlikely to fall exactly on multiples of the LAGDISTANCE= value. A distance tolerance of δx is used to accommodate a spread of distances around multiples of x (the LAGTOLERANCE= option specifies the distance tolerance). For example, if | P1 P2 | is within x±δx, you would place this pair in the first lag class; if | P1 P2 | is within 2x ± δx, you would place this pair in the second lag class, and so on. You can determine the candidate values for the LAGDISTANCE= option by plotting or displaying the OUTDISTANCE= data set. A LAGDISTANCE= value is required unless you specify the NOVARIOGRAM option. See the section “Theoretical and Computational Details of the Semivariogram” on page 4872 for more details. LAGTOLERANCE=tolerance number LAGTOL=tolerance number LAGT=tolerance number
specifies the tolerance around the LAGDISTANCE= value for grouping distance pairs into lag classes. See the preceding description of the LAGDISTANCE= option for information on the use of the LAGTOLERANCE= option, and see the section “Theoretical and Computational Details of the Semivariogram” on page 4872 for more details. If you do not specify the LAGTOLERANCE= option, a default value of (1/2) times the LAGDISTANCE= value is used. MAXLAGS=number of lags MAXLAG=number of lags MAXL=number of lags
specifies the maximum number of lag classes used in constructing the continuity measures. This option excludes any pair of points P1 and P2 for which the distance
4867
4868
Chapter 80. The VARIOGRAM Procedure
between them, | P1 P2 |, exceeds the MAXLAGS= value times the LAGDISTANCE= value. You can determine candidate values for the MAXLAGS= option by plotting or displaying the OUTDISTANCE= data set. A MAXLAGS= value is required unless you specify the NOVARIOGRAM option. NDIRECTIONS=number of directions NDIR=number of directions ND=number of directions
specifies the number of angle classes to use in computing the continuity measures. This option is useful when there is potential anisotropy in the spatial continuity measures. Anisotropy occurs when the spatial continuity or dependence between a pair of points depends on the orientation or angle between the pair. Isotropy is the absence of this effect: the spatial continuity or dependence between a pair of points depends only on the distance between the points, not the angle. The angle classes formed from the NDIRECTIONS= option start from N–S and proceed clockwise. For example, NDIRECTIONS=3 produces three angle classes. In terms of compass points, these classes are centered at 0o (or its reciprocal 180o ), 60o (or its reciprocal 240o ), and 120o (or its reciprocal 300o ). For irregularly spaced data, the angles between pairs are unlikely to fall exactly in these directions, so an angle tolerance of δθ is used (the ANGLETOLERANCE= option specifies the angle tolo erance). If NDIRECTIONS=nd , the base angle is θ = 180 nd , and the angle classes are (kθ − δθ, kθ + δθ)
k = 0, . . . , nd − 1
If you do not specify the NDIRECTIONS= option, no angles are formed, and the spatial continuity measures are assumed to be isotropic. The NDIRECTIONS= option is useful for exploring possible anisotropy. The DIRECTIONS statement, described in the “DIRECTIONS Statement” section on page 4870, provides greater control over the angle classes. See the section “Theoretical and Computational Details of the Semivariogram” on page 4872 for more detailed information. NHCLASSES=number of histogram classes NHCLASS=number of histogram classes NHC=number of histogram classes
specifies the number of distance or histogram classes to write to the OUTDISTANCE= data set. The actual number of classes is one more than the NHCLASSES= value since a special lag 0 class is also computed. See the OUTDISTANCE= option on page 4865 and the section “OUTDIST=SAS-data-set ” on page 4878 for details. The default value of the NHCLASSES= option is 10. This option is ignored if you do not specify an OUTDISTANCE= data set.
COORDINATES Statement
NOVARIOGRAM
prevents the computation of the continuity measures. This option is useful for preliminary analysis when you require only the OUTDISTANCE= or OUTPAIR= data sets. OUTPDISTANCE=distance limit OUTPDIST=distance limit OUTPD=distance limit
specifies the cutoff distance for writing observations to the OUTPAIR= data set. If you specify OUTPDISTANCE=dmax , the distance | P1 P2 | between each pair of points P1 and P2 is checked against dmax . If | P1 P2 |> dmax , the observation for this pair is not written to the OUTPAIR= data set. If you do not specify the OUTPDISTANCE= option, all distinct pairs are written. This option is ignored if you do not specify an OUTPAIR= data set. ROBUST
requests that a robust version of the semivariogram be calculated in addition to the regular semivariogram and covariance.
COORDINATES Statement COORDINATES coordinate-variables ; The following two options give the names of the variables in the DATA= data set containing the values of the x and y coordinates of the data. Only one COORDINATES statement is allowed, and it is applied to all the analysis variables. In other words, it is assumed that all the VAR variables have the same x and y coordinates. XCOORD= (variable-name) XC= (variable-name)
gives the name of the variable containing the x coordinate of the data in the DATA= data set. YCOORD= (variable-name) YC= (variable-name)
gives the name of the variable containing the y coordinate of the data in the DATA= data set.
4869
4870
Chapter 80. The VARIOGRAM Procedure
DIRECTIONS Statement DIRECTIONS directions-list ; You use the DIRECTIONS statement to define angle classes. You can specify angle classes as a list of angles, separated by commas, with optional angle tolerances and bandwidths within parentheses following the angle. You must specify at least one angle. If you do not specify the optional angle tolerance, the default value of 45o is used. If you do not specify the optional bandwidth, no bandwidth is checked. If you specify a bandwidth, you must also specify an angle tolerance. For example, suppose you want to compute three separate semivariograms at angles θ1 = 0o , θ2 = 60o , and θ3 = 120o , with corresponding angle tolerance δθ1 = 22.5o , δθ2 = 12.5o , and δθ3 = 22.5o , with bandwidths 50 and 40 distance units on the first two angle classes and no bandwidth check on the last angle class. The appropriate DIRECTIONS statement is directions 0.0(22.5,50), 60.0(12.5,40),120(22.5);
VAR Statement VAR analysis-variables-list ; Use the VAR statement to specify the analysis variables. You can specify only numeric variables. If you do not specify a VAR statement, all numeric variables in the DATA= data set that are not in the COORDINATES statement are used.
Details Theoretical Semivariogram Models The VARIOGRAM procedure computes the sample, or experimental semivariogram. Prediction of the spatial process at unsampled locations by techniques such as ordinary kriging requires a theoretical semivariogram or covariance. It is necessary, then, to decide on a theoretical variogram based on the sample variogram. While there are methods of fitting semivariogram models, such as least squares, maximum likelihood, and robust methods (Cressie 1993, section 2.6), these techniques are not appropriate for data sets resulting in a small number of variogram points. Instead, a visual fit of the variogram points to a few standard models is often satisfactory. Even when there are sufficient variogram points, a visual check against a fitted theoretical model is appropriate (Hohn 1988, p. 25ff).
Theoretical Semivariogram Models
In some cases, a plot of the experimental semivariogram suggests that a single theoretical model is inadequate. Nested models, anisotropic models, and the nugget effect increase the scope of theoretical models available. All of these concepts are discussed in this section. The specification of the final theoretical model is provided by the syntax of PROC KRIGE2D. Note the general flow of investigation. After a suitable choice is made of the LAGDIST= and MAXLAGS= options and, possibly, the NDIR= option (or a DIRECTIONS statement), the experimental semivariogram is computed. Potential theoretical models, possibly incorporating nesting, anisotropy, and the nugget effect, are computed by a DATA step, then they are plotted against the experimental semivariogram and evaluated. A suitable theoretical model is thus found visually, and the specification of the model is used in PROC KRIGE2D. This flow is illustrated in Figure 80.10; also see the “Getting Started” section on page 4852 for an illustration in a simple case.
Pairwise Distance Distribution PROC VARIOGRAM using NHCLASS=, NOVAR options
no
Sufficient number of pairs in each lag class ? yes
Determine LAGDIST= and MAXLAG= values
Use PROC VARIOGRAM to compute and plot sample variogram
Use DATA step to plot sample and theoretical variograms
Select candidate variogram forms and parameters
Figure 80.10. Flowchart for Variogram Selection
4871
4872
Chapter 80. The VARIOGRAM Procedure
Theoretical and Computational Details of the Semivariogram The basic starting point in computing the semivariogram is the enumeration of pairs of points for the spatial data. Figure 80.11 shows a spatial domain in which a set of measurements are made at the indicated locations. Two points P1 and P2 , with coordinates (x1 , y1 ), (x2 , y2 ), are selected for illustration. A vector, or directed line segment, is drawn between these points. This pair is then categorized first by orientation of this directed line segment and then by its length. That is, the pair P1 P2 is placed into an angle and distance class.
o o
o o
o
o o o o
p2
o
o
o
o
o o o
P1
o
o o o
Figure 80.11. Selection of Points P1 and P2 in Spatial Domain
Angle Classification Suppose you specify NDIR=3 in the COMPUTE statement in PROC VARIOGRAM. This results in three angle classes defined by midpoint angles between 0o and 180o : 0o ± δθ, 60o ± δθ, and 120o ± δθ, where δθ is the angle tolerance. If you do not specify an angle tolerance using the ATOL= option in the COMPUTE statement, the following default value is used. δθ =
180o 2 × N DIR
Theoretical and Computational Details of the Semivariogram
For three classes, δθ = 30o . When the example directed line segment P1 P2 is superimposed on the coordinate system showing the angle classes, its angle, measured clockwise from north, is approximately 45o . In particular, it falls within [60o − δθ, 60o + δθ) = [30o , 90o ), the second angle class. See Figure 80.12. Ν 0ο ο 30
ο
60 δθ
W 270
δθ
o
E 90
o
ο 240
210
ο S 180 o
Figure 80.12. Selected Pair P1 P2 Falls within the Second Angle Class
Note that if the designated points P1 and P2 are labeled in the opposite order, the orientation is in a reciprocal direction, that is, approximately 225o for the point pair instead of approximately 45o . This does not affect angle class selection; the angle classes [60o − δθ, 60o + δθ) and [240o − δθ, 240o + δθ) are the same. If you specify an angle tolerance less than the default, for example, AT OL = 15o , some point pairs might be excluded. For example, the selected point pair P1 P2 in Figure 80.12, while closest to the 60o axis, might lie outside [60 − δθ, 60 + δθ) = [45o , 75o ). In this case, the point pair P1 P2 would be excluded from the variogram computation. On the other hand, you can specify an angle tolerance greater than the default. This can result in a point pair being counted in more than one angle class. This has a smoothing effect on the variogram and is useful when there is a small amount of data available. An alternative way to specify angle classes and angle tolerances is with the DIRECTIONS statement. The DIRECTIONS statement is useful when angle classes are not equally spaced. When you specify the DIRECTIONS statement, you should also specify the angle tolerance. The default value of the angle tolerance is 45o
4873
4874
Chapter 80. The VARIOGRAM Procedure
when a DIRECTIONS statement is used instead of the NDIRECTIONS= option in the COMPUTE statement. This may not be appropriate for a particular set of angle classes. See the “DIRECTIONS Statement” section on page 4870 for more details on the DIRECTIONS statement.
Distance Classification Next, the distance class for the point pair P1 P2 is determined. The directed line segment P1 P2 is superimposed on the coordinate system showing the distance or lag classes. These classes are determined by the LAGD= specification in the COMPUTE statement. Denoting the length of the line segment by | P1 P2 | and the LAGD value by ∆, the lag class L is determined by
| P1 P2 | +.5 L(P1 P2 ) = ∆
where bxc denotes the largest integer ≤ x. When the directed line segment P1 P2 is superimposed on the coordinate system showing the distance classes, it is seen to fall in the first lag class; see Figure 80.13 for an illustration for ∆ = 1.
N-0
o
lag 2
lag 1
lag tolerance lag 0 E - 90
o
lag distance 0
1
2
Figure 80.13. Selected Pair P1 P2 Falls within the First Lag Class
Because pairwise distances are positive, lag class zero is smaller than lag classes 1, · · · , M AXLAG − 1. For example, if you specify LAGD=1.0 and MAXLAG=10,
Theoretical and Computational Details of the Semivariogram
and you do not specify a LAGTOL= value in the COMPUTE statement in PROC VARIOGRAM, the ten lag classes generated by the preceding equation are [0, .5), [.5, 1.5), [1.5, 2.5), · · · , [8.5, 9.5) This is because the default lag tolerance is one-half the LAGD= value, resulting in no gaps between the distance class intervals. This is shown in Figure 80.14. lag 0
lag 1
lag 2
lag 3
o
h 1
2
3
Figure 80.14. Lag Distance Axis Showing Lag Classes
On the other hand, if you do specify a distance tolerance with the DTOL= option in the COMPUTE statement, a further check is performed to see if the point pair falls within this tolerance of the nearest lag. In the preceding example, if you specify LAGD=1.0 and MAXLAG=10 (as before) and also specify LAGTOL=0.25, the intervals become [0, 0.25), [0.75, 1.25), [1.75, 2.25), · · · , [8.75, 9.25) Note that this specification results in gaps in the lag classes; a point pair P1 P2 might fall, for example, in the interval | P1 P2 |∈ [1.25, 1.75) and hence be excluded from the semivariogram calculation. The maximum LAGTOL= value allowed is half the LAGD= value; no overlap of the distance classes is allowed.
Bandwidth Restriction Because the areal segments generated from the angle and distance classes increase in area as the lag distance increases, it is sometimes desirable to restrict this area (Duetsch and Journel 1992, p. 45). If you specify the BANDW= option in the COMPUTE statement, the lateral, or perpendicular, distance from the axis defining the angle classes is fixed. For example, suppose two points P3 , P4 are picked from the domain in Figure 80.11 and are superimposed on the grid defining distance and angle classes, as shown in Figure 80.15.
4875
4876
Chapter 80. The VARIOGRAM Procedure
N 0
o
P3P4 lag 4
2 x bandwidth lag 3 60o lag 2 lag 1 E 90
o
Figure 80.15. Selected Pair P3 P4 Falls Outside Bandwidth Limit
The endpoint of vector P3 P4 falls within the angle class around 60o and the 5th lag class; however, it falls outside the restricted area defined by the bandwidth. Hence, it is excluded from the semivariogram calculation. Finally, a pair Pi Pj that falls in a lag class larger than the value of the MAXLAG= option is excluded from the semivariogram calculation. From this description, it is clear that the number of pairs within each angle/distance class is strongly affected by the angle and lag tolerances. Since it is desirable to have the maximum number of point pairs within each class, the angle tolerance and the distance tolerance should usually be the default values.
Semivariogram Computation With the classification of a point pair Pi Pj into an angle/distance class, as shown in the preceding section, the semivariogram computation proceeds as follows. Denote all pairs Pi Pj belonging to angle class [θk − δθk , θk + δθk ) and distance class L = L(Pi Pj ) by N (θk , L). For example, in the preceding illustration, P1 P2 belongs to N (60o , 1). Let | N (θk , L) | denote the number of such pairs. Let Vi , Vj be the measured values at points Pi , Pj . The component of the standard (or method of moments) semivariogram
Output Data Sets
corresponding to angle/distance class N (θk , L) is given by 2γ(hk ) =
1 | N (θk , L) |
X
(Vi − Vj )2
Pi Pj ∈N (θk ,L)
where hk is the average distance in class N (θk , L); that is, hk =
1 | N (θk , L) |
X
| Pi Pj |
Pi Pj ∈N (θk ,L)
The robust version of the semivariogram, as suggested by Cressie (1993), is given by
2¯ γ (hk ) =
Ψ4 (hk ) 0.457 + 0.494/N (θk , L)
Ψ(hk ) =
1 N (θk , L)
where X
1
(Vi − Vj ) 2
Pi Pj ∈N (θk ,L)
This robust version of the semivariogram is computed when you specify the ROBUST option in the COMPUTE statement in PROC VARIOGRAM. PROC VARIOGRAM computes and writes to the OUTVAR= data set the quantities hk , θk , L, N (θk , L), γ(h), and γ¯ (h).
Output Data Sets The VARIOGRAM procedure produces three data sets: the OUTVAR=SAS-data-set, the OUTPAIR=SAS-data-set, and the OUTDIST=SAS-data-set. These data sets are described in the following sections.
OUTVAR=SAS-data-set The OUTVAR= data set contains the standard and robust versions of the sample semivariogram, the covariance, and other information at each lag class. The details of the computation of the variogram, the robust variogram, and the covariance is described in the section “Theoretical and Computational Details of the Semivariogram” on page 4872. The OUTVAR= data set contains the following variables: • ANGLE, which is the angle class value (clockwise from N–S) • ATOL, which is the angle tolerance for the lag/angle class • AVERAGE, which is the average variable value for the lag/angle class
4877
4878
Chapter 80. The VARIOGRAM Procedure • BANDW, which is the band width for the lag/angle class • COUNT, which is the number of pairs in the lag/angle class • COVAR, which is the covariance value for the lag/angle class • DISTANCE, which is the average lag distance for the lag/angle class • LAG, which is lag class value (in LAGDISTANCE= units) • RVARIO, which is the sample robust variogram value for the lag/angle class • VARIOG, which is the sample variogram value for the lag/angle class • VARNAME, which is the name of the current VAR= variable
The bandwidth variable, BANDW, is not included in the data set if no bandwidth specification is given in the COMPUTE statement or in a DIRECTIONS statement.
OUTDIST=SAS-data-set The OUTDIST= data set contains counts for a modified histogram showing the distribution of pairwise distances. The purpose of this data set is to enable you to make choices for the value of the LAGDISTANCE= option in the COMPUTE statement in subsequent runs of PROC VARIOGRAM. For plotting and estimation purposes, it is desirable to have as many points as possible for a variogram plot. However, a rule of thumb used in computing sample semivariograms is to use at least 30 points in each interval whenever possible. Hence, there is a lower limit to the value of the LAGDISTANCE= option. Since the distribution of pairwise distances is seldom known in advance, the information contained in the OUTDIST= data set enables you to choose, in an iterative fashion, a value for the LAGDISTANCE= parameter. The value you choose is a compromise between the number of pairs making up each variogram point and the number of variogram points. In some cases, the pattern of measured points may result in some lag or distance classes having a small number of pairs, while the remaining classes have a large number of pairs. By adjusting the value of the LAGDISTANCE= option to honor the rule of thumb (at least 30 pairs), you are “wasting” pairs in the other distance classes. One strategy for solving this problem is to use less than 30 pairs for these distance classes. Then, either delete the corresponding variogram points or use them and accept the increased uncertainty. Unfortunately, the deficient distance classes are usually those close to the origin (h = 0). This is the crucial portion of the experimental variogram curve for determining the form of the theoretical variogram and for detecting the presence of a nugget effect. Another alternative is to force distance classes to contain approximately the same number of pairs. This results in distance classes of unequal widths. While PROC VARIOGRAM does not produce such distance classes directly, the OUTPAIR= data set, described in the section “OUTPAIR=SAS-data-set ” on page 4881, contains information on all distinct pairs of points. You can use this data set,
Output Data Sets
along with the RANK procedure, to produce experimental variogram-based equal numbers of pairs in each distance class. To request an OUTDIST= data set, you specify the OUTDIST= data set in the PROC VARIOGRAM statement and the NOVARIOGRAM option in the COMPUTE statement. The NOVARIOGRAM option prevents any variogram or covariance computation from being performed. Computation of the Distribution Distance Classes
The simplest way of determining the distribution of pairwise distances is to determine the maximum distance hmax between pairs and divide this distance by some number N of intervals to produce distance classes of length δ = hmax N . The distance between each pair of points P1 , P2 , denoted | P1 P2 |, is computed, and the pair P1 P2 is counted in the kth distance class if | P1 P2 |∈ [(k − 1)δ, kδ) for k = 1, · · · , N . The actual computation is a slight variation of this. A bound, rather than the actual maximum distance, is computed. This bound is the length of the diagonal of a bounding rectangle for the data points. This bounding rectangle is found by using the maximum and minimum x and y coordinates, xmax , xmin , ymax , ymin , and forming the rectangle determined by the points (xmax , ymax ), (xmax , ymin ), (xmin , ymin ), (xmin , ymax ) See Figure 80.16 for an illustration of the bounding rectangle. y
y max
o o
o
o o
o o
o
o
o o o
o
o
o
o
o o
o
o
o o o o
o
o
o
y min
x x min
x max
Figure 80.16. Bounding Rectangle to Determine Maximum Pairwise Distance
The pairwise distance bound, denoted by hb , is given by h2b = (xmax − xmin )2 + (ymax − ymin )2
4879
4880
Chapter 80. The VARIOGRAM Procedure
Using hb , the interval (0, hb ] is divided into N + 1 subintervals, where N is the value of the NHCLASSES= option specified in the COMPUTE statement, or N = 10 if the NHCLASSES= option is not specified. The basic distance unit is h0 = hNb ; the distance intervals are centered on h0 , 2h0 , · · · , N h0 , with a distance tolerance of ± h20 . The extra subinterval is (0, h0 /2), corresponding to the 0th lag. It is half the length of the remaining subintervals, and it often contains the smallest number of pairs. This method of partitioning the interval (0, hb ] is identical to what is done when you actually compute the sample variogram. The lag classes corresponding to h0 =1 are shown in Figure 80.17. lag 0
lag 1
lag 2
lag 3
o
h 1
2
3
Figure 80.17. Lag Classes Corresponding to h0 = 1
By increasing or decreasing the value of the NHCLASSES= option, you can adjust the lag or distance class with the smallest count so that this count is around 30 or some other value that you judge appropriate. Once you determine an appropriate value for the NHCLASSES= option, you can use the width of the lag classes as a candidate value for the LAGDIST= option in the COMPUTE statement. The width of the lag classes is determined by the upper bound (UB) and lower bound (LB) variables. For example, read the observation from the OUTDIST= data set corresponding to lag 1 and compute the quantity UB-LB. Use this value for the LAGDIST= option in the COMPUTE statement. Note: Do not use the 0th lag class; it is half the length of the other intervals. Use lag 1 instead. Variables in the OUTDIST= data set
The following variables are written to the OUTDIST= data set: • COUNT, which is the number of pairs falling into this lag class • LAG, which is the lag class value • LB, which is the lower bound of the lag class interval • UB, which is the upper bound of the lag class interval • PER, which is the percent of all pairs falling in this lag class • VARNAME, which is the name of the current VAR= variable
Computational Resources
OUTPAIR=SAS-data-set The OUTPAIR= data set contains one observation for each distinct pair of points P1 , P2 in the original data set, unless you specify the OUTPDISTANCE= option in the COMPUTE statement. If you specify OUTPDISTANCE=Dmax in the COMPUTE statement, all pairs P1 , P2 in the original data set that satisfy the relation | P1 P2 |≤ Dmax are written to the OUTPAIR= data set. Note that the OUTPAIR= data set can be very large even for a moderately sized DATA= data set. For example, if the DATA= data set has NOBS=500, the OUTPAIR= data set has NOBS(NOBS − 1)/2 =124,750 if no OUTPDISTANCE= restriction is given in the COMPUTE statement. The OUTPAIR= data set contains information on the distance and orientation for each point pair, and you can use it for specialized continuity measure calculations. The OUTPAIR= data set contains the following variables: • AC, which is the angle class value • COS, which is the cosine of the angle between pairs • DC, which is the distance (lag) class • DISTANCE, which is the distance between pairs • V1, which is the variable value for the first point in the pair • V2, which is the variable value for the second point in the pair • VARNAME, which is the variable name for the current VAR variable • X1, which is the x coordinate of the first point in the pair • X2, which is the x coordinate of the second point in the pair • Y1, which is the y coordinate of the first point in the pair • Y2, which is the y coordinate of the second point in the pair
Computational Resources The computations of the VARIOGRAM procedure are basically binning: for each pair of observations in the input data set, a distance and angle class is determined and recorded. Let Nd denote the number of distance classes, Na denote the number of angle classes, and Nv denote the number of VAR variables. The memory requirements for these operations are proportional to Nd × Na × Nv . This is typically small. The CPU time required for the computations is proportional to the number of pairs of observations, or to N 2 × Nv , where N is the number of observations in the input data set.
4881
4882
Chapter 80. The VARIOGRAM Procedure
Example Example 80.1. A Box Plot of the Square Root Difference Cloud The Gaussian form chosen for the variogram in the “Getting Started” section on page 4852 is based on the consideration of the plots of the sample variogram. For the coal thickness data, the Gaussian form appears to be a reasonable choice. It can often happen, however, that a plot of the sample variogram shows so much scatter that no particular form is evident. The cause of this scatter can be one or more outliers in the pairwise differences of the measured quantities. A method of identifying potential outliers is discussed in Cressie (1993, section 2.2.2). This example illustrates how to use the OUTPAIR= data set from PROC VARIOGRAM to produce a square root difference cloud, which is useful in detecting outliers. For the spatial process Z(s), s ∈ R2 , the square root difference cloud for a particular direction e is given by 1
| Z(si + he) − Z(si ) | 2 for a given lag distance h. In the actual computation, all pairs of points P1 , P2 within a distance tolerance around h and an angle tolerance around the direction e are used. This generates a number of point pairs for each lag class h. The spread of these values gives an indication of outliers. Following the example in the “Getting Started” section on page 4852, this example uses a basic lag distance of 7 units, with a distance tolerance of 3.5, and a direction of N–S, with a 30o angle tolerance. First, input the data, then use PROC VARIOGRAM to produce an OUTPAIR= data set. Then use a DATA step to subset this data by choosing pairs within 30o of N–S. In addition, compute lag class and square root difference variables. Next, summarize the results using the MEANS procedure and present them in a box plot using the SHEWHART procedure. The box plot facilitates the detection of outliers. You can conclude from this example that there does not appear to be any outliers in the N–S direction for the coal seam thickness data. title ’Square Root Difference Cloud Example’; data thick; input east north thick @@; datalines; 0.7 59.6 34.1 2.1 82.7 42.2 4.7 75.1 4.8 52.8 34.3 5.9 67.1 37.0 6.0 35.7 6.4 33.7 36.4 7.0 46.7 34.6 8.2 40.1 13.3 0.6 44.7 13.3 68.2 37.8 13.4 31.3 17.8 6.9 43.9 20.1 66.3 37.7 22.7 87.6 23.0 93.9 43.6 24.3 73.0 39.3 24.8 15.1
39.5 35.9 35.4 37.8 42.8 42.3
Example 80.1. A Box Plot of the Square Root Difference Cloud 24.8 27.7 29.5 32.7 37.0 39.4 46.4 51.0 55.5 62.1 70.5 78.1 80.5 84.5 86.7 88.4 88.9 91.5 94.8 ;
26.3 83.3 89.4 40.2 70.3 82.5 84.1 88.8 92.9 26.6 83.7 45.5 55.9 11.0 70.4 12.1 6.2 55.4 71.5
39.7 41.8 43.0 37.5 39.2 41.4 41.5 42.0 42.2 40.1 40.9 38.7 38.7 41.5 39.6 41.3 41.5 39.0 39.7
26.4 27.9 30.1 34.8 38.2 43.0 46.7 52.8 56.0 63.0 70.9 78.2 81.1 85.2 87.2 88.4 90.6 92.9 96.2
58.0 90.8 6.1 8.1 77.9 4.7 10.6 68.9 1.6 12.7 11.0 9.1 51.0 67.3 55.7 99.6 7.0 46.8 84.3
36.9 43.3 43.6 43.3 40.7 43.3 42.6 39.3 42.7 41.8 41.7 41.7 38.6 39.4 38.8 41.2 41.5 39.1 40.3
26.9 29.1 30.8 35.3 38.9 43.7 49.9 52.9 60.6 69.0 71.5 78.4 83.8 85.5 88.1 88.8 90.7 93.4 98.2
65.0 47.9 12.1 32.0 23.3 7.6 22.1 32.7 75.2 75.6 29.5 20.0 7.9 73.0 0.0 82.9 49.6 70.9 58.2
37.8 36.7 42.8 38.8 40.5 43.1 40.7 39.2 40.1 40.1 39.8 40.8 41.6 39.8 41.6 40.5 38.9 39.7 39.5
proc variogram data=thick outp=outp; coordinates xc=east yc=north; var thick; compute novar; run; data sqroot; set outp; /*- Include only points +/- 30 degrees of N-S -------*/ where abs(cos) < .5; /*- Unit lag of 7, distance tolerance of 3.5 lag_class=int(distance/7 + .5000001); sqr_diff=sqrt(abs(v1-v2)); run; proc sort data=sqroot; by lag_class; run; proc means data=sqroot noprint n mean std; var sqr_diff; by lag_class; output out=msqrt n=n mean=mean std=std; run; title2 ’Summary of Results’; proc print data=msqrt; id lag_class; var n mean std; run;
-------*/
4883
4884
Chapter 80. The VARIOGRAM Procedure title ’Box Plot of the Square Root Difference Cloud’; proc shewhart data=sqroot; boxchart sqr_diff*lag_class / cframe=ligr haxis=axis1 vaxis=axis2; symbol1 v=dot c=blue height=3.5pct; axis1 minor=none; axis2 minor=none label=(angle=90 rotate=0); run;
Output 80.1.1. Summary of Results Square Root Difference Cloud Example Summary of Results lag_ class 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
n 5 31 55 58 63 61 75 85 84 115 82 85 68 38 7 3
mean 0.47300 0.77338 1.13908 1.51768 1.67858 1.66014 1.77999 1.69703 1.74687 1.70635 1.48100 1.19877 0.89765 0.84223 1.05653 1.35076
std 0.14263 0.41467 0.47604 0.51989 0.60494 0.70687 0.64590 0.75362 0.68785 0.57173 0.48105 0.47121 0.42510 0.44249 0.42548 0.11472
References
Output 80.1.2. Box Plot of the Square Root Difference Cloud
References Cressie, N.A.C. (1993), Statistics for Spatial Data, New York: John Wiley & Sons, Inc. Duetsch, C.V. and Journel, A.G. (1992), GSLIB: Geostatistical Software Library and User’s Guide, New York: Oxford University Press. Hohn, M.E. (1988), Geostatistics and Petroleum Geology, New York: Van Nostrand Reinhold.
4885
Appendix A
Special SAS Data Sets Appendix Contents INTRODUCTION TO SPECIAL SAS DATA SETS . . . . . . . . . . . . . 4889 SPECIAL SAS DATA SETS . . . . . . . . . . . . . . . . . . . . . . . . . TYPE=CORR Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . Example A.1: A TYPE=CORR Data Set Produced by PROC CORR Example A.2: Creating a TYPE=CORR Data Set in a DATA Step . . TYPE=UCORR Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . TYPE=COV Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . TYPE=UCOV Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . TYPE=SSCP Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . Example A.3: A TYPE=SSCP Data Set Produced by PROC REG . . TYPE=CSSCP Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . TYPE=EST Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . Example A.4: A TYPE=EST Data Set Produced by PROC REG . . . TYPE=ACE Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . TYPE=DISTANCE Data Sets . . . . . . . . . . . . . . . . . . . . . . . TYPE=FACTOR Data Sets . . . . . . . . . . . . . . . . . . . . . . . . TYPE=RAM Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . TYPE=WEIGHT Data Sets . . . . . . . . . . . . . . . . . . . . . . . . TYPE=LINEAR Data Sets . . . . . . . . . . . . . . . . . . . . . . . . TYPE=QUAD Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . TYPE=MIXED Data Sets . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. 4893 . 4893 . 4895 . 4896 . 4897 . 4897 . 4898 . 4898 . 4898 . 4899 . 4899 . 4900 . 4900 . 4901 . 4901 . 4901 . 4901 . 4902 . 4902 . 4902
DEFINITIONAL FORMULAS . . . . . . . . . . . . . . . . . . . . . . . . . 4902
4888
Appendix A. Special SAS Data Sets
Appendix A
Special SAS Data Sets Introduction to Special SAS Data Sets All SAS/STAT procedures create SAS data sets. Any table generated by a procedure can be saved to a data set by using the Output Delivery System (ODS), and many procedures also have syntax to enable you to save other statistics to data sets. Some of these data sets are organized according to certain conventions that allow them to be read by a SAS/STAT procedure for further analysis. Such specially organized data sets are recognized by the TYPE= attribute of the data set. For example, the CORR procedure (refer to the SAS Procedures Guide) can create a data set with the attribute TYPE=CORR containing a correlation matrix. This TYPE=CORR data set can be read by the REG or FACTOR procedure, among others. If the original data set is large, using a special SAS data set in this way can save a great deal of computer time by avoiding the recomputation of the correlation matrix in each of several analyses. As another example, the REG procedure can create a TYPE=EST data set containing estimated regression coefficients. If you need to make predictions for new observations, you can have the SCORE procedure read both the TYPE=EST data set and a data set containing the new observations. PROC SCORE can then compute predicted values or residuals without repeating the entire regression analysis. See Chapter 64, “The SCORE Procedure,” for an example. A special SAS data set may contain different kinds of statistics. A special variable called – TYPE– is used to distinguish the various statistics. For example, in a TYPE=CORR data set, an observation in which – TYPE– =’MEAN’ contains the means of the variables in the analysis, and an observation in which – TYPE– =’STD’ contains the standard deviations. Correlations appear in observations with – TYPE– =’CORR’. Another special variable, – NAME– , is needed to identify the row of the correlation matrix. Thus, the correlation between variables X and Y would be given by the value of the variable X in the observation for which – TYPE– =’CORR’ and – NAME– =’Y’, or by the value of the variable Y in the observation for which – TYPE– =’CORR’ and – NAME– =’X’. You can create special SAS data steps directly in a DATA step by specifying the TYPE= option in parentheses after the data set name in the DATA statement. See Example A.2 on page 4896 for an example. The special data sets created by SAS/STAT procedures can generally be used directly by other procedures without modification. However, if you create an output data set with PROC CORR and use the NOCORR option to omit the correlation matrix from the OUT= data set, you need to set the TYPE= option either in parentheses following the OUT= data set name in the PROC CORR statement or in parentheses following
4890
Appendix A. Special SAS Data Sets
the DATA= option in any other procedure that recognizes the special TYPE= attribute. In either case, the TYPE= option should be set to COV, CSSCP, or SSCP according to what type of matrix is stored in the data set and what data set types are accepted as input by the other procedures you plan to use. If you do not follow these steps and you use the TYPE=CORR data set with no correlation matrix as input to another procedure, the procedure may issue an error message indicating that the correlation matrix is missing from the data set. If you use a DATA step with a SET statement to modify a special SAS data set, you must specify the TYPE= option in the DATA statement. The TYPE= attribute of the data set in the SET statement is not automatically copied to the data set being created. You can determine the TYPE= attribute of a data set by using the CONTENTS procedure (see Example A.1 on page 4895 and refer to the SAS Procedures Guide for details). Table A.1 summarizes the TYPE= data sets that can be used as input to SAS/STAT procedures and the TYPE= data sets that are created by SAS/STAT procedures. The essential parts of the statements each procedure uses to create its output data set or data sets are shown. Formulas useful for illustrating differences between corrected and uncorrected matrices in some special SAS data sets are shown in the “Definitional Formulas” section on page 4902. Table A.1. SAS/STAT Procedures and Types of Data Sets Output Data Sets Input Data Sets (TYPE=null or as Procedure TYPE= as shown∗ shown) Created by Statement and Specification ACECLUS INITIAL= INPUT= ACE PROC ACECLUS OUTSTAT= data set may be PROC ACECLUS OUT= of type: ACE, CORR, COV, SSCP, UCORR, UCOV ANOVA PROC ANOVA OUTSTAT= CALIS
CORR, COV, FACTOR, RAM, SSCP, UCORR, UCOV, WEIGHT
CORR COV EST UCORR UCOV RAM WEIGHT
PROC CALIS OUTSTAT= PROC CALIS COV OUTSTAT= PROC CALIS OUTEST= PROC CALIS NOINT OUTSTAT= PROC CALIS NOINT COV OUTSTAT= PROC CALIS OUTRAM= PROC CALIS OUTWGT=
CANCORR
CORR, COV, FACTOR, SSCP, UCORR, UCOV
CORR UCORR
PROC CANCORR OUTSTAT= PROC CANCORR NOINT OUTSTAT= PROC CANCORR OUT=
CANDISC
CORR, COV, SSCP, CSSCP
CORR
PROC CANDISC OUTSTAT= PROC CANDISC OUT=
CATMOD
EST
EST
RESPONSE / OUTEST= RESPONSE /OUT=
CLUSTER
DISTANCE
TREE
PROC CLUSTER OUTTREE=
Introduction to Special SAS Data Sets
4891
Table A.1. (continued)
Procedure CORRESP DISCRIM
Output Data Sets Input Data Sets (TYPE=null or as TYPE= as shown∗ shown)
CORR, COV, SSCP, CSSCP, LINEAR, QUAD, MIXED
DISTANCE FACTOR
ACE, CORR, COV, FACTOR, SSCP, UCORR, UCOV
Created by Statement and Specification PROC CORRESP OUTC= PROC CORRESP OUTF=
LINEAR QUAD MIXED CORR
PROC DISCRIM POOL=YES OUTSTAT= PROC DISCRIM POOL=NO OUTSTAT= PROC DISCRIM POOL=TEST OUTSTAT= PROC DISCRIM METHOD=NPAR OUTSTAT= PROC DISCRIM OUT= PROC DISCRIM OUTCROSS= PROC DISCRIM OUTD= PROC DISCRIM TESTOUT= PROC DISCRIM TESTOUTD=
DISTANCE
PROC DISTANCE OUT= PROC DISTANCE OUTSDZ=
FACTOR
PROC FACTOR OUTSTAT= PROC FACTOR OUT=
FASTCLUS
PROC FASTCLUS OUT= PROC FASTCLUS OUTSEED= PROC FASTCLUS OUTSTAT= PROC FASTCLUS MEAN=
FREQ
TABLES OUT= OUTPUT OUT=
GENMOD
OUTPUT OUT=
GLM
PROC GLM OUTSTAT= LSMEANS / OUT= OUTPUT OUT=
GLMMOD
PROC GLMMOD OUTDESIGN= PROC GLMMOD OUTPARM=
INBREED
PROC INBREED OUTCOV=
KRIGE2D
PROC KRIGE2D OUTEST= PROC KRIGE2D OUTNBHD=
LATTICE LIFEREG
EST
LIFETEST LOGISTIC
PROC LIFEREG OUTEST= OUTPUT OUT= PROC LIFETEST OUTSURV= PROC LIFETEST OUTTEST=
EST
PROC LOGISTIC OUTEST= OUTPUT OUT= MODEL / OUTROC=
4892
Appendix A. Special SAS Data Sets
Table A.1. (continued)
Procedure MDS
Output Data Sets Input Data Sets (TYPE=null or as TYPE= as shown∗ shown)
MIXED
MODECLUS
Created by Statement and Specification PROC MDS OUT= PROC MDS OUTFIT= PROC MDS OUTRES= MODEL OUTPRED= MODEL OUTPREDM= PRIOR OUT= PRIOR OUTG= PRIOR OUTGT=
DISTANCE
PROC MODECLUS OUT= PROC MODECLUS OUTCLUS= PROC MODECLUS OUTSUM=
MULTTEST
PROC MULTTEST OUT= PROC MULTTEST OUTPERM= PROC MULTTEST OUTSAMP=
NESTED NLIN
EST
NPAR1WAY
PROC NLIN OUTEST= OUTPUT OUT= OUTPUT OUT=
ORTHOREG
EST
PROC ORTHOREG OUTEST=
PHREG
EST
PROC PHREG OUTEST= BASELINE OUT= OUTPUT OUT=
PLAN
OUTPUT OUT=
PLS
OUTPUT OUT=
PRINCOMP
ACE, CORR, COV, EST, FACTOR, SSCP, UCORR, UCOV
CORR COV UCORR UCOV
PRINQUAL
PROC PRINQUAL OUT=
PROBIT REG
CORR, COV, SSCP, UCORR, UCOV
RSREG SCORE SIM2D
PROC PRINCOMP OUTSTAT= PROC PRINCOMP COV OUTSTAT= PROC PRINCOMP NOINT OUTSTAT= PROC PRINCOMP NOINT COV OUTSTAT= PROC PRINCOMP OUT=
EST
PROC PROBIT OUTEST= OUTPUT OUT=
EST SSCP
PROC REG OUTEST= PROC REG OUTSSCP= OUTPUT OUT= PROC RSREG OUT= RIDGE OUTR=
SCORE= data set can be of any type
PROC SCORE OUT= PROC SIM2D OUTSIM=
TYPE=CORR Data Sets
4893
Table A.1. (continued)
Procedure SURVEYSELECT
Output Data Sets Input Data Sets (TYPE=null or as TYPE= as shown∗ shown)
STDIZE STEPDISC
PROC STDIZE OUT= PROC STDIZE OUTSTAT= CORR, COV, SSCP, CSSCP
TRANSREG TREE
Created by Statement and Specification PROC SURVEYSELECT OUT= PROC SURVEYSELECT OUTSORT=
PROC TRANSREG OUTTEST= OUTPUT OUT= TREE
PROC TREE OUT=
TTEST VARCLUS
CORR, COV, FACTOR, SSCP, UCORR, UCOV
CORR UCORR TREE
PROC VARCLUS OUTSTAT= PROC VARCLUS NOINT OUTSTAT= PROC VARCLUS OUTTREE=
VARCOMP VARIOGRAM
PROC VARIOGRAM OUTDISTANCE= PROC VARIOGRAM OUTPAIR= PROC VARIOGRAM OUTVAR=
∗
If no TYPE= is shown, the procedure does not recognize any special data set types except possibly to issue an error message for inappropriate values of TYPE=.
Special SAS Data Sets TYPE=CORR Data Sets A TYPE=CORR data set usually contains a correlation matrix and possibly other statistics including means, standard deviations, and the number of observations in the original SAS data set from which the correlation matrix was computed. Using PROC CORR with an output data set option (OUTP=, OUTS=, OUTK=, OUTH=, or OUT=) produces a TYPE=CORR data set. (For a complete description of the CORR procedure, refer to the SAS Procedures Guide). The CALIS, CANCORR, CANDISC, DISCRIM, PRINCOMP, and VARCLUS procedures can also create a TYPE=CORR data set with additional statistics. A TYPE=CORR data set containing a correlation matrix can be used as input for the ACECLUS, CALIS, CANCORR, CANDISC, DISCRIM, FACTOR, PRINCOMP, REG, SCORE, STEPDISC, and VARCLUS procedures.
4894
Appendix A. Special SAS Data Sets
The variables in a TYPE=CORR data set are • the BY variable or variables, if a BY statement is used with the procedure • – TYPE– , a character variable of length eight with values identifying the type of statistic in each observation, such as ’MEAN’, ’STD’, ’N’, and ’CORR’ • – NAME– , a character variable with values identifying the variable with which a given row of the correlation matrix is associated • other variables that were analyzed by the CORR procedure or other procedures The usual values of the – TYPE– variable are as follows.
– TYPE– MEAN
Contents mean of each variable analyzed
STD
standard deviation of each variable
N
number of observations used in the analysis. PROC CORR records the number of nonmissing values for each variable unless the NOMISS option is used. If the NOMISS option is specified, or if the CALIS, CANCORR, CANDISC, PRINCOMP, or VARCLUS procedure is used to create the data set, observations with one or more missing values are omitted from the analysis, so this value is the same for each variable and provides the number of observations with no missing values. If a FREQ statement is used with the procedure that creates the data set, the number of observations is the sum of the relevant values of the variable in the FREQ statement. Procedures that read a TYPE=CORR data set use the smallest value in the observation with – TYPE– =’N’ as the number of observations in the analysis.
SUMWGT
sum of the observation weights if a WEIGHT statement is used with the procedure that creates the data set. The values are determined analogously to those of the – TYPE– =’N’ observation.
CORR
correlations with the variable named by the – NAME– variable
There may be additional observations in a TYPE=CORR data set depending on the particular procedure and options used. If you create a TYPE=CORR data set yourself, the data set need not contain the observations with – TYPE– =’MEAN’, ’STD’, ’N’, or ’SUMWGT’, unless you intend to use one of the discriminant procedures. Procedures assume that all of the means are 0.0 and that the standard deviations are 1.0 if this information is not in the TYPE=CORR data set. If – TYPE– =’N’ does not appear, most procedures assume that the number of observations is 10,000; significance tests and other statistics that depend on the number of observations are, of course, meaningless. In the CALIS and CANCORR procedures, you can use the EDF= option instead of including a – TYPE– =’N’ observation.
TYPE=CORR Data Sets
A correlation matrix is symmetric; that is, the correlation between X and Y is the same as the correlation between Y and X. The CALIS, CANCORR, CANDISC, CORR, DISCRIM, PRINCOMP, and VARCLUS procedures output the entire correlation matrix. If you create the data set yourself, you need to include only one of the two occurrences of the correlation between two variables; the other may be given a missing value. If you create a TYPE=CORR data set yourself, the – TYPE– and – NAME– variables are not necessary except for use with the discriminant procedures and PROC SCORE. If there is no – TYPE– variable, then all observations are assumed to contain correlations. If there is no – NAME– variable, the first observation is assumed to correspond to the first variable in the analysis, the second observation to the second variable, and so on. However, if you omit the – NAME– variable, you will not be able to analyze arbitrary subsets of the variables or list the variables in a VAR or MODEL statement in a different order.
Example A.1: A TYPE=CORR Data Set Produced by PROC CORR See Output A.1.1 for an example of a TYPE=CORR data set produced by the following SAS statements. Output A.1.2 displays partial output from the CONTENTS procedure, which indicates that the “Data Set Type” is ’CORR’. title ’Five Socioeconomic Variables’; data SocEcon; title2 ’Harman (1976), Modern Factor Analysis, 3rd ed’; input pop school employ services house; datalines; 5700 12.8 2500 270 25000 1000 10.9 600 10 10000 3400 8.8 1000 10 9000 3800 13.6 1700 140 25000 4000 12.8 1600 140 25000 8200 8.3 2600 60 12000 1200 11.4 400 10 16000 9100 11.5 3300 60 14000 9900 12.5 3400 180 18000 9600 13.7 3600 390 25000 9600 9.6 3300 80 12000 9400 11.4 4000 100 13000 ; proc corr noprint out=corrcorr; run; proc print data=corrcorr; run; proc contents data=corrcorr; run;
4895
4896
Appendix A. Special SAS Data Sets
Output A.1.1. A TYPE=CORR Data Set Produced by PROC CORR Five Socioeconomic Variables Harman (1976), Modern Factor Analysis, 3rd ed Obs
_TYPE_
1 2 3 4 5 6 7 8
MEAN STD N CORR CORR CORR CORR CORR
_NAME_
pop school employ services house
pop
school
employ
services
house
6241.67 3439.99 12.00 1.00 0.01 0.97 0.44 0.02
11.4417 1.7865 12.0000 0.0098 1.0000 0.1543 0.6914 0.8631
2333.33 1241.21 12.00 0.97 0.15 1.00 0.51 0.12
120.833 114.928 12.000 0.439 0.691 0.515 1.000 0.778
17000.00 6367.53 12.00 0.02 0.86 0.12 0.78 1.00
Output A.1.2. Contents of a TYPE=CORR Data Set The CONTENTS Procedure Data Set Name Member Type Engine Created Last Modified Protection Data Set Type Label
WORK.CORRCORR DATA V8 13:56 Wednesday, July 25, 2001 13:56 Wednesday, July 25, 2001 CORR Pearson Correlation Matrix
Observations Variables Indexes Observation Length Deleted Observations Compressed Sorted
8 7 0 56 0 NO NO
Example A.2: Creating a TYPE=CORR Data Set in a DATA Step This example creates a TYPE=CORR data set by reading a correlation matrix in a DATA step. Output A.2.2 shows the resulting data set. title ’Five Socioeconomic Variables’; data datacorr(type=corr); infile cards missover; type_=’corr’; input _name_ $ pop school employ services house; datalines; POP 1.00000 SCHOOL 0.00975 1.00000 EMPLOY 0.97245 0.15428 1.00000 SERVICES 0.43887 0.69141 0.51472 1.00000 HOUSE 0.02241 0.86307 0.12193 0.77765 1.00000 ; run; proc print data=datacorr; run;
TYPE=COV Data Sets
Output A.2.2. A TYPE=CORR Data Set Created by a DATA Step Five Socioeconomic Variables Obs 1 2 3 4 5
type_
_name_
corr corr corr corr corr
POP SCHOOL EMPLOY SERVICES HOUSE
pop
school
employ
services
1.00000 0.00975 0.97245 0.43887 0.02241
. 1.00000 0.15428 0.69141 0.86307
. . 1.00000 0.51472 0.12193
. . . 1.00000 0.77765
house . . . . 1
TYPE=UCORR Data Sets A TYPE=UCORR data set is almost identical to a TYPE=CORR data set, except that the correlations are uncorrected for the mean. The corresponding value of the – TYPE– variable is ’UCORR’ instead of ’CORR’. Uncorrected standard deviations are in observations with – TYPE– =’USTD’. A TYPE=UCORR data set can be used as input for every SAS/STAT procedure that uses a TYPE=CORR data set, except for the CANDISC, DISCRIM, and STEPDISC procedures. TYPE=UCORR data sets can be created by the CALIS, CANCORR, PRINCOMP, and VARCLUS procedures.
TYPE=COV Data Sets A TYPE=COV data set is similar to a TYPE=CORR data set except that it has – TYPE– =’COV’ observations containing covariances instead of or in addition to – TYPE– =’CORR’ observations containing correlations. The CALIS and PRINCOMP procedures create a TYPE=COV data set if the COV option is used. You can also create a TYPE=COV data set by using PROC CORR with the COV and NOCORR options and specifying the data set option TYPE=COV in parentheses following the name of the output data set. You can use only the OUTP= or OUT= options to create a TYPE=COV data set with PROC CORR. Another way to create a TYPE=COV data set is to read a covariance matrix in a data set, in the same manner as shown in Example A.2 on page 4896 for a TYPE=CORR data set. TYPE=COV data sets are used by the same procedures that use TYPE=CORR data sets.
4897
4898
Appendix A. Special SAS Data Sets
TYPE=UCOV Data Sets A TYPE=UCOV data set is similar to a TYPE=COV data set, except that the covariances are uncorrected for the mean. Also, the corresponding value of the – TYPE– variable is ’UCOV’ instead of ’COV’. A TYPE=UCOV data set can be used as input for every SAS/STAT procedure that uses a TYPE=COV data set, except for the CANDISC, DISCRIM, and STEPDISC procedures. TYPE=UCOV data sets can be created by the CALIS and PRINCOMP procedures.
TYPE=SSCP Data Sets A TYPE=SSCP data set contains an uncorrected sum of squares and crossproducts (SSCP) matrix. TYPE=SSCP data sets are produced by PROC REG when the OUTSSCP= option is specified in the PROC REG statement. You can also create a TYPE=SSCP data set by using PROC CORR with the SSCP option and specifying the data set option TYPE=SSCP in parentheses following the name of the OUTP= or OUT= data set. You can also create TYPE=SSCP data sets in a DATA step; in this case, TYPE=SSCP must be specified as a data set option. The variables in a TYPE=SSCP data set include those found in a TYPE=CORR data set. In addition, there is a variable called Intercept that contains crossproducts for the intercept (sums of the variables). The SSCP matrix is stored in observations with – TYPE– =’SSCP’, including a row with – NAME– =’Intercept’. PROC REG also outputs an observation with – TYPE– =’N’. PROC CORR includes observations with – TYPE– =’MEAN’ and – TYPE– =’STD’ as well. TYPE=SSCP data sets are used by the same procedures that use TYPE=CORR data sets.
Example A.3: A TYPE=SSCP Data Set Produced by PROC REG Output A.3.1 shows a TYPE=SSCP data set produced by PROC REG from the SocEcon data set created in Example A.1 on page 4895. proc reg data=SocEcon outsscp=regsscp; model house=pop school employ services / noprint; run; proc print data=regsscp; run;
TYPE=EST Data Sets
Output A.3.1. A TYPE=SSCP Data Set Produced by PROC REG Obs
_TYPE_
1 2 3 4 5 6 7
SSCP SSCP SSCP SSCP SSCP SSCP N
_NAME_
Intercept
pop
school
employ
services
house
12.0 74900.0 137.3 28000.0 1450.0 204000.0 12.0
74900 597670000 857640 220440000 10959000 1278700000 12
137.30 857640.00 1606.05 324130.00 18152.00 2442100.00 12.00
28000 220440000 324130 82280000 4191000 486600000 12
1450 10959000 18152 4191000 320500 30910000 12
204000 1278700000 2442100 486600000 30910000 3914000000 12
Intercept pop school employ services house
TYPE=CSSCP Data Sets A TYPE=CSSCP data set contains a corrected sum of squares and crossproducts (CSSCP) matrix. TYPE=CSSCP data sets are created by using the CORR procedure with the CSSCP option and specifying the data set option TYPE=CSSCP in parentheses following the name of the OUTP= or OUT= data set. You can also create TYPE=CSSCP data sets in a DATA step; in this case, TYPE=CSSCP must be specified as a data set option. The variables in a TYPE=CSSCP data set are the same as those found in a TYPE=SSCP data set, except that there is not a variable called Intercept or a row with – NAME– =’Intercept’. TYPE=CSSCP data sets are read by only the CANDISC, DISCRIM, and STEPDISC procedures.
TYPE=EST Data Sets A TYPE=EST data set contains parameter estimates. The CALIS, CATMOD, LIFEREG, LOGISTIC, NLIN, ORTHOREG, PHREG, PROBIT, and REG procedures create TYPE=EST data sets when the OUTEST= option is specified. A TYPE=EST data set produced by PROC LIFEREG, PROC ORTHOREG, or PROC REG can be used with PROC SCORE to compute residuals or predicted values. The variables in a TYPE=EST data set include • the BY variables, if a BY statement is used • – TYPE– , a character variable of length eight, that indicates the type of estimate. The values depend on which procedure created the data set. Usually a value of ’PARM’ or ’PARMS’ indicates estimated regression coefficients, and a value of ’COV’ or ’COVB’ indicates estimated covariances of the parameter estimates. Some procedures, such as PROC NLIN, have other values of – TYPE– for special purposes. • – NAME– , a character variable that contains the values of the names of the rows of the covariance matrix when the procedure outputs the covariance matrix of the parameter estimates. • variables that contain the parameter estimates, usually the same variables that appear in the VAR statement or in any MODEL statement. See Chapter
4899
4900
Appendix A. Special SAS Data Sets 19, “The CALIS Procedure,” Chapter 22, “The CATMOD Procedure,” and Chapter 50, “The NLIN Procedure,” for details on the variable names used in output data sets created by those procedures.
Other variables can be included depending on the particular procedure and options used.
Example A.4: A TYPE=EST Data Set Produced by PROC REG Output A.4.1 shows the TYPE=EST data set produced by the following statements: proc reg data=SocEcon outest=regest covout; full: model house=pop school employ services / noprint; empser: model house=employ services / noprint; run;
proc print data=regest; run;
Output A.4.1. A TYPE=EST Data Set Produced by PROC REG Obs 1 2 3 4 5 6 7 8 9 10
_MODEL_
_TYPE_
full full full full full full empser empser empser empser
PARMS COV COV COV COV COV PARMS COV COV COV
_NAME_
Intercept pop school employ services Intercept employ services
_DEPVAR_
_RMSE_
Intercept
house house house house house house house house house house
3122.03 3122.03 3122.03 3122.03 3122.03 3122.03 3789.96 3789.96 3789.96 3789.96
-8074.21 109408014.44 -9157.04 -9784744.54 20612.49 102764.89 15021.71 5824096.19 -1915.99 -1294.94
Obs
pop
school
employ
services
1 2 3 4 5 6 7 8 9 10
0.65 -9157.04 2.32 852.86 -6.20 -5.20 . . . .
2140.10 -9784744.54 852.86 907886.36 -2042.24 -9608.59 . . . .
-2.92 20612.49 -6.20 -2042.24 17.44 6.50 -1.94 -1915.99 1.15 -6.41
27.81 102764.89 -5.20 -9608.59 6.50 202.56 53.88 -1294.94 -6.41 134.49
house -1 . . . . . -1 . . .
TYPE=ACE Data Sets A TYPE=ACE data set is created by the ACECLUS procedure, and it contains the approximate within-cluster covariance estimate, as well as eigenvalues and eigenvectors from a canonical analysis, among other statistics. It can be used as input to the ACECLUS procedure to initialize another execution of PROC ACECLUS. It can also be used to compute canonical variable scores with the SCORE procedure and as input to the FACTOR procedure, specifying METHOD=SCORE, to rotate the canonical variables. See Chapter 16, “The ACECLUS Procedure,” for details.
TYPE=WEIGHT Data Sets
TYPE=DISTANCE Data Sets You can create a TYPE=DISTANCE data set containing distance or dissimilarity measures using the DISTANCE procedure. The proximity measures are stored as a lower triangular matrix or a square matrix in the OUT= data set (depending on the SHAPE= option). See Chapter 26, “The DISTANCE Procedure,” for details. You can also create a TYPE=DISTANCE data set in a DATA step by reading or computing a lower triangular or symmetric matrix of dissimilarity values, such as a chart of mileage between cities. The number of observations must be equal to the number of variables used in the analysis. This type of data set is used as input by the CLUSTER and MODECLUS procedures. PROC CLUSTER ignores the upper triangular portion of a TYPE=DISTANCE data set and assumes that all main diagonal values are zero, even if they are missing. PROC MODECLUS uses the entire distance matrix and does not require the matrix to be symmetric. See Chapter 23, “The CLUSTER Procedure,” and Chapter 47, “The MODECLUS Procedure,” for examples and details.
TYPE=FACTOR Data Sets A TYPE=FACTOR data set is created by PROC FACTOR when the OUTSTAT= option is specified. The CALIS, CANCORR, FACTOR, PRINCOMP, SCORE, and VARCLUS procedures can use TYPE=FACTOR data sets as input. The variables are the same as in a TYPE=CORR data set. The statistics include means, standard deviations, sample size, correlations, eigenvalues, eigenvectors, factor patterns, residual correlations, scoring coefficients, and others depending on the options specified. See Chapter 27, “The FACTOR Procedure,” for details. When the NOINT option is used with the OUTSTAT= option in PROC FACTOR, the value of the – TYPE– variable is set to ’USCORE’ instead of ’SCORE’ to indicate that the scoring coefficients have not been corrected for the mean. If this data set is used with the SCORE procedure, the value of the – TYPE– variable tells PROC SCORE whether or not to subtract the mean from the scoring coefficients.
TYPE=RAM Data Sets The CALIS procedure creates and accepts as input a TYPE=RAM data set. This data set contains the model specification and the computed parameter estimates. A TYPE=RAM data set is intended to be reused as an input data set to specify good initial values in subsequent analyses by PROC CALIS. See Chapter 19, “The CALIS Procedure,” for details.
TYPE=WEIGHT Data Sets The CALIS procedure creates and accepts as input a TYPE=WEIGHT data set. This data set contains the weight matrix used in generalized, weighted, or diagonally weighted least-squares estimation. See Chapter 19, “The CALIS Procedure,” for details.
4901
4902
Appendix A. Special SAS Data Sets
TYPE=LINEAR Data Sets A TYPE=LINEAR data set contains the coefficients of a linear function of the variables in observations with – TYPE– =’LINEAR’. The DISCRIM procedure stores linear discriminant function coefficients in a TYPE=LINEAR data set when you specify METHOD=NORMAL (the default method), POOL=YES, and an OUTSTAT= data set; the data set can be used in a subsequent invocation of PROC DISCRIM to classify additional observations. Many other statistics can be included depending on the options used. See Chapter 25, “The DISCRIM Procedure,” for details.
TYPE=QUAD Data Sets A TYPE=QUAD data set contains the coefficients of a quadratic function of the variables in observations with – TYPE– =’QUAD’. The DISCRIM procedure stores quadratic discriminant function coefficients in a TYPE=QUAD data set when you specify METHOD=NORMAL (the default method), POOL=NO, and an OUTSTAT= data set; the data set can be used in a subsequent invocation of PROC DISCRIM to classify additional observations. Many other statistics can be included depending on the options used. See Chapter 25, “The DISCRIM Procedure,” for details.
TYPE=MIXED Data Sets A TYPE=MIXED data set contains coefficients of either a linear or a quadratic function, or both if there are BY groups. The DISCRIM procedure produces a TYPE=MIXED data set when you specify METHOD=NORMAL (the default method), POOL=TEST, and an OUTSTAT= data set. See Chapter 25, “The DISCRIM Procedure,” for details.
Definitional Formulas This section contrasts corrected and uncorrected SSCP, COV, and CORR matrices by showing how these matrices can be computed. In the following formulas, assume that the data consist of two variables, X and Y, with n observations.
P n P P X2 SSCP = PX P X Y XY CSSCP =
P Y P PXY2 Y
P P ¯ 2 ¯ (X − X) (X − X)(Y − Y¯ ) P P ¯ (X − X)(Y − Y¯ ) (Y − Y¯ )2
Definitional Formulas
COV =
UCOV =
CSSCP 1 = n−1 n−1 1 n
P 2 PX XY
P XY P 2 Y ¯ (X − X)(Y − Y¯ ) qX X 2 2 ¯ ¯ (X − X) (Y − Y ) 1 X
1 X rX
4903
P P ¯ 2 ¯ (X − X) (X − X)(Y − Y¯ ) P P ¯ (X − X)(Y − Y¯ ) (Y − Y¯ )2
CORR =
¯ (X − X)(Y − Y¯ ) X ¯ 2 (X − X) (Y − Y¯ )2
1 X UCORR = XY qX X X2 Y2
XY qX X 2 2 X Y 1 X
Appendix B
Using the %PLOTIT Macro Appendix Contents OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4907 %PLOTIT MACRO OPTIONS USED IN THIS BOOK . . . . . . . . . . . 4907
4906
Appendix B. Using the %PLOTIT Macro
Appendix B
Using the %PLOTIT Macro Overview You can use the PLOT procedure to create low-resolution printer plots of labeled points. Alternatively, you can use the %PLOTIT macro to create high-resolution graphical scatter plots of labeled points. The %PLOTIT macro is designed to make it easy to display raw data, regression results, and results from the CORRESP, MDS, PRINCOMP, PRINQUAL, and TRANSREG procedures. You can use this macro to position labels, draw curves, vectors, and circles, and shade to show density or a third variable. You can also use the %PLOTIT macro to control the colors, sizes, fonts, and general appearance of the plots and to create contour plots for discriminant analysis. The %PLOTIT macro is a part of the SAS autocall library. If your site has installed the autocall libraries supplied by SAS Institute and uses the standard configuration of SAS software supplied by the Institute, you need only to ensure that the SAS system option MAUTOSOURCE is in effect to begin using the autocall macros. For more information about autocall libraries, refer to SAS Macro Language: Reference, First Edition, 1997. The %PLOTIT macro can also be found at SAS Institute’s Web site [http://www.sas.com]. Refer to “Experimental Design and Choice Modeling Macros” and “Graphical Scatter Plots of Labeled Points” at [http://support.sas.com/techsup/tnote/tnote– stat.html#market].
%PLOTIT Macro Options Used in This Book Most of the examples in this book that invoke the %PLOTIT macro are created with a specific set of options. The graphics are generated by using a special macro variable called plotitop. The code you see in the examples creates the color graphics that appear in the online (CD) version of the manual. A slightly different set of options and statements are used to create the black-and-white graphics that appear in the printed version of the book. To create the online (color) version of the graphic output, the plotitop variable is defined as follows.
4908
Appendix B. Using the %PLOTIT Macro %let plotitop =
gopts gaccess hsize cback cframe color colors options
= = = = = = = =
gsfmode = replace gsasfile device = gif 5.63 vsize = 3.5 white, ligr, black, red blue white, noclip expand, post=myplot.gif;
To create the black-and-white version of the graphic output, which appears in the printed version of the manual, the plotitop variable is defined as follows: %let plotitop =
gopts gaccess hsize cback color colors options
= = = = = = =
gsfmode = replace gsasfile device = pslepsf 5.63 vsize = 3.5 white, black, black, noclip border expand, post=myplot.ps;
For information on graphics options used in other examples, see Chapter 1, “Introduction.”
Subject Index A AB/BA crossover designs power and sample size (POWER), 3549 absolute level of measurement, definition DISTANCE procedure, 1250 absorption of effects ANOVA procedure, 434 GLM procedure, 1747, 1799 Accelerated failure time model NLMIXED procedure, 3128 accelerated failure time models LIFEREG procedure, 2083 ACECLUS procedure analyzing data in groups, 394, 412 between-cluster SSCP matrix, 387 clustering methods, 405 compared with other procedures, 391 computational resources, 410 controlling iterations, 389 decomposition of the SSCP matrix, 387 eigenvalues and eigenvectors, 398, 399, 406, 410–412 initial estimates, 389, 404 memory requirements, 410 missing values, 409 output data sets, 409 output table names, 412 time requirements, 411 within-cluster SSCP matrix, 387 active set methods NLMIXED procedure, 3093 actual power GLMPOWER procedure, 1946, 1947, 1953 POWER procedure, 3419, 3494, 3496 acturial estimate, See life-table estimate adaptive Gaussian quadrature NLMIXED procedure, 3084 additive models TRANSREG procedure, 4574 adjacent-category logits, See also response functions specifying in CATMOD procedure, 852 using (CATMOD), 868 adjacent-level contrasts, 448 ADJRSQ SURVEYREG procedure, 4380 adjusted degrees of freedom
MI procedure, 2562 MIANALYZE procedure, 2625 adjusted means See least-squares means, 1753 adjusted odds ratio, 1503 adjusted p-value MULTTEST procedure, 2935, 2956 adjusted R2 selection (REG), 3875 Adjusted R-square SURVEYREG procedure, 4387 adjusted residuals GENMOD procedure, 1670 adjusted treatment means LATTICE procedure, 2075 after expansion knots TRANSREG procedure, 4571 agglomerative hierarchical clustering analysis, 957 aggregates of residuals, 1718, 1725 AGK estimate STDIZE procedure, 4139 agreement, measures of, 1493 Akaike’s information criterion example (MIXED), 2780, 2794, 2823 LOGISTIC procedure, 2341 MIXED procedure, 2676, 2740, 2750 SURVEYLOGISTIC procedure, 4279 Akaike’s information criterion (finite sample corrected version) MIXED procedure, 2676, 2750 aliasing GENMOD procedure, 1620 aliasing structure GLM procedure, 1771 aliasing structure (GLM), 1897 alpha factor analysis, 1291, 1311 alpha level ANOVA procedure, 441 FREQ procedure, 1445, 1453 GLM procedure, 1755, 1765, 1771, 1775 GLMPOWER procedure, 1939 NPAR1WAY procedure, 3160 POWER procedure, 3488 REG procedure, 3816 SURVEYFREQ procedure, 4199 TRANSREG procedure, 4574 TTEST procedure, 4780 ALR algorithm GENMOD procedure, 1711
4910
Subject Index
alternating least squares MDS procedure, 2476 Alternating Logistic Regressions (ALR) GENMOD procedure, 1711 alternative hypothesis, 3488 analysis of contrasts SURVEYREG procedure, 4393 analysis of covariance examples (GLM), 1860 MODEL statements (GLM), 1785 power and sample size (GLMPOWER), 1956 analysis of covariation (NESTED), 2990 analysis of variance, See also ANOVA procedure See also TTEST procedure categorical data, 813 CATMOD procedure, 815 mixed models (GLM), 1882 MODEL statements (GLM), 1785 multivariate (ANOVA), 436 multivariate (CANDISC), 786 multivariate (GLM), 1759, 1823, 1824, 1868 nested design, 2985 one-way layout, example, 424 one-way tests (NPAR1WAY), 3165 one-way, variance-weighted, 445, 1769 power and sample size (GLMPOWER), 1930, 1951, 1956 power and sample size (POWER), 3438, 3442, 3443, 3513, 3536 quadratic response surfaces, 4045 repeated measures (CATMOD), 873 repeated measures (GLM), 1825, 1877, 1886 SURVEYREG procedure, 4387 three-way design (GLM), 1864 unbalanced (GLM), 1735, 1804, 1856 within-subject factors, repeated measurements, 449 analysis statements POWER procedure, 3420 Analysis style ODS styles, 333 analyst’s model MI procedure, 2563 analyzing data in groups, 3076 ACECLUS procedure, 394 CANCORR procedure, 763 FACTOR procedure, 1320 FASTCLUS procedure, 1395 MDS procedure, 2485 MODECLUS procedure, 2857, 2874 PRINCOMP procedure, 3607 SCORE procedure, 4073 STDIZE procedure, 4134 TREE procedure, 4754 VARCLUS procedure, 4813 Andersen-Gill model PHREG procedure, 3216, 3243, 3253 angle classes
VARIOGRAM procedure, 4866, 4868, 4870, 4872, 4873 angle tolerance VARIOGRAM procedure, 4866, 4868, 4870, 4872, 4873 anisotropic models (KRIGE2D), 2053–2056 models (VARIOGRAM), 4871 nugget effect (KRIGE2D), 2056 annotate global data set (REG), 3816 local data set (REG), 3844 annotating cdf plots, 3718 ipp plots, 3729 lpred plots, 3737 pplot plots, 2102 predicted probability plots, 3750 ANOVA codings (TRANSREG), 4662 SURVEYREG procedure, 4380, 4387, 4392 table (TRANSREG), 4580, 4615, 4650 ANOVA (row mean scores) statistic, 1502 ANOVA procedure absorption of effects, 434 alpha level, 441 at sign (@) operator, 452 balanced data, 423 bar (|) operator, 452 Bartlett’s test, 443 block diagonal matrices, 423 Brown and Forsythe’s test, 443 canonical analysis, 438 characteristic roots and vectors, 437 compared to other procedures, 1734 complete block design, 428 computational methods, 456 confidence intervals, 442 contrasts, 448 dependent variable, 423 disk space, 433 effect specification, 451 factor name, 446, 447 homogeneity of variance tests, 443 hypothesis tests, 450 independent variable, 423 interactive use, 454 interactivity and missing values, 455 introductory example, 424 level values, 447 Levene’s test for homogeneity of variance, 443 means, 440 memory requirements, 434, 456 missing values, 433, 455 model specification, 451 multiple comparison procedures, 440 multiple comparisons, 441–445 multivariate analysis of variance, 433, 436 O’Brien’s test, 443
Subject Index ODS graph names, 460 ODS table names, 458 orthonormalizing transformation matrix, 438 output data sets, 434, 455 pooling, automatic, 454 repeated measures, 446 sphericity tests, 449 SSCP matrix for multivariate tests, 437 transformations, 447, 448 transformations for MANOVA, 437 unbalanced data, caution, 423 Welch’s ANOVA, 445 WHERE statement, 454 Ansari-Bradley scores NPAR1WAY procedure, 3168 ante-dependence structure MIXED procedure, 2721 apparent error rate, 1163 approximate covariance, options (CALIS), 621 standard errors (CALIS), 576, 587, 648, 686 approximate Bayesian bootstrap MI procedure, 2543 approximate covariance estimation clustering, 387 arbitrary missing pattern MI procedure, 2539 arcsine-square root transform LIFETEST procedure, 2205 arcsine-square root transformation LIFETEST procedure, 2169, 2177 arrays NLMIXED procedure, 3074 association tests LIFETEST procedure, 2150, 2156, 2200 association, measures of FREQ procedure, 1474 asterisk (*) operator TRANSREG procedure, 4558 asymmetric data (MDS), 2484 asymmetric binary variable DISTANCE procedure, 1250 asymmetric lambda, 1474, 1482 asymptotic covariance CALIS procedure, 575, 645 MIXED procedure, 2674 asymptotic variances CALIS procedure, 647 asymptotically distribution free estimation CALIS procedure, 574, 646 at sign (@) operator ANOVA procedure, 452 CATMOD procedure, 866 GLM procedure, 1786 MIXED procedure, 2745, 2819 TRANSREG procedure, 4558 autocorrelation REG procedure, 3915
4911
autocorrelation function plot MI procedure, 2557 autoregressive structure example (MIXED), 2788 MIXED procedure, 2721 average linkage CLUSTER procedure, 966, 976 average relative increase in variance MIANALYZE procedure, 2627 average variance of means LATTICE procedure, 2074 axes labels, modifying examples, ODS Graphics, 368
B B-spline basis TRANSREG procedure, 4560, 4614 backward elimination LOGISTIC procedure, 2317, 2340 PHREG procedure, 3229, 3264 REG procedure, 3800, 3874 badness of fit MDS procedure, 2479, 2482, 2483, 2489, 2490 balanced data ANOVA procedure, 423 example, complete block, 1847 balanced design, 2985 balanced square lattice LATTICE procedure, 2069 banded Toeplitz structure MIXED procedure, 2721 bandwidth optimal (DISCRIM), 1162 selection (KDE), 2008 VARIOGRAM procedure, 4866, 4870, 4875 bar (|) operator ANOVA procedure, 452 CATMOD procedure, 865 GENMOD procedure, 1660 GLM procedure, 1786 MIXED procedure, 2743, 2745, 2819 TRANSREG procedure, 4558 Bartlett’s test ANOVA procedure, 443 GLM procedure, 1767, 1819 Base SAS software, 21 baseline survivor function (PHREG) confidence level, 3225 confidence limits, 3225, 3263 estimation method, 3226 standard error, 3225, 3263 Bayes estimation NLMIXED procedure, 3084 Bayes’ theorem DISCRIM procedure, 1157 LOGISTIC procedure, 2314, 2353 MI procedure, 2547 Bayesian analysis MIXED procedure, 2708
4912
Subject Index
Bayesian confidence intervals splines, 4513 Bayesian inference MI procedure, 2547 Behrens-Fisher problem, 4775 Bernoulli distribution NLMIXED procedure, 3077 best subset selection LOGISTIC procedure, 2308, 2317, 2341 PHREG procedure, 3229, 3265, 3279 between-cluster SSCP matrix ACECLUS procedure, 387 between-imputation covariance matrix MIANALYZE procedure, 2626 between-imputation variance MI procedure, 2561 MIANALYZE procedure, 2624 between-subject factors repeated measures, 1777, 1828 Bhapkar’s test, 932 bimodality coefficient CLUSTER procedure, 972, 984 bin-sort algorithm, 1389 binary distribution NLMIXED procedure, 3077 Binary Lance and Williams nonmetric coefficient DISTANCE procedure, 1276 binning KDE procedure, 2004 binomial distribution GENMOD procedure, 1652 NLMIXED procedure, 3077 binomial proportion test, 1484 examples, 1532 power and sample size (POWER), 3429, 3432, 3504–3506, 3541 bioequivalence, equivalence tests power and sample size (POWER), 3510, 3511, 3520, 3521, 3530, 3531, 3549 biological assay data, 3705, 3761 biplot PRINQUAL procedure, 3678 biquartimax method, 1291, 1317, 1318 biquartimin method, 1291, 1318 bivariate density estimation DISCRIM procedure, 1200 bivariate histogram KDE procedure, 2011 biweight kernel (DISCRIM), 1160 block diagonal matrices ANOVA procedure, 423 BLUE MIXED procedure, 2740 BLUP MIXED procedure, 2740 Bonferroni t test, 441, 1765, 1809 Bonferroni adjustment GLM procedure, 1754
MIXED procedure, 2688 MULTTEST procedure, 2939, 2956 bootstrap MI procedure, 2527 bootstrap adjustment MULTTEST procedure, 2938, 2939, 2957, 2968 boundary constraints, 3075 MIXED procedure, 2707, 2708, 2773 bounds NLMIXED procedure, 3075 Bowker’s test of symmetry, 1493, 1494 Box Cox Example TRANSREG procedure, 4721 Box Cox transformations TRANSREG procedure, 4595 box plot, defined, 483 Box plots MIXED procedure, 2762 box plots plots, ODS Graphics, 355 reading group summary statistics, 522 saving group summary statistics, 518, 519 box plots, clipping boxes, 497, 498 examples, 532, 533 box plots, labeling angles for, 503 points, 492 Box’s epsilon, 1829 box-and-whisker plots schematic, 539 side-by-side, 483 skeletal, 538 statistics represented, 485, 517 styles of, 522 BOXPLOT procedure continuous group variables, 524 missing values, 524 percentile computation, 523 branch and bound algorithm LOGISTIC procedure, 2341 PHREG procedure, 3265, 3279 Bray and Curtis coefficient DISTANCE procedure, 1276 Breslow method likelihood (PHREG), 3228, 3240 Breslow test, See Wilcoxon test for homogeneity Breslow-Day test, 1508 Brewer’s method SURVEYSELECT procedure, 4453 Brown and Forsythe’s test ANOVA procedure, 443 GLM procedure, 1767, 1819 Brown-Mood test NPAR1WAY procedure, 3167 Broyden-Fletcher-Goldfarb-Shanno update, 3073 Burt table CORRESP procedure, 1076
Subject Index
C calibration data set DISCRIM procedure, 1139, 1167 CALIS procedure approximate covariance, 621 approximate standard errors, 576, 587, 648, 686 asymptotic covariance, 575, 645 asymptotic variances, 647 asymptotically distribution free estimation, 574, 646 chi-square indices, 653 chi-square, adjusted, 655 chi-square, displaying, 684, 685 chi-square, reweighted, 655 coefficient of determination, 585 compared to MIXED procedure, 2665 computation method, Hessian matrix, 589 computation method, Lagrange multipliers, 589 computation method, standard errors, 589 computational problems, 678, 680, 681 computational problems, identification, 669, 680, 724, 726, 732, 741 constrained estimation, 565 constraints, 565, 609, 610, 675, 676 constraints, program statements, 565, 628, 630, 675 COSAN model, 552, 591 degrees of freedom, 573, 576, 590, 676 determination index, 657 displayed output options, 583, 621 disturbances, prefix, 601 EQS program, 555 estimation criteria, 646 estimation methods, 549, 574, 644–647 exogenous variables, 662 factor analysis model, 554, 606, 608 factor analysis model, COSAN statement, 593 factor analysis model, LINEQS statement, 602 factor analysis model, RAM statement, 598 factor loadings, 641 FACTOR procedure, 567, 572, 606, 679 factor rotation, 607 factor scores, 641, 643, 687 fit function, 644 goodness of fit, 644, 649, 652, 684 gradient, 634, 664, 665 Hessian matrix, 621, 622, 634, 647, 665 initial values, 550, 590, 595, 597, 602, 661 input data set, 630 kappa, 659 kurtosis, 549, 584, 588, 658, 660 latent variables, 549, 601, 625 likelihood ratio test, 653, 674 LINEQS model, 553, 601 LISREL model, 554 manifest variables, 549 matrix inversion, 647 matrix names, default, 608 matrix properties, COSAN model, 592
4913
MODEL procedure, 679 modification indices, 576, 584, 649, 673, 674, 687 optimization, 550, 577–581, 622, 664–666, 671, 672 optimization history, 668 optimization, initial values, 661, 666 optimization, memory problems, 666 optimization, termination criteria, 611, 615–620 output data sets, 634 output table names, 688 predicted model matrix, 643, 644, 663, 680, 683 prefix name, 594, 598, 602, 603 PRINCOMP procedure, 567 program statements, 628, 630 program statements, differentiation, 589 RAM model, 553, 596 reciprocal causation, 585 REG procedure, 679 residuals, 650 residuals, prefix, 601 SCORE procedure, 571, 586, 643 significance level, 588 simplicity functions, 607 singularity criterion, 590 singularity criterion, covariance matrix, 588, 590, 591 skewness, 658 squared multiple correlation, 657, 686 step length, 581 structural equation, 552, 585, 625, 658 SYSLIN procedure, 679 SYSNLIN procedure, 679 t value, 649, 686 test indices, constraints, 584 variable selection, 662 Wald test, probability limit, 590 CANALS method TRANSREG procedure, 4576 Canberra metric coefficient DISTANCE procedure, 1272 CANCORR procedure analyzing data in groups, 763 canonical coefficients, 751 canonical redundancy analysis, 751, 761 computational resources, 768 correction for means, 760 correlation, 760 eigenvalues, 765 eigenvalues and eigenvectors, 754, 769 examples, 753, 773 formulas, 765 input data set, 760 missing values, 765 OUT= data sets, 761, 766 output data sets, 761, 766 output table names, 771 OUTSTAT= data sets, 766 partial correlation, 761, 762, 764
4914
Subject Index
principal components, relation to, 765 regression coefficients, 759 semipartial correlation, 762 singularity checking, 762 squared multiple correlation, 762 squared partial correlation, 762 squared semipartial correlation, 762 statistical methods used, 752 statistics computed, 751 suppressing output, 761 weighted product-moment correlation coefficients, 764 CANDISC procedure computational details, 794 computational resources, 799 input data set, 795 introductory example, 785 Mahalanobis distance, 804 MANOVA, 786 memory requirements, 799 missing values, 794 multivariate analysis of variance, 786 ODS table names, 802 output data sets, 791, 796, 797 %PLOTIT macro, 785, 808 time requirements, 799 canonical analysis ANOVA procedure, 438 GLM procedure, 1760 repeated measurements, 448 response surfaces, 4046 canonical coefficients, 783 canonical component, 783 canonical correlation CANCORR procedure, 751 definition, 752 hypothesis tests, 751 TRANSREG procedure, 4584, 4593 canonical discriminant analysis, 783, 1139 canonical factor solution, 1297 canonical redundancy analysis CANCORR procedure, 751, 761 canonical variables, 783 ANOVA procedure, 438 TRANSREG procedure, 4584 canonical weights, 752, 783 cascaded density estimates MODECLUS procedure, 2873 case weight PHREG procedure, 3239 case-control studies odds ratio, 1488, 1503, 1504 PHREG procedure, 3217, 3228, 3280 casewise deletion PRINQUAL procedure, 3655 categorical data analysis, See CATMOD procedure categorical variable, 72 categorical variables,
See classification variables CATMOD parameterization, 845 CATMOD procedure analysis of variance, 815 at sign (@) operator, 866 AVERAGED models, 881 bar (|) operator, 865 cautions, 869, 870, 887 cell count data, 861 classification variables, 864 compared to other procedures, 815, 869, 870, 1792 computational method, 891–894 continuous variables, 864 continuous variables, caution, 869, 870 contrast examples, 919 contrasts, comparing with GLM, 833 convergence criterion, 842 design matrix, 847, 848 design matrix, REPEATED statement, 882 effect specification, 864 effective sample sizes, 887 estimation methods, 817 – F– specification, 840, 862 hypothesis tests, 888 input data sets, 813, 860 interactive use, 818, 828 introductory example, 818 iterative proportional fitting, 843 linear models, 814 log-linear models, 814, 870, 916, 919, 1616 logistic analysis, 815, 868, 933 logistic regression, 814, 869, 911 maximum likelihood estimation, 817 maximum likelihood estimation formulas, 895 memory requirements, 897 missing values, 860 MODEL statement, examples, 840 ordering of parameters, 880 ordering of populations, 863 ordering of responses, 863 ordinal model, 869 output data sets, 854, 866, 867 parameterization, comparing with GLM, 833 positional requirements for statements, 828 quasi-independence model, 919 regression, 815 repeated measures, 815, 850, 873, 925, 930, 933, 937 repeated measures, MODEL statements, 875 REPEATED statement, examples, 873 response functions, 836, 840, 852, 854–856, 859, 862, 901, 906, 944 RESPONSE – – keyword, 836, 839, 840, 842, 850, 864, 870, 873, 881, 882, 884, 888, 898 – RESPONSE– = option, 837, 851 restrictions on parameters, 859 sample survey analysis, 816
Subject Index sampling zeros and log-linear analyses, 871 sensitivity, 941 singular covariance matrix, 887 specificity, 941 time requirements, 897 types of analysis, 814, 864 underlying model, 816 weighted least squares, 817, 846 zeros, structural and sampling, 888, 919, 924 CDF, See cumulative distribution function cdf plots annotating, 3718 axes, color, 3718 font, specifying, 3719 options summarized by function, 3716, 3734 reference lines, options, 3719–3722 threshold lines, options, 3721 cdfplot PROBIT procedure, 3715 ceiling sample size GLMPOWER procedure, 1947 POWER procedure, 3419, 3496 cell count data, 1464 CATMOD procedure, 861 example (FREQ), 1527 cell of a contingency table, 72 cell-means coding TRANSREG procedure, 4569, 4594, 4662 censored data (LIFEREG), 2083 LIFETEST procedure, 2186 observations (PHREG), 3283 survival times (PHREG), 3215, 3281, 3283 values (PHREG), 3218, 3272 censoring, 2083 LIFEREG procedure, 2094 variable (PHREG), 3218, 3228, 3235, 3283 center-point coding TRANSREG procedure, 4568, 4668, 4670 centering TRANSREG procedure, 4571 centroid component, 4800 definition, 4799 centroid method CLUSTER procedure, 966, 976 chaining, reducing when clustering, 972 character OPSCORE variables PRINQUAL procedure, 3673 TRANSREG procedure, 4604 characteristic roots and vectors ANOVA procedure, 437 GLM procedure, 1759 Chebychev distance coefficient DISTANCE procedure, 1272 chi-square adjusted (CALIS), 655 displaying (CALIS), 684, 685 indices (CALIS), 653
4915
reweighted (CALIS), 655 chi-square test SURVEYFREQ procedure, 4216 chi-square tests examples (FREQ), 1530, 1535, 1538 FREQ procedure, 1469, 1470 power and sample size (POWER), 3429, 3432, 3457, 3462, 3505, 3506, 3524, 3541 chi-squared coefficient DISTANCE procedure, 1273 choice experiments TRANSREG procedure, 4660 Chromy’s method SURVEYSELECT procedure, 4448, 4452 Cicchetti-Allison weights, 1497 Cityblock distance coefficient DISTANCE procedure, 1272 class level MIXED procedure, 2678 SURVEYMEANS procedure, 4346 classification criterion DISCRIM procedure, 1139 error rate estimation (DISCRIM), 1163 classification level SURVEYREG procedure, 4391 classification table LOGISTIC procedure, 2314, 2352, 2353, 2422 classification variable SURVEYMEANS procedure, 4329, 4337, 4341 classification variables, 72 ANOVA procedure, 423, 451 CATMOD procedure, 864 GENMOD procedure, 1660 GLM procedure, 1784 GLMPOWER procedure, 1937, 1938 MIXED procedure, 2681 sort order of levels (GENMOD), 1625 SURVEYREG procedure, 4375 TRANSREG procedure, 4560, 4569 VARCOMP procedure, 4836 closing all destinations examples, ODS Graphics, 360 cluster centers, 1380, 1392 definition (MODECLUS), 2878 deletion, 1390 elliptical, 387 final, 1380 initial, 1380, 1381 mean, 1392 median, 1389, 1392 midrange, 1392 minimum distance separating, 1381 plotting (MODECLUS), 2878 seeds, 1380 SURVEYFREQ procedure, 4195 SURVEYLOGISTIC procedure, 4255 SURVEYMEANS procedure, 4329 SURVEYREG procedure, 4376
4916
Subject Index
cluster analysis disjoint, 1379 large data sets, 1379 robust, 1379, 1392 tree diagrams, 4743 cluster analysis (STDIZE) standardizing, 4143 CLUSTER procedure, See also TREE procedure algorithms, 986 association measure, 957 average linkage, 957 categorical data, 957 centroid method, 957 clustering methods, 957, 975 complete linkage, 957 computational resources, 986 density linkage, 957, 966 Euclidean distances, 957 F statistics, 972, 984 FASTCLUS procedure, compared, 957 flexible-beta method, 957, 967, 968, 981 hierarchical clusters, 957 input data sets, 969 interval scale, 988 kth-nearest-neighbor method, 957 maximum likelihood, 957, 967 McQuitty’s similarity analysis , 957 median method, 957 memory requirements, 986 missing values, 987 non-Euclidean distances, 957 output data sets, 971, 990 output table names, 994 pseudo F and t statistics, 972 ratio scale, 988 single linkage, 957 size, shape, and correlation, 988 test statistics, 968, 972, 973 ties, 987 time requirements, 986 TREE procedure, compared, 957 two-stage density linkage, 957 types of data sets, 957 using macros for many analyses, 1013 Ward’s minimum-variance method, 957 Wong’s hybrid method, 957 cluster sampling, 164 clustering, 957, See also CLUSTER procedure approximate covariance estimation, 387 average linkage, 966, 976 centroid method, 966, 976 complete linkage method, 966, 977 density linkage methods, 966, 967, 969, 970, 972, 977, 980, 982 disjoint clusters of variables, 4799 Gower’s method, 967, 981 hierarchical clusters of variables, 4799
maximum-likelihood method, 971, 980, 981 McQuitty’s similarity analysis, 967, 981 median method, 967, 981 methods affected by frequencies, 974 outliers in, 958, 972 penalty coefficient, 971 single linkage, 967, 982 smoothing parameters, 979 standardizing variables, 972 SURVEYFREQ procedure, 4203 transforming variables, 958 two-stage density linkage, 967 variables, 4799 Ward’s method, 967, 983 weighted average linkage, 967, 981 clustering and computing distance matrix Correlation coefficients, example, 1283 Jaccard coefficients, example, 1278 clustering and scaling DISTANCE procedure, example, 1253 MODECLUS procedure, 2856, 2857, 2874 STDIZE procedure, example, 4143 clustering criterion FASTCLUS procedure, 1379, 1391, 1392 clustering methods ACECLUS procedure, 405 FASTCLUS procedure, 1380, 1381 MODECLUS procedure, 2856, 2874 CMF, See cumulative mean function PHREG procedure, 3224 Cochran and Cox t approximation, 4775, 4780, 4785 Cochran’s Q test, 1493, 1499, 1548 Cochran-Armitage test for trend, 1490, 1543 continuity correction (MULTTEST), 2949 MULTTEST procedure, 2946, 2948, 2964 permutation distribution (MULTTEST), 2949 two-tailed test (MULTTEST), 2951 Cochran-Mantel-Haenszel statistics (FREQ), 1447, 1500, See also chi-square tests ANOVA (row mean scores) statistic, 1502 correlation statistic, 1501 examples, 1540 general association statistic, 1502 coefficient alpha (FACTOR), 1337 of contrast (SURVEYREG), 4392 of determination (CALIS), 585 of estimate (SURVEYREG), 4393 of relationship (INBREED), 1977 of variation (SURVEYMEANS), 4340 redundancy (TRANSREG), 4590 coefficient of variation SURVEYFREQ procedure, 4214 cohort studies, 1540 relative risk, 1489, 1507 collinearity REG procedure, 3895
Subject Index column proportions SURVEYFREQ procedure, 4212 combinations generating with PLAN procedure, 3358 combining inferences MI procedure, 2561 MIANALYZE procedure, 2624 common factor defined for factor analysis, 1292 common factor analysis common factor rotation, 1294 compared with principal component analysis, 1293 Harris component analysis, 1293 image component analysis, 1293 interpreting, 1293 salience of loadings, 1294 comparing groups (GLM), 1804 means (TTEST), 4775, 4789 variances (TTEST), 4775, 4784, 4789 comparisonwise error rate (GLM), 1809 complementary log-log model SURVEYLOGISTIC procedure, 4284 complete block design example (ANOVA), 428 example (GLM), 1847 complete linkage CLUSTER procedure, 966, 977 complete separation LOGISTIC procedure, 2339 SURVEYLOGISTIC procedure, 4277 completely randomized design examples, 461 components PLS procedure, 3367 compound symmetry structure example (MIXED), 2733, 2789, 2794 MIXED procedure, 2721 computational details Hessian matrix (CALIS), 589 KDE procedure, 2002 Lagrange multipliers (CALIS), 589 LIFEREG procedure, 2108 maximum likelihood method (VARCOMP), 4839 MIVQUE0 method (VARCOMP), 4839 MIXED procedure, 2772 restricted maximum likelihood method (VARCOMP), 4840 SIM2D procedure, 4109 standard errors (CALIS), 589 SURVEYLOGISTIC procedure, 4282 SURVEYREG procedure, 4384 Type I method (VARCOMP), 4838 VARCOMP procedure, 4838, 4842 computational problems CALIS procedure, 678 convergence (CALIS), 678
4917
convergence (FASTCLUS), 1390 convergence (MIXED), 2774 identification (CALIS), 669, 680, 724, 726, 732, 741 negative eigenvalues (CALIS), 681 negative R-square (CALIS), 681 NLMIXED procedure, 3098 overflow (CALIS), 678 singular predicted model (CALIS), 680 time (CALIS), 681 computational resources ACECLUS procedure, 410 CANCORR procedure, 768 CLUSTER procedure, 986 FACTOR procedure, 1335 FASTCLUS procedure, 1402 LIFEREG procedure, 2123 MODECLUS procedure, 2882 MULTTEST procedure, 2960 NLMIXED procedure, 3103 PRINCOMP procedure, 3611 ROBUSTREG procedure, 4012 VARCLUS procedure, 4818 concordant observations, 1474 conditional and unconditional simulation SIM2D procedure, 4091 conditional data MDS procedure, 2477 conditional distributions of multivariate normal random variables SIM2D procedure, 4108 conditional logistic regression LOGISTIC procedure, 2365 PHREG procedure, 3217, 3283 Conditional residuals MIXED procedure, 2764 confidence bands LIFETEST procedure, 2169, 2176, 2205 Confidence intervals LIFEREG procedure, 2115 confidence intervals confidence coefficient (GENMOD), 1637 fitted values of the mean (GENMOD), 1641, 1669 individual observation (RSREG), 4042, 4043 means (ANOVA), 442 means (RSREG), 4042, 4043 means, power and sample size (POWER), 3432, 3438, 3448, 3456, 3463, 3473, 3512, 3522, 3532, 3563 model confidence interval (NLIN), 3029 pairwise differences (ANOVA), 442 parameter confidence interval (NLIN), 3028 profile likelihood (GENMOD), 1640, 1666 profile likelihood (LOGISTIC), 2314, 2315, 2345 TTEST procedure, 4780 Wald (GENMOD), 1643, 1667 Wald (LOGISTIC), 2319, 2346
4918
Subject Index
Wald (SURVEYLOGISTIC), 4288 confidence intervals, FACTOR procedure, 1327 confidence level baseline survivor function (PHREG), 3225 SURVEYMEANS procedure, 4323 SURVEYREG procedure, 4373 confidence limits asymptotic (FREQ), 1475 baseline survivor function (PHREG), 3225, 3263 exact (FREQ), 1443 LIFETEST procedure, 2174, 2183, 2184, 2205 LOGISTIC procedure, 2350 MIXED procedure, 2674 SURVEYFREQ procedure, 4213 SURVEYMEANS procedure, 4340, 4342 SURVEYREG procedure, 4380 TRANSREG procedure, 4575, 4584, 4585, 4587, 4588 confidence limits, FACTOR procedure, 1327 configuration MDS procedure, 2471 conjoint analysis TRANSREG procedure, 4581, 4593, 4690, 4694 conjugate descent (NLMIXED), 3074 gradient (NLMIXED), 3072 gradient algorithm (CALIS), 577, 579–581, 665 connectedness method, See single linkage constant transformations avoiding (PRINQUAL), 3673 avoiding (TRANSREG), 4603 constant variables PRINQUAL procedure, 3673 TRANSREG procedure, 4578, 4604 constrained estimation CALIS procedure, 565 constraints boundary (CALIS), 565, 609, 675 boundary (MIXED), 2707, 2708 linear (CALIS), 565, 609, 676 modification indices (CALIS), 584 nonlinear (CALIS), 565, 610 ordered (CALIS), 675 program statements (CALIS), 565, 628, 630, 675 test indices (CALIS), 584 containment method MIXED procedure, 2693 contingency coefficient, 1469, 1474 contingency tables, 72, 1431, 1450, 4196 CATMOD procedure, 816 continuity-adjusted chi-square, 1469, 1471 continuous variables, 451, 1784 GENMOD procedure, 1660 continuous-by-class effects MIXED procedure, 2746 model parameterization (GLM), 1790 specifying (GLM), 1785 continuous-nesting-class effects
MIXED procedure, 2745 model parameterization (GLM), 1789 specifying (GLM), 1785 %CONTOUR macro DISCRIM procedure, 1201 contour plots plots, ODS Graphics, 324, 360 contrasts, 3076 comparing CATMOD and GLM, 833 GENMOD procedure, 1633 GLM procedure, 1749 MIXED procedure, 2681, 2685 power and sample size (GLMPOWER), 1934, 1937, 1950, 1951, 1956 power and sample size (POWER), 3438, 3439, 3442, 3513, 3536 repeated measurements (ANOVA), 447, 448 repeated measures (GLM), 1779 specifying (CATMOD), 831 SURVEYREG procedure, 4376, 4389 control comparing treatments to (GLM), 1807, 1812 control charts, 23 control sorting SURVEYSELECT procedure, 4443, 4445 converge in EM algorithm MI procedure, 2522 convergence criterion ACECLUS procedure, 404 CATMOD procedure, 842 GENMOD procedure, 1637, 1647 MDS procedure, 2478, 2480, 2481 MIXED procedure, 2674, 2675, 2749, 2775 profile likelihood (LOGISTIC), 2314 convergence in EM algorithm MI procedure, 2527 convergence in MCMC MI procedure, 2555, 2566 convergence problems MIXED procedure, 2774 NLMIXED procedure, 3099 convolution distribution (MULTTEST), 2950 KDE procedure, 2005 Cook’s D influence statistic, 1774, 4042 Cook’s D MIXED procedure, 2768 Cook’s D for covariance parameters MIXED procedure, 2768 Cook’s D plots plots, ODS Graphics, 353 CORR procedure, 21 correction for means CANCORR procedure, 760 correlated data GEE (GENMOD), 1611, 1672 correlated proportions, See McNemar’s test correlation
Subject Index CANCORR procedure, 760 estimates (MIXED), 2713, 2716, 2720, 2791 length (VARIOGRAM), 4860 matrix (GENMOD), 1638, 1656 matrix (REG), 3817 matrix, estimated (CATMOD), 842 principal components, 3610, 3612 range (KRIGE2D), 2034 correlation coefficients power and sample size (POWER), 3426, 3502, 3503 Correlation dissimilarity coefficient DISTANCE procedure, 1271 Correlation similarity coefficient DISTANCE procedure, 1271 correlation statistic, 1501 CORRESP procedure, 1069 adjusted inertias, 1102 algorithm, 1097 analyse des correspondances, 1069 appropriate scoring, 1069 Best variables, 1104 binary design matrix, 1083 Burt table, 1076, 1084 coding, 1085 COLUMN= option, use, 1099 computational resources, 1096 correspondence analysis, 1069 doubling, 1085 dual scaling, 1069 fuzzy coding, 1085, 1087 geometry of distance between points, 1100, 1112, 1118 homogeneity analysis, 1069 inertia, definition, 1070 input tables and variables, 1072, 1082, 1083 matrix decompositions, 1079, 1100 matrix formulas for statistics, 1103 memory requirements, 1097 missing values, 1077, 1088, 1092 multiple correspondence analysis (MCA), 1076, 1101, 1123 ODS graph names, 1109 optimal scaling, 1069 optimal scoring, 1069 OUTC= data set, 1094 OUTF= data set, 1095 output data sets, 1094 output table names, 1108 partial contributions to inertia table, 1103 %PLOTIT macro, 1070, 1118, 1128 PROFILE= option, use, 1099 quantification method, 1069 reciprocal averaging, 1069 ROW= option, use, 1099 scalogram analysis, 1069 supplementary rows and columns, 1080, 1102 syntax, abbreviations, 1073 TABLES statement, use, 1072, 1081, 1088
4919
time requirements, 1097 VAR statement, use, 1072, 1081, 1091 correspondence analysis CORRESP procedure, 1069 COSAN model CALIS procedure, 552, 591 specification, 560 structural model example (CALIS), 561 Cosine coefficient DISTANCE procedure, 1272 counting process PHREG procedure, 3241 covariance LATTICE procedure, 2075 parameter estimates (MIXED), 2674, 2676 parameter estimates, ratio (MIXED), 2680 parameters (MIXED), 2661 principal components, 3610, 3612 regression coefficients (SURVEYREG), 4392 SURVEYFREQ procedure, 4210 covariance coefficients, See INBREED procedure covariance matrix for parameter estimates (CATMOD), 842 for response functions (CATMOD), 842 GENMOD procedure, 1638, 1656 NLMIXED procedure, 3101, 3106 PHREG procedure, 3222, 3233, 3245 REG procedure, 3817 singular (CATMOD), 887 symmetric and positive definite (SIM2D), 4107 covariance parameter estimates MIXED procedure, 2750 Covariance similarity coefficient DISTANCE procedure, 1271 covariance structure analysis model, See COSAN model covariance structures examples (MIXED), 2723, 2782 MIXED procedure, 2664, 2721 covariates GLMPOWER procedure, 1938–1941, 1951, 1956 MIXED procedure, 2743 model parameterization (GLM), 1787 covarimin method, 1291, 1318 coverage displays FACTOR procedure, 1328 COVRATIO MIXED procedure, 2770 COVRATIO for covariance parameters MIXED procedure, 2770 COVRATIO statistic, 3899 COVTRACE MIXED procedure, 2770 COVTRACE for covariance parameters MIXED procedure, 2770 Cox regression analysis PHREG procedure, 3219
4920
Subject Index
semiparametric model (PHREG), 3215 Cramer’s V statistic, 1469, 1474 Cramer-von Mises test NPAR1WAY procedure, 3170 Crawford-Ferguson family, 1291 Crawford-Ferguson method, 1317, 1318 Crime Rates Data, example PRINCOMP procedure, 3619 cross validated density estimates MODECLUS procedure, 2872 cross validation DISCRIM procedure, 1163 PLS procedure, 3368, 3374, 3384 crossed effects design matrix (CATMOD), 877 GENMOD procedure, 1660 MIXED procedure, 2744 model parameterization (GLM), 1788 specifying (ANOVA), 451, 452 specifying (CATMOD), 864 specifying (GLM), 1784 crossover designs power and sample size (POWER), 3549 crossproducts matrix REG procedure, 3917 crosstabulation (SURVEYFREQ) tables, 4227 crosstabulation tables, 1431, 1450, 4196 SURVEYFREQ procedure, 4227 cubic clustering criterion, 970, 973 CLUSTER procedure, 968 cumulative baseline hazard function PHREG procedure, 3257 cumulative distribution function, 2114, 3705 LIFETEST procedure, 2149 cumulative logit model SURVEYLOGISTIC procedure, 4284 cumulative logits, See also response functions examples, (CATMOD), 869 specifying in CATMOD procedure, 853 using (CATMOD), 868 cumulative martingale residuals PHREG procedure, 3223, 3266, 3271 cumulative mean function, See mean function cumulative residuals, 1718, 1725 custom scoring coefficients, example SCORE procedure, 4086 customizing ODS graphs, 363, 369, 371, 373, 379 ODS styles, 345, 374, 376, 378 Czekanowski/Sorensen similarity coefficient DISTANCE procedure, 1276
D DATA step, 21 Davidon-Fletcher-Powell update, 3073 decomposition of the SSCP matrix
ACECLUS procedure, 387 Default style ODS styles, 333, 346 degrees of freedom CALIS procedure, 573, 576, 590, 676 FACTOR procedure, 1320 MI procedure, 2561 MIANALYZE procedure, 2625, 2627 models with class variables (GLM), 1791 NLMIXED procedure, 3062 SURVEYFREQ procedure, 4214 SURVEYMEANS procedure, 4340, 4344 SURVEYREG procedure, 4385 TRANSREG procedure, 4615 delete variables (REG), 3820 deleting observations REG procedure, 3903 dendritic method, See single linkage dendrogram, 4743 density estimation DISCRIM procedure, 1180, 1200 MODECLUS procedure, 2870 density function, See probability density function density linkage CLUSTER procedure, 966, 967, 969, 970, 972, 977, 980, 982 dependent effect, definition, 451 descriptive statistics, See also UNIVARIATE procedure LOGISTIC procedure, 2294 PHREG procedure, 3223 SURVEYMEANS procedure, 4347 design effect SURVEYFREQ procedure, 4215 SURVEYREG procedure, 4388 design matrix formulas (CATMOD), 894 generation in CATMOD procedure, 876 GENMOD procedure, 1661 GLMMOD procedure, 1909, 1917, 1918 linear dependence in (GENMOD), 1661 TRANSREG procedure, 4586 design of experiments, See experimental design design points, TPSPLINE procedure, 4499, 4517 design summary SURVEYREG procedure, 4390 design-adjusted chi-square test SURVEYFREQ procedure, 4216 determination index CALIS procedure, 657 deviance definition (GENMOD), 1615 GENMOD procedure, 1637 LOGISTIC procedure, 2308, 2316, 2354 PROBIT procedure, 3745, 3760 scaled (GENMOD), 1656
Subject Index deviance residuals GENMOD procedure, 1670 LOGISTIC procedure, 2360 PHREG procedure, 3234, 3258, 3302 deviations-from-means coding TRANSREG procedure, 4568, 4594, 4654, 4668, 4670 DFBETA statistics PHREG procedure, 3234, 3260 DFBETAS statistic (REG), 3900 DFBETAS statistics LOGISTIC procedure, 2360 DFFITS MIXED procedure, 2768 DFFITS statistic GLM procedure, 1774 REG procedure, 3899 diagnostic statistics REG procedure, 3896, 3897 diagnostics panels plots, ODS Graphics, 322, 379 diameter method, See complete linkage Dice coefficient DISTANCE procedure, 1276 difference between means confidence intervals, 442 dimension coefficients MDS procedure, 2471, 2472, 2477, 2482, 2483, 2488, 2489 dimensions MIXED procedure, 2678 direct effects design matrix (CATMOD), 879 specifying (CATMOD), 865 direct product structure MIXED procedure, 2721 discordant observations, 1474 discrete logistic model likelihood (PHREG), 3241 PHREG procedure, 3216, 3228, 3283 discrete variables, See classification variables DISCRIM procedure background, 1156 Bayes’ theorem, 1157 bivariate density estimation, 1200 calibration data set, 1139, 1167 classification criterion, 1139 computational resources, 1173 %CONTOUR macro, 1201 cross validation, 1163 density estimate, 1157, 1160, 1161, 1180, 1200 error rate estimation, 1163, 1165 input data sets, 1168, 1169 introductory example, 1140 kernel density estimates, 1191, 1212 memory requirements, 1174 missing values, 1156
4921
nonparametric methods, 1158 ODS table names, 1178 optimal bandwidth, selection, 1162 output data sets, 1170, 1171 parametric methods, 1157 %PLOT macro, 1182 %PLOTIT macro, 1200 posterior probability, 1158, 1160, 1180, 1200 posterior probability error rate, 1163, 1165 quasi-inverse, 1164 resubstitution, 1163 squared distance, 1157, 1158 test set classification, 1163 time requirements, 1174 training data set, 1139 univariate density estimation, 1180 discriminant analysis, 1139 canonical, 783, 1139 error rate estimation, 1140 misclassification probabilities, 1140 nonparametric methods, 1158 parametric methods, 1157 stepwise selection, 4157 discriminant function method MI procedure, 2544 discriminant functions, 784 disjoint clustering, 1379–1381 dispersion parameter estimation (GENMOD), 1614, 1657, 1665, 1666 GENMOD procedure, 1658 LOGISTIC procedure, 2354 PROBIT procedure, 3760 weights (GENMOD), 1650 displayed output SURVEYSELECT procedure, 4458 dissimilarity data MDS procedure, 2471, 2478, 2484 distance between clusters (FASTCLUS), 1398 classification (VARIOGRAM), 4874 data (FASTCLUS), 1379 data (MDS), 2471 Euclidean (FASTCLUS), 1380 distance data MDS procedure, 2478, 2484 DISTANCE data sets CLUSTER procedure, 969 distance measures available in DISTANCE procedure, See proximity measures DISTANCE procedure absent-absent match, asymmetric binary variable, 1250 absent-absent match, example, 1278 absolute level of measurement, definition, 1250 affine transformation, 1250 asymmetric binary variable, 1250 available levels of measurement, 1263 available options for the option list, 1264
4922
Subject Index
Binary Lance and Williams nonmetric coefficient, 1276 Bray and Curtis coefficient, 1276 Canberra metric coefficient, 1272 Chebychev distance coefficient, 1272 chi-squared coefficient, 1273 Cityblock distance coefficient, 1272 computing distances with weights, 1267 Correlation dissimilarity coefficient, 1271 Correlation similarity coefficient, 1271 Cosine coefficient, 1272 Covariance similarity coefficient, 1271 Czekanowski/Sorensen similarity coefficient, 1276 Dice coefficient, 1276 Dot Product coefficient, 1273 Euclidean distance coefficient, 1271 examples, 1251, 1278, 1283 extension of binary variable, 1251 formatted values, 1277 formulas for proximity measures, 1270 frequencies, 1268 functional summary, 1255 fuzz factor, 1257 Generalized Euclidean distance coefficient, 1272 Gower’s dissimilarity coefficient, 1271 Gower’s similarity coefficient, 1270 Hamann coefficient, 1274 Hamming distance coefficient, 1274 identity transformation, 1250 initial estimates for A-estimates, 1257 interval level of measurement, 1250 Jaccard dissimilarity coefficient, 1276 Jaccard similarity coefficient, 1276 Kulcynski 1 coefficient, 1276 Lance-Williams nonmetric coefficient, 1272 levels of measurement, 1249 linear transformation, 1250 log-interval level of measurement, 1250 many-to-one transformation, 1249 Minkowski L(p) distance coefficient, 1272 missing values, 1261, 1262, 1267, 1276 monotone increasing transformation, 1249 nominal level of measurement, 1249 nominal variable, 1251 normalization, 1261, 1263 one-to-one transformation, 1249 ordinal level of measurement, 1249 output data sets, 1261, 1277 Overlap dissimilarity coefficient, 1273 Overlap similarity coefficient, 1273 phi-squared coefficient, 1273 Power distance coefficient, 1272 power transformation, 1250 ratio level of measurement, 1250 Roger and Tanimoto coefficient, 1274 Russell and Rao similarity coefficient, 1276 scaling variables, 1251 Shape distance coefficient, 1271
Similarity Ratio coefficient, 1272 Simple Matching coefficient, 1274 Simple Matching dissimilarity coefficient, 1274 Size distance coefficient, 1271 Sokal and Sneath 1 coefficient, 1274 Sokal and Sneath 3 coefficient, 1275 Squared Correlation dissimilarity coefficient, 1272 Squared Correlation similarity coefficient, 1272 Squared Euclidean distance coefficient, 1271 standardization methods, 1265 standardization with frequencies, 1268 standardization with weights, 1269 standardization, default methods, 1251, 1265 standardization, example, 1253 standardization, mandatory, 1251 strictly increasing transformation, 1249 summary of options, 1255 symmetric binary variable, 1250 transforming ordinal variables to interval, 1250 transforming ordinal variables to iterval, 1261 weights, 1267, 1269 distance tolerance VARIOGRAM procedure, 4867 distributions Gompertz, 3705 logistic, 3705 normal, 3705 DOCUMENT destination, 361 examples, ODS Graphics, 361 ODS Graphics, 326, 335 DOCUMENT procedure, 361 document paths, 362 examples, ODS Graphics, 361 Documents window, 361 dollar-unit sampling SURVEYSELECT procedure, 4466 domain analysis SURVEYFREQ procedure, 4205 SURVEYMEANS procedure, 4336, 4348 domain mean SURVEYMEANS procedure, 4343 domain statistics SURVEYMEANS procedure, 4342 domain total SURVEYMEANS procedure, 4343 DOT product (SCORE), 4065 Dot Product coefficient DISTANCE procedure, 1273 double arcsine test MULTTEST procedure, 2951 double dogleg algorithm (CALIS), 578, 579, 581, 665 method (NLMIXED), 3073 dual scaling CORRESP procedure, 1069 dummy variables TRANSREG procedure, 4560, 4569, 4586 dummy variables example
Subject Index TRANSREG procedure, 4654 Duncan’s multiple range test, 442, 1766, 1814 Duncan-Waller test, 445, 1769, 1815 error seriousness ratio, 443, 1768 multiple comparison (ANOVA), 464 Dunnett’s adjustment GLM procedure, 1754 MIXED procedure, 2688 Dunnett’s test, 1766, 1767, 1812 one-tailed lower, 442, 1766 one-tailed upper, 442 two-tailed, 442
E EBLUP MIXED procedure, 2703 EDF tests NPAR1WAY procedure, 3168 effect coding (TRANSREG), 4549, 4568, 4654, 4668, 4670 definition, 451, 864, 1784 name length (MIXED), 2678 specification (ANOVA), 451 specification (CATMOD), 864 specification (GENMOD), 1659 specification (GLM), 1784 testing (SURVEYREG), 4386, 4392 EFFECT Parameterization SURVEYLOGISTIC procedure, 4270 effect size power and sample size (POWER), 3485 effective sample size LIFETEST procedure, 2172, 2173 Efron method likelihood (PHREG), 3228, 3241 eigenvalues and eigenvectors ACECLUS procedure, 398, 399, 406, 410–412 CANCORR procedure, 754, 769 PRINCOMP procedure, 3595, 3610–3612 RSREG procedure, 4050 Einot and Gabriel’s multiple range test ANOVA procedure, 444 examples (GLM), 1851 GLM procedure, 1768, 1815 Ekblom-Newton algorithm FASTCLUS procedure, 1392 elementary linkage analysis, See single linkage EM algorithm MI procedure, 2536, 2566 empirical Bayes estimation NLMIXED procedure, 3084 empirical best linear unbiased prediction MIXED procedure, 2703 empirical distribution function tests (NPAR1WAY), 3168 empirical sandwich estimator MIXED procedure, 2676
4923
empty stratum SURVEYMEANS procedure, 4334, 4344 Epanechnikov kernel (DISCRIM), 1160 EQS program CALIS procedure, 555 equal precision bands LIFETEST procedure, 2169, 2178, 2205 equality of means (TTEST), 4775, 4789 of variances (TTEST), 4775, 4784, 4789 equamax method, 1291, 1317, 1318 equivalence tests power and sample size (POWER), 3432, 3438, 3448, 3456, 3463, 3472, 3510, 3511, 3520, 3521, 3530, 3531, 3549 error rate estimation DISCRIM procedure, 1163, 1165 discriminant analysis, 1140 error seriousness ratio Waller-Duncan test, 443, 1768 error sum of squares clustering method, See Ward’s method estimability GLM procedure, 1750 MIXED procedure, 2682 estimability checking GENMOD procedure, 1632 LOGISTIC procedure, 2300 SURVEYLOGISTIC procedure, 4258 TPHREG procedure, 4481 estimable functions checking (GLM), 1750 displaying (GLM), 1771 example (GLM), 1792 general form of, 1793 GLM procedure, 1751, 1752, 1758, 1773, 1792– 1794, 1796, 1797, 1803 MIXED procedure, 2702 printing (GLM), 1771 SURVEYREG procedure, 4378, 4393 estimated population marginal means, See least-squares means estimation dispersion parameter (GENMOD), 1614 maximum likelihood (GENMOD), 1655 mixed model (MIXED), 2737 regression parameters (GENMOD), 1614 estimation criteria CALIS procedure, 646 estimation methods CALIS procedure, 549, 574, 644–647 MIXED procedure, 2677 Euclidean distance coefficient DISTANCE procedure, 1271 Euclidean distances, 969, 971, 1158, 1380 clustering, 957 MDS procedure, 2472, 2477, 2488 Euclidean length STDIZE procedure, 4138
4924
Subject Index
event times PHREG procedure, 3215, 3218 events/trials format for response GENMOD procedure, 1636, 1653 exact logistic regression LOGISTIC procedure, 2300, 2369 exact method likelihood (PHREG), 3228, 3240 exact tests computational algorithms (FREQ), 1509 computational algorithms (NPAR1WAY), 3172 computational resources (FREQ), 1511 computational resources (NPAR1WAY), 3173 confidence limits, 1443 examples (NPAR1WAY), 3190, 3191 FREQ procedure, 1508, 1543 MONTE Carlo estimates (NPAR1WAY), 3174 network algorithm (FREQ), 1509 NPAR1WAY procedure, 3171 p-value, definitions, 1510, 3172 permutation test (MULTTEST), 2949 examples, ODS Graphics, 352 axes labels, modifying, 368 closing all destinations, 360 customizing ODS graphs, 363, 369, 371, 373, 379 customizing styles, 374, 376, 378 DOCUMENT destination, 361 DOCUMENT procedure, 361 editing templates, 365 excluding graphs, 352 graph axes, swapping, 371 graph fonts, modifying, 374, 379 graph names, 352, 361 graph sizes, modifying, 378 graph titles, modifying, 368 grid lines, modifying, 373 HTML output, 321, 324, 330, 352 HTML output, with tool tips, 354 Journal style, 358 LaTeX output, 358 line colors, modifying, 369 line patterns, modifying, 369 line thickness, modifying, 376 marker symbols, modifying, 369 multiple destinations, 360 PDF output, 360 presentations, 356 referring to graphs, 330 relative paths, 359 replaying output, 361, 363, 369 RTF output, 356, 360 saving templates, 368 selecting graphs, 330, 352, 361 style attributes, modifying, 374, 376, 378 style elements, modifying, 374, 376, 378 TEMPLATE procedure, 363, 369, 371, 373, 379 tick marks, modifying, 373 trace record, 330, 352, 363
excluded observations PRINQUAL procedure, 3657, 3674 TRANSREG procedure, 4605 excluding graphs examples, ODS Graphics, 352 ODS Graphics, 330 exemplary data set power and sample size (GLMPOWER), 1929, 1930, 1936, 1938, 1946, 1952 exogenous variables CALIS procedure, 662 path diagram (CALIS), 664 expected mean squares computing, types (GLM), 1835 random effects, 1833 expected trend MULTTEST procedure, 2951 expected weighted frequency SURVEYFREQ procedure, 4215 experimental design, 23, 1901, See also PLAN procedure aliasing structure (GLM), 1897 experimentwise error rate (GLM), 1809 explicit intercept TRANSREG procedure, 4605 exploratory data analysis, 22 exponential distribution GENMOD procedure, 1704 exponential semivariogram model KRIGE2D procedure, 2048, 2049 External studentization, 2763 external unfolding MDS procedure, 2471 extreme value distribution PROBIT procedure, 3757
F F statistics CLUSTER procedure, 972, 984 GENMOD procedure, 1668 factor defined for factor analysis, 1292 factor analysis compared to component analysis, 1291, 1292 factor analysis model CALIS procedure, 554, 606, 608 COSAN statement (CALIS), 593 identification (CALIS), 606 LINEQS statement (CALIS), 602 MATRIX statement (CALIS), 595 path diagram (CALIS), 599, 600 RAM statement (CALIS), 598 specification (CALIS), 560 structural model example (CALIS), 564 factor analytic structures MIXED procedure, 2721 factor loadings CALIS procedure, 641 factor parsimax method, 1291, 1317, 1318
Subject Index FACTOR procedure CALIS procedure, 567, 572, 606 computational resources, 1335 coverage displays, 1328 degrees of freedom, 1320 Heywood cases, 1333 number of factors extracted, 1312 OUT= data sets, 1316 output data sets, 1297, 1316 simplicity functions, 1294, 1316, 1329 time requirements, 1331 variances, 1320 Factor rotation with FACTOR procedure, 1298 factor rotation methods, 1291 factor scores CALIS procedure, 641, 643, 687 displaying (CALIS), 687 factor scoring coefficients FACTOR procedure, 4065 SCORE procedure, 4065, 4076 factor structure, 1298 factors PLAN procedure, 3339, 3340, 3348 PLS procedure, 3367 failure time LIFEREG procedure, 2083 false discovery rate adjustment (MULTTEST), 2959 false discovery rate (MULTTEST) p-value adjustments, 2959 false negative, false positive rate LOGISTIC procedure, 2314, 2353, 2422 fast Fourier transform KDE procedure, 2007 MULTTEST procedure, 2950 FASTCLUS procedure algorithm for updating cluster seeds, 1392 bin-sort algorithm, 1389 cluster deletion, 1390 clustering criterion, 1379, 1391, 1392 clustering methods, 1380, 1381 compared to other procedures, 1403 computational problems, convergence, 1390 computational resources, 1402 controlling iterations, 1393 convergence criterion, 1390 distance, 1379, 1380, 1398 DRIFT option, 1380 Ekblom-Newton algorithm, 1392 homotopy parameter, 1390 imputation of missing values, 1391 incompatibilities, 1397 iteratively reweighted least-squares, 1391 Lp clustering, 1379, 1391 MEAN= data sets, 1394 memory requirements, 1402 Merle-Spath algorithm, 1392 missing values, 1380, 1381, 1391, 1393, 1397
4925
Newton algorithm, 1392 OUT= data sets, 1398 outliers, 1379 output data sets, 1393, 1394, 1398 output table names, 1407 OUTSTAT= data set, 1394, 1400 random number generator, 1394 scale estimates, 1390, 1392, 1397, 1399, 1400 seed replacement, 1381, 1394 weighted cluster means, 1395 fiducial limits, 3712, 3713, 3759 finite differencing NLMIXED procedure, 3064, 3091 finite population correction (fpc) SURVEYLOGISTIC procedure, 4280 SURVEYMEANS procedure, 4334 finite population correction factor, 165 first canonical variable, 783 first-order method NLMIXED procedure, 3085 first-stage sampling unit, 165 Fisher combination adjustment (MULTTEST), 2959 Fisher combination (MULTTEST) p-value adjustments, 2959 Fisher exact test MULTTEST procedure, 2944, 2946, 2954, 2975 Fisher information matrix example (MIXED), 2796 MIXED procedure, 2750 Fisher’s exact test FREQ procedure, 1469, 1472, 1473 power and sample size (POWER), 3457, 3463, 3525 Fisher’s LSD test, 445, 1769 Fisher’s scoring method GENMOD procedure, 1642, 1656 LOGISTIC procedure, 2317, 2318, 2336 MIXED procedure, 2674, 2680, 2774 SURVEYLOGISTIC procedure, 4264, 4265, 4276 Fisher’s z test for correlation power and sample size (POWER), 3426, 3429, 3502, 3556 fit plots plots, ODS Graphics, 322, 354 fit statistics SURVEYREG procedure, 4390 fixed effects MIXED procedure, 2663 sum-to-zero assumptions, 1835 VARCOMP procedure, 4831, 4837 fixed-effects model VARCOMP procedure, 4837 fixed-effects parameters MIXED procedure, 2661, 2732 fixed-radius kernels MODECLUS procedure, 2870 Fleiss-Cohen weights, 1497
4926
Subject Index
Fleming-Harrington Gρ test for homogeneity LIFETEST procedure, 2150, 2168 flexible-beta method CLUSTER procedure, 957, 967, 968, 981 floating point errors NLMIXED procedure, 3098 folded form F statistic, 4775, 4784 formatted values DISTANCE procedure, 1277 formulas CANCORR procedure, 765 forward selection LOGISTIC procedure, 2317, 2340 PHREG procedure, 3229, 3264 REG procedure, 3800, 3873 Forward-Dolittle transformation, 1794 fraction of missing information MI procedure, 2562 MIANALYZE procedure, 2625 fractional frequencies PHREG procedure, 3227 fractional sample size GLMPOWER procedure, 1947 POWER procedure, 3419, 3496 Frailty model example NLMIXED procedure, 3128 Freeman-Halton test, 1473 Freeman-Tukey test MULTTEST procedure, 2946, 2951, 2968 FREQ procedure alpha level, 1445, 1453 binomial proportion, 1484, 1532 Bowker’s test of symmetry, 1493 Breslow-Day test, 1508 cell count data, 1464, 1527 chi-square tests, 1469–1471, 1530, 1535 Cochran’s Q test, 1493 Cochran-Mantel-Haenszel statistics, 1540 computation time, limiting, 1445 computational methods, 1509 computational resources, 1511, 1513 contingency coefficient, 1469 contingency table, 1535 continuity-adjusted chi-square, 1469, 1471 correlation statistic, 1501 Cramer’s V statistic, 1469 default tables, 1450 displayed output, 1517 exact p-values, 1510 EXACT statement, used with TABLES, 1446 exact tests, 1443, 1508, 1543 Fisher’s exact test, 1469 Friedman’s chi-square statistic, 1546 gamma statistic, 1474 general association statistic, 1502 grouping variables, 1465 input data sets, 1441, 1464 kappa coefficient, 1497, 1498 Kendall’s tau-b statistic, 1474
lambda asymmetric, 1474 lambda symmetric, 1474 likelihood-ratio chi-square test, 1469 Mantel-Haenszel chi-square test, 1469 McNemar’s test, 1493 measures of association, 1474 missing values, 1466 Monte Carlo estimation, 1443, 1445, 1512 multiway tables, 1518, 1520, 1521 network algorithm, 1509 odds ratio, 1488, 1503, 1504 ODS table names, 1524 one-way frequency tables, 1469, 1470, 1517, 1518, 1530 order of variables, 1442 output data sets, 1446, 1514–1516, 1527, 1538 output variable prefixes, 1516 OUTPUT, used with TABLES or EXACT, 1449 overall kappa coefficient, 1493 Pearson chi-square test, 1469, 1471 Pearson correlation coefficient, 1474 phi coefficient, 1469 polychoric correlation coefficient, 1474 relative risk, 1489, 1503, 1507 row mean scores statistic, 1502 scores, 1468 simple kappa coefficient, 1493 Somers’ D statistics, 1474 Spearman rank correlation coefficient, 1474 statistical computations, 1468 stratified table, 1540 Stuart’s tau-c statistic, 1474 two-way frequency tables, 1470, 1471, 1535 uncertainty coefficients, 1474 weighted kappa coefficient, 1493 FREQ statement and RMSSTD statement (CLUSTER), 974 frequency tables, 1431, 1450, 4196 generating (CATMOD), 842, 845 input to CATMOD procedure, 861 one-way (FREQ), 1469, 1470, 1517, 1518, 1530 one-way (SURVEYFREQ), 4226 two-way (FREQ), 1470, 1471, 1535 frequency variable LOGISTIC procedure, 2304 PRINQUAL procedure, 3658 programming statements (PHREG), 3235 SURVEYLOGISTIC procedure, 4258 TRANSREG procedure, 4557 value (PHREG), 3227 Friedman’s chi-square statistic, 1546 full sibs mating INBREED procedure, 1981 full-rank coding TRANSREG procedure, 4568 furthest neighbor clustering, See complete linkage fuzzy coding CORRESP procedure, 1087
Subject Index
G G matrix MIXED procedure, 2663, 2713, 2732, 2733, 2816 G-G epsilon, 1829 G2 inverse NLIN procedure, 3025 Gabriel’s multiple-comparison procedure ANOVA procedure, 443 GLM procedure, 1767, 1811 GAM procedure comparing PROC GAM with PROC LOESS, 1596 Estimates from PROC GAM, 1579 generalized additive model with binary data, 1582 graphics, 1581 ODS graph names, 1581 ODS table names, 1580 Poisson regression analysis of component reliability, 1589 gamma distribution, 2083, 2097, 2111 GENMOD procedure, 1652 NLMIXED procedure, 3077 gamma statistic, 1474, 1476 Gaussian assumption SIM2D procedure, 4091 Gaussian distribution NLMIXED procedure, 3077 Gaussian random field SIM2D procedure, 4091 Gaussian semivariogram model KRIGE2D procedure, 2047, 2048 GCV function, TPSPLINE procedure, 4497, 4499, 4514, 4522 nonhomogeneous variance, 4514 GEE, See Generalized Estimating Equations Gehan test, See Wilcoxon test for homogeneity power and sample size (POWER), 3473, 3482, 3533 general association statistic, 1502 general distribution NLMIXED procedure, 3078 general linear structure MIXED procedure, 2721 generalized Crawford-Ferguson family, 1291 generalized Crawford-Ferguson method, 1317, 1318 generalized cross validation function (GCV), 4497, 4522 generalized cyclic incomplete block design generating with PLAN procedure, 3357 Generalized Estimating Equations (GEE), 1621, 1646, 1672, 1708, 1713 Generalized Euclidean distance coefficient DISTANCE procedure, 1272 generalized inverse, 2740 MIXED procedure, 2684
4927
NLIN procedure, 3025 NLMIXED procedure, 3065 generalized least squares, See weighted least-squares estimation generalized linear model GENMOD procedure, 1612, 1613 theory (GENMOD), 1650 generalized logistic model SURVEYLOGISTIC procedure, 4284 generalized logits, See also response functions examples (CATMOD), 869 formulas (CATMOD), 893 specifying in CATMOD procedure, 853 using (CATMOD), 868 generation (INBREED) nonoverlapping, 1967, 1970, 1971 number, 1974 overlapping, 1967, 1969 variable, 1974 GENMOD procedure adjusted residuals, 1670 aliasing, 1620 bar (|) operator, 1660 binomial distribution, 1652 built-in link function, 1614 built-in probability distribution, 1614 classification variables, 1660 confidence intervals, 1637 confidence limits, 1636 continuous variables, 1660 contrasts, 1633 convergence criterion, 1637, 1647 correlated data, 1611, 1672 correlation matrix, 1638, 1656 correlations, least-squares means, 1636 covariance matrix, 1638, 1656 covariances, least-squares means, 1636 crossed effects, 1660 design matrix, 1661 deviance, 1637 deviance definition, 1615 deviance residuals, 1670 dispersion parameter, 1658 dispersion parameter estimation, 1614, 1665, 1666 dispersion parameter weights, 1650 effect specification, 1659 estimability, 1635 estimability checking, 1632 events/trials format for response, 1636, 1653 expected information matrix, 1656 exponential distribution, 1704 F statistics, 1668 Fisher’s scoring method, 1642, 1656 gamma distribution, 1652 GEE, 1611, 1621, 1646, 1672, 1708, 1711, 1713 Generalized Estimating Equations (GEE), 1611 generalized linear model, 1612, 1613
4928
Subject Index
goodness of fit, 1656 gradient, 1655 Hessian matrix, 1655 information matrix, 1642 initial values, 1638, 1647 intercept, 1615, 1617, 1640 inverse Gaussian distribution, 1652 L matrices, 1635 Lagrange multiplier statistics, 1668 life data, 1701 likelihood residuals, 1670 linear model, 1612 linear predictor, 1611, 1612, 1618, 1661, 1689 link function, 1611, 1612, 1653 log-likelihood functions, 1654 log-linear models, 1616 logistic regression, 1697 main effects, 1660 maximum likelihood estimation, 1655 – MEAN– automatic variable, 1645 Model checking, 1718 model checking, 1725 multinomial distribution, 1653 multinomial models, 1671 negative binomial distribution, 1652 nested effects, 1660 Newton-Raphson algorithm, 1655 normal distribution, 1651 observed information matrix, 1656 offset, 1640, 1689 offset variable, 1617 ordinal data, 1704 output ODS graphics table names, 1695 output table names, 1693 overdispersion, 1659 Pearson residuals, 1670 Pearson’s chi-square, 1637, 1656, 1657 Poisson distribution, 1652 Poisson regression, 1616 polynomial effects, 1660 profile likelihood confidence intervals, 1640, 1666 programming statements, 1645 quasi-likelihood, 1659 raw residuals, 1669 regression parameters estimation, 1614 regressor effects, 1660 repeated measures, 1611, 1672 residuals, 1641, 1669, 1670 – RESP– automatic variable, 1645 scale parameter, 1653 scaled deviance, 1656 score statistics, 1668 singular contrast matrix, 1632 subpopulation, 1637 suppressing output, 1627 Type 1 analysis, 1615, 1665 Type 3 analysis, 1615, 1665 user-defined link function, 1634
variance function, 1614 Wald confidence intervals, 1643, 1667 working correlation matrix, 1647, 1648, 1672 – XBETA– automatic variable, 1645 Gentleman-Givens computational method, 3197 geometric anisotropy KRIGE2D procedure, 2053–2055 getting started ODS Graphics, 321 GLM Parameterization SURVEYLOGISTIC procedure, 4271 GLM procedure absorption of effects, 1747, 1799 aliasing structure, 1771, 1897 alpha level, 1755, 1765, 1771, 1775 at sign (@) operator, 1786 bar (|) operator, 1786 Bartlett’s test, 1767, 1819 Bonferroni adjustment, 1754 Brown and Forsythe’s test, 1767, 1819 canonical analysis, 1760 characteristic roots and vectors, 1759 compared to other procedures, 1734, 1776, 1833, 1885, 1901, 2664, 2985, 3197, 4033 comparing groups, 1804 computational method, 1840 computational resources, 1837 contrasts, 1749, 1779 covariate values for least-squares means, 1755 disk space, 1746 Dunnett’s adjustment, 1754 effect specification, 1784 error effect, 1759 estimability, 1750–1752, 1758, 1773, 1793, 1803 estimable functions, 1792 ESTIMATE specification, 1801 homogeneity of variance tests, 1767, 1818 Hsu’s adjustment, 1754 hypothesis tests, 1781, 1792 interactive use, 1787 interactivity and BY statement, 1748 interactivity and missing values, 1787, 1837 introductory example, 1735 least-squares means (LS-means), 1753 Levene’s test for homogeneity of variance, 1767, 1819 means, 1763 means versus least-squares means, 1804 memory requirements, reduction of, 1747 missing values, 1745, 1759, 1822, 1837 model specification, 1784 multiple comparisons, least-squares means, 1754, 1757, 1806, 1808 multiple comparisons, means, 1765–1769, 1806, 1808 multiple comparisons, procedures, 1763 multivariate analysis of variance, 1745, 1759, 1823
Subject Index nonstandard weights for least-squares means, 1756 O’Brien’s test, 1767 observed margins for least-squares means, 1756 ODS graph names, 1847 ODS table names, 1844 output data sets, 1773, 1840–1842 parameterization, 1787 positional requirements for statements, 1743 predicted population margins, 1753 Q effects, 1834 random effects, 1776, 1833 regression, quadratic, 1738 relation to GLMMOD procedure, 1909 repeated measures, 1777, 1825 Sidak’s adjustment, 1754 simple effects, 1758 simulation-based adjustment, 1754 singularity checking, 1750, 1752, 1758, 1772 sphericity tests, 1780, 1829 SSCP matrix for multivariate tests, 1759 statistical assumptions, 1783 summary of features, 1733 tests, hypothesis, 1749 transformations for MANOVA, 1759 transformations for repeated measures, 1779 Tukey’s adjustment, 1754 types of least-squares means comparisons, 1757 unbalanced analysis of variance, 1735, 1804, 1856 unbalanced design, 1735, 1804, 1833, 1856, 1882 weighted analysis, 1782 weighted means, 1820 Welch’s ANOVA, 1769 WHERE statement, 1787 GLMMOD alternative TRANSREG procedure, 4586, 4654 GLMMOD procedure design matrix, 1909, 1917, 1918 input data sets, 1914 introductory example, 1909 missing values, 1917, 1918 ODS table names, 1918 output data sets, 1915, 1917, 1918 relation to GLM procedure, 1909 screening experiments, 1923 GLMPOWER procedure actual power, 1946, 1947, 1953 alpha level, 1939 analysis of variance, 1930, 1951, 1956 ceiling sample size, 1947 compared to other procedures, 1930, 3412 computational methods, 1949 contrasts, 1934, 1937, 1950, 1951, 1956 covariates, class and continuous, 1938–1941, 1951, 1956 displayed output, 1947
4929
exemplary data set, 1929, 1930, 1936, 1938, 1946, 1952 fractional sample size, 1947 introductory example, 1930 nominal power, 1946, 1947, 1953 number-lists, 1945 ODS table names, 1948 plots, 1930, 1936, 1942 positional requirements for statements, 1936 sample size adjustment, 1946 summary of statements, 1936 value lists, 1945 global influence LD statistic (PHREG), 3234, 3260 LMAX statistic (PHREG), 3234, 3261 global kriging KRIGE2D procedure, 2034 global null hypothesis PHREG procedure, 3218, 3246, 3269 score test (PHREG), 3229, 3274 TPHREG procedure, 4486 Gompertz distribution, 3705 goodness of fit GENMOD procedure, 1656 Gower’s dissimilarity coefficient DISTANCE procedure, 1271 Gower’s method, See also median method CLUSTER procedure, 967, 981 Gower’s similarity coefficient DISTANCE procedure, 1270 gradient CALIS procedure, 634, 664, 665 GENMOD procedure, 1655 LOGISTIC procedure, 2343 MIXED procedure, 2675, 2749 SURVEYLOGISTIC procedure, 4287 Graeco-Latin square generating with PLAN procedure, 3345 graph axes, swapping examples, ODS Graphics, 371 graph fonts, modifying examples, ODS Graphics, 374, 379 graph names examples, ODS Graphics, 352, 361 ODS Graphics, 330 graph sizes, modifying examples, ODS Graphics, 378 graph template definitions ODS Graphics, 338, 342, 365 Q-Q plots, 366 graph template language ODS Graphics, 338, 342 TEMPLATE procedure, 342 graph titles, modifying examples, ODS Graphics, 368 graphics, See plots examples (REG), 3948
4930
Subject Index
high-resolution plots (REG), 3840 keywords (REG), 3841 options (REG), 3843 saving output (MI), 2526 graphics catalog, specifying LIFEREG procedure, 2090 PROBIT procedure, 3712 graphics image files base file names, 335 file names, 335 file types, 334, 336, 350 index counter, 335 ODS Graphics, 334 PostScript, 337, 358 graphs, See plots Greenhouse-Geisser epsilon, 1829 grid lines, modifying examples, ODS Graphics, 373 grid search example (MIXED), 2796 group average clustering, See average linkage grouped-name-lists POWER procedure, 3490 grouped-number-lists POWER procedure, 3490 growth curve analysis example (CATMOD), 933 example (MIXED), 2733 GSK models, 817 GT2 multiple-comparison method, 444, 1769, 1811
H H-F epsilon, 1829 half-fraction design, analysis, 1895 half-width, confidence intervals, 3488 Hall-Wellner bands LIFETEST procedure, 2169, 2177, 2205 Hamann coefficient DISTANCE procedure, 1274 Hamming distance coefficient DISTANCE procedure, 1274 Hannan-Quinn information criterion MIXED procedure, 2676 Harris component analysis, 1291, 1293, 1311 Harris-Kaiser method, 1291, 1318 hat matrix, 3899 LOGISTIC procedure, 2359 hazard function baseline (PHREG), 3215, 3216 cumulative (PHREG), 3262 definition (PHREG), 3240 discrete (PHREG), 3228, 3240 LIFETEST procedure, 2149, 2214 PHREG procedure, 3215 rate (PHREG), 3284 ratio (PHREG), 3218, 3220 hazards ratio
confidence interval (PHREG), 3233 confidence limits (PHREG), 3233, 3247, 3269 confidence limits (TPHREG), 4487 estimate (PHREG), 3247, 3269, 3284 estimate (TPHREG), 4487 PHREG procedure, 3218 Hertzsprung-Russell Plot, example MODECLUS procedure, 2923 Hessian matrix CALIS procedure, 589, 621, 622, 634, 647, 665 GENMOD procedure, 1655 LOGISTIC procedure, 2317, 2343 MIXED procedure, 2674, 2675, 2680, 2707, 2749, 2750, 2774, 2775, 2786, 2796 NLMIXED procedure, 3066 SURVEYLOGISTIC procedure, 4264, 4287 Hessian scaling NLMIXED procedure, 3093 heterogeneity example (MIXED), 2792 MIXED procedure, 2714, 2717 heterogeneous AR(1) structure (MIXED), 2721 compound-symmetry structure (MIXED), 2721 covariance structures (MIXED), 2730 Toeplitz structure (MIXED), 2721 heteroscedasticity testing (REG), 3910 Heywood cases FACTOR procedure, 1333 hierarchical clustering, 967, 980, 4801 hierarchical design generating with PLAN procedure, 3353 hierarchical model example (MIXED), 2810 hierarchy LOGISTIC procedure, 2310 TPHREG procedure, 4475 histograms plots, ODS Graphics, 358 Hochberg adjustment (MULTTEST), 2959 Hochberg (MULTTEST) p-value adjustments, 2959 Hochberg’s GT2 multiple-comparison method, 444, 1769, 1811 Hommel adjustment (MULTTEST), 2959 Hommel (MULTTEST) p-value adjustments, 2959 homogeneity analysis CORRESP procedure, 1069 homogeneity of variance tests, 443, 1767, 1818 Bartlett’s test (ANOVA), 443 Bartlett’s test (GLM), 1767, 1819 Brown and Forsythe’s test (ANOVA), 443 Brown and Forsythe’s test (GLM), 1767, 1819 DISCRIM procedure, 1150 examples, 1893
Subject Index Levene’s test (ANOVA), 443 Levene’s test (GLM), 1767, 1819 O’Brien’s test (ANOVA), 443 O’Brien’s test (GLM), 1767 Welch’s ANOVA, 1819 homogeneity tests LIFETEST procedure, 2149, 2154, 2178, 2200 homotopy parameter FASTCLUS procedure, 1390 honestly significant difference test, 445, 1769, 1811, 1812 Hosmer-Lemeshow test LOGISTIC procedure, 2312, 2356 test statistic (LOGISTIC), 2357 Hotelling-Lawley trace, 437, 1759, 1828 Hotelling-Lawley-McKeon statistic MIXED procedure, 2717 Hotelling-Lawley-Pillai-Samson statistic MIXED procedure, 2718 Howe’s solution, 1297 HSD test, 445, 1769, 1811, 1812 Hsu’s adjustment GLM procedure, 1754 MIXED procedure, 2688 HTML destination ODS Graphics, 326, 335 HTML output examples, ODS Graphics, 321, 324, 330, 352 ODS Graphics, 326, 334, 336 HTML output, with tool tips examples, ODS Graphics, 354 Huynh-Feldt epsilon (GLM), 1829 structure (GLM), 1829 structure (MIXED), 2721 HYBRID option and FREQ statement (CLUSTER), 974 and other options (CLUSTER), 970, 972 PROC CLUSTER statement, 978 hypergeometric distribution (MULTTEST), 2954 variance (MULTTEST), 2949 hypothesis tests comparing adjusted means (GLM), 1758 contrasts (CATMOD), 831 contrasts (GLM), 1749 contrasts, examples (GLM), 1850, 1865, 1875 custom tests (ANOVA), 450 customized (GLM), 1781 exact (FREQ), 1443 for intercept (ANOVA), 446 for intercept (GLM), 1771 GLM procedure, 1792 incorrect hypothesis (CATMOD), 888 lack of fit (RSREG), 4046 MANOVA (GLM), 1824 mixed model (MIXED), 2742, 2751 multivariate (REG), 3910 nested design (NESTED), 2990
4931
parametric, comparing means (TTEST), 4775, 4789 parametric, comparing variances (TTEST), 4775, 4789 random effects (GLM), 1777, 1833 REG procedure, 3832, 3858 repeated measures (GLM), 1827 TRANSREG procedure, 4615 Type I sum of squares (GLM), 1794 Type II sum of squares (GLM), 1796 Type III sum of squares (GLM), 1797 Type IV sum of squares (GLM), 1797
I id variables TRANSREG procedure, 4557 ideal point model TRANSREG procedure, 4593 ideal points TRANSREG procedure, 4717 identification variables, 3077 identity transformation PRINQUAL procedure, 3663 TRANSREG procedure, 4564 ill-conditioned data ORTHOREG procedure, 3197 image component analysis, 1291, 1293, 1311 image files, see graphics image files implicit intercept TRANSREG procedure, 4605 imputation methods MI procedure, 2539 imputation model MI procedure, 2565 imputation of missing values FASTCLUS procedure, 1391 imputer’s model MI procedure, 2563 INBREED procedure coancestry, computing, 1978 coefficient of relationship, computing, 1977 covariance coefficients, 1967, 1969, 1971, 1973, 1975, 1977 covariance coefficients matrix, output, 1973 first parent, 1975 full sibs mating, 1981 generation number, 1974 generation variable, 1974 generation, nonoverlapping, 1967, 1970, 1971 generation, overlapping, 1967, 1969 inbreeding coefficients, 1967, 1969, 1973, 1975, 1978 inbreeding coefficients matrix, output, 1973 individuals, outputting coefficients, 1973 individuals, specifying, 1971, 1975 initial covariance value, 1976 initial covariance value, assigning, 1973 initial covariance value, specifying, 1969
4932
Subject Index
kinship coefficient, 1977 last generation’s coefficients, output, 1973 mating, offspring and parent, 1981 mating, self, 1980 matings, output, 1975 monoecious population analysis, example, 1985 offspring, 1973, 1980 ordering observations, 1968 OUTCOV= data set, 1974, 1982 output table names, 1984 panels, 1982, 1989 pedigree analysis, 1967, 1968 pedigree analysis, example, 1987, 1989 population, monoecious, 1985 population, multiparous, 1973, 1977 population, nonoverlapping, 1974 population, overlapping, 1968, 1969, 1979 progeny, 1976, 1978, 1980, 1988 second parent, 1975 selective matings, output, 1975 specifying gender, 1971 theoretical correlation, 1977 unknown or missing parents, 1982 variables, unaddressed, 1976 incomplete block design generating with PLAN procedure, 3354, 3357 incomplete principal components REG procedure, 3818, 3828 independent variable defined (ANOVA), 423 individual difference models MDS procedure, 2471 INDSCAL model MDS procedure, 2471, 2477 inertia, definition CORRESP procedure, 1070 INEST= data sets LIFEREG procedure, 2121 ROBUSTREG procedure, 4011 inference mixed model (MIXED), 2741 space, mixed model (MIXED), 2681, 2682, 2685, 2781 infinite likelihood MIXED procedure, 2717, 2773, 2775 infinite parameter estimates LOGISTIC procedure, 2313, 2338 SURVEYLOGISTIC procedure, 4263, 4277 Influence diagnostics details MIXED procedure, 2765 influence diagnostics, details MIXED procedure, 2763 Influence plots MIXED procedure, 2760 influence statistics REG procedure, 3898 information criteria MIXED procedure, 2676 information matrix, 3756
expected (GENMOD), 1656 LIFEREG procedure, 2083, 2084, 2108 observed (GENMOD), 1656 initial covariance value assigning (INBREED), 1973 INBREED procedure, 1976 specifying (INBREED), 1969 initial estimates ACECLUS procedure, 404 LIFEREG procedure, 2108 initial seed SURVEYSELECT procedure, 4441 initial seeds FASTCLUS procedure, 1380, 1381, 1394 initial values CALIS procedure, 550, 588, 590, 595, 597, 602, 661 GENMOD procedure, 1638, 1647 LOGISTIC procedure, 2376 MDS procedure, 2480–2484, 2491 MIXED procedure, 2706 SURVEYLOGISTIC procedure, 4280 initialization random (PRINQUAL), 3673 TRANSREG procedure, 4602 input data set MI procedure, 2519, 2526, 2558 input data sets MIANALYZE procedure, 2620 inset LIFEREG procedure, 2092 PROBIT procedure, 3723 insets background color, 513, 515 background color of header, 513, 515 drop shadow color, 513 frame color, 513, 516 header text color, 513, 516 header text, specifying, 514, 516 positioning, details, 526–529 positioning, options, 513, 514, 516 suppressing frame, 514, 516 text color, 513, 516 instantaneous failure rate PHREG procedure, 3240 integral approximations NLMIXED procedure, 3070, 3084 intensity model, See Andersen-Gill model interaction effects MIXED procedure, 2744 model parameterization (GLM), 1788 quantitative (TRANSREG), 4594 specifying (ANOVA), 451, 452 specifying (CATMOD), 864 specifying (GLM), 1784 TRANSREG procedure, 4558, 4594 intercept GENMOD procedure, 1615, 1617, 1640
Subject Index hypothesis tests for (ANOVA), 446 hypothesis tests for (GLM), 1771 MIXED procedure, 2743 model parameterization (GLM), 1787 no intercept (TRANSREG), 4577 Internal studentization, 2763 interpretation factor rotation, 1293 interpreting factors, elements to consider, 1294 interpreting output VARCLUS procedure, 4818 interval determination LIFETEST procedure, 2174 interval level of measurement DISTANCE procedure, 1250 interval variable, 72 intraclass correlation coefficient MIXED procedure, 2791 inverse confidence limits PROBIT procedure, 3712, 3761 inverse Gaussian distribution GENMOD procedure, 1652 inverse matrix of X0 X SURVEYREG procedure, 4391 IPC analysis REG procedure, 3818, 3828, 3916 ipp plots annotating, 3729 axes, color, 3729 font, specifying, 3729 options summarized by function, 3726 reference lines, options, 3729–3733 threshold lines, options, 3732 ippplot PROBIT procedure, 3725 iterated factor analysis, 1291 iteration history NLMIXED procedure, 3067, 3104 iterations history (MIXED), 2749 history (PHREG), 3233, 3269 history (TPHREG), 4486 PRINQUAL procedure, 3667 restarting (PRINQUAL), 3656, 3673 restarting (TRANSREG), 4602 iterative proportional fitting estimation (CATMOD), 843 formulas (CATMOD), 896
J Jaccard dissimilarity coefficient DISTANCE procedure, 1276 Jaccard similarity coefficient DISTANCE procedure, 1276 joint selection probabilities SURVEYSELECT procedure, 4433 Jonckheere-Terpstra test, 1491 Journal style examples, ODS Graphics, 358
4933
ODS styles, 332, 333, 346
K k-means clustering, 1379, 1380 k-sample tests, See homogeneity tests k-th-nearest neighbor, See also single linkage See also density linkage estimation (CLUSTER), 970, 972, 977 K= option and other options (CLUSTER), 969, 972 Kaplan-Meier estimate, See product-limit estimate kappa coefficient, 1495, 1496 tests, 1498 weights, 1497 KDE procedure bandwidth selection, 2008 binning, 2004 bivariate histogram, 2011 computational details, 2002 convolution, 2005 examples, 2012 fast Fourier transform, 2007 ODS graph names, 2010 options, 1996 output table names, 2009 Kendall’s tau-b statistic, 1474, 1477 Kenward-Roger method MIXED procedure, 2695 kernel density estimates DISCRIM procedure, 1158, 1159, 1191, 1212 KDE procedure, 1993 keyword-lists POWER procedure, 3490 Klotz scores NPAR1WAY procedure, 3168 knots PRINQUAL procedure, 3665, 3666 TRANSREG procedure, 4567, 4568, 4571, 4613, 4678 Kolmogorov-Smirnov test NPAR1WAY procedure, 3169 KRIGE2D procedure anisotropic models, 2053–2056 best linear unbiased prediction (BLUP), 2060 correlation range, 2034 discontinuity, 2051 effective range, 2047–2049 examples, 2062 exponential semivariogram model, 2048, 2049 Gaussian semivariogram model, 2047, 2048 geometric anisotropy, 2053–2055 global kriging, 2034 input data set, 2038 kriging with trend, 2058 local kriging, 2033, 2034 nested models, 2050, 2051
4934
Subject Index
nugget effect, 2044, 2051, 2052 ordinary kriging, 2033, 2056–2060 OUTEST= data sets, 2060 OUTNBHD= data set, 2060, 2061 output data sets, 2039, 2060, 2061 power semivariogram model, 2049 range , 2047 sill, 2047 spatial continuity, 2033 spatial covariance, 2034 spatial data, 2056 spatial random fields, 2057 spherical semivariogram model, 2046, 2047 visual fit of the variogram, 2045 zonal anisotropy, 2055 kriging ordinary kriging (KRIGE2D), 2058 ordinary kriging (VARIOGRAM), 4851 with trend (KRIGE2D), 2058 kriging, ordinary VARIOGRAM procedure, 4852 Kronecker product structure MIXED procedure, 2721 Kruskal-Wallis test NPAR1WAY procedure, 3166 Kuiper test NPAR1WAY procedure, 3171 Kulcynski 1 coefficient DISTANCE procedure, 1276 kurtosis CALIS procedure, 549, 584, 588, 658, 660 displayed in CLUSTER procedure, 972
L L matrices MIXED procedure, 2681, 2687, 2742 label collision avoidance, 351 ODS Graphics, 351 lack of fit tests, 3712, 3759 RSREG procedure, 4046 lag functionality NLMIXED procedure, 3081 Lagrange multiplier covariance matrix, 3030 NLMIXED procedure, 3093 statistics (GENMOD), 1668 test statistics (LIFEREG), 2110 test, modification indices (CALIS), 584, 673, 674 lambda asymmetric, 1474, 1482 lambda symmetric, 1474, 1483 Lance-Williams flexible-beta method, See flexible-beta method Lance-Williams nonmetric coefficient DISTANCE procedure, 1272 latent variables CALIS procedure, 549, 601, 625 PLS procedure, 3367 latent vectors
PLS procedure, 3367 LATEX destination ODS Graphics, 326, 335 LaTeX output examples, ODS Graphics, 358 ODS Graphics, 334, 337 Latin square design ANOVA procedure, 472 generating with PLAN procedure, 3356 lattice design balanced square lattice (LATTICE), 2069 efficiency (LATTICE), 2071, 2075, 2078 partially balanced square lattice (LATTICE), 2069, 2076 rectangular lattice (LATTICE), 2069 lattice layouts ODS Graphics, 381 LATTICE procedure adjusted treatment means, 2075 ANOVA table, 2074 Block variable, 2069, 2072, 2073 compared to MIXED procedure, 2665 covariance, 2075 Group variable, 2069, 2072, 2073 lattice design efficiency, 2071, 2075 least significant differences, 2075 missing values, 2074 ODS table names, 2075 Rep variable, 2069, 2072, 2073 response variable, 2072 Treatmnt variable, 2069, 2072, 2073 variance of means, 2074 layout area ODS Graphics, 342 LD statistic PHREG procedure, 3234, 3260 leader algorithm, 1380 least significant differences LATTICE procedure, 2075 least-significant-difference test, 445, 1769 least-squares estimation LIFEREG procedure, 2108 least-squares means Bonferroni adjustment (GLM), 1754 Bonferroni adjustment (MIXED), 2688 BYLEVEL processing (MIXED), 2689 coefficient adjustment, 1822 compared to means (GLM), 1804 comparison types (GLM), 1757 comparison types (MIXED), 2690 construction of, 1820 covariate values (GLM), 1755 covariate values (MIXED), 2688 Dunnett’s adjustment (GLM), 1754 Dunnett’s adjustment (MIXED), 2688 examples (GLM), 1859, 1866 examples (MIXED), 2796, 2819 GLM procedure, 1753 Hsu’s adjustment (GLM), 1754
Subject Index Hsu’s adjustment (MIXED), 2688 mixed model (MIXED), 2687 multiple comparisons adjustment (GLM), 1754, 1757 multiple comparisons adjustment (MIXED), 2687 nonstandard weights (GLM), 1756 nonstandard weights (MIXED), 2690 observed margins (GLM), 1756 observed margins (MIXED), 2690 Sidak’s adjustment (GLM), 1754 Sidak’s adjustment (MIXED), 2688 simple effects (GLM), 1758, 1817 simple effects (MIXED), 2691 simulation-based adjustment (GLM), 1754 simulation-based adjustment (MIXED), 2688 Tukey’s adjustment (GLM), 1754 Tukey’s adjustment (MIXED), 2688 Lee-Wei-Amato model PHREG procedure, 3251, 3314 left truncation time PHREG procedure, 3229, 3263 less-than-full-rank model TRANSREG procedure, 4569, 4672 level of measurement MDS procedure, 2472, 2481 levels of measurement DISTANCE procedure, 1249 levels, of class variable, 1784 Levenberg-Marquardt algorithm CALIS procedure, 578, 581, 665 Levene’s test for homogeneity of variance ANOVA procedure, 443 GLM procedure, 1767, 1819, 1893 Leverage MIXED Procedure, 2767 leverage, 1774 TRANSREG procedure, 4587 life data GENMOD procedure, 1701 life-table estimate LIFETEST procedure, 2149, 2186, 2211 lifereg analysis insets, 2092 LIFEREG procedure, 2083 accelerated failure time models, 2083 censoring, 2094 computational details, 2108 computational resources, 2123 Confidence intervals, 2115 failure time, 2083 INEST= data sets, 2121 information matrix, 2083, 2084, 2108 initial estimates, 2108 inset, 2092 Lagrange multiplier test statistics, 2110 least-squares estimation, 2108 log-likelihood function, 2084, 2108 log-likelihood ratio tests, 2084
4935
main effects, 2108 maximum likelihood estimates, 2083 missing values, 2108 Newton-Raphson algorithm, 2083 OUTEST= data sets, 2121 output table names, 2124 predicted values, 2114 supported distributions, 2111 survival function, 2083, 2111 Tobit model, 2085, 2129 XDATA= data sets, 2122 LIFETEST procedure arcsine-square root transform, 2205 arcsine-square root transformation, 2169, 2177 association tests, 2150, 2156, 2180, 2191, 2200 censored, 2186 computational formulas, 2171 confidence bands, 2169, 2176 confidence limits, 2174, 2183, 2184 cumulative distribution function, 2149 effective sample size, 2172, 2173 equal precision bands, 2178, 2205 Fleming-Harrington Gρ test for homogeneity, 2150, 2168 Hall-Wellner bands, 2177, 2205 hazard function, 2149, 2214 homogeneity tests, 2149, 2154, 2178, 2200 interval determination, 2174 life-table estimate, 2149, 2172, 2186, 2209, 2211 likelihood ratio test for homogeneity, 2150, 2179 lineprinter plots, 2159 log-log transformation, 2170, 2177 log-rank test for association, 2150, 2180 log-rank test for homogeneity, 2150, 2168, 2178 logarithmic transformation, 2170, 2177 logit transformation, 2170, 2177 median residual time, 2186 missing stratum values, 2166, 2167 missing values, 2171 modified Peto-Peto test for homogeneity, 2150, 2168 ODS graph names, 2191 ODS table names, 2188 output data sets, 2183 partial listing, 2165 Peto-Peto test for homogeneity, 2150, 2168 probability density function, 2149, 2214 product-limit estimate, 2149, 2171, 2185, 2191 stratified tests, 2150, 2155, 2156, 2158, 2167, 2180, 2187 survival distribution function, 2149, 2171 Tarone-Ware test for homogeneity, 2150, 2168 traditional high-resolution graphics, 2159 trend tests, 2150, 2168, 2179, 2187 Wilcoxon test for association, 2150, 2180 Wilcoxon test for homogeneity, 2150, 2168, 2178 likelihood displacement
4936
Subject Index
PHREG procedure, 3234, 3260 Likelihood distance MIXED procedure, 2770 likelihood function, 3756 likelihood ratio chi-square test, 3756, 3759 likelihood ratio test, 2780 Bartlett’s modification, 1150 CALIS procedure, 653, 674 example (MIXED), 2794 mixed model (MIXED), 2741, 2743 MIXED procedure, 2751 PHREG procedure, 3246, 3269 TPHREG procedure, 4486 likelihood ratio test for homogeneity LIFETEST procedure, 2150 likelihood residuals GENMOD procedure, 1670 likelihood-ratio chi-square test, 1469 power and sample size (POWER), 3457, 3463, 3525 likelihood-ratio test chi-square (FREQ), 1471 line colors, modifying examples, ODS Graphics, 369 line patterns, modifying examples, ODS Graphics, 369 line printer plots REG procedure, 3848, 3882 line thickness, modifying examples, ODS Graphics, 376 line-search methods NLMIXED procedure, 3066, 3067, 3096 linear discriminant function, 1139 linear equations model, See LINEQS model linear hypotheses PHREG procedure, 3217, 3238, 3247 linear model GENMOD procedure, 1612, 1613 linear models CATMOD procedure, 814 compared with log-linear models, 817 linear predictor GENMOD procedure, 1611, 1612, 1618, 1661, 1689 PHREG procedure, 3225, 3233, 3235, 3302, 3303 linear rank tests, See association tests linear regression TRANSREG procedure, 4592 linear structural relationship model, See LISREL model linear structure MIXED procedure, 2721 linear transformation PRINQUAL procedure, 3662 TRANSREG procedure, 4563, 4610 LINEQS model
CALIS procedure, 553, 601 specification, 560 structural model example (CALIS), 558, 562 link function built-in (GENMOD), 1614, 1639 GENMOD procedure, 1611, 1612, 1653 LOGISTIC procedure, 2281, 2312, 2334, 2344 SURVEYLOGISTIC procedure, 4243, 4263, 4273 user-defined (GENMOD), 1634 LISREL model CALIS procedure, 554 structural model example (CALIS), 559 LMAX statistic PHREG procedure, 3260 local influence DFBETA statistics (PHREG), 3234, 3260 score residuals (PHREG), 3234, 3259 weighted score residuals (PHREG), 3260 local kriging KRIGE2D procedure, 2034 LOESS procedure approximate degrees of freedom, 2246 automatic smoothing parameter selection, 2243 data scaling, 2239 direct fitting method, 2240 graphics, 2250 introductory example, 2220 iterative reweighting, 2242 kd trees and blending, 2241 local polynomials, 2242 local weighting, 2242 missing values, 2238 ODS graph names, 2250 output data sets, 2238 output table names, 2248 scoring data sets, 2248 statistical inference, 2243 log likelihood output data sets (LOGISTIC), 2294 log odds LOGISTIC procedure, 2347 SURVEYLOGISTIC procedure, 4289 log-interval level of measurement DISTANCE procedure, 1250 log-likelihood functions (GENMOD), 1654 log-likelihood function LIFEREG procedure, 2084, 2108 PROBIT procedure, 3756 log-likelihood ratio tests LIFEREG procedure, 2084 log-linear models CATMOD procedure, 814, 870, 1616 compared with linear models, 817 design matrix (CATMOD), 884 examples (CATMOD), 916, 919 GENMOD procedure, 1616 multiple populations (CATMOD), 872
Subject Index one population (CATMOD), 871 log-linear variance model MIXED procedure, 2718 log-log transformation LIFETEST procedure, 2170, 2177 log-rank test PHREG procedure, 3219 log-rank test for association LIFETEST procedure, 2150 log-rank test for homogeneity LIFETEST procedure, 2150, 2168, 2178 power and sample size (POWER), 3473, 3481, 3533, 3561 logarithmic transformation LIFETEST procedure, 2170, 2177 logistic analysis CATMOD procedure, 815, 868 caution (CATMOD), 870 examples (CATMOD), 933 ordinal data, 815 logistic distribution, 2083, 2097, 2111, 3705 PROBIT procedure, 3757 LOGISTIC procedure Akaike’s information criterion, 2341 Bayes’ theorem, 2314 best subset selection, 2308 branch and bound algorithm, 2341 classification table, 2314, 2352, 2353, 2422 conditional logistic regression, 2365 confidence intervals, 2314, 2315, 2319, 2345, 2346 confidence limits, 2350 convergence criterion, 2308 customized odds ratio, 2328 descriptive statistics, 2294 deviance, 2308, 2316, 2354 DFBETAS diagnostic, 2360 dispersion parameter, 2354 displayed output, 2381 estimability checking, 2300 exact logistic regression, 2300, 2369 existence of MLEs, 2338 Fisher’s scoring method, 2317, 2318, 2336 goodness of fit, 2308, 2316 gradient, 2343 hat matrix, 2359 Hessian matrix, 2317, 2343 hierarchy, 2310 Hosmer-Lemeshow test, 2312, 2356, 2357 infinite parameter estimates, 2313 initial values, 2376 introductory example, 2284 link function, 2281, 2312, 2334, 2344 log odds, 2347 maximum likelihood algorithms, 2336 missing values, 2329 model fitting criteria, 2341 model hierarchy, 2283, 2310 model selection, 2306, 2317, 2340
4937
multiple classifications, 2315 Newton-Raphson algorithm, 2317, 2318, 2336, 2338 odds ratio confidence limits, 2308, 2315 odds ratio estimation, 2347 ODS graph names, 2390 ODS table names, 2386 output data sets, 2294, 2374, 2376, 2377 output ROC data sets, 2378 overdispersion, 2316, 2354, 2355 Pearson’s chi-square, 2308, 2316, 2354 predicted probabilities, 2350 prior event probability, 2314, 2353, 2422 profile likelihood convergence criterion, 2314 rank correlation, 2350 regression diagnostics, 2359 residuals, 2360 response level ordering, 2305, 2329, 2330 reverse response level ordering, 2290 ROC curve, 2314, 2357 Schwarz criterion, 2341 score statistics, 2343 selection methods, 2306, 2317, 2340 singular contrast matrix, 2300 subpopulation, 2308, 2316, 2355 testing linear hypotheses, 2327, 2358 Williams’ method, 2355 logistic regression, 3705, See also LOGISTIC procedure See also SURVEYLOGISTIC procedure CATMOD procedure, 814, 869 examples (CATMOD), 911 GENMOD procedure, 1613, 1697 logistic regression method MI procedure, 2546 logit transformation LIFETEST procedure, 2170, 2177 logits, See adjacent-category logits See also cumulative logits See also generalized logits loglogistic distribution, 2083, 2097, 2111 lognormal data power and sample size (POWER), 3434, 3437, 3438, 3450, 3455, 3456, 3465, 3472, 3509, 3511, 3519, 3521, 3529, 3531, 3552 lognormal distribution, 2083, 2097, 2111 long run times NLMIXED procedure, 3098 Longley data set, 3197 Lp clustering FASTCLUS procedure, 1379 Lp clustering FASTCLUS procedure, 1391 lpred plots annotating, 3737 axes, color, 3737 font, specifying, 3737 reference lines, options, 3737–3739, 3741
4938
Subject Index
threshold lines, options, 3740 lpredplot PROBIT procedure, 3733 LR statistics MI procedure, 2555 LS-means, See least-squares means lsd (least significant differences) LATTICE procedure, 2075 LSD test, 1769
M MAC method PRINQUAL procedure, 3643, 3669 macros TRANSREG procedure, 4588 Mahalanobis distance, 791, 1158 CANDISC procedure, 804 main effects design matrix (CATMOD), 877 GENMOD procedure, 1660 LIFEREG procedure, 2108 MIXED procedure, 2744 model parameterization (GLM), 1788 specifying (ANOVA), 451, 452 specifying (CATMOD), 864 specifying (GLM), 1784 TRANSREG procedure, 4558, 4594 Mallows’ Cp selection REG procedure, 3875 manifest variables CALIS procedure, 549 Mann-Whitney-Wilcoxon test NPAR1WAY procedure, 3166 MANOVA, See multivariate analysis of variance CANDISC procedure, 786 Mantel-Haenszel chi-square test, 1469, 1472 Mantel-Haenszel test log-rank test (PHREG), 3219 MAR MI procedure, 2537, 2564 marginal probabilities, See also response functions specifying in CATMOD procedure, 853 Marginal residuals MIXED procedure, 2764 marker symbols, modifying examples, ODS Graphics, 369 martingale residuals PHREG procedure, 3234, 3258, 3302 matched comparisons, See paired comparisons Matern covariance structure MIXED procedure, 2721 mating offspring and parent (INBREED), 1981 self (INBREED), 1980 matrix
decompositions (CORRESP), 1079, 1100 factor, defined for factor analysis (FACTOR), 1292 inversion (CALIS), 647 multiplication (SCORE), 4065 names, default (CALIS), 608 notation, theory (MIXED), 2731 properties, COSAN model (CALIS), 592 matrix properties COSAN model (CALIS), 592 MATRIX statement (CALIS) factor analysis model, 595 maximum average correlation method PRINQUAL procedure, 3643, 3669 maximum likelihood algorithms (LOGISTIC), 2336 algorithms (SURVEYLOGISTIC), 4275 estimates (LIFEREG), 2083 estimates (LOGISTIC), 2338 estimates (SURVEYLOGISTIC), 4277 estimation (CATMOD), 817, 843, 895 estimation (GENMOD), 1655 hierarchical clustering (CLUSTER), 967, 971, 980, 981 NLMIXED procedure, 3048 maximum likelihood estimation mixed model (MIXED), 2738 maximum likelihood factor analysis, 1291, 1311 with FACTOR procedure, 1297, 1298 maximum method, See complete linkage maximum total variance method PRINQUAL procedure, 3643 MCAR MI procedure, 2537 MCF PHREG procedure, 3225 MCMC method MI procedure, 2547 MCMC monotone-data imputation MI procedure, 2565 McNemar’s test, 1493, 1494 power and sample size (POWER), 3443, 3447, 3448, 3516, 3517 McQuitty’s similarity analysis CLUSTER procedure, 967 MDFFITS MIXED procedure, 2768 MDFFITS for covariance parameters MIXED procedure, 2769 MDPREF analysis PRINQUAL procedure, 3678 MDS procedure alternating least squares, 2476 analyzing data in groups, 2485 asymmetric data, 2484 badness of fit, 2479, 2482, 2483, 2489, 2490 conditional data, 2477 configuration, 2471, 2482, 2483, 2489
Subject Index convergence criterion, 2478, 2480, 2481 coordinates, 2482, 2483, 2488, 2489 data weights, 2487 dimension coefficients, 2471, 2472, 2477, 2482, 2483, 2488, 2489 dissimilarity data, 2471, 2478, 2484 distance data, 2471, 2478, 2484 Euclidean distances, 2472, 2477, 2488 external unfolding, 2471 individual difference models, 2471 INDSCAL model, 2471, 2477 initial values, 2480–2484, 2491 measurement level, 2472, 2481 metric multidimensional scaling, 2471 missing values, 2492 multidimensional scaling, 2471 nonmetric multidimensional scaling, 2471, 2472 normalization of the estimates, 2492 optimal transformations, 2472, 2481 output table names, 2497 partitions, 2477, 2488 plot of configuration, 2504, 2505 plot of dimension coefficients, 2504, 2506 plot of linear fit, 2503 %PLOTIT macro, 2474 plots, 2474 proximity data, 2471, 2478, 2484 residuals, 2483, 2487, 2488, 2491, 2503 similarity data, 2471, 2478, 2484 stress formula, 2479, 2480, 2489 subject weights, 2471, 2477 three-way multidimensional scaling, 2471 ties, 2485 transformations, 2472, 2481, 2482, 2484, 2487– 2489 transformed data, 2491 transformed distances, 2491 unfolding, 2471 weighted Euclidean distance, 2472, 2477, 2488 weighted Euclidean model, 2471, 2477 weighted least squares, 2487 mean function PHREG procedure, 3224, 3226, 3244, 3245, 3253, 3255 mean per element SURVEYMEANS procedure, 4337 mean separation tests, See multiple comparison procedures mean survival time time limit (LIFETEST), 2164 MEAN= data sets FASTCLUS procedure, 1393, 1394 means ANOVA procedure, 440 compared to least-squares means (GLM), 1804 displayed in CLUSTER procedure, 972 GLM procedure, 1763 power and sample size (POWER), 3432, 3448, 3463, 3471, 3526
4939
SURVEYMEANS procedure, 4337 weighted (GLM), 1820 MEANS procedure, 21 means, difference between independent samples, 4775, 4789 paired observations, 4775 measurement level MDS procedure, 2472, 2481 measures of agreement, 1493 median cluster, 1389, 1392 method (CLUSTER), 967, 981 median residual time LIFETEST procedure, 2186 median scores NPAR1WAY procedure, 3167 Medical Expenditure Panel Survey (MEPS) SURVEYLOGISTIC procedure, 4302 Mehta and Patel, network algorithm, 1509, 3172 memory requirements ACECLUS procedure, 410 CLUSTER procedure, 986 FACTOR procedure, 1335 FASTCLUS procedure, 1402 MIXED procedure, 2775 reduction of (ANOVA), 434 reduction of (GLM), 1747 VARCLUS procedure, 4818 memory usage SIM2D procedure, 4110 Merle-Spath algorithm FASTCLUS procedure, 1392 METHOD= specification PROC CLUSTER statement, 966 methods of estimation VARCOMP procedure, 4831, 4842 metric multidimensional scaling MDS procedure, 2471 MGV method PRINQUAL procedure, 3643 MI procedure adjusted degrees of freedom, 2562 analyst’s model, 2563 approximate Bayesian bootstrap, 2543 arbitrary missing pattern, 2539 autocorrelation function plot, 2557 Bayes’ theorem, 2547 Bayesian inference, 2547 between-imputation variance, 2561 bootstrap, 2527 combining inferences, 2561 converge in EM algorithm, 2522 convergence in EM algorithm, 2527 convergence in MCMC, 2555, 2566 degrees of freedom, 2561 discriminant function method, 2544 EM algorithm, 2536, 2566 fraction of missing information, 2562 imputation methods, 2539
4940
Subject Index
imputation model, 2565 imputer’s model, 2563 input data set, 2519, 2526, 2558 introductory example, 2513 logistic regression method, 2546 LR statistics, 2555 MAR, 2537, 2564 MCAR, 2537 MCMC method, 2547 MCMC monotone-data imputation, 2565 missing at random, 2537, 2564 monotone missing pattern, 2538 multiple imputation efficiency, 2562 multivariate normality assumption, 2565 number of imputations, 2565 ODS graph names, 2567 ODS table names, 2566 output data sets, 2520, 2528, 2559 output parameter estimates, 2528 parameter simulation, 2564 predictive mean matching method, 2542 producing monotone missingness, 2552 propensity score method, 2543, 2565 random number generators, 2520 regression method, 2541, 2565 relative efficiency, 2562 relative increase in variance, 2562 saving graphics output, 2526 singularity, 2521 Summary of Issues in Multiple Imputation, 2564 suppressing output, 2520 syntax, 2517 time-series plot, 2556 total variance, 2561 transformation, 2533 within-imputation variance, 2561 worst linear function of parameters, 2556 MI procedure, EM statement output data sets, 2523 MIANALYZE procedure adjusted degrees of freedom, 2625 average relative increase in variance, 2627 between-imputation covariance matrix, 2626 between-imputation variance, 2624 combining inferences, 2624 degrees of freedom, 2625, 2627 fraction of missing information, 2625 input data sets, 2620 introductory example, 2610 multiple imputation efficiency, 2626 multivariate inferences, 2626 ODS table names, 2631 relative efficiency, 2626 relative increase in variance, 2625 syntax, 2613 testing linear hypotheses, 2618, 2628 total covariance matrix, 2627 total variance, 2625 within-imputation covariance matrix, 2626
within-imputation variance, 2624 minimum generalized variance method PRINQUAL procedure, 3643 Minkowski metric STDIZE procedure, 4138 Minkowski L(p) distance coefficient DISTANCE procedure, 1272 misclassification probabilities discriminant analysis, 1140 missing at random MI procedure, 2537, 2564 missing level combinations MIXED procedure, 2748 missing stratum values LIFETEST procedure, 2166, 2167 missing values ACECLUS procedure, 409 and interactivity (GLM), 1787 CANCORR procedure, 765 character (PRINQUAL), 3662 CLUSTER procedure, 987 DISTANCE procedure, 1261, 1262, 1267, 1276 FASTCLUS procedure, 1380, 1381, 1391, 1393, 1397 LIFEREG procedure, 2108 LIFETEST procedure, 2171 LOGISTIC procedure, 2329 MDS procedure, 2492 MODECLUS procedure, 2883 MULTTEST procedure, 2960 NPAR1WAY procedure, 3162 PHREG procedure, 3228, 3236, 3286 PRINCOMP procedure, 3608 PRINQUAL procedure, 3655, 3667, 3674 PROBIT procedure, 3755 SCORE procedure, 4074 STDIZE procedure, 4131, 4133 strata variables (PHREG), 3237 SURVEYFREQ procedure, 4205 SURVEYLOGISTIC procedure, 4251, 4268 SURVEYMEANS procedure, 4323, 4333, 4358 SURVEYREG procedure, 4382 SURVEYSELECT procedure, 4445 TRANSREG procedure, 4578, 4599, 4600, 4605 TREE procedure, 4756 VARCOMP procedure, 4837 mixed model unbalanced (GLM), 1882 VARCOMP procedure, 4837 mixed model (MIXED), See also MIXED procedure estimation, 2737 formulation, 2732 hypothesis tests, 2742, 2751 inference, 2741 inference space, 2681, 2682, 2685, 2781 least-squares means, 2687 likelihood ratio test, 2741, 2743 linear model, 2661
Subject Index maximum likelihood estimation, 2738 notation, 2663 objective function, 2749 parameterization, 2743 predicted values, 2687 restricted maximum likelihood, 2779 theory, 2731 Wald test, 2741, 2786 mixed model equations example (MIXED), 2796 MIXED procedure, 2678, 2739 MIXED Procedure Leverage, 2767 PRESS Residual, 2766 PRESS Statistic, 2766 MIXED procedure, See also mixed model 2D geometric anisotropic structure, 2721 Akaike’s information criterion, 2676, 2740, 2750 Akaike’s information criterion (finite sample corrected version) , 2676, 2750 ante-dependence structure, 2721 ARIMA procedure, compared, 2665 ARMA structure, 2721 assumptions, 2661 asymptotic covariance, 2674 at sign (@) operator, 2745, 2819 AUTOREG procedure, compared, 2665 autoregressive structure, 2721, 2788 banded Toeplitz structure, 2721 bar (|) operator, 2743, 2745, 2819 basic features, 2662 Bayesian analysis, 2708 between-within method, 2693 BLUE, 2740 BLUP, 2740, 2809 Bonferroni adjustment, 2688 boundary constraints, 2707, 2708, 2773 Box plots, 2762 BYLEVEL processing of LSMEANS, 2689 CALIS procedure, compared, 2665 Cholesky root, 2704, 2764, 2772 class level, 2678 classification variables, 2681 compound symmetry structure, 2721, 2733, 2789, 2794 computational details, 2772 computational order, 2773 Conditional residuals, 2764 confidence interval, 2686, 2713 confidence limits, 2674, 2685, 2689, 2692, 2713 containment method, 2693 continuous effects, 2714, 2715, 2717, 2721 continuous-by-class effects, 2746 continuous-nesting-class effects, 2745 contrasted SAS procedures, 1735, 1833, 1885, 2664, 2665, 2985 contrasts, 2681, 2685
4941
convergence criterion, 2674, 2675, 2749, 2775 convergence problems, 2774 Cook’s D, 2768 Cook’s D for covariance parameters, 2768 correlation estimates, 2713, 2716, 2720, 2791 correlations of least-squares means, 2689 covariance parameter estimates, 2674, 2676, 2750 covariance parameter estimates, ratio, 2680 covariance parameters, 2661 covariance structures, 2664, 2721, 2723, 2782 covariances of least-squares means, 2689 covariate values for LSMEANS, 2688 covariates, 2743 COVRATIO, 2770 COVRATIO for covariance parameters, 2770 COVTRACE, 2770 COVTRACE for covariance parameters, 2770 CPU requirements, 2776 crossed effects, 2744 degrees of freedom, 2682, 2684–2687, 2690, 2693, 2705, 2742, 2747, 2751, 2774, 2809 DFFITS, 2768 dimensions, 2677, 2678 direct product structure, 2721 Dunnett’s adjustment, 2688 EBLUPs, 2715, 2740, 2801, 2817 effect name length, 2678 empirical best linear unbiased prediction, 2703 empirical sandwich estimator, 2676 estimability, 2682–2684, 2686, 2687, 2691, 2704, 2705, 2742, 2748 estimable functions, 2702 estimation methods, 2677 factor analytic structures, 2721 Fisher information matrix, 2750, 2796 Fisher’s scoring method, 2674, 2680, 2774 fitting information, 2750, 2751 fixed effects, 2663 fixed-effects parameters, 2661, 2705, 2732 fixed-effects variance matrix, 2705 function evaluations, 2677 G matrix, 2712 general linear structure, 2721 generalized inverse, 2684, 2740 gradient, 2675, 2749 grid search, 2706, 2796 growth curve analysis, 2733 Hannan-Quinn information criterion, 2676 Hessian matrix, 2674, 2675, 2680, 2707, 2749, 2750, 2774, 2775, 2786, 2796 heterogeneity, 2714, 2717, 2792 heterogeneous AR(1) structure, 2721 heterogeneous compound-symmetry structure, 2721 heterogeneous covariance structures, 2730 heterogeneous Toeplitz structure, 2721 hierarchical model, 2810 Hotelling-Lawley-McKeon statistic, 2717
4942
Subject Index
Hotelling-Lawley-Pillai-Sampson statistic, 2718 Hsu’s adjustment, 2688 Huynh-Feldt structure, 2721 infinite likelihood, 2717, 2773, 2775 Influence diagnostics details, 2765 Influence plots, 2760 information criteria, 2676 initial values, 2706 input data sets, 2676 interaction effects, 2744 intercept, 2743 intercept effect, 2703, 2713 intraclass correlation coefficient, 2791 introductory example, 2665 iterations, 2677, 2749 Kenward-Roger method, 2695 Kronecker product structure, 2721 L matrices, 2681, 2687, 2742 LATTICE procedure, compared, 2665 least-square means, 2796 least-squares means, 2690, 2819 Likelihood distance, 2770 likelihood ratio test, 2751 linear structure, 2721 log-linear variance model, 2718 main effects, 2744 MAKE statement in Version 6, 2757 Marginal residuals, 2764 Matern covariance structure, 2721 matrix notation, 2731 MDFFITS, 2768 MDFFITS for covariance parameters, 2769 memory requirements, 2775 missing level combinations, 2748 mixed linear model, 2661 mixed model, 2732 mixed model equations, 2678, 2739, 2796 mixed model theory, 2731 model information, 2678 model selection, 2740 multilevel model, 2810 multiple comparisons of least-squares means, 2687, 2690 multiple tables, 2754 multivariate tests, 2717 nested effects, 2745 nested error structure, 2814 NESTED procedure, compared, 2665 Newton-Raphson algorithm, 2738 non-full-rank parameterization, 2664, 2718, 2747 nonstandard weights for LSMEANS, 2690 nugget effect, 2718 Oblique projector, 2767 observed margins for LSMEANS, 2690 ODS graph names, 2762 ODS Graphics, 2757 ODS table names, 2752 ordering of effects, 2679, 2746
over-parameterization, 2744 parameter constraints, 2707, 2773 parameterization, 2743 Pearson Residual, 2704 pharmaceutical stability, example, 2810 Plots of leave-one-out-estimates, 2760 plotting the likelihood, 2801 polynomial effects, 2743 power-of-the-mean model, 2718 predicted means, 2704 predicted value confidence intervals, 2692 predicted values, 2703, 2796 prior density, 2709 profiling residual variance, 2679, 2708, 2718, 2738, 2772 R matrix, 2716, 2720 random coefficients, 2788, 2810 random effects, 2663, 2712 random-effects parameter, 2715 random-effects parameters, 2662, 2732 regression effects, 2743 rejection sampling, 2711 repeated measures, 2662, 2716, 2782 Residual diagnostics details, 2763 residual method, 2694 Residual plots, 2758 residual variance tolerance, 2705 restricted maximum likelihood (REML), 2662 ridging, 2680, 2738 sandwich estimator, 2676 Satterthwaite method, 2694 Scaled Residual, 2705 Scaled residuals, 2764 Schwarz’s Bayesian information criterion, 2676, 2740, 2750 scoring, 2674, 2680, 2774 Sidak’s adjustment, 2688 simple effects, 2691 simulation-based adjustment, 2688 singularities, 2775 spatial anisotropic exponential structure, 2721 spatial covariance structures, 2722, 2723, 2730, 2774 split-plot design, 2734, 2777 standard linear model, 2663 statement positions, 2672 Studentized Residual, 2704 Studentized residuals, 2768 subject effect, 2683, 2715, 2721, 2776, 2782 summary of commands, 2672 sweep operator, 2768, 2772 table names, 2752 test components, 2702 Toeplitz structure, 2721, 2819 TSCSREG procedure, compared, 2665 Tukey’s adjustment, 2688 Type 1 estimation, 2677 Type 2 estimation, 2677 Type 3 estimation, 2677
Subject Index Type I testing, 2696 Type II testing, 2696 Type III testing, 2696, 2751 unstructured R matrix, 2720 unstructured correlations, 2721 unstructured covariance matrix, 2721 VARCOMP procedure, example, 2795 variance components, 2662, 2721 variance ratios, 2707, 2714 Wald test, 2750, 2751 weighted LSMEANS, 2690 weighting, 2730 zero design columns, 2696 zero variance component estimates, 2774 MIXED procedure, MODEL statement Influence diagnostics, 2700 ML factor analysis and computer time, 1297 and confidence intervals, 1294, 1297, 1327 and multivariate normal distribution, 1297 and standard errors, 1297 modal clusters density estimation (CLUSTER), 970 modal region, definition, 2878 MODECLUS procedure analyzing data in groups, 2857, 2874 cascaded density estimates, 2873 clustering methods, 2856, 2874 clusters, definition, 2878 clusters, plotting, 2878 compared with other procedures, 2856 cross validated density estimates, 2872 density estimation, 2870 example using GPLOT procedure, 2916, 2923 example using TRACE option, 2927 example using TRANSPOSE procedure, 2912 fixed-radius kernels, 2870 functional summary, 2862 Hertzsprung-Russell Plot, example, 2923 JOIN option, discussion, 2880 modal region, 2878 neighborhood distribution function (NDF), definition, 2878 nonparametric clustering methods, 2855 output data sets, 2883 p-value computation, 2877 plotting samples from univariate distributions, 2889 population clusters, risks of estimating, 2877 saddle test, definition, 2879 scaling variables, 2856 significance tests, 2916 standardizing, 2856 summary of options, 2862 variable-radius kernels, 2870 model fit summary (REG), 3896 fitting criteria (LOGISTIC), 2341 fitting criteria (SURVEYLOGISTIC), 4279
4943
hierarchy (LOGISTIC), 2283, 2310 hierarchy (TPHREG), 4473, 4475 information (MIXED), 2678 parameterization (GLM), 1787 specification (ANOVA), 451 specification (GLM), 1784 specification (NLMIXED, 3077 model assessment, 1718, 1725 PHREG procedure, 3223, 3265, 3271, 3318 model checking, 1718, 1725 model selection entry (PHREG), 3230 examples (REG), 3924 LOGISTIC procedure, 2306, 2317, 2340 MIXED procedure, 2740 PHREG procedure, 3216, 3229, 3264 REG procedure, 3800, 3873, 3876, 3877 removal (PHREG), 3230 modification indices CALIS procedure, 576, 649, 673 constraints (CALIS), 584 displaying (CALIS), 687 Lagrange multiplier test (CALIS), 584, 673, 674 Wald test (CALIS), 584, 674 modified Peto-Peto test for homogeneity LIFETEST procedure, 2150, 2168 modified ridit scores, 1469 monoecious population analysis example (INBREED), 1985 monotone regression function (TRANSREG), 4629 transformations (TRANSREG), 4593 monotone missing pattern MI procedure, 2538 monotonic transformation (PRINQUAL), 3662, 3663 transformation (TRANSREG), 4563, 4564, 4610 transformation, B-spline (PRINQUAL), 3662 transformation, B-spline (TRANSREG), 4563, 4611 Monte Carlo estimation FREQ procedure, 1443, 1445, 1512 NPAR1WAY procedure, 3174 Mood scores NPAR1WAY procedure, 3168 MORALS method TRANSREG procedure, 4576 mortality test MULTTEST procedure, 2952, 2972 MSE SURVEYREG procedure, 4388 MTV method PRINQUAL procedure, 3643 multicollinearity REG procedure, 3895 multidimensional preference analysis PRINQUAL procedure, 3678, 3688 multidimensional scaling
4944
Subject Index
MDS procedure, 2471 metric (MDS), 2471 nonmetric (MDS), 2471, 2472 three-way (MDS), 2471 multilevel model example (MIXED), 2810 multilevel response, 3759 multinomial distribution (GENMOD), 1653 models (GENMOD), 1671 multiple classifications cutpoints (LOGISTIC), 2315 multiple comparison procedures, 1806, See also multiple comparisons of means See also multiple comparisons of least-squares means GLM procedure, 1763 multiple-stage tests, 1814 pairwise (GLM), 1807 recommendations, 1816 with a control (GLM), 1807, 1812 multiple comparisons adjustment (MIXED) least-squares means, 2687 multiple comparisons of least-squares means, See also multiple comparison procedures GLM procedure, 1754, 1757, 1808 interpretation, 1816 MIXED procedure, 2687, 2690 multiple comparisons of means, See also multiple comparison procedures ANOVA procedure, 440 Bonferroni t test, 441, 1765 Duncan’s multiple range test, 442, 1766 Dunnett’s test, 442, 1766, 1767 error mean square, 443, 1767 examples, 1847 Fisher’s LSD test, 445, 1769 Gabriel’s procedure, 443, 1767 GLM procedure, 1806, 1808 GT2 method, 444, 1769 interpretation, 1816 Ryan-Einot-Gabriel-Welsch test, 444, 1768 Scheffé’s procedure, 444, 1769 Sidak’s adjustment, 444, 1769 SMM, 444, 1769 Student-Newman-Keuls test, 444, 1769 Tukey’s studentized range test, 445, 1769 Waller-Duncan method, 443 Waller-Duncan test, 445, 1769 multiple correspondence analysis (MCA) CORRESP procedure, 1076, 1101, 1123 multiple destinations examples, ODS Graphics, 360 multiple imputation efficiency MI procedure, 2562 MIANALYZE procedure, 2626 multiple imputations analysis, 2511, 2609 multiple R-square SURVEYREG procedure, 4387
multiple redundancy coefficients TRANSREG procedure, 4590 multiple regression TRANSREG procedure, 4593 multiple tables MIXED procedure, 2754 multiple-stage tests, 1814, See multiple comparison procedures multiplicative hazards model, See Andersen-Gill model multistage sampling, 165 multivariate analysis of variance, 433, 436 CANDISC procedure, 786 examples (GLM), 1868 GLM procedure, 1745, 1759, 1823 hypothesis tests (GLM), 1824 partial correlations, 1824 multivariate general linear hypothesis, 1824 multivariate inferences MIANALYZE procedure, 2626 multivariate multiple regression TRANSREG procedure, 4593 multivariate normality assumption MI procedure, 2565 multivariate tests MIXED procedure, 2717 REG procedure, 3910 repeated measures, 1828 multiway tables SURVEYFREQ procedure, 4227 MULTTEST procedure adjusted p-value, 2935, 2956 Bonferroni adjustment, 2939, 2956 bootstrap adjustment, 2938, 2939, 2957 Cochran-Armitage test, 2946, 2948, 2951, 2964 computational resources, 2960 convolution distribution, 2950 displayed output, 2963 double arcsine test, 2951 expected trend, 2951 false discovery rate adjustment, 2959 fast Fourier transform, 2950 Fisher combination adjustment, 2959 Fisher exact test, 2944, 2946, 2954 Freeman-Tukey test, 2946, 2951, 2968 Hochberg adjustment, 2959 Hommel adjustment, 2959 introductory example, 2936 linear trend test, 2949 missing values, 2960 ODS table names, 2963 output data sets, 2961 p-value adjustments, 2935, 2956 permutation adjustment, 2942, 2957, 2975 Peto test, 2946, 2952, 2972 resampled data sets, 2962 Sidak’s adjustment, 2942, 2956 statistical tests, 2948 stepdown methods, 2957
Subject Index strata weights, 2951 t test, 2946, 2955, 2968 Murthy’s method SURVEYSELECT procedure, 4454
N name-lists POWER procedure, 3490 natural response rate, 3705, 3707, 3711, 3713 nearest centroid sorting, 1380 nearest neighbor method, See also single linkage DISCRIM procedure, 1158, 1161 negative binomial distribution GENMOD procedure, 1652 NLMIXED procedure, 3078 negative variance components VARCOMP procedure, 4838 neighborhood distribution function (NDF), definition MODECLUS procedure, 2878 Nelder-Mead simplex, 3073 nested design, 2985 error terms, 2991 generating with PLAN procedure, 3353 hypothesis tests (NESTED), 2990 nested effects design matrix (CATMOD), 878 GENMOD procedure, 1660 MIXED procedure, 2745 model parameterization (GLM), 1789 specifying (ANOVA), 451, 453 specifying (CATMOD), 864 specifying (GLM), 1785 nested error structure MIXED procedure, 2814 nested models KRIGE2D procedure, 2050, 2051 VARIOGRAM procedure, 4871 NESTED procedure analysis of covariation, 2990 compared to other procedures, 1735, 2665, 2985 computational method, 2991 input data sets, 2988 introductory example, 2986 missing values, 2990 ODS table names, 2994 random effects, 2990 unbalanced design, 2990 nested-by-value effects specifying (CATMOD), 865 network algorithm, 1509, 3172 Newman-Keuls’ multiple range test, 444, 1769, 1814 Newton algorithm FASTCLUS procedure, 1392 Newton-Raphson algorithm CALIS procedure, 578, 580, 581, 665 GENMOD procedure, 1655 iteration (PHREG), 3222 LIFEREG procedure, 2083
4945
LOGISTIC procedure, 2317, 2318, 2336, 2338 method (PHREG), 3245 MIXED procedure, 2738 NLMIXED procedure, 3073 PROBIT procedure, 3756 SURVEYLOGISTIC procedure, 4264, 4265, 4277 NLIN procedure analytic derivatives, 3011, 3017 automatic derivatives, 3017 close-to-linear, 3019 confidence interval, 3013, 3014, 3028, 3029 convergence, 3022 convergence criterion, 3006 cross-referencing variables, 3010 debugging execution, 3008 derivatives, 3011, 3017 displayed output, 3006 G2 inverse, 3025 G4 inverse, 3008 Gauss iterative method, 3008 Gauss-Newton method, 3024, 3027 generalized inverse, 3025 gradient method, 3024, 3025 Hessian, 3026, 3030 Hougaard’s measure, 3008, 3019 imposing bounds, 3010 incompatibilities, 3031, 3032 initial values, 3015 iteratively reweighted least squares example, 3038 Lagrange multiplier, 3010 Lagrange multipliers, covariance matrix, 3030 Marquardt iterative method, 3008, 3024, 3027 maximum iterations, 3008 maximum subiterations, 3008 mean square error specification, 3009 missing values, 3020 model confidence interval, 3029 model.variable syntax, 3012 Newton iterative method, 3008, 3024, 3026 object convergence measure, 3003 output table names, 3033 parameter confidence interval, 3028 parameter covariance matrix, 3029 PPC convergence measure, 3003 predicted values, output, 3014 R convergence measure, 3003 residual values, output, 3014 retaining variables, 3011, 3016 RPC convergence measure, 3003 segmented model example, 3034 singularity criterion, 3009 skewness, 3004, 3008, 3019 SMETHOD=GOLDEN step size search, 3028 special variables, 3020 standard error, 3014 steepest descent method, 3024, 3025 step size search, 3028
4946
Subject Index
troubleshooting, 3022 tuning display of iteration computation, 3009 weighted regression, 3021 NLMIXED procedure Accelerated failure time model, 3128 active set methods, 3093 adaptive Gaussian quadrature, 3084 additional estimates, 3076, 3106 alpha level, 3061 arrays, 3074 assumptions, 3083 Bernoulli distribution, 3077 binary distribution, 3077 binomial distribution, 3077 bounds, 3075 compared with other SAS procedures and macros, 3048 computational problems, 3098 computational resources, 3103 contrasts, 3076 convergence criteria, 3060, 3065, 3087 convergence problems, 3099 covariance matrix, 3061, 3101, 3106 degrees of freedom, 3062 empirical Bayes estimation, 3084 empirical Bayes options, 3062 finite differencing, 3064, 3091 first-order method, 3085 fit statistics, 3106 floating point errors, 3098 Frailty model example, 3128 functional convergence criteria, 3063 gamma distribution, 3077 Gaussian distribution, 3077 general distribution, 3078 generalized inverse, 3065 growth curve example, 3049 Hessian matrix, 3066 Hessian scaling, 3065, 3093 integral approximations, 3070, 3084 iteration history, 3067, 3104 lag functionality, 3081 Lagrange multiplier, 3093 line-search methods, 3066, 3067, 3096 logistic-normal example, 3053 long run times, 3098 maximum likelihood, 3048 negative binomial distribution, 3078 normal distribution, 3077, 3080 notation, 3083 ODS table names, 3107 optimization techniques, 3072, 3086 options summary, 3058 overflows, 3098 parameter estimates, 3106 parameter rescaling, 3098 parameter specification, 3078 pharmakokinetics example, 3107 Poisson distribution, 3078
Poisson-normal example, 3124 precision, 3101 prediction, 3079, 3102 probit-normal-binomial example, 3114 probit-normal-ordinal example, 3118 programming statements, 3081 projected gradient, 3093 projected Hessian, 3093 quadrature options, 3071 random effects, 3079 references, 3138 replicate subjects, 3080 singularity tolerances, 3072 sorting of input data set, 3062, 3079 stationary point, 3100 step length options, 3096 syntax summary, 3057 termination criteria, 3060, 3087 update methods, 3073 nominal level of measurement DISTANCE procedure, 1249 nominal power GLMPOWER procedure, 1946, 1947, 1953 POWER procedure, 3419, 3494, 3496 nominal variable DISTANCE procedure, 1251 nominal variables, 72, See also classification variables non-full-rank models REG procedure, 3893 non-full-rank parameterization MIXED procedure, 2664, 2718, 2747 noninferiority power and sample size (POWER), 3552 nonlinear mixed models (NLMIXED), 3047 regression functions (TRANSREG), 4593, 4628 transformations (TRANSREG), 4593 nonmetric multidimensional scaling MDS procedure, 2471, 2472 nonoptimal transformations PRINQUAL procedure, 3661 TRANSREG procedure, 4562 nonparametric clustering methods MODECLUS procedure, 2855 nonparametric discriminant analysis, 1158 nonparametric regression TPSPLINE procedure, 4497 nonparametric tests NPAR1WAY procedure, 3145 normal distribution, 2083, 2097, 2111, 3705 GENMOD procedure, 1651 NLMIXED procedure, 3077, 3080 PROBIT procedure, 3757 normal kernel (DISCRIM), 1159 normalization of the estimates MDS procedure, 2492 NOSQUARE option algorithms used (CLUSTER), 986
Subject Index NPAR1WAY procedure alpha level, 3160 Ansari-Bradley scores, 3168 Brown-Mood test, 3167 compared to other procedures, 1735 computational methods, 3172 computational resources, 3173 Cramer-von Mises test, 3170 EDF tests, 3168 exact p-values, 3172 exact tests, 3171 introductory example, 3145 Klotz scores, 3168 Kolmogorov-Smirnov test, 3169 Kruskal-Wallis test, 3166 Kuiper test, 3171 Mann-Whitney-Wilcoxon test, 3166 median scores, 3167 missing values, 3162 Monte Carlo estimation, 3174 Mood scores, 3168 network algorithm, 3172 ODS table names, 3184 one-way ANOVA tests, 3165 output data set, 3161, 3175, 3176 permutation test, 3157, 3166 Pitman’s test, 3157, 3166 rank tests, 3163 Savage scores, 3167 scores, 3166 Siegel-Tukey scores, 3167 statistical computations, 3163 summary of commands, 3155 tied values, 3163 Van der Waerden scores, 3167 Wilcoxon scores, 3166 nugget effect KRIGE2D procedure, 2044, 2051, 2052 MIXED procedure, 2718 VARIOGRAM procedure, 4871 null hypothesis, 3488 number of imputations MI procedure, 2565 number-lists GLMPOWER procedure, 1945 POWER procedure, 3490
O O’Brien’s test for homogeneity of variance ANOVA procedure, 443 GLM procedure, 1767 objective function mixed model (MIXED), 2749 oblimin method, 1291, 1318 oblique component analysis, 4799 Oblique projector MIXED procedure, 2767 oblique transformation, 1294, 1298 odds ratio
4947
adjusted, 1503 Breslow-Day test, 1508 case-control studies, 1488, 1503, 1504 confidence limits (LOGISTIC), 2308, 2315 confidence limits (SURVEYLOGISTIC), 4262 customized (LOGISTIC), 2328 customized (SURVEYLOGISTIC), 4267 estimation (LOGISTIC), 2347 estimation (SURVEYLOGISTIC), 4288 logit estimate, 1504 Mantel-Haenszel estimate, 1503 power and sample size (POWER), 3457, 3462, 3523, 3524 ODS and templates, 274 compatibility with Version 6, 280 creating an output data set, 292 default behavior, 273 exclusion list, 276 interactive procedures, 277 NOPRINT option interaction, 279 ODS Graphics, 319 output formats, 273 output table names, 274 path names, 275 run-group processing, 277 selection list, 276, 289 Statistical Graphics Using ODS, 319 suppressing displayed output, 279 TEMPLATE procedure, 303, 309 templates, 278 trace record, 275 with Results window, 277 with SAS Explorer, 277 ODS destinations ODS Graphics, 334 ODS examples concatenating data sets, 299 creating an output data set, 294, 297 excluding output, 291 GLMMOD procedure, 1924 HTML output, 281, 282 links in HTML output, 308, 311 modifying templates, 307 ODS and the GCHART procedure, 312 ORTHOREG procedure, 3206 output table names, 284, 295 PLS procedure, 3394, 3396, 3405 run-group processing, 299 selecting output, 287 ODS graph names ANOVA procedure, 460 CORRESP procedure, 1109 GAM procedure, 1581 GLM procedure, 1847 KDE procedure, 2010 LIFETEST procedure, 2191 LOESS procedure, 2250 LOGISTIC procedure, 2390
4948
Subject Index
MI procedure, 2567 MIXED procedure, 2762 PHREG procedure, 3271 PRINCOMP procedure, 3613 PRINQUAL procedure, 3677 REG procedure, 3923 ROBUSTREG procedure, 4014 ODS graph templates displaying templates, 339 editing templates, 340 graph definitions, 338, 342, 365 graph template language, 338, 342 locating templates, 339 reverting to default templates, 341, 369 saving templates, 340 style definitions, 338, 344 table definitions, 338 template definitions, 338 using customized templates, 341 ODS Graphics, 319 DOCUMENT destination, 326, 335 examples, 352 excluding graphs, 330 getting started, 321 graph names, 330 graph template definitions, 338, 342, 365 graph template language, 338, 342 graphics image files, 334 graphics image files, names, 335 graphics image files, PostScript, 337 graphics image files, types, 334, 336, 350 HTML destination, 326, 335 HTML output, 326, 334, 336 index counter, 335 introductory examples, 321, 324 label collision avoidance, 351 LATEX destination, 326, 335 LaTeX output, 334, 337 lattice layouts, 381 layout area, 342 MIXED procedure, 2757 ODS destinations, 334 overlaid layouts, 368, 373, 381 PCL destination, 326, 335 PDF destination, 326, 335 PDF output, 338 plot options, 324 PostScript output, 338 PS destination, 326, 335 referring to graphs, 330 requesting graphs, 321, 324 reseting index counter, 335 RTF destination, 326, 335 RTF output, 326, 338 saving graphics image files, 336 selecting graphs, 330 supported operating environments, 348 supported procedures, 348 viewing graphs, 327
ODS output destinations, 274, 276, 298 objects, 273 ODS output files, 326 ODS path, 340, 341, 346 ODS Statistical Graphics, see ODS Graphics ODS styles, 332 Analysis style, 333 attributes, 344 customizing, 345, 374, 376, 378 Default style, 333, 346 definitions, 344 elements, 344 Journal style, 332, 333, 346 Rtf style, 346 specifying, 332 specifying a default, 346 Statistical style, 333 ODS table names SURVEYLOGISTIC procedure, 4295 SURVEYREG procedure, 4394 SURVEYSELECT procedure, 4460 offset GENMOD procedure, 1640, 1689 offset variable GENMOD procedure, 1617 PHREG procedure, 3229 offspring INBREED procedure, 1973, 1980 one-sample t-test, 4775 power and sample size (POWER), 3412, 3432, 3437, 3508–3510 one-way ANOVA power and sample size (POWER), 3438, 3442, 3443, 3513, 3536 one-way ANOVA tests NPAR1WAY procedure, 3165 online documentation, 21 operations research, 23 optimal scaling (TRANSREG), 4609 scoring (PRINQUAL), 3662 scoring (TRANSREG), 4563, 4610 transformations (MDS), 2472, 2481 transformations (PRINQUAL), 3662 transformations (TRANSREG), 4563 optimization CALIS procedure, 550, 577, 622, 664 conjugate gradient (CALIS), 577, 579–581, 665 double dogleg (CALIS), 578, 579, 581, 665 history (CALIS), 668 initial values (CALIS), 661, 666 Levenberg-Marquardt (CALIS), 578, 581, 665 line search (CALIS), 580, 671 memory problems (CALIS), 666 Newton-Raphson (CALIS), 578, 580, 581, 665 nonlinear constraints (CALIS), 666 quasi-Newton (CALIS), 578–581, 665, 666
Subject Index step length (CALIS), 672 techniques (NLMIXED), 3072, 3086 termination criteria (CALIS), 611, 615–620 trust region (CALIS), 578, 581, 665 update method (CALIS), 579 order of variables SURVEYFREQ procedure, 4193 order statistics, See RANK procedure ordering observations INBREED procedure, 1968 ordinal level of measurement DISTANCE procedure, 1249 ordinal model CATMOD procedure, 869 GENMOD procedure, 1704 ORDINAL Parameterization SURVEYLOGISTIC procedure, 4271 ordinal variable, 72 ordinal variables transformed to interval (RANKSCORE= ), 1261 ordinary kriging KRIGE2D procedure, 2033, 2056–2060 ORTHEFFECT Parameterization SURVEYLOGISTIC procedure, 4272 orthoblique rotation, 4800 orthogonal polynomial contrasts, 448 orthogonal transformation, 1294, 1298 orthomax method, 1291, 1317 orthonormalizing transformation matrix ANOVA procedure, 438 GLM procedure, 1761 ORTHORDINAL Parameterization SURVEYLOGISTIC procedure, 4272 ORTHOREG procedure compared to other procedures, 3197 input data sets, 3201 introductory example, 3197 missing values, 3203 ODS table names, 3204 output data sets, 3202, 3203 ORTHOTHERM Parameterization SURVEYLOGISTIC procedure, 4272 ORTHPOLY Parameterization SURVEYLOGISTIC procedure, 4273 ORTHREF Parameterization SURVEYLOGISTIC procedure, 4273 OUT= data sets ACECLUS procedure, 409 CANCORR procedure, 761, 766 FACTOR procedure, 1316, 1325 FASTCLUS procedure, 1398 PRINCOMP procedure, 3609 SCORE procedure, 4075 TREE procedure, 4756 OUTDIST= data set VARIOGRAM procedure, 4851, 4865, 4878, 4880 OUTEST= data sets
4949
KRIGE2D procedure, 2060 LIFEREG procedure, 2121 ROBUSTREG procedure, 4011 outliers FASTCLUS procedure, 1379 MODECLUS procedure, 2872 OUTNBHD= data set KRIGE2D procedure, 2060, 2061 OUTPAIR= data set VARIOGRAM procedure, 4851, 4866, 4881 output data set SCORE procedure, 4072, 4075 SURVEYMEANS procedure, 4321, 4345 output data sets ACECLUS procedure, 409 CALIS procedure, 634 CANCORR procedure, 761, 766 CLUSTER procedure, 971 FACTOR procedure, 1297, 1316, 1325 FASTCLUS procedure, 1393, 1394, 1398 KRIGE2D procedure, 2039, 2060, 2061 LIFETEST procedure, 2183 LOGISTIC procedure, 2374, 2376, 2377 MI procedure, 2520, 2528, 2559 MI procedure, EM statement, 2523 MODECLUS procedure, 2883 MULTTEST procedure, 2961, 2962 OUTCOV= data set (INBREED), 1974, 1982 PRINQUAL procedure, 3669 SIM2D procedure, 4099, 4110 SURVEYSELECT procedure, 4456 TREE procedure, 4756 VARCLUS procedure, 4811, 4816 output ODS graphics table names GENMOD procedure, 1695 PHREG procedure, 3271 output parameter estimates MI procedure, 2528 output ROC data sets LOGISTIC procedure, 2378 output table names ACECLUS procedure, 412 CALIS procedure, 688 CANCORR procedure, 771 CLUSTER procedure, 994 FASTCLUS procedure, 1407 GENMOD procedure, 1693 INBREED procedure, 1984 KDE procedure, 2009 LIFEREG procedure, 2124 MDS procedure, 2497 MODECLUS procedure, 2888 PRINCOMP procedure, 3613 PRINQUAL procedure, 3677 PROBIT procedure, 3765 ROBUSTREG procedure, 4012 SURVEYFREQ procedure, 4230 SURVEYLOGISTIC procedure, 4295 SURVEYMEANS procedure, 4349
4950
Subject Index
SURVEYREG procedure, 4394 SURVEYSELECT procedure, 4460 TREE procedure, 4757 VARCLUS procedure, 4820 OUTQ= data set, 3071 OUTSIM= data set SIM2D procedure, 4110 OUTSTAT= data sets CANCORR procedure, 766 FACTOR procedure, 1325 OUTVAR= data set VARIOGRAM procedure, 4866, 4877 over-parameterization MIXED procedure, 2744 overall kappa coefficient, 1493, 1498 overdispersion GENMOD procedure, 1659 LOGISTIC procedure, 2316, 2354, 2355 PROBIT procedure, 3745 overflows NLMIXED procedure, 3098 overlaid layouts ODS Graphics, 368, 373, 381 Overlap dissimilarity coefficient DISTANCE procedure, 1273 overlap of data points LOGISTIC procedure, 2339 SURVEYLOGISTIC procedure, 4278 Overlap similarity coefficient DISTANCE procedure, 1273
P P-P plots REG procedure, 3917, 3953 p-value adjustments Bonferroni (MULTTEST), 2939, 2956 bootstrap (MULTTEST), 2938, 2939, 2957, 2968 false discovery rate (MULTTEST), 2959 Fisher combination (MULTTEST), 2959 Hochberg (MULTTEST), 2959 Hommel (MULTTEST), 2959 MULTTEST procedure, 2935, 2956 permutation (MULTTEST), 2942, 2957, 2975 Sidak (MULTTEST), 2942, 2956, 2972 p-value computation MODECLUS procedure, 2877 painting line-printer plots REG procedure, 3889 paired comparisons, 4775, 4793 paired proportions, See McNemar’s test paired t test, 4782 power and sample size (POWER), 3448, 3455, 3518, 3519 paired-difference t test, 4775, paired t test pairwise comparisons GLM procedure, 1807
panels INBREED procedure, 1982, 1989 parameter constraints MIXED procedure, 2707, 2773 parameter estimates covariance matrix (CATMOD), 842 example (REG), 3877 NLMIXED procedure, 3106 PHREG procedure, 3220, 3222, 3223, 3269 REG procedure, 3919 TPHREG procedure, 4486, 4487 parameter rescaling NLMIXED procedure, 3098 parameter simulation MI procedure, 2564 parameter specification NLMIXED procedure, 3078 Parameterization SURVEYLOGISTIC procedure, 4270 parameterization CATMOD, 845 mixed model (MIXED), 2743 MIXED procedure, 2743 of models (GLM), 1787 parametric discriminant analysis, 1157 Pareto charts, 23 parsimax method, 1291, 1317, 1318 partial canonical correlation, 752 partial correlation CANCORR procedure, 761, 762, 764 principal components, 3612 partial correlations multivariate analysis of variance, 1824 power and sample size (POWER), 3426, 3429, 3502, 3503, 3556 partial least squares, 3367, 3380 partial likelihood PHREG procedure, 3215, 3216, 3240, 3241, 3243 partial listing product-limit estimate (LIFETEST), 2165 partial regression leverage plots REG procedure, 3901 partially balanced square lattice LATTICE procedure, 2069 partitions MDS procedure, 2477, 2488 passive observations PRINQUAL procedure, 3674 TRANSREG procedure, 4605 path diagram (CALIS) exogenous variables, 664 factor analysis model, 599, 600 structural model example, 555, 600 PCL destination ODS Graphics, 326, 335 PDF, See probability density function PDF destination
Subject Index ODS Graphics, 326, 335 PDF output examples, ODS Graphics, 360 ODS Graphics, 338 Pearson chi-square test, 1469, 1471, 3712, 3759 power and sample size (POWER), 3457, 3462, 3524 Pearson correlation coefficient, 1474, 1479 Pearson correlation statistics power and sample size (POWER), 3426, 3502, 3503, 3556 Pearson Residual MIXED procedure, 2704 Pearson residuals GENMOD procedure, 1670 LOGISTIC procedure, 2360 Pearson’s chi-square GENMOD procedure, 1637, 1656, 1657 LOGISTIC procedure, 2308, 2316, 2354 PROBIT procedure, 3742, 3745, 3760 pedigree analysis example (INBREED), 1987, 1989 INBREED procedure, 1967, 1968 penalized least squares, TPSPLINE procedure, 4497, 4498, 4518 permutation adjustment (MULTTEST), 2942, 2957 generating with PLAN procedure, 3358 permutation (MULTTEST) p-value adjustments, 2942, 2957, 2975 permutation test NPAR1WAY procedure, 3157, 3166 Peto test MULTTEST procedure, 2946, 2952, 2972 Peto-Peto test for homogeneity LIFETEST procedure, 2150, 2168 Peto-Peto-Prentice, See Peto-Peto test for homogeneity pharmaceutical stability example (MIXED), 2810 pharmakokinetics example NLMIXED procedure, 3107 phenogram, 4743 phi coefficient, 1469, 1474 phi-squared coefficient DISTANCE procedure, 1273 PHREG procedure Andersen-Gill model, 3216, 3243, 3253 baseline hazard function, 3216 BASELINE statistics, 3224, 3225 branch and bound algorithm, 3265, 3279 Breslow likelihood, 3228 case weight, 3239 case-control studies, 3217, 3228, 3280 conditional logistic regression, 3217, 3283 continuous time scale, 3216, 3228, 3284 counting process, 3241 covariance matrix, 3222, 3233, 3245 Cox regression analysis, 3215, 3219
4951
cumulative baseline hazard function, 3257 cumulative martingale residuals, 3223, 3266, 3271 DATA step statements, 3220, 3221, 3236, 3288 descriptive statistics, 3223 DFBETA statistics, 3234 discrete logistic model, 3216, 3228, 3283 disk space, 3222 displayed output, 3268 Efron likelihood, 3228 event times, 3215, 3218 exact likelihood, 3228 fractional frequencies, 3227 global influence, 3234, 3260 global null hypothesis, 3218, 3246, 3269 hazard function, 3215, 3240 hazards ratio, 3218, 3247 hazards ratio confidence interval, 3233 iteration history, 3233, 3269 Lee-Wei-Amato model, 3251, 3314 left truncation time, 3229, 3263 likelihood displacement, 3234, 3260 likelihood ratio test, 3246, 3269 linear hypotheses, 3217, 3238, 3247 linear predictor, 3225, 3233, 3235, 3302, 3303 local influence, 3234, 3260 log-rank test, 3219 Mantel-Haenszel test, 3219 mean function, 3224, 3226, 3244, 3245, 3253, 3255 missing values, 3228, 3236, 3286 missing values as strata, 3237 model assessment, 3223, 3265, 3271, 3318 model selection, 3216, 3229, 3230, 3264 ODS graph names, 3271 ODS table names, 3270 offset variable, 3229 output ODS graphics table names, 3271 OUTPUT statistics, 3234, 3235 parameter estimates, 3220, 3222, 3223, 3269 partial likelihood, 3215, 3216, 3240, 3241, 3243 Prentice-Williams-Peterson model, 3255 programming statements, 3220–3222, 3235, 3236, 3288 proportional hazards model, 3215, 3220, 3228 rate function, 3243, 3253 rate/mean model, 3243, 3253 recurrent events, 3216, 3224–3226, 3243 residual chi-square, 3231 residuals, 3234, 3235, 3257–3260, 3302 response variable, 3218, 3283 risk set, 3220, 3240, 3241, 3288 risk weights, 3229 score test, 3229, 3231, 3246, 3269, 3274, 3276 selection methods, 3216, 3229, 3264 singularity criterion, 3232 standard error, 3233, 3235, 3269 standard error ratio, 3269 standardized score process, 3267, 3271
4952
Subject Index
step halving, 3245 strata variables, 3237 stratified analysis, 3216, 3237 survival distribution function, 3239 survival times, 3215, 3281, 3283 survivor function, 3215, 3225, 3235, 3239, 3240, 3261, 3299 ties, 3216, 3219, 3228, 3241, 3268 time-dependent covariates, 3216, 3220, 3222, 3224, 3233, 3236 Wald test, 3238, 3246, 3247, 3269, 3284 Wei-Lin-Weissfeld model, 3248 piecewise polynomial splines TRANSREG procedure, 4561, 4614 Pillai’s trace, 437, 1759, 1828 Pitman’s test NPAR1WAY procedure, 3157, 3166 PLAN procedure combinations, 3358 compared to other procedures, 3335 factor, selecting levels for, 3339, 3340 generalized cyclic incomplete block design, 3357 hierarchical design, 3353 incomplete block design, 3354, 3357 input data sets, 3339, 3343 introductory example, 3336 Latin square design, 3356 nested design, 3353 ODS table names, 3352 output data sets, 3339, 3343, 3346, 3347 permutations, 3358 random number generators, 3339 randomizing designs, 3347, 3351 specifying factor structures, 3348 split-plot design, 3352 treatments, specifying, 3345 using interactively, 3346 %PLOT macro DISCRIM procedure, 1182 plot options ODS Graphics, 324 %PLOTIT macro, 808 CORRESP procedure, 1070, 1118, 1128 DISCRIM procedure, 1200 MDS procedure, 2474 PRINCOMP procedure, 3615, 3617, 3625 PRINQUAL procedure, 3678, 3687 TRANSREG procedure, 4617, 4717, 4718 plots examples (REG), 3948 high-resolution (REG), 3840 keywords (REG), 3841 likelihood (MIXED), 2801 line printer (REG), 3848, 3882 MDS procedure, 2474 of configuration (MDS), 2504, 2505 of dimension coefficients (MDS), 2504, 2506 of linear fit (MDS), 2503
options (REG), 3843, 3844 power and sample size (GLMPOWER), 1930, 1936, 1942 power and sample size (POWER), 3412, 3420, 3421, 3483, 3566 plots, ODS Graphics box plots, 355 contour plots, 324, 360 Cook’s D plots, 353 diagnostics panels, 322, 379 fit plots, 322, 354 histograms, 358 Q-Q plots, 361, 363, 368, 370, 372, 373, 375, 377–379 residual plots, 322 scatter plots, 343, 351 surface plots, 324, 360 plotting samples from univariate distributions MODECLUS procedure, 2889 PLS procedure algorithms, 3375 centering, 3386 compared to other procedures, 3367 components, 3367 computation method, 3375 cross validation, 3368, 3384 cross validation method, 3374 examples, 3388 factors, 3367 factors, selecting the number of, 3370 introductory example, 3368 latent variables, 3367 latent vectors, 3367 missing values, 3376 ODS table names, 3388 outlier detection, 3398 output data sets, 3379 output keywords, 3379 partial least squares regression, 3367, 3380 predicting new observations, 3373 principal components regression, 3367, 3381 reduced rank regression, 3367, 3381 scaling, 3386 SIMPLS method, 3380 test set validation, 3384, 3400 point models TRANSREG procedure, 4605 Poisson distribution GENMOD procedure, 1652 NLMIXED procedure, 3078 Poisson regression GENMOD procedure, 1613, 1616 Poisson-normal example NLMIXED procedure, 3124 POLY Parameterization SURVEYLOGISTIC procedure, 4271 polychoric correlation coefficient, 1460, 1474, 1482 polynomial effects GENMOD procedure, 1660
Subject Index MIXED procedure, 2743 model parameterization (GLM), 1787 specifying (GLM), 1784 polynomial model GLMMOD procedure, 1909 POLYNOMIAL Parameterization SURVEYLOGISTIC procedure, 4271 polynomial regression REG procedure, 3804 polynomial-spline basis TRANSREG procedure, 4561, 4614 pooled stratum SURVEYREG procedure, 4389 pooled within-cluster covariance matrix definition, 388 population profile (CATMOD), 820 total (SURVEYLOGISTIC), 4252, 4280 total (SURVEYMEANS), 4326, 4334 total (SURVEYREG), 4374, 4382 population (INBREED) monoecious, 1985 multiparous, 1973, 1977 nonoverlapping, 1974 overlapping, 1968, 1969, 1979 population clusters risks of estimating (MODECLUS), 2877 population profile, 73 population total SURVEYFREQ procedure, 4194, 4204 posterior probability DISCRIM procedure, 1200 error rate estimation (DISCRIM), 1165 PostScript graphics image files, 358 PostScript output ODS Graphics, 338 power overview of power concepts (POWER), 3488 See GLMPOWER procedure, 1929 See POWER procedure, 3411 power curves, See plots Power distance coefficient DISTANCE procedure, 1272 POWER procedure AB/BA crossover designs, 3549 actual alpha, 3496 actual power, 3419, 3494, 3496 actual prob(width), 3496 analysis of variance, 3438, 3442, 3443, 3513, 3536 analysis statements, 3420 binomial proportion tests, 3429, 3432, 3504– 3506, 3541 bioequivalence, 3510, 3511, 3520, 3521, 3530, 3531, 3549 ceiling sample size, 3419, 3496 compared to other procedures, 1930, 3412
4953
computational methods, 3498 computational resources, 3498 confidence intervals for means, 3432, 3438, 3448, 3456, 3463, 3473, 3512, 3522, 3532, 3563 contrasts, analysis of variance, 3438, 3439, 3442, 3513, 3536 correlated proportions, 3443, 3447, 3448, 3516, 3517 correlation, 3426, 3502, 3503, 3556 crossover designs, 3549 displayed output, 3496 effect size, 3485 equivalence tests, 3432, 3438, 3448, 3456, 3463, 3472, 3510, 3511, 3520, 3521, 3530, 3531, 3549 Fisher’s exact test, 3457, 3463, 3525 Fisher’s z test for correlation, 3426, 3429, 3502, 3556 fractional sample size, 3419, 3496 Gehan test, 3473, 3482, 3533 grouped-name-lists, 3490 grouped-number-lists, 3490 introductory example, 3412 keyword-lists, 3490 likelihood-ratio chi-square test, 3457, 3463, 3525 log-rank test for comparing survival curves, 3473, 3481, 3533, 3561 lognormal data, 3434, 3437, 3438, 3450, 3455, 3456, 3465, 3472, 3509, 3511, 3519, 3521, 3529, 3531, 3552 McNemar’s test, 3443, 3447, 3448, 3516, 3517 name-lists, 3490 nominal power, 3419, 3494, 3496 noninferiority, 3552 notation for formulas, 3498 number-lists, 3490 odds ratio, 3457, 3462, 3523, 3524 ODS table names, 3497 one-sample t test, 3412, 3432, 3437, 3508, 3509 one-way ANOVA, 3438, 3442, 3443, 3513, 3536 overview of power concepts, 3488 paired proportions, 3443, 3447, 3448, 3516, 3517 paired t test, 3448, 3455, 3518, 3519 partial correlation, 3426, 3429, 3502, 3503, 3556 Pearson chi-square test, 3457, 3462, 3524 Pearson correlation, 3426, 3429, 3502, 3503, 3556 plots, 3412, 3420, 3421, 3483, 3566 regression, 3421, 3425, 3500, 3556 relative risk, 3457, 3462, 3523, 3524 sample size adjustment, 3494 summary of analyses, 3488 summary of statements, 3420 survival analysis, 3473, 3481, 3533 t test for correlation, 3426, 3429, 3503
4954
Subject Index
t tests, 3432, 3437, 3448, 3455, 3463, 3471, 3508, 3509, 3518, 3526, 3529, 3566 Tarone-Ware test, 3473, 3483, 3533 two-sample t test, 3415, 3463, 3471, 3472, 3526, 3527, 3529, 3566 value lists, 3490 z test, 3429, 3432, 3505, 3506 power semivariogram model KRIGE2D procedure, 2049 power-of-the-mean model MIXED procedure, 2718 PPC convergence measure, 3030 pplot plots annotating, 2102 axes, color, 2102 font, specifying, 2103 reference lines, options, 2103–2105, 2107 PPS, 161 PPS sampling SURVEYSELECT procedure, 4463 PPS sampling, with replacement SURVEYSELECT procedure, 4451 PPS sampling, without replacement SURVEYSELECT procedure, 4449 PPS sequential sampling SURVEYSELECT procedure, 4452 PPS systematic sampling SURVEYSELECT procedure, 4451 precision NLMIXED procedure, 3101 precision, confidence intervals, 3488 predicted means MIXED procedure, 2704 predicted model matrix CALIS procedure, 643, 644, 663 displaying (CALIS), 683 singular (CALIS), 680 predicted population margins GLM procedure, 1753 predicted probabilities LOGISTIC procedure, 2350 predicted probability plots annotating, 3750 axes, color, 3750 font, specifying, 3750 options summarized by function, 3748 reference lines, options, 3750–3752, 3754 threshold lines, options, 3753 predicted residual sum of squares RSREG procedure, 4042 predicted value confidence intervals MIXED procedure, 2692 predicted values example (MIXED), 2796 LIFEREG procedure, 2114 mixed model (MIXED), 2687 MIXED procedure, 2703 NLIN procedure, 3014 REG procedure, 3876, 3879
response functions (CATMOD), 845 prediction example (REG), 3924 NLMIXED procedure, 3079, 3102 predictive mean matching method MI procedure, 2542 predpplot PROBIT procedure, 3746 preference mapping TRANSREG procedure, 4593, 4717 preference models TRANSREG procedure, 4586 prefix name LINEQS statement (CALIS), 602 MATRIX statement (CALIS), 594 RAM statement (CALIS), 598 STD statement (CALIS), 603 preliminary clusters definition (CLUSTER), 978 using in CLUSTER procedure, 969 Prentice-Williams-Peterson model PHREG procedure, 3255 presentations examples, ODS Graphics, 356 PRESS residual MIXED Procedure, 2766 PRESS statistic, 1774 MIXED Procedure, 2766 RSREG procedure, 4042 prevalence test MULTTEST procedure, 2952, 2972 primary sampling unit (PSUs), 165 primary sampling units (PSUs) SURVEYLOGISTIC procedure, 4281 SURVEYMEANS procedure, 4335 SURVEYREG procedure, 4383 principal component analysis, 1291 compared with common factor analysis, 1293 PRINQUAL procedure, 3668 with FACTOR procedure, 1295 principal components, See also PRINCOMP procedure definition, 3595 interpreting eigenvalues, 3621 partialling out variables, 3608 properties of, 3595, 3596 regression (PLS), 3367, 3381 rotating, 3611 using weights, 3608 principal factor analysis with FACTOR procedure, 1296 PRINCOMP procedure CALIS procedure, 567 computational resources, 3611 correction for means, 3605 Crime Rates Data, example, 3619 DATA= data set, 3609 eigenvalues and eigenvectors, 3595, 3610–3612 examples, 3614, 3626
Subject Index input data set, 3605 ODS graph names, 3613 output data sets, 3605, 3609–3611 output table names, 3613 OUTSTAT= data set, 3609 %PLOTIT macro, 3615, 3617, 3625 replace missing values, example, 3626 SCORE procedure, 3611 suppressing output, 3605 weights, 3608 PRINQUAL procedure biplot, 3678 casewise deletion, 3655 character OPSCORE variables, 3673 constant transformations, avoiding, 3673 constant variables, 3673 excluded observations, 3657, 3674 frequency variable, 3658 identity transformation, 3663 iterations, 3656, 3667, 3673 knots, 3665, 3666 linear transformation, 3662 MAC method, 3643, 3669 maximum average correlation method, 3643, 3669 maximum total variance method, 3643 MDPREF analysis, 3678 MGV method, 3643 minimum generalized variance method, 3643 missing character values, 3662 missing values, 3655, 3667, 3674 monotonic B-spline transformation, 3662 monotonic transformation, 3662, 3663 MTV method, 3643 multidimensional preference analysis, 3678, 3688 nonoptimal transformations, 3661 ODS graph names, 3677 optimal scoring, 3662 optimal transformations, 3662 output data sets, 3669 output table names, 3677 passive observations, 3674 %PLOTIT macro, 3678, 3687 principal component analysis, 3668 random initializations, 3673 reflecting the transformation, 3666 renaming variables, 3666 reusing variables, 3666 smoothing spline transformation, 3663 spline t-options, 3665 spline transformation, 3662 standardization, 3672 transformation options, 3663 variable names, 3672 weight variable, 3667 prior density MIXED procedure, 2709 prior event probability
4955
LOGISTIC procedure, 2314, 2353, 2422 probability density function LIFETEST procedure, 2149, 2214 probability distribution built-in (GENMOD), 1614, 1638 exponential family (GENMOD), 1650 user-defined (GENMOD), 1633 probability sampling, see also SURVEYSELECT procedure probit analysis insets, 3724 probit equation, 3705, 3759 probit model SURVEYLOGISTIC procedure, 4285 PROBIT procedure Abbot’s formula, 3755 binary response data, 3705, 3706, 3759 cdfplot, 3715 deviance, 3745, 3760 deviance statistic, 3759 dispersion parameter, 3760 extreme value distribution , 3757 goodness-of-fit, 3743, 3745 goodness-of-fit tests, 3712, 3743, 3759 inset, 3723 inverse confidence limits, 3761 ippplot, 3725 log-likelihood function, 3756 logistic distribution, 3757 lpredplot, 3733 maximum-likelihood estimates, 3705 missing values, 3755 models, 3759 multilevel response data, 3705, 3706, 3759 natural response rate, 3706 Newton-Raphson algorithm, 3756 normal distribution, 3757 output table names, 3765 overdispersion, 3745 Pearson chi-square, 3759 Pearson’s chi-square, 3742, 3745, 3760 predpplot, 3746 subpopulation, 3742, 3745, 3761 threshold response rate, 3706 tolerance distribution, 3761 probit-normal-binomial example NLMIXED procedure, 3114 probit-normal-ordinal example NLMIXED procedure, 3118 Procrustean method, 1291 Procrustes rotation, 1318 producing monotone missingness MI procedure, 2552 product-limit estimate LIFETEST procedure, 2149, 2171, 2185 profile likelihood confidence intervals GENMOD procedure, 1666 profile, population and response, 72, 73 CATMOD procedure, 820
4956
Subject Index
profiling residual variance MIXED procedure, 2772 progeny INBREED procedure, 1976, 1978, 1980, 1988 programming statements constraints (CALIS), 565, 628, 630, 675 differentiation (CALIS), 589 GENMOD procedure, 1645 NLMIXED procedure, 3081 PHREG procedure, 3220, 3222, 3235, 3236 projected gradient NLMIXED procedure, 3093 projected Hessian NLMIXED procedure, 3093 promax method, 1291, 1318 propensity score method MI procedure, 2543, 2565 proportion estimation SURVEYMEANS procedure, 4341 proportional hazards model assumption (PHREG), 3220 distribution (LIFEREG), 2111 PHREG procedure, 3215, 3228 proportional odds model SURVEYLOGISTIC procedure, 4284 proportional rates/means model, See rate/mean model proportions SURVEYFREQ procedure, 4210 prospective power, 1929, 3411 prospective study, 1540 proximity data MDS procedure, 2471, 2478, 2484 proximity measures available methods for computing (DISTANCE), 1257 formulas(DISTANCE), 1270 PS destination ODS Graphics, 326, 335 pseudo F and t statistics CLUSTER procedure, 972
Q Q-Q plots graph template definitions, 366 plots, ODS Graphics, 361, 363, 368, 370, 372, 373, 375, 377–379 REG procedure, 3917, 3953 quadratic discriminant function, 1139 quadratic forms for fixed effects displaying (GLM), 1777 quadratic regression, 1738 quadrature options NLMIXED procedure, 3071 qualitative variables, 72, See classification variables REG procedure, 3943 quantal response data, 3705 quantification method
CORRESP procedure, 1069 quantile computation STDIZE procedure, 4119, 4139 quartimax method, 1291, 1317, 1318 quartimin method, 1291, 1319 quasi-complete separation LOGISTIC procedure, 2339 SURVEYLOGISTIC procedure, 4278 quasi-independence model, 919 quasi-inverse, 1164 quasi-likelihood GENMOD procedure, 1659 quasi-Newton, 3073 quasi-Newton algorithm CALIS procedure, 578–581, 665, 666
R R convergence measure, 3030 R matrix MIXED procedure, 2663, 2732, 2733 R-notation, 1794 R-square statistic CLUSTER procedure, 972 LOGISTIC procedure, 2315, 2342 SURVEYLOGISTIC procedure, 4264, 4280 R2 improvement REG procedure, 3874, 3875 R2 selection REG procedure, 3875 R= option and other options (CLUSTER), 969, 970, 972 radius of sphere of support, 972 RAM model CALIS procedure, 553, 596 specification, 560 structural model example (CALIS), 557, 563 random coefficients example (MIXED), 2788, 2810 random effects expected mean squares, 1833 GLM procedure, 1776, 1833 MIXED procedure, 2663, 2712 NESTED procedure, 2990 NLMIXED procedure, 3079 VARCOMP procedure, 4831, 4837 random effects model, See also nested design VARCOMP procedure, 4837 random initializations TRANSREG procedure, 4602 random number generators MI procedure, 2520 PLAN procedure, 3339 random sampling, see also SURVEYSELECT procedure random-effects parameters MIXED procedure, 2662, 2732 randomization of designs using PLAN procedure, 3351
Subject Index randomized complete block design example, 1847 range KRIGE2D procedure, 2047 rank correlation LOGISTIC procedure, 2350 SURVEYLOGISTIC procedure, 4292 rank order typal analysis, See complete linkage RANK procedure, 21 order statistics, 21 rank scores, 1469 rank tests NPAR1WAY procedure, 3163 Rao-Scott chi-square test SURVEYFREQ procedure, 4216 Rao-Scott likelihood ratio test SURVEYFREQ procedure, 4219 rate function PHREG procedure, 3243, 3253 rate/mean model PHREG procedure, 3243, 3253 ratio SURVEYMEANS procedure, 4330, 4339 ratio analysis SURVEYMEANS procedure, 4330, 4339, 4348 ratio level of measurement DISTANCE procedure, 1250 raw residuals GENMOD procedure, 1669 receiver operating characteristic LOGISTIC procedure, 2378 reciprocal averaging CORRESP procedure, 1069 reciprocal causation CALIS procedure, 585 rectangular lattice LATTICE procedure, 2069 rectangular table SURVEYMEANS procedure, 4324 recurrent events PHREG procedure, 3216, 3224–3226, 3243 reduced rank regression, 3367 PLS procedure, 3381 reduction notation, 1794 redundancy analysis CANCORR procedure, 752 standardized variables (TRANSREG), 4591 TRANSREG procedure, 4576, 4590, 4593, 4606 REF Parameterization SURVEYLOGISTIC procedure, 4272 reference level TRANSREG procedure, 4569 REFERENCE Parameterization SURVEYLOGISTIC procedure, 4272 reference structure, 1298 reference-cell coding TRANSREG procedure, 4569, 4594, 4664, 4666 references, 1288
4957
referring to graphs examples, ODS Graphics, 330 ODS Graphics, 330 refitting models REG procedure, 3905 reflecting the transformation PRINQUAL procedure, 3666 TRANSREG procedure, 4572 REG procedure adding variables, 3819 adjusted R2 selection, 3875 alpha level, 3816 annotations, 3816, 3844 ANOVA table, 3918 autocorrelation, 3915 backward elimination, 3800, 3874 collinearity, 3895 compared to other procedures, 1735, 3197 computational methods, 3917 correlation matrix, 3817 covariance matrix, 3817 crossproducts matrix, 3917 delete variables, 3820 deleting observations, 3903 diagnostic statistics, 3896, 3897 dictionary of options, 3844 forward selection, 3800, 3873 graphics, 3923 graphics examples, 3948 graphics keywords and options, 3841, 3843 graphics plots, high-resolution, 3840 heteroscedasticity, testing, 3910 hypothesis tests, 3832, 3858 incomplete principal components, 3818, 3828 influence statistics, 3898 input data sets, 3860 interactive analysis, 3812, 3869 introductory example, 3800 IPC analysis, 3818, 3828, 3916 line printer plots, 3848, 3882 Mallows’ Cp selection, 3875 missing values, 3859 model fit summary statistics, 3896 model selection, 3800, 3873, 3876, 3877, 3924 multicollinearity, 3895 multivariate tests, 3910 new regressors, 3860 non-full-rank models, 3893 ODS graph names, 3923 ODS table names, 3920 output data sets, 3863, 3868 P-P plots, 3917, 3953 painting line-printer plots, 3889 parameter estimates, 3877, 3919 partial regression leverage plots, 3901 plot keywords and options, 3841, 3843, 3844 plots, high-resolution, 3840 polynomial regression, 3804 predicted values, 3876, 3879, 3924
4958
Subject Index
Q-Q plots, 3917, 3953 qualitative variables, 3943 R2 improvement, 3874, 3875 R2 selection, 3875 refitting models, 3905 residual values, 3879 restoring weights, 3906 reweighting observations, 3903 ridge regression, 3818, 3829, 3848, 3916, 3956 singularities, 3917 stepwise selection, 3800, 3874 summary statistics, 3896 sweep algorithm, 3917 time series data, 3915 variance inflation factors (VIF), 3818, 3958 regression analysis (REG), 3799 CATMOD procedure, 815 examples (GLM), 1853 ill-conditioned data, 3197 MODEL statements (GLM), 1785 nonparametric, 4497 ORTHOREG procedure, 3197 partial least squares (PROC PLS), 3367, 3380 power and sample size (POWER), 3421, 3425, 3500, 3556 principal components (PROC PLS), 3367, 3381 quadratic (GLM), 1738 reduced rank (PROC PLS), 3367, 3381 semiparametric models, 4497, 4500 smoothing splines, 4497 regression coefficients CANCORR procedure, 759 covariance (SURVEYREG), 4392 SURVEYREG procedure, 4384, 4392 using with SCORE procedure, 4066 regression diagnostics LOGISTIC procedure, 2359 regression effects MIXED procedure, 2743 model parameterization (GLM), 1787 specifying (GLM), 1784 regression estimator SURVEYREG procedure, 4400, 4407 regression method MI procedure, 2541, 2565 regression parameter estimates, example SCORE procedure, 4081 regression table TRANSREG procedure, 4580 regressor effects GENMOD procedure, 1660 rejection sampling MIXED procedure, 2711 relative efficiency MI procedure, 2562 MIANALYZE procedure, 2626 relative increase in variance MI procedure, 2562
MIANALYZE procedure, 2625 relative paths, 337 examples, ODS Graphics, 359 relative risk, 1503 cohort studies, 1489 logit estimate, 1507 Mantel-Haenszel estimate, 1507 power and sample size (POWER), 3457, 3462, 3523, 3524 REML, See restricted maximum likelihood renaming and reusing variables PRINQUAL procedure, 3666 repeated measures ANOVA procedure, 446 CATMOD procedure, 815, 850, 873 contrasts (GLM), 1779 data organization (GLM), 1825 doubly-multivariate design, 1886 examples (CATMOD), 925, 930, 933, 937 examples (GLM), 1781, 1877 GEE (GENMOD), 1611, 1672 GLM procedure, 1777, 1825 hypothesis tests (GLM), 1827, 1830 MIXED procedure, 2662, 2716, 2782 more than one factor (ANOVA), 446, 449 more than one factor (GLM), 1830 multiple populations (CATMOD, 875 one population (CATMOD), 873 RESPONSE statement (CATMOD), 873 specifying factors (CATMOD), 851 transformations, 1830–1832 replaying output examples, ODS Graphics, 361, 363, 369 replicate subjects NLMIXED procedure, 3080 replicated sampling SURVEYSELECT procedure, 4438, 4460 requesting graphs ODS Graphics, 321, 324 resampled data sets MULTTEST procedure, 2962 reseting index counter ODS Graphics, 335 residual chi-square PHREG procedure, 3231 residual maximum likelihood (REML) MIXED procedure, 2738, 2779 Residual plots MIXED procedure, 2758 residual plots plots, ODS Graphics, 322 residuals and partial correlation (PRINCOMP), 3609 CALIS procedure, 650 deviance (PHREG), 3234, 3258, 3302 GENMOD procedure, 1641, 1669, 1670 LOGISTIC procedure, 2360 martingale (PHREG), 3234, 3258, 3302
Subject Index MDS procedure, 2483, 2487, 2488, 2491, 2503 NLIN procedure, 3014 partial correlation (PRINCOMP), 3608 prefix (CALIS), 601 prefix (TRANSREG), 4591 REG procedure, 3879 Schoenfeld (PHREG), 3234, 3258 score (PHREG), 3234, 3259 weighted Schoenfeld (PHREG), 3235, 3259 weighted score (PHREG), 3260 residuals, details MIXED procedure, 2763 response functions (CATMOD) covariance matrix, 842 formulas, 892 identifying with FACTORS statement, 836 predicted values, 845 related to design matrix, 877, 879 variance formulas, 892 response level ordering LOGISTIC procedure, 2290, 2305, 2329, 2330 SURVEYLOGISTIC procedure, 4259, 4269, 4270 response profile, 73 CATMOD procedure, 820 response surfaces, 4033 canonical analysis, interpreting, 4046 covariates, 4050 experiments, 4045 plotting, 4048 ridge analysis, 4047 response variable, 451, 1784 PHREG procedure, 3218, 3235, 3283 sort order of levels (GENMOD), 1626 restoring weights REG procedure, 3906 restricted maximum likelihood MIXED procedure, 2662, 2738, 2779 restrictions of parameters (CATMOD), 859 resubstitution DISCRIM procedure, 1163 reticular action model, See RAM model retrospective power, 1929, 3411 reverse response level ordering LOGISTIC procedure, 2290, 2305, 2329, 2330 SURVEYLOGISTIC procedure, 4259, 4269, 4270 reweighting observations REG procedure, 3903 ridge analysis RSREG procedure, 4047 ridge regression REG procedure, 3818, 3829, 3848, 3916, 3956 ridging MIXED procedure, 2680, 2738 ridit scores, 1469 risk set
4959
PHREG procedure, 3220, 3240, 3241, 3288 risk weights PHREG procedure, 3229 risks and risk differences, 1486 RMSSTD statement and FREQ statement (CLUSTER), 974 robust cluster analysis, 1379, 1392 estimators (STDIZE), 4138 ROBUSTREG procedure, 3971 computational resources, 4012 INEST= data sets, 4011 ODS graph names, 4014 OUTEST= data sets, 4011 output table names, 4012 syntax, 3982 ROC curve LOGISTIC procedure, 2314, 2357 Roger and Tanimoto coefficient DISTANCE procedure, 1274 Root MSE SURVEYREG procedure, 4388 rotating principal components, 3611 row mean scores statistic, 1502 row proportions SURVEYFREQ procedure, 4212 Roy’s maximum root, 437, 1759, 1828 RPC convergence measure, 3030 RSREG procedure coding variables, 4047, 4052 compared to other procedures, 1735, 4033 computational methods, 4050 confidence intervals, 4042, 4043 Cook’s D influence statistic, 4042 covariates, 4034 eigenvalues, 4050 eigenvectors, 4050 factor variables, 4034 input data sets, 4040, 4041 introductory example, 4034 missing values, 4048 ODS table names, 4054 output data sets, 4040, 4044, 4051, 4052 PRESS statistic, 4042 response variables, 4034 ridge analysis, 4047 RTF destination ODS Graphics, 326, 335 RTF output examples, ODS Graphics, 356, 360 ODS Graphics, 326, 338 Rtf style ODS styles, 346 Russell and Rao similarity coefficient DISTANCE procedure, 1276 Ryan’s multiple range test, 444, 1768, 1815 examples, 1851
S
4960
Subject Index
S convergence measure, 3030 saddle test, definition MODECLUS procedure, 2879 salience of loadings, FACTOR procedure, 1294, 1327 Sampford’s method SURVEYSELECT procedure, 4455 sample design SURVEYFREQ procedure, 4203 sample selection methods SURVEYSELECT procedure, 4446 sample size CATMOD procedure, 887 overview of power concepts (POWER), 3488 See GLMPOWER procedure, 1929 See POWER procedure, 3411 SURVEYSELECT procedure, 4440 sample size adjustment GLMPOWER procedure, 1946 POWER procedure, 3494 sample survey analysis, ordinal data, 816 sampling fraction, 165 sampling frame, 164 sampling rate, 165 SURVEYFREQ procedure, 4194, 4204 SURVEYLOGISTIC procedure, 4251, 4280 SURVEYMEANS procedure, 4324, 4334 SURVEYREG procedure, 4374, 4382 SURVEYSELECT procedure, 4439 sampling unit, 164 sampling weight, 165 sampling zeros and log-linear analyses (CATMOD), 871 and structural zeros (CATMOD), 888 sandwich estimator MIXED procedure, 2676 SAS Companion, 327, 336, 348 SAS current folder, 327 SAS data set DATA step, 21 summarizing, 21, 22 SAS Registry, 327, 346 SAS Registry Editor, 346 SAS Results Viewer, 327, 336 SAS/ETS software, 22 SAS/GRAPH software, 22 SAS/IML software, 22 SAS/INSIGHT software, 22 SAS/OR software, 23 SAS/QC software, 23 Satterthwaite method MIXED procedure, 2694 Satterthwaite t test power and sample size (POWER), 3463, 3472, 3527 Satterthwaite’s approximation, 4775, 4785 testing random effects, 1835 Savage scores NPAR1WAY procedure, 3167 saving graphics image files
ODS Graphics, 336 saving templates examples, ODS Graphics, 368 sawtooth power function, 3541 scale estimates FASTCLUS procedure, 1390, 1392, 1397, 1399, 1400 scale parameter GENMOD procedure, 1653 Scaled Residual MIXED procedure, 2705 Scaled residuals MIXED procedure, 2764 scaling variables DISTANCE procedure, 1251 MODECLUS procedure, 2856 STDIZE procedure, 4143 scalogram analysis CORRESP procedure, 1069 scatter plots plots, ODS Graphics, 343, 351 Scheffé’s multiple-comparison procedure, 1810 Scheffé’s multiple-comparison procedure, 444 Scheffé’s multiple-comparison procedure, 1769 Schoenfeld residuals PHREG procedure, 3234, 3258 Schwarz criterion LOGISTIC procedure, 2341 SURVEYLOGISTIC procedure, 4279 Schwarz’s Bayesian information criterion example (MIXED), 2780, 2794, 2823 MIXED procedure, 2676, 2740, 2750 SCORE procedure CALIS procedure, 571, 586, 643 computational resources, 4075 examples, 4067, 4075 input data set, 4072 OUT= data sets, 4075 output data set, 4072, 4075 PRINCOMP procedure, 3611 regression parameter estimates from REG procedure, 4074 scoring coefficients, 4065 score residuals PHREG procedure, 3234, 3259 score statistics GENMOD procedure, 1668 LOGISTIC procedure, 2343 SURVEYLOGISTIC procedure, 4287 score test PHREG procedure, 3229, 3231, 3246, 3269, 3274, 3276 TPHREG procedure, 4486 score variables interpretation (SCORE), 4074 scores NPAR1WAY procedure, 3166 scoring MIXED procedure, 2674, 2680, 2774
Subject Index scoring coefficients (SCORE), 4065 scree plot, 1336 screening design, analysis, 1895 screening experiments GLMMOD procedure, 1923 SDF, See survival distribution function selecting graphs examples, ODS Graphics, 330, 352, 361 ODS Graphics, 330 selection methods, See model selection semiparametric model PHREG procedure, 3215 semipartial correlation CANCORR procedure, 762 formula (CLUSTER), 984 semivariogram computation (VARIOGRAM), 4876, 4877 empirical (or experimental) (VARIOGRAM), 4856, 4858 robust (VARIOGRAM), 4861, 4869, 4877 sensitivity CATMOD procedure, 941 sequential random sampling SURVEYSELECT procedure, 4448 serpentine sorting SURVEYSELECT procedure, 4445 Shape distance coefficient DISTANCE procedure, 1271 Shewhart control charts, 23 Sidak (MULTTEST) p-value adjustments, 2942, 2956, 2972 Sidak’s t test, 1769, 1810 Sidak’s adjustment GLM procedure, 1754 MIXED procedure, 2688 MULTTEST procedure, 2942, 2956, 2972 Sidak’s inequality, 444 Siegel-Tukey scores NPAR1WAY procedure, 3167 significance level CALIS procedure, 588 entry (PHREG), 3230 hazards ratio confidence interval (PHREG), 3233 removal (PHREG), 3230, 3277 significance tests MODECLUS procedure, 2876, 2916 sill KRIGE2D procedure, 2047 SIM2D procedure Cholesky root, 4107 computational details, 4109 conditional and unconditional simulation, 4091 conditional distributions of multivariate normal random variables, 4108 examples, 4092, 4111 Gaussian assumption, 4091
4961
Gaussian random field, 4091 LU decomposition, 4106 memory usage, 4110 output data sets, 4099, 4110 OUTSIM= data set, 4110 quadratic form, 4108 simulation of spatial random fields, 4106–4109 similarity data MDS procedure, 2471, 2478, 2484 Similarity Ratio coefficient DISTANCE procedure, 1272 simple cluster-seeking algorithm, 1381 simple effects GLM procedure, 1758, 1817 MIXED procedure, 2691 simple kappa coefficient, 1493, 1495 Simple Matching coefficient DISTANCE procedure, 1274 Simple Matching dissimilarity coefficient DISTANCE procedure, 1274 simple random cluster sampling SURVEYREG procedure, 4397 simple random sampling SURVEYMEANS procedure, 4315 SURVEYREG procedure, 4365, 4395 SURVEYSELECT procedure, 4423, 4447 simplicity functions CALIS procedure, 607 FACTOR procedure, 1294, 1316, 1329 SIMPLS method PLS procedure, 3380 simulation of spatial random fields SIM2D procedure, 4106–4109 simulation-based adjustment GLM procedure, 1754 MIXED procedure, 2688 single linkage CLUSTER procedure, 967, 982 singularities MIXED procedure, 2775 REG procedure, 3917 singularity MI procedure, 2521 singularity checking CANCORR procedure, 762 GLM procedure, 1750, 1752, 1758, 1772 singularity criterion CALIS procedure, 590 contrast matrix (GENMOD), 1632 contrast matrix (LOGISTIC), 2300 contrast matrix (SURVEYLOGISTIC), 4258 contrast matrix (TPHREG), 4481 covariance matrix (CALIS), 588, 590, 591 information matrix (GENMOD), 1642 PHREG procedure, 3232 TRANSREG procedure, 4579 singularity level SURVEYREG procedure, 4377, 4379 singularity tolerances
4962
Subject Index
NLMIXED procedure, 3072 Size distance coefficient DISTANCE procedure, 1271 skewness CALIS procedure, 658 displayed in CLUSTER procedure, 972 SMM multiple-comparison method, 444, 1769, 1811 smoothing parameter cluster analysis, 979 MODECLUS procedure, 2864, 2871 optimal (DISCRIM), 1162 smoothing parameter, default MODECLUS procedure, 2872 smoothing spline transformation PRINQUAL procedure, 3663 TRANSREG procedure, 4564, 4596 Sokal and Sneath 1 coefficient DISTANCE procedure, 1274 Sokal and Sneath 3 coefficient DISTANCE procedure, 1275 Somers’ D statistics, 1474, 1478 spacing STDIZE procedure, 4139 spatial anisotropic exponential structure MIXED procedure, 2721 spatial covariance structures examples (MIXED), 2723 MIXED procedure, 2722, 2730, 2774 spatial prediction VARIOGRAM procedure, 4851, 4852 Spearman rank correlation coefficient, 1474, 1480 specificity CATMOD procedure, 941 spherical semivariogram model KRIGE2D procedure, 2046, 2047 sphericity tests, 449, 1780, 1881 spline t-options PRINQUAL procedure, 3665 TRANSREG procedure, 4566 spline transformation PRINQUAL procedure, 3662 TRANSREG procedure, 4564, 4611 splines Bayesian confidence intervals, 4513 goodness of fit, 4514 regression model, 4497, 4513, 4518 thin-plate smoothing, 4497 TPSPLINE procedure, 4497 TRANSREG procedure, 4560, 4561, 4614, 4637, 4678, 4709 split-plot design ANOVA procedure, 451, 470, 472, 475 generating with PLAN procedure, 3352 MIXED procedure, 2734, 2777 square root difference cloud VARIOGRAM procedure, 4882 Squared Correlation dissimilarity coefficient DISTANCE procedure, 1272 Squared Correlation similarity coefficient
DISTANCE procedure, 1272 Squared Euclidean distance coefficient DISTANCE procedure, 1271 squared multiple correlation CALIS procedure, 657, 686 CANCORR procedure, 762 squared partial correlation CANCORR procedure, 762 squared semipartial correlation CANCORR procedure, 762 formula (CLUSTER), 984 SSCP matrix displaying, for multivariate tests, 438, 439 for multivariate tests, 437 for multivariate tests (GLM), 1759, 1761 stacking table SURVEYMEANS procedure, 4324 standard deviation CLUSTER procedure, 972 SURVEYMEANS procedure, 4341 standard error PHREG procedure, 3225, 3233, 3235, 3269 SURVEYMEANS procedure, 4338 TPHREG procedure, 4487 standard error of ratio SURVEYMEANS procedure, 4339 standard error ratio PHREG procedure, 3269 TPHREG procedure, 4487 standard linear model MIXED procedure, 2663 STANDARD procedure, 21 standardized values, 21 standardization comparisons between DISTANCE and STDIZE procedures, 1251 standardized score process PHREG procedure, 3267, 3271 standardizing cluster analysis (STDIZE), 4143 CLUSTER procedure, 972 MODECLUS procedure, 2856 raw data (SCORE), 4066 redundancy variables (TRANSREG), 4591 TRANSREG procedure, 4572 values (STANDARD), 21 values (STDIZE), 4119 star (*) operator TRANSREG procedure, 4558 stationary point NLMIXED procedure, 3100 statistic-keywords SURVEYMEANS procedure, 4326 statistical assumptions (GLM), 1783 quality control, 23 tests (MULTTEST), 2948 statistical computation SURVEYMEANS procedure, 4336
Subject Index statistical computations SURVEYFREQ procedure, 4206 Statistical Graphics Using ODS, see ODS Graphics Statistical style ODS styles, 333 STD option (MODECLUS), 2856 STD= option (DISTANCE), 1265 STDIZE procedure AGK estimate, 4139 analyzing data in groups, 4134 Andrew’s wave estimate, 4139 breakdown point and efficiency, 4138 comparisons of quantile computation, PCTLMTD option, 4139 computational methods, PCTLDEF option, 4140 Euclidean length, 4138 examples, 4119, 4143 final output value, 4119 formulas for statistics, 4138 fuzz factor, 4130 Huber’s estimate, 4139 initial estimates for A estimates, 4131 input data set (METHOD=IN()), 4137 methods resistant to clustering, 4138 methods resistant to outliers, 4124, 4138 Minkowski metric, 4138 missing values, 4131, 4133, 4141 normalization, 4131, 4133 one-pass quantile computations, 4139 OUT= data set, 4130, 4141 output data sets, 4132, 4141 output table names, 4142 OUTSTAT= data set, 4141 quantile computation, 4119, 4139 robust estimators, 4138 spacing, 4139 standardization methods, 4119, 4136 standardization with weights, 4135 Tukey’s biweight estimate, 4127, 4139 tuning constant, 4127, 4137 unstandardization, 4133 weights, 4135 step halving PHREG procedure, 3245 step length CALIS procedure, 581 step length options NLMIXED procedure, 3096 STEPDISC procedure average squared canonical correlation, 4173 computational resources, 4171 input data sets, 4170 introductory example, 4158 memory requirements, 4171 methods, 4157 missing values, 4170 ODS table names, 4174 Pillai’s trace, 4173
4963
stepwise selection, 4159 time requirements, 4171 tolerance, 4173 Wilks’ lambda, 4173 stepdown methods GLM procedure, 1814 MULTTEST procedure, 2957, 2972 stepwise discriminant analysis, 4157 stepwise selection LOGISTIC procedure, 2317, 2341, 2391 PHREG procedure, 3229, 3265, 3272 REG procedure, 3800, 3874 STEPDISC procedure, 4159 stored data algorithm, 986 stored distance algorithms, 986 strata SURVEYFREQ procedure, 4196 SURVEYLOGISTIC procedure, 4266 SURVEYMEANS procedure, 4332 SURVEYREG procedure, 4381 strata variables PHREG procedure, 3237 programming statements (PHREG), 3235 strata weights MULTTEST procedure, 2951 stratification SURVEYFREQ procedure, 4196, 4203 SURVEYLOGISTIC procedure, 4266 SURVEYMEANS procedure, 4332, 4346 SURVEYREG procedure, 4381, 4391 stratified analysis FREQ procedure, 1431, 1450 PHREG procedure, 3216, 3237 stratified cluster sample SURVEYMEANS procedure, 4350 stratified sampling, 164 SURVEYMEANS procedure, 4318 SURVEYREG procedure, 4368, 4401 SURVEYSELECT procedure, 4425, 4444 stratified table example, 1540 stratified tests LIFETEST procedure, 2150, 2155, 2156, 2158, 2167, 2180, 2187 stratum collapse SURVEYREG procedure, 4388, 4411 stress formula MDS procedure, 2479, 2480, 2489 strip-split-plot design ANOVA procedure, 475 structural equation (CALIS) definition, 585, 625, 658 dependent variables, 625 models, 552 structural model example (CALIS), 555 COSAN model, 561 factor analysis model, 564 LINEQS model, 558, 562 LISREL model, 559
4964
Subject Index
path diagram, 555, 600 RAM model, 557, 563 Stuart’s tau-c statistic, 1474, 1478 Student’s multiple range test, 444, 1769, 1814 Studentized maximum modulus pairwise comparisons, 444, 1769, 1811 Studentized Residual MIXED procedure, 2704 studentized residual, 1774, 3899 Studentized residuals external, 2768 internal, 2768 MIXED procedure, 2768 style attributes, modifying examples, ODS Graphics, 374, 376, 378 style elements, modifying examples, ODS Graphics, 374, 376, 378 subdomain analysis SURVEYFREQ procedure, 4205 SURVEYMEANS procedure, 4336 subgroup analysis SURVEYFREQ procedure, 4205 SURVEYMEANS procedure, 4336 subject effect MIXED procedure, 2683, 2715, 2721, 2776, 2782 subject weights MDS procedure, 2471, 2477 subpopulation GENMOD procedure, 1637 LOGISTIC procedure, 2308, 2316, 2355 PROBIT procedure, 3742, 3745, 3761 subpopulation analysis SURVEYFREQ procedure, 4205 SURVEYMEANS procedure, 4336 sum-to-zero assumptions, 1835 summary of commands MIXED procedure, 2672 summary statistics REG procedure, 3896 sums of squares GLM procedure, 1772, 1773 Type II (GLM), 1773 Type II (TRANSREG), 4580 supported operating environments ODS Graphics, 348 supported procedures ODS Graphics, 348 suppressing output CANCORR procedure, 761 GENMOD procedure, 1627 MI procedure, 2520 surface plots plots, ODS Graphics, 324, 360 surface trend VARIOGRAM procedure, 4854 survey data analysis, 161 survey sampling, 161, see also SURVEYFREQ procedure
see also SURVEYMEANS procedure see also SURVEYREG procedure see also SURVEYSELECT procedure cluster sampling, 164 data analysis, 4185 descriptive statistics, 4315 finite population correction factor, 165 first-stage sampling unit, 165 fpc, 165 multistage sampling, 165 population, 164 population total, 165 PPS, 161 primary sampling units (PSUs), 165 regression analysis, 4365 sample selection, 4421 sampling fraction, 165 sampling frame, 164 sampling rate, 165 sampling unit, 164 sampling weight, 165 stratified sampling, 164 survey weight, 165 SURVEYMEANS procedure, 167 SURVEYREG procedure, 167 SURVEYSELECT procedure, 167 variance estimation, 166 survey weight, 165 SURVEYFREQ procedure, 159, 4185 alpha level, 4199 chi-square test, 4216 cluster, 4195 clustering, 4203 coefficient of variation, 4214 column proportions, 4212 confidence limits, 4213 covariance, 4210 crosstabulation tables, 4227 data summary table, 4225 default tables, 4197 degrees of freedom, 4214 design effect, 4215 design-adjusted chi-square test, 4216 displayed output, 4225 domain analysis, 4205 expected weighted frequency, 4215 introductory example, 4185 missing values, 4205 multiway tables, 4227 ODS table names, 4230 one-way frequency tables, 4226 order of variables, 4193 output table names, 4230 population total, 4194, 4204 proportions, 4210 Rao-Scott chi-square test, 4216 Rao-Scott likelihood ratio test, 4219 row proportions, 4212 sample design, 4203
Subject Index sampling rate, 4194, 4204 statistical computations, 4206 statistical test tables, 4229 strata, 4196 stratification, 4196, 4203 stratum information table, 4225 subdomain analysis, 4205 subgroup analysis, 4205 subpopulation analysis, 4205 Taylor series method, 4206 totals, 4209 unequal weighting, 4204 Wald chi-square test, 4221 Wald log-linear chi-square test, 4223 SURVEYLOGISTIC procedure, 159 Akaike’s information criterion, 4279 cluster, 4255 computational details, 4282 confidence intervals, 4288 convergence criterion, 4261–4263 customized odds ratio, 4267 displayed output, 4292 EFFECT Parameterization, 4270 estimability checking, 4258 existence of MLEs, 4277 first-stage sampling rate, 4251 Fisher’s scoring method, 4264, 4265, 4276 GLM Parameterization, 4271 gradient, 4287 Hessian matrix, 4264, 4287 infinite parameter estimates, 4263 initial values, 4280 link function, 4243, 4263, 4273 log odds, 4289 maximum likelihood algorithms, 4275 Medical Expenditure Panel Survey (MEPS), 4302 missing values, 4251, 4268 model fitting criteria, 4279 Newton-Raphson algorithm, 4264, 4265, 4277 odds ratio confidence limits, 4262 odds ratio estimation, 4288 ORDINAL Parameterization, 4271 ORTHEFFECT Parameterization, 4272 ORTHORDINAL Parameterization, 4272 ORTHOTHERM Parameterization, 4272 ORTHPOLY Parameterization, 4273 ORTHREF Parameterization, 4273 output table names, 4295 Parameterization, 4270 POLY Parameterization, 4271 POLYNOMIAL Parameterization, 4271 population total, 4252, 4280 primary sampling units (PSUs), 4281 rank correlation, 4292 REF Parameterization, 4272 REFERENCE Parameterization, 4272 response level ordering, 4270 reverse response level ordering, 4259, 4269
sampling rate, 4251, 4280 Schwarz criterion, 4279 score statistics, 4287 singular contrast matrix, 4258 strata, 4266 stratification, 4266 testing linear hypotheses, 4266, 4288 Variance Estimation, 4282 SURVEYMEANS procedure, 159, 4315 categorical variable, 4329, 4337, 4341 class level, 4346 cluster, 4329 coefficient of variation, 4340 confidence level, 4323 confidence limits, 4340, 4342 data summary, 4346 degrees of freedom, 4340, 4344 denominator variable, 4330 descriptive statistics, 4347 domain analysis, 4336, 4348 domain mean, 4343 domain statistics, 4342 domain total, 4343 domain variable, 4330 empty stratum, 4334, 4344 estimated frequency, 4341 estimated total, 4341 first-stage sampling rate, 4324 indicator variable, 4337 mean per element, 4337 means, 4337 missing values, 4323, 4333, 4358 numerator variable, 4330 ODS table names, 4349 output data set, 4321, 4345 output table names, 4349 population total, 4326, 4334 primary sampling units (PSUs), 4335 proportion estimation, 4341 ratio, 4330, 4339 ratio analysis, 4330, 4339, 4348 rectangular table, 4324 sampling rate, 4324, 4334 simple random sampling, 4315 stacking table, 4324 standard deviation of the total, 4341 standard error of ratio, 4339 standard error of the mean, 4338 statistic-keywords, 4326 statistical computation, 4336 strata, 4332 stratification, 4332, 4346 stratified cluster sample, 4350 stratified sampling, 4318 subdomain analysis, 4336 subgroup analysis, 4336 subpopulation analysis, 4336 t test, 4339 valid observation, 4346
4965
4966
Subject Index
variance of the mean, 4338 variance of the total, 4341 SURVEYREG procedure, 159, 4365 ADJRSQ, 4380 Adjusted R-square, 4387 analysis of contrasts, 4393 analysis of variance, 4387 ANOVA, 4380, 4387, 4392 classification level, 4391 classification variables, 4375 cluster, 4376 coefficients of contrast, 4392 coefficients of estimate, 4393 computational details, 4384 confidence level, 4373 confidence limits, 4380 contrasts, 4376, 4389 data summary, 4390 degrees of freedom, 4385 design effect, 4388 design summary, 4390 effect testing, 4386, 4392 estimable functions, 4378, 4393 first-stage sampling rate, 4374 inverse matrix of X0 X, 4391 missing values, 4382 MSE, 4388 multiple R-square, 4387 output data set, 4372, 4393 output table names, 4394 pooled stratum, 4389 population total, 4374, 4382 primary sampling units (PSUs), 4383 regression coefficients, 4384, 4392 regression estimator, 4400, 4407 Root MSE, 4388 sampling rate, 4374, 4382 simple random cluster sampling, 4397 simple random sampling, 4365, 4395 singularity level, 4377, 4379 strata, 4381 stratification, 4381, 4391 stratified sampling, 4368, 4401 stratum collapse, 4388, 4411 variance estimation, 4385 Wald test, 4386, 4389 X0 X matrix, 4391 SURVEYSELECT procedure, 167, 4421 Brewer’s method, 4453 certainty size measure, 4432 Chromy’s method, 4448, 4452 control sorting, 4443, 4445 displayed output, 4458 dollar-unit sampling, 4466 initial seed, 4441 introductory example, 4422 joint selection probabilities, 4433 maximum size measure, 4433 minimum size measure, 4436
missing values, 4445 Murthy’s method, 4454 output data sets, 4456 output table names, 4460 PPS sampling, 4463 PPS sampling, with replacement, 4451 PPS sampling, without replacement, 4449 PPS sequential sampling, 4452 PPS systematic sampling, 4451 replicated sampling, 4438, 4460 Sampford’s method, 4455 sample selection methods, 4446 sample size, 4440 sampling rate, 4439 sequential random sampling, 4448 serpentine sorting, 4445 simple random sampling, 4423, 4447 stratified sampling, 4425, 4444 systematic random sampling, 4448 unrestricted random sampling, 4447 survival analysis power and sample size (POWER), 3473, 3481, 3533 survival distribution function LIFETEST procedure, 2149, 2171, 2184 PHREG procedure, 3239 survival function LIFEREG procedure, 2083, 2111 survival models, parametric, 2083 survival times PHREG procedure, 3215, 3281, 3283 survivor function, See survival distribution function definition (PHREG), 3239 estimates (LOGISTIC), 2460 estimates (PHREG), 3225, 3235, 3261, 3299 PHREG procedure, 3215, 3235, 3240 sweep algorithm REG procedure, 3917 symmetric and positive definite (SIM2D) covariance matrix, 4107 symmetric binary variable DISTANCE procedure, 1250 syntax ROBUSTREG procedure, 3982 systematic random sampling SURVEYSELECT procedure, 4448
T t statistic approximate, 4784 for equality of means, 4783 t test MULTTEST procedure, 2946, 2955, 2968 power and sample size (POWER), 3432, 3437, 3448, 3455, 3463, 3471, 3508, 3509, 3518, 3526 SURVEYMEANS procedure, 4339 t test for correlation
Subject Index power and sample size (POWER), 3426, 3429, 3503 t value CALIS procedure, 649 displaying (CALIS), 686 t-square statistic CLUSTER procedure, 972, 984 table names MIXED procedure, 2752 table scores, 1469 tables crosstabulation (SURVEYFREQ), 4227 frequency and crosstabulation (FREQ), 1431, 1450 frequency and crosstabulation (SURVEYFREQ), 4196 multiway, 1518, 1520, 1521 one-way frequency, 1517, 1518 one-way frequency (SURVEYFREQ), 4226 one-way, tests, 1469, 1470 two-way, tests, 1470, 1471 TABLES statement, use CORRESP procedure, 1072 TABULATE procedure, 22 Tarone’s adjustment Breslow-Day test, 1508 Tarone-Ware test for homogeneity LIFETEST procedure, 2150, 2168 power and sample size (POWER), 3473, 3483, 3533 Taylor series method SURVEYFREQ procedure, 4206 Template Editor window, 340, 365 TEMPLATE procedure examples, ODS Graphics, 363, 369, 371, 373, 379 graph template language, 342 template stores, 338 Sashelp.Tmplmst, 338, 341, 345, 346, 365 Sasuser.Templat, 340, 341, 346 user-defined, 341, 342 templates displaying contents of template, 278 in SASUSER library, 278 modifying, 278, 303, 309 style templates, 279 table templates, 279 TEMPLATE procedure, 278 Templates window, 339, 340, 345, 365 termination criteria (CALIS), See optimization test components MIXED procedure, 2702 test indices constraints (CALIS), 584 test set classification DISCRIM procedure, 1163 test set validation PLS procedure, 3384
4967
testing linear hypotheses LOGISTIC procedure, 2327, 2358 MIANALYZE procedure, 2618, 2628 SURVEYLOGISTIC procedure, 4266, 4288 tests, hypothesis examples (GLM), 1850 GLM procedure, 1749 tetrachoric correlation coefficient, 1460, 1482 theoretical correlation INBREED procedure, 1977 thin-plate smoothing splines, 4497 bootstrap, 4531, 4535 large data sets, 4527 theoretical foundation, 4511 three-way multidimensional scaling MDS procedure, 2471 threshold response rate, 3705 tick marks, modifying examples, ODS Graphics, 373 ties checking for in CLUSTER procedure, 971 MDS procedure, 2485 PHREG procedure, 3216, 3219, 3228, 3241, 3268 TPHREG procedure, 4486 time requirements ACECLUS procedure, 411 CLUSTER procedure, 986 FACTOR procedure, 1331, 1335 VARCLUS procedure, 4815, 4818 time series data REG procedure, 3915 time-dependent covariates PHREG procedure, 3216, 3220, 3222, 3224, 3233, 3236 time-series plot MI procedure, 2556 Tobit model LIFEREG procedure, 2085, 2129 Toeplitz structure example (MIXED), 2819 MIXED procedure, 2721 total covariance matrix MIANALYZE procedure, 2627 total variance MI procedure, 2561 MIANALYZE procedure, 2625 totals SURVEYFREQ procedure, 4209 TPHREG procedure displayed output, 4486 estimability checking, 4481 global null hypothesis, 4486 hierarchy, 4475 iteration history, 4486 likelihood ratio test, 4486 miscellaneous changes from PROC PHREG, 4485 model hierarchy, 4473, 4475
4968
Subject Index
ODS table names, 4488 parameter estimates, 4486, 4487 score test, 4486 singular contrast matrix, 4481 standard error, 4487 standard error ratio, 4487 ties, 4486 Wald test, 4486 TPSPLINE procedure bootstrap, 4531, 4535 computational formulas, 4511 large data sets, 4527 nonhomogeneous variance, 4514 ODS table names, 4515 partial spline model, 4515 smoothing parameter, 4514 smoothing penalty, 4525 trace record examples, ODS Graphics, 330, 352, 363 trace W method, See Ward’s method traditional high-resolution graphics LIFETEST procedure, 2159 traditional high-resolution graphics (LIFETEST), 2159 catalog, 2161 description, 2160 global annotate, 2160 local annotate, 2162 training data set DISCRIM procedure, 1139 transformation MI procedure, 2533 transformation matrix orthonormalizing, 438, 1761 transformation options PRINQUAL procedure, 3663 TRANSREG procedure, 4564 transformation standardization TRANSREG procedure, 4572 transformations affine(DISTANCE), 1250 ANOVA procedure, 448 cluster analysis, 958 for multivariate ANOVA, 437, 1759 identity(DISTANCE), 1250 linear(DISTANCE), 1250 many-to-one(DISTANCE), 1249 MDS procedure, 2472, 2481, 2482, 2484, 2487– 2489 monotone increasing(DISTANCE), 1249 oblique, 1294, 1298 one-to-one(DISTANCE), 1249 orthogonal, 1294, 1298 power(DISTANCE), 1250 repeated measures, 1830–1832 strictly increasing(DISTANCE), 1249 transformations for repeated measures GLM procedure, 1779
transformed data MDS procedure, 2491 transformed distances MDS procedure, 2491 transforming ordinal variables to interval DISTANCE procedure, 1250 TRANSREG procedure additive models, 4574 algorithms, 4576 alpha level, 4574 ANOVA, 4650 ANOVA codings, 4662 ANOVA table, 4580, 4615 ANOVA table in OUTTEST= data set, 4626 asterisk (*) operator, 4558 at sign (@) operator, 4558 B-spline basis, 4560, 4614 bar (|) operator, 4558 Box Cox Example, 4721 Box Cox transformations, 4595 Box-Cox alpha, 4570 Box-Cox convenient lambda, 4570 Box-Cox convenient lambda list, 4570 Box-Cox geometric mean, 4571 Box-Cox lambda, 4571 Box-Cox parameter, 4566 CANALS method, 4576 canonical correlation, 4584, 4593 canonical variables, 4584 casewise deletion, 4578 cell-means coding, 4569, 4594, 4662 center-point coding, 4568, 4668, 4670 centering, 4675 character OPSCORE variables, 4604 choice experiments, 4660 CLASS variables, prefix, 4575 classification variables, 4560, 4569 coefficients, redundancy, 4590 confidence limits, 4575 confidence limits, individual, 4585 confidence limits, mean, 4585 confidence limits, prefix, 4584, 4585, 4587, 4588 conjoint analysis, 4581, 4593, 4690, 4694 constant transformations, avoiding, 4603 constant variables, 4578, 4604 degrees of freedom, 4615 dependent variable list, 4588 dependent variable name, 4587 design matrix, 4586 details of model, 4575 deviations-from-means coding, 4568, 4594, 4654, 4668, 4670 displaying iteration histories, 4576 dummy variables, 4560, 4569, 4586 dummy variables example, 4654 duplicate variable names, 4626 effect coding, 4549, 4568, 4654, 4668, 4670 excluded observations, 4605
Subject Index excluding nonscore observations, 4581 expansions, 4560 explicit intercept, 4605 frequency variable, 4557 full-rank coding, 4568 GLMMOD alternative, 4586, 4654 history, iteration, 4576 hypothesis tests, 4580, 4615 id variables, 4557 ideal point model, 4593 ideal points, 4717 identity transformation, 4564 implicit intercept, 4605 independent variable list, 4588 individaul model fitting, 4576 initialization, 4575 interaction effects, 4558, 4594 interactions, quantitative, 4594 intercept, 4605 intercept, none, 4577 iterations, 4601 iterations, maximum number of, 4576 iterations, restarting, 4579, 4602 knots, 4567, 4568, 4613, 4678 less-than-full-rank model, 4569, 4594, 4672 leverage, 4587 limiting displayed output, 4579 linear regression, 4592 linear transformation, 4563, 4610 macros, 4588 main effects, 4558, 4594 maximum redundancy analysis, 4576 METHOD=MORALS rolled output data set, 4622 METHOD=MORALS variable names, 4626 metric conjoint analysis, 4694 missing value restoration option, 4590 missing values, 4576, 4578, 4599, 4600, 4605 monotone regression function, 4629 monotone transformations, 4593 monotonic B-spline transformation, 4563, 4611 monotonic transformation, ties not preserved, 4564, 4610 monotonic transformation, ties preserved, 4563, 4610 MORALS dependent variable name, 4587 MORALS method, 4576 multiple redundancy coefficients, 4590 multiple regression, 4593 multivariate multiple regression, 4593 names of variables, 4571 nonlinear regression functions, 4593, 4628 nonlinear transformations, 4593 nonmetric conjoint analysis, 4690 nonoptimal transformations, 4562 optimal scaling, 4609 optimal scoring, 4563, 4610 optimal transformations, 4563 order of CLASS levels, 4569, 4578
4969
OUT= data set, 4582, 4623 output table names, 4676 OUTTEST= data set, 4556 part-worth utilities, 4691 passive observations, 4605 piecewise polynomial splines, 4561, 4614 %PLOTIT macro, 4617, 4717, 4718 point models, 4605 polynomial-spline basis, 4561, 4614 predicted values, 4591 preference mapping, 4593, 4717 preference models, 4586 prefix, canonical variables, 4584, 4585 prefix, redundancy variables, 4592 prefix, residuals, 4591 redundancy analysis, 4576, 4590, 4592, 4593, 4606 redundancy analysis, standardization, 4591 reference level, 4569, 4579, 4591 reference-cell coding, 4569, 4594, 4664, 4666 regression table, 4580 regression table in OUTTEST= data set, 4626 reiteration, 4579, 4602 renaming and reusing variables, 4571 residuals, 4592 residuals, prefix, 4591 separate regression functions, 4633 short output, 4579 singularity criterion, 4579 smoothing spline transformation, 4564 smoothing splines, 4596 spline t-options, 4566 spline transformation, 4564, 4611 splines, 4560, 4561, 4614, 4637, 4678, 4709 standardization, redundancy variables, 4591 standardization, transformation, 4572, 4580 star (*) operator, 4558 transformation options, 4564 transformation standardization, 4572, 4580 Type II sums of squares, 4580 types of observations, 4581 utilities, 4581, 4691, 4694 utilities in OUTTEST= data set, 4626 variable list macros, 4588 variable names, 4625 vector preference models, 4586 weight variable, 4592 – TYPE– , 4623 treatments in a design specifying in PLAN procedure, 3345 tree diagram binary tree, 4743 branch, 4743 children, 4743 definitions, 4743 leaves, 4743 node, 4743 parent, 4743 root, 4743
4970
Subject Index
tree diagrams cluster analysis, 4743 TREE procedure, 4743 missing values, 4756 OUT= data sets, 4756 output data sets, 4756 output table names, 4757 trend test, 1490, 1543 trend tests LIFETEST procedure, 2150, 2168, 2179, 2187 TRIM= option and other options (CLUSTER), 969, 970, 972 triweight kernel (DISCRIM), 1160 trust region (TR), 3073 trust region algorithm CALIS procedure, 578, 581, 665 TTEST procedure alpha level, 4780 Cochran and Cox t approximation, 4775, 4780, 4785 compared to other procedures, 1735 computational method, 4783 confidence intervals, 4780 input data set, 4783 introductory example, 4776 missing values, 4783 ODS table names, 4789 paired comparisons, 4775, 4793 paired t test, 4782 paired-difference t test, 4775 Satterthwaite’s approximation, 4775, 4785 uniformly most powerful unbiased test, 4786 Tucker and Lewis’s Reliability Coefficient, 1337 Tukey’s adjustment GLM procedure, 1754 MIXED procedure, 2688 Tukey’s studentized range test, 445, 1769, 1811, 1812 Tukey-Kramer test, 445, 1769, 1811, 1812 2D geometric anisotropic structure MIXED procedure, 2721 two-sample t-test, 4775, 4789 power and sample size (POWER), 3415, 3463, 3471, 3472, 3526, 3527, 3529, 3566 two-stage density linkage CLUSTER procedure, 967, 982 Type 1 analysis GENMOD procedure, 1615, 1665 Type 1 error, 3488 Type 1 error rate repeated multiple comparisons, 1808 Type 1 estimation MIXED procedure, 2677 Type 2 error, 3488 Type 2 estimation MIXED procedure, 2677 Type 3 analysis GENMOD procedure, 1615, 1665 Type 3 estimation MIXED procedure, 2677
Type H covariance structure, 1829 Type I sum of squares computing in GLM, 1835 displaying (GLM), 1772 estimable functions for, 1771 estimable functions for (GLM), 1794 examples, 1858 Type I testing MIXED procedure, 2696 Type II sum of squares computing in GLM, 1835 displaying (GLM), 1773 estimable functions for, 1771 estimable functions for (GLM), 1796 examples, 1858 Type II sums of squares TRANSREG procedure, 4580 Type II testing MIXED procedure, 2696 Type III sum of squares displaying (GLM), 1773 estimable functions for, 1771 estimable functions for (GLM), 1797 examples, 1858 Type III testing MIXED procedure, 2696, 2751 Type IV sum of squares computing in GLM, 1835 displaying (GLM), 1773 estimable functions for, 1771 estimable functions for (GLM), 1797 examples, 1858 TYPE= data sets FACTOR procedure, 1325
U ultra-Heywood cases, FACTOR procedure, 1333 ultrametric, definition, 985 unbalanced data caution (ANOVA), 423 unbalanced design GLM procedure, 1735, 1804, 1833, 1856, 1882 NESTED procedure, 2990 uncertainty coefficients, 1474, 1483, 1484 unequal weighting SURVEYFREQ procedure, 4204 unfolding MDS procedure, 2471 uniform kernel (DISCRIM), 1159 uniform-kernel estimation CLUSTER procedure, 972, 978 uniformly most powerful unbiased test TTEST procedure, 4786 unique factor defined for factor analysis, 1292 univariate distributions, example MODECLUS procedure, 2889 UNIVARIATE procedure, 22 univariate tests
Subject Index repeated measures, 1828, 1829 unknown or missing parents INBREED procedure, 1982 unrestricted random sampling SURVEYSELECT procedure, 4447 unsquared Euclidean distances, 969, 971 unstructured correlations MIXED procedure, 2721 unstructured covariance matrix MIXED procedure, 2721 unweighted least-squares factor analysis, 1291 unweighted pair-group clustering, See average linkage See centroid method update methods NLMIXED procedure, 3073 UPGMA, See average linkage UPGMC, See centroid method utilities TRANSREG procedure, 4691, 4694
V V matrix MIXED procedure, 2716 valid observation SURVEYMEANS procedure, 4346 value lists GLMPOWER procedure, 1945 POWER procedure, 3490 Van der Waerden scores NPAR1WAY procedure, 3167 VAR statement, use CORRESP procedure, 1072 VARCLUS procedure, See also TREE procedure alternating least-squares, 4800 analyzing data in groups, 4813 centroid component, 4802, 4808 cluster components, 4799 cluster splitting, 4800, 4806, 4810, 4811 cluster, definition, 4799 computational resources, 4818 controlling number of clusters, 4810 eigenvalues, 4800, 4810 how to choose options, 4815 initializing clusters, 4809 interpreting output, 4818 iterative reassignment, 4800 MAXCLUSTERS= option, using, 4815 MAXEIGEN= option, using, 4815 memory requirements, 4818 missing values, 4814 multiple group component analysis, 4811 nearest component sorting phase, 4800 number of clusters, 4800, 4806, 4810, 4811 orthoblique rotation, 4800, 4809 output data sets, 4811, 4816
4971
output table names, 4820 OUTSTAT= data set, 4811, 4816 OUTTREE= data set, 4817 PROPORTION= option, using, 4815 search phase, 4800 splitting criteria, 4800, 4806, 4810, 4811 stopping criteria, 4806 time requirements, 4815, 4818 TYPE=CORR data set, 4816 VARCOMP procedure classification variables, 4836 compared to MIXED procedure, 2664, 2665 compared to other procedures, 1735 computational details, 4838 dependent variables, 4831, 4836 example (MIXED), 2795 fixed effects, 4831, 4837 fixed-effects model, 4837 introductory example, 4832 methods of estimation, 4831, 4842 missing values, 4837 mixed model, 4837 negative variance components, 4838 ODS table names, 4840 random effects, 4831, 4837 random-effects model, 4837 relationship to PROC MIXED, 4841 variability, 4832 variance component, 4838 variability VARCOMP procedure, 4832 variable (PHREG) censoring, 3218 variable importance for projection, 3396 variable list macros TRANSREG procedure, 4588 variable selection CALIS procedure, 662 discriminant analysis, 4157 variable-radius kernels MODECLUS procedure, 2870 variable-reduction method, 4799 variables, See also classification variables frequency (PRINQUAL), 3658 renaming (PRINQUAL), 3666 reusing (PRINQUAL), 3666 weight (PRINQUAL), 3667 variables, unaddressed INBREED procedure, 1976 variance component VARCOMP procedure, 4838 variance components, 2985 MIXED procedure, 2662, 2721 Variance Estimation SURVEYLOGISTIC procedure, 4282 variance estimation, 166 SURVEYREG procedure, 4385 variance function
4972
Subject Index
GENMOD procedure, 1614 variance inflation factors (VIF) REG procedure, 3818, 3958 variance of means LATTICE procedure, 2074 variance of the mean SURVEYMEANS procedure, 4338 variance of the total SURVEYMEANS procedure, 4341 variance ratios MIXED procedure, 2707, 2714 variances FACTOR procedure, 1320 ratio of, 4775, 4784, 4789 variances, test for equal, 1150 varimax method, 1291, 1317, 1318 VARIOGRAM procedure angle classes, 4866, 4868, 4870, 4872, 4873 angle tolerance, 4866, 4868, 4870, 4872, 4873 anisotropic models, 4871 bandwidth, 4866, 4870, 4875 continuity measures, 4851 contour map, 4856 correlation length, 4860 cutoff distance, 4869 DATA= data set, 4865 distance classification, 4874 distance interval, 4856 empirical (or experimental) semivariogram, 4858 examples, 4852, 4882 histogram of pairwise distances, 4856 input data set, 4865 intervals, number of, 4856 kriging, ordinary, 4852 nested models, 4871 nugget effect, 4871 ordinary kriging, 4851 OUTDIST= data set, 4851, 4865, 4878, 4880 OUTPAIR= data set, 4851, 4866, 4881 output data sets, 4851, 4865, 4866, 4877–4881 OUTVAR= data set, 4866, 4877 pairwise distances, 4851, 4856, 4867 predicted values, 4852 semivariogram computation, 4876, 4877 semivariogram robust, 4877 semivariogram, empirical, 4856 semivariogram, robust, 4861, 4869, 4877 spatial continuity, 4851 spatial prediction, 4851, 4852 square root difference cloud, 4882 standard errors, 4852 surface plot, 4856 surface trend, 4854 vector preference models TRANSREG procedure, 4586 viewing graphs ODS Graphics, 327 VIF,
See variance inflation factors VIP, 3396 visual fit of the variogram KRIGE2D procedure, 2045
W Wald chi-square test SURVEYFREQ procedure, 4221 Wald log-linear chi-square test SURVEYFREQ procedure, 4223 Wald test mixed model (MIXED), 2741, 2786 MIXED procedure, 2750, 2751 modification indices (CALIS), 584, 674 PHREG procedure, 3238, 3246, 3247, 3269, 3284 probability limit (CALIS), 590 PROBIT procedure, 3756 SURVEYREG procedure, 4386, 4389 TPHREG procedure, 4486 Waller-Duncan test, 445, 1769, 1815 error seriousness ratio, 443, 1768 examples, 1851 multiple comparison (ANOVA), 464 Wampler data set, 3208 Ward’s minimum-variance method CLUSTER procedure, 967, 983 Wei-Lin-Weissfeld model PHREG procedure, 3248 Weibull distribution, 2083, 2097, 2111 weight variable PRINQUAL procedure, 3667 weighted average linkage CLUSTER procedure, 967, 981 weighted Euclidean distance MDS procedure, 2472, 2477, 2488 weighted Euclidean model MDS procedure, 2471, 2477 weighted kappa coefficient, 1493, 1496 weighted least squares CATMOD procedure, 817, 846 formulas (CATMOD), 894 MDS procedure, 2487 normal equations (GLM), 1783 weighted means GLM procedure, 1820 weighted pair-group methods, See McQuitty’s similarity analysis See median method weighted product-moment correlation coefficients CANCORR procedure, 764 weighted Schoenfeld residuals PHREG procedure, 3235, 3259 weighted score residuals PHREG procedure, 3260 weighted-group method, See centroid method weighting MIXED procedure, 2730
Subject Index weighting variables FACTOR procedure, 1333 Welch t test power and sample size (POWER), 3463, 3472, 3527 Welch’s ANOVA, 445, 1769 homogeneity of variance tests, 1819 using homogeneity of variance tests, 1893 Welsch’s multiple range test, 444, 1768, 1815 examples, 1851 WHERE statement GLM procedure, 1787 width, confidence intervals, 3488 Wilcoxon scores NPAR1WAY procedure, 3166 Wilcoxon test for association LIFETEST procedure, 2150 Wilcoxon test for homogeneity LIFETEST procedure, 2150, 2168, 2178 Wilks’ criterion, 437, 1759 Wilks’ lambda, 1828 Williams’ method overdispersion (LOGISTIC), 2355 windowing environment, 327, 338 within-cluster SSCP matrix ACECLUS procedure, 387 within-imputation covariance matrix MIANALYZE procedure, 2626 within-imputation variance MI procedure, 2561 MIANALYZE procedure, 2624 within-subject factors repeated measures, 1777, 1828 Wong’s hybrid method CLUSTER procedure, 969, 978 working correlation matrix GENMOD procedure, 1647, 1648, 1672 worst linear function of parameters MI procedure, 2556 WPGMA, See McQuitty’s similarity analysis WPGMC, See median method
X XDATA= data sets LIFEREG procedure, 2122
Y Yule’s Q statistic, 1476
Z z scores TRANSREG procedure, 4572 z test power and sample size (POWER), 3429, 3432, 3505, 3506 zero variance component estimates MIXED procedure, 2774
zeros, structural and random FREQ procedure, 1499 zeros, structural and sampling CATMOD procedure, 888 examples (CATMOD), 919, 924 zonal anisotropy KRIGE2D procedure, 2055
4973
4974
Subject Index
Syntax Index A AB option EXACT statement (NPAR1WAY), 3160 OUTPUT statement (NPAR1WAY), 3162 PROC NPAR1WAY statement, 3156 AB/BA crossover designs power and sample size (POWER), 3549 ABSCONV= option NLOPTIONS statement (CALIS), 615 PROC NLMIXED statement, 3060 ABSENT= option PROC DISTANCE statement, 1256 VAR statement, 1266 ABSFCONV option MODEL statement (LOGISTIC), 2308 MODEL statement (SURVEYLOGISTIC), 4261 ABSFCONV= option NLOPTIONS statement (CALIS), 617 PROC NLMIXED statement, 3060 ABSFTOL= option NLOPTIONS statement (CALIS), 617 ABSGCONV= option NLOPTIONS statement (CALIS), 617, 620 PROC NLMIXED statement, 3060 ABSGTOL= option NLOPTIONS statement (CALIS), 617, 620 absolute level of measurement, definition DISTANCE procedure, 1250 ABSOLUTE option PROC ACECLUS statement, 404 PROC MIXED statement, 2674, 2749 ABSORB statement ANOVA procedure, 434 GLM procedure, 1747 absorption of effects ANOVA procedure, 434 GLM procedure, 1747, 1799 ABSXCONV= option NLOPTIONS statement (CALIS), 617 PROC NLMIXED statement, 3060 ABSXTOL= option NLOPTIONS statement (CALIS), 617 Accelerated failure time model NLMIXED procedure, 3128 accelerated failure time models LIFEREG procedure, 2083 ACCRUALTIME= option
TWOSAMPLESURVIVAL statement (POWER), 3474 ACECLUS procedure analyzing data in groups, 394, 412 between-cluster SSCP matrix, 387 clustering methods, 405 compared with other procedures, 391 computational resources, 410 controlling iterations, 389 decomposition of the SSCP matrix, 387 eigenvalues and eigenvectors, 398, 399, 406, 410–412 initial estimates, 389, 404 memory requirements, 410 missing values, 409 output data sets, 409 output table names, 412 syntax, 402 time requirements, 411 within-cluster SSCP matrix, 387 ACECLUS procedure, BY statement, 407 ACECLUS procedure, FREQ statement, 408 ACECLUS procedure, PROC ACECLUS statement, 402 ABSOLUTE option, 404 CONVERGE= option, 404 DATA= option, 404 INITIAL= option, 404 MAXITER= option, 405 METHOD= option, 405 METRIC= option, 405 MPAIRS= option, 405 N= option, 405 NOPRINT option, 405 OUT= option, 406 OUTSTAT= option, 406 P= option, 406 PERCENT= option, 406 PP option, 406 PREFIX= option, 406 PROPORTION= option, 406 QQ option, 406 SHORT option, 407 SINGULAR= option, 407 T= option, 407 THRESHOLD= option, 407 ACECLUS procedure, VAR statement, 408 ACECLUS procedure, WEIGHT statement, 408
4976
Syntax Index
ACFPLOT option MCMC statement (MI), 2524, 2567 ACOV option MODEL statement (REG), 3823 active set methods NLMIXED procedure, 3093 ACTUAL option MODEL statement (RSREG), 4041 actual power GLMPOWER procedure, 1946, 1947, 1953 POWER procedure, 3419, 3494, 3496 acturial estimate, See life-table estimate adaptive Gaussian quadrature NLMIXED procedure, 3084 ADD statement, REG procedure, 3819 ADD= option PROC DISTANCE statement, 1256 PROC STDIZE statement, 4130 ADDCELL= option MODEL statement (CATMOD), 841 additive models TRANSREG procedure, 4574 ADDITIVE option MODEL statement (TRANSREG), 4574 adjacent-category logits, See also response functions specifying in CATMOD procedure, 852 using (CATMOD), 868 adjacent-level contrasts, 448 ADJBOUND= option MODEL statement (SURVEYLOGISTIC), 4265 ADJRSQ SURVEYREG procedure, 4380 ADJRSQ option MODEL statement (REG), 3824 MODEL statement (SURVEYREG), 4380 ADJUST= option LSMEANS statement (GLM), 1754 LSMEANS statement (MIXED), 2687 adjusted degrees of freedom MI procedure, 2562 MIANALYZE procedure, 2625 adjusted means See least-squares means, 1753 adjusted odds ratio, 1503 adjusted p-value MULTTEST procedure, 2935, 2956 adjusted R2 selection (REG), 3875 Adjusted R-square SURVEYREG procedure, 4387 adjusted residuals GENMOD procedure, 1670 adjusted treatment means LATTICE procedure, 2075 ADPREFIX= option OUTPUT statement (TRANSREG), 4584 after expansion knots TRANSREG procedure, 4571
AFTER option MODEL statement (TRANSREG), 4571 agglomerative hierarchical clustering analysis, 957 AGGREGATE= option MODEL statement (GENMOD), 1637 MODEL statement (LOGISTIC), 2308 MODEL statement (PROBIT), 3742 aggregates of residuals, 1718, 1725 AGK estimate STDIZE procedure, 4139 AGREE option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1447 TABLES statement (FREQ), 1453 TEST statement (FREQ), 1463 agreement, measures of, 1493 AIC option MODEL statement (REG), 3824 PLOT statement (REG), 3844 AIPREFIX option OUTPUT statement (TRANSREG), 4584 AJCHI option OUTPUT statement (FREQ), 1447 Akaike’s information criterion example (MIXED), 2780, 2794, 2823 LOGISTIC procedure, 2341 MIXED procedure, 2676, 2740, 2750 SURVEYLOGISTIC procedure, 4279 Akaike’s information criterion (finite sample corrected version) MIXED procedure, 2676, 2750 ALG= option PRIOR statement (MIXED), 2710 ALGORITHM= option PROC PLS statement, 3375 aliasing GENMOD procedure, 1620 ALIASING option MODEL statement (GLM), 1771, 1897 aliasing structure GLM procedure, 1771 aliasing structure (GLM), 1897 – ALL– effect MANOVA statement (ANOVA), 437 MANOVA statement, H= option (GLM), 1759 ALL option MODEL statement (LOESS), 2232 MODEL statement (REG), 3824 NLOPTIONS statement (CALIS), 621 OUTPUT statement (FREQ), 1447 PROC CALIS statement, 584 PROC CANCORR statement, 759 PROC CANDISC statement, 790 PROC CORRESP statement, 1074 PROC DISCRIM statement, 1147 PROC FACTOR statement, 1309 PROC MODECLUS statement, 2865 PROC REG statement, 3816 PROC STEPDISC statement, 4166
Syntax Index TABLES statement (FREQ), 1453 ALLLABEL= option BOXPLOT procedure, 492 ALLOBS option PAINT statement (REG), 3837 REWEIGHT statement (REG), 3856 ALOGIT function RESPONSE statement (CATMOD), 852 alpha factor analysis, 1291, 1311 alpha level ANOVA procedure, 441 FREQ procedure, 1445, 1453 GLM procedure, 1755, 1765, 1771, 1775 GLMPOWER procedure, 1939 NPAR1WAY procedure, 3160 POWER procedure, 3488 REG procedure, 3816 SURVEYFREQ procedure, 4199 TRANSREG procedure, 4574 TTEST procedure, 4780 ALPHA= option BASELINE statement (PHREG), 3225 CONTRAST statement (CATMOD), 832 CONTRAST statement (LOGISTIC), 2300 CONTRAST statement (SURVEYLOGISTIC), 4257 CONTRAST statement (TPHREG), 4481 ESTIMATE statement (GENMOD), 1634 ESTIMATE statement (MIXED), 2685 ESTIMATE statement (NLMIXED), 3077 EXACT statement (FREQ), 1445 EXACT statement (LOGISTIC), 2301 EXACT statement (NPAR1WAY), 3160 LSMEANS statement (GENMOD), 1635 LSMEANS statement (GLM), 1755 LSMEANS statement (MIXED), 2688 MEANS statement (ANOVA), 441 MEANS statement (GLM), 1765 MODEL statement (CATMOD), 842 MODEL statement (GAM), 1567 MODEL statement (GENMOD), 1637 MODEL statement (GLM), 1771 MODEL statement (LIFEREG), 2096 MODEL statement (LOESS), 2232 MODEL statement (LOGISTIC), 2308 MODEL statement (PHREG), 3233 MODEL statement (REG), 3824 MODEL statement (ROBUSTREG), 3989 MODEL statement (SURVEYLOGISTIC), 4262 MODEL statement (TPSPLINE), 4508 MODEL statement (TRANSREG), 4570, 4574 MULTREG statement (POWER), 3422 ONECORR statement (POWER), 3427 ONESAMPLEFREQ statement (POWER), 3430 ONESAMPLEMEANS statement (POWER), 3434 ONEWAYANOVA statement (POWER), 3439 OUTPUT statement (GLM), 1775 OUTPUT statement (LOGISTIC), 2322
4977
PAIREDFREQ statement (POWER), 3444 PAIREDMEANS statement (POWER), 3449 POWER statement (GLMPOWER), 1939 PREDICT statement (NLMIXED), 3079 PROC FACTOR statement, 1309 PROC GLM statement, 1745 PROC LIFETEST statement, 2160 PROC LOGISTIC statement, 2290 PROC MI statement, 2519 PROC MIANALYZE statement, 2614 PROC MIXED statement, 2674 PROC NLMIXED statement, 3061 PROC REG statement, 3816 PROC SURVEYLOGISTIC statement, 4250 PROC SURVEYMEANS statement, 4323 PROC SURVEYREG statement, 4373 PROC TTEST statement, 4780 RANDOM statement (MIXED), 2713 RANDOM statement (NLMIXED), 3080 SCORE statement (LOGISTIC), 2324 SURVIVAL statement (LIFETEST), 2168, 2191 TABLES statement (FREQ), 1453 TABLES statement (SURVEYFREQ), 4199 TWOSAMPLEFREQ statement (POWER), 3458 TWOSAMPLEMEANS statement (POWER), 3465 TWOSAMPLESURVIVAL statement (POWER), 3475 ALPHAECV= option PROC CALIS statement, 588 ALPHAINIT= option REPEATED statement (GENMOD), 1646 ALPHAP= option MODEL statement (MIXED), 2692 ALPHAQT= option PROC LIFETEST statement, 2160 ALPHARMS= option PROC CALIS statement, 588 ALR algorithm GENMOD procedure, 1711 ALTERNATE= option PROC MDS statement, 2476 alternating least squares MDS procedure, 2476 Alternating Logistic Regressions (ALR) GENMOD procedure, 1711 alternative hypothesis, 3488 AM option PROC MODECLUS statement, 2865 analysis of contrasts SURVEYREG procedure, 4393 analysis of covariance examples (GLM), 1860 MODEL statements (GLM), 1785 power and sample size (GLMPOWER), 1956 analysis of covariation (NESTED), 2990 analysis of variance, See also ANOVA procedure
4978
Syntax Index
See also TTEST procedure categorical data, 813 CATMOD procedure, 815 mixed models (GLM), 1882 MODEL statements (GLM), 1785 multivariate (ANOVA), 436 multivariate (CANDISC), 786 multivariate (GLM), 1759, 1823, 1824, 1868 nested design, 2985 one-way layout, example, 424 one-way tests (NPAR1WAY), 3165 one-way, variance-weighted, 445, 1769 power and sample size (GLMPOWER), 1930, 1951, 1956 power and sample size (POWER), 3438, 3442, 3443, 3513, 3536 quadratic response surfaces, 4045 repeated measures (CATMOD), 873 repeated measures (GLM), 1825, 1877, 1886 SURVEYREG procedure, 4387 three-way design (GLM), 1864 unbalanced (GLM), 1735, 1804, 1856 within-subject factors, repeated measurements, 449 analysis statements POWER procedure, 3420 Analysis style ODS styles, 333 analyst’s model MI procedure, 2563 analyzing data in groups, 3076 ACECLUS procedure, 394 CANCORR procedure, 763 FACTOR procedure, 1320 FASTCLUS procedure, 1395 MDS procedure, 2485 MODECLUS procedure, 2857, 2874 PRINCOMP procedure, 3607 SCORE procedure, 4073 STDIZE procedure, 4134 TREE procedure, 4754 VARCLUS procedure, 4813 Andersen-Gill model PHREG procedure, 3216, 3243, 3253 angle classes VARIOGRAM procedure, 4866, 4868, 4870, 4872, 4873 angle tolerance VARIOGRAM procedure, 4866, 4868, 4870, 4872, 4873 ANGLE= option MODEL statement (KRIGE2D), 2042 SIMULATE statement (SIM2D), 4102 ANGLETOLERANCE= option COMPUTE statement (VARIOGRAM), 4866 anisotropic models (KRIGE2D), 2053–2056 models (VARIOGRAM), 4871 nugget effect (KRIGE2D), 2056
annotate global data set (REG), 3816 local data set (REG), 3844 ANNOTATE= option PLOT statement (BOXPLOT), 492 PLOT statement (REG), 3844 PROC BOXPLOT statement, 487 PROC LIFETEST statement, 2160 PROC REG statement, 3816 annotating cdf plots, 3718 ipp plots, 3729 lpred plots, 3737 pplot plots, 2102 predicted probability plots, 3750 ANOVA codings (TRANSREG), 4662 SURVEYREG procedure, 4380, 4387, 4392 table (TRANSREG), 4580, 4615, 4650 ANOVA (row mean scores) statistic, 1502 ANOVA option MODEL statement (SURVEYREG), 4380 PROC CANDISC statement, 790 PROC DISCRIM statement, 1147 PROC NPAR1WAY statement, 3156 ANOVA procedure absorption of effects, 434 alpha level, 441 at sign (@) operator, 452 balanced data, 423 bar (|) operator, 452 Bartlett’s test, 443 block diagonal matrices, 423 Brown and Forsythe’s test, 443 canonical analysis, 438 characteristic roots and vectors, 437 compared to other procedures, 1734 complete block design, 428 computational methods, 456 confidence intervals, 442 contrasts, 448 dependent variable, 423 disk space, 433 effect specification, 451 factor name, 446, 447 homogeneity of variance tests, 443 hypothesis tests, 450 independent variable, 423 interactive use, 454 interactivity and missing values, 455 introductory example, 424 level values, 447 Levene’s test for homogeneity of variance, 443 means, 440 memory requirements, 434, 456 missing values, 433, 455 model specification, 451 multiple comparison procedures, 440 multiple comparisons, 441–445
Syntax Index multivariate analysis of variance, 433, 436 O’Brien’s test, 443 ODS graph names, 460 ODS table names, 458 orthonormalizing transformation matrix, 438 output data sets, 434, 455 pooling, automatic, 454 repeated measures, 446 sphericity tests, 449 SSCP matrix for multivariate tests, 437 syntax, 432 transformations, 447, 448 transformations for MANOVA, 437 unbalanced data, caution, 423 Welch’s ANOVA, 445 WHERE statement, 454 ANOVA procedure, ABSORB statement, 434 ANOVA procedure, BY statement, 435 ANOVA procedure, CLASS statement, 435 TRUNCATE option, 436 ANOVA procedure, FREQ statement, 436 ANOVA procedure, MANOVA statement, 436 – ALL– effect, 437 CANONICAL option, 438 E= option, 437 H= option, 437 INTERCEPT effect, 437 M= option, 437 MNAMES= option, 438 MSTAT= option, 438 ORTH option, 438 PREFIX= option, 438 PRINTE option, 438 PRINTH option, 439 SUMMARY option, 439 ANOVA procedure, MEANS statement, 440 ALPHA= option, 441 BON option, 441 CLDIFF option, 442 CLM option, 442 DUNCAN option, 442 DUNNETT option, 442 DUNNETTL option, 442 DUNNETTU option, 442 E= option, 443 GABRIEL option, 443 HOVTEST option, 443 KRATIO= option, 443 LINES option, 444 LSD option, 444, 445 NOSORT option, 444 REGWQ option, 444 SCHEFFE option, 444 SIDAK option, 444 SMM option, 444 SNK option, 444 TUKEY option, 445 WALLER option, 445 WELCH option, 445
4979
ANOVA procedure, MODEL statement, 445 INTERCEPT option, 446 NOUNI option, 446 ANOVA procedure, PROC ANOVA statement, 432 DATA= option, 433 MANOVA option, 433 MULTIPASS option, 433 NAMELEN= option, 433 NOPRINT option, 433 ORDER= option, 433 OUTSTAT= option, 434 ANOVA procedure, REPEATED statement, 446 CANONICAL option, 448 CONTRAST keyword, 448 E= effects, 450 factor specification, 447 H= effects, 450 HELMERT keyword, 448 IDENTITY keyword, 448 MEAN keyword, 448 MSTAT= option, 448 NOM option, 449 NOU option, 449 POLYNOMIAL keyword, 448 PRINTE option, 449 PRINTH option, 449 PRINTM option, 449 PRINTRV option, 449 PROFILE keyword, 448 SUMMARY option, 449 ANOVA procedure, TEST statement, 450 Ansari-Bradley scores NPAR1WAY procedure, 3168 ante-dependence structure MIXED procedure, 2721 ANTIALIAS= option ODS GRAPHICS statement, 349 AOV option PROC NESTED statement, 2988 apparent error rate, 1163 approximate covariance, options (CALIS), 621 standard errors (CALIS), 576, 587, 648, 686 approximate Bayesian bootstrap MI procedure, 2543 approximate covariance estimation clustering, 387 APPROXIMATIONS option OUTPUT statement (TRANSREG), 4584 PROC PRINQUAL statement, 3653 APREFIX= option PROC PRINQUAL statement, 3652 arbitrary missing pattern MI procedure, 2539 arcsine-square root transform LIFETEST procedure, 2205 arcsine-square root transformation LIFETEST procedure, 2169, 2177 ARRAY statement
4980
Syntax Index
NLMIXED procedure, 3074 arrays NLMIXED procedure, 3074 ARSIN transformation MODEL statement (TRANSREG), 4562 TRANSFORM statement (PRINQUAL), 3661 ASINGULAR= option NLOPTIONS statement (CALIS), 614 PROC CALIS statement, 588 PROC NLMIXED statement, 3061 ASSESS statement GENMOD procedure, 1627 PHREG procedure, 3223 association tests LIFETEST procedure, 2150, 2156, 2200 association, measures of FREQ procedure, 1474 asterisk (*) operator TRANSREG procedure, 4558 ASYCORR option PROC MIXED statement, 2674 ASYCOV option PROC MIXED statement, 2674, 2796 ASYCOV= option PROC CALIS statement, 575 asymmetric data (MDS), 2484 asymmetric binary variable DISTANCE procedure, 1250 asymmetric lambda, 1474, 1482 asymptotic covariance CALIS procedure, 575, 645 MIXED procedure, 2674 asymptotic variances CALIS procedure, 647 asymptotically distribution free estimation CALIS procedure, 574, 646 AT MEANS option LSMEANS statement (MIXED), 2688 AT option LSMEANS statement (GLM), 1755, 1822 LSMEANS statement (MIXED), 2688, 2689 at sign (@) operator ANOVA procedure, 452 CATMOD procedure, 866 GLM procedure, 1786 MIXED procedure, 2745, 2819 TRANSREG procedure, 4558 AUGMENT option PROC CALIS statement, 572 autocorrelation REG procedure, 3915 autocorrelation function plot MI procedure, 2557 autoregressive structure example (MIXED), 2788 MIXED procedure, 2721 average linkage CLUSTER procedure, 966, 976
AVERAGE option PROC INBREED statement, 1973 TEST statement (PHREG), 3238 average relative increase in variance MIANALYZE procedure, 2627 average variance of means LATTICE procedure, 2074 AVERAGED option MODEL statement (CATMOD), 842 axes labels, modifying examples, ODS Graphics, 368
B B option MODEL statement (REG), 3824 PROC CANCORR statement, 759 B-spline basis TRANSREG procedure, 4560, 4614 backward elimination LOGISTIC procedure, 2317, 2340 PHREG procedure, 3229, 3264 REG procedure, 3800, 3874 badness of fit MDS procedure, 2479, 2482, 2483, 2489, 2490 balanced data ANOVA procedure, 423 example, complete block, 1847 balanced design, 2985 balanced square lattice LATTICE procedure, 2069 banded Toeplitz structure MIXED procedure, 2721 BANDMAX= option SURVIVAL statement (LIFETEST), 2169, 2191 BANDMAXTIME= option SURVIVAL statement (LIFETEST), 2169 BANDMIN= option SURVIVAL statement (LIFETEST), 2169, 2191 BANDMINTIME= option SURVIVAL statement (LIFETEST), 2169 bandwidth optimal (DISCRIM), 1162 selection (KDE), 2008 VARIOGRAM procedure, 4866, 4870, 4875 BANDWIDTH= option COMPUTE statement (VARIOGRAM), 4866 bar (|) operator ANOVA procedure, 452 CATMOD procedure, 865 GENMOD procedure, 1660 GLM procedure, 1786 MIXED procedure, 2743, 2745, 2819 TRANSREG procedure, 4558 Bartlett’s test ANOVA procedure, 443 GLM procedure, 1767, 1819 Base SAS software, 21 BASELINE statement PHREG procedure, 3224
Syntax Index baseline survivor function (PHREG) confidence level, 3225 confidence limits, 3225, 3263 estimation method, 3226 standard error, 3225, 3263 Bayes estimation NLMIXED procedure, 3084 Bayes’ theorem DISCRIM procedure, 1157 LOGISTIC procedure, 2314, 2353 MI procedure, 2547 Bayesian analysis MIXED procedure, 2708 Bayesian confidence intervals splines, 4513 Bayesian inference MI procedure, 2547 BCORR option PROC CANDISC statement, 790 PROC DISCRIM statement, 1147 PROC STEPDISC statement, 4166 BCOV option PROC CANDISC statement, 790 PROC DISCRIM statement, 1147 PROC MIANALYZE statement, 2614 PROC STEPDISC statement, 4166 TEST statement (MIANALYZE), 2619 BDATA= option PRIOR statement (MIXED), 2710 BDCHI option OUTPUT statement (FREQ), 1447 BDT option TABLES statement (FREQ), 1453 Behrens-Fisher problem, 4775 BENZECRI option PROC CORRESP statement, 1074 Bernoulli distribution NLMIXED procedure, 3077 best subset selection LOGISTIC procedure, 2308, 2317, 2341 PHREG procedure, 3229, 3265, 3279 BEST= option MODEL statement (LOGISTIC), 2308 MODEL statement (PHREG), 3229 MODEL statement (REG), 3824 PARMS statement (NLMIXED), 3078 PROC NLIN statement, 3006 BETA= option PROC CLUSTER statement, 968 between-cluster SSCP matrix ACECLUS procedure, 387 between-imputation covariance matrix MIANALYZE procedure, 2626 between-imputation variance MI procedure, 2561 MIANALYZE procedure, 2624 between-subject factors repeated measures, 1777, 1828 Bhapkar’s test, 932
4981
BIASKUR option PROC CALIS statement, 588 BIASTEST option PROC ROBUSTREG statement, 3987 BIC option MODEL statement (REG), 3824 PLOT statement (REG), 3844 bimodality coefficient CLUSTER procedure, 972, 984 bin-sort algorithm, 1389 binary distribution NLMIXED procedure, 3077 Binary Lance and Williams nonmetric coefficient DISTANCE procedure, 1276 BINARY option PROC CORRESP statement, 1074 binning KDE procedure, 2004 binomial distribution GENMOD procedure, 1652 NLMIXED procedure, 3077 BINOMIAL option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1447 TABLES statement (FREQ), 1453, 1532 TEST statement (MULTTEST), 2947 binomial proportion test, 1484 examples, 1532 power and sample size (POWER), 3429, 3432, 3504–3506, 3541 BINOMIALC option TABLES statement (FREQ), 1454 BINS= option PROC FASTCLUS statement, 1389 bioequivalence, equivalence tests power and sample size (POWER), 3510, 3511, 3520, 3521, 3530, 3531, 3549 biological assay data, 3705, 3761 biplot PRINQUAL procedure, 3678 biquartimax method, 1291, 1317, 1318 biquartimin method, 1291, 1318 BIVAR statement KDE procedure, 1997 bivariate density estimation DISCRIM procedure, 1200 bivariate histogram KDE procedure, 2011 BIVSTATS option BIVAR statement, 1998 biweight kernel (DISCRIM), 1160 block diagonal matrices ANOVA procedure, 423 BLOCKLABELPOS= option PLOT statement (BOXPLOT), 493 BLOCKLABTYPE= option PLOT statement (BOXPLOT), 493 BLOCKPOS= option
4982
Syntax Index
PLOT statement (BOXPLOT), 493 BLOCKREP option PLOT statement (BOXPLOT), 493 BLUE MIXED procedure, 2740 BLUP MIXED procedure, 2740 BON option MEANS statement (ANOVA), 441 MEANS statement (GLM), 1765 Bonferroni t test, 441, 1765, 1809 Bonferroni adjustment GLM procedure, 1754 MIXED procedure, 2688 MULTTEST procedure, 2939, 2956 BONFERRONI option PROC MULTTEST statement, 2939, 2956 bootstrap MI procedure, 2527 bootstrap adjustment MULTTEST procedure, 2938, 2939, 2957, 2968 BOOTSTRAP option MCMC statement (MI), 2527 PROC MULTTEST statement, 2937, 2939, 2957, 2968 boundary constraints, 3075 MIXED procedure, 2707, 2708, 2773 BOUNDARY option PROC MODECLUS statement, 2865 bounds NLMIXED procedure, 3075 BOUNDS statement CALIS procedure, 609 NLIN procedure, 3010 NLMIXED procedure, 3075 Bowker’s test of symmetry, 1493, 1494 Box Cox Example TRANSREG procedure, 4721 Box Cox transformations TRANSREG procedure, 4595 box plot, defined, 483 Box plots MIXED procedure, 2762 box plots plots, ODS Graphics, 355 reading group summary statistics, 522 saving group summary statistics, 518, 519 box plots, clipping boxes, 497, 498 examples, 532, 533 box plots, labeling angles for, 503 points, 492 Box’s epsilon, 1829 box-and-whisker plots schematic, 539 side-by-side, 483 skeletal, 538 statistics represented, 485, 517 styles of, 522
BOX= option PROC BOXPLOT statement, 487 BOXCONNECT= option PLOT statement (BOXPLOT), 493 BOXCOX transformation TRANSFORM statement (MI), 2533 BOXPLOT procedure continuous group variables, 524 missing values, 524 percentile computation, 523 syntax, 487 BOXPLOT procedure, BY statement, 516 BOXPLOT procedure, ID statement, 517 BOXPLOT procedure, INSET statement, 511 CFILL= option, 513 CFILLH= option, 513 CFRAME= option, 513 CHEADER= option, 513 CSHADOW= option, 513 CTEXT= option, 513 DATA option, 513 FONT= option, 513 FORMAT= option, 513 HEADER= option, 514 HEIGHT= option, 514 NOFRAME option, 514 POSITION= option, 514, 526–528 REFPOINT= option, 514 BOXPLOT procedure, INSETGROUP statement, 514 CFILL= option, 515 CFILLH= option, 515 CFRAME= option, 516 CHEADER= option, 516 CTEXT= option, 516 FONT= option, 516 FORMAT= option, 516 HEADER= option, 516 HEIGHT= option, 516 NOFRAME option, 516 POSITION= option, 516 BOXPLOT procedure, PLOT statement, 488 ALLLABEL= option, 492 ANNOTATE= option, 492 BLOCKLABELPOS= option, 493 BLOCKLABTYPE= option, 493 BLOCKPOS= option, 493 BLOCKREP option, 493 BOX= data set, 520 BOXCONNECT= option, 493 BOXSTYLE= option, 493, 538 BOXWIDTH= option, 495 BOXWIDTHSCALE= option, 495, 543 BWSLEGEND option, 495 CAXIS= option, 495 CBLOCKLAB= option, 495 CBLOCKVAR= option, 495 CBOXES= option, 496 CBOXFILL= option, 496 CCLIP= option, 496
Syntax Index CCONNECT= option, 497 CCOVERLAY= option, 497 CFRAME= option, 497 CGRID= option, 497 CHREF= option, 497 CLABEL= option, 497 CLIPFACTOR= option, 497, 533 CLIPLEGEND= option, 497 CLIPLEGPOS= option, 497 CLIPSUBCHAR= option, 498 CLIPSYMBOL= option, 498 CLIPSYMBOLHT= option, 498 CONTINUOUS option, 498 COVERLAY= option, 498 COVERLAYCLIP= option, 498 CTEXT= option, 498 CVREF= option, 499 DATA= data set, 519 DESCRIPTION= option, 499 ENDGRID option, 499 FONT= option, 499 GRID= option, 499 HAXIS= option, 499 HEIGHT= option, 500 HISTORY= data set, 521, 522 HMINOR= option, 500 HOFFSET= option, 500 HREF= option, 500 HREFLABELS= option, 501 HREFLABPOS= option, 501 HTML= option, 501 IDCOLOR= option, 501 IDCTEXT= option, 502 IDFONT= option, 502 IDHEIGHT= option, 502 IDSYMBOL= option, 502 INTERVAL= option, 502 LABELANGLE= option, 503 LBOXES= option, 503 LENDGRID= option, 503 LGRID= option, 504 LHREF= option, 504 LOVERLAY= option, 504 LVREF= option, 504 MAXPANELS= option, 504 MISSBREAK option, 504 NAME= option, 504 NLEGEND option, 505 NOBYREF option, 505 NOCHART option, 505 NOFRAME option, 505 NOHLABEL option, 505 NOOVERLAYLEGEND option, 505 NOSERIFS option, 505 NOTCHES option, 505, 542 NOTICKREP option, 506 NOVANGLE option, 506 NPANELPOS= option, 506 OUTBOX= data set, 518
4983
OUTBOX= option, 506 OUTHISTORY= data set, 519 OUTHISTORY= option, 507 OVERLAY= option, 507 OVERLAYCLIPSYM= option, 507 OVERLAYCLIPSYMHT= option, 507 OVERLAYHTML= option, 507 OVERLAYID= option, 507 OVERLAYLEGLAB= option, 507 OVERLAYSYM= option, 507 OVERLAYSYMHT= option, 508 PAGENUM= option, 508 PAGENUMPOS= option, 508 PCTLDEF= option, 508 REPEAT option, 508 SKIPHLABELS= option, 509 SYMBOLLEGEND= option, 509 SYMBOLORDER= option, 509 TOTPANELS= option, 509 TURNHLABELS option, 509 VAXIS= option, 509 VFORMAT= option, 510 VMINOR= option, 510 VOFFSET= option, 510 VREF= option, 510 VREFLABELS= option, 510 VREFLABPOS= option, 511 VZERO option, 511 WAXIS= option, 511 WGRID= option, 511 WOVERLAY= option, 511 BOXPLOT procedure, plot statement OUTHIGHHTML= option, 507 OUTLOWHTML= option, 507 BOXPLOT procedure, plot statements INTSTART= option, 503 BOXPLOT procedure, PROC BOXPLOT statement, 487 ANNOTATE= option, 487 BOX= option, 487 DATA= option, 487 GOUT= option, 487 BOXSTYLE= option PLOT statement (BOXPLOT), 493 BOXWIDTH= option PLOT statement (BOXPLOT), 495 BOXWIDTHSCALE= option PLOT statement (BOXPLOT), 495 branch and bound algorithm LOGISTIC procedure, 2341 PHREG procedure, 3265, 3279 Bray and Curtis coefficient DISTANCE procedure, 1276 Breslow method likelihood (PHREG), 3228, 3240 Breslow test, See Wilcoxon test for homogeneity Breslow-Day test, 1508 Brewer’s method
4984
Syntax Index
SURVEYSELECT procedure, 4453 Brown and Forsythe’s test ANOVA procedure, 443 GLM procedure, 1767, 1819 Brown-Mood test NPAR1WAY procedure, 3167 Broyden-Fletcher-Goldfarb-Shanno update, 3073 BSPLINE transformation MODEL statement (TRANSREG), 4560 BSSCP option PROC CANDISC statement, 791 PROC DISCRIM statement, 1147 PROC STEPDISC statement, 4166 BUCKET= option MODEL statement (LOESS), 2232 Burt table CORRESP procedure, 1076 BWM= option BIVAR statement, 1998 UNIVAR statement, 1999 BWSLEGEND option PLOT statement (BOXPLOT), 495 BY statement ANOVA procedure, 435 BOXPLOT procedure, 516 CALIS procedure, 626 CANDISC procedure, 793 CATMOD procedure, 830 CORRESP procedure, 1080 DISCRIM procedure, 1153 FACTOR procedure, 1320 FREQ procedure, 1443 GAM procedure, 1564 GENMOD procedure, 1628 GLM procedure, 1747 GLMMOD procedure, 1915 INBREED procedure, 1974 KDE procedure, 2001 LATTICE procedure, 2072 LIFEREG procedure, 2091 LIFETEST procedure, 2165 LOESS procedure, 2230 LOGISTIC procedure, 2294 MDS procedure, 2485 MI procedure, 2521 MIANALYZE procedure, 2616 MIXED procedure, 2680 MODECLUS procedure, 2869 MULTTEST procedure, 2943 NESTED procedure, 2988 NLIN procedure, 3011 NLMIXED procedure, 3076 NPAR1WAY procedure, 3158 ORTHOREG procedure, 3202 PHREG procedure, 3226 PLS procedure, 3377 PRINCOMP procedure, 3607 PRINQUAL procedure, 3658 REG procedure, 3819
ROBUSTREG procedure, 3988 RSREG procedure, 4040 STEPDISC procedure, 4168 SURVEYFREQ procedure, 4195 SURVEYLOGISTIC procedure, 4252 SURVEYMEANS procedure, 4328 SURVEYREG procedure, 4375 TPSPLINE procedure, 4507 TRANSREG procedure, 4556 TREE procedure, 4754 TTEST procedure, 4780 VARCLUS procedure, 4813 VARCOMP procedure, 4835 BYLEVEL option LSMEANS statement (GLM), 1756, 1823 LSMEANS statement (MIXED), 2689, 2691 BYOUT option MODEL statement (RSREG), 4041
C C option PROC CANCORR statement, 760 C= option OUTPUT statement (LOGISTIC), 2321 TRANSFORM statement (MI), 2534 CA option TEST statement (MULTTEST), 2948, 2964 calibration data set DISCRIM procedure, 1139, 1167 CALIS procedure, 566 approximate covariance, 621 approximate standard errors, 576, 587, 648, 686 asymptotic covariance, 575, 645 asymptotic variances, 647 asymptotically distribution free estimation, 574, 646 chi-square indices, 653 chi-square, adjusted, 655 chi-square, displaying, 684, 685 chi-square, reweighted, 655 coefficient of determination, 585 compared to MIXED procedure, 2665 computation method, Hessian matrix, 589 computation method, Lagrange multipliers, 589 computation method, standard errors, 589 computational problems, 678, 680, 681 computational problems, identification, 669, 680, 724, 726, 732, 741 constrained estimation, 565 constraints, 565, 609, 610, 675, 676 constraints, program statements, 565, 628, 630, 675 COSAN model, 552, 591 degrees of freedom, 573, 576, 590, 676 determination index, 657 displayed output options, 583, 621 disturbances, prefix, 601 EQS program, 555 estimation criteria, 646
Syntax Index estimation methods, 549, 574, 644–647 exogenous variables, 662 factor analysis model, 554, 606, 608 factor analysis model, COSAN statement, 593 factor analysis model, LINEQS statement, 602 factor analysis model, RAM statement, 598 factor loadings, 641 FACTOR procedure, 567, 572, 606, 679 factor rotation, 607 factor scores, 641, 643, 687 fit function, 644 goodness of fit, 644, 649, 652, 684 gradient, 634, 664, 665 Hessian matrix, 621, 622, 634, 647, 665 initial values, 550, 590, 595, 597, 602, 661 input data set, 630 kappa, 659 kurtosis, 549, 584, 588, 658, 660 latent variables, 549, 601, 625 likelihood ratio test, 653, 674 LINEQS model, 553, 601 LISREL model, 554 manifest variables, 549 matrix inversion, 647 matrix names, default, 608 matrix properties, COSAN model, 592 MODEL procedure, 679 modification indices, 576, 584, 649, 673, 674, 687 optimization, 550, 577–581, 622, 664–666, 671, 672 optimization history, 668 optimization, initial values, 661, 666 optimization, memory problems, 666 optimization, termination criteria, 611, 615–620 output data sets, 634 output table names, 688 predicted model matrix, 643, 644, 663, 680, 683 prefix name, 594, 598, 602, 603 PRINCOMP procedure, 567 program statements, 628, 630 program statements, differentiation, 589 RAM model, 553, 596 reciprocal causation, 585 REG procedure, 679 residuals, 650 residuals, prefix, 601 SCORE procedure, 571, 586, 643 significance level, 588 simplicity functions, 607 singularity criterion, 590 singularity criterion, covariance matrix, 588, 590, 591 skewness, 658 squared multiple correlation, 657, 686 step length, 581 structural equation, 552, 585, 625, 658 syntax, 566 SYSLIN procedure, 679
4985
SYSNLIN procedure, 679 t value, 649, 686 test indices, constraints, 584 variable selection, 662 Wald test, probability limit, 590 CALIS procedure, BOUNDS statement, 609 CALIS procedure, BY statement, 626 CALIS procedure, COSAN statement, 591 CALIS procedure, COV statement, 604 CALIS procedure, FACTOR statement, 606 COMPONENT option, 607 HEYWOOD option, 607 N= option, 607 NORM option, 607 RCONVERGE= option, 607 RITER= option, 607 ROTATE= option, 607 CALIS procedure, FREQ statement, 627 CALIS procedure, LINCON statement, 609 CALIS procedure, LINEQS statement, 601 CALIS procedure, MATRIX statement, 593 CALIS procedure, NLINCON statement, 610 CALIS procedure, NLOPTIONS statement, 611 ABSCONV= option, 615 ABSFCONV= option, 617 ABSFTOL= option, 617 ABSGCONV= option, 617, 620 ABSGTOL= option, 617, 620 ABSXCONV= option, 617 ABSXTOL= option, 617 ALL option, 621 ASINGULAR= option, 614 CFACTOR= option, 621 COVSING= option, 614 DAMPSTEP= option, 622 FCONV2= option, 618, 620 FCONV= option, 614, 617 FDIGITS= option, 618 FSIZE= option, 618 FTOL2= option, 618, 620 FTOL= option, 614, 617 G4= option, 613 GCONV2= option, 619 GCONV= option, 614, 618, 620 GTOL2= option, 619 GTOL= option, 614, 618, 620 HESCAL= option, 622 INSTEP= option, 614 LCDEACT= option, 623 LCEPSILON= option, 623 LCSINGULAR= option, 623 LINESEARCH= option, 614 LSPRECISION= option, 614 MAXFUNC= option, 614, 616 MAXITER= option, 614, 616 MAXTIME= option, 616 MINITER= option, 616 MSINGULAR= option, 614 NOEIGNUM option, 623
4986
Syntax Index
NOHLF option, 621 OMETHOD= option, 613 PALL option, 621 PCRPJAC option, 621 PHESSIAN option, 621 PHISTORY option, 621 PINIT option, 621 PJTJ option, 621 PNLCJAC option, 621 PRINT option, 622 RADIUS= option, 614 RESTART= option, 623 SALPHA= option, 614 SINGULAR= option, 614 TECHNIQUE= option, 613 UPDATE= option, 613 VERSION= option, 624 VSINGULAR= option, 615 XCONV= option, 619 XSIZE= option, 619 XTOL= option, 619 CALIS procedure, PARAMETERS statement, 624 CALIS procedure, PARTIAL statement, 627 CALIS procedure, PROC CALIS statement, 568 ALL option, 584 ALPHAECV= option, 588 ALPHARMS= option, 588 ASINGULAR= option, 588 ASYCOV= option, 575 AUGMENT option, 572 BIASKUR option, 588 CORR option, 584 COVARIANCE option, 572 COVSING= option, 588 DATA= option, 570 DEMPHAS= option, 588 DFE= option, 572 DFR= option, 572 DFREDUCE= option, 576 EDF= option, 572 ESTDATA= option, 570 FCONV= option, 580 FDCODE option, 589 FTOL= option, 580 G4= option, 576 GCONV= option, 580 GTOL= option, 580 HESSALG= option, 589 INEST= option, 570 INRAM= option, 570 INSTEP= option, 581 INVAR= option, 570 INWGT= option, 570 KURTOSIS option, 584 LINESEARCH= option, 580 LSPRECISION= option, 581 MAXFUNC= option, 582 MAXITER= option, 582 METHOD= option, 574
MODIFICATION option, 584 MSINGULAR= option, 590 NOADJDF option, 590 NOBS= option, 572 NODIAG option, 576 NOINT option, 573 NOMOD option, 584 NOPRINT option, 584 NOSTDERR option, 587 OM= option, 577 OMETHOD= option, 577 OUTEST= option, 570 OUTJAC option, 571 OUTRAM= option, 571 OUTSTAT= option, 571 OUTVAR= option, 570 OUTWGT= option, 571 PALL option, 584 PCORR option, 584 PCOVES option, 585 PDETERM option, 585 PESTIM option, 585 PINITIAL option, 585 PJACPAT option, 586 PLATCOV option, 586 PREDET option, 586 PRIMAT option, 586 PRINT option, 586 PRIVEC option, 586 PSHORT option, 587 PSUMMARY option, 587 PWEIGHT option, 587 RADIUS= option, 583 RANDOM= option, 590 RDF= option, 572 RESIDUAL= option, 587 RIDGE= option, 573 SALPHA= option, 583 SHORT option, 587 SIMPLE option, 587 SINGULAR= option, 590 SLMW= option, 590 SMETHOD= option, 580 SPRECISION= option, 581, 583 START= option, 590 STDERR option, 587 SUMMARY option, 587 TECHNIQUE= option, 577 TOTEFF option, 587 UCORR option, 573 UCOV option, 573 UPDATE= option, 579 VARDEF= option, 573 VSINGULAR= option, 591 WPENALTY= option, 576 WRIDGE= option, 577 CALIS procedure, RAM statement, 596 CALIS procedure, STD statement, 603 CALIS procedure, STRUCTEQ statement, 625
Syntax Index CALIS procedure, VAR statement, 627 CALIS procedure, VARNAMES statement, 625 CALIS procedure, WEIGHT statement, 627 CAN option PROC DISCRIM statement, 1147 CANALS method TRANSREG procedure, 4576 Canberra metric coefficient DISTANCE procedure, 1272 CANCORR procedure analyzing data in groups, 763 canonical coefficients, 751 canonical redundancy analysis, 751, 761 computational resources, 768 correction for means, 760 correlation, 760 eigenvalues, 765 eigenvalues and eigenvectors, 754, 769 examples, 753, 773 formulas, 765 input data set, 760 missing values, 765 OUT= data sets, 761, 766 output data sets, 761, 766 output table names, 771 OUTSTAT= data sets, 766 partial correlation, 761, 762, 764 principal components, relation to, 765 regression coefficients, 759 semipartial correlation, 762 singularity checking, 762 squared multiple correlation, 762 squared partial correlation, 762 squared semipartial correlation, 762 statistical methods used, 752 statistics computed, 751 suppressing output, 761 syntax, 757 weighted product-moment correlation coefficients, 764 CANCORR procedure, BY statement, 763 CANCORR procedure, FREQ statement, 764 CANCORR procedure, PARTIAL statement, 764 CANCORR procedure, PROC CANCORR statement, 757 ALL option, 759 B option, 759 C option, 760 CLB option, 760 CORR option, 760 CORRB option, 760 DATA= option, 760 EDF= option, 760 INT option, 760 NCAN= option, 760 NOINT option, 760 NOPRINT option, 761 OUT= option, 761 OUTSTAT= option, 761
4987
PCORR option, 761 PROBT option, 761 RDF= option, 761 RED option, 761 REDUNDANCY option, 761 S option, 761 SEB option, 761 SHORT option, 761 SIMPLE option, 761 SING= option, 762 SINGULAR= option, 762 SMC option, 762 SPCORR option, 762 SQPCORR option, 762 SQSPCORR option, 762 STB option, 762 T option, 762 VDEP option, 762 VN= option, 762 VNAME= option, 762 VP= option, 762 VPREFIX= option, 762 VREG option, 763 WDEP option, 763 WN= option, 763 WNAME= option, 763 WP= option, 763 WPREFIX= option, 763 WREG option, 762 CANCORR procedure, VAR statement, 764 CANCORR procedure, WEIGHT statement, 764 CANCORR procedure, WITH statement, 764 CANDISC procedure computational details, 794 computational resources, 799 input data set, 795 introductory example, 785 Mahalanobis distance, 804 MANOVA, 786 memory requirements, 799 missing values, 794 multivariate analysis of variance, 786 ODS table names, 802 output data sets, 791, 796, 797 %PLOTIT macro, 785, 808 syntax, 789 time requirements, 799 CANDISC procedure, BY statement, 793 CANDISC procedure, CLASS statement, 793 CANDISC procedure, FREQ statement, 793 CANDISC procedure, PROC CANDISC statement, 789 ALL option, 790 ANOVA option, 790 BCORR option, 790 BCOV option, 790 BSSCP option, 791 DATA= option, 791 DISTANCE option, 791
4988
Syntax Index
NCAN= option, 791 NOPRINT option, 791 OUT= option, 791 OUTSTAT= option, 791 PCORR option, 791 PCOV option, 791 PREFIX= option, 792 PSSCP option, 792 SHORT option, 792 SIMPLE option, 792 SINGULAR= option, 792 STDMEAN option, 792 TCORR option, 792 TCOV option, 792 TSSCP option, 792 WCORR option, 792 WCOV option, 792 WSSCP option, 793 CANDISC procedure, VAR statement, 794 CANDISC procedure, WEIGHT statement, 794 canonical analysis ANOVA procedure, 438 GLM procedure, 1760 repeated measurements, 448 response surfaces, 4046 canonical coefficients, 783 canonical component, 783 canonical correlation CANCORR procedure, 751 definition, 752 hypothesis tests, 751 TRANSREG procedure, 4584, 4593 canonical discriminant analysis, 783, 1139 canonical factor solution, 1297 CANONICAL option MANOVA statement (ANOVA), 438 MANOVA statement (GLM), 1760 OUTPUT statement (TRANSREG), 4584 PROC DISCRIM statement, 1147 REPEATED statement (ANOVA), 448 REPEATED statement (GLM), 1780 canonical redundancy analysis CANCORR procedure, 751, 761 canonical variables, 783 ANOVA procedure, 438 TRANSREG procedure, 4584 canonical weights, 752, 783 CANPREFIX= option PROC DISCRIM statement, 1147 CANPRINT option MTEST statement (REG), 3833 CASCADE= option PROC MODECLUS statement, 2865 cascaded density estimates MODECLUS procedure, 2873 case weight PHREG procedure, 3239 case-control studies odds ratio, 1488, 1503, 1504
PHREG procedure, 3217, 3228, 3280 casewise deletion PRINQUAL procedure, 3655 categorical data analysis, See CATMOD procedure categorical variable, 72 categorical variables, See classification variables CATMOD parameterization, 845 CATMOD procedure analysis of variance, 815 at sign (@) operator, 866 AVERAGED models, 881 bar (|) operator, 865 cautions, 869, 870, 887 cell count data, 861 classification variables, 864 compared to other procedures, 815, 869, 870, 1792 computational method, 891–894 continuous variables, 864 continuous variables, caution, 869, 870 contrast examples, 919 contrasts, comparing with GLM, 833 convergence criterion, 842 design matrix, 847, 848 design matrix, REPEATED statement, 882 effect specification, 864 effective sample sizes, 887 estimation methods, 817 – F– specification, 840, 862 hypothesis tests, 888 input data sets, 813, 860 interactive use, 818, 828 introductory example, 818 iterative proportional fitting, 843 linear models, 814 log-linear models, 814, 870, 916, 919, 1616 logistic analysis, 815, 868, 933 logistic regression, 814, 869, 911 maximum likelihood estimation, 817 maximum likelihood estimation formulas, 895 memory requirements, 897 missing values, 860 MODEL statement, examples, 840 ordering of parameters, 880 ordering of populations, 863 ordering of responses, 863 ordinal model, 869 output data sets, 854, 866, 867 parameterization, comparing with GLM, 833 positional requirements for statements, 828 quasi-independence model, 919 regression, 815 repeated measures, 815, 850, 873, 925, 930, 933, 937 repeated measures, MODEL statements, 875 REPEATED statement, examples, 873
Syntax Index response functions, 836, 840, 852, 854–856, 859, 862, 901, 906, 944 – RESPONSE– keyword, 836, 839, 840, 842, 850, 864, 870, 873, 881, 882, 884, 888, 898 RESPONSE – – = option, 837, 851 restrictions on parameters, 859 sample survey analysis, 816 sampling zeros and log-linear analyses, 871 sensitivity, 941 singular covariance matrix, 887 specificity, 941 syntax, 827 time requirements, 897 types of analysis, 814, 864 underlying model, 816 weighted least squares, 817, 846 zeros, structural and sampling, 888, 919, 924 CATMOD procedure, BY statement, 830 CATMOD procedure, CONTRAST statement, 831 ALPHA= option, 832 ESTIMATE= option, 832 CATMOD procedure, DIRECT statement, 835 CATMOD procedure, FACTORS statement, 836 PROFILE= option, 837 – RESPONSE– = option, 837 TITLE= option, 837 CATMOD procedure, LOGLIN statement, 839 TITLE= option, 839 CATMOD procedure, MODEL statement, 840 ADDCELL= option, 841 ALPHA= option, 842 AVERAGED option, 842 CLPARM option, 842 CORRB option, 842 COV option, 842 COVB option, 842 DESIGN option, 842 EPSILON= option, 842 FREQ option, 842 GLS option, 846 ITPRINT option, 842 MAXITER= option, 842 MISS= option, 844 MISSING= option, 844 ML option, 843 NODESIGN option, 845 NOINT option, 845 NOITER option, 845 NOPARM option, 845 NOPREDVAR option, 845 NOPRINT option, 845 NOPROFILE option, 845 NORESPONSE option, 845 ONEWAY option, 845 PARAM= option, 845 PRED= option, 845 PREDICT option, 845 PROB option, 846 PROFILE option, 846
4989
– RESPONSE– keyword, 836, 839, 840, 842, 850, 864, 870, 873, 881, 882, 884, 888, 898 TITLE= option, 846 WLS option, 846 XPX option, 846 ZERO= option, 846 ZEROES= option, 846 ZEROS= option, 846 CATMOD procedure, POPULATION statement, 848 CATMOD procedure, PROC CATMOD statement, 829 DATA=option, 829 NAMELEN= option, 829 NOPRINT option, 829 ORDER= option, 829 CATMOD procedure, REPEATED statement, 850 PROFILE= option, 851 – RESPONSE– = option, 851 TITLE= option, 852 CATMOD procedure, RESPONSE statement, 852 ALOGIT function, 852 CLOGIT function, 853 JOINT function, 853 LOGIT function, 853 MARGINAL function, 853 MEAN function, 853 OUT= option, 854 OUTEST= option, 854 READ function, 853 TITLE= option, 854 CATMOD procedure, RESTRICT statement, 859 CATMOD procedure, WEIGHT statement, 860 CAXIS= option PLOT statement (BOXPLOT), 495 PLOT statement (REG), 3844 CBAR= option OUTPUT statement (LOGISTIC), 2321 CBLOCKLAB= option PLOT statement (BOXPLOT), 495 CBLOCKVAR= option PLOT statement (BOXPLOT), 495 CBOXES= option PLOT statement (BOXPLOT), 496 CBOXFILL= option PLOT statement (BOXPLOT), 496 CCC option OUTPUT statement (TRANSREG), 4584 PROC CLUSTER statement, 968 CCLIP= option PLOT statement (BOXPLOT), 496 CCONF= option MCMC statement (MI), 2525 CCONNECT= option MCMC statement (MI), 2529 PLOT statement (BOXPLOT), 497 CCONVERGE= option MODEL statement (TRANSREG), 4574 PROC PRINQUAL statement, 3653 CCOVERLAY= option
4990
Syntax Index
PLOT statement (BOXPLOT), 497 CDF, See cumulative distribution function CDF keyword OUTPUT statement (LIFEREG), 2100 cdf plots annotating, 3718 axes, color, 3718 font, specifying, 3719 options summarized by function, 3716, 3734 reference lines, options, 3719–3722 threshold lines, options, 3721 cdfplot PROBIT procedure, 3715 CDFPLOT statement, See PROBIT procedure, CDFPLOT statement options summarized by function, 3716 PROBIT procedure, 3715 CDPREFIX= option OUTPUT statement (TRANSREG), 4584 CEC option OUTPUT statement (TRANSREG), 4584 ceiling sample size GLMPOWER procedure, 1947 POWER procedure, 3419, 3496 cell count data, 1464 CATMOD procedure, 861 example (FREQ), 1527 cell of a contingency table, 72 cell-means coding TRANSREG procedure, 4569, 4594, 4662 CELLCHI2 option PROC CORRESP statement, 1074 TABLES statement (FREQ), 1454 CENSCALE option PROC PLS statement, 3374 censored data (LIFEREG), 2083 LIFETEST procedure, 2186 observations (PHREG), 3283 survival times (PHREG), 3215, 3281, 3283 values (PHREG), 3218, 3272 CENSORED keyword OUTPUT statement (LIFEREG), 2100 CENSOREDSYMBOL= option PROC LIFETEST statement, 2160 censoring, 2083 LIFEREG procedure, 2094 variable (PHREG), 3218, 3228, 3235, 3283 CENTER option MODEL statement (TRANSREG), 4571 PROC MULTTEST statement, 2939 center-point coding TRANSREG procedure, 4568, 4668, 4670 CENTER= option RIDGE statement (RSREG), 4043 centering TRANSREG procedure, 4571 centroid component, 4800
definition, 4799 centroid method CLUSTER procedure, 966, 976 CENTROID option PROC VARCLUS statement, 4808 CERTSIZE= option PROC SURVEYSELECT statement, 4432 CFACTOR= option NLOPTIONS statement (CALIS), 621 PROC NLMIXED statement, 3061 CFRAME= option MCMC statement (MI), 2525, 2529 PLOT statement (BOXPLOT), 497 PLOT statement (REG), 3844 PROC TREE statement, 4750 CGRID= option BOXPLOT procedure, 497 CHAIN= option MCMC statement (MI), 2526 chaining, reducing when clustering, 972 CHANGE= option PROC PRINQUAL statement, 3653 character OPSCORE variables PRINQUAL procedure, 3673 TRANSREG procedure, 4604 characteristic roots and vectors ANOVA procedure, 437 GLM procedure, 1759 Chebychev distance coefficient DISTANCE procedure, 1272 chi-square adjusted (CALIS), 655 displaying (CALIS), 684, 685 indices (CALIS), 653 reweighted (CALIS), 655 chi-square test SURVEYFREQ procedure, 4216 chi-square tests examples (FREQ), 1530, 1535, 1538 FREQ procedure, 1469, 1470 power and sample size (POWER), 3429, 3432, 3457, 3462, 3505, 3506, 3524, 3541 chi-squared coefficient DISTANCE procedure, 1273 CHIF option PROC ROBUSTREG statement, 3986, 3987 CHISQ option CONTRAST statement (MIXED), 2684 EXACT statement (FREQ), 1444, 1535 MODEL statement (MIXED), 2692 OUTPUT statement (FREQ), 1447 TABLES statement (FREQ), 1454, 1470, 1535 TABLES statement (SURVEYFREQ), 4199 CHISQ1 option TABLES statement (SURVEYFREQ), 4199 CHOCKING= option PLOT statement (REG), 3845 choice experiments TRANSREG procedure, 4660
Syntax Index CHREF= option PLOT statement (BOXPLOT), 497 PLOT statement (REG), 3845 Chromy’s method SURVEYSELECT procedure, 4448, 4452 CI= option ONESAMPLEMEANS statement (POWER), 3434 PAIREDMEANS statement (POWER), 3449 PROC TTEST statement, 4780 TWOSAMPLEMEANS statement (POWER), 3465 Cicchetti-Allison weights, 1497 CICONV= option MODEL statement (GENMOD), 1637 CILPREFIX= option OUTPUT statement (TRANSREG), 4584 CIPREFIX= option OUTPUT statement (TRANSREG), 4585 Cityblock distance coefficient DISTANCE procedure, 1272 CIUPREFIX= option OUTPUT statement (TRANSREG), 4585 CK= option PROC MODECLUS statement, 2865 CL option ESTIMATE statement (MIXED), 2685 LSMEANS statement (GENMOD), 1636 LSMEANS statement (GLM), 1756 LSMEANS statement (MIXED), 2689 MODEL statement (GENMOD), 1637 MODEL statement (LOGISTIC), 2319 MODEL statement (MIXED), 2692 MODEL statement (TRANSREG), 4575 RANDOM statement (MIXED), 2713 TABLES statement (FREQ), 1455 TABLES statement (SURVEYFREQ), 4199 CL= option PROC MIXED statement, 2674 CLABEL= option BOXPLOT procedure, 497 class level MIXED procedure, 2678 SURVEYMEANS procedure, 4346 CLASS statement ANOVA procedure, 435 CANDISC procedure, 793 DISCRIM procedure, 1154 GAM procedure, 1565 GENMOD procedure, 1629 GLM procedure, 1748 GLMMOD procedure, 1916 GLMPOWER procedure, 1937 INBREED procedure, 1974 LIFEREG procedure, 2092 LOGISTIC procedure, 2295 MI procedure, 2522 MIANALYZE procedure, 2617 MIXED procedure, 2681, 2748
4991
MULTTEST procedure, 2943 NESTED procedure, 2989 ORTHOREG procedure, 3202 PLS procedure, 3378 ROBUSTREG procedure, 3989 STEPDISC procedure, 4169 SURVEYLOGISTIC procedure, 4253 SURVEYMEANS procedure, 4329 SURVEYREG procedure, 4375 TPHREG procedure, 4477 TTEST procedure, 4781 VARCOMP procedure, 4836 CLASS transformation MODEL statement (TRANSREG), 4560 classification criterion DISCRIM procedure, 1139 error rate estimation (DISCRIM), 1163 classification level SURVEYREG procedure, 4391 classification table LOGISTIC procedure, 2314, 2352, 2353, 2422 classification variable SURVEYMEANS procedure, 4329, 4337, 4341 classification variables, 72 ANOVA procedure, 423, 451 CATMOD procedure, 864 GENMOD procedure, 1660 GLM procedure, 1784 GLMPOWER procedure, 1937, 1938 MIXED procedure, 2681 sort order of levels (GENMOD), 1625 SURVEYREG procedure, 4375 TRANSREG procedure, 4560, 4569 VARCOMP procedure, 4836 CLASSVAR= option PROC MIANALYZE statement, 2615 CLB option MODEL statement (REG), 3825 PROC CANCORR statement, 760 CLDIFF option MEANS statement (ANOVA), 442 MEANS statement (GLM), 1766 CLEAR option PLOT statement (REG), 3849 CLI option MODEL statement (GLM), 1771 MODEL statement (REG), 3825 OUTPUT statement (TRANSREG), 4585 CLINE= option PLOT statement (REG), 3845 CLIPFACTOR= option BOXPLOT procedure, 497, 533 CLIPLEGEND= option BOXPLOT procedure, 497 CLIPLEGPOS= option BOXPLOT procedure, 497 CLIPSUBCHAR= option BOXPLOT procedure, 498 CLIPSYMBOL= option
4992
Syntax Index
BOXPLOT procedure, 498 CLIPSYMBOLHT= option BOXPLOT procedure, 498 CLL= option MODEL statement (TRANSREG), 4570 CLM option MEANS statement (ANOVA), 442 MEANS statement (GLM), 1766 MODEL statement (GLM), 1771 MODEL statement (LOESS), 2232 MODEL statement (REG), 3825 OUTPUT statement (TRANSREG), 4585 PROC GAM statement, 1581 SCORE statement (LOESS), 2237 SCORE statement (LOGISTIC), 2324 CLODDS option MODEL statement (SURVEYLOGISTIC), 4262 CLODDS= option MODEL statement (LOGISTIC), 2308 CLOGIT function RESPONSE statement (CATMOD), 853 closing all destinations examples, ODS Graphics, 360 CLPARM option MODEL statement (CATMOD), 842 MODEL statement (GLM), 1771 MODEL statement (SURVEYLOGISTIC), 4262 MODEL statement (SURVEYREG), 4380 CLPARM= option MODEL statement (LOGISTIC), 2309 CLTYPE= option BASELINE statement (PHREG), 3225 EXACT statement (LOGISTIC), 2302 cluster centers, 1380, 1392 definition (MODECLUS), 2878 deletion, 1390 elliptical, 387 final, 1380 initial, 1380, 1381 mean, 1392 median, 1389, 1392 midrange, 1392 minimum distance separating, 1381 plotting (MODECLUS), 2878 seeds, 1380 SURVEYFREQ procedure, 4195 SURVEYLOGISTIC procedure, 4255 SURVEYMEANS procedure, 4329 SURVEYREG procedure, 4376 cluster analysis disjoint, 1379 large data sets, 1379 robust, 1379, 1392 tree diagrams, 4743 cluster analysis (STDIZE) standardizing, 4143 CLUSTER procedure, See also TREE procedure
algorithms, 986 association measure, 957 average linkage, 957 categorical data, 957 centroid method, 957 clustering methods, 957, 975 complete linkage, 957 computational resources, 986 density linkage, 957, 966 Euclidean distances, 957 F statistics, 972, 984 FASTCLUS procedure, compared, 957 flexible-beta method, 957, 967, 968, 981 hierarchical clusters, 957 input data sets, 969 interval scale, 988 kth-nearest-neighbor method, 957 maximum likelihood, 957, 967 McQuitty’s similarity analysis , 957 median method, 957 memory requirements, 986 missing values, 987 non-Euclidean distances, 957 output data sets, 971, 990 output table names, 994 pseudo F and t statistics, 972 ratio scale, 988 single linkage, 957 size, shape, and correlation, 988 syntax, 966 test statistics, 968, 972, 973 ties, 987 time requirements, 986 TREE procedure, compared, 957 two-stage density linkage, 957 types of data sets, 957 using macros for many analyses, 1013 Ward’s minimum-variance method, 957 Wong’s hybrid method, 957 CLUSTER procedure, BY statement, 973 CLUSTER procedure, COPY statement, 973 CLUSTER procedure, FREQ statement, 974 CLUSTER procedure, ID statement, 974 CLUSTER procedure, PROC CLUSTER statement, 966 BETA= option, 968 CCC option, 968 DATA= option, 969 DIM= option, 969 HYBRID option, 969 K= option, 970 MODE= option, 970 NOEIGEN option, 970 NOID option, 970 NONORM option, 971 NOPRINT option, 971 NOSQUARE option, 971 NOTIE option, 971 OUTTREE= option, 971
Syntax Index PENALTY= option, 971 PRINT= option, 971 PSEUDO= option, 972 R= option, 972 RMSSTD option, 972 RSQUARE option, 972 SIMPLE option, 972 STANDARD option, 972 TRIM= option, 972 CLUSTER procedure, RMSSTD statement, 974 CLUSTER procedure, VAR statement, 975 cluster sampling, 164 CLUSTER statement SURVEYFREQ procedure, 4195 SURVEYLOGISTIC procedure, 4255 SURVEYMEANS procedure, 4329 SURVEYREG procedure, 4376 CLUSTER= option PROC FASTCLUS statement, 1390 PROC MODECLUS statement, 2865 clustering, 957, See also CLUSTER procedure approximate covariance estimation, 387 average linkage, 966, 976 centroid method, 966, 976 complete linkage method, 966, 977 density linkage methods, 966, 967, 969, 970, 972, 977, 980, 982 disjoint clusters of variables, 4799 Gower’s method, 967, 981 hierarchical clusters of variables, 4799 maximum-likelihood method, 971, 980, 981 McQuitty’s similarity analysis, 967, 981 median method, 967, 981 methods affected by frequencies, 974 outliers in, 958, 972 penalty coefficient, 971 single linkage, 967, 982 smoothing parameters, 979 standardizing variables, 972 SURVEYFREQ procedure, 4203 transforming variables, 958 two-stage density linkage, 967 variables, 4799 Ward’s method, 967, 983 weighted average linkage, 967, 981 clustering and computing distance matrix Correlation coefficients, example, 1283 Jaccard coefficients, example, 1278 clustering and scaling DISTANCE procedure, example, 1253 MODECLUS procedure, 2856, 2857, 2874 STDIZE procedure, example, 4143 clustering criterion FASTCLUS procedure, 1379, 1391, 1392 clustering methods ACECLUS procedure, 405 FASTCLUS procedure, 1380, 1381 MODECLUS procedure, 2856, 2874
4993
CLWT option TABLES statement (SURVEYFREQ), 4199 CMALLOWS= option PLOT statement (REG), 3845 CMF, See cumulative mean function PHREG procedure, 3224 CMH option OUTPUT statement (FREQ), 1447 TABLES statement (FREQ), 1455 CMH1 option OUTPUT statement (FREQ), 1447 TABLES statement (FREQ), 1455 CMH2 option OUTPUT statement (FREQ), 1447 TABLES statement (FREQ), 1455 CMHCOR option OUTPUT statement (FREQ), 1447 CMHGA option OUTPUT statement (FREQ), 1447 CMHRMS option OUTPUT statement (FREQ), 1447 CMLPREFIX= option OUTPUT statement (TRANSREG), 4585 CMUPREFIX= option OUTPUT statement (TRANSREG), 4585 CNEEDLES= option MCMC statement (MI), 2525 COCHQ option OUTPUT statement (FREQ), 1448 Cochran and Cox t approximation, 4775, 4780, 4785 COCHRAN option PROC TTEST statement, 4780 Cochran’s Q test, 1493, 1499, 1548 Cochran-Armitage test for trend, 1490, 1543 continuity correction (MULTTEST), 2949 MULTTEST procedure, 2946, 2948, 2964 permutation distribution (MULTTEST), 2949 two-tailed test (MULTTEST), 2951 Cochran-Mantel-Haenszel statistics (FREQ), 1447, 1500, See also chi-square tests ANOVA (row mean scores) statistic, 1502 correlation statistic, 1501 examples, 1540 general association statistic, 1502 CODING= option MODEL statement (GENMOD), 1637 COEF= option PROC MDS statement, 2477 coefficient alpha (FACTOR), 1337 of contrast (SURVEYREG), 4392 of determination (CALIS), 585 of estimate (SURVEYREG), 4393 of relationship (INBREED), 1977 of variation (SURVEYMEANS), 4340 redundancy (TRANSREG), 4590 coefficient of variation
4994
Syntax Index
SURVEYFREQ procedure, 4214 COEFFICIENTS option OUTPUT statement (TRANSREG), 4585 cohort studies, 1540 relative risk, 1489, 1507 COL option TABLES statement (SURVEYFREQ), 4200 COLLECT option PLOT statement (REG), 3849 COLLIN option MODEL statement (REG), 3825 collinearity REG procedure, 3895 COLLINOINT option MODEL statement (REG), 3825 column proportions SURVEYFREQ procedure, 4212 COLUMN= option PROC CORRESP statement, 1075 combinations generating with PLAN procedure, 3358 combining inferences MI procedure, 2561 MIANALYZE procedure, 2624 common factor defined for factor analysis, 1292 common factor analysis common factor rotation, 1294 compared with principal component analysis, 1293 Harris component analysis, 1293 image component analysis, 1293 interpreting, 1293 salience of loadings, 1294 COMMONAXES option PROC GAM statement, 1581 COMOR option EXACT statement (FREQ), 1444 comparing groups (GLM), 1804 means (TTEST), 4775, 4789 variances (TTEST), 4775, 4784, 4789 comparisonwise error rate (GLM), 1809 complementary log-log model SURVEYLOGISTIC procedure, 4284 complete block design example (ANOVA), 428 example (GLM), 1847 complete linkage CLUSTER procedure, 966, 977 complete separation LOGISTIC procedure, 2339 SURVEYLOGISTIC procedure, 4277 completely randomized design examples, 461 COMPONENT option FACTOR statement (CALIS), 607 PROC GAM statement, 1581 components
PLS procedure, 3367 compound symmetry structure example (MIXED), 2733, 2789, 2794 MIXED procedure, 2721 COMPRESS option PROC FREQ statement, 1441 computational details Hessian matrix (CALIS), 589 KDE procedure, 2002 Lagrange multipliers (CALIS), 589 LIFEREG procedure, 2108 maximum likelihood method (VARCOMP), 4839 MIVQUE0 method (VARCOMP), 4839 MIXED procedure, 2772 restricted maximum likelihood method (VARCOMP), 4840 SIM2D procedure, 4109 standard errors (CALIS), 589 SURVEYLOGISTIC procedure, 4282 SURVEYREG procedure, 4384 Type I method (VARCOMP), 4838 VARCOMP procedure, 4838, 4842 computational problems CALIS procedure, 678 convergence (CALIS), 678 convergence (FASTCLUS), 1390 convergence (MIXED), 2774 identification (CALIS), 669, 680, 724, 726, 732, 741 negative eigenvalues (CALIS), 681 negative R-square (CALIS), 681 NLMIXED procedure, 3098 overflow (CALIS), 678 singular predicted model (CALIS), 680 time (CALIS), 681 computational resources ACECLUS procedure, 410 CANCORR procedure, 768 CLUSTER procedure, 986 FACTOR procedure, 1335 FASTCLUS procedure, 1402 LIFEREG procedure, 2123 MODECLUS procedure, 2882 MULTTEST procedure, 2960 NLMIXED procedure, 3103 PRINCOMP procedure, 3611 ROBUSTREG procedure, 4012 VARCLUS procedure, 4818 COMPUTE statement VARIOGRAM procedure, 4866 concordant observations, 1474 CONDITION= option PROC MDS statement, 2477 conditional and unconditional simulation SIM2D procedure, 4091 conditional data MDS procedure, 2477
Syntax Index conditional distributions of multivariate normal random variables SIM2D procedure, 4108 conditional logistic regression LOGISTIC procedure, 2365 PHREG procedure, 3217, 3283 Conditional residuals MIXED procedure, 2764 CONF option PLOT statement (REG), 3845 CONFBAND= option SURVIVAL statement (LIFETEST), 2169 confidence bands LIFETEST procedure, 2169, 2176, 2205 Confidence intervals LIFEREG procedure, 2115 confidence intervals confidence coefficient (GENMOD), 1637 fitted values of the mean (GENMOD), 1641, 1669 individual observation (RSREG), 4042, 4043 means (ANOVA), 442 means (RSREG), 4042, 4043 means, power and sample size (POWER), 3432, 3438, 3448, 3456, 3463, 3473, 3512, 3522, 3532, 3563 model confidence interval (NLIN), 3029 pairwise differences (ANOVA), 442 parameter confidence interval (NLIN), 3028 profile likelihood (GENMOD), 1640, 1666 profile likelihood (LOGISTIC), 2314, 2315, 2345 TTEST procedure, 4780 Wald (GENMOD), 1643, 1667 Wald (LOGISTIC), 2319, 2346 Wald (SURVEYLOGISTIC), 4288 confidence intervals, FACTOR procedure, 1327 confidence level baseline survivor function (PHREG), 3225 SURVEYMEANS procedure, 4323 SURVEYREG procedure, 4373 confidence limits asymptotic (FREQ), 1475 baseline survivor function (PHREG), 3225, 3263 exact (FREQ), 1443 LIFETEST procedure, 2174, 2183, 2184, 2205 LOGISTIC procedure, 2350 MIXED procedure, 2674 SURVEYFREQ procedure, 4213 SURVEYMEANS procedure, 4340, 4342 SURVEYREG procedure, 4380 TRANSREG procedure, 4575, 4584, 4585, 4587, 4588 confidence limits, FACTOR procedure, 1327 configuration MDS procedure, 2471 CONFTYPE= option SURVIVAL statement (LIFETEST), 2169, 2191 conjoint analysis
4995
TRANSREG procedure, 4581, 4593, 4690, 4694 conjugate descent (NLMIXED), 3074 gradient (NLMIXED), 3072 gradient algorithm (CALIS), 577, 579–581, 665 connectedness method, See single linkage constant transformations avoiding (PRINQUAL), 3673 avoiding (TRANSREG), 4603 constant variables PRINQUAL procedure, 3673 TRANSREG procedure, 4578, 4604 constrained estimation CALIS procedure, 565 constraints boundary (CALIS), 565, 609, 675 boundary (MIXED), 2707, 2708 linear (CALIS), 565, 609, 676 modification indices (CALIS), 584 nonlinear (CALIS), 565, 610 ordered (CALIS), 675 program statements (CALIS), 565, 628, 630, 675 test indices (CALIS), 584 CONTAIN option MODEL statement (MIXED), 2692, 2693 containment method MIXED procedure, 2693 CONTENTS= option TABLES statement (FREQ), 1455 CONTGY option OUTPUT statement (FREQ), 1448 contingency coefficient, 1469, 1474 contingency tables, 72, 1431, 1450, 4196 CATMOD procedure, 816 continuity-adjusted chi-square, 1469, 1471 CONTINUITY= option TEST statement (MULTTEST), 2947 CONTINUOUS option PLOT statement (BOXPLOT), 498 continuous variables, 451, 1784 GENMOD procedure, 1660 continuous-by-class effects MIXED procedure, 2746 model parameterization (GLM), 1790 specifying (GLM), 1785 continuous-nesting-class effects MIXED procedure, 2745 model parameterization (GLM), 1789 specifying (GLM), 1785 %CONTOUR macro DISCRIM procedure, 1201 contour plots plots, ODS Graphics, 324, 360 CONTRAST keyword REPEATED statement (ANOVA), 448 CONTRAST option REPEATED statement (GLM), 1779, 1830 CONTRAST statement
4996
Syntax Index
CATMOD procedure, 831 GENMOD procedure, 1631 GLM procedure, 1749 GLMPOWER procedure, 1937 LOGISTIC procedure, 2297 MIXED procedure, 2681 MULTTEST procedure, 2944 NLMIXED procedure, 3076 SURVEYLOGISTIC procedure, 4255 SURVEYREG procedure, 4376 TPHREG procedure, 4479 CONTRAST= option ONEWAYANOVA statement (POWER), 3439 contrasts, 3076 comparing CATMOD and GLM, 833 GENMOD procedure, 1633 GLM procedure, 1749 MIXED procedure, 2681, 2685 power and sample size (GLMPOWER), 1934, 1937, 1950, 1951, 1956 power and sample size (POWER), 3438, 3439, 3442, 3513, 3536 repeated measurements (ANOVA), 447, 448 repeated measures (GLM), 1779 specifying (CATMOD), 831 SURVEYREG procedure, 4376, 4389 control comparing treatments to (GLM), 1807, 1812 control charts, 23 CONTROL keyword OUTPUT statement (LIFEREG), 2100 control sorting SURVEYSELECT procedure, 4443, 4445 CONTROL statement NLIN procedure, 3011 SURVEYSELECT procedure, 4443 CONVENIENT option MODEL statement (TRANSREG), 4570 converge in EM algorithm MI procedure, 2522 CONVERGE option EM statement (MI), 2522 MODEL statement (TRANSREG), 4575 CONVERGE= option MCMC statement (MI), 2527 MODEL statement (GENMOD), 1637 MODEL statement (LIFEREG), 2096 PROC ACECLUS statement, 404 PROC FACTOR statement, 1309 PROC FASTCLUS statement, 1390 PROC MDS statement, 2478 PROC NLIN statement, 3006 PROC PRINQUAL statement, 3653 REPEATED statement (GENMOD), 1647 TABLES statement (FREQ), 1456 convergence criterion ACECLUS procedure, 404 CATMOD procedure, 842 GENMOD procedure, 1637, 1647
MDS procedure, 2478, 2480, 2481 MIXED procedure, 2674, 2675, 2749, 2775 profile likelihood (LOGISTIC), 2314 convergence in EM algorithm MI procedure, 2527 convergence in MCMC MI procedure, 2555, 2566 CONVERGENCE option PROC ROBUSTREG statement, 3984, 3987 convergence problems MIXED procedure, 2774 NLMIXED procedure, 3099 CONVERGEOBJ= option PROC NLIN statement, 3007 CONVERGEPARM= option PROC NLIN statement, 3007 CONVF option PROC MIXED statement, 2675, 2749 CONVG option PROC MIXED statement, 2675, 2749 CONVG= option MODEL statement (LIFEREG), 2096 CONVH option PROC MIXED statement, 2675, 2749 CONVH= option MODEL statement (GENMOD), 1638 convolution distribution (MULTTEST), 2950 KDE procedure, 2005 Cook’s D influence statistic, 1774, 4042 Cook’s D MIXED procedure, 2768 Cook’s D for covariance parameters MIXED procedure, 2768 Cook’s D plots plots, ODS Graphics, 353 COOKD keyword OUTPUT statement (GLM), 1774 COORDINATES option OUTPUT statement (TRANSREG), 4586 COORDINATES statement KRIGE2D procedure, 2039 SIM2D procedure, 4099 VARIOGRAM procedure, 4869 COPY statement DISTANCE procedure, 1268 TREE procedure, 4755 CORE option PROC MODECLUS statement, 2865 CORR option LSMEANS statement (GENMOD), 1636 LSMEANS statement (MIXED), 2689 PROC CALIS statement, 584 PROC CANCORR statement, 760 PROC FACTOR statement, 1309 PROC NLMIXED statement, 3061 PROC REG statement, 3817 PROC VARCLUS statement, 4808 CORR procedure, 21
Syntax Index CORR= option ONECORR statement (POWER), 3427 PAIREDMEANS statement (POWER), 3450 REPEATED statement (GENMOD), 1648 CORRB option MODEL statement (CATMOD), 842 MODEL statement (GENMOD), 1638 MODEL statement (LIFEREG), 2097 MODEL statement (LOGISTIC), 2309 MODEL statement (MIXED), 2692 MODEL statement (PHREG), 3233 MODEL statement (REG), 3825 MODEL statement (ROBUSTREG), 3989 MODEL statement (SURVEYLOGISTIC), 4262 PROC CANCORR statement, 760 REPEATED statement (GENMOD), 1647 CORRECT=NO option PROC NPAR1WAY statement, 3156 correction for means CANCORR procedure, 760 correlated data GEE (GENMOD), 1611, 1672 correlated proportions, See McNemar’s test correlation CANCORR procedure, 760 estimates (MIXED), 2713, 2716, 2720, 2791 length (VARIOGRAM), 4860 matrix (GENMOD), 1638, 1656 matrix (REG), 3817 matrix, estimated (CATMOD), 842 principal components, 3610, 3612 range (KRIGE2D), 2034 correlation coefficients power and sample size (POWER), 3426, 3502, 3503 Correlation dissimilarity coefficient DISTANCE procedure, 1271 Correlation similarity coefficient DISTANCE procedure, 1271 correlation statistic, 1501 CORRELATIONS option PROC PRINQUAL statement, 3653 CORRESP procedure, 1069 adjusted inertias, 1102 algorithm, 1097 analyse des correspondances, 1069 appropriate scoring, 1069 Best variables, 1104 binary design matrix, 1083 Burt table, 1076, 1084 coding, 1085 COLUMN= option, use, 1099 computational resources, 1096 correspondence analysis, 1069 doubling, 1085 dual scaling, 1069 fuzzy coding, 1085, 1087
4997
geometry of distance between points, 1100, 1112, 1118 homogeneity analysis, 1069 inertia, definition, 1070 input tables and variables, 1072, 1082, 1083 matrix decompositions, 1079, 1100 matrix formulas for statistics, 1103 memory requirements, 1097 missing values, 1077, 1088, 1092 multiple correspondence analysis (MCA), 1076, 1101, 1123 ODS graph names, 1109 optimal scaling, 1069 optimal scoring, 1069 OUTC= data set, 1094 OUTF= data set, 1095 output data sets, 1094 output table names, 1108 partial contributions to inertia table, 1103 %PLOTIT macro, 1070, 1118, 1128 PROFILE= option, use, 1099 quantification method, 1069 reciprocal averaging, 1069 ROW= option, use, 1099 scalogram analysis, 1069 supplementary rows and columns, 1080, 1102 syntax, 1072 syntax, abbreviations, 1073 TABLES statement, use, 1072, 1081, 1088 time requirements, 1097 VAR statement, use, 1072, 1081, 1091 CORRESP procedure, BY statement, 1080 CORRESP procedure, ID statement, 1080 CORRESP procedure, PROC CORRESP statement, 1073 ALL option, 1074 BENZECRI option, 1074 BINARY option, 1074 CELLCHI2 option, 1074 COLUMN= option, 1075 CP option, 1075 CROSS= option, 1075 DATA= option, 1075 DEVIATION option, 1075 DIMENS= option, 1075 EXPECTED option, 1076 FREQOUT option, 1076 GREENACRE option, 1076 MCA option, 1076 MCA= option, 1101 MININERTIA= option, 1077 MISSING option, 1077 NOCOLUMN= option, 1077 NOPRINT option, 1077 NOROW= option, 1077 NVARS= option, 1077 OBSERVED option, 1078 OUTC= option, 1078 OUTF= option, 1078
4998
Syntax Index
PRINT= option, 1078 PROFILE= option, 1078 ROW= option, 1079 RP option, 1079 SHORT option, 1079 SINGULAR= option, 1079 SOURCE option, 1079 UNADJUSTED option, 1079 CORRESP procedure, SUPPLEMENTARY statement, 1080 CORRESP procedure, TABLES statement, 1081 CORRESP procedure, VAR statement, 1081 CORRESP procedure, WEIGHT statement, 1082 correspondence analysis CORRESP procedure, 1069 CORRW option REPEATED statement (GENMOD), 1647 CORRXY= option POWER statement (GLMPOWER), 1939 COSAN model CALIS procedure, 552, 591 specification, 560 structural model example (CALIS), 561 COSAN statement, CALIS procedure, 591 Cosine coefficient DISTANCE procedure, 1272 counting process PHREG procedure, 3241 COV option LSMEANS statement (GENMOD), 1636 LSMEANS statement (GLM), 1756 LSMEANS statement (MIXED), 2689 MCMC statement (MI), 2524, 2529 MODEL statement (CATMOD), 842 PROC LATTICE statement, 2072 PROC NLMIXED statement, 3061 PROC PRINCOMP statement, 3604 COV statement, CALIS procedure, 604 COVAR option PROC INBREED statement, 1973 COVAR= option MODEL statement (RSREG), 4042 covariance LATTICE procedure, 2075 parameter estimates (MIXED), 2674, 2676 parameter estimates, ratio (MIXED), 2680 parameters (MIXED), 2661 principal components, 3610, 3612 regression coefficients (SURVEYREG), 4392 SURVEYFREQ procedure, 4210 covariance coefficients, See INBREED procedure covariance matrix for parameter estimates (CATMOD), 842 for response functions (CATMOD), 842 GENMOD procedure, 1638, 1656 NLMIXED procedure, 3101, 3106 PHREG procedure, 3222, 3233, 3245 REG procedure, 3817
singular (CATMOD), 887 symmetric and positive definite (SIM2D), 4107 COVARIANCE option PROC ROBUSTREG statement, 3984, 3986 PROC CALIS statement, 572 PROC FACTOR statement, 1310 PROC LATTICE statement, 2072 PROC PRINCOMP statement, 3604 PROC PRINQUAL statement, 3653 PROC ROBUSTREG statement, 3987 PROC VARCLUS statement, 4809 covariance parameter estimates MIXED procedure, 2750 Covariance similarity coefficient DISTANCE procedure, 1271 covariance structure analysis model, See COSAN model covariance structures examples (MIXED), 2723, 2782 MIXED procedure, 2664, 2721 covariates GLMPOWER procedure, 1938–1941, 1951, 1956 MIXED procedure, 2743 model parameterization (GLM), 1787 COVARIATES= option BASELINE statement (PHREG), 3224 covarimin method, 1291, 1318 COVB option MODEL statement (CATMOD), 842 MODEL statement (GENMOD), 1638 MODEL statement (LIFEREG), 2097 MODEL statement (LOGISTIC), 2309 MODEL statement (MIXED), 2692 MODEL statement (PHREG), 3233 MODEL statement (REG), 3825 MODEL statement (ROBUSTREG), 3989 MODEL statement (SURVEYLOGISTIC), 4262 MODEL statement (SURVEYREG), 4380 REPEATED statement (GENMOD), 1647 COVB= option PROC MIANALYZE statement, 2614 COVBI option MODEL statement (MIXED), 2693 COVER= option PROC FACTOR statement, 1310 coverage displays FACTOR procedure, 1328 COVERLAY= option PLOT statement (BOXPLOT), 498 COVERLAYCLIP= option PLOT statement (BOXPLOT), 498 COVM option PROC PHREG statement, 3221 COVOUT option PROC LIFEREG statement, 2090 PROC LOGISTIC statement, 2290 PROC PHREG statement, 3221 PROC PROBIT statement, 3711
Syntax Index PROC REG statement, 3817 PROC ROBUSTREG statement, 3983 COVRATIO MIXED procedure, 2770 COVRATIO for covariance parameters MIXED procedure, 2770 COVRATIO keyword OUTPUT statement (GLM), 1774 COVRATIO statistic, 3899 COVS option PROC PHREG statement, 3222 COVSANDWICH option PROC PHREG statement, 3222 COVSING= option NLOPTIONS statement (CALIS), 614 PROC CALIS statement, 588 PROC NLMIXED statement, 3061 COVTEST option PROC MIXED statement, 2676, 2750 COVTRACE MIXED procedure, 2770 COVTRACE for covariance parameters MIXED procedure, 2770 Cox regression analysis PHREG procedure, 3219 semiparametric model (PHREG), 3215 CP option MODEL statement (REG), 3825 PLOT statement (REG), 3845 PROC CORRESP statement, 1075 CPC option OUTPUT statement (TRANSREG), 4586 CPREFIX= option CLASS statement (LOGISTIC), 2295 CLASS statement (SURVEYLOGISTIC), 4253 CLASS statement (TPHREG), 4477 MODEL statement (TRANSREG), 4568, 4575 CPUCOUNT option MODEL statement (ROBUSTREG), 3991 CQC option OUTPUT statement (TRANSREG), 4586 CR= option PROC MODECLUS statement, 2865 Cramer’s V statistic, 1469, 1474 Cramer-von Mises test NPAR1WAY procedure, 3170 CRAMV option OUTPUT statement (FREQ), 1448 Crawford-Ferguson family, 1291 Crawford-Ferguson method, 1317, 1318 CREF= option MCMC statement (MI), 2525 Crime Rates Data, example PRINCOMP procedure, 3619 CRITMIN= option PROC MDS statement, 2482 CROSS option PROC MODECLUS statement, 2865 cross validated density estimates
MODECLUS procedure, 2872 cross validation DISCRIM procedure, 1163 PLS procedure, 3368, 3374, 3384 CROSS= option PROC CORRESP statement, 1075 crossed effects design matrix (CATMOD), 877 GENMOD procedure, 1660 MIXED procedure, 2744 model parameterization (GLM), 1788 specifying (ANOVA), 451, 452 specifying (CATMOD), 864 specifying (GLM), 1784 CROSSLIST option PROC DISCRIM statement, 1147 PROC MODECLUS statement, 2865 TABLES statement (FREQ), 1456 CROSSLISTERR option PROC DISCRIM statement, 1147 crossover designs power and sample size (POWER), 3549 crossproducts matrix REG procedure, 3917 crosstabulation (SURVEYFREQ) tables, 4227 crosstabulation tables, 1431, 1450, 4196 SURVEYFREQ procedure, 4227 CROSSVALIDATE option PROC DISCRIM statement, 1147 CRPANEL option ASSESS statement (PHREG), 3224 CSTEP option PROC ROBUSTREG statement, 3985 CSYMBOL= option MCMC statement (MI), 2525, 2529 CTABLE option MODEL statement (LOGISTIC), 2309 CTEXT= option PLOT statement (BOXPLOT), 498 PLOT statement (REG), 3845 cubic clustering criterion, 970, 973 CLUSTER procedure, 968 CUMCOL option TABLES statement (FREQ), 1457 cumulative baseline hazard function PHREG procedure, 3257 cumulative distribution function, 2114, 3705 LIFETEST procedure, 2149 cumulative logit model SURVEYLOGISTIC procedure, 4284 cumulative logits, See also response functions examples, (CATMOD), 869 specifying in CATMOD procedure, 853 using (CATMOD), 868 cumulative martingale residuals PHREG procedure, 3223, 3266, 3271 cumulative mean function,
4999
5000
Syntax Index
See mean function cumulative residuals, 1718, 1725 CURVE= option TWOSAMPLESURVIVAL statement (POWER), 3475 custom scoring coefficients, example SCORE procedure, 4086 customizing ODS graphs, 363, 369, 371, 373, 379 ODS styles, 345, 374, 376, 378 CUTOFF option MODEL statement (ROBUSTREG), 3989 CUTOFF= option PROC MDS statement, 2478 CV option TABLES statement (SURVEYFREQ), 4200 CV= option ONESAMPLEMEANS statement (POWER), 3434 PAIREDMEANS statement (POWER), 3450 PROC PLS statement, 3374 TWOSAMPLEMEANS statement (POWER), 3465 CVALS= option OUTPUT statement (PLAN), 3344 CVREF= option PLOT statement (BOXPLOT), 499 PLOT statement (REG), 3846 CVTEST= option PROC PLS statement, 3375 CVWT option TABLES statement (SURVEYFREQ), 4200 Czekanowski/Sorensen similarity coefficient DISTANCE procedure, 1276
D D option MODEL statement (RSREG), 4042 PROC NPAR1WAY statement, 3156 DAMPSTEP option, 3097 PROC NLMIXED statement, 3061 DAMPSTEP= option NLOPTIONS statement (CALIS), 622 DAPPROXIMATIONS option OUTPUT statement (TRANSREG), 4586 DATA step, 21 DATA= option PARMS statement (NLMIXED), 3078 PRIOR statement (MIXED), 2710 PROC ACECLUS statement, 404 PROC ANOVA statement, 433 PROC BOXPLOT statement, 487 PROC CALIS statement, 570 PROC CANCORR statement, 760 PROC CANDISC statement, 791 PROC CATMOD statement, 829 PROC CLUSTER statement, 969 PROC CORRESP statement, 1075 PROC DISCRIM statement, 1148
PROC DISTANCE statement, 1256 PROC FACTOR statement, 1310 PROC FASTCLUS statement, 1390 PROC FREQ statement, 1441 PROC GAM statement, 1564 PROC GENMOD statement, 1625 PROC GLM statement, 1745 PROC GLMMOD statement, 1914 PROC GLMPOWER statement, 1936 PROC INBREED statement, 1973 PROC KDE statement, 1996 PROC KRIGE2D statement, 2038 PROC LATTICE statement, 2072 PROC LIFEREG statement, 2090 PROC LIFETEST statement, 2160 PROC LOGISTIC statement, 2290 PROC MDS statement, 2478 PROC MI statement, 2519 PROC MIANALYZE statement, 2615 PROC MIXED statement, 2676 PROC MODECLUS statement, 2865 PROC MULTTEST statement, 2940 PROC NESTED statement, 2988 PROC NLIN statement, 3007 PROC NLMIXED statement, 3062 PROC NPAR1WAY statement, 3156 PROC ORTHOREG statement, 3201 PROC PHREG statement, 3222 PROC PLS statement, 3375 PROC PRINCOMP statement, 3605 PROC PRINQUAL statement, 3653 PROC PROBIT statement, 3712 PROC REG statement, 3817 PROC ROBUSTREG statement, 3983 PROC RSREG statement, 4040 PROC SCORE statement, 4072 PROC SIM2D statement, 4099 PROC STDIZE statement, 4130 PROC STEPDISC statement, 4166 PROC SURVEYFREQ statement, 4193 PROC SURVEYLOGISTIC statement, 4250 PROC SURVEYMEANS statement, 4323 PROC SURVEYREG statement, 4373 PROC SURVEYSELECT statement, 4433 PROC TPSPLINE statement, 4507 PROC TRANSREG statement, 4556 PROC TREE statement, 4750 PROC TTEST statement, 4780 PROC VARCLUS statement, 4809 PROC VARCOMP statement, 4835 PROC VARIOGRAM statement, 4865 SCORE statement (GAM), 1569 SCORE statement (LOGISTIC), 2324 SCORE statement (TPSPLINE), 4510 Davidon-Fletcher-Powell update, 3073 DDF= option MODEL statement (MIXED), 2693 TABLES statement (SURVEYFREQ), 4200 DDFM= option
Syntax Index MODEL statement (MIXED), 2693 DECIMALS= option PROC MDS statement, 2478 decomposition of the SSCP matrix ACECLUS procedure, 387 Default style ODS styles, 333, 346 DEFAULT= option UNITS statement (LOGISTIC), 2328 UNITS statement (SURVEYLOGISTIC), 4268 DEFF option MODEL statement (SURVEYREG), 4380 TABLES statement (SURVEYFREQ), 4200 DEFFBOUND= option MODEL statement (SURVEYLOGISTIC), 4265 DEGREE= option MODEL statement (LOESS), 2232 MODEL statement (TRANSREG), 4567 TRANSFORM statement (PRINQUAL), 3665 degrees of freedom CALIS procedure, 573, 576, 590, 676 FACTOR procedure, 1320 MI procedure, 2561 MIANALYZE procedure, 2625, 2627 models with class variables (GLM), 1791 NLMIXED procedure, 3062 SURVEYFREQ procedure, 4214 SURVEYMEANS procedure, 4340, 4344 SURVEYREG procedure, 4385 TRANSREG procedure, 4615 DELETE statement, REG procedure, 3820 delete variables (REG), 3820 DELETE= option PROC FASTCLUS statement, 1390 deleting observations REG procedure, 3903 DEMPHAS= option PROC CALIS statement, 588 dendritic method, See single linkage dendrogram, 4743 density estimation DISCRIM procedure, 1180, 1200 MODECLUS procedure, 2870 density function, See probability density function density linkage CLUSTER procedure, 966, 967, 969, 970, 972, 977, 980, 982 DENSITY= option PROC MODECLUS statement, 2866 dependent effect, definition, 451 DEPENDENT= option OUTPUT statement (TRANSREG), 4587 DEPONLY option MEANS statement (GLM), 1766 DEPSILON= option COMPUTE statement (VARIOGRAM), 4867 DER option
5001
PREDICT statement (NLMIXED), 3079 DER statement NLIN procedure, 3011 DESCENDING option CLASS statement (GENMOD), 1629 CLASS statement (LOGISTIC), 2295 CLASS statement (SURVEYLOGISTIC), 4253 CLASS statement (TPHREG), 4477 MODEL statement, 2305, 4259 PROC LOGISTIC statement, 2290 PROC TREE statement, 4750 DESCRIPTION= option PLOT statement (BOXPLOT), 499 PLOT statement (GLMPOWER), 1945 PLOT statement (POWER), 3487 PLOT statement (REG), 3846 PROC LIFETEST statement, 2160 PROC TREE statement, 4750 descriptive statistics, See also UNIVARIATE procedure LOGISTIC procedure, 2294 PHREG procedure, 3223 SURVEYMEANS procedure, 4347 design effect SURVEYFREQ procedure, 4215 SURVEYREG procedure, 4388 design matrix formulas (CATMOD), 894 generation in CATMOD procedure, 876 GENMOD procedure, 1661 GLMMOD procedure, 1909, 1917, 1918 linear dependence in (GENMOD), 1661 TRANSREG procedure, 4586 design of experiments, See experimental design DESIGN option MODEL statement (CATMOD), 842 design points, TPSPLINE procedure, 4499, 4517 design summary SURVEYREG procedure, 4390 design-adjusted chi-square test SURVEYFREQ procedure, 4216 DESIGN= option OUTPUT statement (TRANSREG), 4586 DETAIL option MODEL statement (TRANSREG), 4575 DETAILS option MODEL statement (LOESS), 2232 MODEL statement (LOGISTIC), 2309 MODEL statement (PHREG), 3230 MODEL statement (REG), 3825 MTEST statement (REG), 3833 PROC PLS statement, 3375 determination index CALIS procedure, 657 deviance definition (GENMOD), 1615 GENMOD procedure, 1637 LOGISTIC procedure, 2308, 2316, 2354
5002
Syntax Index
PROBIT procedure, 3745, 3760 scaled (GENMOD), 1656 deviance residuals GENMOD procedure, 1670 LOGISTIC procedure, 2360 PHREG procedure, 3234, 3258, 3302 DEVIANCE statement, GENMOD procedure, 1633, 1646 DEVIATION option PROC CORRESP statement, 1075 TABLES statement (FREQ), 1457 DEVIATIONS option MODEL statement (TRANSREG), 4568 deviations-from-means coding TRANSREG procedure, 4568, 4594, 4654, 4668, 4670 DF= option CONTRAST statement (MIXED), 2684 CONTRAST statement (NLMIXED), 3076 ESTIMATE statement (MIXED), 2686 ESTIMATE statement (NLMIXED), 3077 LSMEANS statement (MIXED), 2690 MODEL statement (SURVEYREG), 4380 MODEL statement (TPSPLINE), 4508 PREDICT statement (NLMIXED), 3079 PROC NLMIXED statement, 3062 RANDOM statement (NLMIXED), 3080 DFBETA statistics PHREG procedure, 3234, 3260 DFBETAS statistic (REG), 3900 DFBETAS statistics LOGISTIC procedure, 2360 DFBETAS= option OUTPUT statement (LOGISTIC), 2321 DFBW option PROC MIXED statement, 2676 DFE= option PROC CALIS statement, 572 DFFITS MIXED procedure, 2768 DFFITS keyword OUTPUT statement (GLM), 1774 DFFITS statistic GLM procedure, 1774 REG procedure, 3899 DFMETHOD= option MODEL statement (LOESS), 2233 DFMETHOD=APPROX(Cutoff= ) option MODEL statement (LOESS), 2233 DFMETHOD=APPROX(Quantile= ) option MODEL statement (LOESS), 2233 DFR= option PROC CALIS statement, 572 DFREDUCE= option PROC CALIS statement, 576 diagnostic statistics REG procedure, 3896, 3897 DIAGNOSTICS option MODEL statement (ROBUSTREG), 3989
diagnostics panels plots, ODS Graphics, 322, 379 DIAHES option PROC NLMIXED statement, 3062 diameter method, See complete linkage Dice coefficient DISTANCE procedure, 1276 DIFCHISQ= option OUTPUT statement (LOGISTIC), 2322 DIFDEV= option OUTPUT statement (LOGISTIC), 2322 DIFF option LSMEANS statement (GENMOD), 1636 LSMEANS statement (MIXED), 2690 difference between means confidence intervals, 442 DIM= option PROC CLUSTER statement, 969 DIMENS= option PROC CORRESP statement, 1075 dimension coefficients MDS procedure, 2471, 2472, 2477, 2482, 2483, 2488, 2489 DIMENSION= option PROC MDS statement, 2478 PROC MODECLUS statement, 2866 dimensions MIXED procedure, 2678 direct effects design matrix (CATMOD), 879 specifying (CATMOD), 865 DIRECT option MODEL statement (LOESS), 2233 direct product structure MIXED procedure, 2721 DIRECT statement, CATMOD procedure, 835 DIRECTIONS statement VARIOGRAM procedure, 4870 discordant observations, 1474 DISCPROPDIFF= option PAIREDFREQ statement (POWER), 3444 DISCPROPORTIONS= option PAIREDFREQ statement (POWER), 3444 DISCPROPRATIO= option PAIREDFREQ statement (POWER), 3445 discrete logistic model likelihood (PHREG), 3241 PHREG procedure, 3216, 3228, 3283 discrete variables, See classification variables DISCRIM option MONOTONE statement (MI), 2531 DISCRIM procedure background, 1156 Bayes’ theorem, 1157 bivariate density estimation, 1200 calibration data set, 1139, 1167 classification criterion, 1139
Syntax Index computational resources, 1173 %CONTOUR macro, 1201 cross validation, 1163 density estimate, 1157, 1160, 1161, 1180, 1200 error rate estimation, 1163, 1165 input data sets, 1168, 1169 introductory example, 1140 kernel density estimates, 1191, 1212 memory requirements, 1174 missing values, 1156 nonparametric methods, 1158 ODS table names, 1178 optimal bandwidth, selection, 1162 output data sets, 1170, 1171 parametric methods, 1157 %PLOT macro, 1182 %PLOTIT macro, 1200 posterior probability, 1158, 1160, 1180, 1200 posterior probability error rate, 1163, 1165 quasi-inverse, 1164 resubstitution, 1163 squared distance, 1157, 1158 syntax, 1145 test set classification, 1163 time requirements, 1174 training data set, 1139 univariate density estimation, 1180 DISCRIM procedure, BY statement, 1153 DISCRIM procedure, CLASS statement, 1154 DISCRIM procedure, FREQ statement, 1154 DISCRIM procedure, ID statement, 1154 DISCRIM procedure, PRIORS statement, 1154 DISCRIM procedure, PROC DISCRIM statement, 1145 ALL option, 1147 ANOVA option, 1147 BCORR option, 1147 BCOV option, 1147 BSSCP option, 1147 CAN option, 1147 CANONICAL option, 1147 CANPREFIX= option, 1147 CROSSLIST option, 1147 CROSSLISTERR option, 1147 CROSSVALIDATE option, 1147 DATA= option, 1148 DISTANCE option, 1148 K= option, 1148 KERNEL= option, 1148 LIST option, 1148 LISTERR option, 1148 MAHALANOBIS option, 1148 MANOVA option, 1148 METHOD= option, 1148 METRIC= option, 1149 NCAN= option, 1149 NOCLASSIFY option, 1149 NOPRINT option, 1149 OUT= option, 1149
5003
OUTCROSS= option, 1149 OUTD= option, 1150 OUTSTAT= option, 1150 PCORR option, 1150 PCOV option, 1150 POOL= option, 1150 POSTERR option, 1150 PSSCP option, 1151 R= option, 1151 SHORT option, 1151 SIMPLE option, 1151 SINGULAR= option, 1151 SLPOOL= option, 1151 STDMEAN option, 1152 TCORR option, 1152 TCOV option, 1152 TESTDATA= option, 1152 TESTLIST option, 1152 TESTLISTERR option, 1152 TESTOUT= option, 1152 TESTOUTD= option, 1152 THRESHOLD= option, 1152 TSSCP option, 1152 WCORR option, 1153 WCOV option, 1153 WSSCP option, 1153 DISCRIM procedure, TESTCLASS statement, 1155 DISCRIM procedure, TESTFREQ statement, 1155 DISCRIM procedure, TESTID statement, 1156 DISCRIM procedure, VAR statement, 1156 DISCRIM procedure, WEIGHT statement, 1156 discriminant analysis, 1139 canonical, 783, 1139 error rate estimation, 1140 misclassification probabilities, 1140 nonparametric methods, 1158 parametric methods, 1157 stepwise selection, 4157 discriminant function method MI procedure, 2544 discriminant functions, 784 disjoint clustering, 1379–1381 dispersion parameter estimation (GENMOD), 1614, 1657, 1665, 1666 GENMOD procedure, 1658 LOGISTIC procedure, 2354 PROBIT procedure, 3760 weights (GENMOD), 1650 displayed output SURVEYSELECT procedure, 4458 DISPLAYINIT option MCMC statement (MI), 2526 DISSIMILAR option PROC TREE statement, 4750 dissimilarity data MDS procedure, 2471, 2478, 2484 DIST = option MODEL statement (GAM), 1567 DIST= option
5004
Syntax Index
MODEL statement (GENMOD), 1638 ONECORR statement (POWER), 3427 ONESAMPLEMEANS statement (POWER), 3434 PAIREDFREQ statement (POWER), 3445 PAIREDMEANS statement (POWER), 3450 TWOSAMPLEMEANS statement (POWER), 3465 distance between clusters (FASTCLUS), 1398 classification (VARIOGRAM), 4874 data (FASTCLUS), 1379 data (MDS), 2471 Euclidean (FASTCLUS), 1380 distance data MDS procedure, 2478, 2484 DISTANCE data sets CLUSTER procedure, 969 distance measures available in DISTANCE procedure, See proximity measures DISTANCE option PROC CANDISC statement, 791 PROC DISCRIM statement, 1148 PROC FASTCLUS statement, 1390 DISTANCE procedure absent-absent match, asymmetric binary variable, 1250 absent-absent match, example, 1278 absolute level of measurement, definition, 1250 affine transformation, 1250 asymmetric binary variable, 1250 available levels of measurement, 1263 available options for the option list, 1264 Binary Lance and Williams nonmetric coefficient, 1276 Bray and Curtis coefficient, 1276 Canberra metric coefficient, 1272 Chebychev distance coefficient, 1272 chi-squared coefficient, 1273 Cityblock distance coefficient, 1272 computing distances with weights, 1267 Correlation dissimilarity coefficient, 1271 Correlation similarity coefficient, 1271 Cosine coefficient, 1272 Covariance similarity coefficient, 1271 Czekanowski/Sorensen similarity coefficient, 1276 Dice coefficient, 1276 Dot Product coefficient, 1273 Euclidean distance coefficient, 1271 examples, 1251, 1278, 1283 extension of binary variable, 1251 formatted values, 1277 formulas for proximity measures, 1270 frequencies, 1268 functional summary, 1255 fuzz factor, 1257 Generalized Euclidean distance coefficient, 1272 Gower’s dissimilarity coefficient, 1271
Gower’s similarity coefficient, 1270 Hamann coefficient, 1274 Hamming distance coefficient, 1274 identity transformation, 1250 initial estimates for A-estimates, 1257 interval level of measurement, 1250 Jaccard dissimilarity coefficient, 1276 Jaccard similarity coefficient, 1276 Kulcynski 1 coefficient, 1276 Lance-Williams nonmetric coefficient, 1272 levels of measurement, 1249 linear transformation, 1250 log-interval level of measurement, 1250 many-to-one transformation, 1249 Minkowski L(p) distance coefficient, 1272 missing values, 1261, 1262, 1267, 1276 monotone increasing transformation, 1249 nominal level of measurement, 1249 nominal variable, 1251 normalization, 1261, 1263 one-to-one transformation, 1249 ordinal level of measurement, 1249 output data sets, 1261, 1277 Overlap dissimilarity coefficient, 1273 Overlap similarity coefficient, 1273 phi-squared coefficient, 1273 Power distance coefficient, 1272 power transformation, 1250 ratio level of measurement, 1250 Roger and Tanimoto coefficient, 1274 Russell and Rao similarity coefficient, 1276 scaling variables, 1251 Shape distance coefficient, 1271 Similarity Ratio coefficient, 1272 Simple Matching coefficient, 1274 Simple Matching dissimilarity coefficient, 1274 Size distance coefficient, 1271 Sokal and Sneath 1 coefficient, 1274 Sokal and Sneath 3 coefficient, 1275 Squared Correlation dissimilarity coefficient, 1272 Squared Correlation similarity coefficient, 1272 Squared Euclidean distance coefficient, 1271 standardization methods, 1265 standardization with frequencies, 1268 standardization with weights, 1269 standardization, default methods, 1251, 1265 standardization, example, 1253 standardization, mandatory, 1251 strictly increasing transformation, 1249 summary of options, 1255 symmetric binary variable, 1250 transforming ordinal variables to interval, 1250 transforming ordinal variables to iterval, 1261 weights, 1267, 1269 DISTANCE procedure, BY statement, 1268 DISTANCE procedure, COPY statement, 1268 DISTANCE procedure, FREQ statement, 1268 DISTANCE procedure, ID statement, 1268
Syntax Index DISTANCE procedure, PROC DISTANCE statement, 1255 ABSENT= option, 1256 ADD= option, 1256 DATA= option, 1256 FUZZ= option, 1257 INITIAL= option, 1257 METHOD= option, 1257 MULT= option, 1261 NOMISS, 1261 NORM option, 1261 OUT= option, 1261 OUTSDZ= option, 1261 PREFIX= option, 1261 RANKSCORE= option, 1261 REPLACE, 1262 REPONLY, 1262 SHAPE= option, 1262 SNORM option, 1263 UNDEF= option, 1263 VARDEF= option, 1263 DISTANCE procedure, VAR statement ABSENT= option, 1266 MISSING= option, 1267 ORDER= option, 1267 WEIGHTS= option, 1267 DISTANCE procedure, WGT statement, 1269 distance tolerance VARIOGRAM procedure, 4867 DISTANCE= option MODEL statement (TPSPLINE), 4509 DISTRIBUTION= option MODEL statement (LIFEREG), 2097 distributions Gompertz, 3705 logistic, 3705 normal, 3705 DIVISOR= option ESTIMATE statement (GLM), 1752 ESTIMATE statement (MIXED), 2686 ESTIMATE statement (SURVEYREG), 4379 DK= option PROC MODECLUS statement, 2866 DOCK= option PROC MODECLUS statement, 2866 PROC TREE statement, 4750 DOCUMENT destination, 361 examples, ODS Graphics, 361 ODS Graphics, 326, 335 DOCUMENT procedure, 361, 362 document paths, 362 examples, ODS Graphics, 361 LIST statement, 362 REPLAY statement, 362 Documents window, 361 dollar-unit sampling SURVEYSELECT procedure, 4466 domain analysis SURVEYFREQ procedure, 4205
5005
SURVEYMEANS procedure, 4336, 4348 domain mean SURVEYMEANS procedure, 4343 DOMAIN statement SURVEYMEANS procedure, 4330 domain statistics SURVEYMEANS procedure, 4342 domain total SURVEYMEANS procedure, 4343 DOT product (SCORE), 4065 Dot Product coefficient DISTANCE procedure, 1273 double arcsine test MULTTEST procedure, 2951 double dogleg algorithm (CALIS), 578, 579, 581, 665 method (NLMIXED), 3073 DR= option PROC MODECLUS statement, 2866 DREPLACE option OUTPUT statement (TRANSREG), 4587 DRIFT option PROC FASTCLUS statement, 1390 DROPSQUARE= option MODEL statement (LOESS), 2233 DSCALE MODEL statement (GENMOD), 1642 dual scaling CORRESP procedure, 1069 DUMMY option MODEL statement (TRANSREG), 4575 PROC PRINQUAL statement, 3654 dummy variables TRANSREG procedure, 4560, 4569, 4586 dummy variables example TRANSREG procedure, 4654 DUNCAN option MEANS statement (ANOVA), 442 MEANS statement (GLM), 1766 Duncan’s multiple range test, 442, 1766, 1814 Duncan-Waller test, 445, 1769, 1815 error seriousness ratio, 443, 1768 multiple comparison (ANOVA), 464 DUNNETT option MEANS statement (ANOVA), 442 MEANS statement (GLM), 1766 Dunnett’s adjustment GLM procedure, 1754 MIXED procedure, 2688 Dunnett’s test, 1766, 1767, 1812 one-tailed lower, 442, 1766 one-tailed upper, 442 two-tailed, 442 DUNNETTL option MEANS statement (ANOVA), 442 MEANS statement (GLM), 1766 DUNNETTU option MEANS statement (ANOVA), 442 MEANS statement (GLM), 1767
5006
Syntax Index
DW option MODEL statement (REG), 3826 DWPROB option MODEL statement (REG), 3826
E E option CONTRAST statement (GENMOD), 1632 CONTRAST statement (GLM), 1749 CONTRAST statement (LOGISTIC), 2300 CONTRAST statement (MIXED), 2684 CONTRAST statement (SURVEYLOGISTIC), 4257 CONTRAST statement (SURVEYREG), 4377 CONTRAST statement (TPHREG), 4481 ESTIMATE statement (GENMOD), 1634 ESTIMATE statement (GLM), 1752 ESTIMATE statement (MIXED), 2686 ESTIMATE statement (SURVEYREG), 4379 LSMEANS statement (GENMOD), 1636 LSMEANS statement (GLM), 1756 LSMEANS statement (MIXED), 2690 MODEL statement (GLM), 1771 MODEL statement (MIXED), 2695 TEST statement (PHREG), 3238 E1 option MODEL statement (GLM), 1771 MODEL statement (MIXED), 2695 E2 option MODEL statement (GLM), 1771 MODEL statement (MIXED), 2695 E3 option MODEL statement (GLM), 1771 MODEL statement (MIXED), 2696 E4 option MODEL statement (GLM), 1771 E= effects REPEATED statement (ANOVA), 450 E= option CONTRAST statement (GLM), 1750 MANOVA statement (ANOVA), 437 MANOVA statement (GLM), 1759 MEANS statement (ANOVA), 443 MEANS statement (GLM), 1767 REPEATED statement (GLM), 1782 EARLY option PROC MODECLUS statement, 2866 EBLUP MIXED procedure, 2703 EBOPT option PROC NLMIXED statement, 3062 EBSSFRAC option PROC NLMIXED statement, 3062 EBSSTOL option PROC NLMIXED statement, 3062 EBSTEPS option PROC NLMIXED statement, 3062 EBSUBSTEPS option PROC NLMIXED statement, 3062
EBTOL option PROC NLMIXED statement, 3062 EBZSTART option PROC NLMIXED statement, 3062 ECORR option PROC NLMIXED statement, 3063 ECORRB option REPEATED statement (GENMOD), 1647 ECOV option PROC NLMIXED statement, 3063 ECOVB option REPEATED statement (GENMOD), 1647 EDER option PROC NLMIXED statement, 3063 EDF option MODEL statement (REG), 3826 OUTPUT statement (NPAR1WAY), 3162 PLOT statement (REG), 3846 PROC NPAR1WAY statement, 3157 PROC REG statement, 3817 EDF tests NPAR1WAY procedure, 3168 EDF= option PROC CALIS statement, 572 PROC CANCORR statement, 760 PROC MIANALYZE statement, 2615 EFF option PROC ROBUSTREG statement, 3986, 3988 effect coding (TRANSREG), 4549, 4568, 4654, 4668, 4670 definition, 451, 864, 1784 name length (MIXED), 2678 specification (ANOVA), 451 specification (CATMOD), 864 specification (GENMOD), 1659 specification (GLM), 1784 testing (SURVEYREG), 4386, 4392 EFFECT Parameterization SURVEYLOGISTIC procedure, 4270 effect size power and sample size (POWER), 3485 EFFECT= modifier INFLUENCE option, MODEL statement (MIXED), 2697 effective sample size LIFETEST procedure, 2172, 2173 EFFECTS option MODEL statement (TRANSREG), 4568 EFFECTVAR= option PROC MIANALYZE statement, 2614 Efron method likelihood (PHREG), 3228, 3241 eigenvalues and eigenvectors ACECLUS procedure, 398, 399, 406, 410–412 CANCORR procedure, 754, 769 PRINCOMP procedure, 3595, 3610–3612 RSREG procedure, 4050 EIGENVECTORS option
Syntax Index PROC FACTOR statement, 1310 Einot and Gabriel’s multiple range test ANOVA procedure, 444 examples (GLM), 1851 GLM procedure, 1768, 1815 Ekblom-Newton algorithm FASTCLUS procedure, 1392 elementary linkage analysis, See single linkage EM algorithm MI procedure, 2536, 2566 EM statement MI procedure, 2522 empirical Bayes estimation NLMIXED procedure, 3084 empirical best linear unbiased prediction MIXED procedure, 2703 empirical distribution function tests (NPAR1WAY), 3168 EMPIRICAL option MIXED, 2676 empirical sandwich estimator MIXED procedure, 2676 empty stratum SURVEYMEANS procedure, 4334, 4344 ENDGRID option PLOT statement (BOXPLOT), 499 ENTRYTIME= option MODEL statement (PHREG), 3229 Epanechnikov kernel (DISCRIM), 1160 EPOINT transformation MODEL statement (TRANSREG), 4561 EPSILON = option MODEL statement (GAM), 1567 EPSILON= option MODEL statement (CATMOD), 842 PROC MDS statement, 2479 PROC PLS statement, METHOD=PLS option, 3376 PROC PLS statement, MISSING=EM option, 3376 PROC VARCOMP statement, 4835 EPSSCORE = option MODEL statement (GAM), 1567 EQCONS= option PARMS statement (MIXED), 2707 EQKAP option OUTPUT statement (FREQ), 1448 EQS program CALIS procedure, 555 equal precision bands LIFETEST procedure, 2169, 2178, 2205 equality of means (TTEST), 4775, 4789 of variances (TTEST), 4775, 4784, 4789 equamax method, 1291, 1317, 1318 equivalence tests
5007
power and sample size (POWER), 3432, 3438, 3448, 3456, 3463, 3472, 3510, 3511, 3520, 3521, 3530, 3531, 3549 EQWKP option OUTPUT statement (FREQ), 1448 ERR= option MODEL statement (GENMOD), 1638 error rate estimation DISCRIM procedure, 1163, 1165 discriminant analysis, 1140 error seriousness ratio Waller-Duncan test, 443, 1768 error sum of squares clustering method, See Ward’s method ESTDATA= option PROC CALIS statement, 570 estimability GLM procedure, 1750 MIXED procedure, 2682 estimability checking GENMOD procedure, 1632 LOGISTIC procedure, 2300 SURVEYLOGISTIC procedure, 4258 TPHREG procedure, 4481 estimable functions checking (GLM), 1750 displaying (GLM), 1771 example (GLM), 1792 general form of, 1793 GLM procedure, 1751, 1752, 1758, 1773, 1792– 1794, 1796, 1797, 1803 MIXED procedure, 2702 printing (GLM), 1771 SURVEYREG procedure, 4378, 4393 ESTIMATE option EXACT statement (LOGISTIC), 2301 ESTIMATE statement GENMOD procedure, 1633 GLM procedure, 1751, 1801 MIXED procedure, 2685 NLMIXED procedure, 3077 SURVEYREG procedure, 4378 ESTIMATE= option CONTRAST statement (CATMOD), 832 CONTRAST statement (LOGISTIC), 2300 CONTRAST statement (SURVEYLOGISTIC), 4258 CONTRAST statement (TPHREG), 4481 estimated population marginal means, See least-squares means ESTIMATES modifier INFLUENCE option, MODEL statement (MIXED), 2697 estimation dispersion parameter (GENMOD), 1614 maximum likelihood (GENMOD), 1655 mixed model (MIXED), 2737 regression parameters (GENMOD), 1614 estimation criteria
5008
Syntax Index
CALIS procedure, 646 estimation methods CALIS procedure, 549, 574, 644–647 MIXED procedure, 2677 ETYPE option LSMEANS statement (GLM), 1756 ETYPE= option CONTRAST statement (GLM), 1750 MANOVA statement (GLM), 1760 MEANS statement (GLM), 1767 TEST statement (GLM), 1782 Euclidean distance coefficient DISTANCE procedure, 1271 Euclidean distances, 969, 971, 1158, 1380 clustering, 957 MDS procedure, 2472, 2477, 2488 Euclidean length STDIZE procedure, 4138 EVENLY option MODEL statement (TRANSREG), 4567 TRANSFORM statement (PRINQUAL), 3665 event times PHREG procedure, 3215, 3218 events/trials format for response GENMOD procedure, 1636, 1653 EVENTSYMBOL= option PROC LIFETEST statement, 2161 exact logistic regression LOGISTIC procedure, 2300, 2369 exact method likelihood (PHREG), 3228, 3240 EXACT option OUTPUT statement (FREQ), 1448 EXACT statement FREQ procedure, 1443 LOGISTIC procedure, 2300 NPAR1WAY procedure, 3159 exact tests computational algorithms (FREQ), 1509 computational algorithms (NPAR1WAY), 3172 computational resources (FREQ), 1511 computational resources (NPAR1WAY), 3173 confidence limits, 1443 examples (NPAR1WAY), 3190, 3191 FREQ procedure, 1508, 1543 MONTE Carlo estimates (NPAR1WAY), 3174 network algorithm (FREQ), 1509 NPAR1WAY procedure, 3171 p-value, definitions, 1510, 3172 permutation test (MULTTEST), 2949 EXACTONLY option PROC LOGISTIC statement, 2290 EXACTOPTIONS option PROC LOGISTIC statement, 2291 examples, ODS Graphics, 352 axes labels, modifying, 368 closing all destinations, 360 customizing ODS graphs, 363, 369, 371, 373, 379
customizing styles, 374, 376, 378 DOCUMENT destination, 361 DOCUMENT procedure, 361 editing templates, 365 excluding graphs, 352 graph axes, swapping, 371 graph fonts, modifying, 374, 379 graph names, 352, 361 graph sizes, modifying, 378 graph titles, modifying, 368 grid lines, modifying, 373 HTML output, 321, 324, 330, 352 HTML output, with tool tips, 354 Journal style, 358 LaTeX output, 358 line colors, modifying, 369 line patterns, modifying, 369 line thickness, modifying, 376 marker symbols, modifying, 369 multiple destinations, 360 PDF output, 360 presentations, 356 referring to graphs, 330 relative paths, 359 replaying output, 361, 363, 369 RTF output, 356, 360 saving templates, 368 selecting graphs, 330, 352, 361 style attributes, modifying, 374, 376, 378 style elements, modifying, 374, 376, 378 TEMPLATE procedure, 363, 369, 371, 373, 379 tick marks, modifying, 373 trace record, 330, 352, 363 excluded observations PRINQUAL procedure, 3657, 3674 TRANSREG procedure, 4605 excluding graphs examples, ODS Graphics, 352 ODS Graphics, 330 exemplary data set power and sample size (GLMPOWER), 1929, 1930, 1936, 1938, 1946, 1952 EXKNOTS= option MODEL statement (TRANSREG), 4567 exogenous variables CALIS procedure, 662 path diagram (CALIS), 664 EXP option ESTIMATE statement (GENMOD), 1634 EXP transformation MODEL statement (TRANSREG), 4562 TRANSFORM statement (MI), 2534 TRANSFORM statement (PRINQUAL), 3661 expected mean squares computing, types (GLM), 1835 random effects, 1833 EXPECTED option MODEL statement (GENMOD), 1638 PROC CORRESP statement, 1076
Syntax Index TABLES statement (FREQ), 1457 TABLES statement (SURVEYFREQ), 4200 expected trend MULTTEST procedure, 2951 expected weighted frequency SURVEYFREQ procedure, 4215 experimental design, 23, 1901, See also PLAN procedure aliasing structure (GLM), 1897 experimentwise error rate (GLM), 1809 EXPEST option MODEL statement (LOGISTIC), 2309 MODEL statement (SURVEYLOGISTIC), 4262 explicit intercept TRANSREG procedure, 4605 exploratory data analysis, 22 exponential distribution GENMOD procedure, 1704 exponential semivariogram model KRIGE2D procedure, 2048, 2049 External studentization, 2763 external unfolding MDS procedure, 2471 extreme value distribution PROBIT procedure, 3757
F – F– specification MODEL statement (CATMOD), 840, 862 F statistics CLUSTER procedure, 972, 984 GENMOD procedure, 1668 factor defined for factor analysis, 1292 factor analysis compared to component analysis, 1291, 1292 factor analysis model CALIS procedure, 554, 606, 608 COSAN statement (CALIS), 593 identification (CALIS), 606 LINEQS statement (CALIS), 602 MATRIX statement (CALIS), 595 path diagram (CALIS), 599, 600 RAM statement (CALIS), 598 specification (CALIS), 560 structural model example (CALIS), 564 factor analytic structures MIXED procedure, 2721 factor loadings CALIS procedure, 641 factor parsimax method, 1291, 1317, 1318 FACTOR procedure, 1307 CALIS procedure, 567, 572, 606 computational resources, 1335 coverage displays, 1328 degrees of freedom, 1320 Heywood cases, 1333 number of factors extracted, 1312 OUT= data sets, 1316
5009
output data sets, 1297, 1316 simplicity functions, 1294, 1316, 1329 syntax, 1307 time requirements, 1331 variances, 1320 FACTOR procedure, BY statement, 1320 FACTOR procedure, FREQ statement, 1321 FACTOR procedure, PARTIAL statement, 1322 FACTOR procedure, PRIORS statement, 1322 FACTOR procedure, PROC FACTOR statement, 1308 ALL option, 1309 ALPHA= option, 1309 CONVERGE= option, 1309 CORR option, 1309 COVARIANCE option, 1310 COVER= option, 1310 DATA= option, 1310 EIGENVECTORS option, 1310 FLAG= option, 1310 FUZZ= option, 1310 GAMMA= option, 1310 HEYWOOD option, 1311 HKPOWER= option, 1311 MAXITER= option, 1311 METHOD= option, 1311 MINEIGEN= option, 1312 MSA option, 1312 NFACTORS= option, 1313 NOBS= option, 1313 NOCORR option, 1313 NOINT option, 1313 NOPRINT option, 1313 NOPROMAXNORM option, 1313 NORM= option, 1313 NPLOT= option, 1314 OUT= option, 1314 OUTSTAT= option, 1314 PLOT option, 1314 POWER= option, 1314 PREPLOT option, 1314 PREROTATE= option, 1314 PRINT option, 1314 PRIORS= option, 1315 PROPORTION= option, 1316 RANDOM= option, 1316 RCONVERGE= option, 1316 REORDER option, 1316 RESIDUALS option, 1316 RITER= option, 1316 ROTATE= option, 1317 ROUND option, 1319 SCORE option, 1319 SCREE option, 1319 SE option, 1319 SIMPLE option, 1319 SINGULAR= option, 1319 TARGET= option, 1319 TAU= option, 1320 ULTRAHEYWOOD option, 1320
5010
Syntax Index
VARDEF= option, 1320 WEIGHT option, 1320 FACTOR procedure, VAR statement, 1322 FACTOR procedure, WEIGHT statement, 1322 Factor rotation with FACTOR procedure, 1298 factor rotation methods, 1291 factor scores CALIS procedure, 641, 643, 687 displaying (CALIS), 687 factor scoring coefficients FACTOR procedure, 4065 SCORE procedure, 4065, 4076 factor specification REPEATED statement (GLM), 1778 FACTOR statement, CALIS procedure, 606 factor structure, 1298 factor-value-settings option OUTPUT statement (PLAN), 3343 factors PLAN procedure, 3339, 3340, 3348 PLS procedure, 3367 FACTORS statement CATMOD procedure, 836 PLAN procedure, 3340 failure time LIFEREG procedure, 2083 false discovery rate adjustment (MULTTEST), 2959 false discovery rate (MULTTEST) p-value adjustments, 2959 false negative, false positive rate LOGISTIC procedure, 2314, 2353, 2422 fast Fourier transform KDE procedure, 2007 MULTTEST procedure, 2950 FAST option MODEL statement (LOGISTIC), 2309 FASTCLUS procedure algorithm for updating cluster seeds, 1392 bin-sort algorithm, 1389 cluster deletion, 1390 clustering criterion, 1379, 1391, 1392 clustering methods, 1380, 1381 compared to other procedures, 1403 computational problems, convergence, 1390 computational resources, 1402 controlling iterations, 1393 convergence criterion, 1390 distance, 1379, 1380, 1398 DRIFT option, 1380 Ekblom-Newton algorithm, 1392 homotopy parameter, 1390 imputation of missing values, 1391 incompatibilities, 1397 iteratively reweighted least-squares, 1391 Lp clustering, 1379, 1391 MAXCLUSTERS= option, 1381 MEAN= data sets, 1394
memory requirements, 1402 Merle-Spath algorithm, 1392 missing values, 1380, 1381, 1391, 1393, 1397 Newton algorithm, 1392 OUT= data sets, 1398 outliers, 1379 output data sets, 1393, 1394, 1398 output table names, 1407 OUTSTAT= data set, 1394, 1400 RADIUS= option, 1381 random number generator, 1394 scale estimates, 1390, 1392, 1397, 1399, 1400 seed replacement, 1381, 1394 syntax, 1388 weighted cluster means, 1395 FASTCLUS procedure, BY statement, 1395 FASTCLUS procedure, FREQ statement, 1396 FASTCLUS procedure, ID statement, 1396 FASTCLUS procedure, PROC FASTCLUS statement, 1388 BINS= option, 1389 CLUSTER= option, 1390 CONVERGE= option, 1390 DATA= option, 1390 DELETE= option, 1390 DISTANCE option, 1390 DRIFT option, 1390 HC= option, 1390 HP= option, 1390 IMPUTE option, 1391 INSTAT= option, 1391 IRLS option, 1391 L= option, 1391 LEAST= option, 1391 LIST option, 1393 MAXCLUSTERS= option, 1388 MAXITER= option, 1393 MEAN= option, 1393 NOMISS option, 1393 NOPRINT option, 1394 OUT= option, 1394 OUTITER option, 1394 OUTS= option, 1394 OUTSEED= option, 1394 OUTSTAT= option, 1394 RADIUS= option, 1388 RANDOM= option, 1394 REPLACE= option, 1394 SEED= option, 1394 SHORT option, 1395 STRICT= option, 1395 SUMMARY option, 1395 VARDEF= option, 1395 FASTCLUS procedure, VAR statement, 1397 FASTCLUS procedure, WEIGHT statement, 1397 FCONV2= option NLOPTIONS statement (CALIS), 618, 620 PROC NLMIXED statement, 3063 FCONV= option
Syntax Index MODEL statement (LOGISTIC), 2310 MODEL statement (SURVEYLOGISTIC), 4262 NLOPTIONS statement (CALIS), 614, 617 PROC CALIS statement, 580 PROC NLMIXED statement, 3063 FD= option PROC NLMIXED statement, 3064 FDCODE option PROC CALIS statement, 589 FDHESSIAN= option PROC NLMIXED statement, 3064 FDIGITS= option, 3092 NLOPTIONS statement (CALIS), 618 PROC NLMIXED statement, 3065 FDR option PROC MULTTEST statement, 2940, 2959 FIADJUST option PROC ROBUSTREG statement, 3985 fiducial limits, 3712, 3713, 3759 FILE= option ODS HTML statement, 326 ODS PDF statement, 338, 360 ODS RTF statement, 338 FILLCHAR= option PROC TREE statement, 4750 finite differencing NLMIXED procedure, 3064, 3091 finite population correction (fpc) SURVEYLOGISTIC procedure, 4280 SURVEYMEANS procedure, 4334 finite population correction factor, 165 first canonical variable, 783 first-order method NLMIXED procedure, 3085 first-stage sampling unit, 165 Fisher combination adjustment (MULTTEST), 2959 Fisher combination (MULTTEST) p-value adjustments, 2959 Fisher exact test MULTTEST procedure, 2944, 2946, 2954, 2975 Fisher information matrix example (MIXED), 2796 MIXED procedure, 2750 FISHER option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1448 TABLES statement (FREQ), 1457 TEST statement (MULTTEST), 2944, 2954, 2975 Fisher’s exact test FREQ procedure, 1469, 1472, 1473 power and sample size (POWER), 3457, 3463, 3525 Fisher’s LSD test, 445, 1769 Fisher’s scoring method GENMOD procedure, 1642, 1656 LOGISTIC procedure, 2317, 2318, 2336 MIXED procedure, 2674, 2680, 2774
5011
SURVEYLOGISTIC procedure, 4264, 4265, 4276 Fisher’s z test for correlation power and sample size (POWER), 3426, 3429, 3502, 3556 FISHER– C option PROC MULTTEST statement, 2940, 2959 fit plots plots, ODS Graphics, 322, 354 fit statistics SURVEYREG procedure, 4390 FIT= option PROC MDS statement, 2479 FITSTAT option SCORE statement (LOGISTIC), 2325 fixed effects MIXED procedure, 2663 sum-to-zero assumptions, 1835 VARCOMP procedure, 4831, 4837 fixed-effects model VARCOMP procedure, 4837 fixed-effects parameters MIXED procedure, 2661, 2732 fixed-radius kernels MODECLUS procedure, 2870 FIXED= option MODEL statement (VARCOMP), 4836 FLAG= option PROC FACTOR statement, 1310 FLAT option PRIOR statement (MIXED), 2710 Fleiss-Cohen weights, 1497 Fleming-Harrington Gρ test for homogeneity LIFETEST procedure, 2150, 2168 flexible-beta method CLUSTER procedure, 957, 967, 968, 981 floating point errors NLMIXED procedure, 3098 FLOW option PROC NLIN statement, 3008 PROC NLMIXED statement, 3065 folded form F statistic, 4775, 4784 FOLLOWUPTIME= option TWOSAMPLESURVIVAL statement (POWER), 3476 FONT= option PLOT statement (BOXPLOT), 499 FORM= option MODEL statement (KRIGE2D), 2042 SIMULATE statement (SIM2D), 4102 FORMAT= option TABLES statement (FREQ), 1457 formatted values DISTANCE procedure, 1277 FORMCHAR= option PROC FREQ statement, 1441 PROC LIFETEST statement, 2161 FORMULA= option PROC MDS statement, 2479
5012
Syntax Index
formulas CANCORR procedure, 765 forward selection LOGISTIC procedure, 2317, 2340 PHREG procedure, 3229, 3264 REG procedure, 3800, 3873 Forward-Dolittle transformation, 1794 fraction of missing information MI procedure, 2562 MIANALYZE procedure, 2625 fractional frequencies PHREG procedure, 3227 fractional sample size GLMPOWER procedure, 1947 POWER procedure, 3419, 3496 Frailty model example NLMIXED procedure, 3128 Freeman-Halton test, 1473 Freeman-Tukey test MULTTEST procedure, 2946, 2951, 2968 FREQ option MODEL statement (CATMOD), 842 FREQ procedure alpha level, 1445, 1453 binomial proportion, 1484, 1532 Bowker’s test of symmetry, 1493 Breslow-Day test, 1508 cell count data, 1464, 1527 chi-square tests, 1469–1471, 1530, 1535 Cochran’s Q test, 1493 Cochran-Mantel-Haenszel statistics, 1540 computation time, limiting, 1445 computational methods, 1509 computational resources, 1511, 1513 contingency coefficient, 1469 contingency table, 1535 continuity-adjusted chi-square, 1469, 1471 correlation statistic, 1501 Cramer’s V statistic, 1469 default tables, 1450 displayed output, 1517 exact p-values, 1510 EXACT statement, used with TABLES, 1446 exact tests, 1443, 1508, 1543 Fisher’s exact test, 1469 Friedman’s chi-square statistic, 1546 gamma statistic, 1474 general association statistic, 1502 grouping variables, 1465 input data sets, 1441, 1464 kappa coefficient, 1497, 1498 Kendall’s tau-b statistic, 1474 lambda asymmetric, 1474 lambda symmetric, 1474 likelihood-ratio chi-square test, 1469 Mantel-Haenszel chi-square test, 1469 McNemar’s test, 1493 measures of association, 1474 missing values, 1466
Monte Carlo estimation, 1443, 1445, 1512 multiway tables, 1518, 1520, 1521 network algorithm, 1509 odds ratio, 1488, 1503, 1504 ODS table names, 1524 one-way frequency tables, 1469, 1470, 1517, 1518, 1530 order of variables, 1442 output data sets, 1446, 1514–1516, 1527, 1538 output variable prefixes, 1516 OUTPUT, used with TABLES or EXACT, 1449 overall kappa coefficient, 1493 Pearson chi-square test, 1469, 1471 Pearson correlation coefficient, 1474 phi coefficient, 1469 polychoric correlation coefficient, 1474 relative risk, 1489, 1503, 1507 row mean scores statistic, 1502 scores, 1468 simple kappa coefficient, 1493 Somers’ D statistics, 1474 Spearman rank correlation coefficient, 1474 statistical computations, 1468 stratified table, 1540 Stuart’s tau-c statistic, 1474 syntax, 1440 two-way frequency tables, 1470, 1471, 1535 uncertainty coefficients, 1474 weighted kappa coefficient, 1493 FREQ procedure, BY statement, 1443 FREQ procedure, EXACT statement, 1443 AGREE option, 1444 ALPHA= option, 1445 BINOMIAL option, 1444 CHISQ option, 1444, 1535 COMOR option, 1444 FISHER option, 1444 JT option, 1444 KAPPA option, 1444 LRCHI option, 1444 MAXTIME= option, 1445 MC option, 1445 MCNEM option, 1444 MEASURES option, 1444 MHCHI option, 1444 N= option, 1445 OR option, 1444, 1535 PCHI option, 1444 PCORR option, 1444 POINT option, 1445 SCORR option, 1444 SEED= option, 1445 TREND option, 1444, 1543 WTKAP option, 1444 FREQ procedure, OUTPUT statement, 1446 AGREE option, 1447 AJCHI option, 1447 ALL option, 1447 BDCHI option, 1447
Syntax Index BINOMIAL option, 1447 CHISQ option, 1447 CMH option, 1447 CMH1 option, 1447 CMH2 option, 1447 CMHCOR option, 1447 CMHGA option, 1447 CMHRMS option, 1447 COCHQ option, 1448 CONTGY option, 1448 CRAMV option, 1448 EQKAP option, 1448 EQWKP option, 1448 EXACT option, 1448 FISHER option, 1448 GAMMA option, 1448 JT option, 1448 KAPPA option, 1448 KENTB option, 1448 LAMCR option, 1448 LAMDAS option, 1448 LAMRC option, 1448 LGOR option, 1448 LGRRC1 option, 1448 LGRRC2 option, 1448 LRCHI option, 1448, 1539 MCNEM option, 1448 MEASURES option, 1448 MHCHI option, 1448 MHOR option, 1448 MHRRC1 option, 1448 MHRRC2 option, 1448 N option, 1448 NMISS option, 1448 OR option, 1449 OUT= option, 1446 PCHI option, 1449, 1539 PCORR option, 1449 PHI option, 1449 PLCORR option, 1449 RDIF1 option, 1449 RDIF2 option, 1449 RELRISK option, 1449 RISKDIFF option, 1449 RISKDIFF1 option, 1449 RISKDIFF2 option, 1449 RRC1 option, 1449 RRC2 option, 1449 RSK1 option, 1449 RSK11 option, 1449 RSK12 option, 1449 RSK2 option, 1449 RSK21 option, 1449 RSK22 option, 1449 SCORR option, 1449 SMDCR option, 1449 SMDRC option, 1449 STUTC option, 1449 TREND option, 1449
5013
TSYMM option, 1449 U option, 1449 UCR option, 1449 URC option, 1449 WTKAP option, 1449 FREQ procedure, PROC FREQ statement, 1441 COMPRESS option, 1441 DATA= option, 1441 FORMCHAR= option, 1441 NLEVELS option, 1442 NOPRINT option, 1442 ORDER= option, 1442 PAGE option, 1443 FREQ procedure, TABLES statement, 1450 ALL option, 1453 ALPHA= option, 1453 BDT option, 1453 BINOMIAL option, 1453, 1532 BINOMIALC option, 1454 CELLCHI2 option, 1454 CHISQ option, 1454, 1470, 1535 CL option, 1455 CMH option, 1455 CMH1 option, 1455 CMH2 option, 1455 CONTENTS= option, 1455 CONVERGE= option, 1456 CROSSLIST option, 1456 CUMCOL option, 1457 DEVIATION option, 1457 EXPECTED option, 1457 FISHER option, 1457 FORMAT= option, 1457 JT option, 1457 LIST option, 1457 MAXITER= option, 1458 MEASURES option, 1458 MISSING option, 1458 MISSPRINT option, 1458 NOCOL option, 1458 NOCUM option, 1458 NOFREQ option, 1458 NOPERCENT option, 1458 NOPRINT option, 1459 NOROW option, 1459 NOSPARSE option, 1459 NOWARN option, 1459 option, 1453 OUT= option, 1459 OUTCUM option, 1459 OUTEXPECT option, 1459, 1527 OUTPCT option, 1460 PLCORR option, 1460 PRINTKWT option, 1460 RELRISK option, 1460, 1535 RISKDIFF option, 1460 RISKDIFFC option, 1460 SCORES= option, 1461, 1547 SCOROUT option, 1461
5014
Syntax Index
SPARSE option, 1461, 1527 TESTF= option, 1462, 1470 TESTP= option, 1462, 1470, 1530 TOTPCT option, 1462 TREND option, 1462, 1543 FREQ procedure, TEST statement, 1462 AGREE option, 1463 GAMMA option, 1463 KAPPA option, 1463 KENTB option, 1463 MEASURES option, 1463 PCORR option, 1463 SCORR option, 1463 SMDCR option, 1463, 1543 SMDRC option, 1463 STUTC option, 1463 WTKAP option, 1463 FREQ procedure, WEIGHT statement, 1463 ZEROS option, 1464 FREQ statement and RMSSTD statement (CLUSTER), 974 ANOVA procedure, 436 CALIS procedure, 627 CANDISC procedure, 793 DISCRIM procedure, 1154 DISTANCE procedure, 1268 FACTOR procedure, 1321 GAM procedure, 1565 GENMOD procedure, 1634 GLM procedure, 1752 GLMMOD procedure, 1916 KDE procedure, 2001 LOGISTIC procedure, 2303 MI procedure, 2523 MODECLUS procedure, 2870 MULTTEST procedure, 2945 NPAR1WAY procedure, 3161 PHREG procedure, 3227 PRINCOMP procedure, 3607 PRINQUAL procedure, 3658 REG procedure, 3820 STDIZE procedure, 4135 STEPDISC procedure, 4169 SURVEYLOGISTIC procedure, 4258 TPSPLINE procedure, 4507 TRANSREG procedure, 4557 TREE procedure, 4755 TTEST procedure, 4781 VARCLUS procedure, 4813 FREQOUT option PROC CORRESP statement, 1076 frequency tables, 1431, 1450, 4196 generating (CATMOD), 842, 845 input to CATMOD procedure, 861 one-way (FREQ), 1469, 1470, 1517, 1518, 1530 one-way (SURVEYFREQ), 4226 two-way (FREQ), 1470, 1471, 1535 frequency variable LOGISTIC procedure, 2304
PRINQUAL procedure, 3658 programming statements (PHREG), 3235 SURVEYLOGISTIC procedure, 4258 TRANSREG procedure, 4557 value (PHREG), 3227 Friedman’s chi-square statistic, 1546 FSIZE= option NLOPTIONS statement (CALIS), 618 PROC NLMIXED statement, 3065 FT option TEST statement (MULTTEST), 2951, 2968 FTOL2= option NLOPTIONS statement (CALIS), 618, 620 FTOL= option NLOPTIONS statement (CALIS), 614, 617 PROC CALIS statement, 580 full sibs mating INBREED procedure, 1981 full-rank coding TRANSREG procedure, 4568 FULLX option MODEL statement (MIXED), 2689, 2696 furthest neighbor clustering, See complete linkage FUZZ= option PROC DISTANCE statement, 1257 PROC FACTOR statement, 1310 PROC STDIZE statement, 4130 fuzzy coding CORRESP procedure, 1087 FWDLINK statement, GENMOD procedure, 1634, 1646 FWLS= option PROC ROBUSTREG statement, 3983
G G matrix MIXED procedure, 2663, 2713, 2732, 2733, 2816 G option RANDOM statement (MIXED), 2713 G-G epsilon, 1829 G2 inverse NLIN procedure, 3025 G4 option PROC NLIN statement, 3008 G4= option NLOPTIONS statement (CALIS), 613 PROC CALIS statement, 576 PROC NLMIXED statement, 3065 GABRIEL option MEANS statement (ANOVA), 443 MEANS statement (GLM), 1767 Gabriel’s multiple-comparison procedure ANOVA procedure, 443 GLM procedure, 1767, 1811 GAM procedure, 1564 comparing PROC GAM with PROC LOESS, 1596
Syntax Index Estimates from PROC GAM, 1579 generalized additive model with binary data, 1582 graphics, 1581 ODS graph names, 1581 ODS table names, 1580 Poisson regression analysis of component reliability, 1589 syntax, 1564 GAM procedure, BY statement, 1564 GAM procedure, CLASS statement, 1565 GAM procedure, FREQ statement, 1565 GAM procedure, ID statement, 1565 GAM procedure, MODEL statement, 1566 ALPHA= option, 1567 DIST= option, 1567 EPSILON= option, 1567 EPSSCORE= option, 1567 ITPRINT option, 1567 MAXITER= option, 1567 MAXITSCORE= option, 1567 METHOD= option, 1568 NOTEST option, 1568 GAM procedure, OUTPUT statement, 1568 OUT= option, 1568 GAM procedure, PROC GAM statement, 1564 CLM option, 1581 COMMONAXES option, 1581 COMPONENT option, 1581 DATA= option, 1564 PLOTS= option, 1581 UNPACKPANELS option, 1581 GAM procedure, SCORE statement, 1569 DATA= option, 1569 OUT= option, 1569 gamma distribution, 2083, 2097, 2111 GENMOD procedure, 1652 NLMIXED procedure, 3077 GAMMA option OUTPUT statement (FREQ), 1448 TEST statement (FREQ), 1463 gamma statistic, 1474, 1476 GAMMA= option PROC FACTOR statement, 1310 Gaussian assumption SIM2D procedure, 4091 Gaussian distribution NLMIXED procedure, 3077 Gaussian random field SIM2D procedure, 4091 Gaussian semivariogram model KRIGE2D procedure, 2047, 2048 GC option RANDOM statement (MIXED), 2713 GCI option RANDOM statement (MIXED), 2713 GCONV2= option NLOPTIONS statement (CALIS), 619 GCONV= option
5015
MODEL statement (LOGISTIC), 2310 MODEL statement (SURVEYLOGISTIC), 4262 NLOPTIONS statement (CALIS), 614, 618, 620 PROC CALIS statement, 580 PROC NLMIXED statement, 3065 GCONVERGE= option PROC MDS statement, 2480 GCORR option RANDOM statement (MIXED), 2713 GCV function, TPSPLINE procedure, 4497, 4499, 4514, 4522 nonhomogeneous variance, 4514 GDATA= option RANDOM statement (MIXED), 2713 GEE, See Generalized Estimating Equations Gehan test, See Wilcoxon test for homogeneity power and sample size (POWER), 3473, 3482, 3533 GENDER statement, INBREED procedure, 1975 general association statistic, 1502 general distribution NLMIXED procedure, 3078 general linear structure MIXED procedure, 2721 generalized Crawford-Ferguson family, 1291 generalized Crawford-Ferguson method, 1317, 1318 generalized cross validation function (GCV), 4497, 4522 generalized cyclic incomplete block design generating with PLAN procedure, 3357 Generalized Estimating Equations (GEE), 1621, 1646, 1672, 1708, 1713 Generalized Euclidean distance coefficient DISTANCE procedure, 1272 generalized inverse, 2740 MIXED procedure, 2684 NLIN procedure, 3025 NLMIXED procedure, 3065 generalized least squares, See weighted least-squares estimation generalized linear model GENMOD procedure, 1612, 1613 theory (GENMOD), 1650 generalized logistic model SURVEYLOGISTIC procedure, 4284 generalized logits, See also response functions examples (CATMOD), 869 formulas (CATMOD), 893 specifying in CATMOD procedure, 853 using (CATMOD), 868 generation (INBREED) nonoverlapping, 1967, 1970, 1971 number, 1974 overlapping, 1967, 1969 variable, 1974 GENMOD procedure
5016
Syntax Index
adjusted residuals, 1670 aliasing, 1620 bar (|) operator, 1660 binomial distribution, 1652 built-in link function, 1614 built-in probability distribution, 1614 classification variables, 1660 confidence intervals, 1637 confidence limits, 1636 continuous variables, 1660 contrasts, 1633 convergence criterion, 1637, 1647 correlated data, 1611, 1672 correlation matrix, 1638, 1656 correlations, least-squares means, 1636 covariance matrix, 1638, 1656 covariances, least-squares means, 1636 crossed effects, 1660 design matrix, 1661 deviance, 1637 deviance definition, 1615 deviance residuals, 1670 dispersion parameter, 1658 dispersion parameter estimation, 1614, 1665, 1666 dispersion parameter weights, 1650 effect specification, 1659 estimability, 1635 estimability checking, 1632 events/trials format for response, 1636, 1653 expected information matrix, 1656 exponential distribution, 1704 F statistics, 1668 Fisher’s scoring method, 1642, 1656 gamma distribution, 1652 GEE, 1611, 1621, 1646, 1672, 1708, 1711, 1713 Generalized Estimating Equations (GEE), 1611 generalized linear model, 1612, 1613 goodness of fit, 1656 gradient, 1655 Hessian matrix, 1655 information matrix, 1642 initial values, 1638, 1647 intercept, 1615, 1617, 1640 inverse Gaussian distribution, 1652 L matrices, 1635 Lagrange multiplier statistics, 1668 life data, 1701 likelihood residuals, 1670 linear model, 1612 linear predictor, 1611, 1612, 1618, 1661, 1689 link function, 1611, 1612, 1653 log-likelihood functions, 1654 log-linear models, 1616 logistic regression, 1697 main effects, 1660 maximum likelihood estimation, 1655 – MEAN– automatic variable, 1645 Model checking, 1718
model checking, 1725 multinomial distribution, 1653 multinomial models, 1671 negative binomial distribution, 1652 nested effects, 1660 Newton-Raphson algorithm, 1655 normal distribution, 1651 observed information matrix, 1656 offset, 1640, 1689 offset variable, 1617 ordinal data, 1704 output ODS graphics table names, 1695 output table names, 1693 overdispersion, 1659 Pearson residuals, 1670 Pearson’s chi-square, 1637, 1656, 1657 Poisson distribution, 1652 Poisson regression, 1616 polynomial effects, 1660 profile likelihood confidence intervals, 1640, 1666 programming statements, 1645 quasi-likelihood, 1659 raw residuals, 1669 regression parameters estimation, 1614 regressor effects, 1660 repeated measures, 1611, 1672 residuals, 1641, 1669, 1670 – RESP– automatic variable, 1645 scale parameter, 1653 scaled deviance, 1656 score statistics, 1668 singular contrast matrix, 1632 subpopulation, 1637 suppressing output, 1627 syntax, 1624 Type 1 analysis, 1615, 1665 Type 3 analysis, 1615, 1665 user-defined link function, 1634 variance function, 1614 Wald confidence intervals, 1643, 1667 working correlation matrix, 1647, 1648, 1672 – XBETA– automatic variable, 1645 GENMOD procedure, ASSESS statement, 1627 GENMOD procedure, BY statement, 1628 GENMOD procedure, CLASS statement, 1629 DESCENDING option, 1629 MISSING option, 1629 ORDER= option, 1629 PARAM= option, 1630 REF= option, 1630 TRUNCATE option, 1631 GENMOD procedure, CONTRAST statement, 1631 E option, 1632 SINGULAR= option, 1632 WALD option, 1633 GENMOD procedure, DEVIANCE statement, 1633, 1646 GENMOD procedure, ESTIMATE statement
Syntax Index ALPHA= option, 1634 E option, 1634 EXP option, 1634 GENMOD procedure, FREQ statement, 1633, 1634 GENMOD procedure, FWDLINK statement, 1634, 1646 GENMOD procedure, INVLINK statement, 1635, 1646 GENMOD procedure, LSMEANS statement, 1635 ALPHA= option, 1635 CL option, 1636 CORR option, 1636 COV option, 1636 DIFF option, 1636 E option, 1636 GENMOD procedure, MODEL statement, 1636 AGGREGATE= option, 1637 ALPHA= option, 1637 CICONV= option, 1637 CL option, 1637 CODING= option, 1637 CONVERGE= option, 1637 CONVH= option, 1638 CORRB option, 1638 COVB option, 1638 DIST= option, 1638 ERR= option, 1638 EXPECTED option, 1638 INITIAL= option, 1638 INTERCEPT= option, 1639 ITPRINT option, 1639 LINK= option, 1639 LRCI option, 1640 MAXIT= option, 1640 NOINT option, 1640 NOSCALE option, 1640 OBSTATS option, 1640 OFFSET= option, 1640 PRED option, 1641 PREDICTED option, 1641 RESIDUALS option, 1641 SCALE= option, 1642 SCORING= option, 1642 SINGULAR= option, 1642 TYPE1 option, 1642 TYPE3 option, 1643 WALD option, 1643 WALDCI option, 1643 XVARS option, 1643 GENMOD procedure, OUTPUT statement, 1643 keyword= option, 1644 OUT= option, 1644 GENMOD procedure, PROC GENMOD statement, 1625 DATA= option, 1625 NAMELEN= option, 1625 ORDER= option, 1625 RORDER= option, 1626
5017
GENMOD procedure, REPEATED statement, 1621, 1646 ALPHAINIT= option, 1646 CONVERGE= option, 1647 CORR= option, 1648 CORRB option, 1647 CORRW option, 1647 COVB option, 1647 ECORRB option, 1647 ECOVB option, 1647 INITIAL= option, 1647 INTERCEPT= option, 1647 LOGOR= option, 1647 MAXITER= option, 1648 MCORRB option, 1648 MCOVB option, 1648 MODELSE option, 1648 RUPDATE= option, 1648 SORTED option, 1648 SUBCLUSTER= option, 1648 SUBJECT= option, 1646 TYPE= option, 1648 V6CORR option, 1649 WITHIN= option, 1649 WITHINSUBJECT= option, 1649 YPAIR= option, 1649 ZDATA= option, 1649 ZROW= option, 1650 GENMOD procedure, SCWGT statement, 1650 GENMOD procedure, VARIANCE statement, 1650 GENMOD procedure, WEIGHT statement, 1650 Gentleman-Givens computational method, 3197 geometric anisotropy KRIGE2D procedure, 2053–2055 GEOMETRICMEAN option MODEL statement (TRANSREG), 4571 getting started ODS Graphics, 321 GI option RANDOM statement (MIXED), 2714 GLM Parameterization SURVEYLOGISTIC procedure, 4271 GLM procedure absorption of effects, 1747, 1799 aliasing structure, 1771, 1897 alpha level, 1755, 1765, 1771, 1775 at sign (@) operator, 1786 bar (|) operator, 1786 Bartlett’s test, 1767, 1819 Bonferroni adjustment, 1754 Brown and Forsythe’s test, 1767, 1819 canonical analysis, 1760 characteristic roots and vectors, 1759 compared to other procedures, 1734, 1776, 1833, 1885, 1901, 2664, 2985, 3197, 4033 comparing groups, 1804 computational method, 1840 computational resources, 1837 contrasts, 1749, 1779
5018
Syntax Index
covariate values for least-squares means, 1755 disk space, 1746 Dunnett’s adjustment, 1754 effect specification, 1784 error effect, 1759 estimability, 1750–1752, 1758, 1773, 1793, 1803 estimable functions, 1792 ESTIMATE specification, 1801 homogeneity of variance tests, 1767, 1818 Hsu’s adjustment, 1754 hypothesis tests, 1781, 1792 interactive use, 1787 interactivity and BY statement, 1748 interactivity and missing values, 1787, 1837 introductory example, 1735 least-squares means (LS-means), 1753 Levene’s test for homogeneity of variance, 1767, 1819 means, 1763 means versus least-squares means, 1804 memory requirements, reduction of, 1747 missing values, 1745, 1759, 1822, 1837 model specification, 1784 multiple comparisons, least-squares means, 1754, 1757, 1806, 1808 multiple comparisons, means, 1765–1769, 1806, 1808 multiple comparisons, procedures, 1763 multivariate analysis of variance, 1745, 1759, 1823 nonstandard weights for least-squares means, 1756 O’Brien’s test, 1767 observed margins for least-squares means, 1756 ODS graph names, 1847 ODS table names, 1844 output data sets, 1773, 1840–1842 parameterization, 1787 positional requirements for statements, 1743 predicted population margins, 1753 Q effects, 1834 random effects, 1776, 1833 regression, quadratic, 1738 relation to GLMMOD procedure, 1909 repeated measures, 1777, 1825 Sidak’s adjustment, 1754 simple effects, 1758 simulation-based adjustment, 1754 singularity checking, 1750, 1752, 1758, 1772 sphericity tests, 1780, 1829 SSCP matrix for multivariate tests, 1759 statistical assumptions, 1783 summary of features, 1733 syntax, 1742 tests, hypothesis, 1749 transformations for MANOVA, 1759 transformations for repeated measures, 1779 Tukey’s adjustment, 1754
types of least-squares means comparisons, 1757 unbalanced analysis of variance, 1735, 1804, 1856 unbalanced design, 1735, 1804, 1833, 1856, 1882 weighted analysis, 1782 weighted means, 1820 Welch’s ANOVA, 1769 WHERE statement, 1787 GLM procedure, ABSORB statement, 1747 GLM procedure, BY statement, 1747 GLM procedure, CLASS statement, 1748 TRUNCATE option, 1749 GLM procedure, CONTRAST statement, 1749 E option, 1749 E= option, 1750 ETYPE= option, 1750 INTERCEPT effect, 1749, 1752 SINGULAR= option, 1750 GLM procedure, ESTIMATE statement, 1751 DIVISOR= option, 1752 E option, 1752 SINGULAR= option, 1752 GLM procedure, FREQ statement, 1752 GLM procedure, ID statement, 1753 GLM procedure, LSMEANS statement, 1753 ADJUST= option, 1754 ALPHA= option, 1755 AT option, 1755, 1822 BYLEVEL option, 1823 BYLEVEL options, 1756 CL option, 1756 COV option, 1756 E option, 1756 ETYPE option, 1756 NOPRINT option, 1756 OBSMARGINS option, 1756, 1822 OM option, 1756, 1822 OUT= option, 1757 PDIFF option, 1757 SINGULAR option, 1758 SLICE= option, 1758 STDERR option, 1758 TDIFF option, 1758 GLM procedure, MANOVA statement, 1759 – ALL– effect (H= option), 1759 CANONICAL option, 1760 E= option, 1759 ETYPE= option, 1760 H= option, 1759 HTYPE= option, 1760 INTERCEPT effect (H= option), 1759 M= option, 1759 MNAMES= option, 1760 MSTAT= option, 1761 ORTH option, 1761 PREFIX= option, 1760 PRINTE option, 1761 PRINTH option, 1761
Syntax Index SUMMARY option, 1761 GLM procedure, MEANS statement, 1763, 1820 ALPHA= option, 1765 BON option, 1765 CLDIFF option, 1766 CLM option, 1766 DEPONLY option, 1766 DUNCAN option, 1766 DUNNETT option, 1766 DUNNETTL option, 1766 DUNNETTU option, 1767 E= option, 1767 ETYPE= option, 1767 GABRIEL option, 1767 GT2 option, 1767 HOVTEST option, 1767, 1818 HTYPE= option, 1768 KRATIO= option, 1768 LINES option, 1768 LSD option, 1768 NOSORT option, 1768 REGWQ option, 1768 SCHEFFE option, 1769 SIDAK option, 1769 SMM option, 1769 SNK option, 1769 T option, 1769 TUKEY option, 1769 WALLER option, 1769 WELCH option, 1769 GLM procedure, MODEL statement, 1770 ALIASING option, 1771, 1897 ALPHA= option, 1771 CLI option, 1771 CLM option, 1771 CLPARM option, 1771 E option, 1771 E1 option, 1771 E2 option, 1771 E3 option, 1771 E4 option, 1771 I option, 1772 INTERCEPT option, 1771 INVERSE option, 1772 NOINT option, 1772 NOUNI option, 1772 P option, 1772 SINGULAR= option, 1772 SOLUTION option, 1772 SS1 option, 1772 SS2 option, 1773 SS3 option, 1773 SS4 option, 1773 TOLERANCE option, 1773 XPX option, 1773 ZETA= option, 1773 GLM procedure, OUTPUT statement, 1773 ALPHA= option, 1775 COOKD keyword, 1774
5019
COVRATIO keyword, 1774 DFFITS keyword, 1774 H keyword, 1774 keyword= option, 1773 LCL keyword, 1774 LCLM keyword, 1774 OUT= option, 1775 PREDICTED keyword, 1774 PRESS keyword, 1774 RESIDUAL keyword, 1774 RSTUDENT keyword, 1774 STDI keyword, 1774 STDP keyword, 1774 STDR keyword, 1774 STUDENT keyword, 1774 UCL keyword, 1775 UCLM keyword, 1775 GLM procedure, PROC GLM statement, 1745 ALPHA= option, 1745 DATA= option, 1745 MANOVA option, 1745 MULTIPASS option, 1746 NAMELEN= option, 1746 NOPRINT option, 1746 ORDER= option, 1746, 1803 OUTSTAT= option, 1747 GLM procedure, RANDOM statement, 1776 Q option, 1777, 1834 TEST option, 1777 GLM procedure, REPEATED statement, 1777 CANONICAL option, 1780 CONTRAST option, 1779, 1830 E= option, 1782 factor specification, 1778 H= option, 1782 HELMERT option, 1779, 1831 HTYPE= option, 1780 IDENTITY option, 1779, 1886 MEAN option, 1779, 1780, 1832 MSTAT= option, 1780 NOM option, 1780 NOU option, 1780 POLYNOMIAL option, 1779, 1831, 1878 PRINTE option, 1780, 1829 PRINTH option, 1780 PRINTM option, 1780 PRINTRV option, 1781 PROFILE option, 1779, 1832 SUMMARY option, 1781 GLM procedure, TEST statement, 1781 ETYPE= option, 1782 HTYPE= option, 1782 GLM procedure, WEIGHT statement, 1782 GLMMOD alternative TRANSREG procedure, 4586, 4654 GLMMOD procedure design matrix, 1909, 1917, 1918 input data sets, 1914 introductory example, 1909
5020
Syntax Index
missing values, 1917, 1918 ODS table names, 1918 output data sets, 1915, 1917, 1918 relation to GLM procedure, 1909 screening experiments, 1923 syntax, 1913 GLMMOD procedure, BY statement, 1915 GLMMOD procedure, CLASS statement, 1916 TRUNCATE option, 1916 GLMMOD procedure, FREQ statement, 1916 GLMMOD procedure, MODEL statement, 1917 NOINT option, 1917 GLMMOD procedure, PROC GLMMOD statement, 1914 DATA= option, 1914 NAMELEN= option, 1914 NOPRINT option, 1914 ORDER= option, 1914 OUTDESIGN= option, 1915, 1918 OUTPARM= option, 1915, 1917 PREFIX= option, 1915 ZEROBASED option, 1915 GLMMOD procedure, WEIGHT statement, 1916 GLMPOWER procedure actual power, 1946, 1947, 1953 alpha level, 1939 analysis of variance, 1930, 1951, 1956 ceiling sample size, 1947 compared to other procedures, 1930, 3412 computational methods, 1949 contrasts, 1934, 1937, 1950, 1951, 1956 covariates, class and continuous, 1938–1941, 1951, 1956 displayed output, 1947 exemplary data set, 1929, 1930, 1936, 1938, 1946, 1952 fractional sample size, 1947 introductory example, 1930 nominal power, 1946, 1947, 1953 number-lists, 1945 ODS table names, 1948 plots, 1930, 1936, 1942 positional requirements for statements, 1936 sample size adjustment, 1946 summary of statements, 1936 syntax, 1935 value lists, 1945 GLMPOWER procedure, CLASS statement, 1937 GLMPOWER procedure, CONTRAST statement, 1937 SINGULAR= option, 1937 GLMPOWER procedure, MODEL statement, 1938 GLMPOWER procedure, PLOT statement, 1942 DESCRIPTION= option, 1945 INTERPOL= option, 1942 KEY= option, 1942 MARKERS= option, 1943 MAX= option, 1943 MIN= option, 1943
NAME= option, 1945 NPOINTS= option, 1943 STEP= option, 1943 VARY option, 1944 X= option, 1944 XOPTS= option, 1944 Y= option, 1944 YOPTS= option, 1944 GLMPOWER procedure, POWER statement, 1939 ALPHA= option, 1939 CORRXY= option, 1939 NCOVARIATES= option, 1940 NFRACTIONAL option, 1940 NTOTAL= option, 1940 OUTPUTORDER= option, 1940 POWER= option, 1941 PROPVARREDUCTION= option, 1941 STDDEV= option, 1941 GLMPOWER procedure, PROC GLMPOWER statement, 1936 DATA= option, 1936 PLOTONLY= option, 1936 GLMPOWER procedure, WEIGHT statement, 1938 global influence LD statistic (PHREG), 3234, 3260 LMAX statistic (PHREG), 3234, 3261 global kriging KRIGE2D procedure, 2034 global null hypothesis PHREG procedure, 3218, 3246, 3269 score test (PHREG), 3229, 3274 TPHREG procedure, 4486 GLS option MODEL statement (CATMOD), 846 GMSEP option MODEL statement (REG), 3826 PLOT statement (REG), 3846 Gompertz distribution, 3705 goodness of fit GENMOD procedure, 1656 GOUT= option MCMC statement (MI), 2526 PROC BOXPLOT statement, 487 PROC LIFEREG statement, 2090 PROC LIFETEST statement, 2161 PROC PROBIT statement, 3712 PROC REG statement, 3817 PROC TREE statement, 4750 Gower’s dissimilarity coefficient DISTANCE procedure, 1271 Gower’s method, See also median method CLUSTER procedure, 967, 981 Gower’s similarity coefficient DISTANCE procedure, 1270 GPATH= option ODS HTML statement, 336, 349 ODS LATEX statement, 337, 359 gradient
Syntax Index CALIS procedure, 634, 664, 665 GENMOD procedure, 1655 LOGISTIC procedure, 2343 MIXED procedure, 2675, 2749 SURVEYLOGISTIC procedure, 4287 Graeco-Latin square generating with PLAN procedure, 3345 graph axes, swapping examples, ODS Graphics, 371 graph fonts, modifying examples, ODS Graphics, 374, 379 graph names examples, ODS Graphics, 352, 361 ODS Graphics, 330 graph sizes, modifying examples, ODS Graphics, 378 graph template definitions ODS Graphics, 338, 342, 365 Q-Q plots, 366 graph template language ODS Graphics, 338, 342 TEMPLATE procedure, 342 graph titles, modifying examples, ODS Graphics, 368 graphics, See plots examples (REG), 3948 high-resolution plots (REG), 3840 keywords (REG), 3841 options (REG), 3843 saving output (MI), 2526 graphics catalog, specifying LIFEREG procedure, 2090 PROBIT procedure, 3712 graphics image files base file names, 335 file names, 335 file types, 334, 336, 350 index counter, 335 ODS Graphics, 334 PostScript, 337, 358 GRAPHICS statement LOGISTIC procedure, 2389 graphs, See plots GREENACRE option PROC CORRESP statement, 1076 Greenhouse-Geisser epsilon, 1829 grid lines, modifying examples, ODS Graphics, 373 grid search example (MIXED), 2796 GRID statement KRIGE2D procedure, 2040 SIM2D procedure, 4100 GRID= option PLOT statement (BOXPLOT), 499 PRIOR statement (MIXED), 2710 GRIDDATA= option
5021
GRID statement (KRIGE2D), 2040 GRID statement (SIM2D), 4100 GRIDL= option BIVAR statement, 1998 UNIVAR statement, 1999 GRIDT= option PRIOR statement (MIXED), 2711 GRIDU= option BIVAR statement, 1998 UNIVAR statement, 1999 group average clustering, See average linkage GROUP option CONTRAST statement (MIXED), 2684 ESTIMATE statement (MIXED), 2686 GROUP= option RANDOM statement (MIXED), 2714 REPEATED statement (MIXED), 2717 STRATA statement (LIFETEST), 2167 grouped-name-lists POWER procedure, 3490 grouped-number-lists POWER procedure, 3490 GROUPLOSS= option TWOSAMPLESURVIVAL statement (POWER), 3476 GROUPLOSSEXPHAZARDS= option TWOSAMPLESURVIVAL statement (POWER), 3476 GROUPMEANS= option ONEWAYANOVA statement (POWER), 3440 TWOSAMPLEMEANS statement (POWER), 3466 GROUPMEDLOSSTIMES= option TWOSAMPLESURVIVAL statement (POWER), 3476 GROUPMEDSURVTIMES= option TWOSAMPLESURVIVAL statement (POWER), 3476 GROUPNAMES= option MODEL statement (REG), 3826 GROUPNS= option ONEWAYANOVA statement (POWER), 3440 TWOSAMPLEFREQ statement (POWER), 3458 TWOSAMPLEMEANS statement (POWER), 3466 TWOSAMPLESURVIVAL statement (POWER), 3477 GROUPPROPORTIONS= option TWOSAMPLEFREQ statement (POWER), 3458 GROUPSTDDEVS= option TWOSAMPLEMEANS statement (POWER), 3466 GROUPSURVEXPHAZARDS= option TWOSAMPLESURVIVAL statement (POWER), 3477 GROUPSURVIVAL= option
5022
Syntax Index
TWOSAMPLESURVIVAL statement (POWER), 3477 GROUPWEIGHTS= option ONEWAYANOVA statement (POWER), 3440 TWOSAMPLEFREQ statement (POWER), 3458 TWOSAMPLEMEANS statement (POWER), 3466 TWOSAMPLESURVIVAL statement (POWER), 3477 growth curve analysis example (CATMOD), 933 example (MIXED), 2733 GSK models, 817 GT2 multiple-comparison method, 444, 1769, 1811 GT2 option MEANS statement (GLM), 1767 GTOL2= option NLOPTIONS statement (CALIS), 619 GTOL= option NLOPTIONS statement (CALIS), 614, 618, 620 PROC CALIS statement, 580
H H keyword OUTPUT statement (GLM), 1774 H option PROC ROBUSTREG statement, 3985 H-F epsilon, 1829 H= effects REPEATED statement (ANOVA), 450 H= option MANOVA statement (ANOVA), 437 MANOVA statement (GLM), 1759 OUTPUT statement (LOGISTIC), 2322 OUTPUT statement (NLIN), 3013 REPEATED statement (GLM), 1782 half-fraction design, analysis, 1895 half-width, confidence intervals, 3488 HALFWIDTH= option ONESAMPLEMEANS statement (POWER), 3434 PAIREDMEANS statement (POWER), 3450 TWOSAMPLEMEANS statement (POWER), 3466 Hall-Wellner bands LIFETEST procedure, 2169, 2177, 2205 Hamann coefficient DISTANCE procedure, 1274 Hamming distance coefficient DISTANCE procedure, 1274 Hannan-Quinn information criterion MIXED procedure, 2676 Harris component analysis, 1291, 1293, 1311 Harris-Kaiser method, 1291, 1318 hat matrix, 3899 LOGISTIC procedure, 2359 HAXIS= option PLOT statement (BOXPLOT), 499
PLOT statement (REG), 3846 PROC TREE statement, 4751 hazard function baseline (PHREG), 3215, 3216 cumulative (PHREG), 3262 definition (PHREG), 3240 discrete (PHREG), 3228, 3240 LIFETEST procedure, 2149, 2214 PHREG procedure, 3215 rate (PHREG), 3284 ratio (PHREG), 3218, 3220 HAZARDRATIO= option TWOSAMPLESURVIVAL statement (POWER), 3477 hazards ratio confidence interval (PHREG), 3233 confidence limits (PHREG), 3233, 3247, 3269 confidence limits (TPHREG), 4487 estimate (PHREG), 3247, 3269, 3284 estimate (TPHREG), 4487 PHREG procedure, 3218 HC= option PROC FASTCLUS statement, 1390 HEIGHT statement TREE procedure, 4755 HEIGHT= option PLOT statement (BOXPLOT), 500 PROC TREE statement, 4751 HELMERT keyword REPEATED statement (ANOVA), 448 HELMERT option REPEATED statement (GLM), 1779, 1831 Hertzsprung-Russell Plot, example MODECLUS procedure, 2923 HESCAL= option NLOPTIONS statement (CALIS), 622 PROC NLMIXED statement, 3065 HESS option PROC NLMIXED statement, 3066 HESSALG= option PROC CALIS statement, 589 Hessian matrix CALIS procedure, 589, 621, 622, 634, 647, 665 GENMOD procedure, 1655 LOGISTIC procedure, 2317, 2343 MIXED procedure, 2674, 2675, 2680, 2707, 2749, 2750, 2774, 2775, 2786, 2796 NLMIXED procedure, 3066 SURVEYLOGISTIC procedure, 4264, 4287 Hessian scaling NLMIXED procedure, 3093 heterogeneity example (MIXED), 2792 MIXED procedure, 2714, 2717 heterogeneous AR(1) structure (MIXED), 2721 compound-symmetry structure (MIXED), 2721 covariance structures (MIXED), 2730 Toeplitz structure (MIXED), 2721
Syntax Index heteroscedasticity testing (REG), 3910 Heywood cases FACTOR procedure, 1333 HEYWOOD option FACTOR statement (CALIS), 607 PROC FACTOR statement, 1311 hierarchical clustering, 967, 980, 4801 hierarchical design generating with PLAN procedure, 3353 hierarchical model example (MIXED), 2810 hierarchy LOGISTIC procedure, 2310 TPHREG procedure, 4475 HIERARCHY option PROC VARCLUS statement, 4809 HIERARCHY= option MODEL statement (LOGISTIC), 2310 MODEL statement (TPHREG), 4475 histograms plots, ODS Graphics, 358 HISTORY option MODEL statement (TRANSREG), 4576 HKPOWER= option PROC FACTOR statement, 1311 HLM option REPEATED statement (MIXED), 2717 HLPS option REPEATED statement (MIXED), 2718 HM option PROC MODECLUS statement, 2866 HMINOR= option PLOT statement (BOXPLOT), 500 HOC option PROC MULTTEST statement, 2940, 2959 Hochberg adjustment (MULTTEST), 2959 Hochberg (MULTTEST) p-value adjustments, 2959 Hochberg’s GT2 multiple-comparison method, 444, 1769, 1811 HOFFSET= option PLOT statement (BOXPLOT), 500 HOLD= option PARMS statement (MIXED), 2707 HOLM option PROC MULTTEST statement, 2942 HOM option PROC MULTTEST statement, 2940 Hommel adjustment (MULTTEST), 2959 Hommel (MULTTEST) p-value adjustments, 2959 HOMMEL option PROC MULTTEST statement, 2959 homogeneity analysis CORRESP procedure, 1069 homogeneity of variance tests, 443, 1767, 1818
5023
Bartlett’s test (ANOVA), 443 Bartlett’s test (GLM), 1767, 1819 Brown and Forsythe’s test (ANOVA), 443 Brown and Forsythe’s test (GLM), 1767, 1819 DISCRIM procedure, 1150 examples, 1893 Levene’s test (ANOVA), 443 Levene’s test (GLM), 1767, 1819 O’Brien’s test (ANOVA), 443 O’Brien’s test (GLM), 1767 Welch’s ANOVA, 1819 homogeneity tests LIFETEST procedure, 2149, 2154, 2178, 2200 homotopy parameter FASTCLUS procedure, 1390 honestly significant difference test, 445, 1769, 1811, 1812 HORDISPLAY= option PROC TREE statement, 4751 HORIZONTAL option PROC TREE statement, 4751 Hosmer-Lemeshow test LOGISTIC procedure, 2312, 2356 test statistic (LOGISTIC), 2357 Hotelling-Lawley trace, 437, 1759, 1828 Hotelling-Lawley-McKeon statistic MIXED procedure, 2717 Hotelling-Lawley-Pillai-Samson statistic MIXED procedure, 2718 HOUGAARD option PROC NLIN statement, 3008 HOVTEST option MEANS statement (ANOVA), 443 MEANS statement (GLM), 1767, 1818 Howe’s solution, 1297 HP= option PROC FASTCLUS statement, 1390 HPAGES= option PROC TREE statement, 4751 HPLOTS= option PLOT statement (REG), 3849 HPROB= option PROC PROBIT statement, 3712 HREF= option PLOT statement (BOXPLOT), 500 PLOT statement (REG), 3846 HREFLABELS= option PLOT statement (BOXPLOT), 501 HREFLABPOS= option PLOT statement (BOXPLOT), 501 HSD test, 445, 1769, 1811, 1812 Hsu’s adjustment GLM procedure, 1754 MIXED procedure, 2688 HSYMBOL= option MCMC statement (MI), 2525, 2529 HTML destination ODS Graphics, 326, 335 HTML output
5024
Syntax Index
examples, ODS Graphics, 321, 324, 330, 352 ODS Graphics, 326, 334, 336 HTML output, with tool tips examples, ODS Graphics, 354 HTML= option PLOT statement (BOXPLOT), 501 HTYPE= option MANOVA statement (GLM), 1760 MEANS statement (GLM), 1768 MODEL statement (MIXED), 2696 REPEATED statement (GLM), 1780 TEST statement (GLM), 1782 Huynh-Feldt epsilon (GLM), 1829 structure (GLM), 1829 structure (MIXED), 2721 HYBRID option and FREQ statement (CLUSTER), 974 and other options (CLUSTER), 970, 972 PROC CLUSTER statement, 969, 978 hypergeometric distribution (MULTTEST), 2954 variance (MULTTEST), 2949 hypothesis tests comparing adjusted means (GLM), 1758 contrasts (CATMOD), 831 contrasts (GLM), 1749 contrasts, examples (GLM), 1850, 1865, 1875 custom tests (ANOVA), 450 customized (GLM), 1781 exact (FREQ), 1443 for intercept (ANOVA), 446 for intercept (GLM), 1771 GLM procedure, 1792 incorrect hypothesis (CATMOD), 888 lack of fit (RSREG), 4046 MANOVA (GLM), 1824 mixed model (MIXED), 2742, 2751 multivariate (REG), 3910 nested design (NESTED), 2990 parametric, comparing means (TTEST), 4775, 4789 parametric, comparing variances (TTEST), 4775, 4789 random effects (GLM), 1777, 1833 REG procedure, 3832, 3858 repeated measures (GLM), 1827 TRANSREG procedure, 4615 Type I sum of squares (GLM), 1794 Type II sum of squares (GLM), 1796 Type III sum of squares (GLM), 1797 Type IV sum of squares (GLM), 1797
I I option MODEL statement (GLM), 1772 MODEL statement (REG), 3827 IAPPROXIMATIONS option OUTPUT statement (TRANSREG), 4587
IC option PROC MIXED statement, 2676 ID statement BOXPLOT procedure, 517 CORRESP procedure, 1080 DISCRIM procedure, 1154 DISTANCE procedure, 1268 GAM procedure, 1565 GLM procedure, 1753 LOESS procedure, 2231 MDS procedure, 2486 MIXED procedure, 2687 MODECLUS procedure, 2870 NLIN procedure, 3012 NLMIXED procedure, 3077 PHREG procedure, 3227 PRINQUAL procedure, 3659 REG procedure, 3821 ROBUSTREG procedure, 3989 RSREG procedure, 4041 SURVEYSELECT procedure, 4443 TPSPLINE procedure, 4508 TRANSREG procedure, 4557 TREE procedure, 4755 id variables TRANSREG procedure, 4557 IDCOLOR= option PLOT statement (BOXPLOT), 501 IDCTEXT= option PLOT statement (BOXPLOT), 502 ideal point model TRANSREG procedure, 4593 ideal points TRANSREG procedure, 4717 identification variables, 3077 IDENTITY keyword REPEATED statement (ANOVA), 448 IDENTITY option REPEATED statement (GLM), 1779, 1886 IDENTITY transformation MODEL statement (TRANSREG), 4564 TRANSFORM statement (PRINQUAL), 3663 identity transformation PRINQUAL procedure, 3663 TRANSREG procedure, 4564 IDFONT= option PLOT statement (BOXPLOT), 502 IDHEIGHT= option PLOT statement (BOXPLOT), 502 IDSYMBOL= option PLOT statement (BOXPLOT), 502 IFACTOR= option PRIOR statement (MIXED), 2711 ill-conditioned data ORTHOREG procedure, 3197 image component analysis, 1291, 1293, 1311 image files, see graphics image files IMAGEFMT= option
Syntax Index ODS GRAPHICS statement, 350 IMAGENAME= option ODS GRAPHICS statement, 350 implicit intercept TRANSREG procedure, 4605 imputation methods MI procedure, 2539 imputation model MI procedure, 2565 imputation of missing values FASTCLUS procedure, 1391 IMPUTE option PROC FASTCLUS statement, 1391 IMPUTE= option MCMC statement (MI), 2526 imputer’s model MI procedure, 2563 IN option PLOT statement (REG), 3846 INAV= option PROC MDS statement, 2480 INBREED procedure coancestry, computing, 1978 coefficient of relationship, computing, 1977 covariance coefficients, 1967, 1969, 1971, 1973, 1975, 1977 covariance coefficients matrix, output, 1973 first parent, 1975 full sibs mating, 1981 generation number, 1974 generation variable, 1974 generation, nonoverlapping, 1967, 1970, 1971 generation, overlapping, 1967, 1969 inbreeding coefficients, 1967, 1969, 1973, 1975, 1978 inbreeding coefficients matrix, output, 1973 individuals, outputting coefficients, 1973 individuals, specifying, 1971, 1975 initial covariance value, 1976 initial covariance value, assigning, 1973 initial covariance value, specifying, 1969 kinship coefficient, 1977 last generation’s coefficients, output, 1973 mating, offspring and parent, 1981 mating, self, 1980 matings, output, 1975 monoecious population analysis, example, 1985 offspring, 1973, 1980 ordering observations, 1968 OUTCOV= data set, 1974, 1982 output table names, 1984 panels, 1982, 1989 pedigree analysis, 1967, 1968 pedigree analysis, example, 1987, 1989 population, monoecious, 1985 population, multiparous, 1973, 1977 population, nonoverlapping, 1974 population, overlapping, 1968, 1969, 1979 progeny, 1976, 1978, 1980, 1988
5025
second parent, 1975 selective matings, output, 1975 specifying gender, 1971 syntax, 1972 theoretical correlation, 1977 unknown or missing parents, 1982 variables, unaddressed, 1976 INBREED procedure, BY statement, 1974 INBREED procedure, CLASS statement, 1974 INBREED procedure, GENDER statement, 1975 INBREED procedure, MATINGS statement, 1975 INBREED procedure, PROC INBREED statement, 1973 AVERAGE option, 1973 COVAR option, 1973 DATA= option, 1973 IND option, 1973 INDL option, 1973 INIT= option, 1973 MATRIX option, 1973 MATRIXL option, 1973 NOPRINT option, 1973 OUTCOV= option, 1973 INBREED procedure, VAR statement, 1975 INC= option PROC TREE statement, 4751 INCLUDE= option MODEL statement (LOGISTIC), 2311 MODEL statement (PHREG), 3230 MODEL statement (REG), 3827 MODEL statement (TPHREG), 4475 PROC STEPDISC statement, 4166 incomplete block design generating with PLAN procedure, 3354, 3357 incomplete principal components REG procedure, 3818, 3828 IND option PROC INBREED statement, 1973 independent variable defined (ANOVA), 423 individual difference models MDS procedure, 2471 INDIVIDUAL option MODEL statement (TRANSREG), 4576 INDL option PROC INBREED statement, 1973 INDSCAL model MDS procedure, 2471, 2477 inertia, definition CORRESP procedure, 1070 INEST= data sets LIFEREG procedure, 2121 ROBUSTREG procedure, 4011 INEST= option MCMC statement (MI), 2526 PROC CALIS statement, 570 PROC LIFEREG statement, 2090 PROC LOGISTIC statement, 2292 PROC PROBIT statement, 3712
5026
Syntax Index
PROC ROBUSTREG statement, 3983 PROC SURVEYLOGISTIC statement, 4251 inference mixed model (MIXED), 2741 space, mixed model (MIXED), 2681, 2682, 2685, 2781 infinite likelihood MIXED procedure, 2717, 2773, 2775 infinite parameter estimates LOGISTIC procedure, 2313, 2338 SURVEYLOGISTIC procedure, 4263, 4277 Influence diagnostics details MIXED procedure, 2765 influence diagnostics, details MIXED procedure, 2763 INFLUENCE option MODEL statement (LOGISTIC), 2311 MODEL statement (MIXED), 2696 MODEL statement (REG), 3827 Influence plots MIXED procedure, 2760 influence statistics REG procedure, 3898 INFO option PROC MIXED statement, 2677 information criteria MIXED procedure, 2676 information matrix, 3756 expected (GENMOD), 1656 LIFEREG procedure, 2083, 2084, 2108 observed (GENMOD), 1656 INHESSIAN option PROC NLMIXED statement, 3066 INIT= option PROC INBREED statement, 1973 INITEST option PROC ROBUSTREG statement, 3988 INITH option PROC ROBUSTREG statement, 3988 initial covariance value assigning (INBREED), 1973 INBREED procedure, 1976 specifying (INBREED), 1969 initial estimates ACECLUS procedure, 404 LIFEREG procedure, 2108 INITIAL option EM statement (MI), 2522 initial seed SURVEYSELECT procedure, 4441 initial seeds FASTCLUS procedure, 1380, 1381, 1394 initial values CALIS procedure, 550, 588, 590, 595, 597, 602, 661 GENMOD procedure, 1638, 1647 LOGISTIC procedure, 2376 MDS procedure, 2480–2484, 2491 MIXED procedure, 2706
SURVEYLOGISTIC procedure, 4280 INITIAL= option MCMC statement (MI), 2527 MODEL statement (GENMOD), 1638 MODEL statement (LIFEREG), 2098 PROC ACECLUS statement, 404 PROC DISTANCE statement, 1257 PROC MDS statement, 2481 PROC STDIZE statement, 4131 PROC VARCLUS statement, 4809 REPEATED statement (GENMOD), 1647 initialization random (PRINQUAL), 3673 TRANSREG procedure, 4602 INITITER= option PROC PRINQUAL statement, 3654 INMODEL= option PROC LOGISTIC statement, 2293 input data set MI procedure, 2519, 2526, 2558 input data sets MIANALYZE procedure, 2620 INRAM= option PROC CALIS statement, 570 inset LIFEREG procedure, 2092 PROBIT procedure, 3723 INSET statement BOXPLOT procedure, 511 LIFEREG procedure, 2092 PROBIT procedure, 3723 INSETGROUP statement BOXPLOT procedure, 514 insets background color, 513, 515 background color of header, 513, 515 drop shadow color, 513 frame color, 513, 516 header text color, 513, 516 header text, specifying, 514, 516 positioning, details, 526–529 positioning, options, 513, 514, 516 suppressing frame, 514, 516 text color, 513, 516 instantaneous failure rate PHREG procedure, 3240 INSTAT= option PROC FASTCLUS statement, 1391 INSTEP= option, 3097 NLOPTIONS statement (CALIS), 614 PROC CALIS statement, 581 PROC NLMIXED statement, 3066 INT option PROC CANCORR statement, 760 integral approximations NLMIXED procedure, 3070, 3084 intensity model, See Andersen-Gill model interaction effects
Syntax Index MIXED procedure, 2744 model parameterization (GLM), 1788 quantitative (TRANSREG), 4594 specifying (ANOVA), 451, 452 specifying (CATMOD), 864 specifying (GLM), 1784 TRANSREG procedure, 4558, 4594 intercept GENMOD procedure, 1615, 1617, 1640 hypothesis tests for (ANOVA), 446 hypothesis tests for (GLM), 1771 MIXED procedure, 2743 model parameterization (GLM), 1787 no intercept (TRANSREG), 4577 INTERCEPT effect CONTRAST statement (GLM), 1749, 1752 MANOVA statement (ANOVA), 437 MANOVA statement, H= option (GLM), 1759 INTERCEPT option MODEL statement (ANOVA), 446 MODEL statement (GLM), 1771 MODEL statement (MIXED), 2702 MODEL statement (PLS), 3378 INTERCEPT= option MODEL statement (GENMOD), 1639 MODEL statement (LIFEREG), 2099 REPEATED statement (GENMOD), 1647 Internal studentization, 2763 INTERP= option MODEL statement (LOESS), 2234 INTERPOL= option PLOT statement (GLMPOWER), 1942 PLOT statement (POWER), 3483 interpretation factor rotation, 1293 interpreting factors, elements to consider, 1294 interpreting output VARCLUS procedure, 4818 interval determination LIFETEST procedure, 2174 interval level of measurement DISTANCE procedure, 1250 interval variable, 72 INTERVAL= option PLOT statement (BOXPLOT), 502 INTERVALS= option PROC LIFETEST statement, 2161 SURVIVAL statement (LIFETEST), 2168 intraclass correlation coefficient MIXED procedure, 2791 INTSTART= option BOXPLOT procedure, 503 INVAR statement, MDS procedure, 2486 INVAR= option PROC CALIS statement, 570 inverse confidence limits PROBIT procedure, 3712, 3761 inverse Gaussian distribution GENMOD procedure, 1652
5027
inverse matrix of X0 X SURVEYREG procedure, 4391 INVERSE option MODEL statement (GLM), 1772 MODEL statement (SURVEYREG), 4380 INVERSECL option PROC PROBIT statement, 3712 INVLINK statement, GENMOD procedure, 1635, 1646 INWGT= option PROC CALIS statement, 570 IPC analysis REG procedure, 3818, 3828, 3916 IPLOTS option MODEL statement (LOGISTIC), 2311 ipp plots annotating, 3729 axes, color, 3729 font, specifying, 3729 options summarized by function, 3726 reference lines, options, 3729–3733 threshold lines, options, 3732 ippplot PROBIT procedure, 3725 IPPPLOT statement options summarized by function, 3726 PROBIT procedure, 3725 IREPLACE option OUTPUT statement (TRANSREG), 4587 IRLS option PROC FASTCLUS statement, 1391 ITDETAILS option PROC MIXED statement, 2677 PROC NLMIXED statement, 3067 ITER= modifier INFLUENCE option, MODEL statement (MIXED), 2698 ITER= option PROC MDS statement, 2481 iterated factor analysis, 1291 iteration history NLMIXED procedure, 3067, 3104 iterations history (MIXED), 2749 history (PHREG), 3233, 3269 history (TPHREG), 4486 PRINQUAL procedure, 3667 restarting (PRINQUAL), 3656, 3673 restarting (TRANSREG), 4602 ITERATIONS= option MODEL statement (LOESS), 2234 iterative proportional fitting estimation (CATMOD), 843 formulas (CATMOD), 896 ITPRINT option EM statement (MI), 2523 MCMC statement (MI), 2527 MODEL statement, 3990 MODEL statement (CATMOD), 842
5028
Syntax Index
MODEL statement (GAM), 1567 MODEL statement (GENMOD), 1639 MODEL statement (LIFEREG), 2099 MODEL statement (LOGISTIC), 2312 MODEL statement (PHREG), 3233 MODEL statement (SURVEYLOGISTIC), 4263 PROC ROBUSTREG statement, 3983
J Jaccard dissimilarity coefficient DISTANCE procedure, 1276 Jaccard similarity coefficient DISTANCE procedure, 1276 JEFFREYS option PRIOR statement (MIXED), 2710 JOIN= option PROC MODECLUS statement, 2866 JOINCHAR= option PROC TREE statement, 4752 JOINT function RESPONSE statement (CATMOD), 853 JOINT option EXACT statement (LOGISTIC), 2301 joint selection probabilities SURVEYSELECT procedure, 4433 JOINTONLY option EXACT statement (LOGISTIC), 2302 Jonckheere-Terpstra test, 1491 Journal style examples, ODS Graphics, 358 ODS styles, 332, 333, 346 JP option MODEL statement (REG), 3827 PLOT statement (REG), 3846 JT option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1448 TABLES statement (FREQ), 1457 JTPROBS option PROC SURVEYSELECT statement, 4433
K k-means clustering, 1379, 1380 k-sample tests, See homogeneity tests k-th-nearest neighbor, See also single linkage See also density linkage estimation (CLUSTER), 970, 972, 977 K0 option PROC ROBUSTREG statement, 3988 K= option and other options (CLUSTER), 969, 972 PROC CLUSTER statement, 970 PROC DISCRIM statement, 1148 PROC MODECLUS statement, 2867 Kaplan-Meier estimate, See product-limit estimate kappa coefficient, 1495, 1496
tests, 1498 weights, 1497 KAPPA option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1448 TEST statement (FREQ), 1463 KDE, 1993 KDE procedure, 1993 bandwidth selection, 2008 binning, 2004 bivariate histogram, 2011 computational details, 2002 convolution, 2005 examples, 2012 fast Fourier transform, 2007 ODS graph names, 2010 options, 1996 output table names, 2009 syntax, 1996 KDE procedure, BIVAR statement, 1997 BIVSTATS option, 1998 BWM= option, 1998 GRIDL= option, 1998 GRIDU= option, 1998 LEVELS= option, 1998 NGRID= option, 1998 OUT= option, 1998 PLOTS= option, 2010 UNISTATS option, 1999 KDE procedure, BY statement, 2001 KDE procedure, FREQ statement, 2001 KDE procedure, PROC KDE statement, 1996 DATA= option, 1996 KDE procedure, UNIVAR statement, 1999 BWM= option, 1999 GRIDL= option, 1999 GRIDU= option, 1999 METHOD= option, 1999 NGRID= option, 2000 OUT= option, 2000 PLOTS= option, 2010 UNISTATS option, 2000 KDE procedure, WEIGHT statement, 2002 KEEP= modifier INFLUENCE option, MODEL statement (MIXED), 2699 Kendall’s tau-b statistic, 1474, 1477 KENTB option OUTPUT statement (FREQ), 1448 TEST statement (FREQ), 1463 Kenward-Roger method MIXED procedure, 2695 kernel density estimates DISCRIM procedure, 1158, 1159, 1191, 1212 KDE procedure, 1993 KERNEL= option PROC DISCRIM statement, 1148 KEY= option PLOT statement (GLMPOWER), 1942
Syntax Index PLOT statement (POWER), 3483 keyword-lists POWER procedure, 3490 keyword= option BASELINE statement (PHREG), 3224 OUTPUT statement (GENMOD), 1644 OUTPUT statement (GLM), 1773 OUTPUT statement (LIFEREG), 2100 OUTPUT statement (PHREG), 3234 OUTPUT statement (REG), 3834 OUTPUT statement (ROBUSTREG), 3991 KLOTZ option EXACT statement (NPAR1WAY), 3160 OUTPUT statement (NPAR1WAY), 3162 PROC NPAR1WAY statement, 3157 Klotz scores NPAR1WAY procedure, 3168 knots PRINQUAL procedure, 3665, 3666 TRANSREG procedure, 4567, 4568, 4571, 4613, 4678 KNOTS= option MODEL statement (TRANSREG), 4568 TRANSFORM statement (PRINQUAL), 3665 Kolmogorov-Smirnov test NPAR1WAY procedure, 3169 KRATIO= option MEANS statement (ANOVA), 443 MEANS statement (GLM), 1768 KRIGE2D procedure, 2033 anisotropic models, 2053–2056 best linear unbiased prediction (BLUP), 2060 correlation range, 2034 discontinuity, 2051 effective range, 2047–2049 examples, 2062 exponential semivariogram model, 2048, 2049 Gaussian semivariogram model, 2047, 2048 geometric anisotropy, 2053–2055 global kriging, 2034 input data set, 2038 kriging with trend, 2058 local kriging, 2033, 2034 nested models, 2050, 2051 nugget effect, 2044, 2051, 2052 ordinary kriging, 2033, 2056–2060 OUTEST= data sets, 2060 OUTNBHD= data set, 2060, 2061 output data sets, 2039, 2060, 2061 power semivariogram model, 2049 range , 2047 sill, 2047 spatial continuity, 2033 spatial covariance, 2034 spatial data, 2056 spatial random fields, 2057 spherical semivariogram model, 2046, 2047 syntax, 2037 visual fit of the variogram, 2045
5029
zonal anisotropy, 2055 KRIGE2D procedure, COORDINATES statement, 2039 XCCORD= option, 2039 YCCORD= option, 2039 KRIGE2D procedure, GRID statement, 2040 GRIDDATA= option, 2040 X= option, 2040 XCOORD= option, 2040 Y= option, 2040 YCOORD= option, 2040 KRIGE2D procedure, MODEL statement, 2042 ANGLE= option, 2042 FORM= option, 2042 MDATA= option, 2043 NUGGET= option, 2044 RANGE= option, 2044 RATIO= option, 2044 SCALE= option, 2044 SINGULAR= option, 2044 KRIGE2D procedure, PREDICT statement, 2041 MAXPOINTS= option, 2041 MINPOINTS= option, 2041 NODECREMENT option, 2041 NOINCREMENT option, 2041 NUMPOINTS= option, 2042 RADIUS= option, 2042 VAR= option, 2042 KRIGE2D procedure, PROC KRIGE2D statement, 2038 DATA= option, 2038 OUTEST= option, 2039 OUTNBHD= option, 2039 SINGULARMSG= option, 2039 kriging ordinary kriging (KRIGE2D), 2058 ordinary kriging (VARIOGRAM), 4851 with trend (KRIGE2D), 2058 kriging, ordinary VARIOGRAM procedure, 4852 Kronecker product structure MIXED procedure, 2721 Kruskal-Wallis test NPAR1WAY procedure, 3166 KS option EXACT statement (NPAR1WAY), 3160 Kuiper test NPAR1WAY procedure, 3171 Kulcynski 1 coefficient DISTANCE procedure, 1276 kurtosis CALIS procedure, 549, 584, 588, 658, 660 displayed in CLUSTER procedure, 972 KURTOSIS option PROC CALIS statement, 584
L L matrices MIXED procedure, 2681, 2687, 2742
5030
Syntax Index
L95 option MODEL statement (RSREG), 4042 L95= option OUTPUT statement (NLIN), 3013 L95M option MODEL statement (RSREG), 4042 L95M= option OUTPUT statement (NLIN), 3013 L= option PROC FASTCLUS statement, 1391 label collision avoidance, 351 ODS Graphics, 351 LABELANGLE= option BOXPLOT procedure, 503 lack of fit tests, 3712, 3759 RSREG procedure, 4046 LACKFIT option MODEL statement (LOGISTIC), 2312 MODEL statement (RSREG), 4042 PROC PROBIT statement, 3712 lag functionality NLMIXED procedure, 3081 LAGDISTANCE= option COMPUTE statement (VARIOGRAM), 4867 Lagrange multiplier covariance matrix, 3030 NLMIXED procedure, 3093 statistics (GENMOD), 1668 test statistics (LIFEREG), 2110 test, modification indices (CALIS), 584, 673, 674 LAGTOLERANCE= option COMPUTE statement (VARIOGRAM), 4867 lambda asymmetric, 1474, 1482 lambda symmetric, 1474, 1483 LAMBDA0= option MODEL statement (TPSPLINE), 4509 LAMBDA= option MODEL statement (TPSPLINE), 4509 MODEL statement (TRANSREG), 4571 TRANSFORM statement (MI), 2534 LAMCR option OUTPUT statement (FREQ), 1448 LAMDAS option OUTPUT statement (FREQ), 1448 LAMRC option OUTPUT statement (FREQ), 1448 Lance-Williams flexible-beta method, See flexible-beta method Lance-Williams nonmetric coefficient DISTANCE procedure, 1272 LANNOTATE= option PROC LIFETEST statement, 2162 latent variables CALIS procedure, 549, 601, 625 PLS procedure, 3367 latent vectors PLS procedure, 3367 LATEX destination
ODS Graphics, 326, 335 LaTeX output examples, ODS Graphics, 358 ODS Graphics, 334, 337 Latin square design ANOVA procedure, 472 generating with PLAN procedure, 3356 lattice design balanced square lattice (LATTICE), 2069 efficiency (LATTICE), 2071, 2075, 2078 partially balanced square lattice (LATTICE), 2069, 2076 rectangular lattice (LATTICE), 2069 lattice layouts ODS Graphics, 381 LATTICE procedure, 2072 adjusted treatment means, 2075 ANOVA table, 2074 Block variable, 2069, 2072, 2073 compared to MIXED procedure, 2665 covariance, 2075 Group variable, 2069, 2072, 2073 lattice design efficiency, 2071, 2075 least significant differences, 2075 missing values, 2074 ODS table names, 2075 Rep variable, 2069, 2072, 2073 response variable, 2072 syntax, 2072 Treatmnt variable, 2069, 2072, 2073 variance of means, 2074 LATTICE procedure, BY statement, 2072 LATTICE procedure, PROC LATTICE statement, 2072 COV option, 2072 COVARIANCE option, 2072 DATA= option, 2072 LATTICE procedure, VAR statement, 2073 layout area ODS Graphics, 342 LBOXES= option PLOT statement (BOXPLOT), 503 LCDEACT= option NLOPTIONS statement (CALIS), 623 PROC NLMIXED statement, 3067 LCEPSILON= option NLOPTIONS statement (CALIS), 623 PROC NLMIXED statement, 3067 LCL keyword OUTPUT statement (GLM), 1774 LCLM keyword OUTPUT statement (GLM), 1774 LCOMPONENTS option MODEL statement (MIXED), 2702 LCONF= option MCMC statement (MI), 2525 LCONNECT= option MCMC statement (MI), 2529 LCSINGULAR= option
Syntax Index NLOPTIONS statement (CALIS), 623 PROC NLMIXED statement, 3067 LD statistic PHREG procedure, 3234, 3260 LDATA= option RANDOM statement (MIXED), 2714 REPEATED statement (MIXED), 2718 leader algorithm, 1380 LEAFCHAR= option PROC TREE statement, 4752 least significant differences LATTICE procedure, 2075 least-significant-difference test, 445, 1769 least-squares estimation LIFEREG procedure, 2108 least-squares means Bonferroni adjustment (GLM), 1754 Bonferroni adjustment (MIXED), 2688 BYLEVEL processing (MIXED), 2689 coefficient adjustment, 1822 compared to means (GLM), 1804 comparison types (GLM), 1757 comparison types (MIXED), 2690 construction of, 1820 covariate values (GLM), 1755 covariate values (MIXED), 2688 Dunnett’s adjustment (GLM), 1754 Dunnett’s adjustment (MIXED), 2688 examples (GLM), 1859, 1866 examples (MIXED), 2796, 2819 GLM procedure, 1753 Hsu’s adjustment (GLM), 1754 Hsu’s adjustment (MIXED), 2688 mixed model (MIXED), 2687 multiple comparisons adjustment (GLM), 1754, 1757 multiple comparisons adjustment (MIXED), 2687 nonstandard weights (GLM), 1756 nonstandard weights (MIXED), 2690 observed margins (GLM), 1756 observed margins (MIXED), 2690 Sidak’s adjustment (GLM), 1754 Sidak’s adjustment (MIXED), 2688 simple effects (GLM), 1758, 1817 simple effects (MIXED), 2691 simulation-based adjustment (GLM), 1754 simulation-based adjustment (MIXED), 2688 Tukey’s adjustment (GLM), 1754 Tukey’s adjustment (MIXED), 2688 LEAST= option PROC FASTCLUS statement, 1391 Lee-Wei-Amato model PHREG procedure, 3251, 3314 left truncation time PHREG procedure, 3229, 3263 LEGEND= option PLOT statement (REG), 3846 LENDGRID= option
5031
PLOT statement (BOXPLOT), 503 less-than-full-rank model TRANSREG procedure, 4569, 4672 level of measurement MDS procedure, 2472, 2481 LEVEL= option PROC MDS statement, 2481 PROC TREE statement, 4752 levels of measurement DISTANCE procedure, 1249 levels, of class variable, 1784 LEVELS= option BIVAR statement, 1998 Levenberg-Marquardt algorithm CALIS procedure, 578, 581, 665 Levene’s test for homogeneity of variance ANOVA procedure, 443 GLM procedure, 1767, 1819, 1893 Leverage MIXED Procedure, 2767 leverage, 1774 TRANSREG procedure, 4587 LEVERAGE keyword OUTPUT statement (ROBUSTREG), 3991 LEVERAGE option MODEL statement, 3990 LEVERAGE= option OUTPUT statement (TRANSREG), 4587 LGOR option OUTPUT statement (FREQ), 1448 LGRID= option PLOT statement (BOXPLOT), 504 LGRRC1 option OUTPUT statement (FREQ), 1448 LGRRC2 option OUTPUT statement (FREQ), 1448 LHREF= option PLOT statement (BOXPLOT), 504 PLOT statement (REG), 3846 life data GENMOD procedure, 1701 life-table estimate LIFETEST procedure, 2149, 2186, 2211 lifereg analysis insets, 2092 LIFEREG procedure, 2083 accelerated failure time models, 2083 censoring, 2094 computational details, 2108 computational resources, 2123 Confidence intervals, 2115 failure time, 2083 INEST= data sets, 2121 information matrix, 2083, 2084, 2108 initial estimates, 2108 inset, 2092 Lagrange multiplier test statistics, 2110 least-squares estimation, 2108 log-likelihood function, 2084, 2108
5032
Syntax Index
log-likelihood ratio tests, 2084 main effects, 2108 maximum likelihood estimates, 2083 missing values, 2108 Newton-Raphson algorithm, 2083 OUTEST= data sets, 2121 output table names, 2124 predicted values, 2114 supported distributions, 2111 survival function, 2083, 2111 syntax, 2089 Tobit model, 2085, 2129 XDATA= data sets, 2122 LIFEREG procedure, BY statement, 2091 LIFEREG procedure, CLASS statement, 2092 LIFEREG procedure, INSET statement, 2092 keywords, 2092 LIFEREG procedure, MODEL statement, 2094 ALPHA= option, 2096 CONVERGE= option, 2096 CONVG= option, 2096 CORRB option, 2097 COVB option, 2097 DISTRIBUTION= option, 2097 INITIAL= option, 2098 INTERCEPT= option, 2099 ITPRINT option, 2099 MAXITER= option, 2099 NOINT option, 2099 NOLOG option, 2099 NOSCALE option, 2099 NOSHAPE1 option, 2099 SCALE= option, 2099 SHAPE1= option, 2099 SINGULAR= option, 2099 LIFEREG procedure, OUTPUT statement, 2100 CDF keyword, 2100 CENSORED keyword, 2100 CONTROL keyword, 2100 keyword= option, 2100 OUT= option, 2100 PREDICTED keyword, 2101 QUANTILES keyword, 2101 STD– ERR keyword, 2101 XBETA keyword, 2102 LIFEREG procedure, PPLOT statement ANNOTATE= option, 2102 CAXIS= option, 2102 CCENSOR option, 2102 CENBIN, 2102 CENCOLOR option, 2102 CENSYMBOL option, 2103 CFIT= option, 2103 CFRAME= option, 2103 CGRID= option, 2103 CHREF= option, 2103 CTEXT= option, 2103 CVREF= option, 2103 DESCRIPTION= option, 2103
FONT= option, 2103 HCL, 2103 HEIGHT= option, 2103 HLOWER= option, 2104 HOFFSET= option, 2104 HREF= option, 2104 HREFLABELS= option, 2104 HREFLABPOS= option, 2104 HUPPER= option, 2104 INBORDER option, 2104 INTERTILE option, 2104 ITPRINTEM option, 2104 JITTER option, 2105 LFIT option, 2105 LGRID option, 2105 LHREF= option, 2105 LVREF= option, 2105 MAXITEM= option, 2105 NAME= option, 2105 NOCENPLOT option, 2105 NOCONF option, 2105 NODATA option, 2105 NOFIT option, 2105 NOFRAME option, 2105 NOGRID option, 2105 NOHLABEL option, 2105 NOHTICK option, 2106 NOPOLISH option, 2106 NOVLABEL option, 2106 NOVTICK option, 2106 NPINTERVALS option, 2106 PCTLIST option, 2106 PLOWER= option, 2106 PPOS option, 2106 PPOUT option, 2106 PRINTPROBS option, 2106 PROBLIST option, 2107 PUPPER= option, 2106 ROTATE option, 2107 SQUARE option, 2107 TOLLIKE option, 2107 TOLPROB option, 2107 VAXISLABEL= option, 2107 VREF= option, 2107 VREFLABELS= option, 2107 VREFLABPOS= option, 2107 WAXIS= option, 2107 WFIT= option, 2107 WGRID= option, 2108 WREFL= option, 2108 LIFEREG procedure, PROBPLOT statement, 2102 LIFEREG procedure, PROC LIFEREG statement, 2090 COVOUT option, 2090 DATA= option, 2090 GOUT= option, 2090 INEST= option, 2090 NAMELEN= option, 2090 NOPRINT option, 2090
Syntax Index ORDER= option, 2090 OUTEST= option, 2091 XDATA= option, 2091 LIFEREG procedure, WEIGHT statement, 2108 LIFETEST procedure arcsine-square root transform, 2205 arcsine-square root transformation, 2169, 2177 association tests, 2150, 2156, 2180, 2191, 2200 censored, 2186 computational formulas, 2171 confidence bands, 2169, 2176 confidence limits, 2174, 2183, 2184 cumulative distribution function, 2149 effective sample size, 2172, 2173 equal precision bands, 2178, 2205 Fleming-Harrington Gρ test for homogeneity, 2150, 2168 Hall-Wellner bands, 2177, 2205 hazard function, 2149, 2214 homogeneity tests, 2149, 2154, 2178, 2200 interval determination, 2174 life-table estimate, 2149, 2172, 2186, 2209, 2211 likelihood ratio test for homogeneity, 2150, 2179 lineprinter plots, 2159 log-log transformation, 2170, 2177 log-rank test for association, 2150, 2180 log-rank test for homogeneity, 2150, 2168, 2178 logarithmic transformation, 2170, 2177 logit transformation, 2170, 2177 median residual time, 2186 missing stratum values, 2166, 2167 missing values, 2171 modified Peto-Peto test for homogeneity, 2150, 2168 ODS graph names, 2191 ODS table names, 2188 output data sets, 2183 partial listing, 2165 Peto-Peto test for homogeneity, 2150, 2168 probability density function, 2149, 2214 product-limit estimate, 2149, 2171, 2185, 2191 stratified tests, 2150, 2155, 2156, 2158, 2167, 2180, 2187 survival distribution function, 2149, 2171 Tarone-Ware test for homogeneity, 2150, 2168 traditional high-resolution graphics, 2159 trend tests, 2150, 2168, 2179, 2187 Wilcoxon test for association, 2150, 2180 Wilcoxon test for homogeneity, 2150, 2168, 2178 LIFETEST procedure, BY statement, 2165 LIFETEST procedure, FREQ statement, 2166 LIFETEST procedure, ID statement, 2166 LIFETEST procedure, PROC LIFETEST statement, 2158 ALPHA= option, 2160 ALPHAQT= option, 2160 ANNOTATE= option, 2160
5033
CENSOREDSYMBOL= option, 2160 DATA= option, 2160 DESCRIPTION= option, 2160 EVENTSYMBOL= option, 2161 FORMCHAR= option, 2161 GOUT= option, 2161 INTERVALS= option, 2161 LANNOTATE= option, 2162 LINEPRINTER option, 2162 MAXTIME= option, 2162 METHOD= option, 2162 MISSING option, 2163 NINTERVAL= option, 2163 NOCENSPLOT option, 2163 NOPRINT option, 2163 NOTABLE option, 2163 OUTSURV= option, 2163 OUTTEST= option, 2163 PLOTS= option, 2164 REDUCEOUT option, 2164 SINGULAR= option, 2164 TIMELIM= option, 2164 TIMELIST= option, 2165 WIDTH= option, 2165 LIFETEST procedure, STRATA statement, 2166 GROUP= option, 2167 MISSING option, 2167 NODETAIL option, 2167 NOTEST option, 2167 TEST= option, 2168 TREND option, 2168 LIFETEST procedure, SURVIVAL statement, 2168 ALPHA= option, 2168, 2191 BANDMAX= option, 2169, 2191 BANDMAXTIME= option, 2169 BANDMIN= option, 2169, 2191 BANDMINTIME= option, 2169 CONFBAND= option, 2169 CONFTYPE= option, 2169, 2191 INTERVALS= option, 2168 MAXTIME= option, 2191 OUT= option, 2170 PLOTS= option, 2190 REDUCEOUT option, 2168 STDERR option, 2170 TIMELIST= option, 2168 LIFETEST procedure, TEST statement, 2170 LIFETEST procedure, TIME statement, 2171 likelihood displacement PHREG procedure, 3234, 3260 Likelihood distance MIXED procedure, 2770 likelihood function, 3756 likelihood ratio chi-square test, 3756, 3759 likelihood ratio test, 2780 Bartlett’s modification, 1150 CALIS procedure, 653, 674 example (MIXED), 2794 mixed model (MIXED), 2741, 2743
5034
Syntax Index
MIXED procedure, 2751 PHREG procedure, 3246, 3269 TPHREG procedure, 4486 likelihood ratio test for homogeneity LIFETEST procedure, 2150 likelihood residuals GENMOD procedure, 1670 likelihood-ratio chi-square test, 1469 power and sample size (POWER), 3457, 3463, 3525 likelihood-ratio test chi-square (FREQ), 1471 LILPREFIX= option OUTPUT statement (TRANSREG), 4587 LINCON statement, CALIS procedure, 609 line colors, modifying examples, ODS Graphics, 369 line patterns, modifying examples, ODS Graphics, 369 line printer plots REG procedure, 3848, 3882 line thickness, modifying examples, ODS Graphics, 376 line-search methods NLMIXED procedure, 3066, 3067, 3096 linear discriminant function, 1139 linear equations model, See LINEQS model linear hypotheses PHREG procedure, 3217, 3238, 3247 linear model GENMOD procedure, 1612, 1613 linear models CATMOD procedure, 814 compared with log-linear models, 817 linear predictor GENMOD procedure, 1611, 1612, 1618, 1661, 1689 PHREG procedure, 3225, 3233, 3235, 3302, 3303 linear rank tests, See association tests linear regression TRANSREG procedure, 4592 linear structural relationship model, See LISREL model linear structure MIXED procedure, 2721 LINEAR transformation MODEL statement (TRANSREG), 4563 TRANSFORM statement (PRINQUAL), 3662 linear transformation PRINQUAL procedure, 3662 TRANSREG procedure, 4563, 4610 LINEPRINTER option PROC LIFETEST statement, 2162 PROC REG statement, 3817 PROC TREE statement, 4752 LINEQS model
CALIS procedure, 553, 601 specification, 560 structural model example (CALIS), 558, 562 LINEQS statement, CALIS procedure, 601 LINES option MEANS statement (ANOVA), 444 MEANS statement (GLM), 1768 LINES= option PROC TREE statement, 4752 LINESEARCH= option, 3096 NLOPTIONS statement (CALIS), 614 PROC CALIS statement, 580 PROC NLMIXED statement, 3067 link function built-in (GENMOD), 1614, 1639 GENMOD procedure, 1611, 1612, 1653 LOGISTIC procedure, 2281, 2312, 2334, 2344 SURVEYLOGISTIC procedure, 4243, 4263, 4273 user-defined (GENMOD), 1634 LINK= option MODEL statement (GENMOD), 1639 MODEL statement (LOGISTIC), 2312 MODEL statement (SURVEYLOGISTIC), 4263 LISREL model CALIS procedure, 554 structural model example (CALIS), 559 LIST option PROC DISCRIM statement, 1148 PROC FASTCLUS statement, 1393 PROC MODECLUS statement, 2867 PROC NLIN statement, 3008 PROC NLMIXED statement, 3068 PROC TREE statement, 4752 STRATA statement (SURVEYFREQ), 4196 STRATA statement (SURVEYLOGISTIC), 4266 STRATA statement (SURVEYMEANS), 4332 STRATA statement (SURVEYREG), 4381 TABLES statement (FREQ), 1457 LISTALL option PROC NLIN statement, 3008 LISTCODE option PROC NLIN statement, 3008 PROC NLMIXED statement, 3068 LISTDEP option PROC NLIN statement, 3008 LISTDER option PROC NLIN statement, 3008 LISTERR option PROC DISCRIM statement, 1148 LIUPREFIX= option OUTPUT statement (TRANSREG), 4587 LLINE= option PLOT statement (REG), 3846 LMAX statistic PHREG procedure, 3260 LMLPREFIX= option OUTPUT statement (TRANSREG), 4588
Syntax Index local influence DFBETA statistics (PHREG), 3234, 3260 score residuals (PHREG), 3234, 3259 weighted score residuals (PHREG), 3260 local kriging KRIGE2D procedure, 2034 LOCAL option PROC MODECLUS statement, 2867 LOCAL= option REPEATED statement (MIXED), 2718 LOCALW option REPEATED statement (MIXED), 2719 LOESS procedure approximate degrees of freedom, 2246 automatic smoothing parameter selection, 2243 data scaling, 2239 direct fitting method, 2240 graphics, 2250 introductory example, 2220 iterative reweighting, 2242 kd trees and blending, 2241 local polynomials, 2242 local weighting, 2242 missing values, 2238 ODS graph names, 2250 output data sets, 2238 output table names, 2248 scoring data sets, 2248 statistical inference, 2243 LOESS procedure, BY statement, 2230 LOESS procedure, ID statement, 2231 LOESS procedure, MODEL statement, 2231 ALL option, 2232 ALPHA= option, 2232 BUCKET= option, 2232 CLM= option, 2232 DEGREE= option, 2232 DETAILS option, 2232 DFMETHOD= option, 2233 DFMETHOD=APPROX(Cutoff= ) option, 2233 DFMETHOD=APPROX(Quantile= ) option, 2233 DIRECT option, 2233 DROPSQUARE= option, 2233 INTERP= option, 2234 ITERATIONS= option, 2234 RESIDUAL option, 2234 SCALE= option, 2234 SCALEDINDEP option, 2234 SELECT= option, 2234 SMOOTH= option, 2236 STD option, 2236 T option, 2236 TRACEL option, 2236 LOESS procedure, PROC LOESS statement, 2230 PLOTS option, 2250 PLOTS(MAXPOINTS=) option, 2250 UNPACKPANELS option, 2250 LOESS procedure, SCORE statement, 2236
5035
CLM option, 2237 PRINT option, 2237 SCALEDINDEP option, 2237 STEPS option, 2237 LOESS procedure, WEIGHT statement, 2237 log likelihood output data sets (LOGISTIC), 2294 log odds LOGISTIC procedure, 2347 SURVEYLOGISTIC procedure, 4289 LOG option MCMC statement (MI), 2525, 2529 PROC PROBIT statement, 3713 LOG transformation MODEL statement (TRANSREG), 4562 TRANSFORM statement (MI), 2534 TRANSFORM statement (PRINQUAL), 3661 log-interval level of measurement DISTANCE procedure, 1250 log-likelihood functions (GENMOD), 1654 log-likelihood function LIFEREG procedure, 2084, 2108 PROBIT procedure, 3756 log-likelihood ratio tests LIFEREG procedure, 2084 log-linear models CATMOD procedure, 814, 870, 1616 compared with linear models, 817 design matrix (CATMOD), 884 examples (CATMOD), 916, 919 GENMOD procedure, 1616 multiple populations (CATMOD), 872 one population (CATMOD), 871 log-linear variance model MIXED procedure, 2718 log-log transformation LIFETEST procedure, 2170, 2177 log-rank test PHREG procedure, 3219 log-rank test for association LIFETEST procedure, 2150 log-rank test for homogeneity LIFETEST procedure, 2150, 2168, 2178 power and sample size (POWER), 3473, 3481, 3533, 3561 LOG10 option PROC PROBIT statement, 3713 logarithmic transformation LIFETEST procedure, 2170, 2177 LOGDETH option PARMS statement (MIXED), 2707 logistic analysis CATMOD procedure, 815, 868 caution (CATMOD), 870 examples (CATMOD), 933 ordinal data, 815 logistic distribution, 2083, 2097, 2111, 3705 PROBIT procedure, 3757
5036
Syntax Index
LOGISTIC option MONOTONE statement (MI), 2532 LOGISTIC procedure, 2289 Akaike’s information criterion, 2341 Bayes’ theorem, 2314 best subset selection, 2308 branch and bound algorithm, 2341 classification table, 2314, 2352, 2353, 2422 conditional logistic regression, 2365 confidence intervals, 2314, 2315, 2319, 2345, 2346 confidence limits, 2350 convergence criterion, 2308 customized odds ratio, 2328 descriptive statistics, 2294 deviance, 2308, 2316, 2354 DFBETAS diagnostic, 2360 dispersion parameter, 2354 displayed output, 2381 estimability checking, 2300 exact logistic regression, 2300, 2369 existence of MLEs, 2338 Fisher’s scoring method, 2317, 2318, 2336 goodness of fit, 2308, 2316 gradient, 2343 hat matrix, 2359 Hessian matrix, 2317, 2343 hierarchy, 2310 Hosmer-Lemeshow test, 2312, 2356, 2357 infinite parameter estimates, 2313 initial values, 2376 introductory example, 2284 link function, 2281, 2312, 2334, 2344 log odds, 2347 maximum likelihood algorithms, 2336 missing values, 2329 model fitting criteria, 2341 model hierarchy, 2283, 2310 model selection, 2306, 2317, 2340 multiple classifications, 2315 Newton-Raphson algorithm, 2317, 2318, 2336, 2338 odds ratio confidence limits, 2308, 2315 odds ratio estimation, 2347 ODS graph names, 2390 ODS table names, 2386 output data sets, 2294, 2374, 2376, 2377 output ROC data sets, 2378 overdispersion, 2316, 2354, 2355 Pearson’s chi-square, 2308, 2316, 2354 predicted probabilities, 2350 prior event probability, 2314, 2353, 2422 profile likelihood convergence criterion, 2314 rank correlation, 2350 regression diagnostics, 2359 residuals, 2360 response level ordering, 2305, 2329, 2330 reverse response level ordering, 2290 ROC curve, 2314, 2357
Schwarz criterion, 2341 score statistics, 2343 selection methods, 2306, 2317, 2340 singular contrast matrix, 2300 subpopulation, 2308, 2316, 2355 syntax, 2289 testing linear hypotheses, 2327, 2358 Williams’ method, 2355 LOGISTIC procedure, BY statement, 2294 LOGISTIC procedure, CLASS statement, 2295 CPREFIX= option, 2295 DESCENDING option, 2295 LPREFIX= option, 2295 MISSING option, 2296 ORDER= option, 2296 PARAM= option, 2296 REF= option, 2297 TRUNCATE option, 2297 LOGISTIC procedure, CONTRAST statement, 2297 ALPHA= option, 2300 E option, 2300 ESTIMATE= option, 2300 SINGULAR= option, 2300 LOGISTIC procedure, EXACT statement, 2300 ALPHA= option, 2301 CLTYPE= option, 2302 ESTIMATE option, 2301 JOINT option, 2301 JOINTONLY option, 2302 MIDPFACTOR= option, 2302 ONESIDED option, 2302 OUTDIST= option, 2302 LOGISTIC procedure, FREQ statement, 2303 LOGISTIC procedure, GRAPHICS statement, 2389 LOGISTIC procedure, MODEL statement, 2304 ABSFCONV option, 2308 AGGREGATE= option, 2308 ALPHA= option, 2308 BEST= option, 2308 CL option, 2319 CLODDS= option, 2308 CLPARM= option, 2309 CORRB option, 2309 COVB option, 2309 CTABLE option, 2309 DESCENDING option, 2305 DETAILS option, 2309 EXPEST option, 2309 FAST option, 2309 FCONV= option, 2310 GCONV= option, 2310 HIERARCHY= option, 2310 INCLUDE= option, 2311 INFLUENCE option, 2311 IPLOTS option, 2311 ITPRINT option, 2312 LACKFIT option, 2312 LINK= option, 2312 MAXFUNCTION= option, 2313
Syntax Index MAXITER= option, 2313 MAXSTEP= option, 2313 NOCHECK option, 2313 NODESIGNPRINT= option, 2313 NODUMMYPRINT= option, 2313 NOFIT option, 2314 NOINT option, 2313 NOLOGSCALE option, 2314 OFFSET= option, 2314 ORDER= option, 2305 OUTROC= option, 2314 PARMLABEL option, 2314 PEVENT= option, 2314 PLCL option, 2314 PLCONV= option, 2314 PLRL option, 2315 PPROB= option, 2315 RIDGING= option, 2315 RISKLIMITS option, 2315 ROCEPS= option, 2315 RSQUARE option, 2315 SCALE= option, 2316 SELECTION= option, 2317 SEQUENTIAL option, 2317 SINGULAR= option, 2317 SLENTRY= option, 2317 SLSTAY= option, 2317 START= option, 2318 STB option, 2318 STOP= option, 2318 STOPRES option, 2318 TECHNIQUE= option, 2318 WALDCL option, 2319 WALDRL option, 2315 XCONV= option, 2319 LOGISTIC procedure, OUTPUT statement, 2319 ALPHA= option, 2322 C= option, 2321 CBAR= option, 2321 DFBETAS= option, 2321 DIFCHISQ= option, 2322 DIFDEV= option, 2322 H= option, 2322 LOWER= option, 2320 OUT= option, 2320 PREDICTED= option, 2320 PREDPROBS= option, 2320 RESCHI= option, 2322 RESDEV= option, 2322 STDXBETA = option, 2321 UPPER= option, 2321 XBETA= option, 2321 LOGISTIC procedure, PROC LOGISTIC statement, 2290 ALPHA= option, 2290 COVOUT option, 2290 DATA= option, 2290 DESCENDING option, 2290 EXACTONLY option, 2290
5037
EXACTOPTIONS option, 2291 INEST= option, 2292 INMODEL= option, 2293 NAMELEN= option, 2293 NOCOV option, 2293 NOPRINT option, 2293 OUTDESIGN= option, 2293 OUTDESIGNONLY option, 2293 OUTEST= option, 2294 OUTMODEL= option, 2294 SIMPLE option, 2294 LOGISTIC procedure, SCORE statement ALPHA= option, 2324 CLM option, 2324 DATA= option, 2324 FITSTAT option, 2325 OUT= option, 2325 OUTROC= option, 2325 PRIOR= option, 2325 PRIOREVENT= option, 2325 ROCEPS= option, 2325 LOGISTIC procedure, TEST statement, 2327 PRINT option, 2327 LOGISTIC procedure, UNITS statement, 2328 DEFAULT= option, 2328 LOGISTIC procedure, WEIGHT statement, 2328 NORMALIZE option, 2329 logistic regression, 3705, See also LOGISTIC procedure See also SURVEYLOGISTIC procedure CATMOD procedure, 814, 869 examples (CATMOD), 911 GENMOD procedure, 1613, 1697 logistic regression method MI procedure, 2546 LOGIT function RESPONSE statement (CATMOD), 853 LOGIT transformation MODEL statement (TRANSREG), 4562 TRANSFORM statement (MI), 2534 TRANSFORM statement (PRINQUAL), 3661 logit transformation LIFETEST procedure, 2170, 2177 logits, See adjacent-category logits See also cumulative logits See also generalized logits LOGLIN statement CATMOD procedure, 839 loglogistic distribution, 2083, 2097, 2111 LOGNLAMBDA0= option MODEL statement (TPSPLINE), 4509 LOGNLAMBDA= option MODEL statement (TPSPLINE), 4509 lognormal data power and sample size (POWER), 3434, 3437, 3438, 3450, 3455, 3456, 3465, 3472, 3509, 3511, 3519, 3521, 3529, 3531, 3552 lognormal distribution, 2083, 2097, 2111
5038
Syntax Index
LOGNOTE option PROC MIXED statement, 2677 PROC NLMIXED statement, 3068 LOGNOTE= option PRIOR statement (MIXED), 2711 LOGOR= option REPEATED statement (GENMOD), 1647 LOGRBOUND= option PRIOR statement (MIXED), 2711 long run times NLMIXED procedure, 3098 Longley data set, 3197 LOWER= option ONESAMPLEMEANS statement (POWER), 3434 OUTPUT statement (LOGISTIC), 2320 PAIREDMEANS statement (POWER), 3450 TWOSAMPLEMEANS statement (POWER), 3466 LOWERB= option PARMS statement (MIXED), 2707 LOWERTAILED option ESTIMATE statement (MIXED), 2686 TEST statement (MULTTEST), 2947 Lp clustering FASTCLUS procedure, 1379 Lp clustering FASTCLUS procedure, 1391 lpred plots annotating, 3737 axes, color, 3737 font, specifying, 3737 reference lines, options, 3737–3739, 3741 threshold lines, options, 3740 lpredplot PROBIT procedure, 3733 LPREDPLOT statement options summarized by function, 3734 PROBIT procedure, 3733 LPREFIX= option CLASS statement (LOGISTIC), 2295 CLASS statement (SURVEYLOGISTIC), 4253 CLASS statement (TPHREG), 4477 MODEL statement (TRANSREG), 4568, 4576 LR statistics MI procedure, 2555 LRCHI option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1448, 1539 LRCHISQ option TABLES statement (SURVEYFREQ), 4200 LRCHISQ1 option TABLES statement (SURVEYFREQ), 4200 LRCI option MODEL statement (GENMOD), 1640 LREF= option MCMC statement (MI), 2525 LS-means, See least-squares means
lsd (least significant differences) LATTICE procedure, 2075 LSD option MEANS statement (ANOVA), 444, 445 MEANS statement (GLM), 1768 LSD test, 1769 LSMEANS statement GLM procedure, 1753 MIXED procedure, 2687 LSPRECISION= option NLOPTIONS statement (CALIS), 614 PROC CALIS statement, 581 PROC NLMIXED statement, 3068 LVREF= option PLOT statement (BOXPLOT), 504 PLOT statement (REG), 3846
M M= option MANOVA statement (ANOVA), 437 MANOVA statement (GLM), 1759 MODEL statement (TPSPLINE), 4509 MAC method PRINQUAL procedure, 3643, 3669 MACRO option OUTPUT statement (TRANSREG), 4588 macros TRANSREG procedure, 4588 Mahalanobis distance, 791, 1158 CANDISC procedure, 804 MAHALANOBIS option PROC DISCRIM statement, 1148 main effects design matrix (CATMOD), 877 GENMOD procedure, 1660 LIFEREG procedure, 2108 MIXED procedure, 2744 model parameterization (GLM), 1788 specifying (ANOVA), 451, 452 specifying (CATMOD), 864 specifying (GLM), 1784 TRANSREG procedure, 4558, 4594 MAKE statement MIXED procedure, 2757 Mallows’ Cp selection REG procedure, 3875 manifest variables CALIS procedure, 549 Mann-Whitney-Wilcoxon test NPAR1WAY procedure, 3166 MANOVA, See multivariate analysis of variance CANDISC procedure, 786 MANOVA option PROC ANOVA statement, 433 PROC DISCRIM statement, 1148 PROC GLM statement, 1745 MANOVA statement ANOVA procedure, 436
Syntax Index GLM procedure, 1759 Mantel-Haenszel chi-square test, 1469, 1472 Mantel-Haenszel test log-rank test (PHREG), 3219 MAR MI procedure, 2537, 2564 MARGINAL function RESPONSE statement (CATMOD), 853 marginal probabilities, See also response functions specifying in CATMOD procedure, 853 Marginal residuals MIXED procedure, 2764 marker symbols, modifying examples, ODS Graphics, 369 MARKERS= option PLOT statement (GLMPOWER), 1943 PLOT statement (POWER), 3484 martingale residuals PHREG procedure, 3234, 3258, 3302 matched comparisons, See paired comparisons Matern covariance structure MIXED procedure, 2721 mating offspring and parent (INBREED), 1981 self (INBREED), 1980 MATINGS statement, INBREED procedure, 1975 matrix decompositions (CORRESP), 1079, 1100 factor, defined for factor analysis (FACTOR), 1292 inversion (CALIS), 647 multiplication (SCORE), 4065 names, default (CALIS), 608 notation, theory (MIXED), 2731 properties, COSAN model (CALIS), 592 MATRIX option PROC INBREED statement, 1973 matrix properties COSAN model (CALIS), 592 MATRIX statement CALIS procedure, 593 MDS procedure, 2486 MATRIX statement (CALIS) factor analysis model, 595 MATRIXL option PROC INBREED statement, 1973 MAX= option PLOT statement (GLMPOWER), 1943 PLOT statement (POWER), 3484 PREDICT statement (KRIGE2D), 2041 MAXCLUSTERS= option PROC FASTCLUS statement, 1388 PROC MODECLUS statement, 2867 PROC VARCLUS statement, 4810 MAXEIGEN= option PROC VARCLUS statement, 4810 MAXFUNC= option
5039
NLOPTIONS statement (CALIS), 614, 616 PROC CALIS statement, 582 PROC MIXED statement, 2677 PROC NLMIXED statement, 3069 MAXFUNCTION= option MODEL statement (LOGISTIC), 2313 MAXHEIGHT= option PROC TREE statement, 4752 maximum average correlation method PRINQUAL procedure, 3643, 3669 maximum likelihood algorithms (LOGISTIC), 2336 algorithms (SURVEYLOGISTIC), 4275 estimates (LIFEREG), 2083 estimates (LOGISTIC), 2338 estimates (SURVEYLOGISTIC), 4277 estimation (CATMOD), 817, 843, 895 estimation (GENMOD), 1655 hierarchical clustering (CLUSTER), 967, 971, 980, 981 NLMIXED procedure, 3048 maximum likelihood estimation mixed model (MIXED), 2738 maximum likelihood factor analysis, 1291, 1311 with FACTOR procedure, 1297, 1298 maximum method, See complete linkage MAXIMUM option RIDGE statement (RSREG), 4044 maximum total variance method PRINQUAL procedure, 3643 MAXIMUM= option PROC MI statement, 2519 MAXIT= option MODEL statement (GENMOD), 1640 MAXITER = option MODEL statement (GAM), 1567 MAXITER= option EM statement (MI), 2523 MCMC statement (MI), 2527 MODEL statement (CATMOD), 842 MODEL statement (LIFEREG), 2099 MODEL statement (LOGISTIC), 2313 MODEL statement (PHREG), 3232 MODEL statement (SURVEYLOGISTIC), 4263 MODEL statement (TRANSREG), 4576 NLOPTIONS statement (CALIS), 614, 616 PROC ACECLUS statement, 405 PROC CALIS statement, 582 PROC FACTOR statement, 1311 PROC FASTCLUS statement, 1393 PROC MDS statement, 2481 PROC MIXED statement, 2677 PROC NLIN statement, 3008 PROC NLMIXED statement, 3069 PROC PLS statement, METHOD=PLS option, 3375 PROC PLS statement, MISSING=EM option, 3376
5040
Syntax Index
PROC PRINQUAL statement, 3654 PROC ROBUSTREG statement (ROBUSTREG), 3984, 3987, 3988 PROC VARCLUS statement, 4810 PROC VARCOMP statement, 4835 REPEATED statement (GENMOD), 1648 TABLES statement (FREQ), 1458 MAXITSCORE = option MODEL statement (GAM), 1567 MAXLAGS= option COMPUTE statement (VARIOGRAM), 4867 MAXPANELS= option PLOT statement (BOXPLOT), 504 MAXSEARCH= option PROC VARCLUS statement, 4810 MAXSIZE= option PROC SURVEYSELECT statement, 4433 MAXSTEP option MODEL statement (REG), 3827 MAXSTEP= option MODEL statement (LOGISTIC), 2313 MODEL statement (PHREG), 3230 PROC NLMIXED statement, 3069 PROC STEPDISC statement, 4166 MAXSUBIT= option PROC NLIN statement, 3008 MAXTIME= option EXACT statement (FREQ), 1445 EXACT statement (NPAR1WAY), 3160 NLOPTIONS statement (CALIS), 616 PROC LIFETEST statement, 2162 PROC NLMIXED statement, 3070 SURVIVAL statement (LIFETEST), 2191 MC option EXACT statement (FREQ), 1445 EXACT statement (NPAR1WAY), 3160 MCA option PROC CORRESP statement, 1076 MCA= option, PROC CORRESP statement, 1101 MCAR MI procedure, 2537 MCF PHREG procedure, 3225 MCMC method MI procedure, 2547 MCMC monotone-data imputation MI procedure, 2565 MCMC statement MI procedure, 2523 MCNEM option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1448 McNemar’s test, 1493, 1494 power and sample size (POWER), 3443, 3447, 3448, 3516, 3517 MCONVERGE= option PROC MDS statement, 2481 MCORRB option REPEATED statement (GENMOD), 1648
MCOVB option REPEATED statement (GENMOD), 1648 McQuitty’s similarity analysis CLUSTER procedure, 967 MDATA= option MODEL statement (KRIGE2D), 2043 SIMULATE statement (SIM2D), 4102 MDFFITS MIXED procedure, 2768 MDFFITS for covariance parameters MIXED procedure, 2769 MDPREF analysis PRINQUAL procedure, 3678 MDPREF option PROC PRINQUAL statement, 3654 MDS procedure alternating least squares, 2476 analyzing data in groups, 2485 asymmetric data, 2484 badness of fit, 2479, 2482, 2483, 2489, 2490 conditional data, 2477 configuration, 2471, 2482, 2483, 2489 convergence criterion, 2478, 2480, 2481 coordinates, 2482, 2483, 2488, 2489 data weights, 2487 dimension coefficients, 2471, 2472, 2477, 2482, 2483, 2488, 2489 dissimilarity data, 2471, 2478, 2484 distance data, 2471, 2478, 2484 Euclidean distances, 2472, 2477, 2488 external unfolding, 2471 individual difference models, 2471 INDSCAL model, 2471, 2477 initial values, 2480–2484, 2491 measurement level, 2472, 2481 metric multidimensional scaling, 2471 missing values, 2492 multidimensional scaling, 2471 nonmetric multidimensional scaling, 2471, 2472 normalization of the estimates, 2492 optimal transformations, 2472, 2481 output table names, 2497 partitions, 2477, 2488 plot of configuration, 2504, 2505 plot of dimension coefficients, 2504, 2506 plot of linear fit, 2503 %PLOTIT macro, 2474 plots, 2474 proximity data, 2471, 2478, 2484 residuals, 2483, 2487, 2488, 2491, 2503 similarity data, 2471, 2478, 2484 stress formula, 2479, 2480, 2489 subject weights, 2471, 2477 syntax, 2475 three-way multidimensional scaling, 2471 ties, 2485 transformations, 2472, 2481, 2482, 2484, 2487– 2489 transformed data, 2491
Syntax Index transformed distances, 2491 unfolding, 2471 weighted Euclidean distance, 2472, 2477, 2488 weighted Euclidean model, 2471, 2477 weighted least squares, 2487 MDS procedure, BY statement, 2485 MDS procedure, ID statement, 2486 MDS procedure, INVAR statement, 2486 MDS procedure, MATRIX statement, 2486 MDS procedure, PROC MDS statement, 2476 ALTERNATE= option, 2476 COEF= option, 2477 CONDITION= option, 2477 CONVERGE= option, 2478 CRITMIN= option, 2482 CUTOFF= option, 2478 DATA= option, 2478 DECIMALS= option, 2478 DIMENSION= option, 2478 EPSILON= option, 2479 FIT= option, 2479 FORMULA= option, 2479 GCONVERGE= option, 2480 INAV= option, 2480 INITIAL= option, 2481 ITER= option, 2481 LEVEL= option, 2481 MAXITER= option, 2481 MCONVERGE= option, 2481 MINCRIT= option, 2482 NEGATIVE option, 2482 NONORM option, 2482 NOPHIST option, 2482 NOPRINT option, 2482 NOULB option, 2482 OCOEF option, 2482 OCONFIG option, 2482 OCRIT option, 2482 OTRANS option, 2482 OUT= option, 2482 OUTFIT= option, 2483 OUTITER option, 2482 OUTRES= option, 2483 OVER= option, 2483 PCOEF option, 2483 PCONFIG option, 2483 PDATA option, 2483 PFINAL option, 2483 PFIT option, 2483 PFITROW option, 2483 PINAVDATA option, 2483 PINEIGVAL option, 2483 PINEIGVEC option, 2483 PININ option, 2484 PINIT option, 2484 PITER option, 2484 PTRANS option, 2484 RANDOM= option, 2484 RIDGE= option, 2484
5041
SHAPE= option, 2484 SIMILAR= option, 2484 SINGULAR= option, 2485 UNTIE option, 2485 MDS procedure, VAR statement, 2486 MDS procedure, WEIGHT statement, 2487 MEAN function RESPONSE statement (CATMOD), 853 mean function PHREG procedure, 3224, 3226, 3244, 3245, 3253, 3255 MEAN keyword REPEATED statement (ANOVA), 448 MEAN option MCMC statement (MI), 2524, 2529 REPEATED statement (GLM), 1779, 1780, 1832 TEST statement (MULTTEST), 2955, 2968 mean per element SURVEYMEANS procedure, 4337 mean separation tests, See multiple comparison procedures MEAN statement SIM2D procedure, 4105 mean survival time time limit (LIFETEST), 2164 MEAN= data sets FASTCLUS procedure, 1393, 1394 MEAN= option ONESAMPLEMEANS statement (POWER), 3434 PROC FASTCLUS statement, 1393 MEANDIFF= option PAIREDMEANS statement (POWER), 3450 TWOSAMPLEMEANS statement (POWER), 3467 MEANRATIO= option PAIREDMEANS statement (POWER), 3450 TWOSAMPLEMEANS statement (POWER), 3467 means ANOVA procedure, 440 compared to least-squares means (GLM), 1804 displayed in CLUSTER procedure, 972 GLM procedure, 1763 power and sample size (POWER), 3432, 3448, 3463, 3471, 3526 SURVEYMEANS procedure, 4337 weighted (GLM), 1820 MEANS option OUTPUT statement (TRANSREG), 4590 MEANS procedure, 21 MEANS statement ANOVA procedure, 440 GLM procedure, 1763 means, difference between independent samples, 4775, 4789 paired observations, 4775 measurement level
5042
Syntax Index
MDS procedure, 2472, 2481 measures of agreement, 1493 MEASURES option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1448 TABLES statement (FREQ), 1458 TEST statement (FREQ), 1463 MEC option OUTPUT statement (TRANSREG), 4590 median cluster, 1389, 1392 method (CLUSTER), 967, 981 MEDIAN option EXACT statement (NPAR1WAY), 3160 OUTPUT statement (NPAR1WAY), 3162 PROC NPAR1WAY statement, 3157 median residual time LIFETEST procedure, 2186 median scores NPAR1WAY procedure, 3167 Medical Expenditure Panel Survey (MEPS) SURVEYLOGISTIC procedure, 4302 Mehta and Patel, network algorithm, 1509, 3172 memory requirements ACECLUS procedure, 410 CLUSTER procedure, 986 FACTOR procedure, 1335 FASTCLUS procedure, 1402 MIXED procedure, 2775 reduction of (ANOVA), 434 reduction of (GLM), 1747 VARCLUS procedure, 4818 memory usage SIM2D procedure, 4110 Merle-Spath algorithm FASTCLUS procedure, 1392 METHOD= < ( options )> PROC ROBUSTREG statement, 3984 METHOD= option BASELINE statement (PHREG), 3226 MODEL statement (GAM), 1568 MODEL statement (TRANSREG), 4576 ONESAMPLEFREQ statement (POWER), 3430 OUTPUT statement (PHREG), 3235 PAIREDFREQ statement (POWER), 3445 PROC ACECLUS statement, 405 PROC CALIS statement, 574 PROC DISCRIM statement, 1148 PROC DISTANCE statement, 1257 PROC FACTOR statement, 1311 PROC LIFETEST statement, 2162 PROC MIXED statement, 2677, 2783 PROC MODECLUS statement, 2867 PROC NLIN statement, 3008 PROC NLMIXED statement, 3070 PROC PLS statement, 3375 PROC PRINQUAL statement, 3654 PROC STDIZE statement, 4131 PROC STEPDISC statement, 4166
PROC SURVEYSELECT statement, 4434 PROC VARCOMP statement, 4835 UNIVAR statement, 1999 METHOD= specification PROC CLUSTER statement, 966 methods of estimation VARCOMP procedure, 4831, 4842 metric multidimensional scaling MDS procedure, 2471 METRIC= option PROC ACECLUS statement, 405 PROC DISCRIM statement, 1149 MGV method PRINQUAL procedure, 3643 MHCHI option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1448 MHOR option OUTPUT statement (FREQ), 1448 MHRRC1 option OUTPUT statement (FREQ), 1448 MHRRC2 option OUTPUT statement (FREQ), 1448 MI procedure adjusted degrees of freedom, 2562 analyst’s model, 2563 approximate Bayesian bootstrap, 2543 arbitrary missing pattern, 2539 autocorrelation function plot, 2557 Bayes’ theorem, 2547 Bayesian inference, 2547 between-imputation variance, 2561 bootstrap, 2527 combining inferences, 2561 converge in EM algorithm, 2522 convergence in EM algorithm, 2527 convergence in MCMC, 2555, 2566 degrees of freedom, 2561 discriminant function method, 2544 EM algorithm, 2536, 2566 fraction of missing information, 2562 imputation methods, 2539 imputation model, 2565 imputer’s model, 2563 input data set, 2519, 2526, 2558 introductory example, 2513 logistic regression method, 2546 LR statistics, 2555 MAR, 2537, 2564 MCAR, 2537 MCMC method, 2547 MCMC monotone-data imputation, 2565 missing at random, 2537, 2564 monotone missing pattern, 2538 multiple imputation efficiency, 2562 multivariate normality assumption, 2565 number of imputations, 2565 ODS graph names, 2567 ODS table names, 2566
Syntax Index output data sets, 2520, 2528, 2559 output parameter estimates, 2528 parameter simulation, 2564 predictive mean matching method, 2542 producing monotone missingness, 2552 propensity score method, 2543, 2565 random number generators, 2520 regression method, 2541, 2565 relative efficiency, 2562 relative increase in variance, 2562 saving graphics output, 2526 singularity, 2521 Summary of Issues in Multiple Imputation, 2564 suppressing output, 2520 syntax, 2517 time-series plot, 2556 total variance, 2561 transformation, 2533 within-imputation variance, 2561 worst linear function of parameters, 2556 MI procedure, BY statement, 2521 MI procedure, CLASS statement, 2522 MI procedure, EM statement, 2522 CONVERGE option, 2522 INITIAL= option, 2522 ITPRINT option, 2523 MAXITER= option, 2523 OUT= option, 2523 OUTEM= option, 2523 OUTITER= option, 2523 output data sets, 2523 XCONV option, 2522 MI procedure, FREQ statement, 2523 MI procedure, MCMC statement, 2523 ACFPLOT option, 2524, 2567 BOOTSTRAP option, 2527 CCONF= option, 2525 CCONNECT= option, 2529 CFRAME= option, 2525, 2529 CHAIN= option, 2526 CNEEDLES= option, 2525 CONVERGE= option, 2527 COV option, 2524, 2529 CREF= option, 2525 CSYMBOL= option, 2525, 2529 DISPLAYINIT option, 2526 GOUT= option, 2526 HSYMBOL= option, 2525, 2529 IMPUTE= option, 2526 INEST= option, 2526 INITIAL= option, 2527 ITPRINT option, 2527 LCONF= option, 2525 LCONNECT= option, 2529 LOG option, 2525, 2529 LREF= option, 2525 MAXITER= option, 2527 MEAN option, 2524, 2529 NBITER= option, 2528
5043
NITER= option, 2528 NLAG= option, 2525 OUTEST= option, 2528 OUTITER= option, 2528 PRIOR= option, 2528 START= option, 2528 SYMBOL= option, 2525, 2529 TIMEPLOT option, 2529, 2567 TITLE= option, 2525, 2529 WCONF= option, 2526 WCONNECT= option, 2530 WLF option, 2524, 2529, 2530 WNEEDLES= option, 2526 WREF= option, 2526 XCONV= option, 2527 MI procedure, MONOTONE statement, 2530 DISCRIM option, 2531 LOGISTIC option, 2532 PROPENSITY option, 2533 REG option, 2532 REGPMM option, 2532 REGPREDMEANMATCH option, 2532 REGRESSION option, 2532 MI procedure, PROC MI statement, 2518 ALPHA= option, 2519 DATA= option, 2519 MAXIMUM= option, 2519 MINIMUM= option, 2519 MINMAXITER= option, 2519 MU0= option, 2519 NIMPUTE= option, 2520 NOPRINT option, 2520 OUT= option, 2520 ROUND= option, 2520 SEED option, 2520 SIMPLE, 2521 SINGULAR option, 2521 THETA0= option, 2519 MI procedure, TRANSFORM statement, 2533 BOXCOX transformation, 2533 C= option, 2534 EXP transformation, 2534 LAMBDA= option, 2534 LOG transformation, 2534 LOGIT transformation, 2534 POWER transformation, 2534 MI procedure, VAR statement, 2534 MIANALYZE procedure adjusted degrees of freedom, 2625 average relative increase in variance, 2627 between-imputation covariance matrix, 2626 between-imputation variance, 2624 combining inferences, 2624 degrees of freedom, 2625, 2627 fraction of missing information, 2625 input data sets, 2620 introductory example, 2610 multiple imputation efficiency, 2626 multivariate inferences, 2626
5044
Syntax Index
ODS table names, 2631 relative efficiency, 2626 relative increase in variance, 2625 syntax, 2613 testing linear hypotheses, 2618, 2628 total covariance matrix, 2627 total variance, 2625 within-imputation covariance matrix, 2626 within-imputation variance, 2624 MIANALYZE procedure, BY statement, 2616 MIANALYZE procedure, CLASS statement, 2617 MIANALYZE procedure, MODELEFFECTS statement, 2617 MIANALYZE procedure, PROC MIANALYZE statement, 2614 ALPHA= option, 2614 BCOV option, 2614 CLASSVAR= option, 2615 COVB= option, 2614 DATA= option, 2615 EDF= option, 2615 EFFECTVAR= option, 2614 MU0= option, 2616 MULT option, 2615 PARMINFO= option, 2615 PARMS= option, 2615 TCOV option, 2616 THETA0= option, 2616 WCOV option, 2616 XPXI= option, 2616 MIANALYZE procedure, STDERR statement, 2617 MIANALYZE procedure, TEST statement, 2618 BCOV option, 2619 MULT option, 2619 TCOV option, 2619 WCOV option, 2619 MIDPFACTOR= option EXACT statement (LOGISTIC), 2302 MIN= option PLOT statement (GLMPOWER), 1943 PLOT statement (POWER), 3484 MINC= option PROC VARCLUS statement, 4810 MINCLUSTERS= option PROC VARCLUS statement, 4810 MINCRIT= option PROC MDS statement, 2482 MINEIGEN= option PROC FACTOR statement, 1312 MINHEIGHT= option PROC TREE statement, 4752 minimum generalized variance method PRINQUAL procedure, 3643 MINIMUM option RIDGE statement (RSREG), 4044 MINIMUM= option PROC MI statement, 2519 MININERTIA= option PROC CORRESP statement, 1077
MINITER= option NLOPTIONS statement (CALIS), 616 PROC NLMIXED statement, 3070 Minkowski metric STDIZE procedure, 4138 Minkowski L(p) distance coefficient DISTANCE procedure, 1272 MINMAXITER= option PROC MI statement, 2519 MINPOINTS= option PREDICT statement (KRIGE2D), 2041 MINSIZE= option PROC SURVEYSELECT statement, 4436 misclassification probabilities discriminant analysis, 1140 MISS= option MODEL statement (CATMOD), 844 MISSBREAK option PLOT statement (BOXPLOT), 504 missing at random MI procedure, 2537, 2564 missing level combinations MIXED procedure, 2748 MISSING option CLASS statement (GENMOD), 1629 CLASS statement (LOGISTIC), 2296 CLASS statement (TPHREG), 4477 PROC CORRESP statement, 1077 PROC LIFETEST statement, 2163 PROC NPAR1WAY statement, 3157 PROC SURVEYFREQ statement, 4193 PROC SURVEYLOGISTIC statement, 4251 PROC SURVEYMEANS statement, 4323 STRATA statement (LIFETEST), 2167 STRATA statement (PHREG), 3237 TABLES statement (FREQ), 1458 missing stratum values LIFETEST procedure, 2166, 2167 missing values ACECLUS procedure, 409 and interactivity (GLM), 1787 CANCORR procedure, 765 character (PRINQUAL), 3662 CLUSTER procedure, 987 DISTANCE procedure, 1261, 1262, 1267, 1276 FASTCLUS procedure, 1380, 1381, 1391, 1393, 1397 LIFEREG procedure, 2108 LIFETEST procedure, 2171 LOGISTIC procedure, 2329 MDS procedure, 2492 MODECLUS procedure, 2883 MULTTEST procedure, 2960 NPAR1WAY procedure, 3162 PHREG procedure, 3228, 3236, 3286 PRINCOMP procedure, 3608 PRINQUAL procedure, 3655, 3667, 3674 PROBIT procedure, 3755 SCORE procedure, 4074
Syntax Index STDIZE procedure, 4131, 4133 strata variables (PHREG), 3237 SURVEYFREQ procedure, 4205 SURVEYLOGISTIC procedure, 4251, 4268 SURVEYMEANS procedure, 4323, 4333, 4358 SURVEYREG procedure, 4382 SURVEYSELECT procedure, 4445 TRANSREG procedure, 4578, 4599, 4600, 4605 TREE procedure, 4756 VARCOMP procedure, 4837 MISSING= option MODEL statement (CATMOD), 844 PROC PLS statement, 3376 PROC STDIZE statement, 4131 VAR statement, 1267 MISSPRINT option TABLES statement (FREQ), 1458 mixed model unbalanced (GLM), 1882 VARCOMP procedure, 4837 mixed model (MIXED), See also MIXED procedure estimation, 2737 formulation, 2732 hypothesis tests, 2742, 2751 inference, 2741 inference space, 2681, 2682, 2685, 2781 least-squares means, 2687 likelihood ratio test, 2741, 2743 linear model, 2661 maximum likelihood estimation, 2738 notation, 2663 objective function, 2749 parameterization, 2743 predicted values, 2687 restricted maximum likelihood, 2779 theory, 2731 Wald test, 2741, 2786 mixed model equations example (MIXED), 2796 MIXED procedure, 2678, 2739 MIXED Procedure Leverage, 2767 PRESS Residual, 2766 PRESS Statistic, 2766 MIXED procedure, 2672, See also mixed model 2D geometric anisotropic structure, 2721 Akaike’s information criterion, 2676, 2740, 2750 Akaike’s information criterion (finite sample corrected version) , 2676, 2750 ante-dependence structure, 2721 ARIMA procedure, compared, 2665 ARMA structure, 2721 assumptions, 2661 asymptotic covariance, 2674 at sign (@) operator, 2745, 2819 AUTOREG procedure, compared, 2665
5045
autoregressive structure, 2721, 2788 banded Toeplitz structure, 2721 bar (|) operator, 2743, 2745, 2819 basic features, 2662 Bayesian analysis, 2708 between-within method, 2693 BLUE, 2740 BLUP, 2740, 2809 Bonferroni adjustment, 2688 boundary constraints, 2707, 2708, 2773 Box plots, 2762 BYLEVEL processing of LSMEANS, 2689 CALIS procedure, compared, 2665 Cholesky root, 2704, 2764, 2772 class level, 2678 classification variables, 2681 compound symmetry structure, 2721, 2733, 2789, 2794 computational details, 2772 computational order, 2773 Conditional residuals, 2764 confidence interval, 2686, 2713 confidence limits, 2674, 2685, 2689, 2692, 2713 containment method, 2693 continuous effects, 2714, 2715, 2717, 2721 continuous-by-class effects, 2746 continuous-nesting-class effects, 2745 contrasted SAS procedures, 1735, 1833, 1885, 2664, 2665, 2985 contrasts, 2681, 2685 convergence criterion, 2674, 2675, 2749, 2775 convergence problems, 2774 Cook’s D, 2768 Cook’s D for covariance parameters, 2768 correlation estimates, 2713, 2716, 2720, 2791 correlations of least-squares means, 2689 covariance parameter estimates, 2674, 2676, 2750 covariance parameter estimates, ratio, 2680 covariance parameters, 2661 covariance structures, 2664, 2721, 2723, 2782 covariances of least-squares means, 2689 covariate values for LSMEANS, 2688 covariates, 2743 COVRATIO, 2770 COVRATIO for covariance parameters, 2770 COVTRACE, 2770 COVTRACE for covariance parameters, 2770 CPU requirements, 2776 crossed effects, 2744 degrees of freedom, 2682, 2684–2687, 2690, 2693, 2705, 2742, 2747, 2751, 2774, 2809 DFFITS, 2768 dimensions, 2677, 2678 direct product structure, 2721 Dunnett’s adjustment, 2688 EBLUPs, 2715, 2740, 2801, 2817 effect name length, 2678 empirical best linear unbiased prediction, 2703
5046
Syntax Index
empirical sandwich estimator, 2676 estimability, 2682–2684, 2686, 2687, 2691, 2704, 2705, 2742, 2748 estimable functions, 2702 estimation methods, 2677 factor analytic structures, 2721 Fisher information matrix, 2750, 2796 Fisher’s scoring method, 2674, 2680, 2774 fitting information, 2750, 2751 fixed effects, 2663 fixed-effects parameters, 2661, 2705, 2732 fixed-effects variance matrix, 2705 function evaluations, 2677 G matrix, 2712 general linear structure, 2721 generalized inverse, 2684, 2740 gradient, 2675, 2749 grid search, 2706, 2796 growth curve analysis, 2733 Hannan-Quinn information criterion, 2676 Hessian matrix, 2674, 2675, 2680, 2707, 2749, 2750, 2774, 2775, 2786, 2796 heterogeneity, 2714, 2717, 2792 heterogeneous AR(1) structure, 2721 heterogeneous compound-symmetry structure, 2721 heterogeneous covariance structures, 2730 heterogeneous Toeplitz structure, 2721 hierarchical model, 2810 Hotelling-Lawley-McKeon statistic, 2717 Hotelling-Lawley-Pillai-Sampson statistic, 2718 Hsu’s adjustment, 2688 Huynh-Feldt structure, 2721 infinite likelihood, 2717, 2773, 2775 Influence diagnostics details, 2765 INFLUENCE option, 2696 Influence plots, 2760 information criteria, 2676 initial values, 2706 input data sets, 2676 interaction effects, 2744 intercept, 2743 intercept effect, 2703, 2713 intraclass correlation coefficient, 2791 introductory example, 2665 iterations, 2677, 2749 Kenward-Roger method, 2695 Kronecker product structure, 2721 L matrices, 2681, 2687, 2742 LATTICE procedure, compared, 2665 least-square means, 2796 least-squares means, 2690, 2819 Likelihood distance, 2770 likelihood ratio test, 2751 linear structure, 2721 log-linear variance model, 2718 main effects, 2744 MAKE statement in Version 6, 2757 Marginal residuals, 2764
Matern covariance structure, 2721 matrix notation, 2731 MDFFITS, 2768 MDFFITS for covariance parameters, 2769 memory requirements, 2775 missing level combinations, 2748 mixed linear model, 2661 mixed model, 2732 mixed model equations, 2678, 2739, 2796 mixed model theory, 2731 model information, 2678 model selection, 2740 multilevel model, 2810 multiple comparisons of least-squares means, 2687, 2690 multiple tables, 2754 multivariate tests, 2717 nested effects, 2745 nested error structure, 2814 NESTED procedure, compared, 2665 Newton-Raphson algorithm, 2738 non-full-rank parameterization, 2664, 2718, 2747 nonstandard weights for LSMEANS, 2690 nugget effect, 2718 Oblique projector, 2767 observed margins for LSMEANS, 2690 ODS graph names, 2762 ODS Graphics, 2757 ODS table names, 2752 ordering of effects, 2679, 2746 over-parameterization, 2744 parameter constraints, 2707, 2773 parameterization, 2743 Pearson Residual, 2704 pharmaceutical stability, example, 2810 Plots of leave-one-out-estimates, 2760 plotting the likelihood, 2801 polynomial effects, 2743 power-of-the-mean model, 2718 predicted means, 2704 predicted value confidence intervals, 2692 predicted values, 2703, 2796 prior density, 2709 profiling residual variance, 2679, 2708, 2718, 2738, 2772 R matrix, 2716, 2720 random coefficients, 2788, 2810 random effects, 2663, 2712 random-effects parameter, 2715 random-effects parameters, 2662, 2732 regression effects, 2743 rejection sampling, 2711 repeated measures, 2662, 2716, 2782 Residual diagnostics details, 2763 residual method, 2694 Residual plots, 2758 residual variance tolerance, 2705 restricted maximum likelihood (REML), 2662
Syntax Index ridging, 2680, 2738 sandwich estimator, 2676 Satterthwaite method, 2694 Scaled Residual, 2705 Scaled residuals, 2764 Schwarz’s Bayesian information criterion, 2676, 2740, 2750 scoring, 2674, 2680, 2774 Sidak’s adjustment, 2688 simple effects, 2691 simulation-based adjustment, 2688 singularities, 2775 spatial anisotropic exponential structure, 2721 spatial covariance structures, 2722, 2723, 2730, 2774 split-plot design, 2734, 2777 standard linear model, 2663 statement positions, 2672 Studentized Residual, 2704 Studentized residuals, 2768 subject effect, 2683, 2715, 2721, 2776, 2782 summary of commands, 2672 sweep operator, 2768, 2772 syntax, 2672 table names, 2752 test components, 2702 Toeplitz structure, 2721, 2819 TSCSREG procedure, compared, 2665 Tukey’s adjustment, 2688 Type 1 estimation, 2677 Type 2 estimation, 2677 Type 3 estimation, 2677 Type I testing, 2696 Type II testing, 2696 Type III testing, 2696, 2751 unstructured R matrix, 2720 unstructured correlations, 2721 unstructured covariance matrix, 2721 VARCOMP procedure, example, 2795 variance components, 2662, 2721 variance ratios, 2707, 2714 Wald test, 2750, 2751 weighted LSMEANS, 2690 weighting, 2730 zero design columns, 2696 zero variance component estimates, 2774 MIXED procedure, BY statement, 2680 MIXED procedure, CLASS statement, 2681, 2748 TRUNCATE option, 2681 MIXED procedure, CONTRAST statement, 2681 CHISQ option, 2684 DF= option, 2684 E option, 2684 GROUP option, 2684 SINGULAR= option, 2684 SUBJECT option, 2685 MIXED procedure, ESTIMATE statement, 2685 ALPHA= option, 2685 CL option, 2685
5047
DF= option, 2686 DIVISOR= option, 2686 E option, 2686 GROUP option, 2686 LOWERTAILED option, 2686 SINGULAR= option, 2686 SUBJECT option, 2686 UPPERTAILED option, 2686 MIXED procedure, ID statement, 2687 MIXED procedure, LSMEANS statement, 2687, 2796 ADJUST= option, 2687 ALPHA= option, 2688 AT MEANS option, 2688 AT option, 2688, 2689 BYLEVEL option, 2689, 2691 CL option, 2689 CORR option, 2689 COV option, 2689 DF= option, 2690 DIFF option, 2690 E option, 2690 OBSMARGINS option, 2690 PDIFF option, 2690, 2691 SINGULAR= option, 2691 SLICE= option, 2691 MIXED procedure, MAKE statement, 2757 MIXED procedure, MODEL statement, 2692 ALPHAP= option, 2692 CHISQ option, 2692 CL option, 2692 CONTAIN option, 2692, 2693 CORRB option, 2692 COVB option, 2692 COVBI option, 2693 DDF= option, 2693 DDFM= option, 2693 E option, 2695 E1 option, 2695 E2 option, 2695 E3 option, 2696 FULLX option, 2689, 2696 HTYPE= option, 2696 Influence diagnostics, 2700 INFLUENCE option, 2696 INTERCEPT option, 2702 LCOMPONENTS option, 2702 NOCONTAIN option, 2703 NOINT option, 2703, 2743 NOTEST option, 2703 ORDER= option, 2747 OUTP= option, 2796 OUTPRED= option, 2703 OUTPREDM= option, 2704 RESIDUAL option, 2704, 2764 SINGCHOL= option, 2704 SINGRES= option, 2705 SINGULAR= option, 2704 SOLUTION option, 2705, 2747 VCIRY option, 2705, 2764
5048
Syntax Index
XPVIX option, 2705 XPVIXI option, 2705 ZETA= option, 2705 MIXED procedure, MODEL statement, INFLUENCE option EFFECT=, 2697 ESTIMATES, 2697 ITER=, 2698 KEEP=, 2699 SELECT=, 2699 SIZE=, 2699 MIXED procedure, PARMS statement, 2706, 2796 EQCONS= option, 2707 HOLD= option, 2707 LOGDETH option, 2707 LOWERB= option, 2707 NOBOUND option, 2707 NOITER option, 2708 NOPROFILE option, 2708 OLS option, 2708 PARMSDATA= option, 2708 PDATA= option, 2708 RATIOS option, 2708 UPPERB= option, 2708 MIXED procedure, PRIOR statement, 2708 ALG= option, 2710 BDATA= option, 2710 DATA= option, 2710 FLAT option, 2710 GRID= option, 2710 GRIDT= option, 2711 IFACTOR= option, 2711 JEFFREYS option, 2710 LOGNOTE= option, 2711 LOGRBOUND= option, 2711 NSAMPLE= option, 2711 NSEARCH= option, 2711 OUT= option, 2711 OUTG= option, 2711 OUTGT= option, 2711 PSEARCH option, 2711 PTRANS option, 2712 SEED= option, 2712 SFACTOR= option, 2712 TDATA= option, 2712 TRANS= option, 2712 UPDATE= option, 2712 MIXED procedure, PROC MIXED statement, 2674 ABSOLUTE option, 2674, 2749 ALPHA= option, 2674 ASYCORR option, 2674 ASYCOV option, 2674, 2796 CL= option, 2674 CONVF option, 2675, 2749 CONVG option, 2675, 2749 CONVH option, 2675, 2749 COVTEST option, 2676, 2750 DATA= option, 2676 DFBW option, 2676
IC option, 2676 INFO option, 2677 ITDETAILS option, 2677 LOGNOTE option, 2677 MAXFUNC= option, 2677 MAXITER= option, 2677 METHOD= option, 2677, 2783 MMEQ option, 2678, 2796 MMEQSOL option, 2678, 2796 NAMELEN= option, 2678 NOBOUND option, 2678 NOCLPRINT option, 2678 NOINFO option, 2678 NOITPRINT option, 2679 NOPROFILE option, 2679, 2738 ORD option, 2679 ORDER= option, 2679, 2744 RATIO option, 2680, 2750 RIDGE= option, 2680 SCORING= option, 2680 SIGITER option, 2680 UPDATE option, 2680 MIXED procedure, RANDOM statement, 2664, 2712, 2777 ALPHA= option, 2713 CL option, 2713 G option, 2713 GC option, 2713 GCI option, 2713 GCORR option, 2713 GDATA= option, 2713 GI option, 2714 GROUP= option, 2714 LDATA= option, 2714 NOFULLZ option, 2714 RATIOS option, 2714 SOLUTION option, 2715 SUBJECT= option, 2683, 2715 TYPE= option, 2715 V option, 2716 VC option, 2716 VCI option, 2716 VCORR option, 2716 VI option, 2716 MIXED procedure, REPEATED statement, 2664, 2716, 2783 GROUP= option, 2717 HLM option, 2717 HLPS option, 2718 LDATA= option, 2718 LOCAL= option, 2718 LOCALW option, 2719 NONLOCALW option, 2720 R option, 2720 RC option, 2720 RCI option, 2720 RCORR option, 2720 RI option, 2720 SSCP option, 2720
Syntax Index SUBJECT= option, 2721 TYPE= option, 2721 MIXED procedure, WEIGHT statement, 2730 ML factor analysis and computer time, 1297 and confidence intervals, 1294, 1297, 1327 and multivariate normal distribution, 1297 and standard errors, 1297 ML option MODEL statement (CATMOD), 843 MMEQ option PROC MIXED statement, 2678, 2796 MMEQSOL option PROC MIXED statement, 2678, 2796 MNAMES= option MANOVA statement (ANOVA), 438 MANOVA statement (GLM), 1760 modal clusters density estimation (CLUSTER), 970 modal region, definition, 2878 MODE= option PROC CLUSTER statement, 970 PROC MODECLUS statement, 2867 MODECLUS procedure analyzing data in groups, 2857, 2874 cascaded density estimates, 2873 clustering methods, 2856, 2874 clusters, definition, 2878 clusters, plotting, 2878 compared with other procedures, 2856 cross validated density estimates, 2872 density estimation, 2870 example using GPLOT procedure, 2916, 2923 example using TRACE option, 2927 example using TRANSPOSE procedure, 2912 fixed-radius kernels, 2870 functional summary, 2862 Hertzsprung-Russell Plot, example, 2923 JOIN option, discussion, 2880 modal region, 2878 neighborhood distribution function (NDF), definition, 2878 nonparametric clustering methods, 2855 output data sets, 2883 p-value computation, 2877 plotting samples from univariate distributions, 2889 population clusters, risks of estimating, 2877 saddle test, definition, 2879 scaling variables, 2856 significance tests, 2916 standardizing, 2856 summary of options, 2862 syntax, 2862 variable-radius kernels, 2870 MODECLUS procedure, BY statement, 2869 MODECLUS procedure, FREQ statement, 2870 MODECLUS procedure, ID statement, 2870
5049
MODECLUS procedure, PROC MODECLUS statement, 2862 ALL option, 2865 AM option, 2865 BOUNDARY option, 2865 CASCADE= option, 2865 CK= option, 2865 CLUSTER= option, 2865 CORE option, 2865 CR= option, 2865 CROSS option, 2865 CROSSLIST option, 2865 DATA= option, 2865 DENSITY= option, 2866 DIMENSION= option, 2866 DK= option, 2866 DOCK= option, 2866 DR= option, 2866 EARLY option, 2866 HM option, 2866 JOIN= option, 2866 K= option, 2867 LIST option, 2867 LOCAL option, 2867 MAXCLUSTERS= option, 2867 METHOD= option, 2867 MODE= option, 2867 NEIGHBOR option, 2868 NOPRINT option, 2868 NOSUMMARY option, 2868 OUT= option, 2868 OUTCLUS= option, 2868 OUTLENGTH= option, 2868 OUTSUM= option, 2868 POWER= option, 2868 R= option, 2869 SHORT option, 2869 SIMPLE option, 2869 STANDARD option, 2869 SUM option, 2869 TEST option, 2869 THRESHOLD= option, 2869 TRACE option, 2869 MODECLUS procedure, VAR statement, 2870 model fit summary (REG), 3896 fitting criteria (LOGISTIC), 2341 fitting criteria (SURVEYLOGISTIC), 4279 hierarchy (LOGISTIC), 2283, 2310 hierarchy (TPHREG), 4473, 4475 information (MIXED), 2678 parameterization (GLM), 1787 specification (ANOVA), 451 specification (GLM), 1784 specification (NLMIXED, 3077 model assessment, 1718, 1725 PHREG procedure, 3223, 3265, 3271, 3318 model checking, 1718, 1725 model selection
5050
Syntax Index
entry (PHREG), 3230 examples (REG), 3924 LOGISTIC procedure, 2306, 2317, 2340 MIXED procedure, 2740 PHREG procedure, 3216, 3229, 3264 REG procedure, 3800, 3873, 3876, 3877 removal (PHREG), 3230 MODEL statement ANOVA procedure, 445 CATMOD procedure, 840 GAM procedure, 1566 GENMOD procedure, 1636 GLM procedure, 1770 GLMMOD procedure, 1917 GLMPOWER procedure, 1938 KRIGE2D procedure, 2042 LIFEREG procedure, 2094 LOESS procedure, 2231 LOGISTIC procedure, 2304 MIXED procedure, 2692 NLIN procedure, 3012 NLMIXED procedure, 3077 ORTHOREG procedure, 3203 PHREG procedure, 3227 PLS procedure, 3378 REG procedure, 3821 ROBUSTREG procedure, 3989 RSREG procedure, 4041 SURVEYLOGISTIC procedure, 4258 SURVEYREG procedure, 4379 TPHREG procedure, 4474 TPSPLINE procedure, 4508 TRANSREG procedure, 4557 VARCOMP procedure, 4836 MODEL= option MULTREG statement (POWER), 3422 ONECORR statement (POWER), 3427 MODELEFFECTS statement MIANALYZE procedure, 2617 MODELFONT option PLOT statement (REG), 3846 MODELHT option PLOT statement (REG), 3847 MODELLAB option PLOT statement (REG), 3847 MODELSE option REPEATED statement (GENMOD), 1648 modification indices CALIS procedure, 576, 649, 673 constraints (CALIS), 584 displaying (CALIS), 687 Lagrange multiplier test (CALIS), 584, 673, 674 Wald test (CALIS), 584, 674 MODIFICATION option PROC CALIS statement, 584 modified Peto-Peto test for homogeneity LIFETEST procedure, 2150, 2168 modified ridit scores, 1469 Modifiers of INFLUENCE option
MODEL statement (MIXED), 2696 monoecious population analysis example (INBREED), 1985 monotone regression function (TRANSREG), 4629 transformations (TRANSREG), 4593 monotone missing pattern MI procedure, 2538 MONOTONE statement MI procedure, 2530 MONOTONE transformation MODEL statement (TRANSREG), 4563 TRANSFORM statement (PRINQUAL), 3662 MONOTONE= option MODEL statement (TRANSREG), 4577 PROC PRINQUAL statement, 3654 monotonic transformation (PRINQUAL), 3662, 3663 transformation (TRANSREG), 4563, 4564, 4610 transformation, B-spline (PRINQUAL), 3662 transformation, B-spline (TRANSREG), 4563, 4611 Monte Carlo estimation FREQ procedure, 1443, 1445, 1512 NPAR1WAY procedure, 3174 MOOD option EXACT statement (NPAR1WAY), 3160 OUTPUT statement (NPAR1WAY), 3162 PROC NPAR1WAY statement, 3157 Mood scores NPAR1WAY procedure, 3168 MORALS method TRANSREG procedure, 4576 mortality test MULTTEST procedure, 2952, 2972 MPAIRS= option PROC ACECLUS statement, 405 MPC option OUTPUT statement (TRANSREG), 4590 MQC option OUTPUT statement (TRANSREG), 4590 MRC option OUTPUT statement (TRANSREG), 4590 MREDUNDANCY option OUTPUT statement (TRANSREG), 4590 MSA option PROC FACTOR statement, 1312 MSE SURVEYREG procedure, 4388 MSE option MODEL statement (REG), 3827 PLOT statement (REG), 3847 MSINGULAR= option NLOPTIONS statement (CALIS), 614 PROC CALIS statement, 590 PROC NLMIXED statement, 3070 MSPLINE transformation MODEL statement (TRANSREG), 4563
Syntax Index TRANSFORM statement (PRINQUAL), 3662 MSTAT= option MANOVA statement (ANOVA), 438 MANOVA statement (GLM), 1761 REPEATED statement (ANOVA), 448 REPEATED statement (GLM), 1780 MTEST statement REG procedure, 3832 MTV method PRINQUAL procedure, 3643 MU0= option PROC MI statement, 2519 PROC MIANALYZE statement, 2616 MULT option PROC MIANALYZE statement, 2615 TEST statement (MIANALYZE), 2619 MULT= option PROC DISTANCE statement, 1261 PROC STDIZE statement, 4131 multicollinearity REG procedure, 3895 multidimensional preference analysis PRINQUAL procedure, 3678, 3688 multidimensional scaling MDS procedure, 2471 metric (MDS), 2471 nonmetric (MDS), 2471, 2472 three-way (MDS), 2471 multilevel model example (MIXED), 2810 multilevel response, 3759 multinomial distribution (GENMOD), 1653 models (GENMOD), 1671 MULTIPASS option PROC ANOVA statement, 433 PROC GLM statement, 1746 PROC PHREG statement, 3222 multiple classifications cutpoints (LOGISTIC), 2315 multiple comparison procedures, 1806, See also multiple comparisons of means See also multiple comparisons of least-squares means GLM procedure, 1763 multiple-stage tests, 1814 pairwise (GLM), 1807 recommendations, 1816 with a control (GLM), 1807, 1812 multiple comparisons adjustment (MIXED) least-squares means, 2687 multiple comparisons of least-squares means, See also multiple comparison procedures GLM procedure, 1754, 1757, 1808 interpretation, 1816 MIXED procedure, 2687, 2690 multiple comparisons of means, See also multiple comparison procedures ANOVA procedure, 440
5051
Bonferroni t test, 441, 1765 Duncan’s multiple range test, 442, 1766 Dunnett’s test, 442, 1766, 1767 error mean square, 443, 1767 examples, 1847 Fisher’s LSD test, 445, 1769 Gabriel’s procedure, 443, 1767 GLM procedure, 1806, 1808 GT2 method, 444, 1769 interpretation, 1816 Ryan-Einot-Gabriel-Welsch test, 444, 1768 Scheffé’s procedure, 444, 1769 Sidak’s adjustment, 444, 1769 SMM, 444, 1769 Student-Newman-Keuls test, 444, 1769 Tukey’s studentized range test, 445, 1769 Waller-Duncan method, 443 Waller-Duncan test, 445, 1769 multiple correspondence analysis (MCA) CORRESP procedure, 1076, 1101, 1123 multiple destinations examples, ODS Graphics, 360 multiple imputation efficiency MI procedure, 2562 MIANALYZE procedure, 2626 multiple imputations analysis, 2511, 2609 multiple R-square SURVEYREG procedure, 4387 multiple redundancy coefficients TRANSREG procedure, 4590 multiple regression TRANSREG procedure, 4593 multiple tables MIXED procedure, 2754 multiple-stage tests, 1814, See multiple comparison procedures MULTIPLEGROUP option PROC VARCLUS statement, 4811 multiplicative hazards model, See Andersen-Gill model multistage sampling, 165 multivariate analysis of variance, 433, 436 CANDISC procedure, 786 examples (GLM), 1868 GLM procedure, 1745, 1759, 1823 hypothesis tests (GLM), 1824 partial correlations, 1824 multivariate general linear hypothesis, 1824 multivariate inferences MIANALYZE procedure, 2626 multivariate multiple regression TRANSREG procedure, 4593 multivariate normality assumption MI procedure, 2565 multivariate tests MIXED procedure, 2717 REG procedure, 3910 repeated measures, 1828 multiway tables
5052
Syntax Index
SURVEYFREQ procedure, 4227 MULTREG statement POWER procedure, 3421 MULTTEST procedure, 2939 adjusted p-value, 2935, 2956 Bonferroni adjustment, 2939, 2956 bootstrap adjustment, 2938, 2939, 2957 Cochran-Armitage test, 2946, 2948, 2951, 2964 computational resources, 2960 convolution distribution, 2950 displayed output, 2963 double arcsine test, 2951 expected trend, 2951 false discovery rate adjustment, 2959 fast Fourier transform, 2950 Fisher combination adjustment, 2959 Fisher exact test, 2944, 2946, 2954 Freeman-Tukey test, 2946, 2951, 2968 Hochberg adjustment, 2959 Hommel adjustment, 2959 introductory example, 2936 linear trend test, 2949 missing values, 2960 ODS table names, 2963 output data sets, 2961 p-value adjustments, 2935, 2956 permutation adjustment, 2942, 2957, 2975 Peto test, 2946, 2952, 2972 resampled data sets, 2962 Sidak’s adjustment, 2942, 2956 statistical tests, 2948 stepdown methods, 2957 strata weights, 2951 syntax, 2939 t test, 2946, 2955, 2968 MULTTEST procedure, BY statement, 2943 MULTTEST procedure, CLASS statement, 2943 TRUNCATE option, 2944 MULTTEST procedure, CONTRAST statement, 2944 MULTTEST procedure, FREQ statement, 2945 MULTTEST procedure, PROC MULTTEST statement, 2939 BONFERRONI option, 2939, 2956 BOOTSTRAP option, 2937, 2939, 2957, 2968 CENTER option, 2939 DATA= option, 2940 FDR option, 2940, 2959 FISHER– C option, 2940, 2959 HOC option, 2940, 2959 HOLM option, 2942 HOM option, 2940 HOMMEL option, 2959 NOCENTER option, 2940 NOPRINT option, 2940 NOTABLES option, 2940 NOZEROS option, 2940 NSAMPLE= option, 2940 ORDER= option, 2940, 2975 OUT= option, 2941, 2961
OUTPERM= option, 2941, 2962, 2964 OUTSAMP= option, 2941, 2962, 2968 PDATA= option, 2941 PERMUTATION option, 2942, 2957, 2964, 2975 PVALS option, 2942 SEED= option, 2942 SIDAK option, 2942, 2956, 2972 STEPBON option, 2942 STEPBOOT option, 2942 STEPPERM option, 2942 STEPSID option, 2942, 2972 MULTTEST procedure, STRATA statement, 2945 WEIGHT= option, 2946, 2951 MULTTEST procedure, TEST statement, 2946 BINOMIAL option, 2947 CA option, 2948, 2964 CONTINUITY= option, 2947 FISHER option, 2944, 2954, 2975 FT option, 2951, 2968 LOWERTAILED option, 2947 MEAN option, 2955, 2968 PERMUTATION= option, 2947, 2949, 2950, 2964 PETO option, 2952, 2972 TIME= option, 2947 UPPERTAILED option, 2947 Murthy’s method SURVEYSELECT procedure, 4454
N N option OUTPUT statement (FREQ), 1448 N= option EXACT statement (FREQ), 1445 EXACT statement (NPAR1WAY), 3160 FACTOR statement (CALIS), 607 PROC ACECLUS statement, 405 PROC PRINCOMP statement, 3605 PROC PRINQUAL statement, 3655 PROC SURVEYLOGISTIC statement, 4252 PROC SURVEYMEANS statement, 4326 PROC SURVEYREG statement, 4374 NAME statement TREE procedure, 4756 name-lists POWER procedure, 3490 NAME= option MODEL statement (TRANSREG), 4571 PLOT statement (BOXPLOT), 504 PLOT statement (GLMPOWER), 1945 PLOT statement (POWER), 3487 PLOT statement (REG), 3847 PROC TREE statement, 4752 TRANSFORM statement (PRINQUAL), 3666 NAMELEN= option PROC ANOVA statement, 433 PROC CATMOD statement, 829 PROC GENMOD statement, 1625
Syntax Index PROC GLM statement, 1746 PROC GLMMOD statement, 1914 PROC LIFEREG statement, 2090 PROC LOGISTIC statement, 2293 PROC MIXED statement, 2678 PROC PROBIT statement, 3713 PROC ROBUSTREG statement, 3983 PROC SURVEYLOGISTIC statement, 4251 NARROW option PROC SIM2D statement, 4099 natural response rate, 3705, 3707, 3711, 3713 NBEST option PROC ROBUSTREG statement, 3986 NBITER= option MCMC statement (MI), 2528 NCAN= option MODEL statement (TRANSREG), 4577 PROC CANCORR statement, 760 PROC CANDISC statement, 791 PROC DISCRIM statement, 1149 NCLUSTERS= option PROC TREE statement, 4752 NCOVARIATES= option POWER statement (GLMPOWER), 1940 NDIRECTIONS= option COMPUTE statement (VARIOGRAM), 4868 nearest centroid sorting, 1380 nearest neighbor method, See also single linkage DISCRIM procedure, 1158, 1161 negative binomial distribution GENMOD procedure, 1652 NLMIXED procedure, 3078 NEGATIVE option PROC MDS statement, 2482 negative variance components VARCOMP procedure, 4838 NEIGHBOR option PROC MODECLUS statement, 2868 neighborhood distribution function (NDF), definition MODECLUS procedure, 2878 Nelder-Mead simplex, 3073 nested design, 2985 error terms, 2991 generating with PLAN procedure, 3353 hypothesis tests (NESTED), 2990 nested effects design matrix (CATMOD), 878 GENMOD procedure, 1660 MIXED procedure, 2745 model parameterization (GLM), 1789 specifying (ANOVA), 451, 453 specifying (CATMOD), 864 specifying (GLM), 1785 nested error structure MIXED procedure, 2814 nested models KRIGE2D procedure, 2050, 2051 VARIOGRAM procedure, 4871
5053
NESTED procedure analysis of covariation, 2990 compared to other procedures, 1735, 2665, 2985 computational method, 2991 input data sets, 2988 introductory example, 2986 missing values, 2990 ODS table names, 2994 random effects, 2990 syntax, 2988 unbalanced design, 2990 NESTED procedure, BY statement, 2988 NESTED procedure, CLASS statement, 2989 TRUNCATE option, 2989 NESTED procedure, PROC NESTED statement, 2988 AOV option, 2988 DATA= option, 2988 NESTED procedure, VAR statement, 2989 nested-by-value effects specifying (CATMOD), 865 network algorithm, 1509, 3172 Newman-Keuls’ multiple range test, 444, 1769, 1814 Newton algorithm FASTCLUS procedure, 1392 Newton-Raphson algorithm CALIS procedure, 578, 580, 581, 665 GENMOD procedure, 1655 iteration (PHREG), 3222 LIFEREG procedure, 2083 LOGISTIC procedure, 2317, 2318, 2336, 2338 method (PHREG), 3245 MIXED procedure, 2738 NLMIXED procedure, 3073 PROBIT procedure, 3756 SURVEYLOGISTIC procedure, 4264, 4265, 4277 NFAC= option PROC PLS statement, 3376 NFACTORS= option PROC FACTOR statement, 1313 NFRACTIONAL option MULTREG statement (POWER), 3422 ONECORR statement (POWER), 3427 ONESAMPLEMEANS statement (POWER), 3434 ONEWAYANOVA statement (POWER), 3440 PAIREDFREQ statement (POWER), 3445 PAIREDMEANS statement (POWER), 3451 POWER statement (GLMPOWER), 1940 TWOSAMPLEFREQ statement (POWER), 3459 TWOSAMPLEMEANS statement (POWER), 3467 TWOSAMPLESURVIVAL statement (POWER), 3477 NFRACTIONAL= option ONESAMPLEFREQ statement (POWER), 3430 NFULLPREDICTORS= option MULTREG statement (POWER), 3423
5054
Syntax Index
NGRID= option BIVAR statement, 1998 UNIVAR statement, 2000 NHCLASSES= option COMPUTE statement (VARIOGRAM), 4868 NIMPUTE= option PROC MI statement, 2520 NINTERVAL= option PROC LIFETEST statement, 2163 NITER= option MCMC statement (MI), 2528 PROC PLS statement, 3374 NKNOTS= option MODEL statement (TRANSREG), 4568 TRANSFORM statement (PRINQUAL), 3666 NLAG= option MCMC statement (MI), 2525 NLEGEND option PLOT statement (BOXPLOT), 505 NLEVELS option PROC FREQ statement, 1442 NLIN procedure analytic derivatives, 3011, 3017 automatic derivatives, 3017 close-to-linear, 3019 confidence interval, 3013, 3014, 3028, 3029 convergence, 3022 convergence criterion, 3006 cross-referencing variables, 3010 debugging execution, 3008 derivatives, 3011, 3017 displayed output, 3006 G2 inverse, 3025 G4 inverse, 3008 Gauss iterative method, 3008 Gauss-Newton method, 3024, 3027 generalized inverse, 3025 gradient method, 3024, 3025 Hessian, 3026, 3030 Hougaard’s measure, 3008, 3019 imposing bounds, 3010 incompatibilities, 3031, 3032 initial values, 3015 iteratively reweighted least squares example, 3038 Lagrange multiplier, 3010 Lagrange multipliers, covariance matrix, 3030 Marquardt iterative method, 3008, 3024, 3027 maximum iterations, 3008 maximum subiterations, 3008 mean square error specification, 3009 missing values, 3020 model confidence interval, 3029 model.variable syntax, 3012 Newton iterative method, 3008, 3024, 3026 object convergence measure, 3003 output table names, 3033 parameter confidence interval, 3028 parameter covariance matrix, 3029
PPC convergence measure, 3003 predicted values, output, 3014 R convergence measure, 3003 residual values, output, 3014 retaining variables, 3011, 3016 RPC convergence measure, 3003 segmented model example, 3034 singularity criterion, 3009 skewness, 3004, 3008, 3019 SMETHOD=GOLDEN step size search, 3028 special variables, 3020 standard error, 3014 steepest descent method, 3024, 3025 step size search, 3028 syntax, 3004 troubleshooting, 3022 tuning display of iteration computation, 3009 weighted regression, 3021 NLIN procedure, BOUNDS statement, 3010 NLIN procedure, BY statement, 3011 NLIN procedure, CONTROL statement, 3011 NLIN procedure, DER statement, 3011 NLIN procedure, ID statement, 3012 NLIN procedure, MODEL statement, 3012 NLIN procedure, OUTPUT statement, 3013 H= option, 3013 L95= option, 3013 L95M= option, 3013 OUT= option, 3013 PARMS= option, 3013 PREDICTED= option, 3014 RESIDUAL= option, 3014 SSE= option, 3014 STDI= option, 3014 STDP= option, 3014 STDR= option, 3014 STUDENT= option, 3014 U95= option, 3014 U95M= option, 3014 WEIGHT= option, 3014 NLIN procedure, PARAMETERS statement, 3014 NLIN procedure, PROC NLIN statement, 3005 BEST= option, 3006 CONVERGE= option, 3006 CONVERGEOBJ= option, 3007 CONVERGEPARM= option, 3007 DATA= option, 3007 FLOW option, 3008 G4 option, 3008 HOUGAARD option, 3008 LIST option, 3008 LISTALL option, 3008 LISTCODE option, 3008 LISTDEP option, 3008 LISTDER option, 3008 MAXITER= option, 3008 MAXSUBIT= option, 3008 METHOD= option, 3008 NOHALVE option, 3009
Syntax Index NOITPRINT option, 3009 NOPRINT option, 3009 OUTEST= option, 3009 PRINT option, 3009 RHO= option, 3009 SAVE option, 3009 SIGSQ= option, 3009 SINGULAR= option, 3009 SMETHOD= option, 3009 TAU= option, 3010 TRACE option, 3010 XREF option, 3010 NLIN procedure, program statements, 3016 NLIN procedure, RETAIN statement, 3016 NLINCON statement, CALIS procedure, 610 NLMIXED procedure, 3057 Accelerated failure time model, 3128 active set methods, 3093 adaptive Gaussian quadrature, 3084 additional estimates, 3076, 3106 alpha level, 3061 arrays, 3074 assumptions, 3083 Bernoulli distribution, 3077 binary distribution, 3077 binomial distribution, 3077 bounds, 3075 compared with other SAS procedures and macros, 3048 computational problems, 3098 computational resources, 3103 contrasts, 3076 convergence criteria, 3060, 3065, 3087 convergence problems, 3099 covariance matrix, 3061, 3101, 3106 degrees of freedom, 3062 empirical Bayes estimation, 3084 empirical Bayes options, 3062 finite differencing, 3064, 3091 first-order method, 3085 fit statistics, 3106 floating point errors, 3098 Frailty model example, 3128 functional convergence criteria, 3063 gamma distribution, 3077 Gaussian distribution, 3077 general distribution, 3078 generalized inverse, 3065 growth curve example, 3049 Hessian matrix, 3066 Hessian scaling, 3065, 3093 integral approximations, 3070, 3084 iteration history, 3067, 3104 lag functionality, 3081 Lagrange multiplier, 3093 line-search methods, 3066, 3067, 3096 logistic-normal example, 3053 long run times, 3098 maximum likelihood, 3048
5055
negative binomial distribution, 3078 normal distribution, 3077, 3080 notation, 3083 ODS table names, 3107 optimization techniques, 3072, 3086 options summary, 3058 overflows, 3098 parameter estimates, 3106 parameter rescaling, 3098 parameter specification, 3078 pharmakokinetics example, 3107 Poisson distribution, 3078 Poisson-normal example, 3124 precision, 3101 prediction, 3079, 3102 probit-normal-binomial example, 3114 probit-normal-ordinal example, 3118 programming statements, 3081 projected gradient, 3093 projected Hessian, 3093 quadrature options, 3071 random effects, 3079 references, 3138 replicate subjects, 3080 singularity tolerances, 3072 sorting of input data set, 3062, 3079 stationary point, 3100 step length options, 3096 syntax, 3057 syntax summary, 3057 termination criteria, 3060, 3087 update methods, 3073 NLMIXED procedure, ARRAY statement, 3074 NLMIXED procedure, BOUNDS statement, 3075 NLMIXED procedure, BY statement, 3076 NLMIXED procedure, CONTRAST statement, 3076 DF= option, 3076 NLMIXED procedure, ESTIMATE statement, 3077 ALPHA= option, 3077 DF= option, 3077 NLMIXED procedure, ID statement, 3077 NLMIXED procedure, MODEL statement, 3077 NLMIXED procedure, PARMS statement, 3078 BEST= option, 3078 DATA= option, 3078 NLMIXED procedure, PREDICT statement, 3079 ALPHA= option, 3079 DER option, 3079 DF= option, 3079 NLMIXED procedure, PROC NLMIXED statement ABSCONV= option, 3060 ABSFCONV= option, 3060 ABSGCONV= option, 3060 ABSXCONV= option, 3060 ALPHA= option, 3061 ASINGULAR= option, 3061 CFACTOR= option, 3061 CORR option, 3061 COV option, 3061
5056
Syntax Index
COVSING= option, 3061 DAMPSTEP option, 3061 DATA= option, 3062 DF= option, 3062 DIAHES option, 3062 EBOPT option, 3062 EBSSFRAC option, 3062 EBSSTOL option, 3062 EBSTEPS option, 3062 EBSUBSTEPS option, 3062 EBTOL option, 3062 EBZSTART option, 3062 ECORR option, 3063 ECOV option, 3063 EDER option, 3063 FCONV2= option, 3063 FCONV= option, 3063 FD= option, 3064 FDHESSIAN= option, 3064 FDIGITS= option, 3065 FLOW option, 3065 FSIZE= option, 3065 G4= option, 3065 GCONV= option, 3065 HESCAL= option, 3065 HESS option, 3066 INHESSIAN option, 3066 INSTEP= option, 3066 ITDETAILS option, 3067 LCDEACT= option, 3067 LCEPSILON= option, 3067 LCSINGULAR= option, 3067 LINESEARCH= option, 3067 LIST option, 3068 LISTCODE option, 3068 LOGNOTE option, 3068 LSPRECISION= option, 3068 MAXFUNC= option, 3069 MAXITER= option, 3069 MAXSTEP= option, 3069 MAXTIME= option, 3070 METHOD= option, 3070 MINITER= option, 3070 MSINGULAR= option, 3070 NOAD option, 3071 NOADSCALE option, 3071 OPTCHECK option, 3071 OUTQ= option, 3071 QFAC option, 3071 QMAX option, 3071 QPOINTS option, 3071 QSCALEFAC option, 3071 QTOL option, 3071 RESTART option, 3072 SEED option, 3072 SINGCHOL= option, 3072 SINGHESS= option, 3072 SINGSWEEP= option, 3072 SINGVAR option, 3072
START option, 3072 TECHNIQUE= option, 3072, 3073 TRACE option, 3073 UPDATE= option, 3073, 3074 VSINGULAR= option, 3074 XCONV= option, 3074 XSIZE= option, 3074 NLMIXED procedure, RANDOM statement, 3079 ALPHA= option, 3080 DF= option, 3080 OUT= option, 3080 NLMIXED procedure, REPLICATE statement, 3080 NLOPTIONS statement, CALIS procedure, 611 NMARKERS= option PROC STDIZE statement, 4131 NMAX= option PROC SURVEYSELECT statement, 4437 NMIN= option PROC SURVEYSELECT statement, 4437 NMISS option OUTPUT statement (FREQ), 1448 NOAD option PROC NLMIXED statement, 3071 NOADJDF option PROC CALIS statement, 590 NOADSCALE option PROC NLMIXED statement, 3071 NOANOVA option MODEL statement (RSREG), 4042 NOBOUND option PARMS statement (MIXED), 2707 PROC MIXED statement, 2678 NOBS= option PROC CALIS statement, 572 PROC FACTOR statement, 1313 NOBYREF option PLOT statement (BOXPLOT), 505 NOCENSPLOT option PROC LIFETEST statement, 2163 NOCENTER option PROC MULTTEST statement, 2940 PROC PLS statement, 3376 NOCHART option BOXPLOT procedure, 505 NOCHECK option MODEL statement (LOGISTIC), 2313 MODEL statement (SURVEYLOGISTIC), 4263 PROC PRINQUAL statement, 3655 NOCLASSIFY option PROC DISCRIM statement, 1149 NOCLPRINT option PROC MIXED statement, 2678 NOCODE option MODEL statement (RSREG), 4042 NOCOL option TABLES statement (FREQ), 1458 NOCOLLAPSE option STRATA statement (SURVEYREG), 4381 NOCOLLECT option
Syntax Index PLOT statement (REG), 3850 NOCOLUMN= option PROC CORRESP statement, 1077 NOCONTAIN option MODEL statement (MIXED), 2703 NOCORR option PROC FACTOR statement, 1313 NOCOV option PROC LOGISTIC statement, 2293 NOCUM option TABLES statement (FREQ), 1458 NOCVSTDIZE option PROC PLS statement, 3376 NODECREMENT option PREDICT statement (KRIGE2D), 2041 NODESIGN option MODEL statement (CATMOD), 845 NODESIGNPRINT= option MODEL statement (LOGISTIC), 2313 MODEL statement (SURVEYLOGISTIC), 4264 MODEL statement (TPHREG), 4476 NODETAIL option STRATA statement (LIFETEST), 2167 NODIAG option PROC CALIS statement, 576 NODUMMYPRINT= option MODEL statement (LOGISTIC), 2313 MODEL statement (SURVEYLOGISTIC), 4264 MODEL statement (TPHREG), 4476 NOEIGEN option PROC CLUSTER statement, 970 NOEIGNUM option NLOPTIONS statement (CALIS), 623 NOFILL option CONTRAST statement (SURVEYREG), 4377 ESTIMATE statement (SURVEYREG), 4379 NOFIT option MODEL statement (LOGISTIC), 2314 MODEL statement (PHREG), 3229 NOFRAME option PLOT statement (BOXPLOT), 505 NOFREQ option TABLES statement (FREQ), 1458 TABLES statement (SURVEYFREQ), 4201 NOFULLZ option RANDOM statement (MIXED), 2714 NOGOODFIT option MODEL statement (ROBUSTREG), 3990 NOHALVE option PROC NLIN statement, 3009 NOHLABEL option PLOT statement (BOXPLOT), 505 NOHLF option NLOPTIONS statement (CALIS), 621 NOID option PROC CLUSTER statement, 970 NOINCREMENT option PREDICT statement (KRIGE2D), 2041 NOINFO option
5057
PROC MIXED statement, 2678 NOINT option MODEL statement (CATMOD), 845 MODEL statement (GENMOD), 1640 MODEL statement (GLM), 1772 MODEL statement (GLMMOD), 1917 MODEL statement (LIFEREG), 2099 MODEL statement (LOGISTIC), 2313 MODEL statement (MIXED), 2703, 2743 MODEL statement (ORTHOREG), 3203 MODEL statement (REG), 3827 MODEL statement (ROBUSTREG), 3990 MODEL statement (SURVEYLOGISTIC), 4264 MODEL statement (SURVEYREG), 4380 MODEL statement (TRANSREG), 4577 MULTREG statement (POWER), 3423 PROC CALIS statement, 573 PROC CANCORR statement, 760 PROC FACTOR statement, 1313 PROC PRINCOMP statement, 3605 PROC VARCLUS statement, 4811 NOITER option MODEL statement (CATMOD), 845 PARMS statement (MIXED), 2708 NOITPRINT option PROC MIXED statement, 2679 PROC NLIN statement, 3009 NOLINE option PLOT statement (REG), 3847 NOLIST option PAINT statement (REG), 3838 REWEIGHT statement (REG), 3856 NOLOG option MODEL statement (LIFEREG), 2099 NOLOGSCALE option MODEL statement (LOGISTIC), 2314 NOM option REPEATED statement (ANOVA), 449 REPEATED statement (GLM), 1780 NOMEAN option BASELINE statement (PHREG), 3226 nominal level of measurement DISTANCE procedure, 1249 nominal power GLMPOWER procedure, 1946, 1947, 1953 POWER procedure, 3419, 3494, 3496 nominal variable DISTANCE procedure, 1251 nominal variables, 72, See also classification variables NOMISS option MODEL statement (TRANSREG), 4578 PROC DISTANCE statement, 1261 PROC FASTCLUS statement, 1393 PROC PRINQUAL statement, 3655 PROC STDIZE statement, 4131 NOMOD option PROC CALIS statement, 584 NOMODEL option
5058
Syntax Index
PLOT statement (REG), 3847 non-full-rank models REG procedure, 3893 non-full-rank parameterization MIXED procedure, 2664, 2718, 2747 noninferiority power and sample size (POWER), 3552 nonlinear mixed models (NLMIXED), 3047 regression functions (TRANSREG), 4593, 4628 transformations (TRANSREG), 4593 NONLOCALW option REPEATED statement (MIXED), 2720 nonmetric multidimensional scaling MDS procedure, 2471, 2472 nonoptimal transformations PRINQUAL procedure, 3661 TRANSREG procedure, 4562 NONORM option PROC CLUSTER statement, 971 PROC MDS statement, 2482 nonparametric clustering methods MODECLUS procedure, 2855 nonparametric discriminant analysis, 1158 nonparametric regression TPSPLINE procedure, 4497 nonparametric tests NPAR1WAY procedure, 3145 NOOPTIMAL option MODEL statement (RSREG), 4042 NOOVERLAYLEGEND option PLOT statement (BOXPLOT), 505 NOPARM option MODEL statement (CATMOD), 845 NOPERCENT option TABLES statement (FREQ), 1458 TABLES statement (SURVEYFREQ), 4201 NOPHIST option PROC MDS statement, 2482 NOPREDVAR option MODEL statement (CATMOD), 845 NOPRINT option FACTOR statement (PLAN), 3340 LSMEANS statement (GLM), 1756 MODEL statement (CATMOD), 845 MODEL statement (REG), 3827 MODEL statement (RSREG), 4042 MODEL statement (TRANSREG), 4578 PROC ACECLUS statement, 405 PROC ANOVA statement, 433 PROC CALIS statement, 584 PROC CANCORR statement, 761 PROC CANDISC statement, 791 PROC CATMOD statement, 829 PROC CLUSTER statement, 971 PROC CORRESP statement, 1077 PROC DISCRIM statement, 1149 PROC FACTOR statement, 1313 PROC FASTCLUS statement, 1394
PROC FREQ statement, 1442 PROC GLM statement, 1746 PROC GLMMOD statement, 1914 PROC INBREED statement, 1973 PROC LIFEREG statement, 2090 PROC LIFETEST statement, 2163 PROC LOGISTIC statement, 2293 PROC MDS statement, 2482 PROC MI statement, 2520 PROC MODECLUS statement, 2868 PROC MULTTEST statement, 2940 PROC NLIN statement, 3009 PROC NPAR1WAY statement, 3157 PROC ORTHOREG statement, 3201 PROC PHREG statement, 3222 PROC PLS statement, 3377 PROC PRINCOMP statement, 3605 PROC PRINQUAL statement, 3655 PROC PROBIT statement, 3713 PROC REG statement, 3817 PROC RSREG statement, 4040 PROC SURVEYSELECT statement, 4437 PROC TREE statement, 4753 PROC VARCLUS statement, 4811 RIDGE statement (RSREG), 4044 TABLES statement (FREQ), 1459 TABLES statement (SURVEYFREQ), 4201 NOPROFILE option MODEL statement (CATMOD), 845 PARMS statement (MIXED), 2708 PROC MIXED statement, 2679, 2738 NOPROMAXNORM option PROC FACTOR statement, 1313 NORESPONSE option MODEL statement (CATMOD), 845 NORESTOREMISSING option OUTPUT statement (TRANSREG), 4590 NORM option FACTOR statement (CALIS), 607 PROC DISTANCE statement, 1261 PROC STDIZE statement, 4131 NORM= option PROC FACTOR statement, 1313 normal distribution, 2083, 2097, 2111, 3705 GENMOD procedure, 1651 NLMIXED procedure, 3077, 3080 PROBIT procedure, 3757 normal kernel (DISCRIM), 1159 normalization of the estimates MDS procedure, 2492 NORMALIZE option WEIGHT statement (LOGISTIC), 2329 WEIGHT statement (PHREG), 3239 NOROW option TABLES statement (FREQ), 1459 NOROW= option PROC CORRESP statement, 1077 NOSCALE option MODEL statement (GENMOD), 1640
Syntax Index MODEL statement (LIFEREG), 2099 PROC PLS statement, 3377 NOSCORES option OUTPUT statement (TRANSREG), 4590 NOSERIFS option PLOT statement (BOXPLOT), 505 NOSHAPE1 option MODEL statement (LIFEREG), 2099 NOSORT option MEANS statement (ANOVA), 444 MEANS statement (GLM), 1768 PROC SURVEYLOGISTIC statement, 4251 NOSPARSE option TABLES statement (FREQ), 1459 TABLES statement (SURVEYFREQ), 4201 NOSQUARE option algorithms used (CLUSTER), 986 PROC CLUSTER statement, 969, 971 NOSTAT option PLOT statement (REG), 3847 NOSTD option PROC SCORE statement, 4072 TABLES statement (SURVEYFREQ), 4201 NOSTDERR option PROC CALIS statement, 587 NOSUMMARY option PROC MODECLUS statement, 2868 PROC PHREG statement, 3222 PROC SURVEYFREQ statement, 4193 NOTABLE option PROC LIFETEST statement, 2163 NOTABLES option PROC MULTTEST statement, 2940 NOTCHES option PLOT statement (BOXPLOT), 505 NOTEST option MODEL statement (GAM), 1568 MODEL statement (MIXED), 2703 STRATA statement (LIFETEST), 2167 NOTICKREP option PLOT statement (BOXPLOT), 506 NOTIE option PROC CLUSTER statement, 971 NOTOTAL option TABLES statement (SURVEYFREQ), 4201 NOTRUNCATE option FREQ statement (PHREG), 3227 NOU option REPEATED statement (ANOVA), 449 REPEATED statement (GLM), 1780 NOULB option PROC MDS statement, 2482 NOUNI option MODEL statement (ANOVA), 446 MODEL statement (GLM), 1772 NOVANGLE option PLOT statement (BOXPLOT), 506 NOVARIOGRAM option COMPUTE statement (VARIOGRAM), 4869
5059
NOWARN option TABLES statement (FREQ), 1459 NOWT option TABLES statement (SURVEYFREQ), 4201 NOZEROCONSTANT option OUTPUT statement (TRANSREG), 4578 NOZEROS option PROC MULTTEST statement, 2940 NP option PLOT statement (REG), 3847 NPAIRS= option PAIREDFREQ statement (POWER), 3445 PAIREDMEANS statement (POWER), 3451 NPANELPOS= option PLOT statement (BOXPLOT), 506 NPAR1WAY procedure alpha level, 3160 Ansari-Bradley scores, 3168 Brown-Mood test, 3167 compared to other procedures, 1735 computational methods, 3172 computational resources, 3173 Cramer-von Mises test, 3170 EDF tests, 3168 exact p-values, 3172 exact tests, 3171 introductory example, 3145 Klotz scores, 3168 Kolmogorov-Smirnov test, 3169 Kruskal-Wallis test, 3166 Kuiper test, 3171 Mann-Whitney-Wilcoxon test, 3166 median scores, 3167 missing values, 3162 Monte Carlo estimation, 3174 Mood scores, 3168 network algorithm, 3172 ODS table names, 3184 one-way ANOVA tests, 3165 output data set, 3161, 3175, 3176 permutation test, 3157, 3166 Pitman’s test, 3157, 3166 rank tests, 3163 Savage scores, 3167 scores, 3166 Siegel-Tukey scores, 3167 statistical computations, 3163 summary of commands, 3155 syntax, 3155 tied values, 3163 Van der Waerden scores, 3167 Wilcoxon scores, 3166 NPAR1WAY procedure, BY statement, 3158 NPAR1WAY procedure, EXACT statement, 3159 AB option, 3160 ALPHA= option, 3160 KLOTZ option, 3160 KS option, 3160 MAXTIME= option, 3160
5060
Syntax Index
MC option, 3160 MEDIAN option, 3160 MOOD option, 3160 N= option, 3160 POINT option, 3161 SAVAGE option, 3160 SCORES= option, 3160 SEED= option, 3161 ST option, 3160 VW option, 3160 WILCOXON option, 3160 NPAR1WAY procedure, FREQ statement, 3161 NPAR1WAY procedure, OUTPUT statement, 3161 AB option, 3162 EDF option, 3162 KLOTZ option, 3162 MEDIAN option, 3162 MOOD option, 3162 OUT= option, 3161 output statistics, 3161 SAVAGE option, 3162 SCORES= option, 3162 ST option, 3162 VW option, 3162 WILCOXON option, 3162 NPAR1WAY procedure, PROC NPAR1WAY statement, 3155 AB option, 3156 ANOVA option, 3156 CORRECT=NO option, 3156 D option, 3156 DATA= option, 3156 EDF option, 3157 KLOTZ option, 3157 MEDIAN option, 3157 MISSING option, 3157 MOOD option, 3157 NOPRINT option, 3157 SAVAGE option, 3157 SCORES= option, 3157 ST option, 3157 VW option, 3158 WILCOXON option, 3158 NPAR1WAY procedure, VAR statement, 3162 NPARTIALVARS= option ONECORR statement (POWER), 3427 NPATHS= option ASSESS statement (PHREG), 3224 NPERGROUP= option ONEWAYANOVA statement (POWER), 3440 TWOSAMPLEFREQ statement (POWER), 3459 TWOSAMPLEMEANS statement (POWER), 3467 TWOSAMPLESURVIVAL statement (POWER), 3478 NPLOT= option PROC FACTOR statement, 1314 NPOINTS= option
PLOT statement (GLMPOWER), 1943 PLOT statement (POWER), 3485 NREDUCEDPREDICTORS= option MULTREG statement (POWER), 3423 NREP option PROC ROBUSTREG statement, 3986, 3987 NSAMPLE= option PRIOR statement (MIXED), 2711 PROC MULTTEST statement, 2940 NSEARCH= option PRIOR statement (MIXED), 2711 NSUBINTERVAL= option TWOSAMPLESURVIVAL statement (POWER), 3478 NTEST= option PROC PLS statement, 3374 NTESTPREDICTORS= option MULTREG statement (POWER), 3423 NTICK= option PROC TREE statement, 4753 NTOTAL= option MULTREG statement (POWER), 3423 ONECORR statement (POWER), 3427 ONESAMPLEFREQ statement (POWER), 3430 ONESAMPLEMEANS statement (POWER), 3435 ONEWAYANOVA statement (POWER), 3440 POWER statement (GLMPOWER), 1940 TWOSAMPLEFREQ statement (POWER), 3459 TWOSAMPLEMEANS statement (POWER), 3467 TWOSAMPLESURVIVAL statement (POWER), 3478 nugget effect KRIGE2D procedure, 2044, 2051, 2052 MIXED procedure, 2718 VARIOGRAM procedure, 4871 NUGGET= option MODEL statement (KRIGE2D), 2044 SIMULATE statement (SIM2D), 4104 null hypothesis, 3488 NULLCONTRAST= option ONEWAYANOVA statement (POWER), 3441 NULLCORR= option ONECORR statement (POWER), 3428 NULLDIFF= option PAIREDMEANS statement (POWER), 3451 TWOSAMPLEMEANS statement (POWER), 3467 NULLDISCPROPRATIO= option PAIREDFREQ statement (POWER), 3445 NULLMEAN= option ONESAMPLEMEANS statement (POWER), 3435 NULLODDSRATIO= option TWOSAMPLEFREQ statement (POWER), 3459 NULLPROPORTION= option
Syntax Index ONESAMPLEFREQ statement (POWER), 3430 NULLPROPORTIONDIFF= option TWOSAMPLEFREQ statement (POWER), 3459 NULLRATIO= option PAIREDMEANS statement (POWER), 3451 TWOSAMPLEMEANS statement (POWER), 3467 NULLRELATIVERISK= option TWOSAMPLEFREQ statement (POWER), 3459 number of imputations MI procedure, 2565 number-lists GLMPOWER procedure, 1945 POWER procedure, 3490 NUMPOINTS= option PREDICT statement (KRIGE2D), 2042 NUMREAL= option SIMULATE statement (SIM2D), 4101 NVALS= option OUTPUT statement (PLAN), 3344 NVARS= option PROC CORRESP statement, 1077
O O’Brien’s test for homogeneity of variance ANOVA procedure, 443 GLM procedure, 1767 objective function mixed model (MIXED), 2749 oblimin method, 1291, 1318 oblique component analysis, 4799 Oblique projector MIXED procedure, 2767 oblique transformation, 1294, 1298 OBSERVED option PROC CORRESP statement, 1078 OBSMARGINS option LSMEANS statement (GLM), 1756, 1822 LSMEANS statement (MIXED), 2690 OBSTATS option MODEL statement (GENMOD), 1640 OCOEF option PROC MDS statement, 2482 OCONFIG option PROC MDS statement, 2482 OCRIT option PROC MDS statement, 2482 odds ratio adjusted, 1503 Breslow-Day test, 1508 case-control studies, 1488, 1503, 1504 confidence limits (LOGISTIC), 2308, 2315 confidence limits (SURVEYLOGISTIC), 4262 customized (LOGISTIC), 2328 customized (SURVEYLOGISTIC), 4267 estimation (LOGISTIC), 2347 estimation (SURVEYLOGISTIC), 4288
5061
logit estimate, 1504 Mantel-Haenszel estimate, 1503 power and sample size (POWER), 3457, 3462, 3523, 3524 ODDSRATIO= option TWOSAMPLEFREQ statement (POWER), 3459 ODS and templates, 274 compatibility with Version 6, 280 creating an output data set, 292 default behavior, 273 exclusion list, 276 interactive procedures, 277 NOPRINT option interaction, 279 ODS Graphics, 319 output formats, 273 output table names, 274 path names, 275 run-group processing, 277 selection list, 276, 289 Statistical Graphics Using ODS, 319 suppressing displayed output, 279 TEMPLATE procedure, 303, 309 templates, 278 trace record, 275 with Results window, 277 with SAS Explorer, 277 ODS – ALL– CLOSE statement, 360 ODS destinations ODS Graphics, 334 ODS DOCUMENT statement, 361 ODS examples concatenating data sets, 299 creating an output data set, 294, 297 excluding output, 291 GLMMOD procedure, 1924 HTML output, 281, 282 links in HTML output, 308, 311 modifying templates, 307 ODS and the GCHART procedure, 312 ORTHOREG procedure, 3206 output table names, 284, 295 PLS procedure, 3394, 3396, 3405 run-group processing, 299 selecting output, 287 ODS EXCLUDE statement, 276, 291, 330, 354 PERSIST option, 291, 292 ODS graph names ANOVA procedure, 460 CORRESP procedure, 1109 GAM procedure, 1581 GLM procedure, 1847 KDE procedure, 2010 LIFETEST procedure, 2191 LOESS procedure, 2250 LOGISTIC procedure, 2390 MI procedure, 2567 MIXED procedure, 2762
5062
Syntax Index
PHREG procedure, 3271 PRINCOMP procedure, 3613 PRINQUAL procedure, 3677 REG procedure, 3923 ROBUSTREG procedure, 4014 ODS graph templates displaying templates, 339 editing templates, 340 graph definitions, 338, 342, 365 graph template language, 338, 342 locating templates, 339 reverting to default templates, 341, 369 saving templates, 340 style definitions, 338, 344 table definitions, 338 template definitions, 338 using customized templates, 341 ODS Graphics, 319 DOCUMENT destination, 326, 335 examples, 352 excluding graphs, 330 getting started, 321 graph names, 330 graph template definitions, 338, 342, 365 graph template language, 338, 342 graphics image files, 334 graphics image files, names, 335 graphics image files, PostScript, 337 graphics image files, types, 334, 336, 350 HTML destination, 326, 335 HTML output, 326, 334, 336 index counter, 335 introductory examples, 321, 324 label collision avoidance, 351 LATEX destination, 326, 335 LaTeX output, 334, 337 lattice layouts, 381 layout area, 342 MIXED procedure, 2757 ODS destinations, 334 overlaid layouts, 368, 373, 381 PCL destination, 326, 335 PDF destination, 326, 335 PDF output, 338 plot options, 324 PostScript output, 338 PS destination, 326, 335 referring to graphs, 330 requesting graphs, 321, 324 reseting index counter, 335 RTF destination, 326, 335 RTF output, 326, 338 saving graphics image files, 336 selecting graphs, 330 supported operating environments, 348 supported procedures, 348 viewing graphs, 327 ODS GRAPHICS statement, 321, 324, 349 ANTIALIAS= option, 349
IMAGEFMT= option, 336, 337, 350, 354 IMAGENAME= option, 335, 350 OFF, 322, 349 ON, 322, 349 options, 349 RESET option, 335, 350 ODS HTML statement, 283, 291, 322 FILE= option, 326 GPATH= option, 336, 349 NEWFILE option, 313 PATH= option, 336, 349 STYLE= option, 332, 375, 378, 379 URL= suboption, 337 ODS LATEX statement, 337 GPATH= option, 337, 359 PATH= option, 337, 359 STYLE= option, 358 URL= suboption, 359 ODS LISTING statement, 361 ODS MARKUP statement TAGSET= option, 337 ODS output destinations, 274, 276, 298 objects, 273 ODS output files, 326 ODS OUTPUT statement, 292, 294, 296, 299 data set options, 297, 298 MATCH– ALL option, 299 PERSIST option, 299, 302 ODS path, 340, 341, 346 ODS PATH statement, 279, 341 RESET option, 342 SHOW option, 340 ODS PDF statement, 338 FILE= option, 338, 360 ID= option, 360 ODS RTF statement, 326, 356, 360 FILE= option, 338 ODS SELECT statement, 276, 287, 330, 331, 353 ODS SHOW statement, 279, 288, 292 ODS Statistical Graphics, see ODS Graphics ODS styles, 332 Analysis style, 333 attributes, 344 customizing, 345, 374, 376, 378 Default style, 333, 346 definitions, 344 elements, 344 Journal style, 332, 333, 346 Rtf style, 346 specifying, 332 specifying a default, 346 Statistical style, 333 ODS table names SURVEYLOGISTIC procedure, 4295 SURVEYREG procedure, 4394 SURVEYSELECT procedure, 4460
Syntax Index ODS TRACE statement, 275, 284, 286, 296, 330, 339, 352, 363 listing interleaved with trace record, 285 LISTING option, 275, 284, 331 offset GENMOD procedure, 1640, 1689 offset variable GENMOD procedure, 1617 PHREG procedure, 3229 OFFSET= option MODEL statement (GENMOD), 1640 MODEL statement (LOGISTIC), 2314 MODEL statement (PHREG), 3229 MODEL statement (SURVEYLOGISTIC), 4264 offspring INBREED procedure, 1973, 1980 OLS option PARMS statement (MIXED), 2708 OM option LSMEANS statement (GLM), 1756, 1822 OM= option PROC CALIS statement, 577 OMETHOD= option NLOPTIONS statement (CALIS), 613 PROC CALIS statement, 577 one-sample t-test, 4775 power and sample size (POWER), 3412, 3432, 3437, 3508–3510 one-way ANOVA power and sample size (POWER), 3438, 3442, 3443, 3513, 3536 one-way ANOVA tests NPAR1WAY procedure, 3165 ONECORR statement POWER procedure, 3426 ONESAMPLEFREQ statement POWER procedure, 3429 ONESAMPLEMEANS statement POWER procedure, 3432 ONESIDED option EXACT statement (LOGISTIC), 2302 ONEWAY option MODEL statement (CATMOD), 845 ONEWAYANOVA statement POWER procedure, 3438 online documentation, 21 operations research, 23 OPSCORE transformation MODEL statement (TRANSREG), 4563 TRANSFORM statement (PRINQUAL), 3662 OPTC option PROC PROBIT statement, 3711, 3713 OPTCHECK option PROC NLMIXED statement, 3071 optimal scaling (TRANSREG), 4609 scoring (PRINQUAL), 3662 scoring (TRANSREG), 4563, 4610 transformations (MDS), 2472, 2481
5063
transformations (PRINQUAL), 3662 transformations (TRANSREG), 4563 optimization CALIS procedure, 550, 577, 622, 664 conjugate gradient (CALIS), 577, 579–581, 665 double dogleg (CALIS), 578, 579, 581, 665 history (CALIS), 668 initial values (CALIS), 661, 666 Levenberg-Marquardt (CALIS), 578, 581, 665 line search (CALIS), 580, 671 memory problems (CALIS), 666 Newton-Raphson (CALIS), 578, 580, 581, 665 nonlinear constraints (CALIS), 666 quasi-Newton (CALIS), 578–581, 665, 666 step length (CALIS), 672 techniques (NLMIXED), 3072, 3086 termination criteria (CALIS), 611, 615–620 trust region (CALIS), 578, 581, 665 update method (CALIS), 579 OPTION statement ROBUSTREG procedure, 3989 options CDFPLOT statement (PROBIT), 3715 IPPPLOT statement (PROBIT), 3726 LPREDPLOT statement (PROBIT), 3734 PREDPPLOT statement (PROBIT), 3747 OR option EXACT statement (FREQ), 1444, 1535 OUTPUT statement (FREQ), 1449 ORD option PROC MIXED statement, 2679 order of variables SURVEYFREQ procedure, 4193 order statistics, See RANK procedure ORDER= option CLASS statement (GENMOD), 1629 CLASS statement (LOGISTIC), 2296 CLASS statement (SURVEYLOGISTIC), 4253 CLASS statement (TPHREG), 4477 MODEL statement, 2305, 4260 MODEL statement (MIXED), 2747 MODEL statement (TRANSREG), 4569, 4578 OUTPUT statement (PHREG), 3235 PROC ANOVA statement, 433 PROC CATMOD statement, 829 PROC FREQ statement, 1442 PROC GENMOD statement, 1625 PROC GLM statement, 1746, 1803 PROC GLMMOD statement, 1914 PROC LIFEREG statement, 2090 PROC MIXED statement, 2679, 2744 PROC MULTTEST statement, 2940, 2975 PROC ORTHOREG statement, 3201 PROC PROBIT statement, 3713 PROC ROBUSTREG statement, 3983 PROC SURVEYFREQ statement, 4193 PROC SURVEYMEANS statement, 4323 VAR statement, 1267
5064
Syntax Index
ORDERED option OUTPUT statement (PLAN), 3344 PROC PLAN statement, 3339 ordering observations INBREED procedure, 1968 ordinal level of measurement DISTANCE procedure, 1249 ordinal model CATMOD procedure, 869 GENMOD procedure, 1704 ORDINAL Parameterization SURVEYLOGISTIC procedure, 4271 ordinal variable, 72 ordinal variables transformed to interval (RANKSCORE= ), 1261 ordinary kriging KRIGE2D procedure, 2033, 2056–2060 ORIGINAL option MODEL statement (TRANSREG), 4566 TRANSFORM statement (PRINQUAL), 3664 ORTH option MANOVA statement (ANOVA), 438 MANOVA statement (GLM), 1761 ORTHEFFECT Parameterization SURVEYLOGISTIC procedure, 4272 orthoblique rotation, 4800 orthogonal polynomial contrasts, 448 orthogonal transformation, 1294, 1298 orthomax method, 1291, 1317 orthonormalizing transformation matrix ANOVA procedure, 438 GLM procedure, 1761 ORTHORDINAL Parameterization SURVEYLOGISTIC procedure, 4272 ORTHOREG procedure compared to other procedures, 3197 input data sets, 3201 introductory example, 3197 missing values, 3203 ODS table names, 3204 output data sets, 3202, 3203 syntax, 3200 ORTHOREG procedure, BY statement, 3202 ORTHOREG procedure, CLASS statement, 3202 TRUNCATE option, 3203 ORTHOREG procedure, MODEL statement, 3203 NOINT option, 3203 ORTHOREG procedure, PROC ORTHOREG statement, 3201 DATA= option, 3201 NOPRINT option, 3201 ORDER= option, 3201 OUTEST= option, 3202 SINGULAR= option, 3202 ORTHOREG procedure, WEIGHT statement, 3203 ORTHOTHERM Parameterization SURVEYLOGISTIC procedure, 4272 ORTHPOLY Parameterization SURVEYLOGISTIC procedure, 4273
ORTHREF Parameterization SURVEYLOGISTIC procedure, 4273 OTRANS option PROC MDS statement, 2482 OUT= data sets ACECLUS procedure, 409 CANCORR procedure, 761, 766 FACTOR procedure, 1316, 1325 FASTCLUS procedure, 1398 PRINCOMP procedure, 3609 SCORE procedure, 4075 TREE procedure, 4756 OUT= option BASELINE statement (PHREG), 3224 BIVAR statement, 1998 EM statement (MI), 2523 LSMEANS statement (GLM), 1757 OUTPUT statement (FREQ), 1446 OUTPUT statement (GAM), 1568 OUTPUT statement (GENMOD), 1644 OUTPUT statement (GLM), 1775 OUTPUT statement (LIFEREG), 2100 OUTPUT statement (LOGISTIC), 2320 OUTPUT statement (NLIN), 3013 OUTPUT statement (NPAR1WAY), 3161 OUTPUT statement (PHREG), 3233 OUTPUT statement (PLAN), 3343 OUTPUT statement (REG), 3834 OUTPUT statement (ROBUSTREG), 3990 OUTPUT statement (TPSPLINE), 4510 OUTPUT statement (TRANSREG), 4582 PRIOR statement (MIXED), 2711 PROC ACECLUS statement, 406 PROC CANCORR statement, 761 PROC CANDISC statement, 791 PROC DISCRIM statement, 1149 PROC DISTANCE statement, 1261 PROC FACTOR statement, 1314 PROC FASTCLUS statement, 1394 PROC MDS statement, 2482 PROC MI statement, 2520 PROC MODECLUS statement, 2868 PROC MULTTEST statement, 2941, 2961 PROC PRINCOMP statement, 3605 PROC PRINQUAL statement, 3655 PROC RSREG statement, 4040 PROC SCORE statement, 4072 PROC STDIZE statement, 4132 PROC SURVEYSELECT statement, 4437 PROC TREE statement, 4753 RANDOM statement (NLMIXED), 3080 RESPONSE statement (CATMOD), 854 SCORE statement (GAM), 1569 SCORE statement (LOGISTIC), 2325 SCORE statement (TPSPLINE), 4510 SURVIVAL statement (LIFETEST), 2170 TABLES statement (FREQ), 1459 UNIVAR statement, 2000 OUTALL option
Syntax Index PROC SURVEYSELECT statement, 4437 OUTBOX= option BOXPLOT procedure, 506 OUTC= option PROC CORRESP statement, 1078 OUTCLUS= option PROC MODECLUS statement, 2868 OUTCOV= option PROC INBREED statement, 1973 OUTCROSS= option PROC DISCRIM statement, 1149 OUTCUM option TABLES statement (FREQ), 1459 OUTD= option PROC DISCRIM statement, 1150 OUTDESIGN= option PROC GLMMOD statement, 1915, 1918 PROC LOGISTIC statement, 2293 OUTDESIGNONLY option PROC LOGISTIC statement, 2293 OUTDIST= data set VARIOGRAM procedure, 4851, 4865, 4878, 4880 OUTDIST= option EXACT statement (LOGISTIC), 2302 OUTDISTANCE= option PROC VARIOGRAM statement, 4865 OUTEM= option EM statement (MI), 2523 OUTEST= data sets KRIGE2D procedure, 2060 LIFEREG procedure, 2121 ROBUSTREG procedure, 4011 OUTEST= option MCMC statement (MI), 2528 PROC CALIS statement, 570 PROC KRIGE2D statement, 2039 PROC LIFEREG statement, 2091 PROC LOGISTIC statement, 2294 PROC NLIN statement, 3009 PROC ORTHOREG statement, 3202 PROC PHREG statement, 3222 PROC PROBIT statement, 3714 PROC REG statement, 3817 PROC ROBUSTREG statement, 3984 RESPONSE statement (CATMOD), 854 OUTEXPECT option TABLES statement (FREQ), 1459, 1527 OUTF= option PROC CORRESP statement, 1078 OUTFIT= option PROC MDS statement, 2483 OUTG= option PRIOR statement (MIXED), 2711 OUTGT= option PRIOR statement (MIXED), 2711 OUTHIGHHTML= option BOXPLOT procedure, 507 OUTHISTORY= option
5065
BOXPLOT procedure, 507 OUTHITS option PROC SURVEYSELECT statement, 4438 OUTITER option PROC FASTCLUS statement, 1394 PROC MDS statement, 2482 OUTITER= option EM statement (MI), 2523 MCMC statement (MI), 2528 OUTJAC option PROC CALIS statement, 571 OUTLENGTH= option PROC MODECLUS statement, 2868 OUTLIER keyword OUTPUT statement (ROBUSTREG), 3991 outliers FASTCLUS procedure, 1379 MODECLUS procedure, 2872 OUTLOWHTML= option BOXPLOT procedure, 507 OUTMODEL= option PROC LOGISTIC statement, 2294 OUTNBHD= data set KRIGE2D procedure, 2060, 2061 OUTNBHD= option PROC KRIGE2D statement, 2039 OUTP= option MODEL statement (MIXED), 2796 OUTPAIR= data set VARIOGRAM procedure, 4851, 4866, 4881 OUTPAIR= option PROC VARIOGRAM statement, 4866 OUTPARM= option PROC GLMMOD statement, 1915, 1917 OUTPCT option TABLES statement (FREQ), 1460 OUTPDISTANCE= option COMPUTE statement (VARIOGRAM), 4869 OUTPERM= option PROC MULTTEST statement, 2941, 2962, 2964 OUTPRED= option MODEL statement (MIXED), 2703 OUTPREDM= option MODEL statement (MIXED), 2704 output data set SCORE procedure, 4072, 4075 SURVEYMEANS procedure, 4321, 4345 output data sets ACECLUS procedure, 409 CALIS procedure, 634 CANCORR procedure, 761, 766 CLUSTER procedure, 971 FACTOR procedure, 1297, 1316, 1325 FASTCLUS procedure, 1393, 1394, 1398 KRIGE2D procedure, 2039, 2060, 2061 LIFETEST procedure, 2183 LOGISTIC procedure, 2374, 2376, 2377 MI procedure, 2520, 2528, 2559 MI procedure, EM statement, 2523
5066
Syntax Index
MODECLUS procedure, 2883 MULTTEST procedure, 2961, 2962 OUTCOV= data set (INBREED), 1974, 1982 PRINQUAL procedure, 3669 SIM2D procedure, 4099, 4110 SURVEYSELECT procedure, 4456 TREE procedure, 4756 VARCLUS procedure, 4811, 4816 output ODS graphics table names GENMOD procedure, 1695 PHREG procedure, 3271 output parameter estimates MI procedure, 2528 output ROC data sets LOGISTIC procedure, 2378 OUTPUT statement FREQ procedure, 1446 GAM procedure, 1568 GENMOD procedure, 1643 GLM procedure, 1773 LIFEREG procedure, 2100 LOGISTIC procedure, 2319 NLIN procedure, 3013 NPAR1WAY procedure, 3161 PHREG procedure, 3233 PLAN procedure, 3343 PLS procedure, 3379 REG procedure, 3833 ROBUSTREG procedure, 3990 TPSPLINE procedure, 4510 TRANSREG procedure, 4582 output table names ACECLUS procedure, 412 CALIS procedure, 688 CANCORR procedure, 771 CLUSTER procedure, 994 FASTCLUS procedure, 1407 GENMOD procedure, 1693 INBREED procedure, 1984 KDE procedure, 2009 LIFEREG procedure, 2124 MDS procedure, 2497 MODECLUS procedure, 2888 PRINCOMP procedure, 3613 PRINQUAL procedure, 3677 PROBIT procedure, 3765 ROBUSTREG procedure, 4012 SURVEYFREQ procedure, 4230 SURVEYLOGISTIC procedure, 4295 SURVEYMEANS procedure, 4349 SURVEYREG procedure, 4394 SURVEYSELECT procedure, 4460 TREE procedure, 4757 VARCLUS procedure, 4820 OUTPUTORDER= option MULTREG statement (POWER), 3423 ONECORR statement (POWER), 3428 ONESAMPLEFREQ statement (POWER), 3430
ONESAMPLEMEANS statement (POWER), 3435 ONEWAYANOVA statement (POWER), 3441 PAIREDFREQ statement (POWER), 3445 PAIREDMEANS statement (POWER), 3451 POWER statement (GLMPOWER), 1940 TWOSAMPLEFREQ statement (POWER), 3459 TWOSAMPLEMEANS statement (POWER), 3468 TWOSAMPLESURVIVAL statement (POWER), 3478 OUTQ= data set, 3071 OUTQ= option PROC NLMIXED statement, 3071 OUTR= option RIDGE statement (RSREG), 4044 OUTRAM= option PROC CALIS statement, 571 OUTRES= option PROC MDS statement, 2483 OUTROC= option MODEL statement (LOGISTIC), 2314 SCORE statement (LOGISTIC), 2325 OUTS= option PROC FASTCLUS statement, 1394 OUTSAMP= option PROC MULTTEST statement, 2941, 2962, 2968 OUTSDZ= option PROC DISTANCE statement, 1261 OUTSEB option MODEL statement (REG), 3828 PROC REG statement, 3817 OUTSEED option PROC SURVEYSELECT statement, 4438 OUTSEED= option PROC FASTCLUS statement, 1394 OUTSIM= data set SIM2D procedure, 4110 OUTSIM= option PROC SIM2D statement, 4099 OUTSIZE option PROC SURVEYSELECT statement, 4438 OUTSORT= option PROC SURVEYSELECT statement, 4438 OUTSSCP= option PROC REG statement, 3818 OUTSTAT= data sets CANCORR procedure, 766 FACTOR procedure, 1325 OUTSTAT= option PROC ACECLUS statement, 406 PROC ANOVA statement, 434 PROC CALIS statement, 571 PROC CANCORR statement, 761 PROC CANDISC statement, 791 PROC DISCRIM statement, 1150 PROC FACTOR statement, 1314 PROC FASTCLUS statement, 1394
Syntax Index PROC GLM statement, 1747 PROC PRINCOMP statement, 3605 PROC STDIZE statement, 4132 PROC VARCLUS statement, 4811 OUTSTB option MODEL statement (REG), 3828 PROC REG statement, 3818 OUTSUM= option PROC MODECLUS statement, 2868 OUTSURV= option PROC LIFETEST statement, 2163 OUTTEST= option PROC LIFETEST statement, 2163 PROC TRANSREG statement, 4556 OUTTREE= option PROC CLUSTER statement, 971 PROC VARCLUS statement, 4811 OUTVAR= data set VARIOGRAM procedure, 4866, 4877 OUTVAR= option PROC CALIS statement, 570 PROC VARIOGRAM statement, 4866 OUTVIF option MODEL statement (REG), 3828 PROC REG statement, 3818 OUTWGT= option PROC CALIS statement, 571 over-parameterization MIXED procedure, 2744 OVER= option PROC MDS statement, 2483 overall kappa coefficient, 1493, 1498 overdispersion GENMOD procedure, 1659 LOGISTIC procedure, 2316, 2354, 2355 PROBIT procedure, 3745 overflows NLMIXED procedure, 3098 overlaid layouts ODS Graphics, 368, 373, 381 Overlap dissimilarity coefficient DISTANCE procedure, 1273 overlap of data points LOGISTIC procedure, 2339 SURVEYLOGISTIC procedure, 4278 Overlap similarity coefficient DISTANCE procedure, 1273 OVERLAY option PLOT statement (REG), 3847, 3850 OVERLAY= option PLOT statement (BOXPLOT), 507 OVERLAYCLIPSYM= option BOXPLOT procedure, 507 OVERLAYCLIPSYMHT= option BOXPLOT procedure, 507 OVERLAYHTML= option PLOT statement (BOXPLOT), 507 OVERLAYID= option BOXPLOT procedure, 507
5067
OVERLAYLEGLAB= option PLOT statement (BOXPLOT), 507 OVERLAYSYM= option PLOT statement (BOXPLOT), 507 OVERLAYSYMHT= option PLOT statement (BOXPLOT), 508
P P option MODEL statement (GLM), 1772 MODEL statement (REG), 3828 P-P plots REG procedure, 3917, 3953 p-value adjustments Bonferroni (MULTTEST), 2939, 2956 bootstrap (MULTTEST), 2938, 2939, 2957, 2968 false discovery rate (MULTTEST), 2959 Fisher combination (MULTTEST), 2959 Hochberg (MULTTEST), 2959 Hommel (MULTTEST), 2959 MULTTEST procedure, 2935, 2956 permutation (MULTTEST), 2942, 2957, 2975 Sidak (MULTTEST), 2942, 2956, 2972 p-value computation MODECLUS procedure, 2877 P= option PROC ACECLUS statement, 406 PAGE option PROC FREQ statement, 1443 PROC SURVEYFREQ statement, 4194 PAGENUM= option PLOT statement (BOXPLOT), 508 PAGENUMPOS= option PLOT statement (BOXPLOT), 508 PAGES= option PROC TREE statement, 4753 PAINT statement REG procedure, 3835 painting line-printer plots REG procedure, 3889 paired comparisons, 4775, 4793 paired proportions, See McNemar’s test PAIRED statement TTEST procedure, 4782 paired t test, 4782 power and sample size (POWER), 3448, 3455, 3518, 3519 paired-difference t test, 4775, paired t test PAIREDCVS= option PAIREDMEANS statement (POWER), 3452 PAIREDFREQ statement POWER procedure, 3443 PAIREDMEANS statement POWER procedure, 3448 PAIREDMEANS= option PAIREDMEANS statement (POWER), 3452
5068
Syntax Index
PAIREDSTDDEVS= option PAIREDMEANS statement (POWER), 3452 pairwise comparisons GLM procedure, 1807 PALL option NLOPTIONS statement (CALIS), 621 PROC CALIS statement, 584 panels INBREED procedure, 1982, 1989 PARAM= option CLASS statement (GENMOD), 1630 CLASS statement (LOGISTIC), 2296 CLASS statement (SURVEYLOGISTIC), 4254 CLASS statement (TPHREG), 4478 MODEL statement (CATMOD), 845 parameter constraints MIXED procedure, 2707, 2773 parameter estimates covariance matrix (CATMOD), 842 example (REG), 3877 NLMIXED procedure, 3106 PHREG procedure, 3220, 3222, 3223, 3269 REG procedure, 3919 TPHREG procedure, 4486, 4487 parameter rescaling NLMIXED procedure, 3098 parameter simulation MI procedure, 2564 parameter specification NLMIXED procedure, 3078 PARAMETER= option MODEL statement (TRANSREG), 4566 TRANSFORM statement (PRINQUAL), 3664 Parameterization SURVEYLOGISTIC procedure, 4270 parameterization CATMOD, 845 mixed model (MIXED), 2743 MIXED procedure, 2743 of models (GLM), 1787 PARAMETERS statement CALIS procedure, 624 NLIN procedure, 3014 parametric discriminant analysis, 1157 PARENT statement TREE procedure, 4756 Pareto charts, 23 PARMINFO= option PROC MIANALYZE statement, 2615 PARMLABEL option MODEL statement (LOGISTIC), 2314 MODEL statement (SURVEYLOGISTIC), 4264 PARMS statement MIXED procedure, 2706, 2796 NLMIXED procedure, 3078 PARMS= option OUTPUT statement (NLIN), 3013 PROC MIANALYZE statement, 2615 PARMSDATA= option
PARMS statement (MIXED), 2708 parsimax method, 1291, 1317, 1318 partial canonical correlation, 752 partial correlation CANCORR procedure, 761, 762, 764 principal components, 3612 partial correlations multivariate analysis of variance, 1824 power and sample size (POWER), 3426, 3429, 3502, 3503, 3556 partial least squares, 3367, 3380 partial likelihood PHREG procedure, 3215, 3216, 3240, 3241, 3243 partial listing product-limit estimate (LIFETEST), 2165 PARTIAL option MODEL statement (REG), 3828 partial regression leverage plots REG procedure, 3901 PARTIAL statement CALIS procedure, 627 FACTOR procedure, 1322 PRINCOMP procedure, 3608 VARCLUS procedure, 4813 PARTIALCORR= option MULTREG statement (POWER), 3424 partially balanced square lattice LATTICE procedure, 2069 PARTIALR2 option MODEL statement (REG), 3828 partitions MDS procedure, 2477, 2488 passive observations PRINQUAL procedure, 3674 TRANSREG procedure, 4605 path diagram (CALIS) exogenous variables, 664 factor analysis model, 599, 600 structural model example, 555, 600 PATH= option ODS HTML statement, 336, 349 ODS LATEX statement, 337, 359 PC option MODEL statement (REG), 3828 PLOT statement (REG), 3847 PCHI option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1449, 1539 PCL destination ODS Graphics, 326, 335 PCOEF option PROC MDS statement, 2483 PCOMIT= option MODEL statement (REG), 3828 PROC REG statement, 3818 PCONFIG option PROC MDS statement, 2483 PCORR option
Syntax Index EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1449 PROC CALIS statement, 584 PROC CANCORR statement, 761 PROC CANDISC statement, 791 PROC DISCRIM statement, 1150 PROC STEPDISC statement, 4167 TEST statement (FREQ), 1463 PCORR1 option MODEL statement (REG), 3829 PCORR2 option MODEL statement (REG), 3829 PCOV option PROC CANDISC statement, 791 PROC DISCRIM statement, 1150 PROC STEPDISC statement, 4167 PCOVES option PROC CALIS statement, 585 PCRPJAC option NLOPTIONS statement (CALIS), 621 PCTLDEF= option PLOT statement (BOXPLOT), 508 PROC STDIZE statement, 4132 PCTLMTD option PROC STDIZE statement, 4132 PCTLPTS option PROC STDIZE statement, 4132 PDATA option PROC MDS statement, 2483 PDATA= option PARMS statement (MIXED), 2708 PROC MULTTEST statement, 2941 PDETERM option PROC CALIS statement, 585 PDF, See probability density function PDF destination ODS Graphics, 326, 335 PDF output examples, ODS Graphics, 360 ODS Graphics, 338 PDIFF option LSMEANS statement (GLM), 1757 LSMEANS statement (MIXED), 2690, 2691 Pearson chi-square test, 1469, 1471, 3712, 3759 power and sample size (POWER), 3457, 3462, 3524 Pearson correlation coefficient, 1474, 1479 Pearson correlation statistics power and sample size (POWER), 3426, 3502, 3503, 3556 Pearson Residual MIXED procedure, 2704 Pearson residuals GENMOD procedure, 1670 LOGISTIC procedure, 2360 Pearson’s chi-square GENMOD procedure, 1637, 1656, 1657 LOGISTIC procedure, 2308, 2316, 2354
5069
PROBIT procedure, 3742, 3745, 3760 pedigree analysis example (INBREED), 1987, 1989 INBREED procedure, 1967, 1968 penalized least squares, TPSPLINE procedure, 4497, 4498, 4518 PENALTY= option PROC CLUSTER statement, 971 PERCENT= option PROC ACECLUS statement, 406 PROC VARCLUS statement, 4811 PERFORMANCE statement ROBUSTREG procedure, 3991 permutation adjustment (MULTTEST), 2942, 2957 generating with PLAN procedure, 3358 permutation (MULTTEST) p-value adjustments, 2942, 2957, 2975 PERMUTATION option PROC MULTTEST statement, 2942, 2957, 2964, 2975 permutation test NPAR1WAY procedure, 3157, 3166 PERMUTATION= option TEST statement (MULTTEST), 2947, 2949, 2950, 2964 PESTIM option PROC CALIS statement, 585 PETO option TEST statement (MULTTEST), 2952, 2972 Peto test MULTTEST procedure, 2946, 2952, 2972 Peto-Peto test for homogeneity LIFETEST procedure, 2150, 2168 Peto-Peto-Prentice, See Peto-Peto test for homogeneity PEVENT= option MODEL statement (LOGISTIC), 2314 PFINAL option PROC MDS statement, 2483 PFIT option PROC MDS statement, 2483 PFITROW option PROC MDS statement, 2483 PH option ASSESS statement (PHREG), 3223 pharmaceutical stability example (MIXED), 2810 pharmakokinetics example NLMIXED procedure, 3107 phenogram, 4743 PHESSIAN option NLOPTIONS statement (CALIS), 621 phi coefficient, 1469, 1474 PHI option OUTPUT statement (FREQ), 1449 phi-squared coefficient DISTANCE procedure, 1273 PHISTORY option
5070
Syntax Index
NLOPTIONS statement (CALIS), 621 PHREG procedure Andersen-Gill model, 3216, 3243, 3253 baseline hazard function, 3216 BASELINE statistics, 3224, 3225 branch and bound algorithm, 3265, 3279 Breslow likelihood, 3228 case weight, 3239 case-control studies, 3217, 3228, 3280 conditional logistic regression, 3217, 3283 continuous time scale, 3216, 3228, 3284 counting process, 3241 covariance matrix, 3222, 3233, 3245 Cox regression analysis, 3215, 3219 cumulative baseline hazard function, 3257 cumulative martingale residuals, 3223, 3266, 3271 DATA step statements, 3220, 3221, 3236, 3288 descriptive statistics, 3223 DFBETA statistics, 3234 discrete logistic model, 3216, 3228, 3283 disk space, 3222 displayed output, 3268 Efron likelihood, 3228 event times, 3215, 3218 exact likelihood, 3228 fractional frequencies, 3227 global influence, 3234, 3260 global null hypothesis, 3218, 3246, 3269 hazard function, 3215, 3240 hazards ratio, 3218, 3247 hazards ratio confidence interval, 3233 iteration history, 3233, 3269 Lee-Wei-Amato model, 3251, 3314 left truncation time, 3229, 3263 likelihood displacement, 3234, 3260 likelihood ratio test, 3246, 3269 linear hypotheses, 3217, 3238, 3247 linear predictor, 3225, 3233, 3235, 3302, 3303 local influence, 3234, 3260 log-rank test, 3219 Mantel-Haenszel test, 3219 mean function, 3224, 3226, 3244, 3245, 3253, 3255 missing values, 3228, 3236, 3286 missing values as strata, 3237 model assessment, 3223, 3265, 3271, 3318 model selection, 3216, 3229, 3230, 3264 ODS graph names, 3271 ODS table names, 3270 offset variable, 3229 output ODS graphics table names, 3271 OUTPUT statistics, 3234, 3235 parameter estimates, 3220, 3222, 3223, 3269 partial likelihood, 3215, 3216, 3240, 3241, 3243 Prentice-Williams-Peterson model, 3255 programming statements, 3220–3222, 3235, 3236, 3288 proportional hazards model, 3215, 3220, 3228
rate function, 3243, 3253 rate/mean model, 3243, 3253 recurrent events, 3216, 3224–3226, 3243 residual chi-square, 3231 residuals, 3234, 3235, 3257–3260, 3302 response variable, 3218, 3283 risk set, 3220, 3240, 3241, 3288 risk weights, 3229 score test, 3229, 3231, 3246, 3269, 3274, 3276 selection methods, 3216, 3229, 3264 singularity criterion, 3232 standard error, 3233, 3235, 3269 standard error ratio, 3269 standardized score process, 3267, 3271 step halving, 3245 strata variables, 3237 stratified analysis, 3216, 3237 survival distribution function, 3239 survival times, 3215, 3281, 3283 survivor function, 3215, 3225, 3235, 3239, 3240, 3261, 3299 syntax, 3221 ties, 3216, 3219, 3228, 3241, 3268 time-dependent covariates, 3216, 3220, 3222, 3224, 3233, 3236 Wald test, 3238, 3246, 3247, 3269, 3284 Wei-Lin-Weissfeld model, 3248 PHREG procedure, ASSESS statement, 3223 CRPANEL option, 3224 NPATHS= option, 3224 PH option, 3223 PROPORTIONALHAZARDS option, 3223 RESAMPLE option, 3224 RESAMPLE= option, 3224 SEED= option, 3224 VAR= option, 3223 PHREG procedure, BASELINE statement, 3224 ALPHA= option, 3225 CLTYPE= option, 3225 COVARIATES= option, 3224 keyword= option, 3224 METHOD= option, 3226 NOMEAN option, 3226 OUT= option, 3224 PHREG procedure, BY statement, 3226 PHREG procedure, FREQ statement, 3227 NOTRUNCATE option, 3227 PHREG procedure, ID statement, 3227 PHREG procedure, MODEL statement, 3227 ALPHA= option, 3233 BEST= option, 3229 CORRB option, 3233 COVB option, 3233 DETAILS option, 3230 ENTRYTIME= option, 3229 INCLUDE= option, 3230 ITPRINT option, 3233 MAXITER= option, 3232 MAXSTEP= option, 3230
Syntax Index NOFIT option, 3229 OFFSET= option, 3229 RISKLIMITS option, 3233 SELECTION= option, 3229 SEQUENTIAL option, 3230 SINGULAR= option, 3232 SLENTRY= option, 3230 SLSTAY= option, 3230 START= option, 3230 STOP= option, 3231 STOPRES option, 3231 TIES= option, 3228 PHREG procedure, OUTPUT statement, 3233 keyword= option, 3234 METHOD= option, 3235 ORDER= option, 3235 OUT= option, 3233 PHREG procedure, PROC PHREG statement, 3221 COVM option, 3221 COVOUT option, 3221 COVS option, 3222 COVSANDWICH option, 3222 DATA= option, 3222 MULTIPASS option, 3222 NOPRINT option, 3222 NOSUMMARY option, 3222 OUTEST= option, 3222 SIMPLE option, 3223 PHREG procedure, STRATA statement, 3237 MISSING option, 3237 PHREG procedure, TEST statement, 3238 AVERAGE option, 3238 E option, 3238 PRINT option, 3239 PHREG procedure, WEIGHT statement, 3239 NORMALIZE option, 3239 piecewise polynomial splines TRANSREG procedure, 4561, 4614 Pillai’s trace, 437, 1759, 1828 PINAVDATA option PROC MDS statement, 2483 PINEIGVAL option PROC MDS statement, 2483 PINEIGVEC option PROC MDS statement, 2483 PININ option PROC MDS statement, 2484 PINIT option NLOPTIONS statement (CALIS), 621 PROC MDS statement, 2484 PINITIAL option PROC CALIS statement, 585 PITER option PROC MDS statement, 2484 Pitman’s test NPAR1WAY procedure, 3157, 3166 PJACPAT option PROC CALIS statement, 586 PJTJ option
5071
NLOPTIONS statement (CALIS), 621 PLAN procedure combinations, 3358 compared to other procedures, 3335 factor, selecting levels for, 3339, 3340 factor-value-setting specification, 3343, 3344 generalized cyclic incomplete block design, 3357 hierarchical design, 3353 incomplete block design, 3354, 3357 input data sets, 3339, 3343 introductory example, 3336 Latin square design, 3356 nested design, 3353 ODS table names, 3352 output data sets, 3339, 3343, 3346, 3347 permutations, 3358 random number generators, 3339 randomizing designs, 3347, 3351 specifying factor structures, 3348 split-plot design, 3352 syntax, 3339 treatments, specifying, 3345 using interactively, 3346 PLAN procedure, FACTOR statement NOPRINT option, 3340 PLAN procedure, FACTORS statement, 3340 PLAN procedure, OUTPUT statement, 3343 CVALS= option, 3344 factor-value-settings option, 3343 NVALS= option, 3344 ORDERED option, 3344 OUT= option, 3343 RANDOM option, 3344 PLAN procedure, PROC PLAN statement, 3339 ORDERED option, 3339 SEED option, 3339 PLAN procedure, TREATMENTS statement, 3345 PLATCOV option PROC CALIS statement, 586 PLCL option MODEL statement (LOGISTIC), 2314 PLCONV= option MODEL statement (LOGISTIC), 2314 PLCORR option OUTPUT statement (FREQ), 1449 TABLES statement (FREQ), 1460 %PLOT macro DISCRIM procedure, 1182 PLOT option PROC FACTOR statement, 1314 plot options ODS Graphics, 324 PLOT statement BOXPLOT procedure, 488 GLMPOWER procedure, 1942 POWER procedure, 3483 REG procedure, 3839 %PLOTIT macro, 808
5072
Syntax Index
CORRESP procedure, 1070, 1118, 1128 DISCRIM procedure, 1200 MDS procedure, 2474 PRINCOMP procedure, 3615, 3617, 3625 PRINQUAL procedure, 3678, 3687 TRANSREG procedure, 4617, 4717, 4718 PLOTONLY= option PROC GLMPOWER statement, 1936 PROC POWER statement, 3421 plots examples (REG), 3948 high-resolution (REG), 3840 keywords (REG), 3841 likelihood (MIXED), 2801 line printer (REG), 3848, 3882 MDS procedure, 2474 of configuration (MDS), 2504, 2505 of dimension coefficients (MDS), 2504, 2506 of linear fit (MDS), 2503 options (REG), 3843, 3844 power and sample size (GLMPOWER), 1930, 1936, 1942 power and sample size (POWER), 3412, 3420, 3421, 3483, 3566 PLOTS option PROC LOESS statement, 2250, 3923 PLOTS(MAXPOINTS=) option PROC LOESS statement, 2250 PROC REG statement, 3923 plots, ODS Graphics box plots, 355 contour plots, 324, 360 Cook’s D plots, 353 diagnostics panels, 322, 379 fit plots, 322, 354 histograms, 358 Q-Q plots, 361, 363, 368, 370, 372, 373, 375, 377–379 residual plots, 322 scatter plots, 343, 351 surface plots, 324, 360 PLOTS= option BIVAR statement, 2010 PROC GAM statement, 1581 PROC LIFETEST statement, 2164 SURVIVAL statement (LIFETEST), 2190 UNIVAR statement, 2010 plotting samples from univariate distributions MODECLUS procedure, 2889 PLRL option MODEL statement (LOGISTIC), 2315 PLS procedure algorithms, 3375 centering, 3386 compared to other procedures, 3367 components, 3367 computation method, 3375 cross validation, 3368, 3384 cross validation method, 3374
examples, 3388 factors, 3367 factors, selecting the number of, 3370 introductory example, 3368 latent variables, 3367 latent vectors, 3367 missing values, 3376 ODS table names, 3388 outlier detection, 3398 output data sets, 3379 output keywords, 3379 partial least squares regression, 3367, 3380 predicting new observations, 3373 principal components regression, 3367, 3381 reduced rank regression, 3367, 3381 scaling, 3386 SIMPLS method, 3380 syntax, 3374 test set validation, 3384, 3400 PLS procedure, BY statement, 3377 PLS procedure, CLASS statement, 3378 TRUNCATE option, 3378 PLS procedure, MODEL statement, 3378 INTERCEPT option, 3378 SOLUTION option, 3379 PLS procedure, OUTPUT statement, 3379 PLS procedure, PROC PLS statement, 3374 ALGORITHM= option, 3375 CENSCALE option, 3374 CV= option, 3374 CVTEST= option, 3375 DATA= option, 3375 DETAILS option, 3375 METHOD= option, 3375 MISSING= option, 3376 NFAC= option, 3376 NITER= option, 3374 NOCENTER option, 3376 NOCVSTDIZE option, 3376 NOPRINT option, 3377 NOSCALE option, 3377 NTEST= option, 3374 PVAL= option, 3375 SEED= option, 3374, 3375 STAT= option, 3375 VARSCALE option, 3377 PLS procedure, PROC PLS statement, METHOD=PLS option EPSILON= option, 3376 MAXITER= option, 3375 PLS procedure, PROC PLS statement, MISSING=EM option EPSILON= option, 3376 MAXITER= option, 3376 PNLCJAC option NLOPTIONS statement (CALIS), 621 point models TRANSREG procedure, 4605 POINT option
Syntax Index EXACT statement (FREQ), 1445 EXACT statement (NPAR1WAY), 3161 POINT transformation MODEL statement (TRANSREG), 4561 Poisson distribution GENMOD procedure, 1652 NLMIXED procedure, 3078 Poisson regression GENMOD procedure, 1613, 1616 Poisson-normal example NLMIXED procedure, 3124 POLY Parameterization SURVEYLOGISTIC procedure, 4271 polychoric correlation coefficient, 1460, 1474, 1482 polynomial effects GENMOD procedure, 1660 MIXED procedure, 2743 model parameterization (GLM), 1787 specifying (GLM), 1784 POLYNOMIAL keyword REPEATED statement (ANOVA), 448 polynomial model GLMMOD procedure, 1909 POLYNOMIAL option REPEATED statement (GLM), 1779, 1831, 1878 POLYNOMIAL Parameterization SURVEYLOGISTIC procedure, 4271 polynomial regression REG procedure, 3804 polynomial-spline basis TRANSREG procedure, 4561, 4614 POOL= option PROC DISCRIM statement, 1150 pooled stratum SURVEYREG procedure, 4389 pooled within-cluster covariance matrix definition, 388 population profile (CATMOD), 820 total (SURVEYLOGISTIC), 4252, 4280 total (SURVEYMEANS), 4326, 4334 total (SURVEYREG), 4374, 4382 population (INBREED) monoecious, 1985 multiparous, 1973, 1977 nonoverlapping, 1974 overlapping, 1968, 1969, 1979 population clusters risks of estimating (MODECLUS), 2877 population profile, 73 POPULATION statement, CATMOD procedure, 848 population total SURVEYFREQ procedure, 4194, 4204 POS= option PROC TREE statement, 4753 posterior probability DISCRIM procedure, 1200 error rate estimation (DISCRIM), 1165
5073
POSTERR option PROC DISCRIM statement, 1150 PostScript graphics image files, 358 PostScript output ODS Graphics, 338 power overview of power concepts (POWER), 3488 See GLMPOWER procedure, 1929 See POWER procedure, 3411 power curves, See plots Power distance coefficient DISTANCE procedure, 1272 POWER procedure AB/BA crossover designs, 3549 actual alpha, 3496 actual power, 3419, 3494, 3496 actual prob(width), 3496 analysis of variance, 3438, 3442, 3443, 3513, 3536 analysis statements, 3420 binomial proportion tests, 3429, 3432, 3504– 3506, 3541 bioequivalence, 3510, 3511, 3520, 3521, 3530, 3531, 3549 ceiling sample size, 3419, 3496 compared to other procedures, 1930, 3412 computational methods, 3498 computational resources, 3498 confidence intervals for means, 3432, 3438, 3448, 3456, 3463, 3473, 3512, 3522, 3532, 3563 contrasts, analysis of variance, 3438, 3439, 3442, 3513, 3536 correlated proportions, 3443, 3447, 3448, 3516, 3517 correlation, 3426, 3502, 3503, 3556 crossover designs, 3549 displayed output, 3496 effect size, 3485 equivalence tests, 3432, 3438, 3448, 3456, 3463, 3472, 3510, 3511, 3520, 3521, 3530, 3531, 3549 Fisher’s exact test, 3457, 3463, 3525 Fisher’s z test for correlation, 3426, 3429, 3502, 3556 fractional sample size, 3419, 3496 Gehan test, 3473, 3482, 3533 grouped-name-lists, 3490 grouped-number-lists, 3490 introductory example, 3412 keyword-lists, 3490 likelihood-ratio chi-square test, 3457, 3463, 3525 log-rank test for comparing survival curves, 3473, 3481, 3533, 3561
5074
Syntax Index
lognormal data, 3434, 3437, 3438, 3450, 3455, 3456, 3465, 3472, 3509, 3511, 3519, 3521, 3529, 3531, 3552 McNemar’s test, 3443, 3447, 3448, 3516, 3517 name-lists, 3490 nominal power, 3419, 3494, 3496 noninferiority, 3552 notation for formulas, 3498 number-lists, 3490 odds ratio, 3457, 3462, 3523, 3524 ODS table names, 3497 one-sample t test, 3412, 3432, 3437, 3508, 3509 one-way ANOVA, 3438, 3442, 3443, 3513, 3536 overview of power concepts, 3488 paired proportions, 3443, 3447, 3448, 3516, 3517 paired t test, 3448, 3455, 3518, 3519 partial correlation, 3426, 3429, 3502, 3503, 3556 Pearson chi-square test, 3457, 3462, 3524 Pearson correlation, 3426, 3429, 3502, 3503, 3556 plots, 3412, 3420, 3421, 3483, 3566 regression, 3421, 3425, 3500, 3556 relative risk, 3457, 3462, 3523, 3524 sample size adjustment, 3494 summary of analyses, 3488 summary of statements, 3420 survival analysis, 3473, 3481, 3533 syntax, 3420 t test for correlation, 3426, 3429, 3503 t tests, 3432, 3437, 3448, 3455, 3463, 3471, 3508, 3509, 3518, 3526, 3529, 3566 Tarone-Ware test, 3473, 3483, 3533 two-sample t test, 3415, 3463, 3471, 3472, 3526, 3527, 3529, 3566 value lists, 3490 z test, 3429, 3432, 3505, 3506 POWER procedure, MULTREG statement, 3421 ALPHA= option, 3422 MODEL= option, 3422 NFRACTIONAL option, 3422 NFULLPREDICTORS= option, 3423 NOINT option, 3423 NREDUCEDPREDICTORS= option, 3423 NTESTPREDICTORS= option, 3423 NTOTAL= option, 3423 OUTPUTORDER= option, 3423 PARTIALCORR= option, 3424 POWER= option, 3424 RSQUAREDIFF= option, 3424 RSQUAREFULL= option, 3424 RSQUAREREDUCED= option, 3425 TEST= option, 3425 POWER procedure, ONECORR statement, 3426 ALPHA= option, 3427 CORR= option, 3427 DIST= option, 3427 MODEL= option, 3427 NFRACTIONAL option, 3427
NPARTIALVARS= option, 3427 NTOTAL= option, 3427 NULLCORR= option, 3428 OUTPUTORDER= option, 3428 POWER= option, 3428 TEST= option, 3428 POWER procedure, ONESAMPLEFREQ statement, 3429 ALPHA= option, 3430 METHOD= option, 3430 NFRACTIONAL= option, 3430 NTOTAL= option, 3430 NULLPROPORTION= option, 3430 OUTPUTORDER= option, 3430 POWER= option, 3431 PROPORTION= option, 3431 SIDES= option, 3431 TEST= option, 3431 POWER procedure, ONESAMPLEMEANS statement, 3432 ALPHA= option, 3434 CI= option, 3434 CV= option, 3434 DIST= option, 3434 HALFWIDTH= option, 3434 LOWER= option, 3434 MEAN= option, 3434 NFRACTIONAL option, 3434 NTOTAL= option, 3435 NULLMEAN= option, 3435 OUTPUTORDER= option, 3435 POWER= option, 3435 PROBTYPE= option, 3436 PROBWIDTH= option, 3436 SIDES= option, 3436 STDDEV= option, 3436 TEST= option, 3437 UPPER= option, 3437 POWER procedure, ONEWAYANOVA statement, 3438 ALPHA= option, 3439 CONTRAST= option, 3439 GROUPMEANS= option, 3440 GROUPNS= option, 3440 GROUPWEIGHTS= option, 3440 NFRACTIONAL option, 3440 NPERGROUP= option, 3440 NTOTAL= option, 3440 NULLCONTRAST= option, 3441 OUTPUTORDER= option, 3441 POWER= option, 3441 SIDES= option, 3441 STDDEV= option, 3442 TEST= option, 3442 POWER procedure, PAIREDFREQ statement, 3443 ALPHA= option, 3444 DISCPROPDIFF= option, 3444 DISCPROPORTIONS= option, 3444 DISCPROPRATIO= option, 3445
Syntax Index DIST= option, 3445 METHOD= option, 3445 NFRACTIONAL option, 3445 NPAIRS= option, 3445 NULLDISCPROPRATIO= option, 3445 OUTPUTORDER= option, 3445 POWER= option, 3446 REFPROPORTION= option, 3446 SIDES= option, 3446 TEST= option, 3446 TOTALPROPDISC= option, 3447 POWER procedure, PAIREDMEANS statement, 3448 ALPHA= option, 3449 CI= option, 3449 CORR= option, 3450 CV= option, 3450 DIST= option, 3450 HALFWIDTH= option, 3450 LOWER= option, 3450 MEANDIFF= option, 3450 MEANRATIO= option, 3450 NFRACTIONAL option, 3451 NPAIRS= option, 3451 NULLDIFF= option, 3451 NULLRATIO= option, 3451 OUTPUTORDER= option, 3451 PAIREDCVS= option, 3452 PAIREDMEANS= option, 3452 PAIREDSTDDEVS= option, 3452 POWER= option, 3452 PROBTYPE= option, 3452 PROBWIDTH= option, 3453 SIDES= option, 3453 STDDEV= option, 3453 TEST= option, 3454 UPPER= option, 3454 POWER procedure, PLOT statement, 3483 DESCRIPTION= option, 3487 INTERPOL= option, 3483 KEY= option, 3483 MARKERS= option, 3484 MAX= option, 3484 MIN= option, 3484 NAME= option, 3487 NPOINTS= option, 3485 STEP= option, 3485 VARY option, 3485 X= option, 3485 XOPTS= option, 3486 Y= option, 3487 YOPTS= option, 3487 POWER procedure, PROC POWER statement, 3421 PLOTONLY= option, 3421 POWER procedure, TWOSAMPLEFREQ statement, 3457 ALPHA= option, 3458 GROUPNS= option, 3458 GROUPPROPORTIONS= option, 3458
5075
GROUPWEIGHTS= option, 3458 NFRACTIONAL option, 3459 NPERGROUP= option, 3459 NTOTAL= option, 3459 NULLODDSRATIO= option, 3459 NULLPROPORTIONDIFF= option, 3459 NULLRELATIVERISK= option, 3459 ODDSRATIO= option, 3459 OUTPUTORDER= option, 3459 POWER= option, 3460 PROPORTIONDIFF= option, 3460 REFPROPORTION= option, 3460 RELATIVERISK= option, 3460 SIDES= option, 3461 TEST= option, 3461 POWER procedure, TWOSAMPLEMEANS statement, 3463 ALPHA= option, 3465 CI= option, 3465 CV= option, 3465 DIST= option, 3465 GROUPMEANS= option, 3466 GROUPNS= option, 3466 GROUPSTDDEVS= option, 3466 GROUPWEIGHTS= option, 3466 HALFWIDTH= option, 3466 LOWER= option, 3466 MEANDIFF= option, 3467 MEANRATIO= option, 3467 NFRACTIONAL option, 3467 NPERGROUP= option, 3467 NTOTAL= option, 3467 NULLDIFF= option, 3467 NULLRATIO= option, 3467 OUTPUTORDER= option, 3468 POWER= option, 3468 PROBTYPE= option, 3468 PROBWIDTH= option, 3469 SIDES= option, 3469 STDDEV= option, 3469 TEST= option, 3470 UPPER= option, 3470 POWER procedure, TWOSAMPLESURVIVAL statement, 3473 ACCRUALTIME= option, 3474 ALPHA= option, 3475 CURVE= option, 3475 FOLLOWUPTIME= option, 3476 GROUPLOSS= option, 3476 GROUPLOSSEXPHAZARDS= option, 3476 GROUPMEDLOSSTIMES= option, 3476 GROUPMEDSURVTIMES= option, 3476 GROUPNS= option, 3477 GROUPSURVEXPHAZARDS= option, 3477 GROUPSURVIVAL= option, 3477 GROUPWEIGHTS= option, 3477 HAZARDRATIO= option, 3477 NFRACTIONAL option, 3477 NPERGROUP= option, 3478
5076
Syntax Index
NSUBINTERVAL= option, 3478 NTOTAL= option, 3478 OUTPUTORDER= option, 3478 POWER= option, 3479 REFSURVEXPHAZARD= option, 3479 REFSURVIVAL= option, 3479 SIDES= option, 3479 TEST= option, 3480 TOTALTIME= option, 3480 power semivariogram model KRIGE2D procedure, 2049 POWER statement GLMPOWER procedure, 1939 POWER transformation MODEL statement (TRANSREG), 4562 TRANSFORM statement (MI), 2534 TRANSFORM statement (PRINQUAL), 3661 power-of-the-mean model MIXED procedure, 2718 POWER= option MULTREG statement (POWER), 3424 ONECORR statement (POWER), 3428 ONESAMPLEFREQ statement (POWER), 3431 ONESAMPLEMEANS statement (POWER), 3435 ONEWAYANOVA statement (POWER), 3441 PAIREDFREQ statement (POWER), 3446 PAIREDMEANS statement (POWER), 3452 POWER statement (GLMPOWER), 1941 PROC FACTOR statement, 1314 PROC MODECLUS statement, 2868 TWOSAMPLEFREQ statement (POWER), 3460 TWOSAMPLEMEANS statement (POWER), 3468 TWOSAMPLESURVIVAL statement (POWER), 3479 PP option PROC ACECLUS statement, 406 PPC convergence measure, 3030 pplot plots annotating, 2102 axes, color, 2102 font, specifying, 2103 reference lines, options, 2103–2105, 2107 PPREFIX option OUTPUT statement (TRANSREG), 4591 PPROB= option MODEL statement (LOGISTIC), 2315 PPS, 161 PPS sampling SURVEYSELECT procedure, 4463 PPS sampling, with replacement SURVEYSELECT procedure, 4451 PPS sampling, without replacement SURVEYSELECT procedure, 4449 PPS sequential sampling SURVEYSELECT procedure, 4452 PPS systematic sampling
SURVEYSELECT procedure, 4451 PR2ENTRY= option PROC STEPDISC statement, 4167 PR2STAY= option PROC STEPDISC statement, 4167 precision NLMIXED procedure, 3101 precision, confidence intervals, 3488 PRED option MODEL statement (GENMOD), 1641 PLOT statement (REG), 3847 PRED= option MODEL statement (CATMOD), 845 PREDET option PROC CALIS statement, 586 PREDICT option MODEL statement (CATMOD), 845 MODEL statement (RSREG), 4042 PROC SCORE statement, 4072 PREDICT statement NLMIXED procedure, 3079 PREDICT statement (KRIGE2D), 2041 PREDICTED keyword OUTPUT statement (GLM), 1774 OUTPUT statement (LIFEREG), 2101 OUTPUT statement (ROBUSTREG), 3991 predicted means MIXED procedure, 2704 predicted model matrix CALIS procedure, 643, 644, 663 displaying (CALIS), 683 singular (CALIS), 680 PREDICTED option MODEL statement (GENMOD), 1641 OUTPUT statement (TRANSREG), 4591 predicted population margins GLM procedure, 1753 predicted probabilities LOGISTIC procedure, 2350 predicted probability plots annotating, 3750 axes, color, 3750 font, specifying, 3750 options summarized by function, 3748 reference lines, options, 3750–3752, 3754 threshold lines, options, 3753 predicted residual sum of squares RSREG procedure, 4042 predicted value confidence intervals MIXED procedure, 2692 predicted values example (MIXED), 2796 LIFEREG procedure, 2114 mixed model (MIXED), 2687 MIXED procedure, 2703 NLIN procedure, 3014 REG procedure, 3876, 3879 response functions (CATMOD), 845 PREDICTED= option
Syntax Index OUTPUT statement (LOGISTIC), 2320 OUTPUT statement (NLIN), 3014 prediction example (REG), 3924 NLMIXED procedure, 3079, 3102 predictive mean matching method MI procedure, 2542 predpplot PROBIT procedure, 3746 PREDPPLOT statement options summarized by function, 3748 PROBIT procedure, 3746 PREDPROBS= option OUTPUT statement (LOGISTIC), 2320 preference mapping TRANSREG procedure, 4593, 4717 preference models TRANSREG procedure, 4586 prefix name LINEQS statement (CALIS), 602 MATRIX statement (CALIS), 594 RAM statement (CALIS), 598 STD statement (CALIS), 603 PREFIX= option MANOVA statement (ANOVA), 438 MANOVA statement (GLM), 1760 PROC ACECLUS statement, 406 PROC CANDISC statement, 792 PROC DISTANCE statement, 1261 PROC GLMMOD statement, 1915 PROC PRINCOMP statement, 3606 PROC PRINQUAL statement, 3655 preliminary clusters definition (CLUSTER), 978 using in CLUSTER procedure, 969 Prentice-Williams-Peterson model PHREG procedure, 3255 PREPLOT option PROC FACTOR statement, 1314 PREROTATE= option PROC FACTOR statement, 1314 presentations examples, ODS Graphics, 356 PRESS keyword OUTPUT statement (GLM), 1774 PRESS option MODEL statement (REG), 3829 MODEL statement (RSREG), 4042 PROC REG statement, 3818 PRESS residual MIXED Procedure, 2766 PRESS statistic, 1774 MIXED Procedure, 2766 RSREG procedure, 4042 prevalence test MULTTEST procedure, 2952, 2972 primary sampling unit (PSUs), 165 primary sampling units (PSUs) SURVEYLOGISTIC procedure, 4281
5077
SURVEYMEANS procedure, 4335 SURVEYREG procedure, 4383 PRIMAT option PROC CALIS statement, 586 principal component analysis, 1291 compared with common factor analysis, 1293 PRINQUAL procedure, 3668 with FACTOR procedure, 1295 principal components, See also PRINCOMP procedure definition, 3595 interpreting eigenvalues, 3621 partialling out variables, 3608 properties of, 3595, 3596 regression (PLS), 3367, 3381 rotating, 3611 using weights, 3608 principal factor analysis with FACTOR procedure, 1296 PRINCOMP procedure CALIS procedure, 567 computational resources, 3611 correction for means, 3605 Crime Rates Data, example, 3619 DATA= data set, 3609 eigenvalues and eigenvectors, 3595, 3610–3612 examples, 3614, 3626 input data set, 3605 ODS graph names, 3613 output data sets, 3605, 3609–3611 output table names, 3613 OUTSTAT= data set, 3609 %PLOTIT macro, 3615, 3617, 3625 replace missing values, example, 3626 SCORE procedure, 3611 suppressing output, 3605 syntax, 3604 weights, 3608 PRINCOMP procedure, BY statement, 3607 PRINCOMP procedure, FREQ statement, 3607 PRINCOMP procedure, PARTIAL statement, 3608 PRINCOMP procedure, PROC PRINCOMP statement, 3604 COV option, 3604 COVARIANCE option, 3604 DATA= option, 3605 N= option, 3605 NOINT option, 3605 NOPRINT option, 3605 OUT= option, 3605 OUTSTAT= option, 3605 PREFIX= option, 3606 SING= option, 3606 SINGULAR= option, 3606 STANDARD option, 3606 STD option, 3606 VARDEF= option, 3606 PRINCOMP procedure, VAR statement, 3608 PRINCOMP procedure, WEIGHT statement, 3608
5078
Syntax Index
PRINQUAL procedure biplot, 3678 casewise deletion, 3655 character OPSCORE variables, 3673 constant transformations, avoiding, 3673 constant variables, 3673 excluded observations, 3657, 3674 frequency variable, 3658 identity transformation, 3663 iterations, 3656, 3667, 3673 knots, 3665, 3666 linear transformation, 3662 MAC method, 3643, 3669 maximum average correlation method, 3643, 3669 maximum total variance method, 3643 MDPREF analysis, 3678 MGV method, 3643 minimum generalized variance method, 3643 missing character values, 3662 missing values, 3655, 3667, 3674 monotonic B-spline transformation, 3662 monotonic transformation, 3662, 3663 MTV method, 3643 multidimensional preference analysis, 3678, 3688 nonoptimal transformations, 3661 ODS graph names, 3677 optimal scoring, 3662 optimal transformations, 3662 output data sets, 3669 output table names, 3677 passive observations, 3674 %PLOTIT macro, 3678, 3687 principal component analysis, 3668 random initializations, 3673 reflecting the transformation, 3666 renaming variables, 3666 reusing variables, 3666 smoothing spline transformation, 3663 spline t-options, 3665 spline transformation, 3662 standardization, 3672 syntax, 3651 transformation options, 3663 variable names, 3672 weight variable, 3667 PRINQUAL procedure, BY statement, 3658 PRINQUAL procedure, FREQ statement, 3658 PRINQUAL procedure, ID statement, 3659 PRINQUAL procedure, PROC PRINQUAL statement, 3651 APPROXIMATIONS option, 3653 APREFIX= option, 3652 CCONVERGE= option, 3653 CHANGE= option, 3653 CONVERGE= option, 3653 CORRELATIONS option, 3653 COVARIANCE option, 3653
DATA= option, 3653 DUMMY option, 3654 INITITER= option, 3654 MAXITER= option, 3654 MDPREF option, 3654 METHOD= option, 3654 MONOTONE= option, 3654 N= option, 3655 NOCHECK option, 3655 NOMISS option, 3655 NOPRINT option, 3655 OUT= option, 3655 PREFIX= option, 3655 REFRESH= option, 3656 REITERATE option, 3656 REPLACE option, 3656 SCORES option, 3656 SINGULAR= option, 3656 STANDARD option, 3656 TPREFIX= option, 3657 TSTANDARD= option, 3657 TYPE= option, 3657 UNTIE= option, 3658 PRINQUAL procedure, TRANSFORM statement, 3659 ARSIN transformation, 3661 DEGREE= option, 3665 EVENLY option, 3665 EXP transformation, 3661 IDENTITY transformation, 3663 KNOTS= option, 3665 LINEAR transformation, 3662 LOG transformation, 3661 LOGIT transformation, 3661 MONOTONE transformation, 3662 MSPLINE transformation, 3662 NAME= option, 3666 NKNOTS= option, 3666 OPSCORE transformation, 3662 ORIGINAL option, 3664 PARAMETER= option, 3664 POWER transformation, 3661 RANK transformation, 3662 REFLECT option, 3666 SM= option, 3665 SPLINE transformation, 3662 SSPLINE transformation, 3663 TSTANDARD= option, 3666 UNTIE transformation, 3663 PRINQUAL procedure, WEIGHT statement, 3667 PRINT option MTEST statement (REG), 3833 NLOPTIONS statement (CALIS), 622 PROC CALIS statement, 586 PROC FACTOR statement, 1314 PROC NLIN statement, 3009 SCORE statement (LOESS), 2237 TEST statement (LOGISTIC), 2327 TEST statement (PHREG), 3239
Syntax Index TEST statement (REG), 3858 TEST statement (SURVEYLOGISTIC), 4267 PRINT statement, REG procedure, 3851 PRINT= option PROC CLUSTER statement, 971 PROC CORRESP statement, 1078 PRINTE option MANOVA statement (ANOVA), 438 MANOVA statement (GLM), 1761 REPEATED statement (ANOVA), 449 REPEATED statement (GLM), 1780, 1829 PRINTH option MANOVA statement (ANOVA), 439 MANOVA statement (GLM), 1761 REPEATED statement (ANOVA), 449 REPEATED statement (GLM), 1780 PRINTKWT option TABLES statement (FREQ), 1460 PRINTM option REPEATED statement (ANOVA), 449 REPEATED statement (GLM), 1780 PRINTRV option REPEATED statement (ANOVA), 449 REPEATED statement (GLM), 1781 prior density MIXED procedure, 2709 prior event probability LOGISTIC procedure, 2314, 2353, 2422 PRIOR statement MIXED procedure, 2708 PRIOR= option MCMC statement (MI), 2528 SCORE statement (LOGISTIC), 2325 PRIOREVENT= option SCORE statement (LOGISTIC), 2325 PRIORS statement DISCRIM procedure, 1154 FACTOR procedure, 1322 PRIORS= option PROC FACTOR statement, 1315 PRIVEC option PROC CALIS statement, 586 PROB option MODEL statement (CATMOD), 846 probability density function LIFETEST procedure, 2149, 2214 probability distribution built-in (GENMOD), 1614, 1638 exponential family (GENMOD), 1650 user-defined (GENMOD), 1633 probability sampling, see also SURVEYSELECT procedure PROBIT, 3705 probit analysis insets, 3724 probit equation, 3705, 3759 probit model SURVEYLOGISTIC procedure, 4285 PROBIT procedure, 3705
5079
Abbot’s formula, 3755 binary response data, 3705, 3706, 3759 cdfplot, 3715 deviance, 3745, 3760 deviance statistic, 3759 dispersion parameter, 3760 extreme value distribution , 3757 goodness-of-fit, 3743, 3745 goodness-of-fit tests, 3712, 3743, 3759 inset, 3723 inverse confidence limits, 3761 ippplot, 3725 log-likelihood function, 3756 logistic distribution, 3757 lpredplot, 3733 maximum-likelihood estimates, 3705 missing values, 3755 models, 3759 multilevel response data, 3705, 3706, 3759 natural response rate, 3706 Newton-Raphson algorithm, 3756 normal distribution, 3757 output table names, 3765 overdispersion, 3745 Pearson chi-square, 3759 Pearson’s chi-square, 3742, 3745, 3760 predpplot, 3746 subpopulation, 3742, 3745, 3761 syntax, 3710 threshold response rate, 3706 tolerance distribution, 3761 PROBIT procedure, BY statement, 3714 PROBIT procedure, CDFPLOT statement, 3715 ANNOTATE= option, 3718 CAXIS= option, 3718 CFIT= option, 3718 CFRAME= option, 3718 CGRID= option, 3719 CHREF= option, 3719 CLABBOX= option, 3719 CTEXT= option, 3719 CVREF= option, 3719 DESCRIPTION= option, 3719 FONT= option, 3719 HAXIS= option, 3719 HEIGHT= option, 3719 HLOWER= option, 3720 HOFFSET= option, 3720 HREF= option, 3720 HREFLABELS= option, 3720 HREFLABPOS= option, 3720 HUPPER= option, 3720 INBORDER option, 3720 LEVEL option, 3720 LFIT option, 3721 LGRID option, 3721 LHREF= option, 3721 LVREF= option, 3721 NAME= option, 3721
5080
Syntax Index
NOFIT option, 3721 NOFRAME option, 3721 NOGRID option, 3721 NOHLABEL option, 3721 NOHTICK option, 3721 NOTHRESH option, 3721 NOVLABEL option, 3721 NOVTICK option, 3721 options, 3715 THRESHLABPOS= option, 3721 VAR= option, 3715 VAXIS= option, 3722 VAXISLABEL= option, 3722 VLOWER= option, 3722 VREF= option, 3722 VREFLABELS= option, 3722 VREFLABPOS= option, 3722 VUPPER= option, 3722 WAXIS= option, 3723 WFIT= option, 3723 WGRID= option, 3723 WREFL= option, 3723 PROBIT procedure, CLASS statement, 3723 PROBIT procedure, INSET statement, 3723, 3724 keywords, 3724 PROBIT procedure, IPPPLOT statement, 3725 ANNOTATE= option, 3729 CAXIS= option, 3729 CFIT= option, 3729 CFRAME= option, 3729 CGRID= option, 3729 CHREF= option, 3729 CTEXT= option, 3729 CVREF= option, 3729 DESCRIPTION= option, 3729 FONT= option, 3729 HAXIS= option, 3730 HEIGHT= option, 3730 HLOWER= option, 3730 HOFFSET= option, 3730 HREF= option, 3730 HREFLABELS= option, 3730 HREFLABPOS= option, 3730 HUPPER= option, 3730 INBORDER option, 3731 LFIT option, 3731 LGRID option, 3731 LHREF= option, 3731 LVREF= option, 3731 NAME= option, 3731 NOCONF option, 3731 NODATA option, 3731 NOFIT option, 3731 NOFRAME option, 3731 NOGRID option, 3731 NOHLABEL option, 3732 NOHTICK option, 3732 NOTHRESH option, 3732 NOVLABEL option, 3732
NOVTICK option, 3732 options, 3726 THRESHLABPOS= option, 3732 VAR= option, 3725 VAXIS= option, 3732 VAXISLABEL= option, 3732 VLOWER= option, 3732 VREF= option, 3732 VREFLABELS= option, 3733 VREFLABPOS= option, 3733 VUPPER= option, 3733 WAXIS= option, 3733 WFIT= option, 3733 WGRID= option, 3733 WREFL= option, 3733 PROBIT procedure, LPREDPLOT statement, 3733 ANNOTATE= option, 3737 CAXIS= option, 3737 CFIT= option, 3737 CFRAME= option, 3737 CGRID= option, 3737 CHREF= option, 3737 CTEXT= option, 3737 CVREF= option, 3737 DESCRIPTION= option, 3737 FONT= option, 3737 HAXIS= option, 3738 HEIGHT= option, 3738 HLOWER= option, 3738 HOFFSET= option, 3738 HREF= option, 3738 HREFLABELS= option, 3738 HREFLABPOS= option, 3738 HUPPER= option, 3738 INBORDER option, 3739 LEVEL option, 3739 LFIT option, 3739 LGRID option, 3739 LHREF= option, 3739 LVREF= option, 3739 NAME= option, 3739 NOCONF option, 3739 NODATA option, 3739 NOFIT option, 3740 NOFRAME option, 3740 NOGRID option, 3740 NOHLABEL option, 3740 NOHTICK option, 3740 NOTHRESH option, 3740 NOVLABEL option, 3740 NOVTICK option, 3740 options, 3734 THRESHLABPOS= option, 3740 VAR= option, 3733 VAXIS= option, 3740 VAXISLABEL= option, 3740 VLOWER= option, 3740 VREF= option, 3741 VREFLABELS= option, 3741
Syntax Index VREFLABPOS= option, 3741 VUPPER= option, 3741 WAXIS= option, 3741 WFIT= option, 3741 WGRID= option, 3741 WREFL= option, 3741 PROBIT procedure, MODEL statement, 3741 AGGREGATE= option, 3742 ALPHA= option, 3743 CONVERGE option, 3743 CORRB option, 3743 COVB option, 3743 DISTRIBUTION= option, 3743 HPROB= option, 3743 INITIAL option, 3744 INTERCEPT= option, 3744 INVERSECL option, 3744 ITPRINT option, 3744 MAXITER= option, 3745 NOINT option, 3745 SCALE= option, 3745 SINGULAR= option, 3745 PROBIT procedure, OUTPUT statement, 3745 PROBIT procedure, PREDPLOT statement LEVEL option, 3752 PROBIT procedure, PREDPPLOT statement, 3746 ANNOTATE= option, 3750 CAXIS= option, 3750 CFIT= option, 3750 CFRAME= option, 3750 CGRID= option, 3750 CHREF= option, 3750 CTEXT= option, 3750 CVREF= option, 3750 DESCRIPTION= option, 3750 FONT= option, 3750 HAXIS= option, 3751 HEIGHT= option, 3751 HLOWER= option, 3751 HOFFSET= option, 3751 HREF= option, 3751 HREFLABELS= option, 3751 HREFLABPOS= option, 3751 HUPPER= option, 3751 INBORDER option, 3752 LFIT option, 3752 LGRID option, 3752 LHREF= option, 3752 LVREF= option, 3752 NAME= option, 3752 NOCONF option, 3752 NODATA option, 3752 NOFIT option, 3753 NOFRAME option, 3753 NOGRID option, 3753 NOHLABEL option, 3753 NOHTICK option, 3753 NOTHRESH option, 3753 NOVLABEL option, 3753
5081
NOVTICK option, 3753 options, 3747 THRESHLABPOS= option, 3753 VAR= option, 3747 VAXIS= option, 3753 VAXISLABEL= option, 3753 VLOWER= option, 3753 VREF= option, 3754 VREFLABELS= option, 3754 VREFLABPOS= option, 3754 VUPPER= option, 3754 WAXIS= option, 3754 WFIT= option, 3754 WGRID= option, 3754 WREFL= option, 3754 PROBIT procedure, PROC PROBIT statement, 3711 COVOUT option, 3711 DATA= option, 3712 GOUT= option, 3712 HPROB= option, 3712 INEST= option, 3712 INVERSECL option, 3712 LACKFIT option, 3712 LOG option, 3713 LOG10 option, 3713 NAMELEN= option, 3713 NOPRINT option, 3713 OPTC option, 3711, 3713 ORDER=option, 3713 OUTEST= option, 3714 XDATA= option, 3714 PROBIT procedure, WEIGHT statement, 3754 probit-normal-binomial example NLMIXED procedure, 3114 probit-normal-ordinal example NLMIXED procedure, 3118 PROBPLOT statement LIFEREG procedure, 2102 PROBT option PROC CANCORR statement, 761 PROBTYPE= option ONESAMPLEMEANS statement (POWER), 3436 PAIREDMEANS statement (POWER), 3452 TWOSAMPLEMEANS statement (POWER), 3468 PROBWIDTH= option ONESAMPLEMEANS statement (POWER), 3436 PAIREDMEANS statement (POWER), 3453 TWOSAMPLEMEANS statement (POWER), 3469 PROC ACECLUS statement, See ACECLUS procedure PROC ANOVA statement, See ANOVA procedure PROC BOXPLOT statement, See BOXPLOT procedure PROC CALIS statement,
5082
Syntax Index
See CALIS procedure PROC CANCORR statement, See CANCORR procedure PROC CANDISC statement, See CANDISC procedure PROC CATMOD statement, See CATMOD procedure PROC CLUSTER statement, See CLUSTER procedure PROC CORRESP statement, See CORRESP procedure PROC DISCRIM statement, See DISCRIM procedure PROC DISTANCE statement, See DISTANCE procedure PROC FACTOR statement, See FACTOR procedure PROC FASTCLUS statement, See FASTCLUS procedure PROC FREQ statement, See FREQ procedure PROC GAM statement, See GAM procedure PROC GENMOD statement, See GENMOD procedure PROC GLM statement, See GLM procedure PROC GLMMOD statement, See GLMMOD procedure PROC GLMPOWER statement, See GLMPOWER procedure PROC INBREED statement, See INBREED procedure PROC KDE statement, See KDE procedure PROC KRIGE2D statement, See KRIGE2D procedure PROC LATTICE statement, See LATTICE procedure PROC LIFEREG statement, See LIFEREG procedure PROC LIFETEST statement, See LIFETEST procedure PROC LOESS statement, See LOESS procedure PROC LOGISTIC statement, See LOGISTIC procedure PROC MDS statement, See MDS procedure PROC MI statement, See MI procedure PROC MIANALYZE statement, See MIANALYZE procedure PROC MIXED statement, See MIXED procedure PROC MODECLUS statement, See MODECLUS procedure PROC MULTTEST statement, See MULTTEST procedure
PROC NESTED statement, See NESTED procedure PROC NLIN statement, See NLIN procedure PROC NPAR1WAY statement, See NPAR1WAY procedure PROC ORTHOREG statement, See ORTHOREG procedure PROC PHREG statement, See PHREG procedure PROC PLAN statement, See PLAN procedure PROC PLS statement, See PLS procedure PROC POWER statement, See POWER procedure PROC PRINCOMP statement, See PRINCOMP procedure PROC PRINQUAL statement, See PRINQUAL procedure PROC PROBIT statement, See PROBIT procedure PROC REG statement, See REG procedure PROC ROBUSTREG statement, ROBUSTREG procedure PROC RSREG statement, See RSREG procedure PROC SCORE statement, See SCORE procedure PROC SIM2D statement, See SIM2D procedure PROC STDIZE statement, See STDIZE procedure PROC STEPDISC statement, See STEPDISC procedure PROC SURVEYFREQ statement, 4193, See SURVEYFREQ procedure PROC SURVEYLOGISTIC statement, See SURVEYLOGISTIC procedure PROC SURVEYMEANS statement, See SURVEYMEANS procedure PROC SURVEYREG statement, See SURVEYREG procedure PROC SURVEYSELECT statement, 4430, See SURVEYSELECT procedure PROC TPHREG statement, See TPHREG procedure PROC TPSPLINE statement, See TPSPLINE procedure PROC TRANSREG statement, See TRANSREG procedure PROC TREE statement, See TREE procedure PROC TTEST See TTEST procedure, 4780 PROC VARCLUS statement, See VARCLUS procedure PROC VARCOMP statement,
Syntax Index See VARCOMP procedure PROC VARIOGRAM statement, See VARIOGRAM procedure Procrustean method, 1291 Procrustes rotation, 1318 producing monotone missingness MI procedure, 2552 product-limit estimate LIFETEST procedure, 2149, 2171, 2185 PROFILE keyword REPEATED statement (ANOVA), 448 profile likelihood confidence intervals GENMOD procedure, 1666 PROFILE option MODEL statement (CATMOD), 846 REPEATED statement (GLM), 1779, 1832 profile, population and response, 72, 73 CATMOD procedure, 820 PROFILE= option FACTORS statement (CATMOD), 837 PROC CORRESP statement, 1078 REPEATED statement (CATMOD), 851 profiling residual variance MIXED procedure, 2772 progeny INBREED procedure, 1976, 1978, 1980, 1988 programming statements constraints (CALIS), 565, 628, 630, 675 differentiation (CALIS), 589 GENMOD procedure, 1645 NLMIXED procedure, 3081 PHREG procedure, 3220, 3222, 3235, 3236 projected gradient NLMIXED procedure, 3093 projected Hessian NLMIXED procedure, 3093 promax method, 1291, 1318 PROPENSITY option MONOTONE statement (MI), 2533 propensity score method MI procedure, 2543, 2565 proportion estimation SURVEYMEANS procedure, 4341 PROPORTION= option ONESAMPLEFREQ statement (POWER), 3431 PROC ACECLUS statement, 406 PROC FACTOR statement, 1316 PROC VARCLUS statement, 4811 proportional hazards model assumption (PHREG), 3220 distribution (LIFEREG), 2111 PHREG procedure, 3215, 3228 proportional odds model SURVEYLOGISTIC procedure, 4284 proportional rates/means model, See rate/mean model PROPORTIONALHAZARDS option ASSESS statement (PHREG), 3223 PROPORTIONDIFF= option
5083
TWOSAMPLEFREQ statement (POWER), 3460 proportions SURVEYFREQ procedure, 4210 PROPVARREDUCTION= option POWER statement (GLMPOWER), 1941 prospective power, 1929, 3411 prospective study, 1540 proximity data MDS procedure, 2471, 2478, 2484 proximity measures available methods for computing (DISTANCE), 1257 formulas(DISTANCE), 1270 PS destination ODS Graphics, 326, 335 PSCALE MODEL statement (GENMOD), 1642 PSEARCH option PRIOR statement (MIXED), 2711 pseudo F and t statistics CLUSTER procedure, 972 PSEUDO= option PROC CLUSTER statement, 972 PSHORT option PROC CALIS statement, 587 PSPLINE transformation MODEL statement (TRANSREG), 4561 PSSCP option PROC CANDISC statement, 792 PROC DISCRIM statement, 1151 PROC STEPDISC statement, 4167 PSTAT option PROC STDIZE statement, 4133 PSUMMARY option PROC CALIS statement, 587 PTRANS option PRIOR statement (MIXED), 2712 PROC MDS statement, 2484 PVAL= option PROC PLS statement, 3375 PVALS option PROC MULTTEST statement, 2942 PWEIGHT option PROC CALIS statement, 587
Q Q option RANDOM statement (GLM), 1777, 1834 Q-Q plots graph template definitions, 366 plots, ODS Graphics, 361, 363, 368, 370, 372, 373, 375, 377–379 REG procedure, 3917, 3953 QFAC option PROC NLMIXED statement, 3071 QMAX option PROC NLMIXED statement, 3071 QPOINT transformation
5084
Syntax Index
MODEL statement (TRANSREG), 4561 QPOINTS option PROC NLMIXED statement, 3071 QQ option PROC ACECLUS statement, 406 QSCALEFAC option PROC NLMIXED statement, 3071 QTOL option PROC NLMIXED statement, 3071 quadratic discriminant function, 1139 quadratic forms for fixed effects displaying (GLM), 1777 quadratic regression, 1738 quadrature options NLMIXED procedure, 3071 qualitative variables, 72, See classification variables REG procedure, 3943 quantal response data, 3705 quantification method CORRESP procedure, 1069 quantile computation STDIZE procedure, 4119, 4139 QUANTILES keyword OUTPUT statement (LIFEREG), 2101 quartimax method, 1291, 1317, 1318 quartimin method, 1291, 1319 quasi-complete separation LOGISTIC procedure, 2339 SURVEYLOGISTIC procedure, 4278 quasi-independence model, 919 quasi-inverse, 1164 quasi-likelihood GENMOD procedure, 1659 quasi-Newton, 3073 quasi-Newton algorithm CALIS procedure, 578–581, 665, 666
R R convergence measure, 3030 R matrix MIXED procedure, 2663, 2732, 2733 R option MODEL statement (REG), 3829 REPEATED statement (MIXED), 2720 R-notation, 1794 R-square statistic CLUSTER procedure, 972 LOGISTIC procedure, 2315, 2342 SURVEYLOGISTIC procedure, 4264, 4280 R2 improvement REG procedure, 3874, 3875 R2 selection REG procedure, 3875 R= option and other options (CLUSTER), 969, 970, 972 PROC CLUSTER statement, 972 PROC DISCRIM statement, 1151 PROC MODECLUS statement, 2869
PROC SURVEYLOGISTIC statement, 4251 PROC SURVEYMEANS statement, 4324 PROC SURVEYREG statement, 4374 radius of sphere of support, 972 RADIUS= option NLOPTIONS statement (CALIS), 614 PREDICT statement (KRIGE2D), 2042 PROC CALIS statement, 583 PROC FASTCLUS statement, 1388 RIDGE statement (RSREG), 4044 RAM model CALIS procedure, 553, 596 specification, 560 structural model example (CALIS), 557, 563 RAM statement, CALIS procedure, 596 random coefficients example (MIXED), 2788, 2810 random effects expected mean squares, 1833 GLM procedure, 1776, 1833 MIXED procedure, 2663, 2712 NESTED procedure, 2990 NLMIXED procedure, 3079 VARCOMP procedure, 4831, 4837 random effects model, See also nested design VARCOMP procedure, 4837 random initializations TRANSREG procedure, 4602 random number generators MI procedure, 2520 PLAN procedure, 3339 RANDOM option OUTPUT statement (PLAN), 3344 random sampling, see also SURVEYSELECT procedure RANDOM statement GLM procedure, 1776 MIXED procedure, 2712 NLMIXED procedure, 3079 random-effects parameters MIXED procedure, 2662, 2732 RANDOM= option PROC CALIS statement, 590 PROC FACTOR statement, 1316 PROC FASTCLUS statement, 1394 PROC MDS statement, 2484 PROC VARCLUS statement, 4812 randomization of designs using PLAN procedure, 3351 randomized complete block design example, 1847 range KRIGE2D procedure, 2047 RANGE= option MODEL statement (KRIGE2D), 2044 SIMULATE statement (SIM2D), 4104 rank correlation LOGISTIC procedure, 2350
Syntax Index SURVEYLOGISTIC procedure, 4292 rank order typal analysis, See complete linkage RANK procedure, 21 order statistics, 21 rank scores, 1469 rank tests NPAR1WAY procedure, 3163 RANK transformation MODEL statement (TRANSREG), 4563 TRANSFORM statement (PRINQUAL), 3662 RANKSCORE= option PROC DISTANCE statement, 1261 Rao-Scott chi-square test SURVEYFREQ procedure, 4216 Rao-Scott likelihood ratio test SURVEYFREQ procedure, 4219 rate function PHREG procedure, 3243, 3253 rate/mean model PHREG procedure, 3243, 3253 RATE= option PROC SURVEYFREQ statement, 4194 PROC SURVEYLOGISTIC statement, 4251 PROC SURVEYMEANS statement, 4324 PROC SURVEYREG statement, 4374 ratio SURVEYMEANS procedure, 4330, 4339 ratio analysis SURVEYMEANS procedure, 4330, 4339, 4348 ratio level of measurement DISTANCE procedure, 1250 RATIO option PROC MIXED statement, 2680, 2750 RATIO statement SURVEYMEANS procedure, 4330 RATIO= option MODEL statement (KRIGE2D), 2044 SIMULATE statement (SIM2D), 4104 RATIOS option PARMS statement (MIXED), 2708 RANDOM statement (MIXED), 2714 raw residuals GENMOD procedure, 1669 RC option REPEATED statement (MIXED), 2720 RCI option REPEATED statement (MIXED), 2720 RCONVERGE= option FACTOR statement (CALIS), 607 PROC FACTOR statement, 1316 RCORR option REPEATED statement (MIXED), 2720 RDF= option PROC CALIS statement, 572 PROC CANCORR statement, 761 RDIF1 option OUTPUT statement (FREQ), 1449 RDIF2 option
5085
OUTPUT statement (FREQ), 1449 RDPREFIX= option OUTPUT statement (TRANSREG), 4591 READ function RESPONSE statement (CATMOD), 853 receiver operating characteristic LOGISTIC procedure, 2378 reciprocal averaging CORRESP procedure, 1069 reciprocal causation CALIS procedure, 585 rectangular lattice LATTICE procedure, 2069 rectangular table SURVEYMEANS procedure, 4324 recurrent events PHREG procedure, 3216, 3224–3226, 3243 RED option PROC CANCORR statement, 761 reduced rank regression, 3367 PLS procedure, 3381 REDUCEOUT option PROC LIFETEST statement, 2164 SURVIVAL statement (LIFETEST), 2168 reduction notation, 1794 redundancy analysis CANCORR procedure, 752 standardized variables (TRANSREG), 4591 TRANSREG procedure, 4576, 4590, 4593, 4606 REDUNDANCY option PROC CANCORR statement, 761 REDUNDANCY= option OUTPUT statement (TRANSREG), 4591 REF Parameterization SURVEYLOGISTIC procedure, 4272 REF= option CLASS statement (GENMOD), 1630 CLASS statement (LOGISTIC), 2297 CLASS statement (SURVEYLOGISTIC), 4254 CLASS statement (TPHREG), 4478 reference level TRANSREG procedure, 4569 REFERENCE Parameterization SURVEYLOGISTIC procedure, 4272 reference structure, 1298 reference-cell coding TRANSREG procedure, 4569, 4594, 4664, 4666 REFERENCE= option MODEL statement (TRANSREG), 4579 OUTPUT statement (TRANSREG), 4591 references, 1288 referring to graphs examples, ODS Graphics, 330 ODS Graphics, 330 REFINE option PROC ROBUSTREG statement, 3987 REFIT statement, REG procedure, 3852 refitting models REG procedure, 3905
5086
Syntax Index
REFLECT option MODEL statement (TRANSREG), 4572 TRANSFORM statement, 3666 reflecting the transformation PRINQUAL procedure, 3666 TRANSREG procedure, 4572 REFPROPORTION= option PAIREDFREQ statement (POWER), 3446 TWOSAMPLEFREQ statement (POWER), 3460 REFRESH= option PROC PRINQUAL statement, 3656 REFSURVEXPHAZARD= option TWOSAMPLESURVIVAL statement (POWER), 3479 REFSURVIVAL= option TWOSAMPLESURVIVAL statement (POWER), 3479 REG option MONOTONE statement (MI), 2532 REG procedure adding variables, 3819 adjusted R2 selection, 3875 alpha level, 3816 annotations, 3816, 3844 ANOVA table, 3918 autocorrelation, 3915 backward elimination, 3800, 3874 collinearity, 3895 compared to other procedures, 1735, 3197 computational methods, 3917 correlation matrix, 3817 covariance matrix, 3817 crossproducts matrix, 3917 delete variables, 3820 deleting observations, 3903 diagnostic statistics, 3896, 3897 dictionary of options, 3844 forward selection, 3800, 3873 graphics, 3923 graphics examples, 3948 graphics keywords and options, 3841, 3843 graphics plots, high-resolution, 3840 heteroscedasticity, testing, 3910 hypothesis tests, 3832, 3858 incomplete principal components, 3818, 3828 influence statistics, 3898 input data sets, 3860 interactive analysis, 3812, 3869 introductory example, 3800 IPC analysis, 3818, 3828, 3916 line printer plots, 3848, 3882 Mallows’ Cp selection, 3875 missing values, 3859 model fit summary statistics, 3896 model selection, 3800, 3873, 3876, 3877, 3924 multicollinearity, 3895 multivariate tests, 3910 new regressors, 3860
non-full-rank models, 3893 ODS graph names, 3923 ODS table names, 3920 output data sets, 3863, 3868 P-P plots, 3917, 3953 painting line-printer plots, 3889 parameter estimates, 3877, 3919 partial regression leverage plots, 3901 plot keywords and options, 3841, 3843, 3844 plots, high-resolution, 3840 polynomial regression, 3804 predicted values, 3876, 3879, 3924 Q-Q plots, 3917, 3953 qualitative variables, 3943 R2 improvement, 3874, 3875 R2 selection, 3875 refitting models, 3905 residual values, 3879 restoring weights, 3906 reweighting observations, 3903 ridge regression, 3818, 3829, 3848, 3916, 3956 singularities, 3917 stepwise selection, 3800, 3874 summary statistics, 3896 sweep algorithm, 3917 syntax, 3813 time series data, 3915 variance inflation factors (VIF), 3818, 3958 REG procedure, ADD statement, 3819 REG procedure, BY statement, 3819 REG procedure, DELETE statement, 3820 REG procedure, FREQ statement, 3820 REG procedure, ID statement, 3821 REG procedure, MODEL statement, 3821 ACOV option, 3823 ADJRSQ option, 3824 AIC option, 3824 ALL option, 3824 ALPHA= option, 3824 B option, 3824 BEST= option, 3824 BIC option, 3824 CLB option, 3825 CLI option, 3825 CLM option, 3825 COLLIN option, 3825 COLLINOINT option, 3825 CORRB option, 3825 COVB option, 3825 CP option, 3825 DETAILS option, 3825 DW option, 3826 DWPROB option, 3826 EDF option, 3826 GMSEP option, 3826 GROUPNAMES= option, 3826 I option, 3827 INCLUDE= option, 3827 INFLUENCE option, 3827
Syntax Index JP option, 3827 MAXSTEP option, 3827 MSE option, 3827 NOINT option, 3827 NOPRINT option, 3827 OUTSEB option, 3828 OUTSTB option, 3828 OUTVIF option, 3828 P option, 3828 PARTIAL option, 3828 PARTIALR2 option, 3828 PC option, 3828 PCOMIT= option, 3828 PCORR1 option, 3829 PCORR2 option, 3829 PRESS option, 3829 R option, 3829 RIDGE= option, 3829 RMSE option, 3829 RSQUARE option, 3829 SBC option, 3829 SCORR1 option, 3830 SCORR2 option, 3830 SELECTION= option, 3800, 3830 SEQB option, 3830 SIGMA= option, 3830 SINGULAR= option, 3830 SLENTRY= option, 3831 SLSTAY= option, 3831 SP option, 3831 SPEC option, 3831 SS1 option, 3831 SS2 option, 3831 SSE option, 3831 START= option, 3831 STB option, 3831 STOP= option, 3832 TOL option, 3832 VIF option, 3832 XPX option, 3832 REG procedure, MTEST statement, 3832 CANPRINT option, 3833 DETAILS option, 3833 PRINT option, 3833 REG procedure, OUTPUT statement, 3833 keyword= option, 3834 OUT= option, 3834 REG procedure, PAINT statement, 3835 ALLOBS option, 3837 NOLIST option, 3838 RESET option, 3838 STATUS option, 3838 SYMBOL= option, 3838 UNDO option, 3838 REG procedure, PLOT statement, 3839 AIC option, 3844 ANNOTATE= option, 3844 BIC option, 3844 CAXIS= option, 3844
CFRAME= option, 3844 CHOCKING= option, 3845 CHREF= option, 3845 CLEAR option, 3849 CLINE= option, 3845 CMALLOWS= option, 3845 COLLECT option, 3849 CONF option, 3845 CP option, 3845 CTEXT= option, 3845 CVREF= option, 3846 DESCRIPTION= option, 3846 EDF option, 3846 GMSEP option, 3846 HAXIS= option, 3846 HPLOTS= option, 3849 HREF= option, 3846 IN option, 3846 JP option, 3846 LEGEND= option, 3846 LHREF= option, 3846 LLINE= option, 3846 LVREF= option, 3846 MODELFONT option, 3846 MODELHT option, 3847 MODELLAB option, 3847 MSE option, 3847 NAME= option, 3847 NOCOLLECT option, 3850 NOLINE option, 3847 NOMODEL option, 3847 NOSTAT option, 3847 NP option, 3847 OVERLAY option, 3847, 3850 PC option, 3847 PRED option, 3847 RIDGEPLOT option, 3848 SBC option, 3848 SP option, 3848 SSE option, 3848 STATFONT option, 3848 STATHT option, 3848 summary of options, 3841, 3843 SYMBOL= option, 3850 USEALL option, 3848 VAXIS= option, 3848 VPLOTS= option, 3851 VREF= option, 3848 REG procedure, PRINT statement, 3851 REG procedure, PROC REG statement, 3815 ALL option, 3816 ALPHA= option, 3816 ANNOTATE= option, 3816 CORR option, 3817 COVOUT option, 3817 DATA= option, 3817 EDF option, 3817 GOUT= option, 3817 LINEPRINTER option, 3817
5087
5088
Syntax Index
NOPRINT option, 3817 OUTEST= option, 3817 OUTSEB option, 3817 OUTSSCP= option, 3818 OUTSTB option, 3818 OUTVIF option, 3818 PCOMIT= option, 3818 PLOTS option, 3923 PLOTS(MAXPOINTS=) option, 3923 PRESS option, 3818 RIDGE= option, 3818 RSQUARE option, 3819 SIMPLE option, 3819 SINGULAR= option, 3819 TABLEOUT option, 3819 UNPACKPANELS option, 3923 USSCP option, 3819 REG procedure, REFIT statement, 3852 REG procedure, RESTRICT statement, 3852 REG procedure, REWEIGHT statement, 3854 ALLOBS option, 3856 NOLIST option, 3856 RESET option, 3856 STATUS option, 3857 UNDO option, 3857 WEIGHT= option, 3857 REG procedure, TEST statement, 3858 PRINT option, 3858 REG procedure, VAR statement, 3859 REG procedure, WEIGHT statement, 3859 REGPMM option MONOTONE statement (MI), 2532 REGPREDMEANMATCH option MONOTONE statement (MI), 2532 regression analysis (REG), 3799 CATMOD procedure, 815 examples (GLM), 1853 ill-conditioned data, 3197 MODEL statements (GLM), 1785 nonparametric, 4497 ORTHOREG procedure, 3197 partial least squares (PROC PLS), 3367, 3380 power and sample size (POWER), 3421, 3425, 3500, 3556 principal components (PROC PLS), 3367, 3381 quadratic (GLM), 1738 reduced rank (PROC PLS), 3367, 3381 semiparametric models, 4497, 4500 smoothing splines, 4497 regression coefficients CANCORR procedure, 759 covariance (SURVEYREG), 4392 SURVEYREG procedure, 4384, 4392 using with SCORE procedure, 4066 regression diagnostics LOGISTIC procedure, 2359 regression effects MIXED procedure, 2743
model parameterization (GLM), 1787 specifying (GLM), 1784 regression estimator SURVEYREG procedure, 4400, 4407 regression method MI procedure, 2541, 2565 REGRESSION option MONOTONE statement (MI), 2532 regression parameter estimates, example SCORE procedure, 4081 regression table TRANSREG procedure, 4580 regressor effects GENMOD procedure, 1660 REGWQ option MEANS statement (ANOVA), 444 MEANS statement (GLM), 1768 REITERATE option MODEL statement (TRANSREG), 4579 PROC PRINQUAL statement, 3656 rejection sampling MIXED procedure, 2711 relative efficiency MI procedure, 2562 MIANALYZE procedure, 2626 relative increase in variance MI procedure, 2562 MIANALYZE procedure, 2625 relative paths, 337 examples, ODS Graphics, 359 relative risk, 1503 cohort studies, 1489 logit estimate, 1507 Mantel-Haenszel estimate, 1507 power and sample size (POWER), 3457, 3462, 3523, 3524 RELATIVERISK= option TWOSAMPLEFREQ statement (POWER), 3460 RELRISK option OUTPUT statement (FREQ), 1449 TABLES statement (FREQ), 1460, 1535 REML, See restricted maximum likelihood renaming and reusing variables PRINQUAL procedure, 3666 REORDER option PROC FACTOR statement, 1316 REP= option PROC SURVEYSELECT statement, 4438 REPEAT option PLOT statement (BOXPLOT), 508 repeated measures ANOVA procedure, 446 CATMOD procedure, 815, 850, 873 contrasts (GLM), 1779 data organization (GLM), 1825 doubly-multivariate design, 1886 examples (CATMOD), 925, 930, 933, 937
Syntax Index examples (GLM), 1781, 1877 GEE (GENMOD), 1611, 1672 GLM procedure, 1777, 1825 hypothesis tests (GLM), 1827, 1830 MIXED procedure, 2662, 2716, 2782 more than one factor (ANOVA), 446, 449 more than one factor (GLM), 1830 multiple populations (CATMOD, 875 one population (CATMOD), 873 RESPONSE statement (CATMOD), 873 specifying factors (CATMOD), 851 transformations, 1830–1832 REPEATED statement ANOVA procedure, 446 CATMOD procedure, 850 GENMOD procedure, 1621, 1646 GLM procedure, 1777 MIXED procedure, 2716, 2783 REPLACE option OUTPUT statement (TRANSREG), 4592 PROC DISTANCE statement, 1262 PROC PRINQUAL statement, 3656 PROC STDIZE statement, 4133 REPLACE= option PROC FASTCLUS statement, 1394 replaying output examples, ODS Graphics, 361, 363, 369 REPLICATE statement NLMIXED procedure, 3080 replicate subjects NLMIXED procedure, 3080 replicated sampling SURVEYSELECT procedure, 4438, 4460 REPONLY option PROC DISTANCE statement, 1262 PROC STDIZE statement, 4133 requesting graphs ODS Graphics, 321, 324 RESAMPLE option ASSESS statement (PHREG), 3224 RESAMPLE= option ASSESS statement (PHREG), 3224 resampled data sets MULTTEST procedure, 2962 RESCHI= option OUTPUT statement (LOGISTIC), 2322 RESDEV= option OUTPUT statement (LOGISTIC), 2322 RESET option ODS GRAPHICS statement, 350 PAINT statement (REG), 3838 REWEIGHT statement (REG), 3856 reseting index counter ODS Graphics, 335 residual chi-square PHREG procedure, 3231 RESIDUAL keyword OUTPUT statement (GLM), 1774 OUTPUT statement (ROBUSTREG), 3991
5089
residual maximum likelihood (REML) MIXED procedure, 2738, 2779 RESIDUAL option MIXED procedure, MODEL statement, 2764 MODEL statement (LOESS), 2234 MODEL statement (MIXED), 2704 MODEL statement (RSREG), 4043 PROC SCORE statement, 4072 Residual plots MIXED procedure, 2758 residual plots plots, ODS Graphics, 322 RESIDUAL= option OUTPUT statement (NLIN), 3014 PROC CALIS statement, 587 residuals and partial correlation (PRINCOMP), 3609 CALIS procedure, 650 deviance (PHREG), 3234, 3258, 3302 GENMOD procedure, 1641, 1669, 1670 LOGISTIC procedure, 2360 martingale (PHREG), 3234, 3258, 3302 MDS procedure, 2483, 2487, 2488, 2491, 2503 NLIN procedure, 3014 partial correlation (PRINCOMP), 3608 prefix (CALIS), 601 prefix (TRANSREG), 4591 REG procedure, 3879 Schoenfeld (PHREG), 3234, 3258 score (PHREG), 3234, 3259 weighted Schoenfeld (PHREG), 3235, 3259 weighted score (PHREG), 3260 RESIDUALS option MODEL statement (GENMOD), 1641 OUTPUT statement (TRANSREG), 4592 PROC FACTOR statement, 1316 residuals, details MIXED procedure, 2763 response functions (CATMOD), 852, 854–856, 859, 862, 901, 906, 944 covariance matrix, 842 formulas, 892 identifying with FACTORS statement, 836 predicted values, 845 related to design matrix, 877, 879 variance formulas, 892 RESPONSE – – keyword MODEL statement (CATMOD), 836, 839, 840, 842, 850, 864, 870, 873, 881, 882, 884, 888, 898 response level ordering LOGISTIC procedure, 2290, 2305, 2329, 2330 SURVEYLOGISTIC procedure, 4259, 4269, 4270 response profile, 73 CATMOD procedure, 820 RESPONSE statement CATMOD procedure, 852 response surfaces, 4033
5090
Syntax Index
canonical analysis, interpreting, 4046 covariates, 4050 experiments, 4045 plotting, 4048 ridge analysis, 4047 response variable, 451, 1784 PHREG procedure, 3218, 3235, 3283 sort order of levels (GENMOD), 1626 RESPONSE – – = option FACTORS statement (CATMOD), 837 – RESPONSE– = option REPEATED statement (CATMOD), 851 RESTART option PROC NLMIXED statement, 3072 RESTART= option NLOPTIONS statement (CALIS), 623 restoring weights REG procedure, 3906 RESTRICT statement CATMOD procedure, 859 REG procedure, 3852 restricted maximum likelihood MIXED procedure, 2662, 2738, 2779 restrictions of parameters (CATMOD), 859 resubstitution DISCRIM procedure, 1163 RETAIN statement NLIN procedure, 3016 reticular action model, See RAM model retrospective power, 1929, 3411 reverse response level ordering LOGISTIC procedure, 2290, 2305, 2329, 2330 SURVEYLOGISTIC procedure, 4259, 4269, 4270 REWEIGHT statement, REG procedure, 3854 reweighting observations REG procedure, 3903 RHO= option PROC NLIN statement, 3009 RI option REPEATED statement (MIXED), 2720 ridge analysis RSREG procedure, 4047 ridge regression REG procedure, 3818, 3829, 3848, 3916, 3956 RIDGE statement RSREG procedure, 4043 RIDGE= option MODEL statement (REG), 3829 PROC CALIS statement, 573 PROC MDS statement, 2484 PROC MIXED statement, 2680 PROC REG statement, 3818 RIDGEPLOT option PLOT statement (REG), 3848 ridging MIXED procedure, 2680, 2738
RIDGING= option MODEL statement (LOGISTIC), 2315 MODEL statement (SURVEYLOGISTIC), 4264 ridit scores, 1469 risk set PHREG procedure, 3220, 3240, 3241, 3288 risk weights PHREG procedure, 3229 RISKDIFF option OUTPUT statement (FREQ), 1449 TABLES statement (FREQ), 1460 RISKDIFF1 option OUTPUT statement (FREQ), 1449 RISKDIFF2 option OUTPUT statement (FREQ), 1449 RISKDIFFC option TABLES statement (FREQ), 1460 RISKLIMITS option MODEL statement (LOGISTIC), 2315 MODEL statement (PHREG), 3233 risks and risk differences, 1486 RITER= option FACTOR statement (CALIS), 607 PROC FACTOR statement, 1316 RMSE option MODEL statement (REG), 3829 RMSSTD option PROC CLUSTER statement, 972 RMSSTD statement and FREQ statement (CLUSTER), 974 robust cluster analysis, 1379, 1392 estimators (STDIZE), 4138 ROBUST option COMPUTE statement (VARIOGRAM), 4869 ROBUSTREG procedure, 3971 computational resources, 4012 INEST= data sets, 4011 ODS graph names, 4014 OUTEST= data sets, 4011 output table names, 4012 syntax, 3982 ROBUSTREG procedure, BY statement, 3988 ROBUSTREG procedure, CLASS statement, 3989 ROBUSTREG procedure, ID statement, 3989 ROBUSTREG procedure, MODEL statement, 3989 ALPHA= option, 3989 CORRB option, 3989 COVB option, 3989 CPUCOUNT option, 3991 CUTOFF option, 3989 DIAGNOSTICS option, 3989 ITPRINT option, 3990 LEVERAGE option, 3990 NOGOODFIT option, 3990 NOINT option, 3990 SINGULAR= option, 3990 ROBUSTREG procedure, OPTIONS statement, 3989 ROBUSTREG procedure, OUTPUT statement, 3990
Syntax Index keyword= option, 3991 LEVEARAGE keyword, 3991 OUT= option, 3990 OUTLIER keyword, 3991 PREDICTED keyword, 3991 RESIDUAL keyword, 3991 SRESIDUAL keyword, 3991 STD– ERR keyword, 3991 XBETA keyword, 3991 ROBUSTREG procedure, PERFORMANCE statement, 3991 ROBUSTREG procedure, PROC ROBUSTREG statement, 3983 BIASTEST option, 3987 CHIF option, 3986, 3987 CONVERGENCE option, 3984, 3987 COVARIANCE option, 3984, 3986 COVOUT option, 3983 CSTEP option, 3985 DATA= option, 3983 EFF option, 3986, 3988 H option, 3985 IADJUST option, 3985 INEST= option, 3983 INITEST option, 3988 INITH option, 3988 ITPRINT option, 3983 K0 option, 3988 MAXITER= option, 3984, 3987, 3988 NAMELEN= option, 3983 NBEST option, 3986 NREP option, 3986, 3987 ORDER= option, 3983 OUTEST= option, 3984 PLOT= option, 4014 REFINE option, 3987 SCALE option, 3984 SUBANALYSIS option, 3986 SUBGROUPSIZE option, 3986 SUBSETSIZE option, 3987 TOLERANCE option, 3987 WEIGHTFUNCTION option, 3985 WLS= option, 3983 ROBUSTREG procedure, TEST statement, 3992 ROBUSTREG procedure, WEIGHT statement, 3992 ROBUSTREG procedure,PROC ROBUSTREG statement COVARIANCE option, 3987 SEED option, 3984 ROC curve LOGISTIC procedure, 2314, 2357 ROCEPS= option MODEL statement (LOGISTIC), 2315 SCORE statement (LOGISTIC), 2325 Roger and Tanimoto coefficient DISTANCE procedure, 1274 Root MSE SURVEYREG procedure, 4388 ROOT= option
5091
PROC TREE statement, 4753 RORDER= option PROC GENMOD statement, 1626 ROTATE= option FACTOR statement (CALIS), 607 PROC FACTOR statement, 1317 rotating principal components, 3611 ROUND option PROC FACTOR statement, 1319 ROUND= option PROC MI statement, 2520 row mean scores statistic, 1502 ROW option TABLES statement (SURVEYFREQ), 4201 row proportions SURVEYFREQ procedure, 4212 ROW= option PROC CORRESP statement, 1079 Roy’s maximum root, 437, 1759, 1828 RP option PROC CORRESP statement, 1079 RPC convergence measure, 3030 RPREFIX= option OUTPUT statement (TRANSREG), 4592 RRC1 option OUTPUT statement (FREQ), 1449 RRC2 option OUTPUT statement (FREQ), 1449 RSK1 option OUTPUT statement (FREQ), 1449 RSK11 option OUTPUT statement (FREQ), 1449 RSK12 option OUTPUT statement (FREQ), 1449 RSK2 option OUTPUT statement (FREQ), 1449 RSK21 option OUTPUT statement (FREQ), 1449 RSK22 option OUTPUT statement (FREQ), 1449 RSQUARE option MODEL statement (LOGISTIC), 2315 MODEL statement (REG), 3829 MODEL statement (SURVEYLOGISTIC), 4264 PROC CLUSTER statement, 972 PROC REG statement, 3819 RSQUAREDIFF= option MULTREG statement (POWER), 3424 RSQUAREFULL= option MULTREG statement (POWER), 3424 RSQUAREREDUCED= option MULTREG statement (POWER), 3425 RSREG procedure coding variables, 4047, 4052 compared to other procedures, 1735, 4033 computational methods, 4050 confidence intervals, 4042, 4043 Cook’s D influence statistic, 4042 covariates, 4034
5092
Syntax Index
eigenvalues, 4050 eigenvectors, 4050 factor variables, 4034 input data sets, 4040, 4041 introductory example, 4034 missing values, 4048 ODS table names, 4054 output data sets, 4040, 4044, 4051, 4052 PRESS statistic, 4042 response variables, 4034 ridge analysis, 4047 syntax, 4039 RSREG procedure, BY statement, 4040 RSREG procedure, ID statement, 4041 RSREG procedure, MODEL statement, 4041 ACTUAL option, 4041 BYOUT option, 4041 COVAR= option, 4042 D option, 4042 L95 option, 4042 L95M option, 4042 LACKFIT option, 4042 NOANOVA option, 4042 NOCODE option, 4042 NOOPTIMAL option, 4042 NOPRINT option, 4042 PREDICT option, 4042 PRESS option, 4042 RESIDUAL option, 4043 U95 option, 4043 U95M option, 4043 RSREG procedure, PROC RSREG statement, 4039 DATA= option, 4040 NOPRINT option, 4040 OUT= option, 4040 RSREG procedure, RIDGE statement, 4043 CENTER= option, 4043 MAXIMUM option, 4044 MINIMUM option, 4044 NOPRINT option, 4044 OUTR= option, 4044 RADIUS= option, 4044 RSREG procedure, WEIGHT statement, 4044 RSTUDENT keyword OUTPUT statement (GLM), 1774 RTF destination ODS Graphics, 326, 335 RTF output examples, ODS Graphics, 356, 360 ODS Graphics, 326, 338 Rtf style ODS styles, 346 RUPDATE= option REPEATED statement (GENMOD), 1648 Russell and Rao similarity coefficient DISTANCE procedure, 1276 Ryan’s multiple range test, 444, 1768, 1815 examples, 1851
S S convergence measure, 3030 S option PROC CANCORR statement, 761 saddle test, definition MODECLUS procedure, 2879 salience of loadings, FACTOR procedure, 1294, 1327 SALPHA= option NLOPTIONS statement (CALIS), 614 PROC CALIS statement, 583 Sampford’s method SURVEYSELECT procedure, 4455 sample design SURVEYFREQ procedure, 4203 sample selection methods SURVEYSELECT procedure, 4446 sample size CATMOD procedure, 887 overview of power concepts (POWER), 3488 See GLMPOWER procedure, 1929 See POWER procedure, 3411 SURVEYSELECT procedure, 4440 sample size adjustment GLMPOWER procedure, 1946 POWER procedure, 3494 sample survey analysis, ordinal data, 816 sampling fraction, 165 sampling frame, 164 sampling rate, 165 SURVEYFREQ procedure, 4194, 4204 SURVEYLOGISTIC procedure, 4251, 4280 SURVEYMEANS procedure, 4324, 4334 SURVEYREG procedure, 4374, 4382 SURVEYSELECT procedure, 4439 sampling unit, 164 sampling weight, 165 sampling zeros and log-linear analyses (CATMOD), 871 and structural zeros (CATMOD), 888 SAMPRATE= option PROC SURVEYSELECT statement, 4439 SAMPSIZE= option PROC SURVEYSELECT statement, 4440 sandwich estimator MIXED procedure, 2676 SAS Companion, 327, 336, 348 SAS current folder, 327 SAS data set DATA step, 21 summarizing, 21, 22 SAS Registry, 327, 346 SAS Registry Editor, 346 SAS Results Viewer, 327, 336 SAS/ETS software, 22 SAS/GRAPH software, 22 SAS/IML software, 22 SAS/INSIGHT software, 22 SAS/OR software, 23 SAS/QC software, 23
Syntax Index Satterthwaite method MIXED procedure, 2694 Satterthwaite t test power and sample size (POWER), 3463, 3472, 3527 Satterthwaite’s approximation, 4775, 4785 testing random effects, 1835 SAVAGE option EXACT statement (NPAR1WAY), 3160 OUTPUT statement (NPAR1WAY), 3162 PROC NPAR1WAY statement, 3157 Savage scores NPAR1WAY procedure, 3167 SAVE option PROC NLIN statement, 3009 saving graphics image files ODS Graphics, 336 saving templates examples, ODS Graphics, 368 sawtooth power function, 3541 SBC option MODEL statement (REG), 3829 PLOT statement (REG), 3848 scale estimates FASTCLUS procedure, 1390, 1392, 1397, 1399, 1400 SCALE option PROC ROBUSTREG statement, 3984 scale parameter GENMOD procedure, 1653 SCALE= option MODEL statement (GENMOD), 1642 MODEL statement (KRIGE2D), 2044 MODEL statement (LIFEREG), 2099 MODEL statement (LOESS), 2234 MODEL statement (LOGISTIC), 2316 MODEL statement (PROBIT), 3745 SIMULATE statement (SIM2D), 4104 Scaled Residual MIXED procedure, 2705 Scaled residuals MIXED procedure, 2764 SCALEDINDEP option MODEL statement (LOESS), 2234 Score statement (LOESS), 2237 scaling variables DISTANCE procedure, 1251 MODECLUS procedure, 2856 STDIZE procedure, 4143 scalogram analysis CORRESP procedure, 1069 scatter plots plots, ODS Graphics, 343, 351 Scheffé’s multiple-comparison procedure, 1810 SCHEFFE option MEANS statement (ANOVA), 444 MEANS statement (GLM), 1769 Scheffé’s multiple-comparison procedure, 444 Scheffé’s multiple-comparison procedure, 1769
5093
Schoenfeld residuals PHREG procedure, 3234, 3258 Schwarz criterion LOGISTIC procedure, 2341 SURVEYLOGISTIC procedure, 4279 Schwarz’s Bayesian information criterion example (MIXED), 2780, 2794, 2823 MIXED procedure, 2676, 2740, 2750 SCORE option PROC FACTOR statement, 1319 SCORE procedure CALIS procedure, 571, 586, 643 computational resources, 4075 examples, 4067, 4075 input data set, 4072 OUT= data sets, 4075 output data set, 4072, 4075 PRINCOMP procedure, 3611 regression parameter estimates from REG procedure, 4074 scoring coefficients, 4065 syntax, 4071 SCORE procedure, BY statement, 4073 SCORE procedure, ID statement, 4074 SCORE procedure, PROC SCORE statement, 4072 DATA= option, 4072 NOSTD option, 4072 OUT= option, 4072 PREDICT option, 4072 RESIDUAL option, 4072 SCORE= option, 4072 TYPE= option, 4072 SCORE procedure, VAR statement, 4074 score residuals PHREG procedure, 3234, 3259 SCORE statement LOESS procedure, 2236 SCORE statement, GAM procedure, 1569 SCORE statement, TPSPLINE procedure, 4510 score statistics GENMOD procedure, 1668 LOGISTIC procedure, 2343 SURVEYLOGISTIC procedure, 4287 score test PHREG procedure, 3229, 3231, 3246, 3269, 3274, 3276 TPHREG procedure, 4486 score variables interpretation (SCORE), 4074 SCORE= option PROC SCORE statement, 4072 scores NPAR1WAY procedure, 3166 SCORES option PROC PRINQUAL statement, 3656 SCORES= option EXACT statement (NPAR1WAY), 3160 OUTPUT statement (NPAR1WAY), 3162 PROC NPAR1WAY statement, 3157
5094
Syntax Index
TABLES statement (FREQ), 1461, 1547 scoring MIXED procedure, 2674, 2680, 2774 scoring coefficients (SCORE), 4065 SCORING= option MODEL statement (GENMOD), 1642 PROC MIXED statement, 2680 SCOROUT option TABLES statement (FREQ), 1461 SCORR option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1449 TEST statement (FREQ), 1463 SCORR1 option MODEL statement (REG), 3830 SCORR2 option MODEL statement (REG), 3830 SCREE option PROC FACTOR statement, 1319 scree plot, 1336 screening design, analysis, 1895 screening experiments GLMMOD procedure, 1923 SCWGT statement GENMOD procedure, 1650 SDF, See survival distribution function SE option PROC FACTOR statement, 1319 SEB option PROC CANCORR statement, 761 SEED option PROC MI statement, 2520 PROC NLMIXED statement, 3072 PROC PLAN statement, 3339 PROC ROBUSTREG statement (ROBUSTREG), 3984 SEED statement VARCLUS procedure, 4814 SEED= option ASSESS statement (PHREG), 3224 EXACT statement (FREQ), 1445 EXACT statement (NPAR1WAY), 3161 PRIOR statement (MIXED), 2712 PROC FASTCLUS statement, 1394 PROC MULTTEST statement, 2942 PROC PLS statement, 3374, 3375 PROC SURVEYSELECT statement, 4441 SIMULATE statement (SIM2D), 4104 SELECT= modifier INFLUENCE option, MODEL statement (MIXED), 2699 SELECT= option MODEL statement (LOESS), 2234 SELECTALL option PROC SURVEYSELECT statement, 4442 selecting graphs examples, ODS Graphics, 330, 352, 361 ODS Graphics, 330
selection methods, See model selection SELECTION= option MODEL statement (LOGISTIC), 2317 MODEL statement (PHREG), 3229 MODEL statement (REG), 3830 REG procedure, MODEL statement, 3800 semiparametric model PHREG procedure, 3215 semipartial correlation CANCORR procedure, 762 formula (CLUSTER), 984 semivariogram computation (VARIOGRAM), 4876, 4877 empirical (or experimental) (VARIOGRAM), 4856, 4858 robust (VARIOGRAM), 4861, 4869, 4877 sensitivity CATMOD procedure, 941 SEPARATORS= option MODEL statement (TRANSREG), 4569, 4579 SEQB option MODEL statement (REG), 3830 SEQUENTIAL option MODEL statement (LOGISTIC), 2317 MODEL statement (PHREG), 3230 sequential random sampling SURVEYSELECT procedure, 4448 serpentine sorting SURVEYSELECT procedure, 4445 SFACTOR= option PRIOR statement (MIXED), 2712 Shape distance coefficient DISTANCE procedure, 1271 SHAPE1= option MODEL statement (LIFEREG), 2099 SHAPE= option PROC DISTANCE statement, 1262 PROC MDS statement, 2484 Shewhart control charts, 23 SHORT option MODEL statement (TRANSREG), 4579 PROC ACECLUS statement, 407 PROC CALIS statement, 587 PROC CANCORR statement, 761 PROC CANDISC statement, 792 PROC CORRESP statement, 1079 PROC DISCRIM statement, 1151 PROC FASTCLUS statement, 1395 PROC MODECLUS statement, 2869 PROC STEPDISC statement, 4167 PROC VARCLUS statement, 4812 Sidak (MULTTEST) p-value adjustments, 2942, 2956, 2972 SIDAK option MEANS statement (ANOVA), 444 MEANS statement (GLM), 1769 PROC MULTTEST statement, 2942, 2956, 2972 Sidak’s t test, 1769, 1810
Syntax Index Sidak’s adjustment GLM procedure, 1754 MIXED procedure, 2688 MULTTEST procedure, 2942, 2956, 2972 Sidak’s inequality, 444 SIDES= option ONESAMPLEFREQ statement (POWER), 3431 ONESAMPLEMEANS statement (POWER), 3436 ONEWAYANOVA statement (POWER), 3441 PAIREDFREQ statement (POWER), 3446 PAIREDMEANS statement (POWER), 3453 TWOSAMPLEFREQ statement (POWER), 3461 TWOSAMPLEMEANS statement (POWER), 3469 TWOSAMPLESURVIVAL statement (POWER), 3479 Siegel-Tukey scores NPAR1WAY procedure, 3167 SIGITER option PROC MIXED statement, 2680 SIGMA= option MODEL statement (REG), 3830 significance level CALIS procedure, 588 entry (PHREG), 3230 hazards ratio confidence interval (PHREG), 3233 removal (PHREG), 3230, 3277 significance tests MODECLUS procedure, 2876, 2916 SIGSQ= option PROC NLIN statement, 3009 sill KRIGE2D procedure, 2047 SIM2D procedure, 4091 Cholesky root, 4107 computational details, 4109 conditional and unconditional simulation, 4091 conditional distributions of multivariate normal random variables, 4108 examples, 4092, 4111 Gaussian assumption, 4091 Gaussian random field, 4091 LU decomposition, 4106 memory usage, 4110 output data sets, 4099, 4110 OUTSIM= data set, 4110 quadratic form, 4108 simulation of spatial random fields, 4106–4109 syntax, 4097 SIM2D procedure, COORDINATES statement, 4099 XCOORD= option, 4100 YCOORD= option, 4100 SIM2D procedure, GRID statement, 4100 GRIDDATA= option, 4100 X= option, 4100 XCCORD= option, 4101
5095
Y= option, 4100 YCOORD= option, 4101 SIM2D procedure, MEAN statement, 4105 SIM2D procedure, PROC SIM2D statement, 4099 DATA= option, 4099 NARROW option, 4099 OUTSIM= option, 4099 SIM2D procedure, SIMULATE statement, 4101 ANGLE= option, 4102 FORM= option, 4102 MDATA= option, 4102 NUGGET= option, 4104 NUMREAL= option, 4101 RANGE= option, 4104 RATIO= option, 4104 SCALE= option, 4104 SEED= option, 4104 SINGULAR= option, 4104 VAR= option, 4101 SIMILAR option PROC TREE statement, 4753 SIMILAR= option PROC MDS statement, 2484 similarity data MDS procedure, 2471, 2478, 2484 Similarity Ratio coefficient DISTANCE procedure, 1272 simple cluster-seeking algorithm, 1381 simple effects GLM procedure, 1758, 1817 MIXED procedure, 2691 simple kappa coefficient, 1493, 1495 Simple Matching coefficient DISTANCE procedure, 1274 Simple Matching dissimilarity coefficient DISTANCE procedure, 1274 SIMPLE option PROC CALIS statement, 587 PROC CANCORR statement, 761 PROC CANDISC statement, 792 PROC CLUSTER statement, 972 PROC DISCRIM statement, 1151 PROC FACTOR statement, 1319 PROC LOGISTIC statement, 2294 PROC MI statement, 2521 PROC MODECLUS statement, 2869 PROC PHREG statement, 3223 PROC REG statement, 3819 PROC STEPDISC statement, 4167 PROC VARCLUS statement, 4812 simple random cluster sampling SURVEYREG procedure, 4397 simple random sampling SURVEYMEANS procedure, 4315 SURVEYREG procedure, 4365, 4395 SURVEYSELECT procedure, 4423, 4447 simplicity functions CALIS procedure, 607 FACTOR procedure, 1294, 1316, 1329
5096
Syntax Index
SIMPLS method PLS procedure, 3380 SIMULATE statement (SIM2D), 4101 simulation of spatial random fields SIM2D procedure, 4106–4109 simulation-based adjustment GLM procedure, 1754 MIXED procedure, 2688 SING= option PROC CANCORR statement, 762 PROC PRINCOMP statement, 3606 SINGCHOL= option MODEL statement (MIXED), 2704 PROC NLMIXED statement, 3072 SINGHESS= option PROC NLMIXED statement, 3072 single linkage CLUSTER procedure, 967, 982 SINGRES= option MODEL statement (MIXED), 2705 SINGSWEEP= option PROC NLMIXED statement, 3072 SINGULAR option CONTRAST statement (GLM), 1750 LSMEANS statement (GLM), 1758 PROC MI statement, 2521 SINGULAR= option CONTRAST statement (GENMOD), 1632 CONTRAST statement (GLMPOWER), 1937 CONTRAST statement (LOGISTIC), 2300 CONTRAST statement (MIXED), 2684 CONTRAST statement (SURVEYLOGISTIC), 4258 CONTRAST statement (SURVEYREG), 4377 CONTRAST statement (TPHREG), 4481 ESTIMATE statement (GLM), 1752 ESTIMATE statement (MIXED), 2686 ESTIMATE statement (SURVEYREG), 4379 LSMEANS statement (MIXED), 2691 MODEL statement (GENMOD), 1642 MODEL statement (GLM), 1772 MODEL statement (KRIGE2D), 2044 MODEL statement (LIFEREG), 2099 MODEL statement (LOGISTIC), 2317 MODEL statement (MIXED), 2704 MODEL statement (PHREG), 3232 MODEL statement (REG), 3830 MODEL statement (ROBUSTREG), 3990 MODEL statement (SURVEYLOGISTIC), 4264 MODEL statement (TRANSREG), 4579 NLOPTIONS statement (CALIS), 614 PROC ACECLUS statement, 407 PROC CALIS statement, 590 PROC CANCORR statement, 762 PROC CANDISC statement, 792 PROC CORRESP statement, 1079 PROC DISCRIM statement, 1151 PROC FACTOR statement, 1319 PROC LIFETEST statement, 2164
PROC MDS statement, 2485 PROC NLIN statement, 3009 PROC ORTHOREG statement, 3202 PROC PRINCOMP statement, 3606 PROC PRINQUAL statement, 3656 PROC REG statement, 3819 PROC STEPDISC statement, 4167 SIMULATE statement (SIM2D), 4104 singularities MIXED procedure, 2775 REG procedure, 3917 singularity MI procedure, 2521 singularity checking CANCORR procedure, 762 GLM procedure, 1750, 1752, 1758, 1772 singularity criterion CALIS procedure, 590 contrast matrix (GENMOD), 1632 contrast matrix (LOGISTIC), 2300 contrast matrix (SURVEYLOGISTIC), 4258 contrast matrix (TPHREG), 4481 covariance matrix (CALIS), 588, 590, 591 information matrix (GENMOD), 1642 PHREG procedure, 3232 TRANSREG procedure, 4579 singularity level SURVEYREG procedure, 4377, 4379 singularity tolerances NLMIXED procedure, 3072 SINGULARMSG= option PROC KRIGE2D statement, 2039 SINGVAR option PROC NLMIXED statement, 3072 Size distance coefficient DISTANCE procedure, 1271 SIZE statement SURVEYSELECT procedure, 4443 SIZE= modifier INFLUENCE option, MODEL statement (MIXED), 2699 skewness CALIS procedure, 658 displayed in CLUSTER procedure, 972 SKIPHLABELS= option PLOT statement (BOXPLOT), 509 SLENTRY= option MODEL statement (LOGISTIC), 2317 MODEL statement (PHREG), 3230 MODEL statement (REG), 3831 PROC STEPDISC statement, 4167 SLICE= option LSMEANS statement (GLM), 1758 LSMEANS statement (MIXED), 2691 SLMW= option PROC CALIS statement, 590 SLPOOL= option PROC DISCRIM statement, 1151 SLSTAY= option
Syntax Index MODEL statement (LOGISTIC), 2317 MODEL statement (PHREG), 3230 MODEL statement (REG), 3831 PROC STEPDISC statement, 4167 SM= option MODEL statement (TRANSREG), 4566 TRANSFORM statement, 3665 SMC option PROC CANCORR statement, 762 SMDCR option OUTPUT statement (FREQ), 1449 TEST statement (FREQ), 1463, 1543 SMDRC option OUTPUT statement (FREQ), 1449 TEST statement (FREQ), 1463 SMETHOD= option PROC CALIS statement, 580 PROC NLIN statement, 3009 SMM multiple-comparison method, 444, 1769, 1811 SMM option MEANS statement (ANOVA), 444 MEANS statement (GLM), 1769 SMOOTH transformation MODEL statement (TRANSREG), 4563 SMOOTH= option MODEL statement (LOESS), 2236 smoothing parameter cluster analysis, 979 MODECLUS procedure, 2864, 2871 optimal (DISCRIM), 1162 smoothing parameter, default MODECLUS procedure, 2872 smoothing spline transformation PRINQUAL procedure, 3663 TRANSREG procedure, 4564, 4596 SNK option MEANS statement (ANOVA), 444 MEANS statement (GLM), 1769 SNORM option PROC DISTANCE statement, 1263 PROC STDIZE statement, 4133 Sokal and Sneath 1 coefficient DISTANCE procedure, 1274 Sokal and Sneath 3 coefficient DISTANCE procedure, 1275 SOLUTION option MODEL statement (GLM), 1772 MODEL statement (MIXED), 2705, 2747 MODEL statement (PLS), 3379 MODEL statement (SURVEYREG), 4380 RANDOM statement (MIXED), 2715 Somers’ D statistics, 1474, 1478 SORT option PROC TREE statement, 4754 SORT= option PROC SURVEYSELECT statement, 4442 SORTED option REPEATED statement (GENMOD), 1648 SOURCE option
5097
PROC CORRESP statement, 1079 SOURCE statement TEMPLATE procedure, 339, 345, 367 SP option MODEL statement (REG), 3831 PLOT statement (REG), 3848 SPACES= option PROC TREE statement, 4754 spacing STDIZE procedure, 4139 SPARSE option TABLES statement (FREQ), 1461, 1527 spatial anisotropic exponential structure MIXED procedure, 2721 spatial covariance structures examples (MIXED), 2723 MIXED procedure, 2722, 2730, 2774 spatial prediction VARIOGRAM procedure, 4851, 4852 SPCORR option PROC CANCORR statement, 762 Spearman rank correlation coefficient, 1474, 1480 SPEC option MODEL statement (REG), 3831 specificity CATMOD procedure, 941 spherical semivariogram model KRIGE2D procedure, 2046, 2047 sphericity tests, 449, 1780, 1881 spline t-options PRINQUAL procedure, 3665 TRANSREG procedure, 4566 SPLINE transformation MODEL statement (TRANSREG), 4564 TRANSFORM statement (PRINQUAL), 3662 spline transformation PRINQUAL procedure, 3662 TRANSREG procedure, 4564, 4611 splines Bayesian confidence intervals, 4513 goodness of fit, 4514 regression model, 4497, 4513, 4518 thin-plate smoothing, 4497 TPSPLINE procedure, 4497 TRANSREG procedure, 4560, 4561, 4614, 4637, 4678, 4709 split-plot design ANOVA procedure, 451, 470, 472, 475 generating with PLAN procedure, 3352 MIXED procedure, 2734, 2777 SPRECISION= option PROC CALIS statement, 581, 583 SQPCORR option PROC CANCORR statement, 762 SQSPCORR option PROC CANCORR statement, 762 square root difference cloud VARIOGRAM procedure, 4882 Squared Correlation dissimilarity coefficient
5098
Syntax Index
DISTANCE procedure, 1272 Squared Correlation similarity coefficient DISTANCE procedure, 1272 Squared Euclidean distance coefficient DISTANCE procedure, 1271 squared multiple correlation CALIS procedure, 657, 686 CANCORR procedure, 762 squared partial correlation CANCORR procedure, 762 squared semipartial correlation CANCORR procedure, 762 formula (CLUSTER), 984 SRESIDUAL keyword OUTPUT statement (ROBUSTREG), 3991 SS1 option MODEL statement (GLM), 1772 MODEL statement (REG), 3831 SS2 option MODEL statement (GLM), 1773 MODEL statement (REG), 3831 MODEL statement (TRANSREG), 4580 SS3 option MODEL statement (GLM), 1773 SS4 option MODEL statement (GLM), 1773 SSCP matrix displaying, for multivariate tests, 438, 439 for multivariate tests, 437 for multivariate tests (GLM), 1759, 1761 SSCP option REPEATED statement (MIXED), 2720 SSE option MODEL statement (REG), 3831 PLOT statement (REG), 3848 SSE= option OUTPUT statement (NLIN), 3014 SSPLINE transformation MODEL statement (TRANSREG), 4564 TRANSFORM statement (PRINQUAL), 3663 ST option EXACT statement (NPAR1WAY), 3160 OUTPUT statement (NPAR1WAY), 3162 PROC NPAR1WAY statement, 3157 STACKING option PROC SURVEYMEANS statement, 4324 stacking table SURVEYMEANS procedure, 4324 standard deviation CLUSTER procedure, 972 SURVEYMEANS procedure, 4341 standard error PHREG procedure, 3225, 3233, 3235, 3269 SURVEYMEANS procedure, 4338 TPHREG procedure, 4487 standard error of ratio SURVEYMEANS procedure, 4339 standard error ratio PHREG procedure, 3269
TPHREG procedure, 4487 standard linear model MIXED procedure, 2663 STANDARD option PROC CLUSTER statement, 972 PROC MODECLUS statement, 2869 PROC PRINCOMP statement, 3606 PROC PRINQUAL statement, 3656 STANDARD procedure, 21 standardized values, 21 standardization comparisons between DISTANCE and STDIZE procedures, 1251 standardized score process PHREG procedure, 3267, 3271 standardizing cluster analysis (STDIZE), 4143 CLUSTER procedure, 972 MODECLUS procedure, 2856 raw data (SCORE), 4066 redundancy variables (TRANSREG), 4591 TRANSREG procedure, 4572 values (STANDARD), 21 values (STDIZE), 4119 star (*) operator TRANSREG procedure, 4558 START option PROC NLMIXED statement, 3072 START= option MCMC statement (MI), 2528 MODEL statement (LOGISTIC), 2318 MODEL statement (PHREG), 3230 MODEL statement (REG), 3831 MODEL statement (TPHREG), 4475 PROC CALIS statement, 590 PROC STEPDISC statement, 4167 STAT= option PROC PLS statement, 3375 STATFONT option PLOT statement (REG), 3848 STATHT option PLOT statement (REG), 3848 stationary point NLMIXED procedure, 3100 statistic-keywords SURVEYMEANS procedure, 4326 statistical assumptions (GLM), 1783 quality control, 23 tests (MULTTEST), 2948 statistical computation SURVEYMEANS procedure, 4336 statistical computations SURVEYFREQ procedure, 4206 Statistical Graphics Using ODS, see ODS Graphics Statistical style ODS styles, 333 STATS option
Syntax Index PROC SURVEYSELECT statement, 4443 STATUS option PAINT statement (REG), 3838 REWEIGHT statement (REG), 3857 STB option MODEL statement (LOGISTIC), 2318 MODEL statement (REG), 3831 MODEL statement (SURVEYLOGISTIC), 4264 PROC CANCORR statement, 762 STD option MODEL statement (LOESS), 2236 PROC PRINCOMP statement, 3606 STD option (MODECLUS), 2856 STD statement, CALIS procedure, 603 STD= option (DISTANCE), 1265 STD– ERR keyword OUTPUT statement (LIFEREG), 2101 OUTPUT statement (ROBUSTREG), 3991 STDDEV= option ONESAMPLEMEANS statement (POWER), 3436 ONEWAYANOVA statement (POWER), 3442 PAIREDMEANS statement (POWER), 3453 POWER statement (GLMPOWER), 1941 TWOSAMPLEMEANS statement (POWER), 3469 STDERR option LSMEANS statement (GLM), 1758 PROC CALIS statement, 587 SURVIVAL statement (LIFETEST), 2170 STDERR statement MIANALYZE procedure, 2617 STDI keyword OUTPUT statement (GLM), 1774 STDI= option OUTPUT statement (NLIN), 3014 STDIZE procedure AGK estimate, 4139 analyzing data in groups, 4134 Andrew’s wave estimate, 4139 breakdown point and efficiency, 4138 comparisons of quantile computation, PCTLMTD option, 4139 computational methods, PCTLDEF option, 4140 Euclidean length, 4138 examples, 4119, 4143 final output value, 4119 formulas for statistics, 4138 fuzz factor, 4130 Huber’s estimate, 4139 initial estimates for A estimates, 4131 input data set (METHOD=IN()), 4137 methods resistant to clustering, 4138 methods resistant to outliers, 4124, 4138 Minkowski metric, 4138 missing values, 4131, 4133, 4141 normalization, 4131, 4133 one-pass quantile computations, 4139 OUT= data set, 4130, 4141
5099
output data sets, 4132, 4141 output table names, 4142 OUTSTAT= data set, 4141 quantile computation, 4119, 4139 robust estimators, 4138 spacing, 4139 standardization methods, 4119, 4136 standardization with weights, 4135 syntax, 4129 Tukey’s biweight estimate, 4127, 4139 tuning constant, 4127, 4137 unstandardization, 4133 weights, 4135 STDIZE procedure, BY statement, 4134 STDIZE procedure, FREQ statement, 4135 STDIZE procedure, LOCATION statement, 4135 STDIZE procedure, PROC STDIZE statement, 4129 ADD= option, 4130 DATA= option, 4130 FUZZ= option, 4130 INITIAL= option, 4131 METHOD= option, 4131 MISSING= option, 4131 MULT= option, 4131 NMARKERS= option, 4131 NOMISS option, 4131 NORM option, 4131 OUT= option, 4132 OUTSTAT= option, 4132 PCTLDEF= option, 4132 PCTLMTD option, 4132 PCTLPTS option, 4132 PSTAT option, 4133 REPLACE option, 4133 REPONLY option, 4133 SNORM option, 4133 UNSTD option, 4133 VARDEF option, 4133 STDIZE procedure, SCALE statement, 4135 STDIZE procedure, VAR statement, 4135 STDIZE procedure, WGT statement, 4135 STDMEAN option PROC CANDISC statement, 792 PROC DISCRIM statement, 1152 PROC STEPDISC statement, 4168 STDP keyword OUTPUT statement (GLM), 1774 STDP= option OUTPUT statement (NLIN), 3014 STDR keyword OUTPUT statement (GLM), 1774 STDR= option OUTPUT statement (NLIN), 3014 STDXBETA= option OUTPUT statement (LOGISTIC), 2321 step halving PHREG procedure, 3245 step length CALIS procedure, 581
5100
Syntax Index
step length options NLMIXED procedure, 3096 STEP= option PLOT statement (GLMPOWER), 1943 PLOT statement (POWER), 3485 STEPBON option PROC MULTTEST statement, 2942 STEPBOOT option PROC MULTTEST statement, 2942 STEPDISC procedure average squared canonical correlation, 4173 computational resources, 4171 input data sets, 4170 introductory example, 4158 memory requirements, 4171 methods, 4157 missing values, 4170 ODS table names, 4174 Pillai’s trace, 4173 stepwise selection, 4159 syntax, 4165 time requirements, 4171 tolerance, 4173 Wilks’ lambda, 4173 STEPDISC procedure, BY statement, 4168 STEPDISC procedure, CLASS statement, 4169 STEPDISC procedure, FREQ statement, 4169 STEPDISC procedure, PROC STEPDISC statement, 4165 ALL option, 4166 BCORR option, 4166 BCOV option, 4166 BSSCP option, 4166 DATA= option, 4166 INCLUDE= option, 4166 MAXSTEP= option, 4166 METHOD= option, 4166 PCORR option, 4167 PCOV option, 4167 PR2ENTRY= option, 4167 PR2STAY= option, 4167 PSSCP option, 4167 SHORT option, 4167 SIMPLE option, 4167 SINGULAR= option, 4167 SLENTRY= option, 4167 SLSTAY= option, 4167 START= option, 4167 STDMEAN option, 4168 STOP= option, 4168 TCORR option, 4168 TCOV option, 4168 TSSCP option, 4168 WCORR option, 4168 WCOV option, 4168 WSSCP option, 4168 STEPDISC procedure, VAR statement, 4169 STEPDISC procedure, WEIGHT statement, 4169 stepdown methods
GLM procedure, 1814 MULTTEST procedure, 2957, 2972 STEPPERM option PROC MULTTEST statement, 2942 STEPS option SCORE statement (LOESS), 2237 STEPSID option PROC MULTTEST statement, 2942, 2972 stepwise discriminant analysis, 4157 stepwise selection LOGISTIC procedure, 2317, 2341, 2391 PHREG procedure, 3229, 3265, 3272 REG procedure, 3800, 3874 STEPDISC procedure, 4159 STOP= option MODEL statement (LOGISTIC), 2318 MODEL statement (PHREG), 3231 MODEL statement (REG), 3832 MODEL statement (TPHREG), 4475 PROC STEPDISC statement, 4168 STOPRES option MODEL statement (LOGISTIC), 2318 MODEL statement (PHREG), 3231 stored data algorithm, 986 stored distance algorithms, 986 strata SURVEYFREQ procedure, 4196 SURVEYLOGISTIC procedure, 4266 SURVEYMEANS procedure, 4332 SURVEYREG procedure, 4381 STRATA statement MULTTEST procedure, 2945 SURVEYFREQ procedure, 4196 SURVEYLOGISTIC procedure, 4266 SURVEYMEANS procedure, 4332 SURVEYREG procedure, 4381 SURVEYSELECT procedure, 4444 STRATA statement, PHREG procedure, 3237 strata variables PHREG procedure, 3237 programming statements (PHREG), 3235 strata weights MULTTEST procedure, 2951 stratification SURVEYFREQ procedure, 4196, 4203 SURVEYLOGISTIC procedure, 4266 SURVEYMEANS procedure, 4332, 4346 SURVEYREG procedure, 4381, 4391 stratified analysis FREQ procedure, 1431, 1450 PHREG procedure, 3216, 3237 stratified cluster sample SURVEYMEANS procedure, 4350 stratified sampling, 164 SURVEYMEANS procedure, 4318 SURVEYREG procedure, 4368, 4401 SURVEYSELECT procedure, 4425, 4444 stratified table example, 1540
Syntax Index stratified tests LIFETEST procedure, 2150, 2155, 2156, 2158, 2167, 2180, 2187 stratum collapse SURVEYREG procedure, 4388, 4411 stress formula MDS procedure, 2479, 2480, 2489 STRICT= option PROC FASTCLUS statement, 1395 strip-split-plot design ANOVA procedure, 475 STRUCTEQ statement, CALIS procedure, 625 structural equation (CALIS) definition, 585, 625, 658 dependent variables, 625 models, 552 structural model example (CALIS), 555 COSAN model, 561 factor analysis model, 564 LINEQS model, 558, 562 LISREL model, 559 path diagram, 555, 600 RAM model, 557, 563 Stuart’s tau-c statistic, 1474, 1478 STUDENT keyword OUTPUT statement (GLM), 1774 Student’s multiple range test, 444, 1769, 1814 STUDENT= option OUTPUT statement (NLIN), 3014 Studentized maximum modulus pairwise comparisons, 444, 1769, 1811 Studentized Residual MIXED procedure, 2704 studentized residual, 1774, 3899 Studentized residuals external, 2768 internal, 2768 MIXED procedure, 2768 STUTC option OUTPUT statement (FREQ), 1449 TEST statement (FREQ), 1463 style attributes, modifying examples, ODS Graphics, 374, 376, 378 style elements, modifying examples, ODS Graphics, 374, 376, 378 STYLE= option, 332 ODS HTML statement, 332, 375, 378, 379 ODS LATEX statement, 358 SUBANALYSIS option PROC ROBUSTREG statement, 3986 SUBCLUSTER= option REPEATED statement (GENMOD), 1648 subdomain analysis SURVEYFREQ procedure, 4205 SURVEYMEANS procedure, 4336 subgroup analysis SURVEYFREQ procedure, 4205 SURVEYMEANS procedure, 4336 SUBGROUP statement
5101
SURVEYMEANS procedure, 4330 SUBGROUPSIZE option PROC ROBUSTREG statement, 3986 subject effect MIXED procedure, 2683, 2715, 2721, 2776, 2782 SUBJECT option CONTRAST statement (MIXED), 2685 ESTIMATE statement (MIXED), 2686 subject weights MDS procedure, 2471, 2477 SUBJECT= option RANDOM statement (MIXED), 2683, 2715 REPEATED statement (GENMOD), 1646 REPEATED statement (MIXED), 2721 subpopulation GENMOD procedure, 1637 LOGISTIC procedure, 2308, 2316, 2355 PROBIT procedure, 3742, 3745, 3761 subpopulation analysis SURVEYFREQ procedure, 4205 SURVEYMEANS procedure, 4336 SUBSETSIZE option PROC ROBUSTREG statement, 3987 SUM option PROC MODECLUS statement, 2869 sum-to-zero assumptions, 1835 summary of commands MIXED procedure, 2672 SUMMARY option MANOVA statement (ANOVA), 439 MANOVA statement (GLM), 1761 PROC CALIS statement, 587 PROC FASTCLUS statement, 1395 PROC VARCLUS statement, 4812 REPEATED statement (ANOVA), 449 REPEATED statement (GLM), 1781 summary statistics REG procedure, 3896 sums of squares GLM procedure, 1772, 1773 Type II (GLM), 1773 Type II (TRANSREG), 4580 SUPPLEMENTARY statement CORRESP procedure, 1080 supported operating environments ODS Graphics, 348 supported procedures ODS Graphics, 348 suppressing output CANCORR procedure, 761 GENMOD procedure, 1627 MI procedure, 2520 surface plots plots, ODS Graphics, 324, 360 surface trend VARIOGRAM procedure, 4854 survey data analysis, 161 survey sampling, 161,
5102
Syntax Index
see also SURVEYFREQ procedure see also SURVEYMEANS procedure see also SURVEYREG procedure see also SURVEYSELECT procedure cluster sampling, 164 data analysis, 4185 descriptive statistics, 4315 finite population correction factor, 165 first-stage sampling unit, 165 fpc, 165 multistage sampling, 165 population, 164 population total, 165 PPS, 161 primary sampling units (PSUs), 165 regression analysis, 4365 sample selection, 4421 sampling fraction, 165 sampling frame, 164 sampling rate, 165 sampling unit, 164 sampling weight, 165 stratified sampling, 164 survey weight, 165 SURVEYMEANS procedure, 167 SURVEYREG procedure, 167 SURVEYSELECT procedure, 167 variance estimation, 166 survey weight, 165 SURVEYFREQ procedure, 159, 4185 alpha level, 4199 chi-square test, 4216 cluster, 4195 clustering, 4203 coefficient of variation, 4214 column proportions, 4212 confidence limits, 4213 covariance, 4210 crosstabulation tables, 4227 data summary table, 4225 default tables, 4197 degrees of freedom, 4214 design effect, 4215 design-adjusted chi-square test, 4216 displayed output, 4225 domain analysis, 4205 expected weighted frequency, 4215 introductory example, 4185 missing values, 4205 multiway tables, 4227 ODS table names, 4230 one-way frequency tables, 4226 order of variables, 4193 output table names, 4230 population total, 4194, 4204 proportions, 4210 Rao-Scott chi-square test, 4216 Rao-Scott likelihood ratio test, 4219 row proportions, 4212
sample design, 4203 sampling rate, 4194, 4204 statistical computations, 4206 statistical test tables, 4229 strata, 4196 stratification, 4196, 4203 stratum information table, 4225 subdomain analysis, 4205 subgroup analysis, 4205 subpopulation analysis, 4205 syntax, 4192 Taylor series method, 4206 totals, 4209 unequal weighting, 4204 Wald chi-square test, 4221 Wald log-linear chi-square test, 4223 SURVEYFREQ procedure, BY statement, 4195 SURVEYFREQ procedure, CLUSTER statement, 4195 SURVEYFREQ procedure, PROC SURVEYFREQ statement, 4193 DATA= option, 4193 MISSING option, 4193 NOSUMMARY option, 4193 ORDER= option, 4193 PAGE option, 4194 RATE= option, 4194 TOTAL= option, 4194 SURVEYFREQ procedure, STRATA statement, 4196 LIST option, 4196 SURVEYFREQ procedure, TABLES statement, 4196 ALPHA= option, 4199 CHISQ option, 4199 CHISQ1 option, 4199 CL option, 4199 CLWT option, 4199 COL option, 4200 CV option, 4200 CVWT option, 4200 DDF= option, 4200 DEFF option, 4200 EXPECTED option, 4200 LRCHISQ option, 4200 LRCHISQ1 option, 4200 NOFREQ option, 4201 NOPERCENT option, 4201 NOPRINT option, 4201 NOSPARSE option, 4201 NOSTD option, 4201 NOTOTAL option, 4201 NOWT option, 4201 ROW option, 4201 TESTP= option, 4202 VAR option, 4202 VARWT option, 4202 WCHISQ option, 4202 WLLCHISQ option, 4202 WTFREQ option, 4202 SURVEYFREQ procedure, WEIGHT statement, 4203
Syntax Index SURVEYLOGISTIC procedure, 159, 4250 Akaike’s information criterion, 4279 cluster, 4255 computational details, 4282 confidence intervals, 4288 convergence criterion, 4261–4263 customized odds ratio, 4267 displayed output, 4292 EFFECT Parameterization, 4270 estimability checking, 4258 existence of MLEs, 4277 first-stage sampling rate, 4251 Fisher’s scoring method, 4264, 4265, 4276 GLM Parameterization, 4271 gradient, 4287 Hessian matrix, 4264, 4287 infinite parameter estimates, 4263 initial values, 4280 link function, 4243, 4263, 4273 log odds, 4289 maximum likelihood algorithms, 4275 Medical Expenditure Panel Survey (MEPS), 4302 missing values, 4251, 4268 model fitting criteria, 4279 Newton-Raphson algorithm, 4264, 4265, 4277 odds ratio confidence limits, 4262 odds ratio estimation, 4288 ORDINAL Parameterization, 4271 ORTHEFFECT Parameterization, 4272 ORTHORDINAL Parameterization, 4272 ORTHOTHERM Parameterization, 4272 ORTHPOLY Parameterization, 4273 ORTHREF Parameterization, 4273 output table names, 4295 Parameterization, 4270 POLY Parameterization, 4271 POLYNOMIAL Parameterization, 4271 population total, 4252, 4280 primary sampling units (PSUs), 4281 rank correlation, 4292 REF Parameterization, 4272 REFERENCE Parameterization, 4272 response level ordering, 4270 reverse response level ordering, 4259, 4269 sampling rate, 4251, 4280 Schwarz criterion, 4279 score statistics, 4287 singular contrast matrix, 4258 strata, 4266 stratification, 4266 syntax, 4250 testing linear hypotheses, 4266, 4288 Variance Estimation, 4282 SURVEYLOGISTIC procedure, BY statement, 4252 SURVEYLOGISTIC procedure, CLASS statement, 4253 CPREFIX= option, 4253 DESCENDING option, 4253
5103
LPREFIX= option, 4253 ORDER= option, 4253 PARAM= option, 4254 REF= option, 4254 SURVEYLOGISTIC procedure, CLUSTER statement, 4255 SURVEYLOGISTIC procedure, CONTRAST statement, 4255 ALPHA= option, 4257 E option, 4257 ESTIMATE= option, 4258 SINGULAR= option, 4258 SURVEYLOGISTIC procedure, FREQ statement, 4258 SURVEYLOGISTIC procedure, MODEL statement, 4258 ABSFCONV option, 4261 ADJBOUND= option, 4265 ALPHA= option, 4262 CLODDS option, 4262 CLPARM option, 4262 CORRB option, 4262 COVB option, 4262 DEFFBOUND= option, 4265 DESCENDING option, 4259 EXPEST option, 4262 FCONV= option, 4262 GCONV= option, 4262 ITPRINT option, 4263 LINK= option, 4263 MAXITER= option, 4263 NOCHECK option, 4263 NODESIGNPRINT= option, 4264 NODUMMYPRINT= option, 4264 NOINT option, 4264 OFFSET= option, 4264 ORDER= option, 4260 PARMLABEL option, 4264 RIDGING= option, 4264 RSQUARE option, 4264 SINGULAR= option, 4264 STB option, 4264 TECHNIQUE= option, 4265 VADJUST= option, 4265 XCONV= option, 4265 SURVEYLOGISTIC procedure, PROC SURVEYLOGISTIC statement, 4250 ALPHA= option, 4250 DATA= option, 4250 INEST= option, 4251 MISSING option, 4251 N= option, 4252 NAMELEN= option, 4251 NOSORT option, 4251 R= option, 4251 RATE= option, 4251 TOTAL= option, 4252 SURVEYLOGISTIC procedure, STRATA statement, 4266
5104
Syntax Index
LIST option, 4266 SURVEYLOGISTIC procedure, TEST statement, 4266 PRINT option, 4267 SURVEYLOGISTIC procedure, UNITS statement, 4267 DEFAULT= option, 4268 SURVEYMEANS procedure, 159, 4315 categorical variable, 4329, 4337, 4341 class level, 4346 cluster, 4329 coefficient of variation, 4340 confidence level, 4323 confidence limits, 4340, 4342 data summary, 4346 degrees of freedom, 4340, 4344 denominator variable, 4330 descriptive statistics, 4347 domain analysis, 4336, 4348 domain mean, 4343 domain statistics, 4342 domain total, 4343 domain variable, 4330 empty stratum, 4334, 4344 estimated frequency, 4341 estimated total, 4341 first-stage sampling rate, 4324 indicator variable, 4337 mean per element, 4337 means, 4337 missing values, 4323, 4333, 4358 numerator variable, 4330 ODS table names, 4349 output data set, 4321, 4345 output table names, 4349 population total, 4326, 4334 primary sampling units (PSUs), 4335 proportion estimation, 4341 ratio, 4330, 4339 ratio analysis, 4330, 4339, 4348 rectangular table, 4324 sampling rate, 4324, 4334 simple random sampling, 4315 stacking table, 4324 standard deviation of the total, 4341 standard error of ratio, 4339 standard error of the mean, 4338 statistic-keywords, 4326 statistical computation, 4336 strata, 4332 stratification, 4332, 4346 stratified cluster sample, 4350 stratified sampling, 4318 subdomain analysis, 4336 subgroup analysis, 4336 subpopulation analysis, 4336 syntax, 4322 t test, 4339 valid observation, 4346
variance of the mean, 4338 variance of the total, 4341 SURVEYMEANS procedure, BY statement, 4328 SURVEYMEANS procedure, CLASS statement, 4329 SURVEYMEANS procedure, CLUSTER statement, 4329 SURVEYMEANS procedure, DOMAIN statement, 4330 SURVEYMEANS procedure, PROC SURVEYMEANS statement, 4323 ALPHA= option, 4323 DATA= option, 4323 MISSING option, 4323 N= option, 4326 ORDER= option, 4323 R= option, 4324 RATE= option, 4324 STACKING option, 4324 TOTAL= option, 4326 SURVEYMEANS procedure, RATIO statement, 4330 SURVEYMEANS procedure, STRATA statement, 4332 LIST option, 4332 SURVEYMEANS procedure, VAR statement, 4332 SURVEYMEANS procedure, WEIGHT statement, 4333 SURVEYREG procedure, 159, 4365 ADJRSQ, 4380 Adjusted R-square, 4387 analysis of contrasts, 4393 analysis of variance, 4387 ANOVA, 4380, 4387, 4392 classification level, 4391 classification variables, 4375 cluster, 4376 coefficients of contrast, 4392 coefficients of estimate, 4393 computational details, 4384 confidence level, 4373 confidence limits, 4380 contrasts, 4376, 4389 data summary, 4390 degrees of freedom, 4385 design effect, 4388 design summary, 4390 effect testing, 4386, 4392 estimable functions, 4378, 4393 first-stage sampling rate, 4374 inverse matrix of X0 X, 4391 missing values, 4382 MSE, 4388 multiple R-square, 4387 output data set, 4372, 4393 output table names, 4394 pooled stratum, 4389 population total, 4374, 4382 primary sampling units (PSUs), 4383 regression coefficients, 4384, 4392
Syntax Index regression estimator, 4400, 4407 Root MSE, 4388 sampling rate, 4374, 4382 simple random cluster sampling, 4397 simple random sampling, 4365, 4395 singularity level, 4377, 4379 strata, 4381 stratification, 4381, 4391 stratified sampling, 4368, 4401 stratum collapse, 4388, 4411 syntax, 4373 variance estimation, 4385 Wald test, 4386, 4389 X0 X matrix, 4391 SURVEYREG procedure, BY statement, 4375 SURVEYREG procedure, CLASS statement, 4375 SURVEYREG procedure, CLUSTER statement, 4376 SURVEYREG procedure, CONTRAST statement, 4376 E option, 4377 NOFILL option, 4377 SINGULAR= option, 4377 SURVEYREG procedure, ESTIMATE statement, 4378 DIVISOR= option, 4379 E option, 4379 NOFILL option, 4379 SINGULAR= option, 4379 SURVEYREG procedure, MODEL statement, 4379 ADJRSQ option, 4380 ANOVA option, 4380 CLPARM option, 4380 COVB option, 4380 DEFF option, 4380 INVERSE option, 4380 NOINT option, 4380 SOLUTION option, 4380 VADJUST= option, 4381 XPX option, 4381 SURVEYREG procedure, MODEL statement (SURVEYREG) DF= option, 4380 SURVEYREG procedure, PROC SURVEYREG statement, 4373 ALPHA= option, 4373 DATA= option, 4373 N= option, 4374 R= option, 4374 RATE= option, 4374 TOTAL= option, 4374 TRUNCATE option, 4374 SURVEYREG procedure, STRATA statement, 4381 LIST option, 4381 NOCOLLAPSE option, 4381 SURVEYREG procedure, WEIGHT statement, 4382 SURVEYSELECT procedure, 167, 4421 Brewer’s method, 4453 certainty size measure, 4432 Chromy’s method, 4448, 4452
5105
control sorting, 4443, 4445 displayed output, 4458 dollar-unit sampling, 4466 initial seed, 4441 introductory example, 4422 joint selection probabilities, 4433 maximum size measure, 4433 minimum size measure, 4436 missing values, 4445 Murthy’s method, 4454 output data sets, 4456 output table names, 4460 PPS sampling, 4463 PPS sampling, with replacement, 4451 PPS sampling, without replacement, 4449 PPS sequential sampling, 4452 PPS systematic sampling, 4451 replicated sampling, 4438, 4460 Sampford’s method, 4455 sample selection methods, 4446 sample size, 4440 sampling rate, 4439 sequential random sampling, 4448 serpentine sorting, 4445 simple random sampling, 4423, 4447 stratified sampling, 4425, 4444 syntax, 4430 systematic random sampling, 4448 unrestricted random sampling, 4447 SURVEYSELECT procedure, CONTROL statement, 4443 SURVEYSELECT procedure, ID statement, 4443 SURVEYSELECT procedure, PROC SURVEYSELECT statement, 4430 CERTSIZE= option, 4432 DATA= option, 4433 JTPROBS option, 4433 MAXSIZE= option, 4433 METHOD= option, 4434 MINSIZE= option, 4436 NMAX= option, 4437 NMIN= option, 4437 NOPRINT option, 4437 OUT= option, 4437 OUTALL option, 4437 OUTHITS option, 4438 OUTSEED option, 4438 OUTSIZE option, 4438 OUTSORT= option, 4438 REP= option, 4438 SAMPRATE= option, 4439 SAMPSIZE= option, 4440 SEED= option, 4441 SELECTALL option, 4442 SORT= option, 4442 STATS option, 4443 SURVEYSELECT procedure, SIZE statement, 4443 SURVEYSELECT procedure, STRATA statement, 4444
5106
Syntax Index
survival analysis power and sample size (POWER), 3473, 3481, 3533 survival distribution function LIFETEST procedure, 2149, 2171, 2184 PHREG procedure, 3239 survival function LIFEREG procedure, 2083, 2111 survival models, parametric, 2083 survival times PHREG procedure, 3215, 3281, 3283 survivor function, See survival distribution function definition (PHREG), 3239 estimates (LOGISTIC), 2460 estimates (PHREG), 3225, 3235, 3261, 3299 PHREG procedure, 3215, 3235, 3240 sweep algorithm REG procedure, 3917 SYMBOL= option MCMC statement (MI), 2525, 2529 PAINT statement (REG), 3838 PLOT statement (REG), 3850 SYMBOLLEGEND= option PLOT statement (BOXPLOT), 509 SYMBOLORDER= option PLOT statement (BOXPLOT), 509 symmetric and positive definite (SIM2D) covariance matrix, 4107 symmetric binary variable DISTANCE procedure, 1250 syntax ROBUSTREG procedure, 3982 systematic random sampling SURVEYSELECT procedure, 4448
T T option MEANS statement (GLM), 1769 MODEL statement (LOESS), 2236 PROC CANCORR statement, 762 t statistic approximate, 4784 for equality of means, 4783 t test MULTTEST procedure, 2946, 2955, 2968 power and sample size (POWER), 3432, 3437, 3448, 3455, 3463, 3471, 3508, 3509, 3518, 3526 SURVEYMEANS procedure, 4339 t test for correlation power and sample size (POWER), 3426, 3429, 3503 t value CALIS procedure, 649 displaying (CALIS), 686 t-square statistic CLUSTER procedure, 972, 984 T= option
PROC ACECLUS statement, 407 table names MIXED procedure, 2752 table scores, 1469 TABLEOUT option PROC REG statement, 3819 tables crosstabulation (SURVEYFREQ), 4227 frequency and crosstabulation (FREQ), 1431, 1450 frequency and crosstabulation (SURVEYFREQ), 4196 multiway, 1518, 1520, 1521 one-way frequency, 1517, 1518 one-way frequency (SURVEYFREQ), 4226 one-way, tests, 1469, 1470 two-way, tests, 1470, 1471 TABLES statement CORRESP procedure, 1081 FREQ procedure, 1450 SURVEYFREQ procedure, 4196 TABLES statement, use CORRESP procedure, 1072 TABULATE procedure, 22 TARGET= option PROC FACTOR statement, 1319 Tarone’s adjustment Breslow-Day test, 1508 Tarone-Ware test for homogeneity LIFETEST procedure, 2150, 2168 power and sample size (POWER), 3473, 3483, 3533 TAU= option PROC FACTOR statement, 1320 PROC NLIN statement, 3010 Taylor series method SURVEYFREQ procedure, 4206 TCORR option PROC CANDISC statement, 792 PROC DISCRIM statement, 1152 PROC STEPDISC statement, 4168 TCOV option PROC CANDISC statement, 792 PROC DISCRIM statement, 1152 PROC MIANALYZE statement, 2616 PROC STEPDISC statement, 4168 TEST statement (MIANALYZE), 2619 TDATA= option PRIOR statement (MIXED), 2712 TDIFF option LSMEANS statement (GLM), 1758 TDPREFIX= option OUTPUT statement (TRANSREG), 4592 TECHNIQUE= option MODEL statement (LOGISTIC), 2318 MODEL statement (SURVEYLOGISTIC), 4265 NLOPTIONS statement (CALIS), 613 PROC CALIS statement, 577 PROC NLMIXED statement, 3072
Syntax Index Template Editor window, 340, 365 TEMPLATE procedure, 343 examples, ODS Graphics, 363, 369, 371, 373, 379 graph template language, 342 SOURCE statement, 339, 345, 367 TEMPLATE procedure, SOURCE statement FILE= option, 340 template stores, 338 Sashelp.Tmplmst, 338, 341, 345, 346, 365 Sasuser.Templat, 340, 341, 346 user-defined, 341, 342 templates displaying contents of template, 278 in SASUSER library, 278 modifying, 278, 303, 309 style templates, 279 table templates, 279 TEMPLATE procedure, 278 Templates window, 339, 340, 345, 365 termination criteria (CALIS), See optimization test components MIXED procedure, 2702 test indices constraints (CALIS), 584 TEST option MODEL statement (TRANSREG), 4580 PROC MODECLUS statement, 2869 RANDOM statement (GLM), 1777 test set classification DISCRIM procedure, 1163 test set validation PLS procedure, 3384 TEST statement ANOVA procedure, 450 FREQ procedure, 1462 GLM procedure, 1781 LOGISTIC procedure, 2327 MIANALYZE procedure, 2618 MULTTEST procedure, 2946 PHREG procedure, 3238 REG procedure, 3858 ROBUSTREG procedure, 3992 SURVEYLOGISTIC procedure, 4266 TEST= option MULTREG statement (POWER), 3425 ONECORR statement (POWER), 3428 ONESAMPLEFREQ statement (POWER), 3431 ONESAMPLEMEANS statement (POWER), 3437 ONEWAYANOVA statement (POWER), 3442 PAIREDFREQ statement (POWER), 3446 PAIREDMEANS statement (POWER), 3454 STRATA statement (LIFETEST), 2168 TWOSAMPLEFREQ statement (POWER), 3461 TWOSAMPLEMEANS statement (POWER), 3470
5107
TWOSAMPLESURVIVAL statement (POWER), 3480 TESTCLASS statement, DISCRIM procedure, 1155 TESTDATA= option PROC DISCRIM statement, 1152 TESTF= option TABLES statement (FREQ), 1462, 1470 TESTFREQ statement, DISCRIM procedure, 1155 TESTID statement, DISCRIM procedure, 1156 testing linear hypotheses LOGISTIC procedure, 2327, 2358 MIANALYZE procedure, 2618, 2628 SURVEYLOGISTIC procedure, 4266, 4288 TESTLIST option PROC DISCRIM statement, 1152 TESTLISTERR option PROC DISCRIM statement, 1152 TESTOUT= option PROC DISCRIM statement, 1152 TESTOUTD= option PROC DISCRIM statement, 1152 TESTP= option TABLES statement (FREQ), 1462, 1470, 1530 TABLES statement (SURVEYFREQ), 4202 tests, hypothesis examples (GLM), 1850 GLM procedure, 1749 tetrachoric correlation coefficient, 1460, 1482 theoretical correlation INBREED procedure, 1977 THETA0= option PROC MI statement, 2519 PROC MIANALYZE statement, 2616 thin-plate smoothing splines, 4497 bootstrap, 4531, 4535 large data sets, 4527 theoretical foundation, 4511 three-way multidimensional scaling MDS procedure, 2471 threshold response rate, 3705 THRESHOLD= option PROC ACECLUS statement, 407 PROC DISCRIM statement, 1152 PROC MODECLUS statement, 2869 tick marks, modifying examples, ODS Graphics, 373 TICKPOS= option PROC TREE statement, 4754 ties checking for in CLUSTER procedure, 971 MDS procedure, 2485 PHREG procedure, 3216, 3219, 3228, 3241, 3268 TPHREG procedure, 4486 TIES= option MODEL statement (PHREG), 3228 time requirements ACECLUS procedure, 411 CLUSTER procedure, 986
5108
Syntax Index
FACTOR procedure, 1331, 1335 VARCLUS procedure, 4815, 4818 time series data REG procedure, 3915 time-dependent covariates PHREG procedure, 3216, 3220, 3222, 3224, 3233, 3236 time-series plot MI procedure, 2556 TIME= option TEST statement (MULTTEST), 2947 TIMELIM= option PROC LIFETEST statement, 2164 TIMELIST= option PROC LIFETEST statement, 2165 SURVIVAL statement (LIFETEST), 2168 TIMEPLOT option MCMC statement (MI), 2529, 2567 TIPREFIX option OUTPUT statement (TRANSREG), 4592 TITLE= option FACTORS statement (CATMOD), 837 LOGLIN statement (CATMOD), 839 MCMC statement (MI), 2525, 2529 MODEL statement (CATMOD), 846 REPEATED statement (CATMOD), 852 RESPONSE statement (CATMOD), 854 Tobit model LIFEREG procedure, 2085, 2129 Toeplitz structure example (MIXED), 2819 MIXED procedure, 2721 TOL option MODEL statement (REG), 3832 TOLERANCE option MODEL statement (GLM), 1773 PROC ROBUSTREG statement, 3987 total covariance matrix MIANALYZE procedure, 2627 total variance MI procedure, 2561 MIANALYZE procedure, 2625 TOTAL= option PROC SURVEYFREQ statement, 4194 PROC SURVEYLOGISTIC statement, 4252 PROC SURVEYMEANS statement, 4326 PROC SURVEYREG statement, 4374 TOTALPROPDISC= option PAIREDFREQ statement (POWER), 3447 totals SURVEYFREQ procedure, 4209 TOTALTIME= option TWOSAMPLESURVIVAL statement (POWER), 3480 TOTEFF option PROC CALIS statement, 587 TOTPANELS= option PLOT statement (BOXPLOT), 509 TOTPCT option
TABLES statement (FREQ), 1462 TPHREG procedure displayed output, 4486 estimability checking, 4481 global null hypothesis, 4486 hierarchy, 4475 iteration history, 4486 likelihood ratio test, 4486 miscellaneous changes from PROC PHREG, 4485 model hierarchy, 4473, 4475 ODS table names, 4488 parameter estimates, 4486, 4487 score test, 4486 singular contrast matrix, 4481 standard error, 4487 standard error ratio, 4487 syntax, 4474 ties, 4486 Wald test, 4486 TPHREG procedure, CLASS statement, 4477 CPREFIX= option, 4477 DESCENDING option, 4477 LPREFIX= option, 4477 MISSING option, 4477 ORDER= option, 4477 PARAM= option, 4478 REF= option, 4478 TPHREG procedure, CONTRAST statement, 4479 ALPHA= option, 4481 E option, 4481 ESTIMATE= option, 4481 SINGULAR= option, 4481 TPHREG procedure, MODEL statement, 4474 HIERARCHY= option, 4475 INCLUDE= option, 4475 NODESIGNPRINT= option, 4476 NODUMMYPRINT= option, 4476 START= option, 4475 STOP= option, 4475 TPHREG procedure, PROC TPHREG statement, 4474 TPREFIX= option PROC PRINQUAL statement, 3657 TPSPLINE procedure bootstrap, 4531, 4535 computational formulas, 4511 large data sets, 4527 nonhomogeneous variance, 4514 ODS table names, 4515 partial spline model, 4515 smoothing parameter, 4514 smoothing penalty, 4525 syntax, 4506 TPSPLINE procedure, BY statement, 4507 TPSPLINE procedure, FREQ statement, 4507 TPSPLINE procedure, ID statement, 4508 TPSPLINE procedure, MODEL statement, 4508 ALPHA= option, 4508
Syntax Index DF= option, 4508 DISTANCE= option, 4509 LAMBDA0= option, 4509 LAMBDA= option, 4509 LOGNLAMBDA0= option, 4509 LOGNLAMBDA= option, 4509 M= option, 4509 TPSPLINE procedure, OUTPUT statement, 4510 OUT= option, 4510 TPSPLINE procedure, PROC TPSPLINE statement, 4507 DATA= option, 4507 TPSPLINE procedure, SCORE statement, 4510 DATA= option, 4510 OUT= option, 4510 TRACE option PROC MODECLUS statement, 2869 PROC NLIN statement, 3010 PROC NLMIXED statement, 3073 PROC VARCLUS statement, 4812 trace record examples, ODS Graphics, 330, 352, 363 trace W method, See Ward’s method TRACEL option MODEL statement (LOESS), 2236 traditional high-resolution graphics LIFETEST procedure, 2159 traditional high-resolution graphics (LIFETEST), 2159 catalog, 2161 description, 2160 global annotate, 2160 local annotate, 2162 training data set DISCRIM procedure, 1139 TRANS= option PRIOR statement (MIXED), 2712 TRANSFORM statement MI procedure, 2533 TRANSFORM statement (PRINQUAL) ARSIN transformation, 3661 DEGREE= option, 3665 EVENLY option, 3665 EXP transformation, 3661 IDENTITY transformation, 3663 KNOTS= option, 3665 LINEAR transformation, 3662 LOG transformation, 3661 LOGIT transformation, 3661 MONOTONE transformation, 3662 MSPLINE transformation, 3662 NAME= option, 3666 NKNOTS= option, 3666 OPSCORE transformation, 3662 ORIGINAL option, 3664 PARAMETER= option, 3664 POWER transformation, 3661 RANK transformation, 3662
5109
SPLINE transformation, 3662 SSPLINE transformation, 3663 TSTANDARD= option, 3666 UNTIE transformation, 3663 TRANSFORM statement, PRINQUAL procedure, 3659 transformation MI procedure, 2533 transformation matrix orthonormalizing, 438, 1761 transformation options PRINQUAL procedure, 3663 TRANSREG procedure, 4564 transformation standardization TRANSREG procedure, 4572 transformations affine(DISTANCE), 1250 ANOVA procedure, 448 cluster analysis, 958 for multivariate ANOVA, 437, 1759 identity(DISTANCE), 1250 linear(DISTANCE), 1250 many-to-one(DISTANCE), 1249 MDS procedure, 2472, 2481, 2482, 2484, 2487– 2489 monotone increasing(DISTANCE), 1249 oblique, 1294, 1298 one-to-one(DISTANCE), 1249 orthogonal, 1294, 1298 power(DISTANCE), 1250 repeated measures, 1830–1832 strictly increasing(DISTANCE), 1249 transformations for repeated measures GLM procedure, 1779 transformed data MDS procedure, 2491 transformed distances MDS procedure, 2491 transforming ordinal variables to interval DISTANCE procedure, 1250 TRANSREG procedure additive models, 4574 algorithms, 4576 alpha level, 4574 ANOVA, 4650 ANOVA codings, 4662 ANOVA table, 4580, 4615 ANOVA table in OUTTEST= data set, 4626 asterisk (*) operator, 4558 at sign (@) operator, 4558 B-spline basis, 4560, 4614 bar (|) operator, 4558 Box Cox Example, 4721 Box Cox transformations, 4595 Box-Cox alpha, 4570 Box-Cox convenient lambda, 4570 Box-Cox convenient lambda list, 4570 Box-Cox geometric mean, 4571 Box-Cox lambda, 4571
5110
Syntax Index
Box-Cox parameter, 4566 CANALS method, 4576 canonical correlation, 4584, 4593 canonical variables, 4584 casewise deletion, 4578 cell-means coding, 4569, 4594, 4662 center-point coding, 4568, 4668, 4670 centering, 4675 character OPSCORE variables, 4604 choice experiments, 4660 CLASS variables, prefix, 4575 classification variables, 4560, 4569 coefficients, redundancy, 4590 confidence limits, 4575 confidence limits, individual, 4585 confidence limits, mean, 4585 confidence limits, prefix, 4584, 4585, 4587, 4588 conjoint analysis, 4581, 4593, 4690, 4694 constant transformations, avoiding, 4603 constant variables, 4578, 4604 degrees of freedom, 4615 dependent variable list, 4588 dependent variable name, 4587 design matrix, 4586 details of model, 4575 deviations-from-means coding, 4568, 4594, 4654, 4668, 4670 displaying iteration histories, 4576 dummy variables, 4560, 4569, 4586 dummy variables example, 4654 duplicate variable names, 4626 effect coding, 4549, 4568, 4654, 4668, 4670 excluded observations, 4605 excluding nonscore observations, 4581 expansions, 4560 explicit intercept, 4605 frequency variable, 4557 full-rank coding, 4568 GLMMOD alternative, 4586, 4654 history, iteration, 4576 hypothesis tests, 4580, 4615 id variables, 4557 ideal point model, 4593 ideal points, 4717 identity transformation, 4564 implicit intercept, 4605 independent variable list, 4588 individaul model fitting, 4576 initialization, 4575 interaction effects, 4558, 4594 interactions, quantitative, 4594 intercept, 4605 intercept, none, 4577 iterations, 4601 iterations, maximum number of, 4576 iterations, restarting, 4579, 4602 knots, 4567, 4568, 4613, 4678 less-than-full-rank model, 4569, 4594, 4672
leverage, 4587 limiting displayed output, 4579 linear regression, 4592 linear transformation, 4563, 4610 macros, 4588 main effects, 4558, 4594 maximum redundancy analysis, 4576 METHOD=MORALS rolled output data set, 4622 METHOD=MORALS variable names, 4626 metric conjoint analysis, 4694 missing value restoration option, 4590 missing values, 4576, 4578, 4599, 4600, 4605 monotone regression function, 4629 monotone transformations, 4593 monotonic B-spline transformation, 4563, 4611 monotonic transformation, ties not preserved, 4564, 4610 monotonic transformation, ties preserved, 4563, 4610 MORALS dependent variable name, 4587 MORALS method, 4576 multiple redundancy coefficients, 4590 multiple regression, 4593 multivariate multiple regression, 4593 names of variables, 4571 nonlinear regression functions, 4593, 4628 nonlinear transformations, 4593 nonmetric conjoint analysis, 4690 nonoptimal transformations, 4562 optimal scaling, 4609 optimal scoring, 4563, 4610 optimal transformations, 4563 order of CLASS levels, 4569, 4578 OUT= data set, 4582, 4623 output table names, 4676 OUTTEST= data set, 4556 part-worth utilities, 4691 passive observations, 4605 piecewise polynomial splines, 4561, 4614 %PLOTIT macro, 4617, 4717, 4718 point models, 4605 polynomial-spline basis, 4561, 4614 predicted values, 4591 preference mapping, 4593, 4717 preference models, 4586 prefix, canonical variables, 4584, 4585 prefix, redundancy variables, 4592 prefix, residuals, 4591 redundancy analysis, 4576, 4590, 4592, 4593, 4606 redundancy analysis, standardization, 4591 reference level, 4569, 4579, 4591 reference-cell coding, 4569, 4594, 4664, 4666 regression table, 4580 regression table in OUTTEST= data set, 4626 reiteration, 4579, 4602 renaming and reusing variables, 4571 residuals, 4592
Syntax Index residuals, prefix, 4591 separate regression functions, 4633 short output, 4579 singularity criterion, 4579 smoothing spline transformation, 4564 smoothing splines, 4596 spline t-options, 4566 spline transformation, 4564, 4611 splines, 4560, 4561, 4614, 4637, 4678, 4709 standardization, redundancy variables, 4591 standardization, transformation, 4572, 4580 star (*) operator, 4558 syntax, 4553 transformation options, 4564 transformation standardization, 4572, 4580 Type II sums of squares, 4580 types of observations, 4581 utilities, 4581, 4691, 4694 utilities in OUTTEST= data set, 4626 variable list macros, 4588 variable names, 4625 vector preference models, 4586 weight variable, 4592 – TYPE– , 4623 TRANSREG procedure, BY statement, 4556 TRANSREG procedure, FREQ statement, 4557 TRANSREG procedure, ID statement, 4557 TRANSREG procedure, MODEL statement, 4557 ADDITIVE option, 4574 AFTER option, 4571 ALPHA= option, 4570, 4574 ARSIN transformation, 4562 Box-Cox transformation, 4562 BSPLINE transformation, 4560 CCONVERGE= option, 4574 CENTER option, 4571 CL option, 4575 CLASS transformation, 4560 CLL= option, 4570 CONVENIENT option, 4570 CONVERGE option, 4575 CPREFIX= option, 4568, 4575 DEGREE= option, 4567 DETAIL option, 4575 DEVIATIONS option, 4568 DUMMY option, 4575 EFFECTS option, 4568 EPOINT transformation, 4561 EVENLY option, 4567 EXKNOTS= option, 4567 EXP transformation, 4562 GEOMETRICMEAN option, 4571 HISTORY option, 4576 IDENTITY transformation, 4564 INDIVIDUAL option, 4576 KNOTS= option, 4568 LAMBDA= option, 4571 LINEAR transformation, 4563 LOG transformation, 4562
5111
LOGIT transformation, 4562 LPREFIX= option, 4568, 4576 MAXITER= option, 4576 METHOD= option, 4576 MONOTONE transformation, 4563 MONOTONE= option, 4577 MSPLINE transformation, 4563 NAME= option, 4571 NCAN= option, 4577 NKNOTS= option, 4568 NOINT option, 4577 NOMISS option, 4578 NOPRINT option, 4578 OPSCORE transformation, 4563 ORDER= option, 4569, 4578 ORIGINAL option, 4566 PARAMETER= option, 4566 POINT transformation, 4561 POWER transformation, 4562 PSPLINE transformation, 4561 QPOINT transformation, 4561 RANK transformation, 4563 REFERENCE= option, 4579 REFLECT option, 4572 REITERATE option, 4579 SEPARATORS= option, 4569, 4579 SHORT option, 4579 SINGULAR= option, 4579 SM= option, 4566 SMOOTH transformation, 4563 SPLINE transformation, 4564 SS2 option, 4580 SSPLINE transformation, 4564 TEST option, 4580 TSTANDARD= option, 4572, 4580 TYPE= option, 4581 UNTIE transformation, 4564 UNTIE= option, 4581 UTILITIES option, 4581 Z option, 4572 ZERO= option, 4569 TRANSREG procedure, OUTPUT statement, 4582 ADPREFIX= option, 4584 AIPREFIX option, 4584 APPROXIMATIONS option, 4584 CANONICAL option, 4584 CCC option, 4584 CDPREFIX= option, 4584 CEC option, 4584 CILPREFIX= option, 4584 CIPREFIX= option, 4585 CIUPREFIX= option, 4585 CLI option, 4585 CLM option, 4585 CMLPREFIX= option, 4585 CMUPREFIX= option, 4585 COEFFICIENTS option, 4585 COORDINATES option, 4586 CPC option, 4586
5112
Syntax Index
CQC option, 4586 DAPPROXIMATIONS option, 4586 DEPENDENT= option, 4587 DESIGN= option, 4586 DREPLACE option, 4587 IAPPROXIMATIONS option, 4587 IREPLACE option, 4587 LEVERAGE= option, 4587 LILPREFIX= option, 4587 LIUPREFIX= option, 4587 LMLPREFIX= option, 4588 MACRO option, 4588 MEANS option, 4590 MEC option, 4590 MPC option, 4590 MQC option, 4590 MRC option, 4590 MREDUNDANCY option, 4590 NORESTOREMISSING option, 4590 NOSCORES option, 4590 NOZEROCONSTANT option, 4578 OUT= option, 4582 PPREFIX option, 4591 PREDICTED option, 4591 RDPREFIX= option, 4591 REDUNDANCY= option, 4591 REFERENCE= option, 4591 REPLACE option, 4592 RESIDUALS option, 4592 RPREFIX= option, 4592 TDPREFIX= option, 4592 TIPREFIX option, 4592 TRANSREG procedure, PROC TRANSREG statement, 4553 DATA= option, 4556 OUTTEST= option, 4556 TRANSREG procedure, WEIGHT statement, 4592 treatments in a design specifying in PLAN procedure, 3345 TREATMENTS statement PLAN procedure, 3345 tree diagram binary tree, 4743 branch, 4743 children, 4743 definitions, 4743 leaves, 4743 node, 4743 parent, 4743 root, 4743 tree diagrams cluster analysis, 4743 TREE procedure, 4743 missing values, 4756 OUT= data sets, 4756 output data sets, 4756 output table names, 4757 syntax, 4748 TREE procedure, BY statement, 4754
TREE procedure, COPY statement, 4755 TREE procedure, FREQ statement, 4755 TREE procedure, HEIGHT statement, 4755 TREE procedure, ID statement, 4755 TREE procedure, NAME statement, 4756 TREE procedure, PARENT statement, 4756 TREE procedure, PROC TREE statement, 4748 CFRAME= option, 4750 DATA= option, 4750 DESCENDING option, 4750 DESCRIPTION= option, 4750 DISSIMILAR option, 4750 DOCK= option, 4750 FILLCHAR= option, 4750 GOUT= option, 4750 HAXIS= option, 4751 HEIGHT= option, 4751 HORDISPLAY= option, 4751 HORIZONTAL option, 4751 HPAGES= option, 4751 INC= option, 4751 JOINCHAR= option, 4752 LEAFCHAR= option, 4752 LEVEL= option, 4752 LINEPRINTER option, 4752 LINES= option, 4752 LIST option, 4752 MAXHEIGHT= option, 4752 MINHEIGHT= option, 4752 NAME= option, 4752 NCLUSTERS= option, 4752 NOPRINT option, 4753 NTICK= option, 4753 OUT= option, 4753 PAGES= option, 4753 POS= option, 4753 ROOT= option, 4753 SIMILAR option, 4753 SORT option, 4754 SPACES= option, 4754 TICKPOS= option, 4754 TREECHAR= option, 4754 VAXIS= option, 4754 VPAGES= option, 4754 TREECHAR= option PROC TREE statement, 4754 TREND option EXACT statement (FREQ), 1444, 1543 OUTPUT statement (FREQ), 1449 STRATA statement (LIFETEST), 2168 TABLES statement (FREQ), 1462, 1543 trend test, 1490, 1543 trend tests LIFETEST procedure, 2150, 2168, 2179, 2187 TRIM= option and other options, 969 and other options (CLUSTER), 969, 970, 972 PROC CLUSTER statement, 969, 972 triweight kernel (DISCRIM), 1160
Syntax Index TRUNCATE option CLASS statement, 2681 CLASS statement (ANOVA), 436 CLASS statement (GENMOD), 1631 CLASS statement (GLM), 1749 CLASS statement (GLMMOD), 1916 CLASS statement (LOGISTIC), 2297 CLASS statement (MULTTEST), 2944 CLASS statement (NESTED), 2989 CLASS statement (ORTHOREG), 3203 CLASS statement (PLS), 3378 PROC SURVEYREG statement, 4374 trust region (TR), 3073 trust region algorithm CALIS procedure, 578, 581, 665 TSSCP option PROC CANDISC statement, 792 PROC DISCRIM statement, 1152 PROC STEPDISC statement, 4168 TSTANDARD= option MODEL statement (TRANSREG), 4572, 4580 PROC PRINQUAL statement, 3657 TRANSFORM statement (PRINQUAL), 3666 TSYMM option OUTPUT statement (FREQ), 1449 TTEST procedure alpha level, 4780 Cochran and Cox t approximation, 4775, 4780, 4785 compared to other procedures, 1735 computational method, 4783 confidence intervals, 4780 input data set, 4783 introductory example, 4776 missing values, 4783 ODS table names, 4789 paired comparisons, 4775, 4793 paired t test, 4782 paired-difference t test, 4775 Satterthwaite’s approximation, 4775, 4785 syntax, 4779 uniformly most powerful unbiased test, 4786 TTEST procedure, BY statement, 4780 TTEST procedure, CLASS statement, 4781 TTEST procedure, FREQ statement, 4781 TTEST procedure, PAIRED statement, 4782 TTEST procedure, PROC TTEST statement, 4780 ALPHA= option, 4780 CI= option, 4780 COCHRAN option, 4780 DATA= option, 4780 TTEST procedure, VAR statement, 4782 TTEST procedure, WEIGHT statement, 4783 Tucker and Lewis’s Reliability Coefficient, 1337 TUKEY option MEANS statement (ANOVA), 445 MEANS statement (GLM), 1769 Tukey’s adjustment GLM procedure, 1754
5113
MIXED procedure, 2688 Tukey’s studentized range test, 445, 1769, 1811, 1812 Tukey-Kramer test, 445, 1769, 1811, 1812 TURNHLABELS option PLOT statement (BOXPLOT), 509 2D geometric anisotropic structure MIXED procedure, 2721 two-sample t-test, 4775, 4789 power and sample size (POWER), 3415, 3463, 3471, 3472, 3526, 3527, 3529, 3566 two-stage density linkage CLUSTER procedure, 967, 982 TWOSAMPLEFREQ statement POWER procedure, 3457 TWOSAMPLEMEANS statement POWER procedure, 3463 TWOSAMPLESURVIVAL statement POWER procedure, 3473 Type 1 analysis GENMOD procedure, 1615, 1665 Type 1 error, 3488 Type 1 error rate repeated multiple comparisons, 1808 Type 1 estimation MIXED procedure, 2677 Type 2 error, 3488 Type 2 estimation MIXED procedure, 2677 Type 3 analysis GENMOD procedure, 1615, 1665 Type 3 estimation MIXED procedure, 2677 Type H covariance structure, 1829 Type I sum of squares computing in GLM, 1835 displaying (GLM), 1772 estimable functions for, 1771 estimable functions for (GLM), 1794 examples, 1858 Type I testing MIXED procedure, 2696 Type II sum of squares computing in GLM, 1835 displaying (GLM), 1773 estimable functions for, 1771 estimable functions for (GLM), 1796 examples, 1858 Type II sums of squares TRANSREG procedure, 4580 Type II testing MIXED procedure, 2696 Type III sum of squares displaying (GLM), 1773 estimable functions for, 1771 estimable functions for (GLM), 1797 examples, 1858 Type III testing MIXED procedure, 2696, 2751 Type IV sum of squares
5114
Syntax Index
computing in GLM, 1835 displaying (GLM), 1773 estimable functions for, 1771 estimable functions for (GLM), 1797 examples, 1858 TYPE1 option MODEL statement (GENMOD), 1642 TYPE3 option MODEL statement (GENMOD), 1643 TYPE= data sets FACTOR procedure, 1325 TYPE= option MODEL statement (TRANSREG), 4581 PROC PRINQUAL statement, 3657 PROC SCORE statement, 4072 RANDOM statement (MIXED), 2715 REPEATED statement (GENMOD), 1648 REPEATED statement (MIXED), 2721
U U option OUTPUT statement (FREQ), 1449 U95 option MODEL statement (RSREG), 4043 U95= option OUTPUT statement (NLIN), 3014 U95M option MODEL statement (RSREG), 4043 U95M= option OUTPUT statement (NLIN), 3014 UCL keyword OUTPUT statement (GLM), 1775 UCLM keyword OUTPUT statement (GLM), 1775 UCORR option PROC CALIS statement, 573 UCOV option PROC CALIS statement, 573 UCR option OUTPUT statement (FREQ), 1449 ultra-Heywood cases, FACTOR procedure, 1333 ULTRAHEYWOOD option PROC FACTOR statement, 1320 ultrametric, definition, 985 UNADJUSTED option PROC CORRESP statement, 1079 unbalanced data caution (ANOVA), 423 unbalanced design GLM procedure, 1735, 1804, 1833, 1856, 1882 NESTED procedure, 2990 uncertainty coefficients, 1474, 1483, 1484 UNDEF= option PROC DISTANCE statement, 1263 UNDO option PAINT statement (REG), 3838 REWEIGHT statement (REG), 3857 unequal weighting SURVEYFREQ procedure, 4204
unfolding MDS procedure, 2471 uniform kernel (DISCRIM), 1159 uniform-kernel estimation CLUSTER procedure, 972, 978 uniformly most powerful unbiased test TTEST procedure, 4786 unique factor defined for factor analysis, 1292 UNISTATS option BIVAR statement, 1999 UNIVAR statement, 2000 UNITS statement, LOGISTIC procedure, 2328 UNITS statement, SURVEYLOGISTIC procedure, 4267 UNIVAR statement KDE procedure, 1999 univariate distributions, example MODECLUS procedure, 2889 UNIVARIATE procedure, 22 univariate tests repeated measures, 1828, 1829 unknown or missing parents INBREED procedure, 1982 UNPACKPANELS option PROC GAM statement, 1581 PROC LOESS statement, 2250 PROC REG statement, 3923 unrestricted random sampling SURVEYSELECT procedure, 4447 unsquared Euclidean distances, 969, 971 UNSTD option PROC STDIZE statement, 4133 unstructured correlations MIXED procedure, 2721 unstructured covariance matrix MIXED procedure, 2721 UNTIE option PROC MDS statement, 2485 UNTIE transformation MODEL statement (TRANSREG), 4564 TRANSFORM statement (PRINQUAL), 3663 UNTIE= option MODEL statement (TRANSREG), 4581 PROC PRINQUAL statement, 3658 unweighted least-squares factor analysis, 1291 unweighted pair-group clustering, See average linkage See centroid method update methods NLMIXED procedure, 3073 UPDATE option PROC MIXED statement, 2680 UPDATE= option NLOPTIONS statement (CALIS), 613 PRIOR statement (MIXED), 2712 PROC CALIS statement, 579 PROC NLMIXED statement, 3073 UPGMA,
Syntax Index See average linkage UPGMC, See centroid method UPPER= option ONESAMPLEMEANS statement (POWER), 3437 OUTPUT statement (LOGISTIC), 2321 PAIREDMEANS statement (POWER), 3454 TWOSAMPLEMEANS statement (POWER), 3470 UPPERB= option PARMS statement (MIXED), 2708 UPPERTAILED option ESTIMATE statement (MIXED), 2686 TEST statement (MULTTEST), 2947 URC option OUTPUT statement (FREQ), 1449 URL= suboption ODS HTML statement, 337 ODS LATEX statement, 359 USEALL option PLOT statement (REG), 3848 USSCP option PROC REG statement, 3819 utilities TRANSREG procedure, 4691, 4694 UTILITIES option MODEL statement (TRANSREG), 4581
V V matrix MIXED procedure, 2716 V option RANDOM statement (MIXED), 2716 V6CORR option REPEATED statement (GENMOD), 1649 VADJUST= option MODEL statement (SURVEYLOGISTIC), 4265 MODEL statement (SURVEYREG), 4381 valid observation SURVEYMEANS procedure, 4346 value lists GLMPOWER procedure, 1945 POWER procedure, 3490 Van der Waerden scores NPAR1WAY procedure, 3167 VAR option TABLES statement (SURVEYFREQ), 4202 VAR statement CALIS procedure, 627 CANDISC procedure, 794 CORRESP procedure, 1081 DISCRIM procedure, 1156 FACTOR procedure, 1322 INBREED procedure, 1975 LATTICE procedure, 2073 MDS procedure, 2486 MI procedure, 2534 MODECLUS procedure, 2870
5115
NESTED procedure, 2989 NPAR1WAY procedure, 3162 PRINCOMP procedure, 3608 REG procedure, 3859 STDIZE procedure, 4135 STEPDISC procedure, 4169 SURVEYMEANS procedure, 4332 TTEST procedure, 4782 VARCLUS procedure, 4814 VARIOGRAM procedure, 4870 VAR statement, use CORRESP procedure, 1072 VAR= option ASSESS statement (PHREG), 3223 CDFPLOT statement (PROBIT), 3715 IPPPLOT statement (PROBIT), 3725 LPREDPLOT statement (PROBIT), 3733 PREDICT statement (KRIGE2D), 2042 PREDPPLOT statement (PROBIT), 3747 SIMULATE statement (SIM2D), 4101 VARCLUS procedure, See also TREE procedure alternating least-squares, 4800 analyzing data in groups, 4813 centroid component, 4802, 4808 cluster components, 4799 cluster splitting, 4800, 4806, 4810, 4811 cluster, definition, 4799 computational resources, 4818 controlling number of clusters, 4810 eigenvalues, 4800, 4810 how to choose options, 4815 initializing clusters, 4809 interpreting output, 4818 iterative reassignment, 4800 MAXCLUSTERS= option, using, 4815 MAXEIGEN= option, using, 4815 memory requirements, 4818 missing values, 4814 multiple group component analysis, 4811 nearest component sorting phase, 4800 number of clusters, 4800, 4806, 4810, 4811 orthoblique rotation, 4800, 4809 output data sets, 4811, 4816 output table names, 4820 OUTSTAT= data set, 4811, 4816 OUTTREE= data set, 4817 PROPORTION= option, using, 4815 search phase, 4800 splitting criteria, 4800, 4806, 4810, 4811 stopping criteria, 4806 syntax, 4806 time requirements, 4815, 4818 TYPE=CORR data set, 4816 VARCLUS procedure, BY statement, 4813 VARCLUS procedure, FREQ statement, 4813 VARCLUS procedure, PARTIAL statement, 4813 VARCLUS procedure, PROC VARCLUS statement, 4806
5116
Syntax Index
CENTROID option, 4808 CORR option, 4808 COVARIANCE option, 4809 DATA= option, 4809 HIERARCHY option, 4809 INITIAL= option, 4809 MAXCLUSTERS= option, 4810 MAXEIGEN= option, 4810 MAXITER= option, 4810 MAXSEARCH= option, 4810 MINC= option, 4810 MINCLUSTERS= option, 4810 MULTIPLEGROUP option, 4811 NOINT option, 4811 NOPRINT option, 4811 OUTSTAT= option, 4811 OUTTREE= option, 4811 PERCENT= option, 4811 PROPORTION= option, 4811 RANDOM= option, 4812 SHORT option, 4812 SIMPLE option, 4812 SUMMARY option, 4812 TRACE option, 4812 VARDEF= option, 4812 VARCLUS procedure, SEED statement, 4814 VARCLUS procedure, VAR statement, 4814 VARCLUS procedure, WEIGHT statement, 4814 VARCOMP procedure, 4834 classification variables, 4836 compared to MIXED procedure, 2664, 2665 compared to other procedures, 1735 computational details, 4838 dependent variables, 4831, 4836 example (MIXED), 2795 fixed effects, 4831, 4837 fixed-effects model, 4837 introductory example, 4832 methods of estimation, 4831, 4842 missing values, 4837 mixed model, 4837 negative variance components, 4838 ODS table names, 4840 random effects, 4831, 4837 random-effects model, 4837 relationship to PROC MIXED, 4841 syntax, 4834 variability, 4832 variance component, 4838 VARCOMP procedure, BY statement, 4835 VARCOMP procedure, CLASS statement, 4836 VARCOMP procedure, MODEL statement, 4836 FIXED= option, 4836 VARCOMP procedure, PROC VARCOMP statement, 4835 DATA= option, 4835 EPSILON= option, 4835 MAXITER= option, 4835 METHOD= option, 4835
VARDEF option PROC STDIZE statement, 4133 VARDEF= option PROC CALIS statement, 573 PROC DISTANCE statement, 1263 PROC FACTOR statement, 1320 PROC FASTCLUS statement, 1395 PROC PRINCOMP statement, 3606 PROC VARCLUS statement, 4812 variability VARCOMP procedure, 4832 variable (PHREG) censoring, 3218 variable importance for projection, 3396 variable list macros TRANSREG procedure, 4588 variable selection CALIS procedure, 662 discriminant analysis, 4157 variable-radius kernels MODECLUS procedure, 2870 variable-reduction method, 4799 variables, See also classification variables frequency (PRINQUAL), 3658 renaming (PRINQUAL), 3666 reusing (PRINQUAL), 3666 weight (PRINQUAL), 3667 variables, unaddressed INBREED procedure, 1976 variance component VARCOMP procedure, 4838 variance components, 2985 MIXED procedure, 2662, 2721 Variance Estimation SURVEYLOGISTIC procedure, 4282 variance estimation, 166 SURVEYREG procedure, 4385 variance function GENMOD procedure, 1614 variance inflation factors (VIF) REG procedure, 3818, 3958 variance of means LATTICE procedure, 2074 variance of the mean SURVEYMEANS procedure, 4338 variance of the total SURVEYMEANS procedure, 4341 variance ratios MIXED procedure, 2707, 2714 VARIANCE statement, GENMOD procedure, 1650 variances FACTOR procedure, 1320 ratio of, 4775, 4784, 4789 variances, test for equal, 1150 varimax method, 1291, 1317, 1318 VARIOGRAM procedure, 4851 angle classes, 4866, 4868, 4870, 4872, 4873 angle tolerance, 4866, 4868, 4870, 4872, 4873
Syntax Index anisotropic models, 4871 bandwidth, 4866, 4870, 4875 continuity measures, 4851 contour map, 4856 correlation length, 4860 cutoff distance, 4869 DATA= data set, 4865 distance classification, 4874 distance interval, 4856 empirical (or experimental) semivariogram, 4858 examples, 4852, 4882 histogram of pairwise distances, 4856 input data set, 4865 intervals, number of, 4856 kriging, ordinary, 4852 nested models, 4871 nugget effect, 4871 ordinary kriging, 4851 OUTDIST= data set, 4851, 4865, 4878, 4880 OUTPAIR= data set, 4851, 4866, 4881 output data sets, 4851, 4865, 4866, 4877–4881 OUTVAR= data set, 4866, 4877 pairwise distances, 4851, 4856, 4867 predicted values, 4852 semivariogram computation, 4876, 4877 semivariogram robust, 4877 semivariogram, empirical, 4856 semivariogram, robust, 4861, 4869, 4877 spatial continuity, 4851 spatial prediction, 4851, 4852 square root difference cloud, 4882 standard errors, 4852 surface plot, 4856 surface trend, 4854 syntax, 4864 VARIOGRAM procedure, COMPUTE statement, 4866 ANGLETOLERANCE= option, 4866 BANDWIDTH= option, 4866 DEPSILON= option, 4867 LAGDISTANCE= option, 4867 LAGTOLERANCE= option, 4867 MAXLAGS= option, 4867 NDIRECTIONS= option, 4868 NHCLASSES= option, 4868 NOVARIOGRAM option, 4869 OUTPDISTANCE= option, 4869 ROBUST option, 4869 VARIOGRAM procedure, COORDINATES statement, 4869 XCOORD= option, 4869 YCOORD= option, 4869 VARIOGRAM procedure, DIRECTIONS statement, 4870 VARIOGRAM procedure, PROC VARIOGRAM statement, 4865 DATA= option, 4865 OUTDISTANCE= option, 4865
5117
OUTPAIR= option, 4866 OUTVAR= option, 4866 VARIOGRAM procedure, VAR statement, 4870 VARNAMES statement, CALIS procedure, 625 VARSCALE option PROC PLS statement, 3377 VARWT option TABLES statement (SURVEYFREQ), 4202 VARY option PLOT statement (GLMPOWER), 1944 PLOT statement (POWER), 3485 VAXIS= option PLOT statement (BOXPLOT), 509 PLOT statement (REG), 3848 PROC TREE statement, 4754 VC option RANDOM statement (MIXED), 2716 VCI option RANDOM statement (MIXED), 2716 VCIRY option MIXED procedure, MODEL statement, 2764 MODEL statement (MIXED), 2705 VCORR option RANDOM statement (MIXED), 2716 VDEP option PROC CANCORR statement, 762 vector preference models TRANSREG procedure, 4586 VERSION= option NLOPTIONS statement (CALIS), 624 VFORMAT= option BOXPLOT procedure, 510 VI option RANDOM statement (MIXED), 2716 viewing graphs ODS Graphics, 327 VIF, See variance inflation factors VIF option MODEL statement (REG), 3832 VIP, 3396 visual fit of the variogram KRIGE2D procedure, 2045 VMINOR= option PLOT statement (BOXPLOT), 510 VN= option PROC CANCORR statement, 762 VNAME= option PROC CANCORR statement, 762 VOFFSET= option PLOT statement (BOXPLOT), 510 VP= option PROC CANCORR statement, 762 VPAGES= option PROC TREE statement, 4754 VPLOTS= option PLOT statement (REG), 3851 VPREFIX= option PROC CANCORR statement, 762
5118
Syntax Index
VREF= option PLOT statement (BOXPLOT), 510 PLOT statement (REG), 3848 VREFLABELS= option PLOT statement (BOXPLOT), 510 VREFLABPOS= option PLOT statement (BOXPLOT), 511 VREG option PROC CANCORR statement, 763 VSINGULAR= option NLOPTIONS statement (CALIS), 615 PROC CALIS statement, 591 PROC NLMIXED statement, 3074 VW option EXACT statement (NPAR1WAY), 3160 OUTPUT statement (NPAR1WAY), 3162 PROC NPAR1WAY statement, 3158 VZERO option PLOT statement (BOXPLOT), 511
W Wald chi-square test SURVEYFREQ procedure, 4221 Wald log-linear chi-square test SURVEYFREQ procedure, 4223 WALD option CONTRAST statement (GENMOD), 1633 MODEL statement (GENMOD), 1643 Wald test mixed model (MIXED), 2741, 2786 MIXED procedure, 2750, 2751 modification indices (CALIS), 584, 674 PHREG procedure, 3238, 3246, 3247, 3269, 3284 probability limit (CALIS), 590 PROBIT procedure, 3756 SURVEYREG procedure, 4386, 4389 TPHREG procedure, 4486 WALDCI option MODEL statement (GENMOD), 1643 WALDCL option MODEL statement (LOGISTIC), 2319 WALDRL option MODEL statement (LOGISTIC), 2315 WALLER option MEANS statement (ANOVA), 445 MEANS statement (GLM), 1769 Waller-Duncan test, 445, 1769, 1815 error seriousness ratio, 443, 1768 examples, 1851 multiple comparison (ANOVA), 464 Wampler data set, 3208 Ward’s minimum-variance method CLUSTER procedure, 967, 983 WAXIS= option PLOT statement (BOXPLOT), 511 WCHISQ option TABLES statement (SURVEYFREQ), 4202 WCONF= option
MCMC statement (MI), 2526 WCONNECT= option MCMC statement (MI), 2530 WCORR option PROC CANDISC statement, 792 PROC DISCRIM statement, 1153 PROC STEPDISC statement, 4168 WCOV option PROC CANDISC statement, 792 PROC DISCRIM statement, 1153 PROC MIANALYZE statement, 2616 PROC STEPDISC statement, 4168 TEST statement (MIANALYZE), 2619 WDEP option PROC CANCORR statement, 763 Wei-Lin-Weissfeld model PHREG procedure, 3248 Weibull distribution, 2083, 2097, 2111 WEIGHT option PROC FACTOR statement, 1320 WEIGHT statement CALIS procedure, 627 CANDISC procedure, 794 CATMOD procedure, 860 CORRESP procedure, 1082 DISCRIM procedure, 1156 FACTOR procedure, 1322 FREQ procedure, 1463 GENMOD procedure, 1650 GLM procedure, 1782 GLMMOD procedure, 1916 GLMPOWER procedure, 1938 KDE procedure, 2002 LIFEREG procedure, 2108 LOESS procedure, 2237 LOGISTIC procedure, 2328 MDS procedure, 2487 MIXED procedure, 2730 ORTHOREG procedure, 3203 PHREG procedure, 3239 PRINCOMP procedure, 3608 PRINQUAL procedure, 3667 REG procedure, 3859 ROBUSTREG procedure, 3992 RSREG procedure, 4044 STEPDISC procedure, 4169 SURVEYFREQ procedure, 4203 SURVEYMEANS procedure, 4333 SURVEYREG procedure, 4382 TRANSREG procedure, 4592 TTEST procedure, 4783 VARCLUS procedure, 4814 weight variable PRINQUAL procedure, 3667 WEIGHT= option OUTPUT statement (NLIN), 3014 REWEIGHT statement (REG), 3857 STRATA statement (MULTTEST), 2946, 2951 weighted average linkage
Syntax Index CLUSTER procedure, 967, 981 weighted Euclidean distance MDS procedure, 2472, 2477, 2488 weighted Euclidean model MDS procedure, 2471, 2477 weighted kappa coefficient, 1493, 1496 weighted least squares CATMOD procedure, 817, 846 formulas (CATMOD), 894 MDS procedure, 2487 normal equations (GLM), 1783 weighted means GLM procedure, 1820 weighted pair-group methods, See McQuitty’s similarity analysis See median method weighted product-moment correlation coefficients CANCORR procedure, 764 weighted Schoenfeld residuals PHREG procedure, 3235, 3259 weighted score residuals PHREG procedure, 3260 weighted-group method, See centroid method WEIGHTFUNCTIONT option PROC ROBUSTREG statement, 3985 weighting MIXED procedure, 2730 weighting variables FACTOR procedure, 1333 WEIGHTS= option VAR statement, 1267 WELCH option MEANS statement (ANOVA), 445 MEANS statement (GLM), 1769 Welch t test power and sample size (POWER), 3463, 3472, 3527 Welch’s ANOVA, 445, 1769 homogeneity of variance tests, 1819 using homogeneity of variance tests, 1893 Welsch’s multiple range test, 444, 1768, 1815 examples, 1851 WGRID= option PLOT statement (BOXPLOT), 511 WGT statement DISTANCE procedure, 1269 STDIZE procedure, 4135 WHERE statement ANOVA procedure, 454 GLM procedure, 1787 width, confidence intervals, 3488 WIDTH= option PROC LIFETEST statement, 2165 WILCOXON option EXACT statement (NPAR1WAY), 3160 OUTPUT statement (NPAR1WAY), 3162 PROC NPAR1WAY statement, 3158 Wilcoxon scores
5119
NPAR1WAY procedure, 3166 Wilcoxon test for association LIFETEST procedure, 2150 Wilcoxon test for homogeneity LIFETEST procedure, 2150, 2168, 2178 Wilks’ criterion, 437, 1759 Wilks’ lambda, 1828 Williams’ method overdispersion (LOGISTIC), 2355 windowing environment, 327, 338 within-cluster SSCP matrix ACECLUS procedure, 387 within-imputation covariance matrix MIANALYZE procedure, 2626 within-imputation variance MI procedure, 2561 MIANALYZE procedure, 2624 within-subject factors repeated measures, 1777, 1828 WITHIN= option REPEATED statement (GENMOD), 1649 WITHINSUBJECT= option REPEATED statement (GENMOD), 1649 WLF option MCMC statement (MI), 2524, 2529, 2530 WLLCHISQ option TABLES statement (SURVEYFREQ), 4202 WLS option MODEL statement (CATMOD), 846 WN= option PROC CANCORR statement, 763 WNAME= option PROC CANCORR statement, 763 WNEEDLES= option MCMC statement (MI), 2526 Wong’s hybrid method CLUSTER procedure, 969, 978 working correlation matrix GENMOD procedure, 1647, 1648, 1672 worst linear function of parameters MI procedure, 2556 WOVERLAY= option PLOT statement (BOXPLOT), 511 WP= option PROC CANCORR statement, 763 WPENALTY= option PROC CALIS statement, 576 WPGMA, See McQuitty’s similarity analysis WPGMC, See median method WPREFIX= option PROC CANCORR statement, 763 WREF= option MCMC statement (MI), 2526 WREG option PROC CANCORR statement, 762 WRIDGE= option PROC CALIS statement, 577
5120
Syntax Index
WSSCP option PROC CANDISC statement, 793 PROC DISCRIM statement, 1153 PROC STEPDISC statement, 4168 WTFREQ option TABLES statement (SURVEYFREQ), 4202 WTKAP option EXACT statement (FREQ), 1444 OUTPUT statement (FREQ), 1449 TEST statement (FREQ), 1463
X X= option GRID statement (KRIGE2D), 2040 GRID statement (SIM2D), 4100 PLOT statement (GLMPOWER), 1944 PLOT statement (POWER), 3485 XBETA keyword OUTPUT statement (LIFEREG), 2102 OUTPUT statement (ROBUSTREG), 3991 XBETA= option OUTPUT statement (LOGISTIC), 2321 XCONV option EM statement (MI), 2522 XCONV= option MCMC statement (MI), 2527 MODEL statement (LOGISTIC), 2319 MODEL statement (SURVEYLOGISTIC), 4265 NLOPTIONS statement (CALIS), 619 PROC NLMIXED statement, 3074 XCOORD= option COORDINATES statement (KRIGE2D), 2039 COORDINATES statement (SIM2D), 4100 COORDINATES statement (VARIOGRAM), 4869 GRID statement (KRIGE2D), 2040 GRID statement (SIM2D), 4101 XDATA= data sets LIFEREG procedure, 2122 XDATA= option PROC LIFEREG statement, 2091 PROC PROBIT statement, 3714 XOPTS= option PLOT statement (GLMPOWER), 1944 PLOT statement (POWER), 3486 XPVIX option MODEL statement (MIXED), 2705 XPVIXI option MODEL statement (MIXED), 2705 XPX option MODEL statement (CATMOD), 846 MODEL statement (GLM), 1773 MODEL statement (REG), 3832 MODEL statement (SURVEYREG), 4381 XPXI= option PROC MIANALYZE statement, 2616 XREF option PROC NLIN statement, 3010 XSIZE= option
NLOPTIONS statement (CALIS), 619 PROC NLMIXED statement, 3074 XTOL= option NLOPTIONS statement (CALIS), 619 XVARS option MODEL statement (GENMOD), 1643
Y Y= option GRID statement (KRIGE2D), 2040 GRID statement (SIM2D), 4100 PLOT statement (GLMPOWER), 1944 PLOT statement (POWER), 3487 YCOORD= option COORDINATES statement (KRIGE2D), 2039 COORDINATES statement (SIM2D), 4100 COORDINATES statement (VARIOGRAM), 4869 GRID statement (KRIGE2D), 2040 GRID statement (SIM2D), 4101 YOPTS= option PLOT statement (GLMPOWER), 1944 PLOT statement (POWER), 3487 YPAIR= option REPEATED statement (GENMOD), 1649 Yule’s Q statistic, 1476
Z Z option MODEL statement (TRANSREG), 4572 z scores TRANSREG procedure, 4572 z test power and sample size (POWER), 3429, 3432, 3505, 3506 ZDATA= option REPEATED statement (GENMOD), 1649 zero variance component estimates MIXED procedure, 2774 ZERO= option MODEL statement (CATMOD), 846 MODEL statement (TRANSREG), 4569 ZEROBASED option PROC GLMMOD statement, 1915 ZEROES= option MODEL statement (CATMOD), 846 ZEROS option WEIGHT statement (FREQ), 1464 zeros, structural and random FREQ procedure, 1499 zeros, structural and sampling CATMOD procedure, 888 examples (CATMOD), 919, 924 ZEROS= option MODEL statement (CATMOD), 846 ZETA= option MODEL statement (GLM), 1773 MODEL statement (MIXED), 2705 zonal anisotropy
Index KRIGE2D procedure, 2055 ZROW= option REPEATED statement (GENMOD), 1650
5121
Your Turn If you have comments or suggestions about SAS/STAT ® 9.1 User’s Guide, please send them to us on a photocopy of this page or send us electronic mail. For comments about this book, please return the photocopy to SAS Publishing SAS Campus Drive Cary, NC 27513 E-mail: [email protected] For suggestions about the software, please return the photocopy to SAS Institute Inc. Technical Support Division SAS Campus Drive Cary, NC 27513 E-mail: [email protected]