QUANTITATIV DATAANALYSIS rch I Resea D oingS ocia to Testldeas
DO N AL D I. TRE IMA N
If i?j[i,i:l[fri:,
reserved'
[email protected] JohnWiley & Sons'Inc All dghts by JosseYBass Published
com cA 941O3wwwjossevbass ftltijt?tlltJ,l'",, t"' Francisco, form stored in a retrieval system' or tansmitted in any No part of this publication may b€ reproduced' exceptas oth"*ise' ol t:o:lt:q. or bv anv means,elecfonic, mechatucal,photocopying'recording theprior either Act'without
;:#;i
ffi;;!;".1b;i
ul'iei s'ut"'copvright
roa
"r aulori^tion trttougrtpuy"ni oftheappropriatep"t:1oP1*" "' wrinen Dermissionof the putrisrrer,or 'i]"'iiie (e78) 7508400' oiuq nu"'l MA'ore23' ;;;;'I*: il;i""i iliilt:;;;ii should permission for publisher the t n"q*t" '?;;il"*ooa o. onttn"ut *t* fax (978)6468600, NJ Hoboken' "clt stree! "oiyig;' River l,1l Inc : to thePer.ir.ion. o"ptii!'ni,i"rt^ wii"y n Sons' be addressed ssrons. www.wiley.com/so/pen at online or oiriid, iii,1j i1d_oor1,fax 201,744_6008, ascitationsor sourcesfor further information Readersshouldbe awarethat InternetWebsitesoffered waswrittenandwhenit is rcad' this time the between disappeared .. ."1 t """.ftag"a publisherandauthorhaveusedtheir bestefforts Limit of Liability/Disclaimer of warranty: while the or com wi*l respeclto lhe accuracy or lhis book.Lheymakeno repre'enlations wafianlie' in DreDaring or merchanrabil' warranties implied rr,i. roor #i ,fi"iri.aiiy dictaimany ;iJ";K, ;i ;;;.;;;;;,,'oi
ffi;il;?;;ili."iu,'pttp"t"
n'*Lantvmavbecreatedorextendedtysalesrei::il
I The aivice and strategies contained herein maynot ,"1* .it"tials .i *iii." nor author shall publisher the leither upp.op;ut". ation.you should consutt wltt, a protessiinut*fi.." to special' limited not but including oit*'' be liable for any loss of p.ot t o' "ommerciJdamages' ydamaBes' or other con(equential. rncidental.
most bookstores TocontactJosseyBassdirectl) JosseyBass books and products are availablelhrough the United *itio if," Unitla Star". ur 1a0O)956?739' outside call our CusromerCar" u"p*"n, (317) 5724002' Siatesat (3ll) 5723986' oi via fa'x at formats some content that appearsin JossevBassalso publishesits books in a variety ofelecftonic print may not be ivailable in electronic books' Library of Congress Cataloging'inPublication
Data
Donald J. Treiman, jutu unalysis : doing social researchto test ideas/ Donald J Treiman d"tl[G D, Cm,
2.Sociorogvf,esearchstatist "liJJj;l.T'3;:3:;:,t3*"f33?,*,n"^"thods. methodsComputer + Socialsciencesstatistical
methods. 3. Sociologystatisticar "if'oOt programs. 5. Stata. I Title HA29.T675 2008 300;724c22 Printed in the United StatesofAmerica FIRST EDITION
PB Printing
l0 9 8 7 6 5 '1 3 I I
20080131:v
*fq$ Tg$XT'{. fables, Figur€s,Exhibits. and Boxes
Xi
Preface
xxiii
The Author
xxvii
Introduction CROSSTAB U LATIONS What This ChapterIs About Introductionto the Book via a ConcreteExample CrossTabulations What This ChapterHas Shown MORE ON TABLES What This ChapterIs About The Logic of Elaboration SuppressorVariables Additive and InteractionEffects Direct Standardization
xxix 1 1 2 8 19 21 z1 22 ).) 26 28
A Final Note on StatisticalControlsVersusExperiments What This ChapterHas Shown STILLMORE ON TABLES What This ChapterIs About ReorganizingTablesto Extract New Information When to Percentagea Table "Backwards"
45 47 47 48 50
CrossTabulations in Which the DependentVariable Is Representedby a Mean Writing About CrossTabulations
52 58 61
What This ChapterHas Shown
o1
Index of Dissimilarity
Vl
Contents
4 ON THEMANIPULATION OFDATABYCOMPUTER
o)
What This ChaprerIs Abour
tr)
Introduction
66
How Data Files Are Organized Transforming Data What This ChapterHas Shown Appendix 4.A
Doing Analysis Using Stata Tips on Doing Analysis Using Stata Someparticularly Useful Stata 10.0Commands
INTRODUCTIONTO CORRELATION AND REGRESSION (ORDINARYLEASTSQUARES) What This ChapterIs About Introduction Quantifying the Size of a Relationship:RegressionAnalysis Assessingthe Strengthof a Relationship: CorrelationAnalysis The RelationshipBetweenCorrelation and RegressionCoefficients FactorsAffecting the Size of Correlation(and Regression)Coeflicients CorrelationRatios What This ChapterHas Shown 6
INTRODUCTIONTO MULTIPLE CORRELATION AND REGRESSION (ORDINARYLEASTSQUARES) What This ChapterIs About .
Introduction A WorkedExample:The Determinants of Literacy in China Dummy Variables A Strategyfor ComparisonsAcross Grouos A BayesianAlternativefor Comparing Models IndependentValidation What This ChapterHas Shown
MULTIPLE REGRESSION TRICKs: TECHNIQUES FOR HANDLING SPECIAL ANALYTIC PROBLEMS What This ChapterIs About NonlinearTransformations
OI
72 80 80 80 84
87 87 88 89 o1
94 94 99 102
r03 103 104 113 120 124 133 135 136
139 139 140
contentsVii Tesrin,ethe Equality of Coefficients TrendAnalysis: Testingthe Assumption of Linearity LrnearSplines Lrpressing Coefficientsas Deviationsfrom
MULTIPLEIMPUTATIONOF MISSING DATA \\tar This ChapterIs About lntroduction \ WorkedExample:The Effect of Cultural Capital on EducationalAttainmentin Russia \\hat This ChaprerHas Shown SAMPLEDESIGNAND SURVEYESTIMATION \\har This ChapterIs About SurveySamples Conclusion \nlar This ChapterHas Shown REGRESSION DIAGNOSTICS what This ChapterIs About Introduction A WorkedExample:SocietalDifferences in StatusAttainment RobustRegression
' ! 1 SCALECONSTRUCTION What This ChapterIs About Introduction
149 152
the
Grald Mean (Multiple ClassificationAnalysis) OrherWaysof RepresentingDummy Variables Decomposingthe DifferenceBetween Two Means \\'har This ChapterHas Shown
Bootstrappingand StandardErrors What This ChapterHas Shown
147
r64 166 172 179 181 181 \82 187 194 195 t95 196 )t7
224 225 225 226 229 237 238 240 241 241 1,41
Validiry Reliability
242 243
Vlll
12
Contents ScaleConstruction
246
ErrorsinVariablesRegression What This Chapter Has Shown
258
LOGLINEARANALYSIS What This ChapterIs About Introduction Choosinga PrefenedModel ParsimoniousModels A Bibliographic Note What This ChapterHas Shown Appendix 12.A Derivation of the Effect parameters Appendix 12.8 Introductionto Maximum Likelihood Estimation Mean of a Normal Distribution LogLinear Parameters
,'3
BINOMIAL LOGISTICREGRESSION What This ChapterIs About Introduction Relationto LogLinearAnalysis
261
263 263 264 265 277 294 295 295 297 298 299 301 301 302 303
A WorkedLogistic RegressionExample: PredictingPrevalenceof Armed Threats A SecondWorkedExample:SchoolingprogressionRatiosin Japan
304 314
A Third WorkedExample (DiscreteTimeHazard_Rate Models): Age at First Marriage
318
A FourthWorkedExample(CaseControlModels): Who WasAppointed to a Nomenklataraposition in Russia? What This ChapterHas Shown Appendix l3.A Some Algebra for Logs and Exponents Appendix 13.8 Introduction to probit Analvsis
327 329 330 330
14 MULTINOMIAL AND ORDINALLOGISTIC REGRESSION AND TOBITREGRESSION WhatThisChapterIs About Muhinomial LogirAnalysis
335 J J.)
336
Contents lX frinal
Logistic Regression
342
Tobit Regression(andAllied Procedures)for Censored DependentVariables Otter Models for the Analysis of Limited DependentVariables &'hat This ChapterHas Shown
t5
353 360 361
IMPROVINGCAUSAL INFERENCE: FIXED EFFECTS AND RANDOM EFFECTS MODELING What This ChapterIs About Introduction Frxed Effects Models for Continuous Variables RandomEffects Models for ContinuousVariables A Worked Example: The Determinants of Income in China Fired Effects Models for Binary Outcomes A Bibliographic Note Wtat This ChapterHasShown
363 363 364 365 371 372 375 380 380
16 FINALTHOUGHTS AND FUTURE DIRECTIONS:
RESEARCH DESIGN AND INTERPRETATION ISSUES whar rhis Chapter is About ResearchDesignIssues The Importanceof Probability Sampling A Final Note: Good ProfessionalPractice What This ChaDterHas Shown
38r 381 382 397 400 405
Appendix A: Data Descriptions and Download Locations fot lie Data Used in This Book
407
Appendix B: Survey Estimation with the General Social Survey
4',11
References
417
lndex
431
':,,::,li::1,i' ;.l.ll LiFl,..,
a:.x:X Ii:::.i,:;,,*rXf":* i::'.,:: i, TABLES I .1.
Joint FrequencyDisrributionof Militancy by Religiosity Among UrbanNegroesin the U.S., 1964.
1.2.
PercentMilitant by ReligiosityAmongUrbanNegroes in the U.S., 1964.
10
PercentageDistribution of Religiosity by EducationalAttainment, UrbanNegroesin the U.S., 1964.
l3
PercentMilitant by EducationalAttainment,Urban Negroes in the u.s., 1964.
l3
PercentMilitant by Religiosity and EducationalAttainment, UrbanNegroesin the U.S., 1964.
15
PercentMilitant by Religiosity and EducationalAttainment, Urban Negroesin the U.S., 1964(ThreeDimensionalFormat).
18
PercentageWho Believe Legal Abortions ShouldBe PossibleUnder SpecifiedCircumstances,by Religion and Education,U.S. 1965 (N : 1,368;Cell Frequencies in Parentheses).
27
Percentage AcceptingAbortion by Religion and Education (HypotheticalData).
28
PercentMilitant by Religiosity,and PercentMilitant by Religiosity Adjusting (Standardizing)for Religiosity Differencesin Educational Attainment,UrbanNegroesin the U.S., 1964(N : 993).
30
1.3. 1.4. 1.5. 1.6. Ll.
2.2. 2.3.
2.4.
PercentageDistribution of Beliefs Regardingthe Scientific View of Evolution(U.S.Adults,1993.1994.and2000).
2.5.
Percentage Accepting the ScientificView of Evolution by ReligiousDenomination(N : 3,663).
2.6.
Percentage Acceptingthe ScientificView of Evolution by Level of Education.
2.7.
Percentage Accepting the ScientificView of Evolution by Age.
2.8.
Percentage Distributionof Educational Attainmentby Religion
2.9.
PercentageDistribution ofAge by Religion.
2.10.
Joint ProbabilityDistribution of EducationandAge.
33
35 35 36
Xll
Tables,Figures,Exhibits,and Boxes
2 .11. PercentageAccepting the ScientificView of Evolution by Religion, Age, and Sex (PercentageBasesin Parentheses) 2.12. ObservedProportionAccepting the ScientificView of Evolution, and ProportionStandardizedfor EducationandAge. 2.r3. PercentageDistribution of OccupationalGroupsby Race,South African Males Age 2069, Early 1990s(Percentages ShownWithout Controlsand also Directly Standardizedfor Racial Differencesin EducationalAttainment";N = 4,004). 2.14. Mean Number of ChineseCharactersKnown (Out of 10), for Urban and Rural ResidentsAge 2069, China 1996(MeansShown Without ControlsandAlso Directly Standardizedfor UrbanRural Differencesin Distribution ofEducation; N : 6,081). FrequencyDistribution ofAcceptanceof Abortion by Religion andEducation,U.S.Aduits, 1965(N : 1,368). Social Origins of Nobel Prize Winners(19011972)and Other U.S. Elires (and,for Comparison,the Occupationsof EmployedMales i9001920). 3.3. MeanAnnual Income in 1979Among ThoseWorking Full Time in 1980,by Educationand Gender,U.S. Adults (Category FrequenciesShownin Parentheses). Meansand StandardDeviationsof Income in 1979bv Education and Gender,U.S. Adults, 1980. 3.5. MedianAnnual Incomein 1979Among ThoseWork rg Full Time in 1980, by Educationand Gender,U.S. Adults (CategoryFrequencies Shownin Parentheses).
6.2.
6.3. 6.4.
PercentageDistribution Over Major OccupationGroupsby Race and Sex,U.S. Labor Force, 1979(N : 96,945). Mean Number of PositiveResponsesto an Acceptanceof Abortion Scale(Range:07), by Religion, U.S. Adults, 2006. Means,StandardDeviations,and CorrelationsAmong Variables Affecting Knowledgeof ChineseCharacters,EmployedChinese Adults Age 2069, 1996(N = 4,802) Determinantsof the Number of ChineseCharactersConectly Identifiedon a TenItemTest,EmployedChineseAdults Age2U69,1996 (StandardEnors in Parentheses). Coefficientsof Models ofAcceptanceofAbortion, U.S. Adults, 1974 (StandardErrors Shownin Parentheses); N : 1,481. GoodnessofFitStatisticsfor Altemative Models of the Relationship Among Religion, Education,andAcceptanceofAbortion, U.S. Adults, 1973(N = 1,499). DemonstrationThat Inclusionof a Linear Term Does Not Affect PredictedValues.
37 39
4l
42 48 51
52
58 60
101
115
116 127
136
153
Tables, FiguretExhibits. and BoxesXiii ":
"i
.4
Cefficiens for a Linear Spline Model of Trends in years of Sciool Compleredby year of Birth, U.S. Adults Age 25 and Older, ad Comparisonswith Other Models (pooled Datafor 1972_2004, \ : 19.324). GoodnessofFitStatisticsfor Models of Knowledgeof Chinese Cbaactersby year of Birth, Controlling for years of Schooling, rirh \arious Specifications of the Effect of the Cultural Revolution rTbose Affected by the Cultural Revolution Are Deflned peoole as Tuning Age I I During the period 1966ttuough 1977),Chinese {dnlts Age 20 ro 69 in 1996(N = 6,086). Cocfficientsfor Models 4, 5, and 7 predicting Knowledgeof Chinese Charactersby year of Birth, Controliins for ye;rs ( p Valuesin parentheses). of Scbooti_ng
s
CoefficientsofModels of ToleranceofAtheists, U.S. Adults, 1[O to 2004 (N : 4,299). 6, Desiga Matrices for Alternative Ways of Coding Categorical \ariables(SeeText for Details). Coefficients for a Model of the Determinants of Vocabulary Knorrledge,U.S. Adults, 1994(N : 1,,757R2 : .2445: Sald TestThat CategoricalVariablesAll Equal Zetot F.t,rrrt = 12.48; p <.0000). tVeans. StandardDeviations, and Correlations for Variables Included rn a Model of EducationalAttainment for U.S. Adults lgg0 to 2004, , b1.Race(BlacksAbove the Diagonal,Non_BlacksBelow).  )Coefficientsof a Model of EducationalAttainment,for Blacks and \onBlacks, U.S. Adults, 1990to 20O4.  1n Decompositionof the Difference in the Meanyears of School Completedby NonBlacksand Blacks, U.S. Adults, 1990to 2004. LtDescriptiveStatisticsfor the VariablesUsed in the Analysis, Russian {dulrsAge TwentyTivoro Sixty_Ninein 1993 (N: a,6S5). t: Comparisonof Coefficientsfor a Model of EducationalAttamment Estimatedfrom a CasewiseDeleted Data Set [C] (N = 2,661) and from a Vultiply ImputedData Set [M] (N :4,6g5), RussianAdulrs Ase TwenryTwoto SixryNine in 1993. 9.1Portion ofa Tableof RandomNumbers. of the Total PopulationResidingin Each of the Ten LarqestCitils in Califomia, 1990. +:,
DesignEffects for SelectedStatistics,Samplesof 3.000 with Clustering(50 Countiesas primary SamplingUnits, 2 Villages or
157
160
161 t65 16g
169
176 I77 178 1g9
Dz D6
ZO1
XiV
9.4.
9.5. 9.6.
tables,Figures, Exhibits, andBoxes Neighborhoodsper County,and 30 Adults Age 20 to 69 per Village or Neighborhood),With andWithout Stratification,by Level ofEducation. Determinantsof the Number of ChineseCharactersCorrectly Identified on a 10ItemTest,EmployedChineseAdults Age 2069, 1996(N = 4,802). Coefficientsfor Models of the Determinantsof Income,U.S. Adult Women, 1994,Under VariousDesignAssumptions(N : 1,015). Coefficientsof a Model of EducationalAttainment,U.S. Adults, 1990to 2004(N: 15,932).
10.1. Coefficientsfor Models of the Determinantsof the Strengthof the OccupationEducation Connectionin EighteenNations. 11.1. Valuesof Cronbach'sAlpha for MultipleItem Scaleswith Various Combinations of the Number of Items and the Averase Correlation Among Items.
210
216 221 223 236
11.3. Abortion FactorLoadinssAfter Varimax Rotation.
246 253 254
11.4. Means,StandardDeviations,and CorrelationsAmong Variables Included in Models of the Acceptanceof Legal Abortion, U.S. Adults, 1984(N : 1,459).
256
11.5. Coefficientsof Ttvo Models PredictingAcceptanceofAbortion, U.S.Adults, 1984.
256
11.2. FactorLoadingsfor Abortion AcceptanceItems Before Rotation.
11.6. 11.7.
Mean Scoreon the ISEI by Level of Education,Chinese Males Age Twenty to SixtyNine, 1996.
259
Coefficientsof a Model of the Determinantsof Political ConservatismEstimatedby ConventionalOLS and ErrorsinVariablesRegression,U.S. Adults, 1984(N : 1,294).
260
1,2.1. FrequencyDistribution of Programby Sex in a GraduateCourse. 12.2. 12.3. 12.4.
12.5.
265
FrequencyDistribution of Level of Stratificationby Level of Political Integrationand Level of Technology,in NinetyTwo Societies.
268
Models of the RelationshipBetweenTechnoiogy,Political Integration,and Level of Stratificationin NinetyTwo Societies.
269
PercentageDistribution of ExpectedLevel of Stratificationby Level of Political Integrationand Level of Technology,in NinetyTwo Societies(ExpectedFrequenciesfrom Model 7 Are Percentaged).
272
FrequencyDistribution of Whether'A CommunistShouldBe Allowed to Speakin Your Community" by Schooling,Region,and Age, U.S.Adults, 1977(N = 1,478).
273
Tables, Figures, Exhibits, and Boxes XV G..odnessofFit Statisticsfor LogLinearModelsof theAssociations i:n.rns \\:hethera CommunistShouldBe Allowed to Speakin Your C..mmunit\'. Age, Region,andEducation,U.S.Adults, 1977. :r:e.red Percentage(from Model 8) AgreeingThat 'A Communist S:ruld Be Allowed to SpeakinYour Conrmunity" by Education,Age, .: i Resion.U.S.Adults, 1977. Distribution of Voting by Race,Education,andVoluntary i>:.r.iation Membership. ::quenl  ::quenl Distribution of Occupationby Father'sOccupation, C:rnese{dults,1996. :,:;raction Parametersfor the SaturatedModel Applied to Table 12.9. G..odnessofFitStatisticsfor AlternativeModels of Intergenerational O,cupational Mobility in China(SixbySixTable).
'
275
276 278 280 282 284
F:;quency Distribution of EducationalAttainmentby Size of ?,::e of Residenceat Age Fourteen,ChineseAdults Not Enrolled :: School.1996.
289
P.rcentageEver Threatenedby a Gun, by SelectedVariables,U.S. {Jults. 1973to 1994(N : 19,260).
306
G..t dnessofFitStatisticsfor VariousModels Predictingthe P::ralenceof ArmedThreatto U.S.Adults, 1973to 1994. Eie!r Parametersfor Models 2 and4 of Table 13.2.
308 310
GoodnessofFitStatisticsfor VariousModels of the Processof ErucationalTransitionin Japan(PreferredModel Shownin Boldface).
315
Eiect Parameters for Model 3 ofTable 13.4.
316
OddsRatiosfor a Model Predictingthe Likelihood of Marriagefrom \Ee at Risk, Sex,Race,and Mother's Education,with Interactions Bet$ eenAge at Risk and the OtherVariables. Coeillcientsfor a Model of Determinantsof Nomenklatura \Iembership,Russia,1988.
328
Efiect Parametersfor a Probit Analysis of Gun Threat(Corresponding :.r \lodels 2 and4 ofTable 13.3).
331
Ettect Parametersfor a Model of the Determinantsof English and RussianLanguageCompetencein the CzechRepublic, 1993 p Valuesin Italic.) \ : 3,945).(StandardErrors in Parentheses;
339
Eftect Parametersfor an OrderedLogit Model of Political Party Identification, U.S.Adults, 1998(N : 2,443).
345
PredictedProbability Distributionsof Party Identificationfor Black and nonBIackMales Living in Large CentralCities of NonSouthern S\lSAs and Earning $40,000to $50,000perYear.
349
XVi 14.4. 14.5. 14.6. 14.7.
15.1. 15.2. 15.3.
Tables,Figuret Exhibits,and Boxes Effect Parametersfor a GeneralizedOrdercdLogit Model of political Party Identification,U.S. Adults, 1998. Effect Parametersfor an Ordinary LeastSquares Regression Model of Political party ldentification,U.S. Adults, 199g. Codesfor Frequencyof Sex in the Pastyear, U.S. Adults, 2000. AlternativeEstimatesof a Model of Frequencyof Sex,U.S Adults, 2000 (N : 2,258).(StandardErrors in parenthesesl All CoefficientsAre Significantat .001 or Beyond.) SocioeconomicCharacteristicsof ChineseAdults by Size ofplace of Residence,1996. Comparisonof OLS and FE Estimatesfor a Model of the Determinantsof Family Income,ChineseRMB, 1996(N : 5,342). Comparisonof OLS and FE Estimatesfor a Model of the Effect of Migration and Remittanceson SouthAfrican Black Children,s SchoolEnrollment,2OO2to 2003.(N(FE) : 2,408 Children; N(full RE) = 12,043Children.)
350 354 356
357 373 374
379
FIGURES 2 .1.
The ObservedAssociationBetweenX andy Is Entirelv Spurious and Coes to Zero When Z Is Controlled.
2.2.
The ObservedAssociationBetweenX andy Is partlv Sourious: theEffecrof X on Y ls ReducedWhenZ Is Controll;d(Z Affecrs X and Both Z and X Affect Y). The ObservedAssociationBetweenX andy Is Entirely Exolained by the InterveningVariableZ and Goesto Zero When 2 Is bontrolled. The ObservedAssociationBetweenX andy Is partly Explainedby the InterveningVariableZ: the Effect of X on y Is ReducedWhen Z Is Controlled(X Affects Z, and Both X and Z Affecr y).
2.5. 2.6.
4.1. 5.1. 5.2.
Both X and Z Affect Y, but ThereIs no AssumptionRegarding the CausalOrdering of X and Z. The Size of the ZeroOrderAssociationBetweenX andy (andBetween Z andY) Is Suppressed When the Effects ofX on Z andy haveOpposite Sign, and the Effects ofX and Z ony haveOppositeSign. An IBM punch card. ScatterPlot of Yearsof Schoolingby Father,syears of Schoolins (HypotheticalDara.N : t0). LeastSquares RegressionLine of the RelationBetween Yearsof Schoolingand Father'sYearsof Schoolins.
24 24 25
26 11
88 89
T Tables, Figures, Exhibits. and Boxes XVii .:.:.iuares RegressionLine of the RelationBetweenyears S:: .'irn,sand Father'sYearsof Schooling,ShowingHow the '::::: Prediction"or "Residual"Is Defined. ': ;..:Squares RegressionLines for Three Conligurationsof Data: : ::::.rl Independence, (b) PerfectCorrelation,and (c) perfect . ::;ear Correlationa ParabolaSymmetricalto the XAxis. :: Ie;r of a SingleDeviantCase(High Leveragepoint).  :'.:=:lng DistributionsReducesCorrelations.  :: iiecr of Aggregationon Correlations. of the Relationship Between ::: DimensionalRepresentation \:::er of Siblings,Father'sYearsof Schooling,andRespondent,s ::. ri Schooling(Hypothetical Data;N : l0).
90
92 95 97 99
105
:r:e;:ed \umber of ChineseCharactersIdentified (Out of Ten) , . \:,r: ol Schoolingand Gender,Urban Origin ChineseAdults Age 20 : : ::r 1996with NonmanualOccupationsand with years of Father,s S: :l.ine andLevelof CulturalCapitalSetat TheirMeans(N : 4,g02). \::e: ihe temaleline doesnot extendbeyondl6 because thereareno :'::.".esin the samplewith postgraduate education.) 120 :,j':pranceofAbortion by EducationandReligiousDenomination, 131 .S. \dulrs.1974(N : 1.481). .: RelationshipBetween 2003 Income andAge, U.S. Adults .{:: Ttlen*'to SixtyFourin 2004(N : 1,573). t4l :r;ted 1n(Income) by YearsOf SchoolCompleted, U.S. Males Females.2004, with Hours Workedper WeekFixed at the :: l'1irntbr Both SexesCombined(42.7;N : 1,459). 1,44 ir:e.ied Incomeby Yearsof SchoolCompleted, U.S. Malesand ::neles. 2004,with Hours Workedper Week Fixed at the Mean for 3.rhSeresCombined(42.7). 145 ::end in ArtitudesRegardingGenderEquality,U.S.AdultsSurveyed : i9rl Through1998(LinearTrendandAnnualMeans;N=21,464). 151 f:arsof SchoolCompletedby Yearof Birth, U.S.Adults (pooled S:mplesfrom the 1972Through2004GSS;N = 39,324;Scatter Pr.rtShownfor 5 PercentSample). 154 \lean Yearsof Schoolingby Yearof Birth, U.S. adults(SameData :i tbr Figure7.5). 155 TlueeYearMoving AverageofYears of Schoolingby year of Birth, L.S. Adults(SameDataasfor Figure7.5). 155 Trendin Yearsof SchoolCompletedby Year of Birth, U.S. Adults SameData as for Figure 7.5). PredictedValuesfrom a Linear Splinewith a Knot at 1947. 158
XVlll
Exhibits, andBoxes Tables, Figures,
7 .9.
Graphsof ThreeModels of the Effect of the Cultural Revolution on VocabularyKnowledge,Holding ConstantEducation (at TwelveYears),ChineseAdults, 1996(N : 6,086).
7.10. 10.1. 10.2.
10.3. 10.4.
Figure 7.9 Rescaledto Show the Entire Rangeof the YAxis. Four ScatterPlots with Identical Lines.
163 163 226
ScatterPlot of the RelationshipBetweenX andY andAlso the RegressionLine from a Model That IncorrectlyAssumesa Linear RelationshipBetweenX andY (HypotheticalData).
227
Yearsof School Completedby Number of Siblings,U.S. Adults, 1994 (N  2,992). Yearsof SchoolCompletedby Number of Siblings,U.S. Adults, 1994.
10.5.
A Plot of LeverageVersusSquaredNormalizedResidualsfor Equation7 in TreimanandYip (1989).
10.6.
A Plot of LeverageVersusStudentizedResidualsfor Treimanand Yip's Equation7, with Circles Proportionalto the Size of Cook's D.
lO.7.
AddedVariablePlots for Treiman andYip's Equation7. Plot for Treiman andYip's Equation7. ResidualVersusFitted
10.8.
Plots for Treimanand AugmentedComponentPlusResidual Yip's Equation7. 10.10. ObjectiveFunctionsfor ThreeM Estimators:(a) OLS Objective Function,(b) Huber ObjectiveFunction,and (c) BiSquare ObjectiveFunction.
228
zz8 232 233 233 234
10.9.
10.11. SamplingDistributionsof BootstrappedCoefficients (2,000Repetitions)for the ExpandedModel, Estimatedby RobustRegressionon SeventeenCountries. 11.1. 13.1. 13.2. 13.3. 13.4.
13.5.
235
238
240
Loadingsof the SevenAbortionAcceptanceItems on the First Two 255 Factors,Unrotatedand Rotated30 DegreesCounterclockwise. ExpectedProbability of Marrying for the First Time by Age at 320 Risk,U.S.Adults, 1994(N = 1,556). Risk the First Time by Age at ExpectedProbability of Marrying for (Range:Fifteen to ThirtySix), DiscreteTimeModel, U.S. Adults, 1994. 3ZZ ExpectedProbability of Marrying for the First Time by Age at Risk (Range:Fifteen to ThirtySix), Polynomial Model, U.S. Adults, 1994. ExpectedProbability of Manying for the First Time by Age at fusk, Sex, and Mother's Education(Twelveand SixteenYearsof Schooling), NonBlack U.S. Adults, 1994. ExpectedProbability of Marrying for the First Time by Age at Risk, Sex,and Mother's Education(Twelveand SixteenYearsof Schooling),Black U.S.Adults, 1994.
322
326
326
Tables,Figures.Exhibits,and Boxes XIX
:,:.8.1. ProbabilitiesAssociatedwith Valuesof Probit and Logit Coefficients. +.l. 11.1. 16.1. 6.1.
ThreeEstimatesof the ExpectedFrequencyof Sex per Year, U.S. Married Women,2000 (N : 552). ExpectedFrequencyof Sex PerYearby Genderand Marital Status, U.S.Adults,2000(N : 2,258). 1980Male Disability by Quarterof Birth (Preventedliom Work by a PhysicalDisability). Blau andDuncan'sBasicModel oflhe Processof Stratification.
JJ{
358
359 386 394
EXHIBITS :. 1 :2.
lllistration of How Data Files Are Organized. A CodebookCorresponding to Exhibit4.1.
67 68
BOXES
Statado Files and Jog Files Direct StandardizationIn Earlier SurveyResearch
3 6 9 10 14 15 16 18 22 27 30 31
The Weaknessof Matching and a Useful Fix
44
TechnicalPointson Table3.3
53 54 66 70 72 75
OpenEndedQuestions SamuelA. Stouffer TechnicalPointson Table 1.1 TechnicalPointson Table 1.2 TechnicalPointson Table 1.3 TechnicalPointson Table 1.4 TechnicalPointson Table 1.5 TechnicalPointson Table 1.6 Paul Lazarsfeld HansZeisel
SubstantivePointsOn Table3.3 A Histodcal Note on Social ScienceComputerPackages HermanHollerith The Way Things Were TreatingMissing Valuesas If They Were Not
XX
Tables,Figures,Exhibits,and Boxes
PeopleGenerallyLike to Respondto (WellDesigned andWellAdministered)Surveys Why Use the " Least Squares" Criterion to Determine the BestFittingLine? Karl Pearson A Useful Computational Formula for r A "Real Data" Exampleof the Effect of Truncatingthe Distribution A Useful ComputationalFormulafor 12 Multicollinearity ReminderRegardingthe Varianceof DichotomousVariables A Formula for ComputingR':from Conelations Adjusted R'? Always PresentDescriptiveStatistics TechnicalPoint on Table6.2 Why You ShouldInclude the Entire Samplein Your Analysis Gettingpvaluesvia Stata Using Statato Comparethe Goodnessoffitof RegressionModels R. A. (RonaldAylmer) Fisher
17 9I 93 93 97 101 108 110 111 r1 1 114 117 122
r25 125 126
How to Test the Significanceof the Difference BetweenTwo Coefficients Altemative Ways to EstimateBIC
129
Why the RelationshipBetweenIncome andAge Is Curvilinear
140
A Trick to ReduceCollinearity
145
In SomeYearsof the GSS,Only a Subsetof Respondents WasAsked CertainQuestions
150
134
An AlternativeSpecificationof SplineFunctions Why Black versusNonblack Is Better Than White versus Nonwhite for SocialAnalysis in the United States
156
A Commenton Credit in Science Why PairwiseDeletion ShouldBe Avoided
175
TechnicalDetailson lhe Variables TelephoneSurveys
188
Mail Surveys
r99 200 202 205
Web Surveys Philip M. Hauser A SuperiorSamplingProcedure
175 183 198
Tables, Figures, Exhibits, and BoxesXXi Strurces of Nonresponse ["eslieKish Hos the ChineseStratifiedSampleUsed in the Design Erperimentswas Constructed $ii,ehdng Data in Stata Limitarions of the Stata10.0 SurveyEstimationprocedure {n {lternativeto SurveyEstimation Ho\l to DownweightSampleSize in Stata Eirs to AssessReliability $h1' the SAI and GRE TestsInclude SeveralHundredItems TransformingVariablesso That ,,High,'has a ConsistentMeaning ConstructingScalesfrom IncompleteInformation h LogLinearAnalysis "Interaction',Simply Means ,Association,, l: Defined Other Softwarefor EstimatingLogLinear Models \larimum Likelihood Estimation ProbitAnalysis Techdcal Point on Table 13.1 Limitations of Wald Tests SmoothingDistributions EstimatingGeneralizedOrder Logit Models With Stata JamesTobin PanelSurveysin the PublicDomain Otis Dudley Duncan SewellWright \sk a Foreigner To Do It GeorgePeterMurdock ln the United States,Publicly FundedStudiesMust be Made Available to the ResearchComrnunity Al'Available from Aulhor" Archive
207 ?08 212 2,13 215 219 219 244 245 248 249 264 267 294 302 302 305 309 325 349 354 369 395 396 398 401 404
, , ,__ :l ,:i ,
,"
.a.
: , :. a book abouthow to conducttheoreticallyinfomed quantitativesocialresearch ":: .. socialresearchto testideas.It derivesfrom a coursefor graduatestudentsin sociprofessionalschools(public :, .rnd other social sciencesand social sciencebased .. education,socialwelfare,urbanplanning,and so on) that I havebeenteachingat  .t tbr somethirty years.The coursehasevolvedasquantitativemethodsin the social , ::::s haveadvanced;early versionsof the coursewere basedon the first half of this .., r throughChapterSeven),with additionalmaterialsaddedover the years.Interest::.. I havebeenableto retainthe sameformat a twentyweekcoursewith onethree::: e.tureper week and a weekly exercise,culminatingin a term paperwritten dudng i .it lbur weeksof the course from the outset,which is, I suppose,a tributeto the .:=sing level of preparationand quantitativecompetenceof graduatestudentsin ::= ...ial sciences.The book owes much to lively classdiscussionsover the years,of :: :ubtle andcomplexmethodologicalpoints. tsr rheendof the book,you shouldknow how to makesubstantive senseof a body of data. you That is, prepared should be well produce to publishable papersin :,:::ative :: neld. as well as firstratedissertationchapters.Of course,thereis alwaysmore to :=:. In the final chapter(ChapterSixteen),I discussadvancedtopics that go beyond ; '.: .an be coveredin a first coursein dataanalysis. Tie focusis on the analysisof datafrom representative samplesof welldefinedpop ,:, rns.althoughsomeexceptionsareconsidered.The populationscanconsistof almost societies,occupations,pottery shards,or whatl :rns people,formal organizations,  ::. ihe analytic issuesare essentiallythe same.Data collectionproceduresare men :J only in passing.Thele simply is not enoughspacein an alreadylengthybook to do .::re to both data analysisand datacollection.Thus, you will needto look elsewhere r .i stematicinstructionon datacollectionprocedures. A strongcasecan be madethat .hould do this after rather than before a courseon data analysisbecausethe main :. : emin designinga data collectionefforl is decidingwhat to collect, which means  irst needto know how you will conductyour analysis.An altemativemethod of ::ring aboutthe practicaldetailsof datacollectionis to becomean apprentice(unpaid, : ,:;essary) to someonewho is aboutto conducta surveyand insistthat you get to par,:::ate in it stepbystep evenwhenyour presence is a nuisance. Thisbookcoversa varietyoftechniques,includingtabularanalysis,loglinearmodels r :abulardata,regressionanalysisin its variousforms,regressiondiagnosticsandrobust .::\sion, ways to cope with missing data,logistic regression,factorbasedand other :::.niquesfor scaleconsnxction,andfixed andrandomeffects modelsasa way to make ,..al inferences.But this is not a statisticsbook; the emphasisis on usingtheseproce::;s to drawsubstantive conclusionsabouthow the socialworld works.Accordingly,the :' ..kis designedfol a courseto be taken after a firstyeargraduatestatisticscoursein : rocial sciences.Although thereare many equationsin the book. this is becauseit is
XXIV
Preface
necessa.ry to understandhow statisticalprocedureswork to usethernintelligently. Because the emphasisis on applications,there are many worked examples,often adaptedfrom my own research.In addition to data from samplesurveysI haveconducted,I also rely heavily on the GeneralSocial Survey,an omnibussurveydesignedfor use by the research community and also for teaching.Appendix A describesthe main data sets used for the substantiveexamplesand provides information on how to obtain them; they are all availablewithout cost. The only prerequisitesfor successfuluseof this book are a prior graduateJevelsocial sciencestatisticscourse,a willingnessto think carefullyandwork hard,andthe ability to do high school algebraeither rememberedor relearned.With only a handful of exceptions (referencesat one or two points to calculus and to matrix algebra),no mathematics beyondhigh school algebrais used.If your high schoolalgebrais rusty, you can find good reviews in Helen Walker, Mathematics Essential for Elementary St,,tistics, and W. L. Bashaw,Mathematicsfor Statistics.These books have been around forever. Although more recent equivalentsprobably exist, school algebra has not changed,so it hardly matters.Copiesof thesebooksarereadily availableat amazon . com, andprobably many otherplacesaswell. The statisticalsoftwarepackageusedin this book is Srara(release10). Downloadable commandfiles (do files in Stata'sterminology),files of results(1og files), and ancillary computer files used in the computations are available at wwwjosseybass. conr/golquantitativedataanalysis Often the details underlying particular computationsare only found in the downloadable do  and  1og  files, so be sureto downloadandstudythemcarefully.Thesefiles will be updatedasnew releasesof Statabecomeavailable. I use Statain my teachingand in this book becauseit has very rapidly becomethe statistical packageof choicein leadingsociologyand economicsdepartments. This is not accidental.Statais a fast and efficient packagethat includes most of the statistical procedures of interest to social scientists,and new commandsare being addedat a rapid pace. Although many statistical packagesare available, the thrce leading contenderscurrently are Stata,SPSS,and SAS. As software,Statais clearly superiorto SPSSit is faster, more accurate,andincludes a wider rangeof applications.SAS, althoughvery powerful, is not nearly as intuitive as Stata and is more difficult to learn (and to teach). Nonetheless, this book canbe readilyusedin conjunctionwith eitherSPSSor SAS, simply by translating the syntaxofthe Statado files.(I havedonesomethinglike this,exploitingAllison's excellent,but SASbased,expositionof fixed andrandomeffects models[Allison 2005] by writing the correspondingStatacode.)
FORINSTRUCTORS Somenotes on how I have usedthesematerials in teaching may be helpful to you as you designyour own course. As noted previously,the courseon which this book is basedruns for two quarters (twenty weeks). I have offered one threehour lecture per week and have assignedan exerciseeveryweek.When I fust taughtthe course,I readtheseexercisesmyself,but as
Preface
XXV
::: ::rentshaveincreased,I haveenjoyedthe servicesof a T.A. (chosenfrom among ::.:. $ ho haddonewell in the coursein previousyears),who assistsstudentswith the : .::ies of computingand statisticsand also readsand commentson the exercises.In lecturesandhaveassignedexercisesfor all but the :::r: \eais. I haveofferedseventeen '.'' the course devotedto producingtwo draftsof a term paper .  :ih rhe final monthof : :::rJirihon sessionI readthe first draftsandwrite comments,in an attemptto emulate : :  : :nal submissionprocess.Thus, in my course,everyonegetsa "reviseand resub::: :i>ponse.I encouragestudentsto developtheir telm papersin the courseof doing andto completetheir draftsin the two weeksafter the lastexerciseis due. := .:::ises l;: initial exercisesare designedto lead studentsin a guided way through the , ::::rics of analysis,and someof the later exercisesdo this as well. But the exercises  ::.:nglr take a free form: "carry out an analysislike that presentedin the book." ,,:.:ir e answersareprovidedfor thoseexercisesthat involvedefinitiveanswers that , , .3 sinIilarto statisticsproblemsets. :3 .oursesyllabus,weekly exercises,andillustrativeanswersto thoseexercisesfbr i:[ have written illustrative answersare availablefor downloadingfrom www. : ::_.r.i:s.com/go/quantitativedataanalysis
ACKNOWLEDGMENTS ,. , r:3dearlier,this book hasbeendevelopedin interactionwith manycohortsof gradu:. .::dents at UCLA who havewrestledwith eachof the chaptersincludedhere and : . :erealed troubles in the exposition, sometimesby way of explicit comments   : r:nerimesvia displaysof confusion.The book would not exist without them, as I :: :: naginedmyselfwdting a textbook,and so I owe themgreatthanks.Onein partic. ?.rmelaStoddard,literally causedthe book to be publishedin its currentfolm by : ::.:ing in the courseof a chanceairplaneconversationwith Andrew Pastemack,a ...' , Bassacquisitionseditor,that her professorwas thinking of publishingthe chap.. . : usedas a coursetext.Andy contactedme, andthe restis history. h: courseon which this book is basedfirst cameinto being throughcollaboration i : :r] colleagueJonathanKelley, when he was a visiting professorat UCLA in the   .. The first exerciseis borrowedfrom him, andthe generalthrustof the course,espe  . :re lirst half, owesmuch to hrm. \ly colleague,Bill Mason,recentlyretiredfrom the UCLA Sociologyand Statistics ..:::rients, hasbeenmy statisticalguru for manyyears.Otien I haveturnedto him lbr :: i::s irto difficult statisticalissues.And much that I have learnedabout topics that ;: : :roi part of the cuniculum when I was a graduatestudenthas beenfrom sitting in ,: ::red statisticscoursesofferedby Bill. Anothercolleague,Rob Mare, hasbeenhelp . :nuchthe sameway.My new colleague,JennieBrand,who took over my quantita : :;ia analysiscoursein the fall of2008, hasreadthe entiremanuscdptandhasoffered relptul suggestions. Finally, the book hasbenefitedgreadyfrom very carefulread.,. .: :l' a group of about 100 Chinesestudents,to whom I gavea specialversionof the , :.: in an intensivesumner sessionat Beijing University in July 2008.They caught
XXVI
Preface
ftmy errors that had gone unnoticed and mised often subtle points that resulted in the reworking of selectedportions of the text. My understanding of research design and statistical issues, especially conceming causality and theats to causal inference, has benefited greatly from the weeHy seminar of the Califomia Center for Population Research,which brings together sociologists, economists, ald other social scientists to listen to, and corrment on, presentationsof work in progress,mainly by visitors from other campuses.The lively and wideranging discussionhasbeen somethingof a floating tutorial, a realization of what I haveimagined academiclife could and should be like. Finally, my wife, Judith Herschman,has displayed endlesspatience, only occasionally asking, "When are you going to finally publish your methodsbook?"
. : & L JYht ** H t Treiman is distinguishedprofessorof sociologyat the Universityof Califomia u s 1:.::s rLCLA) andwas until recentlydirectorof UCLA's Califomia Centerfor aorurr,:r Re:earch.He hasa BA from ReedCollege(1962)and an MA andphD from ! n..:r .'fChicago(1967).As a graduatestudentat Chicago,he spentmostofhis .f, \aiional Opinion ResearchCenter(NORC), wherehe gainedvaluabletrain_ : .Er: :1nence in surveyresearch.He then taught at the University of Wisconsin, rntae :l :e,ided that he really was a social demographerat heart, and made the Center ru }:,1:rrph1 and Ecology his intellectualhome. From Wisconsin,he moved to I 'rrrrn; Lnirersitv and then, in 1975,to UCLA, wherehe has beenever since,albeit qd E\i=J1 so.;ournselsewhere,as staff director of a study committee at the National r;rrr='. .:: Sciences,4.Jational ResearchCouncil (19781981)and fellowship yearsat Bl:eau ofthe Census(19871988), theCenterfor AdvancedStudyin theBehav_ i umr rc S.rialSciences(1992 1993),andthe NetherlandsInstitutefor AdvancedStudy r M and SocialSciences(19961,997). l::.or Treiman startedhis careeras a studentof social stratificationand status ::rrniries il.!yn..: parricularlyfrom a crossnationalperspective,and this has remained a con_ i'Fr._r :::3resr.He andhis Dutch colleague,Harry Ganzeboom,have beenengagedin a {mr€:= project to analyzevariationsin the statusattainmentDrocess rossnational [irrlr. :::!lD! throughoutthe world over the courseof the twentiethcentury.To date, tEl r:,: ;ompiled an archiveof more than 300 samplesurveysfrom more than 50 m:cs =ngrns through the last half of the century. In addition to his comparativeproj_ s ?:: ::sor Treimanhas conductedlargescalenationalprobability samplesurveysin
[email protected] \.,a  19911994),EastemEurope( 19931994),andChina(1996),all concemed q [ .J::.u! aspectsof socialinequality. :lj .Lrent researchhasmovedin a more demographicdirection.He hasa national !r.rr!'::\ lample surveycurrentlyin progressin China,which focuseson the determ! m. :i:amics. andconsequences of internalmigration.
:r,{rK*milcTl*ru I . :or uncommonfor statisticscoursestakenby graduatestudentsin the socialsciences x :E [eated essentiallyasmathematicscourses,with substantialemphasison derivations rnc:roofs. Evenwhenempiricalexamplesareusedwhich they frequentlyarebecause <;r
(}VTRVIEWOF CHAPTERS 5*1 iis book beginsat the beginning,with the most basicapproachto analyzingnon=_s^mentaldatapercentagetables.Chapters One throughThree describethe logic r :.sstabulations and provide many technicaldetails on how to produceattractive b::'u.e they are clear and easy to read) tables. The two central ideas in thesechapters r,: Jeidingin which directionto percentagetablesand understandingstatisticalcon!:'i. h rumsout that the first of theseis difficult for somestudentsmuch moredifficult lorc .rpingwith complexmathematicalformulas,which we do later in the book. Thus, :r 'J \ ou think you alreadyknow all thereis to know aboutpercentage tables,I encour!p i !1uto pay carcful attentionto thesechapters.Doing so will pay greatdividends. Chapter Four is an introductionto computing.In this chapter,I showhow dataare q:.::;zed for analysisby computer and how analysisis conductedusing statistical
XXX
Introduction
software. I also provide hints for using Stata, the statistical package used in this book. However, the chapter is written in such a way that it also can serveas an introduction to otherstatisticalpackages,suchas SpSSand SAS. Chapters Five through Sevenconsider ordinary least squarescorrelation andregres_ . sion, the workhorse of statistical analysis in the social sciences.These procedureslro_ vide a way of quantirying the relationship between some quantitative oot"o" it, determinantsfor example, how much of a difference in income should we exDect d for people who differ by a given number of yearsin their level of schooling. holdiog .oo.t*, other confounding factors? They also provide a way of assessinghow good our predic_ tion isfor example, how much of the variability in income can be attributed to differences h education, gender, race, ard so on. Chapter Five focuses on twovariable correlation and regressionto get the logic straight and to consider somecommon errors in interpretation of correration and regression statistics. chapter Six considers multiple regression,which is used when there are severalpredictors of a particular outcome, and inhoduces the idea of "dummy" or dichotomous variables, which require special treatment. Making use of dummy variables and ,.interaction terms," I offer a itrategy for assessingwhether social processesdiffer acrosspopulation groups, a frequent qo"rioo in the social sciences.Chapter Sevenoffers a variety of tricks thaipermit relatively reflned hypothesesto be testedwithin a regressionframework. Most datasetsanalyzedby social scientistsareplaguedby missing data_information on particular variables that is missing for specific individuals. Chapter Eight reviews ways to cope with missing data, culrninating in a demonstrationof how to do multiple imputation of missing data, the current stateoftheartapproach. Chapter Nine takesup the issueof sampling and iti implications for statistical anal_ _ ysis. r hereas the previous chapters assumed simple random sampling, most general population samplesare actually complex, multistage samples.Correctly analyzing data from such samplesrequiresthat we take accountof the "clustering" of observations when we compute standarderrors.This chapterintroduces srrv4r estimation prccedves, which do this. There are many pitralls to regressionthat can trap the unwary. As noted, these are fust discussed (briefly) in Chapter Five. Chapter Ten gives them a fuller treatrnent, through the introduction of what are known as regressiondiagnosdcs.Theseprocedures provide protection against the possibility of making false inferences frorn regression results. Chapter E_ leven shows why and how to construct multiple_item scales, focusing . pincipally onfactorbased scahngbutalso introducing e.ffea_p)oportional scaling.Often we wart to study conceptsfor which no one item in a questionnaireprovides an adequate measure,for example, "level of living," ..Iiberalisrn,',,lypeA personality,, and ,.depres_ sion." Sumnary measures,or scales, based on several items usually provide variables that are both more reliable autd,more valil than single items. This chapter shows how to createsuch scalesand how to use them. Chapters Tlvelve through Fourteen provide techniques for considenng limited _ dependentvariables. Ordinary least squaresregression is designed to handle outcome
Introduction XXXi rfir:les thatcanbe treatedascontinuous,suchasincome,yearsof schooling,andso on. S.ir r,.n'outcome variablesofinterestto socialscientistsaredichotomous(for example, rnescr peoplevote, man:y,havebeen victimized by crime, and so on; and othersare uir:..mous (political affiliation in a multiparty system,occupationalcategory type of attended,andso on). LogJinear analysisandlogisticregressionaretechniques 'm..r:iN :r with limited dependentvariables.chapter Ttvelveconsidersloglinearanal,s:oaling .:.rechniquefor making rigorousinferencesaboutthe relationshipsamonga set of nrr.:Llmousvariables,that is, inferencesabout the degreeand patternof associations mr:r: crosstabulated variables.In this sense,logJinear analysisprovides a way of [E: ltatisticalinferenceaboutthe kinds of tableswe considerin ChaptersOne through In=. Chapter Thirteen introducesbinary logisticregression,an appropriatetechnique ir =;.l1zing dichotomousoutcomes,andthen showshow to usethis techniqueto handle w;':t kinds of cases:progressionratios, where what is being studiedare the factors d*:rv $ hetherpeoplemove througha seriesof steps,sayfrom one level of schooling r :e ne\t: discretetimehaza.) ratemodels,where what is being studiedis the likeli_ [.x lar an event(say,first marriage)occursat a point in time (say,a given age);and .s=:..nrol models,which providea way of studyingthe likelihood of rareeventssuch E:iimacring diseases,gainingelite occupations,and so on. Chapter Fourteen shows tLrrra study still other limited dependentvariables:unorderedpolytomousvariables: :. n pe of placeof residen ceyiamultinomiallogisricregrestion: ordinaloulcomes jr ; fich the order of categoriesis known but not the distance betweencategories,such ri i:.loe aftitude scales(are you "very happy," "somewhat happy," or ..not too happy"), t:i :Jinel logistic regression;and "censored"variables,where the range of a scaleis t'rI:cJed.for example,an incomevariablewith the top category,.$100,000per year or wr: ia tobit regression. \\len using nonexperimental data,it frequentlyis difficult to delinitively establish ftr lne'ariable causesanotherbecausethey both could dependon still a third variable, r: unmeasured. Chapter Fifteen providesa classof techniques,knownasfrxedeffects m: ;ndomeffectsmodels, for dealing with such problems when one has suitable data_ :der panel data, in which data are available for the sameindividuals at more than one rr:E: ln time, or clustereddata,in which observationsare availablefor more than one m.idual in a family, school,community,or otherunit. When appropriatedataare avail_ rri fiis is a very powerful approach. The final chapter(Chapter Sixteen) considerstechniquesthat are beyond what I Ir.. beenableto coverin this book and beyondwhat usuallycan be coveredln a first:laduatecoursein quantitativedataanalysis.Many of thesetechniques,now widely .q::ir u,se::r economics,are waysof copingwith variousversionsof theendogeneity problem, ft that unmeasured variables affect predictors both and outcomes, resulting rossibility tr rir.:.edestimates.Fixed and randomeffects modelsprovide one way of dealingwith :rc: oroblems,but many other techniquesare available,which are reviewedin Chapter i.;:n. I also briefly introduceshucturalequationmodeling,a techniquefor dealing r:: omplexsocial processesin which an outcomevariableis a predictorvariablefor rtr:,jrer outcome.For example,in statusattainmentanalysis,we would want to study
XXXII
Introduction
how the social status of parents affects education, how parental status and education affect the status of the fust job, and so on. The brief introduction to these advanced techniquesis intended to provide guidancein pursuing more advancedtraining in quantitative analysis. I conclude the chapter with advice on good researchpractice_hints on how to improve the quality of your work and how to save time and energy in the Darqaln.
:!,q p T r n
CROSSTABULATIONS IIAT
THISCHAPTER IS ABOUT
! tu :::::.r. r\'estaftwith an introductionto the elementsof quantitativeanalysis_the E:L :: be colered in this book. Then we deal with the most basicof all quantitative 6Ir",q'. 1_1]15. crosstabulations or percentage tables.(Strictly speaking.not all percent_ i,E! rf,ie. ire crosstabulations becausewe can pgrcentageunivariatedisrributions.But @ rli:: of this chapterwill be on how to percentagetables involving the :Dphasis [email protected]=srus tabulationof two or more va.riables.) Although the proceduresare basic, fi!3 E: :!rr triyial. There are clear principles for deciding how to percemage cross_ rllrrifil qs. \\'e will cover theseprinciplesand also their exceptions.In the course of umrs ::. s e will considerthe logic of causalargument.Then we wil considerother *n'''o:res:Jes percentage tables,of summarizingunivariateandmultivariatedistributions fr [a li s ell as ways of assessingthe relative size of associationsbetween oairs of tnnri<: orlrolling for or hordingconstantothervaiables.Takethis chapterseriousry, 3a ' ..'uhaveencounteredpercentagetablesbefore and think you know a lot about t: .t experience,gettingright the logic of how percentage * to a tableprovesto be er :i:uh for many students,much more difficult than seeminglyfancier^procedures, .ni: .:.=uhiple regression. : rr \\ ill notice that many of the examplesin the first three chaptersare quite old, ,!i'r: :om studiesconductedasfar backasthe 1960s.This is becauseat that time tabu,nr".'. !is \\ asthe "stateof the art"the techniqueusedin mostof the articlespublished u ca::: journals.Thus,by going back to the older researchliterature.I haveLeen able to inc ::nicularly clearapplicationsof tabularprocedures.
QuantitativeData Analysis:Doing SocialResearch to Testldeas
INTRODUCTION TO THEBOOKVtA A CONCRETE EXAMPLE ln 1967, Gary Marx publishedan article in the American SociologicalReview tjtled "Religion: opiateor inspirationof civil rights militancy amongNegroes?',(Marx 1967a; see also Marx 1967b).The title expressedtwo competingideas about how religiosity among Blacks might have affected their militancy regarding civil rights. One possibility was that religiouspeoplewould be lessmilitant than nonreligiouspeoplebecausereli_ gron gave them an otherworldly rather than thisworldly orientation, and established religious institutions have generally had a stake in the status quo and hence a conservative orientation. The other possibility was that they would be more militalt because the Black churcheswere a major locus of civil rights militancy, and religion is an important sourceof universal humanistic values.of course,a third possibility was that thergwould be no connectionbetweenreligiosity and militancy. Supposethat we want to decide which of theseideasis correct. How can we do this? One waywhich is the focus of our interest herewould be to ask a probability sample of Blacks how religious they are and how militant with respect to civil rights they are, and then to crosstabulatethe answersto determinethe relative likelihood, or probaliliry thatreligious and nonreligious people say they are militant. If religious people are less likely to give militant responsesthan are nonreligious people, the evidencewould sup_ port the first possibility; ifreligious pe ople aremore Tikelyto give militant responses, the evidence would favor the secondpossib ity; and if there is no difference in the relative likelihoods of religious and nonreligious people giving militant responses,the evidence would.favor the third possibility. Of course,evidencefavoring an idea doesnot definitely prove it. I will say more about this later. simple example contains all of rhe elementsthat we wili be dealing . . fhil ygninglf with in this book and that a researcherneedsto take accountof to arnve at a meaninsful and believableanswerto any researchquestion.Let us consider the elementsone by one. First, the idea: is religion an opiate or inspiration of civil rights militancy? Without an idea, the manipulation ofdata is pointless.As you will seerepeatedly,the nature of the idea a researcherwants to test w l dictate the kind of data chosenand the manipulations performed.without an idea, it is impossible to decide what to do, ard the researcher will be tempted to try to do everything and be at a loss to choose from among the various things he or she has done. Ideas to be tested are generally called hypotheies; they also will be referred to here and in what folrows as theories. Atheory need not be eitheigrandiose or abstractto be labeled as such.Any idea about what causeswhat, or whv and how two variables arc associated,is a theory. Secondis the information, or data, neededto test the idea or hypothesis (or theory). In this book, we will be concemedwith data drawn from probability samplesof popuia_ ttons A population is any definable collection of things. Mostly we will be concemed with populationsofpeople, suchasthe populationof the united states.But social scientrsts ar_ e also interestedin populationsof organizations,cities, occupations,and so on. A probability sample is a subsetof the population selectedin such a way that the proba_ bility that a given individual in the population will be included in the sampleis known. only by usinga probabilitysampleis it possibleto makeinferencesfrom the characteristics ofthe sampleto the characteristicsof the population from which the sample is drawn.
CrossTabu lations
lh s : r: rrbsen'e a givenresultin a probabilitysample,we can inf.erwithin a speci_ ri u{" a har rhe likely resultwill be in the population. lbe i::ngle usedby Marx is actuallyquitecomplex,consisting ofa probabilitysample f, r!: Birks liring in metropolitanarcasoutsidethe South,ptis four .p""iat .upt"r, um*t:\ :arnplesof Blacks living in Chicago,New york, Atlanta, and Birmingham. lh ]:f}:umber of respondents from the non_Southem u.ban ,u_pr. pfu. the four spe.l ;rcks is 1.119.and Marx treatsthe combined sampleas representatrve of urban Lbrr  ie Lnited States.This is not, in fact, entirelylegitimate.iater we will explore nur r: r:ight complexsamplesto make them truly representative of fhe populations tn t:sh rher are drawn.Evaluationof the sampleusei in an analysisis un i_port_t rmi :i :< iara analyst,stask.But.for nor, rv" ,"ill go along with Marx in treatinghis rnrc r. I probabilitysampleof U.S. urbanBlacks. Lq ..ur ideasare aboutthe behavioror attitudes of people,a standardway of col_ n*n'r.I ',::.ir to ask a probability sample chosenfrom anapiropriate populationto tell r lr:u irir behaviorandattitudesby answeringa set of specificquestions.That is, we iD5,t!r:e :ample by asking eachindividual in the sample a set of questronsand record_ rtE :3sponses. In mostsamplesurveys,the possibleresponses arepreselected, andthe :iu eingsurveyed,rhe respondent,is askedto choosi the bestresponsetrom a list tu*a,ir lee the boxed corment on open_ended questions).For example,one of the m.:Li \Iafl( askedwas ,,a ,.,culdyau ny abautthecivilrightsdemonstrations over the lastfewyears_that :E, '.,e hetpedNegroes a greatdeal,helpeda lixle,hurta little,or hun a qreatdeat? Sepeda greatdeal 1 lcped a little 2 ::= a linle 3 E:r a deal 4 great ]::'r know 5
OPENENDED QUESTIONS
occasionaly, questions
=::=l;1,:ilil::llil:ffi ':"i:i*fru Jil:ffi Jl=lil":::ff T;ff "?::lff
:: j,/ttstedon a questionnajre or when the researcher doesn,thavea very good ideaof responses ''':::he possible wiI be.openendedquestions mustbe coded,that is,converted :: a standardset of response categories, as an editingoperatjonin the courseof data :=:3ratton. Thisis verytimeconsuming and expensive and is avoidedwheneverpossjble. : someitemsmust be askedin an openended format.Bothin the decennrar censusand  any contemporary surveys in the unitedStates, for example, a seriesof threeopenended :=stionstypically isaskedto elicitinformation necessary to classify respondents according to .=darddetailed(threedigjt) classifications of occupation and industrv.
4
euantitativeDataAnalysis:DoingSocialResearch to Testldeas
Each response,or responsecategory,has a numberassociatedwith it, known as a code. The codes are whar are acrually recorded when ,h" ;;;; ;" pr"p_"d ibr analysis
tomanipurate da,i^,"
;;iiirl,'.o*"
."roondents lX:.1:':^1": .f 'sed wll 1;"_e*;;. reruse to answer a questionor. in a self_admini.i.r.u qu"rtioo'orre, will choosemore than one response.Sometimes,aninterviewer will forget to reclrO a responseor will ambiguousway. For theserear"r., _"_?" i, urualy designatedto f:1:.1,_tl "" rndrcatenonresponses or uncodableresponses. "JO" For example, ,oiglr, Ueassigned to nomesponsesto the preceding qu€stion when tlr" Outau." ""a":.1_ U"irrg p.epared for analysis (thistopicis discurr"ctu tt tarer).Howdi;il ;;;finses, or missing data, ". ^bit rs one of the peremial problems of the survey ;;';;;iiidevote a deal of $ear attentionto this question. """1)";, The term variable refersto each set of responsecategoriesand the associated codes. A machinereadabledata set (v
disks,cD_RoMs,il#*":l':I:,1"j:l;il:,T$il,.'"i;,iffi liii,.j.r;lL'll codesfor eachindividualin the ,u.pt" .orr".ponOlng a"ri"
,ii."r" .",egonesfor the variablesincludedin rhe data ,"t. "tie eartierquestion i* *;pl; ;h" on ,Suppor", whethercivil rightsdemonsrrations huu"t w"groJir'rfr"l",.,f, in a survey. Suppose, also,thatthefirst respondent;n "tp"O ",rote tr," .a_prJt ..helped a litde'" Thedatasetwouldthin incluo". 'z 'ui" aairii r,"iai_*ro",lons in i"ntr,l."",ii ii *" ,i.st individual. To know exactlywhatis includedin a data set,"a ,n"." irri" O'"" setit is located,a codebookis prepuedandusedas
ho.' tou."a"ojJ;ffi"*; t,;#:'.::;:::ffffiJJ:::llg,ffi'".]i1i"",::"'J sary to carry out the sort of analvsisdealt with in this boot
_" u aliu ,"r, u codebookfor the daraset, and documentationthat de.criO", tfr" ,a_pi". W" ,ifii o" *rcerned with problems of data collection or the preparation of u rnu"hin"."uJult. Outu,"t, except in require
f#il:r;;::::
fuu treatrnent in rheirownrigr,t,_J*"\vil
*ics
nothavetime
l:,j:ffi,ff"";"T:iill",ff ::.riq;ff i3.i:,i:";,ry_:,#n:::ry::f lJ,"x""ff *
and collectivelyexhausdvecatesories. Rerigiousafmiaiion i. *of such a variable.For example,we mighthlaveth" foffo'rvlngr"rfonrl'"ut"jo.i". "**pt" uoo "oo"r, Protestant I Catholic 2 Jewish 3 Other None No answer
5 9
Note that no orderis imDlied,among_ the responses_no response ls ,.better,,or "higher"rhananyother Thevariable .i_!ry p.ouii"ru ,rr'ii"fi*orr"* peopleinto religiousgroups.Note,further,thatevery individualin the surveyhasa code,even those
CrossTabulations5 I
I G'5€f, rb€
t
question. This is i bv includinga residualcategory r,.1.;".*#i"';r1*omplished :. deslgned vanables, categories arealways ...n raEandcot,"*u^r! cr:F..'!E collectively exhausiive_,h", ^r^,.lll.t]y ;;,,* i;il;ffi"r*;#XJ;
a f
".
l =*.r_\.
!
(rnCiupi",el* *" rv,u *';**;*r]::11,1"^i:iiTo :* :"d onryonecode. of coding missingdata..1 piopertythey
canbe arranged "o.luJ'al crimersion: quantity, value, o.Lu"i.rir" qu"ir;;;iTiffil.'J:ffHlin an order
! I
uno.q"", *r,"*,i" ji.i,ii"i f::,::::ordeled :ITg::r "_i"uc, is helpfulness to Negroes. ",,n,"n n""Jry, ,i" 11.rre
Etr::.n:.y::,:,
encounter
")iili::,,l",I.,1
in surveys. r;'orl" rLrpuxscs' oon l d;ses, ..don,t ] 5 :J,::,:T:i.::tuarly u lmplicit "no answer" not selfevidentlyordered with r G (
r
ff:;*o.;:f E $,.ierhara . *rrc dL}'
,,don't :,::,:lf:::,*:*:llg:4;ir'" "#:.rr"ii"i know,, ".0'*" responie isin be,*"",
p,ausib,e argu.
r.:ffi {il; rti\ a neutra] rafhe " 'i either a apositive or a or a nesatrve negative response. Puuuve response. To treat To treat F ri,;. *,,, o thi', ,, ;;unj, ^_ ^_^,,Lrhun
hen
..3:.tl "y,r,.,..h,,"r;'::,.Y::lo_l'::1" the.variablebyassigning .de ..r.. "oo.
Dhr_=an
.;;;.;:;il;;'#;;"i:,;T[ffiTl.tfl;J:5il:l'; =Y ,** *.,n"
qE ffi
 r'i::5#*::,::,:,"^tTlt$*l,J*;.i;*areundertaken, asparrorrhewritiuporthe ri*oria'ffiffi#;"H;:lfil urv.i*l ;G;;i:: aneufta, response
ffi."j"li:"*::1": ses tb. noaroorra hlh
:g*
.
so varied' includingsimpleenor. failure ro comHence.thereis no way to predicrhow he no*.rponi.n,. rnponded hadthey _ done so. Therefore,it probablywould b" *lr"rt +.*.'*J;;:;;".""o,,":'lt r_ner_ to t s missinsdata. "ui L ry..rrant featre of ordinal variables is that they include no information about D'ahEe tr;ts een catesorie.. For.example, * we do not t ,i",rr* the difference r :ir*', "'' E+ment rhatcivil rishts demonstrations fif""'u,o thejudgmentthat qr throj a tjnle.. is srearer"orsmalle,rthan ,fr" Oiff*"r"" Uiween ahejudgment r,E b'o.o a little.: and that they.,helped ;;;;ffii,il. thi. ,"u.on, .o_"  .md social researchers 
ryo._*,"r,il;"#"1f,T#i[:.j:::JfrX',f .rr a \ ariableanduseonly th" ord". prop".ty.ffiri. :il:llf*:,:m*::] ryEs bhi ffi
rro,nr
fr" p"rrd"" qe u.ill mainly consid, ""i appropnate takenhere. kinds of statistics,those tbr nominal s..rJrh^ca on ;^.^ .:r.two
doseappropriate forintervai il,utjo;uiiJ;tr#fl1ff:ff#T
#: ffi",fl":"J"l1'"u*^.rt*'ins 'i"ti'i",,'"",n*tlvdesigned
ffi;"*
*  !".r"., i""I"*;;;;;'il]1,:?:ii#,T::.n tr;:ffi:.JH*"1:,1#i.;} j,*:ili,Tffi;,',,J1:;:nn*]:l Tl *"^r";dil;,i:';,f,l'#f :\ample' rhare..or is normaty distributea, 'dbir s""""J'"ffii @madcally
h
mLl]
tractablethan
usedthan parametricstatistics;moreove, ,fro"
*",r"ry
statisticsaremuch altemativesfor
QuantitativeData Analysis:Doing SocratResearch to Testldeas
rhingandiirrle consensus amongresearchers aboutwhich ordinat :::::ltl:llii sratlstrcto use.*:.same Third, many ordinalstatisticsinvo]veim;Icit assumptrons that arejust as restrictiveas the assumptionsunderlying parametric statistics.For example,it can be show! thatspearman'srank ordercorrelation(an ordinalstatiJcfis ioenticalto the productmoment(Pearson)coneradon(the conventional parametriccorrelationcoefricient) whenintervalor ratio variablesare converted to ranks.In effect,then,the Spearmanrank ordercorrelationassumes an equaldistancebetweeneaaf, .",lr.r,han makingno assumptions aboutthe distancebe "","g".y
byusingordinar,",i**'G.,""""T;";J:'.:Ti;.:[:T;H.rH]il[.1 stonscanbe tbundin Davis1971,andHildebrancl andoth.., lt i.i Interw vqriqblesandratio vqriqblesaL.e similaf in thatthe clistanceoetweencatego_ Nor only canwe.saythatonecaregoryis higher rhananother(on some :t"_:l: T.i."irCf"] drn'renslon) bur also how much higher.Such uuriobi.. i.gltiiotelf can Uemanipulated with standardarithmeticoperations:addition, subtractiori'outr'i oi"ouon, and division.
'r! l \lt L ..' \
('1900?960)wasanearlyteader in thedevelop_,, ot ment surveyresearch. Hewasbornin SacCity,lowa,andearneda B.A. frori Morninglide college;earnedan MA. in riterature at Harvard; served threeyearsasan eaitorofihesiiirty. . ot hisrather; andthenbesan e'"d'"t" rtuoi",in sociotosi at iH;,i.1^"]i3T"jj::i*o the university of chicaso,completing ph.D. his in 1930.i.A/hit; ;a;;";;;.;*;;*"r," F.osburn,who introduced himto starisrics uurpituitr r"no"..,ibed jnirial :1,:_,::::tyl lr: hostility to thesubject. Hestudiedstatisticai methods and**n"ru,i., ,n*nrtu"ryat chicago
univers,v olLondon, ililil T::J.l[:ff;:,:::::::"::1y:::'::,::rtt:,,*a,the
whereheworkedwith Karlpearson, amongothers(seethe bi"grupf,i.uf ,f";h';"r"";;""i;' Chapter Five). Stouffer heldacademic appointments in statistics andsociology Wrr.onrio, Chrcdgo, andHarvd,d. Hewasd srlredresea.cn "t adri.tistrdto,, i""Ong o'nurnO"r. o+toro"
il;;:';.:;;;;il il;:il:*;':::":, TTli: ;:Iil:: i:::1,:.::t::11t:;;.;
sociat science Research councir project to evaruate theinfruence of the,.ilri:; i;'r::,i durinsWorrdWarrr,a studvot sorders rorrhe ;:T:.::*l:':::: iiJ.'lf:: :::*"'hs; Defense Department, whichresulted intheclassic priblica..r"rr, ,n"'ir"*r'ildriffi;
astudv ortheantkon'm'ni't t'v't",iu J;;
il""h' :y era,;::::':,i:l'"'11,".ii]?jl:: fundedby the FordFoundationt Fundfor the Republic, which;;;;; ;;r;;;# (t55) when he diedratherun"*0".,"0t"u, asesixtyartera ::,:r'::: T: :::,,:,!::,* brief iilness, hewas in theprocess
of devetoping for theeopulalio;;il;,;;* . ffiff d:"r*:rg nations. Heatsoplayea un irnpo,tunr rorein deverop_. f:til:1T""_Ti:l.jl]l]l I statistical standards intheU.s.Bureau oftheBudget. A hailmai.f ur"ri"ra*i*";; ;; ; to usingempiricat daia quanritarive and me*or, . ',nor*i,Vi"ri )l::"r,:iny:"rT,tted roeasabout socialprocesses,
rns rhe statisticiar ;; ;n#il;;;J",* iffi ruTffi.iil*'#:l""lT.,. o,"n;,;
which makesit fittjng that a poslhumous collectionb{ his .r. papersls titled SocialResearchto Testldeas(1962). .
CrossTabulations
I l
'=[F#J&:
I
7
arions ror rhem rhe [r dr#i #i.*#ilder
,fl"T:H,"":f =E::m#':.':,"J"'fi ;i""..",.,,".;:oj:;::;;'# i:l::iiiiird;J:;::$:: iF::;Tiri;fl lE*'"',*'n:"'ff
lffii,tij$1lr","1ff ***,$+T.;",#n:r
"*::i*'T #l;'tril*fd1#;ltti*Ttfi'trJ,i
lq'jf"f* vanous forms of
ffi'tuffi*,,*j:.l: h g
*
@.lo '\*t \d HI {*at
:,pinion, is rhe povemm.
i*,ii )ii,';'A,:':Kf", ".
* washinston pushinsintesration toosrowtoo
,rantb workhard. ct I t aheadiustas e&Jilyas anyone else'@isagree.) thould sp"rd .or" ti*" and less time demonstratinS' (Disagree.) tc nuh I wou, u" *or" no'o'mg part in civil righrs demonstrations.(Disagree.) wt like to s"" *or o"^ol,'* tions or lessdemonstrations? (More.) ot+ne^horta oot hoJ
ffi1*';;":;:::;:;;":::::":"::,',;;:"X:T;i ft
,"t
OOerty shouldnot haveto sell
to Negroestf he doesn wsnt b.
rtf*tf*#r."r::%#.qf =*#tr{*1df #$,dTr*:
8
euantitative DataAnalysis:Doing Social Research to Testtdeas
wil devote considerabre afientionro ijit'i'" j,X,iJii1;li::;",,1":,:lf*, Ereven_we The third elementin any quantitative analysisis the model,the way we organizeand marupulatedata to assessour idea or hypothesis.Th; ;;il; rwo componenrs:the choiceof staristicarprocedure rh. now the variablesin our analysisare related.Given"r,i ";r;;;;;;r;;;;""il'fi, I
;::T,'.x','l;:J*,'.",llfutii::li*:::*.rd:i ;j:;
witr, our nypotr,e,e.."d;;;#r::Jffi ',.TH,""r'":l'XT:1,"H;,il:"#:i*:i
crosstabulations of militancvbv religiosity trurr ,rr. i",i".iiri or succersiue variables,whichare.tiscusseo cont.ol a ,"J .rp""lliol' fi,vpotr,".iOi, tr,utu of thenonretiglous ^bitlar".i, thanof the"r, religio;;;ii; :',:T:.1"1"*"c. mllitant_{r because we navecompering hypotheses, thata lower.percen,"g" rant Larcrin thebookwe will deal "i,i" """rigious will beraili_ withstatisrr""r moresophisticatedmostlyvariantsof thegeneralli1:,Tmod:llbur ,r"o"ir,rr" ,,re.i.gi. . *,ril"_rn unchanged. we actuallycarryoutcross_tabulatlon How analysis is thetopicof thenexrsecnon. CROSS.TABULATIONS
rerigious Bracks aremore rikerv (orress ft:,T,1T;:Tfli,ffn:x*:rmine.whether Blacks'
Perhaps the moststraightforward approach rs ro crosstabulate ,;::llci:"t *tiglt.ill' thatis, to count thet.qu"n"y oi persons witheachco,o6;12661 ^1L o11?,cl
*,;;;;;;;;;;;,:l :ill:'J:i Hl:ffit:i HI ff:l;ll:;li,lli i ::"^T:. jill'ffi.Tiff'l[T:'J+;i,Ti']iiofmiriraricf ;t ;j'*.""'r rhe rorrowing vieids
. Trt, ^:t" Joint Frequel Among urban Negroes in the,rcy Religiosity
Distribution
Militant
Veryreligjous
61
of Militancy by Religiosity
Nonmilitant
Total
4t
290
C
372
532
t! U
108
t95
169
f
Somewhatreligious
160
Not very religious Not at ali religious 11
36
ld
fl[ @ @
333 soulce.Adaptedfrom Alarx(j967a,
Table6)
660
993
d
Tl I
CrossTabulations
: ' . 1; 1. : ' O :' lN T AB L E1 .1 :ll:li]lSividuals) in thetabteis siveninthetower_riqht ce(orposition
,= ]_ in l=: = ='lo1e that this is fewer than the nrrmharofcases .:c6" in i^ the +L^sample tne number ^* ;,*(recallthatthe ^f j9
::sstsol t,j cases). Thedifference isdueto missing Oau;Ua o, ,or" L_ r :  :::: r d not answerall the questions needed to construct the religiositanOmiLater, wewtlldealextensively with missing datap.U"r, lor tfrup**ni, rgnorethe missingdataand treatthe sampleas if jt consists of g93 i : ;: :ei s In the interior of the ta
. : ..::,en that cyore"* 1", ?ll;lffJ :#frHffi:Dcvdislribution. ,0,."i,"1f  :: :. :. the variables and response caregones areg,venin the tablestubs.
Ji:ff :[:?,".J."^: :iffi?f:fl:i,T .i_,': *ffi :::ffi";:::i
: :  169 230,andsoon;addlngup theentries of eachcoiumnandconfirm_ ::'r correspond  i: :ey to the row marginal, for example, 6l + 160+ g7 * ,, = ,:a, :. :r :; and addingup the row marginars andthe corumnmarginars and contirming := : :_r of eachcorresponds to the tabletotal. lt is easyto intro"duce errors, especrally : ::c_,'ngtables,and it is far bett
.::::nt than you,,"ua"o,o ror or.ll,"'lfJ"fij[T;:JJlffi:,l,1"tffi:ruill:
:::'
.: _ iables.
:': ::: rable.can we decidervhetherreligiosity favorsor inhibits militancy?Not rrr.. ;:. T,r do so. we would needto determine the"ret"iiri p'iloitirythat peopleof rrur ic:: of reli_eiosityare militant. If the probabi[t l""ri"r* ,r* religiosity, we nr*n,nruul ::r.ude rhar religiosity milit_"y; if tne proUaUitityof militancy rq:i:
#il;:#:::::Tifr [TiJi;':'il?Tl'X.H:'#fl,.T:X ),...;;j'g;',?ffi
10
QuantitativeDataAnalysis;Doing SocialResearch to Testldeas
percenttrltititantby RetigiosityAmong TABLF 1.2. Urban Negroes in the U,S..1964.
very Religious
Somewhat Religious
Not Very Religious
Not at All Religious
TjICHNICAL POINTSON TABLE 1.2 i) Always includethe percentagetotals (the row of 100%s).Although thrs may seem redundantand a wasteof space,jt makesit immediatelyclearto the reader in which directionyou havepercentagedthe table.When the percentagetotalsare omitted, the readermay haveto add up severalrows or columnsto figure jt out. Usingpercentage signson the top row of numbersand againon the Totarrow arsocrearryindicates to the readerthat this is a percentagetable. 2) Wholepercentages areprecjse enough.Thereis no point in beingmoreprecise in the presentationof data than the accuracyof the data warrants. Moreover,fractionsof percentages are usuallyuninteresting. lt is hardto imagjneanyonewantingto know that 37.44percentof womenand 41.87percentof mendo something; it is sufficient to note that 37 percentof women and 42 percentof men do it. Incidentally. a conve_ nrentroundingrule is to round to the evennumber Thus,37.50 becomes 3g, but 36.50 becomes36. Of course,36.5.1becomes37 and 37.4galsobecomes you 37. only want to reportmorethan whole percentagesif you havea distributionwith manV categoriesand are concernedabout roundingerror. 3) Alwaysincludethe numberof caseson which the percentages are based(that is, the denominatorfor the percentages).This enablesthe reader to reconstruct the entiretable of frequencies(within the limjts of roundingerror) and hence to reorganizethe datainto a differentform. Notethat Tablei.2 containsall of the intormation
CrossTabulations
't1
you canreconstruct that Table1.1 containsbecause Table1.1 from Table1.2: 27 percentof 230 is 62.1,whichroundsto 62 (withinroundingerrorof 61),and so on. Customarily,percentagebasesare placedin parentheses to clearlyidentify ihem and to helpthem standout from the remainder of the table.
F
Sometimesit is usefulto includea Totalcolumn.as I havedone here.and sometimesnot. Thechoiceshouldbe basedon substantiveconsiderations. In the present case,about onethirdof the total sampleis militant (as defined by Marx); hence,the marginaldistribution for the dependentvariableis reportedhere. Recall from page7 that "militants"arethosewho gavemilitantresponses to at leastsix of the eight items in the militancyscale.We now seethat about onethirdof the sampledid so.Obviously, if we definedasmilitantallthosewho gave at leastfivemilitantresponses, the percentage militantwouldbe higher. No conventiondictatesthat tables must be arrangedso that the percentages run down,that is,so that eachcolumntotalsto 100 percent.ln Table'1.2, the categoriesof the dependentvadableform the rows.and the categories of the independent variableform the columns.lf it is moreconvenient to reverse this,so that the categories of the independent variableform the rows,this is perfectly acceptable.The only caveat is that within each category of the independentvariable, the percentagedistribution acrossthe categoriesof lhe dependentvariablemusttotalto 100 percent.Thus,if the categories of the dependentvariableform the columns,the tableshouldbe percentaged across each row
The Diredion to Percentagethe Table Note that the direction in which this table is percentagedis not at all arbitrary but rather is determinedby the nature of the hypothesisbeing tested.The question being addressed is whetherreligiosity promotesor hinders militancy. In this formulation, religiosity is presumedto influence, cause,or determine militancy, not the other way around. (One could imagine a hypothesisthat assumedthe oppositewe might srppose that militants would tend to lose interest in religion as their civil rights involvement consumedtheir passions. But that is not the idea being testedhere.) The variable being determined,influenced, or causedis known as the dependentvariable, and the variables that are doing the causing, determining, or influencing are known as independent,or predictor, variables.The choice of causalorder is always a matter of theory and cannotbe determinedfrom the data. The choice of causalorder then dictatesthe way the table is constructed.Tablesshould (almostan exception will be presentedlater) always be constructedIo expressthe conditional probability of being in eachof the categoriesof the dependentvariable given that an individual is in a particular categoryof the independentvariable(s).(Do not let the fact that the table is expressedin percentagesand the rule is expressedin probabilities confuse
12
QuantitativeDataAnarysis: DoingsociarResearch to Testrdeas
l*l$#iq_1"",.:fi i:,fi,,f,H1, ,i*s1:1i*1",__;:Tffi :,T:*l%*;.""d,*1,:,i i**qii;ttil,TittrL';tff #l i**"x,:;{ilr#,;:xn:ii:r#:j,:i^: ru;,m;::*fr TiJ#;[ *#",'T; fflj..':,jTi:ili??1"iif."," ffi'l'ffi*:,lnri:;nll; ;ry:*:.1"i;:1iil:11*t;"#i.Ixli:'f
:txxT:r*,r#fr :ffi:::t;?HTtH:1H$::" ff::i.;iffi i Control Variables Jhus
far, we have determined ti
,,ffi ?::1;:J", mh*::H,"ili:j.{##:HI""3!:,J,,T#j:
a rtrongrheory,lri p.Ji.,.aT'it
religiositycauses peopleto belessrnilirant.lf wehad
:;i#lqfr#;#$ltfi ffi fi r+l.t!;t#:;:'. ,ffi: iff jffijjff l1;1u,"nr.,.,h;Gffi uJ:[; frf :i:*[.{;ffiiifi
*:ry]i',;^:,:**i.:. I T,[TJ'.;*? ?ff d#if,::1lf,H "f,. ;;;;il::iJ#liiifi
:,.:il;::f
il""n*"j,9,;;?:j#nfja*J
Horv can we testthis possibility?
j:,lffiTrnXi*ereducationdoesinracrredr "*r'ilhff .ffi1,
j*fu31"*"*l=*"^*i1:': ;::::*ltln:: :i."ll$".:;*t"Tfl;*:"., ;iHl?il :ilff:.ilT i;fi :ffii::#:lT:{
:; ;:,:l',f
;:r**:L'f;*i**T##.,:ff ;ff d#kl'lJi:#f i#*"::* ir is. whar ,",r0 y,"1. ,ff::JH il:ru:::n""1,* p",".n,ug.o ,r. ^,.ningliyou ,"rJiI.
we need to dererminewhether educationincreasesmilitancy by crearing
..";?ffi;:"*i#:'rfifi }:lff:T;"J$ $,",f ::ffi ,"ji?lil"Jj;T",T
I CrossTabulations
't3
:i:rent of those with high school education, and fully 53 percent of those with college are militant. Another way of putting tfts t to sayttrat"a'posrtive association :ration :!r\ between education and militancy: as education increases, the probability of mili_ increases. ::r
i , :i , percentage Distribution of Retigiosity by Educational AtGinment, Urban Negroes in the U.S., 1964. EducationalAttainment Religiosity
GrammarSchool
High School
College
.€ry religious
',:: at all religious btal
frornNlarx(1967a,  : '::r Adapted Table 6).
':; .rf . percent Militant Lfran Negroes in the U,S., 1964.
by Educational Attainment
Educational Attainment Vilitancy i,!ilitant '.3nmilitant Total
GrammarSchool
High School
College
14
Quantitative DataAnalysis: DoingsocialResearch to Testldeas
T'ECHNCAIP{.}iXTSON TABLE 1.3 5ometrmes your percentages will not totalto exactly100percentdue to roundinger_ ror.Deviations of one percentage point(99 to 101)areacceptable. Largerdeviations probablyindicatecomputational errorand shouldbe carefully checked.
\
Note how the title is constructed.lt stateswhat the table is (a percentagedistribu_ tion),which variables are included(the conventionis to listthe dependentvariable first),what the sampleis (urbanNegroesin the U.S.),and the dateof datacoltectron (1964).Thetableshouldalwayscontainsufficientinformation to enabieone to read it without referringto the text.Thus,the title and variableheadingsshouldbe clear and complete;if thereis insufficient spaceto do this,il shouldbe done in footnotes to the table. In the interpretation of percentagedistributions, comparingthe extremecategoriesand ignoringthe middlecategoriesis usuallysufficient.Thus,we notedthat the proportion "very religious"decreases with education,and the proportion ,,not at all religious,, increases with education.Similarassertions about how the middlecategories(,,some_ what religious"and "not veryreligious.,) varywith educationareawkwardbecausethey may draw from or contributeto categorieson eitherside.Forexample,the percentage "not very religious"amongthosewith a collegeeducationmight be largerjf eitherthe percentage"somewhatreligious"or the percentage,,not at alj religious,,were sma er. Butone shiftwould indicatea morereligiouscollegeeducated population,andthe other shift would indicatea lessreligiouscollegeeducated population.Hence,the not very religious"row cannot be interpretedalone,and usuallylittle is said about the Intenor rowsof a table.On the other hand,it is importantto presentthe dataso that the reader canseethat you havenot maskedimportantdetailsand to allowthe readerto reorqan_ izethe table by collapsingcategories (discussed later).
4) In dealingwith scaledvariables,such as religiosity, you shouldnot make much of the relativesizeof the percentages within each distribution;that is, comparisonsshould be madeacrossthe categoriesof the independentvariable,not acrossthe categones ot the dependentvariable.In the presentcase,it is jegitimateto note that those with a grammarschooleducationare more likelyto be very religiousthan are those who are bettereducated,but it is not legitimateto assertthat morethan half thosewrrna grammar schooleducationaresomewhatreligious.The reasonfor this isthat the scaleis only an ordinalscale;the categoriesdo not carryan absolutevalue.How religiousis ,,veryre_ ligious"?All we know is that it is morereligious than ,,somewhat religious.,, In conse_ quence,it is easyto changethe distributionsimplyby combiningcategorjes. Suppose, for example,we summedthe top two rows and calledthe resultingcategory,,reli_ gious." In this case,88 percentof those with grammarschool educatjon would be shownas "religious." Consider how thiswouldchangethe assertions we wouldmake about this sampleif we took the categorylabelsseriously.
CrossTabulations15
, ' . : f A t  F Ct t , f T so N T A BL E1 .4 r, When you are presenting severaltablesinvolvingthe samedata,alwayscheckthe consistency of yourtabiesby comparingnumbersacross the lableswhereverpossible. For example,the number of casesin Table 1.4 should be identicalto that in T able' 1 .3 .
BecauseeducatedurbanBlacksareboth lesslikely to be religiousandmorelikely to :s militant than are their lesseducatedcounterparts, it is possiblethat the observedaisobetween religiosity (non)militancy and is determined entirely by their mutual iadon Jependence on educationandthat thereis no connectionbetweenmiliiancy andreligios:1 amongpeoplewho are equally well educated.If this provestrue, we would say that :ducation.rplains the associationbetweenreligiosityandmilitancy andthat the associa:on is spurious becauseit does not arise from a causal connectionbetween the rariables. To testthis possibility,we studytherelationbetweenmilitancy andreligiositywithin :ateqories of educationby creatinga theevariablecrosstabulation of miliiancyby rel! by education. such a table can be set up in two different ways. The first is shown iiosity rr Table1.5,andthe secondin Table 1.6.
Tlt 3 L f ? , 5 percent nnilitant by Retigiosity and Educationat " Attainment, Urban Negroes in the U.S.,1964. GrammarSchool Militancy
High School V SN
College VSN
Jour.er Adaptedfrom Marx ('1967a,Table6). S=somewhat V=veryreligious; relgious;N=not veryreligious or not at all religious.
16
QuantitativeData Analysis:Doing SocialResearch to Testtdeas
TECI{}iJCAIPCII,JTSON TABLE 1.5 1) In thissortof table,education js the controlvariable. Thetableis set up to showthe relationship betweenmilitancy and religiosity within categories oi eOuc"tion, tt"t is ,,controlling (synonymously), ,,holdin9 for education,,, edui"tion oi.n"t ot education " Thecontrorvariabre shouldalwaysbe put on the outside nrtunt,of thetaburation sothat it changes mostslowly.This{ormatfacilitates readingtfreiaUt"Oec"use it put, the numbers beingcompared in adjacent corumns. (sometimes we wantto studythe relationship of eachof two independent variables to a dependent uuriuOiu, in .uru controlling for the other.In suchcases, "u.t jn we stillmakeonlyonetableandconstruct tt whatever waymadeit easiest to read.lf ourdependent variable isdjchotomous or can betreatedasdichotomous, we setup thetablein theformatof Table16.) 2) Notethatthe ,,notveryreligious,, and ,,notat all religious,, categories werecom_ bined.Thisis oftenreferred to ascollapsjng categories. Collapsinq is usuay done whentherewouldbe too few cases to producereliableresults for soime .uregon"s.tn the presentcase,aswe knowfromTablej.l or LZ. therearetfrirty_srlx peoplewho arenot at all religious. Dividing themon the basisof educational attajnment woutd producetoo few casesin eachgroup permit to reliabre estimates of the percentmrrrtant.Hence, theywerecombined withtheadjacent group,,,notueryreligious.,. An additional reason for colli
detair makes itdirric,r, r",,h";;;:;ij":ffiln: jiJff:ffi,'ii'"'tJi:#:.j
it helpsto reduce the numberof categories presented. OntheotherhanJ,it categories of theindependent variable djfferjn termsof theirdistribution on tf," Jup"nOunt u"ri abre,combining thecategories wilrmaskimportant distinctions. e fineiarancemust bestruckbetweenclarityandprecision, whichiswhyconstructjng tablesis an art.
From Table I.5, we seetharreligiosity continuesto inhibit militancy evenwhen edu_ cation is conholled, although thedifferences ln p"r""ni_itit_tu.ong religiosity cate_ gones tend to be smaller than in Tabte 1.2 where educatim i, i.i (In the next chapter' will discussa procedurefor ""r"ol"d. calculatingthe sir" or,rr"."oo",ion 1e in an associatron resulring from the introducrion_ i. ;in*a ytr,ol yanaUfe, ,* Oercentage O:{:r:""",.)!^"g 11a thosewith Cr.lTrnar schooleducation,f Z p#*, of the very ret! grous and 32 percent of the not religious are militant; the p".centages for thosewithhighschool "orr"rilnorog education i.z+
38and68'Thusweconclude ai^Jli;Ji;';il*ege thatedu"ution oo", noi' rrv!vrr sw! associationbetweenreligiosity andmilitancv. "opi#iy "***
educarion are tor theinverse
At this point, we haveto decid.ewhether to continue the searchfor actditionalexplan_ atory variables. our decision usually will be uur"a on u of substantiveand technical considerations.If we have grounds "oiinltion for believing that some other factor misht
CrossTabulations
17
:"'count both for religiosity and militancy, net of education,we probably would want : ' control for lhat factor as well. Note, however,that the power of additionalfactorsto :rplain the associationbetweentwo original variabres(herereligiosity and militancy) '; ill dependon their associationwith previously introducedcontrol variables.To the :\tent that additionalvariablesare highly correlatedwith variablesalreadyintroduced, :e'will havelittle impact on the association.This is an extremelyimportantpoint that 'ill recur in the context of multiple regressionanalysis. Be sure you understandit :oroughly. Considerage. What relation would you expect age to have to religiosity and to :rilitancy?
Pauseto ThinkAbout This Religiosityis likely positivelyassociated with agethat is, olderpeopletend to be more :lieious and militancy is inverselyassociatedwith ageyounger people tend to be ::rrre militant. Hence,we might expectthe associationbetweenreligiosity andmilirancy :: be a spuriousfunctionof age.That is, within agecategories,theremay be no associa_ :.rnbetweenreligiosityandmilitancy. What,however,of the relationbetweenageand education?In fact, from knowledse =:out the secula.r trendin educationamongBlacks,we would expectyoungerBlacksio t substantiallybettereducatedthan older Blacks.To the extentthis is true.aseand edu:.:rion are likely to havesimilar effectson the associationbetweenreligiosityand milincv Hence,introducingageasa controlvariablein additionto educationis not likely to :ducethe associationbetweenreligiosityandmilitancyby much,relativeto the effeci of :Cucationalone. Apart from theoreticalandlogical considerations (is a variabletheoreticallyrelevanr, .:d is it going to add anythingro the explanation?),thereis a straightforwardtechnical ::ason for limiting the number of variablesincluded in a single cross_tabulation_we :rr!kly run out of cases.Most samplesurveysincludea few hundredto a few thousand .ses. We alreadyhaveseenthat a threevariablecrosstabulation requhedthat we col_ :rse two of the religiosity categories.A fourvariablecrosstabulation of the samedata . lilely to yield so many smallpercentagebasesas to makethe resultsextremelyunreli.:le. The difficulty in studying more than about three variablesat a time in a cross_ :bulationprovidesa strongmotivationto use someform of regressionanalysisinstead. { substantialfraction of the chaptersto follow will be devotedto the elaborationof :::ressionbasedprocedures. Table 1.5 also enablesus to assessthe effect of educationon militancy,controlling ::r religiosityby comparingcorresponding columnsin eachof the threepanels.Thus,we ,rtethat, amongthosewho are very religious, 17 percentof the grammarschooledu_ :trIedare militant comparcdto 34 percentof the high schooleducatedand 3g percentof :; collegeeducated;amongthosewho are somewhatreligious,the correspondingper_ r:nmgesare 22, 32, and48; and amongthosewho are not religious,they are 32,47, a\d ::. Hence,we concludethat, at any given level of religiosity,the better educatedare .rre militant.
18
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
TA B L E J . 5. percent uititant by Religiosity and Educationat Attainment, Urban Negroes in th€ U.s., 1964 (Three_Dimen"iir,.l rorrnii). EducationalAttainment Religiosity
GrammarSchool
High School
61)
Source:Table'1.5.
TECIiN{CALPOINTSON TABLE 1.6 1) Eachpairof entriesgivesthepercentage of peoplewho havea traitandthepercent_ agebase,or denominatotof the ratiofrom whichthe percentage wascomputed. Thus,the entryin the upperJeftcornerindicates that ?7 percentof the 108veryreligrousgrammarschooleducated peoplein the samplearemilitant.Fromthjstable, we canreconstruct anyof the preceding fivetables(butwith the two leastretigious categories collapsed into one),withinthe limitsof roundingerror.Tryto do thisto confirmthatyou understand the relationships amongthesetables.
This reguires a fairly tedious comparison, however, skipping around the table to , locate the appropriate cells. When the dependent ,i"Of" t'&"frLt"mous, that is, has onll tw9 response caFgories, a much more succinct taOte tormat rs possible and is prefened' Thble i.6 containsexactly the same information nii"i.i, urr,rr" information is_anangedin a more succinct way. Tableslike Table r.o "r t ro*n u, tr,r"edimensional tables" CompareTables 1.5 and 1.6.,you will see that they contain exacfly the same inforrnationall the additional numbersin Table t.S *"."iuoJ_f Moreover, Table I.6 rs much easier to read becausewe.can.seethe effect of religi;rity;, militancy, holding constanteducation,simply by reading down the columns, ani th" of educa_ tron on militancy, holding constantreligiosity, simply "aoie" "ff"ct by'readinfacross the rows.
CrossTabulations
19
WHAT THISCHAPTERHAS SHOWN In this chapter,we haveseenan initial ideaformulatedinto a researchproblem,an appropriate samplechosen,a surveyconducted,and a set of variablescreatedand combined into scalesto representthe conceptsof interestto the researcher. We then consideredhow to constructa percentagetable that showsthe relationshipbetweentwo variables,with specialattentionto determiningin which directionto percentage tablesusingthe concept of conditionalprobability distributionsthe probability distributionover categoriesof the dependent variable computed separately for each category of the independent variable(s).This is the most difficult conceptin the chaptet and one you shouldmake sureyou completelyunderstand. The other importantconceptyou need to understandfully is the idea of statistical controls,also known as controlling for or holding constantconfoundingvariables,to determinewhetherrelationshipshold within categoriesof the controlvariable(s).Finally, rve consideredvarious technicalissuesregardingthe constructionand presentationof ubles. The aim of the gameis to constructattractive,easyto readtables. In the next chapter,we continueour discussionof crosstabulations, consideringvarious waysof analyzingtableswith morethantwo variablesand,moregenerally,the logic of multivariateanalvsis.
to AS
is DN lal
oe .6 a
CHAPT ER
MOREON TABLES WHATTHISCHAPTER ISABOUT ln this chapter we expand our understandingof how to deal with crosstabulations,both substantivelyand technically. First we continue our considerationof the logir of elaborarion, that is, the introduction of additional variablesto an analysis; second,we consider a special situation known as a suppressoreffect, when the influences of two independent variablesoffset eachother; third, we consider how variables combine to produceparticular effects, drawing a distinctionbetween additive and interaction effects;fourth, we see how to assessthe effect of a single independentvariable in a multivariate percentagetable while conholling for the effects of the other independentvariables via direct standardizafion; and flnally we considerthe distinctionbetvteen experimentsandstatistical controls.
11 1Z
QuantitativeData Analysis:Doing SocialResearch to Testldeas
THE LOGICOF ELABORATION In traditional treatmentsof survey researchmethods(for example,Lazarsfeld 1955: Zeisel 1985),it was customaryto make a distinctionbetweentwo situationsin which a third variablecompletelyor partiallyaccountsfor the associationbetweentwo other variables:spariors associations andassociations that can be accountedfor by an interveninB variableor variables.The distinctionbetweenthe two is that when a control variable(Z r is temporallyor causallyprior to an independentvariable(X) anddependentvariable(I). and when the control variablecompleteryor partryexplainsthe associationbetween the independentanddependentvariable,we infer that thereis no causalconnectionor onlv a weak causalconnectionbetweenthe independentand dependentvariables.Howeuer. whenthe controlvariableintervenestemporallyor causallybetweenthe independent and dependentvariables,we would not claim that thereis no causalrelationshipbetween the independentand dependentvariablesbut ratherthat the interveningvariabL explains, or helps explain,how the independentvariableexertsits effect on the dependentvariable. In the previouschapterwe consideredspuriousassociations. In this chapterwe revisir spunousassoctations andalso considerthe effectof interveningvariables.
SpuriousAssociation Considerthe threevariables,X, { andZ. Supposethat you had observedan association betweenX and I and suspected that it might be completelyexplainedby the dependence of both X and Y on Z. (For a substantiveexample,recallihe hypothesisin the previous chapter that the negativerelation between religiosity and miiitancy was due to the
Moreon Tables 23
IA
ri18 Z) '), rc t, d E
)r e,
ir
dspendenceof both on educationBlacks with more education were both less religious md more militant.) Such a hypothesismight be diagrammedas shown in Figure 2.1. Causaldiagramsof this sort are usedfor purposesof explication throughout the book. Tbey are extensively usedin path analysls, which is a way of representingand algebrai;ally manipulating structural equation models that was widely used in the 1970sbut is bss frequently encounterednow (seeadditional discussionof structural equation models .md path analysisin Chapter Sixteen). My use of such models is purely heuristic. Nonefreless, I usethem in such a way asto be conceptually complete. Hence, the pathsfrom x bX(px.)andfromytoy(lDyy)indicatethatotherfactorsbesidesZinfluenceXandI. Now, if the associationbetweenX and I within categoriesof Z were very small or moedstent, we would regard the associationbetween X and Y as entirely explained by 6eir mutual dependenceon Z. However,this generally doesnot happen;recall, for examde. that the negativeassociationbetweenreligiosify and militancy did not disappearwhen edrcation was held constant.We ordinarily doi not restict ourselvesto an allornothing hlpothesis of spuriousnessexcept in the exceptionalcasewhere we have a very skong drory requiring that a particular relation be completely spurious;rather, we ask what the associationis betweenX and I controlling for Z (and what the associationis betw eenZ and tscontrolling for X). The logic of our analysiscan be diagramedas shownil Figure 2.2. To statethe samepoitt differendy, rather than assumingthat the causal comection betweenX ard yis zero and determining whether our assumptionis correct, we esfimate Se relation betweenX and lholding constantZ and determineits sizewhich, of couse, may be zero, in which caseFigure 2.1 and Figure 2.2 are identical. X <1  x
Y+Y
I
FIGURI 2.1 , the obs"*"d Association Between X andy tsEntirelyspurious and Goesto Zero WhenZ ls Controlled. Xix
.,
z
II I
\
I
Y+ Y
FIGURt 2.2, rheOOn*ed Association Between X andY tspafttySpuious:the Effectof X on Yls Reduced WhenZ ls Controlled(Z AffecbX and BothZ andX Affect y).
24
QuantitativeData Analysis:Doing SocialResearch to Testldeas
Interuening Variahles Now let us considerthe interveningvariablecase.Supposewe think two variables,X and L are associatedonly becauseX causesZ and Z causesZ An examplemight be the relation betweena father's occupation,son's education,and son's income. Supposewe expect the twovariable associationbetweenX and fsometimes called the zBroorderassociation. shortfor zeroorderpartial association,that is, no partial associationto be positive,but think thatthis is dueentirelyto the fact thatthe father'soccupationalstatusinfluencestle son'seducation and that the son's educationinfluencesthe son's income; we think there is no direct hfluence of the father'soccupationalstatuson the son'sincome,only the indirect influence throughthe son'seducation.This sortof claim can be diagrammedasshovn in Figure 2.3. But, as before, unless we have a very strong theory that dependson there being no direct connectionbetweenX and I, we probably would inspect the data to determinethe influence of X on f, holding constant the intervening vaiable Z, and would also determine the influence of Z on I, holding constantthe antecedentvariable X. This can be diagrammedas shown in Figure 2.4 If the net, or partial, associationbetweenX and I provesto be zero,we would conclude that a chain model of the kind describedin Figure 2.3 describesthe data.Otherwise. we would simply assessthe strengthand natureof both associations,betweenX and land betweenZ ard f (and, for completeness,Ihe zeroorder associationbetweenX and Z. Notice the similarity betweenFigure 2.2 and Figure 2.4. With respectto the ultimate dependentvariable, f, the two models are identical. The only difference has to do with the specification that Z causesX or that X causesZ. Tttere is still another possibility: X and Z causeI, but no claim is maderegarding the causalrelation betweenX and Z. This canbe diagrammedas shownin Figure2.5.
,r'
',
'/.
z
Ft6r.,RE 2.3.The ObservedAssociationBetweenX and Y ls EntirelyExplained by the lntervening Variable Z and Goesto Zero When Z ls Controlled
/
\
Ff Gf.JRg 2"4, ne observedAssociation Between x andy ls parttyExptained by the lnterveningVariableZ: the Effectof X on Y ls ReducedWhenZ is Controlled (XAffectsZ, and BothX andZ Affect y).
Moreon Tables 25
Z
FIGURE 2.5, eoth xanA z AffectV but theretsno AssumptionRegarding tu CausalOrdering of X and Z. In almost all of the analyseswe undertakeincluding crosstabulationsof the kind c te concemedwith at present,multivariate models il al ordinary least squaresregresin tanework, and logJinear and logistic analogsto regressionfor categorical depenrzliablesthe models,or theories,represented by Figures2.2,2.4, and2.5 willbe r$rically indistinguishable with respect to the dependentvariable, Y. The distinction mns them thus must rest with the ideas of the researcher,not with the data. From the m+oinl of data manipulation, all three models require assessingthe net effect of each dtro variableson a third variable, that is, the effect of eachindependentvariable holding :mstaot the other independentvariable. Obviously, the sameideas can be generalizedto :inrari6a3 ilyslyilg more than three variables.
SUPPRESSOR VARIABLES (he final idea needsto be discussedhere, the notion of suppressorvariablas. Thus far, r€ bavedealt with sifuationsin which we suspectedthat an observedassociationbetween tm rariables was due to the effect of a third, either as an antecedentor an intervening qiable. Situations can arise, however, in which there appears to be no association hcreen two variables when, in fact, there is a causal connection. This happenswhen re other variable is related to the two variables in such a way that it suppressesthe fterted zeroorder associationspecifically, when one independentvariable has oppo* effects on another independentvariable and on the dependentvariable, and the two *?endent variables have opposite effects on the dependentvariable. Such situations :a be diagrammedas shown in Figure 2.6. For example, supposeyou are interestedin the relations among education, income, rl fertility. On theoretical grounds,you might expectthe following: educationwill havea !tr[ft e effect on income; holding constantincome, educationwill have a negativeeffect '[ fertility (the idea being that educatedpeople want to do more for their children and cgrrd children asmore expensivethal do poorly educaiedpeople;henceat any given level rCimome, they have fewer children); holding constanteducation,the higher the income, fu higler the number of children (the idea being that children are generally regardedas ,Lsirable so that at any given level of the perceivedcost of children, those with more to md that is, with higher income, will havemore children). Theserelationship are repreoed ir Figure 2.6, where X : level of education, Z : income, and I : number of
26
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
),' ".,.1 \l\t
 r : '
\i
I
Y+Y
Ff GtrR€ 2.6. rhe sireof thezeroorder AssociationBetweenxandyf:nd BetweenZ and Y) ls SuppressedWhen the Effectsof X on Z and y have Opposite Sign, and the EffecB of X and Z on y have Opposite Sign. children. The interesting thing about this diagram is that it implies that the gross, or zeroorder,relationshipsbetweenx and r andbetweenZ and r wili be smaller thal the nel or firstorder partial, relationshipsandmight evenbe zero, dependingo' therelative sizeoi the associationsamongthe three variables.To seehow this happenf consider the relation_ ship betweeneducation and fertility. we have posited that eiucated people tend to have more income,and at any given level of education,higher_incomepeopletend to havemore children. Hence, so far, the relation betweeneducationand fertility would be expectedto be positive. But we alsohavepositedthat at any given income level, better_educadd people tend to havefewer children. So we havea positive causalpath and a negative causaliath at work at once,andthe effect of eachis to offsetor suppresi the other effect, so that the zeroorder relationship betweeneducationand fertility is reduced.
ADDITIVE AND INTERACTION EFFECTS we now considerinteractioneffects,sii)ationsin which the effect of onevariable on another is contingenton the value of a third variable.To seethis clearly, considerTable 2. 1. This table showsthat in 1965 educational attainmenthad no effecr on acceptanceof abortion amongCatholicsbut that amongprotestants,the greaterthe education, the greater the percentageacceptingabortion. Thus, Catholics and p.ot"rturrt, with gth grade iducation or less were about equally likely to believe that legal abortion should permiaed be under specified circumstancesbut, among those with more education, protestants were subslanjal]1lore likely to acceptabortion than were Catholics.Among the college edu_ cated,the difference betweenthe religious groups is fully twenty points: about 31 percent of Catholics and 51 percentof hotestants believed that ;bortion should be permittid. This kind of result is calredan interaction effect. Rerigionand educational attainment mteract to produce a result different from what each would produce alone. That is, the relationship between education and acceptanceof abortion differs tbr Catholics and hotestants, andth€ relationshipbetweenreligion andacceptanceof abortion differs by edu_ cation' Situationsin which the relationshipbetweentwo variablesdepends on the value of a third, as it doeshere, are known as interactions.In the older survey analysisliterature (for example,Lazarsfeld 1955; Zeisel 1985), interactionsare sometimescalled specifitations. Religion specffesthe relationshipbetweeneducationandbeliefs aboutabortion: acceptance of abortionincreaseswith educationamonghotestants but not amonsCatholics.
More on Tabl es
27
:j:.i .i 'l . Percentagewho BelieveL€gal Abortions should Be Possible by Religionand Education,U.5.1965 UnderSpecifiedCircumstances, (N = 1,368;Cell Frequenciesin Parentheses). EducationalAttainment Religion
8th Grade or Less
someHigh School
29o/o (287)
(2s0)
High School Graduate
somecollege or More
'and ri te
Protestant
oss, or tle net, sizeof .lationto have emore rted to people pathat le Zero
36"/o
430/o
5 1 '/ o
(22s)
(1966). i:rr.erRossi ',:ae:NonChrist ansornltted.
another ance of sreater educamftted N Were se edupercent Ed. Lrnment is. the cs and bl edulue of a ,lle (for :ations. sPAnCe
Where we do not have interactioneffectswe have qdditive effects(ot no effects). Suppose,for example,that insteadof the numbersin Table 2.1, we had the numbers in Table2.2. 'hown What would this show?We could saytwo things:(1) The effectof religionon accep;mceof abortion is the sameat all levels of education.That is, the differencebetweenthe who would permit abortionis l0 percentat each of Catholicsand Protestants rcentage (2) of abonionis the samefor education on acceptance The effect of \'el of education. between the percentagewho would Catholicsand hotestants. For example,the difference abortionamongthosewith somehigh schooleducationandthosewho arehigh school rmit SimilzLrly, thedifferencebetween is 10percentfor both CatholicsandProtestants. rraduates those with a high schooleducationand pemit abortion among percentage who would :re :nong thosewith at leastsomecollegeis 20 percentfor both CatholicsandProtestants.
28
to Testldeas QuantitativeData Analysis:Doing SocialResearch
TAtsLf 2,2. nercentageacceptingabortion by Religionand Education (HypotheticalData). 8th Grade or Less
Some High School
High School Graduate
Some College or More
The reason this table is additive is that the effects of each variable add together to producethe final result. It is as if the probability of any individual in the sampleaccepting abortion is at least .3 (so we could add .3 to every cell in the table); the probability of a Protestantaccepting abortion is .1 greater than the probability of a Catholic accepting abortion (so we could add .1 to all the cells containingProtestants);the probability of someonewith sornehigh schoolacceptingabortion is .05 greaterthan the probability of someonewith an 8th gradeeducationdoing so (so we would add .05 to the cells for those with somehigh school); the probability of thosewho are high school graduatesaccepting abortion is . 15 geater than the probability of those with an 8th grade educationdoing so (so we would add .15 to the ce1lsfor those with high school degrees);and the probability of those with somecollege accepting abortion is .35 higher than the probability of those with an 8th grade education doing so (so we would add .35 to the cells for those with at least somecollege). This would produce the results we seeinTable 2.2 (after we convert proportions to percentagesby multiplying eachnumber by 100). By contrast,it is not possible to add up the effect of eachvariable in a table containhg interactions becausethe effect of each variable dependson the value of the other independentvariable or variables. Many relationships of interest to social scientists involve interactionsespecially with gender and to some extent with race; but it is also true that many relationships are additive. Adequate theoretical work has not yet been done to allow us to specify very well in advancewhich relationships we would expect to be additive and which relationshipswe would expectto involveinteractions. Later you will seemore sophisticatedways to distinguish additive effects from interactions and to deal with various kinds of interactions via loslinear analvsis and resression analysis.
DIRECT STAN DARDIZATION Often we want to assessthe relationship betweentwo variables controlling for additional variables. Although we have seenhow to assesspartial relationshipsthat is, relationships between two variables within categories of one or several control variablesit would be helpfui to have a way of constructing a single table that shows the average
Moreon Tables 29 reIationship betweentwo variablesnct of, that is, controlling for, the effectsof other variables.Direct standardization provides a way of doing this. Note that this technique has other names,for example, covariate adjustment. Howeve\ the techniqueis most widely usedin demographicresearch,so I usethe term by which it is known in demography,direcl ttandordization.It is important to understandthat, eventhough the sameterm is used, this procedurehas no relationship to standardizingvariables to create a common metric. We sill considerthis subjectin ChapterFive.
Example 1: Religiosity by Militanqr Among U.S.Urban Blacks The procedureis most easily explainedin the contextof a concreteexample.Thus we revisit the analysisshownin Tables1.2 through 1.6 of ChapterOne (slightly modified). Recall that we were interestedin whether the relationship between militancy and religiosity amongBlacks in the United Statescould be explainedby the fact that bettereducated Blackstendto be both lessreligiousandmoremilitant. Becauseeducationdoesnot completely explain tlle associationbetweenmilitancy and religiosity,it would be useful to have a way of showing the associationremaining after the effect of education has been rcmoved. We can do this by getting an adjusted percentagemilitant for each religiosity ,ategory which we do by computing a weighted averageof the percent militant across iducation categorieswithin each religion category but with the weights taken from the overall frequency distribution of education in the sample. (Alternatively, becausethey are mathematically identical, ws can compute the weighted srm, using as weights the ?roportion of casesin each category.)By doing this, we construct a hypothetical table >howingwhat the relationship betweenreligiosity and militancy would be if all religios1r!goups had the samedistribution of education. It is in this precise sensethat we can sav we are showing the associationbetweenreligiosity and militancy net of the effect of education.As noted earlier, this procedureis known asdirect standardizationor covariate ,lCjustment. Note that the weights need not be constructed from the overall distribution in the table. Any other set of weights could be applied as well. For example, if we wanted to assessthe associationbetweenreligiosity and militancy on the assumptionthat Blacks had the samedistribution of educationas Whites, we would treat Whites as the stand.d.rd, topulation and use the White distribution across educational categories (derived from someextemalsource)asthe weights.We will seetwo examplesof this strategya bit later in the chapter. Now let us constructa militancybyreligiositytable adjusted,or standardized,for education,to seehow the procedureworks.We do this from the datain Table 1.6.First, $e derivethe standarddistribution,the overall distributionof education.Becausethere are993 casesin the table(= 108 +... * 49), andthereare 353 1= 193 +201 + 44) peoplewith a grammarschooleducation,the proportionwith a grammarschooleducation is .356 (:353/993). Similarly, the proportionwith a high schooleducationis .508, and the proportionwith a collegeeducationis .137.Theseare our weights.Then to get the adjusted,or standardized,percentmilitant among the very religious, we take the n eightedsum of the percentmilitant acrossthe threeeducationgroupsthat subdivide fte "very religious" category(that is, the figuresin the top row of the table): 17Voa.356
30
QuantitativeData Analysis:Doing SocialResearch to Testldeas
TAB Le 2.3.
percentlvtitirantby Retigiosity, and p€rcentMilitanr
by Religiosity Adjusting (Standardizing) for Religiosity Differences in Educational Attainment, Urban Negroes in the U.S., 1964 (N = 993).
PercentMilitant
PercentMilitant Adiustedfor Education
Percentage spread
+ 34Va*.508+ 38Voa .137= 29Va.To get the adjustedpercentmilitant amongthe .,somewhat religious,"we apply the sameweightsto the percentages in the secondrow in the table:227o+.356+ 32Vo*.508+ 48%a.137= 31Vo. Finally,to get the adjustedpercentmilrtantamongthe "not very or not at all religious,"we do the samefor the third row of the table,which yields45 percent.We canthencomparethesepercentages to the corresponding percentages for the zeroorderrelationshipbetweenreligiosityandmilitancy (thatis, not controllingfor education).The comparisonis shownin Table2.3. (The Statado_ file usedto carry out the computations,using the commanddstdize and the Iog file that showsthe results,are availableas downloadablefiles from the publisher,JosseyBass/lViley(wwwjosseybass.com/go/quantitativedataanalysis) asare similar files for the remainingworkedexamplesin the chapter.Becausewe havenot yet beguncomputing,it probablyis bestto notethe availabilityofthis materialandretum to it laterunlessvou are alreadyfamiliar with Stata.)
STATA Do FILESAND Loc
FILES Insrata, doiircsare
commands,and  1oq filesrecordthe resultsof executingdo files.As you will seein Chapter Four,the management of dataanalysis is complexand is muchfacjlitated by the creationof ao files,whichare efficientand alsoprovidea permanentrecordof whar you naveoone to produceeach tabulationor coefficient.Anyonewho hastried to replicatean analysis performedseveral yearsor evenseveral monthsearlierwill appreciate the valueof havinqan exactrecordof the computationsusedto generateeach result.
Moreon Tables 31
N I
t0) I ,2)
Whenpresentingdataof this sort,it is sometimes usefulto comparethe rangein the :ercentagepositive(in this case,the percent militant) acrosscategoriesof the indepen_ lent variable,wirh and wirhout conft;ls. rn Tabb ti,;; ;" ;;;r" the differencein hepercentmilitant betweenthe leastandmostreligiou, rwenty_one points '.rhereas,when educationis controlled, "ut"go;r'r. the differJncet.;;;; ro srxteen polnts, a l1 percentrcduction (= I  16/2.1).ln ,o." ,.or., tt say that education "erplains" abour a quarrer "r,;;; of the relarionship^betweenreffirit in"o w" n""o :o be cautiousaboutmaking computations of thi, .o.t unjonty rt,un"y. tt"_ when they ':re helpful.in making the analysis_ "_ffoy clear.no. ii io"'rri, ir*" much senseto ""u,npr", a "spread"or "range" in the percentages iithe relatio;shi; betweenreligiosity 'ompute 3flimilitancy is not monotonic(that ir, if th. p!.""ntug" Jili""ia."'", increase,or ar reastnot decrease, asreligiositydeclines). "",
i
t' omen the : milf the ondatis,  file file sey: rhe g. it i ale
plEE_qT_ tN EARLTER STANDARDTZATTON
}.f,*nS l':':ffiffil:f"i:::i:i N ;*::u,lxli:lifi
to a "weighted netpercentaqe difference,, or ,,weighted netpercentage spread.,, ThereaJly usefulpartof the procedure is the computation of adjusted, or staidarorzeO, rates.The subsequent computation of percentage differences or percent"g",p*uJri, onry,or"tlrnu, useful, asa wayof summarizing theeffectof control varjables.
Example2: BeliefThatHumansEvolvedfrom Animals(Direct Standard_ ization with Two or More Control Variables) Sometimeswe want to adjust,or standardize, our databy more than onecontrol variable 3i.a time.to€et a summaryof the effect of some variabieon _oii", \Jt/i,"ntwo or more orhervariablesareheld constantConsider'ror u.."ptun"" of the scientificthe.rn of evolurion.In 1993, 1994. and 2000, ""upt", ttre N
32
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
TABLE 2.4. lercentageDistributionof Beliets Regardingthe ScientificView ot Evolution(U.S.Adults, 1993, 1994, and 2OOO).
ly not true
Perhapsthe recent increasein the proportion of the population that adheresto fundamentalist religious beliefs, especially fundamentalist Protestantviews in which the Bible is taken as literally true, accountsfor this outcome. To seewhether this is so, in Table 2.5 I crosstabulatedacceptanceof the scientific view of evolution (measuredby endorsement of the statementthat descentfrom other animals is "definitely true") by religious denomination, making a distinction between"fundamentalist" and "denominational" Protestants. (For want of better information, I simply dichotomized Protestantdenominationson the basis of the proportion of their members in the sample who believe that "The Bible is the acfual word of God and is to be taken literally, word for word." Denominations for which at least 50 percent of respondentsgave this response"other" Protestantsand all Baptists except membersof the "American Baptist Church in the U.S.A." and "Baptists, don't know which"were coded as fundamentalist; all other Protestantdenominations were coded as DenominationalProtestants.)Unfornrnately,the samedistinction, between religious subgroupswith andwithout a literal belief in theholy scriphres of their faith, cannot be madefor nonProtestants given the way the datawere origimlly codedin the GSS. Although there are substantial differences among religious groups in their acceptance of the scienfific view of evolution, the fundamentalistdenominationalspht among Protestantsdoes not seemto be central to the explanation becausethere is only a 4 percent difference betweenthe fwo groups. Interestingly, nonChristians appea.rto be much more willing than Christians to acceptan evolutionary perspective,and Catholics appear to be more willing than Protestantsto do so. Given thesepattems, it could well be that the observedreligious differences are, at least in part, spurious. In particular, educational differences among religious groupsJewsare particularly well educatedand fundamentalistProtestantsare particularly poorly
Moreon Tables 33 i,
.i 2,5, fercentage Ac(€pting the ScientificView of Evotution by leligious Denomination (N = 3,663). PercentageAcceptingthe Evolution of Humansfrom Animals as "De{initelv True"
Denominational Protestants
I1.8
Catholics
17.8
(858)
?J]erChristians
5.6
(18)
h'6 ulmenlible is ble 2.5 €ment :nomismnts, on the iible is rns for ald all lPtlsts, urtrons etween cannot accepamong 4 perr much appear are, at )uPSpoorly
(1,222)
I:ll,.,..i...:,....lll*1r,:,l,..'..l ..',:,:',,,. :.:l.lli*,.,
C$er religion
23.6
(123)
tS religion
32.5
(391)
r:;riatedmight partly accountfor religiousdifferencesin acceptance of the scientific ':';'..Similarly, age differencesamong religious groupsthe young are particularly i:11 to rejectreligionmight providepart of the explanationas well. To considerthesepossibilities,we needto determine,first, whetheracceptance ofthe explanationof humanevolutionvariesby ageandeducationand,if so, whether ;ntific =:giousgroupsdiffer with respectto their ageandeducation.Tables2.6 and2.7 provide :e necessary informationregardingthe first question,andTables2.8 and2.9 providethe :,rrespondinginformationregardingthe secondquestion. l nsurprisingly,endorsemenlof the statementthat humansevolvedfrom other animals L. definitely true" increasessharply with education,as we seein Table 2.6, ranging from : of thosewith no more than a high schooleducationto 36 percentof thosewith rcent rrgraduate education.It is also true that younger people are more likely to endorsethe (seeTable2.7): 18percentof explanationof evolutionthanareolderrespondents ientific :r:e under agefifty, comparedto 7 percentof those seventyand over, say that it is "defi=',il) true" that humansevolved from other animals. religiousgroup, \s expected,Table2.8 showsthat Jewsare by far the besteducated ::loued by other nonChristiangroups,and that FundamentalistProtestantsand Other Crisdans arethe leastwell educated. AIso asexpected,Table2.9 showsthat thosewith:r{ religiontendto be young.However,membersof "other" religiousgroupsalsotend::sproporlionatelyto be young,perhapsbecausethey aremainly immigrants.
34
QuantitativeData Analysis:Doing SocialResearch to Testldeas
?,q*,a .i..*. eercentageAcceptingrhe Scientific View of Evolution by Level of Education, Percentage
Somecollege
l&XLe
11.9
2.?,
eercentage Acceptins
the Scientific View of Evolution by Age. Percentage
5069
13.5
(889)
Theseresultssuggestthat differencesamongreligiousgroupswith respectto ageand educationmight, indeed,explainpart of the observeddifferencein acceptance of the sci_ enlificview of evolution. To seeto what extent ageandeducationaldifferencesamongreligious groups account for religious group differencesin acceptanceof evolution, we can directly standardizethe religion/evolutionbeliefs relationship for education and age. we do rhis by deterrnining the joint distribution of the entire samplewith respectto age and education and then use theproportionin eachagebyeducation categoryasweightswith which to compute,sep_ aratelyfor eachreligious group, the weightedaverageof the agebyeducation_specific percentagesacceptinga scientificview of evolution.By doing this, we treat eachreli_ giousgroup asif it had exactlythe samejoint distributionwith respectto ageandeduca_ (ionasdid theentiresample.Thisprocedure rhusadjuslsrhepercentage of iach religious group that endorsesthe scientificperspectiveon evolutionto removethe effect of religiousgroup differencesin thejoint distributionof ageandeducation.
lff Si{ 2.*, Religion.
eercentageDistributionof EducationalAttainment by
High School Some College Postor Less College Graduate Graduate Total
N
7.1 Denominational Protestants 4 7.6
26.5
15.2
10.6
OthefChristians
61.1
16.7
16.7
s.6
100.0
(18)
1 5 .7
2 1.7
31.3
31.3
1OO.O
(83)
{123)
JCWS
Ctherreiigion
3 6 .6
25.2
19.5
18.7
100.0
Total
47.6
25.6
15.3
11.6
t00.1 (3,663)
T
': L F 3 . *,
md
Protestants
:cI
unt the ing use
rercentag€ Distribution of Age by Religion. 1849
5069
7O+
Total
59.8
25.O
15.1
99.9
N
catholics', eo,1 ': 'ts.:er ,',,,..,ll:';;i:i'.,;.t.;iirat,,,,,,:i:,'.,,,. 3iher Christians
8 8 .9
11.1
0.0
99.9
(18)
83.7
13.8
2.4
99_9
(123)
11.0
100.1
Jew5
ific :lica)us rli
f,iherreligion
36
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
To get the necessaryweights,we simply cross{abulateageby educationandexpress , the numberof peoplein eachcell of ttre taUteas a proportioi ofihe total. Thesepropor_ tions are shownin Table2.10. We then tabulatethe percentageacceptingthe scientific position on evolution by reli_ . glon, age,and education.Thesepercentages are shownin Table2.11.Note that many of thesepercentages arebasedon very few cases.This meansthat they are nol very precrse, in the sensethat they are subject to large sampling variability. We ctuld collapse the eOucation and age categoriesstill further, but that would ignore substantial within_category heterogeneity.As always, there is a balanceto be struc[between sampling precision and substantivesensibility orin terminology we will adopt later_betw eenreliability and validity . the presentcase,I might havebeenbetter advisedto take a more conservative rn approach, especiallybecausethe cell_specific percentages bouncearounda lot (exactlyas we wouldexpect given the large degreeof sampling variability), whrch makesthe differ_ ences in the resurting standardizedpercentagessomewhatless ctear cut. on the other hand,the weights are very small for the cells basedon few cases,which minimizes their contribution to the overall percentages. Finally, to get the adjusted,or standardized,coefficients, we sum the weighted per_ centages,wherethe weightsarethe proportionsin Table2.10. For example,the adjusteO, or dircctly standardized,percentageof Fundamentalistprotestants who acceptthe evolutionary viewpoint as ,.definitively true,, is, within rounding error, 9.7 :5.7*.274 + 3.8*.184+ 15.7*.110 + 29.3+.080 + 4.9*.126+ 3.3*.056+ 25.O*.032 + 40.9*.029 + 3.3+.076+ 7.7*.015+ 10.0*.012+ 16.7*.007 standardizedpercentagesare derived in the sameway. .l hey are . _Tl" :"ld"i$ shown inTable.2.l2, with the observedpercentages repeatedfrom Table2.5 to make compari_ sonseasler. TAtsLf
2.10,
roint probabitityrDisrribution of Educarionand Age. 1849
5069
70+
Total
 / \ ll I I !1.r..
.'
 
ts€rlcrrlcuE AlrcFlitts
ah€ tGi€rlitiG vi€w H, Gu€lllllrll
kt F€ i l g l 'n
ig!
a ta d 5 Er l Fr i .r i tl .g i
l I'drrrrllrrrrr)
Fundamentalist Protestants
Denominational Protestants
Catholic
Other Christians
Jewish
Other
None
I0.01 (2)
128.61 (14)
11. 5 (26)
29.3 (82)
Age 1849
Somecollege
3.8 (183)
10.' !
(208)
20.1 (159)
22.7 (22)
[0.0] (4)
6.8
20.o
121.41
(6)
(r4)
3.3 (60)
9.0 (78)
(441
(1)
collegegraduare
25.o Q4)
21.3 (47)
20.0 (2s)
(0i
t2o.Ol (s)
I25.Ol (4)
Postgraduate education
40.9 (22\
20.0
32.O (25)
(0)
[37.5] (8)
[2s.0]
Somecollege
t0.01
12s.ol (4)
(4\
l4t.7) (12)
166.71 (12)
(Continued)
TA B L F 2. '1 'l . p.rcentage Accepting the Scientific View of Evolution by Religion, Age, and Sex (Percentage Basesin Parentheses)'(Contmued) Fundamentalist Protestants
Denominational Protestants
Other Christians
Age 70 or more
l7.71 (13)
[0.0] (1)
[100.0] (2) basedon fewer than twenty cases(shownin squarebrackets)shouldbe interpretedwith caution Note: percentages
Moreon Tables
39
t
it ::.1,3. oUserveaProportion Acceptingthe scientific view of Evolution, and Proportion Standardized for Education and Age.
Percentage Accepting Scientific View of Evolutionas "Definitely True"
Percentage AcceptingScientific View, Standardized by Age and Education
:JndamentalistProtestants l:'rominationalProtestants
(968)
12_2
(1,222) (858)
ainolrc5 : er Christians
N
2.1
9,r5
(r8) (83) (123)
{o religion
\s you can see, despitethe associationbetweenreligious group affiliation and, :rectively, ageandeducation,andthe associationof ageandeducationwith acceptance :: .:rrevolutionaryaccountof humanorigins, standardizingfor thesevariableshas rela 1:\ little impacton religiousgroupdifferencesin acceptance of the claim that humans :'. r.\ ed from otheranimals.The oneexceptionis the nonreligious,whosesupportfor a ';::atific view of evolution appearsto be due, in part, to their relatively young age. l,:spite minor shifts in the expecteddirection for FundamentalistProtestantsand for ':;. andthosew th otherreligions,the dominantpattemis oneofreligious groupdiffer::e\ in acceptance of an evolutionaryview of the origins of mankindthat arenot a sim:: :ellectionofreligious differencesin ageandeducationbut presumablyreflect,instead, :: lheologicaldifferencesthat distinguishreligiouscategories.
Example3: OccupationalStatusby Racein SouthAfrica l; '.1let us consideranotherexample:the extentto which racial differencesin occupational .jnment in SouthAfrica can be explainedby racial differencesin education(the data are ::t rhe Suney of EconomicOppoftunityandAchievementin Sculr A/nca,conductedin the =.,, 1990sfTreiman,lrwin, andLu 2006]:the Statado and 1og files for the worked =,.:rpleareavailableasdownloadablefiles; for informationon the datasetandhow to obtain
40
DataAnalysis: DoingsocialResearch to Testldeas Quantitative
it, seeAppendix A). From the lefthandpanelof Table2.13, it is evidentthat therea.restrong differencesin occupationalaftainmentby race.NonWhites,especiallyBlacks, are substantially lesslikely to be managerial,professional,or technicalworken than areWhites and are substantiallymore likely to be semiskilledor unskilled manual workers.Moreover,Blacks are far more likely to be unemployedthan are membersof any other group. It is also well known that substantialracial differencesexist in educationalattainmentin SouthAfrica, with Whitesby far the besteducated,followed,in order,by Asians(who in SouthAfrica aremafu y descendants of peoplebroughtasindenturedworkersfrom theIndian subcontinent),Coloureds (mixedracepersons),andBlacks (thesearetheracial caiegoriesconventionallyusedin South Africa); andalsothatin SouthAfrica, aselsewhere,occupationalattainmentdependsto a considerabledegreeon educationalattainment(Ireiman, McKeever,and Fodor 1996).Under thesecircumstances,we might suspectthat racial differencesin occupationalattainmentcan be largely explainedby racial differencesin educationalattainment.Indeed,this is what Treiman,McKeever,andFodor (1996)found usingthe IntemationalSocioeconomicStatusIndex (ISEI) (Gaueboom, de Graaf, and Treiman 1992; GaruBboomand Treiman 1996) as an ildex of occupationalattainment.However,it also is possiblethat accessto certaintypesof occupations,such as professionaland technicalpositions,dependsheavily upon education, whereasaccessto others,suchasmanagerialpositions,may be deniedon the basisof raceto thosewho areeducationallyqualified. To determine to what extent, and for which occupation categories,racial differences in accesscan be explainedby racial differencesin education,I adjusted(directly standardized)the relationship betweenrace and occupationalstatusby education.Here I used the White distribution of education, computed ftom the weighted data, as the standard distribution to determine what occupational distributions for each of the nonWhite groupsmight be expectedwere they able 1oupgade their levels of educationalattainment so that they had the samedistributionsacrossschoolinglevelsas did Whites. The resultsare shownin panelB of Table2.13.They are quite instructive.Bringing the other racial groups to the While distribution of education(and assumingthat doing so would not affect the relationship between education and occupational attainment within each group), racial differences in the likelihood of being a professional would entirely disappear.Indeed,Blacks would be slightly more likely than membersof the other groups to becomeprofessionals.By contrast,the percentageof eachrace group in the managerial category would remain essentially unchanged, suggesting that it is not education but rather norms about who is permitted to supervisewhom that accountfor the racial disparity in this category.The remaining large changesapply to only one or two of the three nonWhite groups:Asians would not be very substantially affected exceptfor a reduction in the proportion semiskilled; Coloureds would increasethe proportion in technical jobs andreducethe proportionin semiskilledand unskilledmanualjobs and farm labor; and Blacks would increase the propoftion in clerical jobs and reduce the proportion in all man[al categories. IJ all four racial groups had the same educational distribution as Whites, the dissirnilarity (measuredby A; seeChapter Three) between the occupational distributionsof Whites andAsianswould be reducedby about30 percent(fuom292to 20.5)aswould thedissimilarityin the occupationaldistributionsof WhitesandColoureds (from 37.9 to 26.5),whereasthe dissimilarityin the occupationaldistributionsof Whites
More on Tables 3 Srong {bstanand are Blacks .o well :a with mainly lourcds r South ) a conUnder ,'nt can at Treit Index ta s a n ,pes of cation, ace to
i4
: : . , Percentage Distribution of Occupational croups by Ra€e,South African rqE 2069, Early 199Os (Percentages Shown Without Controls and also Directly for Racial Differences in Educational Attainment;. N = 4.0O4).
WithoutControls
Adjusted for Education Black
White
Asian
Coloured
Black
13.7
7.0
3.3
13.2
13.6
11.9
16.5
7.2
13.4
5.8
7.2
13.0
9.1
8.5
3.5
2.2
9.4
2.5
2_6
14.4
18.6
18.8
19.3
8.2
13.0
9.7
6.2
16.6
20.1
100.0
99.1
99.9
100.0
100.0
20.5
26.5
46.1
tences i stanI used mdard White DINCNI
nglng ing so A.ithin lrllely rcups gerial n but lsparthJee lclion lj o bs ': and ir all )n as ional )f to ureds hites
r. r.n Whites(A). 
100.2
99.9
100.0
29.2
37.9
52.4
::::::on isth eeduc at ionaldls t r ibut ionof t heW h i t e m a e p o p u l a t i o cno m p u t e d f r o r nt h e s u r v e d y a t aw e i g h t e d  :  ::nsusdistrbut onsof reqon bv urbanversusruralresidence. :::'o ccup ato ndat at odet ailedoc c upat ons ha d n o t b e e n c o m p l e t e d w h e n t h s t a b e w a s p r e p a r eIdh,a v e i n :, i : ,a:egory"occupaton unknown."  :_:. = 1/2the sumof the absolute va uesof the dlfferences betweenthe perceniaoe of Whitesand the percent' _ = . al groupin ea(hoccLrpation categorySeeChapterThreefor furtherexpost ol']cf ih s ndex
42
to Testldeas QuantitativeData Analysis:Doing SocialResearch
andBlackswould be reducedonly about 12 percent(from 52.4 to 46.i). The substantiaremainingdissimilarityin the occupationaldistributionsof the four race groupsnet o: educationsuggeststhat Treiman,McKeever,and Fodor's (1996)conclusionthat education largelyexplainsoccupational.ttatrsdifferencesbetweenracegroupsin SouthAfrica doesnot tell the whole story.
in China Example4: Levelof Literacyby Urban VersusRural Residence Now considera final examplethe relationshipamongeducation,urbanresidence,ani degreeof literacy in the People'sRepublic of China. In a 1996nationalsampleof the were askedto identify tet adult population(Treiman,Walder,andLi 1996),respondents properties (see the of this data set and ho$ A regarding Appendix Chinesecharacters is to obtain accessto it). The numberof correctidentifications inter?retedas indicating the degreeof literacy(Treiman2007a).Obviously,literacywould be expectedto increase with education.Moreover,I would expectlhe urban populationto scorebetter on the characterrecognitiontask just becauseurban respondentstend to get more schooling than do rural respondents. The questionof interesthere is whethereducationaldifferencesbetweenthe rural and urbanpopulationentirelyexplainthe observedmeandiffercorectly identified,which is 1.8(asshownin Table2.14t encein thenumberof characters the urbanandrural meansby assumTo determinethis, I adjusted(directlystandardized) ing that both populationshavethe samedistributionof educationthe distributionfo: the entireadult populationof China,computedfrom the weighteddata.Note that in this that are standardizedbut rather means.The procedureis exampleit is not percentages identicalin both cases,althoughifthis is doneby computer(usingStata),a specialadjustment needsto be madeto the datato overcomea limitation in the Statacommandthe requirement that the numerators of the "rates" to be standardized (what Stata call. charvar) be integers.To seehow to do this, consultthe Statado and1og filesfo; this example,which areincludedin the setof downloadablefiles for this chapter. fAmLf ?,14. rvl..,t Number of chinese charactersKnown (out of 1o), for Urban and Rural ResidentsAge 2069, China 1996 (Means Shown Without Controls and Also Directly Standardized for UrbanRuralDifferences in the Distribution ot Education;.N = 6,O81).
Ruralresidents
Without Controls
Adiustedfor Education
2.0
2.4
(3,002)
'The standardpopulatonis the entre populatlonof Chinaage 2069, computedfrom the sutueydata we ghted to reflectd fferentialsamprng ratesfor the ruraland urban populationsand to correctfor vafiaweremissinq wereomitted. s ze.N ne casesfor whichdataon education tionsln household
Moreon Tables 43 hntial let of duca\frica
na t, and of the fy ten I how )ating I€ASC
n the nling liffer[ffer, 14\
sumn for n this me is djustthe calls x for
t, ut
The results are quite straightforward and require little comment. When education is standardized,the urbanruralgapin the meannumberof characterscorrectly identified is reducedfrom2.8tol.6.Thus,about43percent(=11.6/2.8)oftheurbanruraldifferencein vocabularyknowledge is explained by ruralurban differencesin the level of edu.ationalattainment. Although the four examples presented here all standardizefor education, this is purely coincidental. Many other usesof direct standardizationare imaginable. For example. it probably would be possible to explain higher crime rates among early twentiethcentury immigrants to the United Statesthan among natives simply by standardizingfor age and sex. Immigants were disproportionately young males, and young males are Lnown to have higher crime rates than any other agesexcombination.
A FINALNOTEON STAT]STICAL CONTROISVERSUSEXPERIMENTS Ia describingthe logic of crosstabulations, I havebeendescribingthe logic of nonexperimental data analysis in general. True experiments are relatively uncommon in social research,although they are widely used in psychological research and increasingly in microeconomics(for a very nice example of the latter, seeThomas and others [2004]). A mre experimentis a situation in which the objects of the experimentare randomly divided into two or more groups, and one group is exposedto some treatment while the other gloup is not, or severalgroups are exposedto different treatments.If the groups then diffEr in someoutcome variable, the differences can be attributed to the differencesin treatments.In such caseswe can unambiguouslyestablishthat the treatmentcausedtJre Jifference in outcomes(although we may not know the exact mechanisminvolved). (Of .ourse,this claim holds only when differences between the experimental and control are not inadvertentlyintroducedby the investigatorsas a consequence of design _groups darvsor of failure to rigorously adhereto the randomized trial design. For a classic dis.rrssionof such problems,see Campbell and Stanley [1966] or a shorterversion by Campbell[1957]that containsthe core ofthe Campbelland Stanleymaterial.) When experimentsare undeftakenin fields such as chemistry sampling is not ordinarily a considerationbecauseit can be confidently assumedthat any batch of a chemical Eill behavelike any other batch of the same chemical; only when things go wrong do .hemiststend to question that assumption.In the social and behavioral and many of the biologicalsciences,by contrast,it cannotbe assumedthat one subjectis just like another illbject. Hence,in experimentsin thesefields, subjectsare randomlyassignedto treatnent groups.In this way,it becomespossibleto assesswhethergroup differencesin out.omesare larger than would be likely to occur by chancebecauseof sampling variability. Il so. we can say, subject only to the uncertainty of statistical inference, that the differ€lce in treatmentscausedthe differencein outcomes. In the socialsciences,randomassignmentof subjectsto treatmentgroupsis oftenm fact usuallyimpossible for severalreasons.First, both ethicalandpracticalconsider.:donslimit the kind of experimentationthat can be done on human subjects.For example. it would be neitherethicalnor practicallypossibleto determinewhetherone sort of
44
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
schoolingwas pedagogicallysuperior to anotherby randomly assigningchildren ii' different schools and severalyears later determining their level of educational achievement. In addition, many phenomenaof interest to social scientistsare simply not expe+ mentally manipulable,evenin principle. The propensity for ingroup solidarity to increl'. in wartime,for example,is not somethingthat can be experimentallyconfirmed,nor ca the proposition that social stratiflcation is more pronounced in sedentary agriculturzi societies. societiesthanin huntergatherer Occasionally,"natural experiments" can be analyzed.Natural experirnentsare situations in which different individuals are exposedto different circumstances,and it can bc reasonablyassumedthat the circumstanceto which individualsareexposedis essentialll random. A very nice example of such an analysis is the test by Almond (2006) of tbe "fetal origins hypothesis." He showed convincingly that individuals in utero during the few monthsin which the 1918flu pandemicwasragingsufferedreducededucationz" attainment,increasedrates of physical disability, and lower income in midlife relative rc those in utero in the few months preceding and following the epidemic. Becausetherei! no basisfor expectingthe exactmonth of conceptionto be correlatedwith vulnerabilig to the flu virus, the conditions of a natural experiment were well satisfied in this elegar analysis.Naturalexperimentshavebecomeincreasinglypopularin economicsasthe limitations of variousstatisticalfixes to correctfor "sample seLectionbias" have becom more evident.We will retum to this issuein the final chapter (For additional examplesct naturalexperimentsthat arewell worth reading,seeCampbellandRoss[1968],Berelscr [1979],Sloanandothers[1988],andthe paperscited in ChapterSixteen.) Given the limited possibilitiesfor experimentationin the social sciences,we resort to e variety of statistical controls of the sort discussedhere and later Theseproceduressharc a common logic: they are all designedto hold constant somevariable or variables so dl.{ the net effect of a siven variable on a given outcome can be assessed.
?,
FIX AND A USEFUL OF MATCHING THEWEAKNESS
\\ 
by matchingcomparison attemptto simulaterandomassignment surveyanalysts Sometimes practice unsatisfacwas inherently form, this In its original groupson somesetof variables. tory When attempting to match on all potentiallyrelevantfactors, it is difficult to avoid are usedln the match,it is no matterhow manyvariables runningout of cases.l\,4oreover, facgroups on somenonmatched differ possible and control that the experimental always with matching combining Howevef, outcome with the experimental tor that is correlated statisticalcontrolscan be a useful strategy,especiallywhen the adequacyof the match is and Rubin1983) Forrecenttreatments score"(Rosenbaum via a "propensity summarized (1997), Beckerand lchino(2002),Abadieand propensity see Smith score matching, of (2006), (2004), (2006), and Beckerand Caliendo(2007) Halaby Brandand Brand others in scorematchingis alsodiscussed applicationPropensity Harding(2002)is an instructive ChaDterSixteen.
Moreon Tables 45 hen to chieveEXPen
rcrease lor can ltural : srruacan be ntially of the during uional tive to hereis ability legant plimecome esof relson r ttoa share ;o that
Comparedto experiments,statisticalcontrols have two fundamental limitations, s hich makeit impossibleto definitivelyproveany causaltheory(although definitivediszrool is possible).First, no matter how many control variabieswe mtroduce, we can neverbe surethat the remainingnet relationshipis a true causalrelationship and not the spuriousresultof someyettobeintroduced variable. Second,although we speakoI holding constant somevariable, or set of variables, . $hat we usually do in practiceis simpty reducethe withingroup variability for these variables.This is particularlyobviouswhenwe aredealingwith crosstabulations because *e generallydivide the sampleinto a small setof categories.In what sense,for example, ;an we be said to "hold educationconstant"when our categoriesconslst of thosewith lessthan a high schooleducation,thosewith somesort of iigh scnoot experience,and rhosewith some sort of collegeexperience? Although the within_category variability in educationalattainment obviously is smaner than the total variability in tire sampreas a s hole, it is still substantiar.Hence,if two other variabresboth dependon educational rnainment,theyarelikely to be correlatedwithin educationalcategories asgrossasthese, rs,well as acrosseducationalcategories. As you will seein more detail later,usinginter_ ral or ratio variables in a regressionframework will not solve the problem but merely u'ansformit. Although the withincategory variability generally will be reduced, the very narsimony in the expression of relationships between variabres thar regresslon procelures permit will generallyresultin somedistortionof the true comprexities of suchrelatonshipsdiscontinuities, nonlinearities,and so on, only some of which can be .rpresentedsuccinctly. Our only salvationis adequatetheory.Becausewe can seldomdefinitively establish ausalrelationshipsby referenceto data,we needto build up a body of theorythat con_ :isrsof a setofplausible,mutuallyconsistent,empiricallyverifiedpropositions. Although re cannotdefinitivelyprovecausalrelations,we candetermine whither our dataarecan_ tsrent with our theories;if so, we can say that the propositionis tentatively empirically rerified.we arein a strongerpositionwhenit comesto disproof. rf ourd.ataarc inconsis:cnr with our theory, that usually is suffcient grounds for rejecting the rheory, although z 3 need to be sensitiveto the possibility that there are omitted variablesthat would :iange our conclusionsif they wereincludedin our cross{abulation or model.In short, :..\maintaina theory it is necessarybut not suffictentthat thedatabe aspredictedby the Becauseconsistencyis necessaryto maintainthe theory,inconsrstency is suffi_ ro rory. '3nr requireus to reject itprovided we can be confidentthat we havenot omitted :rportant variables. (On the other hand, asAlfred North Whitehead is supposedto have :3id. neverlet datastandin the way of a good theory.If the theoryis sufficiently strong, ,;!rumight want to questionthe data.I will havemore to sayaboui this later,in a discui_ ilon of concaptsandindicators.)
TI/HATTHISCHAPTERHAS SHOWN L this chapterwe have consideredthe logic of multivariatestatisticalanalysis and its Eplication to cross{abulationsinvolving threeor more variabres.The notion of an interLiioneffecta situationin which the effectof oneindependentvariabledependson the
46
DataAnalysis: DoingSocialResearch Quantitative to Testldeas
value or level of one or more other independentvariableswas introduced.This is a verr importantidea in statisticalanalysis,and so you shouldbe sure that you understandr thorougbly. We also consideredsuppressoreffects, situations in which the effect of oE independentvariable offsets the effect of another independentvariable becausethe tqo effects have opposite signs. In such situations, the failure to include both variables in the model can lead to an understatementof the true relationships between the included variable and the dependentvariable. we then tumed to direct standardization (sometirnes called covariateadjustment),a procedurefor purging a relationship of the effect of a par_ ticular variable or variables. Direct standardizationcan be thought of as a procedurefcr creating "counterfactual" or "what if' relationshipsfor example, what would be tbc relationship between religiosity and militancy if we adjusted for the fact that well educatedBlacks in the 1960stendedto be both more religious and less militant than lesswell educatedBlacks. Having discussedthe logic of direct standardization, we considered severaltechnical aspectsof the procedureto seehow to standardizedata startins not onh frorn tables but also from individual records;to standardizepercentagedisrriuotion.;,nd to standardizemeans.we concludedby considering the limitations of statistical controK m contrast to randomized experirnents. In the following chapter,we complete our initial discussionof crosstabulationtables by considering how to extract new information from published tables; then note the ooe clrcumstancein which it makes senseto percentagea table ,,backwards',;touch for thc first but not for the last time on how to handle missing data; consider crosstabulation tables in which the cell entries are means;presenta measureof the similarify of percenragedistributions, the Index of Dissimilarity (A); and end with somecommentsabouthos to write about crosstabulations.
CHAPT ER
STILLMOREON TABLES WHATTHISCHAPTER ISABOUT ID this. chapter we wrap up our discussion of cross_tabulations for now. Alter spending $me time leaming to love the computer_a very brief time, actually_and thentelvin! i o th€ mysteriesof regressionequations,we will retum to crosstalulations and discuss for making inferences about relations embodied in them via logJinear *...1:*. irullvsls_ We begin this chapterwith a discussionof how to extract new inlbrmation from pub, . h9red tables; then note the one circumstancein which it makes sensero percentagea rable "backwards"; touch for the first but not for the last time on how to handle missing dam: consider crosstabulationtables in which the cell entries are means;presenta mea_ sre ofthe similarity of percentagedistributions,the Index of Dissimilarity (A); and end rith somecomnents about how to write about cross_tabulations.
48
QuantitativeDataAnarysis: DoingsociarResearch to Testrdear
REORGANIZING TABTES TO EXTRACT NEWINFORMATION
Oflenwhenworkjngwithpublish^eddara. reading research papers. thardarahadbeenpresented andsoon.we wish differenrly. so."r;,o.:, ,;ffi;;"",r#i,i"r,,"" i, presenred ro
:Jffj:,n:ffiilT:r;J;.,..:I[T",ff*'J# "oo*"0 ffi,3il:'.T:;ti,"]i;li CollapsingDimensions
rerarionship q:oss berween acceprance #?ff: lil ilTfi: 'il'il'illt ll i:only orabonion a tablesuchas Table2.I ln itup,", rwo. rio"*
couldyou *r*i,rr.',*"
[_,xTti:nHy,:,:'#[l$L:,i.:fJJ:i::trJ thepercentage frequencies; iabr".ir i"ati"i : r p"."'*i.i'bo "jf "#jaDle
:
: Tl ^1i]*
" fi .',';*','.ff',T1!::f i ;ffi,fi:::n;il{*nl3:i"",".".11".T:f ;=,*'":
;L?ur::JilTiln :n*"H:;H[tr,il#::!f""titfu i,*l*"y.'i y
::i,:r^.j,",t "
;ql:,fr,::",ff :"t, Hfii]ff l!,i:;jruf,*:;
"iifi fr:';,1t"'#'r:tilf ,uooron. ,11,1zortrorp'or*,rll,ulffi rhe weishr€d il:f il:H""lffi:itrff1,., jp"1lT..,,;:J'il:i#JjlrompuLing ",,"r*
'jLfjtrj #I;fxq#jfi ,1*i*,*#;i.lll;:,*n;qifi compuungthe full tableof frequenciesis, firsr,
tharir provi.", ,
,O*J'*O
"n
the accuracy
ffi*i*X;.L:T.Jyi;'J,'.'j:H,",J:l:,?B:ceorAbortion EduGation
Source_ Table2.1.
Catholic
Protestant
StillMoreon Tabtes 49
IL we wish resentedto 6 opposed re.
of !'our computationsand, second,that it permits other tabulationsto be constructed,for .sample, the zeroorderrelation betweeneducationand acceptanceof abortion. Although many other examplescould be given, they all follow the same logic. you $ould get in the habit of manipulatingtablesto extract information from them. Not only is ir a useful skill but it also gives you a better understandingof how tablesare constructed.
CollapsingCategoriesto ReprcsentNew Concepts rf abortion Two. How s, thinking ? The proa table of se,formed od by addthe 't€ther rn be per'Catholics : weighted separately $eragesto
%x90)+ raltage of ; accuracy
k
Sometimeswe want to view a variable in a mannerentirely different from that envisioned uy'_. the original investigator, in which case we may want to reorder the categories.We rheady have seenone example of this, in our discussionof how to treat .,no answer" in o r considerationof nominal variables in Chapter One. "No answer,,may be thought of .tsa neutral responseand henceas lying betweenthe least positive and the least negative rsponse; or "no answer" might be thought of as not on the continuum at all, and hence tESttreatedas missing data. Another examplecan be drawn fiom the U.S. Congress.In the late 1970s,theNew york Tinles,lhe WashingtanPost, and similar rags took to calling conservativeDemocrats.,boll reevils" and liberal Republicans"gypsy moths" (fads come and go; you never hear these nms anymore).Supposewe wereconductinga studyof membersof the U.S. Houseof Representativesandinitially classifiedeverymemberinto one ofthe following four categories: 1. StandardRepublicans 2. Gypsymoths 3. Boll weevils 4. StandardDemocrats This fourcategory classification can be collapsed into three distinct twocategory rlassifications,eachof which representsa different theoretical construct.If we were interesredin studying party politics and wantedto know which parq, controlled the House, we rould combine category 1 with category2, and combine category3 with category4: StandardRepublicans Gypsy moths
Republicans
Boll weevils StandardDemocrats
Democrats
If we wereinterestedin distinguishingbetweenliberalsandconservatives, we would .ombinecategory 1 with category 3 and combine category2 with category4: 110 ::
StandardRepublicans Boll Weevils G)?sy moths StandardDemocrats
)
coo.".uutiu",
Liberals
50
Quantitative Data Analysis:Doing SocialResearchto Testldeas
tirerestedin studyingpartyloyalg andwantedto know whatproportion ot ^]t_::^y:T areparty loyalists,we would combinecategory ^" congressmen I with category4 and combinecategory2 with category3:
StandardRepublicans StandardDemocrats
Party loyalists
Boll weevils Gypsy moths
Crossovers
The point ofall this is that nothing,is sacrosancjabout the way a variable is originally constructed.you can and should recortevariables freely to get the bert ."p."."ntuion o,f the conceptyou areinterestedin studvins. A very important corollary of,this piint is that when you are designing or executing a data.colleclioneffort, you should alwaysconserve as ,ir""fra""if as possible.In the early days of survey research,the technorlgy of data maniputution researchers to pack as many variablesas possibleonto one "n"ouruged IBM iard; hence highly aggregated classificationswere adoptedto savespace(and the tedium of maffitation). The technol_ ogy haschanged.Todaythereiswith oneexception_no."u.oirio, ,o p."r".ve asmuch detail as possible in the initiar coding of your v'ariaures. lrrre exceptronis that you need to design your datacollectioninstrumentin a way that minimizes respondent,intervrewer, and coder enor. For example, in a survey with data collectlon done by face_
jlj:^tlj::::::a
lenel!1andcomplicated
r"r,"" f*l
variaure
is likely to rncrease "oaing lntervrewererror) you neverknow when youwill geta newideathatwill require recodingoneor severalvariables;andif you lackimagnufron,in" not oserof the same *t *j. Everyexperienced survey analyst ha?t""Jgr"", t.,."tion on counr_ l:a occasions 1t less because detailthr
j,:jifr:,,:*#J:flI 0"," *u,not co'upiloll*'"Hirffi ffiTilJ.1 ::ll._",:1q: a1teasycomputeroperation;disaggregating variablesis impossible, ls
:i::Bur.rcs wthout going back to the original questionnaireand usJary
at least
not then either.
WHENTO PERCENTAGE A TABTE BACKWARDSThere is one exception to the rule that tables should be percentagedso that the categories of the dependentvariable add to 100 percent. This is in the sample is not representative of the population ,.at risk,, of falling into"u;;;;*" tt" uJoo, categoriesof the dependentvariable' Sometimes samplesare stratifi;d on trr" a"p"no"o, variable rather than the independent variables or variables; tnut ir, ,o."ti_o iti"y _" on th" basis of their value on the dependentvariable. Various "fror"n hard+o_findiopulations are typi_ cally sampledin this way: convictedcriminals, university stuOenti,pofitcat activlsts, cancerpatrents,and so on.

StillMoreon Tables 5'! )pornon \' l and

: ll?, SocialOrigins of Nobel PrizeWinners (19011972)and other U.s.Elites(and,for Comparison,the Occupationsof EmployedMales 19(x)1920). Father'sOccupation Professional
Other
Total%
. . 1q :.. i '1q0% l 28
18
100y.
15
57
2a
1000/o
24
35
41
100%
',:bel laureates
iginally ation of ecutmg . ln the searchregated echnolt5much )u need : inter1 faceL.lielyto require Ie SAME
:::Tators Employed males
1900 1910 1920 (1977,64). ,:e AdaptedfromZuckerman
LCOUnt
xitially nber of at least
e_gones : is not of the : rather on the re typr:tr\,rsts,
For example,Table3.2 showsthe social origins of variousAmericanelites.In this ::ie the tableis percentaged to showthe distributionof fathers'occupationsfor eachof a ::mber of elite groups,and also for the U.S. labor force as a whole for selectedyears : ughly corresponding to whenthe fatherswerein the labor force.The point of the table :.. of course,to show that elites come from elite origins: much higher percentagesof :e nembers of theseelites are from professionalor managerialorigins than would be :rpected if their fathers'occupationscorrespondedto the distributionof professionals in this direction,contraryto the ,:d managersin the labor force.The tableis percentaged probability percentages the conditional rule that express of someoutcomegiven :ual lrme causalor antecedent condition,becauseit is constructedfrom informationobtained samplesof elites(plus somegenerallaborforce data),andthereforeis not represenrrm ::rile of the social origins of the population.It would not be sensibleto use data from : representative sampleof the populationto study the likelihood that the children of becomeSupremeCouft justices,Nobel laureates.and so on, becausewe ::Lr1'essionals ;.ruldvirtuallv neverfind anv caseswith theseoutcomesunlesswe obtaineddata
52
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
irom the entire population*the outcomesare simply too rare. Thus, m suchcases,we rely on response.basetl samples,andpercentagethe ,, ,rr"* in" distriburionof the independentvariablefor eachresponsecategory_in"urc the presentcase,the socialorigins of variouselitescomparedto the generalpopulation.
CROSS.TABULATIONS IN WHTCH THEDEPENDENT VARTABLE ISREPRESENTED BYA MEAN When the dependentvariableis an intervalor ratio variable,it often is useful to display the meansof the dependentvariablewithin categori". fo"""l;;;;;oss_ctassification of va.n"bles.forexample,supposeyou areinterested in therelarionshipamong ilp:d":, educalion, gender, and eamings,perhapsbecauseyou suspectthat women get smaller r€turnson their educationthan do men.Table3.3 shtws the meanannualincomein 1g7g 1'orfulltime workersby level of educationalanalnment anj genOlf computedfiom the 1980NORC GeneralSocialSutley.
?& * l t: 3,3. ru".r, Annual thcome in 1979 Among Those working Full Time in 1980, by Education and cendei U,S. Adults (Ca'tegoryrrequencies Shown in parentheses).
Collegegraduate
27,227 (46)
11,789
16,288 (131)
10,324 (10s)
'13,536 (236]'
t 1,135 (246)
16,654
20,4r5 (380)
(3s)
20,s12 ( 81)
\ozo)
Still More on Tables ES,We of the rigins
53
:ali"tNie AL POtltTS 0N TABLE 3.3 1. Notethat the formatof thistableis identical to that of Table1.6 from ChapterOne, exceptthat percentages are presentedin Table1.6, and meansare presented here. Thetablesarereadin the sameway.
isplay ion of mong naller 1979 n the
2. In thistablelevelsof educational attainmentarepresented in descending order Either descending or ascending orderjs appropriate; the choiceshoulddependon which makesthe discussion easier. L Note that this table jncludesonly 626 cases,out of a total sampleof 1,468.Thjs reflects the factthat manyindividuals do not work full time,particularly women,and alsothat information on educationand incomeis missingfor someindiviouals. Some_ timesit is usefulto catalogthe missingcases,especially whenthereare manymissing casesor whentheirdistribution hassubstantive importance. ln suchcases, a tootnote can be addedto the table or an additionmade at the bottom of the table, for examole. Numberof casesin table No information on income
bl
626 57
No information on education
1
No information on education and incorne
1
Totalworkingfull time
685
Men not workingfull time
235
Women not working full time
549
Totalin full sample
1,469 The reasonthjstabulationshows.1,469when thereare .1,468 casesin the samplejs dueto roundingerror Because of errorsin the execution of a ,,splitballot,,procedure in the 1980 GSS,the data haveto be weightedto be representative of the popula_ tjon (Davis,Smith, and Marsden2OO7).We will considerweighting issuesin ChapterNine. Evenwhen you do not presentthe informationshown in the tabulation,it is wjse to compileit for yourserf,as a checkon your computations.In fact, in the courseot creat_ Ing the preceding accountingof missingcases,I discovered a computingerrorr had madethat resultedin incorrectnumbersin Table3.3 (since corrected). An alternative way to dispraythesedata,which would makethe point ol the tabre moreimmediately evidentto the reader,is to show in the rjghtmostcolumn. temale meansas a percentage of maremeans,ratherthan the totarmeans.Tabre_making is an art, and the ajm of the gameis to makethe message as clearand easyto under_ standas possible.
54
to Testldeas DataAnalysis: DoingSocialResearch Quantitative
InspectingTable 3.3, you seethat in 1980,women eamedmuch lessthan equally welleducatedmen, although for both men and women income tended to increaseas the level of educationincreased.The genderdifferencein incomesis striking: on average, women eamedjust over half of what men did, and the best educatedwomen (those with postgraduatetraining) earnedless on averagethan did the leasl educatedmen (thosewho did not completehigh school). To provide an easily graspedcomparison of male and female averageincomes for each level of education, we can compute tlle ratio of female to male means.Ordinarily, thesewould simply be included in an additional columl in the table or as a substitutefor the total column.
Ed ucati o n a I Atta i n m ent Postgraduate training Collegegraduate Somecollege Highschoolgraduate Lessthan 12 years Total
Mean Female lncome Exprcssedas a Percentage of Male Mean lncome 44 43 68 63 53 55
The computationshere arejust the ratios multiplied by 100, which yield the female meansexpressedaspercentagesof the male means.They show that within educationcategories, women on averageearn between twofifths and twothirds what men do. You might be curious whether things have changedsince 1980.To find out, you can construct the sametable from a more recent GSS.
S{jBS'}}IVTIVE T'OTNTSON TABLE 3.3 The ratio of femaleto male incomesshown here(55 percent)is somewhatlowerthan the ratiotypically estimated from censusdata(forexample, Treiman and Hartmann1981,16), which is about 60 percent.The discrepancymay reflect differencesin the definition of fulftime workers. lvlost of the computationsbased on census(or current Population Survey[cPs])data define "fulltimeyearround"workersas those employedat least thirtyfive hours in the week precedingthe surveyand employedat least fifty weeks in the previousyear.The G55question,by contrast,askswhether peoplewere working in the previousweek and if so, how many hours,or, if they had a job but were not working in the previous week,how manyhoursthey usuallyworked.lt may be that the G55table
Still More on Tables pally s the ragg, r with : who
)s for arily, te for
55
nc udesa substantial year numberof peoplewho did not work full time in the prevrous andthereforehad lowerincomesthan thoseemployedful time,whereasthesepeople rvouldbe excludedfrom computations basedon censusor CPSdata.Because women :end to havemore unstableemploymenthistories than men, it is probablethat those nciudedin the GSSbut not the censusdefinitionof "full time" would be mainlywomen, ivhichwould lowerthe GSSratiorelativeto ratioscomputedfrom censusor CPSdata. Notethat thereis a certainamountof slipperiness to the analysis usingeitherthe GSs cr the censusdefinitions of fulltimeworkers:information on hoursworkedperweekat :he time of the surveyis relatedto incomecomputedfor the previouscalendaryear. Ihere is no helpfor this because is to askabouthoursperweektypically the alternative ,vorkedlastyear,whichis boundto be highlyerrorprone,or to askaboutcurrentsalary cr wagewhich is also highlyerrorpronebecauseincomeis highlyvariableoverthe .ourseof the year.Theconvention, whichisthe convention because it isthoughtto yield ihe bestdata, is to ask about hoursworked in the pastweek but to ask the weeks year. ,vorkedand incomequestions with respectto the pastcalendar Anotherpossible reasonfor the discrepancy betweenthe GSSand censusestimates cf the ratioof femaleto maleincomesis that the G55figuresaresubjectto substantial :amplingerror We will take up statistical inference in surveyanalysis in ChapterNine. The point of this note is to emphasizethat wheneveryour resultsdiffer from those
emale n cat
feportedby others,especiallythose that are widely cited, it is important to attempt to accountfor the differences asbestyoucan,andto eliminatecandidate explanations that croveto be incorrect. Yourpapersshouldbe filledwith commentsof thissor'l;they give ihe readerconfidence that you havethoughtthroughthe issues and areawareof what s goingon in yourdataand in the literature.
r. You struct
ffierences from lnformation on Missing Data the t6), rof rcn last €K5
9 tn Jn9 rble
\.rle that the catalog of sourcesof missing data presentedin the technical note on T:ble 3.3 can be combinedwith informationin the tableto get an approximateestimate s differencesin labor force participation rates.The row marginal of the table tells us ilr r rhere are 380 males and 246 females employed full time for whom complete inforrrrrion is available.From the information in the technical note, we seethat there xe 235 n:les and549 femaleswho arenot employedfull time. If we arewilling to ignorethe 59 who areemployedfull time but for whom informationis missingon educationor ;e1rple rLnme, we can estimatethat 62 (: [380(380 + 235)]+100)percentofthe malesin the ;omple and 3l (: [246/(246 + 549)]+100)percentof the femalesin the samplewere :mployedfull time during the week of the survey.Of course,becausewe havethe data, couldget theseestimatesdirectly and would not haveto ignorethe 59 missingcases. s 3ur if we had only the publishedtableand the accountingof sourcesof missingdata,we
56
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
could use them to estimate labor force participation rates, even though the table was not prcsentedwith this in mind.
StillAnother Way of Presentingthe SameData Sometimesit is useful to presentstandarddeviationsas well as meansin tables such as Table 3.3. When you needto presentstandarddeviationsas well as means,a useful way to avoid overcrowdingyour tables is to presentseveralpanels,as in Table 3.4. The point of presenting the standarddeviations is both to enable the reader to do statistical inferencecomputationsfrom the data in the table (the standarddeviations are neededto computeconfidenceintervalsfor testsof the sisnificanceof the differ_ encebetweenmeans)and to provide substantiveinformation. For example,it is infor_ matrve to notefrom the rightmost columnthat the heterogeneity in income is more than three times as great for men with postgraduatetraining as for women with postgraduatetraininga ratio that is much larger than for any of the other levels of education. This gives us a hint as to why the averageincome of women with post_graduate training is so lowunlike their male counterparts,some of whom get extremely high_ payingjobs, thesewomenappearto be lockedinto a setofjobs with a very narrowrange of incomes.We could takethis further by investigatingthe propertiesof suchjobs_but we will nol do sohere. A seriousshortcomingin the comparisonof meansacrossgroups is that means, unlike medians,are sensitivero outliers*extreme observations.Thus, for example. the inclusion of a few very highincomepeoplein a samplecan substantialryaffecithe computedmeans.This is equally a problem when the data are codedinto a set of cate_ Eofieswitb a top code for incomes higher than some value, as is the casefor the income measuresusedin the GSS. In 1980the top code for income was $50,000.To comoute a mean,a valuehas to be assignedto eachcategory.This is not much of a problemfor most categories;it is conventional,and reasonablyaccurate,to simply assignthe midpoint of the mnge included.For example,the bottom category,..under $1,000,',would be assigned$500,and so on. But for the top code,any decisionis likely to be arbitrary. One possibility is to use a Paretotransformationto estimatethe meanvalue of the top code (Miller 1966,215220), but this dependson rather strongassumptionsregarding the shape of the distribution. In the analysis shown here I thus, rather arbitrarilv. assigned$62,500to the rop code.Had I assigned,say,$75,000,the malefemaleincome differencesfor welleducatedpeople would have been larger, and the male standard deviationswould havebeenlarger as well. In the caseof skewed(asymtnetrical)distributions where one tail is longer than the other, of which income ii perhapsthe most commonexample,it makesmore senseto computemediansfor descrjptivepurposes, although for analytic purposesmost analystsresort to a transformationof income, usually by taking the natural log of income becausemedians are yery algebraically intractable.Table3.5 is the equivalentof rable 3.3 exceptthat mediansare substituted for means.(Ifan analystwantsan analogto a standarddwiation, the interquartilerange is commonly used.)In this casethe meansand the mediansyietd similar interpretlations, but often this is not the case.
s such useful I e 3.4. to do ations differinfori more I posteducaaduate ; highi range ,sbut means, ample, :ectthe )f cateincome Jmpute lem lor re mid' g ould birary. the top earding itrarily, rncome randard rdistrrhe most lrposes, income, 'raically tstituted le range :erprem
58
to Testldeas QuantitativeData Analysis:Doing SocialResearch
TAffiLg 3.5" ueuianannuattncome in
1979 Among Those Working Full Time in 1980. by Education and Genden U,S, Adults (Category Frequencies Shown in Parenthes€s),
ilii.:l... ,l*'*+i,',4:., :l:l1ii!iiii.i:: .:i,.:i, :.t,l,l Collegegraduate
23,750 (46\
11,250 (35)
18,750 (81)
*r r,llil;..,.:':. iti:::i:::.i::j*iqli:ti, Highschoolgraduate
Less than
16,250 (131)
9,000 (105)
11,250 (236)
11,250 (246)
13,750 (6261
': , : i .:.: : . ; ,:..:,t, : .: , ,
: i.t i :::: i Total
16,250 (380)
INDEXOF DISSIMILARITY Thus far we have studiedthe associationbetweentwo or more variablesby comparing percentages, means,or mediansacrosscategoriesof the independentvariableor variables.As we havenotedalready,thereare situationsin which this strategydoesnot yield particularlyinformativeresults.In particular,whenthereare largenumbersof categories in a distribution,comparingthe conditionalpercentages in any onecategoryignoresmost of the information in the table. Supposeyou areinterestedin knowingwhetherthe laborforce is more segregated by sexor by race.You might investigatethis by crosstabulatingoccupationby sexandrace, as in Table3.6.Visually,the tableis of little helpit is not obviouswhetherthe distributions of the two racial groupsor the two gendergroupsare more similar.To decidethis, you can computethe Indexof Dissimilaity (L), glenby
 1 Q , l n
(3.1)
Still More on Tables
I Time rryn in
59
rhere P. equalsthe percentageof casesin the ith category of the fust distribution and Q, equalsthe percentageof casesin the ith category of the seconddistribution. This index ;an be interpreted as the percentageof casesin one distribution that would have to be difted among categoriesto make the two distributions identical. If the two distributions ae identical,A will of coursebe 0. If they are completelydissimilar,as would be, for erample, the disribution of students by gender in an allgirls school and an allboys school,A will be 100. From Table 3.6 we can compute A for each pair of columls. For example, the A fm White males and White females (which gives us the extent of occupational segregadon by sexamongWhites)is computedas42.1 : (15.6  16.41+ 114.9 6.81+ " + lli  0.91y2.In the presentcase,four of the six comparisonsare of interest:
Occupational segregation by sex among Whites
42.1
Blacks and others
41.3
occupational segregation by race among Men Women
fnnanng ; or vannot yield ategories IES MOSI
€ated by andrace, rdistribucide this,
(3.1)
24.3 18.2
From thesecomputations, we seethat more than 40 percent of White women would bave to change their major occupation group to make the occupational distribution of shite females identical to that of White males, and sirnilarly for Black and other romen relativeto Black men (note that the coefficientis symmetrical,so we could as easily discussthe extent of changerequired of males to make their distributions similar lo thoseof females).By contrast,less than onequarterof Black maleswould haveto ctange major occupation groups to make the Black male distribution identical to the $'hite male distribution, and among women, the corresponding proportion is less than mefifth. Thus, we conclude that occupational segregationby sex is much grcater than ccupational segregationby race.Although it is not common to report testsof significa.ncefor A, it is possibleto do so. (SeeJohnsonand Farley [1985] andRansom[2000] for discussionsof the samplingdistributionof A.) One important limitation of the Index of Dissimilariry is that it tends to increaseas number of categoriesincreases(A cannotget smaller if the categoriesof a distribution fte re disaggregatedinto a la.rgernumber of categories;it can only get larger or remain mchanged). Hence, comparisonsofAs are legitimate only when they are computedfrom distributions basedon identical categories.For example,it would not be legitimate to use 1asa measureof the degreeof occupationalsexsegregationin different countriesbecause L1cupationalclassificaiions tend to differ fiom country to country (unless,of course,the distributions were recoded to a st rdard classification, for example, the Intemational *andard Classifcation of Occupatiozs[InternationalLabour Office 1969, 1990]) or *ome aggregationof this classification.
60
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
f,&,Sl,f 3,*, f.r..rrt.ge Distribution over Major OccupationGroups by Race and Se:c U.S.Labor Force, 1979 (N = 96,945). Blackand Other Men
Managers and administrators
White Women
Blackand Other Women
14.9 6.4
Clericalworkers
6.0
Other serviceworkers
Farmlaborers and supervisors
N (thousands)
16.1
1.5
(so,721)
(s,779)
source:AdaptedfromTreirnan and Hartmann0 981, 26)
..
24.6
StillMoreon Tables 61
WRITING ABOUTCROSSTABUtAfl ONS b *riting about crosstabulations,or for that matter,quantitativerelationshipsof any kind, aim of the gane is clarity, not elegance.Youshould try to say enoughabout what the :rble showsto guide the readerthrough it but not so much as to confuseor bore. Strive for :rlnomy of prose. Hemingway is a good model. Among quantitative social scientists, \adran Keyfitz (who addscharrnto simplicity) andpaul Lazarsfeld(a goodexamplebecause rs nativelanguagewas German,not English, andhe was saidto write his drafts a half dozen nes or more beforebeing satisfiedwith the prose)are worth emulating.Robert Merton is frr a quantitativesociologistbut is a good negativerole model an1.way.He is excessively rnate anduseserudition to finessesticky points. Too many social scientistsare simply tur_ Howard Becker's book, Writing for Social Scientists(1936), is a wonderful primer on eiting good social science,but he doesnot pay much aftentionto writing aboutquantitative ::m However,two recentbooks by JaneMiller (2004, 2005) do this very well, providing rrh useful advice.It would be well worth your time to consultboth of thesetexts,the first rnhich focuseson crosstabulationsandthe secondon multivariatemodels.The following .ue :ome specificpointersfor writing aboutthe sort of datawe areconcernedwith here: r
Describe tables mainly in terms of their subsfantive implications. Cite numben only as much as is necessa.ry to makeclear what the table shows,and then statethe conclusionsthe numbersleadyou to. The point of presentingdatais to test ideas,so the datashouldbe discussedin termsof their implicationsfor the ideas(hypotheses) beingtested.Simply citing the numbersis not sufncient.On the otherhand,you need to cite enoughnumbersto guidethereaderthroughthe tablebecausemostreaders_ including most professionalsocial scientistsare more or less illiterate when it comesto readingtables.
r
Strive for simplicity. Try to stateyour argumentand describeyour conclusions in termsyour ancientgrandmotheror your cousinthe appliancesalesmanwould understand.There is no virtue in obscurity. Obscurity andprofundity are not synonyms;obscurityand confusionare, at leastin this context.As our brethrenin the physical sciencasknow, truly elegantexplanationsare almost always simple. Avoid phrases that add no meaning. For example, insteadof ,,We now investi_ gatewhat inferencewe canmakeasto whetherA might be saidto havean effect on 8." write "DoesA affectB?"
a
r
r
Avoid passiveconstructions,"It is found thatX is relatedto y,, tells us no more than '.X is relatedto Y." Avoid 'A scaleof supportfor U.S. foreign policy was constructed."Who constructedit, God?Write ,.I constructeda scaleof support for U.S. foreign policy" or "I usedthe Universityof Michigan Internationalism Scaleto measuresupportfor U.S. foreign policy.',
:
Avoid jargon when it doesnot help. Note thatI did not suggestavoidingjargon altogether.Jargon, the technical terms of a particular discipline or craft, has a clear functioneconomy.Use jargon terms when they enable you to convey a point in a sentencethat otherwise would require a paragraph.But if ordinary
62
euantitativeDataAnalysis: DoingSocialResearch to Testldeas
;l',ll'JJ#H'[il;L#;.;1;ii;;,*T."",,;fr jtrr;doesnotmakey
'
*':* :' }iJ";F ';{t*el* *#,'r r:#, frlflr +'ru* #::,J.:n:,i..ltff n:i.,1::#:#,,";xTT,E unavoidabre inrabre, jiilFlffl'sa:f,T"lfr,.:"#T,T.are because orren
' ,?U1ffJJ:J.TffJ::#::l:j:ll.,,is pre,entious inasol
jl"'#f i",j n:'fff::i ffi S:: : ::::::";G;;:#:; :,T*l#; .''i,,JJT""J:; ro''"u.p t".
ffilJ,ilT';*fl ji** J1',
;: :.T:il.l
";.:tmir:lnri.,*,1*ir#H"# "ffi
'';tp":3.tfffi {"3i;::f"1#ilr*i#
**txi*t;;,llffi [**t*ht{:l "*#,:fr,ffTi1i'::l: ,**ffi*,e;l,Nil*t*J 6***1*'5,i;#16*, t
I rl
rt
i
ff 
'"DLw4vtorean
*..Ii.".T:r: "u rur wnung is "ot
thewall oi, .n the wrrrnext .,* .^ ,. to your
g"tting tr,nri" you ool;r:n
word processor,for
t
Don'r get it right, get
I
Write, don't read.
I
Don't let the perfect becomethe enemy of thepossible.
it written.
confoft
Still More on Tables
63
Havethe courageto be simpleminded. Anything worth doing is worth doing superficially (with thanks to John Tukey). The last 10 percentof the work takeshalf the time. The first 10 percentof the work also takeshalf the time. You can't write the seconddraft until you have written the flrst draft. Write honestfirst drafts. (Show your friends your first drafts, not your fifth ftafts passedoff as first drafts. It is much more efflcient to get othersto tell you what is wrongand what is rightwith your prosethan to try to figure it out for yourself.) I
a
Thereis no suchthing as good writing, only goodrewriting. Accept criticism gracefully, even though it feels like rape, castration, or some similar violation of your person;it happensto everybody,ard everybodyfeels the samewav,
WHATTHISCHAPTERHAS SHOWN
I t :
fo rhis chapter we have seen how to extract new information from p[blished tables. Tben we noted the one circumstancein which it makes senseto percentagea table ar".krlards"when we analyzedataderivedfrom "responsebased" samples,samples m:rified on the dependentvariable. We saw why it is necessaryto provide information or "asesin the samplebut excludedfrom a table, and how to do this. We considered D;{ ro construct and interpret crosstabulationsin which the cell entries are means(and mlard deviations). We learned how to compute the Index of Dissimilarity (A), a meare ofthe similarity of percentagedisftibutions.And we consideredhow to write about lrcsnbulations. {Il of our work so far has been basedon paperandpenciloperations,involving at n''q a hand calculator. In the next chapter we enter the world of modem social research 4 Ieaming how to construct crosstabulationsfrom data on individuals via computer rrsare designedfor statisticalanalysis,focusingon the statisticalpackageStata,which .€ qill use in the remainder of this book.
CHAPT ER
ONTHEMANIPULATION OFDATABYCOMPUTER W}IATTHISCHAPTER ISABOUT h this chapter we consider how to manipulate data by computer to produce crossntulations. The same logic of data manipulation will apply when we get to regression mallsis, so this chapter servesalso as an introduction to statistical analysisby computer. se considerhow data files (of the kind that are of interest in this book) are organizedand hq to extract data from them; we consider ways to transform variables to make them rwEsent the conceptsof interest to us; and we again addressthe nagging problem of how rr handlemissins data.
66
Quantitative DataAnarysis: Doing Research to Testrdeas ''ciar
INTRODUCTION Most stadsdcalanalysisby
socia
ffiixir,trrffi il:"i#H:: #;#*ilT*{:i,ll.lFI,T$.rr:,n ,:tffifi:il[l,t:.jl $ilffi ffiry;fl1*ihT:;:*:::1fij":,u:i11ffi j. :.fl J,g:il#jt;ru*:l* H,;:lf i*m**;.Tr.h+l{,"!;,f, ins new command, *.* '"'Tl::,1#fi::t.T:i,]# t"d;TffiIil
#i#T:$i,",'"",;"H};;:}:x#r$iprosra n+::T{lJ1iTffi academic users. srara israpidry becoming,r,",,i ir,T"rif"&"g1,;iffiiii;$?Xli
scrENcE coMpurER i^?',:l??L*'I:ttoNsocrAl Kf p,ouuiryi"';; ."."ng,o.iorosi,t,. # Ji3"ti,:i,i?liJ:i :::l:,::,:,Jn"o" J:;;:: :[::lIff:i:i::iiiiil.::l':::. i:;::i::il; :"'n" "'"r*"'''''J::
j:: ::.#:Hu:nl m;::lj:::Ll,:m:#11,.=s=*:::1,:,::.1 *:l wnttenbvcompute,il;,T,'J:"Tt;:T[Tl;#:;ffiHmU:;i
i::ffi:#ilt
iffif.T:te
asintroductions t" *o'1,.'r,,'*" orisinar manuar isno
r"J:"._1;?,T,T:"THt; n:ru*tllld!1l[:]rjfi ;:",;H;:;tr.:,[x.fl ;:::tT:1*1";:::jliyir":r{: r# uo.ptuna.o.pri""t,"'"",,, ii"ttn
universities Assocjal scientists t
lt is notaneasvlanguage Fortunatery, to teachor to rearn. Stata, whicho"n.tll,ltllY"
#ll'd#{::tJ *;;;$,it*n+*5ii;
evenwlthverylargedatasets(for percent sample example, of th" an,""r"iln,tltll'. a1 lt iscapable of doing most of modern the tt rg, data.""vr,r.ii" *q;ir"d ," aregenerallv simple andstraishtforw"ra dutu,"t, .* il;;on, ano stutu .Tnsus) il;t1nds
work Appendix 4A,",,r;:lJTl".n;jJfll ,,11i:.:,1?j,".""1.11"1.: Pruvruc5 trpslor carryingout "; ":,,,0,"0 usjngstata. data analysjs I
Onthe Manipulation of Databy Computer
6V
r: ::fis are availableboth for Unix platforms and for pcs. of tne three.stata is the :1:.;: rearlf identicalacrossplatforms. There are, of.ourr..'.on1, orher statisticai :!!:ges as well. Many of thesepackagesar: exploringbut onty alier you have :;::red the materialin this book. As you lvorth will Ois.ou.r. ttr. iugl. ot data analysis by  ':::rter is fairry standard,althoughthe comm.tnd iirr"i, ,oewhat from pro_ "_l,nu.,  ::.. r!)program.Havingoncemasteredthe basiclogic, it is easyto apply it to otherdata ,:r. rd otherstatisticalpackagecompurer programs.
IOW DATAFILESARE ORGANIZED .j.].* tuut to think of the organizationof data in a computeris to imaginea mafix : :; ::ich the rows are casesand the.columns(or setsof col;;;.j are ua.iautes.specifi_ : ' considera data set that contains257 variablesluut +zz cotumnsof data because
J':"fr'#, TffT"ix1lf l;"'?1.,i;ll$g:"'il"ff ::.1,1qi# T"fh.d,":llJ:
: hence1,609rows ofdata). For the ,ot" of ."g_O,nr. dataset as con_ ::,S informationfrom a representative ".on"r"t"n".S, sample of the U.S. popuiouon.In sucha data rr xe informationmight be organizedasin Eihibit 4.1. lrr manipulatethesedata,we needa map to the dataset,which tells us wherein the r;:::r particularinfomation is locatedand what the infonnation meanr.Sucha map is \:: :,n asa cod.ebooft. In the presentexample,we might huuea cod"book somethinglike ; :: rs shownin Exhibit4.2. ,\rmed with the informationin the codebook,we now know exactlywhat the dataset  ::.ins. It consistsof one recordper respondent for eachof 1,609respondents. High_ generallyprovide information about the ::i iharactenstics .also of the on which the data set is basedan
*s
QuantitatjveDataAnalysis:Doing SocialResearch to Testldeas
A Codebook Corr€sponding to Exhibit 4.1. Variable Number 1
Column IA
2
5
3
67
Variable Name rdno 5ex
_,] 422
poficV
Variable Label and Code
sexof respondent '1 Male 2 Female Age(exactyear) 99 99 or older
The policiesof.ihe presidentare l Wonderful
20K 3 Not so hot 4 God a\,vful
5 Who knowsandwho cares 6 Noansweror uncodable
computerreatlableandare known asfi1e.r.In the presentexample,the first four columns give the identilicationnumbertbr eachrespondent.Usualrythis is of rittle interestat the data analysisstage.but it is vitary necessaryto keeptrack of the data and is crucial if everwe wantto addadditionaldatato the filefor example,if we haveconducted another surveyof the samerespondents andwant to mergethe datafrom the two surveys,or if we want to supplementinteryiew responseswith informationfrom organizational records. and so on. Column 5 givesthe sexof the respondent,columns6 ani i give the age,and column422 givesthe responseto a questionaboutthe policiesof the president. Usingthe responsecategoriesindicatedin the codebook,we seethat the first respon_ _ dent is a twenty sevenyearold male who thinks the president,spolicies are god awful and that the secondrespondentis a fortyoneyear_old femalewho tninks the president,s policies are wonderful.The third respondentis a woman for whom no inlbrmation is availableregardingeitherher ageor her judgmentof presidentialpolicies.perhaps she refusedto answerthesequestionsin the interviewor guuenonserrsical responses, or per_ napstherewas somesort of editing error that destroyedthe informaLton;ln any event,rt is unavailableto the dataanaryst.(Note thatthereis no "n/a" cotiefor sex. It is rareto find a "no answercode" for sex.at reastin interyiewsurveys,becausethe interviewer usually recordsthisinfomation.) Somecodebooksgive thefiequencydistribution(the marginali)
I
I
ut m
trr
l]tq
u0u h
lllrlll @
[l
{@n
On the Manipulation of Databy Computer 69 ts each variable. This is a very useful practice, and if you construct a codebook, you $ould include the marginals(the codebook commandin Stataaccomplishesthis). Ttreir inclusion permits better initial judgments as to suitable cuttingpoints for variables :s well as a standardagainstwhich to check your computer output for accuracy.It is very casyto make mistakeswhen specifying computer runs, so you should check eachrun for ronsistencywith previous runs and the marginals. Supposewe wanted to ascertainwhether men and women differ in their support for gesidential policies. To do this, we might cross{abulate the presidential policy question I'r sex, percentagingthe table so that the judgment of presidential policies is the depenibt variable. Thus we have to instruct the computer where to find each variable, to do 6e crosstabulation,and to percentagethe table in the appropriate direction. We also hi€ to instruct the computer what to do about the "no answer',categoryin the presidential policy variable. There are two ways to specify how to locate data in a file, and computer programs ,fffer as to whether either or both is permissible. Some programs use instructions that prrnt to paticular columns in the filefor example, "crosstabulatecoluml422 by colrn 5." More commonly, programsrequire that the analyst first specify where in the data s eachvariableis located and then use variable namesto commandparticular manipulatirxfor example,"The variable SExis in column 5 and the variable poLrcyis in colm 122.CrosstabulatePOLICyby SgX." A variant of this approachis to require a map mbering the variables sequentially and specifying their location, for example: Variable 001 002 003
Columns t4 5
DS l€
422
if ET (e b,
rd F.
d
trosstabulate UAR257by yAROO3."In most currentprogmms,includingStata,SAS, d SPSS,suchrnapsare createdin the courseof creating ry stemfilesi as part of the prepabn of the file, variable names(ussally restricted to eight characters,although no lon!:lr io in Statabeginning with Version 6.0), wriable labels, andvalue labels (indicattng lb meaning of each responsecategory) are attachedto the file, and variables are then *ified by name. In instructions to the program, which are known as commands,the rf'st usesthe namesof the variables and neednot be concemedabout their location in & file. For example,the Statacommatd
is
tab policy
E f
ir d ,tv jl
sex, col
fre computer to crosstabulatepOLfCy (the row variable) by SEX (the column variand computecoluml percentages.Note that in Stata,variable namesare casesensiThus Stata regards sex, sEx, and sex as three different variables. (Although in t! book, variable names appearin ALLCA?S, to make it easier to distinguish variable fiom otherwords in a sentence,in my Statacommandfiles [do filessee the s l3 *r
70
QuantitativeData Analysis:Doing SocialResearch to TestIdeas
lollowingdiscussionl. I alwaysnamefijeswith lowercase namesto avoid extra typing ano rne elror that accomDanies it )
A Digressionon Card Decksand Card_tmage Computer Files
computersbeganto be usedextensivelyin the social sciencesin the midlg60s but did not becomeubiquitousuntil the 1g70s.As a conseq".o"", .ets sr'r of interest cr:at:g use wirh pre_compureranalytic_rec.hnotogy, urv'iJ,u spe#carf y with machinery Y:Je _f9r that readsIBM punchcards(seeFigure 4.1j. Alrhoughtn"i"gi" o"t" organizationis similar to that usedfor analyri. by "dictated the technology "i severalimportant "ornprt.r, $ff,".:"::r. Whereasthere is in principle no limit to tt nuiiU".J va.'ables that can be includedin a singlecomputerrecord(although " fmitution, thereare u. to f,ow many vari_ ablesa programcanhandle),an IBM contains eight;;;i;;;. ;"""r.e the machin_ "ard ery for manipulatingIBM cardscould handle ;; ;J ;; ; time lsuch machines "rly unit.record equipmenr,where the record was one caro length), there was ::::::ll "rpacking premrumon a as manyvariablesaspossibleonto a singlecard.
'.: A card dataset consistsof one or more cardsper respondent.For example,to represent all of the datacontainedin our illustrati ve ?57 variabre,i2z*iuo,n outu."t *ould require 6cardsper responde\t (= 4ZZ/}O,rounded up) ,f''"r'LfOS ..r;""dents, or 9,654cards. The information shownfor the iirst respondent might be representedon an IBM card asin Figure4.1,wheretheresponsetopresidentialpo'Jes is coituin"Jio , a+,but other_ wrsethe columls correspondto thosein Exhiiit "orr.n 4.1. An analystwantingto cross_tabulate responsesto presidentialpoliciesby sexwould passthe deck tbrougha counrer_sorter, whicir would pirysicallydivide the deckinto two subdecks readingthe holespunched in u o".ignui.j _by 5 in this case. cards wirh a "1" punch would fill into the "otu#n, "orrrnr, r p*ri"t ir,r,.,*"r,i"l and cardswirh a,,2,, punchwould fall into the 2 pocket Each of thesesubdectr*ouiJ,rr"n u" passedthrouqh
On the Manipulation of Data by Computer
71
).png
ur did rterest [inery fon is nrtant canbe y vanachinchines re was
epresent i requlre i4 cards. ard as rn ut other:x would into two his case. rttha"2" I though
.'' A .1 . an tBM punchcard. machinea secondtime,andthedistributionof punchesin column64 wouldbe counted [c displayedfor the analystto copy by hand onto paper.Thesecountswould generate :e bivariate frequency distribution of judgments regarding presidential policy by sex, in the usualway (usinga deskcalculator). r::;h would thenbe percentaged This technology had severalimportatt consequencesfor data organizationand data [,:1sis. First, it discouragedthe use of statistical methods other than crosstabulations :rrauseall it could do wasgeneratethe countsneededasinput to statisticalproceduresthe manipulationstill had to be canied out by hand.Second,it discouragedthe reten_m::braic :rr{r of detailedinformation; therewas, indeed,a greatpremium on squeezingthe response ::r:gories hto a singlecolumn if at all possiblebecausea twocolumn variablewas tedious :: ranipulate (it requiredmuch morecard handling becausethe variablehad to be sortedon fe ffst digit, and then eachof the resulting categorieshad to be sortedon the seconddigit) n: producedmore detail than could be usedeffectively in a crosstabulation.This resulted r ie use of what are known as zone punches,the locations on an IBM card above the called"x" and"y" r:erical columns,which alsowereusedfor "*" and"" (sometimes (no punch) meaningful category. Thus, fbr examas a :xhes), and also the use of blanks years in data setsdesignedfor reir would be unlikely for ageto be representedby single re *ith unit record technology; rather, a set of age categorieswould be predesignated. =d. in the interest of getting as many variablesas possibleon a single cardbecause it rr. impossibleto include in a single tabulationvariableslocatedon different cardssome lctsts resortedto putting more than one variableinto a singlecolumn. Considervariables I ad 257 in the preceding example (Exhibit 4.1). Becausethere are only two possible i.::i,.lnsesto the sexitem and six to the presidentialpolicy item, they could be included in a r.5le column simply by using punches.l9 for the presidentialpolicy responsecategories. { €\ ice on the countersortermachile madeit possibleto suppresssomepunchesand sort r .rrhers.Columnsof this kind were known asmultiplepunclrcdcolumls. ,\11of thesedevicesfor packing as much data as possibleinto a single IBM card :,ved havoc when the shift to data analysisby computer occuned. Becausemost
72
QuantitativeData Analysis:Doing SocratResearch to
Testldeas
weredesignedro recodedatafrom one setof symbotsto another,the ffl!1r11lroSrus srmpte caseswerc thosein which zonepunches andblant, ,"r.u.io u. _"uningful cate_ much more difficulr problem.arose,t"n *"r" irr"f,ipf._prnched.Such il^.lT " casesusuany requiredextensivespeciarized "*0. computer progru.in! ,o them into compubrreadableform. "onu"a Even after computersbecamewidely ava able for social research,data set, were often initia.lly prepared in machine_readable fo._ oo mVt usrng a keypunch machine and then read into compurersand transferredto sto.ug""_O', ."aiu ,u"t as computer tape; only.relatively recently have keypunch machinesb*"' *?r"i"o u, work stations that permit keying data directlv into a.:omputer nt". ft"n"", *uny existing data sets, includingthe NoRC GSSwell into th" tsso., orguJ;;;;;;;_,*r* records.T.hat in computer sroragemedia " as a series of eighty columr records li:TI:: :t ":"nred Typicalty.the first rhreeo, for. lll_iifn I.".pd"."t. rhe respondenr roennncatronnumberand column g0 containstn",."o.0 "oturnnr'#tain nuroilr, o, out tO.This orgazatronof datahasno consequences foranalysis,bur iiil;;h;;" *"y the computer rs rnstructedto readrhedata.The specifi" a"tuitsu*y.t"p"nl;;;;;" p."g."m you use, but you should be aware of this altematrve mode of data organization, rn addition to the specificationof onerongrecordper respondent with whi"r, i,li"gl" *. oiscussion.
THE WAY THINGS
WERE
available, commandfilesalsowere
Before electronic daraenrryterminats became
wrote outhiscommand fire, .", .rJ::il:;:ii;::J#:[
1
: l
?,ilil.:?ju'; J,5il:J;
a separatelBN,4 card (by a keypunchoperatoror' in the caseof undertundedgraduate students,by the analyst).The resulting,,deck,, of lBlvlcardswas then transportedto the university centralcomputinqcenterand either submittedto u .turt o'. Lo directlyinto a caroreaderEventually the commandfile was executed(,,thejob ran,,j.otten after a delay of severalhours,and the printedoutputplus the box of cardswere returnedto the analyst. lf.therewere errors,the entireprocess was repeated.Thistechnologylimited the number of computerrunsto two or three per day,which made'tf,t".r.ofJi"" * any particular
verVttmeconsuming proposition bycurrenr standards_but didat teasthavethe iii.ll1s salutarylfeature of allowingmoretjme to think whilewaitjng for the joO,o .un.
TRANSFORMING DATA As notedseveraltimesin previous,chapters, dataare not alwaysinitrally represented in a form that is suitableto oui ,"r"u
oru,ngr" u*iuuiil;;il,l;l;i:'*T#]riJ3iill;li#'J#j";li*:: "oa..
data transformation.r, and each o
ceduresroraccomilil;;';;;:i;:;#iTfi ::X#it"li'"."n"'ffi:*::U. brlrtresFaciliryartransformins
variable* to u ro. t,ut"oriirffi concepts is animporrant skil ;f theqr"",ii"i* 0""1"
I
I
"*pr.rr"s
theoretical
On the Manipulation of Data by Computer he Ierch Ito tre
rh t T
)ns ls, mt rds \aIer se, fue
73
Rxoding Recodingis the term usedfor changing the values of a variable to a different set of val_ rs. Recodinghasmanyuses,someof which we havealreadyseen. one is to collapsecategoriesof a variableinto a smauernumberof catesories.for :rample, whenI createdthe leftmostcolumnof rhble 2.3 from Table1.1.To see how this rccedure works, let us considerthis examplein detail. I startedwith a reliqiositv scale of the following categones: omposed l. 2. L +.
Very religious Somewhatreligious Nol veryreligious Not at all religious
(For the moment,ignore the possibility of missing data.)To combine the last two cate_ gies. I simply changed,or recoded,category4 to category3, which yields a new variable: l. Very religious ?. Somewhatreligious 3. Not religious \lthough somecomputerprogramspermit a variable to be .,written over,,_that is, to 5e replacedwith a new variablethis is very poor practice. Rather, you should createa ..s containingthe transformedvalues.The reasonfor this shouldbe obvious: 'ariable :[dr to protect againsterror and to permit you to transform a variable more than once in te samecomputer run, you should preservethe original coding of a variable as well as .u' recodedor otherwisetransformedversionsofthe variable.Typically,stadsticarpackrse !omputerprograms operate linebyline; each line of code operareson the data in rbareverform they appearafterthe previousoperation.Hence,it ii all too easyto trans_ trm a variable and then inadvertently transform it again, unlessa new variable is created n rbe courseof the transfomation ,\ second use of recoding is to redefine a variable by creating a new set of ::riesoriesrepresentinga new dimension.you have seenan example of this also, in :sr discussionof property spacein ChapterThree. Recall our classification of U.S. :.]ngressmen lnto
L StandardRepublicans 2. Gypsymoths na he
J.
+.
Boll weevils StandardDemocrats
AS
ros1cal
To createa classificationaccordingto party membership,we can recode2 to I and 3 r 1.l,ielding a new variablewith valuesof 1 (: Republican)and 4 (: Democrat).To ::eatea classificationofcongressmenasliberal or conservative, we canrecode2to 4 and, ::,..'1. againyielding a new variablewith valuesof 1 ( conservative) and4 (: liberal).
74
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
Note' howevet that when variables are recoded to dichotomies,
the conventionis .o code
;i"J5:fl i:l'#trff :ffi":;il$n",H:il;:?T,*y*"*ln
"Republican,"in which l and 2 in the original variabre;;"; r, and 3 and 4 in the original variable are coded 0. As we will seein later chapters,tliJ0_l recoding convention facilitates the use of dichotomous variables in both dLS and logistic regression. A_third useof the recodeoperationi. to ur.ign ."u1" ,"o.", tJ th" .. of u u_iable.For example, we might have a variable "ut"go.r", measuringeducationatattainment, which is initiallycodedasfollows: 1. 2. 3. 4. 5. o. 7. 8. 9. 10.
No schooling 14 yearsof elementaryschool 57 yearsof elemenrary school 8 yearsof elementaryschool 13 yearsof secondaryschool 4 yearsof secondaryschool 13 yearsof college 4 yearsof college 5 or moreyearsof college No information
For many purposes,it is useful to tr.eat years of school completed as a ratio variable. By doing so, it is possibleto computethe meannumberof yearsof schoolcompletedby of the popularion, to use years of ."frJ in regression lT::,":,^ribt:*os equa ons,and so on. To do this, we might recode "opf"ted the original u_iu6t" Uy u..igilng it midpoint or another estimate of the years of school .#;;O " ,, rndividuals in each category:
Original Code 1 2 3 4 5 6 7 8 9 10
Recode 0 2.5 6 8 10 12 14 16 18 1
In making recodesof this sort, it is important justify to your choice of valuesrather than assigningthem arbitrarily.For example, ,.18 years,, the decisioni. ;:G to the
On the Manipulationof Data by Computer
D code ded l. named in the 0nven)n. a vanhich is
75
:ategory"5 or more yearsof college"ratherthan 17 yearsor 19 yearsmust be justified, rx simply asserted. \ote the specialtreatmentof category10,"no information."In carryingout the anal.'ri. we want eitherto excludethis categoryor otherwisegive it specialtreatment.We :rti sive it a specialcode,which we can eitherdefne asmissingdata (seethediscussion .rx;r in the chapter) or otherwise modify. It is convenientto use negativenumbersto flag :nesoriesthat we aregoing to treatasmissingdatabecausedoing so minimizesthe likei,h.od of inadvertently treating them as substantivelymeaningful. (A useful alternative, lr aYailablein Stata,is to usethe code"." to specifymissingvalueswhen we haveno re:d to distinguishbetweendifferenttypesof missing value,and to use the codes...a,', '. . . ".2" whenwe want to distinguishdifferenttypesof nonresponseagain,seethe of missingdatalaterin the chapter).For example,supposewe recodedthe .,no r.ussion .n'..rmation"categoryas 99. If we subsequentlydecidedto analyzethosewith at least iLsecollegeeducation,we might instructthe computerto selectall caseswith yearsof ;nool completedgreaterthan or equal to 14, forgetting that category 99 meansno inforr:iiion This, of course,would resultin the inclusionin the highesteducationcategoryof f,r\. fbr whom educationis not known along with the college educated.
TREATINIG MISSING VALUES AS IFTHEYWERENOT?,;I riable. xed by resslon ing the n each
N :"ffiJ:nffi Till:;t:Ti::'iH:,:!T; :::il:::1it1ililii..,::i:,";""fi rtercourseper month increases with a wifes agecontraryto alJexpectations! Alas,as r scovered by Kahnand Udry(1986,736),shefailedto noticefour outliers,caseserrone:.Jslycoded88 ratherthan 99, the specifiedmissingdata code.When thesefour cases :.e omitted,the positiveeffectof wife'sage disappears. Kahnand Udryalsoomittedfour l:her outliers,promptinga livelyresponse from Jasso(1986)aboutwhat shouldbe regarded .s an outlier We will returnto a discussion of outlierswhen we considerreoression diaoosticsin ChapterTen.
\ final useof the recodeoperationis to convertdatafrom old surveysthat usezone hes and blanks into a form that permits numerical manipulation. This typically can :ln t .jone by reading the data in an alphanumericformat and converting them to a floatingr'^nr decimalformat.
Afith meti c Transformati ons
i rather 't o t he
irmetimes we want to transform variablesby performing arithmetic operationson them. i,,.h ransformations will be particularly important when we get to regressionanalysis :eauseit is sometimespossibleto representnonlinearrelationshipsby linear equations ntolYing nonlinearvariables.For example,it is well knownthat therelationshipbetween ir:Lrmeand ageis cuwilinearincome increasesup to a cenain age and then declines.
76
QuantitativeDataAnalysis:Doing SocialResearch to
lr?ii*l'"tntn
"an
Testldeas
be represented by constructinga regression equationof thefollowi':a+b(A)+c(12)
(4.1)
thatis, income(= y) is takento be a linear functionof ageandthe squareof age.To esti_ equation, we need to create a new variable, tie .qr_" oi ug". So we simpty l1:,T": AGESQ = 468*46s
(4.2)
and then regressy on AGE
atd.Ac exrensivetransrorm",,.,."o"b,r;tj?#j;j;,r,jTrXl'Ji:T"ff1
3
ffiff;"
operaror or anyof a numberof specializeo f";i;;r,
suchasthesquare
Cont i ngen cy Transform ations A final way to transformvariablesis to use ,,if, specificationsin your commands.,,If, specificationsare an alternative to_recode commanis and are ne"iute in some b""u:r" makeit possibte.to.p""ify u"tr ore I?, t:I r"ru,ionrfripsinvolv"o_pl"" "onttngJi"r example, if wewanted todistinguislithose
whowereupwardly
;fr1il1l1J1:l:TFor jobs rhat were,f hd;;;;ljijl,ii'lilTl?Jil:Tfr3?ff:'f"*ffi ,IffTl specifyingrhe following: if pREs?rcE is greaterthan pnsiriis _op_
se?HEB constructa new variable,MOBILITY, and giveit the,ufu" f; oifr".*]r", grveit the value 0. the synlt of.rhe computer commandrequired ro do this wilt vary dependingon fl]lo]lih ue used,the logic is, as usual,straightforward: lrogram u;;;;;*ru, variable is,eated, scored 1 for those individuats. who _"""p,rd;; ;;i;juia ,"or"o 0 orherwise (where"upwardmobiliry,' is def,ned ashavinga" dp;,i"r;i;;er prestigethan the occupationof one,sfather). Another kind of continsencvtransformation is to createa variableconsistingof a count of the numberof responses to a specifiedset of other variablesthat meet specific criteria. For example, we misht create a scale of acceptan"" oi uUo.tioo Uy the mmber of ''pro choice',(,,accepting,,) responses,; ;:;;;iqd;;s about"ounting rt,e circumstancesunder which abortion should be permitted. Contingency statementsare used not only to transform variables but also to select subsamples for analysis.For example,ir *" J"." in[r"ri"J,rl"arr,"g fer_ tility, we might want to restrict ( ""mpleted
jffi ':. accomprisheJ in.o""o,"'u,". j?lTJ;il! ;:"TnJ5'; r3;! ":"n on the subsample. Otherpackages,suchasStata.
doingall the subsequent operationsonly
;",fi::T::,1'
partof eachcommand, arthough ."bi;;
;i;#;,mpre
is possibre in
On the Manipulationof Data by Computer
llow
(4.1) )estimPlY
ta )\ ovide using quare
i. "If' some rolvmrdly o had Ed by , conlue 0. ng on s crerwrse hn the gofa rccific ng the rcumselect d ferhis is le and Stala, ible in
77
}issing Data Otten.substantive informalionon certainvariablesis missingfrom a dataset.The sources ri missingdataare nearlyendless.In datasetsderivedfrom interviewswith a sampleof the information may neverhavebeen elicited from the respondent,either in error ople, .T asa matterof design(somequestionsare "not applicable,"for example,spouse'sedu:arion for the nevermarried; and sometimesquestions are asked of random subsetsof :espondentsto increasethe length of the questionnairewithout increasingthe respondent rrdenthe GSS often doesthis). The respondentmay haverefusedto answercertain :tr€stions,may haverespondedto somequestionsby claimingnot to know the answeror rrt to have an opinion, or may have given logically inconsistent answers(for example, :aponding "never married" to a question on marital statusbut providing an answer to a *n€sdon on "age at first marriage"). Interviewers may have failed to record responsesor =:1 haverecorded them incorrectly. Errors may have been introduced in the processof :reparing data for analysisas when narrative responsesare inconectly assignedto code =regories by coding clerks or when correctly assignedcodesare incorrectly keyed in the :qrrse of dataentry. Similar problemsplague other sortsof data sets.Bureaucraticrecords m'eoften incomplete and frequently contain inconsistentinformation.
PEOPLE GENERALLY LIKETO RESPOND TO (WELLE[
R":::"f ),:?.,1T"?",Y:',,:;fP#:)::"lT"tP
written.By and large,peopleare flatteredthat they are askedtheir opinionsand askedto talkaboutthemselves. Thereisa famousstoryfrom the loreof surveyanalysis aboutthe Indianapolis Fertility Survey, oneof the earliest surveys that askedexplicitly aboutsexualbehavjor. Oneof the analysts went out with considerable trepidation to conducta pretestof the questionnaire, not knowinghow womenwouldrespondto "intimate"questions. As it happened, the interviewwent off without a hitch untilthe veryend,when the interviewer got to the routinedemographic questions and askedthe respondent her age,at whichthe ladydrew herselfup indignantly and said,"Now you'regettingpersonal!" Theexceptionto the generalwillingness to respondis with respectto informationthat peoplefear might put them in jeopardy, suchas income,whichthey suspeclmight find its way to the tax authorities.
ln highquality surveys,great pains are taken to minimize the extent of error In the rurse of readying data setsfor analysis,they arecleaned,that is, edited to identify and if rr\sible correct illegal codes(codesnot correspondingto valid responsecategories)and ilrg:ically impossiblecombinationsof codes.For example,whena respondentwho claims =ver to havebeen manied gives his age at first marriage,sometimesit is possibleto lecide which is thecorrectandwhich theincorrectresponseby inspectingotherresponses by the sameindividual.When this is not possible,the respondentmight be con_air:en rcted and askedto resolve the inconsistencv.
78
QuantitativeDataAnarysis: DoingsociarResearch to Testrdeas
t,#i;T'r,$::
the.editing. process aswelascorrected. For
\x!ni:':!::!in
:"1*:iff",1,J:,:",?ffj'#. lii:'::twl'i;: ffi#:HTi;'*?"#f
nousewrves, takenonastemporary employ"".oytrr"C"nrur*nu#u,.."o,'""t"d,, returnsin whicha woman's maritarstaturi"". r"i r"L1*"i"Joi',""",*"n ""n.u, *.."."0oa"0 responseto''married."
rhelikelihooa, or"ou,.,", t auiin
i;rtTfl::r*i;ffilitatus
":J ffiff:ffiXi; ltiiii;itJffitril3:h""1 lT.":'3;*{Ti*"fid,Tttr
ladies_isnot supposed to occur,butit does. io* o,"onrtn . ln thecourseof,h" codes are assigned to.thevariou. "di ti., ornon.ou.ttiu;;;p.^:r:i:::1t:,explicit "u,"go
jf Lff : *Jfr *':::"Hl*#H:"jJ:#l;xf;p il"i:'t";ffi il#,i"f ll;: ;,::S;#:il,inT:::j.:,;ff
ffi ilF:';'"''otri".io*,Jiu;;;",ffi todisri n;;;;;;#J#iJ:l:liffi g::tgie '"'i"il:ffi ffi :n',;:"'f,"n#l
.,rH.: .:ilfr ::li ffi ff :q;:*r *th:in#ft .ifi i hFT #f t fi r't""i*,np.,r,r".Xo"i"g*ru" ::::TflT,i.TfiX1;lioiLiiilof,r responses. rr
l..Tf ":""h"o,"_"*",".#lii^*;iilT:""##ilff l1,,ffiffi ,ffi.,:,$:
caregories ro Represent N;;
c"r*i oetartm thecodingof nonsubstanhve responses.
in the sec_
"nrprer r;il ii rJ'r",lo..ruo, to preserve
Analyzing Surueyswith Missing Data Presumiag that the data are
coded in su$
waV as to preserve
all relevant distinctions, 1 ,. o"n'" r'o'tot .ui ffiffiJ:H.':"#Tjf*TlT,l'"t ordecisions *:' "n""il",^' whichresponses o beregaroel' stroulo assubstanrively
interestinq
y,1,s^Y.YTli":
an6 "4""',. what todo"aioui t"*r''.",i"'#T.Tr[T:"r}iii"'''T[]fl::T?,T#on?
example here. Anorher case thatarisesr..qr*ttyin tauui;;;"rrl
,Tfi:,T::"J## t*"'"; qi'*i'''i".llJiu,r ..,".ore ;'l":ffi"'ffii,:in (mean;T:il+i':, " .rr,i,i,ooiu'go;il'];;Hili,ili.T::f ,:fnT:riii:Tffi[::Hil,:J; I nusrr youarestudying theadulrpopulatio" i]j,"J,li",.rr. your
reler.totheen'ireadultpopulation, tabtesshould notjust to "r*. whi*r ,"Jnrr"t.. The sorutionin casels simplyto create this a residual .other...and,o in"foO",t In thetablebulnor bother[o discussit. tt is inctuded_ 91tegory. foi,r,. ,*l ;;;""J:_rch. amongolher "f otrrei ror
il1l:."il":fijii"'tra;:,:i;;J:,,,: T."ure trr"rp""in" n,fri"1r;;ff;:#j*
,t" A more
ffiiis_burisnot discussed
of risidualcatesories generally "terogeneitv
difficult problem aris more variabres'' *; ffi;;;;nf, :1i1,.:$*::lx'm:jHfrta,."j"ffi #
On the Manipulatlon of Data by Computer
L For h the {lass C N SUS
nrted hatin r who rlyrlegoBOnes nedin mple, plicarl. it is ,flexilier, a nomeses.If rssible eselYe rc seceserve
cllons,
o rear garded ) about iypical ) about (meananalyDterest. should in this but not g other ;cussed nerally ' one or [rcation
79
.r their income.Again, one alternativeis to include a "no answer'.categoryin eachrow .:ld columnof the table.If thereare many missingcases,this is wise.If thereare only a ::\' rnissingcases,the increasedsizeof the tableprobablyis not warranted.In this caseit :: sufficientsimply to reporthow many casesare missing,in a footnoteto the table. When our variablesare continuous,we musteitherexcludemissingvaluesfrom the :nalysisor in someway imputethe values.ChapterEight is entirelydevotedto the treatrent of missingdata. Most statisticalpackageprogramsallow the analystto specifywhich codesare to be :eated as missingvalues(andindeedrequireit in the sensethat any codesnot specified :. missingvaluesare includedin the computationwhetheryou intend it or not). Typi:rllv. statisticalpackageprogramsarenot completelyconsistentacrossprocedures(comnands) in the way they handlemissingdata,soit is very importantto understandexactly rhat each proceduredoes and to design your analysisaccordingly.In designingany .nalysis, you must know how lhe procedurewill treat eachlogically possiblecode in l our data,includingin particularthosecodesyou designateasmissingvalues;otherwise :ou inevitablywill get into trouble. In the example on educationdiscussedearlier, "no information,' was assigneda 1. When computinga mean,we ordinarilywould declare1 to be a missing odeof r:lue for education.In SPSSsyntax,missingvaluesareexplicitlydeclared:,,missingvalues :duc ( 1)"; in Stata,asnotedpreviously,missingvaluesmaybe excludedautomatically by .Lisigningone of several"missing value" codes,or may be explicitly excludedfrom a proce$Ie by limiting the samplewith an y' qualifier: . . . if educ:  I (that is, if EDuc is not equalto  l). Thesestatementstell the computerto omit all individuals for whom education r. coded 1 (or assignedthe missingvaluecode)from the computationof the mean.Neglectng to so inform the computerresultsin an incorrect meanbecauseany individuals who are .odedir the dataashaving 1 yearsof schoolingare includedin the computation.Errors of dis sort are very common,which is why it is imperativeto checkandrecheckthe logic oi vour comrnands.A useful checkis to work throughthe logic of your computercommands ,ine by line fot specifiedvaluesof your original variablesto seehow the computertransforms iem at eachstepin the process.You will makesomesurprisingdiscoveries. Oneof the thingsthat typically happensto novicedataanalystsis that they do some ;omputation and discoverthat their computerprintout showsno casesor a very small ;rumberof cases.Usually this turns out to be the result of a logical enor in the specifi:ation of data transformations.For example,consider an income variable originally ;oded in a set of categoriesrepresentingrangesof income, for example, 1 : under 53.000per yeal 2 : $3,000to $4,999,and so on, bur where 97,98, and99 are usedto specify various kinds of nonresponses. If the analyst recodesthe income categories 10the midpointsof their ranges,for example,recodesI to 1,500,2 to 4,000,and so on, t'ut then forgets this and specifiesas missing valuesall codesgreaterthan or equal to 97. all the caseswill be excludedbecauseall casesfor which incomewasreportedhave beenrecodedto valuesin the thousandsof dollars,that is, greaterthan 97. If you do not rhink this will happento you, wait until you ffy itl It happensto all of us. The trick is to ;atch logically similar, but more subtle,errors before you constructentire theoretical edificesunon them.
80
Quantitative Data Analysis:Doing SocialResearchto Test ldeas
WHATTHISCHAPTER HASSHOWN
fi::ffi:iff ffiff"#HT.: :i,ffilx"::.;:'*;:T ili:i"i'fifJtlTl?'tff
data manipulation and the treatmentof missing datalThe ct up*. ,t o, ,"*", us a founda_ tion that^shouldmake it easy to leam any statlstical packug" p.og;urn_Stuta, which we 'SPSS will use for the remainder of the book, oi any ottrer p;kag?ii"ir'". SeS . In the next chapterwe tum to the gen"rj he". _oA"i r"l f, u g"o,l" inooduction .. via a discussionof bivariatecorrelationandresression.
APPENDIX4.A DOINGANALYSISUSINGSTAIA TIPSON DOINGANALYSISUSTNGSTATA This appendix offers some simple tips that will greatly enhancethe easeand efficiency with which you use Srarafor analysis. In additio'n, ,h; ;;;;; iir,, some parricularly useful commandsthat are easily overlooked.
Do Everythingwith ao_ Files You should from the outsetdevelopthe habit of carrying out a// your analysisby creating commandfiles, known in Stataparlanceas .._ do_ nt"j, Oologiolus two major advan_
torepeatyour untilyougetit .i!r,t, _a ii _utJJ t; ;;; :19^:::tjT1": :asy Keepins analysis oocumentyourl work. a log of your analysisis nit an adequatesubstiiute
(although,of course.you mustcreate _1og_ afile io ,uua yo* output) becausea log faithirlly records all of your enors and false steps, makir! ii iiti*i a r" ow rhe direct path to successfulexecution and tedious ,o ."p"u, yo* u?rA;;;"." rs an example of part of_oneof my do files, shown to suggesta standardformat you might want to adopt.I usethis set of commardsat the begiriring of eactr_ao_ fite f create.The com_ mandsin the file are shownin Courier type,andmy commentsrn squarebrack_ .New ets are shownin Times New Roman type (the standardfont ior the text). capture
1og close [This commandclosesany 1og_ file (seethe next command) it finds open.The captureprefix to a commandis very useful because it instrucm Stata not to stop if an ,,error,,is encountered_whichwould be the caseif it could not find a  1og flle to close.l
1og using
class.1og, replace [This command te]ls stata to keep a file of commandsand the results of the commands,calleda .. log _ fiIe,,,andto replu"" *y pr"uioo, u"r.ioosof the _1og_ file. The  replace  part of the commandis crucial becauseotherwisewhenyou execute
On the Manipulation of Data by Computer
the do fi1e,fix an enor, and try to executeit again,Statawill complainthat a previousversionof the  1og file exists.l
I SOme
rgic of Dundaich we
81
*elimit; [This commandtells Statato end all subsequentcommandswhenevera ";" is encountered.I find this the most convenientway to handlelong lines. The default in Statais to regarda carriageretum (the computercommandthat endsa line) asthe end of the command, which means that, unless the cariage retum is "commented out" (see below), commandsare restrictedto one line. Of course,the line may be very long, extendingwell beyond the width of a page,but this makesyour file difficult to read.l
:
lon vla
i=:s ion 10.0; [This command tells Stata for which version of Stata the file was created. Stata always permits old  do  files to run on more recent venions of the software, if the versionis specified.l ;iency ularly
eaf,ng dvanasyto ctitute a log direct ple of &ntto comrack
::
more 1; [This command tells Stata not to stop at the end of every page of the output. When executinga do file, you want the programto run completelywithout stopping. The way to inspectthe oulputis to readthe log file.l
fThis commandclears any data left over lrom a previous attempt to executethe program or any other Statacommand.Statais good about waming you againstinadvertently destroying data you havecreated.But the fact that Statawams you meansthat you needa way to overridethe waming, which is what this commanddoes.l ^r.h
dr^^
.l
l
fThis commanddrops any existing programsthat you might havecreatedin a previous executionof the do file. Failing to do this causesStatato stop if you have includedany programsin your do file.l s=t mem 10 0m;
: The a not r find
com' file.
lThis commandtells Statato reserve100 MB of memory.Spacepermitting,Stata readsall datainto memoryand doesits analysison thesedata,which is why it is so fast.If you specifytoo little memory Statawill complainthat it hasno room to add variablesor cases.l i:LASS. DO (D.lr iniriared 5/L9/99, :=st revised 2/a/Oe) i [I alwaysnamemy do file, and becauseI work often with others,indicatethe author,the initial dateof creation,and the last dateof revision.This is very useful
82
euantitativeDataAnalysis:DoingSocialResearch to Testldeas
in identifying differentversions of thesame _do_ file, which might exist because my coauthorsand I haveboth revised the samefile, o. L""u*" f frur" made a revi_ sronon my office computerand,have forgottento upOu,.,fr"versronon my home computer, and so on. Note are distinguished'from commands by an asteriskin the first 'o*^ezls "orun.lnu' *Thls dofile creates computations for a paper on literacy in China.; I always include a descriptionof the analysis the _do_ file is carryrng out. Because ove_r31111naea oaner perioa, *rJscriitiln ,sextemery ]"11":11l :" " rrerprur rn Jogglngmy memory andhelpingme to locatethe correii fi1e..1 use d: \ china\ survey\ data \ china 0 ?  dta ; [This commandIoadsthe datainto memory. The remainderof the _do file then consists of commands that perform ;;";;;;'in ,ir" ou,u and produce variouscomputations.l "rri;.
_rog close; [ This c om m and c l o : e s t h e  11o9oo
word processor.]
6 t c so e ^ that , h . r it i r can Lfile' ^ ^ _ be openedby my editor or
The basic
procedure for c (1)openanew'nreiil;;ilTljil:.ffi ;iil:.*,ffij*:;,*::;.:li!1.,; editor),remembering rhar_do
ntes_must inctuoJtil (2) inserta front end.of thesortjust outlined i"fr^f "rr"rJl:,0o,, uflruy,lopy ,r"rn 1*t i', nl""l"* Iile to my current nle to minimizetypjng):(3) create a set of conmanas,to ou, the first task; "_ry
jTJTti,:ffihr; li"T,l'1'il:.i""T":'J.'l?"j'#::,j; ;::#:,rjr# there surely will be an eror most of.th" ti_"1; fSl 6gg;".t
t" thi eOitor;
correct
the 16.y ffiE::"Hili:?lilljff;3'j?":ffi:*iflJ:"Jd':";;i;""'"u"n,ein
T::1.l,ry"+';'H:::,;1",'"1i.'#:J::l?',:1
of howyougotrheresults showninrfr"_i""_ ni" _d (2)canbe ll:::,i"*.d .H"n. rerunat maywantto doif, as j1i_ tou rr"pp"i.=, Tl ,oT#r.r"", _ error in rhe "ri.,
iiftl{ii.ililfinT::ti::"J,"##!,'**i::r,i::* ffT."JJii"
you submit a paper for publication_ and get an invitation to ,t;;ie
and resubmit,,the
tion, :r'."il"".p;;";i;;;ri"oli""o.'n""..u.f ff:Tiff,:TJ,"*:HX:""::ff do file'rh" uuuituuinf
ota do firewiualsogreatl,.rJ.ffJi:;::"""Tt"*vour
On the Manipulation of Databy Computer 83 ECaUSe a rer ii bome f b\ aD
rause elptul
)ild
ln Extensive Checksof your Work
h s ertremely easyto make errors___of both a logical and a clerical kind_when doing .sourerbased dataanalysis.The only.way to protectyourself from happily makingui .Ede! aboutresultsproduced in error is to compulsivelycheckyour work. you can do fu .n two ways. Fhst, check the logic of each sit of daia transformation comnands by u.:rliirg throughas a pencil and paper operationhow each value of a variable being :rc{ibrmedis affectedby eachcommand.second,tabulate or summanzeeachnew vari_ mt atd actuallylook at the output.you will be surprisedhow many errorsyou discover :,a rkiIlgthesetwo simplesteps!
fulment
ther duce
sto
;CT oDt
ent sk: ing md the all I mat he T if Ie Li
Your do FileExhaustively
! shouldmakeextensivenotesin your _do_ file aboutthe purpose ofeach setofcom_ mrls andthe underlyinglogicespecially in the caseof datatranstbrmations. Includ_ Ds'ornmentssummarizing the outcome of each set of cornrnandsmakes it clear why I :m' out the next stepof the analysis.The do file then becomesa documentsu'ma_ s my entireanalysis.I cannotemphasize stronglyenoughtheimportanceof adequate ,ur:zmentqtion.It is qrpical in our field to work on severalproblems at once and to return r r problem after months or even years. In addition, ttre eOltorial review processoften ni a vgry 16.t 1ime.Ifyou havenotdocumentedyour work, you may havea greatdeal r:ouble rememberingwhy you havedonewhat you have done.lhis is inefficientand :n, Lrehighly embarrassingas when a journal editor asks you to do some additional ndysis. and you haveno idea why you madeparticularcomputatrons, much lesswhat fu .hain of reasoning was, and cannot reproduce the previous results. This happens mrch more often than most of us want to admit.
Hude "Side" Computationsin your _d,o File lb: is a corollary to the point about exiaustive documentation madein the previous sec_ Lwe do ,,side"computationsin the courseof writing papers to makepoints or .gften ar illustrations to the aextfor example, computing th. .utio'oi t,"o coeificients in a rcie s e havemadeor compudng a correlation coefficient between two variableslisted in rne otherpublication.The way to make your _do_ file a comprehensrve documentof rf rour computationsis to use stata, rather than a hand calculator'or spreadsheet,to do the *,:rli: o1 at minimum,to includeboth the dataandthe resultsascommentsin your _do_ ire \[orethanonce,I haveproduceda paperwith a well_documented _do_ file but have r,]ed_to include side computationsin the do file, and then havedrscoveredmonths ;rer *rat I had no idea how the side coefficients reportedin the paper were derived. kntn Your ao File as a Final Check q: te point you havecompleted a paperand areaboutto submitit for a course,ibr Dost_ ls rn an online paperseries,or for publication,you should make a poinr of "*."iting r,crdo file in a singlestepand then checkingeverycoefficientin the paper againj b conespondingcoefficientsin your resultinglog file. you likely will be startleotloois;T er how many discrepancies thereare.Because do  files oftenare developedover an
84
QuantitativeDataAnalysis:DoingSocialResearch to Testldeas
extendedperiod of time and often are executedin pieces, it is extremely easy for incon_ sistenciesto creepin. If you havea do file that will run from beginningto end,without interruption, and will produce eyery result you report in the paper,you will havemet the gold standardfor documentation.you also will be a happy camper months or years later when you need to make a minor changethat affects many results.you will discover that the change usually can be made in a matter of minutesalthough updating your tablesby handis usuallya much moretediousbusiness.
Make Active Useof the StataManual The only way to become facile at any statistical program, including Stata, is to make a point of continuously improving your skills. Each time you are unsure how to carry out a task, look for a solution in the manual.you will find the improvement in facility very rewarding. After you becomereasonablyfacile at Stata,you should then take advantage of Stata'snet commands,which link you to the Statausercommunityand the moit uptodateapplications.Of course,to usethe net  commandsyou must be connected to the Intemet.
SOMEPARTICULARLY USEFULSTATA1O.OCOMMANDS Here is a list of key datamanipulationand utility commands.It is to your advantageto study the descriptions of these commandsin the Stata manual_in addition to reading through the User's Guide. The time you spendgaining familiarity with thesecommands. and with the logic of Stataprocedures,will be more than repaid by improvementsin the efficiency of your work. I haveincluded few of the commandsfor carrying out estimation proceduresbecausethey will be introduced in later chaDters.
append by
cd codebook col l apse
Getsadjustedvaluesfor meansand proportions. Combinestwo data setswith identicalvariablesbut differentobservations. (Seealsomerge.) Repeats a Statacommandon subsets of data. Capturesreturncode(thatis,allowsStatato continuewhetherthe cond! tion is true or not). Changesdirectory Produces a codebookdescribing the data. Producesaggregatestatistics,suchas means,for subsetsof data. Parlicularly usefulfor makinggraphs.Similarto the ,,a99regate,, command In 5 P5 5 .
compress
counc +del imi t
:
Compresses variables to makea datasetsmallerbut withoutalteringthe logicalcharacter of anyvariable.Usefulwhenyourdatawill not fit into memory Givesthe numberof observations _count_ satisfying specified conditions. without a suffjx gjvesthe numberof observationsin the data set. Changes the delimitcharacter
On the Manipulation of Databy Computer 85 sv for inconto end, with*'ill have met nths or yeals $ill discover pdating your
=:ibe ic 3:::
..j:
is to make a r to carry out r facility very ke advantage and the most be connected
:: ce :  : :  a ch
E : e :a ce

'
e
::=:ect
advantageto m to reading e commands, emenlsin the r]t estimation
::g
b,se rvations.
ra::< G=fe
\thecondi: e : S ea r ch
r ' c o m ma n d fteringthe t flt into
=ie! ; =:ji
6. count:  a c de :3:ame
ct
Describes the contentsof a data set. Displays file namesin the currentdirectory Substitutes for a handcalculator. Executes commands from a do file. Dropsvariablesor observations from the file. Allowsyouto edityourfile cellby cel. Usefulin inspecting the contentof yourfile or correcting errorsin the file. Extensions 10the generate command. Permitsrecodingof stringvariables to numericvariables. RepeatsStatacommandfor a list of items(variables, values,or other enti ties).Similarto "do repeat"in SPSS powerful. but more RepeatsStatacommandfor a set of consecuuvevarues. Creates or changesthe contentsof a varjable. Obiainsonlinehelp.(Seealsosearch, and ner search.) Readsdatainto stata.(seealsoinJix and insheet .) lnputsdata from the keyboard. Usefulsummaryof numerical particular{y variables, whenyou arenot familiarwith a dataset.Reports the numberof negative, zero,and pos! tivevalues;the numberof integersand nonintegers;the numberof unique values;andthe numberof missingvalues;and produces a smallhistogram. Keepsvariablesor observationsin the filethat is, dropseverythingnot specified. Usefulwhen youwant to createa new file containinga small subsetof variables. Creates or modifiesvalueand variablelabels. Listsvaluesof variables. Createsa log of your session. lVarksobservations (a way of maintaining for inclusion consistency regarding missingvaluesthroughoutan analysis). Mergestwo data selswith correspondingobservationsbut different (Seealsoappend.) variables. lnstalls and managesuserwritten additions from the net. Searches the lnternetfor installable commands. Combinesnumerical valueswith labelsso that both aredisplayed. This often is very convenient. Putsnotesinto data set. Reordersvariablesin a data set. Oblainspredictions afteranyestimation command. Preserves data.Usethisbeforea commandthat will alterthe dataset,such as collapse. Thenuserestore to restorethe preserved dataset. Performs Statacommandwithoutshowingintermediate steps. Recodes variables. Renames variables.
86
euantitativeDataAnalysis: DoingsocialResearch to Testldeas
replace
neplacesvaluesof a variablewjth
reshape restore #review
new variablesif specjfied conditionsare
.,'atand vice o.:lJ:I"';ffi:lffiJ:T.'o" versa lil' hevtewsprevious commands.Halr
save search s or t summari ze t abf e ltabulat e update us e
version xi
j;i:iil:::: :Hl:T :Htrt j# ff:::i,"::Y"xi:: ij._
Savesa data set. Searches Statadocumentation for
**"Pa* cu ar' m*l'",;:::;#ii:J"il;: 1;j;4m; {orlconrinuous
;;;ffi ;,;il:i;;Tri6 t'* produces one_ andr*"_*ri "fi
variabjes
1;];1';"""'* :;::j: :;;1i;[,T#it# $IT""j i:11,';,1'i,;,"J ;ffi:iffi;n*#:#r.:it "i"' :"J .#iFHl1:t*1 ". """*"xi:ii;n:;:iJI,:,"ii
5pecifies whkh versio",, ,,"u
"OOi,
ffi ;ij:"expans,on,;#;JTi,:"'rl:"J;:i:T;ffi:
ms are
CH APT ER
d rather b to
aularly
rk for axis use mmano).
INTRODUCTION TO CORRELATION AND REGRESS(TOORDt N NA RY LEASTSQUARES) TTIATTHISCHAPTER ISABOUT 3o tar we havebeen dealing with proceduresfor analyzing categoricaldata.We now tum I a powerful body of techniquesthat can be applied when the dependentvariable is an or ratio variable: ordinary leastsquaresregression and correlation analysis. In rtalchapterwe deal with the twovariable case,where we have a dependentvariable and fu e smgle independentvariable, to illustrate the logic. In the following two chapters we G2I q ith multiple regression,which is usedwhen we want to explorethe effectsof sevol independentvariables on a dependentvariable, the typical case in social science gearch.
88
euantitativeDataAnalysis: DoingSocialResearch to Testldeas
INTRODUCTION Supposewe havea setof dataarrayed like Father's Years of Schooling 2 12 4
13 6 6 8 4 8 10
_
this:
yearsof Schooling Respondent's 4 10 8 13 9 4 i3 6 6 11
What can we say about the rclationship
betweenfather,s educatronand respondent,s
much. Visual inspectio, of tfr",.'" :d.l:latr?n?.]{or tr weptot the rwo
i. q"nJrriioro'",r"".
However. variables "."f, in twodimensional therelationship revealedwhenyou inspecrrheDlot(Figure ,i;;, ;l"';;r#;f is 5 1),i irr;;;,"ry lvident thatthech'drenof highlyeducated farhersiendto behighly In rhissiruarion. we saythatthefather,sandtheresponO*tt "0"*"J,f,"_r?*s. Arthoushwe canseethattheiather,, "Ou"utioou." p "ri'iii, _*"r"*a. gd;;;;;;;;:;:;irare posrtivery related,we wanrto quanrifythe relationship cor_ ; ;;;;. il; we want a way to
scho.tins frf.yf,,?l;,f :i;i:***^:,::::,:":';:::,Fa,her,syears.f 
Introduction to Correlation and Regression (OrdinaryLeanSquares)
89
Gcribe the character of the relationstripbetween the father's and respondent,syears of fuiing. How large a differencein the dependentvariable,yearsof schooling,would we qect on averagefor a personwhosefather'sschooling(theindependentvariable)differs by . unit (one year)?What level of schoolingwould we expect,or predict, on averagefor person,giventhat we know how much schoolinghis or her fatherhas?Second,we want $ l rav to characterizethestrengthof the corclalron,or conelatbz, betweenthe resDondent's .d farher'syearsof schooling.Can we get a preciseprediction of the respondent;slevel of r*rrion from the father'slevel of educationor only an approximateone?
THE SIZEOF A RELATIONSHIP: QTJANTIFYING IfGRESSIONANALYSIS
Lrespondent's ive. However, elationshipis that the chil&is situation, oted, nsitively coriant a way to
Ihe conventionalald simplest way to describethe characterof the relationship between rro variablesis to put a straight line through the points that ,.best' summarizesthe average daionship between the two variables. Recall from school algebrathat straight lines are 4resented by an equation of the form
Y:a+ b( D
drrc a is the intercept (the valtrc of Y when the value of X is zero) and b is the slope (the imge in yfor eachunit changein X). Figure 5.2 showsthe coefficients a and b for our involving years of education (IJ and father's years of education (X). The figure nple r e eraphic representationof the equation:
(s.2)
E:3.38 +.687(Er)
0
of Schooling
( s.1)
2
4
6 810 Fatherbyearsof schooling
12
14
Ff GURE 5.2" leastSqua years resRegression LineoftheRelation Between d khooling and Father'sYearsof Schooling
90
to Testldeas QuantitativeDataAnalysis:DoingSociaiResearch
Here E indicates the expectedntmber of years of school completedby people with eachlevel of father's yearsof schooling(EF)on the assumptionthat the relationshipis lizear, rhat is, that each increasein the father's educationproducesa given increasein the respondent'seducationregardlessof the initial level; 3.38 is tl]tr,intercept,that is, the expectedyearsof schoolingfor peoplewhosefathershad no schoolingat all; and .687 is the slape, that is, the expectedincreasein yearsof schoolingfor eachoneyearincreasein the father's schooling.From this equation,we would predict that thosewhosefathershave 10 yearsof schoolingwould have10.25yearsof schoolingbecause3.38 + 10* 687:10.25. Similarly, we would predict that the children of university graduateswould have2.75 more yearsof schooling,on average,thanthe childrenof high schoolgraduatesbecause.687*(1612) : 2.75. U"U^uring the valueof the dependentvariablein a regressionequationfor given valuesof the independentvariableis known asettaluatingIlrc e
0
2
4
6 810 FatherbyearsoJ schooling
12
14
F,&URE 5,3. teastSgua resRegression Lineof the RelationBetween Years of Schoolingand Father's YeaRof Schooling, ShowingHow the "Errorof Prediction"or "Residual"ls Defined.
Introductionto Correlationand Regression(OrdinaryLeast5quares)
WHY USETHE "LEASTSQUARES" CRITERION TO
Dplewith hip is ftnse in the b. the ex687is the asein the s have10 : 10.25. 1.75more _ 687*(16 rfor given efficients is that we re sum of ed in this lhis critei$l person r father's specifled which are is sumof
91
squares"is not the onlyplausible criterionof "bestfit." An intuitively moreappealingcriteion is to minimizethe sum of the absolutedeviatjons of observed valuesfrom expected valJes.Absolutevaluesare mathematically intractable, however, whereassumsof squares have convenientalgebraicproperties,which is probablywhy the inventorsof regressionanalysis rit uponthe criterionof minimizing the sum of squarederrors.Theconsequence is that observations with unusually largedeviations from the typicalpatternof association canstrongly affect regressionestimates;becausethe deviationsare squared,such observationshavethe greatestweight. The presenceof atypicalobservations,known jn this context as high leveragepoints,canthereforeproducequitemisleading results. We will discuss this pointfurther n the upcomingparagraphs and in ChapterTen.
It can be shown, via algebraor calculus, that the following formulas for the slope and uercept satisfy the least squarescriterion:
cov(x,y) _ t(xtxvt) ,_ " var(X) l{xx)'
= Ntg.(t&(Dv) Ntx'(tx,)
(5.3)
sf,
a:Y b(x):4
fr
b
lx
(5.4)
N
ASSESSING THE STRENGTH OF A RELATIONSHIP: €ORRELATION ANALYSIS \n that we have seenhow regressionlines are derived and how they are interpreted,we ed to assesshow good the predictionis. Our criterion for goodnessof predictionor 4;lxlnessofft is the fraction or proportion of the variancein the dependentvariable that =a be attributed to variancein the independentvariable. We define
^
'fz_1_z'  )
Years
ru
DETERMINETHE BESTFITTINGLINE? Nore that,,reasr
5 r y yr ' l,r v
l t v  lf tu
T.:l is, l, which is just the squareof the Pearsoncorrelationcoefficient,is equalto 1 fiinus the ratio of the variancearound the regressionline to the variancearound the mean r de dependentvariable.(The Pearsoncorrelationcoefficientis, of course,the correlaa.o coefficient you have encounteredin introductory statisticscourses.It has the
92
Quantitative DataAnarysis: Doings'ciarResearch to Testrdeas
aovantageof ranging from _ I to *1 dependingon whether two variables move
j!li:"j",1'"..,1x'x jfr:ffi,#I ;:T*"1';'*.,T;T::A":J: y#: ;LXtil"T* together
dependentvariable_that is, whe
* p,"ar"i,r,"'"a"i, #",i"T#'+:iii#li:""11y1"1"# iiir'"p
dependent variableis lhe leastsquares predicrion ofeachvair"i_,f,.,u,,o,, I and = thrscaseis illustratedin ra) of I 0: Figure5 4 *fr"" f,""*i.Og" .,r;.,.value of the indepen_
ffil#:i lifrJfJJiTfl"J;:l#,Ti;:1',:ilT""""11J o j;,",.","r"io1,hp u"tw".o
twovariabres .(;T;i_T,:,T?i:::i'"::XilllHfr """*;:ff::H;"iJ::il:j1::.n"t",,
"o,,.rutionu.t,""ni;;;ff
X.i"#trj:,;l:
jlix**j"*: :.:#;*jn*iii:*",*.11;X*:1,,*"1TH
mav beassociared witnu".yorre."nr,"rltffi,:'#,;"* ilJ,Hf,iiH":n.t;"i""J sronprovidesanadequate sumrnary .'.n*"rr"i,.""r, of u ."f u,ionffioni,
Y= 5.00+ 0(X)
0
represents the
22 20 18 16 14 12 10 8 6 4 2 0
Y = 5.76+ 0(X )= 5.76
(c)
5,4, reastsgua yfI resRegression Lines for Threeconfigurations f f (a)perfect of Data: lndependence. perfect b) correhri,r, iiftn"ct Curvitinear Correlationa parabolaSymmetncat "* to thexAxis.
Introductionto Correlationand Regression(OrdinaryLeastSquares)
93
i ;novetogether square.)When ihe meanof the t r anabledoes he meanof the i slandl:0; rf rhe indepen..theratio is 0, ni o variables, r erample,the :sobviousthat . * hich reproen corelation Linearregresrepresents the
, :a '
i 0 2(X) 8 10
:::.ter of the relationship. When it fails to do so, additional variables need to be : :, led in the model. You will see how to do this in the next chaDter. Rerurning to oul example about inrergenerational continuity in educational attain_ \\'e note that t' : .536, which tells us that the variance around the regression line is : :half the size of variance around the mean of the dependent variable, and therefore :,: r :.: rbout half of the variance in educational attainment is explained by the correspond_ : ,uiability in father's education. As social science results go, this is a very high ,  eiation.
A USEFUL COMPUTATIONAL FORMULAFORr
rherol ?z
lvifg is a usefulcomputational formulafor the correlation coeffictent, r, which comes andywhenyou haveto do handcalculations: '" N
ons of 'Jrvilinear
cov(X,Y)
N:Xv (:,xx>y)
94
Quantitative Data Analysis:Doing SocialResearchto Test ldeas
THE RELATIONSHIP BETWEEN CORRELATION AND REGRESSION COEFFICIENTS
Supposewe were to standnrdize our variablesbefore computing the regressionof I on X, by, for eachvariable, subtractingthe mean from the value of eachobservationand dividing by the standard deviation. Doing this produces new variables with mean : 0 and standarddeviation = 1. Then we would have a regressionequation of the form
i: [email protected])
(s.6
(The convention adoptedhere, which is widely but not universally used, is to represen standardizedvariables by lowercaseLatin symbols and the coefficients of standardize variables by Greek rather than Latin symbols.) There is no intercept becausethe resression line mustnecessarilypasstfuoughthe meanof eachvariable,which for standar&ze variablesis the (0,0) point. we interpret B as indicating the numberof standarddeviations by which we would expecttwo obseruationsto differ on y that differ by one standarddeviation on X. (This follows directly from the fact that for standardizedvariables,the standard deviation is one. Thus, one standarddeviation on x is oneunit on r; and the samefor y and y.) It can be shown,through a simple manipulation of the algebraiccomputationalformulas for the coefflcients,that in the twovariable case,r = p. It is also fue that r is invariant under linear transformations.(A linear transformationis one in which a variable is multiplied [or divided] by a constantand./ora constantis added [or subtracted]. consider two variables,yand I', withY' : a + ,(y ). In thiscase,r_,= r.,.; So thecorrelationbetween standardizedvariablesand unstandard.izedvariablesii neceisarilv perfect. A convenientpair of formulasfor moving betweenb and B'(which also holds for multiple regression coefficients) is
p:b(&)=,:r(f)
(5.7
a=y _ b(X)
(5.8)
where s, and s, are the the standarddeviations ofX and f, respectively.
FACTORS AFFECTING THE SIZEOF CORRELATION (AND REGRESSTON) COEFFTCTENTS
Now that we seehow to interpret correlation and regressioncoefficients, we needto con_ sider potential troublesfactors that affect the size of coefficients in wavs that mav lead to incorrect interpretation and false inferencesby the unwary.
Outliersand Leveragepoints
As noted, correlation and regressionstatisticsare very sensitiveto observationsthat deviate subsaantiallyfrom the t)?ical pattem. This is a consequence of the least souares criterionbecause "errors" (differencesberweenobservedand predictedvalueson rhe
Introduction (OrdinaryLeastSquares) to Correlation and Regression  10points, with(13.13)  9 points,omitting(13,J 10points, with(13,0) 
'onX, divid0 and
95
a
.9 1 E
6
(s.6)
!l
f,esent rdized
Egresrdized 'nttons ! devimdard r land formuvariant multibr two ween lds for
(s.7) (5.8)
to conay lead
at devisquares ; on the
o tlGURf
2
,1*"., L.^",l*""Jl
12 14
5,5. rherffectof a singteoeviant crri guignt"n"r"ge point).
dependentvariable) are squared,the larger the error, the more it will contribute to the wm of squarederrors relative to its absolute size. Thus, conelation coefficients can be xbstantially affected by a few deviant observations, with regression slopes pulled {rongly toward them, producing misleading results. To seethis, consider the following Erample, illustrated in Figure 5.5. Supposethat in our example about intergenerational educationaltransmission,the fourth casehad values(13,0) (shownas a solid circle surroundedby an opencircle) insteadof (13,13)(shownas an opencircle).That is, suppose ftat in the fourth casethe child of a man with thirteen years of schooling had no educarion instead of thirteen years of schoolingperhaps becausethe child was mentally mpaned. The alterationofjust one point, from (13,13)to (13,0),dramaticallychanges de regressionline and misrepresentsthe typical relationship between the father's and rspondent's education, making it appearthat there is no relationship at all (he regresiion equationfor the ten points with (13,0) asthe fourth valueis.6 = 6.74 + .0a91,(E"); ; _ .002). This exampleillustrates the condition under which deviant casesare influentialthat ir havehigh "leverage." This is when points are far away from the centerof the multivariaredistribution.Outlierscloseto the centerof the distribution,for example,the (8,13) pointin Figure5.5,havelessinfluencebecause,althoughtheycanpull theregressionline ry or down, they have relatively little effect on the slope. We will consider this distincrironfurther in ChapterTen. The most straightforward solution is to omit the offending case.When this is done, fte regressionline tfuoughthe remainingnine points is very closeto the regressionline drough ten points with (13,13). However,this generally is an undesirablepractice becauseit createsthe temptation to stafi "cleaning up" the data by omitting whatever casestend to fall far from the regression surface. Two better strategies,which will be elaboratedin ChaptersSevenandTen, are (1) to think carefully about whether the outliers
96
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
might have been generatedby a different processfrom the remainder of the data and. when you suspectthat possibility,to explicitly model the process;or (2) to use a robust regressionprocedure that downweights large outliers. Fortunately, the damage done by outliers diminishesassamplesizesincrease.However,evenwith large samplesexaeme outliers can be distortingfor example,incomes in the millions of dollars.one simple way to deal with extremevalues on univariate distributions is to truncate the distribution, foi example,in the UnitedSratesin 2006by specifying$150,000for incomesof 150,000 or $ above(this is what the GSS does;in 2006,just over 2 percentof the GSS sample had rncomesthis high); but this createsits own problems,as we will seenext. A bener way, which you will see in Chapter Fourteen, is to use interval regression(an elaboration of tobit regression)to correctly specify the categoryvalues.
Truncation are sometimestempted to divide their study population into subgroupson the ]+nffst_s basisof values on the independent or dependentvariitti or on variables substantially correlatedwith the independentor dependentvariable. For example,an analyst who sus_ pects that income depends more heavily on education among those with nonmanual occupationsthan amongthose with manual occupationsmight attempt to test this hypothesisby correlating income with educationseparatelyfor nonmanualand manual workers. This is a bad idea becauseincome is correlated with occupational status; thus, dividing the population on the basis of occupational status will truncate the distribution of the dependentvariable, which, all else equal, will reduce the size of the conelation. More_ over, ifone subgroup,say manual workers, has a smaller variancewith respect to income than doesthe other subgroup,saynonmanualworkers (and this is likely to be true in most societies),the size of the correlation w l be more substantianyreducedfor manual than for nonmanual workers, thus leading the analyst to_misiakenly_believe that the hypothesisis confirmed. Tose9 thirs,consider a highly stylized example, shown as Figure 5.6. To keep the example simple, imagine that all manual workers in the sample have less than seven yearsof schooling and that all nonmanualworkers havemore than sevenyears of school_ ing. Note that in the example, there is exactly the same income retum to an additional year of educationfor nonmanual a:rd manual workers. Note further that eachpoint is an equal distance from the regressionline. Now, supposethe correlation between income and education were computed separatelyfor manual and nonrnanualworkers. The correlation for both groups would be smaller than the correlation computedover the total sample, and the correlation would be smaller for manual than for nonmanual workers. This follows directly from Equation 5.5 because,from the way the example was constructed, the variance around the regression line is identical in ail tl,,ee cases,but the variance around the mean of the dependentvariable is smaller for nonmanualworkers than for the total sampleand smaller for manual workers than for nonmanualworkers. Although, for the sakeof clarity, the exampleis highly stylized, the principle holds generally: whin distributions are truncatedthe correlation tends to be reduced.This, by the way, is the main reason GRE scoresare weak predictors of grades in graduate schtol courses: sraduate
(OrdinaryLeastSquares) and Regression Introduction to Correlation a and, r robust doneby me outple way don, for 0,000or ple had rcr way, ation of
97
20000
15000
10000
5000
0
s on the mntiallY who susnrnanual i hypothworkers. dividing n of the n Moreo income e ln most nual than that the keep the an seven rf schoolrlditional nint is an n rncome lhe corretotal samkers.This nstructed, : variance nn for the nough,for when diss the marn ; graduate
0
2
4
6
8
10
12
14
', 1 6
Yearsof schooling
F;G L;iig
5.S,
rruncating DistributionsReducescorrelations.
.hartments do not admit people with low GREs, thereby truncating the distribution of GRE scores.But this doesnot imply thatGRE scoresshouldbe ignoredin the admissions lcuess, as statistically illiterate professorsargue from time to time.
OFTHEEFFECT OFTRUN''L A "REALDATA"EXAMPLE CATING THE DISTRIBUTION
Analyzins theu.s.sampre forthePo^.Igl
.al Action:An Eight NationStudy,19731976(Batnesand Kaase1979)someyearsago, lwas ouzzed to discoveran extremelylow correlationbetweeneducationand income(lessthan betweenthesetwo variables is on the order .1,whereasin U.5.surveys the typicalcorrelation 3f .3).Furtherinvestigationrevealedthat the low end of both the educationand incomedistripresumably in eitherthe sampling truncated, asa resultof inadequacies cutionswereseverely When the datawere weightedto reproducethe bivariatedistri3r the field work procedures. cution of educationand incomeobservedin the U.5.censusfor 1980(the yearclosestto :he survey),the estimatedcorrelationapproximatedthat typicallyfound in U.5.surveys.
firession Towardthe Mean becauseof a pheof truncationactuallyareworsethanjust suggested, Tbeconsequences menon known as "regressiontoward the mean." When two measurementsare made at ,:iferent points in time, for example, prelest and posttest measurementsin a random;zed experiment or scoreson the GBE, it is typical to observethat those caseswith high rdues on the first observationtend, on average,to havelower valueson the secondobserrarion. and that those caseswith low values on the first observationtend to have hisher
98
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
valueson the secondobservation.That is, both the high and the low valuesmove toward (or "regess toward") the mean. This is true even whenthere is no change in the trae value betweenthe two measurements. The rcasonfor this is that observedmeasurements consistof two components:a true score and a componentrepresentingerror in measurement of the underlying true score. For example,considerthe GRE. The observedscorefor each individual can be thought of asconsistingofa componentmeasuringthe candidate,s ,,true,,(or underlyingor conitang ability to do the kind of work measuredby the test and a random componentcomprisedof variationsin the exactquestionsaskedin that administration of the test,the candidate's level of energyandmentalacuity,level ofconfidence(Steele 1997),andso on. It thenfol_ lows that thosewho havehigh scoresin any given administration of the test will dispro_ portionatelyinclude thosewho havehigh positiverandom components,and thosewho havelow scoreswill disproportionately includethosewho havelow randomcomponents. Butbecausethe secondcomponent,rrandom,thosewho have high randomcomponents on the first test will tend, on average,to have lower random coniponentson the second test and thosewho havelow random componentson the first test, will tend, on average,to have higher random componentson the secondtest. The resuit is that the correlation betweenthe two testswill be less than perfect and also that the regressioncoefficient relatingthe secondto the first testwill be lessthan 1.0.This is true evenif the meansand standarddeviations of the two tests are identical. An important irnplication of this result is that a researcher who targets fbr special . rnterventiona lowscoringgroup (thosewho did poorly on a practiceGnE, ttrosewitl row gradepolnt averages. andso on) will be boundto conclude,incorrectly,thatthe inter_ ventlonwas successful.Of course,if that sameresearcher chosethe highscoringgroup for the sameintervention, he or shewould be forced to conclude that the intervention was completelyunsuccessfulindeed,that it was counterproductive. All of this is a simple consequence of analyzinga nonrandomsubsetof the original sample. Exactlythe samephenomenonmeasurement error_has the effect of lowering the correlation betweenseparatephenomena,for example,education and income, the heights of fathersandsons,andso on. This kind of observaionis what leJ FrancrsGalton,oneof the foundersofcorrelationandregressionanarysis,to conclude in the latenineteenthcentury that a naturalphenomenonof intergenerational transmissionwas a ,,reversion,, (or "regression")toward ..mediocrity,'hence the term ,,regressi,on anatysis,,to describe the linear predictionprocedurediscussedhere.But whatGalton failed to notice is thar there is also, and for exactly the samereason,a tendency for valuesnear the meanto move away from the mean. The result is that the vaiancL of thepredicted (but not the observefl ualuesandthe slopeof the regressionline arereduced in iroportion to the complementof the correlationbetweenthe variables. lFor a bookteigth treatmentof this topic, seeCampbellandKenney[1999].)
Aggregation Studentswho havespentsometime studyingthe behavior of populatronsof individuals usually concludethat we live in a stochasticworld in which nothing is very strongh relatedto anythingelse.For example,in the United States, typically about l0 percentof
Introductionto Correlationand Regression(OrdinaryLeastSquares) 225
: toward l\e true s: a true e SCOre. oughtof 0nstant) rrisedof rdidate's ften folI disproosewho ponents. rponents : second crage,to rrelation pfficient eansand r special osewith he interrg grouP don was a simple ering the e heights n, oneof :nth cension" (or describe :e is that mean to t not the the comrt of this
dividuals strongly Ercent of
99
175
€ d
125
100
58
60
62
64
66 68 Height(inches)
to
12
74
: i.: : ,:X n $.7 . rne rftect of Aggregation on correlations. '
:re variancein income can be attributedto variancein education(r  .3 + f  .Og). $udents are then puzzled when they discover that seemingly comparable correlations .rmputed over aggregates,for example,the correlationbetwggnmean educationand rean income for the detailedoccupationalcategoriesusedby the U.S. Bureau of the Cinsus,tendto be far larger(in the presentexample,r .7 .=t f  .49).Why is this so? The explanationis simple.when correlationsare computed over averagesor othersumaa''; measures, a greatdealofindividual variabilitytendsto ,,average out.,,In the extreme :".*e. where there are only two aggregatecategories,the correlation between the means :or the two categorieswill necessarilybe 1.0, as you can seein Figure 5.7 (wherethe :rve circle represents the meanheightandweightfor women,andthe largetrianglerep_ the meanheight and weight for men); but the principle holds for more than two ents :elegories as well.
CORRELATION RATIOS 5.. larwe have been discussingcaseswhere we have two interval or ratio variables. Sometimes,howeveqwe want to assessthe strengthof the associationbetweena catesoricalvariableand an intervalor ratio variable.For example,we might be interestedin rhether religiousgroupsdiffer in thefuacceptance of abortion.or we might be interested :tr \\'hetherethnicgroupsdiffer in their averageincome.The obviousway to answerthese +estions is to computethe meanscoreon an abortionattitudesindex for eachrelisious voup or the meanincomefor eachethnicgroup.Bul if we discoverthat the meansJiffer ubstantially enoughto be of interest,we still areleft with the questionof how strongthe =lationshipis. To determinethis we can computean analogto the (squared)correlation .,lefficient,known asthe (squared)conelationratio, rf (etasquared). is definedas t'
100
DataAnalysis: DoingsocialResearch Quantitative to Testldeas ^2 _  ' ,
= 1
Varianceqround the subgroupmeans Variancearound the grand mean Within Rroupsum of sauures Total sum of squares
(5.9)
llt4,t,i r ' \\rv
1J
1r \
it
7 12 
'
j'
where Iis the dependentvariable, there arej groups,and I caseswithin eachgroup. Thus. I.' is the meanof f for groupj, and I. is the grandmeanof L From Equation5.9,it is evident that ifall the groups havetlle samemean on the dependentvariable, knowing which group a casefalls into explains nothing; the variance around the subgroupmeansequals the variancearoundthe grand mean,and 4, : g. 41 ,1t" other extreme,if the groups differ in their means,and if all caseswithin each group have the samevalue on the dependent variablethat is, there is no withingroup variancethen the ratio of the withingroup sum of squaresto the total sum of squaresis 0, and 42 : 1. From this we seethat 42, like f , is a proportional reduction in yanance measure, Let us explore the religion and abortion acceptanceexample with some actual dataIn 2006 (ald for rnostyearssince1972)the GSSaskedsevenquestionsaboutthe acceprability of abortion under various circumstances: . . . should [it] be possible for a woman to obtain a legal abortion . . . r r
if there is a strong chanceof seriousdefect in the baby? if she is married and doesnot want any more children?
r
if the woman'sown healthis seriouslyendangered by the pregnancy?
r
if the family has a very low income and cannot afford any more children?
r
if she becamepregnantas a result of rape?
I
if sheis not manied and doesnot want to marry the man?
r
if the womal wants it for any reason?
From theseitemsI constructeda scaleby countingthe positiveresponses, excluding all caseswith any missing data. The scale thus rangesfrom 0 to 7. Table 5.1 shows the mean number of positive responsesby religion. All those who specifled religions other than Protestant, Catholic, or Jewish or said they had no religion were included in the "Other andNone" category.From the table, it is evidentthat Jewsandother nonChristians are much more acceptingof abortion than are Ctu.istians(Protestantsand Catholics). But how important is religion in accountingfor acceptanceof abortion?To seethis, we computeq2 : .070.(The Statacomputationsto createTable5.1 andto obtain areshownin 42 the downloadabledo and1og files for the chapter.)
Introduction to Correlation and Regression (OrdinaryLeastSquares) .l01 TABLF 5,'l , u.rn Number ot positive Responses to an Acceptance d Abortion Scate(Range:07), by Religion, U.S.'Aduh;, 2O;.
(5.9)
leligion
MeanNumberof PositiveResponsesStandardDeviation
Catholics
p. Thus, it is evig which s equals p differ pendent n8roup ri'],like al data. accept
2.5
tE* ts Other
2.2
ts' ,:,'
Clearly, religious affiliation does not explain much of the variance in abortion atti_ les. H_owcan this be, giventhe substantialsizeof tn" ,"om"."n""s? The answeris ;aple. Jewsand "Others,,differ substantiallyfrom prot".tunt" unO, Catholics = rheiracceptance of abonion.But thesegroupsarequlte small,especially ".p""ially, Jews. Hence, :c. matterhow deviant they are from th" *".u11 uu".ug", tfrey are u'nfrtefy to have much =ryact; when more than half of the popuiation is incluAed in on. group, as is the case :rre with Protestants,a large fraction oi the vari"" in uUo.tion *'ceptance rs bound to  = rr ithingroup variancerather than between_group variance. A seconduseof the correlationratio is to test assumptions of linearity.We will take up in ChapterSeven. s
A USEFUL COMPUTATIONAL FORMUIAFO]
? formula to compute tromfrequency ?, OynanO orpercentage distributions i:
cluding owsthe $ other I in the ristians :s).But recomlown ln
,::\ llv:l l i i
o *"0
x i,12 [L>:ti1 / l '
'
5f
f
77',
wneretherearef groupsand I categoriesof the dependent variable,which in this caseis desig_ natedby X 50 Xi is the scorefor the ith category(of thelh group,aithoughthe caregoryscores are,thesamefor ajl groups),and /, is the numberof cases in the rth .u,"!ory o,nong,uro"r, of the/th group Noticethe difierencefrom Equation5.9, wherethe r refeisto inoividuars rather than to categories of the dependentvarjable.
Kl
102
QuantitativeDataAnarysis: DoingsociarResearch to Testrdeas
WHATTHISCHAPTER HASSHOWN In this chapterwe have considered simple (two_vanable;ordinary
teast_squares (OLS)
H?;ru"'"",'J:ff j *::kl*,.*#.i:l*#li# [..8i":!i;il'""fJ"fi is affectej j br,o"ii"*"i" o,,i"i"iili ll1.':g].:.'io"""{ficients
mn'i* xffis*{#jl_:"_t W'}i.#*:;;'y.;"i* gjtF:?.J"4ru::;:f,1 ffi:l**i#fi,ltrl1"Ji'ff ;:nnl'*ffi#*1'T;#',""ri*#;#l
(ors) im. the elation 6cally, to the oughly le then hisan ebuta nltiple r more
CHAPT I iT
INTRODUCTION TO MULTIPLE CORRELATION AND REGRESSION
(onDtNARYLEAST SQUARES)
WHATTHISCHAPTER tSABOUT h this chapterwe consider the central techniquefor dealing with the most b/pical social r..ienceproblemunderstanding how someontcome is affected by severaldetermining Frriablesthat are correlatedwith eachother. we begin with a conceitual overview of mur=le correlation and regression,and then continu! ,ith u ,ortJ to illustrate Lrll to interpret regressioncoefficients.We then turn to "*arople consideration of the specialprop_ =ties.of categoricalindependentvariables, which U" in"tuj"Jlo multiple regrcssion 3luatronsas a set of dichotomous(.,dummy,') variables, "al one for eachcategoryof the origi_ ril variable(exceptthat to enableestimation of the equation, one categorymust be repre_ :entedonly implicitly). In the courseof our discussionof oummy variiutes,we develip a {rategy for comparing goups that enablesus to determine wheiher whateversocial pro_ 'esswe are investigating operatesin the same way for two or more subsegmentsof the populationmales and females, ethnic categories, anOso on. We conctudewith an alter:atrveway ofchoosinga prefenedmodel,the BayesianInformation Coefficient(BIc).
104
QuantitativeData Analysis:Doing SocialResearch to Testldeas
INTRODUCTION For most social sciencepurposes,the two variable regressions we encounteredin the previous.chapter.arenot very interesting,exceptas a baselineagainstwhich to compare modelsinvolving severalindependentvariabies.Sucn moO"t J" ttr" fbcusof this chap_ ter Here we generalizethe twovariable procedureto many variables.That is, we predicr some (interval or ratio) dependentvadable from a ser of iniependent vanables.The logic rs exactly the sameas in the caseof twovariabre regression, excepl that we are estimatrng an equationin many dimensions. Let us first consider the case where we have two independentvariables. Extending the tenobservationexample from the previous chapter, ,rp'p"r"r"" *i"t ,frat education dependsnot only on the father,s educaiion but also^on th" irui". ot ,iUUngs.The argu_ melt.is that the more siblingsone has,the lessattention on. i"""iu.. f.orn one,sparents /all elseequal),and hence,in consequence, the lesswell one doesin schooland, there_ fore, the lesseducationone obtains,on average(for examplesof studiesof sibship_size effectsin rheresearchlirerarure,seeDown"y tlsgsl, N4_uffi i06 , L"[2005], andLu and Treiman [2008]). Suppose,further, that we have informution on utf tn"" uariablesfor our sampleof ten cases: Father's Yearsof Schooling 2 12 4 13 6 6 8 4 8 10
Respondent'sYears of Schooling 4 10 8 13 9 4 13 6 6 11
Number of Siblings 3 3 4 0 2 5 3 4 3 4
Note that the first two columns are simply repeated from the examplein the previous chapter(seepage88). To test our hypothesis that the number of siblings negatively all.ects educational afianment, we would estimatean equation of the form: E : a + b(Eo) + c(S)
(6.1)
(Note that I use generic symbols, for example, X and f, b indicate variablesin equa_ .. tions of a generalform, but nnemonic ,ytoi., io, OS,to indicate variables in equations that refer to speclnc concrete "*r_pt",'U,ir, examples.I find ^it much easier to keeptrack ofwhat is in my equationwhenI use mlemonlc symUotstor varlaUtes.;

(OrdinaryLeastSquares) 105 Introduction to MultipleCorrelation and Regression
d in the preto compare of this chaps we predict es.The logic I are esumats. Extending ut education Es.The arguone'sparents ,l and, thereI sibshipsize 0051,andLu lariables for
Numberof Siblings 3 3 4 0 2 5 3 4 3 4
Equationssuch as Equation 6.1 are known as muhiple regressionequations.In rldple regressionequationsthe coefficientsassociatedwith each variable measure ft expecteddifference in the dependentvariable associatedwith a oneunit difference r 6€ given independentvariable, holding constant each of the other independentyarii,tes.So in thepresentcase,the coefficientassociated with thenumberofsiblings tells us a. erpected difference in educationalattainmentfor eachadditional sibling amongthose rfude fathers have exactly the sameyears of education.Corespondingly, the coefficient rsociated with the father's education tells us the expecteddifference in years of educarn for thosewhosefathers differ by one year in their educationbut who haveexactly the re numberof siblings. In the tbreevariablecase(that is, when we have only two indepodent variables), but not when we have more variables, we can construct a geometric that illustrates the sensein which we are holding constq.ntonevariable and Fesentation simating the net effect of the other. h multiple regression,as in twovariableregression,we use the leastsquares critem to find the "best" equationthat is, we find the equation that minimizes the sum of ryared errors of prediction. However, whereasin bivariate regressionwe think in terms r fte deviation between each observedpoint and a line, in multiple regressionthe anahg is the deviation between each observedpoint and a kdimensional geometric surface rherE t  I * the number of independentvariables.Thus, where there are two indepen&nr variables, the leastsquarescriterion minimizes the sum of squared deviations of a::h observationfrom a olane.as shownin Fisure 6.1.
Dthe previous s educational
(6.1) ables in equas, to indicate nuch easrerto riables.)
012345 Number of siblinqs
Fi G i XA &,1" threeoimensional Representation of the Relationship Between and Respondent's )tnber of Siblings, Father'sYearsof Schooling, Yearsof Schooling OiwotheticalData;N = 10).
106
Quantitative DataAnalysas: DoingSocialResearch to Testldeas
M etric Regressi on Coeffi cie nts
Thecoefficientsassociated with eachindependent variableareknow fcients, or netregression coeffici)nts.(orsomerimes rau o,;;;;;;;::;t;"::"#ri:;{;, to distinguishthemfrom siindardized ,"i"frr;;;;;,";;;;',;;Z;",r""wi'learn later).In thepresentcase,theestirnat"O " , ."gr"r.ion l;' "quati; E : 6.26+ .564(E _ .640(s) ") This equationtells us that a person
who had no siblings and whosetather
morc
(6.2)
had no edu
[lF{,"#l!i,n'"","in:""ffi ry:fr.::,'"...:;,,",,### ::illiiki:r"]h***i,Til3:rTT
:itii,"#l!:i*1?rffi todifferintieirow"."h",ri;;;l;.;;;;
;:Hf;"*:j:*".
oi"iar y" 1p.".i."ry.
Note that the coefficient associated with the father,s education rn Equation 6.2 is smallerthan the correspondintct
#i;1.:'ffi1,_:':*iffill"X IT:,;"::;;i;il#:il:fl,':,!Tl,j.?'ff m ract..503 in rhisexarn'le). rrrus.in equation ;.i.;;;fi. observed
thefather's educarion on rheresnondenr., .dil;;li;Ji
nT 3ffii Hl#"*',H::l'l'r
effecrof ,iJir* *u, poortyedu_
{d;:;;;"d"'ilri'ies.tendtoso,ess
thisassociation andgives theeffeciorthe ru*,... t::9"tt"llingfor)thenumber "Jo"utiT;;;ii;fift::":T1or of siblings Theimptication .i,rrir'."rriij.ll1:,ol:t11t
;i;;:"";Tl,ffiT#Jilijl:T;il:l#il1,[ij.,5 1"e.,"1,9!tJ"#;ffi
rn the equationwill be biased_that is, *ltt ou"r.t t" * u;;";;;" between
thegivenindependent variabl"Fqr;G;;;;;H;1""?;*0,,o t:fjut variableis uncorrelatedrvitri,r," illl,jlllft. xnown as specfficationerror or omitted variable bias. """"iriiiL
*"
relation "ausat
thelimirins equation). Thisis
Someanatystspresenra ,*l:r^:_r^ y:*rll!t!"rno." multiple regression changes in thesizeo.fspecin""oefn"ieoir..".ulting "o_pt"t" 3^"_OllTl.9h":s fromtheinclul sronot additionalvariabresThisis a s"mur" rou,"gy unc". ooe'*p"J"i'" the analystwanrsto considerhow the ,t eo erect of ,, "onoitroo: modifiedby rhe inclusionof anothervariable(or variables). onJ6r;;;;;;;tb;", rr,"i ri_'" _"i.li analogous the searchfor spuriousor intervening ro "ro."r, rer"tirrrtf, ii crraprers 1r." and. Two the "i"r*ir?irl, Three), analystmight wani to investigate il;;;;;rlar relationship is or partlyexplained. by another ru"o'r.no. .*u.pr.."iiil", ,. observed ::lll::4 Jourhemers rha areresstoreranr of sociardeviants However'theanalystmaywarr to assess ,rrun*. p*.'[^ riuingou,rio. theSoutithepossibilitythatttrisreLionstripis (or taryely)spurious' entiretv arisinerromthe.facr thais"";;";;il";. Iess we'educared andlessurbanthanothers,ind thatedu"ution _o *U *rli"i.J i"o.."u." ,olerance. In it would be appropriare io pres"nt trvo _od"f___oo".lg."rrrog,or"r_"" * :T._h"T.:residence Joutnem anda second.egreisingtot"ruo"" on ioutfrJrriierioen"",education
(OrdinaryLeanSquares) 107 rntroduction to MultipleCorrelation and Regression
€?rtj.
xe
tuels be x€ :l;... lis D'. ls. of
&ress tbe lss. de led loD in,s s rs lon
:lu0€n the ;to iro ) rs hat rrh. €l) Ied l n ron on.
ml sizeofplaceand then to discussthe reduction in the size of the coefficient rs.\ociatedwith Southernresidencethat occurs when educationand sizeofplaceare rided to the equation. However, absentspecific hypothesesregarding spurious or medime effects,there is no point in estimatingsuccessiveequations(exceptfor models nsolling setsof dummy variables,discussedin the next section,or variablesthat alter imtional forms, discussedin the next chapter); rather, all relevant variables should be nluded in a single regressionequation.However,evenin this casethe analystshould resent a table of zeroorder(twovariable) correlation coefficients betweenpairs of variri.ies. plus meansand standarddeviations for all interval and continuous variables and trE';entagedistributions for all categorical variables.Thesedescriptive statistics help the roler to understandthe properties of the variables being analyzed.In addition, as noted 3.rrlier,the zeroorder correlations provide a baselinefor assessingthe size of net effects Tren othervariablesare controlled.
Tating the Significance of Individual Coefficients h :: conventionalto compute and report the standarderror of the coefficient of eachindemdent variablealthough, as you will soon see,standarderrors have limited utility in :r caseof dummy variables or interaction terms. The convention is to interpret coeffi:renB at least twice the size of their standarderror as statisticallysignificant.This :onl'ention arises from the fact that the sampling distribution of regressioncoefficients :..ilows a ldistribution and that, with 60 d.f. (where the degreesof freedom is computed x \  k  1, with ft the numberof independentvariables),r : 2.00 definesthe 95 per::nt confidence interval around the value b : 0. It is important to understandthat the :{atistics indicate the significance of eachcoefficient net of the effect of ali other coeffi;rcnts in the model. Thus, when severalhighly correlated variables are included in the nrdel, it is possible that no one of them is significantly different from zero, although as a _:roupthey are significant(seealsothe following boxedcommenton multicollinearity). Some aralysts estimate regressionmodels involving severalindependentvariables, imp the variables with nonsignificant coefficients (this is known as trimming the regresrion equation), and reestimatethe model, on the ground that to leave coefficients in the nrdel that havenonsignificanteffectsbiasesthe estimatesof the other variables.How::er. other analystsargue that the best estimate of the dependentvariable is obtained by n^ludingall possible predictors, even those for which the difference ftom zero cannot 5e established with high confldence.The latter shategyis preferablebecauseit provides a€ bestpoint estimatebasedon a setof variablesthatthe analysthasan apriori basisfor $specting aflect the outcome.
Standard ized Coeffi cie nts ,\ questionthat naturally ariseswhen there are multiple determinantsof some dependent ruiable is which determinanthasthe greatestimpact.We cannotdircctly comparethe coefn.'ientsassociatedwith eachindependentvariable becausethey typically are expressedin lifferent metrics.Is the consequence of a differenceof one year of schoolingcompleted ofa differenceof onesibling?Although $ thefathergreateror smallerthan theconsequence 6e questioncan, of course,be answeredas we saw earlier, the cost of each additional
108
to Testldeas Research DoingSocial DataAnalysis: Quantitative
MULTICOLLINEARITY
correlated' variables arehighlv whenindependent
a condition known as multrto//,nea,ty,regressioncoefficientstend to have large standard of errorsand to be ratherunstable,in the sensethat quitesmallchangesin the distribution (1991 1 1; notes As Fox , the coefficients size of produce in the largechanges the dala can variable,./' an independent error o{ seealsoFox 1gg7, 337366\,the inflationin the standard is given by 1\1  Ri),where Rf is the coefficientof determination due to multicollinearity, of variableion the remaining (discussed with the regression laterin thischapler)associated and can be computedin factor inflation variance the independentvariables;this is known as (SeeFoxand Stataby usingthe estat vif commandafterthe regress command suchasa setof dummy variables, to setsof independent MonetteI19921for a generalization in chaptersevenof thisbook,in and itssquare;seealsothe discussion variables or a variable ") Transformations. the sectionon "Nonlinear to be for multicollinearity must be quitehighlycorrelated variables clearly.the independent quadrupled' and an importantproblem Forexample'i! Rl:75, the errorvariancewill be R;'s as largeas .75 are quite uncommon, the standarderrorwill thus be doubled.Because in mainlyarisingin situations sciences, problem social in the a is not often multicollinearity model a single in are included concept measures of the sameunderlying which alternative and most commonlywhen aggregateddata, suchas propertiesof occupations,cities,or nainto solutionis to combinethe measures a reasonable In suchsituations, tions,areanalyzed. a multipleitemscale(seeChapterEleven). Someanalystsattempt to minimizemulticollinearityby employingwhat is known asstepwse in which variablesare selectedinto (or out of) a modelone at a time, in the order regression, that producesthe greatestincrement(or the smallestdecrement)in the sizeof the R'?Such methodsare generallymisguided,both becausethey are completelyathoreticaland because the order in which variablesare selecledcan be quite arbitrary,given the previouslynoted arehighlycorrelated' whenvariables coefficients ln regression instability
sibling is somewhatgreater than the gain from each year of the fatler's educationthe answerdoesnot tell us which variable has the strongereffect on the dependentvanable becausethe variance in the number of siblings is much smaller than the variancein the father's years of schooling. If it is not obvious why the size of the valiance mattels considerthe effect of educationandincome on the valueof the car a persondrives.suppose that for a samDleof U.S. adults,we estimatesuchan equationand obtain the following:
500(E) v  rs,ooo+.s1r;
(6.3)
We would hardly want to conclude flom this that the effect of education is 1'00 times as larse as the effect of income, or to measureincome in unis of $100 and then to f
I
(OrdinaryLeastSquares) 109 Introduction to MultipleCorrelation and Regression fted, dard nof ,1 1; *e,L mon nrn9 ed In ia n d mmy x, In
bbe , and mon, ns in tooel r nai Into
,u,/ise order Such tause roted
tionthe lent varirriance in e matters, i Suppose Dwing: (6.3) l is 1,000 trd then to
$clude that the effect of educationis 10 times that of income.Actually, the equation nlicates that a year of educationreducesthe (expected)valueof a person'scar by $500, of income,whereasa $1,000incrementin incomeincreasesthe (expected)valueof a t luson's car by $500,net of education.ln this precisesense,a year of educationexactly d'=ts $1,000in income.However,a more generalway to compareregressioncoeffiis to transformtheminto a commonmetric. ntsThe conventionalway this is doneis to expressthe relationshipbetweenthe depenieirt and independent variables in terms of standardized variablesthat is, variables rrnsformed by subtracting the mean and dividing by the standard deviation. Because uh variablesall havestandarddeviation: 1, the regressioncoefficientsassociated with *andardized variables indicate the number of standard deviations of difference on the ft?erdent variable expectedfor a one standarddeviation difference on the independent r:riable, net of the effects of all other independentvariables. In the presentexample,the i{uation relating the standardizedcoefflcientsthat is, the standardizedcounterpafi to E4uation6.2is j.601(et).260(s)
(6.4)
R.eminder:As noted in the previous chapter,there is no intercept becausestandardized rsiables all havemean = 0 and a regressionsurfacemust passthrough the mean of each r:riable.) From inspeciionof the coefficientsin Equation 6.4, we concludethat the irher's educationhas a greatereffect on educationalattainmentthan doesthe number of nalngsa greatereffect in the precise sensethat a one standarddeviation difference in ir father's years of schooling implies an expecteddifference of .60 of a standarddevia:.'n in the respondent'syears of schooling, whereasa one standarddeviation difference n rhe number of siblings implies only a .26 standarddeviation expecteddifference in yearsof schooling ile respondenl'b Note that in practice we do not ordinarily standardizethe variables and recompute ::e regressionequationbut rather instruct the software to report standardizedcoefficients usuallyin additionto metric coefficients).Becausestandardizedcoefficientsoften are f,!'t reported,particularly in the economicsliterature, we also can make use of the relation 3 ,: bo\r/s")ahat is, the fact that the standardizedcoefficientrelating independent r3riableX to dependentvariableyis equalto the metriccoefficientmultipliedby the ratio lf the standarddeviations of the independentand dependentvariablesto convert metric ",.Standardized coefficients(or viceversa).(RecallEquation5.7 and5.8.) regressioncoefficients.The conThereis somecontroversyregardingstandardized ientional wisdom in sociologyand other social sciencesis that they are useful for the Fvrposejust describedto assessthe relativeeffect size of eachof a set of independent rrriables in determining someoutcomebut that they are inappropriatefor assessingthe relativeeffectsizeof a givenvariablein differentpopulations,preciselybecausethe standardizedcoefficients will differ if the relative standarddeviationsdiffer in the populations
110
Quantitative DataAnalysrs: DoingSocialResearch to Testldeas
standardized regressioncoefficielt in the two_variable case.)For example,supposerre wanted to compare the effect of the number of siuring. on among Blacks anti ir U. States.Suppose,turther, ,f," ,f,.."oi" "iu"iiion ."glsron coelficient relaF Yj:: T" "fted to the tng yearsof schooling numberof siblingsis iaentlcatiorslacks and whites, thar the standarddeviation of yearsof scho"ri"g r"t"nr""t. rdentical, but that the standarddeviation of number of siblings iJlarger afrri"Js for Blacks thai"for whites. under the= cfcumstancesthe standardizedcoefficient relating the nuu". or siblings to years of
qsobelargerfor Btacrstnanrorivnii". iair?irom ooectlyfrom :::t]rg.,r?rtq the mathematicar relationbetween
the standardized andmetri.i"g."Joo .o"ffi"ients shoul in Equations5.2 and 5.g).Would we really want to con.tua""tfruilt nuber of siblings has a strongereffect for Blacks than fot urrtr". " in a","tiJijil. ,nu"t ."r,ooling thel get if the "cosr" (in termsof vearsof schooling) of eachuJJi iorJ riuring rs identicalfor probablynot. Ho*"n"" ,i"r" *" it 3.1"1:lr,o*tl"r? ii". *"p", Hargins1976r who arguethat it would be meaningfulto "* ,ir" say that .iU.nip _uit".s more fbr Black preciselybecausethereis more variibility in rhe sibshif siz"rlfslu"t tu r".. Additional light may be shed onthis poin, Uy'"o_parln! the interpretationof standardizedand unstandardizedcoefficients .i,hi; ;;i;dr":^mple. In an analysis ofeducationalattainrnentin a 1962US. nationatrepreslri"ii"",""*pr", n"verly Duncaa (1965,60, 65) showedthat the cost of coming f.o_ u nonirrtu"tfu.ily was very high_ a year of schooling, net of a rarge numi"r of othe. f;";. lbout ;;."r"r, the standard_ ized coefficientrelatingeducationto family intactnesswa, ."l"ii""fy *"*, .09,far trom the largeststandardized coefficient.ito." "fout ,t iro 1"."i1, o, reconciled? The fact is.thatthereis nothing inconsistentabout "_ them.The"r" m"t i. inocates that for the relativelyfew personsfrom non_intact families (these*" "o"E"rent auiujio_ 1962),the cosr was very substantial'But the standardized coefficieni indr"ut". tt uiru,o'y intactnesswas not a very important determinantof variance in educationaL"ilirir*,, p.""i."ly because only a relativelysma' fractionofthe sample wasfrom nonintactrannities. civen thenear f*rily
inrachess variabti, it
::i::T:f "f F.aflainment. rn educalional
r,*oty of thevariabilin "oura "*fiuiimluch
K!
VARTANCE IFHTApI$"^nEqAt?llg__rHE oF DICHOTOMOUS VANIAELTS
variabre. Asyou the standarddeviationof suchvari_ ''irrreca,,,", ",",1f.ii,"t"*.,,,1"fi:;l:T:'"il:::i:::lT::: (whichevu" uvr
I vr
)uLl
vdtf
.u,usofis,ooerinea). el l neo/. rnut Inat t5, i,,
T: H* Yll..fl"ft'"l,,positive,, ,{:.fl:l^tly:, themore thedisrribution,
thatii, thefurtherrrdeparrs from thesmaskewed erthestandard deviation and hence the smaler rrroflcr (t_ tne ure standl >ld .l^f,i^tl,,ll:l l"n:tive, ::*:1^.::tl':,":,
.r
i!
vr yo r
!r
Iu l l
Because for dichoromous variables the sizeof standardrzed r r u s u lcoeffjcients uc  l r Ltet t15
of themerric coefficient burarsoon theproportion of the ::::?::.T,.:l fi,,positive" ll":o"attribute, jt
::T,::ltn,]n" coefficients for suchvariables.
4
ir rn*r" to,"tl ilJ;;i;;ffi;; seneraly
(OrdinaryLeast5quares) '111 Introduction to MultipleCorrelation and Regression Ee we ks and I relats, that hat the r these :ars of om the shown iblings rg they ical for r 1976) Blacks tion of malysis )uncan highrndard.09,far d? The that for fte cost esswas because thenear riability
)F IOUS
Yarift is, hom a n dBnts Itn e lized
Cefficient of Determination (R2) Id!.g well doesEquation 6.2 explain the variancein educationalattainment?We determine trr ria an exact analogy to l, known as R2,or the coefficient of determination, which us the proportion of variancein the dependentvariable explainedby the entire set of dlrs nlEpendent variables.Just as for f, R2 : 1  the ratio of the error variance(the variance
A FORMULAFORCOMPUTINGR' FROMCORRELA?, TIONS A convenient formulafor computing R, froma matrixof correlations ,"d Iq 5iandardized regression coefficients is
Rtr,, .r*:Dr,,,Fu, Thatis,B'?canbe computedasthe sum of the productsof the correlations betweeneachof :he independent variables and the dependentvariableand the corresponding standardized r F r l r ac< i.,n.^ a ff i.ia n r <
ADJUSTED
R'z
whenthenumber of variables included ina modet istaroe reta?zI
iiveto the numberof casesin the sample,the explained variance is necessally laJg"b""."ur" $ ihe amountof information usedin the explanation approaches the amountof Information to be explained. To correctfor this.mostcomputerprogramsreportthe "AdjustedR," as well asthe ordinaryR'?. Theformulafor AdjustedR'zis
n,_r _ rr_ n ,l ,y l ,rl rrJ whereAiis the numberof casesand k isthe numberof independent variables. lt is clearthat ask approaches N, R;. getssmall;indeed.it canbecomenegative. Theproblemof overfitting the dataonlyariseswhen samples are quitesmall;but in suchcasesthe AdjustedR, should be taken seriously. However,the ordinaryR'?shouldbe usedin testsof the signlficanceof the incrementin R'z(discussed laterin the chapter,in the section"A Strategyfor Comparisons AcrossGroups").
a .t .t
I tZ
euantitative DataAnalysis: Doing SocialResearch to Testldeas
gmsffi*ff*l;ffi
ffwm# c l
Il
I
*gu$*'s'**m*$m
a
& il
C
rt ';::::::,i;:,:::::::7::;:,:,,;:;;,:,",!,!1,:,!i:ii1;JJ? ru! :,1. p, ex :"r; ";::::;:;::ni^::xr*_ ana,. n.r,he f!i;;;:1:J:: r:c, I fl
'i:,::!vr;w:t:::i;L!,:;:i':ii"*,xx: qr
tr j1ii:."::,,u,",'u,i)uu",J"':::::::::::l! uod ;,! "i!i!!li!l?;il'"',!ff ::r:::::!:"il:::;:;:";,;,,;;y ;;3:P;;:,:;:::;f Gc
stalus arlribures ro." iilii"'#iffi::"',;;;;;,;:..j,:::,;XT'1.];::HlT,';,j,:f:#,i"j!iJ::::Hg "o"''uJ.. .".'oi".t .],,'jl']#J[il.::*o Fodor Iee6, ani t^r::!:ro ,::, of Estimate (RootMsE)
IU6!t llllr f
Mtm fiutul ot r iiqr
#H::,ffili,ffi1'"ffi *,:"xrx:hl:::::,:ljffiT.di"l1ffffi ,
n
nu0[: mlh
ltutD]][
Introduction to MultipleCorrelation and
Regression (OrdinaryLeastSquares) 113
tx preseF f or pendent itsjoint tion 6.2 m eduAcross ix now asethe ready I slarus uales" Iauser ment, of the t were rding think I edu)anln ttaine not Blau I
:
(6.5) r bereiy'is thenumberofcases, k
*l;u;ry,nnr.*:x*h:tf.q*n",x'"i.fi1,".;:"T,1
ri.iifil:l:JsTx;::ml':*,*ri] [T*?]}"":Tff F,*.l'"T:T"fl jjj"gl ffi,;:fr; s*HTtrJ#T".,H:,n::Hly"i:Tlt,iT;:::,ffi ..,pr",,rsp".",i"i#J#:1i"F:1llT.j:dT,fl :;a#:;ffi J*:Ttr"lT
:triff"it;ffi:il::t$;r:1.;il,*'stohlewithiir.si";'il;,ih"regressionsurrace isthatthe'r.e.e. isnotsensitive totherelative size
r ,r3ixl:iltJ,t#f#Tff"il::2
;:il#".:tlx#;l*:f Tffi;flt j:'*#,Tffi 3Jlil"#:f :tHffii{; AWORKED EXAMPTE: THEDETERMINANTS OFLITERACY IN CHINA Let.us_workthroughthepresent
:j*#g*#iX,;:'ffr#X*1'i:'{itiqff i;,::"T:']"J;r",'#:;?JJi,:l ll,:f;i."li:"rif :*:::*nf,*j:,,,1"",S:il1llf I :"fr:ijFi$'tfil 3",T,:.S:*ffi:ii"il#,TJ,iffl:T"'ii:ml,'"',1;:l bedownloaded fromthecourseweb site).A surveyof thepoputution lg96 or tr," flT
ji:;l'*;.i:*::l#,#*iil#$h*I.*'.:"trq;x.: '{;fli scioor'iog " H"J"fniil::llXltfl#l'::':ase inw",t".nution.. infacr,measured interms ofthemrmber ; "h,';;;;;;g,,)"4'd]:1]i!"#:r;:,:$T withadditionar
ting ally and
I,::?:,."":,#;;;;;;;lT::dj/*iffi #:$i,:#$Til"_",".1T:::[
:,#;,i;:",,.*'..J::l}if j:t# :d{Tiin::i3lJff ilJ"ff #..lTi,."1il i g n *. pr iy'in.'r.** anda ...1; *, :X, ;:liffi
:t^:',1*r
:i'i, wtren,t. ramitr r".oJ"il", *ri;X:;t
of
l:
T:.'j't*
measu
",J readingwasimponant re rowhich in the
,iliffi:::ll.fi ,jJHT:tl *#1T:":{i:fi ryTi",:*:1$XilllJJl:iilf jffi[::,:: [":T##'.T:.
ffi
:xJ:n*k*.1:#yr,:1,nfu:ffi:
114
euantitative DataAnalysis : DoingSocialResearch to Testldeas
;:xTlr..?lJ::f#f"ffJ:;T*,:::,*.libraries aremore rikery toexrsr andtocontain
.*fi**$Nn****r
ffi*,mffiffi*
ii#trp,;* $$$$"}i#*ffihlrf
h'*fr iffi fr *"i*jiilt*tti*"*inltffi'f i;:Tfti,id",T"},..;;,.x",f;Jm S;1:;:#i'"",T"'.ffi
ru*#;Hj:,#ffiJ;,J,[.;,# #},[#r;,rn #Tn.li
Nf":#:{t*i*?iiil.i,ii;i"r:f *y.1,tr3r!,,,.,A "fi,n l,:.m r::,l,..;:rt::"riiiliffii[Ti, *:;l] *:l in tt'. o",t ;ffi#Tffi:fff i1lil^:*" ouine ro,"",u,Jl'uuo"' "0",t "*ry,uo npm#:: il:'#;fu ff;:il' il#*ftp;i,*il"1 :
.,l":x I:Hil:ffi:":*l} i{"""J:##: :::r:h##T: }ff I
(OrdinaryLeastSquares) 115 Introduction to MultipleCorrelation and Regression :onuln rteracy ccuparequire med in jobs. inantthan of esover side of leamed hin the quiring nsuishheva aYerage iearsof nts);18 percent pondent )ndents. r. I also ns were eswhen rnberof vearsof
.eole rays TT]Aii on
o o ut here 'cent na .s of ;grks

' l. I 6..1 . M".rr", StandardDeviations,and CorrelationsAmong Variables Affecting Knowledge of Chinese Characters,Employed Chinese Adults Age 2069, 1995 (N = 4,802), M
EF
.397
.400
.331.
,z+/
.4tJ
.341
.216
.514
.368
.030
.327
E: tathersyearsof schooling \i: Nonmanual occupationb
,.,r Ny'ale
'.1ean
030
3 .6 0
6.47
3.O1
.177
.180
.558
.227
:rtems, in ncreasrngorderof difficulty,areyiwan (ten thousand),x/rgmng (full name),iiargshi(grain), .'thu \fundon), diaozhu(catue),slrue (wreakhavocoTwanton massacre), qimao chuarmlu (erroneous), r::rgenarian),chlchu(wa k s owly),and taot/e(g utton). ,:'ables N, U, and M are dichotomies, scoredT fof thosein the categoryand scored0 for thosenot in aategory. _ 3scae isthe meanof standardized _: scores for fivevariables measuring the behavior o{ parents whenthe ':3ondentswere agefourteen:the numberof booksin the home,the presence of ch drensmagazines  r^e home,the frequencywith whlch parentsreada newspapetthe frequencywith wh ch parentsread ::cus nonfc(ion,andthe presence oJ an atas nthehome. f informatlon was missing for an item,thai a: was excludedfrom the average. The resuting scae was transformed to a 0 I metric that is, the :..1,t ,co'e ,s0 and Lheh qhestsLores I
Table6.2 confirmsthe importanceof yearsof schoolingbecausethe standard:zedcoefficientsfor yearsof schoolingin both modelsarefar largerthanthe standardized :crefficients for anyothervariable.Eachadditionalyearofschoolingproducesan expected rcrease of about.4 in the numberof charactersconectly identified,net of all otherfac:ors. This means,for example,that a universitygraduate(sixteenyearsof schooling) .r.ould be expectedto identify abouttwo morecharactersthanwould an otherwisesimilar . ocationalor technicalschoolgraduate(elevenyearsof schooling).
"t*$j:
il *.r1. Determinants of the Number of chinese characters correcfly ldentified on a TenltemTest EmployedChineseAdults Age 2(H9, 1996(Standard Errors in Parentheses).. Variable
Model 1
Model2
.030 (.006)
.009 (.007)
Metric regression coeff i(ients
EFrFather'syearsof schooling
.255
.177
(.0s3)
(.054)
_o49
.015
C Levelof culturalcapital(rangeO_1)
5.e.e. Standardized regression .oefficients
FFrFathe13yearso{ schooling
i!
fl
tr
1 lt t!
rtl
C Levelof culturalcapital(range0,1)
I
tr aresignificant "Ailvariables at or beyond the.001levelexceptfor father'seducationin Model2 (p = .195).
I
€
ntroduction to MultipleCorrelation (OrdinaryLeart5quares) 117 and Regression
:,tlill\illA! P0if'\rTON TABLE 6.2
lz
Notethat both modelsin Table6.2 are basedon exactlythe samecases,the numberof casesshownin Table6.1.A comrnonerroranalysts makeis to presentsuccessive models basedon differentcasesallthe casesfor which completeinformat/onis available for the variables includedin that model.Thisis ill advisedbecauseit makesit impossible to determine whetherdifferences in the coefficients for successive modelsaredueto the inclusionof additionalvariables or are due to variationin the samples. Moreover, formal comparisons of the incrementin explained varianceresulting from the inciusion of addi(presented tiona{variables in the next section)are not correctunlessthe moqetsare basedon the sarnecases.Statahasa command, . whichmakesit easyto ensure that all modelsbeingcomparedarebasedon the samecases.
I I
I t t)
I I
I I
l)
F,r I
)
It.
I ,:' l/
r ::. I, r.' .: a
.::
i
l .r
'' l :,.:
t
I ..::
t: : ,
I i:: I ,,i I .1 9 5 ).
Model I predictsthenumberof characters identifiedfrom all variablesexcept,.cultural In this model all coefficients are significantat the .001level.Net of otherfactors. apital." iose with nonmanualoccupationsscoreabouta quarterof a point higher than manual a orkers,thosefrom urbanorigins scoreabouta quarlerof a point higherthan thosefrom rral origins,andmalesscoremorethana third of a point higherthanfemales.Clearly,all iese effectsarerealbut, with theexceptionofeducation,aremodestin size.Interestingly, ie father'syearsof schoolingsignificantlyincreasesknowledgeof characters,net of all otherfactors,althoughthe effect is very small (he expecteddifferencebetweenthose $ith themosteducated andleasteducated fathersis only abouthalfa characterprecisely .5l : .030*l8). Together,the factorsin Model 1 explainmorethantwothirdsof the variancein vocabularyknowledge,which is a very strongrelationship.Also, the standard errorof estimatefor Model 1, 1.25,tells us that95 percentof the actualvocabularyscores Iie within 2.45 points (!'196*1.25)of the regressionsurface.It is instructiveto note how largethe error is. Evenwith a very high R, by socialsciencestandards, the casesare distributedover nearlyhalf the rangeof the dependentvariable.This suggeststhe needto exerciseconsiderable cautionin interpretingregressionestimates. The intercept,.579, is interpretedas the expectedvocabularyscorefor those with a score of zero on each of the independentvariablesthat is, for rural origin females working at manualjobs without any schoolingwhosefathershad no schooling.This is not a very meaningfulvalue.Althoughin Chinatherearepeoplewho fit this description, in manynationsa personwith 0 scoreson all variableswould be beyondthe rangeof the obseNeddata.To achievea meaningfulintercept,it often is useful to reexpressthe continuousindependentvariablesasdeviationsfrom their mean.If this is done.the interceDt is then interpretable as the expectedvalue on the dependentvariable for people who are at the meanwith respectto eachof the continuousvariables(and,of course,havescores of 0 with respectto eachdichotomousvariable).In the presentcasesucha reexpression would give us the expectedvocabularyscorein this case3.30for rural femalesworkins
1 18
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
at manual jobs (the 0 values on each of the dichotomous variables) who have average "of educationand whosefathershave averageeducation.Note that such a reexpression the independentvariableshasno effect on the regressioncoefficients, the standari errors,tbr .1R2, or the standarderror of estimate.Only the inrercept is affected. Model 2 includes "cultural capital" as an addiiional factor. The associatedcoefficient indicatesthat, net of all other factors,peopleraisedin households with the highest cultural capital, that is, maximally involved with reading, score almost a full polnr higher in their knowledge of vocaburarythan do people raiseJin householdsminimalrv involved with reading. Although the explained varlLce is .igrifi";,it i;;;;;;; (i" the next section we considerhow to assessthe significanceof the rncrementin Rr). tlteincrease is hardly important from a substantivJ point of view. What is importanr is that the introductionof "cultural capital" reducesthe effect of father,seducationto nonsignificalce.This makesclear the reasonwhy, net of the respondent,sown education, knowledgeof vocaburaryis enhancedby the father'seducation: householdswitb educatedfatherstend to be more involved in readingthan other households.After the "cultural capital" of the householdis takeninto account,the father'seducationhas no additionaleffect on vocabularyskill. The "cultural capitarl'variable alsoreducesthe size of the "urban origin" effect, which indicates that parr of the advantage of urban origins is the tendencyof urban householdsto be more involved with reading than are otherwise similar rural households.None of the other coefficients is much affected by the introduction of cultural capital.
GraphicRepresentationof Results Sometimes,for easeof exposition, it is useful to graph the net relationship implied by the model between a given independentvariable and the dependent variaOle.this is easy m do. The trick here is to simplify the estimation equati; by substitutlng the means, or other appropriate values, for the remaining indelendent u*iubl". the one of interest ajld collecting them into the constant. This yields the "*""pt expected value on tre dependentvariableat eachlevel of the independentvariable, holding constantall other independentvariables at the specified values. The same procedure can be extendedto show separategraphs for each category of a categorical variable_for example, if we were interested in how the relationship between tlracy scores and years of schooling implied by Model 2 differed for males and females.For continuous variables the meanis a good choice of the value to substituteinto the equation.For dichotomousvariables we could substituteeither the mean or some suitable ;alue_for example,nonmanualworkers from urban origins. Of course,for dichotomousvariables, the meanis just the propor_ tion that is "posjtirze"with respectto the variable.Thus, if we substitutemeansfor qlcnotomousvanables,we are not evaluating the equationfor any actualperson_after all, one cannot be 18 percent urban or 56 percent male; rather, we are eva.luatingwhat are,in somesense,the typical circumstances of the population. To seehow the procedure works, Iet us evaluatetle equation in two ways: for nonmanual workers from urban origins, and for the mean values of these variables. In each case, we evaluate the equation separatelyfor males and females to create graphs thal
lntroduction to MultipleCorrelation (OrdinaryLeastSquares) 'l'rg and Regression have average expressionof hd errors,the
fu.'s separatelines for males and females. Considering first an equation evaluatedfor rumanud workers from urban origins, we have, for females i
[ciated coefrirh the highst a fuIl point lds minimally increased(in ementin R'?), t is important i educationto 5 own educameholdswith Ids. After the cation has no ducesthe size urban origins are otherwise , the introduc
mplied by the lhis is easy to lhe means, or pt the one of value on the stant all other e extendedto rample,if we ; of schooling ssthe mean is I variables we manualworklst the proporfie means for personafter aluating what vays:for nonables.In each te graphs that
a+btE)+clEt)d(N\
re(Ut+ ItMtJ
gtc)
: .546+ .393(E)+.009(3.07) +.21(1) +.177(1)+ .385(0)+ .866(.227)
(6.6)
:1.158 +.393(E) nl tbr males i  a + b(Et 1 ctEt+ dtNt+ ?(u t + ftM )+ gta) +.38s(1)+ .866(.227)  .s46+ .393(E)+.009(3.07)+.21 (1) + .177(1) :1.543 + .393(E\
(6.7)
Having arrived at a pair of bivariate equations,differing only by a constant (= .385, t coefficientassociatedwith "male"), we can simply graph the equations.Figure 6.2 rirrqs the graph, which makes clear the relative magnitude of the education and gender decn net of all other determinantsof vocabulary knowledge in China. Clearly, educann is far more important than gendet although within levels of educationthereis a small frderence favoring males. Now, supposethat instead of evaluating the equation for nonmanual workers from ulan origins, we evaluatedlhe equationat the meansof eachof the independentvariables :rcept, of course,educationandgender,becausewe want to displaythe effectsof these ro variables).Our equationfor femalesis then i  a + b(Et I c(Et t+ d(Nt+ e(U  ftM)+ gla)  .546+.393(E)+.009(3.07)+.21(.171) +.r77(.18O)+ .385(0)+ .866(.227) :.839+.393(E) (6.8) ld for malesis i  a + b(EtL aEr't+ dN t a[J )+ f\M )+ g(al = .546+ .393(E)+.009(3.07)+.21(.177)+ .r77(.180)+.385(l)+.866(.227) : 1.224+ .393(E)
(6.9)
Note that the only differencesbetweenEquations6.6 through6.7 andEquations6.8 Arough 6.9 are in the interceptsand also that the difference in the interceptsbetweeneach
120
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
t ffi q
g
@ll
re {c .d * &
p !f
I
E
J f,r dz
x I
q
Yearsof schoolcompleted
f;€tinf 6,?: rxp".rea ruumber of chinesecharacterstdentified(Out of Ten) by Yearsof Schoolingand Gender,Urban Origin ChineseAdults Age20 to 69 in 1996with Nonmanual Occupationsand with yearsof Father'sSchoolingand Level of Cultural Capital Setat Their Means (N = 4,802). /votei Thefemale linedoe5notextend beyond 16because therearenofemares inthesampre with postgraduate education.
[email protected]
ne iMli
dl pair ofequationsis identical.Thus,a graphofEquations6.g through6.9 would be almost identicalto Figure6.2 exceptthat both lines would be shifteddown.For this reason I do not botherto showa graphof Equations6.8 through6.9.Whetherto substitutemeans or otherspecificvalueswhen evaluatingan equationis a matterofjudgment that shouldbe decidedby the analyst'ssubstantiveconcems.
[iF
r
a[]: MI [email protected]
owI
DUMMYVARIABLES often situationsarisein which we want to analyzethe roreof categoricalvariables, such asreligiousaffiliation,maritalstatus,or political party membership,in determiningsome outcome.Moreover,typically we want to combine categoricalvariableswith inferval variables,to study the effect of eachcontrollingfor the other.Thus, we need a way to includecategoricalvariableswithin a regressionframework. To seehow this is done,let us revisit the problem we consideredin the final sec_ tion of chapter Five, on correlationratios.Recallthat we wereinterestedin the relation betweenreligiousaffiliationandacceptance of abortion,andwe analyzedthis by estimating the meannumberof positive(accepting)responsesto a sevenitemscalefbr each of four religiousgroups(Protestants, catholics, Jews,and thosewith other or no relision)
rp [email protected]
nmuo dtea {@ rfrnlfirIf,r
[email protected]" [email protected]
b
(OrdinaryLeastSquares) ",21 fntroduction to MultipleCorrelation and Regression ftom the 2006 GeneralSocialSunrey(GSS)data.Here we explorea similar substantive goblem, but this time using data from the 1974GSS,becausethe resultsfor that year rE particularly clearcut and hence more suitable for exposition of the method. (As an aercise, you might want to carry out a similar analysisusingthe 2006data).We startby €r$'erting the religious denomination variable into a set of four dichotomous variables, c for eachreligiousgroup,with eachvariablescored1 for personswith that religion nf scored0 otherwise.That is, we definea set of new variables(seethe downloadable  ic  or  1og  file): R, : 1 if the respondentis Protestant,and  0 otherwise R, : 1 if the respondentis Catholic,and : 0 otherwise R, : 1 if the respondentis Jewish,and : 0 otherwise R, : 1 if the respondenthas anotherreligion, no religion, or failed to respond, and : 0 otherwise 8
Variables of this kind are known as dichotomous or dummy variables. Using these ruiables, we can estimatea multiple regressionequation of the form:
tTen) n 'evel lewith
alrnost bnldo Fansor ould be
es,such ng some interval I way to inal secrelation estimatr eachof religion)
A : o +fi,n
= o + b2R2 + b3R3 + b4R4
(6.10)
stere A is the numberof"prochoice" responsesthatis, positiveresponses to questions rbout the circumstancesunder which legal abortions should be permitted (in 1974 six nch questionswere asked,so the scalerangesfrom 0 to 6), and the R. are as specifiedin de precedingparagraph. Note that it is necessaryto omit one category from the regressionequation to avoid r linear dependency(the situation in which any one independent variable is an exact fonction of other independentvariables); becauseof the way dummy variables are congucted, with each individual scored 1 on one dummy variable and 0 on all the remaining variablesin the set, knowing the value on all but one of the dummy variables allows fErfect prediction of the value of the remaining dummy variable. In such situations OLS equationscannotbe estimated.Any categorymay be omitted, but because,aswe will see, fre coefficients of the dummy variables included in the equation are interpreted as deviarions from the value for rhe omitted, or reference,category it is best to choosethe referce category on substantivegroundsthe category against which the analyst wants to contast other categories.The only exceptionto the substantivecriterion is that very small categoriesshould not be chosenas the omitted category becausedoing so may create a nrarlineardependencyamongthe remainingcategories,which could result in unstable numerical estimatesof the coefficients. EstimatingEquation6.10,we have
a ::.l s .::1 n ,;+ 1.6( &)+.88( &) ;R' :.045
(6.11)
122
QuantitativeDataAnalysis:DoingSocialResearch to Testldeas
Now, let us derive the predicted valuesfor each category: For Protestants:
A= a+ b"(o)+ br(0)+ Do(0)= a :3.93
(6.12)
A = a + br(I)+ br(o)+ bo(o) = a + br: 3.98.33 : 3.65
(6.13)
A = a + br(o)+ br(l)+ bo(O) = a + br: 3.98+ 1.61: 5.59
(6.14)
For Catholics:
For Jews:
For OtherandNone: A= a + br(o)+ br(o)+ bnl)= a + bo: 3.98+.88:4.86
(6.151
From Equations6.12throueh6.15,it is evid€nt thattheintercept,d, glvestheexpected value for the omitted, or referenlce,category and that the coefficients assocratedwith each ofthe dummy variables,the b,, give tbedifference inthe"rp"ii"Jrir" u",ween thar caregory and the omitted, or refeieicr
expectedscoreor3.n*",,r,"J"i""iX',i,s;f; J,L?:ffi::"J:1H"";Hffj:ffi fr acceptablefor abo[t four of the six conrtitions. Cutt o'"r, ty"orrt
u'st,havean expected scoreof 3.65, .33 less than protestarts.Jews on averageenOoJ nearty all six items (precisely,5.59), 1.6r rnore rhan prorestants. Finally,;; ;"^t;;; of the sample,the residualcategory"other and None,"fa's midway beiweern ,"r,"o" u.o Lewsin their averagelevel of acceptanceof abortion.
Ql
N
NCLUDE rHE ENflREsAMpLE }gF.ty_ygrl lHolr_L.p l! YOUR ANALYSIS
o,,;;;;",,;;J^""""
catesoryisa
catcharrresiduar caiegory andisthusratheruninteresting, it isdesirabre to incrude suchcategories Intheanarysis ratherthanrestricting theanarysis to thosewith ,,inrerpretabre,, rerigions. The reason for thisisthatwe ordjnarily wantto generalize to an entirepopulation, not a subset of the popuration with definablecharacteristics. rf we omit the residuar categoryfrom the analysis, ourestimates of theaverage population characteristics arelikelyto be biased, and, whatisworse,biased in unk
arso be ::"H[1f#ffi :J l:"J.Ti]::ffi;1""".mav biased. see rhe discu,,,"" ",
fntrodudionto MultipleCorrelation (OrdinaryLeastSquares) 123 and Regression
(6.12)
(6.13)
(6.14)
(6.15) expected vith each that catI havean abortion expected iix items nple, the s in their
'LE atchpnes ;.The ubset n the and, 50 be
Note that the R, is identical to the correlation ratio, 4r, that we encounteredin the pevious chapter Moreovel the prcdictedvalueson the abonion attitudesscalearejust 6e meansfor each religious group shown in Table 5.1. This follows from the fact that, rbsentany additionalinformation,the meanis the leastsquares predictionof the valueof .a observation. Thus,the "leastsquares" estimatesfor eachreligiouscategoryarejustthe r:bgroup means.So far we seemto haveno more than a complicatedapproachto estimat_ ine subgroupmeansand correlation ratios. The real value of dummy variablesis when they are used in combination with rrtier variablesto test the effect of group membershipon the dependentvariablenet of ie effectsof other variablesand also to assessthe effect of group membershipon the relationshipbetweenothervariablesand the dependentvariable(andthe effect of other variableson the relationshipbetweengroup membershipandthe dependentvariable)tat is, to assessinteractionsbetweenthe group categoriesand other variables.To see tis let us continuewith our example.Supposewe are interestedin assessingthe effect .rt educationon acceptanceof abortion.Supposethat in addition we want to assessthe Sassibilitythat religious groups differ in their acceptanceof abortion and, moreover, n the relation betweeneducationand abortionacceptancewith Catholicstending to rpose abortionregardlessof their education,Jewstending to acceptabortionregardie<s of their education,andthe remainingtwo groupsbecomingmore acceptingastheir slucation increases.To test theseclaims,we eslimatethreesuccessivelvmore comoliJaredregressionequations:
A : a+ bE 4 ,4:,/ r AfF r \   p
a e : a" + b"E + Lc,' R,+ 1\c,R,E
(6.16) (6.r7)
(6.18)
The first model(Equation6.16)positsan effectof educationbut no effectof religion. This model assumesthat all religious groupsare alike in their acceptanceof abortion. The mond model(Equation6.17)positsan acrosstheboard, or constant,differencebetween rligious groups in their acceptanceof abortion but assumesthat the relation between slucationandacceptance of abortionis the samefor all religiousgroups.The third model rEquation6.18)positsan interactionbetweeneducationandreligion in the acceptance of .&ortion, ot to put if differently, assumesthat the religious groups differ in the way edu;arion affects abortion acceptance.(The conventional way to representan interaction in r regressionframework is to construct a variable that is the product of the two for more] rariables among which an interaction is positedalthough other nonlinear functional ic:ms are sometimesDosited.)
124
QuantitativeData Analysis:Doing SocialResearch to Testldeas
A STRATEGY FORCOMPARISONS ACROSSGROUPS Our first taskis to decideamongthe modelsrepresented by Equations6.16through6.1g. In this situation'wherewe areassessing whethergroupsdiffer with respectto somesocial process,we would generallypreferthe most parsimoniousmodel (except when we hare a strongtheoreticalreasonfor positingdifferencesbetweengroupsor, asnoted in the section on Metric RegressionCoefficientsearlierin the chapter,when we suspect the possi bility of omittedvariablebias).That is, in generalwe shouldprefer a more complicated model only if it doesa signiflcantlybetterjob of explainingvariability in our dependenr variable,acceptanceof abortion.We decide on the preferredmodel by comparingtbe varianceexplainedby eachmodel.If a more complicatedmodel explainsa significantll largeramountof variancein the dependentvariable,we acceptthat model;if ii doesnoi we acceptthe simpler model. (This is the classical,or frequentist,approach. The neu sectionprovidesan altemativeapproachto modelassessmeni basedon Bayesiannotion: a comparisonof BICs.) we begin by comparingthe first and third models.That is, we contrasta model thar asslmesthat thereare no religiousdifferencesin acceptanceof abortionbut only edu_ cationaldifferenceswith a model that assumesthat the relation betweeneducation and acceptance of abortiondiffers acrossreligiousgroups.To assessthe significanceof the differencein R2,we computean Frano:
(4nilr* ( 1 4) /( N k 1)
(6.19,
whgre is the varianceexplainedby the largermodel (here,Equation6.lg); R; is the vananceexplainedby the smallermodel (here,Equation6.16);Nis the numb". oi,"ur.., k is the total numberof independentvariablesin the largermodel;rn is the differencein the numberof independentvariablesbetweenthe largerand smallermodel; the numera_ tor degreesof freedom : m; and the denominatordegreesof freedom : 1y'_ _ k 1. In our numericalexamDlewe have
(.097.053)/6 : 11.96 (1.097)/(1.48r7 1)
(6.20t
with 6 and 1,473degreesof freedom.To determinewhetherthis F_rafiois significant,we find thepvalue correspondingto the numericalvalueof the r'ratio with the specified numerator anddenominatordegreesoffreedom. If ourpvalue is smallerthan somecritical value(.05 is *ject the null hlpothesis (Model l) in favor of the altemativehypothesis :_":":"I^":1U I. (Model 3). In the presentcasethat is what we are led to do becauseF(6 : ,.,,rr, 11.d6,which irnpiiesp < .0000).
Firoductionto MultipleCorrelation (OrdinaryLeastSquares) 125 and Regression
GETTING pVALUES VIA STATA untir rairry n
:";LL:m:m:ii:ll"l,ru:n:ru[H':]#i"1",:; Jiil""lfJK[
and is now somethingof an ordfashioned approach.stata providesa set of buirtinstatisry tables,includinga table of probabilities :rcal associated with specificFratios.The probability =cra given Fratiocan be computedby executingthe command,_display fprob (df_], jf_2, F)  , where df_l is the numerator degrees of fteedom, df_2 is the denominator legreesof freedom,and F is the calculatedFratio).
USINGSTATA TO COMPARE THEGOODNESS.OF.FIT ^rheFtestfortn"'".r"ru*rr*r,"*
OF REGRESSIONMODELS
Q!
ent to the wald test that the coefficientsfor the set of variablesthat are includedthe lurg"r. EII 'nodel but not in the smallermodel are not significantlydifferentfrom zero. Thussoftware :hat implements the Wald test as a postestimation command(for example,_test_ and testparm in Stata)canbe usedto carryout the Ftest.lt alsocanbe shownthat when a ;inglevariableis addedto a regression equation,the tratiofor the additionar variabre equars :he squareroot of the Fratiofor the incrementin R, and the t and Fratioshaveidentical crobability distributions. Thus,when two equationsdifferby a singlevariable,they can be contrastedsimplyby inspectingthe significanceof the tratio,which is routrnervorovadedas cart of the regressionoutput.
Having determined that we cannot posit a single model of the relation between lircation and abortionacceptance for all religiousgroups,we next investigatewhether r .' necessaryto posit religious group differences in the relation between education and rinnion acceptanceor whether there are simply acrosstheboarddifferencesbetweenthe reiisiousgroups in their acceptanceof abortion but a similar relation betweeneducation abonionacceptancefor all groups; that is, we ask whether both the slopesand inter_ rl ::ps differ acrossreligious groups or whether only the intercepts differ? To answerthis .:cesdon,we contrastthe R2for Model 3 (Equation6.1g) and Model 2 (Equation6.17), :simating an Fratio usingEquation6.19.For our currentnumericalexample,we get
(.097.089)/3 (t  .o97)/(1,48171)
(6.21)
rith 3 and 1,473degreesof freedom.BecauseF3.,n r,  4.35,whichimpliesp : .0046,we qrect thenull hypothesisthatthe relationshipbeiweeneducationandabortionacceDtance
't:*
r^r tv [\
I
QuantitativeDataAnalysis:Doing SociaJResearch to Testtdeas
(1890_j962)wasa Britjsh stdtistician wtn a stronginterest in biology(hewasa founder,w,th Sewa Wfight_see the biosketch o: Wrightin chaptersixteen_and l.B.s.Hald"n",ot tt eoretoaipoj.rlarion genetrcs). He was respon, siblefof majoradvances in experimental design, introaudng ih;iot,onof rundo, urrignment o, cases to different treatments andshowjng ho* to ,re of vaijance, whrchheinventecl_the Fdistribution "nulysr:, is nam€dafterF.herto utt"t, tt . .ontrioutioi otlacn ot reverat racto*in dete._ mininganoutcome' a procedure thatgreatry enhanced thepoweroiexperrmentar designs. Hearsc invenied theconcept oflhe maximum iikerihood andmademajorciontributions to statisticar procedurcsforassessing sma'sampres Histextstat^f/Lal Method,fo, ner"urcn work"rs,firstpubrishec in 1925'wasverywideJy used, espe'iairy asa handbook forthedesijnandanarysis of experimenrs andranthroughfoufteenedtons, tnetatest published in 1970 Fisher wasbornjn London,thesonof an urt O"at"tunOuraioneer.Hewasa precociotrs s.u.ent,wrnningthe NeerdMedar(a cornpetitive essayin mathematics) at Harrowschoore: the ageof sixteen. (Because of hispooreyesight, n" *rr'iri"*, ," mathematics withoutthe aidof paperandpen,whjch
#
asopposed tou,ins urs"boi.d;uu",loi;,T_::::1,"";H:.[1T:",rJ"il::ff:n
mathematical resultswithoul
ics atcambridse, rr.";r;;'"'"rHt"?,"":1,::r}:ffi:,,11:;t::'i::ff'ru;*:l::
entrst in thearmyduringWorldWartbut was rejected becauie of rrrspooreyesrgnt, andthe. spentseverar yearsteachingmalhematics in secondary school.at tne enciof ttrewar,he wa: ofiereda positionat the Galt(
*
but because o{hssril;J; ;:,f::l,:]"fi :*:Jiilffi::liT::,T,ll :,.i:n:i culturalexperimental il: station(RothamsteO, *n*" n" o",nn appointedprofes_
t
sorof Eugenics at university "rn.,rud"r*u co,ege Londonin '1933and thento the Baltourchair of Genetic,. jn 1943.After r at Cambridge
his rire as asenior,","";;,iiilnl,iil.J#;"n:,/;'L,ii.lrlilii illlffir"::::Organjzation in Adelaide, jmportunt Australia. Fishers .ontriOrfions
lo oorngenetjcs andst;_ rrstrcs areemphasized bythe remarkof theweJlknown statjstici"n, , i"on.rC I Savage (1976): occasionally meetgeneticists who askmewhetherit istrue,nu,af]" i** n"n"ocrstR.A. Fishewasd'soan,noortdnt slatislictan
;
I
:
:j5#:1""i"::ii,llt,_?:::.""f.
burrhat.rhe groups difrerin rheiracross_the br.,
ffi::liJ;T::.?::'::Ti:Y::l'.:xii:"*'i""r'ipo't"i''i'i'ii,.'*ili'li nr r,,,, u", r,.a,,, i"",il. f;;;il;.:il,li,iillJllI::l#." ;::l.,ljl1.1,,, of religion educution drd .ho.ion ilcLeprrnl.e j"i i',::l:j::l::' li'''tron\hrp bcr\r. .rirr.r,,acrossreligious groups.Thus,rn sum,our:: Ierredmo.letis nna+h.i ""."* ..^,. .. .lt":tt
rrrrsrul duecr it'orrton attlr,. and that the effect of education vartes by religion (and. necessarily as well, that the ei:, ol relision varies variec by h., education). ^.r,,^.:^..
lt!
introduction to Multiprecorreration and Regression (ordinarvLeastsquares) 127
istician ich of Spon . Fnt oi . flthe . ldeter' llealsoproce blished ments, Dcious nol at ' ut the terrns,.
 i i l.  6. 3 , Coefficientsof Modelsof Acceptanceof Abortion, U,S.Adult, 1974 (Standard Errors Shown in parentheses);N = 1,481. Model I
:: Catholic
Model 2
Model 3
.373 (.r11)
' 1.059
(.4ss)
Model 3'
.371 (.111)
1.341 (.282) :,: Otheror None
.747 (.184)
.702 (.187)
bdu.. ematnedto d then . F WAs t Fivei llagrilrofes:netics lars of . l search' d staV6\: "l
i' *F
F.,*6
'ntercept
frsnef ' ' ' J lerl\ l o d e3 ' i s i d e n ti c a l to M o d e r3 e xceptthati nr\rode 3' yearsofschoorng(educatron)rsexpfessedas . revationfrorfthemeanyears of school ng.
heboard between our preanitudes tle eff'ect
The conventional practice is to report the estimated coefficients for each model, not :nerely the preferred model. These are shown in Table 6.3. Let us seehow to interpret each ofthese moders.Model 1 isjust a twovariable regression equation of the sort we encounteredin the previous chapter; nothing further needs to be said here. As we have noted, Model 2 posits the same relationship between education and abortion acceptancefor all religious groups but acrosstheboarddifferences between
128
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
religious groupsin the level of acceptanceof abortion amongthosewho havea given level of education.What is meantby an "acrosstheboard"differenceis clarified by writing our Equation 6.17 separatelyfor eachreligious group. For Protestants,we have
A : a + b (E )
(6.221
A : a I b(E) t c, : (a + c")'t b(E)
(6.23)
A  a I b(E) + c, : (a + cr) *b{E)
(6.241
A: a t b(E)rco= @+ c.)+ b(E)
(6.25j
For Catholics, we have
For Jews, we have
For Others,we have
From Equations6.22 ttrotgh 6.25 it is evidentthat Model 2 (Equation6.17)implies that the religious groups differ in their interceptsbut not in the slopesrelating education to abortion acceptance.If Model 2 were our preferred model, we could conclude that eachyear of educationresulted, on average,in a .125 increasein abortion acceptancefor peopleof all religions, so that, for example,college graduates(sixteenyearsof schooling) would be expectedto accept abortion for one reasonmore thaa would those of the same religion with only a primary school education (eight years of schooling). And we would expect Jews to agreethat abortion ought to be permitted for 1.3 reasonsmore than hotestants,on average,and Catholics to agree.4 times less on averagethan Protestants.ln short, interpretation of the coefficients for Model 2 is straightforward, and the net effecls of education and of religious group membership can be assessedseparately.However. although the sze of the coefficient for each religious group can be interpreted individually, it generallyis not meaningfulto assessthe significanceof individual coefficients becauseeachcoefficient indicatesthe difference betweenthe expectedvalue for the given category and the expectedvalue for the omitted category net of all other factors. Therefore, a significant tratio merely indicates that a coefficient is significantly different ftom the implied coefficient of zero for the omitted category,and which coefficients are shown as significantin one'scomputeroutputis entirelydependentupon the choiceof omitted. or reference,category.Thus, the appropriateprocedureis to assessthe significance of the entire set of dummy variables representinga given categorical variable by computing an Ftest of the increment in R2for models including and excluding the set of dummy variablesconespondingto a singleclassification(or equivalently,theWald testthatthe setof coefficientsis jointly : 0).
hroduction to MultipleCorrelation and Regression (OrdinaryLeastSquares) 129 ;iven level vriting out
(6.22)
low To TESTTHESGNIFCANCE OFTHEDTFFER_ sronsin which an analystwants to assessthe signifjcanceof the differencebetween two specificcategories of a dummyvariabre crassification. In this case,it is possibre to makeuse of the formula: t : (bi
\6.23)
bj,/ (va(bi) + va(b,)  2cov(b,\))h
,vhereb, and bj are the lwo coefficientsbejng compared.Most statisticalpackagespermit :he estimationof the variance,covariance matrix of coefficients.Of course,in these daysof :rghspeedcomputing,it probablyis easierto simplyreestimate the model,redefiningthe referencecategoryStataprovidesan eveneasierway to comparecoefficients,by computing a Wald test that b. : b,.
(6.24)
(6.2s) /) implies cducation :lude that trancefor chooling) Ihe same re would han Protstants. In Et effects llowever, individuefficients the given s. Thererent from re shown iomitted, rceof the Putrngan rmy varithe set of
K
ENCE BETWEENTWO COEFFICIENTSrhere may beocca_
When interaction terms are involved, the requirementsare even more strinsent. lSrrronly must the significanceof all of the associatedcoefficientsbe assessedsimulrmeously, but the coefficients themselvesmust be interpreted together rather than inli'idually. ConsiderModel 3, which includes interaction terms betweeneducation rrl religious group membership.It helps to write out Equation 6.1g separatelyfor *h religious group. hr Protestants. we have
A: a + b(E) : 2.18+.155(E)
(6.26)
fur Catholics,we have
a=a+b (E )+ c r+ d r(E ) :(a+c r)+ (b + d r)E  (2.18+ 1.06)+ (.1s5_ .121)E = 3.24+ .034(E)
(6.27)
Fsr Jews,we have
A :a+b( E )ic r. rQ (E ) =(.a*c r)* (b + 4 )E : (2.r8+ 3.20)+ (.155.140)E  5.38+ .01s(E)
(6.28)
130
QuantitativeData Analysis:Doing SocialResearch to Testtdeas
tr!tEollr
For Others, we have
A=a+b(E)+co+d.(E) : (a 'l c,,)+ (b l d")E : (2.18+.53)+ (.155+.014)E
(6.29)
:2.71+ .141(E) Again, it is evidentfrom Equations6.26through6.29thatEquation6.lg allowsboth the slopesandthe interceptsto vary betweengroups.The coefficientsassociated with the dummy variables,the cs,indicatethe differencesin the interceptbetweenthe reference categoryandeachofthe explicitly includedcategorieswhereasthecoefficientsassociated with the interactionterms,the d s, indicatethe differencesin the effect of education(the slope)betweenthe referencecategoryandeachof the explicitly includedcategories. From Equations6.26tfuough6.29 it shouldbe clearthat for equationsof the form of Equation6.18, no overall summaryof rhe effecrof a variable1in rhrscaseeducationor religiousgroup membership)is possiblebut only the effect of eachcombinationof education and religiousgroup membership.Specifically,the coefficient.155in Model 3 of Table6.3 (andin Equation6.23)doesnot refer to the overalleffectofeducationbut rather to the effectof educationamongProtestants; and so on for the othercoefficients. BecauseEquation6.18 is a saturatedmoclelit includesall possibleinteractions amongthe independentvariables(in the presenrcase.all possibleinteractionsbetween educationandthe religiousgroupcategories)it is mathematicallyequivalentto estimating separateequationsfor eachreligiousgroup.Equations6.26 through6.29 show this equivalence:the coefficientsresultingfrom rewriting Equation6.1g as Equations6.26 tfuough6.29 and collectingtermsa.reidenticalto the coefficientsobtainedby estimating the equationseparatelyfor eachreligious group (you might want to persuadeyourself of this by carryingout the computationboth ways).The advantage of estimatingEquation 6.18 is that it permitsan explicit testof the hypothesisthat the groupsdiffer, throughthe Ftest shownas Equation6.19. Becausethe conversionof Equation6.lg to Equations 6.26 through6.29 is a fairly tedioushandoperation,especiallywhen the numberof vari_ ablesis large,we often estimateequationsof the form of Equation6.1g to obtainthe R: requiredfor the Ftest (or the coefficientsrequiredfor the wald test) but then estimate sepa.rate equationsfor eachgroup andreportthem in a separatetable From Table6.3,andevenfrom Equations6.26 through6.29,it is difficult to interprer the relationshipsamong religion, education,and abortionattitudes.With equation;ol this sort,graphingthe relationshipsis often useful.Figure 6.3, which can be constructed in the sameway as Figure 6.2 (seethe downloadabledo, or 1og_ file for details). shows the.levelof abonion acceptanceexpectedfrom educationand religious group membership.Inspectingthe graph,it is evidentthatJewsarehighly acceptingof abortion regardlessof their level of education,that catholics are relativelyunacceptingof abortion regardlessof their level of education,andthat aborlionacceptance variesstronglyb\. educationfor Protestants and"others," with the poorly educatedsimilar to Catholicsand the well educatedsimilar to Jews.
.&..".lErat aE
fuar .lm]]]tfr:m
ftmqut , I [email protected]!:m.'r."iit m!fltdsitl!
uos \c rt &luulEEr x liM
l ll::rlt Etr
dllwlliJ: :
J:E :
tu r,r_,amf a Tuma:i l0illlllu r{hnD
.!:nm] @re {Mruhrc.. lt! re1{ff M& r&rl'rir G
mryirr irrr liMur
rEir
gn6!M',]i5
rk,t.t IULS  rE lDr *ij( , ln
Introduction (OrdinaryLeastSquares) 131 to MultipleCorrelation and Regression
5
(6.29) 94
wsboth rith the ference ;ociated ion (the ies. form of allon or of edudel 3 of n rather ractions tetween esnmatlow this ns 6.26 imating iourself 4uation rughthe uatlons of varin the R'? ;strmate nterprer tions of structed details), s $oup abortion of aborrnglyby fics and
b
) E
'1
810
12 14 16 Yearsof schoolcompleted
.
Protestants Catholics
r
Other/none
18
20
FIG URS 6,3. Acceptance of Abortionby Educationand Religious Denomination, U.S.Adults, 1974(N = 1,481). Rexpressing Variables as Deviations from Their Means {part from graphing the relations among variables,asin Figure 6.3, we can use one ot}er de\ice to render the coefficients in models containing interaction terms more readily interpretablewe can reexpress the continuous variables as deviations from their meansjust as we saw in the earlier example predicting knowledge of Chinese characre.I5.The advantageof doing this here is that the group effects (the main effects of the .irnmy variables for groups) can then be interpreted as indicating the expecteddiffereacesamonggroupswith respectto the dependentvariablefor personswho areat the ,n erage with respectto the intervallevel independentvariables or variables. In the present context, this means reexpressingyears of school completed by subFacting the samplemeanfrom eachobservation.(The reexpressedcoefficients are shown in Table6.3 as Model 3'.) The interceptthen gives the expectedvalue on the abortion scaleamong Protestantswith averageeducation (where the mean is computed over the eotire sample,not just Protestants).The coefncients associatedwith eachof the dummy rariablesthen give the difference in the expectedlevel of prochoice sentimentbetween Protestantsand the specified category among personswith averageeducation. Note that rie slope associatedwith years of schooling (which, as noted, gives the effect of schooling for Protestants)is unchanged,as are the coefficients associatedwith the interaction erms; only the group interceptschange.However,the interpretation of the coefficients is gready facilitated: we seethat among those with averageeducation, Protestantsendorse about4 of the 6 abortion items on average,Jewsan additional 1.5 items, "Others" an addidonal .7 items,andCatholicsabout.4 itemsfewer.We also seethat eachadditionalyear of schoolingincreasesthe expectedendorsements of Protestantsby .155, and of those
132
Quantitative Data Analysis:Doing SocialResearchto Testldeas
rtsLni
without religion by about the sameamount becausethe difference in the slopes is onlv '014; by contrast,educationmatterslittle for Jewsand cathoricsbecausethe deviations from the Protestantslope are negativeand almost as large as the hotestant slope.
TestingAdditional Hypotheses:Constraining Coefficients to Zero or to Equatity
n
cEli
,I. rjrL
ry:rc:
Inspecting Figure 6.3, we might be led to infer that education has no eftbct on abortion attitudes for Catholics and Jews and the same effect for protestants and ..Others.,,How can we formally test the correctnessof our inference?We can do this by estimating an equation of the form:
A = +  b,R,+ c(ER,.r ERo) " i=1,3,4
di.r
Itr (ih
r ]415 .qudl
6lFi.rr'l
a
Efni tr haf,
e*:
(6.30)
ks
where, in this case,Catholics are the omitted category.To see how this equation repre_ sentsthe particularhypothesisof interest,we can again write out Equation6.30 separately for eachreligious group.
[
hrs I
tu14 ne nl
tu
For Protestants:
A: @+b,)+c(E)
i.f, Til [email protected] (6.31)
cdr
ler IT
For Catholics:
nOg!,{
[
'$r d frmt
For Jews:
(6.331 For Others:
(s
tur
i n;
t= (a+ bo)+c(E)
(6.341
is evidentfrom inspectionof Equations6.31tbrough6.34, underthe specification _As of^ Equation 6.30 each religious group differs in the int#ept; the slope retating educa. tion to abortion acceptanceis zero for Catholics ald Jews; and tne stope ls identical frr Protestantsand "Others.,' To tesr whether this constrained specification is an adequare representationof the data, we cannot compute the incremeni in R, for Model 3 relatire to Model 3'becausethe two modelsdo nor standin a hierarchical relationshipto each other: there is no main effect for education in the constrained model. So, what to do? Fortunately,a solution is available
f=
turl
N, ' l!W5.1 :ef:r
.u,r rd !E
GfuI Fnlt
(OrdinaryLeastSquares) 133 Introduction to Multiplecorrelationand Regression )nl\ ion'
1lon Io s
i.10t
?re€pa
6.i I t
6.32)
6.33)
6.3,1) ration duca:al for 4uate :lative ) each o do?
A BAYESIAN ALTERNATIVE FORCOMPARINGMODELS Re can exploit an alternativeway of contrastingmodels, the BayesianInformation Criterion (BIC ), introducedinto the sociologicalliterature by the statisticianAdrian Rattery for loglinear analysis(1986) and generalizedto a vadety of applicationsin !: important article in SociologicalMethodology (1995a: see also the critical com:rent by Gelman and Rubin [1995], the appreciativecommentby Hauser[1995], and Raftery's reply to both [1995b], and also the February 1999 issue of Sociological of B1C).In a sense llethods and Resenrcft,which is devotedentirely to an assessment 31C operateson the oppositeprinciple from classicaltestsof significance.It is a like:hood ratio measurethat tells us which model is most likely to be true given the data :or a brief introduction to maximum likelihood estimation,seeAppendix 12.B); clas;:"al inference, by contrast, tells us how likely it is that the obse ed data could ::r e been generatedby sampling error given that some theoretical model (the null :\ pothesis)is true. B1C has tbree important advantagesover the Ftest introducedpreviously.First, :like the Fratio, B1Ccanbe usedto comparenonhierarchicalmodels.Any two models trurportingto describethe samephenomenoncan be contrasted.Second,B1C builds in :. correctionfor largesampleswhereasif the sampleis largeenoughvirtually any increrent in R2 will be significant,no matter how small and substantivelyunimportant.A "rger incrementin R'?is requiredto generatea particularBIC value for large samples iran would be requiredfor small samples.Thus, B1C reflectsthe conventionaladvice :..' choosea smallerprobability value when the sampleis large. Third, BIC penalizes ."rge models.That is, if it takesthe introductionof many additionalvariablesto gener:re much of an increasein R2,BIC is more likely than the Ftest to lead us to prefer ie simpler model.There are severalspecificways to calculateBIC, dependingon the rarticular statisticbeing analyzed.To compareregressionmodels,we can useRaftery's Eauation26:
Blco: t711n11 + po[n(N)l ^oz;l
(6.35)
nhere Rf is the value of R'?for Model t, pn is the numberof independentvariablesfor \Iodel k, andlr' = the numberof casesbeing analyzed.A negativevalueof B1Cindicates :hata specifiedmodel is morelikely to be true thanthe baselinemodel of no association setweenthe independentvariablesand the dependentvariable.To comparetwo models, rre estimateBIC for eachof them and choosethe model with the more negativeB1C. Raftery(1995a,Table6) givesa rule of thumbfor comparingBlCs: a B1Cdifferenceof 0 to 2 constitutes"weak" evidencefor the superiorityof onemodeloveranother;a differenceof 2 to 6 constitutes"positive" evidence;a differenceof 6 to 10 constitules"strong" eridence;and a differenceof >10 constitutes"very strong"evidence.However,because Raftery'srule of thumbis mostapproB1Ctendsto increaseasthe samplesizeincreases, priatefor relativelysmall samples. To seehow BIC is used,let us computeBIC valuesfor the threemodelsshownin Table6.3. For Model 1. we have
134
K
QuantitativeDataAnalysis:DoingSocialResearch to Testldeas
ALTERNATIVEWAYS TO ESTIMATEgtc
ntrlr,€bqr
Even rora siven
statistic, there are arternative versions of Bic. I prefer Rafterys formuras because they buird in a comparisonto a baselinemodel. Thus, I have wrjtten a small _do_ file, _bicreg. do_, to calculate B/C foliowing Raftery: * BIC REG . DO ( Updat ed version
f or
St ata
.t.O
tI/\\/OI.\
?. 0
*Compute BIC from
saved
results
from
drop
bic
in
hltt'.Ilrbrsl ddedd hfi.mIh11 [email protected] qaerci.aritr hniE6cll G$e1ril.
regreasion.
ge n bic = e ( N) * 1n ( 1 e ( 12 ) ) + e 1 6 5 _ *; *f n ( e ( N ) ) *Note.: BfC is the same for all observat.ions. * 1 ist Bf C f or any obs er v at ion . list
[Er Ut adr uli.rgder Fryj:I:Eir rsl.r d fts. tpwxi*.rE Tbe nce [m
Thus,
I
can
1
rcrr
bic
InvoKe bicreg
immediateryfo owing your regressioncommand. However, stata r0.o now offers BiC as a postestimation statistic. To have Stata calculate B/C wjthout using my do file, invoke the command ,estat ic_ immediately followrng your regres_ sion command. The numerical value of each BiC will differ from mine, but the difference
between the for arternative moders wit be identicai regardless of whichversion of B/c '/cs iscalculated.
{ltr€ff
a sI
:se Il
ttrbea.1: ka=<lti
ir:L
dLrsla*q
rda.ry .'( 6e i ra:ETllL
t:rc,t'.1 I
BIC,: 1,481*tn(I  .053)+ t*ln(I,481)= _13.4
(6.36)
f{=
I
=s€fl
ForModel2, wehave BIC,: 1,481+tn(t  .089)+ 4*ln(1,481) =_108.4
I iiai fr
For Model3, we have B1q : 1,481*ln(1  .097)+ 7*tn(1,48t) =100.3
t€ ..{.i .E
l Warr. Il
(6.38)
From a comparisonof the glcs for the three models, we are led to conclude ahatthe data are most consistentwith Model 2, which posits the surn" on abortion attitudesJor all religious groups and an across{he_board"ff""iot "Or"ation difference (that is, a differencethat holds at eachlevel of education)in abortion acceptance for the various religious groups. From the size of the B.IC differences, we conclud^e that the dara .Aery suoiglyfavor Model 2 over Model 1 and ..strongly,' favor Model 2 over Model 3Note that theseresultsare inconsistentwith the resultsse olF,_r pnilosly rbmugt ,rh a comparison of Rrs via an I'tesr What are we to mafre d rls? rs m definitire
aif.
m,rc;r*l *E *:Ed
[ h i: =*i
Ihe GSS 5 :i: _ia::a
r: :tr ,T EFi fu f t f i. €: r ETE
(OrdinaryLeastSquares) 135 lntroduction to MultipleCorrelation and Regression answer My advice is, first, go with theory. If you have a theoretical reasonto prefer one modelover the other,choosethat one.This adviceis consistentwith one of Weakliem's .19991criticisms of BIClhat BIC assumesa "unit prior" B1Cis an approximation of the Bayesfactor,which involvesa comparisonof the posteriorlikelihood of models,where rheposteriorlikelihoodis simply the productof the datalikelihood andthe researcher's prior. The researcherthen choosesthat model with the greatestlikelihood; that is, the model that has the highest probability of being the true model given the researcher'spri,xs and the data" (Winship 1999a,356).If thereis no clearreasonto expecta departure lom the null hypothesis,a "unit prior"which amountsto saying we havelittle informadon about the likely outcomeis appropriate.But if we havestrong theoretical reasonsto e\pecta relationship,BIC canbe too conservative. In this case,classicalinferencewould :eem to be the preferred tool unlesswe were to modify B1C in ways that go beyond this course.We will discusslikelihoods in ChaptersTwelve ard Thirteen. Absent a strong theory go for parsimony, which is what 81C generally does. In the prcsent case,I would be inclined to prefer Model 3 becauseI think there are good realons to expectCatholicsand Jewsto haveconsistentreactionsto abortionregardlessof deir level of education(Catholicsbecauseabortionis prohibitedby the ChurchandJews becausestill in 1974evenif lessso todaythe Jewishcommunitywas sociallyliberal, andJewslacking educationtendedto be immigrants who had the valuesof educatedpersons)and to expectProtestantsand Othersto be more acceptingif they arebetter educated 'becauseof the increasingsophisticationthat educationbrings).But if I did not havea srong, coherent,explanation for the religious difference, I would then prefer Model 2. We can, of course,also compute81C for the constrainedmodel derivedfrom the data: BICy,: 1,491*tn 1  .096) + 4*ln(1,481) : 17L0
(6.36)
(6.37)
(6.38) lrat the n aborr differ:ligious rongly" through lfinitive
qhich is more negative than the BIC for any of Models 1 through 3 and thus "very strongly" suggeststhat, for thesedata, the constrainedmodel is to be preferred.
INDEPE NDENT VALIDATION Note that I said &atfor thesedata lhe constrainedmodel is to be preferred.This is because q e arrived at a new preferred model basedon our inspection of the data rather than from a priori theory. Thus, we are vulnerable to the possibility that we arc simply capitalizing on sampling enor To anive at a definitive preferencefor the constrainedmodel, we need to show that it is the preferred model in an independentdata set. If our sample size pernitted, we would want to carry out all of our exploratory analysis using half of the data and then to reestimateour final model (and its competitors) using the other half of the data.The GSS providesa closeapproximationto this ideal becauseit repeatsidentical questionsin successive surveysconductedusingthe samesamplingprocedures.Thus,it is reasonableto treat adjacentsurveysas independentsamplesdrawn from the samepopulation,at leastfor phenomenanot subjectto shorttermfluctuation.The implicationof rhis is that we can carry out all of our exploratory analysisfor one year and then use the yearto validateour conclusions. datafrom the previousor subsequent
't36
QuantitativeDataAnalysis:Doing SocjalResearch to Testldeas
f*'Ti(
"{
.&; f t: S . {, GoodnessofFitstatistics for Atternative Models of the Relationship Among Religion, Education, and Acceptance of Abortion, U.S.Adultr 1973(N = 1,499).
u]]!];arc _lu
I t;tg,
;iirl:lits l!
d.f.
!!
lL
fl:tl! EU::j i '_'
l!'
lllllllllr iir::
197.7
l
fl]llllrlg:ifIt
.1405
j]lllllllrr1 Jl
:tr :l ili
N4odel 3
'191.'t
lfi;
1/10:)
Contrasts Model3 vs.Model'l
41.2
14.52
6; 1491
Model 3 vs. Model 2 Constrained vs. L4odel2
Here we can exploit the GSS in just this way, reestimatingthe four models of prochoiceattitudesusingdatafrom the 1973GSS.Insofaras we can assumethat abortion attitudesdid not changein the populationbetween 1973 and,19i4. reestimatins the modelsusingthe datafrom 1973constitutesan independenttestof the claim thatthi "constrained"modelis the preferredmodel.Table6.4 showsB1CandR, valuesbasedon the 1993datafor all four modelsandcontrastsbetweenmodelswherevermeaningfuland appropriate.The outcomesare, in fact,just the sameas for 1974..Model 3 is preferred to Model 1 and Model 2 by the criteria of classicalstatisticalinference,whereasby the Blc criterionModel 2 is preferredto Model 3: and by the glc criterionthe constrained model is the most preferred.Thus, we can concludethat our preferencefor the con_ srainedmodel.derivedfrom inrpeclion of rhedara,is r alid.
WHAT THISCHAPTER HAS SHOWN In this chapteryou have learnedhow to carry out multiple regressionand correlation analysisandhow to interprettheresultingcoefficients,consideringa workedexampleon the determinantsof literacy in China.we then focusedon the manipurationof dummr
._ r r't
:
lntroduction to MultipleCorrelation and Regression (OrdinaryLeartSquares) 131 rriables (setsof dichotomousvariablesthat_represent categoricalvariables),including :gecially interactionsbetweendummy variablesand other;ariables, usingas a workei
li.tn,
[,
I
f
F: .:: t
'
nodelsof ttrat aborstimating n thatthe basedon ngful and preferred as by the nstrained the con
rrelation ampleon i dummy
CH APT ER
REGRESSION MULTIPLE TRICKS:TECHNIQUES FORHANDLINGSPECIAL ANALYTICPROBLEMS ISABOUT WHATTHISCHAPTER This chapter presentsvarious "tricks" for dealing in a multiple regression framework The Statado and 1ogrith specificanalyticproblemsfacedby socialresearchers. are available as downloadablefiles. Spein this chapter worked examples fles for all the and independent of both dependent cifically, we consider nonlinear transformations an equation; how to assessthe rzriables; ways to test the equality of coefficients within rsumption of linearity in a relationship, with a trend analysisas a worked example; how to construct andinterpret linear splines asa way of representingabrupt changesin slopes; dtemative ways of expressingdummy variable coefficients; and a procedurefor decomposing the difference betweentwo means.
'40
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
NONLINEARTRANSFORMATIONS often when doing regressionanalysis,we havereasonto suspectthat the rerationship betweenparticularindependentvariablesandthe dependentvariableis nonlinear.Henci. an estimateof the linear relationshipbetweenthe independentand dependentvariables would not properlyrepresenttherelationshipin the sampleunderstudy.you haveseen al exampleof this kind in (c) of Figure5.4 in chapterFive,which showsa perfectparabolic relationshipbetweentwo variablesbut which producesa slope and correrationof zero whenestimatedby a linear regressionequation.Fortunately,thereis a simplesolution to problems of this kindyou can transform one or more variables so that the dependenr variable is a linear function of the independent variables. Here are severarexamolestogetherwith someinterpretivetricks.
CuruilinearRelationships:Age and Income In crosssectional dataincomecomrnonlyincreaseswith ageup to a point in the middle of the career and then begins to fall. A reasonableway to ripresent this is to estimate atr equation of the form t:a+b(A)+c(A,)
e.t.
wheref= annualincome,A : age,and, A2 : A*A. In the 2004 Generalsociar suruay(GSS),the estimatedvaruesfor this equationare (for people age 20 to 64 with informationon personalincome; : N l,573fthe openendedupperinterval$ 110,000per yearor morewas recodedto $150,000;theremain_ ing incomeintervalswererecodedto their midpoints): i' = 4gJ3g + 3,777(A) 35.95(4,); R, : .084
(j .)
which can be represented graphically,as shownin Figure7.1.
ENINCOMEANDAGE ?,! wr_{yTHERELATONSHIp BETWE N
lS CURVILINEAR
There areseveratpossible exptanarions torthecurvitineariry ofthe
relationship bet\,,r'een incomeandage.of whichthe two majoronesarethe following: .
.
Economists arguethat productivityincreases with age up to a point and then falls; sociologists sometimes makesimilarargumentsbut alsopoint out that variousinstitu_ tionalfactors,suchasthe greaterdifficurty orderworkershavein returningto work after layoffs,resuitin the sameobserved pattern. The crosssectional observationmay simply be an artifact of a cohort progressionof earnings, with successive cohortsearningmoreat anygivenagethan theirseniors, and the earnings of all workerscontinujngto risethroughoutthe career.
ffii G & m'F ru! [email protected][[ rqr fuI
m:I
ffi ltrm
q ffi ilhfr
..D mffi
v,Jltiple Regression Tricks:Techniquesfor HandlingSpecial Analytic problems
141
50,000 Lrili,:
F{e;l;;,
nn,3.
45,000 40.000 35,000
t z::
30.000
iLr: :: tria:l nPl3>
20,000 I 5 ,0 0 0 l 0 ,0 0 0 s,000
rid:: 0
: .i ; ": . The RelationshipBetween 2OO3Income and Age,U.S. Adults Twenty to Sixty_Fourin 2OO4(N = 1,57 . +p o := 0a::
_': '
T:::';:i:.ffi*;ilJ:"#tff T$Hffi:l:*ill? =in d*:i:i:ffi::
ro:.rr550,000 peryear.Withoutthelraph,h";;;, ir":;r because the coefficients the
;;*r"tili"#equation
z.z is oir_
.nr.=pretarion. rtispossibre, .F.t rnterpretation. It can be"''.;i,::i:nil";11,l.il:;il'j'.r.lf"ilf":1l1i:,1'; shownthat in the equation (.7.3) F:ere /r : a  b2,/4candF = _b/Zc (\Nith thecoefficients on the right side taken from i::arion 7.1),z is themaximumincome,and F is the ageat whlcfrtfremarlmum lncome : =:rained.In the presentcase,the numerical esti."";{;;,#;.; I = 50,066 35.95(52.53 A),;
" R, = .084
(7.4)
Equations7.Z md 7.4, of course,,yield the samegraph becausethey are equivalent But Equation7.4 also_ telli us precisely,#i;; ;ressions. rs ;;;ome 1j50,066 and lrr: rhispeakis attainedbetweenfifty,*o unOnfry_,tr.. ,"i. r*" tp*"isely, 52.53). \n equivalenrrransformationis possibl" "i f". adaitionatinde_ "; "q;;;;italirng Consideran equationof the form .E:dentvat:iables.
Y a+ b(A)+ c(Ar)+ d(z)
(7.5)
142
Quantitative Data Analysis:Doing So
whereZ is someother independentI ariable.and the remainingvariablesare as before. We could thenrepresentthe relationbenveen.{ and L net of Z, by substitutingthe mean of Z. Z . so lhat Y :
tn +,1 t7 \t

ht )t +
t
a2 
or, equivalently,
i' : m+ c (F A )r where,in this case,
m: (a + d e D
b2t4c
(7.E
and F is as before.
Semilog Transformations: lncome A usefultransformationwhenpredictingincomeis the semilogtransformation;thatis, instead. of predictingincome,we predict the natual log of income.This hasthreeadvantages. First, economictheoriesabout what generatesincome tend to make predictionsn termsoflog income.Specifically,humancapitaltheorytakesincomeasdeterminedb1.a rnvestmentprocess(Mincer and Polachekl9T4). Hence, insofar as we take such theorier seriouslyor are interestedin testingthem seriously,we probablyshouldpredictincoE in its log folm. Second,incometendsto be distributedlognormally in the United Statesand o6a advancedindustrialsocieties,so the log of incomeis distributednormally,a convenied property. Third, and mostimportant,when the dependentvariableis in (natural)log form. e metric regressioncoefficientscan be interpretedasindicatingapproximatelythe protrF tional increasein the dependentvariableassociatedwith a oneunitincreasein the iDi* pendentvariable,for b lessthan about0.2.To seethis, considerthe equation
t n (Y )a + b (X ) Now considertwo individuals who differ by one unit with respectto X: thal xX.= X^+i. Then
ln(Yr)a+b(Xr\ and
.::
l nrl ':i 'tl 
MultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 143 So, subtracting, ln(f,)  ln(fr) : (a  a) + b(X,  X..) : b
(7.12)
But we know from the propertiesof logs that h(f,)  h(fr) : ln(Y,/Yr)
('7.13)
ln(YrlYr) : b
(7.r4)
So we have
Then,exponentiatingboth sides(thatis, makingeachan exponentofe), we have
(7.1s)
YrlY, : eb Now let us look at the relationshipof b to e, for variousvaluesof b.
m m es 0e ET
nt be xL
. 9,
l0 '
b 0.01 0.05 0.10 0.15 0.20 0.30 0.40 0.50
1 .0 1 1 .0 5 1 .1 1 1.2 2 1 .3 5 1 .4 9 1 .6 5
b  0.01  0.05 0.10 0.15 0.20  0.30 0.40  0.50
€o 0.99 0.95 0.90 0.86 0.82 0.74 o.67 0.61
We seethatfor b lessthanabout10.21, b is a good approximationto the expectedproportional increasein I for a oneunit increasein X. For larger values of b, D underestimatesthe proportional increasein L To see how to interpret such results,considerthe effect of educationand hours workedon ln(income),by sex,usingthe 2004 GSS.We estimatea modelof the form ln(I) = a + b(E) + c(H) + d(M)
(7.16)
where/ : incomein 2003,E : yearsof schoolcompleted,11= hoursworkedper week, and M : l for males and = 0 for females. (Note that although the present analysis is restrictedto peoplewith incomes,it is common to add a small constant,say 1, to the value of the dependentvariable to ensure that zero values are not dropped; such transformed variablesare known as "startedlogs" [Tukey 1977].Seethe discussionof tobit analysisin Chapter14 for an alternativeway of dealingwith zero values.)The estimated equation,basedon 1,459caseswith completedata,is
ln(I)  7.41+ .125(E)+ .0207(H) + .335(M)t R2=.257
(7.r7)
144
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
This equationtells us that each year of schoolingwould be expectedto increase income by about 12 percent,within gendercategories,amongthose working an equal number of hours per week. Correspondingly,each additional hour worked per week would be expectedto increaseincomeby 2.1 percent,within gendercategories,among thosewith equal education.Finally, among thosewith the sameeducationworking an equal number of hours, men would be expectedto earn about 40 percentmore than women.Here the coefficientunderstates the male advantagebecauseer35= 1.398.This remindsus thal the b may only be directly interpretedasindicatingthe expectedpercentageincreasefor b < 10.21. For largerbs, we shouldactuallycalculaGthe exponent. Negativecoefficientshavethe sameinterpretation.For example,a coefficient of 0.05 indicatesthat a oneunitincreasein the independent variablewould be expectedto yield a 5 percentdecreasein the dependentvariablei that is, the expectedvalue of the dependenr variablewould be 95 percentas large.Also, for b < 0.2, the percentlosswill be smaller thanimplied by the coefficient.So, again,we shouldcomputethe exponentiated value. Note that the equationexpresses a linear relationshipbetweenthe independentvariablesar'dIhe natural log of income,not incomeitself. This is evidentfrom inspectionof a graph of the relationshipbetweeneducationand ln(income),evaluatedseparatelyfor malesand femalesat the meannumberof hoursworkedper week by all workers,males andfemalescombined(42.67).The relationshipis, ofcourse,linear andthe expectedvaluesfor the two sexesdiffer only by a constant,as shownin Figure7.2. However,when we graphthe expectedrelationshipbetweenincomeand education. the relationshipis curvilinearandthe lines areno longerparallel(Figure7.3).
* r.Jdl]lro'ft
ll
[email protected]
ir mm' ll
.g E
fr m [r
IU
q
:nlllullr
;
A o r ir J\ i.'1 i. r . 4
4
a 12 Yearsof schooling
16
20
Expected ln(tncome) by Years of School Completed, IJ.S.Males and Females,2004, with Hours Worked per Week Fixed at the Mean for Both Sexes Combined = 42.7 (N : 1,459).
fllL EI
l, llrr
vultiple Regression Tricks: Techniques for HandlingSpecial Analyticproblems Ease qual reek xtn_i
145
70,000 60,000 s0,000
6.t Ihri enI
?
40,000
 30,000
2
lC': ld: kn: Ilit
20,000 10,000
tn
0
to:
for Ie! al
yearsof
s.hoolirg : c, , ; iL?.3 .Expected lncome by yearsof SchoolCompleted,
u.S.Mates Females, 2004, with Hours Worked per Week Fixedat the Mean for Both Sexes '=nbined (42.7).
yd
Supposewe have an equationinvolving
a logged dependentvanableand squared *";j:n:H::ff"11*pendent variabres.H", ; ;; ;;";t th" "o"rn"r.ot,.r ln(Y) = a+b(X)+c(X One way to interpret this equation_isto
L.l,ilL.*tro
 k)2
(7.18)
find the first derivative
of [email protected] with respect tt forappropriate vatues ofX,.", ;" ;;;;. ;i;.ir r.o* firrty"* "a_ ddnllir =b+2cX_2ck ^=. cJ\ )
(7.19)
p"ffi i:li:s,f """""":,5":3t:'*:*T,i}**",#HTil
same equation because suchvariables tendto behighly correlateo. ,"
.ir.. ."rri*iil ;"J a variable anditssquare, analysts sometimessubtract a constant b"* ,0"","n lt canbeshown r:iir:Trnj, whereb isthestopeof theregression of X, on X, tende(; T x and(X _ b/2)2 onhogonal (seeTreiman andRoos1983,62D.
lll
146
euantitativeDataAnalysrs: DoingSocjalResearch to Testldeas
t
;.",:ffi.ffi i;."t:,;:fl:".#:'.?T":1:,,1TXHTiilTi';? ilffi:,# tf becausethe funclio
j..l;i*;lrTtil#uitil:,, lrrr ;ffi flT:J}::1il *";:Ti,5"tr*T# t[,'y,;,Jl *, ''J!:{:ii;:r#"*::nl:#;ff T.T:,;.".",, ^ttowever.
i!t

."\.)t

u  f o \ x t ) +. ( y .
_ t \2
(7.20t
ft rq
and l ntv \ "'\r2J
u f
6iE;="+b(Xl
o(x.l
+ ((x
_ L\2
(7.21t
+ 1)+ c((Xt+ 1)_ 1r1z
r&
,:::. tnlY2l ln(Yt)= b + 2cXt_ 2ck + c But because
rr,ry],_ inr}r = ntef,t
(1.:1,,
sidesof Equalton 7.23. we have ,2/it
=
[email protected]+2cxr2ck+c)
Thus.Equation7.25 . sives rl increasein X. evaju;,#"i;;; ;f^:xpecred
{ 5,
in".ease
in wesetx equal giu.tth.p'opo*ion;r";;;#1"t7:l 1 so'irproponiona.l ro irs I /for twoindiv'o*""i?l'ilH*Tit uoo1",,l. mean onvariable x.
f,[
ff
( 7 ) 1t
Then,subtractingEquadon 7.20from E qtation7.22, we have
lr wc exponentiate borh
fr mr
.f for a onew
jffi #ffi[#tr#: THgri=f"EiEE{s ";
UultipleRegression Tricks: Techniques for HandlingSpecial Anallticproblems 'l47 Nr::,:
lobility
Effects
i:rpose we want to testthe Durkheimianhypothesisthat extremesocialmobility, either rrn ard or downward,leadsto anomie.If we are willing to considerthe effectof upward rc.l downwardmobility as symrnetrical,we might estimatean equationofthe form A : ct+ b(P)+ c(P) + d(P  PF)2
(7.26)
r:ere A = the scoreon an anomiescale,P, : the prestigeof the respondent'sfather,s :csupation,and P : the prestigeof the respondent'soccupation.(Note that this specifi:=on of the hypothesisassumesthat it applies Io intergenerationalmobility, that :cupational mobility is a good indicatorof socialmobility andprestigea good measure :i.:rcupational status,and that extrememobility shouldbe most heavily weightedby the difference.In a substantiveanalysis,all of theseassumptions needto bejusnring :i3d explicitly,not merelypresentedwithoutjustification.)A significantlypositivecoef:crent d indicatesthat anomie increasesas the discrepancybetweenrespondent'sand :.\er's occupationalprestigeincreases,controlling for the level of both respondent's r:l tather'sprestige.Thus d indicatesthe effect of mobility per se, controlling for the ::ect of statuslevel. It is necessaryto control for statuslevel becauseanomiemay be :Jtedto origin or destinationstatusentirelyapartfrom any effectof mobility. Of course,many othertransformations of variablescanbe usedto representdifferent r\ial processes. For someexamples,seeGoldberger(1968,Chapter8), Treiman(1970), cd Stoltzenberg(1974, 1975).
TESTING THE EQUALITYOF COEFFICIENTS
._: nr:.: I tnLr t'
\rmetimes situationsarise in which we want to determinewhether two coefficients the sameequationare of equalsize.You havealreadyseenan examplein the preithin chapter, .:ous in the discussionof Equation6.30.Here we consideran additionalexam:.e. Supposewe areinterestedin assessing theeffectofparentaleducationon respondent's :lucation and, in particular,in decidingwhetherthe mother'sor the father'seducation :.1sa strongereffect.The hypothesisthat educationaltransmissionthroughthe mother i strongerthan throughthe fatherarisesfrom the observationthat mothersspendmore :,.newith their childrenthando fathersandhenceareputativelymoreimportantsocializrs agents.The altemativehypothesis,that the father'seducationhas a strongereffect, :erivesfrom the claim that the father'ssocioeconomiccharacteristics largely determine :e family's socioeconomicstatus.Becauseeducationinvolvesopportunitycosts,it may ; ell be that thosewhosefathersare poorly educatedwill be more likely to leaveschool :3rly to switchfrom beinga drainon the family financialresourcesto beingan economic :ontributorto the family. Amed with thesetwo competinghypotheses,we might thenestimatethe regression rf yearsof schoolcompletedon father'sand mother'syearsof schoolcompleted.From ne1980GSSI estimatedan eouationof the form
E :a+b(E , )+ c (E r)
(7.27)
148
DataAnalysis; Quantitative DoingSocialResearch to Testldeas
l*..r: c,e k
where E : respondent'syears of schooling,E : father's years of schooling, and E" : rnother'syearsof schooling.(I chosethe 1980GSSdatato illustratehow to testthe significance of an apparenttrue difference. The 2004 data yield virtually identical coefficientsfor mother'sandfather'seducation.Assessingcross{emporaltrendsin the relative effects of parental education and the reasonsfor such trends might yield an interesting paper.)EstimatingF,quation7.27, with N = 985, yields E :7 .8 7
I .2 OqtF  + 16qrF r. '"'\"Fl
R , .Jl J
(7.28)
This result appearsto supportthe claim that mother,seducationhas a somewhatstronger effect on educationalattainmentthandoesfather'seducation.It is possible.however.that this result arisessimply from samplingvariability. How can we find ouf/ The trick is to force the coefficientsfor mother'sandfather'seducationIo be equalard thento assesswhetherthe R2for the unconsaainedequation(7.28t is significantly larger than theR2for the constrained equation.We constrainthe coefficientsto equalif by estimatingan equationof the form:
E : a + b (E " )
(7.?9)
trr =. ce fu+Ttd IIEilD
AI
& ire GSsI .I06,1J=!f, 5= ::g
Wr. i'.::U
furi;::c g ro EIg:= l:E i,''f
Lb,n:.lo ,aot:s;.r4 mc ::r:il =0
(7.30)
Note that defining a variable as the szm of the years of schooling of the mother and father is equivalent,with respectto testingthe hypothesis,to defininga variableas the mean of the years of schooling of the mother and father. If the mean were soecified. the coefficientwould simply doublein size.In the presentcasethe sum is mori readily interpretablebecauseit retains the metric of the separatemeasuresfor mother and father. Estimatingthe equation,we get E : 7.93 + .236(E"); R2: .317
C;r.rn
ENlXIiff:
where E" : E * 8". Thus, we have
E: a + b(E,) : a + b (E " lE r) : a + b (E , )+ b (E * )
L:ros
(7.31)
Next we comparethe two models.First, we do an Ftest of the equality of the coeffi_ cients, which is equivalentto testing the significanceof the increment in R2.This can be donevery easilyin Stataby usingthe  test  command.In fact, we don't evenneedto consfucttheconstrainedmodel.We simplyissuethecomrnand  test paeduc=maeduc _ after estimating the unconstrainedmodel. This yields an Fvalue of 1.40, which is not significant 1p = .236); note tlat becausewe have a twotailed test (we have hypotheses expecting either mother or father to have greaterinfluence), the conventionalsignificance level required to reject the null hypothesisis .025. Aa altemative way of comparing the models is, of course,to comparethe BlCs for the two models.To do this we needto esti_ mate the constrainedmodel. The two B1Cs,estimatedin the usual wav. are
rnr=
I :: a
hnr":,
.:CC
re=:r:if =.r,1 ft: .Ltr T :I:TDfl mq+r,:c _:: ung
5;:s
rrr[fs?x mlOreS ]ite S ,m: 3jC r
ry!tr:Elo r n:r=. lgl ln.'t r_5: I =:til t nitr r!i1*rF: G5i:r3;
rfrla
::t n
:: as*
faultiple Regression Tricks:Techniquesfor HandlingSpecialAnalytic probtems
nling. and r to testttre ica] coeffitberelative interesting
(7.28) rtal stroDrel'er, that is to force tether the mstralned eform: (7.29)
Unconstrained:
355.4
Constrained:
360.9
149
lD this caseboththe BIC andtheF_testfavortheconstrained modelof no difference. Thisgeneralstrategycanbe appliedto u wio" ua.i"ty oi su*i_ij" p.oof"rnr.
TREND ANAtyStS:TESTTNG THEASSUMPTTON OFLtNEARtTy As rhe GSS hasmatured,it has becomean increasingly valuableresourcefbr the study of ;rosstemporaltrends.Becausemany questions havebeen askedin exacfly the sameway smcethe flrst GSSwasconductedin.1972,it is possible to poot tfr" Outatiom all yearsto fldy a variery of trends.Moreover, if no uiiutiJni i." 0","","a, *," Ou,u ".or._i".po.ut frr all yearscan be treatedasa sampleof the U.S. populati* in ,fr" i ,r"otieth century n generatesufficientcasesto studyrelativelysmall " iubsetsof the population. model (apart from the tim;;;;.; ;i;;;;"d) is thar there is a ._ l: trend :'To"'1.oend inear over time with respectto the outcomeof iiterest. es a first step,it is useful to :otrtrast such a model with a model that posits year_to_y"_ uuri;;;. in the outcome_ rhat Sorokinmany yearsago (1927)described ,.""rfu., as nr"i"rl,ions.,,To do this, we 60mate two models:
Y=a+bT
(7.32)
(7.30)
*her and le as the pecified, ) readily d father.
(7.31) >coeffir can be needto educ r is not otheses fcance ing the to esti
Y=d'+bT+
\
z
(7.33)
shere ? is a linear representationof time (here, the year of the surveyJ,and the Z' are dummy variablesfor eachyear the survey was conducted; note that two dummy variables mustbe omitted becausethe linear term usesup one degreeof freedom. We then compare de twomodels in the uslal via an F+esi of th" ,ignin"_"" ot,t e increment _waV, in R2 and a comparison of BIC valuesA convenientway to Jo the first in Stata is to estimate Equation7.33 and then to test the hypothesis,fr"i af tfr" ,2.il1, zero,vraa Wald test using Stata's  test  command. (Note that equution", "l"a smply a different parameterizationof an equation in which the linear ierm is omitteJand oniy the dummt are included. The coefficients will, of course, Olff".. nui tt p."dicted values, 'ariables R:, and81Cwill be identical.) If w.econclu.le that " no simpf" fln"ar ,r"nO no ,he data, we mrght then posit either a model with asmoothcurve by inifoJirrg u ,qr*.a t"rm for Z, or a model that tries to model particular historical events by g.oupiig y"_, ioto historically meaningful groups and identifying each group ltess one') ui'u u'a'orn_y variable, or a splinemodel (seethe section"Linear Sptines,iater in tne inuf".;i""uur" ,he explainedby Equation7.33 is the maximum possible 'y'."p."."ntution variance ftom of tr" (measuredin years), the R'?associated with Equation ?33 ;;, ;', a standard against which to assess,in substantiverather than .t i"tty rtatr.ti"ail"r1, ro* close various
1 50
to Testldeas QuantitativeDataAnalysis:DoingSocialResearch
sociologically motivated constrainedmodels come to fully explaining temporal variation in the dependentvariable. Although, to simplify the exposition, I have not included any variablesin the model other than time, a model actually positedby a researcherqpically would include a number of covariates(otherindependentvariables)andalso,perhaps,interactionsbetweenthecovariatesandthe variablesrepresentingtime. Exactly the samelogic would apply to suchan analysis asto the simpleranalysisjust described;the logic is alsoidentical to the dummy variable approachto the assessment of group differencesdescribedin the previouschapter(although herethe "groups" are yearsor, if warrantedby the analysis,multiyear historical periods).
Prediding Variation in Gender Role Attitudes over nme: A Worked Example Four items on attitudesregardinggenderroleequality were askedin most yearsof the GSS between 1974 and 1998.The four variablesare shownhere with the percentageendorsing the proequalityposition,pooledover all yearsin which all four questionswere asked: r
Do you agree or disagreewith this statement?Women should fake care of running their homesand leaverunning the country up to men (74 percentdisagree).
r
Do you approveor disapproveofa married woman earning money in businessor industry if shehas a husbandcapableof supporting her? (77 percent approve).
r
If your party nominated a woman for President, would you vote for her if she werequalifiedfor thejob? (84 percentsayyes).
r
Tell me if you agreeor disagreewith this statement:Most men are better suited emotionallyfor politics than are mostwomen(63 percentdisagee).
To form a genderequalityscale,I simply summedthe proequality responsesfor tbe four items, excluding all people to whom the questionswere not askedand treating other noffesponsesas negativevalues.The point of treating "don't know" and similar responses asnegativevaluesrather than excluding them is to savecases.But this would not be wise if therewerenot substartivegroundsfor doing soin this case,it seemedreasonableto me to treat "don't know" as somethingother than a clearcutendorsementof genderequality.
?,I N
l
h I
rN SOMEYEARSOFTHEGSS,ONLYA SUBSET OFRESPONDENTS WASASKEDCERTAIN QUESTIONSusersor the GSSneedto be awarethat to increase the numberof itemsthat can be includedin the G55 each year,some items are askedonly of subsetsof the sample.A convenientway to excludepeoplewho were not askedthe questionsis to usethe Statarmiss  option under the egen commandto countthe numberof missingdataresponses and then to exclude peoplemissingdata on all itemsincludedin a scale.However,in the currentanalysis I excludedall thosewho lackedresponses on any of the four itemsbecausesome,but not all. of the questionswere askedin someyears.
MultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 151 u'al variation in the model rde a number enthecovarigrch an analmry variable ner(although lpenods).
s of the GSS geendorsing asked: care of runntdisagree). rbusinessor approve). r her if she
EstimathgequationssuchasEquations7.32and,733suggestssignificantnonlinearities in attitudes regarding gender inequality. The increment in R, implies F = 3.54 with 11and,21,448d.f.,which hasa probabilityof lessthan0.0001.Howevel the B1Cfor the lrnear trend model is more negative than the B1C for the annual variability model 'de BlCs are,respectively,959 and 871), suggestingthat a lineartrendis morelikely siyenthe data.BecauseB1Candclassicalinferenceyield contradictoryresults,a sensible Fxt stepis to graph annualvariations in the meanlevel of support for genderequality, to Jee whether there is any obvious pattem to the nonlinearity. If substantively sensible deviationsfrom linearity are observed,the annual variation model might be accepted,or e new model, aggregatingyears into historically meaningful periods, might be posited ,teeping in mind the dangersof modifying your hypothesesbasedupon inspection of the dansee the discussionof this issueat the end of ChapterSix), or a smoothcurve or spline function might be fitted to the data. Figure 7.4 showsboth the Iinear trend line and annualvariations in the mean.Inspecting the graph, it appearsthat deviationsfrom lineariq are neither large nor systematic. Given this, I am inclined to accept a linear trend model as the most parsimoniousrepresentationof the data, despitethe Ftest results.The lineartrendis, in fact, quite substantial,implying an increaseof 0.81 (= .0338*(19981974))over the quarter of a century for which we have data; this is about 20 percent of 6e range of the scaleand is about twothirds of the standarddeviation of the scalescores. \pparcntly, support for gender equality has been increasing modestly but steadily ftroughout the closing years of the twentieth century. From a technical point of view, it may be helpful to comparethe estimatesimplied by rhe two altemative ways of representingdepa.rturesfrom linearity: Equation 7.33 and the
Etter suited nsesfor the tating other I responses * be wise if ble to me to quality.
+ 
Llneartrend Mean fo. year
6 62
RES. ,sers of t in the way to l under xclu0e ls lexta ll, of
z
1914 1976 ',19781980 19a2 19a4 1986 1988 1990 1992 1994 1996 1998
Yearof survey
FiGUfl€ 7.&, rrendin AttitudesRegardingGenderEquatity,U.s.Adutts Surveyed in 1974Through1998(LinearTrendandAnnualMeans;N = 21,464).
152
euantitativeDataAnalysrs: DoingSocialResearch to Testldeds
altemativespecificationthat doesnot incrudea linear term for year.when the rinearterm is included,two dummy variabrecategones aredropped,ratherthanone,becausethe lin_ ear terrnusesup one degreeof freedom. However,,ir" t*o pro""J*". produceidentical results. is evidentfrom inspectlon of Table7. I . ^which untortunately. thereis no simp^lecorrespondence betweenthe coelficientsin equa_ trons rhe form of Equation z.j: o"viation, ;;; .of ;;';;i"hons of rhe tinear .ana equatron.If you want to show annual depanuresf.;h;;y:;; needro construcra new variable, which is the difference u.ir""" ,fr" pr"al"i.J J"fi", ,", each year from Eq,tation7.32 and Equation 7.33.^This i..u"ry ui""_oirJf, in Srata using rhe  foreach or forvalues_ cornmand. "ufi1o
I
t , . !
I
LINEARSPLTNES I
Somedmeswe encountersituationsin which we believe that the relationshipbetween tu.o variableschangesabruptly at some point on the distribution of the independentvariabre. so that neither a linear nor a curvilinear representationoi,i" l"fu,ionrrup is adequate. qlcofol c_olsurnlriyn.may have no impact on l"arf, U"to* l:l,"]"pr" some rhreshold. whereasabovethethresholdheatthdecline. i" li""*;;y;; ;;;;i consumption increases. Temporaltrendsalsomay abruptrycnange, " asa result of policy changes,cataclysnic evenl. suchas depressions,wars,revolutions, and so on. In casei of this kind, it is useful to representthe relationshipsvia a setof connected line segments,know:nu tin"o, ,plir"r.
A Worked Example:Trendsin Educational Altainment over Timein the united States
form?
_._l*^"ji il::fTfi::::Ti:,":llilffii I,r::.*" :""pr.,,n" ::ilT'ru.,i"";:;
showssucha plot, madewith the same specificationsas the scatterplot. Inspecting the y"",r::^,h"."verage educationin"r"u."a in u _or" o. i"rrlr"l_ rv"}, ,or thoseborn 11,,: between1900and 1947bur rhenlevel"aoff. n""uu." rt a bit, prob_ *" relarively ";l;;l;;;;. smau nu_b".;;; f;;;;; ""rd
Jffi
I
!
s 1, l>r:
c
a
=v
a/
rtmigrrtueueuer to
hie*"uirii"J;;#ft?:l#;:?"::il*TTJ:i"*"llifJ.ffi ii:Tl,i:  do  and  1os files.) rnspecting this graph,*";;;;; ;;;;; conclusion_there
is a fairly abruptchansein ihe trend, wittr it os" b",..' il;;;;;;df
2
er < :=
,;:"ffi:'i: tfere appears," b";;il;;;,::ruffis: i:1,,'# ;'"ffiff tffi* discemis it linear or is the trendbetter representedby someotherfunctronal
:,0]la3:iy "r moving averase plot threeyear
I
rt
!\
consider chalges in the averasereverof educationover time. Figure 7.5 presents a scafter plot relating educationalattainirent to year of birth, estimatedfro;'trre css. To create rhis graph,I combineddatafrom all vears betw.* 1972;;;6;.r"rv*"r, , *"0*o those bom prior to I 900 becausethe very small sampl" .ir* p;il;;;u"ii"".*"ro. , a., droppeda thoselessthanagetweng/fivearthe rime of rh" J;";;;;;;iluiy"i"opr" ao not their schoolinguntil rheir midtwenries. Th. ;d; i;;"; "o_pt.t. * , ,Jittered,' cases, oiJ"i ro make it readable,andis to ma_k "r*" To discoverhow rhe increasr
a
of the rwentierh
= a
Ito r r to ttr tr .t
llo n
I h d I ltrcl url orl
.,I .r I t.tt,..r l ot tl t l ).rot
N (rl A116r I l'r'(lt(t.'(l
V.rhror.
Coef f ic ient i = a + Irr "  i' i
1975 1977
1998
z .t6r4 3 * C tg tt:2 .5 1 1 4 * O:OOA I = 2 51q5
a. + D .19/5 + c,u,.s: _j j .68578+ O.O375B 72* 1975+ 0. 0403799: 2. 5893 i , + bl i g77 + c,pn:.71..68578 _ ^111c^io + .A 37sa7i * 1q7T 7U. ] J6418= 2. 5105 ),,,. . . ._,,. ..v)/)otz.tel /
154
QuantitativeDataAnalysis:Doing socialResearch to Testldeas
m]Il::e
!.I J.... 16
:t; 1..
E 6
=
i
tz
o
't: . i..*..' 1900
1910
.'J.
1920
1930
194A 1950 Yearof b rth
1960
1970
fll${"}gti:ir.5.
Yea6 of SchoolCompletedby Yearof Birth, I).5. Adults (Pooled Samplesfrom the 1972Through 2004 GSi;N = 39,324;ScatterPlot Shown for 5 PercentSample).
rem . ibsf
century(precisely,until 1947)experiencinga fairly steadyyearbyyearincreasein their schooling,but thoseborn in 1947or later experiencingno changeat all. This suggest: that the trend in educational attainment is appropriately representedby a linear spline with a knot at 1947,where"knot" refersto the point at which the slopechanges. This specificationcanbe represented by an equationof the form: E  a'l br(Br)+ b"(8,)
(7.3+
whereBr  the yearof birth for thosebom in 1947or earlierand : 1947otherwise,and B,  the year of birth  1947for thosebom after 1947and : 0 otherwise.More generalll.. a splinefunctionrelatingZto X with segments vt. . .!,*t andknotsatkr k2,. . . ,k,can be reDresented bv
Y : a'l br(X,)+ b,(Xr)+... + b,*lx,*)
(7.35r
wherev, : min(X k,), u,  max(mintX k,. k, k1).0),.. ..urr+rr: max(X f,,0)(see Panis [1994]; the entry for Stata's mksplinecommand lstatacory 2007]; and Greene[2008]).Eachslopecoefficientis thenthe slopeof the specifiedline segmenr.We can seethis concretelyby going back to our example,Equation7.34, and evaluating the equationseparatelyfor thosebom ir 1947or earlrerand thosebom after 194j. Fot thosebom in 1947or earlier,we have
.1
I
rcm &im.,rll
xultiple Regression Tricks: Techniques for HandlingSpecial Analyticproblems f 55
14
d
l
13 o
12 11 10 9 8 1930 1940 1950 Yearof binh
old
FIGURE 7.6. ueu, yearcof Schooting by yearof Birth,u.s Actutl(same hta asfor Figure7.5).
6eir 5e$s pline I E
I 14
13{t
ad tll]'. can
!.
12 11
g
10
35r EC trd ['e
rg br
';
9 81930
1940 1950 Yearof binh
FlGtrRE 7.7. Threeyear MovingAverage of yearsof schooling byyearof Birth, U.5.Adults (SameDataas for Figure7.5).
156
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
E = a + b,(B)+ br(0): a + br(B) andfor thosebom laterthan 1947, we have E: a + b,(1947) + b,(B1947) : (a + 1947br) + [email protected])
(7.37)
Notice that the intercept in Equation 7.37 is just the expectedlevel of educationfcr thoseborn in 1947 aadthatbrgives the slopefor thosebom after 1947.Thus, the expected level of educationfor those 6om in 1948 is just the expectedlevel of educationfor those bom in 1947 plus Dr; for those born in 1949 rt is the expectedlevel of educationfa thosebom in 1947plus 2br; and so on. Estimating Equation 7.34 from the pooled 19722004 GSS data yields the coefficients in Table 7.2. By inspecting the BICs for three modelsthe spline model, a linea trend model, and a model that allows the expectedlevel of schooling to vary year_b).yearit is evident that the linear spline model is to be prefened. Note, however, that a comparisonof R2sindicatesthat by the criterion of classicalinference,the model posirhg yearbyyear variation in the level of schooling fits significantly better than the splbe model. I am inclined to discount this result becauseit has no theoretical iustification. is
SpECtFtCATtON OF SpLtNEFUNC4N ALTERNATTVE
TION S Analternative specification represents theslopeot eachlinesegment asa deviation fromtheslopeof theprevious linesegment. Inthisspecification, a different setof newvariables is constructed. Suppose therearek knots,thatX istheoriginal variable. andthatyr,...,yh+r) arethe constructed variables. Then
ur= X  k.,if X> k,; :0 otheruuise u,,, = X  k.il X > k"; : 0otherwise To seethis concretely, considerthe presentexamplespecifyinga knot at birth year 1947 in the trend in educationalattainment.We would estimatethe equationwherez, : birthyear(, andur: X 1947if X > 1947and = 0 otherwise.Thenfor thoseborn in 1947or earlier, t : a + b,(X) + b,(o) : a + b1(X) whilefor thoseborn laterthan 1947 E: a + b,(X) + b2(X 44 Thus,for those born in 1948,the expectedlevelof educationis given by (a + 48b,) + b,; for thoseborn in 1949it is (a + 4gb1)+ 2br;and so on. Fromthis,it is evidentthat b, give; the deviationof the slopefor the previouslinesegment.Forusefuldiscussions of thesemethods,seeSmith(1979)and Gould(1993).
r.'!ultiple Regression Tricks: Techniques for Handringspeciar Anaryticprobrems 157 ' . 16
'
r a,S ?.3, Co.ffi.i.rrts fora LinearSplineModel of Trends years in of School Completed by year of Birth, U.S. Aiults age iS Ofa.., comparisonswith other Moders(pooredDatar". "na = ".rO r6'izzoo+,
rv
r f . , 3Ltion tbr rPecred v *lose tion lbr eoeffia linear earb] : that a osiring spbne tion. is
39,324),
s,e. 5bpe '. :: ,:..'i'.: 5.ope(bjrthyearsI94j1979)
,i.:: .0092
.o024
r*""u1,rr,1.,:, Model Comparisons
2) Lineartrendmodel
.1167
(3) :. i5
I ) vs.(2)
5 31
.0121
545.2
1;39321 .OOO0
::arly inferior by the BIC criterion,and occurs simply as a consequence of the large imple size.Thus, I acceptthe linear splinemodel asttrepreteneJmoO"t. The coefficientsfor the line segmentsindicatethat for peopt"iorn in 1947or earljer, :ere is an expectedincreaseof .0g6yearsof schooling foi ,*""rriu" birth cohort. .._. us.peopleborn lwelve yearsapartwould be expectld "uin to differ on averageby abouta 1:ar of schooling.However,for peopleborn in 1947or later,;";" ." trendin educa:.rnalattainment;the coefficient .0092 implies ttut it *ouiO iut "about a century for :. eraqeschoolingto increaseby one year This is a somewhatsurpnsrng result, espe_ :::lly becausetherehavebeensu
.":nraged minoriries, rhat is, ",""1::#'iff'":!ilT":fi::rTli:,ffiH:tr"Hi,
::d also amongwornen.However,as Mare (1995, tb:; not"r.d*utronally disadvan_ !ed proportionsof the population havegrown over tim"..tutlu" to tn" White majority. )saggregation of the trend woula be wolrthwhile u* p"r.ued here;it would :rte,an interestingpaper The graph implied by the"""".ii. coefficienisfor the linear spline :Lrdel is shown.inFigure 7.g, togetherwiih u 2 i".".nt rundornslmpte of observations : rr eachcohort_(redlced from 5 percentto 2 percentto mut. it .J". to seethe shapeof :e.spline). In this figure the j itterfeaiurein Stutui, u."J'io _uke it clearwhere : rhegraphthereis the greatestdensityof points.
158
DataAnalysis: DoingSocialResearch Quantitative to Testldeas
". :,t'..i i f
?.... .'.t'
lfi
_g E .r
t
t.
iifirEr,
t
flrcfl diiMd
o fr
N[dd dbi
libu
%i
o
llrry
btrt
It
hr
l$." .!F, tEd l
'1900
1910
1920
1930
1940 1950 Yearof birth
1960
1970
1980
@m h ftr/rqi trtil
Ff &Untr 7 .&, rrenain Yearsof SchootCompteted by yearof Birth,U.S.Adutl (SameDataasfor Figure7.5;ScatterPlotShownfor 2 percentSampte). predicted Valuesfrom a LinearSplinewith a Knot at 1947.
tuq drF
A SecondWorked Example,with a Discontinuity: euality of Education in China Before, During, and After the Cultural Revolution The typical useof splinefunctionsis to estimateequationssuchasthe onejust discussedin which all points are connectedbut the slope changesat specifiedpoints (,,knots"rHowever, there are occasionsin which we may want to posit discontinuoas functionsThe Chinese Cultural Revolution is such a case.It can be argued that the disruption of socialorder at the beginningof the CulturalRevolutionin 1966was so massivethat it js inappropriateto assumeany continuity in trends.Deng and Treiman (1997) makejus such an argument with respect to trends in educational reproduction. They argue thal there was then a gradual 'tetum to normalcy" so that changesresulting from the end of the Cultural Revolution in 1977 were not nearly as sharp and were appropriately representedby a knot in a spline function rather than a break in the trend line. Here we consideranotherconsequence of the Cultural Revolution,the quality of educationreceived(the exampleis adaptedfrom Treiman [2007a]).Although prima4 schoolsremainedopen thoughout the Cultural Revolution,higher level schoolswere shutdown for varyingperiods:most secondaryschoolswereclosedfor two years,from 1966to 1968,and most universitiesand other tertiarylevel institutionswere closedfor six years,from 1966 to 1972. Moreover,it was widelv reDortedthat even when the
m
lr"D. lhEb
h br& {@E
fu r frFfr ffi{
ryE'ft bd rlidh' Ed &trI
hr mb &'nn b
litultipleRegression Tricks: Techniques for HandlingSpecial AnalyticProblems 159
0 ldutts Pd
n hste.l lols 1,
:tiorr. ion e.: ar it i: Ie JL.:i e thar end of reprelirr eti
rima+ t $ efe
. from ed for en lhe
siools were open, little conventionalinstruction was offered: rather, school hours rere taken up with political meetingsand political indoctrination.Rigorousacademic himrction was not fully reinstituteduntil 1977, after the death of Mao. Under the ;iriumstances,we might well suspectthat, quite apart from deficits in the affiount of siooling acquiredby thosewho wereunfortunateenoughto be of schoolageduringthe Culmral Revolution period, those cohorts also experienced deficits in the quality of $ooling comparedto thosewho obtainedan equalamountof schoolingbeforeor after fre Cultural Revolution. To test this hypothesis,we can exploit the tenitem characterrecognition test ,&iristered to a nationalsampleof Chineseadultsthat was also analyzedin Chapter SLx(seeTable6.2). As before,I take the numberof characterscorrectly identified as a of literacy andhypothesizethat, net of yearsof schoolcompleted,peoplewho asure age eleven during the Cultural Revolution would be able to recognizefewer rned [Laractersthanpeoplewho turnedelevenbeforeor after the Cultural Revolutionperiod. Uoreover,following Deng andTreiman (1997),I posit a discontinuityin the scoresat tu beginningbut not at the end of the period. To do this, I estimatean equationof fre form:
i  a + b1(B)+ bz(B) + cr(Dr) + \(\)
(7.38)
rhere B, = year of bfuth (last two digits) if born prior to or in 1955 and : 55 ifbom Fbsequentto 1955;Br: 0 if bom prior to 195(, = year of birth  55 if born between 1956and 1967,inclusive, and : 67  55 if bom subsequentto 1967',83: 0 if bom = 0 for lrior to or in 1,967and : year of birth  67 for those bom after 1967i and D, : 1955. Note difference prior 1 for those bom after that the born to or in 1955 and 6ose henveenthis and Equation7.35 is that I include a dummy variableto distinguishthose born after 1955from thoseborn earlier;this is what permitsthe line segmentsto be disat 1955.If I were to havepositeda discontinuityat 1967as well, the equarr')otinuous :ion would be the mathematicalequivalentto estimatingthree separateequations,for rte periodbefore,during, and after the Cultural Revolution,in eachcasepredictingthe rtrmberof charactersrecognizedfrom yearsof schoolingand year of birth. The advanage of equationssuchas Equation7.38 is that they permit the specificationof altematire modelswithin a coherentframeworkand by so doing permit us to selectbetween nodels. Estimatingthis equationyields the resultsshownfor Model 4 in Tables7.3 and 7.4. {s in the previous example,I contrastmy theorydriven specification with other possibiliries: that there is a simple linear trend in the data; that there are yearbyyearvariations; tat there are knots at both the beginning and the end of the Cultural Revolution, but no discontinuities; that there are discontinuities at both the beginning and the end of the Culural Revolution; and, for the three spline functions, that there is a curvilinear relationship between year and knowledge of characters during the Cultural Revolution period.
l6S
QuantitativeData Analysis:Doing SocialResearch to Testldeas
''
'inla
Ra^r.rr
':,.:l:
: ,'.l' GoodnessofFit statistics for Models of Knowledge of chinese Characters by year of Birth, Controlling for years of schooling, with Various Specifications of the Effect of the Cultural Revolution (Those Affected by the Cultural Revolution Are Defined as people Turning Age il During the Period 1966 through 19771,Chinese Adutts Age 20 to O9 in 1996 (N = 6,08G),
: 'Chinese Char a( lues in Paren Va
':=i o: schocl;:l: .665 .616
i 956'196: .g
i:i6725.9
6723.9
.612
 6722.1
.611
:: 

6724.1
. 2A 71.72
6717.4
1116.33 :.
..1,i1,/:
'::(
€ar 1r€tc 'f5
. ':::
4.26  42.4
:a Ba . a .=' .
30.04 ::
54.43
.003
s1.11
1.8
.00'l
. 6.86
6.5
.000
'a a  a _ e a  :t

.
: :;l _1i Lrn i :
:::
' .
 i ddl l i Lrr:
:
. :  , t t r ing iit : a.  ::rruities.. : , ,  . likelr r : . :
'  t iple Re g re s s i olnri c k s :T e c h n i q u efo s r H andl i ngS peci aA na yti c P robl errs
] 5l
' , :, Coefficients for Models 4, 5. and 7 Predicting Knowledge : Chinese Characters by Year of Birth, Controlling for Years of Schooling :Va lues in Parentheses).
:: 's of schooling
 i 955 or earlier(age11 1965or earler)
:
:
19561967(age11 1966*1977)
: r   1968or l a te r(a g e1 1 1 9 7 8o r l a te r)
:  i: q inu: t ya t ' 9 5 5
Model4
Model 5
.443 (.000)
.443 (.000)
A44 (.000)
0.001 \.721)
0.001 (.134\
0.001 (.749)
0.043 (0.000).
0.032 (0.000)
0.041 (.000)
0.016
0.557 ( 000)
*0.508. (.000)
. o.o4l (0.18s) 0.028 (.012) 0.349. / nnl\
o.241 (.010)
, : : r nt inuit ya t 1 9 6 7
0.0066 (.00e)
:,llineartend'195ffi7
= : (rootmeansquareerror)
Model 7
0.770
0.770
0.771
0.571
0.672
o.672
1.29
1.29
1.29
. ,rnparison of the B.lCs suggeststhat three modelsmy hypothesized model, a model ,: in addition to a discontinuity at the beginning of the Cultural Revolution allows the =:J during the Cultural Revolution period to be curvilinear. and a model positing  .:ontinuities at both the beginning and the end of the Cultural Revolution are about , ..i1ly likely given the data, albeit with weak evidence favoring the singleknot model.  : that all three are strongly to be preferred over all other models.
162
QuantitativeData Analysis:Doing SocialResearch to Testldeas
Again, B1Cand classicalinferenceyield conrradictoryresultsbecausethe two alternativemodelsfit significantlybetter(at rhe 0.01 le\el) than doesthe originally hypothesizedmodel.Here I am in a bit of a quandaryas to u hich modelto prefer.I havealreadr stateda basisfor positing a single discontinuity.plus a knot at the end of the Cultura Revolution.However,anotheranalyst might favor a twodiscontinuitymodel, on th; groundthat the curricularreform in 1977that restoredthe primacyof academicsubjecc was radical enoughto posit a discontinuityat the end as well as at the beginningof the Cultural Revolution.A third analystmight arguethat a linear specificationof trends. especiallyin times of great social disruption,is too restrictiveand that it makesmore senseto posit a curvilinear effect of time during the Cultural Revolutionperiod. Ir Treiman(2007a),I presentedthe model positinga discontinuityat 1955,a knot at 196. and a curvebetween1955and 1967see Figure7.4 in that paper.Howevel the truth i! thatthereis no clearbasisfor preferringany oneof the three,exceptfor the evidenceprc' vided by BlC, which suggeststhat ihe originally hypothesizedmodel is slightly mor; likely thanthe othersgiventhe data.Again, my suggestionis, go with theory.If you har: a theoreticalbasisfor one specificationover the others,that is lhe one to feature;but. iI the sametime, you mustbe honestaboutthe fact that alternativespecificationsfit nearll equally well. In fact, the optimal approachis to presentall threemodelsand invite th: readerto chooseamongthem.A waming: if you do this, you probablywill haveto figL with journal editors, who are always trying to get authors to reduce the length of papersand perhapswith reviewers,who sometimesseemto want definitive conclusionsere: when the evidenceis ambiguous. The estimatedcoefficientsfor all threemodelsare shownin Table7.4.In alt thr*modelseachadditionalyearof schoolingresultsin nearlyhalf a point improvementin dE numberof charactersidentified.However,the coefficientsassociatedwith trendsortime are relativelydifficult to interyrer.Again, this is an instancein which graphingrtr relationshiphelps.Figure7.9 shows,for eachof the threepreferredmodels,the predicr* numberof charactersrecognizedfor peoplewith twelve yearsof schooling,that is, *bi havecompletedhigh school.Although the threegraphsappearto be quite different,th1 all show a declineof abouthalf a point in the numberof charactersidentifiedfor thos who wereageelevenduring the early yearsof the CulturalRevolutionperiod,relatile :: thosewith the samelevel of schoolingwho tumed elevenbeforeand after the Cuhwir Revolution.Thus, despitethe difficulty in choosingamong alternativespecificatio*togetherthey stronglysuggestthat the quality of educationdeclinedduring the Culruii, Revolution.Peoplewho acquiredtheir middle school(unior high school)educationdr:ing the CulturalRevolution,in effect,lost a year of schoolingthat is, displayedknos _edgeof vocabularyequivalentto thosewith one year lessschoolingwho wereeducai= before and after the Cultural Revolution. Still, we shouldbe cautiousin our interpretationof Figure 7.9, where the Culru:rr Revolutioneffect appearsto be quite large becauseof the way the data are graphi; (with the yaxis rangingfrom 5.3 to 6.7 charactersrecognized).Indeed,Figure 7.10_r which the yaxis rangesfrom 0 to 10, suggestsa ratherdiffereDtstory'a very mod:s decline in the numberof charactersrecognized.It is quire rea_.onable ro reporr li,su::: suchas Figure7.9 to makethe differencesamonsthe model
,'1s=:rr.4
tlt
Jnfi!flt
:
':95
= at
ro alterrypotheralread) Cultural , on the zubjecs g of the I trenG
P 65
t u, g 57
25 30 35 40 45 50 55 60 55 70 75
25
:0
15
Yearol birth Dlsconlinuily at 1955,knot at 1967
40
45
50
55
60
65
70
75
Di5.oninuiies at 1955and i967
es lnOIE
riod. In ,r 1967. rtruth is DCeprol5r more ou ha\ e i but. at I nearll Mite the to fighr PaPers. 0s eveD rll three in the ds or,er dng the Edicted is. who nL the)' r those ative to ]rrltural Dations. )ultural on durknollltlcated
I6.5
;
61
g
57
25
30
35
"*,",,
40
45
50
55
60
*,*,:;:::::47
65
70
75
curve 1ss5_,e67
FlGi..rR[ 7,9. eraphsofThreeModets of the Effect of theCutturatRevotution an Vocabulary Knowledge,HoldingConstant Education(at Twelveyears),Chinese (N = Hults, 1996 6,086).
= 1.
aa
p c8
a
5
; =2
= E0 25
J0
l5
40
45
50
55 60 65 7o Yea.ofbirth Dscontinrityat 1955,knot ar 1967
75
25
30
35
40
45
50
55
60
65
70
75
Discontinuiries at 1955and 1967
p
r8 !
5
).lltural raphed 1.10.in modest figures ses the
E2
25
l0
15
40
45
50
55
60
65
70
75
",,,",,,,;i:::'"f'$u,..,"",,,,_,,u, ",..", FlGUnf
7 .7&" rigur" 7.g Rescated to showthe EntireRanqe of theyAxis.
164
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
responsibleanalystwill call the reader'sattentionto the range of the yaxis to avoid misinterpretation.
EXPRESSING COEFFICIENTS AS DEVIATIONS FROM THEGRANDMEAN (MULTIPLE CLASSIFICATION ANALYSIS) The conventional way of treating categorical independentvariables is the approachprcsentedin the previous chapter:omit one categoryand interpret the remaining coefficients as deviations from the expected value for the omitted category. Sometimes,especiall!' when we have a large number of categories,it is preferable to expressour coefficients as deviations from the mean ofthe dependentvariable. We can do this by transforming the coefficients,makinguseof the following relationships: aii 
I
I I
Dii f Ui:
Vi :
 ), p;io,1 j
(7.391
where the a, are the coefficients for the 7th category of the lth predictor, expressedas deviations from the mean of the dependentvariable; the b, are the correspondingcoefF cients associated with the dummy variables; the Qt are adjustment coefflcients thi constrain the weighted sum of the coefficients associatedwith the categoriesof eachpredictor to zero; and the p are the proportion of total casesfalling in theTth categoryof tbe ith predictor(Ardrews er al. 1973,4547). To see how these coefficients work, consider the relationship between religious denomination and tolerance.The anal.ysistask includes two elements: r
To assessto what extent and in what way religious groups differ in their toleranceof the antireligious
r
To assessto what extent the observeddifferences between religious groups can be attributed to the fact that they differ with respect to education and Southern residencebecausethese variables are known to affect tolerance (with the morE educatedand nonSouthemresidentsmore tolerant than others)
I start by estimating two regressionequationsin the usual wayone with only the dummy variablesfor religious groups and one also including educationand Southernresidence usingpooleddatafrom the 2000,2002,and2004GSS;I pool threeyearsof dan to increasethe samplesizebecausesomereligiouscategoriesarequitesmall,andthe tol. erancequestionswere askedof only a subsetof respondentsin eachyear. The results arr shownin the lefthandpanelofTable 7.5. I thenreexpressthesecoefficientsasdeviatiom from the mean of the dependentvariable using Equation 7.39. The rightmost panel of the tableshowsthe reexpressed coefficients. Ordinarily you would not presentboth setsof coefficients,but would chooseone form or the othereither a dummy variable representationor a multipleclassification representation.I present them both together here so that you can see the relationshipe betweenthe coefflcients.
l\4ultiple Regression Tricks: Techniques for HandlingSpecial Analyticproblems 165 L\ls to avoid
"
::: i 1..11. coefficients of Models of Tolerance of Atheists. U.S.Adults, 2O0Oto 2OO4(N = 4299). Dummy VariableCoefficients (Deviation5 from Omitted
tt pproachprercoefficients s. especially efficientsas ;fbrmingthe
Category)
tsaplists
(7.3e) rpressedas ding coeffiicients that rf eachpre:soryof the
MCA Coefficients (Deviationsfrom Grand Mean)
Model 1
Model 2
0.000
0.000
o.422
0.647
o.447
o.224
Model 1
OtherProtestants
Model 2
 0.308
t644'
(163)
0.066
n religious o.27 4 their tolersrOUpS Can I Southem h the more
o.643
h only the uthernIesars of data nd the tolresultsare deviations mel of the looseone ;sification Itionships
R2
.0 6 1
0.102
0.102
0.000
0.039
(2,195)
0.136
(3,446)
0.136
0.061
tuoter5incepvaruesare not readilyaornputedfor the McA coefficents, and afe not partrcurar y meaningfLrl lor the dur.myvariable coefficleirts b".aLrse theyindicatethe s gnificance oJthe difterence trom the omitted calegorythey are not shown here.
156
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
Note thatthe dff?rencsbetweencoeff,cientsis the samein both versions.In Mode,: for example,the differencebetweenthe tolerancescoresof Methodistsand Baptisli ][ 0.395: 0.395 0 : 0.027 ( 0.422).Similarly,in Model 2 the differenceis 0.298= 0.298 0: 0.010 (*0.398). What do the reexpressed coefficientstell us?I think they are easierto interpret.F::s considerModel 1. From this model we seethat Baptistsare considerablyless tolertr than the averagerespondentwhereasJewsand thosewithout religion are considerahl! more tolerant than average,and Lutheransand "Others" are somewhatmore tolerant $.t average.However,thesedifferences,especiallythe high toleranceof Jews,are some$ix explainedby religiousdifferencesin educationand regionof residencebecause,in general, the deviations from the mean decline when thesetwo factors are controlled. Southemresidentsare somewhatbelow averagein tolerance,net of religiousaffilir. tion and education,whereasnonSouthemresidentsare slighdy aboveaveragein tol* ance.NonSouthernresidentswill necessarilybe closerto the overallmeanbecausetheE aresubstantiallymoreof them,andthe weightedaverageofthe coefficients(with weisiG corresponding to the proportionof the samplein the category)must sum to zero. The coefficientsfor the religiousgroupsandfor Southerners versusnonSouthem<s are sometimesreferred to as "adjusted group differences,"where the adjustmentrefeF r the fact that othervariablesin the modelare controlled. The slopecoefficientfor educationdoesnot change,but the scalingof the edu,> tional variabledoes.In the reexpressed ("MCA") representation, educationis expres:.( as a deviationfrom its meanin this case13.4.Finally, the interceptin the reexpresset representation is just the meanof the dependentvariable,the level of tolerance.
OTHERWAYSOF REPRESENTING DUMMY VARIABLES Three other ways of representingthe effects of categoricalvariablessometimescan fa,i. itate interpretation.Two of them,effectcoding andcontrastcoding,requirerepresennrr the categoriesof the classiflcation in a different way from conventional dummy variahh coding (seeCohen and Cohen [1975, 172210],Hardy [1993, 64751,andFox [199. 20611D. A third, which I label sequential effects, involves manipulating the ou+{. None of thesealtemativeways of expressingthe effectsof a categoricalvariableah<s the contributionof the categoricalvariableto the explainedvariance;that is, the R: I unaffected.All they do is reparameterizethe effects, and so the only reasonfor using a:rr of them is to makeinterpretationof the relationshipsin the dataclearer To seehow thesealtemativeswork, let us considera new problemthe effecr :r occupationand educationon knowledgeof vocabularyin the United States,using d,:cr from the 2004GSS.The GSSincludesa tenitemvocabularytest,a detailedclassificad.r of currentoccupation,and a measureof yearsof schoolcompleted.For the purposeu this example,I havecollapsedthe detailedoccupationalclassificationinto four categrnes: upper nonmanual(managersand professionals),lower nonmanual(technicial: salesoccupations,andadministrativesupportoccupations), uppermanual(precisionpr,duction,craft, and repair occupations),and lower manual (all other categories:seni: occupations;agriculturaloccupations;and operators,fabricators,and laborers).I expe: that, net of currentoccupation,vocabularyscoreswill increasewith yearsof schooliG
MultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 167 In Model 1, I Baplistsis e is 0.298erpret.Ffust bss tolerant onsiderably Dlerantthan e somewhat use,in genled. ious affiliage in tolerrause there dth weights
\tore interestingly,I expectvocabularyscoresto increasewith occupationalstatusthat is. that lowgr manual workers, upper manual workers, lower nonmanual workers, and uppernonmanualworkerswill haveincreasinglyhigh vocabularyscores,on average,net trf yearsof schooling.The argumentis that symbolicmanipulationplays an increasingly largerole in work as one movesup the occupationalhierarchy,so that verbal skills are much more strongly reinforcedand enhancedin highstatusoccupationsthan in lowstatusoccupations.(Of course,in a seriousanalysis,I would needto considerthe possibility that those with better verbal skills, relative to their education,would be more likely to endup in highstatusoccupations.) The conventional approachto representingtheseeffects is to estimatean equationof the form ^4
V: a + b (E )+ lc , O i
(7.40)
.fo.
Southemers tnt refers to ' the educas expressed €expressed E.
s canfacilepresennng ny variable Fox [1997, Ite output. iable alters s, the R'?is f usrngany E effect of using data assification purposeof Durcategoechnicians, f,isionpro€s: servlce t. I expect schooling.
*here V is the vocabulary score,E is the years of school completed, and the O are the occupationcategorieswith, say,O, = 1 for lower manual workers and = 0 otherwise, .. . , : 1 for uppernonmanualworkers and : 0 otherwise.The top panel of Table7.6 shows Q the "design matrix" (also known as the "model matrix") for the conventional coding of dummy variablesto representthe separateeffects of eachoccupationcategory;the resulting coefficients are shownin the top panel of Table 7.7. As you can see, there are no surprises here. As expected, vocabulary knowledge increaseswith education and also increases monotonically with occupational stafus. However, let us now see how to represent the effects of occupational status in other, mathematicallyequivalent,ways.
Effect Coding One possibility is to highlight the effect of each occupation category by contrasting it s ith the unweighted averageof the effects of the other categories.If we include a single categorical variable in the model, representedby a set of ft  1 trichotomous variables eachcoded  I for the omitted category 1 for the ith category and 0 otherwise (refer to the secondpanel of Table 7.6), the resulting regressioncoefficients give the difference betweenthe meanon the dependentvariable for the specific categoryand the unweighted averageof tJIemeansfor all categories.That is, (7.41)
where I is the unweighted averageof_the sample means for each category averaged overall categoriesof the classification;I is alsothe interceptof the equation.The coefficient for the omitted category is just the negative sum of the coefficients for the k  1 explicitly includedcategories,althoughin thesedaysof highspeedcomputing,it usually is easierto simply changethe omitted category.When other variables are included in the
158
euantitative DataAnarysis: Doingsociar Research to Testrdeas
f .6 .Tl Y i (See .l Variables
DesignMatricesfor AtternativeWays of coding categoricl
Text for Details). Lower Manual
Upper Manual
Lower Nonmanual
Upper Nonmanual
ConventionalDummy Variable Codinq
,.;i
i
;ffi:X::Tl?*same
,.'
,i1
relationships holdexcept thatnowwehaveadjusred
As nobd, the codingof catesories tharproducesthis outcomeis snown rn the se.
d;::,, &ca,egories rererence iiT*:1?::Ji,i",^i:2,;:_" catesory I i onl::" {L**,oi#;r:: . i;;;; iscoded i;fJ".'iTJ,fr::f,:i::l "ri,r'1
ul",u""",,iu"ry ;"fiT"J:,h.::"fi',:#"1*i""*lU*,",,"i,i"g1,"g"]r* I and0 ontheremaining inoi"utot u.iuir.ri;;,ffi;:"#:TT"tl""".Ttil:t:; ..i
categoryand eachof the other categorieswhile minimizing the influence of the rem mg categories. Inspectingthe coefficientsin the secondpanel of Thbte7.7, we see that the un, thefour
occupa,ion
ilil"?,i*:*:f:for have substanria,y ""Le..i* i.l.os ;"0;:,,;;.,n_,", lower rhan :Ji:rT::T::,1ffi :l ","*;;;;;;"#ffi
Multiple Regression Tricks:Techniquesfor hgorical
Pper
HandlingSpecialAnaly.ticproblems
l6g
:,1a,: 7. ;, coefficienrs for a Model of the Determinants of Vocabulary (nowledge, U.5. Aduttr 1994 (N = 1,757,R2= .2445;Wald Test That Categorical VariabfesAf l Equal Zero: Frr.,,ur, = 12.48ip < .o00o).
'nanual
Coefficient
Standard Error
pValue
Conventionaldummy variablecoding
.5
t.
o.377
0.076
0.000
o.143
0.070
o.o41
2.482
0.239
0.000
0.106
0.000
o.154
o.142
td mear: intercept e s ec ona n e s . T h: : used i,: lr code; onlittai aantalt:
Contrastcoding
o.529 r eichre.: g orker. manu
c2
_o.226
(Continued)
176
to Testldeas QuantitativeData Analysis:Doing SocialResearch
; ,.:..1.:: ., , .. Coefficients for a Model ot the Determinants of Vocabulary Knowfedge,u.s. Adults, 1994 (rv = 1,757;R2=.2445; Wald TestThat Categori(Continued) = 12.48'P < .OO0O). cal Variabfes Aff Equal zero: Fo..t1s2t Coefficient
standardError
pValue
q
o.244
0.120
o.o42
Intercept
2.482
0.239
0.000
Educat,on
0.277
0.018
0.000
5,
0.226
0.154
o.142
5l
o.295
0.154
0.056
s4
o.243
0.119
Q.O42
Intercept
2.105
0.220
0.000
_ :,

Sequential coefficients
Jrl ' L
workershavesomewhatlower thanaveragescores;lower nonmanualworkershaveson:what higher than averagescores;and upper nonmanualworkers have substantiallyhighthan averagescores.Note that the differencesbetweenoccupationcategoriesare identic(within roundingenor) in thetwo parameterizations andthatneitherthe effectof educatic: nor the R']is affected.This repalameterizationis likely to be mostuseful whenthe categonc: no one of which is a panicular.. variableincludesa largenumberof responsecategories, useful referencecategory.Note also the contrastbetweenthis parameterizationand that dicussedin the previoussection,which showscoefflcientsas deviationsfrom the weigltri: averageof the subgroupmeans.Here the coefficients are deviations from lhe unweiglttt: andeachmay be usefulundercertaincircumstances. average. Both areappropriate,
.
llr 
:
tl
:
:l r ,hr.g:a :
[r:l _
r_
rlln
:!
5:l l l i 
t'7 t'
 ,;
u:
I
Contrast Coding Sometimeswe may want to comparethe effectsof subsetsof variables.For example.ri: may want to contrastnonmanualandmanualworkers,and then to contrastthe two no:We can do this by constructinga seti: manualcategoriesandthe two manualcategories. contrastsof the subgroupmeans.That is, we forrn a setof contrastsof the form
.:


UultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 171 fir'ect to the constraintsthat the ai sum to zero; that ,t  1 contrastsbe formulatedto Eresent f categories;and that the codesfor eachpair of contrastsbe linearly indepen!E or. to put it differently,that the contrastcodesbe orthogonalwhich conditionis nrtred when,for eachpair of contrasts,the sumof the productsof the codes: 0. \ setof contrastcodesis shownin the third panelof Table7.6. Note that they satisfy irX :nreeconstraintsmentionedin the previousparagraph:three contrastvariablesare caj to representthe four occupationcategories;eachrow sumsto zero; and the sum of G :roducts of the codes : 0 in all cases(for example,for C, and q we have .5*1 + .5* : r + (.5)*0 + (.5)+0 : 0; and similarlyfor C, and C,,andfor C.,and q). This i;Cng. plus a little futher computationon the regressionoutput,explainednext, yields .oeffcients shown in the third panel ofTable 7.7. \ote that the interceptgivesthe unweightedmeanof the categorymeans,just asfor de:t coding,but the coefficientsof the indicator variableshave a somewhatdifferent Eeryretation,which requiresa little additionalmanipulation.Eachconhast,j, givesthe dj:rence in the unweightedmeansof the meansfor the categoriesin the two groups Es contrastedandis computedby C ,:
bi
flsr I hez (nr)(nrz)
(7.43)
/r3ris the numberofcategoriesincludedin the first group,n,2is the numberofcatere includedin the secondgroup.and b is the regressioncoefficientfor the !:I:es contrast::tier1dummy variable.Note thatthe standarderors alsomustbe multipliedby the same ii.:lr asthe regressioncoefflcients. Inspectingthe contrastcoefficientsshownin the third panelofTable 7.7, we seethat E ranual groupsaverageabouta half point below the nonmanualgroupsin their mean l":r3bularyscores,which is highly significant; that upper manual workers averageabout ! rarter point higher than lower manual workers in their vocabulary scores,but that this .tiJ:renceis significantonly at the.16 level,which givesus little confidencethat thereis r :ue difference between these categories;and that upper nonmanual workers likewise Fe.age about a quarter point higher than lower nonmanualworkers, and that this differ:alF is significantat the 0.04level,which meansthatwe canhavemodestconfidencethat tr.. is a true differencebetweenthesecategories.
huential Coefficients ,lre additionalway ofpresentingcoefficientsfor categoricalvariablesis sometimeshelpjr \\'hen the underlyingdimensionis ordinal,or we want to treatit asordinal,it may be rsetul to reexpressthe coefficientsas indicatingthe differencebetweeneachcategory nc theprecedingcategory.To do this is a simplematterof estimatingthe equationusing :Eentional dummy variablecodingbut fhen subtractingeachcoefficientfrom the pre=jing coefficient. If we have fr categoriesand omit the first one, then k, remains mchanged(k' . : k,  O); k' ,  k,  krl and so on. The appropriate standarderrors for :ch coefficient arc then the standarderrors of the difference from the preceding coeffi:.1r. Again, the standarderror for t, remains unchanged,and the standarderrors of the
7 172
La[LiPleRegr€s
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
remaining variablesmay either be computedby hand from the variancecovariancematnr of coefficientsusing the denominatorof the formula shownin the boxednote, "How to Test the Significanceof the DifferenceBetweenTwo Coefficients,"in ChapterSix' or alteringthe omitcomputedsimply by reestimatingthe regressionequation,successively ted category. Here we see (in the fourth panel of Table 7.7) that the differencesbetween adjacenr occupation categoriesare each about onequafier of a point on the vocabulary scale and that alLbut the first of the differencesis signilicant at conventionallevels.Note also that the confasts between the first and secondcategories(lower and upper manual workers) and betweenthe third and fourth categories(lower andupper nonmanualworkers) are identical (within limits of rounding error) to contrasts2 and 3 of the previous panelwhich, of course,must be so becausethe samecategoriesare being contrastedin both cases.
Tben. ulnng
TWO MEANS BETWEEN THEDIFFERENCE DECOMPOSING A commonproblemin socialresearchis to accountfor why two (or more) groupsdiffer with respect to their average score or value on some variable. For example, we ma) observe that Blacks and nonBlacks differ with respect to their averageeamings and may wonder how this comes about. In particular, we may wonder to what extent the differencearisesfrom group differencesin their "assets,"the traits that enlance eamings' and to what extent they arise becausethe groups get different "rates of retum" to their assetsthat is, some groups gain a greater advantagefrom any level of assetsthan do other groups. Consider education, for example. To what extent do earnings differences betweenBlacks and nonBlacksarisebecauseBlackstend to be lesswell educatedthan nonBlacks,and to what extentdo they arisebecauseBlacks get a lower retum to their educationthan do nonBlacks?A naturalway to investigatethe determinantsof any outdeterminantsandnotethe comeof interestis to regressthe outcomeon a setof suspected relative size of the coefficientsassociatedwith each independentvariable.A natural extensionof this approachfor the pulposeof comparingtwo groupsis to computeparallel regressionsfor the two groups of interest, to subtractone equationfrom the othet and to note the sizeof the resultingdifferences. Considerthe following equations:
'lbu .m !o aryt:u and der Equi.rir'n {
!€.
ftr :.iependenr tFgs m me': nc ::frtren.e n i!D:s 3 mr$.
I s r, 11  u1 
r\r'
1) ut1,ti l
$
r. dr
i> n.r trb' m:,1 lien
v
hr.e ,.
and 12:a2fLoiZAiz
\Sartr. ]0q i F,luaricl. Jan I$o ff
(7.451
which representsomemodel with /
Ge .: the tq L' I I! boft tbes 3Ed d [ ::L. indept :t!':E 'lre aD:= 'l :lltirieF rlt3eGl oa rbe I nt, :s :.dr ibl .lSt E:';::(':)
MuftipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 173
r \
t2 u2
t" i
k r \ t
v
(7.46)
(7.47)
Then,taking the differencebetween(7.46)and (7.47),we have
YrYz: (a, +lbnx)
 (a2+ Dbi2ii2) i
(a,  ar) * D(bir  b,)xi2 + D biz(xiz xiz) + I (4, bi)(xit  tiz)
(7.48)
, Youcanwork out the equalityfor yourself.It is easierif you startwith the expanded {Lrrion and derivethe simpledifference.) Equation7.48 alsocan be writtcn as
Y,  t, : (a,  a2)+ f (4r * bi)xn + Db,t6n  Xi2) +Ltbitb,r\tX ,rX ,rt
(7.49)
r\gain, you can convinceyourselfof this by working out the algebra.) Equations7.48 and 7.49 representalternativedecompositionsof the difference !E^ eentwo meansinto the differencebetweenthe intercepts,the slopes,the meanson ft :ndependentvariables,and the interactions betweenthe differencesin slopesand difi==ncesin means.In Equation7.48, Ctro]up 2 is usedas a standard.Hence,the effectof :nr Jifference in slopesis evaluatedat the meanfor Group 2, and the effect of the differa: in meansis evaluatedwith respectto the slopefor Group2. In Equation7.49,Group . rj taken as the standard.These equationsgenerally yield different answers,and there rsdlv is no obviousway to decidebetweenthem.Hence,it is a goodpracticeto present trn setsof decompositions,as I do here.Differencesin interpretationassociatedwith rse .1fthe two standardswill be discussedshortly. ln both thesedecompositionsthe coefficients representingthe effect of the difference n :eans and the interactionare unchangedwhen a constantis addedto or subtracted ii:n the independentvariables,but the coefficients representingthe effect of the differ.n: rn interceptsand the difference in the rate of retum to the independentvariables do Eod on the scalingof the variables(Jonesand Kelley 1984).For this reason,it generullris advisableto combinethesetwo terms.Doing so yields threecomponents.From 3.r:rion 7.48 we have
174
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
Y,, Y,
Actual: Observed groupdifference.
Dbi2(Xu X i2)
Composition:Portiondueto differences in assets.
(a1 h) +D(b  bi2)Xi2
Rate: Portiondue to dilJerencesin the ratesof return to assets(thatis,the difference remainingf assets were equalized).
D(bi1 b)(ri1Xj2)
!ilr:
:rt
WH ulrF F {T = :g
Interaction:Portiondueto valuingthe differences in assets at the Group2 rateof returnratherthanthe Group1 rateof
la
return.
l _::t
::a =
 :'i
Equation7.49 can, of course,be reorganizedin the sameway. Note that in Equa_ tions 7.48 and7.49,the interactiontermshavethe sameabsolutevaluebut oppositesign. which follows from the definitionof the interactionterm.
.4
l:c
A WorkedExample:FactorsAffecting RacialDifferences in EducationalAttainment
I i
Now let us considera substantive problemto seewhat to do with thesecoefficients.Sup posewe are interestedin studyingthe factorsaffectingracial differencesin the level oi educationcompleted.It is well known thaton averageBlackshavelesseducationthando membersof other races.Data from the GSS show that over the period 1990 to 2001. Blacks averagedabouta year lessin schoolingthan did others11i.g yearscomparedtc i3.7 yearsfor others).What factorsmight accountfor this difference? To study this question,it is necessaryto obtain a largeenoughsampleof Blacks rc, yield reliableestimates. Althoughno oneyear of the GSShasenoughBlack respondents. pooling all yearsfrom 1990 to 2004, excluding2002 (in which the race questionwa: askedin a nonstandardway and thus the data are not comparableto thosefrom other years),yieldsa sampteof 2,105Blacks with complereinformationon all variables.I thus pooleddatafrom theseyearsofthe GSS,dividedthe sampleinto Blacksandnon_Blacks. and estimatedfor eachsamplethe regressionof yearsof schoolcompletedon mother.s yearsof schooling,numberof siblings,and whetherthe respondentresidedin the south at agesixteen.I chosethesevariablesfor studybecausethey are known to affect educational attainment:mother'seducationis a measureof family culturalcapitaland is superior to father'seducationin the Black populationdue to the relatively large numbei oi femaleheaded householdsthe higher the lever of mother'seducation.the hisher the expectedlevel of respondent'seducation;the number of siblinss is an indiiaror oi the shareof parentalresourcesthat can be devotedto any singtechildthe larger the numberof siblings,the lower the expectedrevelof educationalattainment;and Southem residenceat age sixteenis an indicatorof inferior schoolingthose who grew up in the South a.reexpectedto obtain less educationthan thosewho grew up in other parts oi the country.
dtrfi::
a . i :D
{!i,t:l
r. n
::fi
'llrjrl
ls'!tr :'.
iL:
:
fi3lj]ne+ M * 1 ,.,:i tr,fl 1 D
AC :
31'\
*'; :]{ a la
AnalyticProblems 175 for HandlingSpecial uultiple Regression Tricks: Techniques
NONBLACK IS BETTER THAN 3[ WHY BLACKVERSUS wHrrE vERsus NoNwHlTEFoR SOCIALANALYSIS N lN THE UNITED
STATES
racial dfferences intheunlted states, whenstudylns
' is reasonable NonBlacks, of course,will to dividethe populationinto Blacksand nonBlacks. are white). Including o€ mainlyWhite (in the GSSfrom 1990to 2004 94 percentof nonBlacks 'Others"(thosewho areneitherBlacknor Whiteand, in fact,aremostlyAsian)wlth Whiteshas rttleeffecton the estimates but hasthe advantageof retainingthe entirepopulationratherthan arbitrarily studyingmostbut not all of the population.Includinq"others" with Blacksmakesless sense,both because"Others"aremoresimilartoWhiteswith respectto mostsocialcharacteristics and becausethey would constitutea largerfractionof the "nonWhite"populationthan of the "nonBlack"population,thus makingthe categorylesshomoqeneous.
lf
The questionat issueis to what extentthe observednearlyoneyeardifferencein the .rrerageeducationof Blacks and nonBlacksis due to racial differencesin the average in el of mother'sschooling,the averagenumberof siblings,andthe probabilityof living n rhe South,and to what extent the difference is due to the lower ratesof retum to Blacks iom havingeducatedmothers,comingfrom smallfamilies,andliving outsidethe South. i by estimating an equation of the forrn '1art
m
E a+ b(Eu)+ c(S)+ d(R)
(7.50)
II
II L l5
;eparately for Blacks and nonBlacks, where E= years of school completed; E, = years completedby the mother;S = numberof siblings;andR = 1 if the respondent ".t school :!r'edin the Southat ase sixteenand= 0 otherwise.
C 5" t
n F F rf E
:r r: f E
x
'l)! thedeco'po,ton A COMMENT ON CREDITlN SCIENCE tnstata is carriedout usingan ado file, oaxaca , which can be downloadedfrom the_Web:ry* N ,.net search oaxaca", then clickthe entryfor oaxaca.The nameof the ado tlle ls a techniquewas introducedby the Thedecomposition tellingreflectionof the sociologyof science. EvelynKitagawa,in 1955and was elaboratedin a numberof waysoverthe years demographer; the and sociologistsseethe sectionon "AdditionalReadingon Decomposing by demographers Ronald it was only when an economist. the chapter. However, Between Means" later in Difference lt is now that it gainedgeneralcurrencyamongeconomists. oaxaca(1973),usedthe procedure decomposition," due to a someor the "BlinderOaxaca knownasthe "Oaxacadecomposition" Alan Blinder('1973). what clearerexoosition bv anothereconomist,
17*
to Testldeas QuantitativeData Analysis:Doing SocialResearch
Table 7.8 shows the means, standard deviations, and corelations among the yariables included in the equation, separately for Blacks and nonBlacks. From the table \\ e see that Blacks come from much larger families, are much more likely to have beer: raised in the South, and that both respondents and their mothers average nearly a yer lessschoolingthar nonBlacks.Table 7.9 showsthe regressionestimates,and Table 7.ll showsthe decomposition. For Blacks in the 1990 to 2004 GSS pooled sample, the estimated values for Equa, tion 7.50 rue
f : r t.oq + .220(E,t) .o7l(s) .512(R)
(7.51
whereas for nonBlacks the estimated values are
E
10.7b .JU8rf,,) .iJsr5r
.488rR)
l 7 5^
Table 7.9 gives the regression coefficients for the two equations, together with standard errors. This table shows that the main differences between the determinants of edu, cation for Blacks and nonBlacks are, first, that the cost of coming from a large family i. substantially greater for nonBlacks than for Blacks and, second, that the advantagof mother's education is greater for nonBlacks than for Blacks. Interestingly, the effect o:
: a'.:l:. , .;. Means, Standard Deviationt and correlations for Variables Included in a Model of Educational Attainment for U.S. Adults, 199O to 2004, by Race (Blacks Above the Diagonal, NonBlacks Below). tFl
(f) Yearsof school
lF
I
a
0.350
0.186
(EM)Mothersyearsof school
0.411
(5) Numbero{ siblings
o.232 . o )64
(R)Livedin Southat 16 Mean Standard deviatron
0.274
0.065
0.011
13.7
11.4
3.33
3.46
0.201
o 1)4
0.102
2.83
(R)
2.67
10.6
4.96
3.73
..:
0.559
0.444
N = 14,985
3.45 O.49r
I.
MultipleRegression Tricks: Techniques for HandlingSpecial AnalyticProblems 177 I \ l1_:
}{ s ;: : tet: a I .;: e_ 
f&*tF
7,9, coefficients of a Model of Educationalattainment fol Elacksand NonBlacks,U.S.Adults, 1990 to 2oo4. Metric Coefficients(StandardErrors) NonBlacks
Eq::
r.<
Number of siblings
o.o71 (0.016)
Constant
11.09 (0.22)
* 0.138 (0.008)
 :: I Si4' 
dll :: &t::: 'ecr .:
les
r+ D.
ts 5 73
F 497
t
sowing up in the South differs little for the two races,which representsan important :hangefrom the past.Finally,lessthana flfth of the variancein educationis explainedby ie threevariablesin the modelfor nonBlacksandIessthana sixth for Blacks,both sub{antially lessthanin the past. The coefficients in Table 7.9 do not, however, permit a formal comparison of the Jeterminantsof educationalaftainmentfor Blacks and others.To see this we tum to Table7.10,which givesa decompositionof the almostoneyeardifferencein the average earsof school completedby Blacks and nonBlacks.Decompositionl, which takes .. Blacksas the standard,is constructedfrom Equation7.48, whereasDecomposition2, s hich takesnonBlacksasthe standard,is constructedfrom Equation7.49.In both cases nonBlacksare takenas group 1, and Blacks as group 2. So the decompositionis of the epproximatelyoneyearadvantagein the averageschoolingof nonBlackscompared rith Blacks.Both decompositionssuggestthat differencesin assetsthe fact that nonBlack womenhavebettereducated mothers,fewer siblings,and are lesslikely to live in fte Southare more importantthan differencesin retumsto assets.But the two decomp'ositionsdiffer in the contributionthey assignto differencesin matemaleducationand number of siblings, both of which are more important in Decomposition2 than in Decomposition1. The reasonfor this is straightforward:when Blacks are taken as the
378
QuantitativeData Analysis:Dojng SoctatResearch to lest ldeas Decomoosi
schoorcompreter t"
of the Differencein the Meanyears of
""rllilllltn
: ,. De(ompositionI (Black Standard)
Decomposition2 (NonBlackStandard)
Totaldifference 0.89 Differences in assets
"!,:,:
i,
Motherseducdtion Numberof siblings LrvedIn Southat i 6 lotal due to differencein assets
o.17 0.11 /,] !
0.15 o.44
Differences in returns to assets Mother,seducation Numberof siblings . Lrvedin South ai 16 lntercept .Totaldue to differences in returns
0.93  0.34 0.01 0.33 0.28
o.46
Interactions Mother3 education Numberof siblings Livedin Southat 16 Totaldue to interactions
0.o7 0.11 0.0i o.17
o.07 0.11 0.01  0.17
lMultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 179
,t r2 lard)
I
iurdardthat is, when the Black/nonBlackdifferencesin slopesis evaluatedat the Black mean the differencein expectedvaluesis smallerin both casesthan when the lifference in slopesis evaluatedat the nonBlackmean.(To convinceyourself of this, .ketch a graph of the two slopesfor eachof the two variables.)Finally, the interaction :ermshave relatively little importancein this decompositionbecauseof the offsetting iflects of numberof siblingsandSouthemresidence.
Additional Readingon Decomposingthe Difference BetweenMeans For a good senseof how to carry out more complicateddecompositionsand what the :nterpretative issuesare,readthe papersby Duncan(1968),WinsboroughandDickinson 1971),Kaufman (1983), Treiman and Roos (1983), Jonesand Kelley (i984), Kraus 1986),TreimanandLee (1996),andTreiman,McKeever,andFodor(1996).
WHATTHISCHAPTER HAS SHOWN In this chapterwe havecoveredvariouselaborationsof multiple regressionprocedures ftat give us improvedability to representsocial processesand thus to test ideasabout how the socialworld works. Specifically,we haveconsiderednonlineartransformations .rf both dependentand independentvariables;ways to test the equality of coefficients \l ithin an equation;how to assessthe assumptionof linearity in a relationship;how to ;onstructand interpretlinear splinesto representabrupt changesin slopes;altemative s ays of expressingdummy variablecoefficients;and a procedurefor decomposingthe lifferenca betweentwo means.Severalof the worked exampleshavefocusedon trends overtime, which givesus a modelfor how to usernultipleregressionproceduresto study iocial change. In the next chapterwe returnto perhapsthe mostvexingproblemin nonexperimental socialresearchmissingdata on somebut not all variablesand considerwhat is currentlyregardedasthe gold standardfor dealingwith missingdata:multiple imputationof missinsvalues.
CH APT ER
IMPUTATION MULTIPLE OF MISSINGDATA ISABOUT WHATTHISCHAPTER In this chapterwe consider issuesinvolved in the treatmentof missing data, review various methodsfor handling missing data, and seehow to use a stateoftheartprocedure for imputing missing data to createa complete data set, the method of multiple imputanbr. For a very useful overview of imputation methods, incLudingmultiple imputation, see Paul and others (2008), upon which this discussiondraws heavily. Other useful reviews of the literature on missing data treatmentsinclude Anderson, Basilevsky, and Hum (1983),Little (1992),Brick and Kalton (1996),andNordholt (1998).
18?
QuantitativeData Analysrs: Doing SocialResearch to Testldeas
INTRODUCTION Missing data is a vexing problemin socialresearch.It is both commonand diflicult ro manage.Most surveyitemsincludenoffesponsecategories:respondents do not know thr answersto somequestionsor refuseto answer;intervieweGinadveftentlyskip questionr or recordinvalid codes;errorsare madein keying data;and so on. Administrative dau_ hospital records,and other sorts of data have similar problems_invalid or missinE responses to particularitems where informationis missingbecauseit is not applicabteri particular respondents(for example, age at marriage for the never married), there is no problem;the analyticsampleis simply definedas those.,atrisk,' of the event. But in rhr remainingcases,in which in principle therecould be a response,we needspecial proceduresro copewith missinginformaljon. In the statisricsliteratureon missing data (Rubin l9g7; Little ard Rubin 2002). a distinction is madebetweentkee conditrons..missing completelyat random (MCAR), tL condition_in which missingresponses to a particularvariableareindependent of the values of any other variable in the explanatory model and of the true value of the variable il questron;mdJirrg at random (llAR), thecondition in which missingness is independentaf the true value of the variablein questionbut not of at leastsomeof the othei variabres in the explanatorymod,e'and.missing not at random (MNAR) or, altematively, nonignor_ abk (NI), in which rnissingnessdependson the true value of the variable in question and_ possibly,on othervariablesaswell. Note that thesedistinctionsrefer to net effects.Thus,for example,if the probabilig that data are missing on the father's education is independent of ine true varue of the father'seducationafter accountis taken of the respondent'seducationbut dependson the respondent'seducation,the data would satisfythe MAR condition.The fact thar rie typology refersto net ratherthan grosseffectsis very importantbecauseotherwise it r< difficult to think of variablesthat satisfythe MAR condition.For example,it is likely t,. missingness on the father'seducationi s co,elated withthe true valueof the father,seducationsimply becausethe father,sandthe respondent,s educationarecorrelated,andlaci of knowledgeof the father,seducationis greateramongthe poorly educated. Unforlunately,at leastin crosssectional data,theie is no way to empirically deter_ mine whethermissingnessis independentof the true valueof the variable; this mustbe defendedon theoreticalgrounds.Although it is likely that missingnessis seldom com_ pletely independentof the true valueof the variable,there*. casesin which it L< plausibleto assumethat it is largelyindependent, uriy net ofthe other variablesin the explan_ atory model.Thesearethe casesthat concernus here. The NI conditionis often discussedunderthe rubric of sampreserectionbias,lhe sir uatronwherethe sampleis selectedon the basisof variablescorrelatedwith the depen_ dent variable This topic is beyondwhat can be includedin this book (but seeChapt€, Sixteenfor a brief introduction).Accessiblediscussionsof the issuesinvolved in sample selectionbiasandpossiblecorectionscanbe foundin Berk andRay (19g2), Berk ( l9g3 rBreen(1996),andStoltzenbergandRelles(1997). Next we review a numberof proceduresfor dealingwith missingdata,culminating . in a discussionof Bayesianmultiple imputation,the cment gold standard, andpresenti tion of a workedexampleusingthis method.
h fril JI L ;iiiibffi ,iiii6lh
flrnr h ,flE Iu[
ffi
qru ,t[4d #F 6E mlr]lJ
friri tu f
@i @rrq & Fnd [M*,J pfr &mm flmd J[r Snuhc DH h.t
Er Md M[[r[
Multipletmputationof MissingData
183
Casewise Deletion iffcult:c knot r,rt q esrionr ti\ e da!:.' missing licable : eTe is Fi lut in rbe aI pro.e
l0ol r. .:. AR r.rbe le .t alu3:
riable i rdent o: ariable: onisno,tion ani $abilit_r e o f lbi etrds oi that t.be iise ir i: rell rh;: i's edumd lack r detetmust tE m comich iI ;s erplanAre si:depen3hapte sample i 1981. dnarin=: RSente
fhe mostcommonlyusedmethodfor dealingwith missingdata(whichwe haveemployed so far in this book) is simply to drop all caseswith any missing data on the variables hcluded in the analysis.If datamainly aremissingcompletelyatiandom, dueto record_ ing. keying or codingerrors,or omissionby design(the questionis askedonly of a ran_ dom subsetof the sample),the main costis to reducethe samplesize.This is badenough becauseoften the reductionin sampresize is quite dramatic.For example.Clark and {ltman(2003)reporteda studyof prognosisof ovariancancerin which rijssing dataon l0 covariatesreducedthe samplesizeby 56 percent,from 1,1g9to 5lg.
WHY PAIRWISE DELETION SHOULDBE AVOIDED
N
Sometimes, to avoidsubstantial reductions in theirsample size,analysts basetheiranalysis on "pairwisepresent" correrationsthat is,correrations computed fromaI dataavailabre for eachpairof variabres. Thisisa badideabecause it canproduce inconsistent, andottenuninterpretabre, resurts, especiay whenhierarchical moders arecontrasted, of thekinddiscussed in thesection of Chapter Sixon ,,AStrategy for Comparisons Across Croups.
However,the problem usually is much worse becausedata are not missins completelyal random Rather,the presenceor absenceof dataon particularvariablestlnds to dependon the value of other variables. For example, as noLd previously, poorly edu_ ratedpeopleare lesslikely to know abouttheir family histories,and hencetheir oarents, characteristics, than are weneducatedpeople; the refusal to answercenain kinds of questions,for example,thoseinvoking political attitudes,may vary with political party affiliation;selfemployedbusinessmen may refuseto divulge theii incomefor fear that theinformationwill wind up in the handsof the tax authorities;andso on. In suchcases, coefficientsestimatedusingcasewisedeletionarc generallybiased.Thus,to simplv omit missingdatais to risk seriouslydistorlingour analysis. case deletion (also known as listwise deretion) is arsoappropriatewhen the model is perfectlyspecified,andthe valueof the dependentvariableis noiaffectedby the missing_ nessof data on any of the independentvariables(paul and others200g).But perfectiy specifiedmodelsare virtually unrnown in the sociarsciences.Trte meantmputationwilh dumntytariables methoddiscusseda bit later providesa test of the dependence of the dependentvariable on the missingnessof the independentvariabre or variables; but we still are left with the problem of imperfect model specification.One circumstancein which casedeletionir appropriateis whena questionis askedonry of a randomsubsetof a samplebecausethen the subsetis still a probabilitysampleof the population.But even here.thereusuallyis a heavycostro pay in lermsof reductionin sampiesize.
Weighted CasewiseDeletion A similarapproach,which is possiblewherethe populationdistributionofsome variables is known or can be accuratelyestimated(for example,from a censusor highquality
184
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
survey),is to drop caseswith any missingdatabut then to weight (or reweight)the sarDpte so that it reflectsthe populationdistributionwith respectto known variables,sucher age,sex,ethnicity,education,andgeographicaldistribution.The U.S. CensusBureaua.od a number of sample survey housesdo this to correct their samplesurveysfor differentid If fu nonresponse, but the methodalso has beenusedto corect for item nonresponse. substantivemodel is perfectly specified, the method can result in unbiased estimatesalbeit with inflated standarderrors. In addition, weights ihat depaft substantially frn unity alsoinflate the staldarderrors.(Stata'spweight optionprovidescorrectstandard errors when this method is used,but the standardenors typically will be inflated r+ ative to standarderrors for unweighted data.) However,becauseour models arc virtuaB never perfectly specified, the validity of the procedure dependson how closely perfe.:r specificationis approximated,which requiresajudgmenton the pafi of the analyst.
Mean Substitution Various methods for imputing missing data (rather than dropping cases)have been proposed.(The mean substitutionmethodsprovide a way to generatecompletedata $ji respectto the explanatoryvariables.In thesemethods,the dependentvariableis nu imputed;doing so would amountto artificially inflatingthe strengthof the association b addingcaseson the regressionline.) Early studiesoften simply substitutedthe meanc modeof thenonmissingvalues,but this procedureis now regardedasentirelyinadequc becausedoing so without further correctionproducesbiasedcoefficientsin regressir modelsevenunder the MCAR condition (Little 1992) and also producesdownwar r biased standarddeviations of eachdistribution containing imputed data and hencedo'elwardly biasedstandarderrorsandconfidenceintervalsof calculatedstatistics. Anotherapproach,which hasbeenwidely usedin the socialsciences,is the miss{l indicqtor method: for eachindependentvariable with substantialmissing data, the mer (or someother constant)is substituted,and a dummy variable,scored1 if a valuebl beensubstitutedand scored0 otherwise,is addedto the regressionequation.An adrltage of the method is that it provides a test of the MCAR assumption:if any of fu dummy variableshas a (significantly) nonzerocoefficient, the data are not MCAR. Coba and Cohen (1975,274), early proponentsof this method,claim that it conectsfor d! noruandomness of missing data. Howevel Jones(1996) has shownthat it and relaial (for methods example,addinga categoryfor missingdatawhena categoricalvariablebr beenconvertedto a setof dummy variables)producebiasedestimates. A final meansubstitutionmethodis conditionalmeantmputation,in which missitg valuesarereplacedby predictedvaluesfrom the regressionof the variableto be imputl (for the subsetof caseswith observationson that variable)on othervariablesin the dn set; this is the method implementedby Stata 10.0 in its impute command.Thi method also results in (typically downwardly)biasedcoefficientsand underestimanl standarderrors. A11the mean imputation methods suffer from the problem of overfitting. Becaus missingdataarereplacedby a predictedvalue,the completeddatasetdoesnot adequar4r representthe uncertainty in the processbeing studiedthe error componentin the for eachindividual.This is manifestin standarderrorst}tat are too small, evenin
Multipletmputationof MissingData i5a
ri
e:
f ,alrI Ffa*
trnE E"!
1 g5
rtere the coefficients themselve
Ti;.ffi*fi:i nWY:;;$":t"#*ltr*trt*tr':"::"T';ffi .ii"vLffi
ffi1"?ixff,,fi::#,:;;h:: r:;,,i;T:: "y,:;w **.t
^an*u",
tmedbyRubinandschenter [i986;i"ti" rdsii^";i.""i,j#
fiirm
,Mtleck Imputation
f,ftF
Ihis is the methodusedby the U.S. CensusBureauto constructcomplete datapublic use q[es. The sampleis dividedinto.strata(similar to the s,ru,uurJ'in ,h" *"ighted r*€ deletionandconditional case_ mean
l.:rd1 rf,*
tr:rS
DJI t l'l II @
r
& rI'
4 I ll l
b I G
d f
imp*il.r."rf,"o*l.trr"rluli _i.rtng uAu"*ithin e *atum is replacedwith a valuerandomly O.awnffi ."nr"""_"rO O"rn theobserved cls€swithin thestratumAs a result,withineach iqrted casesis (within thelimits or rupting stratumthi Jir,.iuotronor uuruesfor the .,.orfia"i'ir""ut io ii" oirt iootionot uuts for rheobserved cases. when.rheiputiion _ooJ;;;i, specifled(thatis, rten all variables correlared with tt _i.rtng*r, ,rnable areused b rmputethe missingvalues)'this"tr",rrra'p.rJr*roi;iJ;;;;" i"oi#J illtn"".ts but biased rr also.tendsto performp"".,;;;;;; ;.;;rl ]fT1".: i.l"t,oo or l[ feastonemissingvalue(Royston2 004, huu" 22$ . "u.",
h l BayesianMuttipte tmputation Ihis method,introduced bv Rubi
..'.,, j]".""G,ili::Ti:_li'J;"T,:ffilT:Jl'jiH"li#li:ii iiiios :T;li# ryutatronin pracri""y. r_tttle md in" Jfori,ion or
,r,emethod, hn Schafer(1997, 1999)providesl1bil139021, 1, introductions, "r1.Jij .as moreaccessible Ooesattson (2001).
see rreiman' Bi"[;;;d"'c asl8;fras"r'"nk"" r."i,"un, * #":Hi11i;:'ins' The essenceof multiple
imor eacn variaure with ffi 'fi#L:ff;:j*."ri:::X1*T;
I
I
jiffj":J*:rr.."S"j
,'ffi ;ffi::fiL#,llrrfi.,$?:"T;J:"$:ff1 ;T":'""# i ^r"g,pi"i"Jffi dara
pletedataset,with the missinsiata i_p"rJi"*ra';;:".0""
tr#:h,,ffiHfi;;, I
:j:
a randomfrom the predicld distributlon ur" .uuJtutJ ,", ii" *..rr, Because rariableswith missingdatamav be amongthe predictorsfor anoitrer "nues. variaute witt niss_ ing data,the processis repeatei s
sersare nve'1ut thereis some may be "uio",'".,r'uiio," herptur
Eachof thesedatasetsis thenanalyzed inlhelsual way,andtheresultingcoefficients ae averagedor otherwisecombined, using whathaveo""o_"tno.'n as ,.Rubin,s rures,, lRubin1987,76).Thismethod
ororr,ces ,ru"."J"""ii"i""ir' uytutiogu""ount of theadditional uncerraintv created by *," i,npuiuti*'ffili "io,"ui*, ffir"o ,tundo .,'orr. Specifically, thestandard enorof a coirn"i"ntiu."o oni i.i",r:i1". * gr"", ny
, +)e]F; l"f,'i+[ ": t1 ,17
,

(8.1)
186
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
That is, the standarderror is estimated as the averageof the standarderrors based on eachimputation (the leftmostterm), which capturesthe uncertaintyin the estimatewithin each imputation, plus a componentfor the variation in the estimatedcoefficients across imputations,which capturesthe uncertaintyinhoducedby the imputation procedure. For this procedure to produce proper imputations, two conditions must be satisfied: and(2) that if, in the substan(1) thatthe analystdo a goodjob ofpredicting missingness, the outcomevariable be variable, with the outcome missingness is correlated tive model, included in the imputation model. Softwarcthat implementsmultipleimputationprocedureshasbeenwritten for Statabr'. Royston(2004,2005a,2005b,2007,building on earlierwork by VanBuuren,Boshuizen,and Knook [1999]); to downloadthe necessaryado files ftom within Stata(connect€dto the network), b,pe lookup ice and click the fourth entry, "sj74." (Seealso usefrrl guides to using Royston's  ado  flles written by AcademicTechnologyServicesof the Universiq of California at Los Angeles[UCLA]; they will appearin responseto the same lookupcommand.)Royston'ssoftwaremakesthe processlesstediousthanit usedto be.Nonetheless implementingmultiple imputation can add considerablecomplexity to your analysis.The difficult and timeconsumingpart of the work is to choosethe predictor variablesusedto produceestimatesof the missingvaluesof eachvariablethat hasany missingvalues. The essenceof the procedureis to specify which variablesto include, to createappropriate transformations (dummy variables and interactions), and to specify the relationshipsamongthe variables.Thesedetailsare part of Royston'scommand ice , which shouldbe built usingthe dryrun option so that the logic canbe testedbeforebeginning what can be a lengthy computation.The imputation is then carried out, and a data set is saved,consisting of multiple copies of the original data, each of which has complete data becausethe missing values are imputed. However, in each of the completed daia sets, the imputed values generally will differ. This multiplecopy, or multiply imputeddatasetcan then be usedto carry out any analysis,usingthe commandmicombine' This command carries out the specifled estimation procedure, for example, multiple regression,using each of the imputed data sets and then combines the resulting coefficients to produce a single coefficientusually the averageof the coefflcients estimated from eachof the completeddata setsand a standarderror that takesaccountof the additional uncertaintyintroducedby the imputationprocedure(seeEquation8.1). Typically, construction of the imputed data set is computationally intensivein the worked example discussednext, it took about 3.5 minutes on my home computer (which has a 792 GHz processor)but analysis using the imputed data set is nearly as fast as As you increasethe number analysisusinga singledataset,typically a matterof seconds. of imputations, the time required to createthe imputed data setsincreasesarithmetically. As you add variablesto be imputed, the time required increasesat a fasterrate. For example, approximately doubling the number of variables to be imputed increasedthe time b) a factor of four. Perhapsthe bestway to conveywhat is entailedin usingmultiple imputationto createand analyzecompletedata setsis to carry out an example.This is what follows. The do and  log f,les thatproducedtie exampleareavailablefor do*nloading.Theseflles contain,jus beforethe imputationstep,a discussionof how to specilJthe  ice  command.
t5l
Itr '!M * qag,'t &D !&
Et
trrr tr'i rlus [email protected]["f,l D'dl nl
fu!,rc rfrfl bs trGi crat
r d
{lf,.[
Ml 5i*,r UFj.  l rfgrc rrc rE
m'lbo 4 rrurli;iil G]IUls
a
cd [isr 1
d :, #t l @
tI @ ffil{
dmr 8:a fttrEd
Multipletmputationof MissingData trEs based rrne \rirhin af,s ilcraEs !E
t satisfied E subsknaiable be r shra bf EizeDald *d ro de fol Sdo !ni\ersq .oc;c.:!t Defrelessysrs. Ibe s usedto b apprc'rclarion. qhicir E b€Bjtrdaraser omplete dl .1^r' mpute& cir=, nuldpb coeflitirn2re\i E addi
in tbe [*'hi;t fasr a. mber dcallf CIAIIF
me b) E aBi ' ani Dju{
197
A WORKED EXAMPLE: THEEFFECT oF CULTURAL CAPITAL oN OUCATIONAL ATTAINMENT IN RUSSIA Tbere is increasing evidence from many nations that the extent to which parents are cngagedwith the written word_measured by the number of books in the household uten the respondentwas growing up_is at least as important (and perhapsmore imporrr ) a determinantof educationarattainmentasis the amount of formal schooling attained ti parents(Evansand others2005). The reasoningis straightf.,';1, what mattersabout lnental educationis not the credentialsit. bringi but thJway it affectsfamily life and &ild rearing. In householdswhere_readingis an important u"tii,if, often learn to r€ad at home, enjoy reading, and becomegood at it, all of which "'to,"n improves their ability n meetthe demandsof formal schooling. Thus, they tend to do well in school and in conr€quenceto continuetheir educationto advancedlevels. in this exampleI investigatewhetherbooks in the childhood home were rmportant  educationalattainment itr in Russia.I choseRussiafor th" ,"o.t"a uoth because 6e number of books is probably a good indicator of family "trpr" reaatng haUits in Russia trcause the cost of books was very low during rh" Soui"t;"J;J(my data pertain to rftrlts surveyedin 1993,just after the collapseof the Soviet Union) and because_asa result of massivecasualtiesduring the SecondWorld War_there is an unusualy large missing data on parentalcharacteristicsin the Russian data set, a national :u:lgf gobability sampleof 5,002 Russian_adults age twenty unA ou", 1."" Appendix A for deails on the data and how to obtain them; ."" ulso T."irn_ unAV"ieny 1993,Treiman 1994) The sampleis restricted to those agetwenty to sixtynine to avoio understatement lii€ducational attainmentby thosestilt inschool lfewer inun i p".""r,, of twentyyear_ olds were still in school) and differential monality and .o.UiOiiy (rcndering people rmavailableto be interviewed) arnongthose age seventyand older. This reducesthe sam_ ple to 4,685.In the presentanalysis,th" dutu." not,"igtrtea, atitrougtrweightsareper_ missiblein i ce,, because weightingintroducesanaddiional complicationin comparing Iesulrsobtainedfrom a casewise_deleted datasetanda muttipty imiuteO Oataset.
Creatingthe SubstantiveModel I first specify a conventionaleducationalattainmentmodel: 6
E: a+ b(Ep)+c(E)+Da,@,)+ e(c)+ f(s)+ g(M)+h(B)+i(EpB) (s.2) i:2
shere E is the number of years of schoolingof the respondent,E is the sum of the years of schooling of the father and ttre mottrea A^ i, the difference in number.of rhe numberofyears of schoolingof thefatherandthe mott..; "rJ ol."te.s tt to the father,s occupational categorywhentherespondentwasagefouneen;C is" year of birth (.,cohort,,), shich capturesany effectsof the secularincreasJin rn Ru.'a over the course of the twenliethcentury;S is the numberof siblings,"au"ution whictr is tnown to negativelyaffect educationalattainment (Maralani Z0O4 Ll ZOO5;ll anA freiman 2O0g;;M is scored 1 for malesand 0 for femalesto test the possibility that the .'o ,"*", Oifi.", ln ,t eir averase
188
DataAnalysis: DoingSocialResearch Quantitative to Testtdeas
education,somethingthat is true in someplacesbut not others; and, finally, B is an ordinal scale measuring the number of books in the household when the respondent was age fourteen (the categoriesare none, I or 2, around 10, around 20, around 50, around 100, around 200, around 500, and 1,000 or more), and EoB is a product term that capturesa possible interaction between parental education and the number of booksI would expect the size of the home library to be more important when parentsare less educate4 on the ground that welleducated parents are likely to provide schoolrelevant skills whether or not there is a family culture of reading but that this is less likely when the parents are poorly educated.That is, I would expect that parental education and parental involvement in reading are to some extent substitutes.
T:CH.IICAL DET,q{{SON THE VARIABLES r
Parentaleducation. I specifythe sum and differenceof the yearsof schoolingof each parentratherthan simplyincludingthe yearsof schoolingof eachparentasa separate variable.lt can be shownthat the two specifications are mathematically equjvalentand that eithercan be derivedfrom the other. But the specificationI used is more readily interpretablebecauseit givesthe overalleffectof parentaleducationplusanyadditional effect resultingfrom a differencein the educationallevelof the parents.
r
Father's occupation. The occupationalcategoriesare from the sixcategoryversion of the EriksonGoldthorpePortocararo (ECP)occupationalclassscheme.modifiedby (1996). Ganzeboom and Treiman
f
Number of books, I exploredthreespecifications of thisvariable:the ordinalscale,midpointsof the numberof booksindicatedby eachcategoryand the naturallo9 of the midpointscale.Interestingly, the ordinalscaleproducedthe bestfit, probablybecause the lo9 scaleexcessively diminishedthe effectof increases in largehomelibraries.
The problem for this analysisis that many of the variablesin the model havesubstantial fractions of missing data. Table 8.1 shows the percentageof casesmissing for eacl variable. If I were to simply drop all caseswith missing data, I would be left with only il percent of the sample(2,661 cases).Moreover, becauseit is probable that missingnessL correlatedwith other variablesin the model, I would be analyzing a nonrandomsubsetd the original sample,thus completely undercutting the validity ofany claim that the analf sis characterizesthe educationalattainmentprocessin late twentiethcenfuryRussia.Evidence that missingnessis not random is to be found in a comparison of the meansad standarddeviations basedon the completedata(casewisedeletion) sample (N = 2,661) and the corresponding statistics computed over all observations available for eacl variable: in the completedatasubsample,the means of the socioeconomic status var; ables are generally higher, and the standarddeviations are generally smaller than whet computations are based on all observationsavailable for each variable. Thus, I turn b multiple imputation ofthe missing data to createa valid complete data set.
{nls
Descriptive,r:r,:!.: for the Variables ' Used Age Twentyto SixtyNinein 'o in the Anatysis,Russian 1993(N=4.685). Mean Atl Observauons 1 2 .5
SD
Casewise Deletion (N = 2661)
Atl ObservationS
12.9
3.7
3.5
4,633
1.1
4.0
3.9
3,880
17.2
4,469
4.6
7.5
3,8A7
18.7
3.2
3,807
18.7
4,685
0.0
8 .5
7 .4
7 .9
1 6 .0
16.4
0 .8
4.4
1.6
4 .7
.4 1
Casewise Numberof Deletion Nonmissing (N = 2661) Observations % Missing
.4 1
2 .2
21
1951
I9 5 3
'r50
172
.49 2.0
1.9
4,219
10.0
13.1
13.1
4,685
0.0
4,305
8.1
245
259
%.
2 2 .7
23.6
2 .6
2 .7
1 .7
1 .6
3 1 .8
32.5
2 A .2
19.6
2 1 .A
2 0 .0
100.0
100.0
: cf casesfor whichresponses aaefLrtmrsstng
3,265
30.3
1 90
DataAnalysis: DoingsocialResearch to Testldeas Quantitative tt dl
Creating the Imputation Model For each of the variables with any missing data, it is necessaryto specify an modelthat is, a model predicting valueson the variable from the casesfor which vations are available.Van Buuren, Boshuizen,and Knook (1999,687) suggest although in principle the larger the number of variablesin the imputation model the in practice(to avoidmulticollinearityand computationalproblems),it is bestto limit predictor set to fifteen to twentyfive variables.They proposeas criteria for inclusion: 1. Include (as predictors for each variable with missing data) all variables that model. be includedin the substantive(completedata) 2. In addition,include (aspredictorsfor a givenvariablewith missingdata)aI thought to affect the missingnessof that variable. Suchvariablescan be by examiningthe associationbetweenmissingnessand candidatevariables.If associationis not zero or is closeto zeroincludethe candidatevariable. 3. In addition,include(aspredictorsfor a givenvariablewith missingdata)all ables that are strong predictors of that variable. Such variables can be by examining the associationbetweenthe given variable and candidatevari for casesin which the siven variableis observed. 4. Removefrom sets(2) and (3) thosevariablesthat themselveshave amountsof rnissinsdata. An intermediate step that I skip in the presentexposition is to confirm that the are not MCAR by predicting missingnessfrom other variables in the data set; if some the coefficients are nonzero, we have evidence that the data are not MCAR. there is no way of deciding empirically whether they are MAR or NI. For each missingnessis dichotomous,so the appropriateestimationtool is binary logistic sion.However,becausewe will not coverthis techniqueuntil ChapterThirteen,this of the worked example is omitted. In the Dresentcase. we need to imDute missins data for all variables in Table except gender and year of birth (which have no missing data). Following the criteria Van Buuren and his colleagues,my imputation model for the variables included il substantivemodelis
E : flEo,E*DOt, C,S,M, B, EpB) E o : f lE , E * E O t C, S , M, B ) Er : flE, E,, DO, C, S,M, B) O : flE, Eo,E* C,S,M, B) S : f(E, E",E,,DOi,C,M, B) B  flE, E,, E,,tOt, C,S,M)
I

sq t d
l
Multipletmputationof MissingData l rmputaioq ftich obs€rus,sesr tb.r I the benez to limir 6e Jusion: 6 rhar qiII {l rariabb E identiiiol bles. Il6E aI all adidentifial : \ariable! ubstandal
r the d.ale f some ot lloq erer rariabtrer regres,this pan l'able5.. dteria ol d in dre
rfrae rhe variablesare thoseincluded in the substantivemodel definedin Equationg.2. 16l Dothaveto resfict myself to variables includedir, tt" *Urtuntiu" model.Rather, filning VanBuuren,Boshuizen,andKnook (1999),I might well havechosenadditional which predictthe independentvar.iaLf".i" ,fr"i,"a"i". iredict their missing_ Pn:, Bs or do both; generaly,rhis would be advisable. H"*;";;, ;;;; rnterestof keeping examplefrom becomingtoo complex,I sertledfor the predictionequations *:"* ice commandpermitsthespecification of severaldifferentestrmationmodels _(I5 ]e for continuousvariables,and also binary, ."for"JJ, _0"_oinal logistic regres_ m tbr categoricalvariables).Becausewe have no, y", regression,I ask h.:" ,". tf" on faith that they ar9,.t{reapproqrial "o*."Offidc tectniques For oealing with these T ra of variables;thesetechnioueswill,be exposited rrt cn"pi."lirrr""en and Fourteen. {* n happens,all variablesto be imputedare continuous ,fr" ,nA"r,s occupation ,*gories, which areimputedusinr "*.iif
u ;_i u""t, i"1i" i"",5,"i;,;ilil,Tl,lli1Hi,.';?x;::f"il1fi:1"1,;:1 { veryusefutfeatureof _i ce_ is itsab ty t. rrl"oi" ;pariiuJiy
,_pu,"d,,u*iuut"., ftrt is' variables such as interaction terms and sets of dummy variibtes that are matherncal transformationsof other variables. .o"of ,t iJ'ilu1ii""f"a" missing data. {c Royston(2005a,191195)for a descripdonof thepro""ooro unott oownloadable files for the chapter for a"tatt"O &s.usrion"o]'^ho* " ,o specify the _i:. _":j,olu;. Cofltparing CasewiseDeletion and Multiple tmputation Results
Table8.2 showsregressioncoefflcients,standard errors,andr_andpvaluesfor two mod_ cls one estimated using casewise deletion and tfr" from a multiply "tfo._. ".tiait"d ryuted.data setusingRoyston'sco'mand micombine elthougrrthe .esultsarenot _rreadydifferent, they do lead to different substartiv. *g*Oing thee of the rr elve variablesin the model:if we accept ""r"f".i"". the conventional .0! level of significance,we rould conclude that all variabtesin th" CjIS.oO"f *" ,igrm*ri, *nf, ,fr" exception of 6e father'semploymentin a routine nonmanuarjob, which resultsin equaleducational ;tances for offspring as does the farher,s ;;;ug".rur o, professional ln particular, "_pt";_;rt;;
Fb
I E._:
191
wewouldconclude tharit ir uir"n"Li*r,iri'di
,notr,".is well edu_ raredrelativeto the father(because, net of the averagelevel p**r, education,the morethefatler's educationexceedsthemother's,the ilw". trt""f utrainmentof offspring). Wealsowouldconclude that,u, ,norJ;; "i"oluonur families getless il. education'andalsothatmaleseetlessschoohn! "*p""r"a, thanoo r"mut.*uJ.'ever,noneof these &reecoefficients exceeds the.05levelof signiicun""in ,fr"i_ou*i Ouru. Interestingly,thesizeof thehomerib.ur!t ,r," i.poilii ,*uur" in a" *oa"r, asindicatedby the standardized coefficienis, ort ,h";i;;;;;;;ost column.But, as predicted, its importance diminishes asparentaleducation inc.ieases, as rs evidentfrom de negativecoefficientfor theinteracrlonrern.
fA * t, il & " 2, Compa.ison of Coefficients for a Model of Educational Attainment Estimated from a CasewiseDeleted Data Set ICI (nr= 2,661) and from a Multiply lmputed Data Set [M] (N = 4685), RussianAdults Age TwentyTwoto SixtyNine in 1993. Std.Error
of school Parents'years (difference)
.O41
M
c
M
 026
020
.016
2.O4
.51
3.40
c
M
1.61
.o42
. 108
3.11
.001
.000
M
Father'soccupation (professionals and managers omitted)
Selfemployed
1 .77
1 .90
.52
_A
_22
.000
O1
manual Agricultural
1.41
1.42
23
25
603
572
000
000
.o4
Y oarof bhtl r
.0 i ,
siblinqs f,llger s'f

Male
,249
00b
547
I00 1.86
:t, .125
 2.00
.00t
.l)
194
QuantitativeData Analysis:Doing SocialResearch to Testtdeas
WHATTHISCHAPTER HASSHOWN In this chapterwe haveconsidereddifferentkinds of missingdata,drawingupon a di_ tinction betweendata that are .,missingcompletelyat randolm,,OaCen), data that a: "missingat random" (MAR), anddataihat are .,missing not at ranJom,,(MNAR or \T. We exploredthe propertiesof eachof thesemissing jata ,yp", then considereii numberof proceduresfor handlingmissingdata,in.taing tirt*ir."raOeletionand vario* imputarion of_missing values.We determinedthat of ,h" proceduri! l:Tt "l produce reviewed biasedcoefficientsin predictionmodels.Thisor, circumsrance mouvarer consideration of multiple imputationprocedures,in which missingjata are tmputeds _ eral times and the resultsof eachimputationare combined. uutiipt" i*purution ott.', the bestchanceof producingresultsthat are free from bias.we thei considered,throuq a workedexample(the role of culturalcapitalin educational n"rri"i frJ toimplementmuitipleimputationprocedures,using "u"i"_"rii, softwarewritten by the British medcal statisticianRovston. Thus far we have carried out statisticalinferenceon the assumptlonthat our dr:r weredrawnfrom a simplerandomsample.However, most surveys,suchasthe GSS.:_ nol basedon simplerandomsamplesbut ratheron complex multistageproUabilitysan_ ples.Inthenext chapterwe considervarioussample designsunJr"" norv to get correJ standarderrorswhenwe are usingdatabasedon ,t utm"O"o, multistageprobe bility sanplesor both. "iurt"..d
la r I r:
ftJ.ADfr
\l l f1 I
I,!.r
r \I t*!
rilrs ft} ir:e6e: I :e:off=: I(ES
. b:'e DEa'. {r e
l :r: sa
SAMPLEDESGN AND SURVEY ESTMATION
II?J
mhr
WHATTHISCHAPTER IS ABOUT Thusfar we havetreatedthe issueof statisticarinferenceasif we wereanalyzingsimple randomsamplesand our data conformed to the distributional propertiesassumedby ordinaryleastsquares(OLS) regression.Neitherconditionis likely to hold in practice. Thus. Dowthat we arecomfortablewith the manipulationandinterpretationofregression mod_ els.it is time to expandour analytictoolkit to makecorrectinferencesaboutdata based on the kinds of complexsamplestypically usedin nationalsurveys.We alsowill want to !onsiderhow to identify and, if possibre,correctfor anomalousfeaturesof our data. As s'e will see,thesetwo topicsarefairly closelyrelated. I beginwith a descriptionof typesof samplesusedin surveyresearch.I then discuss theproblemof statisticalinferencecreatedby complexsampledesigns.ChapterTen then considersvariousdiagnosticproceduresfor OLS regressionmodelsand some ways of correcting problems revealedby the diagnostic procedures.
to Testldeas QuantitativeData Analysis:Doing So€ialResearch
', ,,
SAMPLES SURVEY As wc know tiom elementarystatistics,to generalizetiom a sampleto a population\\: needsome sortol probability sanple. For our puryoses'there are threebasic kinds r: probabilitysamples: Simple random sanrples,in which every individual in the populationhas ' equalchanceof beingincludedin the sample.(The equalprobabilityofselectic u.rmpleasrrndom) Jehnes cL'ndition
'
.: Multistage probability samples.Theseare nothingmorethancomplexrando:. and then subunitsof the sampli: samplesin which units are randomlysan.rpled units are randomlysampled,and so on. Examplesincludeareaplobability sar' ples in which, say, cities and countiesate randomlysampled,then bloci'' within areas.then householdswithin blocks' then personswithin householc' and schoolsamplesin which, say,schooldistrictsare sampled'then schoc' within districts.thenclassroomswithin schools'then pupils within classroom' .
Stratified probability samples.which arealsocomplexrandomsamplesln str;:ilied samplesthe populationis dividedinto strataon the basisof cefiaincharacti istics (race.sex,place of rcsidence,and so on). A probabilitysampleis dtar'' within eachstratum,with the strataoftensampledat dift'erentratesfbr examP: with Blacks sampleclat a higher rate than nonBlacksto ensurethat there ".enoughBlacks1branalYsts.
Simple Random Samples requiresa list ofer e: To drawa randomsan.tple simp)erandomsamples. Let usfitstconsicler to selecta fiactionof theindivicindividualin thepopulationancla randomizingprocedure A typicalway of drawinga randomsamplebeforetheageof comput':' alsin thepopulation. wasto colsult a list of randomnumbem.Table9 l showsa smallportionof sucha list Supposewe wantedto draw a rantlomsampleof 10peopleout of a classof 40, usi: ' a table of random numberssuch as Table 9 1 We would list the 40 peoplein the c1;'Portion of a Table of Random Numbers. 10480
1 5 0 11
01536
A ZO11
81647
9164.
22368
46513
25595
8s393
30995
8919:
24134
48360
22521
91265
76393
6480!
42161
93093
46243
61680
07856
163i 4
SampleDesignand survey Estimation
.trHni*'i"{,*,3i;1;5: #1fr
bti 0: r t:tnir j r h;:5 €l3'::u
fl+1iffiTril.f;fl\il,31!6T;,j:. :,,,i,:#rx,;: H;il ;#:#:ru::x;m'#li**hr*mn
:tT$ry
";X T'"J"i""il',':iT# qtq$6*iJ$..,:":;fl :L:'h# H:**
am3F;
H::;f:.'fr
w !i::_ b. itr.r eh..=, II1
197
xquentially.Iiom I to 40. Then
work ;;;;;,;'"i."r, sampte witho ut '1il:.'J":fl#i:ll,o1"cticar
T
"xj*iii,""ry#.:i'i#:rlffi.TiHi*ifr ff#,r"{,##: H;;:$l;i#i*;;*.*,t'il,*Tt#rFf qn*.T;T,#;i{trdj$ff :$,* ll;?: :ft'nffi
0 =:f,
rs randombecauseby virrue of the
chance orbeing ,,n,#11 ;;;;;' #;;;il:X;T"i'fffiffi:iltf '.rranequar
IL='
4:a q,E lE :f:
fi,inil :#,:'
cases,")"r"mnr,a sam;i;;;,;#1,il"1 lll.;ill.:1jl ffi:ffiffi:f i:T:I#: anaqren eve+;;;;*
*tJl:1""':lfflilf;td""Tj::,:::sen
^
chosen rromthe
ilHf,;L*ili ;:T,:n** ili* *1,l#*ir[*HT#j[Hr,ffi every roun h s tu de ntin 1,1#:ff
H,lT lffU: JilTake n:U:m #; " "'"."tt*' "t,The propertiesof sysrematic .
E f,
t 5 ? ri :J
*.r""
,: :":,"
to
a sample
'et
*:_:".i,,.:,r,.,n,n",'n'raiuirlffi ilffi,[.".':Tllililf,:.#ffi :irrflf, u..o"n.,ir,unloo,,iiu,,o,., #.i,'n,;"1ff1,;ff:nT,'n:i;i:r",uv because,hey f .::o::,g"r":",",.,'il"i"f::Ti.li{::f"::,",}ffi ::T'J'offff""1Ti:,:T .
r"i'".J##:.rou,or*g #T$::T#ti",",J*ji:ifrH tooffset i"j::^11'ru'"r'"p,"., sampring, roodthat i, what i say;l:,';;;il:;T:ffilfiffff::: um".'r'oura"u"
Hu lti stageprobabi Iity Sa ples m
5rmpterandomsamplingis pracdcr nmpterelisr of thepopularioni, l,:*I,':t
circumslancesin panicular.
whena "Tt* ecentrar rocation. b!,,'";#;:;'ii:,jj"#1.'i$:,:",ir"ilf":ffi n i"r.r.,,fiiff#
Gi..ii.'
'urelqordrofeurp aq plnoMlr 'asnol qlnur e sr 'Sar4Uno: ,{ueu.r ul ualqord :aq1o 1ou serqaldu.les'sauoqdalata^Pqsalels }o 1O patPaJla eq sat^ap 6utuaalts pa run aq] ursploqasnoq lle]sourlpalurs'llrls serllnlr#rp ^^au auoqdalalaldrllnul Clra e) pue 'saurqrpu]xe] 'souotldlla) ]o uorleta]tlotdaql sploLlasnoq ro, ]snlpe o] pue sauoqdelalssaulsnq]no uaalls o] pssl^apeq ]snul sajnpa)old laAaMoH '6urlplpll6lp tuopueJ,Lo sueeuAq'pasnaq uP) 6urlduesuropuPr'aldr)u d ul 'asne)aqsMal^ qlrmueql smarnralut auoqdqI^ Jarsee stIllptauab6utldules ralurploqasnoquosradur 'uorlesranuot auoqde e 6uropaq ol ulel) saDuabeoutla)reulalaleulosaluts olur aldoadMprp o1alr^epp se ^a^dns ueql pleMol Jo] plau aql pareanbseq 6uqa>peuta1a1 Aller)adsa'qlrpasarIaruns a1euir1r6el are stuapuodser a1soq,(l6ursear:ut leqt sl s^arunsauoqdalalqll/!\ Illn)$+lp lPul+V 'sraMawalurauolldola]lualadLuo)put] otrlln)tj+tpst ]t 'pnole
l_C
ftrr!s [! Earg t
J)y.
l+ a'!fr c.Er:5; t 6.1" ;r cL9J}a"E P*i
! s rr: ?1F:t\
pS,u 1: 7:[a
ur era uP ur la^oalon an6tle] luapuodsataztutl 6urpearte pallr)s arp eldoad Ma] Ltrf,rq^^ pue 'auoqdalalaql ta,lo suorl aq lsnuJs^^ar^ralur ueql ra]loqs a)p]olalel uru]ol sMarrualur sanbxalduJollse o] pue lodder qsrlqelsaol llnlr+,rparou.rsr1rpupqraqlo aLlluO 1)loM ol lue]lnlal aJesla^ at^lalurpue)Joopaqt raMsuPol luelfnlat ate aldoadataqmspooqloqq6tau palpo ut ssor.llsP Ll)ns'uos ro spooqloqqbtau auru)q6rqur osle pup 's6urplnq ^]rnlas ssa)fp o1alqrssod srlr leql st slallns auoqd ere sploqasnoq ur ssa)re ol fua^ rad leql llnlt11rp ql6ual auoqd e lo] 0E,g auresaql +o,!\at^Jalut a1a1 1oa6eluenperoleuraql '}so) L!orj Uedv ltuJapete outpPal ]noqe qlr^^ parpduJol']uapuodsarrad OO€$]noqe slsol sraluar ^aruns lnoqauo a)eJo}a)P+ aq] jo auo ,{q papnpuor aldures&tltqeqoldleuol}eue ul Mal^lalul p'salels palrunaql ul Illuornl s^ahraluta)e]olale+ueLll ss"1 ".,",{aql"rnrr"O Sl ^llso) euoqdatat pasn ares^a^rns
Bcr.r:r:=! €.:::5]; 3.E
LPti ts
l:
:r:
,,ra eI =rtr':E
'ilv5t
D6 l
^t6urspal]ur ^tapri\^
n
Ft: [\F I,T.f fm;{,Li'uurxr:qr F U:S ail o1€\:'id F F:4ntq f @oJ:qq  umIPJrSpfl 4 ':trL'L=q F Urqllld
;flEl
FlsIII ]s gr1:orSq:1P 't:rq+ls iue op,6PJ U?q F [uoI\ sluE_s .D
SAlnUnS
lNOHdlllI
IC
s3IeBuVso'I asoqr ,{luopuer a,t;r 'aldruexerog 'sarlrca8rel ur eldoedueqt eldurusaql ur popnycurSuraq3o eouzqcraq81qqcnur ? sulhol IIsIrrs ur eldoed elr8 ppo,{\ sII :uerp Jo altrosasooqc{Iflopuer pue ,tnunoc eq1ur seprceq11e 1sq,{1duls1ouplnoc ar'r ,{1snor,rqo 'aceydqceo ur sr\er^ialr ,quaml Srmcnpuocpue (sOSd) $Tm Sqldul€s freurud perpunq ol :uo Sursooqcfq srql op ol aplcep pue eldoad puesnoqto,,ir1go eldurespuollsu 3 ^\"Jp papales qcee;o Sureq f,1rc ue,r a,n asoddn5 uorlepdod s1r;o azls aql ol Fuoruodord s1 ou?qJ eql rql .{" \ 3 qrns uI 'sr l?rll(sdd) azts o1luuog:odord f,tlllqeqord qlrm 'urop rr?l lB u,r€Jp em (uo os pus 'seDunoc'seIlIJ) ,sr?urgu'1\dutos{totuttd'eaels lsJg er0 uJ '(saldups tqtqeqotd a3o1sy1,'rxu eureuaql acuaq)se8z1sul peleen sm saldruusqcn5 'sfe.,'msploqosnoqpuopeu ro; seyduesflryqeqord e8e1sqlnulasrlap ol sn peal suolteJeplsuocryo.npleg pue Su1dues qloq snqJ 'uou?u eql moq8norql parellucs aq f,1uleuet tsoulp ppo.tr oqrrr 'sluepuodserpalsales eql Jo seuoq ol elqlssod aJe,& eql ol lal?.r1ol a^rsuedxa,{le,rqrqrqordeq ppo,tr y'eldures s q3ns ^reJp 'uollelndod uropuer aldrurs E ,lreJp ol elqlssodurl aql Jo aldures q I uele 'puoJes tt saleur 'luapuodser 'g qJTq,r'uoqelndod n aqt 3o ralsr3er puollsu ou sI eJerll'petou lsnf su 'lsrg Jrn Jo arxoq oql ol sao8 Je,,r{er^le1ulaql qcq^\ ut s,{\er^Jelul'sl tst0pepnpuot eJ€ qJlq,u.ur uorlelndod '5 n aqt;o selduiespuoueuJo asec s/r\ar^Jetur IBqueprsale3?Jolof,sJ seapllsal ol q)rpasauler)os6uloo :sls^leuvelec o^llelllueno
86 t
SampleDesignandSurveyEstimation 'r99 rt€ta.i
f be rhi.f r iir o6e k t2gB b. trl}l:
r':T (I @
rH de$
{T SantaMonica or Beverly Hills (the latter two are small cities in Los Angeles County) md then randomly selectedtwenty peoprein the chosencity (assumingwe had a list of all raidents),anypersonin SantaMonica or BeverlyHills wouldhavea muchhigherchance crbeing includedin the samplethanwould any personin Los Angeles. so insteadwe group cities tnto strata on the basis of their size and randomlv samnle ides within strata,at a raceproportional to their size. For example,we might ;oup ihe Iffsest citiesinto a stratum,largecitiesinto a secondstratum,mediumsizedcitiesinto a &fud stratum, and so on. Supposethe population of the largest group averagedtwo mil_ [on..the popllation of the secondgroup one million, and the population of th" thi.d group rire hundred thousand,and so on. we might then randomly chooseevery cify in the first 3roup,every other city in the secondgroup, every fourth city in the third group, and so on. ff Eethen interviewed the samenumber of people from each selectedcity, p".son "u"ry ir the country would have an (approximately) equal chanceof being included: 20/2 million = 1/100,000 20l(1 million/0.5) : 1/100,000 zjt (s00,0o0t0.25) : I /100.000 and so on
MAIL
SURVEYS
vuit surveys aregenera'y undesirabte because rhevtenoto ?z
ji[:l"*fi Nl il:ffi51.'"iT: iff:,:lH":fi :"j,i."Jffi Jffi:'"::: i#:H:'^'"T:
of the survey(through registeredletters,telegrams,phone calls,and so on). JonathanKelley and Mariah Evanshaveachievedamazingryhigh responseratesin mairsurveyscarriedout in Australiaon the orderof 65 percentby doing extensive follow up. Theyalsoshowthat nonrespondentsto their surveysare essentiallyno diflerent from respondents(Evansand Kelley 2004, Chapter 20). Suchsurveysrequirea samplingframe that includesaddresses. This is impossiblein the united Statesbut possiblein countriesthat haveregistrationsystems,suchas Australia,where voting registrationis required.Noncitizensare excluded,but the samplinq frame is good orheruvise. Another disadvantageof mail surveysis thal one cannot ask complex questionsor questionsthat are contingenton responses to previousquestions;respondentshavedifficulty {ollowingthe logicof complexcontingencies, known asfilters.On the other hand,one canask questionswith relativelylong listsof alternativesbecausepeoplecan handlemorealternatives when they can readand referbackto them than when the itemsare readto them. A final limitation of mailsurveysis that they arevulnerableto beingcompletedby committeethat is, by severalmembersof the householdconsultingone another For many topics,this poseslittle difficultyand may actuallybe advantageous, as,for example,when life historiesare solicitedj but where independentresponses are required.this is a seriousshortcomina.
200
N
: l
WEB
QuantitativeData Analysis:Doing SocialResearch to Testldeas
SURVEYS
years rn recent webbased surveys havebecome increasingty
widelyused.In somerespects Websurveys arelikemailsurveys in that theyeliminate the interviewerand requirea respondentto decideto partjcipateand to completethe surveywithout the benefit of persuasionby a live person,whichwhen practicedby a skilledinterviewercan overcometrepidation,boredom,irritation,and other impediments to completingthe interview.On the other hand,for the computerliteratethey areeasierto completethan paperquestionnaires. andpencil at leastwhentheyarewelldesigned. Theyalsohavethe advantage overall othermodesin permittingcomplexfilters,in whichquestions areincludedor omitted dependingon responsesto previousquestions.In both facetofaceand terepnonesurveys, filtersare used.but they are vulnerableto interviewererror.In paperandpencil surveys,using filtersis difficuitbecause respondent erroris Iikely. With respectto samplebias,Web surveys todayfacethe samelimitations astelephone surveysdid in the United Statesin the first half of the twentieth century:a strong socioeconomicbiasin computeraccess andcomputerliteracy. In addition,thereis no knownsampling frame of Web addressesthat correspondsto a populationof people. Moreover,given the currentflood of spamand concerledattemptsto interceptit throughspamfilters,effortsto secureresponses from a randomsampleof Web addresses will likelyfail. Hencethe useful_ nessof Webbased surveysis likelyto be restricted to situationsin which there is a wellspecified samplingframe (suchas a list of membersof an organization) and the abilityto address surveyquestionnaires to namedindividuals with suitableaopealsand inducements to respondand assiduousfollowup efforls to convertnonresponses to responses.
The problem with this method is that the stratamay be quite heterogeneous. Fc example,supposeall cities with populationsof one million or more are includedin thc first stratum.Then, if cities were simply chosenat random, residentsof Los Angeler would have only one{hird the chanceof being included in the sampleas residentsd San Diego, since the population of Los Angeles is about three times the populatiot of SanDiego. To avoid this problem an altemative procedure is often used: within each stratun" units are sampledPPS.To accomplishthis, all the units arc anayedin order accordiry to their size, and the tolal population is cumulated.Then numbers are drawn at randon and units are chosenthat include the randomly drawn numbers.For example,suppost we want to samplePPS five of the ten largestcities in Califomia as pSUs, so we ca interviewonehundredpeopleper PSU. (Here,becauseof the largevariancein the sized the cities,it makessenseeitherto samplewith replacemenlor to divide Los Angeles,ar{ perhapsSanDiego, into portions and treat eachportion as a separatecity. I havedonetbc former.)Table9.2 showsthe population(here,accordingto the 1990census),the cumulative populationwhencitiesarearrayedby size,andthe percentage of the total populatic of the ten cities residing in eachcity.
s {I
It is f
Sample Design andsurveyEstimation 201 The population Size, population Size, and of the Total populat .ron C.umulative Residing in Each of the Ten Largest .(entage F catifornia, 1990. Cities _
'1990 Population
Cumulative Population
3,485,398
3,485,398
1,110,549 782,225 723,959 :l Seach ..
.id
,a::anento
429,433 312,242 369,365 354,202
: , . : . s ide
226,50s 210,943
\ow we need to choose sor 1er tableat the backof one,l:i":.*t.
:: : rhrough ninth nuil;#
4,595,947 5,378,17 2 6 ,102,131 6,531,564 6,903,806 7,273,171 I ,627,373 7,8s3,878 8,064,821
Percentageof Total Populationof the tO Cities 43"2 13.8 9.7 9.0 5.3 4.6 4.6 4.4 2.8 2.6
nutnbersGoing to x convenient random
," d;;;; }_:l":::';:Til:il:;lbitrariry deciding
Belondrheranscrsnorel ottt" (since4,204,805r'als within rherange:3,485,399 to ;U;;ot;" 1.168,953 ChooseLos Anqeles ' :oJ 52'{1chooseLo\ An;le5 '8ain ChouseLo. Anletes,ri .:::?0 agrin 6.574,717 ChooseOaklanj I .:;:ij r':u+'6ur
\ore thar Los Angeles,, .nr1:11L..1 of the five rimes.(Or course. r:rtatronofLosAngelesis.l3percentofthetotalOo'u,u*''.oii,r.ren.largestcities :"r sjnce the
;*;
to Testldeas QuantitativeData Analysis:Doing SocialResearch
,.: in california, we would expect Los Angeles to be chosen about two out of five times average if we repeated the sampling procedure mtlny times.) We would thus divide Lr ' Angeles into thlee equally sized sections and treat each of them as a primary samPlif.. unii, together with San Diego and Oakland. By sampling in this way' and repeating tr: process for smaller units within each primary sampling unit, we ensure that every lnc ' vidual living in the ten cities has an approximately equal chance of being included in tr. siLmple,precisely becausethe chance of the city being included is exactly propofiional :: the size of the citY. Note that I say "approximately equal " This is becausethe multistage selection pr':cess introduces "lumpiness." Here, for example, each primary sampling unit represer'' exactly 20 percent of the population, but each city does not contain an exact multiple . 20 percent of the population. Although there always will be sorne lumpiness' the larg;: the number of sampling units at each stage, the smaller the problem becomes' Typically a survey house will use the same primary sampling units repeatedly Fr: example, the National Opinion Research Center (NORC) changes its primary sampli:: units every ten years, when the new census data are available (these are needed to dete:mine the population size). NoRC does this because it maintains a staff of intervieu e,' in each primary sampling unit and wants to avoid the expense of recruiting and trainil:: a new set of interviewers for each survey. The part of a sampling design that is fixed advance and maintained over time is known as the scLmplingfram'e'
ffl lY'l '
( I 909 i994r was a oemoqraphe' who soenthis enti'e 1 earnrngl'isBA ir 929' hrsMA in 1933'and his or Chicago, u.uo"t'. .areeraTthe Univerqitv contributions phD in 1938,all in sociology. and academic He madeimportantorganlzational 1947, serving 1939 to from the Census Bureau of U S. workingat the to the socialsclences, Direcas Assistant eventually first asAssistantChiefStatisiicianof the PopulationCensusand tor (and asActing Directorfrom 1949to 1950).At the Bureauhe playeda major rolein creating the 20 percentsamplelong form, usedin the 1940censusfor the firsttime, as well as particularly of Blacks methodsto teducethe undercount, most notableamonghls pulrlications on manytopics.Perhaps At Chicagohe published dno Hauser1973) He also by'ace ano c,ass(Kiraqawa was d stJdyof'norra.irvo,flerenl;ars Centerandservedasitsdirectol{or Population Research of ChLcago the University established from developirgnationsHe is many o, them PhDs. thirty years,trainingrnorethan a hundred in professional associations president maior o{ three perhapsthe only personto haveservedas Associathe socialsciences:the AmeficanSociologicalAssociatlon,the AmericanStatistical tion, and the PopulationAssociationof Arnerlca.
I
.t!]
lll

,
'
flt I :
f 

l]l]1]!
l]
:
l! f l :
When sarnplinglarge,geographicallydiversepopulations'the selectionprocesst\ F cally is repeatedseveraltimes, for successivelysmallerunits. For example,in a 199: nationalsamplesurveyof China (Treiman 1998),we dividedthe countryinto urbanar: rural sectors.Then. within eachsector,we sampledcounties(or their urbanequivalent: with probability proportionalto sizc. Within eachof the chosencounties,we sampli:
frhrjrclt t
l':
SampleDesignandSurveyEstimation 243 IES
:E
de L:r
ry+ nS
rr'xr: l:
I r:
mships (or zipcodesizeddistrictsof cities [',streets,']),with probabilityproportional r rize.Thenwithin eachof the chosentownshipswe sarnpledvillages(or neighborhoods a eities), with probability proportional to size. Once small geographical units are identifiedfor example, villages in rural China c districts or neighborhoodsof citiesthere are four standardaltemativesfor choosins dr idualsto be interviewed: r
Randomselectionfrom a populationregister
r
Randomselectionfrom a list ofaddresses(householdsamples)andfurtherselec_ tion within households
r
Randomwalk procedures(anotherway of selectinghouseholds)
.
Quotaselection
E!€:fi
dcx la[..= i F:r TL:g ftl=E5 =:
in;'_r itf trL
)L'F' trr' 15 
rl,<:
wulation RegisterSamplesIn countriesthatmaintainregistersof the population(for aample, EastemEuropeandChina),it is commonto randomlysampleindividualsmeet_ ry the studycriteria (usuallysimply thosefalling within someagerange)directly from 6e populationregister.This is a very good method becauseit allows strong control Eom the officethat is, it makes it very difficult for the interviewers to cheat by filling lhe questionnairesthemselves.A simple control procedureis to ask the respon_ ".qt dent for the exactdate of birth. This informationtypically is in the populationregister tm will be unknownto the interviewer.Thus interviewerscannotmakeup an answerto rtis questionfrom their kitchen table. There are tbree (related)potential disadvartagesto using population registersto draw :mples. First, if the registeris not kept up to date,it will miss thosewho tend to move round a lot. Second,often people are officially registeredin one place (for example,their bme village) but are away working somewhereelse for an extendedperiod. Thus they rill be interviewedin neitherplace becauseit usually is extremelyexpensiveto track rhem down. This is a major problem in China, where 25 percent of the population of Beijing, and comparableproportions in other cities, is .,floating," working in the city but registeredin a village. To obtain better records for official statistics (and alsoindeedmainlyto maintain tight social control of the population), the Chinese government beganin 1994 to require that people residing in a place for more than thrce months regiser as "temporary residents"; nonetheless,many people fail to register A third disadvan_ age to basing sampleson population registers is that the registers are virfually always restrictedto the dejare population rather thanthe defacto population. So a large resident alien populationlike Germany's Gqstarbeiter (gtest workers)will be excluded.This canresult in rather odd samples.For example,Germansamplestypically havefar too few maleunskilledworkersbecauseunskilledjobs are almostalwaysdoneby Gastarbeiter. Random Samplesof Householdsand Further Selectionwithin Households In theUnited Sntes and other countries lacking population registers,rhe problem is to create a list of peopleto be sanpled within eachof the small geographicunits chosen.This typically is done in thrce stages:by enumeratinghouseholds,sampling them, and then, as parl of tie hter\iewing process,randomly choosingoneperson(or more) per householdto be interviewed.
204
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
Households are enumerated(listed) by fieldwork staff who walk through the arer. locatingand recordingeveryoccupieddwelling unit. In suburbanneighborhoodsfull c singlefamilyhouses,this is pretty easyalthough one still has to be carefulto inclulr motherinlawapartmentsand such.In placeswherepeoplelive in garages,roomsin rin backs of shops,and other informal dwellings, it can be very difficult. (Contemporur urbanChinais sucha case.For an accountofchallengesfacedby thosetrying to confoc samplesurveysin suchenvironments, seeTreiman,Mason,andothers[2006].)It cana,ro be difficult to get into securitybuildingsandgatedneighborhoods. This is a problemD[ only at the listing stagebut at the interviewingstageas well. Once the list is compiled,a random sampleof dwelling units is drawn and inrer viewersare sentto conductthe interviews.The next problemis to randomlyselect!E or more peoplewithin the householdfor interviewing.This is done by the intervieswho lists all the residentsof the householdwho meet the criteria and randomlyseles one (or sometimesmore,dependingon the designof the study),using what is knoul r a Kish table(after the samplingstatisticianLeslie Kish) or similar methods(seeGazim [2005] for a review of withinhouseholdrespondentselectiontechniques).Suppose;m example,that the interyieweris instrucledto interview a personagedeighteento siu_r_ nine. The interviewerlists all householdmembersbetweeneighteenand sixtynined then choosesone by referringto a tableof randomnumbersor using someotherderi:suchaschoosingthe personwhosebirthdayis closestto the interviewday. Householdsampleshave the advantageof capturingthe de facto population___m populationactuallyliving in a place.But theyhavethreeimportantdisadvantases. Fi.'l r is fairlyeasyto chear.inlerviewing whomeverhappens ro bi avairabre rarher6an renuing to complete an interview with a personchosenbut not available.Interviewers are sE!!posedto makea specifiednumberof attemptsto completethe interview(typically rhre beforeabandoningthe attempt.I pickedup cheatingin a surveyI did in SouthAfrica r the ea.rly by noting that 97 percentof Blacks were interviewedon the first tn ___i 1990s completely unbelievablepropoftion.(By contrast,aboutg0 percentof the White, Asi,e. andColouredinterviewswerecompletedon the first try.) To makeit possibleto discoru suchproblems,it is a very goodideato build informationon theinterviewingprocessirr, the datacollectionfor example,by having the interviewerrecordthe dateand time d eachattemptto completean interviewand the outcomeof the attempt,and also by cr* lectinginformationon at leastthe ageand sexof eachhouseholdmemberanoincluarg this information in the analytic data set, which permits the analyst to comparethe dis* butionof completedcasesto the distributionofhouseholdmembers.In my SouthAfricr study I usedsuchinformation to determinethat men had beenundersampledand wasatl to get the surveyhouseto collecta supplementary sampleof men. A second disadvantageof household samplesis that they are not true probabiln samplesof the population,becausepeople in large householdshave a smaller prcdability of being selectedthan do people in small households.For examDle.in :0r_t. 34 percentof U.S.householdsincludedonly oneadult,54 percenthad two adults,andfu remaining12percenthadthreeor moreadults(datafrom the GSS).Obviouslythe chau of an adultin a singleadulthouseholdbeing incrudedin a sampleis twrceas greatasfu chanceof an adultin a twoadulthousehold beingincluded.
SampleDesignand SurveyEstimatjon 205 E ATAL
i full or irl*r. r fu6 e PfrErI
mfu: n a.lio In E.{ lnIETcf e hreL Rla:E I.D ls
ztqe t,I ili_t:d. TT:E.
& rtr Er
ry !r AI ir
b tt
n :d 4
We typically convert householdsarryl11 to person samples by weighting the data by te, numberof eligible peoplein the household,normaiizedto retain'rheongrral samplesize. tris is very easyto do in Stata.For example,supposethat rhe rurg"ipopututioni. adults,that E sish to correctfor the numberof adultsin thi household, unithit *" hun" u countof the mber of adultsin eachhousehold.In Statawe simply speci.fy = adutts] lpweight ta. dependingon the command, laweight = ua"ri"l f.iolv ,uppose l. the averagenum_ b of adultsper householdwas2.0.Then a householdwith iour adJti woutOget a weight ot a qiereas a householdwith oneadultwould geta weight of0.5. Also, tne meanweight would h I and.thesum of the weightswould be N, the numb'erof cur", io tL" .*pt.. .{sit weighting the GSS to take accountof differential household . _happcns, size [3tes little differencefor most variables,which meals that the analysiswe havedone chaptersusing the GSS is for the most part not far # the mark. Still, it is lji"]"I rynrtant to get it right. Moreover, sometimescorrecting for differential household size &ei matterfor example,when we considerfamily incote. Conectfy weignting for dif_ &rntial householdsize increasesthe estimateof iamlty in"o" Uy about l0 percentin ft :002 U.S. GSS(for the evidence,seeparr I of downloadable .,ch09.do,,): fiie urweightedmean: $50,102 meanweightedby householdsize : $54,gg0 .\ third disadvantage ofhousehold.samples is theincreasingdifficulty in securinghigh rs{nnse rates. In Eastem Europe during the communist periJd it was common to com_ qlal more than 90 percent of the anemptedinterviews, oJutresponserates dropped with fu thll of comrnunism. The same is true in China, *h"." ."rpinr" ,utes once exceeded ff y19en1 trl tr_ave beenfalling steadily, especially in urU ui"u, *t people increas_ live_in highrise apartmentbuildings with restricted u"""r.. it "r" \{1 CSS typically gets $our a 75 percentresponserate,andotherU.S. ,u*"ys do " *o.re, which createsa ton_Spossibilitythatrespondents will be a nonrandomsubset u"h of the targetpopulation.In   GSS,for example, men are usua y undersampled relativeto womenbecauseof dif&rential nonresponse (Smith 1979).Any populationestimatein which men and women dfferwhich is often the casewith attitude items_will be biased.
l;
!l I
ft
t b
a : I ll
A SUPERIOR SAMPLING PROCEDURE Asuperior arternative to ?)I
J[il[,:T:'"iil:li'lill,li,i N i?l;::li]:ifl :::lli:;"i:ii fl#"::l ;:::Jfr.
and recordingthe age,sex,and other identifyingcharacteristics of eachresident,and then samolingdirectlyfrom the listof eligibleindividuals. Thisapproachradicallyincreases costsbut is far moreaccuratethan simplylistingaddresses, becausehouseholdsizestend to varysubstantially and because,especiarry in crowdedneighborhoods, there often are ,,doorsbehinddoors,,_ that is, separatehousehords that wourd be missedwithout interviewingrocarresidents and Inquiring aboutthe presence of suchhouseholds.
206
to Testldeas QuantitativeDataAnalysis:Doing SocialResearch
Two ways of improving coverageare typically used: drawing a sample somewhar larger than the target number of completed interviews, to offset nonresponses;and substitution by survey interviewers of a new case,typically from the samesmall area,when an interview cannotbe completed.Both methodsincreasethe number of completedinterviews but do nothing to overcome biaseslhat are due to the differential availability of potential respondents. Random Walk Samp/es Random walk samplesare a variant of householdsamples. Within eachsmall area,the intervieweris instructed10 start at a particular location (a particularstreetintersection)andto proceedin a specifiedway, taking everynth address (or evenvarying the interval accordingto a scheduleof randomnumbers)and turning in a specifieddirection at eachintersection.This amountsto doing the addresslisting on the fly. This is not a desirablemethodbecause,in addition to the other weaknesses of householdsamples,it results in difficulttofind dwelling units being overlookedeven by honest interviewers.Also, cheating is even easier than with conventional householdsamplesbecauseenumeration,householdselection,and interviewing are all doneby the sameperson:and typically thereis little or no documentationof the potential sample,only thoseactuallyinterviewed.It is usedbecauseit is lessexpensivethan populationregistersampling and conventionalhouseholdsampling. In the first t\ro years of the GSS (1972 and 1974),a random walk procedurecombinedwith a quora samplewas used. Quota Samples A quota sample is a sample in which the interviewer is instructed to obtain information on a given number of people with specified characteristicsfemales under forty, femalesforty and over, working women, and so on. Often quota procedures are combinedwith multistageprobabilitysampling:small areasare selectedusingmultistageprobability sampling methods,and then, within each small area,the interviewer is instructed to obtain ilterviews to fulflll specified quotas. In general,quotasamplesarenot a good idea,for two reasons:first, they do not meet the conditions pemitting valid statistical inferencethey are not a probability sampleof any population.Second,they typically producea biasedsampleof the populationther. purport to represent,overrepresentingthe kind of people who tend to be available when interviewing is carried out. Still, carefully controlled quota sampling can be useful under conditions in which probability sampling is prohibitively difficult, becausein such circumstances coverageof the populationmight actuallybe better
Stratifi ed Probabi Iity 5ampl es Multistage probability samplesare sometlmesstratirted, that is, designedto treat va.rious segmentsof the population as if they are separatepopulations. For example, an initial distinctionmight be madebetweenurban and rural areas,with separatesamplesdrawn from the urbanportion and the rural portion.The main reasonfor stratifying a sample
i n ensure tha sisFor err ! it $ ould I lre: ntm \mall a * smr€ small fuu t effect of clu
SOUR 3
'a
:3 =t
r€5ni :a 3e5e
d
IE2a3
0
';Ee :2
t$
,'e'.'.eq aJl c.=.
EIGN
3
EF
ft iLr rhar E rrearg ry++ &lindlou btu'rben ot r:riabler rrl tl poluluicn .'f tu rne relarir arol drasn re.T'etl to f,r {e t frird $ge b€eneflrs ac.l.rmrrir sl .lmPle tlht re need r&srhrd A rR'r: r'<s Eiee:lrdSrj fr+:rgd clF desisE lEc.1:ir3!
SampleDesignand SurveyEstimation rhE
mhPlr
E.
r cr
247
b to ensurethat a sufficient number of casesare drawn from eachstratumto Dermit nalvsis. For example,to get estimatesof somephenomenonfor eachstatein the Uniced staresit would be necessaryto stratify a national sampreby statebecauseotherwise ooll a small numberof respondents, or perhapsnoneat all, would be likely to be chosen aom small states.A secondreasonfor using a stratifiedsampledesignis to minimize 6r effectof clustering,a point discussedin more detail later in the chaDter.
lEr 1! B!:
ry
@s 5S oI. clr
&1 :f,
SOURCES OF NONRESPONSE rhemain reason rornonresponse is
failureto starrthe interview becausethe interviewercannotcontactthe target household(as somettmes happens in gatedcommunities andhigh_rise apartments), because no one tshome, or becausethe householderrefusesto answerthe door. For this reasonhigh_qualitysurvey operationsoften attempt to contacttargeted householdsby mail to explainthe surveyand pavethe way for the interviewerOnce contacted,relativelyfew people refuse to be inter_ viewed(althoughrefusals are increasing, especially in urbanareas), and almostno one termi_ natesan interviewafter it starts.
ITI a/\ ft
TI€SIGNEFFECTS I' ET ET
DI E
T t ! tl' j
I
I I
The fact that national sample surveysgenerally are basedon multistage areaprobability *rmples createsa problemstandard statisticalpackages,which assumerandomsam_ iiing. tend to understatethe true extentof samplingenor in the data.The reasonfor this E dat when observationsare clustered (drawn from a few selectedsampling points), for rany variables the withincluster variance tends to be smaller than the variance across ae population as a whole. This in tum implies that the betweenclustervariancethe qriance of the cluster means, which gives the standarderror for clustered samolesis niated relative to the variance of the same variable computed from a simple random $mple drawn from the samepopulation.Reducedwithinclustervariance,especially Tidr respectto sociodemographicvariables,is typical within the small areasthat make u; te ftird stageof multistage probability samples:areasof a few blocks rend to be mor! hmogeneouswith respectto education,age,race,and so on than the populationof the affe country. The result is that when we use statistical proceduresbasedon the assumptbn of simple random sampling, our computed standard errors rypically are too ..ail. lhat we need to do is to take accountnot only ofthe var:ianceamong individuals within r cluster but of the variance betweenclusters.This is what survej estimation ptocedures (For a usefulintroductionto suchprocedures,especiallyasimplementedin Stata, see ';lo. Etinge andSribney[1996].However,notethat stata'ssurveyestimationprocedures have geatly expandedsince that paper was published: they are now capableof handling multi_ *age designswith more than two levels,and surveyversionsof many more estimation rocedures are available.)
K[
;t$
to Testldeas QuantitativeData Analysis:Doing socialResearch
To illustrate what can happen to our standard ertots when we take account i: design effectsthe tact that we have a clustered sample I draw upon some samplrr: experiments conducted in the course of designing rny 1996 national sample survel r: China (Tteiman and others 1998).Becausethis survey was to be conductedby sendir. interviewers from Beijing to each sampling point, cost was a strong incenti!e to minimi: ' the number of sanrpling points. However, since China is a very heterogeneouscountr' it was possiblethrt a highly clusteredsamplewould producean unacceptablyhigh ler. of sampling error. To estimate the potential damage that could result liom clustering. r" conductedsome analysisusing a 1:t00 sampleof the 1990Censusof China. Although we carried out severalexperiments,I draw upon only a subsetto illL' trate the potential problel.r of clustering a three stage design for a rural sample.T:' first stageconsistedof Iifty counties,chosenrandomly with probability proportional : size. In the second slage two villages within each county were chosen randomly s::probability proportioral to size. In the third stage thirty people between ages twer:. and sixtynine were chosen at random within each village. Altogether, this desi.: created a sample of thlee thousand people. We also drew a corresponding sanlr: from the urban population. To assesswhether the clustered samples produce lar5.: sampling variability than would correspondingrandom samples of the same popu'tion. we cornputed several statistics summarizing featuresof the Chinese populatr.: and estimatedthe design effect (deff tbr each statistic.Delis the ratio of the variar. calculated taking the clustered sample design into account to the estimatedsampli. variancefrom a hypothetical survey of the same size with observationscollected fri a simple random sample. It also can be thought of as a factor for the sample size; thu' 
i'arl E! 
of thetwenti(19102000) wasoneof the leadingsurveystatisticians (1965), became which monograph SurveySampling the pioneering centurypublishing "tn of inference to the development for the field.He mademaiorcontrlbutions the standard proceduresfor complexsamplesand other applications.((ish inventedt\e deff and nefl statlstics.)He also helped to found the lnstitute for SurveyResearchat the Universityof Michiganandto designits sample. parentage, he camewith hisfamilyto the of Hungarian Bornin Poprad,now in Slovakia, UnitedStates1n1925.Hisfatherdiedshortlythereafter,so he completedhis3A in mathematics ln night schooiat Ciiy Collegein New York,studyingwhile helpingsupporthis motherand siblings.He alsotook two yearsoff to fight the fascistsin Spainas a memberof the International Briqade.After completinghis BA he movedto Washington,D.C . where he workedfirst Hethen againvolunteered of Agricuiture. at the CensusBureauand then ai the Department the University of Mlchigan, he moved to Army. In 1947 this time in the U.S. for militaryservice, he cornand teaching, where,in additionto helpingfoundthe lnstitutefor SoclalResearch, pietedhisMS and PhD.He fenainedat Michiganfor the restoi hislife.
_ ilt  
llf
l::: I
1l:
'
_," ,:
SampleDesignand Survey Estimation
[ rf qirg IIt
fr!
mr lrrf br.d E! {c ilh6
:,Tb dD
rfr tElr[t .*rh
Tb EET
& iotr E
ft I
at
ZOg
ji:i4!H.iL:"#Jf.T:J;'J"1t,#::.::,mp,ewi,hour "r4uudu crror we would obtain from ".:.*:i sampletKish 1965.259i a srmple
ffiAs,i1J*[,*t**, ffi ffi:,fr fh leftmostpanelof ,ihUie l.: rr
#H#ili:###j,l$.[{H,1#{:::.'"::fl :ff;il];::H
g**j*ff *,ri,p;*a*hT,"ntr'i; 'f ;,fi:#i#,fT ;:l;iff ffi ffii,i:",i'"ff ;X":*l;;fiirffi ,g..noug' ,oiugs"li';;';:;:.:fi :1.ff.Tlr,;,?#:.lnn* ;i ::rull
to Offset the Effectof Clustering
andpowerful featureof the statisti 6.we canr more is thar under certain more or less cerrain conlesscompler.etv,.tf;i;; comDlerelv con_ _i.^.:l:*plng ;;":;:irsurstal{nder ^*".
il'Jff tr'..',;'11'#,*$#::"lh: j."^l"lil.;i".::T::""l,ii','r"ffi fjl
ff nff:#1.##tili,;"f. l;iilT: :.,:,T"tii!:rF:rfi:H:.:,#,,:.".,
il,1,:il1:nr,tn*t ffs,,,:.#r:f ffi;:;#:rJili*f,*j#ff :ffi:f.,;jf ffl;:fll;i;:":*:l{i!_ffi ::.ff :*'}:,::nJ,:::ff :,:
#iI:'"T:*ll#*1trJ#**.r$tr,l$Tl"!".ror aswerr' rhishomogeneiry can
Hy::,1':*:;;ry^ilHi,:l;f;h:ffibres
be
.i:ffi #':*#trffi.f,frHq"'lfg fi fi:J.I:"n'ffi f,"T'f:l.'"id!F::fiifitiil;fllifi lfliln.,orruur"sJ ;"J.',::" ;ililii:
lff,f:f.#iff frj:i:fr,':",H::[J
sampte.
l:rer in this chapte.r pr*."i
*xl'i"J.lfi iTl
gxnilttgill,m:ifu:t;in'.l.:;:l.*:il:.,l## " "*'""ur
':
,,. t; l !: 1), ii , oesign rffects for selected Statistics, samples of 3,ooo with clustering (5o counties as Primary sampling Units,2 Villagesor Neighborhoodsper County,and 30 Adults Age 20 to 69 per village or Neighborhood), With and Without Stratification, by Level of Education. Without Stratification
Coefficient
Statistic
S ampl e
Meanyearsof schooling
Urban
4.45
Rural
5.49
Urban
Meana9e
N,4ean lSElscore
Percent wlth localregisiration
llr l
Deff
o.1a
4.22
5.61
o.a7
38.06
2.69
38.21
0.96
Rural
38.55
1.73
38.71
0. 99
Urban
33.35
5.68
33.44
4. 87
Rural
24.O2
2.44
24.6 1
0. 91
Urban
95.37
6.08
94.61
o.B7
2.13
99.30
0.96
Urban
B'1.07
4.19
B1.1 3
0.93
R ural
88.97
)99
87 47
0.q3
tl F
!t
F'l  4l
 1 I i I I   , ' ,li    r l I ' i I l rtll
Coefficient 8.41
Rural Percent employed
Deff
With Stratification
I l tl rdr I
II r, l
'l /l l
l ll
l rr/
BE:
r nl Regression of lSElon years of schooling
l t.
Urban
Rural bro*
Urban Rural
Regression of lSElon yearsof schooling, age,and sex
int.
Urban
Rural b
Urban Rura I Urban Rural Urban Rural
A/oteThisisthe l tj OOsarnple of the 1990Census of China,firstversion.
lJ lll
1).90
4.70
10.6/
0. 92
18.91
2.81
1 7. 45
0. 96
2.42
6.36
2 . 71
0.90
0.93
1.80
1. 28
0. 94
16.23
?.83
13 . 57
o.97
27.85
1.67
22. 88
0. 95
2.23
3.31
2. 54
0. 94
0.49
1.70
0.89
0_94
0.08
2.75
0.08
0. 95
o.20
1.50
o.14
o. 97
2.7A
1.43
2. 81
0.99
2.31
1.43
4.1 4
0.98
212
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
SAMPLEUSEDIN THE STRATIFIED HOWTHECHINESE s".uu'" WASCONSTRUCTED EXPERIMENTS DESIGN with eachstratumtreatedas a separate samples, stratifiedsamplesarejust multistageprobability earlier, usingasan is similarto that described for creatingsuchsamples the procedure sample, exampleCaliforniaclties.To createthe chinesesample,we first dividedall countylevelunits (counties,countylevel cities,and districtso{ largecitles)into an urbanand a ruralsectoi using the 1990 Chinese census.We treatedthesesectorsastwo separatepopulationsthe datafrom of rural population of ruralChina.considerthe population of urbanChinaandthe population china firsl, which consistedof about 2,400 counties.We arrayedthesecountiesin the orderof the proportionof the adult populationwith at leasta lower middleschooleducationWe then equalsizeso that countiestotaling dividedthe countiesinto twentyfivestrataof approximately population in were included eachstratum.We then chosetwo 4 percentof the approximately proportional to size,pickingthe firstone at ranfrom eachstratum,with probability counties by addinghalfthe populationof the stratumto the dom and the secondone systematically originalnumberand pickingthe countywithin which the sum fell, wrappingas necessary The remaining stagesweresampledPPSin the usualway.We then createdthe urbansample
EU
na*. ^_r rrd
d [email protected] tlrjE!
']rqr 'lr
in the sameway.
* am :re s
As noted earlier in the chapter, a second reason for using stratified samplesis to sample different subpopulationsat different rates. We did this in the Chinese sampleAlthough for convenienceI have presentedthe Chinese data as if they consistedof two separatesamples(an urban sampleand a rural sample),the urbanrural distinction may be thought of simply as a secondstratification variable. However, becauseChina was about 75 percent rural at the time the survey was conducted,we sampledthe urban populaiion at tbree times the rate at which we sampledthe rural population in order to achieveurbaa and rural samplesofthe samesize, which we wanted for our analysis The samestrateS) was usedin the 1982and 1987GSS to achievea sampleof the Black populationof the United Stateslargeenoughto sustaina separateanalysisof BlacksandnonBlacks.
lffnu rI
lrqr irfli *a![ :ff'
4Niml 5lll lnrft :ti Tt
weighting When portions of the populationare sampledat different rates,the sampleis, of course, no longer representativeof the entire population. Thus any statistics computed over the entire samplewill be biased.For example,if we naively computedthe meanlevel of educationin the Chinesesample,we would overstatethe true level of educationil the Chinesepopulation sincethe urban population, which was oversampledrelative to the rural population, is much better educated.A similar naive computationusing the 1982 or thelevel of educationin the poP Blacks,would understate 1987GSS,which oversampled ulation given the lower level of educationof Blacks comparedwith nonBlacks.To correct for such distortions, we v,eight the dataproportionally to the inverseof the sampling rate. For example,in the 1996Chinesesurvey,which includes(approximately)threethousand rural casesand three thousand urban cases,to correct for the fact that the urban
lxf lr5[ m m!!!!iff &
'I!''il
T€4:r&nrdM f,_S
SampleDesignand5urveyEstimation 213 was sampledat tbreetimes the rate of the rural populationwe would assign t tcight, w", to the urban populaiionand a weight, w,, to the rural population,where r: 3w'".Note that we would not want to simply assigna weight of 0.33 to the urban ;rylation anda weightof 1.0to therural populationsincethis would resultin a weighted qle sizeof 4,000,whereasthe true samplesizeis 6,000.Rather,we would adjustthe ta back to the original samplesizeby dividing the initial weight by the meanweight, 16 (This is, of course,just what we havedoneto converthouseholdsamplesto person ryles). Thus we would createa new variable (weight) that has the value 0.5 for urban cres andthe value1.5for rural cases.This yields a weightedsamplesizeof 6,000(which L ilentical to the unweightedsamplesize) and a weightedsamplesize of 1,500urban aes and4,500 rural cases,which correspondsto their relative population sizes.Then we c !omputeunbiasedsummary statistics for the entire population. Note, however, that th procedureoverstatesthe reliability of rural responses(there are actually only 3,000 El respondents,but we are treating the data as if there were 4,500) and similarly underthe reliabilitv of urbanresoonses. s
WEIGHTING fu:r
+.* itrc qhr
t(u ri! rtd E ftu
DATA I N STATA weishts canbeinctuded instata co'pu!f
;:T:.?.Hl:::Xfl :J.l;'ffi ;:,f;lT:il:ffi ;:[l"1[":1';::T'*:N jamplewith a weight vadablenamedl4T, we would issuethe followingStatacommand:reg
i x lpweight=wt ] . Statapermitsseveralkindsof weights;seethe User3Guide(Statacorp 2007) for details.In general,probabilityweights (pweights)are the appropriatechoicefor stratifiedprobabilitysamples,and these weights are used in Stata'ssurveyestimationcom.nands.However,Statadoes not permit probabilityweightsfor all commands,and it requires iat frequencyweightsbe integers.I thus recommendthat, in the relativelyraresituationsin rvhichit is appropriateto weight data but not to do surveyestimation(surveyestimationproceduresare discussedlater in the chapter),analyticweights (aweights)be used whenever pweightsand aweights weightsarenot permitied.Stataautomatically normalizes orobability total numberof casesincludedin the analysis, whichmakesit unnecessary :o the unweighted ior the analystto carryout this step.
Bfi
rcd be. t rf orhr F.'r FP rraa n=, t(Ertan
Sometimesmore complex weights are devised.For example,in the Chinesecase re fust corrected for differential household size by using the number of adults in the bqsehold as our first weight. Then we deviseda weight to correct for oversampling te urban population. We then multiplied the two weights together to achieve an overall wight, which is appropriatesinceeachweight is normedto a meanof 1.Owhich is nther way of sayingthat the sum of the weighteddata is identical to the sum of the us eighteddata.
214
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
As notedin thepreviouschapter,somesurveyhousesconsnrrcta complexsetof weigla to take accountof differential nonresponse.That is, they weight ttre data so that the disrri bution of key variables(geographiclocation, sex, age,education,and so on) rn the samolc conforms to the distribution in a standardpopulation, such as the census.(This procedrrE is implementedin Stata10.0usingthe post.strat.a ( )_ and _postweignt ( options in the s\,l/setcommand.)This can be useful when nonresponserates diff{ substantiallyacrosspopulationgroups of interest,but it also is potentially misleadingsilcc it assumesthat nomespondentsareidentical to respondentswithin the groupsformedly tb zlray crosstabulationof the variablesusedto createthe weishts. The use of weights is somewhatcontroversial.Somearguethat you shouldnever weight your data but rather should include in your analysisall the variablesusedo devisethe weights.The claim is that weightssweepproblemsunder the rug, maskiry effects that should be explicitly modeled.There is much to be said for this oosition_ certainly, urbanruraldistinctionsare crucial in china, and racial distinctionsare crucial in the united states.Thus, generallyit win be far more informatrveto exDhcirrT describethe urbanruraldistinclion in China or racein the United Srares.ptus apiropri ateinteractionswith othervariabres,in one'sanalysisthan to weight the dataandigncrc these distinctions.However, from a practical point of view .""ighting is sometimer unavoidable,particularly in the computationof descriptivestatistics.If we want m accuratelyestimateeducationalattainmentin China, we do need to weight the dara to reflect the oversampleof the bettereducated urban populaiion; and so on. In addi_ tion, it is sometimestediousto model nuisance effectsfactors that might affect tbc outcomebut that are not central to the substantiveanalysis.The effect of differentic householdsize is such an example.Thus a casecan be madefor weighting the dataro correctfor sucheffectswithout focusing on them. of course,the counteris that eitba they are unimportant,in which caseweightsare unnecessary, or they are important.h which casethey shouldbe modeledexplicitly. Perhapsthe most important point to make about weights is that it is imperative tbr . the analyst fully understandthe weighting schemeusedin the data being anJyzed. Ofter y:ightt _: quite complex,andjust as often weightingschemesare badly documentocl Although it often takes a good deal of effort, full understandingof weighting scherner can save a great deal of troubleand considerableembarassment arising from erroc in the analysis{own the road. In general,wheneveryou begin to use a new data s€ryou shouldtry to obtainasmuch documentationaspossibleaboutthe sampledesign arxt executionand then,of course,readit. Estimation a'tsingstata To get correctestimatesof standarderrorsfrom mulG 'uruey stage samples,we need to use estimationproceduresspecifically designedfor suci samples.Stataprovidesa setofs rveyestimationcommandsto estimatestandardenorr for many commonstatistics,including means,proportions,OLS regressioncoefficients. and logistic regressioncoefficients.Thesecomrnandsmake it possibleto take accoutr of both clusteringand stratificationat each level of a multistagesample,albeit wifr somerestrictions.
u
t
u! G
c !r! I
; t "Fl
; u d
ut N!!i
fr t*
Ttt
:
ftfti !i I
Dmrq r
fi c r ol
lma
SampleDesignand SurveyErtimation 2''5 ls E
r G
k E
tk ag I q fr
rad! E
m: Ei I: [.1
LIMITATIONS OFTHESTATA1O.OSURVEY ESTIMAAI pr"uiou, TION PROCEDURE Although stata10.0ismuch improved over u"n
$l
sionsin itsabilityto correctly estimatestandarderrorsand designeffectsfor multistage sam ples,one important limitationremains:becauseof the way Stataestimatesstandarderrors, the defaultfor stratawith only one samplingunit is 10 reporimissingstandarderrors.Stata just 10.0provides threealternatives, whichare helpfulif onlyan occasional stratumcontains cnesamplingunitalthoughin thiscaseStatarecommends that the offendingunitsbe combined with others (Suryel Data, 154 lstatacorp 2007]). Ihe alternativesare inappropriate when,by design,eachstratumcontainsonlyone samplingunit. (Notethat in Stata'simplementation"the samplingunitsfrom a givenstageposeasstratafor the next samplingstage" Thisisthe designusedby the Gss,whichhasone PSU lsurveyData,154(Statacorp2007)1.) per stratumand is the designusedin the 1996Chinesesurveyanalyzedhere,in which one iownshipwassampledper county. ThesolutionI adoptis to ignorethe stagecontainingonlyone unit perstratum;but this understates the degreeof clustering.For example,in the Chinesecasetwo villagesper countywere sampled,but both were drawnfrom a singletownship;ignoringthe township levelresultsin this aspectof clustering not beingtakeninto account.Althoughnot optimal, stagesaltogether, thissolutionstrikesme as betterthan ignoringsubsidiary which is what Statadid beforerelease 9.0.
frr XIL
b .tr
b E
Ea ES tr! tEl j
To showthe effect ofusing surveyestimationprocedures,I first repeatthe analysis, lresentedin ChapterSeven,of the determinantsof knowledgeof Chinesecharacters, cing surveyestimationprocedures.I then follow with an analysisof race differences m income among U.S. women to show how to do survey estimationfor subsamples. I concludewith an analysisof race differencesin educationin the United States(the rue exampleI usedto discussthe decompositionof differencesin meansin the previ[s chapter)to show how to do surveyestimationwhen combining severalyearsof the GSS (or, by extension,other data sets).SeeAppendix A for descriptionsof both the Chhese data and the GSS, and seeAppendix B for a discussionof how to do survey estimationusing the GSS.
A Worked Example:Literacy in China hEt If,!
t5M
id
Biratfollows is a comparisonof the regressionestimatesand standarderrors derived two &1s: by using surveyestimationprocedures,andby assumingthat the datawere ftom a *imple random sample,as we did in ChapterSeven(seeTable9.4). The 1996 Chinese nnvey analyzed here used a design similar to the design of the sampling experiments describedearlier in the chapter,except that in the sample survey we sampledone townSip per county and two villages per township. (SeeAppendix A for details on how to r€ess the documentation for the survey, including information on the sample design \ppendix D of the documentationl and how to obtain the data.)
z'!{;
Quantitalive Data Analysis:Doing SocialResearch 1o Teslldeas
.;lf
ll ,1l.ll oeterminants of the Number of chinese characters correcfly ldentified on a lottem Test, Employed ChineseAdults Age 20 09, 1996 (N = 4,802). Unweighted
Weighted

li L
i
DesignBased
l
D
S.e .
b
s.e.
b
0.378 0.006
0.393
0.006
0.393
0 .0 0 2 0 .0 0 6
0.009 0.007 0.009
s.e.
Deff
Meff
Meff ,.
:J
!l[:h
0.0.!o 2.ss
2.g3
1.5:
0.007
 .45
1.0.
0.206 0_046 0.211 0.057 0.216 0.055 1.0.t 1.42
0.9;
0.281 0.045 0.177 0.054 0]72
 2/
0.050 0.88
fls
1.21
0. o:
1.80
1.12
r:l
lillr
0.366 0.037 0.385 0.044 0.385 0.049 1.70
.; iiii.:
0.759 0.101 0.866 0.118 0.872 0.129 1.53 1.64
1.14 ll
0.040 0.546 0.039 0.544 0.060 3.25 R2
0 .6 8 1
0.687
2.31
1.5:
*:'
0.688 *t
s.e.e.
1.2 4
1.24
afe signficantat or beyoncl "Allvariables the .OOllevelexceptfor father,s educatron. lor the unwerghi:l data,p : .690.Forthe weighteddata,p = .l 95, andfor the clesig basedanaysis,p : ta6
Stata requiresthat information regardingthe propertiesof the data be set beii... specifyingestimationcommands.Once this is doni, using the _svyset_ comllta]:l estimationis carriedout in the usualway,exceptthat the suivey versronof the estimatii commandis substitutedfor the nonsurveyversion.The specific commandsusedto c surveyestimationfor the Chineseliteracyexampleareshownin downloadablefile ,,cht,: do" (Part2). Seealso the  log file .,ch09.log;'for the output.

:. :d
SampleDesignandSurveyEstimation 217
h
tG
tt
n,
* t
If
The Stata10.0surveyestimationcommandsprovide four designellbct statistics:/zqf 6e misspecification effect;deff,the crassic design'effect ri".i.,i" iJ*r"p"a by Kish ( 1965) .ad dircussedearlierjn the chapter; andmeft zrrrd defi,*ni"n a."in" approximatesquare n<:ts of meff anddeff. of theseI find the nist tro ttt" *osi o.]iL Thesecoefficients are eported in Table9 4, which also includesthfee estimatesoi ,rr" J"i".ino of riteracy in China:regressioncoefficients as
nmaom simpring w*.""d;;'jlillill1xHx?,1l'i.#"11"1 i.i#il"trif;#"f
pmel, "Design Based"). Finallv, the table shows rr"o.irtiJ oesign statistic, which I caI meff.,. otrr".
Seff ,t4Fis the rado of the samplins variance(the squareof the standarderor) computedusing ti designbased estimationco]mand to tt ,apting uariu*" Joinputeoon the assumptinn of unweightedsimple random sampling. " tvi"Xlnu" informsus.1usthow badly we rould err in our estimatesof samplingvari"an"" *"." *" ,onuiuelycomputestatistics rithout taking accounteither of clustering or of differentluiruniiiirrg .ut"s_as we have Jone in previous chapters;for the current example, theseare the compuhtlons shown in 6e first two columnsof Thble9.4.l,lote that as specifieJUy ,fr" a"fioi,io" of meff,in the tust row meff:2.93 : 0.0102/0.006, (or, preciseiyti."_,ir" a"irr""aable _1og_ filel q'm95421'1/0'00557672), wherethe ratio is rormea'uy a".q."d standard error estinared using designbasedestimationdivided by trr. iqu"i ,tunoard error assuming rmple unweightedrandom sampling. Sometimes, u. ; thi, the underesti_ ,h" sampling.variability "*_pf", canbe subsrantial;tf,", it *_fJU" T," :f inapproprateto usenaiveesfimating procedures "ompletely lor theChinesedata. to. weight your databut ro ignore clustenng and srratification, 1:: ^.ii:dficienr 6 ..rl the computations in the rightmostcolumn of Table9.4 demonstrates. .Ihis coefficient, the.ratioof the desig" iJ r"i,prr"g varianceto the 3_:1.1]::"*ed .y"ff*, .g.rues ramphng vanance estimatedby w^:ignting the data but not taking account of clustering cr stratification'(Becausethis coefflcientis not amongtir" stuiu optronrr createdit for heurisricpurposesit must be computeoby hand,unrJssy;;;;; p."gr"_ Stataro do l for you. Seedownloadablefile ,.chOl.Oo,; part i] to seeiorv r u."i s,u,u ,o do the com_ Frtauons.)As is evident,the varianceestimatescan be quite different..Ihus onceagain jh" 1mgo.tce of taking accountof the samphnjd".ign iJ g", ].j!" estimates ir the standarderrors of our coefficrents. "orr"ct
&ff C
t f D L
\ nored.earlierin the chaptet deff is the ratio of the design_based estimate of the sam_ 1*ing varianceof a statistic that has.beencollectedunder ,;;.p1"; survey design to the esrimatedsampling variancefrom a hypothetical s;;t;f inl"ffiiir" *itn observations rollected through simple random sampling.Thus zefis d iner.i tron a.6.tn that it gives 6e ratio of the samplingvariancesobtainJofrooui *toJ juouta". aro (1) rhen we usedesignbased estimationto accountfor clusteringandsample "onditions: weightsand(2)
218
to Testldeas QuantitativeDataAnalysis:DoingSocialResearch
when we ignore clustering and weights and estimate statistics appropriate for simple unweighted random samples; deff, by contrast, gives the ratio of the designbased sampling varianceto the sampling variance that we would expect if we had actually carried out a survey using a simple random sample.In this sensezelis mainly of didactic value becauseit revealsthe consequencesof naive estimalion. De;f can be thought of as a variance inflator, indicating the extent to which the sampling variance is inflated becauseof the clustering of the observationsin the sample. Becausethe standarderror is a function of the squareroot of the sample size, de;f also can be thought of as indicating how much larger than a simple random sample a sample basedon the clustereddesign would have to be for both samplesto yield standarderrors of the samesize. In the present case, despite our best efforts to stratify the sample, we still have a relatively large deff for years of school completed: 2.99. This implies that with respect to the measurement of yearsof schooling,our clusteredsampleof about6,000caseshas the precisionof a simplerandomsampleof about2,000cases.Although this is a great improvementover the designeffectsof 8.22 and13.43that we obtainedfor the rural and urbansamplesin our designexperimentbasedon the 1990Chinesecensus,it still is quit€ large. Fortunately,none of the remaining variables in the model has design effects nearl)' as large(althoughthe interceptdoes). In the courseof carryingout analysisof an existingsurveythereis, in fact, little reason to compute delFbecausede;fprovides information that is useful primarily in designing a new survey (as in the Chinesecensusanalysisdiscussedearlier). Rather,for samples for which we haveadequateinformation on the design,we simply carry out our analysisin the standardway, but usi.ngthe survey estimation commandsrather than commandsthar assumesimple randomsamples.Unfortunately,suchinformation oftenindeed, usuallyis not included in survey documentation,especially for older surveys In such casesthere is a nextbest approach.You can approximate design effects b)' treating your sampleas somewhatsmaller than it actually is, by weighting your data by 0.75or 0.67or 0.50 (thisis easilyaccomplished, eitherby creatinga weightvariable: O.75,0.67, or 0.50,or whateveryou judge the reciprocalof the designeffectto be; or br multiplying any existing weight variable by your judgment of the reciprocal of the desigl effect).Weightingby 0.75 is tantamountto assumingthat eachstatisticin your analysis hasa designeffectof 1.33(= 1/0.75).Because,aswe haveseen,designeffectscan varysubstantially, this is hardly an optimal solution,but it is superiorto blithely assumingthat the multistage probability samplesupon which almost all survey data are based are as preciseas simplerandomsamples,which is what we do whenwe makeno correctionfor designeffects. In the GSS the designeffect is typically about 1.5for attitudeitems and about i.75 for sociodemographic items,which tend to vary more acrossclusters(Davis and Smith 1992).So we could get an approximationto the correct standarderrorsby weighting our sampleby the reciprocalof the designeffect, for example,weighting the GSS by 0.57 (: 111.75),to be conservative.However in recent years the GSS has included the variableSAMPCODE, which permitsthe useof surveyestimationprocedures.The GSS usesa complex design,which, moreover,has changedsomewhatevery decade.
t t{ T
n :!a {b d !I] [email protected] @
ff
 { m I nt F
n t]]t $a d :I f,
ts I
lp t
t @
p tll
SampleDesignandSurveyEstimation 219 F€ EE
ft ]' L b
a It ta
a I I
I E h lF T I T
ritl
the shift to a new sampling frame. Hence the correct use of survey estimation 1rneduresfor trend analysisis a somewhatcumbersomebusinesseven when, as is yearsare treatedas strata.However,for the analysis of a singleyear of the rsonable, rBS. the task is somewhateasier.(SeeAppendix n tor a discussion of how the sample &sign of the GSS has changedover time and the implications of the changesfor how b do surveyestimationusing the GSS.)
AN ALTERNATIVE TO SURVEYESTTMATIONrrthere istL
:ilTff"i:,"#J::ff::ii,::,:',1'i::ffi1;1,"i,1:,1,::,ii" IU
'.viththe robust and cluster_ options.Thiswill producestandarderrorsthat are lwithin roundingerror)identicalto the estimates producedby the surveyestrmation com_ nand when no strataare specified.That is, robust_ and _ctuster_ opttonstake account of clusteringbut not of stratification.In general,but not always,lailure to take accountof stratificationwill producelargerstandarderrors.
This approachmay make it possibreto providea partiarcorrection for crusteringeven lvheninformation on the sampledesignis not available in the surveydocumentation. Because almostall largepopulationsurveysare clusteredon the basisof geography, you may be able io usegeographlcplaceidentifiersas a clustervariable.In addition,you may havea data set that incrudes informationon househords and arsoon severar individuars withina househord. In sucncasesyou cantreatthe householdidentifieras a cluster variable(in additionto any geographic identifierin the dataset).
HOW TO DOWNWEIGHTSAMPLESZEtN STATAwhen ?)r
;::ffi;Il'ffl"fl'::::"T,i:',"ffi ffilil.l"J;,ffi :ii::i:ilH:]"i;i:?:Nl
refrectan approximate designeffectby infratingthe standarderrors.To accomprish this in Stata,usethe tir^,eighrl specification, which createsweightsthat are not renormedto the samplesize. Note that using the tiweighrl specificatjon is the equryatentof usino Iaweighrs] ratherthan lpweighrs], but using taweighrsl when doing modelestii mationis in generalincorrectand typicallywili producesmallerstandard errorsthan when lpweighrs] are used.So thjs clearlyis a suboptimalsolution.lt thus is generally well worth the (oftenconsiderabre) effort to determinethe actualsampredesrgnand to obtain the variablesnecessaryto implement survey estimation procedures that correctlyreflect complexsamplingdesignsevenif it meansimposinguponoriglnal investjgators who have goneon to other research and do not want to be botheredtryingto documentsomething theypaidlittleattentionto in the firstplace.
224
to Testldeas QuantitativeDataAnalysis:DoingSocialResearch
Analysis of Subpopulations:Effectof Educationand Raceon lncome Among Women One specialfeatureof the designbased estimationprocedurein Stataneedsto be higb. lighted: when analysisis restrictedto a subsetof the data,it is inappropriatesimply to excludecasesnot meetingthe selectioncriterion.The reasonfor this is that the sample designfeaturespertainto the entiresample,not the subsetselectedfor analysis.Statacorrectly handlesanalysisof subsamples throughthe  subpop option in the estimatioo comrnand(seePart 3 of the downloadablefiles "ch09.do" and "ch09.log").To illustrate the use of this option and to further illustrate how the useof surveyestimation can change substartiveconclusions,I here carry out a simple analysis,using the 1994GSS, of tbe effect of educationandraceon incomeamongwomgn. The 1994 GSS is a stratifiedmultistagesample.The units for the first stagewerc 2,489 U.S. metropolitanareasand nonmetropolitancountiesdivided into 100 strata with one PSU per stratum selectedat random with probability proportional to size: then, within PSUs, 384 secondstageunits (groupsof blocks) were chosenPPS, and in someinstancesa thirdstageselectionwas made as well. However,the documentation for the GSS identified only the PSUs,using the variableSAMPCODE. Because. as noted,Statadoesnot permit the specificationof one PSU per stratum,I set the PSU but not the strata,and I treatedthe analysisas including only one samplingstage.ThLi procedureprobablyunderestimates the true standarderrorsbut is the bestoption givetr the documentationavailable. Table 9.5 showsthe coefficients and standarderrors for three models, eachestimated threeways: treatingthe sampleas if it were a simplerandomsampleof the population: weighting the sampleto correct for differential householdsize; and taking accountof the clusteringcreatedby the first stageof the multistagedesignused in the GSS. Itr thesemodelseducationis expressedas a deviationfrom the meanyearsof schoolingof womenin the sample. Consideringfirst the contrastsamong models,it is evident that adjusting for differential probabilities of selection resulting from differential householdsizeshas a nontrivial effecton the results.The estimationthat fteatsthe dataas an unweightedsimplerandom sample(panelI) yields a significantincrementin R2for Model 3 in contrastto Model lleadingus to concludethat the determinantsof income differ for Black and nonBlack women.By contrast,neither the weighted (panelII) nor survey (panelIIf estimation procedureyieldsthe sameconclusion.From the weightedanddesignbased estimates,'.r'e would be led to acceptthe null hypothesisof no racial differencein retumsto education for women.Here is a casewheretaking accountof the imprecisioncausedby treating householdsamplesas if they are personsampleschangesa substantiveconclusionin an rmportant way. An alternativeway to do this without weighting the data would be to introducea set of dummy variablesfor the number of adults in the household,plus interactions betweenthe set of dummy variablesand,respectively,race and education,and perhaps threewayinteractionsas well. Unless the focus of the analysisis on how race and educationdifferencesvary by the number of adults in the household,this alternative strikesme as excessivelycomplexand tedious.I think the examplemakesit clear wh)
u& f
 qr
I
Fl ll, lE
E
'
r'l Elz
t'l F'l m r:, E
5' :!_:
[i
I I
SampleDesignand SurveyEstimation 221 ,1 lt i. Coefficients for Models of the Determinants of lncome, U.S. rdrlt Women, 1994, Under Various Design Assumptions (N = 1,015). Edu(ation D
5.e.
Interaction
Intercept
p
..r ng simplerandomsample
IttLi3
2,548
205
.OO0 1O
1,419 .gg4 't,.755
r 'ningweightedrandomsample
:amrf9 wetghtedand clustered sample
'.,.,r:2
2,656
33a .A0O1,251 1,772 .480
!k,:: 3
2,419
26S .000
18,001
(l\/lode :_:_:sts 3 versus Model1)
4 29 8
d.f.(l)
d.f.(2)
2
1,O11
1.22 .2951
2 .1 4 .1 2 1 2
r ' o s the net regression coeffraent; s.e. is the standarderrorof the coeff cient;p rstrre assocrated prob.  d f.l1) andd.f.l2)arethe numeratoranddenominator degrees of freedorn; anciF is the varueof the : .::,: c for the contrastbetweef modes
222
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
in general,we want to do surveyestimationwhen we havethe information to be able to do so. Note also that the R2sfor correspondingmodels in panelsII and III are identical eventhough the standarderrorsdiffer. This foltows from the definition of R'?as a function of the ratio of variancearoundthe regressionsurfaceto regressionaroundthe mean of the dependentvariable. Becausethe point estimatesare the samefor panels II and IIL the R2sare also the same,eventhough in panel III the point estimateshavewider confidenceintervals. Note also that I have not shown BIC estimates.Although it is legitimate to computt B1C for simple random samples,as we did in ChaptersSix and Seven,B1C is not appropriate for weighted or clustered samples.For such designs,pseudolikelihood functions are estimated,which may be substantially different from true likelihoods and may even vary in a nonmonotonic way acrossnestedmodels.Thus neither likelihood ratio testsnor BIC of which the log likelihood is a component, should be usedto comparemodels frr weightedor clustereddata.Rather,Wald statistics,implementedin Stataas the testand sr,1rtest commands,shouldbe used.(For a discussionof maximumlikelihoo'l estimation,which is usedby most of the procedureswe will explorein ChaptersTwelle throughFifteen,seeAppendix 12.B.)
TgB in ili
Combining GSSData Setsfor Multiple Years Previously I suggestedthat under some circumstancesit is useful to merge severalsamples drawn from the same population into a single data set. In particular, if it can be assumedthat a social processis consistent over time, it would be reasonableto combine GSS samplesdrawn in differentyearsto increasethe numberof cases.I did this in the workedexamplein ChapterSeven,decomposingthe differencebetweentwo means.Here I use the samedata in a slightly modifled way to study racial (nonBlack versus Blackr differencesin educationalattainmentover the period 1990 through 2004. The poinr of the presentexerciseis to illustrate what is entailed in combining severaldata sets(seedownloadable file "ch09.do," Part 4, for the Stata code). In carrying out this analysis, I treal year as the stratum variable, on the ground that the sample for each year is fixedI then manipulate the data a bit to createa weight va.riablethat is consistentacrossyea6. (Seethe downloadablefile for detailson this process.)Havingappropriatelyweightedthe data, I carry out survey estimation in the usual way. The results are shown in Table 9.6. For our presentpurposesboth the del andmeff coefficients are instructive. The largestdeff tglls us that in our estimationof the coefficientfor Southernorigins, we hare the sameefficiencyas a randomsampleof 8,754(: l5,9321I.82).Of course,sinceour sampleis so large (becausewe havecombineddata from eight GSS sampies),we still have the equivalent of a very large sample. The meff coefficients also are large, especially for mother'syearsof schooling.This onceagainsuggeststhat nai.veanalysisthar takesno accountof weighting or of clusteringcan be misleading,althoughagain the very large size of the sampleprotects us. Although the results are substantively interesring, I forgo further commentary on them since it would largely repeat the discussionin ChaDterSeven.
(il 'trhL uo4 t,lm ld
H iEl
@ $cfr T o!fr
SampleDesignand SurveyEstimation 223 '::rr,. f., Coefficients of a Model of Educational ld,rfts, 199O to 2OOa(N = 15.932). l.=cictor Variable
Coef.
iltlei's yearsoJ
Attainment, U.S.
5.e,
Meff
0.288
.0i0
2.27
0.133
.010
1.59
0.531
.065
o.434
.331
1.31
1.75
.023
1.32
1.7A
' I.13
1.20
5_CC
'u:er of siblings hrern rUE'5
residence,
ia:. nE<'.nothe13 Er trj school 3 a:. jiblings
0.057
.021
l[{Southern
0.006
.1 5 4
1.47
10.961
.1 3 5
2.26
E€CE
lL
0.'182
:SNCLUSION :: .:>\on to be drawn from each of the.analyses in this chapteris that we are likely to ''r' 'lnderestimate the degreeof samplingvariab ity ir we^fal to take account of and .: lbr.thefact that large samplesurveystypically use multistagedesignsthat result r :>rantialclusreringof observations. Note thit thii is tru" not oiy of ar.a probability but also of organizationallybasedsamples.suehas samples ;_,:s of students(often ;,: :d by first selectingschools,then classrooms,thenindividuals within classrooms), irr :::rl or clinic patients,and so on. survey estimationprocedures shoulclbe usedfor i; :ln'eys aSWell. ,, en when completeinformationon the samplingdesign is unavatlable_whichis r: Jnatelyall too common_it sometimesis possible to approximarethe desisnbv
224
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
since almost all surusing information on the place where the inteNiew was conducted' their data setsfor inforveys"arectusteredUy place. Analysts are well advisedto explore and then to use th design' sampling mation that will enable them to approximate the reliability of their the overstating designbasedprocedures available in Stata, to avoid of Type I errrr chance the n"aing". Una"ttru,ing the sampling error, and thus increasing of is the usual consequence treating multi6ejecing the null hyfotnesis when it is true), stagesamplesas if they were simple random samples' book have survelMost of the standard statistical procedureswe deal with in this For procedures based versions, and these should be used whenever they are available' be possibh may it commands' estimation y", it"f"a"O m the packageof surveybased clusterthe "", and to uppro^iut" .u*"ybu."d estimationby using [pweights] is only one sampliry optiin, ulong fit" fines suggestedin this chapter' Also' where there option usedwid clusterstageand no informationon any stratumvariables,the proceestimadon ooiro*"y estimation proceduresyields results identical to the survey case which in analyzed' dures discussedin this chapter (eicept when a subpopulationis suweyestimationproceduresshouldbe used)'
WHAT THISCHAPTERHAS SHOWN random sampling is This chapterhas taken us ftom "textbook" analysis,in which simple the implications cf and research social assumedlto the kinds of samplesactually used in of samples'wiih types the main such sampledesignsfor statistical analysis'We reviewed samples'both particular focus on multistage probabiiity samples;the idea of stratifying the populaof to ."do"" sampling error and to gain sufficient casesfor small subgroups ro desirable or tion to permii analysis;and theionditions under which it is necessary fc a set of procedures compute weigtrted estimates.We then tumed to survey estimation' of the sample designfeatures of account take that correctly estimating standarderrors two statistics' df interpret to how considered of cases.Finally, we purtl"otarty "to.t"ring for sampling error' samples random andm"6, that qtantify the effect of depa{turesfrom
CHAPT ER
REGRESSION DIAGNOSTICS THISCHAPTER 15ABOUT Is chapter we consider ways of identifying, and under some circumstancescorrectuoublesomefeaturesof our data that misht lead to incorrect inferences.We do this rralyzing one of my published papersas a way of seeinghow to apply and interpret ins regressiondiagnostic tools. Apart from not adjusting our standarderror estimatesto take account of complex designsof the kind we consideredin the previous chapter,there are other ways we be led to incorrect inferences.Even if we are comDletelv attentive to the comolexitv u sample,we may err either becausewe have specifiedthe wrong model or because rnple includes anomalousobservationstopics we briefly touched on in Chapter Here I give thesetopics extendedtreatment. Iueat thesepossibilitiestogetherboth because,in at leastsomecases,the samesetof ionsmay be thoughtof eitherasanomalouswith respectto somepositedprocess r conforming to some other processthat can be captured by introducing additional or changing the functional form of one or more predictors, and also becausethe methodscan be used to detect and correct both setsof oroblems. First we consider diagnostics,a setof proceduesfor detectingtroubles.Thenwe considerrobust an approachto correctinga subsetof thesetroubles.This discussionrelies on Fox (1997,Chapters1112) andthe discussionof variousregessiondiagnosi the Stata 10.0 section"Re$ess Postestimation"(Statacorp2007).I recommend sourcesfor further study.
226
to Testldeas QuantitativeData Analysis:Doing SocialResearch
INTRODUCTION To illustrate the kinds of troublesthat can befall the naive analystwho is inattentiveto tbe propertiesof his or her data,considerthe four scatterplots shownin Figure 10.1.Theseploc were contrivedto producethe sameregressionestimate(slope andintercept),the samecorrelation betweenthe variables,and the samestandardefior of the regressioncoefficients However,only plot (a) is reasonablysummarizedby a linear regressionline. Plot (b) sho'rs a curvilinear relationship.Plot (c) showsa linear relationship with one value that distors what is otherwisea perfect linear relationship.Plot (d) showsa data set with variancein X and the sloperelating yto X createdentirely by a singlepoint (whereX is the variableon tbe horizontalaxis and lis the variableon the vertical axis). Clearly thereis a cautionarylesso in thisit is a very good ideato visually inspectthe relationshipsamongvariablesto ensue that your specifledmodel adequatelycapturesthe tnle relationshipsil your data. Apart from theseexampleswe needto be sensitiveto still otherwaysour regression modelsmay fail to adequatelycapturerelationshipsobservedin our data.In particular important variables may be omitted from our mode1,as illustrated in Figure 10.2. Here ir is obviousthat the regressionof I on X is misleading,becausethe threemiddle obsen'ations have expected values of I three points higher than those to the left and right. h should be evident that an equation of the form
Y : a + b (D+ c (4
FUEE' e9.d Xa
Z: . $I
G
(10.1t
r red !!Hiri.lEfl
rit *rLi
SI_
&ryrtrirrr :qr re;f *6rri lrg:sd
{rndr
T'.iro
.5cl
IfuiE_ lllicqs
lhs rr'FF
;"Oa \;Eid
b:{ss
.<"r€{
ibtr
(c)
(d)
ru*ti&.r 3*.1.Four ScatterPlotswith ldentical Lines. Source.Anscombe 1973, 1920.
hf:i:i rrd dlu If[:rb frdmqt* atrtrsrr &rc,;gi l }EJ
:lr':r
r.ri
RegressionDiagnostics 227
'!S,2, Scatterplot of the Relationship FGUqf BetweenXandy and Atso fr RegressionLinefrom a Model That lncorrectly Assumesa Linear Relationship &€en X and Y (Hypothetical Data). me Z is scored 1 for the three middle observationsand is scored 0 otherwise and @..dy predicts y. Visual inspectionof a scatterplot suchas in Figure 10.2, or a compomplusresidual plot, discussedlater in this chapter,can sometimesreveal the need to additional variablesalthough empirical examplesare not usually so clear cut. Still another potential problem is heterosced.asticiD,, unequal error variance around t reqession surface at different predicted values, which results in inaccurate standard mrs of regressioncoefficients. Heteroscedasticityis fairly common becausein many rrc the varianceof observationsincreaseswith the mean.Fortunately,modestviolations!h Lar_sest error variancesless than ten times as large as the smallesthave little effect [ 6e standarderrors.Still, we needto checkfor largeviolations. To detect anomalousrelationshipsin our data in the caseof simple twovariable mlels, we can simply plot the relationshipbetweenX and y as in Figures 10.1 and trdrllHowever, for multiple regressionequations,zeroorder scatterplots between each tie independent variables and the dependentvariable are likely to miss imponant and anomalies.Thus we need to exploit a set of additionalprocedures rn collecrivelvas repressiondiapnostics. )ionetieless, ii is usEfrft6Ttan:Fm;i€ry simple example,if only ro illushare how milious the problem can be in actualresearchsituations.Supposean analystfails to notice nh missingdataare representedby very large codes(recall the boxedcommentin Chapter &m ''Treating Missing Valuesas if They Were Not"). Consider the relationship between mber of siblings and years of school completed in the 1994 GSS. For both S/BS and ,fiDLC code 98 : "Don't know" and code 99 : "No answer."If we naively assumedthat fu datawerecompleteard corelated the two variables,we would concludethat the amount ,deducation obtainedis umelatedto the numberof siblings, becauser : .006. Excluding tu missingdatafrom both variablesyields a more plausibleestimate:r  .246. What can we do, apart from simply being alert and careful, to protect ourselves iEmst such an error?The first and most obviousstepis simply to make and inspecta
228
QuantitativeDataAnalysis:DoingSocialResearch to Testldeas
6
a E I o 6
n
na
Number oi siblings
FIGURE 1 0.3 .Years of School Completed by
Number of Siblings, U.S. AdulE,
1994(N = 2,992).
E rn _q
E 912
_o
v8
0 0 1 2 3 4 5 67A 910
15 20 25 Number oi siblings
30
FtGURT 10.4.Years of School Completed by Numher of
35 siblings,
U.S.Adults, 1994.
scatterplot of the relationship betweenthe two variables.As we haveseenin Figure 10lsuch scat0erplots are enonnously instructive, not only in revealing gross errors such as the inclusion of missing value codes but also in indicating other anomaliesin the datE curvilinear relafionships, discontinuities, patterns that suggestthe possibility of omitted variables, and heteroscedasticity.Figure 10.3 shows a plot of the relationship betweer numberof siblingsand yearsof schoolcompleted,in the 1994GSS. This plot immediately revealstrouble and would do sojust asclearly if the numerical value of the ' NA ' category.99, were shown. Inspecting the plot, we seethat the missing [email protected] Doing soresultsintheplot shownin Flgure 10.4.with the regressionline included.Note that the new plot is basedon 2,975
RegressionDiagnostics 229 rzses. a reductionof only 17 cases,but the regressionestimatediffers substantially. letause so few casesare missing, we neednot be concemedabout imputing the missing ,iara(recall the discussionin ChapterEight). However,evenafter omitting missing cases, T€ still needto be concemedaboutthe possibilitythat the regressionestimateis unduly iiluenced by the relativelysmallnumberof respondents with many siblings.The fotlowile sectionaddresses how to assesssucha possibility.
A WORKED E)(AMPLE: SOCIETAL DIFFERENCES I{ STATUS ATTAINMENT
L D
I I
{ D
Ibe failure to omit missing casesin the preceding example is a particularly blatant error, cts)'to noteandeasyto fix. Sometimes,howevet errorsaremore subtle.Thuswe needa ret of proceduresfor detectinganomaliesin our data.Regressiondiagnosticprocedures rre at presentnot very well systematized.There are many graphical methods and tests, den doing more or less the samething, and considerableconfusion about nomenclature rthe same proceduresare called by different names, and t}te same names are used for ftfferent procedures).I have illustrated a subset of these proceduresthat seem useful, Eloring those that are or easily can be implemented in Stata.(For a useful exposition of regressiondiagnosticprocedures,seeBollen andJackman[1990].) As a concreteexampleofhow to ca.rryout regressiondiagnosticprocedures,I reanalyze .r anicle I completed with a former graduate student, KamBor Yip (Treiman and Yip 1989).In this multilevel analysis we were interestedin how r444:rosocialcharactelistics dect the processof status attainment.For a very simple model predicting men's otfspring'sfdupationaltrtafis fi6 lth1ir fathers' occupational statusand their own educarionin eighteennations,we hypothesizedthat the effect of the son'seducationshould El6Gger;d ilr:'eetrect of rhe iather's occupationalstatusweakerin moreindustrialized co ntriesand in countrieswith lessincomeinequalityand lesseducationalineoualitvin 6e father's seneration. The first step, after converting all of our data to a common metric, was to estimate 6e micromodg,lseparatelyfor eachnation. The secondstep was to predict the size of the . !  i'1 coefficientsresultingfrom the first step,using measures resr'ession of industrialization :rmd inequality. This sort of twostep estimation procedure,although statistically suboptimal, is conceptuallyclear.(For statistically optimal multilevel procedures,seeRaudenbush ,mdBryk [2002],andfor a brief introductionto multilevel analysig,seethe discussionin ChaDterSixteen.) Here I reanalyzethe results shown in Equation 7 of Treirnan andYip: 6,:  zO1n4*.19(11)+.31(D); R2 .55; Adj. R2: .46 (r0.2) $here bu is the metric regressioncoefficient relating occupational status to education in eachmicroequation,EI is a measureof educationalinequality,11is a measureof income mequality,and D is a measureof economicdevelopment.The coefficientsestimatedfor Equation10.2areexpressed in standardform. However,regressiondiagnosticprocedures operateon metric coefficients.Here is the correspondingequation,with coefficients eroressedin metric form.
b":2.o2  .3s(E1) .32QD+ .30(D)
(10.3)
230
QuantitativeDataAnalysis:DoingSocialResearch to Testloeas
Although regressiondiagnosticproceduresare helpfirl for samplesof all sizes, they ac particularlyusefulfor analysesbasedon smallnumbersof observationsbecause suchsamples areparticularly vulnerableto the undueinfluenceof oneor a few extremeobservations. Downloadablefiles,.chlO.do"and ,,ch10.log,'showthe Statado_ and log_ files for my reanalysis; you should study these along with the text becausemany details ale provided only in the commentaryin thesefiles.
Preliminaries A.. yoll9 do in anyreanalysis of published resulrs,I srarrby tryingto replicatethe I.
publishedfigures.The stata  1og file showsa listing ofthe dataset, varioussummaff statistics,and estimatesfor the regressionequationreportedin the pubrished article.Ai agreewith the correspondingfigures in the published article. This is not always the caseA surprisingly large number ofpublished articles contain errorscoefficients that do nd correspondto estimatesderived from data setsthat are listed by the authors or are available from archives. Sometimesthis is becausethe authors have dropped casesor transinforming the Igadel,bgt sometimesffi*s gqes^yithout f,"*;ipf, PTgq maoemlsrakes.Ulven the easeof email communication,it often is possible to clearr4r suchproblems relatively painlessly and is certainly worth the effort. The first time I tried to replicatethe publishedequationI got an absurd minimum valrr ^ the educational for inequality measureandregressionestimates that disagreedwith the puL lished estimates.It tumed out that the explanationwas simple_the scanning operationthr input the datafrom the publishedarticle (which waswritten many yearsago, beforeI started sy^stematically keepinglogs of my work) recorded 69 rather ihan 0.69 for Britain, and I failed to notice this when I proofiead the file. It is not worth the space, your or trme, tI) detail my effort to detectthe sourceof andcorrectthe problem,but the tesson is clear:devise asmany checksaspossibleand studythem carefirlly at eachstepbeforeproceeding. There also is a lesson here regarding good professional practice_rn your publica_ tions always describeyour proceduresin sufficient detail to make it possiblJ fo, i"o_p.tent analyst to exactly replicate your coefficients given only your paper and your origilal data set. Doing so is not only a matter of courtesf; it wili irelp yoo air"ou", yourowr errors before they are published and becomevulnerable to snideconection by somegraG uate student looking for a quick publication. Wheneveryou produce paper a for publica_ tron (or evena semipublication such as a deposit on a Web page or a submission as a term paperor dissertation chapter), your last step before submiision should be to rcrun yow completedo file andcheckeverycoefficientin your paperagainst _do_ the file. you will be surprisedhow many errors you find!
Leverage Having establishedthat I can replicate the published results,I now consider whether ther provide a reasonablerepresentationof the relationships in the data. I start by consiaerin! unl observationshave parricularly high levirage, where leverage refers yl:ft.r to tbe difference between the value or values of the indepenLnt variable or variables for a particular observation,and the mean or centroid of the values for all observations.plcr (d) in Figure 10.1 illustratesthis case.The observationwith a scoreof nineteenon tbe horizontal axis has high leverage.such observationsare troublesome becausetl'ev mav
RegressionDiagnostics 231 undueinfluenceon the regressionslope;obviously,in (d) of Figure 10.1the slope be inflnite except for the single high leveragepoint. A conventionalmeasureof leverageis the diagonal elementsof thehatmatrix, wlich iles a scalefree measureof the distalce of individual observationsfrom the cen_ il Computing the hafmatrix for the eighteennations in our data set (searchfor .,hat,, ft downloadablefile "chl0.do"), we note that India has an unusuallylarge value, four times the mean hatvalue. This suggeststhe possibility that India is unduly the regressionestimates.
acting on this possibility, however,we need to further exDlorethe data. Our stepis to discoverwhetherthere are any exlj]emeoutliers, observationsfar from regressionsurface.To do this we needto adjustfor the fact that observationswith leveragetend to havesmall residualspreciselybecausethe leastsquares property the regressionsurfacetoward such observations.The studentizedresiduat (E..\ les suchan adustmentby basingeachresidualon a regressionequationestimate'd the observationomitted.The studentizedresidualis attractivebecauseit follows a with N  t  2 degreesof freedom(whereN is the numberof observaand ft is the number of independentvariables),which makesit possibleto assess satistical significanceof specificresiduals. However, becausewe usually do not have a priori hypothesesregarding particular we needto adjust our tesrsof significancefor simultaneousinference.A simIt way to do this is to make a Bonferroni adjusnnenrby dividing our desiredprobability value(conventionally.025 for a two{ailed test) by the number of possiblecompari_ rs. Iff which in this caseis the numberof observations.Thus the procedurefor the analvsis h is to compute studentizedresidualsand to identi! outliers as unlikely to havearisen !rctance if thepvalueis lessthan.025/18: .00139.As it happens, noneof theoutliersis lri*ically significant,becausethe largeststudentizedresidual,for Denmarlqis 3.349,with J  3  2 = 13 dl, which impliesa rvalueof.00523(searchfor..estu"in thedownload* file "chlO.do"). It probably would be unwiseto take suchtestsof sisnificancetoo seriespecially given the very small sample. Fox (1997,280) arguesthat studentized greaterthan two in absolutevalue are worthy of concem. This suggeststhat we to furtherconsiderDenmark(E* = 3.35)andperhapsIndia (Et = 1.9,.
Huence ksures that take simultaneous account of both leverage and outliers are known as 4uence stqtistics. Severalrelatively similar measuresare available; here we focus on tCak's Distancemeasure(Cook's D), which is a scalefreesummarymeasureof how a c of regressioncoefficients changeswhen each obseflation is omitted. Taking 4/N as eortoff point for Cook's D, we notethat only India is exceptionallyinfluential,with the liuited Statesmarginallyso (searchfor "cooksd"in the downloadablefile.,ch1O.do").
llots for Assessinglnfluence Ibspite our focus thus far on numerical surnmary measures,a generally more useful 4'roach to diagnosingregressionills is to plot the relationshipsamong variousindicators.
232
Quantitative DataAnalysrs: DoingSocialResearch to Testldeas
Two useful plots that combine measuresof leverage and residuals are the leveragtversu_sresidualsquared plot (the _1vr2p1ot _ commandin Stata.landthe stuclentjzelresidualvers,ushat pl.ot weightedby Coik,s D proposedby fox ftWZ, 2g5) and easilr implementedin Stata(to seehow I did this, searchfor "rnut=fru,;;i, il;;;i;;d;; filC'ch10.do").Figures10.5and 10.6showtheseplorc. ff,"fJni of rquaringthe residr.C in Figure 10.5is to indicatethe influenceof the outliers b".uur",n" ."gr"rrron procedut minimizesthe sumof squaredenors.Still, Figure 10.6 seemsm Jo a betterjob of reveal ing the overall influence of specific observations. Clearly India standsout from 6c remaining observations.Denmark, however,has the largest outlier
AddedVariableptots Our.next task is to fy to discover any systematrcrelationships among the variablestbr might accountfor either the large residualsor the highly innuentla oUservations.A gocl way to do this is to construct addedvariable plols, ul.o kno*n as partiaL_regressi(. leverageplox or simply partialregressionploti. Such plots provide a two_dimensiosd analog to the kind of scatterplot with a regression line throogf ir,lr* _ the simpleregressioncase.Addedvariableplotsdo this "onstrufi by snoing u fo, ''"or "* ,rr" relationshi betweentwo predicted values:(1) the predicted vau.. iro u r"lr"irion of the dependel variableon all variablesexceptone and (2) the predicted the regressioncf the variableomitted from (1) on the remaining independent"d;;;;;. variables. Cuaph(a) in Figure 10.7, assessingthe effect oi educational inluaiity (81), sugges that India is highly influential; it has very high educational inequalt relative to its incon mequality andlevel ofindustrialization. andii arsodispray, u .tr*g.i of educationa occupationlhanwould be expectedfrom its levelsof income inequiiry "ff""t anoinoustriarizatiqInterestingly, the plot reveals that if India were removed oi oo.nweignteo, the sloF
I Normalized resldual squafed
'i *.5. Sf€UR€ ,q pbt of LeverageVersusSquaredNormalized Residuats for Equation 7 in Treiman and Vp (t g1g). Note. The horizontal and vertical lines are the means of the two variables being plotted.
,.
t t I
RegressionDiagnostics 233
.6
.2
0
1 3 stuoJntituo '"'ia'utt' studentized Residualsfor Treiman ptot versus ot Leverage a 1A.6. FfGURg the sizeof Cook'sD' Yp's Equation7, with CirclesProportionatto d andtheverticallineisat zero' lineisat the meanhatvalue' !e'The horizontal o
1
1.5 Q1 a, .s f l o
l
x > ,1
,I
x'l A^ 6  .) 1
t
,.,
l
o
"(
o'")
7' andYip'sEquation for Treiman FIGURE lAV ' Xa"av"riablePlots occupational retums to education would dadng educational inequality to the level of arease.Denmark,bycontrast,nasunusuallyloweducationalinequalityrelativetoits ;;; much stronger educationi;"qoutity uoo indusnializaiion' but it has a ;;;" on the other two position its connection than would be expected from p",i*
234
DataAnalysis: Quantitative DoingSocialResearch to Testldeas 1 .5 1 I
:2
:.
0 .5
Flttedvalues
nGURr1S.8ResidualVersusFitted plot for Treimanand yip,s Equation 7variables, so the omission or downweighting of Denmark would decrease the effect of educational inequality. Graph (b), assessingthe effect of income inequality (1I), reveals that only Denmark is a large outlier. Otherwise, the plot is fairly unremarkable.Grapb (c), assessingthe effect of industrialization(d), showsthe United Statesto be a higileverage observation, with a very high level of industrialization relative to its level of educational and income inequality. Because the United States is below the regression line, its omissionwould increasethe slope.
ptotsand Formal Testsfor patterns ResidualVersusFitted in the Data
A secondtest,stata's  ovtest  command,assesses the possib ity of omittedvariables by testing whether the fit of the model is improved when the second through fourth powers of the fitted valuesare addedto the equation.Given the small samplesize,I takethe pvalue of .08 resulting from this test as suggestingthe possibility of omitted variablesComponentplusresiduar plots aretseful in iwealing theiunctional form of relationships and,by extension,the possibility of omitted variables.Suchplots differ from addedvariable plots becausethey add back the linear componentof the;aftial relationshipbetweeny and X to the leastsquaresresiduals, which may incl]de an unmodeled nonlinear component.Figure 10.9 showssuch plots for our data,using the ,.augmented,, version availablein StataGearchfor.,acprplot,,in the downloadableile ,,chl0.do,,). The plots in Figure 10.9 continue to show Denmark as a large outlier. But otherwise they do_not appearorderly; andwith one exceptionI can tiink of no omitted vari_ ables.The exceptionderivesfrom work by Miillir and Shavit (199g) that suggeststhat the educationoccupationcomection is especially strong in nations with welJeveloped vocationaleducationsystemsand especiallyrv"uk in *ion, with poorly developedvoca_ tional educationsystems.In our dataDenmark,Germany,Austria, and ttre Netherlandshave especiallyshongvocationaleducationsysterns,andthe United States, Japan,ard Irelandhave very weak vocationaleducationsystems.The relationshipfound by Miiiter and shavit seems to hold in our data, with the nations with strong vocational educationsysternsabove the
ni riu t d Ef DC
_a
&16l lfcr dtufl
tutu, trd hi[
]F.
:lr b *tu rEd€

h:r fr nru['d form r fuft.8!
RegressionDiagnostics 235
!
51 !9
n
==
0123 Educational inequality
1.5
1.5 0.5 Incomeinequality
2 9
1
Pe td *
F * td
2
1012 Economi.development
.!0.9"Augmented ComponentPlusResidual FIGURE Plotsfor Treimanand Ws Equation7.
b
s t br F * rI 9I
ir 'tG D'
b Fd E} E F T
lb
qression line and the nationswith weak vocationaleducationsystemsbelow the regression h. This result suggestsaddingthe strengthof the vocationaleducationsystemasa predictor. b do this I add two dummy variablesto distinguishthe three setsof nations(strong,weab d neitherespeciallystrongnor especiallyweakvocationaleducationsystems).I thenreestiEe Equation7, which yields the coefficientsshownin the secondcolumn of Table 10.1 (for ruvenience, Column 1 showsthe metric coefflcientsfrom Treimanandyip's original Equa7, that is, those shownin Equation 10.3 of this chapter);lhe remaining colurnnsshow :ious additionalestimatesdiscussedin the following paragraphs. D The specificationshownin Column2 poduces a betterrepresentation of the deierminants dthe strengthof theeducationoccupation cormectionin theeighteennationsstudiedherethan the original specification.The adjustedR2increasessubsantially and, as expectedfrom A pafiem of residuals,the coeffrcientsfor strongand weak vocationaleducationalsystems es ld havethe expectedsigns.(I discussthe standarderrorslater in this chapter.) However,the question remains as to whether the results are still substantially driven ! India and Denmark. To determine this I repeatedall the diagnostic proceduresdisossed previously with the new equation. The Stata log contains the commandsI used, lm in the interest of saving spaceand avoiding tedium, I have not shown the resulting floa andwill not discussthe resultsexceptto note that India continuesto be a high leverage lnint and Denmark continues to be a large oudier, although the diagnostic indicators for both are somewhatless extremethan the conesponding indicatorsjust reviewed.
156
QuantitativeDataAnalysis:Doing Social Research to Testldeas
l';1;,,, : 1,;, ', Coefficients for Modets of the Determinants of the Strergl,. of the OccupationEducation Connection in Eighteen Nations.
18 Observations
17 Observations (ExpandedModel)
f, ii
lf
{ ll
Original
IducationalInequality
IncomelnequaJity
lndustrialization
Model (Metric
Model (OLS
Coefficients)
Estimates)
 0.354 (0.s32) 0.320 (0.299) 0.299 (0.27s)
Weak Voc. Ed.System
R' AdjustedR,
Expanded
l0ll
R o bur
ors
Regress:r
0.292 (0.s6s)
0.821 (2.268)
a.J a,
0.342 (0.324)
,0.321 (1.s94)
0. 3: :
tiltl
\1.29nl tl a
4.2a7 (0.275)
0.208 \1.449)
0.836 (0.410)
0.707 (0.644)
0.5E': (0.59,
*0.476 (2.518)
tO.1La
lll
I
StrongVoc. Ed.System
Intercept
iut
0.403 (o.414)
it
I
2.021 (0.222)
1.899 (0.2s1)
1.814 (0.631)
.553
.762
.792
.457
.662
.698
0.529
0.471
o.672 /V!fer Bootstrappedstandarcl errors,denv
1.8C: (0.5 a:
.
iii:l::$,jiij!,, and ::T:19;[J,.:'il;1';]f+tlq11fiI,i",.,l,"iil,llil,"l;j,"J:li 0r76 ror corumn 1:a i7s,o2;5.*r,;r;;:;;;, 0 313,and 0 I 74 f oj Colum n3 ; ancO l i l I , A . 2 6 7 , A . 1 8 5 , 0";ili;,illill."iiili;liflillliii;1,, .330,0.362,and0.189forCoIumn4.
RegressionDiagnostics 237
x)BUSTREGRESSION 5r $hat to do? Becausewe have no clear basis for modifying or omitting particular oar.r\,ations,nor for transformingour variablesto a different functional form, we need an ivnative way ofhandling outliersandhigh leveragepoints.Onealtemativeis robustestiman'r which doesnot in generaldiscardobservationsbut ratherdownweightsthernsivins less iduence to highly [email protected] RobusteffiG iftac[iilelaus."they re nearl6GffidiEiTis ordina:fleastsquaresestimatorswhen the error distribution is nrmal andare much more efficient when the errorsare hear,ytailed,as is qpical with high Lremge points and outlien. There are, however,severalrobustestimators,and there are no The bestadviceis to explore "*ffut rulesfor larowingwhich to applyin what circumstances. :rar dataas thoroughlyas time and energypermit. @or fi[ther detailson robustestimation. ,,:cult Fox [1997,405414;2f2],Berk [1990],andHamittonl199Za;I992b,2072111.) One classof robustestimators,known as M estimators,works by downweighting dftlen'ationswith largeresiduals.It doesthis by performingsuccessive regressions, each :m (afterthe first) downweightingeachobservationaccordingto the absolutesizeofthe nidual from the previous iteration. Different M estimators are defined by how much kr_sht theygive to residualsofvarious sizes,which canbe showngraphicallyasobjective brtion* The objectivefunctionsof three wellknown M estimatorsare shownin (a), ,b  and (c) of Figure 10.10The OLS objectivefunction ([a] of Figure 10.10)increases dqonentially, as it must given that OLS regressionminimizes the sum of sqzared residu. rk The Huber function ([b] of Figure 10.10)gives small weight to small residualsbut reiehts largerresidualsas a linear finction of their size.The bisquareobjectivefuncen ([c] of Figure 10.10)givessharplyincreasingweight to mediumsizedresidualsbut lh,'o flattensout so that all large residualshaveequal weight. BecauseHuber weights deal prrrrly with severeoudiers (whereasbiweights sometimesfail to convergeor produce mldple solutions),Stata'simplementationof robustregressionfirst omits any observa_ nas with very large influence(Cook's D > 11,usesHuber weightsundl the solutions Jrll erge.ano tnen usesbtwelghtsunlll the solutlonsagalnconverge. Becauseof the rrr it is defined,robustregressiontakesaccountonly of outliersbut not of highJeverage ,ftervations with smallresiduals.For someproblemsthis can be a major limitation. Panel2 of Table 10.1 shows (in Column 4) robust regressionestimatesfor the elabomed model of the educationoccupationconnectionwe have been studying. There is no rctust regressionestimatein Panel 1 becausethe procedure dropped India at the outset he'auseof its large Cook's D. Columa 3 shows the correspondingOLS estimateswith hlia omitted. Interestingly, the OLS and robust regressionestimatesdiffer very little in hel 2, with the exceptionof the effect of strongvocationaleducation,which is reducedin rb mbust estimatebecauseDenmark,with its large residual,is downweighted.The agreemt betweendifferent estimatorsdoesnot alwayshold and shouldnot be taken asan indi:rion that robust estimationis unnecessary.However,the stability of the estimatesunder "iferent estimationproceduresgives us adde.dconfidencein them. By contrast,the omissionof India stronglyaffectsthe educationalinequalitycoeffi*trI. increasingit by more than a factor of two. The coefficientfor strongvocational a,ircationis modestly reduced,and the coefficient for industrializationis even more   € z _ 4 . : =+'
238
euantitativeDataAnalysis: DoingSocialResearch to Testldeas
35
.8 .7 .6
'd
= 1s 10 .2
;l s4 3 2 _ toi2;;;tr
.l 0
6
Deviation score
'6 (b)
54
r
)r
, i I1 4 . I\ 5 i ^ )/ Deviation score
3 9r 
__ 1
.5 0  b 5 4 t
" tz 1456 
^
:
i
Deviatton score
10, objective Functions forrhree M Estimator: (a) oLsobjectil f.gyry tunction,(b)113" Huberobjectivefunction, and(c)bi_squareoiluir" tu*rior.
modestlyreduced'A reasonable conclusionis that the educationoccupatron connecti
tr;,r''"i#lno"iiu, u,*u.yor, generat relationship ;::illT:flff.ff ffi:*::::""':,::.:1.''*between ird*t'i;;"il,];;;ffiffi H,ff :ilfffi properly set India asidefor separateconsideration.
BOOTSTRAPPTNG AND STANDARDERRORS
;J""Jt""ft",f"f"YJ'1il3r,'S1t::*:i:1g*.b"*ordinaryreastsquaresandr af. no: norrnally disnibuted,ifr" airt lUut* enors errors isasvmntoficallv rs asymptotica'v nn*"r *^:^u1o:, normar_that is,;,il;;i"#ff H;"J:?#iftr#:ltJ#
t
.ynil"ar,.":"p1"'r;,,],iJl , * the observadons number l:.,:l* d and r i. tr," *''r". o'riJ:t#[ "*r::ffffi.':#*I;T::::: ;"t :iffi#: **Afy;JT:1:T; ffi#ff iffi;rj glfl'lv:'arva'r""t l#ii "'".i"^llTn* o,*,.*l oneway around thisprobremis;;;;'r;;;##;;::X?fiffiL,
. .
_* "a*jm#*1i*ffi*::",1# ;gIfi,11 Fxl:XT*TilTt ,*:tJT:;:,";::*t6:i::.tl#*,:#;:m:ff il"".::i{ff jtr**.ffi ;fiT ;lll'il"lXT'Ji,t;ilffi'#::;:."1J:;:Lffi
iff
J[*",..
RegressionDiagnostics
dls m:f tr:rs Ei Ei fz e [email protected]& dEd lft
r :r F.
239
d eighteennationsis drawn doesnot actuallyexist sincewe took alr nationsfor which dra were available.Thus we needto resorl to an approximation. Bootstrappingapproximatesresamplingby taking the observedsampleas a proxy fu the population and repeatedlysampling,with replacement,observationsfrom the *served sample.Thus,in our currentexample,we would randomlydraw (with reolace_ a first sampleof eighreencasesfrom our eighteenobservationr. say Norray. ot) Srlerlands, India, Ireland, Austria, United States,Finland, philippines, Denmari<, blr'. Taiwan, Sweden, India, Ireland, Finland, Denmark, Denmark, Taiwan. Note tar England,Germany,Hungary,Japan,Northern lreland, and poland do not fall into lh sample;Austria, Italy, the Netherlands,Norway, the philippines, Sweden,and the fnited Statesare includedonce;Finland,Ireland,India, andTaiwanareincludedtwice: rl Denmarkis includedthreetimes. From this samplewe would estimateour reqrestir equationand record the coefficients.Then we would draw a secondsamplewith E{rlacement,a third, andso on. The resultis, for eachcoefflcient,a distributionof values cqFalin size to the numberof sampleswe havedrawn.From this distributionwe esti_ the standard error as the standard deviation of the distribution. (For further are Gcussion of bootstrapping,seeFox [1997, 493Sl4], Stine [1990], Hamilton [1992a; l992b,3133251,andthe entry for bootstrapin the Stata10.0manual.) This methodprovidesa good estimateof the standardenor of a statistic.orovided thr the samplein fact representsthe populationfrom which it is drawn anJthat the rsrlting distributions are approximately normal. with very small sampleswith outliers d high leveragepoints, as in our case,there tends to be high variabilio/ from sampleto rmple. Thus it is wise to drawmany samplesto get stableestimatesof the samplinedislrhtion. In the presentcaseI drew2,000samplesto estimatethe standarderroisirieach d the columnsof Table 10.1 (seethe sectionon ..Bootstrapped StandardErrors" in the &sdoadable file "chlO.do"). I experimentedwith smallernumbersof samplesbut got rsatisfactory variability acrosstrials in my estimatesof the standardenors. With 2,0b0 dications the standarderrorsseemto be reasonablystablebut hardly normally distributedc Figure 10.11.The outliersin thesedistributionsderivefrom the randomomissionor mltiple presenceof highJeverage observations.(with seventeenobservationssamoled sith replacement,the probability of a given country being excluded from a pa_rticular mple is 36 percentmore precisely,0.357= t1  llNlu = 11 Ufiy .) Note that the standard errors are sometimes much larger than the corresponding simptotic standard errors reported in the note to the table, especiallv those for the edrcationalinequalirymeasure. This resultalertsus to the dangeiof naivelyaccepting ccmputed standard erors from general purpose statistical progr:rms! especially when rvkhg with small samples.On the other hand, as noted previously, it is unclear in the Fesentcasethat much shouldbe madeof the standarderrors,giventhat our .,sample,'is hnh very small andhardlya probabilitysampleof any population.Ir is reasonable to ten_ rively accept the estimatedmodel, specifically the robust regressionestimatesfor JEi'enteen societiesreportedin column 4 of rable 10.1,which havefar smallerstandard aors than do the conespondingoLS estimates.Nonetheless,we must note that the results re only suggestiveand require confirmationwith more and better data before beins regardedas definitive.
240
QuantitativeDataAnalysis:DoingSocialResearch to Testldeas 1.5
#
.6 F
#
.2 0 15
10
5
0
5
42024 hcomeinequality
Educational Inequaiity
1.5
n'
 20246 Strongvocational system
d
505 Weakvocational systern
ro
5
0 5 Intercept
t;
F,6URE 10.1 1, samplingDistributionsof Bootstrappedcoefficients (2,000 Repetitions)for the ExpandedModel, Estimatedby RobustRegression on Seyenteen Countries. /Voter These arethe bootstrapped coefficients for (olumn4 of Table10.1. when we have genuine probability samples of larger populations, the . l{o1vever, standarderrors and the conndenceintervalsthey imply assumemuch greaterimportanceThe calculationof appropriateconfidenceintervarrto, tootrt upp"o ,Ltrstics is an unsettled and ongoing areaof statisticarresearch.Stataprovides four ciiii.eorls pe.""nt confidence rnlervalsbasedon different assumptions.There is considerable controversyas to which of theseestimatesprovidesthe best coverageof the true standard error But the weight of tbe evidenceto date seemsto supportbiascorrectedestimates, andthis is the default in stata
WHATTHISCHAPTER HASSHOWN In this chapterwe have seenhow to check our data for anomalousobservationsand violations ofthe assumptionsunderlying oLS regression, how to usethe inlbrmation obtained to generatenew hlpotheses, how to use robust regression procedures to achieve estimates with smaller standard erro
standard errors insrtuations inwirllT: LT#,[T .i:::#i,::ffi:ilJj: ;l#f;
distributed cannot be susrained.The main lessin of rhi. that much can be leamed by gaphing relationships in the data. Indeed, often"h;;;;;. the best way to understand your data is to graph what you think you are observing. The resulJare otten surprising, and usually informative.
t f
!rq ,0 I
CHA PTER
SCALECONSTRUCTION THISCHAPTER ISABOUT chapterwe seehow to improve both the validity and reliability of measurementby rctmgmultipleitemscales.we considertbreeways to construct scares:additive ns. factorbasedscaling, and effectproportional scaling. We also consider two vari_ of regressionaralysis errorsinvariablesregression,;hich co[ects tbr unreliabil_ lf measurement,and seemingly unrelated iegression, which is used to compare ssion equations with (some or all of) the sameindependentvariables but different variables.
242
to Testldeas QuantitativeDataAnalysis:DoingSocialResearch
INTRODUCTION for which E In socialresearchwe often wish to studythe relationshipamongconcepts stratification'clas' haveno direct and exactmeasures.Examplesinclude' from social authoritarianism: ad atutor, und po*"r; from attitude research,anomie, alienation' and of this kind' ft r concepts For conservatism irom poflticaf sociology, liberalism versus they behal how or people believe what of Jim"ot to imagine thatany single measure in dil arc interested we that example' would adequatelyreflect the concept. Suppose,for recld voting their liberal ferentiatinj among membersof Congresson the basis of how a singlevoteis. We would hardly be contentto measure"liberal voting" by choosing and thtrr be liberals say,support for foreign aidand declaring thosewho voted for it^to b€sides"liber* who voieOagainstit io be conservatives.For any particular vote' factors particular languageof & ism" or "conservatism" may come into playobjection to the judgment that in tight times fuoib legislation, the need to dischargea political debt, the Although extraneousfactors m4 on' and so b"r,"t spent on social welfare at home, average"liberals" would be rntt that on ! affecr any particular vote, we would still expect rights' affirmarirt likely to support foreign aid, domestic welfare' civil liberties' voting to refine want might action, and^such,than would "conservatives'"(Of course'we 'o social rCexample' concept,differentiating domains of liberalism or conservatismfor samebasic poifl o"r, dr"ul policS anJ intemationalismversusisolationism But the becauseextrarFholds: any one itim tendsto be a poor measureof an underlying concept strategyfor cma useful Therefore item a single to ) ous factois may affect the response scaict multipleitem $eate \s to concepts structingopeitional indicators of wderlying r reflecr to thought items of of a set fnat is,wetatcettre averageof the responsesto each Multipleita the concept of underlying concept as iniicating ot measuring ttre stren}th be rellable' scalesshouldsatiify two criteria: they should be vcli4 and they should
VALIDITY if it adequatt* An indicator is valid if it measureswhat it is supposedto measurethat is, technical wal rn measuresthe underlying concept. Unfortunately, there is in general no when we dr.evaluatethe validity of a scale,although, as we will seelater in the €hapter by determr of a scale validity in the gain confidence cuss factorbasedscaling, we can on theoreticrl expect would we that ing that it has the relationships to other variables theor* appropriate an constucting gr:ounds.Assessmentof validiay is mainly a matter of alc or indicators' indicator icai link for the relationship between the concept and its betweenthe conceptand other variables. measuer Many of the most important argumentsin scienceare about the validity of sciences'Indeat This is as true in the physicaland biological sciencesas in the social (Burgess1978) thl I recommenda fascinatingaccountof a searchfor life on Mars camp and S' includes a vivid portrayal of the ongoing dispute between the "prolife" the Mars Lander by back sent being indicators "antilife camp as to whether particular Mars' on of life presence could validly bi interpreted as evidenceof the you arE The firsi requirement for devising a valid measureis to be clear about what concep'E not our than often trying to measuie.This is not as obvious as it sounds' More
ScaleConstruction 243 uE irrmulatedrathervaguely.Just what do we meanby "social class."for example?If .. j]ie a Manist approachanddefineclassby the "relationshipro the meansof producu:r s,e havemerely shiftedthe problem,becausewe then haveto sayexactlywhat we nr:: by therelationshipto the meansof production.If we takea Weberianapproachand rce classby "marketposition,"we haveexactlythe sameproblem. \n),onewho thinksI am constructinga strawman is advisedto look at the wntings of hi Oiin Wright and his followers,who are to be commendedfor trying to do serious .E4::rtativeresearchwithin a Marxist framework(see,for example,Wright and others "!:. Wright 1985,andWright andMartin 1987).A goodporrionof thewriringsof Wright ltrr:ais groupis preoccupiedwith the validity of altemativeindicators. Er en seeminglystraightforwardvariablesoften havethe samedifficulties.Just what Er;h ing conceptarewe trying to measurewhenwe devisea scaleofeducationalattainr:: skill, knowledge,credentials, values,conformityto externaldemands,or still somea.t: else?In principle our theory as represented in the specificationof the conceptof .E!E::st.shoulddictateour choiceof indicators.For example,if we are interestedin the functionof educationin channelingaccessto particularkindsofjobs, we may 4:r,rreeping r ir: ro measureeducationalattainmentby the highestdegreeone obtains.If we regard rr:':.rling asenhancingcognitiveskills,we may wish simply to countthe numberof years 'r rooling a personobtains. Sometimes,of course,we are restrictedto extantdata and must work in the other ir':ion, constructingan argumentaboutwhat underlyingconceptis represented by the :L€:iure we have at hand. In either event clarity is crucial in your own mind, and on !E ;.ntten pageas well, regardingwhat conceptsyour indicatorsmeasure.(For a brief $rduction to differenttypesof validity, seeCarminesandZeller [1979, 1726).)
IEUABILITY {e:.bility refers to consistencyin measurement. Different measuresof the sameconrE:L or the samemeasurements repeatedover time, shouldproducethe sameresults.For :;::pie, if oneindividualscoreshigh andanotherindividualscoreslow on a measureof :acial tolerance,we would like to get the sarnedifferencebetweenthe two individur,. ,: q e useda different(but equallyvalid) measureof interracialtolerance;to the extent te :reasuresyield similar results,we say that both measurements are reliable.Also, I :: :amerespondentwereaskedhis attitudeat two points in time, we would like to get t= :.:ne result(assuminghe hasnot changedhis attitude). Fromthis definitionit is easyto seewhy multipleitemscalesaregenerallymorerelirrr=Ihan singleitemscales.When responsesare averagedover a set of items, eachof r:.'h measuresthe sameunderlyingdimension,the idiosyncraticreasonsthat individuir :.spond in particularways to particularquestionstend to get "averagedout." Of i L:>e.this is true only to the extentthateachitem in a scalereflectsthe sameunderlying ":i":.irsion(theconceptualvariable).If an item is capturingsomeotherunderlyingdimeni'ri insteadof, or in additionto, the oneof interestto the analyst.it will undercutthe relir,,n (and validity) of a scale.For example,supposeresponsesro a questionabout r l,sness to have people of a different race as neighborsreflecteddifferencesin
244
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
economicanxiety,with somepeoplerejectingpotentialneighborsnot becauseof racfl intolerance but for fear (rightly or wrongly) of a reduction in property values.We woul not want to include suchan item in a scaleof racial tolerancebecauseit would tendt make the scalelessreliablewith scalescoresdeterminedto someextent by whetherc happenedto include a lot ofpeople with economic anxieties or only a few such peopleAn important reason for creating reliable scales is that, all else equal, unreliaE scalestend to havelower correlations with other variabies.This follows from the fact 6t unreliable scales contain a lot of "noise." We might think of scales as having a "tnrcomponentand an "error" component.The "true" componentis representedby the cofl!lation of the observedmeasurementwith the true underlying dimension; the size of tli, corelation gives the reliability of the scale.The "error" componentthe portion uncc related with the underlying dimensionreflects idiosyncratic determinants of tb observedmeasurement.From this definition of reliability, it follows that the los€r the reliability of each of two measures,the lower the correlation between their obsenel values relative to the true correlation between the underlying dimensions.Formally, n can estimatethe "true" correlation betweenvariables by knowing their observedcorreletion and the reliability of eachvariable. The true correlation is given by :
'*t
(1l .l l
where Pxrv, is the correlation between the fiue scores, rr" is the observed correlatil betweenX and \ andro, and r"". are the reliability coefficients for X and Il respective$Equation 11.1 is also referredto as a formula for correctingfor attenuationcausedb unreliabilify; Px,r, is the correlation between X and f corrected for attenuation. Fa example,if two scaleseachhavea reliability of .7 and an observedcorrelationbetwec them of .3, the correlationconectedfor attenuationwill be .3/J(.7)(.7) : .43. Clearl5 correlations can be strongly affected by the reliability of the componentvariables.
K[
Hl,#'t"g,::,tttt
RELIABILITY rhere are severar ways tomeasure
.
IestrefestreliabiliA ls the correlationbetween scoresof a scaleadministeredat two pointsin time.
.
Alternateformsreliabilityis the correlationbetween tlvo different scalesthought to measurethe sameunderlying dimension.
.
lntemalconsistencyreliabilityis a functionof the correlationamongthe itemsin a scale. Cronbach's alpha, discussed in the followingparagraphs, is an internalconsistency measure.
ScaleConstruction 245
j:?1"."':d{i.qi:fi #:"',l,T:r:f :#*f, : a'*#*##',#l:l roestrm aiJ;; ;#; i# ::H[XT:';"Jil#:1,r.*n*,r,.J" !". 9.* aore "ri, sTlitlxH:".T:'H,3,TJ'j;tri:x"?;H,:"il:"i'; flif":*"*$: fh extensive .t;;"."o,,.]."u'"'ll,t":,'..:":kt andothersllg72, tgTglr"t."*"oi"r'"? "."
Q'ates
"' Lqlsr rrr uus cnaptel
this concept.
*"esdepends onrwofacto6: rhe rrabiJity. orinrernarconsisrency t *ru.nc,J"#"'r,lil; ffifffffi:,fi::::.' _*,r:1iflJ:"#j"il:l.;;fi:::Tl^lj1*1y,,
*:,:'.1":i#iffi $: l":ls:*::i*:ii::,dnTyilJH';,'ffi "
=  rtr
1+ r(ff _ t)
(11.2)
E N is rhe numbero[ items _O ,,.r,:n: averagecorrelation arnongitems.ln Table
,lli,iu,".ug" .,:.,.,,.1'ri, ,irfr ffi Hru*,y"ilTru1i1"jj iot".it". * "i.."rl ffi :i":ilff trTffiT*ectiverv,de,.'s,ir1ir,"i"iii:#:fffrTt:'*"fl 1fi ll"i:1;,f,T:*1il,"y:,1**:g"correrationas.25,scarescom_ orat reasr rr€d _"t:?a:'Jfl ::ff seven "'.i"r,.i,.'",1.aJii.+iiiff;.ffi,T1?l,1l.r,i;ill,li".,ll;
fi ;;3",,,'i#i: THli:."Trd#*;;,nr;trl,.".l?l;,T#*"',ffi :,T#,ffi:iil:&TJ,:l; .fr;"#ff :,*T#,ilT::;,ilH:,y#::;ffi;J""ft fNCLUDE 5EVERAL H,th#Bi+tn?."RE rEsrs number of items in makes a *.," *t,^ rJ KN clearwhyexaminations ",r ,r.f:"I"tl:ltn"a19GRE comprise severar e".",,".ott"eu hu"" ;;"' Lll o. 1,ffi; ;:r#:::::::Ar
illl*;r::::;j :n;;:[l*l:lfilT:;T,",lj:f #"il:ii:i;::ff j}i:ii::j:;[iffi *:,n*l*nj*f"ilJtt#{i,"11i.,:il", :ix*] :ffi;:lT:?.::Tit:;Tl{ Ii[i:;,";"il#:;.;l:':.+T:iliT:::]H"ffi
preparation.
le the test is taken. and also,
of course,the degreeof
246
QuantitativeData Analysis:Doing SocialResearch to Testldeas 1'.i .:'1;1: I '1 , 'i values of cronbach,s Alpha for Multiplettem Scales with Various Combinations of the Number of ltems and the Average Correlation Among ltems.
.09
.25
.17
.40
.23
.50
.28
.57
5
.33
.62
6
.37
.67
7
.41
.70
8
.44
.73
9
.47
.75
10
.50
.77
20
.66
.41
50
.83
.94
100
.91
.97
200
.95
.99
SCALECONSTRUCTION In this chapterwe considerthreestrategiesto createmultipleitemscales:additivesring, factorbasedscaling,and effeclproponional scaling.lFor a brief generalintroction to scaling,seeMclver and Carmines[1981]. For a recentextendedtreatmenr.\Netemeyer,Beardon,and Sharma[2003].A classicbut still usefultreatmentrs Nunna andBernrtein  1984j..1
ScaleConstruction 247 Mftive Scaling h .':nplestway to createa multipleitemscaleis simply to sumor averagethe scoresof run :i rhecomponentitemswhich is what we havebeendoing up to now.Whereitems Jr]G :hotomous,this amountsto countingthe numberof positiveresponses. Wherethe hemselves m. constitutescalesfor example,continuousvariablesfor educationor [tr:e or attitudeitemsrangingfrom "strongly agree"to "stronglydisagree" we ordimrf.:iandardizethe variablesbeforeaveraging(by subtractingthe meanand dividing "* :e 'tandarddeviation).If we fail to do this, the item that hasthe largestvariancewill !mn:e greatestweight in the resultingscale.The effectof the varianceon the weight is rtr, :!1seeby consideringwhat would happenif a researcherdecidedto make a sociorriTl.r:iricstatus(SES) scaleby combiningeducationand income,and did so simply by 1rr!,rr..for eachrespondent,the numberof yearsof schoolcompletedand the annual nr:c:. He may think he has an SES scale,but what he actuallyhasis an incomescale wlr : \ery slight amount of noise, becauseeducationtypically rangesfrom zero to xr::," \ears (and,in the United States,effectivelyfrom eight to twenty years)whereas rr::e rangesin the tensof thousandsof dollars.By dividing eachvariableby its [email protected]: :e\ iation, the analystgives eachvariableequal weight in determiningthe overall n:i* r;ore. (I first cameto appreciatethis point manyyearsagoin graduateschoolwhen mr::essor told me that he andanothermemberof the faculty effectivelycontrolledwho :m* and who failed the collectivelygradedPhD exams.They did so simply by using =d rd :re hundredpointsin the scalethe faculty haddevisedfor scoringexams,while most ,r :e:r colleaguesgavefailing examinationsa scoreof fifty or so.) lee trouble with simple additive scalesof the sortjust describedis that the items tu(Jed may or may not reflecta singleunderlyingdimension.A scalewith a heterogenr:r<setof itemsrunsthe risk of beingboth invalid,becausein additionto what the anatr; rinls the scaleis measuring,it is also measuringsomethingelse, and unreliable, :!e::jie at least someof the items are weakly or even negatively corelated. WtorBased Scaling fr;': anwe determinewhetherthe itemswe proposeto include in a scalereflecta singr 5rension? First we identify a setof candidateitemsthat we believemeasurea single &:lving concept.Then we empiricallyinvestigatetwo questions:(1) Do the items all 'br:: together" as a whole, or do one or more items tum out to be empirically distinct mr .in the senseof havinglow correlationswith) the remainingitems,eventhoughwe !rL=irt they reflectedthe sameconceptualdomain?If so, we must reject the offending E. {2) Doeseachitem haveapproximatetyJlqsEnllretrettlo}lgloJb9._4gpendent llrile of interest?If not, the deviantitemsshouldnot be usedbecause this is againeviE:c,' that they do not measurethe same concept (or that they measureother concepts ;::,ies the oneof interest).Assessingthe secondquestionis a simplematterof regressing t j.pendent variableon the set of tentativelyselectedcomponentsof the scale,plus Whenthe scalewill be usedasa dependent m:ronal controlvariableswhereappropriate. cf,r5le. the corrqlations between the !9lqBonent
items and the indepe4!9qt
variables
'il.'J bein:Gaa ln borhsiruation s.wnuitv. ..iooting Gfi( rhatthecandi";dence thesamemagnitude, u.,t.q5El!y)he_$)_glgnd approximately
Z4E
DoingsocialResearch to Testldeas DataAnalysis: Quantitative
Education,occupationalstatus,andincomearegood examplesof itemsthat arer:N'//tively correlatedbut thattendto havequitedifferentnet effectson variousdependeni.r'ables.For example,fertility is known to be negativelyrelatedto educationnet of in; ar of to]er::,,r bttt positivelyrelatedto incomenet of education.Similarly,variousmeasures tendto be positivelyrelatedto educationnet of incomebut uffelatedor negativelyre:inru to incomenet of education.For this reasonthe commonpracticeof consfructingscai:. Jr ( variablesshou: nr \\ socioeconomicstatusshouldbe avoided,and eachof the component /fr includedasa separate predictorof the dependentvariableof interest. t A useful procedurefor deciding whetheritems "hang together" is to submit therri, ,r factor analysls)is a p: ':: fqctor anolysis.Factoranalysis(or moreprecisely,e;rploratory durefor empiricallydetemining whethera setof obseNedcorrelationscan,with reir"rEby, a smallnumberofhypoth::u ableaccuracy, be thoughtofas reflecting,or asgenerated with man\ \ ::,r!underlyingfactors.Factoranalysisis a welldeveloped setof techniques, tions. However,this chapteris concemednot with the intricaciesof factor anallsr. rur with its useasa tool in scaleconstruction.For our presentpurposes, the optimalproc;:* is to useprincipal factor analysiswith iterations anda varit?Mxrotation andlhen to in:r:: Ihe rotatedfactor matrLx.The varimax rotation rotates the factor matrix in such a $ : . '1! to maximizethe contrastbetweenfactors,which is what we want when we are tq t. : detemine whetherwe canfind distinctivesubsetsof itemswithin a largersetof canrj::.or: items.We thenchoosethe itemsthathavehighloadtngson onefactorandlow loadin5..nr the remaining factor or factors.A rule of thumb for "high" is loadingsof .5 or more (\\ :f,rr areconsistentwith correlationsof about.52: .25 or hieher).
TRANSFORMING VARIABLES SOTHAT"HIGH"HASA CONSISTENT
M EAN I NG
" hish"refers offactor anaLysis, :: Inthecontext
va ueof a factorloading. the absolute We wouldthusregarda loadinglessthanorequalto : however, that a h 9h neq: or greaterthanor equalto .5 ashigh.lt is lmportantto appreciate, tiveloadingimpliesthat a variabe isnegatlvetrelatedto the underlying concept.Forth s re:' y runinthe samedirectron th.: son,it isdesirable to transform allvariables sotheyconceptual (frois,sothat a highvalueon the variabLe ndicates a highlevelof the underyingdlmension whichit thenfollowsthat allthe indicators shouldbe postivey correlated). Forexample, corsidertheGSSitemsSPKCAM('SupposethisadmittedCommunistwantedtomakeaspeech' your community.Shouldhe be allowedto speak,or nat?") andCOLCAM(Supposehe is teacring in a college.Shouldhe be fied, or not?"). Cleady,a positiveresponseto the first iternan: a negativeresponse to the seconditemboth ndicatesupportfor civi I berties.Soto maketl'. interpretation o{ the factoranalysis lessconfusing,it would be desirable to reverse the sca' ing of the seconditem.Thiscanbe accomplished easiy by transforming the originalvarlable X, into a reversescaled variable, X', usingthe relationX' : (k + 1) X wherethereare . response categories. Similartransformat onsare helpfulin anykindof multivariate analyss.
scaleconstruction 249 i:
:hen choose those items that meet both criteriahigh
loadings on the factor and
::elationshipsto the dependentvariableand combinethem into a singlescaleby .:::dardizing them (subtractingthe mean and dividing by the standafddeviation) ::r averagingthem. Theseprocedurestypically producescaleswith a meannea.r =j a rangefrom somemessynegativenumberaround2. "r or 3. r to someequally ,. :ositive number.For convenienceof exposition,it is useful to convertthe scale . :rnre extendingfrom zero to one becausethen the coefficientassociatedwith the _.:\'esthe expected (net) difference on the dependent variable between cases with
:'::st and highestscoreson the scale.Sucha conversionis easyto accomplish,by ::iso equationsin two unknownsas you did in schoolalgebra: 1:a*b(max) 0:a+&(nin)
( l r .3)
''max" is the maximumvalueof a scale.S, in the data.and "min" is the minimum r: 5 in the data.This yields a andr, which you then useto transformS into a new ::. S'. asfollows: S': a + b(S)
(11.4)
CONSTRUCTING SCALES FROM INCOMPLETE ?>] INFORMATION
Whenyouconstruct multipleitem scales, it oftenis uselut $
: :rnpute scalescoresevenwhen rnformationon some rtemsis missinq.This reduces: ^Jmber of missingcases.Forexample,if I am constructinga fiveitemscae, I might :oute the averagelf data are presentfor at leastthreeof the five items.Thisis easyto ::::':'rplishin Stataby usingthe rowrnean commandto computethe mean and the ::i;niss commandto count the numberof mlssingitems,replacingthe scalescore : ihe m ssingvaluecode if the numberof missingitemsexceedsyour chosenlimitin : 3resentexarnple, lf morethan two of the ftveitemshavemissng values.
.ereralfactorsemergefrom the factoranalysis,we can,of course,constructseveral . Heretheproblemof validity loomsagain.Becausewe ordinarilystartwith a setof r:::e itemsthat a priori we think measurea singleunderlyingconcept,we areon the : _rroundif only one factor emerges.If more than one factor emerges,we are forced
 L.:der what concepteachfactor is measuring.Working from indicatorsto concepts :e very real dangerthat our sociologicalimaginationwill get the betterof us and ; : $ ill invent a concept to explain a set of correlations that reflect sampling error ':rn some underlying reality. The danger is compounded if we forger that we have
250
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
inventedthe conceptto explain the data and stafl treatingit as if it has an independer realitythat is, if we refJr our concept.To be surethat we have actually discou"."d ,o* underlyingrealiry we shouldreplicatethe items andthe scarein someindependentdan set (perhapsby usinga randomhalf of our sampleto developour scalesand fit our mo& els and then using the orher random half of the sample to verify the adequacyof bot scalesand models).Unfortunately,this seldomis done,becausewe usuallywant larga samplesno matter how large our sampleis. However, the GSS provides such opportuli tiesbecausean analysisdevelopedusingdatafrom oneyear oftencanbe replicaiid us;B dala from the preceding or following year. I strongly encouragethis kind of independerr validation. Readersfamiliar with factor analysismay wonder why I suggestchoosinga set of candidateitems, weighting them equally,andaveragingthem,in contrastto constructinga scab by using thefactor scoresas weights.The reasonis that using the factor scoresmaximizes the associationbetweenthe hypotheticalunderryingconceptandthe conshuctedscalein rrr sample. Tltat is, it capitalizes on sampling variability. The result is that the correlationr betweena scale constructedin this way and other variablesare likely to be substantiar! smaller if the sameanalysisis replicatedusing a different data set. By contrast, the facttr_ basedscaling prccedure,tn wllch the items are equally weighted,is much less subjectEi crosssampleshrinkage.In this sensefactorbasedscalesare more reliable than are scales constructedusing factor scoresasweights.
A WorkedExample:Religiosityand Abortion Attitudes (Again) Abortionhasbecomeanincreasinglysalientandemotionallychargedissuein recentyear:Fundamentalist religiousgroups(andothers)opposeabortionas'1nurder"while feminis (and othert defendthe right of women to control their own bodies.Despitethe shal polarization of opinion regarding abortion, most Americans evidently support the ar.ai ability of legal abortionunder at least somecicumstances.Many peoplefind abonicr acceptablefor medical or therapeuticreasonsbut nol for reasonsof personalpreferencetr convenience. Consideringthe theologicalunderpinningof the "right to life" movemenrthat a fetus is a personand hencethat abortion is tantamountto murder_we might expe; strongly religious people to adamantly oppose abortion for personal preferencereasonr but to be lessopposedto abortionfor therapeuticreasons,when the ,rights,,of the fenrs mustbe weighedagainstthe healthandsafetyof themother.Thosewho arelessreligiou:_ by contrast,might be expectedto makelessof a distinctionbetweenthe acceptabiliq...r abortion for personalpreferenceand therapeuticreasons.If thesesuppositionsare conerwe would expectreligiosity to have a weakereffect on attitudesrlgarding therapeuui abortionthanon attitudesregardingabonionfor personalpreferencereasons. To testthis hypothesis,I usedatafrom the 1984GSS,a representative sampleof l.ll adultAmericans.(seedownloadabre fires"ch11.do"and"ch1Llog" for estimationdetail,. I usethe 1984surveybecauseit containsitemssuitablefor constructinga scaleofreligiority (discussed later).Specifically,I comparethe coefficientsin two regressionequations: (11,:
scaleConstruction 251
F: a' + b'(F) + c'(E)
(11.6)
*re I I and F are, respectively,scalesof the acceptability of abortion for therapeutic resons, the acceptability of abortion for personal preference reasons, and religiosity fh). E is years of school completed,introducedas a control variablebecauseit is bsn that acceptanceof abortion increaseswith educationand that religious fundamen& is negatively correlaaedwith educationin the United States. The three scaleswere constructedby factor analyzing items thought to representthe funsion being measured,elirninating items with low factor loadings, converting each h to standardscoreform, and averagingitems. To facilitate interpretation of the regrescoefficients, the resulting scaleswere then transformed so that each had a range of Eo (for the lowest level of religiosity ard the lowest acceptarceof abortion) to one (for r t highestlevel of religiosity and the highest acceptanceof abortion). Candidateitems for the scaleof religious fundamentalismincluded the following: '
l. ATTEND: How often do you attend religious servlces?(Range: never . . . several timesa week). 1. POSTLIFE: Do you believe there is a life ajler death? (no, yes). 3. PRAY:About how oftendo you prayl (Range:never. . . severaltimesa day). 1 RELITEN: Would you call yourself a rtrong [religion named by respondentin responseto questionon religiouspreferencefor not a strong [preference]?(not very strong; somewhatstrong lvolunteered] or don't know or no answer;strong). 5. B1B.'Altemative versions of this question were askedof twothirds and onethird of the sample,respectively: L Which of thesestatementscomescl.osestto describing your feelings about the Bible? a.
The Bible i"sthe actual word of God and is to be taken literally, word for word. b. The Bible is the inspired word of God but not everything in it should be taken literally, wordfor word. c. The Bible is an ancient book offables, legends,history, and moral precepts recordedby men II. Here arefour statementsabout the Bible, and I'd like you to tell me which is closestto your own yiew: a. b.
The Bible is God's word and all it says is true. The Bible was wrixen by men inspired by God, but it contains some humanerrors.
252
euantitativeDataAnarysis: DoingsociarResearch to Testrdeas c. d.
The Bible is a good book beca '.aLause tt was written bywise men, bur Gd hod nothing to'ii* The Bible was written by men who lived so long ago that it is ,lofth little today.
*.sionsIand rr were combined, excepr thar(
combinedwith category(c) from "".(:;H:ti.lHtJ*:ffi1t;:In versronI to a a newvariahle new variable, *r*oro o^,^,2 NEwBIB. Before thesgn,,eit",#;;" #JiffiT#:l#li; coneratedrvith,h;;;;;;;,.1ilo, ..ooot "no ;:?Xl:1,*:lj:..",:very answer', rno*_ responses
r"l ri. ;il'Jfit :."#:rtff ,".:.T"T?L ".r".J. size Arter .a.; ;;;;;;u.ll"u", i#li*r**T"::::l.l*l"y the ,"." "ri;", rumber orcases availabreil; ffi;;;yJj";fl5t :H:ffiT;llI U"latedprincipalfactonngandvarimax y4urrd roration. rurauon. A singredominant iacror emer ged'
*"t^r:1,:r,lg
wrth loadings after rotation:
ATTEND POSTLIFE PRAY RELITEN NEWBIB
which explained 86 percent of the total
.787 .573 .654 .260
Given the pattem of factor loadingg it appearsfrom simple inspection that a threo_
t;
jili','xf j:ffi:l1"T"1?:H:'::l+fi:,1's!di"" sca.rerhatincrudesennowzwnreiin";;ff#;l:;,;lT;%;.#;;*lt",T ::"#ffi "".".",iab,s\a,
ll1l'. r,li. ::Til:':,]:Til:l;,'l'ff ":T,": nrylv "g4q,,,,'"
,";:lfii,ii:,.',H;3"ffi1[tnl]. ;1;111ag,r'oery,i',ffi r"f;r= with much rower r".i.i#i,r.,u*i, liJ:ilfi i".iG ffi:*:fili:Ti:t1t""s
;'J.lffi"HJTff.n:"*.."##::,r :,..]:x"qhisr,L"oi,e,.u;;::;ff
jtr1H ffiffi [r".T:#'.i,ffi ilT"3#,*:::+m1;r;*:if"f
f,t1ri.Hffif:aft Jfr li*:J:"'ilJi:''.T#:::',i"l: abortion, Tocreatescale,."u.*Uirii
. ine ,"u* it ..,i;i; ,'
';;:#:;:;,f:;#::;:
ffiffi:H:::::Xi#::
I ractor anarvzed theroro*
notlou think it shoutdbepossibte ror a pregnant,eotnd .o.r
1' ABDEFECT: If there is a strong chanceof seriousdefect in the baby? 2. ABN)M2RE:
If sheis mariedano 0"".,"r*ani_r_#cilii**,
scaleconstruction
ZS3
ABHLTH: If the woman'sown hea
e.a.! ram'y has a";,i,*',:::ilj?":jff;,'"jr:lJ::ffi:ll*t ! on: rrthe children? OU^!!!!: If shebecame pregnant asa resultof rape? ABSINGLE: If sheis notmarriedanddoesnot*ant to mar.1, themanr ABANI If thewomanwantsit for anyreason / 1each case the possibleresponseswere ,,yes,',,.No,', ,,Don,t know,,,and ..No ,rr ,3r." '.rDon'tknow,,and ,.No answ .::. * ingrrnqdlur"bqtx""o and"No."AlThoTgEasrndt.ut"dlpiJiilifinyfo,r,".i# j"O ,r,u,,n"rearedistinc_ :' : ::jponsesto abortionfor therapeutic and personalprefea"n"",*aonr, I nonetheless ::*: : analyzedboth subsetsofiter together to confirm empiricallythat the two sets of :ri do in fact behuu.dirtin"tiu"lLt T,.ro nontrivial factorswere .*ou.t.,l._ *hi"h rogetherexplained ' 96 percentof the *in the items.Table11.2showsthe loadings before,o,iior,t.e \s is evidenr,all sevenitems load strongly onFu"to. 1 B;;;;_" are posrtiveand r :: :re negativeon Factor2. The pattem of positiu" unOn"gutiu"ioadingson Factor 2 i*.:ir\ thatrheseitemscanbe subdivided into two distin"rtuit"r* iuUf. 11.3shows the :f .: of executinga varimaxrotation, a rotation of ,i," .irt iactor matrix that :: :rizes the distinctionbetweenfactors. "* "
l1*:
,' l'1 ..?. ractorLoadinss
for Abortion Acceptance ltems Before Rotation. Factor 1
ABNOMORE
ABPOOR
Factor 2
*.263
.8 3 1
.183 .412
.869
 .249
254
QuantitativeDataAnalysis:Doing SocialResearch to Testtdeas
t"A* fg
31,3.
ebortionractor
Loadings After Varimax Rotation. Factor 1
ABNOMORE
Factor 2
.880
ABPOOR
ABSINGIE
.876
.217
Inspectingtheseloadingsyou seethat,ashypothesized, two factorsunderlieabor::r attrtudes.ABNOMORE, ABPOOR, ABSINGLE, and,ABANy all load strongly on Fai::r 1 (shownin bold) andweakryon Factor2, whereastheremainingthreeitemsload stror.r' on Factor2 (shownin bold) andweaklyon Factorl. Thesetwo setsof itemsconespon;i: the a priori distinctionI madebetweenabortionfor personalpreferencereasons(racro: . andabortionfor therapeuticreasons(Factor2). Figure 11.1demonstrates that the unrotatedand rotatedfactor srucruresare slm:,.. mathematicaltransformations of one anotheranddo nothingto changethe reradonsr_r amongthe variables.The rotationmerelypresentsthe resultsin a form that makes rl_.n more readily interpretable.As notedpreviously,in the unrotatedmatrix (solid axesr. :] itemsload positivelyon Factor 1 but someitemshavepositiveloadingson Factor I someitemshavenegativeloadings.After I rotatethe axes30 degreescounterclock$:s rr (to the dashedlines),all items havepositiveloadingson both fa&ors, but four (the s_ sonalpreferencereasons)load stronglyon the first factorandweaklyon the secontlfa;,r while three(the "therapeutic"reasons)load weakly on the first factorandstronglv on = secondfactor Giventheseresults,two separatescalesarewananted.I thereforeconstructeda s!r of accaptanceof abortionfor personalpreferencereasons,using the fbur items loac:r stronglyon Factor 1, and a scaleof items for therapeuticreasons,using the three ite= loadingstronglyon Factor2. In eachcasethe itemswereconyertedto standardtbrm.:rr averaged.I computedaveragesif valid responseswere availablefor at leastthree of the ir,personalpreferenceitems and at least two of the three therapeuticitems. Again ,:

ScaleConstruction 255
6
.2
.6 
1 8
axes Unrotaled Rotated axes
6 o ',^Jorr'
4
6
8
1
F {C ,Jt?f 1 1 , f , roaarngs of the SevenAbortionAcceptanceltemson the First TryoFactors,lJnrotated and Rotated30 Degreescounterclockwise'
D
tu
a t!, l['
t fF b r,fl d E
F rcur rfr r*
& tx. rdl rfm' I&
rcales were transformed to range from zero to one, with one indicating high acceptance d abortion. The second criterion for scale validity is whether the cornponent items all bear ryroximately the same relationship to the other variables in the analysis. Ideally, one $ould assessboth the zeroorder and net relationshipsbetweenthe componentitems and Le dependentvariables.Here, however,the dependentvariablesare the two abortion attirdes scales.Thus I assessthe consistencyof the relationshipssimply by inspectingthe curelations among each of the componentsof all three scalesplus the remaining indepdent variable,education.Thesecorrelationsare shownin downloadablefile "chll. lry." All of the componentsof eachscaleshow consistencywith respectto sign and gross imilarity with respect to magnitude in their correlations with the remaining variables fhus I concludethat combining theseitems into scalesas I have done is appropriate. Table 11.4 showsthe means,standarddeviations,and correlationsamongthe three r:ales andyearsof schoolcompleted,andTable11.5 showsthe coefficientsestimatedfor of theraErluations11.5 and 11.6.Not surprisingly,the meanfor the scaleof accaptance for of acceptance of abortion for the scale the mean Futic abortion is rnuch higher than the Lowest by converting (Because is calibrated each scale lnsonal preferenceteasons. sore in the sampleto zero and the highest scorein the sampleto one, comparisonof the rans acrossscalesis not, strictly speaking,legitimate. However,they do indicate where mostacceptingandleastaccepting te rypicalrespondentfalls relativeto the respondents d eachcategoryof aborlion, andhencecan be usedto comparethe relative acceptanceof 6e two typesof abortion.)
scateconstruction
257
\s predicted, acceptanceof abortion for reasonsof personalpreferenceis somewhat Te strongly socially structured than is acceptanceof abortion for therapeuticreasons. L f: for the former is .182,comparedwith .136 for the latter.Moreover,both of the coefficients are substantially larger for the personal preferenceequation than for rec & fterapeutic equafion, indicating that both education and religiosity have a greater on attitudes regarding abortion for personal preference reasonsthan regarding ryfi ifution for therapeuticreasons.However, the standardizedeffect of religiosity is about T'llv strongfor both setsof abortion reasons,whereasthe standardizedeffect of educa:[ is much strongerfor personalpreferenceabortion.
1n
i ngly UnreI ated Regressi on
,f, hrmal test of whether correspondingcoefficients differ significantly in the two equath is available through Zellner's seemingly unrelnted regression procedure, implearfrid in Stataas  sureg. This proceduresimultaneously estimatesmodelscontaining or all of the same independentvariables but different dependentvariables. When t fudependentvariablesare identical acrossmodels, the coefficients and standarderrors ilentical to thosefrom separatelyestimatedequations,but sureg providestwo Sional kinds of informationan estimate of the correlation between residuals ftom d equationand a test of the significanceof the difference betweencorrespondingcoeftas. In the presentcase,the correlationbetweenresidualsis .38, which tells us that *crer factors other than education and religiosity lead to acceptanceof abortion for lkzpeutic reasonsalso tend (modestly) to lead to acceptanceof abortion for personal F*rence reasons.The tests of the equality of correspondingcoefficients reveal that, as $adesized, the coefficients for education and religiosity are significantly larger in the preference equation than in the therapeutic equation. (See downloadable file lxtnal tI l.do" for delailson how to imnlemenr errra.' \
kProportional
scaling
.[ pecial kind of scaling problem arises when we have an independentvariable that has a anlinear relationship to the dependentvariable of interest. In Chapter Seven I diseed proceduresfor assessingwhether relationships are nonlinear and for representing ,.in€ar relationships by changing the functional form of equations.One possibility I ftcqssed was to representnonlinear relationships by converting variables into sets of qories and studying the relationship between category membershipand the outcome rbble. In this sectionI describean extensionof categoricalrepresentations of vari*: efrectpropor"tionalscaling, which is availablein situations in which the dependent :i$le has a clear metric. (For an exampleof a researchuse of effectproportional scalQ .ee TreimanandTenell [1975].) Suppose,for example,that we are interestedin the relationshipbetweeneducational Ginment andoccupationalstatusin a nationwith a multitrackschoolsystem.We might d[ eapectthat in suchsystemsoccupationalattainmentdependsnot only orthe qmount drciooling but on the rypeof schoolingcompleted.How to representtheeffectof schooliq in a succinctway becomesa difficult problemin suchsituations.We could,of course, cf, andreport the coefficients for a typology of typebyextent of schooling, but this is
258
QuantitativeData Analysis:Doing SocialResearch to Testldeas
likely to requirethe presentationof many coefficients.An altemative wourdbe to s. stepfurther and scaleeducationalcategoriesin termsof their e;fecl on occuparioritus. From a technicalpoint of view, this is very simple.We estimate the relatios betweenoccupationalstatus(measured, say,by the IntemationalSocioeconomrc lner occupations[ISEI] [Ganzeboom,de Graaf,andrreiman 1992; Ganzeboomandrren 19961)anda setof dumrnyvariablescorrespondingto our typology of type_by_errial schooling,and then we form a new educationvariablein;hic;each categorl.:typologyis assignedits predictedoccupationalstatus. Doing this maximizesthe conelationbetweeneducational attainmentand oc.r tional statusno other scalingof educationwould produce high".;";;il;;=; the sameset of categories),and, of course,the correlationis "ideitical ," irr" ."r.: mtio Thusthe interpretationofthe educationvariabrebecomes"the highestlevel oi cation achieved,calibratedin termsof its averageoccupationalstatus return.,,So lor. the analystis candid with the readerthat this is whaf has been don", tt .un f objection.The clearadvantageof the procedureis that it allows "r" educatronal attaina: be includedsuccinctly in subsequentanalysisand thus permits assessment of hori relationshipbetweeneducationalattainmentand occupational statusrs aftectedb\ :d factors,andhow therelationshipdiffers acrosssubpopulationr, fo, ,.;* ethnicity. "*ofi.,i, Hereis an exampleof the constructionanduseof sucha scale.(No log file is rr ^ this worked for examplebecauseno new computrngtechniquesare rntroduced.) t 1996Chinesesurveyanalyzedearlier(in ChapiersSix, Seven,and Nine; seeAppenr_ for detailson the dataset and how to obtainit), educationwas solicitedwith a cuesr that included the categoriesshownin Table 11.6.Although, with the exceptionc:l last twocategories,the classificationappearsto form an ordinal scaleof increasins: cation,it is not evidentwhetherthe scalehas a monotonic ..l"ti"".hrp ;;;;;;;;; status.In fact it doesnot, ascanbe seenfrom the meanson the ISEI shownin Table _ In particular,vocationaland technicalmiddle schoolgraduates tend to achievesuti tially higheroccupationalstatusthando academicuppir middle schoolgraduates\\ b: not go on to university. I thus created a new education variable in which each category was assrgnei mean ISEI score shown in Table 11.6. (A convenient way to do th;s in Stata is to re:
ISEI on the educationcategoriesand get predictedvaluesfrom th. ,.g*;;;. i; associatedwith this regression,.372,is, ofiourse, just the square of the correlation:: that we encounteredin ChapterFive, 4r.) This scaiecan then b" usedin other ana_i For example,we rnightwish to assess the dependence of occupationals;;;;;;;:: and father'soccupationalstatusfor severalnations,including China, to assessnal,r similaritiesand differencesin the relativeimportanceof achievementand ascriptin occupationalstatusattainment.
ERRORSINVARIABTES REGRESSION As notedpreviously,unreliablemeasurementgenerallyproduces weakermeasuredel,: Thus when variablesare measuredwith differential reiiability, the multivariate strucR_ relationshipscan be substantiallydistorted.Becauseattituie variablesotten ha'e r
Scaleconstruction ?59
$r F j @
d !!
 ',,3ig X t.$. fvf..n score on the tsEtby Levelof Education, ChineseMales Age Twenty to SixtyNine,1996. Levelof Education illit er a te
Mean l S E l 1a.2 ' :' i l .:::
113 " ' r"
Canread
16.0
E
Uppermiddle(alsospecialized)
35.5
272
tt
" fi v eb i g " S pec ia l i z ei d n ,c l u d i n g
61.0
111
65.1
65
&
..
m itu Ur !l
n I la a llN
! t.
$dll E > [i
lmperialdegreeholdet (xiucai,juren)
30.5
[* il
Othe'
39.0
!t
Total
28.5
G
2,413
1$, qt li I
lui trl ml
ebility, analysesincludingsuchvariablesoftencanbe misleading.A way of correcting trs lroblem, whenmeasures of reliability areavailable,is to correctconelationsfor attenrar..,ncausedby unreliability.The Statacommand eivreg  (enors invariablesregres'Ll. doesthis conveniently.The analystsuppliesan estimateof the reliability of each ,rr"ble,andthe commandmakestheadjustmentandcaniesout the regre5\ione\timation. li ro estimateis supplied, the variable is assumedto be measureduith perfect e::lility.)
260
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
it canhave,I herepreserr To showhow this procedureworks andwhat consequences an analysisof the effectof abortionattitudesandreligiosity(the threescalescreatedpreviously) plus race,region of residence,an interactionbetweenrace and region, and the narurallog of incomeon politicaIconservatism. From the previousanalysiswe havethe reliability of the threescales.I takethe rel' ability of the income measute,.8, from Jencksand others (1979, Table A2.13) anr assumethat race and region of residenceare measuredwithout error.Table 11.7 shosi ari the resultsof OLS estimationwithout correctingfor unreliability of measurement, errorsinvariablesestimationthat does correct for unreliability. Becauseeivregdoesnot permit correctionof the standarderrorsfor clusteringandrequiresaweightst* regressionwith the!. fweights),I carriedout both conventionaland errorsinvariables specifications.
fA,iLf 11"7. coefficients ofa Model of the Determinantsof Political Conservatism Estimated by Conventional OLSand Errors'inVariables Regression,U.5, Adults, 1984 (N = 1,294). ConventionalOLS
s.e.
p
0.692
0.170
.000
0.282
0.091
.OO2
b ReliEosity preference Personal aDonron
o
ErrorsinVariables s.e. P
1.066
o.2s7
0oOi
0.220
0.113
.051
I
.816
0.172
o.425 DI
0.063
0.079
s .e .e .
1.24
1.23
I i
Scale Construction 261 The effect of adjustingfor differential reliability is dramaticthe coefficient associated mrt:eligiosity increasesby 54 percent.In addition, the coefficientsassociatedwith rc3?nce of therapeuticabortionand with incomeincreaseslightly,and the coefficient urn:cated with acceptanceof abortion for personalpreferencereasonsdecreasesslightly. ]^,"llGresultsindicate clearly how the relative effects of variablesin a multiple regression u re distorted if the variables are measuredwith differential reliability, as these are. Be::]l that the reliabilities of the religiosity, therapeuticabortion, income, and personal are,respectively, .66,.78,.80,and .93.) mr=.enceabortionmeasures \ote that with one exception,all the coefficientsare aboutwhat we would expect: increaseswith religiosity,with income,and for nonBlacks,with tuu:;ai conserrr'atism (although lruiem residence this last effectis only marginallysignificant)anddecreases ru,.ic:eptance of both kinds of abortionincrease.The unexpectedresultis that acceptance r ::erapeuticabortion is a much strongerpredictor of political conservatismthan is lrc:::tance of abortionfor personalpreferencereasons.From the analysisI presented rr:
n{ATTHIS CHAPTER HASSHOWN !r ::s chapterwe have seenwhy multipleitemscalesare advantageous: they improve Gil:ility of measurement. We haveconsideredtwo waysto constructsuchscalesthat go ;. :ld simplecountsof the kind usedin previouschapters.In this chapterwe focused m.rily on factorbasedscaling,which providesa meansof purging a scaleof items k lo not reflectthe sameunderlyingdimensionas the remainingitems or that reflect rF dimensions effectproportional in additionto themaindimension.We alsoconsidered ilcrr::g.which is usefulin establishinga metricfor a setofcategoriesby scalingthe items ru::ding to their effect on somecriterion variable.Finally, we consideredtwo extenrrn of OLS regression:errorsinvariables regression,which correctsregressioncoeffirclr for attenuationcausedby unreliabilitysomething that can alter our substantive mcusions when variablesin a model are measuredwith differential reliability; and rerngly unrelatedregression,which provides a meansfor comparingmodels with if:::nt dependentvariablesbut (at leastsomeof) the sameindependentvariables.
LOGLINEAR ANALYSIS IVHATTHISCHAPTERIS ABOUT ,Elinearanalysisis a techniquefor makinginferencesaboutthe presenceof particular r.:rionshipsin crossclassification tables.The first tfuee chaptersof this book were irloted to percentagetables.In those chapterswe spentconsiderabletime on cross:nlations, developingrules of thumb for decidinghow largea differencebetweentwo ,nE:Jentages had to be beforewe were willing to take it seriously,how to detectinteraci:is amongvariables,and so on. LogJinear analysisprovidesa way of formalizingthe rx\'sis of crosstabulations, permittingan assessment of whetherrelationshipsobserved r : crosstabulation constructedfrom sampledata are likely to exist in the population il: n which the sampleis drawnandalsoprovidinga way of describingthe relationships. h:ilis chapterwe first considerhow to fit a logJinear model to multiway tables,to get te mechanicsstraight.We then move on to more parsimoniousmodels for twoway :res. usingthe studyof intergenerational occupationalmobility asour main substantive =:,nple,althoughthesetechniquescanbe appliedin many othercontextsaswell. Addi:r:rl expositionsof logJinear analysiscan be found in Knoke and Burke (1980)and in P:";ersandXie (2000,chap.4).This chapterdrawsheavilyon PowersandXie.
264
QuantitativeData Analysis:Doing SocialResearch to Testldeas
INTRODUCTION In one sensethe modelfittingaspectof logJinear analysisis nothingmorethan a :' s: alizationof the l'? (chisquare)testfor the independence of two variabres.Recarir r the usual (Pearson)1'test, the observedfrequenciesin eachcell are contr.asted \i r: ,r modelof perfectindependence, in which the expectedfrequenciesin eachcell are s : the productof the marginalfrequenciesdividedby the total numberof casesin the _: e The size of X'then dependson the extent to which the observedfrequenciesc::.: from the frequenciesexpectedfrom the model of independence. This approachcanbe generalizedto morecomplexrelationships,albeitwith a c:.:.,,r in the formula.For a bivariatefrequencydistdbutionwe can write a generalformu. n expectedcell frequencies: F,1 4r,\ r\ . )t

where 4 (eta) is the geometricmean of the cell frequencies(the geometricmear I valuesis the /tth root of their product); rf is the "effect parameter"for the ith car. ofthe X variable(f is pronounced "tau"); r,f is similarlydefined:rnd rjt is rhe parameterfor the "interaction"of the lth categoryof X andthejth categoryof ), l_:
?)f $l l
In LogLinear Analysis"lnteraction,,SimplyMeans
"ASSOtiatiOn"
,,inreraction,, Notethatinrhetogtinear titerature isrheterm,:,
what rscalled"associatron" In the olderiiterature on crosstabulation tables.rt is i\odantto re:, ognizethat lt is not the sameaswhat iscaliedan interaction in boththe oldertabularliteratL= and in the literature on multipleregression. In thoseliteratures "interaction', refersto the si:_ation in which the relationship betweentwo variables dependson the va ue of one or mc: othervariables.
The relationshipexpressed by Equation12.1can be shownto hold when the i.. ::: definedasfunctionsof oddsratios(seeAppendix1Z.A .The oddsof an observation in a givencategoryof a variablearejust the ratio of the frequencyof observations i::,tL ,r categoryto the frequencyof obsenations not in it. Thus in a classof 20 men ar:: women,the oddsof a studentin the classbeinga man are 20/10: 2:1(,,two to one.. Analyzing the datain Table 12.1,we seethat the ratio of the oddsof being a l;. :andscience(LS) student,giventhat oneis male,to the oddsof beinean LS studenr.!_ rr thatoneis female. are 9ll l )l\g )  :I L So menareoneelevenrh aslikell ro be Li ... dentsas a.rewomen(and,of course,womenare eleventimesas likely to be LS stuc.:as are men). Oddsratios vary aroundunity; if the oddsof being an LS studentuer :c samefor malesand females,the oddsrario woultl be 1.0.An odrisratio of lesstha.. indicates,in this case,that the oddsof beingan LS studentare smallerfor malestha'' : , femaleswhereasan oddsratio of greaterthan 1.0 indicatesthat the oddsof being a: studentaregreaterfor malesthanfor females.
LogLinearAnalysis
26s
;& f t:: t ?, '1. rr.qu"ncy Distributionof Programby sex in a GraduateCourse. Male
Female
Total
Management
\ow supposewe takethe naturallog of both sidesof Equation12.1.This givesus ln(F,,)  ln(.rtr{rl r{Y )
: ln(t) + ln(rf ) + ln(rf) + tnlrfr)
(12.2)
q.:::: hasa logJinear formttat is, the left side of the equationis a linear function of ,rss of the quantitieson the right side of the equationhence the term, loglinear Equation12.2is sometimes(for exampleby Leo GoodmantI972, 10431,oneof the :=:.rrs of the method)exnressedas
G , , o ' r ^ ! + ^ l +
ry
(r2.3)
r rre:the \s (lambdas)are (natural)logs of the r's, 0 (theta)is the log of 4, and G is the rug:: F . An altemativenotation,usedby PowersandXie (2000, 107),is
ln{,  ln(r) + ln(rf) + ln(rt) + ln(?,fc)
: p+ tlf + pf + pt'
(12.4)
nnt,: Lr(mu) : ln(r), and so on. An even more convenient notation, which we also will ur]E. [XY], which implies that the model of interest includes the explicitly specified rc::tion and all of the lower order effects. This is sometimes called theltted rnarginals flnl::.rn. Equations 12.1 through 12.4 can easily be generalized to more than two qf:iles. as we will see in the following section.
SIOOSINGA PREFERRED MODEL un 12.1the observedcell frequencies, f,, are exactlyequalto the predicted lll, ration ::quencies, F., and thus are perfectlypredictedbecauseall possibleeffect parametr: :e presentin the model.HenceEquation12.1is known as a seturatednlodel.
266
QuantitativeDataAnarysis: DoingsociarResearch to Testroeas
interesr. ordinariry, il,TJ:iff ,ffi;[ffiT.tff ] t orIinre
:::ilf i?nJ:l.#t*t***l*'{}*tfi*J#i** .j*#:,":",,".diiT"",ill,:&f,f""
","+*:""i:"ffi
:orlrnonis.,"""r.r.'l^iii,fi "_j::m 1li:.i,+.,i"{"r*"'J:.:,U.r;:mf to*t?j?iu l"T::,T:ffi:9;*il1ffiffl,: ,,u,.*", ermretr l:u*nuutv 'to'.it
ltt':,T##'#i*ktT{1ffj,".,:jT;'#1ff"1:i::i iJiIil",'"H 'ffi;ilTf":ln'ffi';l*t :'":ifl :"'ffi l:ff?i:ffi :*'":f#,l";'#ffi Model SetectionBasedon
Goodnessof Fit
approach, usins thedara rrom rabre I2.1.rlrc
.;itill*illfi ffi1'Iffir*"Jrlff*t;,y:.^rflg ;""fi 3.H*i::*::. "#i#i.i:ii!:!ilH:#;ff tr*:1.lt"ii3,i,l,,ifiT:i;;*lhnti".:m
:_; j:T::;"'"il;;#;u':ilff;, r, urown ff;;::;#,:\,iJ,:fil,1:: a,r rV ' F.l Jt I i :t i = t\ ri j )
(12.'
whereF is, as notedpreviouslv. t
jl;i:dj:,f:lTilfl "H fj{fy"::,:"i:;;'::i,:_TH:rjt" "t
.,#,. jtil_til*;1i::1rtr. ,,;mn*+:#1fr I p;411,11i::;iill.x,",:#:f;:t*
ff,?#"ff ;:trflhnff :?T:".'*:,*;::r[iff :l;lury,.o_,r,o q r{ = 1/r{ r{ = 1/ r{ ,f{
=,{{:1/r{{
(12.6t
=11,x,r
Log_LinearAnalysis 267 it
n E f
E
! 3 l
}
le:ause for the simplestmodel we estimateonly 4, we havethreeremalnlng(residual) .e:reesof freedom.As expected,the f,t to Tablel2.1 is poo., Lt: t0.96. which implies b:: rheobserveddistributionwould occur by ch_.. oniy about I percentof the time if t€ Jellssizeswereequalin the population(precisely,p :'.OiZ; S"'*. that this n del doesnot fit the data;that is, we reject tt nuit t ypott i.i. that"onclude the cell sizesare " rrl (For detailson how to estimatesuchmodels, seethe workedexampreon anticommrst sentrmentlater in the chapterand also the downloadable files ..chl2_1.do,,and :12 1.loe.") To be sure,the ,,population,, in this exampleis problematicbecausewe arepresum. i"'] studyingthe characteristicsof alr individuars enrolredin up*i"ulu. courseand E\e might think of theseindividualsasconstitutingthe populationratherthana sample. ILs ever,we might regardthe individualsenrolled'in ,n! ;our;;;_y grventime as a imple of all possiblesetsof individualseverenrolled in the course,anahencegeneral_ = iiom the particularset of observationsto what we might expect*fbr this courseover tu long run" or for "courseslike this." This sort of use Jr .,u,i'.,*t inferenceis in fact .ure commonln researchpractice(seethe discussion of the conceptof srperpopulation n ChapterSixteen).
L2 Defingd
tt canbe shownthatl, is minustwicethe differenc(
*:"tnim:*:Llrii**:ffi n:mN ":H:iT;?:;t::il::i:if ikelihood estimation and a definitionof the likelihood.)
might testthe possibilitythat the two variablesX and f are independent,so . )ext ye * ,n:.*11 are simply a function of ttre marginatdistributions. {:nl"ncies We would r:rte this: [X][Yl. In this casewe are estimatingthreeparaireters:,1, .t, r"J,". o"rv a; r to 1.0,so we have1 degreeof freedom.In this case# : 6.35,so onceagamthe fit is .r :rrr (p : .012, only coincidentlythe sameas for the previous model),indicatingthat tere rs an associationbetweenX and l_we cannotpridict the cell f."qu"n"i", ,i_piy =m the marginals. To obtaina good fit in this example,it is necessa.ry to estimateall four parameters,wtxch s up all dsglsss6f freedom(andhence,as we havenoted, ensuresa perf.ectfit). We write ar tY Notejfal in rhis expositionwe are dealinglvitt t i"r_.t i.a models,which .''ans that every higherorderrelationshipimplicitly coitains a lowerorderrelationships. Seace[XY] '+[1][X][][y]. we win returntothispoint lat". in ,h. op,". So farwe havedonenothing that could not be ione "t witn tne ,suui rt test for inde:Frdence.Howevet the sameproceduresapplyto cross_tabulations conrarnrng morethan :ro variables,and also to polytomies as well as dichotomies.ConsiderTable 12.2
268
DataAnalysis: DoingSociat Quantitatlve Research to Testldeas
'l 2"2, ?AgtC r..q.,..rcy Distribution of Levet of stratification by Levd of Political Integration and Level of Technology,in Ninetyftiro Societi€s. No Metalworking Stateless
State
Metalworkingat Lean Stateless
State
Source:Computedfrom Murdockand provost(1973).
a crosstabulation of level of stratificationby level of political integrationand lerrtechnologyamongninetytwosocieties. In the datadredging approachto loglinearanalysis,it is commonto positan i or baseline,model of completeindependence amongthe variablesin the tablei: presentcasethe model of no associationbetweentechnology[T], political lPl, and stratification[S]. We do this by fitting the model tTltpltsl. For this model.i: 84.68,with 7 degreesof freedom;the goodnessoffitstatisticsfor this model,andser others,are shownin Table 12.3.Clearly,this model doesnot fit the data(p < .0000. we will nonetheless makeuseof it momentarily. We might next posit an association,or interaction,betweenlevel of political i tion anddegreeof stratificationandassumethat neitherofthese variablesis relatedr..: level of technology.That is, we fit [T][pS] (Model 2 in Table 12.3).This model pthat the observedcell frequenciescan be accountedfor (within the limits of sam: error) by the univariate distribution of level of technoloey and the bivariate of the degreeof stratificationandthe level of political integration.Estimatingthis yieldsll = 41.54,with 5 d.f. Although the largeZ, tells us that the model doesnot provide an accuratefit ri data (p < .0000),we might still want ro know wherherthe predictionis imDroved:.: tive to lhe baselinemodelof completeindependence. To seerhis.we subtracror: from the other and similarly subtractthe degreesof freedom, and then we ge: pvalue associatedwith the new Z2 and new dl It also is common to show the I: eachof the subsequent modelsas a proportionof the 12 for the baselinemodel,l"to showthe index of dissimilarity,A, betweenthe observedfrequenciesand the cresexpectedunderthe model, and also B1C.The differencesin thesemeasuresc:: computedaswell; the differencesbetweenModels I and2 are shownin the row of Table 12.3.All of thesecomputationsprovide informationon the goodness,:f of the models and the improvementin goodnessof fit realizedby positing succesr elaborationsof a model.
LogLinearAnalysis269 :
.: :r 1?"3, naoa"tr of the RetationshipBetween Technotogy,potitical liegration, and Level of Stratification in Ninety.Ti,voSocieties. Iodel
BIC
d.f.
L'ILtr
h t ; t:P5l
41.54
30.4
h IPItrsl NU .:,.TPl
60.48
F
irPltTsl
6
h
FsltPsl
m,llTsllPsl
0.60
3
.4Ol
 10.6
.03
5.3
2
.739
8.4
.01
2.5
I'
r!nus (2)
,l  us(8)
2.34
, r :heprobability Io(L, = 43.14,with2 d.f.lt canbeobtained fromthecomplernent of TE :':cabilityreturned bytheStatachi2function. r'r :we reverse thesignaftersubtraction sothata negative B/Cndicates animprovement infit. : the probabilityof l,':  43.2.wfih 2 d.f.,is lessthan .0000,we concludethat IrN'*ir,:_J an ccauseassociationbetween political integration and stratificationsignificantly r[@rr'..s the fit of the model. Similarly, the differencein B1C tells us that the second mni:Ei! much more likely than the first, given the data (althoughneitheris as likely as M .iJated modelbecauseboth BlCs arepositive). ;: canget a quantitativeestimateof the extentof improvementin the fit of a model !'w :e rwo remainingsetsof coefficients.From the ratio of the s. we seethatpositing n i:i\:iation betweenthe degreeof stratificationand the level' of political integration ogru: thelack of fit of the modelto the databy abouthalf relativeto the baselinemodel m ::ulete independence amongthe threevariables. ::a11y. from the rightmostcolumnof Table 12.3we notethat the modelof complete urct:dence misclassifiesabout42 percentof the casesin the table(that is. 42 percent m rr :aseswould haveto shift categoriesfor the expecteddistributionto be identicalto
270
QuantitativeData Analysis:Doing SocialResearch to Testtdeas
the observeddistributionrecall the discussionof the Index of Dissimilarin. _ ChapterThee), whereasthe secondmodelmisclassifiesonly 30 percentof the ca.:, Becausemodel [T][SP] does not fit the data well, we would evaluatestill models,searchingfor the mostparsimoniousmodel that doesfit well. Table l2.j .: goodnessof fit statisticsfor eight models(all the logically possiblemodelsexce:: saturatedmodel and the model that assumesall cells have the same frequency). C!:_ ing to examine the coefficients in Table 12.3, we see thar Model 7, [TS][pS], fits rh: ;
quite well. This model positsthat both the level of technologyand the level of pc _ rntegratlonare associated with the degreeof stratificationbut that the level of techr:: andthe level of political integrationare unrelated,net of their associationwith srr": tion. It misclassifiesonly about5 percentof the casesin the table and also redu., baseline Z'?by97 percent(: 100+(1.0 0.03)). Although Model 8, which positsthateachpair of variablesis associated, pro\ i:: r evenbetterfit, it might be arguedthatit overfitsthedata.Thepenultimatemodel, ITS. _,_i fits near_ly aswell andwould be my choiceasthe final modelon the groundofpari::, especiallybecausethe improvementin fit betweenModel 7 andModel g is not sisL . (2.94 0.60:2.34 p: .1261. Note that the useof testsof significancein this contextis the oppositeof their _, role as a decisionrule for rejectinga null hypothesis;herewe want to decidet he:e accepta null hypothesis;that is, a model.Accordingly,we would like to mrmmiz: , II (p) enor (the probability of acceptinga null hypothesiswhen it is false) rarhe_ Type I (ci) error (the probabilityof rejectinga null hypothesiswhen it is true).L:. nately,thereis no direct way to do this, and so we must settlefor a computation.: I enor A usefulrule of thumb is to accepta model if d is greaterthan 0.2. Hou er:: larger the samplesize,the smallerd will tend to be, so fbr very large samplesu. ::
wish to accept a model even when a is quite small. As we will see momentant,. j offers an altemative and more satisfactory method of model selection
One additionalcoefficientis shown in Table 12.3,B1C,the BayesianInfor:_j lrt Criterion (Raftery 1986, 1995a, 1995b),which we first encounter€din Chanr:: !. Recall81C's definition:
BrC : _2[In(B)] whereB is theratioof the (unknown)probabilityof somemodel,M, beingtrue (unknown) probability of the saturated model being true, given the data. For los models.B/C is estimatedby
l
L'?((U.)lln(N)l whereZ'?is the likelihoodratio t' for M odelM; d.f. is the residualdegreesof freedi : mnr Model M; andN is the numberof casesin the table.When B1Cis negatlve,Modi ! ll. preferred to the saturatedmodel. When several models are compared, the model \\:: :ntr
most negativeB1C is most preferred becauseit has the greatestlikelihood oi:;:::E true giventhe data.Here,Model 7 is morelikely thanModel g giventhe data.Com: :::ll
LogLinearAnalysis
271
:tr i.rmation obtainedfrom the.L2and BlC contrastsof Models 7 and 8, we seethat r ,  is to be preferred. .:,erealvalueof BIC is in the comparisonof modelsfor very largesamplesbecause ,r,,:c:he sampleis large,often no model (except,of course,the saturatedmodel) fits the ru; ' conventionalstandards.When that happens,B1C is of great use in helping us i: ,:r among models.For this reason81C has becomethe conventionalmethod for "r,..:ng altemativemodelsin loglinearanalysis.An additionaladvantageof BIC, noted m =;pter Six, is that it can be usedto comparenonnestedmodels.
"fwryBased Model Selection ll:c approachto model selectionis to contrastmodelsthat representaltemative ::ond !:':ises ur aboutrelationshipsamongvariablesthat is, to do theorydrivenratherthan model selection.For example,we might ask whether the association omur::edging rm: *n the degreeof stratificationandthe level of political integrationcanbe explained on the level of technology.If the answeris yes, we would , :er mutual dependence s"r:: lTPl[TS] to fit the data,becausethis model impliesthat the obseryedfrequencies m :e :ablearegeneratedby an associationbetweentechnologyandpolitical integration betweentechnologyand stratificationin the absenceof an association ,nuur ': .rssociation political integrationand stratification.As we seein Table 12.3 (Model 5), this rr.n n doesnot fit the data,because12 : 21.88with 4 d.f. (p < .000) Hencewe reject nnnnr.,!. fi[rc:]Ntnesls.
ffi&
Parameters
*: :: :hown in Appendix 12.A, the parameters associatedwith the interaction tems in ",..:ar models (for example, rlv in Equation 12.1) can be interyreted to indicate the unr=::,nand strength of associations in crosstabulation tables. Note, however, that :d :=meters for twoway interactions involving dichotomies are shown relative to Jr '.Tic means of the expected frequencies. When more than twoway interactions or m, F: 'ian two categodes are involved, the interpretation becomes more complex. r#!.r:.:\er. by default Stata uses a "dummyvariable" parameterization When a dummyrl[.;e parametedzation is used, the parametersfor twoway interactions give the odds mui,:\ lrr log odds) for the specified categoriesrelative to the teference categories. riJause the effect parameters are not very straightforward, most analysts use loglM": .nalysis to test hypothesesabout the presenceor absenceof particular associations :1i:.'tions) in the table but then discuss the table in terms of percentage differences, wnt;: re much more familiar to ordinary readers.This is particularly so when the softvariable form that is, w.]E: .id to estimate the models shows the parametersin dummy or to 1 0 in the multhe log fbrm is set to 0 in l]! ;E.::tions from an omitted category that variable form are difficult to r',r;:j!e formbecause coefficients expressedin dummy m<::;r in the loglinear context. \!l recommendation is that you use loglinear modeling when 1ou \\'ant to test ,,rrr , hypotheses about relationships in crosstabulation tables hecause it is an Y"m:,:ely powetful tool for doing this job. However, once you settle on a preferred
272
DataAnalysis: DoingsocialResearch to Testldeas Quantitative
:,4' i,,]. l:: ' i. : , Percentage Distribution of Expected Level of stratificatict by Level of Political Integration and Level of Technology,in NinetyTwo Societies (Expe
State
Metalworking at Least Stateless
State
Egalitarian
78.1
33.2
55.
Statusdistinctions only
20.5
31.2
46.2
15.7
1.4
29.6
20.7
41.7
Total
100.0
100.0
100.0
100.0
N
(30.6)
(1s.4)
{12.4)
(33.6
Two or more classes
I
LO
model, I suggestyou interpreteither the observeddistributionor the expecteddisrr :'tion implied by the model.The point ofpercentagingthe expectedratherthantheobse:.:: frequenciesis that unsystematicvariability is removed;however,you shouldbe sens::. to the possibilitythat deviationsof observedfrom expectedcell frequenciesmay re.:relationshipsnot adequatelycapturedby the model. Table 12.4 shows the percentagedistributionof level of stratificationby le\i. political integrationandlevel of technologyimplied by Model 7, which positsan ass.,:rtion betweenthe level of technologyandthe degreeof stratificationandbetweenthe li : of political integrationand the degreeof stlatif]cationbut not betweenthe level of r;:nology andthe level of political integration.Becausethe model fits well, the distribu'r : of expectedpercentages closelyparallelswhat we would havefbund had we percenrii:Table 12.2.As we see,within levelsof technology,statesocietiestendto havemorec. rplex stratificationsystemsthan statelesssocietiesand, within levelsof political inretion, societieswith metalworkingtechnologytend to have more complex stratifica'*: systemsthansocietieslackingmetalworkingtechnology.(Onelimitationof this apprr _:r: is that the marginalfrequenciesin the expectedtablegenerallydo not matchthoseiL : conespondingtable of observations.For a methodthat recoversthe marginaldisrr,:,tions,seeKaufmanandSchervish [ 1986].)
Another WorkedExample:Anticommunist Sentiment The optimal way to cany out logJinearanalysisusingStatais to usethe g1m (gei:alizedlin*u model) command,which permitsthe estimationof a wide variety of lir:models.Indeed,asshouldbe evidentfrom Equation12.2,logJinearanalysisisjust a ::t cific caseof the familiar linear model,in which the dependentvariableis the natural ,: of the number of casesin a cell of a multiway crosstabulationand the indepenc.:
LogLinear Analysis 27 3 Frequency Distribution of Whether ,,A Communist Should L Allowed to Speak in Your Community" by Schooling, Region, and Age, U.S. rcJhs, 1977 (N = 1,478). .1 , ::,
CommunistSpeaker(C)
rSe (A) * tr' younger
Region(R)
Schooling(S)
50uIn
No college
72
71
College
55
22
NonSouth
Allow
No college 'ol
{
:. clder
South
Non5outh
Not Allow
92
College
151
25
No college
65
162
College
23
23
No college
197
214
College
107
32
rr::s are dummy variablesfor the categodesthat makeup the variablesincludedin lurr: ri itabulation.Although a userwrittenStata ado file (Judson1992.\993) can ft !=l to do hierarchicallogJinear analysis,the advantageof using g1m is twofold: 0 ":'=rsthe linear model framework,and all of the Statapostestimation commandsare u r=rle. To showhow to cany out logJinearanalysisusingthe gfm command,I mu*..::Table l0 from Knoke and Burke (1980);a comparisonof my resultswith theirs m, ::or ide additionalinsight. rrppose we are interestedin the relationshipbetweenage (thiftynine and younger s.:: Ibrty and older),region of residence(SouthversusnonSouth).schooling(some : 1:: \ersus high schoolor less),and tolerancefor civil liberties.as measuredby a r[,']f, ]n on whethera communistshouldbe allowedto give a speechin I our comrnunity. : : :i\1ayfrequencydistributionof thesevariables,basedon dataliom the 1977GSS. u ,:,.; n in Table12.5. hysis Strategy The first stepin carryingout a loglinear analysisof Table l l.5 is to :s:i::: a baselinemodel.Becausemy interestis in the effectof age.recion.and school mr . r rtrlerance of cofi[runists, a reasonable baselinemodelis tC]ARSI. That is. I fit the
274
QuantitativeData Analysis:Doing SocialResearch to Testldeas
threevariablerelationship ;rmong age, region, and schooling exactly, but I assume noneof thesevariablesis relatedto toleranceof communistspeakers. As a seconds posit [CA][CR][CS][ARS].Tharis, I conrinueto fit the rhreevariable relationshio andin additionposit effectsof eachof the independent variableson toleralce of c nist speakers("interactions"betweeneachof age,region,andschooling,respectivelrtoleranceof communistspeakers). If my secondmotlelyields a good fii, I thentry to , plify the_model by omitting specifictwovariableinter;cdons.if my secondmodel not yield a good fit, I explore more complicated models by fltting various three_va: interactionsinvolving toleranceof communistsplus pairs oi the irirlependent variabre: lmplementation To carrli out the analysisusing g1m_ in Stata,I first readin the tentsof Table 12.5as a data set, whereeachcell is an observationand the va.riable! the responsecategoriesfor each variable plus an additional variable that gives the ci in eachcell. I thuscreatea dataset,call it,,knoke.raw": 11 I1 11
r1 12 12 t2 1? 21 21
1172 1 2 2155 J1 ' ta
11161 1292 21157 ')
)
)7
22 ))
r<
1165 12162
,1
21 ),
7l
z5
2223 11197 12214
2 tto 7 .
L
JZ
andthen readthe datainto Statawith the following command: inJile a r s c count using knoke.raw,
clear
Recall that the baselinemodel [C][ARS] is a shorthandway of representrng the Imdd tcllAltRltsltARllAsltRsllARsl. Thus, I need to specify each of the terms in tu model. Becausethe Statacommandto createproducttermsfoi categorical variables,_>:__. does not permit more lhan twoway products, I take advantageoiu ur"r_\Vrr,,"n __._ comman{ desmar (Hendrickx 1999, 2000,2001a,Z00ib), to specifyrhe requind variables.(Seethe downloadableflles ,,chlr_l.do,,and,,ch12_l.log,,tbr details.).{iiir; becauseglm doesnot provideall the coefficientsshownin Tablei2.3 andproduc._.1 incorrect estimateof B1C (given the way I have specifiedthe problem, _gLm_ counli r casesthe numberof cells in the tableratherthan ths numberof p^eople in ttre sampte),I b,a
LogLinear Anatysis 27 5
(ror"gociJness or nr").anda rerseversion, ,.,. '.ll"l:",::,,"i1"j"^^1::"^?:.;;;i3Bil""ii::*::::::l"""ts:these
_donr",,r,o..u,,iui,ixrlrJ,';.jll;
',_ _ .Jrble file. for lhis chapler _
rit
orherStara
estimarion command, rurr'rdrru. =_,i:l_:::T1111orkr,.Juu wirh wrtn , ,::iion: :v:ry ; : ,. becauseit can handlemany kinds of linear moa"r. specify which nust r.lel . r: w^,r.,ohr ,ri^,1^ """,,l"' 
inailil;il;ffi;] r1e ;J ;[ . .=':::1:i[:,:::::::j::*l1it .:::j.i."l:.'::11.lod:l poisson b":1,.r" r,vlntion distribution;"r.."*"",", 11" ".o"n,." variable. sp".ry'ngil ;;:i"#ffi ::s flr!rj..:. :',tL::,:"T":ependent a loglinear model. .: _ g"::
Ji,il; ;iHl;
the
g"._l
_. .".j::":.:::..1: slm_ command the ule Ee'Erdres senerares r r.,:::lrs ; rz.o. /{1"1 r.1 shown in the ;l:i,lirst line of,,.r^Lr,"1: r^ I rnen repeat the process for a model that lull :' : : p r
roc butnotro;nt.ra.tti;;f" ;;;;;r,",t"# J:$; _ .::j ,T9,_1,::lroberelated that
l:;uefficienrsshoran :o 9' ls,tansllagldcrisii.tn"." commands  _ :i].1:.:T::^'"':ll:n'nto on the bolromtine
oftaOte lZ.o ir"l lel g). o./. Clearly, \'rsdrryJ this . trus :r j= _i :!, rhp.,era L_, :. the data well ,6,, by all crite na..rnd::d so well as to suggestthat a simpler ,r^ model , __ fln in hr rhp ,.rar.,,,^ll r. : :> rhe remaining coefficients in the table. ..::cting these statistics, we see that none of these models fits the data adequately.
. ... rhis. resrimare a,,il....i*;.:i,l;H.; ,.irSll5,o,lli:""llJ:,*::ln
:!*. I serrleon IARS]IACIIRC][SC] asmy preferred 'n"J.i.'i",a*,fy, age,region, 
.
,..i,,
..: . .r.
coodness_of_Fit Statistics for Log_Linear Models of the
conmunist shourJ ,. *" speak in ff"^t11:lf:1*I.l*. hr Community, Age, Region, and Education, "ii"*.o ,.r, ,rn. flcc€i
"a"f,r, Lz
d.f.
BIC
L'ILtr
A
N1 r:;q r t
15_1 :]:;S]IA C]
o/t
1
.69
10.7
a_2
B :_:,sllRcl c::lll(at
87.75
6
.000
44.0
.44
; \)lrALt'KLt
84.j2
5
.000
48.2
.42
a : ; s l l A cl[5C]
48.69
5
.000
12.2
44.74
5
.oo0
4.2
.22
1.7
re.:lslJAcl[RC][sC] 2.92
4
.571
_26.3
01
1.5
r .:xsllRc][sc]
7.8
276
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
TA E , € tr 2 . 7 , erp..t"d percenrage(from Modet 8) Agreeing That "A Communist Should Be Allowed to Speak in your Cornmunity" by Education, Age, and R€gion, U.S. Adult, 1977. Age
Region
No College
College
::;;; : :::ri:I;::;: t.::.:..
::t*t*i (253)
:::
(182) 567 (46',)
47.8 (411)
74.3
(13s)
Noter Cellfrequenciesare shown in parenthesses.
andeducationall affectattitudesregardingcomrnunistspeakers. To seewhat theseeffac are, I percentagethe table of frequencies predicted by this model. (To see how to d these,consultthe downloadablefiles, ,,ch12_i.do',and,,ch12_1.log.',) Thesep"r""n,Jr" are shown ]n Table 12.7. The table clearly shows that, controlling for each of the odo. factors,thosewho arebettereducated,younger,andnonSouthemaremorelikely to su_ port the right of a cornmunistto give a speech.In eachcomparison,the percentasediafrencesalwaysgo in thesamedirectionandarequitesubstantial. The attitudes reported here are from thiny years ago. during the height of the Cifl War. It would be of interest to determinewhether the samepattern holds today.To do fui, within a loglinear framework you would need to construct a seconddata set, basedL recentdata(for example,the 2006 GSS),to appendthe seconddata set to the first. trt an additionalvariable(?l for ,.time,,),andthento assess whetherit is necessary ro posrt! effect of time (or of interactions betweentime and any of the twovariable associaticroq to adequatelyrepresentrhe data. That is, you would estimate [ARS][AC]tRCltsf,] tARSltACltRCltSCl[T], and IARSIIACTIIRCTI[SCT], and perhapssomeinrermedi.G models,and comparetheir goodnessof fit. If none of the more elaboiatemodelsproducd a betterfit than tARSltACltRCllSCl, which isjust Model g replicatedfor thepooleda.o you would conclude that attitudes regarding the rights of communistshave not chand between1977and2006.If tARSltACltRCl[SC][T] emergedasthe prefenedmodet. 1ic would concludethat there had been an acrosstheboardchange(preiumably an increact in supportfor the civil libertiesof communists.If IARSIIACTIIRCTI [SCi] emergedr the preferred model, you would conclude that the structure of the relationships bet$ ear age, region, and education, respectively,and support for the civil rights of commun:*
Log_LinearAnalysis 2V7 ::rsed between
1977 and 200 ::r;tude that the wourd strir;,#;:xl;, Tffi::ffi1T:"#ffiffi,:f,Tr""ffii;Jou
hing LogLinearAnalysiswith polytomous Variahles
orcommunists, ff:i",''**r* ;rn$.#*l ri'hts ffjiifi :'"rTtr{.il".Hffi ;#?:;'"":T"i"H1?ffi
L";;Tffi;[*i:ff;;#:1rr".'o*"u".i',rfi
::
:.*fi ,",ilir:?ffi ilHi"rHt Fir:*,##fifu #:ir;ilffi rnc ;"r"^n* il..'.l;fiiliiiiiJ;'l.l#::y lllociatign if':T'ili"Jll;
race'and membership are dichotomous =o^'.rtionis . ,rrt...ror"ri'"fJ, variables. but tto*ts thal we crealetwo durffny vari,f,!i:: s2 {   for high schoolgra,i.f1i.uno9"t :
0 otherwise) ands3t: 1ro.ttrose ,E1!; some '19 those wittrlt uiJ =dl'ri#lT' vrse)' with lackinghigh schoolgmduation "o "g" m:ed category. asthe Suppose we are inrerested ,":.lT:lrl,
a modelin whichrace,
educarion, and exampr"'" aono,"u." *3:;itll,"lt;fff';""t::?.iJ$il theprevious abour the membership re.cr ion amongth'e;;,
and hence
.*v,nrrysrrv_i,r jd;ffi ",:jiilffi ',}iil.il*1fi'X"rl'":",1il11#*"# ;;;,1:,", I .:,nd
m'.I1 this model with the _glm_ command: ln
count r s2 s3 m r vs3 vm, ram iry rpoisl2oJs3
rm s2m s3m rs2m rs3m v vr vs2
ryo=eachof the compound variables
ls a productrerm_for example, n rL rseethedowntoadable rs2 =r*s2, and files.thrz_r.irrn",oecificarion of ""d:.r,irir"*lil.
r'o,i;,,""r,jl ".,;.,',o."o"" ffij'l*#:;:il;l"fiii;;:?" too"p'"rulv rur provides u.i"rii"i".l *"r,i"i"1,*,?J:Tffii'_'#lJil'jffi""#:i'*1. . t9 LogLinear Analysis with tndividual_Level Data
,r:'T,lixlf ffi;;ffi ililtff il#"#$:i:'":1tti*i,:ilffi t; [email protected] lisstanins rrom ;,;;ff;#Ti{*"*
*ljH ;:l$fff,"il1'nJ"_:y, ::,"::. [email protected]:nd.(Downtoadable file,,chlr_t.do,, shows ,rr"J"iarrl*""irJ"':in,r_r.,on.,,., MONIOUSMODETS r.:r we lave dealtwith _"0"r. *1:r.r;ri
global associarion. or absenceof global
n"_.,.".._.,,i.1,d :;,i";,ffj;1i..?3i**1,":r arhypotheses rike totesr regarding ,n" ":1,:r",.brt.n, ,,"_,",i"'_"lt :er tables can be described ,7,.rii[ll?J,l,,llili"l5;,,::] by relatlvely simple models that generate the observed
2V8
DoingSocialResearch to Testldeas DataAnalysis: Quantitative
 ,l A ;.,ii '1 7 ,3. r."q,r.ncy Distribution of voting by Race,Education, and Voluntary Association Membership. Didn't \i:
Race
n2 1. t
White
6
12
18 6C
Oneor more
24
lC
Sourcer Adaptedffom Knokeand Burke(1980,Tabe 3).
patternof frequenciesin the table.The developmentof suchmodelsto describepan;:n of intergenerational occupationalmobility hasbeena lively enterpriseoverthe pastlh: yearsor so,but the lbrmal modelsdevelopedin this contexthaveapplicationsfar bel r,ai the studyof socialmobility (for example,Radeletand Pierce1985;Schwartzand \1,:r 2005;Robefisand Chick 2007;Domanski2008).Still, it is convenientto illustratethss models in the context of mobility analysis.(Seedownloadablefiles "ch122.do' :r "chl2_)..log" for details on the Stata proceduresused to estimatethe models in:E remainderof the chapter) It is helpful to begin by deriving a generalexpressionfor log odds ratios. Re:rEquation 12.4, which gives the natural log of expectedfrequenciesfor a twovaril'rd
LogLinearAnalysis279 From Equation12.4we can write an expresririe as a function of a setof p,parameters. mE tbr the log odds ratio of the expectedfrequenciesfor cells formed from any pair of m. ri andi') andcolumns(j andj') in a twovariabletable: !
or: 
F..F,... F,,1F,,. loe " " : los v ''J losfl.loeE, " Fri  F,iF,j lFri
 loe4,  loeE.,
 (rL+p! + pl + pnc)+(tt+pf + pf,+ pff) @+ pf + pf + pl9)(p+ pf + pf + uff)
(12.e)
 tf, + pff  ptP pff
I
lfher dummyvariablecodingis used,asin Stata's glm command,andi' andj' arethe !*r.nce categories,theright sideof Equation12.9simplifiesto Pfc, which makesclear h de interaction parametersrepresentthe log odds ratios for each cell relative to the ,rined categories(ordinarily the first row and first colunn). \ote that to uniquelyidentify the coefficienls,it is necessaryto imposeconstraints. bc differentconstraints,or "normalizations,"are typically used.One is effect coding pal in Equation 12.6andAppendix 12.,4.1, coefficientsas deviations which expresses fu rhe grand total by requiring that the logform coefflcients for each variable sum to rF:, The otherconstraint,dummyvariablecoding,codesonecategory[in Stata,the flrst csonl of eachvariableaszero.) Il the fully saturatedmodel thereis a unique coefficientfor eachcell of the table q)=t. with dummyvariablecoding, the cells in the first row and first column. This (for a sevenbyseven table): by the following designmatri"x. mdel canbe represented
1 1 1 1111 12 3 4567 l 8 9 15 r 14 21 120 r 26 2728293031 13 2 3334353637
10 16 22
ll 17 23
12 18 24
13 19 25
: full dm
@ lQl
dtr h F 'd ]U lEl ri$
fu rhat a design matrix is simply a variable,with one value per cell, that imposes alFrlitr'constraintson somesubsetof cellsa1lcells with the samevalueareconstrained :cne equalcoefficients.This designmatrix specifiesthat all the coefficientsfor the first ind first columr areequal;in fact,they are (implicitly) zeroby vinue of the dummyr mble coding.Noneof the remainingcoefficientsis constrainedto be equal.This model m; dl the availableinformation,andthe observedcountsin eachtableare f,t exacdy. \ote that in Stata'sg1m command,the specification ..:::glm count i.X
i.Y
i. full _dm, family(poisson)
280
euantitativeDataAnarysis: DoingSociar Research to Testrdeas
ffi lf ,]i,;f;.';::T.T:rrrT"ibutionoroccuparionbvFatherr Respondentt
Prof. cadre cler. iJes
in 1996 Ser.
117
Man.
810
Agric.
2,765
producesresultsidentical to the usualway of specifying the saturatedmodel: xi:g1m count i. X*i. y, family (poisson) That is, gIm creates a desisn "b" matnx like that of 'full_dm,,when specified. the interacdcn
,,iil:,T::: yifi:r:il::;i{_j_", ffi a; :.{,i# fi :T}j,T,,: fi *o "?, ^ oo.','lnt"o',"" ; il:::"'ffi':J oio.no;^ ;"i*t:'ff o,. il::::i"t.," women ^i,,.
*.** women roincrease ro I havepooledmen increase ,f,. rhe ."ln","""#, *,0r. L?l juffi;":,fii:"jllf,iJffi:::,..::.# ,TLifll. ,reparatety. s,.iar
two.wayrabre:tharrr,"*"."_*uuini""ffi ll,: ;'i;3#:r::,il1#:',ily?":."j:::::l':: orathrceway ,ab,e i* tt"!:ibrTiiy f::i:;';?f :i,THtr ii!:X":i;yffi j i.'io8i,i '*"iJ:,ffi r,i'^." ro,esr "1,::::1r:1 i",r, the
nrsr condi*".,.il;, ff 1  ,#;.1ff;,ji;liir,l;a,. ""il1,J:fi ttri:L'Ttli:l il;;ffi ,fr ner o:h
ilflruffiff
,''r.,",,"iJ,.l and women {rhr ?olr,"y,oun.,"" G
nand R=;;;"oJ;l'":#ffi f :,nee .il"l.:H:nm: ;# l':"i?:.rufi ffi,rffi ,:i,ri;'f lffi*,l,,TtH1ffi
.#.iiff +*:f 11X11*"
LogLinearAnalysis28'l x.raly marginally significant.Given the relatively large sizeofthe sample,I am inclined to ttus on the BIC ratherthan thepvalue andconcludethat the first conditionis satisfled. To test the secondcondition, I contrasta model (call this Model B) thar omits the interEion betweensexandfather'soccupationthat is, [SR][FR]against Model A. The subIllrive argumentfor this is that in China, where almost all women are in the labor force, u€ shouldexpectno differencein the distribution of father's occupationfor employedmen ml rvomen.To contrastthe two models,I take the differencein the 17 and the differencein fu degreesof freedom to get the pvalue for the improvementof goodnessof fit resulting frm the addition of [SF] and also get the difference in B1C values.Although the fit of lf,rlel A is significantlybetterby classicalstandards (p : .019IL  L: = 67.18 52.03 =15 l5;dl, dJn:42  36 6l).ModelB is morelikelygiventhedaratBle BICA = 185.9  [250.6] : 35.3). Again,I am inclinedro put moreweighton theB1Cdiffoence and concludethat the secondcondition is satisfied.Thus I am willing to pool men nl $ omenfor the subsequentanalysis,which effectively doublesthe samplesize. Table 12.10showsthe coefficientsfor the saturatedmodel (see"ch10_2.do,'tosee h* thesecoefficientswere computedusing Stata).As we haveseen,thesecoefficients re not readily interpreted directly. However, in the present caseit may be of interest to .=nnast particular cells in the table. For example,we might ask about the relative chances r de child of an agricultural worker becoming an agricultural worker insteadof a mannal rr orker comparedto the conesponding odds for the child of a manual worker From Ewation 12.9it is evidentthat the log oddsratio can be computedas
to g9=p+f+pt pf{p f : 2.756+ 1.567 1.088 .80 = 2.434
(r2.10)
dt{h implies that the relativeoddsare 11.40(: e2434)' that is, the childrenof agriculErl workers are more than eleven times more likely to become agricultural workers thselves, rather than becorningmanual workers, than are the children of manual work6x" Similarly,the oddsthat the child of a professionalwill becomea professionalinstead rr rcorning a cadre,comparedto the correspondingodds for a child of a cadre,are
to g9=,y y +pl ft"ffpt =O+.62700 : .677
(r2.11)
ffir,h implies that the relativeodds are 1.87 (: eo621). Clearly,in China (as elsewhere) E inieritance" of farrn occupations relative to inflow from the children of manual cr*ers is much stronger than the inheritance of professional occupations relative to dt* from the childrenof cadres.
fulogical,
or Levels,Models
filrriag shorn how to interpret the interaction coefficients, I next addresswhether the nilb can be simplified.In pafiicular,given the lack of differentiationbetweensalesand
':
rXf,af i;:..!S"Interaction Fathert Occupation When R Age 14
parameters for the Saturated Model Applied to Table 12.9.
Prof.
Cadre
Respondent,sOccupation in 1996 Clex Sales Ser.
1.213
0.169
0.054 15qq1
1.489
6. Manual workers
1.595
Man.
Agric.
,0.100
o.341
0.384
0.058 0.607
Log_LinearAnatysi,293 serlice workers in the Chinese ecr ronably be collapse;l;;;;"::mv.'
I s:s,pejt that these two categories might rea
c.u.invorui'g.iieil;#o;;iKTliff ,"'ffiTili:r*;ffi ,*:f,:";#. 1 111111 ::54456 , 
6
:!2814v15rc :, : ? ! 1rt td
t z z 2 3 2 4 )q 2 s2 6
9
9
14
14
l9
,g
l0
ti = ss_dm
1 5 i; ZO
21
16.06,.with 11drbecause oryrwentynve orrhe
ffi:nT:'#ffif;;"::ii:i.,r
d^:53),';;'";ffi ;:ni_",,,1T#il;";#:lt3i;fi *:3,:l?;"J,"_? asasevenbvseven rable. r,il.;;';il;e
ffiT;:$:i^lell
*"t*t;H#;l':f
subsequent anarysis
ff i:::::.:":llTpyr:*'voushourdkeepinmindthis you are trying to decide .'"t
"t,Tfl Eegones ofa tabl". Th" o.o".d.tn"never
to
; J'#':"':".?1T"':::*ffi :"3'lTii: Tf,:Ti:,'#"'fi,'ff:* SliT;i:;" "tt"t
"otffi
ceilsofa tabre ashaving *#?Ji,,tix1?ill;:}f,ffjt:iu":.:f panicurar identicar
m.*".pL.,',""ir"#;'il;?,1'r"r.j:;:ff :.T:lTH,:;,,[:n".J*r""ir"]ir, QnsilndependenceModels
*1'#::i
;rT:T;:,ftlf#j
if georr.e areabre tofreethemserves fromthesociar
u:n*::ll.* *lhlt""lxiliiT:""Ff,1,;:i*t..';i:t1:?fJ:ffi
(onthe hpothesis couuo,"a .i*_oulJ*.,11!lil'o."ifi [Uffi:Hffi:1""}.;fffi1ff:
{egonar ce's of thetablebut otherwrse fbrcesa'interaction parameters to be
identical:
2111r1
r3r1ii 114111 111511 llll6l lltt;;
= diag_dm
.Asse canseefromthefirstrowol^tf s3c3nd.panel of Table12.1l,thismodelis a huge rprovemenrovertheindependencl mgde],*fri"f, t ,fr" U"*Ur"model in Thble12.11. lbough it doesnotfit by ciassical stanou.os. ir is mor. tit"i, ,ir_ *i'**r,"0 moderand rnisclassifies about2 percent of thecases. S l. other.JO.llr"igh,ht evenberter r'
i
'irlili:
': ,:.,I :
statislics for Alternative Models GoodnessofFit in China (Six'bysixTable)' Mobility Occupational of Intergenerational
.
B,c
L'?lLl
.000
869
1.000
': fi
000
109
054
:i
nj 2
.58/
' :w '
L'
d.f.
p
1080
25 20
58.8 oJ+
14
451
24
.000
249
.418
urban hukou Line!rbylinear,
157
24
000
45.2
145
:g
Linearbylinear, lSElr urbanhukou
150
23
.000
* 43.8
.139
I !i
324
16
.000
190
,300
^^n
14
Row_andcolumne{fectsll (RC)
^^n
:
6
.098
 106
.050
,;
.020
'_:
117
.432
' :
 11a
.031
':
 14
Diagonalcellsfitted exactly
.000
Quasiiridependence
.016
62.4
Quasisymmetry
21.a
10
Crossinqs
)o I
16
n)t
Uniformassociation
34.5
18
.011
l)tl Llnearoylrnear,
33.7
18
A1A
urbanhukou Linearbylinear,
37.2
18
.005
114
.034
.Linearbylinear, t) + uroannuKou
33.7
17
n6q
 10q
,031
RowandcolumneffectsI
10.3
10
.415
9.0
10
Row and columneffectsll (Rc)
1nq
73.4
_ ' a
LogLrnearAnalYsis285
CluasiSymmetryModels
t important issue in social mobility researchis whether' net of any shift in the G:sinals, the relative odds of upward and downward mobility betweencorrespondn: !ategoriesare symmetrical.The following design matrix specifiesthis model for te :irbysix table: :11111 3 1 8 I i917 10 1 1li14
9 12 515 15 16
8 4 13
1l 14 16 17 7
10 13 6 17
: qidm
ts ;\e seein Table 12.11,this model fits slightly better than the quasiindependence riel by the likelihood ratio standardbut not nearly so well by the BlC standard'
CmssingsModels tableaslepresentwe wereto takethe occupationalcategoriesin our sixby_six Sdr.r,ose Supposefurther mobility barriersto = .ocial classes,with boundariesthat constitute "cross" eachbarto llrl in an analogyto movementacrossphysicalspace,it is necessary E ttween adjacentclassesto achievemobility betweennonadjacentclassesWe can sr::ent this model (following PowersandXie 2000, 117)as
(rz.t2)
F,,= nrlrl ufc
riuu fori > i j l
il
uu fori < i
€i
fori:
t
fu* ,pecificationimplies the followhg interactionpalametersfol the cells of the sixbysrx mie rivith the diagonalcells fitted exactly): q1
\,
E.
F
E,
),
to Testldeas DoingsocialResearch QuantitativeDataAnalysis:
286
one for eac These parameters can be estimated by summing six design matrices' and taking parameterplus one for the diagonal design matrix (diagdm)' ir estimated is exactly "*ring. diagonal the fit not antitogs. f.lle conesponding model that does desigr five the are Here omitted' is matrix ttrat ttre diagonat design th" ,ui *uy " "*."pi crossingsparameters: matrices for the 011111 100000 100000 100000 100000 100000 crldm
001111 001111 110000 110000 110000 110000 ct2dm
000111 000111 000111 111000 111000 111000 cr3dm
000011 000011 000011 000011 111100 111100 cr4dm
000001 000001 000001 000001 000001 111110 cr5 dm
Ij*
rfr @4
the othermodels'rc As we seein Table12.11,the crossingsmodelfits betterthan any of degradesthe ft exactly cells diagonal the have reviewed so far. Interestingly, ntting
movingb"tY":t rt:,jiii:li because presumablv .tigt,tyUym" AC standard,
3
5
0.138 0.002 o.203 0.228 1.033
farm and nonfarn Clearly, by far the most difficult transition (crossing) is between and China is m everywhere' is true this o".uputOnt (specifically,manualoccupations); cadre and clericd between is exception. Interestingly, the least difficult transition distincticr sharp no Chinain occupations.Again, this is no particular surprise, because of tbr the brightest and best is made between clerical and administrative tasks and the mobilig clerical staff are often tapped to become cadres' The known intragenerational positions seenas pa$em may well carry over to intergenerationalmobility,.with clerical cadre positions ieasonable starting points for the children of administrative cadres and Finally' thb as aftainable upwld mobility goals for the children of clerjcal workers' lt could females and males combines here result could be due to the fact that the analysis workers' clerical to become tend well be that the daughtersof cadresdisproportionately
lJniform AssociationModels
tut
T1:"
weil by the crossingsparameters,ard the additional degreesd i. Jiu"gona "uptured ""U, freJdom usedby fitting the diagonal ex actly arc penalizedby BIC ' The crossingsparametersfor the simpler crossingsmodel are
2
fi
rd
parsimonious When the cateSoriesof a table are ordered.ir is possiblelo eslimalemore model assumesthl such simplest The models than are available for nominal categories'
I
EL:r !i
h*
*t I5lrtrE
r [.d l&lr d:r
G
m{ @ dEF trtd
dfr
LogLinearAnalysis297 te.differencebetweeneachpair of adjacent categoriesis equar,so thatthe scalefor each uiable can be represented by consecutiveintegJrs.rr,"t iiii" .#r i,
togF..= p+ p! + pf + Bij
(12.13 )
rtere the strengthof the association betweenthe row level and the colur* level is '";red by F From this it follows that the log odds.",i" u"."#."* .ategories r and .e: .olunn categodesjandj, is just '
to g 9 =B G0 U l )
(12.14)
Table12 11 showsgoodness_of_fit statisticsfor the uniform assoclation model with .rl . c.ithourrhe main diasonalfitted exactly.As y"" .;;,;;;;;iagonal cells are nor ft :ractly, the uniform aisociationmodel hts u".y luAfy. ffr" ."^on tbr this is simple; F'ole disproponionatery tend to remainin the sameoccirpu,#ii ,heir fathers. Eh. tendencyis capturedbv fitti "go.y ^
p,.r,..",,L',."J;i;""i#':,,1f, ,il""*HiJffi ffH',:u:ii;?:T:i, :,ft::[:
&gonal cellsare estimatedexacflv. when the diagonal cells*areestimatedexactly, {y.""t rhe
umlbm association *:ll. It vieldsB : .046.FromEquatio;;l;;;;; ."" tharthisimplies, TII!l':"1t: e:\ample, that the log oddsthat the child of a professiooutrvilti""o_" u protbssional the corresponairg for the child of 1.150; s,,50: 3t5Si;:ii.i",i"i "OO.
ffier than a farmer are more than_threedmes
nnrmer:.046(1  6x1  6) = ,"*,nerlow odds m,'' whichis consisrenr with thegeneralr"nr" ,rruiini".g"iJrationar mobilityin is easier thanin mostorhernaionsdr;:;;w;#;;:#ii, ^n:..a 20071 trooo, fora .w:erargument). tfua rSy1in"" r Association Models
. cr ruppose we have more information than simply a rank order of categories_for **rple, socioeconomic statusscores.We can then estimateu iin"*Oy_In"* u..o"iu_ m nodel, wherethe scalescoresaresubstituted for thec","g"fi"O"*r. LLrat is, instead rLluation 12.13,we have
logF,, : 1t+ pf + Lrf + p\yj
(12.r5)
*fr ae log oddsratio givenby 1og0:B e,x)(t1
_t)
(12.16)
Esrimatingthis model for the,Chinese data,with occupationcaregonesscored by filhr meanoccupationalstatus(ISEI; see Ganzeboom,O.'Cr""i ano Treiman 1992).
288
to Testldeas DataAnalysis: DoingSocialResearch Quantitative
we achievea model that fits marginally better than the uniform associationmodel. bg B1C criterion. For this model, B : .000483. Thus for the samecategoriesas in the form associationexarnple,we have.000483(16.2 63.7)(16.2 63.7) = 1.090;e:! 2.974.We areherebyled to a quaiitativelysimilarconclusion:the oddsthat the child professionalwill becomea professionalrather than a farmer are about threetimes as as the correspondingodds for the child of a farmer. Note that it is possible to include more than one scaling of the categoriesof a to representdifferentconcepts.Table12.11 showsgoodnessoffitstatisticsfor two tional linearbyJinear models, one of which scales occupations by the proportiu incumbentswho havepermanenturban regisffation (urban hukou) andthe other of usesboth the ISEI and urban registration measures.As it happens,neither fits as rrell the ISEI and uniform associationmodels.However.if we wishedto assessthe los ratio using, say,the model that includes both measures,we would simply apply 12.16to both variablesand computethe sum. (For a wellknown applicationof this ki model,seeHout 1984.)
Row Effects(and Column Effects)Models Sometimeswe are confident that one variable can be scoredwith an integer scalefr1 is, that the difference between each pair of adjacent categoriesis the samebut we l uncertair about how to order the other variabie. ln such cases we can estimate tr untnown scores.In this model the expectedfrequenciesare given by
logFij = tt+ p! + LLf+ ift
(llrqi
where thej index the categoriesof one variable and the d. are the estimatedscale sctc for the othervariable.The los oddsratio is sivenbv log0:tS,fi.t\jj')
(1114
As an example of a situation in which theseconditions might hold, considerthe r* tionship betweensize of place of origin and educationalattainment,for the 1996 Chirp surveywe havebeenusing. Table 12.12showsthe bivariatefrequencydistributionfu adultsnot currentlyattendingschool.In constructingthistable,I havecollapsededucarir so that the categoriesrepresent approximate threeyear intervals in median schooli4. The sizeofplace categories are from the official administrative hierarchy of Chir. which sffongly affects the flow of resourcesto places. Thus, in addition to the geDcd advantageof urban residencefor educational attainment (greaterexposureto the wrira word and such), we would expect educational attainment to be greater for placeshish in the administrativehierarchybecausesuchplacesarc the beneficiariesof more resourclr from the central govemment. The roweffects model fits well (BIC : 135, L : 2.96) although not ! classical inference (p < .000). But contrary to my expectation, the estimated scrEi
LogLinear Analysis Ll j
289
':.:
:
rn:''I
:': .
'.1 :?,1?, FrequencyDistribution of EducationatAttainment Size of Place of Residence at Age Fourteen, ChineseAdults Not Enrolled t rr S
:.:3
Level of Schooling i[
Lower Upper Lower None Primary Primary Middle
3:f"
upper Middle Tertiary Total
n:,: r i ":.: 15: 
u
:{
;i.trl q:.E[r:[ hl: m
a:r'rrylevelcity
F:
:!l:
iE
lc
r:, nciaT capital
''. 
'.iJ llraS1l
' ri E'rl:,iq* lmrrfi
rm ;m' tlirtuinm :url:"F 'ltrrnu !!
sul
1,142
nr of place of origin suggesta nonmonotonicrelationshipto education.The 'ize ,ut;:t:iare Village Town Countyseat Countylevel city PrefectureJevel city Provincialcapital Provincelevelcity
0.00 0.36 0.74 0.86 0.73 1.01 0.98
5T1[:!
att* NT iumil
TI
qrj::.ling to this model,peoplefrom countylevelcities (mediumcities) get somewhat r'r: education than do people from prefecturelevelcities, although it would be rrl >eto make too much of this becausethe confidenceintervalsoverlap(the 95 perqr:onfidence intervalis 0.71 to 1.01 for countylevelcities and 0.63 to 0.84 for re:::turelevel cities).
790
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
Columneffectsmodels are formally identical to row_effectsmodels, but with role of rows and columnsreversed.A columa_effecrs model of the relationshipbetc sizeof placeat agefourteenand educationalattainmentdoesnot fit as well as the c{ spondingroweffecrsmodel (B1C:  108,A : 2.98, andp < .000),which suggests the_assumption of equal scaledifferencesbetweenadjacentsize_oi_ptace categories probably inconect. This is hardly surprising given the dlviation from equal diff.erencc: the estimatedcoefflcients for sizeofplacecategoriesin the row_effects model and. cially, the nonmonotonicity of the scoresrelative to my a priori ordenng. RowandColumnEffects Model I Another analytic possibility is to treat both the andcolumneffectsscoresasunknownquantitiesto beistimated.However, in this cr is important to have the correct ordering of both the row and column categoflesbe, the results are not invariant under different orderings. For the Chinese example we been exploringthe relationship between the size ;f the place of origin and educaticd attarnmentthis createsa bit of a dilemma. Is it better to reorder the size_of_placectr gories according to the scale scoresderived from the row effects model or to retain rb.l priori orderingderivedfrom the Chineseadministrativehierarchy? One possibilityL
152,p =.304,andA: 1.20, compared withBIC: n6,; I .u_rC, A : 16 usingthea priori categories). For therow_and_columr effectsmodelwithO thereorderul categories, thescalescoresareasfollows: Village
0.00
No schoolins
0.00
ffJl"tu,er"u"r"i* ?.13bilff;",Tfr?Zl County seat ZZZ Lower middle
CountyJevelcity ProvinceJevelcity Provincialcapital
3.10 4.00 _4.95
Uppermiddle Terliary
Z]g 3.g4 4.g0
Formally, the rowandcolurrn effects model (often called RowandcolumnEfia:r Model I to distinguishit from a logmultiplicativemodel a.lsoproposed by Goodmr [1979] andknown as RowandColumnEffects Model II, which we witl discussil fu next section),is givenby
togF,,  1.t+ p! + pf + jdi + i[j
( 11.19r
with thelog oddsratiogivenby tog9: (5,  4t)0  j,) + @i  ej)(i,
L)
(l l.)r{
Thus, for example,from Equation 12.20we cancalculate the log odds ratio of a tenirr versus an upper primary education for a person raised in a prwincial capital comparJ
LogLinearAnatysis291 uq a personraisedin a villageaslog 0 : (4.95  0X6  3) + (4.80  1.67)(7 l): i l:. rvhichimpliesthatthe oddsratio is 50.9 ( a3r). That is, the oddsofpeople obtain[E.1tertiary educationrather than a primary educationare more than fiffy times as great tn dose living in provincialcapitalsas for thoseliving in rural villages.When people thm Chineserural villages make it to university, they are overcoming stupendousodds.
fuandColumnEffects Model Il (the RCor LogMultipticativeModel) +i I noted in the previous section,a seriouslimitation of RowandColumnEffects tfurjel I is that correctestimationof the scalescoresdependson correctly orderingthe retories. For this reasonan altemativemodel proposedby Goodman(1979), RowlrColumn Effects Model II (also called the RC model or the LogMultiplicative .**cel t. which is invariantunder any orderingof categories,and which estimatesscale nicre: from the data,has becomemuch more widely used.In this model the expected hroencies are calculatedas
logF,,: 1.r+pf + pl + dt?j
(r7.?1)
iE :]e log odds ratios as
log0 : tQ, Q,,lttp, p,,)
(12.22)
\n altemativeparameterization of Equation 12.21,which includesa term for the lrc:I strengthof associationin the table (particularlyuseful for comparisonsbetween gmn. u hichI do notcoverhere)is
logF,, : 11,+p! + ttl + BOpj
(12.23)
mn :ie oddsratiosgivenas toe9: [email protected],_ 6L)(pi
pt)
(r2.24)
Rr the data shownin Table 12.12,estimationof Equation12.23yields a very good r = .140 and BIC   147.3. Interestingly,the estimatedscalescorespreservethe :rder of the rowandcolumneffect scoresreportedearlier: Village Town Prefecturelevel city County seat Countylevelcity ProvinceJevel city Provincialcapital
0.00 0.42 o.76 0.82 0.91 1.00 1.04
No schooling Lower primary Upperprimary Lower middle Uppermiddle Tertiary
0.00 0.14 0.17 0.50 0.80 1.00
292
to Testldeas DataAnalysis: DoingSocialResearch Quantitative
In China, size of place of origin appearsto be very strongly associatedwith attainment,reflectedin the associationparameterB : 4.17.Moreover,the greatest betweenrural villages and any urban place, with the next largest gap betweentowns prefecturelevel cities. Making the samecomparison as for the rowandcolumn model, from Equation 12.24we can calculatethe log oddsratio of a tertiary versus upperprimary educationfor a personraisedin a provincialcapitalcomparedto a raisedin a village aslog 0 : 4.17(1.04 0X1.00 * 0.17) : 3.60,which impliesthar oddsratio is 36.6 1: lhut is, the RC modelimplies that the oddsof peopleobaL "r'e9. ing a tertiary education versus a primary education are about thirtyseven times as for those living in provincial capitals as for those living in rural villages. Althougtr odds ratio implied by this model is not as large as that implied by the effects model (which yields an odds ratio of fiftyone), it is still extremely large. Although in this examplethe scalingof sizeofplacecategorieswasreasonablyclo:el my a priori assumptions,andthe rank ordering of the educationcategorieswas exactlynL I anticipated,there is nothing in the method that guaranteessuch a close colresponderre Becausethe scalescoresargcomputedto maximizethe associationbetweentherow andc+ umn variables,they provide a test of the correcmessof a priori assumptions.We can seetclearly by estimatingan RC model for the Chinesehtergenerationaloccupationalmohlir table analyzedearlier. In contrastto the typical outcomein Westemnations (Galzeboor Luijkx, and Treiman 1989),the resulting scalescoresfor Chha deviatevery substand{ from my a priori ordering of occupationcategoriesbasedon their socioeconornicpositi (perhapsbecauseour dataincludeboth malesandfemaleswhereasmostresearchon occrp tional mobility for other nations, including that carried out by Ganzeboom,Luijkx. d Treiman [1989] ard also Wu and Treiman's 2007 analysis of these data, is restrictedI males).The following coefficientsare ftom a model with the diagonalblocked.
Father's 0ccupation Professionals Cadres Clerical workers Salesand serviceworkers Manual workers Agricultural workers
0.00 27 .68 13.76 12.97 2.33 1.00
Respondent's 0ccupation 0.00 0.27  0.18 0.77 0.87 1.00
Clearly, the children of cadres are much more likely than other offspring to move irm highstatuspositions.By contrast,the childrenofprofessionalsarehardlyprotectedar ol from downward mobility, which may reflect the rather heterogeneouscharacter of fu category; it includes village accountantsand school teachersand many technical posF occupati(u tions thatdo not requiretertiaryeducation.The scalescoresfor respondents' are somewhatmore orderly, revealing a sharpmanualnonmanualdivide, although mob? ity into the professionsfrom all sorts of origins appearseasierthan mobility into cleri.{ or cadrepositions.Wu and Treiman (2007) also obtain distinctiveresults,albeit not r
LogLinearAnalysis293 h:ad Ef r!' r x=d n :5:rf
oeme as these,in their maleonly analysisand argue that their results reflecl a distinciie Chineseinstitution, the residential registration system, which makes the children of cal nonagricultural workers vulnerable to downward mobility into agriculture but also crrales ar extreme upward mobility route into the professionsfor the bright children of
rE$:i
F*r.ents. The conffast between these results and results from the corresponding RowandCo{umnEffectsModel I is instructive:
) r :E!C ES=.s[[: .e:tm 5l 5 _ s ! hqg"r.t d.c::ufl,
t :r!rD r*,.fl r .n::d* . i €l b tlroftl elfllrr
ts*rr4 ast:fl (r:c:D' .qa: d IFAEII,D
E:'3:
II
f!  :f,
[:ci!.
Respondent's Occupation
0.00 0.66 0.86  1.15 133 1.66
0.00 0.25 0.51 0.92 1.29 1.53
The rowandcolumn effects model gives orderly results, consistentwith my a priori dering of categories.Thus, an analyst might be temptedto settlefor this model because ! 6e likelihoodratio criterion it hasby far the best fit among all the models estimatedin af,s chapterexceptfor the RC model (seeTable 12.11)although it doesnot havethe rr* negativeB1C.However,the rowandcolumn effects model is clearly incorrect, even thueh it is nearly as likely as the RC model by the B1C standard. From the RC model we can calculatethe relative odds that the child of a professional rfl becomea professionalrather than a farmer, comparedwith the correspondingoddsfor h.bild of a farmer. Becausethe associationcoefficient, B, for the Chinesemobilify table (anotherindicator of the lack of associationil the table), from Equation 12.24 o rr.O455 havelog 0 : .0455(0 1X0  1) : 0.0455,which impliesthatthe oddsratio is i.047  d fl55).Apparendy,the oddsthat the children of professionalswill follow their fathers' N: Scrsrepsrather than going into the fields are hardly larger than the odds that the children disrners will becomeprofessionalsrather than following their fathers into the fields. This is a very different result from what we calculatedfrom the uniform and linearlr{inear associationmodels, and it brings home in a dramatic way the importance of iding the right model before making inferences.(It is also quite different from the corModel I, which implies that the chil4onding result from RowandColumnEffects ln of professionalsare abouttwice as likely [becausee065: 1.92] as the childrenof lcants to become professionalsrather than peasants.)Nonetheless,here we might be dI advisedto settle for the linearbyJinear associationin which mobility is a function ddifrerences in statusbetweenoccupationcategories,on the groundthat it hasthe most qadve BIC.
funsions
ry= Imi
lrfl:
Professionals Cadres Clerical workers Salesand service workers Manual workers Agricultural workers
Father's Occupation
Dr
I x quite possibleto extendthe parsimoniousmodelspresentedhereto more than two The mostconmon applicationis to comparctwovariabletablesacrosscontexts iables.
294
euantitativeDataAnalysrs: DoingSocialResearch to Testldeas
ttm9. ge1ods, nations, ethnic groups, ald so on), but more general extensionsare ab possible.Many of theseprocedures are discussedin the litera'tu."i.i"ny ,"ui"wed in h following section.
A BIBTIOGRAPHTC NOTE A numberof treatmentsof loglinearanalysis are available,rangingliom thoseintendrl for social scientistswith limitea mathematical backgroundsto fullblown treatisesir mathematicalstatistics.The most accessible treatmerisinctuOettroseby Davis (19_{5 Kn:!: uf Burke (1980),cilbert (198.1), and powers
rt979r.Cioggr l'oS),.C..q, ffi'lir"r;;il6t;.
;;
FORESTTMATTNG Qtr gT!!EESOFTWARE LOG_LNEAR N
y9,?ELS
cuv isavaitaote forpurchase fromhttp://w$1/v.nag.co.uk/starycDcE_sofr
, examples asp.Theworked appearingin Goodmanand Hout (t gSg) can be downloaded as two Microsoftoffice Excer97 workbook firesfrom the carnegie Meton university Statistics Department'sstatlib: http://rib.stat.cmu.edu/Dos/generur. ir*r. fires, ,,mobility.xrs,,ano "voting.xls,,,jncludethe raw data, GLIN,4 ,esrlts, andgrapnicatJrsptaysfrom the examples presentedin the artlcle. Vermunt3(1997)software,/em,and the accompanying documentation (Vermunt jgg1) can be downloadedfree of charge.The easiestway to iinj tf,e lownfoaO sjte is query to a searchengtnefor ,,homepage jeroenvermunt.,,rhe documentation ts verycrypttc,but the softwarecomeswith manyworked examples that can ..rify O" pisati3Stata _ado_ fjle to estimateuniform layer_effect"j.Oa*. modetscan be downloaded from within Stata(connected to the internet)by typing""uar""r.f) prsatr,, and then clicking " s9142+tomhttp://wwl/v. stata.com/stb,/stb5 5." JohnHendrickx haswrittenan _ado_file,_rc2 , thatestimates the RCmodel(Equation 12.23).To download_rc2_ from within Stata.type "net searci'rc2.,,Thenclick rc2 from http;//fmwvr'wbcedu/RepEc/bocode/r and fotow ihe inrtru.ttnr. r thank Maarten Boisfor pojntjngme to Hendrickx5 program.)
LogLinearAnalysir295 mLl6:.1984),Sobel,Hout, and Duncan (1985),Yamaguchi(1987),Becker and Clogg ll&[9 r.Mare(1991),Xie (1992),EriksonandGoldthorpe(1992a,1992b),Hout andHauser +ilN!91r. GoodmanandHout (1998),Fu (2001),Pisati(2001),andParkandSmits(2005). Tbe 1998 paper by Goodmanand Hout is particulady valuablefor analystswho mir ro comparelogJinear modelsacrosscontexts.Goodmanand Hout estimatedtheir usingGLIM, a powerfulBridsh competitorto Stata.Thesemodels,andvirtually .rrherloglinearor logmultiplicativemodel, can be estimatedusing lem developed JEroenVermunt at Tilburg University, the Netherlands.A subsetof the models disby Goodmanand Hout can be estimatedusing a Stata ado file by Pisati 1t: seealsoYamaguchi(1987)andXie (1992),who originally proposedversionsof models,
\pplicationsof logJinearmodelingto substantive problemsotherthan socialmobil:3n be locatedby searchingSociologicalAbstractsor otherbibliographicdatabases. 5rt 310hits searchingSociologicalAbstractsfor "loglinear" land variants"loglinear" iogJinear"l asa key word on 24 November2007.)
THISCHAPTERHAS SHOWN ti.r chapterwe haveseenhow to useloglinearanalysisto test hypothesesregarding resence or absenceof associations amongvariablesin multiway tables.Thesetools us a powerful way of testing hypothesespertaining to percentagetables. In addition, imre seen how to apply various models to parsimoniously summarize patterns of htion in twoway tables,and to determinewhich of severalaltemativemodelsfits examplesfor discussingparsimoniousmodelsweredrawn {lthoughthe substantive ] from studiesof social mobility, the topic that has driven most model developftese modelscan be appliedto a wide varietyof substantiveproblems.
APPENDIX 12,A DERIVATION OF THEEFFECT PARAMETERS \r'hatthe ? andTs are,considerthe saturatedmodelfor a twobytwotable.Recall Equation12.1that the expectedfrequenciesin eachcell of a tablecan be expressed of rs: r2 tttt2 t1 t2 1
.2' llIl2l2
Fr, : qr{ r{ r{rY
Fr, : nr{rlrl
(l2.A.l)
:lultiplying one of the equationsby the otherthee and simplifying (recallingthe amongthe ?s shownin Equation12.5),we have 11: I F ,,F ,rF rrF rr)tta
(tz.A.2)
296
DataAnalysis: Quantitative DoingsocialResearch to Testldeas
Thus 4 is just the geometricmean of the expectedcell frequencies(the meanof a setof ,, numbersis the zth root of their product).In this sense,l is a scale tor; all it does is take account of the fact that mbles have different averagenumberr casesper cell. Next we expressthe row effect as a function of the cell frequencies.We do thB writing the product of the two conditional odds as a function of 4s and rs simplifying:
 qr lr lr ! , Y 'r lr lr ! r lr Y If ,l l n,1 l ri l    i /i i I F,,llFu) [n,{,(,{,Y]lqrlrvrr!} I w 1t
I vl
(1 .
And so r{ : [(Frr/Fr)(Fr2/F2)lt/4 = [(4142) / (F21F2)ltt4
(11.
That is, we see frorn Equation 12.A.4 that r is a function of the product of the conditional odds. But we can get a more readily interpretablealternativeexpresl by multiplying both the numeratorand the denominatorof the secondline of l2.A.j (F,,F,r)'/aand simplifying.This yields t li
I:
\l l 2
x 't  (FlfnFrlr2Y
(1 t .
From Equation12.4.5 we seethat rf, the effect parameterfor the first row. i: . the ratio of the averagesize of the cells in the first row to the averagesize of all cells rr table (where by averagesI mean geometric means).Thus rs larger than one indicare a disproportionately large share of all the casesin the table is in the first row aDi smaller than one indicate that a disproportionatelysmall shareof all the casesin the is in the first row. In a similar way we can derive correspondingexpressionsfor the parameterassociatedwith the secondrow and also the effect parametersfor colunL.Finally, we can deriveinterpretableexpressions for the interactioneffectparamd{r, To seethis, we write the expectedoddsratio, (F1tlF2)/(F \JF2), as a function of 4s d rs and simplify as we did earlier: ti
tF
1qr{ r{ r (tY)/ (rtr { r{ r{{ )
Ft2lF22
(qr { r { r {{ )/ (nr{ r{ r{l )
(ll.{{r
This yields
lrfi"]o:
(Ft/ F2)/(Fn/Fz2)
( 11.,\n
LogLinearAnalysis297
rdY : lrF,1F2,J/tFn/F22\)
(12.A.8)
& rs ;fiv is a function of the ratio of the two conditional odds. Once again we can get e rre readily interpretable expression by multiplying both the numerator and the by the geometricmeanof the expected tL'minator of the right side of Equarion12.,4..8 h reagisg,(F rrFrrFr.Fr")tta, andsimplifying: ,

(4tF2)tt2
(12.A.9)
(41FnF^F22)rt4
From Equation 12.A.9we seethat the interactioneffect parameter,7fl1,is just the m .'ithe averagesizeof the two diagonalcells to the averagesizeof all cells.If this r L s..rrerthan one, there is a positive association(or interaction,in loglinearterms) Luttn X andL If r is smallerthan one,thereis a negativeassociationbetweenX and I mming that Category1 is the "positive" valuein eachcase).In a similar way we can fuie expressionsfor the other interaction effect parameters. Theserelationshipscan be generalizedbeyond the twobytwo case,but that is ti.Ed the scopeof the presentdiscussion.Thosewishing to pursuethis topic should tie sourceslisted in the BibliographicNote sectionof this chapter.
ESTITVIATION TO MAXIMUM LIKETIHOOD IPPENDIX12.8:NTRODUCTION likelihoodestimationis oneof severalmethodsusedto obtainparameterestitbr the models presented in this and the following chapters. The principle is orward,althoughboth the underlyingmathematicsandthe computationalproceare often quite complexand go beyondwhat is dealt with in this book. For good to the topic, seeKing (1989),Eliason(1993),Long (1991,2533, 5261), PosersandXie (2000,AppendixB). Supposewe observea randomsampleof valueson somevariable,xr x.,..., x,, independentlyfrom a population distribution/(x,, xr,..., x,,10)govemedby an r\ir parameter 0. We may then ask what is the probability of obtaining the ed samplefor any given value of 0. This is the likelihood of the sample.What rant to do is to find the value for 0 that maximizesthe likelihood of the sample; : the maximum likelihood estimateof 0. More generally,maximum likelihood ion consistsof proceduresfor finding estimatesof unknown parametersthat ize the likelihood of the observeddata; the resulting parameteresttmatesare maximumlikelihood estimates. \larimum likelihoodestimationinvolvestwo steps:determiningthe likelihoodlimc$hich expresses the probabilityof the observeddata as a function of the unknown
298
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
parameters;and maximizins the likelihood function. We can write a general for the likelihoodfunction:
A(e)= IJ/(,r,;0) where0 is a column vectorof unknownparameters; notethat theremay be only oneunkr parameer,in which case0 is a scalar. Equation 12.8.1 h"ld;;;;;J. the observatiom assumedto be independent,which means that theh joint AistriUutionmay be wriften
marginal oistriuutiom iioweili
iJJo:u."p.oou",,".r* T*l1.1l^"*: intractable, :dividual. mathemarically weconvert Equation 12.B.1il i;r;;i#; because therelationship between a variabLanditslogis mo".iJ"l. irrr. ," il"; fr"*
)(e):hifi/(",,r)]: irrrr",,ot
(r2
find the valuesof 0 (denored0) thatmaximize the log likelihood; ^.."Y*"i oI the monotonicrelationship,thesealsomaximize the likelihood.
MEAN OF A NORMALDISTRIBUTION Considera simplecase. Supposewe want to find the
maximumlikelihoodestimateofh
uno_ulyjt,tiiot"o popurutioo.t ;;H* :.::Ti:TlT.lt:.?f "!,"*"ion,r,o_ variance d. Becausethe likelihooa for u .ingt" ;.;.;;d i'.
L (p o2 . t : j.*o['l2n
o2
'l
r ",  1 ,' 1 2o2
)
(123l
it follows (from Equations12.8.r and 12.8.2)that the log likelihood of the sample is
^r=i^[#,,(q#)]
: N(nJz*)_$f,a,_,t,
However,we can disregardthe lefoirost term on the
right side of the equattonbecause
it doesnot dependon the x. We also can discard,t"
;}. known.This leavesus with the ftenrelof the log likelihood: \r'
,,' :
(12.B_+
,".rn becauseo2 is assr,rrxrl
(12.B_a
(:r:
LogLinearAnalysis
JL
i##ll':,&?;:T#ffi,*Tr.:T:^T:,*rike,inood,Equa,ion j:ff":lit jjt'i#*i#"r#"*#J#ii"i:.!ili::nT ii3;li;i""rt"";ilifi I.,::Hrr,:1T*i*:ltl:t,,*,*il:::',#
.::
;*{Tl;i}lj"$ii#i:rums#*J#[fi;i*# fr
L :*reme Hii;:Til:f varue isil;11*In:X*TTifr ""t ;i*:niii#*n;:LxrTrilrlll:'t[:T; ffi:Htr.#:H:flT ; a
tr]t:lr qirnm t : r i. ] ]I& bT :1&
299
;iy l
I
ittG
(xi  p)2

z)
x .  ') i / , ,
(12.8.6)
IIl:: i:lliu
ffi&*rion
12.8.6tozeroand solvingfor p yields the
l[,r
*
maximumfikelihood esti
(t2.8.7) ::l
this is the maximum be
o;::itJ". i2B5wi,h :' .n*ron
j ',.X[:fi Hrillrri{**1*dj*;lT:T:i1{Fr'"xiilil"ti;r;i,; "JJ*il ,#,;li:,1*fJg",11,ff :T: ft:,,n:$ ffi m3**:*;*";*'#;ft."iH:f ftr l,xr:[li i# H#j[* *f;J,f ffi;J:f;:;*5:r1ffi,fi;:::ffi:
:li
PARAMETERS .rcG.LINEAR
max,rnum,ike,ihood jrding j:fl *T;1;lT,:iilHm:ni.:,:,.r$:l::fJ?::il, Hx."3"a#:ff'.Tii,lf ffi:r,::fJ?::"J; HJ:i q*.ed rutn*:1l:*:ti:li{"rjt#:1,"#*::#j"#{#;}*: frequenci*'r* r;;;;#jil.J Ftj = P+ p: + pf ( 12.8.8)
;"J::f,$#1.j;.ri,:,:::.**n Fl;i:Tfijifrijl,f r".",. j:Jf,"l+F"#$ik*r*i#?;4:!:i'"i:;1il:ri ""::;: ru..r "o ffi jaredilrHlrnm":";;"rutr1i:ll.;.,:t,t$j:"J ff ,.::ffif soc *,,n.."r,'i"r,rl,!1 iffi il:H i[, ;,rj #:
lo
10 ll l0 I1
x,
xz!
0), 0!, 1J . l yo
300
to Testldeas DataAnalysis; DoingsocialResearch Quantitative
The independencemodel can then be written as a model for counts: m':
exP(Bo+ Prx, t B2x2i)
where rz is the expectedfrequency in the ith cell. Under Poisson sampling, the the log likelihood is
)(p) : t(y,log m,  mi) and so we needto maximize Equation 12.B.10.Becausethe rnodelis nonlinear,an solution is requiredin which we repeatedlyupdatethe estimatesof the Bs using the secondderivativesof Equation 12.8.10 with respectto B. For our purposes,it is not sary to considerfurther how the estimationis actually carried out. For additional detail Eliason (1993); Gould and Scribney(1999); andPowersandXie (2000,Appendix B).
,r^  t A Frr
L r "! A r
:}*Niii'
r
rtn
t.l
dfrf
:l
ill*
Etr Ed d"re
LOGISTIC BINOMIAL REGRESSION r THISCHAPTER IS ABOUT chapter introduces binomial logistic regression,a technique for estimating models a dichotomousdependentvariable.We startby consideringthe relationshipof binolosisticregressionto logJinear analysisand then see,by studyinga worked exambow to estimateand interpret logistic regressionmodels.We then considerthree worked examples to expand the applications of binary logistic regressionto progressionand similar models,to discretetimehazardratemodels,and Eiecontrol designs.
302
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
INTRODUCTION Often social scientists are confronted with the need to analyze categorical v variableswhether people vote, for whom they vote, their degreeof agreement seen' as we.have on Altholgh particularattitude,their choiceof occupation,andso are iegression procedurescan easily handle categorical independentvai^bles' they dict case of apiropriatefor categoical depenfunt variables,evendichotomies ln the parti mous dependentvariables,the assumptionsof multiple regression'including in that errois of prediction are normally distributed, break down badly' often yielding ously misleading results; moreover,predictedvalues often 1ieoutside the logically ble range (zero io one). For thesereasonsa variety of procedureshavebeendevelol dealing with dichotomous dependentvariables, of which one of the most 1 logit analysisor (synonymously)logistic regression,which usesmaxlmum vari e."tiution. Logistii regiession can be readily extended to handle dependent of ordered sets with more than two categori.es(multinomial logistic regression)and the in gories (orderedlogistic regression).We will considerthesetwo extensions chapler.Bur we startwith binomial logistic regression
?,I Nl 
tiketihoe MAXIMUM LIKELIHOODESTIMATION Maximum
prirrefersto a framework for estimatingparametersof statisticalmodels.The "rtlmatlon the to find models'is and logisticregression .iol". which underliesestimationof loglinear (See the likelihoodof observingthe sampledata valueof the parameterthat maximizes KingI19891'Eliaso' Appendix12.Bfor a briefoverviewof maximumlikelihoodestimation; B] for accessib: Appendix LongL1gg7 , 25'33' 52611,and Powersand Xie [2ooo' I'19931,
i n tro d u c ti o n s to th e to p  c;andGou dandsri bney[1999]foratechni ca di scussi onofhowi: in Stata.) do likelihoodestimation
'ril N 
PROBIT
ANALYSIS
use: morewidely regression, to tosistic An atternarive
yieldsimila' generally is probitanalysisThe two procedures than in sociology, in economics (sei professional convention of matter r"rultr, and the choice between them is largelya ) to probitanalvsis Aooendix13.8for a brrefintroduction
Binomial logistic regressionis a procedurefor predicting'from a setof the log oddsthat individualswill be in eachof two categoriesof a I vari.ables, dependentvariable.The formula for logistic regressionis
BinomialLogisticRegressjon
f l
l:.Jiiilld E
tI[
D
" ' \" ,x A]
lf
\
^ 1' r,r ^
303
K s.
): a+ Lbkxl
( 13.1)
:ea. :f".f !.4  m gcllr+ l[,!:rltut :'tr,g [email protected]
lr rn*g,l;fE'i
I aq:u f EPiiirx,r{ rE{!ut
fu/( independentvariables,X" wher.ethe a and Doare coefficientsanalogousto OLS tcression coefficients,andthe dependentvariable iJthe nuturutt,ogof tfr" odds dbelng in category1 of the dependentvariabre ""pected ratherthan i"ffi".y ,, conditional on tralues of the independentvariables.Tho., lo;i.ti" ."G..i."'ii_",n". .pecific case d6e generallinear model. ft is also true (and can be easilyshownby dividing throughby 1V)that the log odds d 6e expectedconditional freauency distribution or irr" or.iotolous dependentvariequalsthe log of rhe r;o of the'expected prot"uttiit"t .i"i"gln eachof the rwo Stgones: lre
{f :G. (p
frE x!:!
o&r rr
s('',f (!e
\
2Yl=h(p:Dt\p:z))= rn((p = 1)/(1_(p:1)))(r3.2) ^ltf; I ,r) Thedependent variable(thelog odds) is knownas therogit.As we havejust seen, may be expressed @irs eitherin termsof frequencies or rn t"#. orprouuurtr"r. EATION
5s
lF
le ?
TO LOG.LINEARANALYSIS
relarionshipof the losit soecificationto log_linear analysisis straightforward,as can wif rhehelp of alittl; algebra.Consiie. a l"g_i'""'een _"f'.i, ,n which thereare .':":11.r;
that,". rviirri" [g*a'a"s trre
dependent variA, andB.{9, Now1"nr:1":Jvariabte considerthe saturated modelrelating;",*J;u*ril?;?j. [ller thismodel,expectedcell frequencres areestlmatedas f; *'.
rn(F,f"):0+^!+^: +yk+^f +^iv+^f +^;fy PC
tr E
( 13.3)
\ow, becausethe dependentvariable, is dichotomous, X we can easilyderivethe log i ofbeing in category1 (ratherthanin category 2) of y fro oqrution .exponenrs 13.3(for those on schoolalgebra,somealgebraicrelatins'hjp,in"ol"ing l;;.;;; are r rnAppendix13.A):
\r\(F;*I F;:") = rn(r,j*  tn(rif" ) ) : (a+.r,, +.1,f +tl + >,1" + s!, + li,"+,\;i") (o+li+s,f+),r+);,+11'+),.8,'+);iy)
_ry)+(.r;"_s)+ (t1," : (.r," _ +(^ff,_ ^::) ^H,)
(13.4)
304
to Testldeas DoingSocialResearch QuantitativeDataAnalysis:
acrosseach dimension' )i But becausethe s must sum to zero ^ So we have
= )i'
anOtl
n(rff" lrif'):z^i +2^i"+2^:{+2^:i" rather thn of the den11d9nt lariable In short, the log odds of being in one category
theusql"cl"1,"::f:^':T'::Tlit""::"fffi ofiwice sum *l ei"*uv,r'" *'"T:;"i111'^3]: alone variables' "ir'"' ;uiiitiltT;"#r,,oi',i."1"0"p""0"* :l ::t::5 thecoefficient),fl.expressin*theass nJolr,"rraturated modelsNotethat t* or urca dropsour "il" uarrutl"t'itpvariables' ino"p"no"nt theindependent the between donbetween don
"itr'" "**Jlo:t:.)il"'li in publishedt
tegressions Thus we can carry out uinodd logi*tic :t'h":::i
dt+]tl,ttl^"1"::li:,t"::":::':::X"i: .^rtt'ir i.t;r ;logrinear "."rv*iJ"'3 coet!Lrlsareexpressad wtr"nttre^togunear #lffi;*;;il;;i; ;iff:l#
froni a referencecategorywith an impt dummyvariablefo.aut$tut t', u' i"niations coefflcientof zero. '"',qrit"tgrtlrg;,u"t""" ** *"tt Arl'uuts''u6" anatvsisandlogisticregress.ionaremarhemati:illli::::,":'"Trtt as a specialcaseof 1oglinearanallris separateorigins.Logit analysis.wz o"u"roped as de9e1{ent which one (dichotomous.)vattaor" is regarde! 1n." 1"11t'":}'t;:,::t::S
to *"1 o*"r"'p"a bv statisticiansand^econometricians l"#;::;;#;;;;i* d"t"tg*t:"""u:::l::1":::::T:3"1:1*:l dichotomous that L*'**"**;ifo pruurerun ule problems wirh the wrur independentvariables'' handlecontinuous regression.Therefore it was devek th"g ll9:,11lT,i9:':,"':t::'i:?,Tiri; the *" statrstri'al to gouu'Lruuuvuu" introduction a a sood "*;s with sociologicit examples'seeLong [1997] treatm For and Lemeshow[2000]. text' seeLong andFreese[2006]') PowersandXie [2000].For a Stataoriented
EXAMPLE: REGRESSION A WORKEDLOGISTIC oF ARMEDTHREATS pnlorclrue PREVALENcE
the likelihood that a personhasevertEl Supposewe are interestedin what determines fu *" are interestedin asllrtainine whether threatenedwith a gun. Moreover, *ppo" que:N latter (Investigating,this of arm"eothrea$ has changed ouer time. ,**r"."" comparl<* make to how ;ccasion for demonsirating 1l*,lllpotd ;il;;;;;;; more likely to have experiencedd using the GSS.) we *rgttt *otpe"t ;;i t*Jt " of the male population beenin c* threatsthan are femut".. Not onry t'us somefraction r men tend to be more likely to be involtaf bat, unlike women (ontif u"'y t"""ntty)' but
r"'le"ir"tit,o* ngr',r' smnr' "'a't13'""rr"i1"t',"'1t1T:1'::f'.::"::ff*':T# withsocioeconornic conelared n"gutlu"tv :il:#::'.l'u,ii.i'?tt'"".."ttt""iJut Fs in leisureactivities' t'utotdifferences status(SES): rtnrh '"g'"guiion"uno educatronutin indi"uto' of socioeconomic.
;J;iitt"*l;i;"rio",ttiut
convenience,I take fixed overthe adult life course ad r occupational status and in"orn", tt i' essentially Third' it is likely that Blacks interpretable equivalently tor men and women' pace other racial groups' net of SES' given to more armedthreatsthan are membersof middleclass Blacks to live in highcrm of residential discrimination that force even
Regression 305 BinomialLogistic rghborhoods. Fina\ claims about the breakdownof civility in America would suggest Se prevalenceof armed threatshasbeen increasingover time. Datato assessthesepossibilitiesare availablein the GSS.In most yearsfrom 1973 1994 respondentswere asked, "Have you ever been threatenedwith a gun, or et: In addition, the sex, race (White, Black, or Other), and education(yearsof completed,ranging from 0 to 20) of each respondentwas ascertained.I omitted' ence,5,031casesin which the gun questionwasnol answered(mostlybecausein crrl yearsthe question was asked only of a subsampleof respondents),an additional crscsin which informationon educationwas missing,and an additional16 casesin information on the number of adults in the household(usedto constructthe weight people ) was missing.Theseproceduresyielded an effectivesampleof 19,260 estimation survey using the analysis out I carried r the yea.rs1973tkough 1994. and treating each year as a stratum. (For estimation details, seeAppendix B te downloadablefiles "ch131.do" and"ch131.log"') Iable 13.1 confirms that a substantially higher percentageof males than of females e moderately higher percentageof Blacks than members of other races have ever Sreatenedby a gun. It is difficult to see a consistentpattern with respectto either f,ional attainment or year, but it is possible that each variable suppressesthe effect otherbecauseeducationhasbeenincreasingover tlme.
t Eil{t]{ICAL PCTNTON TABLE 13.1 but the t€:e that in Table13.'1the percentages are basedon weightedfrequencies .r.veighted percentagebasesare shown l weighted the data to take accountof differthe of Blacksin 1987,and to equalize size,to adjustfor an oversample =:'ai household For descriptive do" for details) 1 31 =. iributionof eachyear(seedownloadablefile " ch to usethe weighteddata to get correctestimatesfor the popula=:st cs, it is necessary rc.. But it is desirableto show the unweighted N's to revealto the readerthe actual 'r'rber of caseson which eachcomputationis based
lly fust taskis to choosea preferredmodel.Table 13.2showsgoodnessoffitstatisfti five models. Model 1 is a baselinemodel' positing that sex, race, and education affect the oddsof being thrcatenedby a gun. Model 2 in additionpositsa mnd in the (log) oddsof being threatened'net of the effectsof sex,race,andedu. If thelikelihoodof being threatenedhasbeenincreasingover time, the coefficient rred with year should be positive. Model 3 posits yeartoyearvariation around any ornd in the (log) oddsof beingthrcatened.Models 1,2' and3 standin a hierarchito eachother.Model 4 positsthat the log oddsof being threateneddepend race, and education; that the 1ogodds increaseover time in a linear fashion; and r{ and race interactthe hypothesis being that gender differences in the likeli.l beins threatenedwill be smallerfor Blacks than for othersbecause'owing to
306
to Testldeas QuantitativeData Analysis:Doing SocialResearch
{r& I l ! } li . I . percentageEver Threatenedby a Gun, by selected vari*' U.S.Aduftt '1973ro 1994 (N = 19'26O). Percent Threatened'
Percentage
18.8 25.O 17.3 Education Lessthan high school
21.8
Highschoolgrad
11.3 21.2 18.0
Year 1973
'16.8
1975
18.0
1976
17.0
1974
19.8 21.O
() :
BinomialLogisticRegression
iE:
20.3
r54
189
'i:a
19 .5
IS7
20.4
"$3
22.O
In *s
19.3
s
307
19.5 1991 " :j3
20.1
'944
14.7
bl
(1,0s3)
(19,260)
19.5
{ii:: :. weiqhtedfrequencessee the box "TechnicalPo nt on Table13 1 _' : l.ted lrequencles.
"
neigh:{,:ential discrimination, Blacks are more likely than others to live in dangerous n,c,_.ods,andhenceBlackwomenareparticuladyvulnerabletobeingthreatened.Model race and education pos' r::nds the same argument to lnclude an interaction between r. .rsma l l er ef f ec t of educ a ti o n o n th e o d d s o fg u n th re a tforB l acksthanfol others rc,.:.e of the residential vulnerability of even welleducated Blacks' of rhe U S population' it would be possible ; the GSS were a simple random sample (reponed in Stata out: : :lpare nested models by using the likelihood ratio X'?s,or Z'?s  command as LR chi2); lJ is defined as twice the difference :ru::om the  logistic with no independent rurr;en the model log likelihood and the log likelihood for a model be distributed would lrs the sample' random a simple r;:les. If we were analyzing rve could such cases In L2 s pair of :my between lrrr 'r i mately asX'?,ua*ould th" diff"aence the significance by assessine another ,u=. rvhetherone model fits significantly better than ;erlifferencebetweentwo12s,withdegreesoffreedorncalculatedasthedifferencein when $ e use $ eighted e" ::grees of freedom associatedwith the two models. However, is actually ];l]l Justered data, or designbasedestimation procedures,what Stataestimates
BinomialLogistic Regression 309
iriE
R'?= 1  Lt/Lo where lo is the log likelihood for a constantonly model (that modelwith no independentvariables),andZl is the log likelihood for the estimated Obviously, if the dependentvariable is perfectly explained by the set of indepenmriables, L,  0 andpseudoN : 1, and if the independentvariables explain noth: 0. ThusthepseadolR2 givesa senseof how well a modeldoes.Howevel pseudoRz s in the caseof weightedor clustereddata, pseudo1oglikelihoodsare estimated, pseudologlikelihoodscan increaseratherthan decreasefor more completemodels, brlnce the pseudoR2s can decrease,which makes little sense.More generally, when
[f, E :
F t
z d seasl he \\ald Ea rczrncotgl o a \ \ ald<Ej(..ct the I rof tb€ N imple molH
bls is sigrfr: mcrion rl lusEred diE
likelihoodsare estimated,thereis no simplerelationshipbetweenchangesin prseudolog likelihoodsand improvementsin the goodnessof fit, so the pseudoR2s uninterpretable.For the same reason,81C is inappropriate for designedbased ion becauseit alsois basedon a comparisonof log likelihoods.(For randomsamBlC for logisticregressionis estimatedby E + (df.)lln(Ml. The signsareopposite tbr Eauation12.8 becausehere the comparisonof interestis not with a saturated but with a baselinemodel in which predictionsare basedon the interceptalone.) we havedatabasedon complexsamplesas in the currentcase,surveyestimation is the best availabletool, with by the Statacommandsvy: logistic) comparedthroughadjustedWald tests.
wayto do't. ?I UMITATIONS OF WALD TESTS rheappropriate
:siical inferencefor complex samplesis at present an unsettled issue. As we saw in $l typicalof multistageprobabilitysam :.apter Nine,when the clusteringof observations :,es is ignoredthe standarderrorsof statisticsmay be substantiallybiasedthey typically ae underestimatedbut in some instancesmay be overestimated.But the proposedcorrec:ons have their own limitations.both theoreticallyand practically.In particular,Wald tests +e known to have poor properties,which may produce misleadingresults(Gould and for weightedor clusleredsamples. >.bney 1999,78)i and as noted,8/C is not available rre optimal solution may be to treat clusteredsamplesin a multilevelcontext, estimating er*rerfixed or randomeffectsmodels(Mason2001),which can be done in Statausingthe go beyondwhat can be coveredin this book, :{t or gee command;theseprocedures to Althoughevennow much 3(n seeChapterSixteenfor a briefintroduction multilevelanalysis. . ':ra:_ljournals, and treats simplyignorescomplexsanpledesrgns rat is published, evenin ldading this is generallyinappropriate cata as if they were generatedby randomsamplingprocedures, in its variForthe oresent,lsuqqestfor loqisticreqression 4d can leadto incorredinferences.
:us formsthatwhenyouhavedatathatareweightedorclustqedyoucarryouty!u!estimatigl '.:relvon adtustedWald tettilor modelselection. +:
iindittata3 survevestimalioncommandsand + Onlywhere 3e cautioui,however,in your interpretationand exploreallernativespecifications. you usethe  logistic  commandand random sample should have a true, unweighted, ,ou ikelihoodratio test (lrtest) . Further,wheneverpossible,eschewweightingin favor of rxluding the variablesusedto createthe weightsin the model.
31 0
to Testldeas QuantitativeData Analysis:Doing SocialResearch
Inspectingthe Wald{est statisticsin the bottom panel of Table 13 2' ue .a Model 2 fits betterthan Model 1, but no modelfits significantlybetterthanMod: l thus conclude that the likelihood of armed assaultdepends on gender, race, and e;. and also changes over time in a linear way. To see the nature of these relations:::', examine the coefficients in Table 13.3. I also have included the coefficients fbr \
in Table 13.3,eventhough Model 4 is only a marginallysignificantimprovem::r Modelz (p = .092).I do this to illustratehow to dealwith interactiontermsin ih. of logistic regression.
ilrflij
',,.
for Mod€ls2 and 4 of Table'13.2. effect Parameters
Independent Variable
Standard Error
Model 2
0,0065
Education
Intercept
2.9037
0.3178
and"others." ot "Whites" consist "BackversusnonBlackNonBlacks
.000
BinomialLogistic Regression 31 l There are two altemadve(bu
fi ilT;:1',."J#:;#:rjr:'"$:"J.ff T:,j#:tr*Til "*,i..."g.".,i.on" 6e 1ogodds of the dependentvariable.and ro.onriO",,f,. ), lrrplicatirceffecrs
r
d
irdependenr of "i,irr, variableon rheoddsof rhed.p.";";, ;;il. .on'0., ho* ,o thelog_odds effects,rheeffecrs*,h";;r;;;li"J!nir.o. eq"ution trprer r:.r. to log odd_s, 'aninterpretthecontributions theis, ju.i ," ,r""io .rr"coefficients loLS regression in equation:a oneunitdifferen* rit ,rr"". i"o"p."o'"nt vanableresurts I t unitsdifference in thelos od1. ,r,.."i#i iy'" g1r,,1l",o,uI other res. Thusfor example, variin MJder2 :f_b..loe r"br"i3i, ;;;;;i"i.i"in.,ri" ,ogooo,or,nur". d tbmales "f beingthreatened bv a holdingconstant race,education, andthe tEr of the**"i. eGlo ,,T]^il,],'?, ""Jr, of havingbeenthreatened in,1994areaboui;,il;;;;J; in r97Jrpredr.0 2t2l = 0.0t01*[t994_19731). b alletseequal.;;;;;*, "an \lrhoughrheinrerpretation is straighforw_d,l"g;di;" *t very rntuitivelymean_ ;tuI. Hence.a moreappearins oossibilityis ,, i",;;;il;;;o!l or t" rs, tt" te a oneunitdiffe..n." io thJind"p"ro"* ,.i^ui"l"rrl r"rr""l'i"il^" ",.. {c unirsin thererative
tn::,:rT j T,.::dr**,i.'oy"o.dro,:;"iT::ffi [H".:ffi :f.,:"r;:ffi
t n ( 4 , **,1 r r , ", r) : "+ fo o xo
(r3.6)
Erponentiatingboth sidesof the equation, we get
rttx
,lrztx,
=e
,"*ir,*,r \=l
^ K
: e.fIe4x,
(13.7)
Tbat is. the odds of beins in cate
,#:'.";ru: Hilff L+:, liil:"ni:J,TiTi::l?#::i,K:ff
u.erpreted ascontributions toodds rytiis,.,r,rilr,ir,"'."'i*#;flHrri#,j::r1: .l,l_":,"::
i,a"f",o*,Ydrduro' lorolng y::  ^^*vrv'sv'! iilo,n,constant :lH31;T u,, all independent variables. I.trus for I qi""" "liii,i, "Iil ei "on,,un, 2 the. e\pected odds of males rbrearened iXT?l:]: by gun a are4.r5 (: yodeltfal theoddsof females being :ned, ", ^) llel qreater red,holding holdine const race, constant ,^* .^^,1 education, aad the yearof the**"r. o^"tTl"".oio".'l? ^,,
r"'""'"r'i.,."""1 teT#:H:i"J:i::T:fl:'rti919?,"""1' expected netodds ofhavine been "?Iio"""u"",," jl1:jd.. 1l;1"il ffi;lTrjJ:fi:# Jooro,,,sqa,gu:, = t: .: what itJj !f:::,f !:?t?tsubstanrively? 1"*"i1i,,*ii,i,";;:, so canweconclude t,"r oiott.. ru"torl.'ri'"tl_iito "". of r ever having been threatenedare four times greater "oo. than fo. f"_ui"r; the expected
312
to Testldeas QuantitativeData Analysis:Doing SocialResearch
education (the odds of having been threalened decline slightly with increasing
lessthan for thcs being threatenedfor those with at least a BA are about 14 percent 8)))'but increasemc: oniy"aneighthgradeeducationprecisaV'O'!!Z+n':o"tl'u in an1 ouo tia"lut ,t'" ttave seen;and the odds of Blacks having been threatened sex with thi the same of nonBlacks for year are more than 1.5 times as greatas amountof education(precisely,1.56 e04461)' andyear:r Now 1etus considir Model 4. Note thatthe coefficientsfor education for V coefficients the change.Thus we can restrict ourselvesto the interpretationof these i andELACK and their interaction. A convenientway to seehow to inter?ret yEAR' Ler u' cients is to evaluatethe equation for fixed values of EDUCATION and assessth' 1994andtwenty yearsof educationas our valuesfor thesevariables'to gun We thu' by a threatened been ever having of race and sexLn the probability of : + 001i 2903.7 0 0i91"20 : 4br*94 a + br*20 po* u n"* intercept:i' ': 23363 (wh# b, is the coeffi'cientfor educadonandbris the coefflcientfor 1a a gun thre''i survey). Then we wriie out the expectedlog odds of having experienced for MALE' b ' convenience,call this G) by race and sex (where bMis the coeffic\ent ' term)' coefficient for B.I4CK, andb uris the coefflcient for the interaction For nonBlack females we have G: a' : 2.3363 For Black femaleswe have G=a''lba = 2.3363+O.5690: 17673 For nonBlackmaleswe have G: a' r bu : 2.3363+ 1.4543= 0.8820
( 1,
For Black males we have G:a ' tb e + bM+ bR M
: 0 5255 : 23363+0.5690+1.4543  0'2125
6_: l
Wesee,bothftomthecoefficientsinTablel3'3andfromthesumsjustshown'tha: fot: expectedlog oddsofbeing threatenedare 1'45 largerfor nonBlac\ T3t"t'h"i for B't Black femal"es;that the expected1ogodds of being threatenedare 0'57 larger full do; femalesthan for nonBlack females; but that nonBlack males do not face the jeopardyof being male andBlack, becausetheir expectedlog oddsare0 21 lesstha'r the : sum of the MALE coefficient and the BI,ACK coefficient Or, to put it differently' is difference race the and Blacks, der differenceis greaterfor nonBlacksthan for
BinomialLogistic Regression 313 fu ttmales than for males.Theseresults are as hypothesizedexceptthat the interaction is b $eak for us to havemuch confidencein it. Again,the interpretationis easierif we considerthe oddsratiosratherthanthe logits. (L: way to do this is simply to takethe antilog of the logits we just computed(the Gs). Iling this, we seethat the expectedodds of ever having beenthreatenedby a gun among lEoplewith twentyyearsof schoolingin 1994are0.10for nonBlackfemales(: e23363), O.ll for Black females,0.41 for nonBlackmales,and 0.59 for Black males.Note thar t oddsratiosarejust what are shownin the rightmostcolumnof Table 13.3(within the hits of rounding error): the odds of nonBlackmaleshaving beenthreatenedare 4.3 fues as large as the oddsof nonBlackfemaleshaving beenthreatened(0.4140/0.0967 = 1.2813= 4.2817); the oddsofBlack maleshavingbeenthreatenedare about3.5 times r hrge as the oddsof Black femaleshavingbeenthreatened(0.5913/0.1708:3.4619 3.J618: 4.2817*0.8085);and so on. We can seethis most clearly by writing out the oddsjust aswe did for the logits. For nonBlack femaleswe have
(13.r2) = .0967 For Black femaleswe have e" : e" e"t
( 13.13)
: (0.0967)(1.7665) : 0.1708 FornonBlackmaleswe have eG :
ea' eb,
(13.14)
: (0.0967)(4.2817) : 0.4140 ForBlackmaleswe have gc :
g"'abu ab, ab*
(13.1s)
= (0.0967)(1.7665X4.2817X0.8085) : 0.s9I 3 Oneothercoefficientis sometimesusefulthe percentage changein the odds.given I00(eb 1). For example,from Model 2 in Table13.3we would concludethat all else the odds of Blacks having ever been threatenedor shot at are 56 percent greater the conespondingoddsfor nonBlacksbecause100(1.56 l) : 56. However,even though odds ratios are readily interpreted.expectedodds are still particularly intuitive. Thus it would be useful to convert expectedodds into perFor example,in the presentcase it would be helpful to get rhe expected of individuals in each racebysex group who have ever been threatened,
ofeducationand surveyyear.That is, we would like to get the adjustedpercentages
314
DataAnalysis: DoingSocialResearch Quantitative to Testldeas
implied by the model so that we can assesspercentagedifferencesbetweenrace sexgroups,controlling for educationand year of survey.We can do this by making of the relationshio
x Pcrtyl=1001 I lx + 1.J
(13,1{l
where x is the odds of I for specifiedvaluesof the independentvariables.Note th becausethe relationship between the odds and the percentagesis nonlinear, we ncal to choose speciflc values of the independentvariables for which we wish to make L conversion.Here I use the samevaluesfor which we evaluatedModel 4; that is, I a expectedpercentages by race and sex amongpeoplewith twenty yearsof educatioofo 1994.For example,for nonBlackwomen,we havePcr(y) : 100+[0.0967(0.0967  II : 8.8.The corresponding percentages are,respectively,for Black women,14.6;for nc. Black men,29.3;andfor Black men,37.2.If we wished,we could,ofcourse,construcr! entire table of such expectedpercentagesfor various valuesof educationand year of w vey. Doing this requires a fair amount of hand calculation. However, in conjunction F h their text on Stataproceduresfor handling limited dependentvariables,Long and Frees (2006)developeda set of Stata ado files that automatethe computationof thesed otherstatisticsfor interpretinglogistic regressioncoefficients.(Thosewishing to expLE thesefiles shouldstafi with Long's web page:http://www.indiana.edu/jslsoc. Follos.rb links for Long andFreese'sbook.)
A SECONDWORKEDEXAMPLE: SCHOOLING PROGRESSION RATIOSIN JAPAN In the educational stratification literature an important hypothesisis that the dependen:r of educationalattainmenton the social statusof one's parentsdecreasesas educaricr increases. This hypothesishasbeenoperationallyspecified in terns of prcgressionral.r liom one level of schoolingto the next (Mare 1980, 1981).That is, we can ask \\.bd affects the odds that those at any given level of educationgo on to the next level: that Frmary schoolgraduatesenter secondaryschool,that thosewho enter secondaryschtrl graduate,that secondaryschool graduatesgo on to college or university, and so on. ODcr we specifythe problemin this way, it is evidentthat it is a logistic regressionproblerbut one of a particular kind. The distinctive featureof this sort of problem is that any in.fo. vidual may make severaltransitions. It also should be evident that the formal structureif the problem is identical to that of many nonreversibletransitions; for exampte,in crimi nology, from arrestto araignment to trial to conviction to sentencing;in medical researci. the transition through various stagesof a disease;and so on. We tackle problems with rtx sort of formal structure by pooling data for all transitions into a single data set and tht8 analyzing not a sampleof people br:/"rather a sampleof transitions. To seehow this is done, consideran analysisof trendsin educationalattainm€fi in Japan,carriedout by TreimanandYamaguchi(1993).Here, Io illustratethe methal
Regression 31 5 Logistic Binomial
rll llu
trli]]]lllll
.dr lfl
l'rm
concemedwith the transitioniiom middleschool ur=ntonly theportionof our analvsis or uni'ersity in schooland frorn higher secondaryschoolto,college #;1.;;;";l rheir educationduring *: data set in"tuaed t,320men who completed ;;:;'r"p; level is compulsoryin school to the middle m:,rstwar period Because"ducation up were"at risk' of t:319.1:" it' ffi.'"", tl* it t,320middle."ttootg'uauu'"tt!h* so andhencewere did educationOf these'1'056 at leastsomehighersecondarv ffiing the first making of risk .u cottegeor universityPoolingthose.at :k" of continuingo,, to to Possibilities study' i",ti' we fraue23lo (: l'320 + l '056)transition ilr if the transi""d""t.i u dommy variable'SUCCESS(S)' scoredI mu.:i.h of theseca..r, t" "'"ut" the two transitiolsby a dummy variable' um ras madeand 0 otherwise.we distinguish from higher secondaryto terliary education' ffi..15/71ON (T), scored1 for the transiti"on logistic regressionequationsin which the otherwise.we then estimatea seriesof m a transitionand oddsofsuccessfully uue'r:Jentvariableis the naturallog of the laking of parents' status the of variables are the tiansition variable' measures u :J.f*o*, Table 13'4 variables amongthe m .i birth (to studytrends),uoiuiout interactions
'd : E :1Ul .gfif
forvarious Models ofthe Process :r  : 'i 3 ,4'. cooa"essofFit statistics dEducationaIT]ansitioninJapan(PreferredModelshowninBold{ace). j
: rllllL
Model L'z
:n d_
d.f.
Blc
l.
251
F
653
fr:
a :l]lllu
4
641
origins)*(Year) 3i + (Social
,if :'T
rrr c":1,
:.ltF:
n
tt
:i:lL'
t'
ltr] 3]
:g:
 t)\
l
410
388
28
14
(1993'Table104)' andYamaguchi !,= AdaptedfromTreiman OO0levelexceptthe(4) the at areslgniilcant rlr :relsanci contrasts ,r *: 223 level
12
(3)contrasi'"'rc'ssgnrii'an1
316
l;.lii;i
to Testldeas QuantitativeData Analysis:Doing SocialResearch
'i f 'l.
dlir :r 1: qnl]][ i: ::r
for Model3 of rable13'4' erect Parameterc
.*l education E: Parents'
*j:
r:
tu
:r
firu :rE ] iildur"
'!0I
ilu0luut]ri
"{\ llr:
lti
i:rxr
lulili
f: Transition
1.23
0r5lN T*P
T*Y
0.0180
0.0439
0.9
,..1
0 9:
'llliu :.m fr[ d]e 0Duug). l' du: lmum IF dll{"'* Ifltfr{ I I !
fcir ii:!g: rni illli
didnot repc' (1993,TableI o 5);TremanandYamaguchl frornTremanandYamaguch Adapted sourcer 'i stanoaro efiors.
MU
'llr
Ej
ilr'/lff + :
fimn'firElr I
fl um(
shows goodnessoffit statistics for various models of the educational ffansition prc\e,ii and Table 13.5 shows effect parametersfor the prefered model (This analysis was cr:r:l out before designedbasedestimation was generally available Thus no account was L{r of the clustedng of the sample. In addition to the usual clustering by sampling 5 typical of national surveys,transitionratio models are clustered by person becaus: :nt transitions made by any one person are hardly independent.Thus in addition to an! lr adjustments for clustered samples, the observationsfor each individual should be tre'el as nonindependent.) F ro mT a b l e l 3 .4 w e s e ethatModel 3fi tsbestaccordi ngtoboththel i kel i hood:a' t and B1C.The model posits that the effect of social origins varies acloss transitions and .:.,r that the odds of making the two transitions change over time (but the effect of social or:Ei does not change over time). From the point of view of our a priori hypothesistha: lt effect of social origins declines with successivetransitionsthe contlast between Nlo=: 2 and 1 is particularly noteworthy Model I posits that the odds of moving to the next h::3 level of education are affected by the social status of one's parents (specifically' pare::' education and father's occupational status,measuredby plestige) but that the relation::: is the same regardlessof which transition is considered. Model 2, by contrast, posrts 'r:Ir that the odds of making the ftansition depend on which ffansition is considered and L
l!i{rii"l3t u:
j
tmi
1![
Whrui
NMW:M llull0l
@r
ry
t!ffi ! r
'lTnfflr' ul!!* *lrl!Ut; @
lllllllfiru
BinomialLogistic Regression 317
"{
'i 'f :€
'{
r 5, :.
:r]:J$" :.rrr:lit
t 9:.
:t{{r
Ih;
r:rG
El4.:€ lE i'
(E'
l l:.rcd
rn'L.:i: c
:@
=':.8!5
c1a la+I6 E :::: \l =: tsr IE\: :r:!Er \:r$e:;, ,r"q rr t\':r fl:n lc ,:
crt therelationshipbetweensocialoriginsandthe oddsof makinga transitiondependon transitionis considered. Model 2 represents our a priori hypothesis. lichAs we see,Model 2 is far more likely than Model I given the data,but Model 3, aich also posits a temporalshift in the odds of making eachtransition,is still more itely. Thus we havepreliminarysupportfor our hypothesis,but we also haveevidence irrl the transitionprocesshad changedover time. (This point is further exploredin the !trperbut neednot concemus here.) In retrospect,the contrastspresentedby TreimanandYamaguchiare not wholly satsrctory. It would havebeen betterto include a model intermediatebetweenModel 1 .ol Model 2; that is, a model that positsa differencein the oddsof making successive nasitions but with the effect of socialorigins constrainedto equalityacrosstransitions. Tle difficulty is that we do not know whetherModel 2 is superiorto Model 1 because ir oddsof making the transitionvary acrosstransitions,or becausetlte effect of social u:cins variesacrosstransitions,or both.The samepoint can be madewith respectto the f:ct of birth yeara model intermediate between Model 2 and Model 3 would have rcn desirable. .\ctually, all that the coefficientsin Table 13.4 tell us is that a model that posits .q"erenl effects of social origins for different transitions is more likely than a model that l.sits the sameeffect.To pin down our claim, we needto inspectthe effect parameters, :e.rcrtedin Table 13.5,to be surethat they havethe predictedsign. Table13.5showstheeffectparameters associated with the preferredmodel.Note that : f,J not reportthe standarderrorsor pvaluesfor individual coefficients.Becauseall of ir "main effects" in the model also appearin "interaction terms," the appropriateway to x6=ss the effectof a singledimensionis to contrastmodelswith andwithoutthe variables rresenting thatdimension.I havedonethis in Table13.4,but only for selectedcontrasts mer thaneverypossiblepair of models.(Rafterydiscusses SP/assoftwarethatmakesit from amongall possiblemodels rssible to choosethe mostlikelymodelgiventhedata l',.llving a givensetof variables.InterestedreadersshouldconsultRaftery[1995a].) Notethatthetreatmentof standarderrorsin Table13.5contrastswith Table13.3.where I ir. show the standarderrors andpvalues. The difference is that Table 13.3 showsonly M. interaction, so the pvalue associatedwith the interaction term indicates the signifi::rce of the differencein the fit of modelsincluding andnot including the interactionterm. '*aere a model includes both variables for which individual significance tests are ninhgful and variablesfor which they are not becausethey are confoundedby interacr.cs (or other transformationssuch as squaredterms), the usual practice is to report all nraificance testsandpvalues.It might be preferable,however,only to repoft significance mrGtics when they are meaningful,in order to precludeincorrect inter?retation. This model suggeststhat the processof moving from one level of educationto the xn in Japanis aboutaswe would expectit to be: the oddsof makingeachtransitionvary lsrtively with parents'educationandwith the statusofthe father'soccupation.Of greater rerest arethe coefficientsofthe interactiontermsf*E andI*P. Theseareboth negative, r::;h indicatesthat,ashypothesized, in thesedatathe effectof socialoriginson goingon n :le nextlevel of educationis weakerfor the transitionfrom highersecondaryschoolto er!:rsity thanfor thetransitionfrom middleschoolto highersecondaryschool.Eachyear
318
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
ot averageparental education increases the odds of making the first transition. f middle schoolto highersecondaryschool,by about40 percent(becauseeo34s0 : 1.+ but increasesthe oddsof the secondtransition,from secondaryschoolto universinonly about35 percent(becausee(0.3480o.o5o3)  1.341).Thus,for example,all elseeg the odds that a.son of a universitygraduatewill go on to higher secondaryschooi more than 1I times as great as the correspondingodds for the son of a middle sc.l graduate(because1.416(t6e) : 1I.414).By contrast,amongthosewho managed to into higher secondaryschool, the odds that the son of a university graduate will go to university are only eight times as great as the corresponding'oddsfor the son middle schoolgraduate(because[(1.416X0.951)]o6nr : S.O:O;.iimltarty, rhe net e of eachunit incrementin the prestigeof the father'soccupationis to increasethe oc the first transitionby aboutsix percent(becauseeooi6e = 1.059)but to increasethe
of the second transition by only 4 percent (becausestoo56eoo1s0) : 1.040). Thus, for ple, the net odds of the son of a shopkeeper (prestige score = 42) making the tral
from middle schoolto higher secondaryschoolare more than twice as jreat as the odds of the son of a factory worker (prestige score = 29) making the transition (bec: 2e): 2.107).But the 1.059(42 net oddsof a shopkeeper,s .on the transitioD secondaryto tertiary educationare only about 66 percent greater iking than those for a f, worker's son (because1.040(422e) : 1.665).The effectsof year of birth and of the i action betweentransition and year of birth can be interpreted in a similar way. As a reminder, the interpretation of contributions to log odds in models inr . interactiontermsis exactlythe sameas in ordinaryleastsqiaresregresslon(see ter Six): the appropriatecoefficientsare added.However,ai *".u* in the first example,exponentiatedcoefficients(contributionsto odds ratios) are not added rather multiplied. Thus, for example.the coefficient for parentaleducationis 0.1 for the first transition and0.Z9jj (: 0.34g0  0.0503)for the secondtransition. correspondingexponentiatedcoefficientsarc l4162 for the first transitronand l.: (= 1.4162*0.9509) for the second.Ofcourse,1.3468: e0.2e11.
A THTRD WORKEDEXAMpLE (DtSCRETE_TME HAZARD_RATE MODELS): AGEAT FIRST MARRIAGE One of the most powerful usesof binary logistic regressionproceduresis ro esdmdiscretetime hazardrate models, sometjmes called event hiitory models. Hazardrt models are those for the rate at which events occur or the likelihood that an event Efl occur at a specifiedtime. Thereis a welldevelopedstatisticaltechnologyfor estimari4g such models, most of which is beyond the scope of this book. However, for a particuii classof thesemodels,in which time is treatedas a set of discretevalues and thi interen is in estimating the likelihood that an event occurs in each period of time, conventicqrl binomial logistic regression procedures can be used once the data are appropnacf arranged.Indeed, as we will see, discretetime hazard_ratemodels are formally iqulrr_ lent to the educationattransitionmodelwejust discussed. The basic procedureis to createa personperioddata set by s/acfir?greplicates of fu original datasetfor eachperiod for which eachindividual is .,atrisk,, of the eventoccurri.og.
BinomialLogistic Regression 3.19 Fu erample'supposewe areinterestedin estimating the likelihoodthatrndividualsmarry agTsay, at eachyearof agefrom 15ti 36. we.un aoif,,, o1...r.utin,s 111lfied a new &r set consistingof one observationfor eachperson fo. ea.h yeur ot ugeat which the F{tr hasnot yet married,plus the ageat which thepersonmarjJ if he or sheaid. up to ,u. includingthe individual'scurrentage.The dependentua.iaUteis a Orcf,otomy, scored I r de personmaried at that aqeand scored 0 0iherwise.r'o. individuars, [h lependentvariabletakeson the value "u"t.u.a 0 fo, eactrage,fro. ug"'ii u*, ,n" yearbefore b. married,andis scored1 for the.ageat which they malry. 6bseruadorrsrepresendng r.lequent agesare droppedfromthe data set, becauseonc" they marry,peopleare no rE.r 'at risk"of (first) marriage.ror nevermarried individuals,ihe dependentvariable ur,ciired0 for all years,from age 15 utr)to their currentug". eg."'g..r,"r than their cur_ u :eeare dropped from the data set becausethey obvioustya.re n"otat rist of marriage fu:_:estheyhavenot yet reached.We thenanalyze this dutu."iln ttl u.uut *ay, estimat_ llg :.binomiallogistic regressionequanon. ,\r thispoint you may be wonderingwhy we go to all this fusswhenit would be easy r t3ar age at first mariage as a ratio variableand simply ou, un OLS regression nft rse at first marriageasthe dependentvariable. "u.ry This'rnight be a reasonaUte procedure ff r: had asampleof personsold enoughto no longerbe i, .irt oi rnu_ug". However, h lpically is not the case,becausewe usually analyzerepresentativesamplesof a lutiation and thus include adultsof all ages,someof *fro_'fruu" no, yet mamed but nfl lo so in the future. Thesecasesarecensored becausewe have stoppedobserving &n: '* hile they are still at risk for the event.Under thesecircumsiances OLS regression 5ars misleadingresuhswhereasdrscrete{imehazard_rate modelsgive conect estrmates ,d : lrtelihood of marying ar eachage fo. those*ho a." ,,ili"ilir* o*"use rheyhave @'Fd that age without ever having married. illustratethe practicalproceJuresfor carryingour suchan analysis,I usethe 1994 __Trr rK\:o estimatethe likelihoodof marryingforthe firsttime asa functionof age,mother,s dn::don, sex, and race (Blacks versusnonBlacks). Given marital norms in the late lhqlethcentury United States,we would expectthe likelihoocloi marr)rlngto mcrease mm 3seup to the mid_twentiesbut then to decline,and we also would expectmalesto
wewould expect those fromweu"J""ui"J ri.ili". (measured 55:ll::",i1i i:Tlles. :orher'seducation) to marrylatelin parr
because ,h"y ;";r;;;';'"rd';T::H: '' completeir,"t. (althoughfor 5*::j:._*^.: 1"llt.::.g".untiirhey people marriage "Ju"u,ion affecrs thelikelihood
,:]i:.f
of continuing i" ,"r,""rj.ii"Ji,}?il,; likelyto marrythannonBlackslbotn O"*ur" oitn" socioeco_
l"of:"t.ri : position Blacks and becauseof racial differenc". in norm, ."going child_ are
lesslikelyto becoerced into rxorr4Bs by uy their Lrcrr ''v marriage ,n]acrs fujes .41?ll.i:e^"."tlffr_: in the case of unanticipated pregnancies. ,,chl3_2.do,, downll.adable_files and ,,chl3_2.1og,,show the specific commands __$ :Llcarry I out the analysis, together with comments. Because I have extensively documented
stata commands isfonr,.,"",iii ;"f':Y:j":::T:ij of the"1jf 'n".i1" Stataserup _,."hu;; _ .;,.i;*;;;;.#r; """"""..",r. is the use of he I),i:rel.feiture ctedataset,shownwith resultsin theStatalog file. This commandconv.rtsdatafrom :o long form; that
is, in this casefrom a file o=fpeopleto u nl. oi [._n_years, where
320
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
remainsur thereis oneobservationfor eachpersonfor eachyear at which he or she tT"b^t: plusthe yearof marriage.This is i very efficigntw1Vtl.create^a *" ::jl"li my selupplus time hazardrateanalysis,which requiresonly a few lines of code. Study this command of relevantsectionof the Statamanualio be sureyou understandthe logic (afterercl*" I beganby clefiningtherisk setasincludingages15through56'because with missing values on the independent v""!l"l:"^I ing u ,riutt n rb". oi "u.", for the first time beforeage 15 or after age56 I then estimalair intt".u*ot" *led form equation of the
r nlw l: o* ir e, \ l wt
(l::ili
,4
age' andtrt where W is the probability of first marriage,conditionalon a respondent's regressn This category' A, are dummy variables for age at risk, with 15 the omitted the lqn (converted from pioduced the expectedprobabilitiesshown in the Statalog the figure' ile trc estimatedin Equationt3.tZ1 anOgraphedin Figure 13 l Inspecting that the right tail is not very orderly. thosein fu In pariicular, the proba'bility of first matriage appearsto increaseamong why thl" r it clear fi1e makes 1og40s ani 50s. lnspection of the downloadableStata Thus one or lro marriedhas everyone so: by the time peopleare in thet late 30s, almost is someoh graph The at risk u,e a nonnegligibleproportion of all those becac "onr,i perfect is prediction .iug". misleaiing in anothe,*uy u. *"I b""uuse all agesfor which
.1 3 .1 2 .1',r .1 0 b .u 6
;.0 7
c06 i. u4
21
25
29
37 33 Age at nsk
41
45
4e
)3
)1
rts&irifi'13.1.ExpectedProbabilityof Marrying for the First Timeby Age at Risk,U.S.Adults, 1994(N = 1,s56).
BinomialLogisticRegression
EE; [email protected] E S]Ur6_
r
:untr
@:.[
m: tE Ie: +lu : i:'gd:" r:
rg
[ :: T 6 r :J
r: .rneat risk married at that age(37, 42, '14tbrough 48, 50 through 55) are droppedfrom & equationand hencefrorn the graph. The lessonhere is that very sparsedata may give msleadingresults.Beforecontinuingwith the analysis,I droppedall agesgreaterthan36. I next reestimatedthe model,predictingthe probabilityof marriageat discreteages nb.!g an equationsimilar to Equation13.17)and then substituteda fourthdegreepolyrmial for discreteyearsto fit a smoothcuwe to age at risk. (I decidedthat a fourth&gree polynomial,estimatedbY
w rn I l :o +a t A1+cyA) lr dr Ar t+ etA' ) II_ WJ
: 
lif
: l: 31I G'rTe
hr:a.ur
321
( 13.18)
powersofrisk age.)The two curves rrs requiredby testingthe significanceof successive thattheyarequitesimilar' .e :hown asFigures13.2and 13.3.Visualcomparisonsuggests at specificages. ine'ugh a formal test of significancerevealssignificantdiscrepancies Sn I determinedthis. I had to considerwhetherto continuewith a discreteor smooth which more faithfully qresentation of ageat risk. I optedfor a discreterepresentation, which is far more of age at risk, representation Ere\ents the data,althougha smooth gr.rmonious,alsowould havebeenreasonable. to estimatetwo additionalmodels:includingfirst tlte othervariables I thenproceeded I nTothesizedto affect age at marriage(sex,race, and mother's education)and then cr.lctions betweenthesethreevariablesand ageat risk. Wald tests,for all interactions rbr interactionswith eachof the main effects(shownin the Statalog), madeit clear : ta rhe model including interactions is the preferred model; all tests are significant at htnd the .000 level.Thus, the likelihood of marrying at eachagevariesby sex,race, mother'seducation.Table 13 6 showscontributionsto odds ratios, which are the of the coefficientsestimatedfrom an equationof the form ,ri.,es
nl!l: l l Wl
r + u,E\+ c(M  drBt f e,,+. , r5,a' 4
(13.19)
+Lc,A,M+Lh,A,B +Df ,A,E
(A:E
rise lV is the probability of marrying given that one is at risk; E is the number of years d rirool completedby the respondent'smother,expressedas a deviation from the sample 1 for Blacksand0 for non!1is scored1 for malesand 0 for females;B is scored with age 25 the referencecategory' risk, for age at variables are dummy is: andtheA the expectedodds of shows "Main Effect," labeled h Table 13.6 the first columl, years are at the sample of schooling whose mothers' ! ing for nonBlack females columnsshow remaining three 25. The in ratio to theeffectfor thoseage expressed except thatthe coefrleractions of ageat risk with mother'seducation,sex,andtace, for thesevariablesat age25 Te the main effects.Theseoddsratioscan be used n:rlie any comparisonof interest.For example,amongwomen who havenevermar:y..age21, the oddsof Blacksmarryingat that ageareaboutthreefifthsthe oddsfor
322
DataAnalysis: to Testldeas DoingSocialResearch Quantitative .1 3 .1 2 .1 1 .1 0
P .oe ; 07
E .oq .03 .42 .0 1 0 15
17
19
21
23
25 27 Age at risk
29
31
33
35
Ff GUR€ 13.2. rxpectedProbabitity of Marryingfor theFirstTimeby Age at Risk(Range:Fifteento ThirtySix), DiscreteTime Model,U.5.Adults,1994.
.1 3 .1 2 .1 1 .'t0
tt
il
P .os
.i
>
ri
; 07
lj
3 .o+
l
.03 .o2 .0 1 0 15
17
19
21
23
25 27 Ageat risk
29
31
33
35
Fl{:Unf 13.3. rxpecteaProbabitityof Marryingfor theFirstTimebyAse (Range: at Risk Fifteento ThirtySix), PolynomialModel,U.5.Adul$, 1994.
r .l ':1,
l/\lfl Mothor'$
t
nt ]tl!k, s"r, t trlala lt tr[ h' r Mrrrl.l Fr e.ll. IlrrU ih! I lkollh.ttt.l rl Mat tltgo lrotlr Aga Variables' Other Edu(dtlon, wlth Int.!ractlons Botween Age at Risk and the   t,
lld(o'
'rn'l
lnteraction with Sex (Male)
Race(Black)
9847
3. '156
20
3. '108
22
0.786
23
a.765
24
0.918
category)b 25 (reference.
'1.000
26
0.498 (Continued)
BinomialLooisticReqression 325
SMOOTH ING DISTRIBUTIONS
smooth nsrefers toactass ofterh?>J
niquesfor makingthe generalshapeof a distribution clearby removng " no ," " d"ui"tions $ from the underlyingtrend that resultfrom samplingerroror id osyncratic factors.Perhapsthe simplest smoothertsa movingaverage. A movingaverageis the average valueof several consecutlve data points.Considerthe workedexamplein this secton. A threeyear moving averageof the expectedprobability of marriageat eachage would be constructed by first takingthe averageof the expectedprobablitlesfor agesfifteen,sixteen,and seventeen; thenthe average of the expectedprobabilities for agess xteen,seventeen, and eighteen; and so on. At the time the ageatfirstmarriage examplewas created,the Statasubcommand ma ("movingaverage")was available within the egen command.However, this sub(although commandisno longerdocumented in Stata10 it stillworks),andhasbeenreplaced by smooth . whichgenerates mediansof the lncludedpointsratherthan means.Another smootheravailable in Statais lowess .
il .tt
a'nBlacks (precisely,0.591 : 0.190*3.108).Among 3Oyearold nevermarried people, fu oddsof marryingin that year amongthosewhosemothersare collegegraduatesare r:rlv 10percenthigherthan the oddsfor thoseof the sameraceand sexwhosemothers (precisely, r: highschoolgraduates 1.094: (0.918*1.114)4). Despitethe usefulnessof Table13.6for making specificcontrasts,the overallpattem qlied by the coefficientsis difficult to discem.Again, graphshelp. Figures 13.4 and of the expectedprobabilityof first marriageby age :i showthreeyearmovingaverages f, isk. separatelyfor BlacksandnonBlacks.In eachgraph,separatelines are shownfor tri.resand femaleswhosemothershad twelve and sixteenyearsof schooling(as a con€=ientway of visually representingthe effect of mother'seducation).Moving averages r: shownbecausethereis a greatdealof "float andbounce"for individualyears,which F lident from inspectionof the coefficientsin Table 13.6.(Seethe downloadablefile *:13_2.do" for details how on the moving averageswereconstructed.) InspectingFigures 13.4 and 13.5, we see that mariage rates for Blacks differ mnntially from thosefor nonBlacks,with Blacks much less likely than nonBlacks @:lirry at all. Moreover,nonBlackfemales(especiallythosewhosemothershaveonly ,I rsi schooleducation)marry at disproportionately high ratesat agesnineteenthrough lDdxn five; nonBlack males marry a bit later and with less concentrationin a short FL{. Black marriagerates,by contrast,are spreadout over a much longerperiod,but rrE ar upsurgein marriageratesfor malesin their thirties,especiallythosewhosemothG ire high schooleducated.For both Blacks andnonBlacks,malestend to marry later k remales,with male ratesexceedlngthoseof femalesbeginningaroundage thirty. lirrJl. amongall racebysexgroups,thosewhosemothersarehigh schoolgraduatesare m3 likely to marry than are thosewhosemothersare collegegraduates. Ir I werepreparingtheseresultsfor publication,I would presentonly a subsetof the fter large set of tablesand graphswe havejust marchedthrough.The intent here,of
to Testldeas QuantitativeDataAnalysis:DoingsocialResearch
326
. 18 . 16
F e m a l e(s1 2 ) .o Males(12) F e m a l e(s1 6 ) _ Males(.16)
\
,/ . 14
6
i .os E p .u o
.04 .02 0 15
1/
21
19
23
25
27
29
31
33
35
Age at nsk
PtGtJ*i: 13'4, r"pecteaProbabititvof Marrvingfor the FiRt Timebv Ase Sex,andMother'sEducation(Twelveand sixteenYearsof Schooling)' at Risk, U.S.Adults,1994. NonBlack
. 18 . 16 . 14
(12) Fem ales  o.o Males(12) (16) Fem ales Mates(16) 
E b
9 .oe € .o o
rr.r.Q,
.04 .02 0 19
21
23
25
21
29
31
Age at nsK
of Marryingfor theFirstfimebyAge Fl€URg 13'$. etpect"aProbabitity at Risk,Sex,andMother'sEducation(TwelveandSixteenYea'sof Schooling)' BlackU.s.Adults,1994.
;d nl !t
BinomialLogisticRegression 327 is to providealtemativesfor you to considerwhenpresentingyour own analyses. of the application of discretetime hazardratemodels include Astone and oth1J00),Dawson(2000),Lewis andOppenheimer(2000),and Sweeney(2002).
FOURTHWORKEDEXAMPLE(CASECONTROL MODELS):WHO APPOINTED TO A NOMENKLATURA POSITIONIN RUSSIA? a dependentvariableis a rareevent,it is inefficientto draw a representative sample populationat risk for the event,becausethe samplesizewould haveto be extremely to obtainenough"positive"casesto analyze.This is a frequentoccurrencein epideical research,where the eventsof interestare diseases,but it also occursin the :ciences.For example,if we are interestedin studyingwhat determineswho gets to Congress,we could hardly do this by drawing a representative sampleof the ion andlooking for the congressmen in it. We havesimilar problemsin studvins crime victimization,homosexuality,and variousotherrelativelyuncommonpheOne solutionto this problemis to sampleon the dependentvariable(that is, to a sampleof congressmen, criminals,or homosexuals), collect informationon that collect oie. correspondinginformationon a representative sampleof the population 'lrs not experiencedthe rareevent(becomingcongressmen,criminals,or homosexuals), the two samples,and model the odds of experiencingthe rare event.This is ascasecontrolsampllngin the epidemiologicalliterature(for an excellentreview itatisticalproceduresinvolved,seeBreslow [1996]). C3iecontrolsampling exploits the fact that odds ratios are invariant under shifts distributionof the data.This extremelyimportantfeatureof oddsratios makesit to combine sampleswith very different distributionson the independentand variablein orderto modelrareevents.This capabilityis not possiblewith OLS becauseOLS coefficientsare affectedby the distributionsof the variablesin n]del. T.r see how case control procedures work in practice, let us consider what factors
the oddsof becominga memberof the Russianpolitical elite at the end of the ist era. From Social Stratification in Eostern Europe after 1989 (Treiman and samplesfrom Russia:a probabilitysampleof 1i 1993),we havetwo representative ,< population(N : 5,002)and a randomsampleof personswho werein nomenpositionsasof January1988(N = 850).(SeeAppendixA for a descriptionof the .md informationon how to obtain them.)Nomenklaturapositionswere thosethat the approvalof the CentralCommitteeof the Communistparty. They ranged rery high govemmentofficials (for example,membersof the politburo) down to of sensitiveorganizationsfor example,rectorsof universities,editorsin chief of newspapers, andheadsof largeindustrialenterprises. Th generalpopulation sample departsin two ways from compliancewith the ions underlyingcasecontrolsampling,but neitherdeviationis importantfrom standpoint.First, it is a probability sampleof the 1993 populationrather tb 1988population.However,the samplingframe is basedon the lg89 census,and nmple thereforeprobablyrepresentsthe 1988populationnearly as well as it does
BinomialLogisticRegression
II
329
Before tuming to interpretation of the results, we should note the one difference hween casecontrolanalysisand ordinarybinomial logistic regression:in casecontrol aalysis the intercept is not meaningful. This should be obvious from the fact that the in logistic regressionindicates the proportion of the sample that is "positive" rid respectto the dependentvariable. However, in casecontroldesignsthis proportion ercept b ixed by the sampledesign, and thus the coefficient addsno information. Inspectingthe coefficientsin Table13.7,we seevery largeeffectsandfew surprises. Ech year of schoolingincreasesthe odds of becominga memberof the nomenklatura b more than 70 percent. Thus, all else equal, university graduates(who typically have Li yearsof schoolingin Russia)are more than 15 times as likely as high schoolgradu(with 10 yearsof schooling)to be appointedto nomenklaturapositions(precisely, s l5i2 : 1.72605r0)).The effect of genderis astronomical:malesare more than 17 times G likely as females to be appointed to nomenklatura posts. The effect of age is also anemely strong: all else equal, the odds of being appointed to a nomenklatura posrtion i;rease about 14 percentper year.Thus, for example,a SOyearoldis more than 7 times hkely to securea nomenkhtura positionasis a 35yearold(precisely,7.23 : 1.141(5035)). Itrhaps more interesting, the effect of social origins, evenamong thoseequally well educred, is far from trivial. Coming from a family in which one's father was a memberof the Communist Party improves one's chancesof a nomenHntura appointmentby about half, d elseequal.Also, eachyear of father'sschoolingincreasesthe oddsof nomenklatur.l qpointment by about 11 percentthis in the worker's paradise!so that the offspring of t universityeducatedintelligentsia (15 years of school) are about three times as likely * the offspring of those with only a primary educationto sectJrenomenklatura apporntlmts, inespectiveof their own educationalachievement(precisely,294 : 1.114(s5)). rllone amongthe variableswe haveconsidered,father's occupationalstatushasno impact r the odds of appointmentto a nomenklatura post.
XHAT THISCHAPTER HASSHOWN h dis chapterwe have seenhow to estimateand interpret binary logistic regressionmodds which are widely usedto model dichotomousoutcomessuch as whether people vote, employed, or are members of a particular organization. We have seenthat although t estimationproceduresare quite different, the interpretation of the coefflcients of such ndels is similar to that of OLS regression, except that the coefficients represent net &cts of eachindependentvariable on the log odds of an outcome. Because log odds are not intuitive quantities, we have considered two nonlinear :nsformations to more readily interpretable coefficientsoddsand expected problllitiesand have also seenhow to graph net relationships, a form of regressionstanfor logistic regression.Finally, we consideredthree extensionsof the basic &ization models, listic regressionmodel:educationprogressionratios,discretetimehazardrate d casecontrolmodels.A notable feature of logistic regressionmodels is that they are with respectto the distributions of variables in the sample,which is what makes procedureslegitimate in the logistic regressioncontext blrt not in the OLS Ge{ontrol aiant rlresslon conlexr.
330
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
APPENDIX I3,A SOM: AIGIBRA FORLOGSAND EXPONENTS who have forgotten their school algebra, here are some usefirl Io. 9lr" rnvorvlngnaturallogarithmsandantilogs(exponents): e'lt) : X
h(x*r):h(x)+h(r) ln(X /I) : tn(X)  ln(f) X * Y : et^6) etn(Y): e(L(x)+ln(Y) e <x +Y ) _ e x *e v "(xY)
"x1"v
ln(XP): P * 1n11; XP :1s1"(x)1c  .e.n1x1 Note that I : ln(X) and X : ey are equivalent.
APPENDIX13.8 TNTRODUCTIO}I TO PROBITANALYSIS *pinning of this chapter,an alternariveto logistic regressronas a mod *::j:1^:,,1. ror predlctlngmodelbinaryresponses is theprcblr model,wiich is definedas k
P r(f =  lx ) = O (6 ' x \ : 6'\ror B + \ n * ZrPt4i
rt
(13.BO
where o is the standardcumulative norrnal distribution and thereare t predictor variabbFrom this definition it is evident that the are z_scores, Bs *O1n"i *" associatedprf, d",":ined by finding the area under rhe normal curve correspondingror ::ll?_:TO: parucularzscore.This canbe doneby invoking Stata,s_normal _ function. Consider the example used in the chapterto ittustrate tfre interpretation of logis_ regressionmodelsthe determinantsof th; tkelihood of being ttfeatened by a gun r being shot at. Table 13.B.1showsrhe probit co"m"i"ntr_tfr" irlorr".pondiog to h logistic coefficients shownfor Models i and 4 in Table f :.:. Noi" tt ut tfr" p.obit andlqa models yield similar conclusionsexceptthat in Moder 4 th" int"ru"trL t".m ls marginallr
BinomialLogisticRegresslon
331
thatwhenestimatedby using _ :.ant wheneslimatedusinga logit model,andnot even " rit model. ":, in *undard deviation i..uur" p.obitsare.scores,they indicatethe expect:q:!1n^g": 1P 620] calls (what StataCorp7Q01lRefere,nce  , in th" iatent dependentvariabie variable' predictor :: rrobit index"),reiulting from a oneunitchangein the associated latent the of the variance  .,.r, iai. pt.perty oflrobits, in commonwith logits' that " Effect Parameters for a Probit Analysis o{ Gun Threat : .',..i' : c.rresponding to Models 2 and 4 of Table13'3)' rc:pendent Variable
b
Standard Error
P
Marginal Effect
o.8022
\i::
0.01' 11 0.0062 0.2586
i : : ..
r::':ept
 1.709s
0397
.000
.1A20
.000 .1154
''=: ..ed probability :: 4
,u; :
0 .8 1 26
.0038
.003
0.0062
.0022
.004
o.2994
.0545
.000
0.0806
.o721
.264
.1810
.000
0 .0 1 1 4
l
..'* . ^.'''
$i:,.L4ale
.o729
1.7117
to Testldeas DoingSocialResearch QuantitativeDataAnalysis:
332
are introduced into a model This means variable changesas additional variables  : is ts not appropriateto comparecorrespg"ditc Pt*'it ":i":::11t"3:,?":t:::t :I; musi 'IUL'rPPrvPrralw meltricoLS coefficients Rather' we of mediating variabies' as we oo *itrt dependentvariable bydividing by th" Y111"^"*:i::::: r4lw'r svvv'^*L'e latent luarurzsthe su standardize which canthenbe directly comn ir"o bution.This producesY*standard "o"in"i"n,s, ol.pregicto: numbers
*ith differing lar,11t:*: :::riT:?.TlT inthe rogitcoerncients ""..*'"o*l*t ;'it;';;nou'aiutionotoroinal #;.i"' ill::.:".Xliff
va{P:] f:1li11"'*'j"T*:1. metrics. 11:?' ,r'" inEquation GL..'iN"" the probiishaveintrinsic metrics'thel "'"1c;;;';;:'ffi; t;;;ardized
;;t';;;; ttunttot,t9]l:::f::::T":i'; Thus r'ur probits interpret. ro l'rlerPrel vrvv^'" tvfi"uttv difficult drlnculr to 'oo,"o*" givenconfigurationof va]u= " fo, a ing the exPectedProbabilitYof a PlJi,i* effect of a changein eachprecl: marginat ttre ng the Dredictorvariabl", o, oy tnt"rpJ variable on the probability of a positive outcome'
'T*"
Il,tt':,:11*".'^:"*:T,:?:'i"#i'T l"'iio rt'"wort"oexampte in Moder4 ri bvthelogitcoerncients ;;;;;i*pr'ed
tormaqrotilnodet'::,c19li"i1""tlh: piouuuliti"' the*1"1"* Til:T:i"H using probabilities ffi;##;;n;;''t" orschooli:4 wig lwenlv vears **"on rorp1op1e ffi*?ffi'ffiti#,it"ltlrt et:::":",Y#t:ul,i: of samevalues
".Ji#;J;;'il;';; :.fi;;";;,l":naine
theprobitequationiorthe 1994.To evaluate
:
= : a +.bE+20+ o' .bv4e4 'r'11" #.###1ffi;:#;il;;;'"""p" o.miz,sq isthe = r.:sos uurr+'zu  "toi rwhere'bu *'bli::':T1?T:::jljTl?, ffiifrd; s b" is the Probit coefficient for Yea
t*"y;l
Then we write out the expected:
tlfg:,::*:":"1;:'1,:.fiitl b" istneloeffici"nt byraceandsex(where themusins tem) andtransrorm ;;;;;
t'""tion
NonBlacks
Black
Females
o(a')
6(a' + bu)
Males
Q(a' + br)
Q(a' + bu* bM+ bBM)
#;;#
H.T;i'J#
normal  function:
these coefficients' we have Substituting the numerical values of
Females
O( 1.3569)= 0'0874
Blacks + 0.2994) o( 1.3569 = O (1 . 0 5 7 5 ): 0 1 4 5 1
Males
O( 1.3569+ 0 8126) = i'zstt : @(0'5443)
O( 1 3569+0'8126+ 0'29v 0 0806): o(0'325s): 0'::4
NonBlacks
predicr= extremelycloseto the percentages NoG that, multiplied by 100, theseare
for n"t B11"Il:T:t:i:t:"t:"::Hi":l:= thelogit model,whichare,respectively' rolc" r etu't men'37'2' (seetheparagraph
ff;:?;
ffi:;#;;;d;'';
Equation13.16.)
BinomialLoqistic Reqression 333 Now let us considerthe marginaleffect.We might askhow big a changein the probdiliry we could expectfor a small changein a parlicularindependentvariable.However, brause the relationship between the probit index and the probability is nonlinear, rhe answerdependson the valuesof the independentvariablesat which we evaluatethe uiange. Unless we have a reasonfor doing otherwise, evaluating the marginal effect of eh variable relative to the expectedvalue when all independentvariables are set at their Ens would seemmost reasonable,and this is the approachStatatakesfor continucS r'ariables.However, there is an exceptionit makeslittle senseto evaluatemarginal tlanges in dummy variables relative to their means.A better approachfor dummy vari*{es is to compute the discrete change ihe difference in the expectedprobability for fu6e scored1 and 0 on the dummy variable, with all other variables(including any other fumy variablesin the equation)set at their means.Thus, for example,we would want D how the expected difference in the probability of males and females having been &eatened, among people who are at the mean with respect to the other variables. For ,cmdluous variables, however,we want to know the effect of a small changerelative to rte meanfor all variables.Thus for continuousvariablesthe marginaleffect is defined r$ de slope of the probability function at the mean,extrapolatedto a unit increase. The marginaleffectsfor Model 2 are shownin the rightmostcolumnof Table13.8.1. lide that I do not show marginal effects for Model 4. This is becausewhen we haveinterrtion terms, the effects of the variables included in the interaction cannot be separated. Thus when we have a model involving interactions, it is best to evaluatethe probabilities ftr various combinations of variables, as in the logit example. The first thing to note is the predictedprobability, 0.1753, which tells us the expected Fobability that the averagepersonin the data sethasever beenthreatenedby a gun or shot r h is reassuringthat the predictedvalue is close to the observedvalue19.5 percentof m samplehasbeenthreatened.This gives us confidencein the corectness of the model. Now note the marginal effect for males. Becausesex is a dichotomousvariable, this efficient gives the difference in the expectedprobability of having ever been threatcd for males and females who are at the mean with respectto the other characteristics a^luded in the model; among suchpeople,males are predictedto be 21 percentmore Itely than femalesto haveexperienceda gun threat.We also seethat, at the mean,a onelcar increasein schoolingwould be expectedto reducethe probability of having been tleatened by 0.0029.What would, say a tenyearincreasein schoolingbring?Note that hre we cannot simply extrapolatethe marginal effect. For example, it is not correct to q ftat a tenyearincreasein schoolingwould resultin a 0.029decrease in the expected Foportion having been threatened.Rather, we need to compare the cumulative normal tmsformations at the mean and at the mean Dlusten vears:
Q(sa+{1M+ 13,(E+1O)+ BtM+ il"E+3.y+ 3p) &Y + pAB)A(po+ : iD(1.710 x 84.47 +0.111) * 0.4510.0111* (12.39 + 0.802 + 10)+0.0062 + 0.259 x 0.4510.0111* * 84.47 * 0.111)  O( 1.710 + 0.802 + 0.0062 + 0.259 12.39 : .1482.1753 :  .0272
(13.B.2)
334
Quantitative Data Analysis:Doing SocialResearchto Test ldeas
5
3
2
1012 Coe{fclen't(b)
ffe u$ qr13 .8.1 Probabilities " Associated with
3
4
S
Values of Probit and Logit
Coefficients. A flnal point to note is that the logit and probit modelshavesimilar shapes,er::d that probit coefficients more quickly reach probabilities asymptotically close to zer; m' one than do logit coefficients,as is evidentfrom Figure 13.8.1. For this reason.1:!t models are more sensitive when dealing with rare eventsor with predicted probabil:x closeto zero or one.But with this exception,the two modelsalmostalwaysyield silrlb' substaniiveconclusions. For furtherdiscussionof thebinomialprobit model,seePetersen(1985),Long ( 19q:. 4084), PowersandXie (2000,Chapter3), Long and Freese(2006),Wooldridge(li{rD, probit pos test imat ion, svy:prob::583595),and the probit, and svy:probit poste st imat ion entriesin Statacorp(2007).For an inter:sing applicationseeManski andWise (1983). The Statacommandsusedto createthe worked examplefor the probit model anCtb outputare shownas the lastpart of downloadablefiles "ch13_1.do"and "ch13_l.log.
l
tt
it and Logtt
ar shaPes,excef{ cr t closeto zero 'rhis logrr reason, kted Probabilitie: rays Yield similar (199' ,985),Long sooldridge (20O6 svY: Probit ' t?). For an interestrobit model andtbe ' nd "ch131'1og
C HAPT I I
AND MULTINOMIAL LOGISTIC ORDINAL AND TOBIT REGRESSION REGRESSION ISABOUT WHATTHISCHAPTER
types of limited dependent models for three additionar rn this chapter we consider rariabies: which multinomial more than two categories' for r categorical variables with logisdc regressionis aPProPnate ordinal logistic regressionis appropriate ordinal variables' for wh'rch : not observed variables' where observatronsare dependent censored' or ! truncated, ior whicrr tobit regressionis approPriate below or abovesome revi, an illustrative subis specifiedand then work through model the how see we case ln each standveanalysis'
336
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
MUTTINOMIALLOGITANALYSIS Sometimes we wish to analyze categorical dependent variables with more than categories.In this case,we haveavailablea naturalextensionof binomial losistic sion: multinomiallogisiic regression.The procedureinvolvessimultaneousl"v est a setof logisticregression equations. of the form
"[##]=o,+fb^xo ,[" ,":4 D]
o ,+ fb ,xo
(l
'[ffi#j:o.+fu^xr Here' one category of the dependentvariable is omitted and becomesthe reference gory. The estimation procedure yields, for a set of m + 1 categoriesof some deper uartable,m logistic regressionequations, each of which prediits the log odds of a fallin^ginto a specific categoryrather than into the referenie category(here designateJ I: 0). Note, however,that although the interpretation is similario the oinomial caseestimation procedure is not equivalent Io estimating a set of binomial logistic regresJ equationsin which the oddsof beingin a particularcategoryversusnot beins in thatca gory are predicted. In general,the estimateswill differ and the binomial estiirates *ill
lncolTect. This can easily be appreciated by imagining that we are interested in what
determinewhether,in 1988poland, a person was a Communist party official, a Cc nist Party memberbut not an official, or neither a membernor an ofdcial. If we esti a binomial logistic regressionpredicting ordinary party membership(without office ho ing) and anotherlogistic regressionequationpredicting party office holding, we woulJ in trouble with respectto the first equationbecausethe negativecategory(not an ordi party member) would include those who were neither party memb;rs nor officials aly thlse who wereparry ofrciats.In consequence, the resultingcoefficientswould misleading.For example,it is likely that a coefficient relating eduJationto party memi ship would be very weak becauseparty officials are likely to be better educated than members,whereasparty membersare likely to be better educatedthan nonmembers. The appropriate way to handle this problem would be to estimate a multinon logistic regressionmodel with three categories:nonmember,ordinary member,and cial. Doing so would result in two equations,one contrastingordinary membersv nonmembers and the other contrasting officials versus nonmembers, which are
and Tobit Regression Multinomial and Ordinal LogisticRegression
n > lf
337
rerpreted in the ordinary way. An altemative would be to do a sequentiallogit analysis r rnich first membershipversusnonmembershipis modeled,and then offlce holding usrs ordinary membership is modeled for party members only. The choice between Ge alternativeswould dependon how the processof becominga party memberor a official occurs.(Seethe brief discussionat the end of the chapterin the sectionon ;q *t(I5er Models.")
Yorked Example: ForeignLanguage Competence bthe CzechRepublic E ;ee how this procedure works in practice, let us analyze the factors that account for in Englishand Russianin the CzechRepublic.The datausedherewere colnationalprobability sampleof 5,496 Czechsage Med in 1993 from a representative part the swey Social Stratifictltion in Eqstem Europe After kn to sixtynine, as of rl${9 rTreimanand Szel6nyi1993;seeAppendixA for detailson this surveyandhow to Here we considerfour groups: frh the datasetanddocumentation). I
thosewho speakneitherEnglishnor Russian thosewho speakEnglishbut not Russian
r r
thosewho speakRussianbut not English
D
r
thosewho speakboth languages
e
To be classedas a speakerof a language,a resPondenthad to report that he speaks "fairly well" or "very well"; those who reported that they speak the lanJ.:nguage "only a little" or "not at all" or who failedto answerthe questionwereclassifiedas of the language.Becausethe survey was conductedin Czech, everyone spoke Czech. A few may also have spoken a second language other than or English,but this possibilityis not analyzedhere. andtechnicianswould be morelikely thanother \l]' expectationis that professionals Englishis now the intemationallanguageof groups English because ion to speak and hence the ability to speakEnglishis important :e. technology,and scholarship, rofessional advancament.Those who were ever Communist Party members, and ially thosewho were govemmentor party officials, would be more likely than other for political advanceion groupsto speakRussianbecauseRussianwasnecessary in the EastemBloc. It is lessclearwhetheror to what extentbeinga managerwould for intemationalbusinessdealthe oddsof speakingEnglish(perhapsnecessary (perhaps dealings). for Eastem Bloc necessary or Russian ' To identify thosewho potentiallyneededRussianfor their careers,I classifyresponby their 1988 occupationand createfour dummy variablesfor 1988occupation, scoredI for thosein the category and scored0 otherwise: officials, other managers, sionalsandtechnicians,andothers.(This variablewasconstructedby recodingthe versionof ISCO 88 shownin Treiman[1994,AppendixC]. "Officials" include 1000to 1166."other manasers"include codes 1200to 1320."professionalsand includecodes2000to 3480,and"others"includecodes4000to 9333.Those
n q * f
t lb n l.
d * lL E fd
fr 5 G dl !cI" fli
ru
338
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
not reporting an occupationin iggg were excludedfrom the analysrs.)ln thesevariables,I includeeducationasa controlvariablebecause it is clearthattho.e are educatedare more likely to speakforeign languagesin general. The data were weighted to adjust for differentiar househord size and to britrr samplecharacteristics into conformity with populationdistributions 1r." fr.irnan tt SectionI.G, for details).However,standa.derrorswere not aOiusteO ibr clustenae_ the sampledesign,censustractswere divided into eight strata on ,n. i"rir'"i ,'Li
households were randomly sampled within strata. Because the stratum rdentificffi._rr
not givenin the documentation,thereis no alternativebut to treat the sampleasa S (weighted)randomsample.Given the probablelack of systematicassocrationber andorhercharacteristics, the lack;f adjustmentfor strat.ifi;{ lT.:L: is likely "t:"^ir.lracrs to.be of little consequence. The results*" ."port"O iri tuOt" 14.1for rhe_: peoplewith a job in 1988 for whom completeinformation was avarlable.(Doqni able file "chl4_1.log" showsthe Statalog for the analysis, and ,,chl4_l.do,, shos:.  do  file usedto obtainthe results.) Inspectingthe coefficientsin Table14.1,we seethat, asexpected, the oddsof s ingeitherRussianor English,or both, improvedsubstantially with education.The multipliersin the secondpaneltell us thateachadditionalyear of schoolingincrea.nd by 25 percent,the oddsof speakingEnglishby 36 perceDl :,1d::j_tp"""kt"C,lussian. the odds of speaking both languagesby 51 percent__alt ir, "con;ast to speakingneib Russial nor English.Thus,for example,net of other factors, the oddsthat a Czeshr versrtygraduatecould speakRussianbut not English (in contrastto speakingDtfl Russiannor English) are nearly two and one halfrimesas high as the oddsthar a h schoolgraduatecoulddo so (because1.24gu612) = 2.43).The &ds ttrara unlversrn.s uale could speak English but not Russian are more than three times the o,lOsfor'"I
schoor gradriate r.:oyil;;=;.4; ilffi;fi
;r"J;ffi.il:i
speakboth RussianandEnglisharemorethan five timesthe oddsfor a high school,
1r): 5.17). uate (because1.508(16
Note that we are not restricted to comparisonswith the omitted referencecaregrnr By subtracting the coefficients for the log odds (or, altemativell taking the ratio cut oddsmultipliers).we can comparethe categoriesfor which *" ho1r" example,each year of school increasesthe odds that ";p?;l;";i;; a Czech Jhu:: jol could d : English.instead.ofRlssian by about 9 percent(because ecsoso.:2,.rl.:ij)i.:il 1.092).Hence,the oddsthat a universitygraduatecould speakEnglish and not RusE (rather than Russian and not English ) arJmore than +O pirc.nt gieater tnan the c,.J f* high schoolgraduares(becausesaooec.i,"t: : LqIh 1i 3AZ/1.2+8)o :::111li.9ddr (Note that in contmsrto our usual rule of thumb that three .lgnin.ani Oigit. _" suffi.ru_ it probably is best to report four digits for the coefficients blcause they often are u_\€rin subsequentcalculations. Too much rounding error is introduced wlen only three rii_. are reported, so that the mathematicalrelationships implied by the coefficients shosT.I downloadablefile ..chl4_1.log',appearno longerto hoid.) Continuing with our substantivecompariron,,. not" thut, usexpected,member.slibo the Communist Pafty increasedthe odds of speakingRussian _J;;;;;;;"H; had no impacr on the odOsof speakingboth RussianandEnglisbllies fngtisn .Uyr .Nt else equal, the odds that Contmunist Parqr membeis spoie Russian but not Englilt r
Multinomialand OrdinalLogisticRegression andTobitReqression i.,: ': .:. 1 . Effect parametersfor a Model of the Determinantsof Englishand RussianLanguageCompetencein the CzechRepublic,i993 X = 3,945). (Standard Errors in parentheses;pvalues in ltalic.)
  : lll
I
[
 .::
l
.
:
:_ _.
. 
:. :
,:: i 
Russian
E ngl i sh
Both
':ars of school ::inpteleo
0.2213 \.0247) .000
0.3096 (.0404) .000
0.4107 (.0429\ .000
:.:r a CommunistParty :rn ber?
0.3020 (.1488) .042
0.8965 (.3332) .007
0.0484 B) \.277 .862
::vernmentor CPofficial r :9 8 8
1.5591 (.716e) .030
28.2975 (.6097) .000
::er managerin 1988
o.9941 (.272s) .000
0.8010 \.4844) .098
0.8534 (.s330) .109
>:'essionalin 1988
0.9943 (.1s48) .000
1.124 (.2990) .000
1.3856 (.3s77) .000
 5.5378 (.3021) .000
8.I 541 (.5036) .000
 10.1965 (.5866) .000
!: ::1
t : :r
y'ariable : r ts (b)
:.___ :. _
:tL
::J
: , .'; :
: {li
. :. ':  l
:
: r* :
'= l
:'' u :::_.:
:
339
J.'
28.3602 (.7039) .000
:u ':L:il
::: multipliers(d) '::.s of school ::oleted
1.248
1.363
1.508
: : a Communtst Party :ber?
1.353
0.408
i.050
\Cantinued)
340
QuantitativeDataAnalysis:Doing SocialResearch to Testtdeas
Y&PLi:
1,6, t , ef."t parametersfor a Modetof the Determinants of
English and Russian Language Competence in the Czech Repubti(, 1993 (N : 3,945). (Standard Errors in parentheses; p_values in ttaiic.) (Continued) Variable
Russian
Othermanagerin l98g
2.702
English
2.38
I about a thfudhigher than the odds that they spokeneither language, whereasthe oddsdlx Communist Party membersspoke English but not Russian i" Jniy uOout40 percenrr great as the odds that they spoke neither language.Thus the odds that Communist Fa.l membersspokeRussianbut not English are more than thre" ti*", u, gr.ui u. tt;;il ; they spokeEnglish but not Russian(becauseecozor.txr)) : 1.35410.40g = 3.316).Th sameis trxe of service as a govemment or Cornmunist party official. Here, as expecr.;, officials were nearly five times as likely to speakRussian urt to speakingneitba 1ln Russian nor English) than were those who were neither "ont managers nor proibssionalsl: technicians(recall that the referencecategoryis all other occupatrons). The odds d;r g?YeyTrent officials spoke English or both Russian and English are effectively zero_ which they should be becausenot one of the sixteenofflcials ii the samplespokeEnelin Fin{1y, yi seethat being a professionalor technicianin 19gg roughly triples the od6 d speakingRussianonly or English only, andqladruples the oOAs of spJakingbottrEnglishmd Rl^ssian,relative to speakingneither English no. Russian. By coniast, being a managern triples ttre odds of speakingRussianonly, relative to speaking neither.Bur c l3!9 ":t,l erect or bernga manageron the oddsof speakingEnglish or of speakingboth English mc Russianwereboth somewhatsmallerthanttte effectof 6eing a on tt oddsof spez&ing Russian.A1so,the coefficientsareonly marginally significant_alt " 0.1 ig", aboutthe level. Althoughfor this exampleI settledon a singlemodelin advance, model selectionfir . multinomial logit modelsis carried out in exactly the same way as fbr binomial lcrs_ modelsby taking the ratio of the differencein Z;s (Modef XrO'to tfr" Oiff"..;;;;; de^grees_of freedom for any two models,to determinewhether one model fits the data srcnilicantly betterthan the othermodel (but recall that this p."""d";ir;;;;;;;;;; robustestimationis usedthat is, whenthe dataareweighted or clustered;rather,a \\ais testshouldbe usedto comparemodels).
lndependenceof IrrelevantAlternatives In the_multinomial logit model, the relative odds of being in two categonesare assumedr be independentof the other altemarivesincluded in the riodel. This fJllows from Equari.r 14.1,flom which we canderivethedifferencein log odds for two categories, d andc, a.
Multinomial and Ordinal LogisticRegression and Tobit Regressio n
.'LuurJ ''[##J:1"*2u,"r) 1",
Bot B E:
i_1!E
''t be ia :rt rr [ ::r fmr nr< 6.
:aL:
nt
E    _ ,f 3: 3L:PJdtL:::€liF
fe.:: :,:,. n Tx be s:! n3l .:nk: E:::* s cr. .\i'.5 ir h Er:::: a a T;i;::
I
i6er. B:: @ r Ergr: m dL. o: :n0,1ie'.::r' le:::: ino mr:.:'s iete;i;e :: :it ; de ia q. f,rs'ible tiE' 121i1313 \\ arD
re LisuiDei
II
ion Equ::.:r . .imdi. =
341
(14.2)
\.rte that only the two categoriesbeing comparedenterthe equation.If, however,the rela::;e odds do depend on what the altematives are, the model produces misleading srimates.To seelhis clearly,considerMcFadden's(1974) wellknownexampleof transartation choice. Supposepeoplecan travel to work by bus or by car and that half choose t go by car and half by bus.Now supposea competingbus companyestablishes buses r:ih the sameroutesandschedule,so we no longerhave,say,only blue busesbut alsored r.es. Presumably,the half that traveled by car would continue to do so, but the half that :.r'eled by bus would divide equally between the red and blue buses,taking whichever ri showed up first at the bus stop. Thus the odds ratio for car versus bluebus riership would changefrom i:1 to 2:1, violatingthe assumptionof the model. Now consideranotherexample.Supposetherearetwo restaurantsin a neighborhood,a ![erican andan Italian restaurant,andthat the Mexican restaurantgets60 percentof the total r.iness. Then a new Chineserestaurantopensin the neighborhoodanddrawsoff 20 percent :idre businessof the Mexican restauant and20 percentofthe businessofthe Italian restau:::]r The Mexicanrestaurant'sshareof thetotal is now 48 percent,andthe Italian restaurant's (trA) ;;re of the total is 32 percent. Here the independenceofirrelevantaltematives rsrmption holdsbecause60/40 : 48/32 : 312. Becausethe multinomial model is misleadingwhen the IIA assumptionis violated, \(;Fadden suggeststhat multinornial(andconditional)logisticregressionmodelsshould :E estimatedonly when the outcomecategories"can plausiblybe assumedto be distinct md weighedindependentlyin the eyesofeach decisionmaker" (1974,I13). A formal testofthe IIA propertyis available,implementedin Stata10.0as suestremingly unrelatedestimation,"a generalization of an earliercommand,hausman). la€ suest test comparesmodelsthat do and do not include presumablyirrelevant :qicomes.If the resultingparametersfor the restrictedanduffestrictedmodelsare simi:: the additionaloutcomescan be assumedto be irrelevant.Applying theseideasto our ::rrent example,we might ask whetherthe oddsthat peoplespeakEnglish are affected f. including "Russian" as an alternativein the model. In this case the test strongly ;.sgests that the IIA condition is not satisfied.Thus we might considerestimating r,equential logit model in which we successivelyconsidertwo {uestions:whethera =spondentspeakseither Russianor English versusspeakingneitherlanguage,and for :L'h of the two subsetsof respondentsthosespeakingRussianand those speaking 1:,glishwhether they speakthe otherlanguageaswell. For fulher discussionof the IIA assumptionand its consequences. seeMcFadden (1988). (1984), Hoffman Hausman and McFadden and Duncan Zhang and 97.1), (1993), (1997,182184), (2000. Long Powers and Xie 215 247). Long and ;:frman (2007). (2006), suesti=ese andthe hausman and entriesin Statacorp Addi:rroal examplesof the applicationof multinomial logit modelsincludeAIl and Shields (1999t.and Breen and Skages ,991),Haynesand Jacobs(1994),TomaskovicDevey (2000). rcd Jonsson
342
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
ORDINATTOGISTIC REGRESSTON Often in the social scienceswe haveordinal dependentvariables,wherethe response ries canbe orderedon somedimensionbut wherethedistancebetweencateeorieiis ur Most attitude variablesare of this sort. For example,if people are askedto say how hfl lhey are, and the responsecategoriesinclude ,.veryhappy,',,,prettyhappy,,'and .,ncr: happy,"there is no ambiguity in assumingthat those who say they are ,.pretty happllesshappythan thosewho saythey are'\;ery happy',andaremorehappythan thoseu.bc, they are "not too happy."However,thereis no basisfor assumingthat the distance "not too happy" and "pretty happy" is the sameasthe distancebetween..prettyhapp\,'1ery happy." Many other aftinrde scaleshave similar properties.In such caseswe o predict the scalescoreusing ordinary leastsquares regression.However,to do so wouk tantamountto assumingthat the distancebetweenresponsecategoriesis uniform. (For a ful discussionof this andotherpoints, seeWinship andMare [1984].) An altemativeis to estlmatean ordinal logit eqtJation,which makesuseof the property of the responsecategorieson the dependentvariable but makesno at all abouttherelativedistancesbetweencategories. The basicassumptionof the ord logit model is that thereis an unobservedcontinuousdependentvariable,f*. whicb linearfunctionof a setof independentvariables: Y* :
al
Db jx j + p
However,what is observedis a setof orderedcategories,y : 1 . .. { suchthat Y:Iifcn3Y*1kr Z rf kt
( I/t
wherethefr.are"cuttingpoints" on theunobserved, or latent,underlyingvariable.Now, we observe : I when I* < f,, observeI: Zwhenk,l y* < !, andsoon, it follows nrr
P rr'Y i l X] Pr ik.,
I
Substitutingfrom Equation 14.3 and imposing the constraint that a : necessaryto identify the equation,we have
er(r: ;lx): er1r,_, <)ax
+ pt < ktlx)
0, whict r
( I3.1iir
Th€n, subtractingwithin the inequality and noting that the probability that a random ra. able falls between two valuesis the difference between the cumulative density funcdcu evaluatedat trese values.we have
and Tobit Regression Multinomial and Ordinal LogisticRegression
Pt\Yilx)rE BEr r*rfg!,
rF:&ut lLTM
#& t.tr qr cbgE gd re:cd rrscirl b fficr e rnbecrjstt 'q:oil criard
be rt!.i
rt r
laii
]* l+e
" " " '
IIert
rtbx'
343 (11.7)
Iid is, the expectedprobability that an observationwill havea particular value is the dif&Ence betweenthe probability associatedwith reachingthe upperboundcufting point d rhe probabilityassociatedwith reachingthe lowerboundcutting point, wherethese lnbabilities are estimatedfrom logistic functionsknown as cumulative/ogirsbecause tlq give the log odds of reaching eachcutting point. (Note that for the extreme categowith fu oneof the termsof Equationi4.7 dropsout becausethe probabilitiesassociated x andm arezero.)
lWorked Example:PoliticalParty ldentification Aff€ United States,1998 problem.Supposewe wish to assess what factorslead C..osiderthe following substantive end rather than toward the Democratic place toward the Republican to themselves FE?le ql of a scale of political party identiflcation. Here is the item and the ordinal response ceories in the 1998GSS: Generally speaking, do you usually think of yourself as a Republican, Democrat, bi.pendent, or wh(tt? IF REPUBLICAN OR DEMOCRAT: Would you call yourself a strong (Republican/ Democrat)or not a strong (Republican/Democrat)? IF INDEPENDENT, NO PREFERENCE, OR OTIIER: Do you think of yourself as closer to the Republican or Democratic Party?
ftar
yieldedsevenresponsecategories: This setof questionsandresponses /
,1l_4,1
(D.ba.c 5nss5 *tr  1+{l ) shich 3
r r r r I
Strong Democral Not strongDemocrat nearDemocrats Independent, Independent Independent, nearRepublicans
I
Not strongRepublican
r
StrongRepublican
On the ground that the Republican Party is increasingly the party of nonurbanaffluc nonBlackmales,especiallythosefrom the South,I predictthe scoreon a continuum derlying the listedresponsecategoriesfrom the following variables: r
rl+,5t mdomrmli functicc
r r
sizeofplace (peopleliving in large[populationmorethan 250,000]centralcities of StandardMetropolitan StatisticalArcas [SMSAs], other people living in SMSAs,andpeopleliving outsideSMSAs) income (with categoriesrecoded to their midpoints and the openendedupper category$110,000andover,recodedto $i50,000) gender(ma1eversusfemale)
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
344 i r
regionof residence(Southversusother) race(Black versusnonBlack)
Surveyestimationprocedureswere used to take accountof clusteringand householdsize.The Statacommandslhat cary out this analysisare shownin ablefile "ch14_2.do"andthe resultsare shownin "ch142.1og." A property of ordinal logistic regression(which appliesalso to binomia' rlrg regressionmodelsof the sort discussedin the previouschapter)is that the !ai: the latent variable nresumedto underlie the observedoutcome variable variablesare addedto the prediction equation.Thus, it is not appropriatelo comparecorrespondingcoefficientsacrossmodels, as is commonly done rr::l models(seeChapterSix). Rather,the latent dependentvariablemust flrst be ized. To illustrate how to carry out such a standardizationand how to inter:Es resulting coefficients, I estimate two modelsModel which includes race.
1, which omits race, and \d
First considerModel 1, shownin the left panelof Table 14.2.Herewe seeth.e: the variables have the expectedsignsa positive sign meansa shift toward Re: identification.However,Southemresidenceis not at all significant.(To assess r significanceofthe two "urban" coefficients,I useStata'stest  cormand in t:l; way. Doing so, I concludethat the urbandistinctionsare significantat well beror: ventionallevels.)Now considerModel 2. Once Blacks are includedin the mrrEeffectof Southernresidencebecomesmarginallysignificant(at the .048level).I:r. we would expect,consideringthatBlacksaremorelikely thannonBlacksto resiE South(53 nercentof Blacks versus33 oercentof nonBlacks)and are also mu;: likely to identify as Democrats(63 percentof Blacks versus30 percentof norB. When raceis not includedin the model.the larsefraction of SouthernBlack suppresses the positive effect of Southernresidenceon Republicanleaning.Gc controlfor race,this effectemergesclearly. Converting the Logits to Y*Standardized Form Inspecting the coeffici:r, appearsthat the inclusion of race in the model dramatically increasesthe e:sJ Southemresidence,from .050 to.187. However,this comparisonis inappnc becausethe varianceof the latent "ReDublicanism"variablechangeswhen adr variablesare included in the model. Thus, before comparingcoefficients,it is saryto standardizethe coefficients.Although thereare severalwaysto do this. a: ularly appealingapproachis to standardizeonly the latent dependentvariable..: the resulting (y+standardized)coefficientsindicate the expectedchangein ihdard deviationof the latent variablefor a oneunitchansein the independent\a An important advantageof l*standardizationover full standardizationis that.1. saw in ChapterSix, fully standardizedcoefficientsare not appropriatefor cate3 variablesbecausefor such variablesthey are affectedby the relative size of the gory as well asby the size of the metric effect. An additional reason for standardizins the coefficients. even when we do not xan
compare correspondingcoefficients acrossmodels, is that the latent dependent\
,lr/t'i,tolrr.t
'' ''
l"ral l"r'frrrlrrr t"r
'rl
l'r.lrr.l I rrgtrM.r.rrr.rt r.rlfl..r F.rry r(r.r1trr.rr.rr,u.s.A(rurrr.r996
Model I
Model 2
Standard Error
Y*Std. Coeff
0.105
.156
Standard Error
Y*Std. Coeff.
o.092
.120
Substantivevariables
0 .5 1 7
0.400
.178 0.081
.108
0.334
0.081
.100 .056
Black ^1 .414
o.164
.000
.423 (Continued)
et..t parametersfor an ordered Logit Model of Politi€alPady ldentifi
Model 1
standard Error
Y*Std. coeff.
b
Standard Error
P
Y*Std. Coeff.
and Tobit Regresslon Multinomial and Ordinal LogisticRegression
347
meaninglasno intrinsic metric, which makesthe size of the unstandardizedcoefficients the represent 14'2 ta 3 that the coefficientsshownin Table iltquatlon ;.^G;; latent' or on.the unobserved' ;hge in eachof the independentvariables ;J;;;**i variables) However' independent other al1 variaule,"y*, holding constant G;;t; of Y*' we can divide the coefflcients by the bttausei.t is possibleto estimatetne variance that is' f*standardized' coefficients' sldard deviation of Yx to get semistandardized' deviationsof differencein I+ expected rhich arethen interpretedastne numberof standard ftrtwoindividualswhodifferbyoneunitonthegivenindependentvariable.Thatis' (14.8) ay+standardized lth variable and P' is the rbere b. is the coefficient associatedwith the 129\: i" t., the varianceo[ Y*' I follow Long t]991' =fn.i.j* (14.9)
var(Y*):B/VB+var(P)
matrix of the logits' and rtere B is a vector of logits, V is the variancecovariance "ctrt+Z'Oo" for how to €stimate thesecoeffiwt rL)is r'?l3. (See the downloadaUiente each.panelof Table 14'2') ; teported in the rightmost column of ;nL;hd Blacks are nearly a half standard factors, consider Model 2. e. ," ."", ,i"t or n other .bjationlowerthannonBlackswrthrespecttoRepublicanorientation.Noothelvaliab]e positive' theeffect of Southemresr,. .*tg an impact' In particiar' although il "i'ly asthe effect of genderand about a third as strong .hce is weak, only about half as strong Family income also has only a mod€steffect' s rhe effect of nonmetropotrtanresidelnce. per vear' toota l^ue to differ in income bv about $184'000 ;;tdiniaul' e;;;j;, and Blacks as are in nepublican tendencies of all other factors, to be aoout as far apafi (precisely' 0 '423 0'023+18'39)' who are identical rn other resiects t r.nBlacks, is way to assessthe magnitudeof the effects Hting PredictedPetcenEges.Another particular valuesof the independentvariables' E evaluatethe prediction equatl;; for the of the coefficientsassociatedwith eachof X.do this we needto take accountboth conewhich cut.points' the and of the ancillary parameters' i"""ti"ti "l"tles 2 14'7 we can estimate (from the Model lsization. For example, rrom Equation per vear man earning$40'000to $50'000 ="ffi;;;ti; ilu^ultitv tttut u nonn1ack as a "strong categorized is South the outside ; centralciry of an SMSA ;i;G;;
H'ffi il ffi
"iq'"JJ"i+'+r'c"!h*'.tl"j: thecatemoder :I"':t'^Tl:::0"'
Democrat": 1

r,c
strong Democrar' is {iflilarly, the probability that such a personis a "not
( 14.10)
348
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
r
L
(
( .02,+ .0764+ 4.5+ 31 4)
I

(
i
I l l ) +.0764i 4 5+ l l 4)
Although theseprobabilitiesmay be computedby hand,it is easyto haveStatado commandto get the predicteddistributim for work. The trick is to use the predict(seethe Stata1og). Table14.3$m personswith the desiredproflle of characteristics the predicted distribution of party identification for Black and nonBlack males e"try $40,000 to $50,000 per year and living in central cities of SMSAs. (Of course,I anid equally well estimateprediciedprobability distributionsfor any combinationof charai@i tics.Indeed,it is possibleto get estimatesfor combinationsof variablesnot found ir ur sample by creating a new data set containing thesecombinations.(See the discussitrd  in Statacorp2007.)As we see,nonBlacksare substantiallymoreRepul*ro predict than otherwisesimilar Blacks. Constructing Odds Ratios Still another way to assessthe net effect of an indepeniu variable is to compute its contribution to the ratio of the odds of being below an) Sln![ value in the ordinal scale to the odds of being at or abovethat value. Becauseof the rq the logits are derived, their contributions to odds ratios are constantregardlessof the *ting point, and it can be shown (Long 1997, 139) that they are equal to e4, for tbe in independentvariable. Thus, for example,the ratio of the oddsthat males and female's'lll Democratsisjust e0334:0.72;or, punieru be strongDemocratsversuslessthanstrong more naturally, net of other factors women are about 40 percent more likely than ro (precisely,1.39 : 110.72)to be strongDemocrats(ratherthan anything closer tc m Republicans).Similarly, women are about40 percentmore likely than men to be an! LTd plus Republicans). of Democrat(comparedto Independents Comparisonsto Other Estimating Procedures:goIogic2 As we havejust seetI" important constraintembeddedin the  ologit  estimation procedure s what is kn(ry theproportional odds assumptionlltat the explanatoryvariableshavethe sameefra ^s on the oddsthat the dependentvariableis below any dividing point. On the face of iL d.G is often little reasonto assumethat the odds are proportional. Why should we assume'ir example,that genderhas the sameeffect in distinguishing shong Democratsfrom all [email protected] and in distinguishing anyone who is Democraticleaning from lndependenrsc Republicans;and the same for each other independentvariable?A userwritten ai:(for GeneralizedOrderedIngit Model) relaxesthis assumpdirfile, gologit2allowing the odds to vary acrosscutting points. ReestimatingModel 2 of Table ir yieldsthe coefficientsshownin Tablel'!4 ratherthan ologitusing gologit2As we can see,the effectsof eachvariablediffer substantiallyfrom categoryto ;cegory. For example, Southernresidencedistinguishesneither strong Democratsr:r both kinds of Democratsfrom those who lean more toward Republicans,nor doe'r distinguishstrongRepublicansfrom others;but it doessignificantlyaffect the remafing distinctions.Similarly, nonmetropolitanresidencemattersrathermorein the midJt of the distribution than at either extreme. Still, the pattern of distinctions doesn.f appearto be very systematic.
Multinomial and Ordlnal LogisticRegression and Tobit Reqression 349
 : .:i i ,i ,3 , predicted probabitity Distributions of party tdentification t Blackand NonBtackMales Living in Large CentralCitiesof NonSouthern and Earning $4O,O0O to $5O,OO0 per year. '5As
c : l :; r E:i j
Black
ti r
NonBlack
:,:'r I
:':=+ . ...,. :r:
: fE
::
:L: :ri
: :.: ;u"
fc:oendent
.095
k:5trong Republican
.053
f :r:
ift _li] lI: : r
iroflg Republican
:
:[::i 1:rr
:i:,:
hal
0.999
1.001
ilE
T': . ti: ' in[
ri
i:
:: !:rt''i: :r. :r :i * I l['
LEs
::
irtr  ::a: 
= :I : I L
rs

1 :]] .li&" :, r'. I m
: :,:" : : '_: : :, :^:ru l:C" : ! llr
r my judgement,becausethe gologit model is rathermore complexthan the ologit jr&... two criteria need to be satisfied to justify substituting gologitz _ for *:  : rit  estimates:first, that the propofiionaloddsassumptionbe shown to be inadeur.i:: and second,that the coefficientsfor the gologit model be interpretableand .m:I:tative. To determinewhetherthe proportionaloddsassumptionis inadequate,we :Srlrre the gologit model andtestthe equalityof correspondingcoefficientsfor eachof tu : jtting points.In the presentcasewe reject the null hypothesisthat the coefficients ;ual(X'?,with 30 d.f. : 147'p <.0000). However,I am hard pressedto arive at a .efie:3ntinterpretationof the variationsin the coefficientsacrosscuttins Doints.I would
ESTIMATING GENERALIZED ORDER LOGITMODELS?)!T
*l"Tl,"?":f.tfl j",:::i::flilTff ,."i;:T:: 11.",1i:""',$T:11:,H:
:=:nenhanced byWilliams(2006).Williams's ado file, gotogir2 , canbe downloaded ..: withinStata. Type" net searchgologit2,,,clickthe firstentrv,andthenseect ,,Clickhere ^r : : 1 s ta l"l .
350
DataAnalysis: DoingSocialResearch to Testtdeas Quantitative
,:Atl;:1,i,r1, etect Parametersfor a Generalizedordered Logit [email protected] of PoliticalParty ldentification, U.S.Adults, 1998. o
StandardError
0.4732
.000
0.391
.00c
.412
.149 .524 .000 .000
.000 0.095
.000
.003
.000 .493 .000 .193
and Tobi t R egressi on M u l ti n o m i a la n d O rd i n a lL ogi sti cR egressi on
351
lndependentversusather ='ocrat or Democraticleaning :.,ial incorne(000s)
0.0582
0.0111
.000
' rencein SMSA, :: r argecenter
o.469
0.135
.001
;= lence :':ilde SMSA
0.700
0.'182
.000
' a:
0.239
0.096
.013
:f :1ern E5 SenCe
0.238
o.117
.043
1 .5 7 9
0.201
.000
0 .5 9 1
0.158
.000
. .' versusRepublicanleaninglndependentor Republican lJal in.ome(000s)
0.0941
0.o127
.000
;=
in SMSA, ence  : : ' la rg ec e n te r
4 0.57
0142
.000
is
0.820
0.182
.ooo
0.383
0.093
.000
0.345
0.114
.002
1.724
0.225
.000
* ] 696
o.111
.000
ri Jence
I.= Ce SMSA , aa 5r:,:.ern ?5:ence ;.:<
(continued)
Multinomialand OrdinalLogistic Regression andTobitRegression 353
T
frrs be inclined to settle for the conventional ordinal loeit model on the srounds of Pusrmony. Minary Least Squares as an Alternatve Finally, we could treat the dependentvari.rbleasan interval variable and estimatean ordinary leastsquaresequation.This amounts s assumingthat the distance between any pair of adjacent categories is identical. As i tums out, in the present case the coefficients yielded by the OLS model, shown in Table14.5,are quite similar to thoseyieldedby the ologit model.Thus,we might be as rell servedby simply estimating an OLS model, which is much simpler to estimate and D interpret than is the ologit model. The difficulty is that unlesswe carry out the analysis toth ways we really do not know whether the results will be similar in any particular arance. Thus, a reasonablestrategyis, indeed,to carry out the analysisboth ways and, f the results prove to be similar, to present the OLS results but to add a note indicating hr you did the analysisboth ways and got similar results. Of course,if the results differ mugh to affect the conclusions, the ordinal logit model is to be prefened over OLS tEcauseit is less restrictive; that is, becauseit does not assumethat the categoriesare sFidistant.
(ANDALLIEDPROCEDURES) rOBITREGRESSION K)R CENSORED DEPENDENT VARIABLES (Xen we havedependentvariablesthat arecensoredinthe sensethat the recordedvalues t) not representthe entire range of the true underlying variable. The classic caseis that odied by the economistJamesTobin (1958)hence the nametobit regressian(coined ! econometricianArthur Goldberger when he described "Tobin's probit")where a rmsumer good was purchasedif the desire was high enough,with "desire" measuredby ft dollar amount spenton the good. From this definition of "desire," it is evident that the xsure is "censored" at zero, becauseall those not making the purchaseare recordedas hring "zero" desire, whereas in reality some might have been close to making it and iglt have done so had the price been a little lower. Others might have had no desire at { andwould neverhavemadethe purchaseregardlessof the price, and still othersmight hre wavered in between. That is, there actually is variability in the relative desire of bse recordedas having zero desire. An underlying variable is censoredin many other situations as well. The classic case L rhere many values are below a theshold that would lead to action; for example, the mber of extramarital affahs (Fair 1978), the number of infant deathsexperiencedby nrhers (Wood andLovell 1992),the number of killings committed by police in different ;risdictions (Jacobsand O'Brien 1998), the number of anests afler releasefrom prison {!l itte 1980),the numberof scientific publications (Stephanand Levin 1992),the number d protestsin a nation (Walton and Ragin 1990), and the number of hours worked per lEar (Rosen1976,Keeleyand others1978,Questerand Greene1982).But we also can hagine other kinds of cases:attitude variables that fail to offer enough options, income rnded in categorieswith a top code that is too low, censoringthat occursbecausethe lasth of time to ar event is analvzed onlv for those to whom the event has occurred
j5{,
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
"'''ii.:
for an OrdinaryLeastSquares :,: ,. EffectParameters U.S'Adults, 1998' ldentification, Party Political of Model Regression o
i
x
p
AnnuaLincome(000s)
0.0803
.0100
.000
in 5lVSA, Resldence nol In largecenret
a.411
.100
.000
Residence outside5N45A
0.620
.141
.000
Male
0.337
.082
.00c
Southernresrdence
0.212
.096
.429
Ba c k
1.386
.146
.000
Intercept
1.981
.123
.000
I
Q
standard Error
(1981),is b. (1918 2002),a NobelLaureatein Economics
sociar,.i"ntists(:?;:jtt"t"':.:,.:l1,tff:':;:,::Til"':'iiil];' ono*n u'ons 
dependentvariablesBut hismaiorwofk, with censored procedure for estimatingrnodels to c. andtheirrelations of financialrnarkets whichhe won the NobelPrize,washisanalysis employment,production.and prices He made me ' sumptionand investmentdecisions, and flrmsactuallydeterminethe comp':' of how households to the analysis contributions theory" The result' 'what is knownas "portfolioselection developing tion oi theirassets, of financialmarketsandflows in the economy and analysis a description wherehisfatherwasa ic lllinols, in Champaign, grew household up in a liberal Tobin athleticproqramanc : of lllinois University nalistwho workedas publicitydirectorfor the high schoolwhere, as ': laboratory . motherwas a socialworker.He attendedthe university's stfawvc.. in a 1932presidential notesln hisNobellecture,he castthe onlyvotefor Roosevelt where he earneci' ' at Harvard. graduate economics work ln ?nd He did his undergraduate job in Washing:ca first by interrupted, been PhD in 1947, his graduatestudieshaving
D.C.,and then by servicein the Na4/asan officeron a destroyerAfter thfeeyearsasa Harva': Junior Fellow (a very prestigiousfeilowship),which he used ln parl to study econome''careerat Ya;. he hadmissedduringthe war,he thenspenthisentireacademic developments
andTobitRegression 355 Multinomialand ordinalLogisticRegression Smith, and Nord 1990). Other substantiveapplications include Mare and Chen ; Saltzman(1987);Roncek(1992);andTreno,Alaniz, and Gruenewald(2000).
Tobit Model obvious question is what to do in the case where we have censoredobservaexample,observationsscoredzero (or someother constant)when we think thereis variability in the true underlying value of the censoredobservations.One ion is to simply carry out an OLS regressionof the entire data set. But this proinconsistentestimates(Long 1997, 188190).Another solution is to discardthe casesand carry out an OLS estimationof the relationshipin the noncensored for example,determinantsof how many hours peoplework among thosewho at leastsomehours.But this approach,which amountsto truncatingthe distribualso producesinconsistentestimates(Long 1997,188190) Tobin's solution was ide observationsinto two sets:uncensoredand censoredobservations.Formally, observedvaluesof dependentvariable X censoredat somevalue r, we have
"1":
:o+fb* x , r + e , if Y ! > r
'' 
[",.
(14.12)
if Y ! ! r
is. the observedvalueof I is equalto the "true" valueof Y Y*, \f Z* is abovethe at which observationsare censoredand is equal to someconstantvalue (usua.lly,but necessarilythe value at which observationsare censored)if the true value is at ol the value at which observationsare censored.For the first set,estimatesare derived sameway as in ordinary leastsquaresestimation. For the secondset' it is possible imate the probability that an observation is censored,conditional on the values of fudependentvariables, and to use this probability to estimate the likelihood. These are then combined to produce expectedvaluesfor all observations,conditional valuesof the independentvariables:
(E(Y'l4t t,x,)l E(Ytlxt): lw(uncensoredlx,l* + fPr(censoredlx)* Tyl
(14.13)
x.  d+Db,x,k expositionof the mathematicsinvolved,seeLong (1997'Chapter7)' m accessible
356
Quantitaiive DataAnalysis: DoingSocialResearch to Testldeas
The tobit model hasbeenextendedandgeneralizedin
a numberof ways:
x
to allow for right censoringand both left atd right censoring (that is, at low valuesandhigh valuesof a distribution)
u
toallow for the possibility that differentobservations are censoredat di values (for example,income when severalyears of the GSS are pooted) to allow for situationsin which an underlyingcontinuous variableis coded set of categories(in many surveysincomeis codedthis way) to correctly estimateeffects where observationsare truncated to dealwith sampleselection problems
r x !
In the following section I provide a worked example that illustrates many of thesee esrimation details, see the Stata downloadablefiles ..ch14 ;r9qs, lFor 3.do_ "ch14_3.1og.")
A Worked Example:Frequenqrof Sex
The 2000 GSS,includedthe question ,About how often did you have sex during tE .0 twelvemonths?"The responsecategories(shownwith coderio l" u."o tut" ar. d.*i in Table14.6. Clearly,thesedamare censoredboth below and above. Thosewho havenot h.u at all in the last yearincludethosewho haveneverever had sexandthosewho har* , ply beenunlucky in the pastyear,with othersin between.At th" oth", "more than threetimes a week" asfour times a week, or five times a week,"*t."*". mav un
TAELe
14.5,
coae"ror Frequency of sex in the pastyear,u,s.Adutrs, Midpoint
2 or 3 timesa month
2 or 3 timesa week
LowerBound
UpperBo(d
MultinomialandOrdinalLogistic Regression andTobitRegressio" 357 b prowessof newlywedsand other sexualathletes.Finally, somecategoriesinclude a mge. which might or might not be optimally represented by the midpoint. To illustratethe effect of censoring,let us considera simple model in which frepency of sex is predicted from age, gender,and marital status(currently maried versus n1(,. ln fact, in this and most analysesinvolving age, it would be better to include a ryared term. However,I do not do sojust yel becauseincluding only linear terms makes fu crpositioneasier. Table14.7showsthe resultsfor four estimates: r
ordinaryleastsquares estimateswith the categoriescodedat their midpointsbut with an arbitrarytop codeof 208 for "more than 3 timesa week" ( 52+4)
I
tobit estimateswith censoringfrom below
r
tobit estimateswith censoringboth below andabove
r
intervalregressionestimateswith censoringboth below and above
C:nparing the coefficients in the two left columns, we see that the effect of censoring fum below is severe.Failure to take proper accountof such censoringresults in an
TA B i t: '! 4.7. ett.rtt"tive Estimates of a Model of Frequencyof serc Gt Adults, 2OO0(N = 2,258). (Standard Errors in Parentheses;All Coefficients Significant at .O01or Beyond.) 
Model 1:
oLs 
Model 2: Tobit, Left censored
Model 3: Tobit, Left and Right Censored
Model 4: Interval, Left and Right Censored
119.2 (6.8)
118,4
1Z.V
v .t) 1 .4 1 (0.09)
2.16 (0.12)
(6.8) ..'';i'l]''''
. .  .t, :. ., ,.r :..t . r.,.'ar,r t
71.7
11 n
358
QuantitativeData Analysis:Doing SocialResearch to Testldeas
underestimate of the effect of marital status on frequency of sex by about half and
very substantial underestimationof the effects of age and of being male. Interesti taking accountof censoringfrom aboveas well as below hardly changesthe coeffici suggestingthat marital status, age, and gender have little impact on the probabilirl being extremely sexually active. Inspection of the probability of censorshipfrom confirmsthis supposition:even among the most sexuallyactive group, young nu[D men! no more than about 15 percent have sex more than three times per week. Bv trast, there is great variability by marital status,sex, and especially agein the of never having had sex in the last year, ranging from about 3 percent of young num men to about 90 percent of elderlv unmarried women. Apart from the probabilities, three predictions are of interest: the linear predi from the model, the censoredprediction,and the ftuncatedprediction.Graphsof predictedvaluesfor Model 4 are shownin Figure 14.1by age,for marriedwomen, linear prediction is the prediction from the model, which tells us that, net of other the frequencyof sexper year declinesby about2.3 occasionsper year of age.The tells us that for married women the frequency of sex declines to less than once a ye:r about age seventy.Although negative observedvalues make no sense,the linear prerlrtion gives the values of a latent, or underlying, variable. We can think of this variable the propensityfor sex, which declinessteadilywith age (because,of course,we h* modeled the frequency of sex as a linear function of age). The censoreclprediction eqtals lhe latent prediction when the dependentvariat*r i observedand equalsthe censoringvalue when the dependentvariable is censored.(Sorr. what confusingly,Stata calls censoredpredictionsthe "ystar" option, although l. r, 120 100 b B0
\
E 60 ,:
40
n0 20
Age
Ff6trnS 14.J. rf,r"" Estimates of the Expected Frequency of Sexperyear, U.S.MarriedWomen,2000(N : 552).
MultinomialandOrdinalLogistic Regression andTobitRegression 359
!ttr.{mq
ftcn
:3 & Br IFI fg f
F*3 l@
rTb F ER ,I 3t
r t IEIL
'
I1
lEs
staily takento indicatethe latentvariable,as it is in Equation14.12.)Thus in this case, * assumethat 0 and 208 are fue valuesfor thosein the lowest andhighestcategories. D construction, censored predictions must fall within the range of the uncensored Gervations, The truncatedprediction is defined only for thoseobservationsthat ale not censored. h 6is case the truncated prediction gives the predicted frequency of sex among those rto had any sex at all in the last year. Note that neither the censoredprediction nor the rncated distribution is linear. Thus, thesepredictions must be evaluatedat specific levr* of the independentvariables.Most commonly we will be interestedin the linear pdiction. Now that we seehow to interpret tobit coefficients, let us extendthe analysisslightly a make it more substantivelyplausible. I do this by adding a squaredterm for age and *rying interactionsbetweenage,gender,andmaritalstatus.As it happens,it is not neccary to posit threeway interactions among marital status,gender,and, respectively,age d age squared; a model positing the threeway interactions does not fit significantly her than a model with the two sets of twoway interactions, between gender and, :spectively, age and age squared,and between marital status and, respectively,age and 4r squared.The coefficients for this model are shown in downloadable file "ch14_3. l;: Becausethey are difficult to interpret directly, I have graphed (in Figure 14.2) the dationship between age and the frequency of sex for each gendermarital status mbhation. Inspectingthe graph, we seethatno surprisemaried peoplehavemore active sex hs than do cunently unmaffied peopie of the same age and gender, and that sexual GiTiry declines at an increasins rate with ase. 100
tg
50
dE
E
150 200
Currentlymarriedmen Currentlymarriedwomen Not marriedmen Not marriedwomen
tr: Age t'€ftt{,
Un€ 1rtr.,Z. Expected Frequency of sexPerYearbyGenderand Marital U.S.Adults, 2000 (N = 2,258).
360
QuantitativeData Analysis:Doing SocialResearch to Testldeas Interestingly, in both marital status categories,men report more actrve sex lives
do women of the sameage and marital status.The reasonfor the genderdiscrepar.w within marital statuscategoriesis not completelyclear but probablyreflectsa tendel.*for men to overreport and (or) for women to underreport their sexual activity. Note ii, consideringonly heterosexual activity,both the averagenumberof sexualencounteFdl the averagenumber of partnersmust be identical for males and females.Thus drclearlyis biasedreporting;differentialnonresponse (for example,the likelihoodthatsai womenwith manysexualpanners for example.prosritutes_are underrepresente; m the GSS);or morereportedhomosexualactivity amongmen than amongwolnen. Maried men and women both averageaboutone parlner(precisely,1.03 and 9k". which suggeststhat for both married men and married women their spousers usualh. :sr.. only partner,which in tum would imply that the averagenumberof sexualencourrer* should be the samefor currently maried men and women, adjusting for the three_r:1. averagedifference in age. However, inspection of Figure 14.2 shows a difference L:s than_canbe explained by the age gap (if the age gap were the full explanation, the ja3r would be parallel for married men and women, and a line segmentofihreeyears, le:s drawn to the left of the male line and parallel to the xaxis should iust touch the feI5E line).This suggests thepossibilitythateitheror both marriedmenandmarriedwomeni* tort their reportsof the frequencyof sexualactivity in a socially desirabledirectit:_ men claiming sexualprowessand women claiming sexualmodesty.The likelihotx o distortion is substantiallygreateramong the unmarried:unmarrietl men on averagerrf:r abouttwice as many partnersin the last year as do unmarriedwomen(1.g5.op*". .90), whichgiven that rhis discrepancyis far too large to be accountedfor by differe..," homosexualactivitysuggeststhepossibilitythatunmarriedmen andwomendiston :rm the number of partnersand the frequencyof sexualactivity in the socially desirabled:::: . tion. Another possibility is that the propensity for women to be younger than thef =a partners pafily accountsfor the gender difference in repofied sexual activity amons ft unmarried.Adjudicatingamong thesepossibilitieswould requiremore analysis th;: : warrantedhere.
OTHERMODELSFORTHEANALYSIS OF LIMITED DEPENDENT VARIABLES This introductionby no meansexhauststhe varietyofproceduresavarrablefbr the arr,r , sis of limited dependentvariables.Stata10.0includescommandsto carry our a nur,:E: of procedures,including x
s
Conditionallogistic regressionand mixed models,where outcomesdepeni :r featuresof the outcomesas well as on characteristicsof the individuals.Fr" examples,see Boskin (1974), Hoffman and Duncan (19gg),White and L:ae (1998),andYanovitzkyandCappella(2001). Nestedlogistic regression,which extendsconditionallogit analysisby dn i,..r_E outcomesinto a hierarchyof levels.For examples,seeCameron(2000).Soo:cz, manienandJohnes(2001),and SouthandBaumer(2001).
andTobitRegression 361 Regression MultinomialandordinalLogistic r r
Probit regression,an altemalive to logistic regression.For a brief introduction, seeAppendix 13.8. Poissonregression, usedto modelcounts,the numberofoccurrencesofan event. A classicexampleis von Bortkiewicz's 1898 study of the numberof soldiers kicked to deathby horsesin the Prussianarmy.Applicationsin the social sciencesinclude Long (1990), Greenberg(1991), Rasler (1996), Chattopadhyay and others (2006), andWeitoff and others (2008). The definitive statistical treatment of poissonregressionis CameronandTrivedi (1998).
(1997),HosmerandLemeshow(2000),and Powersand Xie (2000)provideexcelintroductionsto many of theseproceduresthat, with a bit of diligence, are accessible socialscientistswho havea modeststatisticalbackground.Long and Freese(2006) proa guide to using the proceduresin Stata.For a useful overview,seeGould (2000).
HAS SHOWN r THISCHAPTER fris chapterwe have seenhow to estimatemodels for three types of limited dependent : ordinal variables, for which ordinal logit analysis is the appropriatemethod; variables,for which multinomial logit analysisis the appropriatemethod; and variables(where valuesaboveor below somecutting point are not observed),for tobit modeling is the appropriatemethod.
d
rx
{\!ldrI ^t J/\R T T iL }i, l\
CAUSAL IMPROVING E FIXED INFERENC AND RANDOM EFFECTS MODELINC EFFECTS ISABOUT WHATTHISCHAPTER h this chapter we consider two closely related techniquesfor coping with omitted varilble bias. Recall from Chapter Six that omitted variable bias occuts when we havefailed n hclude in our model variables that affect the outcome and that are correlatedwith one r more of the predictor variables.The techniquesdiscussedin this chapterfor estimating nbiased coefficients are known as.fixedeffects and random effects models.Thesemodd' use information on the sameindividuals from two or more time points or information m two or more individuals within groups(families, schools,firms, communities'or similar measuredor unmeasured, goups) to purgethe estimatingequationof all characteristics, groups. The result is that the characteriswithin or constant over time tat are constant factors.For usetimeinvadant by unobserved unbiased ncs we are ableto measureare (2006,Chapters (2005) Wooldridge and fol introductionsto thesetechniques,seeAllison lj and 14),both of which I draw on in this chapter.
354
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
INTRODUCTION As we haveseenat manvDointsin this book. the nonexperimental methodswe ha\e studvinsare vulnerableto omittedvariqblebias: the possibilitvthat unmeasured affect both the predictorand outcomevariables.In this casethe coefficientswe throughOLS or logistic regressionwill be incorrect.To appreciatethis most full) r helpful to contrastthe linear model approaches we havebeenstudyingwith exDeriments. In the classicrandomizedexperiment,individuals are randomly assignedtc groups; membersof the treatment group are exposedto some sort of intervention membersof the controlgroup arenot, anddifferencesin one or more outcomesare sured.(This designcan be generalizedto include severaldifferenttreatmentgrout\ the logic remainsunchanged.)Becausethe treatmentand control groupsare,with:: limits of sampling eror, identical on averagein their pretreatmentattributesor. tc the same point differently, receipt of fteatment is uncofielated with pretreatment butesof individuals anv differencein averaseoutcomesmav be assumedto be cby the treatment.
With linear model approaches, we attemptto approximaterandomized by statistically controlling for as many confounding factorsthat is, factors with both the predictorand outcomevariablesas possible.For example,if we that men eam more than women, we might wonder whether this is due to
However.beforewe acceDtedsucha conclusionwe would want to considerwhedr= leastin paft, the pay gapis dueto the fact thatmen arc morelikely to havetechnical: ing, to enterhighpayingfields,to havemorework experience,andto work longerhWe would then statisticallycontrol for thesevariablesand assessthe effect of gend:: eamingsamongpeoplewho areidenticalwith respectto the controlvariables.If $: found a genderdifference in pay, we might then be willing to attribute the remainin5 ferenceto discrimination.However.we would be vulnerableto the claim that we ha; includedother crucial factorsthat could result in oav differences.For examole.rr may not bargain as effectively as men and may thereforeacceptjobs at lower \:levelsthan men.If we omit a measureof bargainingprowess(or if we measure ing prowess imperfecdy so that true bargaining prowess remains pardy then any effect of prowesswould be captured by the enor term. However, if bargai prowessis correlatedwith gender,the assumptionof OLS (andotherlinear model: error is uncorrelatedwith the predictor variableswould be violated,producingbi coefficients. So what can we do? It tums out that if we havemeasurements on the sameindirr als for at leasttwo points in time, we can get unbiasedestimatesof lhe effectsof r ablesthat, at leastfor someindividualsbeing studied,changeover time. We do th. predictingchangein our outcomevariablefrom changesin our predictorvariables.*: hasthe effectofpurging from our predictionequationthosefactors,measuredand sured.that do not chanseover time. But thereis no suchthins asa free lunch.The ccr this method,known asfixed effecrs(FE) modeling,is twofold: (l) We are unableto mate the "main effects" of predictors that do not vary over time for individuals, for ple, sex and race (although we are able to estimate interaction effects involving
lmprovingCausalInference:FixedEffectsand RandomEffectsModeling
s Lf r[i
tl
!'
n
365
wiables and variables that do change over timewe will return to this point in the cmtext of our gender pay gap exampleand we also are able to esimate effects lhat rtange over time for timeconstant variables). (However, recent work by Bollen and Erand [2008] has shownhow, with suitable assumptionsabout unobservedlatent factors, i is possible to obtain effects of timeconstant predictors within a structural equation ndeling (SEM) frameworka set of techniquesbriefly discussedin the next chapter). rlt When we are analyzing limited dependentvariables, we usually will have a substanfel reductionin samplesize becausein FE logistic regression,individualswho do not ciange over time on the outcomevariable are droppedfrom the analysis.However,under me circumstances,and with some additional assumptions,we can recover our sample lize by resorting to what is known as random effects (RE) modeling. We will consider tis approachlater in the chapter.
VARIABLES MODELSFORCONTINUOUS HXEDEFFECTS t' :ee how FE works for continuousoutcomevariables,let us write a prediction equation: yr: lL,+ Bxrllz,la,*e,,
i:1,
,n; t:1,
,T
( 15.1)
is an interrbere y,,is the value of the outcomevariable for the ith individual at time t; 7r., ceprthat is allowed to vary with time; x,, is a vector of variablesthat vary both over indirrfuals and, for eachindividual, over time; z, is a vector of variables that vary over indiriirals but, for each individual, not over time; o, representsunmeasureddifferences Letrveenindividuals, that is, differences not accountedfor by the 12,,that are fixed over ine: and €, represents idiosyncratic factors that vary both over time and across hlividuals. To simplify the discussion,assumetltat I  2, althoughthe sameconclusionshold rhen Z ) 2. Now supposewe simply pooled observationsfrom the two time points d estimatedour outcome through OLS. Clearly, insofar as omitted variables are correIled with the variables in the model (as in our example involving bargaining prowess), ting this will produce biased estimatesbecausethe fundamental assumption of OLS, frr the error term (which in this caseis the sumof d * e, becauseor is unobserved)is rorrelated with the predictor variable, will be violated.
RtndamentalFEEquation bs'ever, supposewe write sepamteequations for each time period and subtract one lom the other.Subtractins d
ld = F, ! l3xa + 12. + oi + €il
]
(rs.2)
t
liz:
G
pzI ]xiz + .f zi + oi + €i2
IT
u: E
liz  lt:
(pz 11) + P(xiz  xit) + (€o  €^)
(1s.3)
to Testldeas Doing SocialResearch QuantitativeData Analysis:
366
eff:ct oj predictor variables: ::tF i': thl whv equantr Td Notice that both 'y2,, the trmeconstant is which ,,ain"r"n""ooufl of Equarion 15.3, ',;.J; rru{" l""n ;;il., 15 3 has twr equatiins' *:t t^'' of this sort are known ^" n'na'n"'""t"a ^lY"t well as any mea'udl are constant over dme' as purgeclof all unmeasuredfactors that gquation 15.3solves,theomittedvadableI6 factors that are constant ou". ,r"]iho. change 'w oo"utot"d factors whose effects problemassuming that there *" "o it tft*
***poot' thui time;thisisanontrivial
ffi.Til 1il#'i!ffi;d"t1il:ffi"'"
fi xTtr anc x''# rorwhich 5 l"u't'o'o"r'u*:*lit;' ut?]":.1":^f9 ' ro candidate '" as a age for example' ruling out
and.x, are not perfe.tly "ot'"tut"o ftfto'' with the idiosvn'rrc oit"*"a p'"a)"tor variables are uncorrelated tbe observedpredictorvariafu "i;a;'ili;;;il*" foints: lhal is' thal error Lerms.r,, and c. at Uotir.trme observedirr rttuiirt"f oo not otpend on the outcome arc srrictlyexogeno?scruclauy' earliertime Point.
Allowing the S/oPesof the Xs to Vary
lr that the effects of lhe predictor variablei Notice that in Equation 15'3 it ls assumed firstdifferen:rr a can be testedby estimating .xs, are constantov". ti". rnr. u^.Joaftion r *" allowed to vary. To see this. consider eauation in which the slopes or ;;; following Pair of equations: a' t €t lir: Pr* l31;t* 1z'l ( ij rtur and 1a' * e'" !,2: lJz*1zx,z*1z' 15'4 from the secondyields Here,subffactingthe first equationof l r:r{l
we hc:aoe slope of any of the x s differs over time' That is, to test the hypothesisthat the the tr score Then' if the coefflcient for both the time 1 variable ana tn" Jitf"t""t" nol e.ldl t""uo conclude that the slopes are 1 variable differs significantly ttoit^'Lo' I la:: time the for ty suUtractingthe coefficient and can get the value of the time t siope score' ablefrom the coel'ficientl'or the difference
Testingwhether the Effectsof the Timelnvariant VariablesVaryover Time
over ctrrc the timeinvariant variables to change We also can a1lowttre coefficients for of equatrons: To seethis, considerthe following pair yr= P 1l 0xi 1* 1rz' * a, * e,,
and t a,l €i z ! ,2: l f,r Fx,z+ 1" 2'
Ll5 rx
lmprovingCausalInference:FixedEffectsand RandomEffeds Modeling
Stacting
367
the first equation of 15.6 from the second yields l;z
lit:
Qrz lr)+ P(xo  xr)+ (12  7, )2, * (e,,  e,,)
(rs.7)
kc'm Equation15.7we seethat it is possibleto assessthe claim that the effectsof the z. & not vary over time, by testingthe significanceof the coefficientsassociatedwith the : rariables. Note that these coefficients do not show the effects of the zs but rather the Serences in tlte effectsof the z s betweentime 2 and time 1.
ftractions BetweenTimeConstantand TimeVaryingVariahles &* noted previously, we generally cannot get the effect of timeconstant variables from t FE model (but seeBollen and Brand [2008]).However,we can get the effect of the of the timeconstantvariables with the timevarying variables, the xs. To see considerthe following pair of equations: t:Efaction la:
l \t
Ax i tI1 z ,I6 x,rz,I
a,I t,,
(15.8) y,,  11,,*Bx,,I1zi
+ 6xi2ziI a.I e,,
Subtracting the first equation of 15.8 from the second yields liz  lr:
012 I,t)'l BQ'  x,r)'l 6zi(xi2 xi1)+ (€i2 €ii)
(1s.9)
erample, retuming to the effect of genderon income, the FE model doesnot allow a assessmentof the role of gender in creating income differences.However, it does us to determine, say, gender differences in the effect of changesin performance ion scoreson changesin income.Supposegenderis coded1 for malesand 0 for and x (now designating a single variable rather than a vector of variables) is a evaluationmeasure.Thenwe would have: f or f ema l e s : l i z  !a :p (x ,rior m ale s :
l o J r:(0 + 6 )(x ,2 
x r)+
(15.10) x)+
More than Two TimePoints we have three or more measurementsper individualwhich we increasingly hrcause a number of multiplewave data sets are now in the public domainthere *eleral possibilities,of which two are simpleextensionsof the methodswe havejust Consider each of these.
Fint, we may analyzetwo wavesat a time, computingfirst differencesbetweensuccesraves. This approachhasthe limitation that, unlesswe tum to advancedmethodssuch leastsquares,we cannotget a singlesetof coefficientsfor all wavescombined, approach Z  1 setsof coefncientsfor f wavesof data.Thus the successivewaves to be of greatestinterestwhenthe numberof wavesis small,saythreeor four.
368
to Te* ldeas DoingsocialResearch QuantitativeDataAnalysrs:
eachynable in the data over waves; then' for Arr alternative is to pool the or"tl j****;;.rffil,o,ffi1['}l.t] comoutethe averagevaruc '"' computetheaveragevaluef and that individual'sover 1;;;;;'inOiutduut wavespecific equ the between oLS regression "tl,l;;;;;.;;;; in a conventional re:ulfn9.1tr"e the use and average; comDute r55 rI ' compute ^r E^,,.ri^n  or Equation rorm
ilTlti,:'#jl,; ;;;;;;;"'h'
t,=+rt,*dr'=+P'" Then' for eachvariable' observationsfor person i' where n,. is the number of the observedvalue: o".r*l'o""in" *"an from and x',,= x,,7 !',,: !i, !, This yields an equation of
the form T.
y',,L P,D,4 Bx',,* e'', differ by t allow the that variables dummy are ilf:":pJ: .t where rhe D, i;rJ';;."';;;,; equanon inEquauu' om ando rhe the zs and zs rhat t *f :::'.'li11ill,il#"t":'iiXT;.llT Noticethat Nodce a zero.Equation15.13a individuals' within afeconstant th"9t"'1::l'^':":::;,,ili r'" i"*"*t. Thereascr
th:'t11{11 tnat (JLs except oLSexcept of t* throush insteadof estimatedtrough estimated ::^T'"y'#"0:iffT",. "''#,.#;;;;hg the data but. instead equrvarcttt isthe 13 l5 "' Equation that is samplefi this vari*]: "^"""i'i"nJiar^f in the dummv a(]u,ru'r includng a .*.i,., i'cluding such€q Sucb i""i",i"" scores, deviation :1'^:^": by dummy.variables. o,.TiTlt'l*Jiadables. ;dfi;;.;;;r.
the xt wrtncorrcLls';;Jr1;;;;." $Jl"..T"il:;.':;J.ffi#;:":'":':":':r#" commandstionscanbeestimated' l"j;:'fiT: ";x'"H;:T Fh esdmate to way ':j""ii;'ii "f ..trupt".alsocanbe But theopdmal lt^",t],: o"iT:::,T^T';:lililr*,", irri,
models or rE elaboratrons elaborarons various The vadous rhe "' ""#i";;;;;d than adaptedto fhe analysisof more
*."u.* neednot be firther discussed
RatherThan over Fixing Effects AcrossIndividuals
ltme
rucuruuJrut*ttt^t^:i.LtTl:*:."#Hi::ff*:il rn"tt'ootlo't dtscusseo dis"o""a have wehave far we Sofar So Ueappliedwhenwen"]:."5 Jii.irgi" "
;"1i"'*F*,'.':i*:m:;:m:'" ff lffi:iJT'ffi f;f;""l'Tffi i''#il"#l[lF":i,1ilif
i::::*X;l'lg::lg,T;Hff":ff "f;i."fi
tncome' shipberweeneducadonand Y:'i:::':;":;" i.,*, tn so far in schoola.odF ramlies r.rulw that ur oI "i charactensucs to charac*r;r i"in il;;;;;.,"ristics Dart, l':t"::i:il.9,t:JJ",i;J$*" for such unobservedchara une. wav ,o job market the in '" ::l:::^';';;;""". successful "onoot in io.oln" ur u
siblin;: compar€ tocomparc be to wouldbe outs would families of of families 'ffi1;;;i;G'*d :"ly:Pt"lf'?'i,lllir*,ino, Krueger(1994)canied educauol "*"u or level the in of differences ::':ll1'::l ::.,,;;;*he effectsof educt
;H;:i;;;;;"*'.'li?J"til"L:Ti,.n*,'}.[f,*',T1l;;",con,ro stronger tnan were in fact slightly gender,age, and race'
lmprovingCausal Inference: FixedEffects and RandomEffectsModeling 369
PANELSURVEYSlN THE PUBLICDOMAIN Maior u.s.?)J panelstudiesof interestto socialscientists include
p
.
(PslD):http://psidonline.isrumich.edu PanelStudyof IncomeDynamics
.
(NLS): NationalLongitudinal Surveys http://wwwbls.gov/nls
.
WisconsinLongitudinalStudy(WLS):http;//wwwssc.wisc.edu/wlsresearch
.
Healthand Retirement Study(HRS): htlp://hrsonline.isrumich.edu
.
NationalLongitudinal Studyof Adolescent Health(Add Health):http://wl,^,/wcpc.unc. edu/addhealth
lmportantforeignpanelstudiesinclude .
ChinaHealthand NutritionSurvey(CHNS): http://wavwcpc.unc.edu/china
.
GermanSocioEconomic PanelStudy(5OEP): http:/
.
^/ww.diw.de/english/soplndex.html IndonesiaFamilyLifeSurvey(IFLS): http://w1,vw.rand.orgllabor/Fl5/IFls
.
MexicanFamilyLifeSurvey(MXFLS): http://wwwradix.uia.mx/ennvih/main.php?lang=en
.
MexicanHealthand AgingStudy(MHAS):http:/Arywv/.mhas.pop.upenn.edu/english/ home.htm
lvlanyadditionalpanelsurveysmoreor lesscomparable to the PSIDare listedat http:// psidonline.isrumich.edu/Guidey'PanelStudies.aspx.
Now consider a secondexample.In an analysisof Indonesiandata, Frankenbergand hon (1995) studiedthe effect of maternaleducationon behaviorsconduciveto chil&a's health,includingsanitationandhygienepracticessuchasthe sourceandtreaffnent d drinking water,wastedisposalpractices,and so on. However,in developingnations rh aslndonesia,both a mother'slevel of educationandthe possibilityof easilyobtain.g safewater or protecting againstcontamination from human waste tend to vary across mmunities, dependingon their level of development.In this situationone would want n prge the associationbetweenmaternal educationand child healthreiated practicesof ft confounding influences of community characteristics.This is what Frankenbergand [son did by fixing community characteristicsand relating differences in health pracib to differencesin matemaleducationamongwomenin the samecommunities.In this rtr1 they were able to show a causaleffect of matemal educationon behavior conducive o child health.
linitations of Fixed Effects Approachesand Cautions to Keepin Mind Lte all other statistical procedures, FE approachescarry a set of assumptions and rquirements.When theseare violated,FE coefficientsmay be worse(morebiased)than
370
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
simply poolingdataandobtainingOLS estimates.Unfortunately,oftentheseassumptr:rr areunteslable. Herearesomecautions; If unmeasuredeffectsdo changeover time (or, in the crosssectionalapir:2. tion just discussed,do vary acrossindividuals),FE estimationdoesnot .:' *t the bias problem. It is thus necessaryto think carefully about whethe: & assumptionof timeconstantunmeasuredeffects is tenable.The samel'.r holds even more strongly for family or community fixed effectsone h. u assumethat noneof the unmeasuredfactorsaffectingthe outcomevaries3!::1ii individuals within families or communities.This is often dubious,espe,:' within families. To convinceyourself of this, think of recentU.S. presic:m andtheir ne'erdowellsiblings;or simplyconsidervariationsamongsib::. in families you know. Could such differencesaccountfor differencesir l: kinds of outcomesstudiedwith family FE models?This is a crucial que:::a (Of course,unmeasuredeffectsthat changer =often ignoredby researchers. time alsobias OLS coefficients.So resortingto OLS regressionin suchca:::.:i no solution.) The predictorvariablesmustbe strictlyexogenous, conditionalon theunobse:.= ,.r. variables.That is, we mustassumethat oncewe controlfor the unobserved ables,tiere is no remainingcorrelationbetweenthe predictorvariablesanc 1 idiosyncraticenors, the X,s and the e,s. One commonway strict exogeneir.:violatedis when one or more of the predictorvariablesdependson the our.i,T variablemeasuredat a previouspoint in time. For example,if we were stuc.'l}r how the crimeraterespondsto changesin the sizeofthe policeforce,andthe :r of the police force were determined by the crime rate in the previous year* strictexogeneityassumptionwould be violated. Relativeto variabilityin the outcomevariable,theremustbe sufficientvariab:r, over time in the predictorvariables(or acrossindividualsin the crosssecdt':a FE approach).What is sufficient?This is difficult to quantify.Still, it is obr::'rr that predictorvariablesthat hardly vary can havelittle impact on the outc.. just asin OLS analysisonecannotpredicta variablefrom a constantandu t .3r a poorjob of trying to predicta variablefrom a nearconstant.
I L
@tr [email protected] 3tu [T
3 0
J ID fll
ffiD :[ !r! 0
nu D C @
1t5 [! iq @
T @!l
rdU E @
A corollary of the previouspoint is that variablesthat differ only by a Li:.r transformationover time are regardedas unchangedover time. Thus,for er::ple, agecannotbe includedin an overtimeFE analysisbecauseageat time : $ identicalto ageat time I plus a constant.It thenfollows that variablesthatdi= over time by a nearlinear transformationcreateproblems. The predictorvariablesmust be reliably measured.As Wooldridgenotes,"D:ferencinga poorly measuredregressorreducesits variationrelativeto its cor:; tion with the differencederror causedby classicalmeasurement error,resul.;l: in a potentiallysizablebias" (2006,475).
I T
@
lmproving CausalInference:FixedEffeds and RandomEffectsModeling
e assumPDolls
donal aPPIt;lloes not solre c \\ hether 6a h€ same Poir tsone hli ^' E \ aries acrol{ ous. especialll ".S. Presiden:' amongsiblilgs ferencesin thg nrcial questttr. rat changeorin suchcases:.' r rheunobsend rnobservedtariariables and th ict exogeneiqr' ; on the outcotrB re \lerc stud\ ins orce,and the siza reviousYear.rb ficient variabilT re crosssectioDd Still, it is obuc'cl t on the outcoE nsrantand will ril
371
VARIABLES MODELSFORCONTINUOUS RANDOMEFFECTS BecauseFE modelsdo not allow us to assessthe size of timeinvariantvariables(or, in family, organizational,or community applications,variablesthat are invariant across individualswithin units), therehas beena strongincentiveto find modelsthat do yield such estimates.Among these, a frequently used approach is lhe random effects (RE) model.Like the FE model,the RE model can be written by startingwith Equation15.1. However,the assumptionsare different. Whereasthe FE model assumesthat the g represent a set of fixed parameters,which are purged from the model by differencing, the RE model assumesthat each n. is a normally distributed random variable with a meanof zero and constantvariance and that it is independentof 2,,x,,, and e,r'This is a strong assumption. Fortunately,it canbe tested,usinga testproposedby Hausman(1978).The strategy is to estimatecorrespondingFE and RE models ald to comparethe similarity of the coefficientsusingthe Hausmantest.If the null hypothesisof no differenceis not rejected,we can concludethat the independenceof the d. i.ssupported,which meansthat the RE model yields unbiasedcoefficients. Becausethe RE model yields estimatesof the effects of the assumptionis satisfied.If it is not :, the RE modelis to be prefered if the independence of the effects of the 2,.The and forgo estimates FE model for the we must settle satisfied, the RE model. Bollen and does not support quite and often restrictive Hausmantest is FE and RE models for comparing (2008) statistics offer a range of altemative Brand procedures are based Brand's and also proceduresfor forming hybrid models.Bollen and but is i.n this book on structural equation modeling, which is beyond what is covered briefly discussedin the next chapter. How can we eslimate the RE model? The details are beyond what can be considered here,but it is possibleto sketchthe generalapproach.Because,by assumption,a is uncorrelated with the explanatory variables, the coefficients of these variables could be However,doing so would ignore at consistentlyestimatedfrom a single crosssection. periods). Pooling the data and esti(or than two time for more more, leasthalf of the data yield estimates.However,neiconsistent matingthe coefficientsthroughOLS alsowould rher procedureyields the correct standarderrors.The reasonfor this is that the errors will the two errortermsin be seriallycorrelatedovertime.We caneasilyseethis by repLacing Equation 15.1 with a single term for the compositeerror:
(1s.14) r onlY bY a liner :. Thus,for examse ageat time I i5 ariables that ditr= "Di:fuidge notes, lativeto its correbF rnt enol resultiry
Becauseo, is includedin the compositeerror for eachtime period,the u,areseriallycorrelatedover time, with the correlationgivenby
corrqu,,,r,, 1: fi l1of,+ o!1, t=s
(15.1s)
*here ol : Var(o,)ando! : Var(e,).However,it is possibleto derive)a genernint the transformationthat eliminatesthe serial conelation alized leastsouares Def,ning
EITOTS.
372
to Testldeas Doing SocialResearch QuantitativeData Analysis:
x:tt":
l(4 +To""))""
rr u
we can wrlte
+ B * (x " o ) 1 )+ (u " \ u ' )
y,)n=p.(1))*p,Q r, ' ' )r" )+
(l: ' r
t P""::lt:;;3;"3';;;; l;Ji:X*:ffi,:f::i[:tJil sim'arity the Note ff ;;il;;erie.1]r;1e*resizeorthe=rwar':l lr*,TJ#Hx.:ilnTiH;:ffi in ttti*'ted (whichcanbedone "evenl it u^tJi on o2,o;'andi tiondepends lromthepooledi ' = iZ tun Ut tui*uted though OLS neednol concemus)' Equatlon' ' unOthecolr€ct standarderrc:' tn" time datato yield co"titt"nt "'u*uit"JJf the enor ter:: r "o"m"l"ntt u"tween FE and RE by rewriting Finallv' we can seettt" t"rut'JlJip Eouationi5.17 as u,, \i,  (l \)a * e,,)q
(1:,!
:: f ii':i::;:l:: i:; il""J#'J":'1il1i:ili:!, i11ilff ilfi HH:,'::t ffi ; ::i#;; ;i
""'*"' ",t:lllf :#,Lif":'ffiill3i,T,il'i;Li.if
ol tl approaches0. a larger frac on bv definition' the bias tncreases'
OF INCOMElN CHIi{A DETERMINANTS A WORKEDEXAMPLE:THE dependent .;and RE modelsl?l "oYnloot oiFE To seehow to estimateand lnterpret aselser:':: ln cl11 ttreoeterminantsf"Jiv i"t"*t iifnina' I consider ables, hi€herin theChtne'::r' Communities
ut'o" totonitits incomedifferssubstantialty fromruralvillagesto province:': (a sevencategory la:' hierarchy ban '"i"rn"ittut 'ung"t tendto have'hjgheraverage tttil"t" Chongqing' Beiiing, hu:"s cities: ""Jii"":fn) *" i"tt"t "naoted with both the income.But they alsonuu"poptitutio'i''irtui
*#;;il;;;.:lS*:i*;,:'Xfj,*1.ffi :,'ffilT;,"y,i';1;:Tj
#.r*'il*j:knn"ui$kt*lxm;:::l*ll differenc" : oitt.r, simply reflects the one hand, and family ln"o.", "i'ti" community ournpl"' the tendencyof r:t tatlffett i*o'n"fot otn"r and market labor th" "onoitions to disproportionately moveto iaP11, *rtilttt"^liv "atcation
r"'^,',"o"yir,.."l"1r::::i:jlt'j*';;:nj":'i,ffil.i#fJ
inser usetr survev ci'i'"* ""ir"''"rsampre ' vill'=.Y.lii?51,ifiJ.1li::S':Tii""ffi rural hundred one' oJtign ioitt'i' tu*ty inttooed previouschaptersTh" t*pr" andone hundredurDarltrcrBuuuruq aboutthifiY households(SeeAPPt tion on how to obtain the data )
u't n"ignuoffi;' i" :;*a"tu hundred one and 9,!T':"'*:T::T"XX'.".I#T?.= t o" ttt" studydesignandfor info=.tot 't
lmproving Causallnference:FixedEffectsand RandomEffectsModeling

373
:ii,
f 15,1,. so"ioeconomi
r : _
05El)
Mediarr Family Incomeu
Median Fam.lnc. per Worker"
J,vnshipor town
8.2
31.9
9,000
4,000
:3untylevelcity
10.4
49.8
10,680
5,000
L:vrncialcapital
1 0 .5
44.1
13,000
6,000
7.6
29.9
7,000
2,775
410
;
NCH h A
.:
Mean Occ. Status
': 
_= :.rNevwasconducted durinqthe summerof 1996.The ncomequestionwas " Now frornall sources, (midl995to mid1996), the 'i :: .,rasyour familyincomein the pastyear?" Duringthe relevantperiod :' 'i ,,asworth$0.12,withhardlyanyfluctuation. *i ::ta are missing Forthe remaining columns,rnissng data on for slzeof placeor yearsof schooling. _i: ,:riableareexcuded.
rjfu 
6.090b
In this analysisI predictfamily income(in RMB) from the education,occupational c.r:s. andageof the respondentor spouse(whicheveris higher),the numberof peoplein re louseholdwho are employed,and whetherthe householdis engagedin varioustypes (Becausein the surveyusedhereno variableidentifiesthe headof ri::mily enterprises. rr ]lousehold,I used the higher of the respondent'sand spouse'scharacteristicsas a r:ry for the characteristicsof the householdhead. This variable will be incorrect are otherrelativesof the householdheadfor example,aduh tr:.'tir asour respondents decision :rJren or siblings.In a seriousanalysis,I would developa more sophisticated ",n; tbr decidingwho is the heador how to characterizethe socioeconomicstatusof the u,rehold.But for our presentpurposes,this proxy is adequate.)We would, of course, r:ect the educationand occupationalstatusof householdmembersto affecthousehold s,,::,me. In addition,householdincomeis likely to increasewith age,which canbe taken n . proxy for expedence.I haveno clearpredictionsregardingthe effect of engagingin jr. production,agriculturalsidelines,or nonagriculturalsidelines.but I suspectthat affectincome. tcr' aspectsof entrepreneurship
fmprovingCausalInference: FixedEffectsand Random EffectsModeling
325
f.T;:;{TTffi ffi :;'i$["::,T:;Hffi i:11#f:'?'ii",{y:;l':f "#iJll .*J'"i,i, :: . rie l,il FE anarysis 1il",0", u,,?"'l#:ji:':;,'i"11,'11",,s. ffi * "d.",",'Jr,::i: 1,i" *
""ffi]T.:ftTrr#ljl;'il,ffill;: T::::'"1'"".',5:'ili[ftfl:."",,: ,1ing,r,e po,i riu" l':p::::_r{.11";iiffi *xi',"ii:':trll'.*."r;x,*:n;;#:;#;ii":H;t:ix': 6anfamiries o"l",l'" J::::"T:ij:"Xjl:f'*nd to earn ramiries 'r,,' .",r.0i" subsranrialy ;' ;;;';ffi;d',113'f,lili;J1l':'as
""",r,'l,lJJlll
iil:#[:1",ilIy:;:;1"**{:1i:.lf,:'.H1:'#i:ffi ','J,"f .il:H j:j::j,i.::!L'.T:GT',:ffi T,ri:Jil:'i:1lTh:,r,t*: '#'J,,:fl;::
;fln:ll;:n1,.*#ni:;::lj:*::,11'ff:'..Hfni y,.';1;T '""l#"lil:ff: I t'."f,:t*f ;ffiiJJH."i'I;"::ff
,f;n*?X ffi ftf*:j,Tfl ,T,,fr Jr#; :liT,:i'l,'i;:1T",'"HL',T:;:"i#J[ffi j;l;:t*.:;":*;#$:1r:t]#".J3i1tft ff;Ll;lltltffi ffiJfif :il'#, ililil;ll,ii;,illiffi uthe;ff:1;,i?#ii*#:'',;.fl lfJ"''J,'# arevalidonJvif theunobserved o.arei"o"p."o.ii [email protected] car we do eveno.n.,iwiu,i""
'LBIcsston.
sucnas the[ position l Howeveras noredpreviouslv. in the urban R;"q'ur!'rss
.iil.'i;X,::tltt:"1: ft: obseryed :orer rr.ertime umeor vanabtes oracross .,;,iln1li."T1bbr. across thar inCiviau,r" individuars wffi"i,,..;q:ftffiff:ffilil:T:,,H thJ'2,:;;';;";;.,,ricerrors. ff5 :TH,Tffi:'#'fl:*,i'gn.units' ;
jrfij#11id:5:,[:l:[""1TfiT,y; ;T:*? *ji# ::Jl# *?Tfi'lr;ff fr iffo..H:T:'.T,'j,y:i":Tq';,!i:##il;Jf ,i'""."",T:; .:del €stimares rcewemustsetrte '#i""l,lTf, arebiased, tt f.S
il'x.;ffil#f " ".filrTjTj,li^gy JIT:ff.'.:;l.i:""'ffi
ll',il;'H:,l
l#",Tl,H::tr*:Htr#rl ruHilH,il'J:"Jl:::1t:.#:il":ff
b permitmuchof an inference. Th"i;,
,;
.ffiil:
me standardenorsareroo
.'h"",h;; j'J ;;;;ffi;JiilJT,l1 ffi:5:fi:T?:l i'"ff;::::":..::f j;'H:f #,ffn::,ixlJf ffi:;1,*:,,j:: ,f:[ffi
fi:T*:;;'#,ffi
' "( f ; ) =
r ,* u .,,*1 2 ,ra , i =1 ,,1 1; y=1,2
(i5.19)
376
DataAnalysis: Quantitative DoingsocialResearch to Testldeas
wherep, is the probability that y,',= 1 rather than 0, and the remaining terms are deful asin Equation15.1.In addition,we needto assumethat within individuals,y., andr: independent. Then it follows that Pr(y,,: 0,y,, :0) : 0  p,)(I p,") Pr(y,,= 1,1, : 0): p,r(l  po) Pr(yu : 0,y,, :1) = (l  p)po
(r5_tl
Pr(y,.,: 1,y,, : 1): prp,, Becauseour goal is to estimate p, and p while controlling for the timeinvariant corriates,we use only variation within individuals to estimatetheseparameters.Thus, becindividuals for whom the outcome variable y, doesnot changebetween time I and ri2 contribute no information, we drcp them from the sample.we are left with the two nl. dle rows of Equation 15.20.We take the log of the ratio of these probabilities to set r equationthat "differencesoul" the z anda:
 rll = h(r r, [I(}f!2. e,) + lnp,, rnp,,ln(r p,,) = = 1,y,, 0)J lPr(y,,
:,"i o',l'"I o, ] p,, p,, \r
)
11
(l_i:trt
)
Substituting the righthand terms of Equation I 5. I 9, we have
. fprn Pr(yu : 1,1,,, I Pr(r
1)l= (p2
0)J
p)+ P(xi2_x,,)
Notice that the outcomevariablein Equation15.22is the equivalentofthe log oddsdr "positive" outcome at time 2 for thosewhose outcomesdiffer betweentime 1 and rinF I Thus Equation15.22reducesto a conventionalbinomial rogisticregressionequatimi which the predictor variablesare the difference scoresfor the xs. However,becauseF1tion 15.22is estimatedonly for individuarswhoseoutcomehaschanged,thereis usulr a large reduction in the sampresize relative to the full sample. Keep this limitatioo i mind when interpreting FE logistic regressionresults. FE estimation can be generalized to permit observations on more than two rb pornts per person (or more than two people within a unit) by resorting to conditid maximum likelihood estimation. That is, when there are more than two observations I! unit, the problem becomesa conditional logistic regressionanalysis. (The algetn involvedis sketchedin Allison [2005,5759]; seealsothe entryfor the _clogit _ cn_ mand in Statacorp [2007] and the referencescited there.)
lmproving CausalInference:FixedEffectsand RandomEffectsModeling
377
MODELSFORBINARYOUTCOMES RANDOMEFFECTS As in the continuouscase,we can estimateRE binary logistic regressionrnodels as an allemative to FE models. RE models for binary outcomesnot only have the advantageof nllowing estimatesof the effects of variables that are constantacrossobservationswithin mits but also are not restrictedto observationsfor which the outcomevariesover time. aboutunobserved However,logistic regressionRE modelsrequirethe sameassumptions effects as in the continuous case:that they have ar expectedvalue of zero, are normally distributed with constant variance, and are independent of both the timevarying and dmeconstant observedvariables.As in the continuouscase,the assumptionof independencecan be testedwith a Hausmantest. 3[[email protected]
rtrnd r d
ooc. j.lu E It
ilror
A WORKEDEXAMPLEWITH A BINARYOUTCOME:THE EFFECT AMONG SOUTH OF MIGRATIONON SCHOOLENROTLMENT AFRICANBLACKS To illustrate how to derive and interpret FE and RE modelsfor binary outcomes,I present r portion of the analysisLu and Treimn (2007) carried out in their study of the effect of Eor migration and remittanceson children's schoolingin South Africa. As a conseFence of apartheiderarestrictions on residential rights, many South African Blacks rere forced to live in rural "homelands" carved out of the least productive land in the ntion. As a result, many people, mainly men but somelimeswomen, left their families Hhd and sought employment in "White" South Africa. In a majority of caseslabor [igrants sentremittanceshome to their families. The question Lu and Treiman addressedwas whether remittances benefited the ,rtildren left behind, by improving the odds that they would enroll in school. It mighr be rgued that the extra income provided by remittancesincreasesthe likelihood of school fiollment. But supposeparents are committed to keeping their children in school and hce decideto go out for work to make this possible.That is, supposethat the same ureasured characteristics of families determine both the migration decision and the dool enrollment decision.If this is the case,the coefficient relating remittancesto school rrollment will be biased.However,an overtime FE analysiscan control for this (and all that canbe assumedto be constantover time). characteristics der unmeasured Using data for the South African Black population (which constitutes78 percent of t total population) from the September2002 and September2003 SouthAfrican Labour hrce Survey,Lu and Treiman studied changesin school enrollment between 2002 and l[3 asa function of changesin the migrationandremittancestatusof the householdand der timevarying householdattributes(householdincome, the highestlevel of education ained by householdmembers,the number of children in the household, whether the hsehold was femaleheaded,and the year of the survey). In addition, they included the ae of the child asa predictor variable. Although this variable is regardedastime constant bause the time2 value is an exactlinear transformationof the time I value (age, : age,  l). recallfrom Equation15.7that suchvariablescanbe includedin an FE equationto :s the possibilitythat the effectsdiffer over time; the coefficientsassociatedwith such
3?8
to Testldeas DoingsocialResearch DataAnalysis: Quantitative
effect for time 2 relativeto time 1' In I variablesgive thedifferenceinthe size of the effectof children's:4r suspectthat for SouthAfrican Blacksthe ;;**;;,;;ight periodasschoolE yearin thepostapartheid on schoolenrollmentnasgoneoownfear by hasbecomemore readilY available an FE model' for an RE model' :oi Table 15.3 showstbree setsof estimates:for rlluced sample of th:* rather than lh: (Seedownloadableij:r for an RE model estimated on the full sample by the FE model who changedenrollmentttu'ut ut '"quitta how the analysiswascarriedout') on ' ii^oJ; r"o "ch152.log" for details hyF"'f"r,rConsiderthe FE resultsfirst They providesubstantialsuppofifor the central chilr= remittanceshome' the odds that esis of the studywhen taoor mrffis send of otherfactors'relativeto the oddsfor c:enroll in schoolincreaseby 50 perlnt' net : it is jmportar:i''i households(preciselv'14: a."n'iiti,u in :,'::):1",",y^:"er' predictir; t "o","igrant trasueen d"rnontttut"d Recall that what we are ;;l;rn!;.;""tliwhat bet\\ changed status schoolenrollment '5 the oddsof schoolenrollmenttor thosewhose crBlack African the sampleof all South 2002 and 2003, which is onty ZOp"tt"n' of subgrouponly Moreover'for this subgr''rr dren.Thusthe linding apptresto this selected the differel: on the odds of enrollment' Interestingly' n". mu""tr "ffect age on schooleffollment is negar'""iiling'.i* between2002 and 2003 in tne ertectof a child's 2001 agematteredilightly lessin 2003than in which indicatesthat, ashypotnesized, n sho\\: FE model the to The next step 1sto estlmarc an RELodel corresponding are legitimately intel?reted only if = the first two columns. Recall ti'ui'una1 effects measuretleffects and the idiosyncratlc er:r un"ua,lraa effects are independentof the perform a Hausman test of the similanq :o To determine whether this is the case' we we are interestedoni] r coeflicients in the FE and Ri models' Because ;;d;i;c effects' we restrictthe Hausr the similarity of the estmatesof migrationremittance h'rp"rdummy variablesWe cannotreiectthe null testto the two migratronremlfiances similar resultsr":: = 0.4;6Jthus concludethat the FE and RE analysesyield "rit; status'This allowsus to interpretthe RE model' ,".pit to FE model Of course'this is r<:' igrotionremittance out that the resultsare similar to thosefor the It tums becauseotherwise the :" tTtt:' fo, th" t*o the restri's ".r;;;;;;J ha:vebeenrejectedby the Hausmantest For .i.tlarity wouldig'ution'"ittun:: frip"it"*"f livil; I children that shlw that the odds sampleusedfor the FE a"alysts, the RE estiiates a: = large percent as are more than 40 householdsreceivingremrnanceswill attendschool
withoulTTT:Y:::"5:IT::: .rt .p""Jttg tOO.'forthoselivingin households intheREarHowever,
il::"##;f;"";li.;
"gi"i"* in"tu0" tto
inthetwomodels. "ri"cts residence h=' tl".'*tonstant variables:genderand urban .i,irri*
;; ;il;;;;i. femalesto be enrolled in 2003' But plact :' estingly, males appearto be more tikely than residince (urbanversusrural) has no effect' an RE model estimatedfor the full s::' However,the questionremalns asto whether to the 20 percent of the sample whose effolLnta of" ,",i1. ,ftr" t"ing ."'tt''"ttd RE m''= r"sults To determinethis' we estimatethe *"uliyi"ld sir.t''ilar i whetherthe fullsample ",*all"it*"J SouthAfrican Black chitdren'Again' to dercrmine. for this model and:r coefficients of we comparethe consistency modelis acceptable, ^from hypothesis(p = 72)' fe *oO"t. enO ugainwe fail to rejectthe null
lmprovingcausal lnference:FixedEffectsand RandomEffectsModeling
379
Comparison of OLS and FE Estimates for a Model of the :. :',, Effect of Migration and Remittan<es on South African Black Children's School Enrollment, 2002 to 2OO3.(N(FE)= 2,408 Children; N(full RE) = 12,043 Children.)
FixedEffects
1  eIeh eaded househol d
0 .1 9 s .1 6 7
RandomEffects (Restricted Random Effects (Full Sample) Sample)
4.121
.183
0.022
.798
0.045
000
o.014
.788
tey year Qa03)
1.74
.000
3.67
.000
1.54
.000
:: cf child
0 .0 2 9
.0 0 0
0.019
.000
0.063
.000
.'aa
0.169 .050
0.210
.000
_
0.003 .974
0.330
.000
at1
Still again, we are led to the same qualitative conclusion regarding the eff'ect of : .:ilancesthey increasethe odds of enrollment by nearly 45 percent. However, for the  .ample all other effects are significant as well, with two exceptions: households in
380
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
ftom those in 'r'b{ remittances are indistinguishable not but migrants are there which households oit'J fi"iot*' tnlOt* inl"taleheaded there are no migran,.t una, n"' or most of L Moreover' io school ;il;t';;';;;*ott"o th"" likely less no and no more emollment rncratG models The oddl of^sJtrool effects are la.rge(thun fot tt'" oti'"t't*o incoc I household of levJ oiit'" t'oot"ttoto; ttt* effect monotonically with the edu"ation hd I children tt'" lomt*1 ": 1"1^t^:""t" twice as large as in the *tt;;;;;;' li\41 more are and urban residents u",rr Dositive effecton s"hoo1enrotteit;'*J "r* trtun are female or rural residents' in il" "ototi"o
NOTE A BIBLIOGRAPHIC
Htilu"#'ir;111i1%ly"J.'$;i3.ff{ltt *:ff,1r"f.ffi u".r" modelscanbe founain ctrarnuerLailiiig6l,
Ronnijg (1995),andHaLrlt
a
?,',l1'"T1*#*di*.*Jnv!:':#'r'"i"Tf" Krueeer(1994),Frankenberg "td;";;;li;9t)' andlre (2005),Hotz andXt"t i;il;i:'H;'
6JoiJbo?ll, N.t"*
(2001)'cam$d Budisand ^Ensland uutlin' anosctrotz(2005)'Bu$enbn
R
(2007)' andLuandrreiman r'uot)nt"tg(2006)'
a
HAs SHOWN WHATTHISCHAPTER two techdquesfixed In this chapter we have considered
ll
effects (FE) and randcr
;?;i.Ely,:u",'::.1lf ii","**o:*;:,"'n:Tn:Tl#3"::T;::'ffi ro& across differences i;;'il;;; ;;;;"r'rv' ;Ti:#:TT:T:"',ffi?fffiil so on)' The assumptionsunderlr{ (famihes'
viduals within groups ""t"tlf"t' were pre::^lted' one involving r a examples. each method were discussed' t*"'*"tt"J depender other involving a dichotomous continuous dependent vanabte ;nd;h; dichotomlr for models FE. t()ln"*ft"t different variable, becausethe two cases carEt because of th'o1,e11flsamnle " a subset onlv oi ." ##;;t,;;fiv o{r members across "'tLut"a on", tLe or diffcr notit for which the outcome variable ooes estimarb permit do,not re g" frt?o*t' group must be dropp"o ft" ti'" Jt"rvij' oatls variables' whereasRf timeconstant on"u'o"d of the effects of either measureJ oi met d: ;*"ptions of the RE model are il;;;; models do. For these two reas;' RE model is to be Preferred'
I
ll
E F 3 fl ]
!Ei a t
l ^S lAl**r f^r :F/ F,l ta
(
C
tt:! =
t", 
 ,:1
:
j
*;l
i "{
r \:: i f u !t 1,_,3l5E
I ir::, * fi=: :r: l rlr: r:i5
.m
f s:::: : 3Ji:
rr'iii '
n:H:[email protected] € F
IIJ
::lJ!n
FINALTHOUGHTSAND FUTURE DIRECTIONS: RESEARCH DESIGNAND INTERPRETATION ISSUES
s e>=3iEL
rfon:'qd a:rS= l# r ulc<lt4 im !rr:!r t s dsJriql dtb.1.:r etali<
JIST
rmten du it e.dr..rm s h
WHATTHISCHAPTER ISABOUT In this chapterI review various aspectsof researchdesign,some of which we have encounteredin previous chaptersand someof which are new. In the courseof doing this, I alsobriefly discussa numberof advancedstatisticaltechniquesand procedures,which dataanalysisin the I'ou will needto completethe "tool kit" requiredfor stateoftheart you position given arenow in a to tackle, the foundationpror.,cial sciences,and which rided by what hasbeencoveredin this book.I thencommenton theimportanceofprobablity samplingand waysto think aboutpopulations.I concludewith someadviceabout guodresearchpractice.
382
to Testldeas SocialResearch QuantitativeData Analysis:Doing
DESIGNIS5UE5 RESEARCH
appropriateanalytic designsto ar'': In this sectionI considersomeissuesregarding data' researchquestionsusingnonexperimental
ComparisonsAre the Essence
ca:3;take the following tbrm: I want to study Not infrequentlyterm paperproposals to e\:]1 want I to usefor my anabli1or ers,andI havefound a sampteor caregivers ::!r students of in a school'and I havea sample program'instituted ru f a constant is that you cannotstudy ""rv'"i"*,it""f "that school.The problemwith theseproposals .,li are.palticularly plone to deplei want to know, for example,whether caregivers to kno\\ tr'rlli unOnon"utlglu"tt similarly' if you want caregivers of t"pl" .""O" ,"" \\ r o a saJple of both thosewho do andthose tactorsleadpeopleto migrate,you need t sam:': a you need program, the eificacyof a not migrate.And if you want to evaluate ani before (or data implemented g placeswhere the programnu' unJ hut not been oYer:tr compansons making problems.in. special iJrern"ntutionulthough there are
ilfi;iiil;Jtittt"u
* uit tut"tin ittechapter)Thisis anextremelv'simplt rylsam::
' If' for example'you havea i. often ignoredin datacollectionefforls' r' variations c1 intemal study is do can you "r"in"t migrantsor delinquentsot t*"giu"t" all or caregivers'which presumablyis not ':'; oelinquents or migrants differenttypes of you arereally interestedln' forcedto relv or 1:; only the populationof interest'you are '"";;;;;;;pled to compars:rtrying often entails to you, studyto make comparisons'which their da I ' compare "^t"au't data.Sometlmesin suchsituations'teseatchers quitecomparable populationFor e: : to paterns assumedto hold for somestandard a specialpopulati.on childrenin Beijing (ChenandL':= ple, a recentstudyof schoollngavailableto migrant :e* schoolagechildren'From on u .o'uty of migranthouseholdswith ;il;;;tJ of such children not enr: :calcuiated and reported the proportion aro ,f* ,"**"fts r:= onty i*pti"itis that all nonmigrantB': in school.The implicrt contrasrand it tut particularleasonto plesumethis' Indeed 'r children attendschool.But thereis no to be false T:' socialscientistsmake Jbouttheir own societiesProve the assumptions be preferred' .^pti"i ty .o*p*u,ive dataare stronglyto the obviousnext questionis' what st':' If comparisonsarethe es'e"ceoianulysis' f comparisonsare appropriatefor what purposes and Historical Periods A common re\:.Population Subgroups, Populations' subgroups(malesversuster': questionin the social sclencesrs whetherpopulation andthefactorsdeterr :on) differ with respetito someootcome ;ril;;;;;;o :' in the section'A Strategyfor Compar' that outcome.We saw in CtrapterSix' :' ie rer 'l of analyticquestion HereI briefly AcrossGroups,"how to approactlthis kind tttu?3u"r"._rn" x predictorsand some : whethera relationshipbetweena set of of a populationoI whethel it differs a'::' come variable Y holds for utt suugroups be OLS equanonsor : '': threepredi;on equationst which may .uUgroopr,t" "rtlmate
FinalThoughtsand FutureDirections: Research Designand Interpretation tssues383
+propriate for some other linear modelfor reeiression) : V 

example, some form of logistic
(16.1)
\'LV
f :a'+la', X,rlc,G,
(16.2)
j=2
i:l
IJJ]
i'  a"+lu',' x,+lc', c, +LDd,,x,G j=2
,:l
l! J"
rr Lr! dl I!' lo
d"
t
i=2
( 16.3)
i=l
h Equations16.1through 16.3,the X are predictorvariablesand the G are population rbgroups, with eachsubgroup(except the flrst) representedby a dummy variable coded I for those in the subgroupand coded 0 otherwise. We then contrast Model 3 (Equation 163) againstModel 1 (Equation16.1)to determinewhetherwe needto posit different dationships betweenthe X, and I for the different groups.(We do this by assessing th significanceof the increment in R'?or, equivalently,assessingwhether the ci and the I in Model 3 are collectively not significantly different from zero.) If Model 3 fits sigbetter than Model 1, we conclude that the social processbeing studied differs nong the groups and ask the subsidiary question:is the difference only in the intercepts, ficantly the significanceof the increment c is it in the slopesas well? (We do this by assessing whetherthe d,,in Model i.R betweenModel 3 andModel 2 or, equivalently,assessing 3 are collectively not significantly different from zero.) Note that this strate'gyis only lfropriate when the groups can be taken to be exogenousto the outcome under study, *ich holds for gender,ethnicity, and so on. When selection into the group is conelated rih the outcome, net of the other predictor variables in the model, the assumptionof CS regressionthat the predictor variables are uncorrelated with the error is violated, endogenous switchingregressionprocedures,discussedlater in the chapter,mustbe to vield unbiasedestimatesof effects. If Model 3, or Model 2, provesto be the prefened model, it is then possibleto decomthe differencesbetween groups in their averageoutcomes,using the proceduresfor differences in meansdiscussedin Chapter Seven.Note that the decomDoprocedurewas discussedin ChapterSevenin the contextof OLS regression.The procedurecanbe usedto decomposedifferencesin loggeddependentvariables(see iman and Roos [1983,636640] for an example)or in log odds,albeit without quite sameintuitive appeal. \ variant on the strategyfor assessinggroup differences is to start with an equation the form Il: a+
s\ ^ 2)c i\ri j:2
(16.4)
384
to Testldeas QuaniitativeDataAnalysis:Doing SocialResearch
Becausefor Equation16.4the predictedvaluesfor f are simply the meansof Y fbr alt by contrastingEquation16 3 with Equation: ! subgroup,the questionbeingaddressed differenceswith respe: ''' 1orEquution16.2with Equation16.4)is to what extentgroup the outcomecan be explainedby groupdifferencesin the otherpredictorvariables Exactlythe sameprocedurescanbe usedto makecomparisonsover time For er::(lit'':'" ple, we might want to know whetherthe relationshipbetweenpolitical attitudes tr:e in the 1970s, same was the of abortion ia u..aua conservatism)and acceptance appare" to abortion Rowv. Waclewasfirst decided,andin the 2000s,whenopposition becameobligatory for Republicanpresidentialcandidatesln this case,time becc::cl the G variabL andpolitical attitudesthe X variablein Equations16 1 through16 3 'L': of course,the samelogic holds for comparisonsof changesover time in group dL=:ences.albeit with an increasein complexitybecauseof the needto considerthree:r interactions.For example,the analysisof the interactionbetweeneducationandreli,r'r' of abortionin 1974,which I usedin Chapter::l denominationin determiningacceptance :r:"! couldbe replicatedin 2006to assess to introducethe strategyfor groupcomparisons, attitudes' affected the "abortionwars" over the last thirtytwo yearshave Somecrosstemporalcomparisonsare vulnerableto estimationproblemsstenEj+ from the fact thatdatafrom differenttime periodsmay not be independentThis is tn:: :" aggregatemeasuressuchas the averagelevel of schooling The valueof sucha vari::'r r'rl computeafor, say,the United Statesin 2005 will hardly differ from the valuefor l :r becauseboth computationsare basedon more or less the samepopulation Thus two obseryationsare not independent.Proceduresfor coping with the nonindepend::':: kn ownasautocorrektion' andwith otherspecialfeaturesof timest " of observations, seethe Statamantal TimeSerien[fS] (Statacorp2007) TL=dcta arewell de'reloped; seriesproceduresare widely usedin economics.Another kind of data, widely use: : other social sciencefields as well as economics,derivesfrom panel studies,in $::: the sameindividualsare surveyedtwo or more times, typically severalmonthsor ) 3:apart.Data with this structureprovide one meansof carrying out FE and RE anal""' oi th" kind discussedin the previouschapter.Theseand other techniquesfor dea; time ser':' of observationsareknown asXT (crosssectional with the nonindependence book ::' in this to consider able have been models.Such modelsgo beyondwhat we Dc,ta Longitudinal/Panel manual ljstandardintroductions,consult the stata 10.0 (2003)' Bai''+ (2002)' Hsiao (Statacorp200?) and textsby Sayrs(1989),Wooldridge the Greenetext reason::. (2005),andGreene(2008).The Sayrstext is quite accessible, so. and the otherthreefaj.rlyformidable. comparisonsand crosstemporalcomparisonsmay' of coLt' Both crosssectional two comparisons(morethantwo groupsor moredlan two i::( to more than be extended of a singlenationor populationsof differ:: points),and groupsmay be subpopulations nations.For examplesof the latter,seeEriksonandGoldthorpe(1987a,1987b)' comparisonsts to ::r or crosstemporal The reasonfor canying out crosspopulation subgroupsdiffe: :' population abouthow populationsor somehypothesis,or hypotheses, this is a reasonablestrategyHowe":: time periodsvary.If you havea priori hypotheses, you are vulnerableto the counterclaimthat the differencesyou posit and obsene ' :ts spurious,becausetheyreflectdifferencesbetweengroupsor over time that affectboth
n
l I t
FinalThoughtsand FutureDirections: Research Designand Interpretation tssues385 ans of Y for each iih Equation 16.'1 esrvith resPectto r variables. r time. For examanitudes(liberalLthe 1970s,when Frtion apparendr 6e. dme become< brough16.3.And r in grouP differDnsider threewa) ation and religiou' red in ChaPterSir hos 2006to assess es. mblems srcmrnlng d€nt.This is trueof : of such a variable the value for 2frtr pulation.Thus. Ihe e nonindependeD!'E riuresof timesenj"J acorp 2007) TiF bta. widely used ir el studies, in wtrict ial months or Yeal FE and RE analssir hniques for dealilg Ectional time senes' ter in this book' Fr ol/Panel Data lXf.l tsiao(2003),Balqr reenetext reasonatdli 6ons may, of courscr more than fwo tic pulations of differd 87a.1987b). comparisonsis to c on subgrouPsdiffer rr ble strategY.Ho$esa' posit and obsene me thataffectbo$ &
independentand dependentvariables.Binary comparisonsare particularlyvulnerableto suchclaims,becauseany numberof otherfactorscould accountfor the differences. Indeed,any graduatestudentin the social sciencescan invent a post hoc explanation for an) observeddifference!If you do not believeme, try this simpletest on your friends: invent a finding about some sociefyor populationthey do not know about or, betteryet, report a finding but reversethe sign or otherwisechangethe outcome.Then wait to seewhat plausibleexplanationthey give you. I usedto do this at cocktailparties whenI discoveredthat everybodythoughtthatmy finding of essentialinvariancein occupationalprestigehierarchiesaroundthe world (Treiman1977)wasa caseof documenting the obvious. I startedtelling people that prestige hierarchieswere quite different in, say, Russia,andgot all sortsof interestingerplanationsaboutwhy it wasobviousthatthis had to be so (eventhoughit wasnot!). Comparisonsof threegroups(timepoints)arefar more constraining,andcomparisonsof still more groups(time points)evenmore so. As a casein point, considerhistorical comparisons.Nee (1989, 1996) has argued that the shift toward a market economyin China reducedthe power of cadresandenhanced the power of "direct producers."The difficulty, asWalderpoints out in a critique (1996, 1,064),is that manythingschangeover time: Theproblemwith time [asa measure]is that manyother changesconceptuallydistinct from the spreadof makets,and which mayalsoaffect the distributionof power and income,also occur through time and at different ratesin different regions. Some grow rapidlywhileothersdo not statepolicymayprovide emergingmarketeconomies gains grain producers windfall to in one period only; private enterprisemay thrive in someregionsbut remainmarginalin others;capitalmaybe highlyconcentratedin some regions,moredispersed or absentin others.All of theseprocesses affect the disttibution of powerandincome;any timedependentmeasurcof marketallocationmustcarefully controlfor them. This difficulty, sometimesrefened to as the "too many degreesof freedom" problem, b€causethere are too many plausibleexplanationsfor whateveris observed,is generic to twocase comparisons,both crosstemporaland crosssectional.For this reason, comparisonsof a small numberof casesare more helpful in demonstratingsimilarity than in explaining differences.Sometimesit is helpful to show that a finding in one societyor at one point in time also holds in a different time and place. If so, we can bavemore confidencethat we have identified a generalphenomenonand not just an idiosyncraticresult. By contrast,considera testof the "fetal origins"hypothesis(Barker1998)by Almond 11006).The hypothesisstatesthat adverseeventsexperiencedby pregnantwomen can bavelongterm consequencesfor their offspring. Almond studied this claim by analyzing the consequences of the 1918flu pandemicfor educaiionalattainment,occupational :tatus,income,disability,andotheroutcomesmeasuredin the 1960,1970,and 1980censrses.He finds strongeffects,one of which is shownin Figure 16.1,maledisabilityrates m 1980by quarterof birth in 1918through 1920.Becausedisabilityrateswereelevated ooly for those who were in utero during the pandemic and returned to the trend line for
to Test ldeas Quantitative Data Analysis:Doing SocialResearch
386
. 19
g
s 6
. 16
1 9 1 9Q 1
Ouarterof birth
f f slrng
from of Birth(Prevented bv Quarter 16'3 ' tgeou"le Disabilitv
Work by a PhysicalDisabilitY). Source.Almond 2006, Figure 2.
other' unmeasured'chrrlgr those bom later, we can rule out the possibility that some altemative explanarb coincided with the onset of the pandernic. More precisely, any the pandemic' vlil with coincided exactly would have to show a pattem of timing that in this caseis not remotelyplausible. becausetqt Natural Experiments Analyses of this kind are particularly compelling nonexf all almost with constitulerntural experiments.As wehave seen,the difficulty the ra' both affect that m"ntut *o.t is the possibility that we have omitted variables reG experiments Natural comesandthe prediitor variables,thus biasing our estimates' that plausibly can h or eliminate ooritt"d u.iubl" bias by focusing on natural events 1918 pandemic stni arguedto be distributed randomly in the population' Becausethe 1919' it is reasc *itfrout *uiog in October1918and had largely dissipatedby early group I treatment as a pandemic of the utt" io ."gO tfiote in utero during the monthi Becau* grouP' a control as thosein Jero just before andjust after thesemonths to be the misfortune'h.tT: t" piuttiUf" aiff**ces betweenthesegloups, exceptthatone had in outcomes differences uieio at ttre time of the pandemic,it is reasonableto infer that infectL became mothers pregnant all not course, Of pandemic. the to do" ,o infected' u'lil "*po.*" But we know that about onethird of childbearingagewomen did become exist Almond's pag' i, a tge enough gloup to reveal differences in outcomesif they pandemic' is a modeld which also exploits statetostatev:fiations in the severity of the how to do analYsisof this kind.
ttsues 387 Designand Interpretation Research FinalThoughtsand FutureDirections: For other examplesof natural experiments,albeit some more persuasivethan others in how thoroughly tlley overcomepotential omittedvariable bias, seeDeng and Treiman ,1997), Ansolabehere,Snyder,and Stewart (2000), Abadie and Gardeazabal(2003), tassen(2005),Oster(2005),Treiman (2007a: seealso the discussionof this examplein ChapterSeven),andLu andTreiman(2008). tultilevel AnalysisWhen you havemany comparisongroups(manytime points,many D.ltions,and so on), it makes senseto shift from treating each group as a discrete point b1 includinga set of dummy variablesrepresentinggroups)to scoringeachgroup with rspect to various dimensions(for example, characterizingnations by their level of econrmic development,the degreeof urbanization, and so on). The optimal way to do this is al carry out the analysisat two or morelevels.In the lattel case,macrosocial"contexts" re defined(for example,classroomsor schools,or both, in educationalstudies;societies m crossnationalstudies;birth cohortsor historicalperiodsin crosstempolalcomparirrns: and so on). Then a microequation(one fepresentingsomesocial process)is estirated separately for each context, and variations in the coefficients representing the licroprocess are predicted from characteristicsof the contexts' For example, supposeyou wish to test the hypothesis that the negative effect of the romber of sibiings on educational attainment is stronger where school fees as a ftaction oatotal family incomes are higher. Here is the typical setupfor such an analysis:
t{rc,
(16.5)
Yij: a j +b j X i j +€ i j
:1fiEr
Ee;a e\:l

;$"L''i
denr:. rfofi tr':.t:€ 4[ all o.'re\Fat'o{b I.i. .rdmenlr ra:@ rrsit'l} .:o b nrle rru; gd g ir ii ri.lsr.} FoIgt]try'd faue rbels G forune r.'trc r in outcorF: ecame ioJod
erte,Lrfui i [mond s Prya k is a modei d
ai : \6]IlnlG;'f
oo1
(16.6)
bj:nu + n rc j+ a u rbere j : 1, 2, ..., J denotescontexts'or level2units (schoolsystemsin the example)' nl 1 : l, 2, ..., n, denotesindividualswithin contexts(level1units)'The level2equat.os assertthat th'e intercepts and slopesof the levell equationsvary over contexts as hear functions of G (or, in the example, that the lnegative] slope associatedwith the mrber of siblings is greater[in absolutevalue] when schoolfees are higher; in this anmple I have piovided no hypothesisregarding the intercepts of the levell equations' though it might be plausible to expect that the level of educational attainment is lower rben schoolfeesare high). of course,the levell equationsmay includemore than one I and the level2 equationsmay include more than one G' To appreciatethe substantivepayoff of multilevel analysis,considera 1989paper in the academicorganization of ! Lee ani Bryk that analyzed the effect of differences mathematicstests.This in standardized achievement lig! schoolson the distributionof analysis(Raudenbush multilevel text on cill paperby oneof the authorsof the standard but lays our rhe example substanrive dBti t2d02l) not only providesa compelling Beyondstudy and ainical issuesin a very clearway. Using datafrom the High School achievement scores rl0187 studentsin 160 high schools),Lee and Bryk showedthat
388
DataAnalysis: DoingSocialResearch Quantitative to Testldeas
tend to be highest and differentiation among studentsby race and socioeconomic lowest in schoolswith a standardizedacademiccurriculum required of all in contrastto "shoppingmall" schoolswith a wide vadety of coursesand manv tives. They cite this difference in the organization of lhe curriculum as the main r Catholic schoolstend to be more successful,a widely noted but not previously understoodphenomenon. There are many ways to carry out a multilevel analysis, including the metaanalv approach used by Treiman and Yip (1989) in the example presentedin Chapter Tra illustrateregressiondiagnosticprocedures. The technicaldetailsgo beyondwhat ue been able to considerin this book. Good introductionsinclude a paperbv Diprete Fonistal (1994), which emphasizessubstanliveapplicationsand providesa good of what this body of techniquescan be usedto do. A paperby Mason(2001)focrrse: variousmultilevelanalysisproceduresas a way of copingwith observationsthat are independentbecausethey are clustered within higherlevel units (for example, in families, pupils in classrooms).Severalbooklengthtreatmentsalso address level analysis,of which Raudenbushand Bryk (2002)is the stardardbut is demanding,as is Goldstein(2003);SnijdersandBosker(1999)may be more accesft. For additionalsubstantiveexamples,seeEntwisleandMason (1985)on the relati between level of socioeconomicdevelopmentand fertility; Dihete and Grusky (l99lLu 1990b,i990c) on temporalvariationsin socioeconomicattainmentin the United Srq. and Sampson,Raudenbush,and Earls (1997) on the role of neighborhoodefficacr n reducingcrime.
Endogeneity,SampleSelection Bias,and Other Threats to CorrectCausallnference In this section I discuss severalclosely related circumstancesthat require special ueei ment to avoidbiasedestimates.I alsoprovidebrief introductionsto someof the st,rrilii solutionsnot addressed in previouschapters. Treatment Effeds Endogeneityrefers to situationsin which one or more of the predir variables is correlated with unobservedvariables that affect the outcome. Becaus€ft effects of unobservedvariables are relegatedto the error term, the coefficients of predrtor variables correlated with such unobservedvariables are biased.A common situaim in which endogeneityproblems may ariseis when we wish to assesstlte effect of a .Irerment" that itself dependson unmeasuredfactors that affect the outcome.For exampleff in a developingnation midwives are placed in villages with the worst health outcomei tr assessmentof the effect of midwives on health that fails to control for this nonrandl assignmentwill be downwardlybiased(seeFrankenbergandThomas[2001]for an an* ysis of Indonesiandatathatusesa fixed effectsfdifferenceindifferencel analysis.offu kind discussedin the previouschapter,to derivecorrectestimates).Similarly,if worb less able to command high wages (for unmeasuredreasons) are more likely to joil I union,OI.S estimatesof the effectof union membershiphere regardedasthe ,,treatmeff_ on wageswill be downwardly biased. For an interesting example of how to carq/ otn !n
Designand Interpretation tssu"t 389 Research FinalThoughtsand FutureDirections:
lI! : $
ls f l. t{r_. :LF dtrr:m rd' .1
nmr[16 lsTal' Er...bE D.F:Gd g!\t
J6G
I iacs.cr c &rfEefl * rd'rb n* iei:
rr*hr$ c *:=ssi&rrekn:udf,l rsb l{ql. :dF:Srffi d e,.x1 rr
+E::t
xEfl
{te sod
fe r B
mra bl d !
For hc'ul.rg' bii
0l I t!'r :E
o:l1sL' d H ] . iJfilell :t b
Flreamd 5 tO rarTl
''
analysisusing a treafinenteffectsapproach,see Brand's 2006 study of the effect ofjob on the quality of subsequentjobs. displacement The most obvious way to correct for endogeneity problems is to measure all the factors thought to affect the outcome. We encounteredthis idea in our consideration of ordinary leastsquaresregression in Chapter Six, when we discussedthe presentation of severalmodels,with successivelymore variables,to assesshow newly introduced rariablesmediate the effects of variables already in the model. From this we see that endogeneitybias is a form of omittedvariable bias. However,it is not alwayspossibleto measureall the potentialinfluenceson an outcome, either becausewe are reanalyzing already collected data or becausethe analyst may not be able to identify a priori all the potential influences on an outcome that are correlated with variables explicitly measuredfor example, all the factors that might both lead individuals to join a union and be conelated with their capacity to command high wagesas individual employees.Thus we need ways to correct for endogeneitybias rand its close cousin, sampleselectionbias). We have already covered one approach, flxed effects or random effects modeling, which is possible when we havemeasurements for individuals at more tlan one point in time or measurementsfor different individuals rithin $oups (for example,families or classroomsor communities). When suchdata are rot available,severalother analytic strategiesmay be considered,all of which go beyond riat it hasbeenpossibleto coverin this book. For usefuldiscussionsof what is entailed h establishingcausality,seeHolland (1986), togetherwith commentsby Rubin, Cox, Glymour, and Granger,and Holland's rejoinder; andWinship and Morgan (1999). Variables Regression Al approach to coping with endogeneity that is ',Ef,;rumental popular among economistsis inslramental variable (lY) estimation.If a variable (Z) can be found that is uncorrelatedwith unobservedvariables (u) that aJfectthe outcome (f), is cnrrelatedwith the variable (X ) in the model thought to be correlatedwith the unobserved raiables, and is conditionally unrelatedto the outcomevariables net of the effect of both fu observedand unobservedvariables,Z can be usedas an instrument for X to yield reladrely unbiasedestimatesof the effect ofX on L For example,considera 1990paperby Angrist studyingthe effect of servicein the ilitary during the Vietnam War on lifetime earnings.The difficulty with estimating an (}LS equation is that the decision to join the military might well have been correlated uith unmeasuredfactors that affect earnings.Angrist exploited the fact that for much d the war period, a lottery system was used to determine who would be drafted into s'ice. Although there were many exceptions,the increasedprobability of being drafted fu those with low numbers makes the assignmentof lottery numbers a kind of natural .ry€rimentone's lottery number was correlated with the likelihood of serving but not rirh other factors relatedboth to serviceand to subsequentincome. Thus the lottery numbis a good instrument for adjusting the effect of Vietnam veteranstatuson income. Another situation where IV estimation may be helpful is where the causal order is nbiguous. Supposewe observethat women who work are lesslikely to be depressed fu women who do not work. Can we concludethat employmentprotectsagainstdepreswomenmay be less in? Perhaps.But the causalordermight go theotherway: depressed
390
QuantitativeData Analysis:Doing SocialResearch to Testldeas
likely to seek or retain employment. One way to address this problem would be to
rnstrumentfor employment.A reasonablechoice might be whether the rnother is known that the daughtersof mothers who workedaremore likely to work tl But there is no particular reasonto believe that, net of her own emiloyment, a I mother's employment affects the likelihood that the woman herseli expenences sion(theexampleis from Ettner [2004]).Thus,mother'semployment would satisfrl conditions for an instrumental variable A final circumstancein which IV approachescanbe helpful is to estimatesimulu_E equation_models or, asthey aresometimescalled, reciprocaicausationmodels. Wooldri (2002,555) providesa usefulexample:in a sample oi cities,we might expecttne mu rate to dependon the size of the police force_the more police peicaprta, the lowe, expectedper capita murder rate. But we might also expect the size o1 the police ft ,h" murder.ratethe higher the (anticipated) murder rate, rhe greater :^*Ty.l to increase incentive the sizeof the police force. Becausewe observeonly the equitibrir conditiona particular murder rate and a police force of particular a srze_specifu a simultareous equation model amounts to asking the qo".,i;;;; would be the murder rate if the size of the police force were "oun,".fu"*a different? What would be: size of the police force if the murder rate were different? IV methodsprovide a war estimatingsuchmodels. casethat usually involves reciprocal causationand hencemight be han  _Another *:d."t. (or by structural equation models of the kind discussed later) is when ,OI,V. attitude is thought to affect anotherbut they are measuredat the sametrme. Ratherrhr assumingthat somehowone causallyprecedesthe other,it usually is more sensibleto rrdt them aseither both dependenton a third variable(seemingly o*"iut"O..gr"r.ion, encm_ teredin ChapterEleven, can be helpful in such cases)oras having recrprocalefi.ectsThe difflculty with IV estimation is that it often;s difficult to"finOgood instrume','4 variables,and poor instrumentsoften produce results worse lmore irasea; than usitrsr instrumentsat all. For a good introduction to IV estimation ani lt. Oung"rr,,"" WootOri:ur (2006, Chapters15 and 16). Orherusefirl references incfodeeaun (20'06i,;" il;;;  ivregress  commandin StataCorp(2007),and Green(200g). sampleselection Brbs Sampleselectionbias ariseswhen unmeasuredtacbrs correld with ar outcomedeterminewhether an individual is included in the sample.For ex,rnfia a woman may enter the labor force only if she can command a reasonably high lagr_ Thus selection into the sample (people in the labor force) is nonrandombut dependsI unmeasuredcharacteristicscorrelatedwith the outcome variable (wages).Analyzing mry thosewith wagesthenresultsin biasedestimates. Consider anotherexample.Many surveysin China are restricted to the de jure l]Ib_ population.As wu and rteiman (2007) haveshown, in such studiesestimatesof ircgenerationalmobility are overstatedbecausethose of rural origins who obtain urban rar. istration are not a random sampleof the rural population bu, J,h". ;;;tr;;;;;; the."best and the brightest,', who have experienced long_rangeupwarA social mobilrt If the entirepopulationis includedin the analysis, oitt"'"*ii;iffi;;; muchmoremodest. "rtiit",
Designand Interpretation tssres 391 Research FinalThoughtsand FutureDirections: DiDdir [[email protected]"1 trle sdr.I [email protected]
8s{ & *:*iE"
bnlbrlB [email protected] lt{slts lx f G
Flts Fiil rycfil.q fr iti dFrts t a114d trg}.ndH sriraG RtiE' itie ,:rd rrres{cde::s [email protected] fiuirySi\iiE{E cdl rrlts
15ifi€ld
[email protected] r higb r..!E
rd+adsEhzin!d* deiuEfl re! ot il; rlttim 4'in rc?ofiiEdl
rial m.rtf,r f mct'unr
Heckman Selection Model A standardapproachto correcting for sampleselectionbias (in caseswhere it is not possible to redefine the population as Wu and Treiman did) is to :osea Heckmancorreclion (seeHeckman 1979).The procedureinvolves predicting (using a binary probit equalion) the prcbability of being in the sample(or, equivalently, ofhaving an observedoutcome), calculating the expectederror for eachobservation,and using dleseerrors as regressorsin an equation predicting the outcome of interest. SeeWinship andMare (1992) for a very clear exposition of this and other modelsfor sampleselection bias,and seeDubin and Rivers (1989) for an extensionof theseproceduresto models sith binary outcomes. The Stataentry for the heckman command (Statacorp 2007) offers anothervery clear exampleand exposition of the method,using the canonical example,women's eamings. In the example,eamings (for women who haveeamings) are predicted from education and age, and the probability of having earningsis predicted from marital status,the number of children at home, education, and age (and implicitlytbrcugh the inclusion of education and age, which predict the outcomeofthe expected wage itselfl. Note rhat the assumptionhere is that marital statusand the number of children at home do not affect eamings but only the probability of having eamings.We might well quesfion this assumptionbecausemanied women, and particularly women with children at home, may cbooseto take lowerpaying jobs that more readily accommodatetheir dual careersas sorkers and mothers, This examplethus revealsa major limitation of the procedure.To yield robust results, 6e predictors in the selection equation should strongly affect the probability of being selectedbut should haveno net effect on the outcome.(Heckmancorrectionscan be made e!enwhen there arc no such variables, by relying on the functional form of the equation b identify the model. However,the results are often neithet robust nor substantivelycompelling.) Suitable variables are often difficult to find. Note the similarity to IV estimation discussedpreviously. For instructive applications of corrections for sampleselectionbias, see Mare and srnship's 1984study of employmenttrendsfor young Black and White men; Hagan's sudies of factors influencing the severity of punisbmentfor convicted criminals (Peterson md Hagan 1984;Hagan and Parker I985t Zatz and Hagan 1985);Manski and Wise's of graduationfrom college;andHardy's(1989)studyof r 1983)studyof the determinants acupational mobility in the nineteenthcentury basedon matching data acrosscensuses, rhich takesaccountof selectiondue to deaths,emigration, and namechanges Erdogenous Switching Regression Note that the Heckman procedurealso can be used n analyze endogenoustreatment effects, as an altemative to IV estimation, However, a is alsoavailablein Stata,in additionto heckman. $?aratecommand,treatreg, The problem of an endogenoustreatmenteffectthat is, where there is a nonzerocorreirion between assignmentto a "treatment" group and unmeasuredfactors affecting the stcomecan in tum be generalizedto the casein which the parametersof a model link.g treatmentsto outcomes differ acrosstreatment groups and assignmentto treatnent groups is endogenous.For example, Gerber (2000) asks whether the fact that former CommunistParty membersdo better in postSovietRussia than do others is due to
392
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
residual social capital (the fact that connectionscontinue to favor former party membagl or rather to unmeasuredfactors that affect both the likelihood that people becamep ! membersduring the Soviet era and their eamings in the postSoviet periodThis kind of problem can be addressedusing methods that are similar to thoseir treatment effects and sampleselectionproblemsspecifically, endogenous switch4 regressionmodels.Endogenousswitching regressionmodels are usedin sit[ations $ bE= one outcome.I., is observedif a selectionvariable, Z = 0, but a different oulcome. l r observedif Z : 1. Using this method,Gerberconcludesthat the advantageenloyeOt. former communistsis due entirely to unmeasuredcharacteristicsassociatedwith bectning a member of the Communist Party and that there is no lingering effect of Sovierta socialor political capital.(Seealso the critiqueby RonaTasand Guseva[2001] andfu rejoinderby Gerber[2001].) Gooddescriptionsof the techniqueandofhow to implementit canbe found in \h: andWinship (1988)andPowers(1993).For additionalapplicationsseeWillis andRosa (1979);GamoranandMare (1989);Long (1990);SakamotoandChen(1991);Manskid others(1992);TiendaandWilson (1992);Powersand Ellison (1995);Smock,Manniry and Gupta(1999);Hofmeyr and Lucas (2001);Lichter, Mclaughlin, and Ribar (20t,:c (2004);andProuteauandWoltr (2006). SousaPoza Propensiy ScoreMatchrng Another threat to correct causalinference occurs when dr predictor variable of interest occurs only rarely in the sample and is highly correl*=d with other independentvariables.For example,what is the effect of attendingan elite re occupationalstatus?The usualway of approachingsucha quesuil versityon subsequent is to carry out a multiple regressi.onof occupational status on attendanceat elite \ers other universities plus a set of variables controlling for family background, high schrd performance,and so on. The difficulty is that attending an elite university tends to be rl highly correlatedwith the control variables that controlling for confounding factors Iafo to hold them constant,becausethere are few people with low values on the control \.mableswho attend elite universities.Apart from the conceptualproblem this createsabtl the meaning of "holding constant," there is a serious statistical problem"unbalamed treatments"tend to inflate standarderrors (Rosenbaumand Rubin 1983,48), malJE problematicthe rejectionof the null hypothesisofno effect.To copewith this probleu. analysts sometimesresort to matching pairs of casesthat differ with respectto the \aable of interest (the "treatmenf' variable) but that are identical on a set of covariarEl However, as Srnith notes (1997, 326327), until recently matching studies otlen hsrc been resistedon the gound that they involve "throwing away" a lot of data. Moreorer I often is difficult to find good matchesfor more than a small number of variablesbecaret for a linear increasein the number of covariatesthere is a geometric increasein the number of matchesrequired. However, advancesin the statistical theory of matchingthe seminal adicle is ! Rosenbaumand Rubin (1983)have led to the developmentof a procedurethat replaL'E the large set of discretematchesrequired by classicalmatching procedureswith r propensity score, a scalarsummaryof the degreeof similarity betweencaseswith resFq: to a large number of covariates.The procedureinvolves predicting the treaflnent variahic
EinalThoughtsand FutureDirections: Research Designand Interpretation lssues 393
r tt
:fu ff 5
q,r ff G
tr l
Jom covariatesand then matchingeach"treatment"casewith the confiol casethat has tre nearestpropensityscore(or sometimeswith severalcontrol cases;seeMorgan and rrnship [2007] for a useful discussionof the technicalissuesinvolved). The resultne sampleis then analyzedin one of severalways: focusing on outcomedifferences nenveenmatchedtreatmentand control cases,ignoring the unmatchedcases;stratifying te sampleinto stratawith similar propensityscoresand comparingoutcomeswithin sr:m (for an interestingapplication,seeBrand and Xie [2007]);or usingthe propensity aarredirectly in a regressionequationto get an estimateof the effect of the treatmentnet [l: rhepropensityto be in the "treatment"group.The essentialinsightis that by compar[€ casesthat havea similar propensityto be in the treatmentgroup, we createa quasi qFeriment.That is, we canthink of matchedcasesasbeing,in effect,randomlyassigned D eitherthe treatmentor the conhol group becausethey havethe sameprobability of lHne in eithergroup,giventheir covariates. Consider the example presentedby Smith (1997) in his illuminaiing exegesis d matchingmethods.He was interestedin comparingthe mortality rate in two typesof Lspitals, ordinaryhospitals(N : 5,053) and "magnef' hospitals(N  3g)hospitals rtrh organizationalpracticesthat enhancedtheir reputationsas good placesto practice n\ing. Contrastingan OLS analysiswith a propensitymatchingprocedure,he showed fi& fte two methodsyielded similar estimatesof the difference in mortality rates in the Do Npes of hospitals,but the latter methodhad far smallerstardarderrors,yielding a uistically significantreductionin mortality in the magnethospitalscomparedto ordiri hospitals,a conclusionnot yieldedby the OLS analysisbecauseof the largestandard resultingfrom the unbalanceddesign. There is by now a substantialliterature on both the statistical theory underlying proiry scorematchingandpracticalproceduresfor implementingthe method.The 1997 paperis a goodplaceto startandalsohasa usefulbibliography.BeckerandIchino :). Abadie and others(2004), and Beckerand Caliendo(2007) discussthe impleion of propensityscorematchingin Stata.DehejiaandWahba(2002)and Brand Halaby(2006)provideusefulevaluationsandworkedexamples.Harding(2002)is a icularly instructiveapplication.For otherapplications,seeBerk and Newton (1985), andothers(1995),Keatingandothers(2001),Lu andothers(2001),Morgan(2001), andSmith (2003),Lundquist(2004),andCohen(2005).One limitation of propen\.orematchingis that it may not balanceunobservedcovariates. Thus if you suspect , you will needto resortto oneof the methodsdiscussedhereor in the previ.hapterthat are specificallydesignedto handlesuchproblems.
Equation Models ':ural equqtionmodeling(SEM) is a technique(or, more precisely,a set of tech) thatpermitsthe estimationof systemsof equations,often involving unmeasured lrcnt constructs.Considera simpleexample,Blau and Duncan's(1967, 170)classic of statusattainment,shownin Figure 16.2.When we think abouthow occupa: statusis transmittedfrom onegenerationto the next,it becomesevidentthat this is process:men whosefathersare well educatedandhavehighstatusjobs tend rriieve more schooling; those who achieve high levels of schooling tend to obtain
394
to Testldeas DoingsocialResearch QuantitativeDataAnalysis; Fathers educaton
FatherS occ.
ft€UnS
16'?"
Respondents educallon
.3 1 0
.224
 818 /
First job
of stratificatim' Modelof the Process Basic alauand Duncan's
sour.eiBlauandDuncan1967,170
those who have h1*' highstatus first jobs (but their social origins may also trelp); and current jobs (but tu' highstalus into them parlay ,tu:to"nrrt iot, u." likely to be able to !{ various "pathrThe to matter). education and even their social origins may continue shown in the figc' which fathers' occupational statusls transmitted to theh sons are equationspredict4 a set.of known as a "path diagram." The paths can be representedby can be explora: I each of the outcomeJin tum. The relationships among the equations the two wyield insights regarding the relative importarce of different paths linking (typically' if the size of particular cod' ables. Moreover, under some "rr"uattu"""' or more coefficienr' cients is fixed, usually but not necessarily^t zero' ot two r}Ie goodnessof fit oi:i: overidefiirted), is ir, if tne model ui""O ro i" "qoitlut "onrt model canbe assessed. variable Hosgra' In the modeljust discussed,thereis only one indicatorfor each thought to reF measures often the analyst has available repeatedmeasulesor a set of to use SEIIs n possible senta singleunderlyingor latentconstruct'In suchcases'it is Featherman(1977r d assessand correct tbr measurementerror. SeeBielby, Hauser,and Still anotherrc Hauser,Tsai, and Sewell ( 1983) for two early but instructive examples even involving lara of SEMs is to estlmate processesinvolving reciprocal causation' (Note thar m an example (1968) for such variables.See Duncan,Haller, and Portes workthe of recent lack to a due not applicationsjust cited are all very old. This is "smuch more expli* ,"nt tit"rutur" is vastbut rather to the fact that the early work was
FinalThoughtsand Future Directions:ResearchDesignand Interpretationtssues .:195
?J ,"ou T;":tQ " ."1L?.,'ii.'*llll;'jj"j3
:'
statisticiansociologist LeoGoodman"the most importantquantitativesociologistin the wor d in the lattef half of ihe twentiethcentury" Duncanwas responsb e for ntroducingpath analysis(a versionof structuralequationmodels)inlo sociology.He usedpath analysisasthe technical appdratusto reconceptualize intergeneralionalsocialmobility as a multistepprocessin which statusattributes(suchas education,occupationalstatus,and ncome)are modeledas dependingnot only on parentalstatusbut aisoon the priorstatusesof individuals. Duncanalso contributedimportantly to our understanding of racialdifferences rn socioeconomic attainment,spatialand racialinequalllies withincities,and,laleIn hiscareer, attitudemeasurement. Althoughlackingadvanced mathematical training,Duncanprobablymadebetteruseof the statistical toolsat hisdisposal than anyolhersocialscientisi, throughthe combination of an unusualabil;tyto think through a problemin advanceand greatclarityabout how to representsociological models.lt is stril.ing,and telling,that becauseof thenideasin stalistical extant rulesgoverningaccessto Current PopulationSurveydata, all of the tabulationsand estirnalesin Duncan'slandmarkbook TheAmerlcanOccupationa! Structure(Btauand Duncan 1967)were specified in advance, withoutthe analysts havingseena s ngiecoefticient. InteF (1984), estingly, Duncanhimsel{regardedhis latebook, NotesonSocial Measurement as his most importantcontribution,a judgmeht not widelysharedby the many researchers strongly contrioJtons. 'rflrerceo by hissJbstant.ve yearsin Stillwater, Sornin Nocona, Texas, Duncanspentmostof hrsprecollege Oklahoma, professor lvherehisfather,OtisDurantDuncan,alsoa sociologist, was a at OklahomaState UniversitfDuncandid his undergraduatework at LouisianaStaieUniversity, obtainedan l\lA at the University of Minnesota. servedthreeyearsin the U.S.Armydur ng WorldWar ll, and ihen completedhisPhDat the University of Chicagoin 1949.Hetaughtat Pennsylvania State Jnive's;ty, the Universitv ol Wi:consin,lhe UlversiLyof Chicago,rne Urrversiry ol Micnigan, :he Unlversityol Arizona,and the Universityof Cali{o.niaat SantaBarbara.Durcan enloyeda secondcareeras a composerof electronicmusicand was famousamong peoplewho had no oedthat he wdsa distinguisl'ed socrasc'enlist.
. i the models being estimated than much of the literature that followed. after struc* equation modeling became widely used. Thus for didactic purposes the early papers : lore useful.) TIle strategy for estimating SEMs is to exploit the lact that the posited relationships  rg the vadables (observed and latent) inplies a particul;rr covariance structure (that  .et of relationships among the variances and covariances of the observed variables), ,  , r is why the technique is sometimes called covariance stntcture npdeling. Goodness : is assessedby comparing the covariance structure implied by the nrodel with the rrnce stmcture obserued in the data set beins analvzed.
tts'es 397 Designand lnterpretation Research FinalThoughtsand FutureDirections:
r i $6. G iIDlss
( sF tc$ t'ekr rof Chir !b.Fb sPeciNfl 5r Erac ([sa.ntl€rF fszre ry*ff,ar€ PO#F sr Snud i b1 \rbnclftDe anabs r n a s1.*en d Ecent \esipos. srarisit
rhat enablesthe analyst to explore the implications of whatevermodel the analyst posits on a priori grounds.Thus structuralequationmodelingis best seenas an interpretatrve procedure,with the addedfeature that in somecasesit is possibleto determinewhether a particularmodelis consistentwith observeddata.Usedproperlyin this way, SEM canbe a valuable tool. (The best introduction remains the 1989 text by Bollen, which, although somewhatdemanding,is intendedfor and accessibleto social scientists.Seealso a collection of paperson technicalissues,editedby Bollen andLong [1993];Bollen andCurran's 2006 book using SEMs to estimate latent curve models; and Bollen and Brand's 2008 paperusingSEMSto estimaterandomandfixed effectsmodels.)
SAMPLING OFPROBABILIry THEIMPORTANCE To generalizefrom a sampLeto a populationwhich is what social scientistsare almost alwaysinterestedin doing, whetherwe admit it or notit is necessaryto samplecases from the population of interest in such a way that eachindividual in the population has a known probability of being included in the sample.only under this circumstancedo the principles of statistical inference apply. Nonetheless,many studiesviolate this principle, drawing "convenience" or "causal" samples.Chinesesocial surveysare particularly egregiousin this respect,often sampling a sei of provinces or cities that are said to be typical of particular types of places; this is true of even highquality surveys such as the Chinese Health and Nutrition Survey (Hendersonand others 1994). The difficulty is that there is no way of knowing to what extent and in what ways the chosenplaces a:e indeed similar to the places that are not chosenbut are purportedto be representedby the chosenplaces.In sum, samplesof "typical" placesare no substitutefor probability samples.It is well worth the extra costin the sampling effort and, often, in the fieldworkto design a samplein such a way that it car be generalizedto the population of interest.
ASK A FOREIGNER TO DO lT
scientistsarenoto'iouslvb"d social f)f,
it Oroved survey. A casein point: in my 1996Chinesetheirown societies. at characterizing \ Trom ot opposrtron urbandistrictbecause to do the fieldworkin one countylevel impossible localofficials.Insteadof askingme to providea substituteplacefrom the samestratum (recallfrom chapter Nine that there were twentyfive urban strata, basedon the level of educationin the population),my chinesecolleaguessimplysubstitutedanotherdistrict from the samecity that they saidwas very similarto the omitted district.However,it turned out that whereasthe omitteddistrictwas in the eighteenthstratum,the substjtutewas in tr modelhg. tal6 lggestiDg thar lh \' in PsYcholo'gil limitadons=g r$' magically o\lF r. it is a Procedre
the twentythirdstratum,clearlya violationof the stratifiedsamplingdesign The truth is that if you want a clearheadedcharacterizationof a society,you should aska foreignerto renderit. Thisessentialpoint was understoodby the carnegiecorporation, GunnarMyrdalto heada study and sociologist the Swedisheconomist whichcommissioned monographAn wasthe classic The result in the'1930s States in the United of racerelations AmericanDilemma(Myrdal1944,vivii)

FinalThoughtsand FutureDirections: Research Designand Interpretation tssues399
rM" d E !t lllF
.e Er5.D
rfr dirytd. .6F

Still another ex:rmple can be found in institutionally based studiesfor example, iudies of hospitals,clinics, and their catchmentareas,which are often usedin public bealthresearch.The justification for what amount to conveniencesamplesis that the pardcularhospitalsor clinics being studiedarerepresentative of all similar places. When is it legitimate to invoke the conceptof a superpopulation?I suggestthat when data for a population exist from which a probability sample can be drawn, convenience samplesare a poor substitute and do not meet current scientific standardsclaimsof saduate student poverty, lack of time, and so on, notwithstanding. However, when the populationis unlmown and unknowable,as in the caseof Murdock and Provost'sethEographicsample,use of the available data and generalization to a superpopulationare kgitimate. In the caseof singlecrosssectional surveysbasedon probabilitysamplesof 6€ population at the time of the survey,we are on firm ground in characterizingthe socieq, as it was at the time of the survey but are increasingly on shaky ground as we try to gneralize over time. It canbe done,but it mustbejustified. Data from Multiple 5ureys Invoking the conceptof a superpopulationhas 'oling tonsiderable practicalusewhen it can be justified.A particularlycompellingapplication fu rvhencomparabledata are available over time, as in the U.S. GSS and other repeated uosssections.If it can be shown that relationships amongthe variables of interest do not ran. over time, data from severalyears may be pooled to increasethe size of the sample railable for analysis.This canbe a particularlyusefulstrategywhenany oneyearyields isufficient data to sustain reliable comparisons,for example of race differences in the frited States.The basictestis a varianton the strategyfor group comparisonsdiscussed rzlier in this chapterandalsoin ChapterSix (seealsothe discussionof trendanalysisin Gapter Seven).There are two steps.First, estimatean equation of the form .,
I
 q+
\,
J IJ
. r r\ \,1 . r> DI. , \z 2 t t , 221_n..i . j:2
i:t
rr r j
(16.7)
j:2
the X are predictor variables and the { are crosssectionalreplicates of the survey irh the first omitted to avoid linear dependency).Second,test whether the c. and the d.. collectivelyzero.If so,you can concludethat all the samplesaredrawnfrom a single ion and happily proceedto pool your data.But evenif there are yeartoyearvarims in the level of I (significant differences among the c,) or in the relationship of one moreof the Xs to I (significantdifferencesamongthe 1,,t,you may still wish to pool data but include the dummy variables and interaction terms necessaryto capturethe the social processyou are studyingchangesover time. This has the advantageof permitting al analysisof changeald increasingstatisticalpower for ing the relationships that do not vary over time. (For some recent examplesof the of this strategy,seeBarkanand Greenwood[2003],Chenand Guilkey [2003],Pow[2003],FitzgeraldandRibar [2004],Kelly ard Kelly [2005],andTavits[2005].) which hasmuch to recommendit, is to $ altemativeuseof repeatedcrosssections, data from one survey to develop a preferred model, modifying the model in light of
400
Researchto Test ldeas Quantitative Data Analysis:Doing Social
inthe data' Then estimatey relationships unanticipatedby your theory but observed
ttini dutuftomur"plicatedcrosssection' pr"r"rr"o l:] T:T^i?tLtl:""":: ,: "0"r in trt" precedingor following vear(recallthe discussicr "i;;, il;;;;;"#o","0 this strategyin ChapterSeven).
for more than one idr A final possibility, in caseswhere information is collected trouseholdmember one than uiaoa *itt in u t oosehold(either by interviewing more T? members)' is to exp household other of characteristics the u.tirrg u i".pona"nt about
alirla:.t:T,^i:T eachindividualfor whominformatron"i: u "uting it is necessaryto take account of the fact that obsen However, in ,o"h
,rr" ,"pr" iy t
case. "ur". householdsuslng survey e$n are not independent,by adjusting for clustering within
;;;;.;"d;
or Ly'adoptingllekind of.multilev"lT"9"tllc..':l::t""*t":=
is available for a restricted*"Masin (2001) cited earlier. Moreover, when information to the consequen!'85'attentive to be set of others, for example, spouses,it is important in conclusionsyidddr differences of the ty cu.rying ooi."ntitiuity io. of sensim4 (see discussion the alysis adults "*umpt", ;;;ffii;'"f;#"d ieople and a sampleof all analysisin the next section).
PRACTICE A FINALNOTE:GOOD PROFESSIONAL
quantitativedata analystsandharebd Now that we haveconsideredvanous issuesfacing of study, I close by offering serall a uri"r introduction to advancedtechniquesworthy that make a difference bersrar things g good professional pra&cethe t;dt ;;; principles' availat'tsI aresimple mediocreand supenorquantltatrvedata analysis These or brilliance insight or matllerEunyufyrt; tft"it upplication doesnot require particular to them is sure to improve the quality of your work' i"if f".lfi v. e* ",Adon
the Propertiesof You Data aJnderstand
or data fiom an archi\t u' Whether working with data you acquiredfrorn anotheranalyst you ,hoold thoroughly understandhow the data were cred y";t# il;;;; "ott""t"d, attention to the sampledesigr I and also should explore thell properties Pay particular to implement it For the sc determine whether survey estrmationis possible and how were constructedand hos n .*.on, Vo. need to understandhow any weight variables investigatorsare poorly documenrcd rr" tfr"*. Of"t afteweights provided byihe original ask them how they constnrd It is enfuely appropnateto wnte to the investigators to tfllt",I d:f::iltt^5y imposition' an their weights.You should not regardthis as :"": public use is to pro\rG for available the respJnsibilities of those who make their data adequatedocumentation. distributions for ers! You also should calculate and inspect univariate frequency you( analysis This is erq pertinent to variable in the data set, or at least every variable With respectto eachvariable'ax to ao Uy u.ing Stata's cod.ebook command' what you know abourlic *ft",t* ,ft" observeddistribution is plausible' given y"*.# of univari'c being studied. It is surprising how informative the inspection i.o"l",i*
ilfi;;'ilffi*or*
or tablesf .* u" ThLnextstepis to createcrosstabulations
dependentvariablesandaI meansthat show the associationbetweeneachof your central
F FinalThoughtsand FutureDirections: Research Designand Interpretation tssres 401
r!ul!D' *5 r rd RdF
ru,h
#r !F r: G r r;l d!# FdtF
d Eita$l
fr
q
IN THE UNITEDSTATES, PUBLICLY FUNDEDSTUDIES MUST BE MADE AVAILABLE TO THE RESEARCH
COMMUNITY tt is now a reouirement of both the NationatscienceFoundation (NSF)and the NationalInstitutes of Health(NlH)that samplesurveysfundedby these agencies be madeavailable for publicusein a timelyway.ThecurrentNIHpolicyreads,"NlH endorses the sharingof final research data . . . and expectsand supportsthe timelyrelease and sharingof final research datafrom NIHsupporled studiesfor useby other researchers. 'Timelyreleaseand sharing'isdefinedas no laterthan the acceptance for publicationof the mainfindingsfrom the finaldataset" (http://grants.nih.gov/grantvpolicy/nihgps,2003/ NIHGPs_Part7. htm#_Toc546001 31, accessed December9,2007). The NSFpolicystateprecise principle: "NSFexpects. . . investigators ment is less but conveysthe same to share with other researchers, at no morethan incremental costand within a reasonable time,the data,samples,physicalcollections and other supportingmaterialscreatedor gatheredin the courseof the work. lt alsoencourages awardeesto sharesoftwareand inventionsor otherwiseactto makethe innovations theyembodywidelyusefuland usable"(http://wvwv. nsf.gov/pubs/2001/9c10'1/9c101 revl.pdf,accessed December 9, 2007).Providing adequate documentation is oart of the reouirement.
6e candidatepredictor variables. This too can be extremely informative, revealing both &ficiencies in the data and deficienciesin your a priori assumptions. I still recall, with someembanassment,an incident forryfive years ago when I was a @irning graduatestudentat the University of Chicago. I worked as a researchassistant r the National Opinion ResearchCenter (NORC), and Peter Rossi was the director of \ORC . I ran into him oneeveningashe wasleaving the building and carrying a greatstack {tr computerprintout{rosstabs from the study we were working on. I made some snide rmark about why should we bother with crosstabsnow that we could do regressionsby cnrnputer,and he gaveme a withering look and said somethinglike, "Live and leam, kid." Otcourse,he was completely correct. There is a lot to be leamed by getting a feel for the daiabefore rushing to estimatefancy, or evennotsofancy,models.
E qlore Alternatives to Your a Priori Hypotheses E D:
ftr lss ddrbd\rs f
t=
(he of the features of truly strong research papers is that the auihor anticipates and qlores all of the altemative explanationsfor the observedphenomenonor relationship tar a critic might propose.In nonexperimentalwork the searchfor alternativeexplanations den amountsto assessingthe possibility of spuriousassociationdue to the failure to ilude variables that affect both the independentand dependentvariables in the model. fhus you need to ask yourself, is there an altemative explanation for the associationI $serve? In particular, might some other variable be causing both the outcome I observe al the values of my predictor variables?Then, if possible, include the candidate vari$les in your model, or do a side analysis (even using a different data set) to investigate ft associationof thesevariableswith variablesalreadvin vour model.
402
to Testldeas Doing SocialResearch QuantitativeData Analysrs:
thaterp ':; tt A niceexampleof theuseof this strategv " P"P"l !1.ytlt^:1.(2007) incre'i' : an in th! twentiethcenturyresulted whethergrantingwomenthevoteearlyin shonge\1., r: a reiuctionin chilclmortalityHe finds ;;;1i;":1rh ,p;"ding andhence thecausalargurl'ri::: thatbeforeaccepting in supportof his claim'But ne recognizes ri :! endosenous legislationwas outthepossibllitvthatsuffrage il;ru;""il;;;ie devoes a:i:_ ,[
:l tfi*:
m a
in public healthspending'He thus tols that alsoresult"d'n an '.ncreuse i:ialoityitsts" designedto rule out the possibili: ' of his paper (z4JB) tou'ioo' alternaiiveexplanationsfor his resultsconfou:::l not possiblebecausethe potential Where the ,t ut"gy .1"ttttitto*i"J is thati: 'r noting bv to rule iepotslur" iiiluy ou'"'u"a' bt"n not thT,r^o:J variableshave exa:.. differ from what is observedFor f'"ai"t"o "r""it i"ould utCtt1" "t"tii*"1;t"r; ofitera"y in Cttina(Treiman2007a)il ".. in apaperanalyzingthe determrnaiis
lllli:it
;_ Lf
,"
6{ly::1"},iTilj:j"Til}i"}"j,::Hi}:1;:J5'Jil:Ti.ir," : ii*,',e;''"" thatnonmanualv "the hypothesis thisconclusion'I had to ru ' : it'(146) ffo*tu"'' t"tot" accepting work suppresses ch'': i" rOlOliry1y1"1t"terl historical the possibilitythat ug" anet"n""''""'uttO out pointing  ' by literacyby cohofi' I did this "' in Chinathat prodo""a Oif"'"nte' ln rn'::*"; an expect ttl:lY::1:uld (decreased) the quality of educationincreased "t"t workersrafherthan the ob:' :* (decrease)in literacy to' Uotttmanuatand nonmanual the nonn"I ruled out the postllllll 'n" as tn u ;;;;;;;;";s"t"" 'i*itu'''uv andmanualworkersdecltnec grew,the av"tug" "quuttti; of both nonmanual iector *t,ttti;*il;tiln, is^to sweepunmeii:. avaitJuteunder some circumstances' or randome.:: potentialconfoundersout of th;;lititiy T^"i,:T""t "ttlu'ing to adjustt' i:E is possibility lftupter' Still another models as we did in tn" p'"*t endoge:' with coping ot ttt" methodsfor on" oting ly confounders potential effectof earlierin this chapter' inj .u.pi" t.f""tion bias discussed
&
'1r. ri
Jlm
uric 0 d
!t
ConductSensitivitYAnalYsis
reader:: inspireconfidenc:"1.1:^Ojn ot tt* Anotherway to gain confidenceand i' t '':::' robust art ruuusr resul$ are your your results $fi:Ti'T:," ;,^ ,'^ 111:i:,:jllT*t motlel framework.c :"onoJ'i""u'li1 linearmode "r ti.ear nationships in a general represem you forms by which di1 ' ::' generallyan"alysis'andmore ent cutting points when to"Vt"g'o"iiutufar omittedrr': ' Like consiieration 3f'loleltial ways of representingyou' tont"fo' being anr :set tr,"]3:: t"q"i*going bias,this sort of exploratronarso'iuv :t:g o the adequacr:' ' izoAV""T^T:::tf See,for example'Treimanand r<""i trSS:' standardproxyforlaborfbrceexperience(:ageminusyearsofschoolingminussi: ontwoaltelr' :
fromthea'"tii "i s*inv^"":,s:.:?^'*"o "#rn:Jr['""it","s rnd actuallaborforceexpertence
' the pro\y medsure mea.ures, yield similarresults But e\: specilications Oiffe'ent tnut is course' of Your hope,
to'"':I1llllt ]:.lJ::ffn:"i,"J;I:f,i#::#"':f:,"'*"J'[:T. n""a not, voo but to discoverhow a hvDothesis that our dataarenot ;;iil;;;;;;;tud" to alternativesPecifications'
not r' r; informativebecauseout resultsare
!x1 t I !!
ilu il
Finalrhoughtsand FutureDirections: Research Designand Interpretation tssues403
6
3fl_, ihat erp.er Eri ii an tD!rai.r [. ri.r ::rong
er iie=:
drillgenou! F :3r.rtas a
:a t.r:ill
t :le pcssibt:,
r
Errid ionlou:"::u r Eotrn,gthat ::r ned. For er:=:r)e, I ar,sued:,.a.t sorkers coDr:/c nr:e $ hile mr,!iir. a Ihadro n ::u r I h:storical ch=lr: pointins our :e r erF€.t an in.:.a€ dur the ob:e:,:c I ar the nonn:iia ters declinecnep unmeaiJ or raldom er.:"$ s Io adjust t'u : : raith endose:
]'our reader:.r ifferent funcriqa &amerr ork. di]:rneralll  ditl';: I ominedr anr* et beins anahz.: the adequaci .i r lin,s minus \ir :"r )tr t\\ o altema::.: esults.But er e: t' I is not to "pr..i3edmes this m..fsi ults are not rotv::i
Hout and Hauser(1992)critique.d_ Erikson and Goldthorpe,simportantcomparative studyofsocialmobility,the ConstantFla.r(1992b),showingthat EriksonandGoldthorpe,s resultsare not robustto changesin the model specificatiin, in the statisticalprocedure used,or in the level of aggregationof their occupationalclassification. see also Erikson andGoldthorpe's(1992a)response.The exchangeprovides an illuminating exampleof u hy it is importantto carry out sensitivityanalysisyourself beforea cntic doesit for you. For a striking exampleof a tendentioxsand sloppily argued analysisthat wasthoroughly demolishedby the long knivesof crirics,seeUermst"inLd Munay (1994)anOimportant ffitiques by Heckman(1995),Fischerand others(1996),and HauserandHu ang(1997). Oneusefulapproachis to .,bracket"your results,reporting not a point estimatebut a rangeof estimatesderivedunderdifferentassumptions.For eiample, if it is not clearto 1ou whether,in an attitudescale,a "don't know"iesponseshouldie coded,,missing,, or stventhe middle value,intermediatebetweena positiveanda negative attrtude,try it both s aysandassessand,of course,report_the results.
DocumentYour Work You.shouldcari:yout all your analysisusing corrmand files (_do_ tiles in Stata)and producinga log of your commandsandresultseachtime you executeyour commandfile (a log file in Stata).Moreover,you should useextensivecommentsrn your command files,sayingfor eachbit of analysiswhat you_are doing and why you are doing it. In my own work I go further,addingcommentson the results. This practice has several advantages.First, it provides a record of what you have dole.Theprocessof researchproductionin the social sciences from initial idea to pub_ lishedpaperoften coversa period of severalyears.Even if you are an efficient person u'hodoesonething at a time andthusarc ableto executeyouranalysis from startto finish inamatterof a few weeks,you then haveto submityour paper to aloumal, which typi_ cally will take severalmonthsto get back to you, oit.n *i,i, u r"qu"rt fbr revision and resubmission that entailsdoing additionalanarysis.At this point yo; do not want to be in the embarrassing positionof not rememberingexactlyhow you carriedout the computa_ tionsto producethe statisticsthat appearin your tablesand graphs and, worse still, not beingable to replicatethem. If you havea well_documented'command file, you will be ableto figure out what you havedone,and why. Moreover,you will be ableto modify your analysisandcreate a new setof computa_ rionsefficiently.Suppose,for example,that the refereessuggest that you control fbr an additionalvariable.This is a trivial taskif you havean existii! commandfile. you simply add the variableto your model and executethe commandfiie. This is far preferableto redoingan entiresectionof your analysis. U*" you will makeit possiblefor othersto replicate_or challenge_your . work, by archivingyour log file so thatit is availableon demand.you may be temptedto obscure the detailsof your work so that no one elsecan discovererrors in it. gut this is not how screnceprogressesfar better to be clear (evenif wrong) than vague. If you are clear aboutyour procedures, you makeit possiblefor othersto !"u"tly ."pir"at" *hat you have done and perhapsto figure out how to do it better. Remember,ihe aim of the game is to advanceour collectiveunderstanding of socialstructureandprocess.
404
to Testldeas QuantitativeDataAnalysis:DoingSocialResearch
Of course,the gold standardfor the production of researchpapersis that they all of the ilformation necessaryto exactly replicate the research.Your goal should bc document your work thoroughly enough so that if you handedyour paper and your set to a competent analyst, he or she could reproduce every number in your paper laudable as this goal is, however,it tends to be frustratedby joumal editors who insi.sl shortening papersby omitting technical detail. So in addition to describing your cal proceduresas cleady as possible in your paper, archiving your log file is very professionalpractice.
Do a Last Checkfor Errors The last thing you should do before you submit a paper for publication (or asa term or a dissertationchapteror post it in a working paper series)is to executeyour
?r)! AN "AVAILABLEFROM AUTHOR"ARCHIVE BecausecrainE \
papers in OuOllsneO proveto befalse, thatadditional materials are"available fromauthor"usually (CCPR) at leastattera fewmonths, theCalifornia CenterforPopulation Research at UCLArecenti_r ,doimplemented a mechanism bywhichadditional materials, for example, and 1og files, postedin itsPopulation canbeattached to papers WorkingPaper archive. Otherresearch centers areto beencouraged to do thesame.
file and then to check every single number in your paper againstthe correspondingnmbers in your log file. You will be amazedat the number of discrepanciesyou find. Beca producing a professional paper is typically a lengthy process, it is extremely eas)iE inconsistenciesto creep in. Your goal should be to produce a single command file ri:r contains all the computationsrequired for an analysis.Even in caseswhere you are m*lyzing more than one data set, you would be well advised io incorporate all your cre,. mands into a single file. In this way, you create a single document that producesd explains all of your work. You also minimize the chancethat portions of the analysisril fail to be documentedor that the documentationwill be lost. For the samereason!{! should incorporate side computations,evenhand computations,into your commandflE(In Stata the displaycommandaccommodatesthis by functioning like a h.ni calculator.) The standard to be emulatedat least pardyis the lab notebook conventioMl! kept in a chemistry lab. Lab notebooks record the conditions under which an expE . ment was conducted, including the temperature and humidity of the room, whetbs r reagent was spilled on the floor that day (together with the exact time and descrip.ix of what was spilled ard where), and the outcome of each procedure, whether succx.ful or not. We need not go that far. Nothing much is gained by recording the error: rc made in the processof getting our file to execute.But we should record,analytic dead ends,hypothesesthat did not pan out, assumptionsthat proved to be incorrect, and so .rYou will find such conmentary enormously helpful when you retum to analysis after n
FinalThoughtsand FutureDirections: Research Designand Interpretation tssues405
br a.rrr Lrliirr dSadl rpry.t.aI b i f t d{rC tII'rLrl rqr
d
fuclBcr:y czq fu diiL l:r Fr:0l."dl {lsrd 5f[]ul
E;t brd
*
ia{t r q EF
rhl:fur l.rc'flrrnrl rqr.B
arrt* IF:.H rd. ro cu briEr
absenceof months or years, which, as I have noted, is not an uncommon gap. Moreover, by documenting and archiving your analytic deadends,you may help othir inalysts.
WHAT THISCHAPTERHAS SHOWN In this chapterI havereviewed somegeneralpoints regarding good researchdesign; have briefly introduceda numberof advancedstatisticaltechniquesandprocedures, *hich you should pursuein further coursework or independentstudy; have emphasizedthe value of probability sampling; and have concluded with some advice aboufgood researchpractice. On the basis of the material we have covered in this book, you are well preparid to do highqualityand rigorousanalysisof samplesurveyand other data.But you should not stophere,becausestatisticalmethodologyin the socialsciencesis advancing rapidly, and a first course in data analysis is no longer sufficient to master stateoftheart techmques,many of which I reviewedin this chapter.I thus urge you to think of this book as the beginning of a careerlongcommitment to continually expandyour tool kit, just as I havedone in the more than forty years since completing my phD. If an old doe like me can learnnew hicks, so canyou! Havefun!
F
APPEN DI X
RIPTIONS DATADESC ANDDOWNLOAD FORTHE LOCATIONS DATAUSEDIN THIS BOOK This appendix describesall of the data sets used to create the worked examplesin the book. A common feature of all these surveysis that they are householdsamples,which meansthat the data need to be weighted by the reciprocal of the number of adults in the bouseholdto converttheminto personsamplesseethe discussionof this issuein chaprr Nine. They are all basedon probabilitysamplesof households,and the detailsof the designare given in other sources,indicated in the referencesincluded in this appendix'
408
QuantitativeData Analysis:Doing SocialResearch to TestIdeas
CHINA The 1996 sluwey,Life Histories and Social Change in Contemporary China (Treirn_, Walder, and Li 2006), was conducted by faculty and students of eeoplet Univer:15. Beijing,wirh fundingfrom theU.S.NarionalScienceFoundation(SBR_9423453), theFd FoundationBeijing, and the Luce Foundation.The principal investrgators were Donall I Treiman (UCLA), Andrew G. Wal_der.(Stanford), andeian; Li (then at people,sUnivers4 and now at Qinghua University, Beijing). The suwey coliected extensrveinfbrmation c respondents'socioeconomiccharacteristicsand educational,occupational, and famill. Lr_ tones, and also information on their spouses,parents,children, and other family memt ar The surveywasbasedon a stratifiednationalprobabilitysample of the populationd. China agetwenty to sixtynine,yielding 6,090casesplus a speciatsampleof jg3 uiitr_ leaders(village cadres).Detailsofthe sampledesignL givenin Trerman(199g). atrd accompanying documentation can bJdownloaded from the UCLI 1"" ^ .1h:Science Social DataArchive ar http://www.sscnet.ucla.edu/issr/da. Click Catalog,ht,ttzAsiaChina, and,then Life Histories and Social Changein Contemporary China, 19!,6_
EASTERN EUROPE The.survey.ofSocialStrairtcafionin Eastem Europeajler l9g9 (Sz€lenyi andTreimn 1994) consistsof six generalpopulation.u*.yr, iur"d on probability samplesof tu adultpopxlationsofBulgaria,theCzechRepublii, Hungary,poiand, Russia,andSloral:'with all the surveysconductedin 1993exceptthe poUst survey, whictr wascarriedou: r 1994, andall surveysusing an essenfiallyidenticalquestion;re. Each surveysampi* approximately5,000 adults using a multistagestr;tified national probabiliiy sampb design(exceptthat the Polish samplewas smaller,approximately 3,500adults).Dee* on the surveydesigncan be found in Treiman(1994).Thesesuweys we.e fundedbr.* U.S. National ScienceFoundarion(SES_9111722 and SBR_93103i5),;" t;. N";; Councilfor SovietandEasternEuropeanResearch(g06_29),the Dutch NationalScienc Foundation (NWO), and various Eastem Europeangovemmental agenctes..fhe princiFri investigatorswere Ivan Sz6lenyi and Donald J. Treiman, at that tim"eooth at UCLA. The focusof the datacollectionwas on the effect of the collapse of communismte life chances.Extensiveinformationwas collectedon respondents; socroeconofircch.T..actedsticsandeducational,occupational,residential,andiamily histones,andalsoini._r_ mation^ on_theirspouses, parents,children,and otherfamily rnembers.In addition,a gcrrt deal of political information was colrected, as well as inflrmation that permitted a c!Etrastbetween1988and 1993. and accompanyingdocumentationcan be downloadedfrom the UCL{ 9u,u ^ .lh:Jcrence Jocrat IJataArchive at http://www.sscnet.ucla.edu/issr/da. Click Catalog,htdi_t EuropeBulgaria, and,then Social Stratirtcation in Eastern nroop" e1t", l9g9: Gener; Population Survey. The surveyof elitesthat wascaried out in eachofthe six nationsat the sametime .r, the genenl_populationsurvey and alalyzed in chapter Thirteen is not currently availat h for public distributiondue to the difficulty of protectingthe confidentialityof r"rponr._,. The difficulty with an elite survey,of course,is that individuals are farrty readily idendi_ ablefrom detailsof their biographies.
I DataDescriptions and DownloadLocations for the DataUsedin ThisBook 409
SOUTHAFRICA
E t. b
B I
rlf,.
jrh df a!50!
l*
Gd It9
b lrfr
Sr I
b
fLr
tr'q.
pl
'Bl
4.! d.a fl"
Thc Sutae.!of SocioeconomicOpportunities and.Achievementin SouthAfrica (Treiman, lXoeno, and Schlemmer i994) is a multistage national probability sample survey of all ces in "greater South Africa" carried out in the early 1990sin severalstagesbetween l9l and 1994. GreaterSouthAfrica refers to what was historically, and is currently, the $trth African nation; that is, it includes the '"VBC States,' that at the time of data co{tectionwere nominally independentpuppet stateshived off by the apartheid regime lre Treiman [2007b] for a brief history). The sampleconsistsof a generalpopulation sam1b of 8,7i4 adults and a Black elite sampleof 372 adults.The principal investigatorswere Dcnatd J. Treiman and two South African sociologists,Sylvia N. Moeno and Lawrence llfr.lemmer SeeTreiman, Lewin, and Lu (2006) for details on the survey design. Extensiveinformationwas collectedon respondents'socioeconomiccharacteristics d educational,occupational,residential, and family histories, and also information on hr spouses, parents,children,and otherfamily members. The data and accompanying documentation can be downloaded from the UCLA ScienceDamArchive at http://www.sscnet.ucla.edu/issr/da. Click Catalog,Index, aSouthAfrica, and then Surveyof SocioeconomicOpportunities and Achievement.
GENERAL SOCIALSURVEY by theU.S. NationalScienceFoundation, theGeneralSocialSrryey(GSS:Davis. and Marsden 2007) is a repeatedcrosssectionalsurvey,with data collected from ional multistageprobabilityof U.S. adultsabout 1,500peopleapproximatelyeach from 1972 ihrough 1991 and then, beginningin 1994,about 3,000 peopleevery year. The principal investigators are JamesA. Davis and Tom W Smith and, in years, PeterV. Marsden. Appendix B provides details on the sample design as it changed over lhe years.
The GSS is intended to be an allpurpose survey,to permit analysis of the attitudes, ior, andcharacteristics of the U.S. populationby thosewho cannotafford the masresourcesrequired to mount a national probability samplesurvey.As it has matured, becomean increasingly valuable vehicle for the study of social change,especially in attitudes.The strategyof the GSSis to repeata substantialportion of the quesire year after year to permit the analysis of changesover time but also to incorpoDewquestionsthat areresponsiveto changingconditionsand concems. The data may be downloaded from the National Opinion ResearchCenter at the of Chicago,the producer of the GSS: http://www.norc.org/GSS+Website/ It is also possibleto do data analysisusing this site, without actuallydownthe data.An alternativesite, which also permits both data analvsisand downloadL the SDA Archive at the University of CaliforniaBerkeley: http://sda.berkeley. For information regarding how to download or purchasethe documenseehttp://www.gss.norc.org.
A PPM N DI X
ESTIMATION SURVEY WITHTHEGENERAL SOCIALSURVEY Social Survey (GSS) uses a stratified multistage probability sample in Davis, Smith, and Marsden [2007, Appendix A], and the sources cited rhich meansthat correct estimatesof standarderrors require survey estimation Unfortunately, the GSS documentationis not complete,presumablyto mamiality; only the primary sampling units (PSUs) are identified by the ,IAMPwiable; neither the stratanor the secondarysampling units are identified. This is becauseit precludes exploiting Stata's proceduresfor adjusting for multiified sampling. Also, information is not provided that would permit a finite correction, although this limitation is not important, given the large number of te population sampledat each stage.Moreover, the sampledesign has changed years. with a new sampling frame createdeach decadebasedon the decennial and additional major changesin 1976, from a block quota design to a full sample design; in 2004, with the inhoduction of a partial listbased sample U.S. PostalService addresslist, an aggressiveeffort to convert a subsetofinitial and postenumeration adjustments for differential nonresponse;and in uirh the inhoduction of a Spanish language sample. Finaily, in 1982 and 1987 rere oversampled.Thesechangescomplicate pooling data acrossyears.
4.'! 2
to Testldeas Doing socialResearch QuantitativeData Analysrs:
SINGLEYEAR ANATYZINGDATAFROMA
:,i**ij.:,il:y?;:rlilii{#*r.'i# **srl1ffi r:i
lowing section,offer suggestrons
Samples The 1972to 1976BlockQuota
to the block l1:' samplingtras carrie^doutdoir'n ptouuUltiry CSS' tn" of years around the biL{In the early i'itelei in a sPecified^way trrti*^#"t *" block' each Then. within
: 1 ru
tl::t"Tl]1'j,HJTF iil'*5:'"#ffJ"'$ "l";.;::"':q i'.&; "nd;;;; ho* p"' rT'pr'i""d h#xin:*::l*:*:rh*1,Til:y::::"J c:Io:rld:r"]* numbero[ peoplewith Part'cula: ,"",.l"i.*fewed, andthe inteni':
rru;:;*;,.:pllliHil::ili rll#""fJl;;:,x*r;f Ts';;;l:ff iliililil:ilx[rr.LT"fJ't?'#ilt*':*iil*l"*;J"'iiiiffi ni:[tt"*\;*:j:l;llr*j;"*:jt*Y;rx'^lmffi *ti'utt'" the interuiews were conducted
half were condu'.:': ui"tft q"9" Tllhod: an<1
lr
'i: u"o"'repre'e **pt":1L9:].11:'ut't quoru utoft the riai concludeo and in single=:: olinsprocedures ou"'ttp'""nrtd i*:lT:: living ttn iully employed o 'ot"*iu] quota samplesexlst' e:'jr tiutittical inferencewith Although procedu'"' rot din!"
m'iiffi"",.."{#;m:t':n";..l:1iJ'Jilii a''f and Mccarthy lvbr' LnapLEr';i";;;;;r"t"designeffectsof
about15' app::r
{smittrtee,D:": ":::il::t::'^:'S::T:"ilili;i"iloi'ii""'n'*if it to heattheblockquota:::Z',clll"reasonabtt approactt Smith.andMarsdenZOOf' plesasif thevareprobabilitvt1"iejl"J"*l'5:'*t subselJili '3# asfor thetrueprobabilitvsamples iil
procec= survevesdmadon
youc .h:'.] ,uorr samples, :
obsene: aistriuutions u'*oontrre welgn:': ," ,Jll:['i::i:J]"":t1"1ffiffii;:iei'Je you?t u"ottt"tto o"postenumeration M, turr"'uon 'Jitiui : the1970census. in youranalysis'whichwill 1:: status unO'":pfoy**t gencler include Rather,simply
;'q:":":1*:if,',"J;l1li::,il:ffi ,TJi::=: ffi*:J;ffi T.,.ill'#Tffi []""fit'ffffi.1i*l"#fiti;ffi;;;;;;a*',irso'perhapsbracketing:e emplovedmales'T:t;;;; ;l;flat" the numberof summarystatisticsbv weigh""g,ti; tttl'iullJ^ reDortthe original anOinnut"o valuesof the statisticsfall'
on
:r of the range within which the
"'tirrlut"
1982and 1987. . The 1977to 2OO2Surueys'Except of Bl;:' o'rersamPles r'vhich included
uJd 1987surveYs' With theexceptionof tft" rscz he treatedin a stan;: tt.t"tliriooz to'u'tys crn all o'"oi*n the full probabilitytu'npr"' surve)'s :: GSS' like most household tfl"'fo"t tt'ot fo' adjust to need You wav. 't.tt
with the GeneralSocialsurvey 413 SurveyEstimation sampleof householdsrather than people. But the eligible population consists rdults (people age eighteen and over) who are capable of responding to an intefBecausehouseholds are randomly sampled within small areasbut only one ranchosenadult per householdis interviewed, adults living in householdswith many bave a smaller chance of being included in the sample than do adults living in with few adults.A reasonableway to convert the sampleof householdsinto a of people is to weight eachrespondentby the ratio of the number of adults in the to the mean number of adults in all households in the sample. This can be in Stataby constructing a householdweightvaiable, HHWT: csen adultm ge:
hhwt
= mean (adults
)
= adults/adultm
your databy this variable.(In fact, becauseStatarenormsprobability weights ciginat samplesize, you can simply use the ADULZS variable as your weight variyou areusing this variableasa componentin a more complexweight variablein the next sectionin which caseyou should useI/I1[rI asthe component.) the GSS is a multistage samplewith two, and for somePSUSthree, stages, only provides information on the primary sampling units (metropoliand nonmetropolitan counties) and no information on strata (basedon region, place, and race/ethnicity). This meansthat we can go only pafi way to adjusting ing in the GSS sample design. Here are the Stata commandsthat will accomusine the GSS PSU variable, SAMPCODE: =adults
et
sampcode
lpweight
t
sampcode
lpweight=hhwt]
]
this command also adiustsfor differential householdsize.
and 1987Surueyswith Oversamplesof Blacks aralyzing the 1982 or 1987surveysand want to compute descriptive statistics' to adjust for the fact that Blacks were oversampled.To adjust for both the of blacks and differential householdsize, createa new weight variable that of the OVERSAMPweight variable provided by the GSS and the weight rou constructedto correct for differential householdsizethat is aewwt
= hhwt*oversamp
6e mean of this new variable is 1.0.) Then set your data for suney analysis: E sampcode
[Pweight =newwt ]
414
to Testldeas DoingsocialResearch QuantitativeDataAnalysis:
The 2004and 2006SurveYs
new sampling procedurethat exploits IiE In the 2004 GSS,NORC introduced a radically Ly tne U S' tostat Service' which in lffr4 availability of a list of addresses For areas covered br dr Jntuin"a io filoit"tt"uttuigh ?99J). covered Tpercent of househorot to small areasin essen'a p".iJ S"*i* fit,' it was possibleto go directly from PSUs s""onO innovation was an aggressiveeficm from the PSU to th" t"rtiy tupiingonit 'q' to respondents.The secondinnovaDi! to convert a random half of initiar noiuespondents data were weighted to make them represenrF necessitaFda changein the way the GS3 & had rc be weighted by twice the weight of tive of the populationtfre convertedcases I44SSadiustsboth ,ir onty t uti."." tollowed up. The variable ;;G'"ilJ;r" size' thisandfor differentialhousehold r* data' this variab9 i: nan€d ]1T:::: Note that in the originut u""'on ot th" 2004 variableappearstwrce' F i; the 19722006cumulativeflle' this I. ,h; ;ill;";td p't for all years; for 2004 and 2006 the !1TSSfor years 2004 and 2006 and as I4TSSALL files earlier vou a: *::i1l:l:h" ur" iA.nti"ul. Thus, depending on which ;;; othernan (orwhatever
ffi;;:;ilil;;;
orwrsstoNEI{I4T to nn*n" twssn'o+
to have a comparableweight variable :ir you give to your consmlcrco w"igftit*i*f"l utt tiflrt;o or poole
TheFORMWTVariable
somequestionswere askedonly of a sur In someyears(1978, 1980,and 19821985)' was to administer the questionsto a rand('= t"pi" # t".pttOents Althoughthe intent : (Sittt and Petersol 1986)'Thus' the GSS offer' subset,this was not alwayst"uti'"J h: is that you not use this variable corection weight,FORMWT'Uy i"to*tn"naution
with the Generalsocialsurvey 415 SurveyEstimation do multiple imputation (see Chapter Eight) to create a complete data set that all respondents.
FROMMORETHAN ONEYEAR SURVEYS rc pool surveysfrom more than one year, it is reasonableto treat YEARasthe straiable becausethe surveys from each year are independent,and YEAR is a fixed The Statacommandto accomplish this is Gr:r'set
sampcode
lpweight=newwt]
, strata
(year)
K!;r xR mru c n S 'l,beno, and Javier Gardeazabal. 2003. The economic costs of conflict A case studv of the Basoue countrv.
EconomicReview93(1):ll3132. &iid Drukker,JaneLeber Hen, and Guido W Imbens.2004.Implementingmatchingestimatorsfor avereff€ctsin Stata.StataJoumal4f3):290311. 1002. Categoricaldata analysjs.2nd ed. NewYo*: WileyInte$cience. aEl D. 2001. Missing data. Sage univenity papers series on quaxtitative applications in the social sciences,
Thousand Oaks,CA: Sage. :L6. Fixed effectsregressionmethodsfor longitudinaldatausing SAS. Cary,NC: SASInsrituteInc. hlglas. 2006. Is the l9l8 influenzapandemicover?Longterm