Table of Contents Preface Acknowledgments Ch. 1 Statistical Methods and Linguistics 1 Ch. 2 Qualitative and Quantitativ...
27 downloads
806 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Table of Contents Preface Acknowledgments Ch. 1 Statistical Methods and Linguistics 1 Ch. 2 Qualitative and Quantitative Models of Speech Translation 27 Ch. 3 Study and Implementation of Combined Techniques for Automatic Extraction of Terminology 49 Ch. 4 Do We Need Linguistics When We Have Statistics? A Comparative Analysis of the Contributions of Linguistic Cues to a Statistical Word Grouping System 67 Ch. 5 The Automatic Construction of a Symbolic Parser via Statistical Techniques 95 Ch. 6 Combining Linguistic with Statistical Methods in Automatic Speech Understanding 119 Ch. 7 Exploring the Nature of Transformation-Based Learning 135 Ch. 8 Recovering from Parser Failures: A Hybrid Statistical and Symbolic Approach 157 Contributors 181 Index 183
file:///C|/Documents%20and%20Settings/Admin/Desktop/New%20Folder/TOC.txt[6/17/2009 10:22:27 PM]
Preface
The chaptersin this book comeout of a workshopheld at the 32nd Annual Meetingof theAssociationfor ComputationalLinguistics, at NewMexicoState University in Las Cruces, New Mexico, on 1 July 1994. Thepurpose of the workshopwas to provide a forum in which to explorecombinedsymbolicand statisticalapproachesin computationallinguistics. esto the study To manyresearchers , the merenotion of combiningapproach has . Indeed, in the past it of languageseemsanathema appearednecessaryto , studyingtwo essentially choosebetweentwo radically different researchagendas different kinds of data. On the one hand, we find cognitively motivated theoriesof languagein thettadition of generativelinguistics, with introspective es motivated by data as primary evidence. On the other, we fmd approach empiricalcoverage,with collectionsof naturallyoccurringdataasprimary evidence . Eachapproachhasits own kinds of theory, methodology, and criteria . for success Although underlying philosophicaldifferencesgo back much further, the genesisof generativegrammarin the late 1950sandearly 1960sdrew attention to the issuesof concernin this book. At that time, therewasa thriving quantitative linguistics community, in both the United Statesand Europe, that had originatedfollowing World War II in the surgeof developmentof sophisticated es to scientific problems[Shannonand Weaver, 1949] . quantitativeapproach es werebuilt on the foundationof observabledata Thesequantitativeapproach as the primary sourceof evidence. The appearanceof generativegrammar [Chomsky, 1957], with its emphasison intuitive grammaticalityjudgments, led to confrontationwith the existing quantitativeapproach,and the rift between the two communities, arising from fmnly held opinions on both sides, prevented esto languagegrew up productiveinteraction. Computationalapproach during this feud, with muchof computationallinguisticsdominatedby the theoretical perspectiveof generativegrammar, hostile to quantitativemethods,
viii
Preface
and much of the speechcommunitydominatedby statisticalinfonnation theory , hostileto theoreticallinguistics. Although a few naturallanguageprocessing( NLP) groupspersistedin taking a probabilisticapproachin the 1970sand 1980s,therule-governed,theorydriven approachdominatedthe field, even amongthe many industrial teams working on NLP (e.g. [ WoodsandKaplan, 1971; Petrick, 1971; Grosz, 1983]). The influenceof thelinguists' generativerevolutionon NLP projectswasoverwhelmi . Statisticalor evensimply quantitativenotionssurvivedin this environmen , included for the purpose of only as secondaryconsiderations ' but optimization rarely thoughtof asan integralpart of a systems coredesign. At the sametime, speechprocessinggrew more mature, building on an information -theoretictradition that emphasizedthe induction of statisticalmodels from training data(e.g. [Bahl .et al., 1983; Flanagan, 1972]). For quite sometime, the two communitiescontinuedwith little to sayto each other. However, in the late 1980sandearly 1990s,the field of NLP underwenta radical shift. Fueledpartly by the agendaof the DefenseAdvancedResearch ProjectsAgency (DARPA), a major sourceof American funding for both , and partly by the dramaticincrease speechand natural languageprocessing world wide in the availability of electronictexts, the two communitiesfound themselvesin close contact. The result for computationallinguists was that long-standingproblemsin their domain- for example, identifyingthe syntactic , or resolvingprepositionalphraseambiguity categoryof thewordsin a sentence in parsing- weretackledusingthe samesortsof statisticalmethodsprevalentin . Thespecifictechniquesvaried, but speechrecognition, oftenwith somesuccess all werefoundeduponthe ideaof inducingthe knowledgenecessaryto solvea problem by statistically analyzing large corpora of naturally occurring text, ratherthanbuilding in suchknowledgein the fonn of symbolicrules. Initially , the interestin corpus-basedstatisticalmethodsrekindledall the old controversies- rationalist vs. empiricist philosophies, theory-driven vs. datadrivenm , symbolic vs. statisticaltechniques(e.g., seediscussion in [ChurchandMercer, 1993]) . The BalancingAct workshopwe held in 1994 wasplannedwhenthe rhetoricwasat its height, at a time whenit seemedto us that, evenif somepeoplewereworking on commonground, not enoughpeople were talking about it. The field of computationallinguistics is now settling down somewhat:for themostpart, researchers havebecomelessunwaveringly adversarialover ideologicalquestions,andhaveinsteadbegunto focuson the searchfor a coherentcombinationof approach es. have Why things changed? First, there is an increasingrealization, within eachcommunity, that achievingcoregoalsmay requireexpertisepossessed by
Preface
esaddrobustnessandcoverageto traditionally the other. Quantitativeapproach brittle andnarrowsymbolicnaturallanguagesystems,permitting, for example, the automatedor semiautomated acquisitionof lexical knowledge(e.g., term inology, names, translationequivalents). At the sametime, quantitativeap aboutthe nature proachesare critically dependenton underlyingassumptions of the data, and more peopleare concludingthat pushingapplicationsto the next level of performancewill require quantitativemodelsthat are linguistically betterinformed; inductivestatisticalmethodsperformbetterin the faceof limited datawhenthey arebiasedwith accurateprior knowledge. A secondsourceof changeis the critical computationalresourcesnot widely availablewhenquantitativemethodswerelast in vogue. Fastcomputers,cheap disk space, CD-ROMs for distributing data, andfundeddata-collection initiatives have becomethe rule rather than the exception. The Brown Corpusof ' American English, Francis and KuCeras landmark project of the 1960s areannotated [KuCeraandFrancis, 1967], now hascompanionsthat arelarger, that from data of consist that multiple languages in morelinguistic detail, and . 1996 ICAME 1996 LDC , ]) . e. ; , [ ( g Third, thereis a generalpushtowardapplicationsthat work with languagein a broad, real-world context, rather than within the narrow domainsof traditional symbolicNLP systems.With the adventof suchbroadcoverageapplications , languagetechnologyis positionedto help satisfysomereal demandsin : large-vocabularyspeechrecognitionhasbecomea daily part the marketplace of life for many peopleunableto use a computerkeyboard[ Wilpon, 1994], into on-line rough automatictranslationof unrestrictedtext is finding its way services, and locating full text information on the World Wide Web has becomea priority [Foley andPitkow, 1994] . Applicationsof this kind arefaced with unpredictableinput from userswho are unfamiliar with the technology andits limitations, which makestheir tasksharder; on the otherhand, usersare adjusting to less than perfect results. All these considerations coverage, robustness , acceptabilityof gradedperformance- call out for systemsthattake advantageof large-scalequantitativemethods. is Finally, the resurgenceof interest in methodsgroundedin empiricism direction. in that back partly the result of an intellectualpendulumswinging Thus, evenindependentof applications, we seemore of a focus, on the scientific sideof computationallinguistics, on the propertiesof naturallyoccurring data, the objectiveandquantitativeevaluationof hypothesesagainstsuchdata, and the constructionof modelsthat explicitly take variability and uncertainty into account. Developmentsof this kind also have parallels in related areas ; in the latter such as sociolinguistics[Sankoff, 1978] and psycholinguistics
Preface
field, for example, probabilistic modelsof on-line sentenceprocessingtteat frequencyeffects, and more generally, weightedprobabilistic interactions, as fundamentalto the descriptionof on-line performance , in much the sameway that empiricistsseethe probabilistic natureof languageas fundamentalto its description(e.g. [ Tabossiet al., 1992]; also see [ Ferstl, 1993]) . This book focuseson the ttend toward empirical methodsas it bearson the engineering side of NLP, but we believe that ttend will also continueto have important implicationsfor the studyof languageasa whole. The premiseof this book is that there is no necessityfor a polar division. Indeed, oneof our goalsin this book is to changethat perception. We hold that es. thereis in fact no contradictionof defectioninvolvedin combiningapproach es to languageis a Rather, combining " symbolic" and " statistical" approach kind of balancingact in which the symbolic and the statisticalare properly thoughtof asparts, both essential,of a unified whole. The complementarynatureof the contributionof theseseeminglydiscrepant esis not ascontradictoryasit seems.An obviousfact that is often forgottenis approach that everyuseof statisticsis basedupona symbolicmodel. No matter what the application, statisticsare founded upon an underlying probability model, andthat modelis, at its core, symbolicandalgebraicratherthancontinuous andquantitative. For language,in particular, the naturalunits of manipulation in any statistical model are discrete constructs such as phoneme, morpheme, word, and so forth, as well as discreterelationshipsamongthese constructssuch as surface adjacencyor predicate-argument relationships. Regardlessof the detailsof the model, the numericalprobabilitiesare simply ' . On meaninglessexceptin the contextof the model s symbolicunderpinnings " " this view, thereis no suchthings asa purely statistical method. Evenhidden Markov models, the exemplarof statisticalmethodsinheritedfrom the speech community, arebasedupon an algebraicdescriptionof languagethat amounts to an assumptionof fmite-state generativecapacity. Conversely, symbolic underpinningsalonearenot enoughto capturethe variability inherentin naturally occurring linguistic data, its resistanceto inflexible, tersecharacteriza tions. In short, the essenceof the balancingact can be found in the opening chaptersof any elementarytext on probability theory: the core, symbolic underpinningsof a probability model reflect thoseconstraintsand assumptions that mu~t be built in, and form the basisfor a quantitativeside that reflects of preferences . uncertainty, variability, andgradedness The aim of this book is to explore the balancingact that must take place es are brought together- it contains when symbolic and statisticalapproach foreshadowingsof powerful partnershipsin the makingbetweenthe more lin -
Preface
es within the tradition of generativegrammar guistically motivatedapproach and the more empirically driven approach es from the tradition of infonnation . Research of this kind theory requiresbasicchoices: What knowledgewill be representedsymbolically and how will it be obtained? What assumptions underliethe statisticalmodel? What principlesmotivatethe symbolic model? Whatis theresearchergainingby combiningapproach es? Thesequestions,and the metaphorof the balancingact, provide a unifying themeto the contributions in this volume. References
L. R. BabI, F. Jelinek . 1983.A maximumlikelihoodapproach , andR. L. Mercer to continuous . IEEETransactions on speech Pattern recognition and Machine IntelAnalysis . liRence --190. , PAMI-5:179 NoamChomsky. 1957. SyntacticStructures. The Hague, Mouton. KennethW. ChurchandRobertMercer. 1993. Introductionto the specialon computationallinguistics usinglargecorpora. ComputationalLinguistics, 19( 1) :1- 24. JamesL. Flanagan. 1972. Speechanalysis, synthesisandperception, 2nd edition. New York, Springer-Verlag. Evelyn Ferstl. 1993. The role of lexical informationand discoursecontextin syntactic : a review of psycholinguisticstudies. Cognitive ScienceTechnical processing Report 93-03, Universityof Coloradoat Boulder. Jim Foley and JamesPitkow, editors. 1994. ResearchPriorities for the World-Wide Web: Reportofthe NSF workshopsponsoredby theIn/ormation, Robotics, and Intelligent SystemsDivision. NationalScienceFoundation, October1994. BarbaraGrosz. TEAM : a transportablenaturallanguageinterfacesystem. InProceedings of the Conferenceon AppliedNatural LanguageProcessing.Associationfor ComputationalLinguistics, Morristown, N.J., February1983. ICAME. ICAME corpuscollection. World Wide Web page, 1996. http:// nora.hd.uib. no/corpora.html. H. KuCeraand W. Francis. 1967. ComputationalAnalysis of PresentDay American English. Providence,R.I ., Brown University Press. LOC. Linguistic Data Consortium( LDC) home page. World Wide Web page, June 1996. http:// www.cis.upenn.edu/ - idc/ . StanleyR. Petrick. 1971. Transformationalanalysis. In RandallRustin, editor, Natural LanguageProcessing.New York, Algorithmics Press. David Sankoff. 1978. Linguistic variation: Modelsandmethods. New York, Academic Press. ClaudeE. ShannonandWarrenWeaver. 1949. TheMathematicalTheory Communication of . Urbana, University of Dlinois Press.
Preface
. Patrizia Tabossi, Michael Spivey-Knowlton, Ken McRae, and Michael Tanenhaus 1992. Semanticeffects on syntacticambiguity resolution: evidencefor aconstraintbasedresolutionprocess.Attentionand Performance, 15: 598- 615. Jay G. Wilpon. 1994. Applicationsof Voice-ProcessingTechnologyin Telecommunications . In David B. Roe andJay G. Wilpon, editors, VoiceCommunicationsBetween Humansand Machines. WashingtonD .C., National Academyof Sciences , National Press . Academy W. A. Woods and R. Kaplan. 1971. The lunar sciencesnatural languageinfonnation system. TechnicalReport2265. Cambridge,Mass., Bolt, Beranek, andNewman.
Acknowledgments
Sincetimeis theoneimmaterialobjectwhichwecannotinfluence- neitherspeedup norslowdown, addto nordiminish- it ~s animponderably valuablegift. ' - MayaAngelou , Wouldnt TakeNothing/or MyJourneyNow Many peoplehavegiven of their time in the preparationof this book, from its first stagesasa workshopto its fmal stagesof publication. The first setof pe0Pie to thankarethosewho submittedpapersto the original workshop. We had an overwhelmingnumberof very high-quality paperssubmitted, andregretted not being able to run a longer workshopon the topic. The issueof combining es is clearly part of the researchagendaof many computationallinapproach as . Thoseauthorswho havechapters guists, demonstrated by thesesubmissions in this book have revised them severaltimes in responseto two rounds of reviews, andwe aregrateful to themfor their efforts. Our anonymousreviewers for thesechaptersgavegenerouslyof their time- each article wasreviewed severaltimes, andeachreviewerdid a carefulandthoroughjob- - andalthough they are anonymousto the outsideworld, their generosityis known to themselves , andwe thankthemquietly. We alsodeeplyacknowledgethe time Amy Pierceof The MIT Pressgaveto us at many stagesalong the way. Her vision and insight havebeena gift , and shehascontributedto ensuringthe excellent quality of the fmal setof chaptersin the book. We aregratefulto the Associationfor ComputationalLinguistics (ACL ) for its supportandsupportiveness , especiallyits fmancialcommitmentto the Balancing Act workshop- anddoublypleasedthat theattendanceat the workshop helpedus give somethingtangible back to the organization. The ACL 1994 conferenceand post-conferenceworkshopswere held at New Mexico State University, with local arrangementshandled efficiently and graciously by JanyceWiebe; we thank her for her time and generosityof spirit. We would also like to thank Sun MicrosystemsLaboratoriesfor its supportiveness
throughoutthe processof organizingthe workshopand putting togetherthis book, especially Cookie Callahanof Sun Labs. Finally, we thank Richard . SproatandEvelyneTzoukennannfor their valuablediscussions We have enjoyedthe experienceof working on this topic, since the challenge of combiningwaysof looking at the world intriguedeachof us independently beforewe joined forcesto run the workshopand subsequentlyedit this book. Oneof ushastraining in theoreticallinguistics, andhasslowly beenconverted to an understandingof the role of perfonnancedataeven in linguistic . theory Theotherhadfonnal educationin computerscience,with healthydoses of linguistics and psychology, and so arrived in the field with both points of view. We are finding that more and more of our colleaguesare coming to believe that maintaining a balanceamong several approach es to language and is an act worth . analysis understanding pursuing
Chapter Statistical
1 Methods
Steven Abney
In general , the view of the linguist toward the use of statistical methods was shaped by the division that took place in the late 1950s and early 1960s between the language engineering community ( e.g . [ f ngve, 1954] ) and the linguistics community (e.g . [ Chomsky, 1964] ) . When Chomsky outlined the three levels of adequacy- observational , descriptive , and explanatory - much of what was in progress in the computational community of the time was labeled as either observational or descriptive with relatively little or no impact on the goal of producing an explanatorily adequate theory of language . The compu tationallinguist was said to deal just with perfonnance , while the goal of linguistics is to understand competence. This point of view was highly influential then and persists to this day as a set of a priori assumptions about the nature of computational work on language . ' Abney s chapter revisits and challenges these assumptions, with the goal of illustrating to the linguist what the rationale might be for the computational linguist in pursuing statistical analyses. He reviews several key areas of linguistics , specifically language acquisition , language change, and language variation , showing how statistical models reveal essential data for theory building and testing . Although these areas have typically used statistical modeling , Abney goesfurther by addressing the central figure of generative grammar : the adult monolingual speaker. He argues that statistical methods are of great interest even to the theoretical linguist , because the issues they bear on are in fact linguistic issues, basic to an understanding of human language . Finally , Abney defends the provocative position that a weighted grammar is the correct model for explanation of several central questions in linguistics , such as the nature of parameter setting and degrees of grammaticality .- - Eds.
1 Chapter In the spaceof the last 10 years, statisticalmethodshavegonefrom beingvirtually unknownin computationallinguistics to being a fundamentalgiven. In 1996, no one can professto be a computationallinguist without a passing ' knowledgeof statisticalmethods. HMM s are as de rigeur as LR tables, and anyonewho cannotat leastusethe tenninology persuasivelyrisks being mistaken for kitchenhelp at the ACL banquet. More seriously, statisticaltechniqueshave broughtsignificant advancesin broad-coverage language processing . Statistical methods have made real progresspossibleon a numberof issuesthat had previouslystymiedattempts to liberatesystemsfrom toy domains: issuesthat includedisambiguation , error correction, and the inductionof the sheervolume of infonnation requisitefor handlingunrestrictedtext. And the senseof progresshasgenerateda greatdeal of enthusiasmfor statisticalmethodsin computationallinguistics. However, this enthusiasmhasnot beencatchingin linguistics proper. It is alwaysdangerousto generalizeaboutlinguists, but I think it is fair to saythat mostlinguistsareeither unaware(and unconcerned ) abouttrendsin computa. The gulf in basicassumptions tionallinguistics, or hostileto currentdevelopments is simply too wide, with the resultthatresearchon the othersidecanonly seemnaive, ill -conceived, anda completewasteof time andmoney. In part the differenceis a differenceof goals. A largepart of computational linguisticsfocuseson practicalapplications,andis little concernedwith human . Nonetheless , at leastsomecomputationallinguistsaim to languageprocessing advanceour scientific understandingof the humanlanguagefaculty by better the computationalpropertiesof language . Oneof the mostinteresting understanding andchallengingquestionsabouthumanlanguagecomputationisjust how peopleareableto dealsoeffortlesslywith the very issuesthat makeprocessing unrestrictedtext so difficult . Statisticalmethodsprovide the most promising currentanswers , andasa resultthe excitementaboutstatisticalmethodsis also sharedby thosein the cognitivereachesof computationallinguistics. In thischapter,I would like to communicatesomeof thatexcitementto fellow . Thereis no denying , to makeit comprehensible linguists, or at least, perhaps that there is a culture clashbetweentheoreticaland computationallinguistics thatservesto reinforcemutualprejudices.In caricature,computationallinguists believe that by throwing more cycles and more raw text into their statistical black box, they candispensewith linguistsaltogether,alongwith their fanciful RubeGoldbergtheoriesaboutexoticlinguisticphenomena . The linguist objects that, evenif thoseblack boxesmakeyou oodlesof moneyon speechrecognizers andmachine-translationprograms(which they do not), they fail to advance our understanding . I will try to explainhow statisticalmethodsjust might contribute to understanding of the sortthat linguistsareafter.
StatisticalMethodsandLinguistics
This paper, then, is essentiallyanapology, in the old senseof apology. I wish to explainwhy we would do sucha thing asto usestatisticalmethods, andwhy they arenot really sucha badthing, maybenot evenfor linguisticsproper.
I think the mostcompelling, thoughleastwell-developed,argumentsforstatistical methodsin linguistics comefrom the areasof languageacquisition, language variation, andlanguagechange. LanguageAcquisition Under standardassumptionsaboutthe grammar, we would expectthecourseof languagedevelopmentto be characterized by abrupt , eachtime thechild learnsor altersa rule or parameterof the grammar. changes If , as seemsto be the case, changesin child grammarare actuallyreflectedin changesin relativefrequenciesof structuresthatextendover monthsor more, it is hard to avoid the conclusionthat the child has a probabilisticor weighted grammarin somefonn. Thefonn that wouldperhapsbe leastoffensiveto mainstream sensibilitiesis a grammarin which the child " tries out" rulesfor a time. During the trial period, both the new andold versionsof a rule coexist, andthe probabilityof usingoneor the otherchangeswith time, until the probabilityof using the old rule finally dropsto zero. At any given point, in this picture, a child' s grammaris a stochastic(i.e., probabilistic) grammar. An aspectof this little illustration that bearsemphasizingis that the probabilities are addedto a grammar of the usual sort. A large part of what is meant by " statistical methods" in computationallinguistics is the study of stochasticgrammarsof this fonn: grammarsobtainedby addingprobabilities in a fairly transparentway to " algebraic" (i.e., nonprobabilistic) grammars. Stochasticgrammarsof this sort do not constitutea rejection of the underlying . This is quite different from algebraicgrammars, but a supplementation someusesto which statisticalmodels(mostprominently, neuralnetworks) are put, in which attemptsare madeto model someapproximationof linguistic behaviorwith an undifferentiatednetwork, with the result that it is difficult or ' impossibleto relatethe network s behaviorto a linguistic understandingof the sort embodiedin an algebraicgrammar. (It should, however, be pointed out that the problemwith suchapplicationsdoesnot lie with neuralnets, but with the unenlighteningway they aresometimesput to use.) Language Change Similar commentsapply, on a larger scale, to language change. If the units of changeare as algebraicgrammarslead us to expectrules or parametersor the like - we would expectabrupt changes.We might
ChapterI
" " expectsomeoneto go down to the local pub oneevening, order Ale!, andbe servedan eel instead, becausethe Great Vowel Shift happenedto him a day too early.1 In fact, linguistic changesthat are attributed to rule changesor changes of parameter settings take place gradually, over considerable stretches of time measuredin decadesor centuries. It is more realistic to assumethat the languageof a speechcommunity is a stochasticcompositeof the languagesof the individual speakers , describedby a stochasticgrammar. " " In the stochastic community grammar, the probability of a given construction reflects the relative proportion of speakerswho use the constructionin question. Languagechangeconsistsin shifts in relative frequency of construction (rules, parametersettings, etc.) in the community. If we think of speechcommunities as populations of grammarsthat vary within certain bounds, and if we think of languagechangeas involving gradualshifts in the center of balanceof the grammarpopulation, then statisticalmodels are of immediateapplicability [Tabor, 1994] . In this picture, we might still continueto assumethat an adult monolingual es a particular algebraicgrammar, and that stochasticgrammars speakerpossess areonly relevantfor the descriptionof communitiesof varyinggrammars. However, we must at leastmake allowancefor the fact that individuals routinely comprehendthe languageof their community, with all its variance. This rather suggeststhat at leastthe grammarusedin languagecomprehensionis stochastic . I returnto this issuebelow.
of language variationI havein LanguageVariation Therearetwo senses mindhere:dialectology , ontheonehand,andtypology , ontheother.It hasbeen that somelanguages consistof a collectionof dialectsthat blend suggested one into the other to the aremoreorlessarbitrary , smoothly pointthatthedialects in . Forexample Inuit as" a fairly unbroken , Tait describes points a continuum chainof dialects , with mutualintelligibilitylimitedto proximityof contact , the furthestextremes of the continuumbeingunintelligibleto eachother " [ Tait, 1994 , p. 3] . To describethedisbibutionof Latin-Americannativelanguages , Kaufman definesa language as" a geographic zonethat complex ally continuous contains thanthatfoundwithina singlelanguage ..., linguisticdiversitygreater butwhereinternallinguisticboundaries similarto thosethatseparate clearlydiscrete " Kaufman1994 . 31 . Thecontinuousness are of , ,p ] languages lacking [ with distance is consistent withthepictureof a speech community changes geographic withgrammatical variance above . Withgeographic distance , assketched , 1. I havereadthis anecdotesomewherebefore, but havebeenunableto fmd the citation . My apologiesto the unknownauthor.
andLin9.Uistics Methods Statistical the mix of frequencyof usageof variousconstructionschanges , anda stochastic , 1995]. grammarof somesortis an appropriatemodel[ Kessier Similar commentsapply in the areaof typology, with a twist. Many of the universalsof languagethat have been identified are statistical rather than absolute, including rough statementsaboutthe probability distributionof language features(" head-initial and head-final languagesare about equally frequent " " or conditional ) probability distributions( postpositionsin verb-initial " languagesare more common than prepositions in verb final languages) [Hawkins, 1983, 1990]. Thereis asyet no modelof how this probability distribution comesabout, that is, how it arisesfrom the statisticalpropertiesof language change.Which aspectsof thedistributionarestable, andwhich would be ' different if we took a sampleof the world s languages10,000 yearsago or 10,000 yearshence? There is now ~ vast body of mathematicalwork on stochastic esandthe dynamicsof complexsystems(which includes, but is process not exhaustedby, work on neuralnets), much of which is of immediaterelevance to thesequestions. In short, it is plausibleto think of all of theseissues- languageacquisition, languagechange, and languagevariation- in terms of populationsof ~ mars, whetherthosepopulationsconsistof grammarsof different speakersor setsof hypothesesa languagelearnerentertains. When we examine.populations of grammarsvarying within bounds, it is naturalto expectstatisticalmodels to provideusefultools.
2 Adult MonolingualSpeakers ? Ever sinceChomsky, linguistics But whataboutanadultmonolingualspeaker hasbeenfmnly committedto the idealizationto an adult monolingualspeaker in a homogeneousspeechcommunity. Do statisticalmodelshaveanythingto sayaboutlanguageunderthat idealization? In a narrowsense,I think the answeris probablynot. Statisticalmethodsbear mostly on all the issuesthat are outsidethe scopeof interestof currentmainstream , though, I think that saysmore aboutthe linguistics. In a broadersense narrownessof the currentscopeof interestthanaboutthe linguistic importance of statisticalmethods.Statisticalmethodsareof greatlinguistic interestbecause the issuestheybearon arelinguistic issues,andessentialto anunderstandingof what humanlanguageis and what makesit tick. We must not forget that the idealizationsthat Chomskymadewere an expedient, a way of managingthe vastnessof our ignorance. One aspectof languageis its algebraicproperties, but that is only one aspectof language, and certainly not the only important
Chapter!
aspect. Also importantare the statisticalpropertiesof languagecommunities. And stochasticmodelsare also essentialfor understandinglanguageproduction andcomprehension , particularly in the presenceof variation andnoise. (I focus here on comprehension , though considerationsof languageproduction have also provided an important impetusfor statisticalmethodsin computationallinguistics [Smadja, 1989, 1991] .) To a significantdegree, I think linguisticshaslost sight of its original goal, andturnedChomsky's expedientinto anendin itself. Currenttheoreticalsyntax givesa systematicaccountof a very narrowclassof data, judgmentsaboutthe well-formednessof sentencesfor which the intendedstructureis specified, wherethejudgmentsareadjustedto eliminategradationsof goodnessandother complications.Linguistic dataother than structurejudgmentsare classifiedas " " performance data, andthe adjustmentsthat areperformedon structure-judgment dataaredeemedto be con-ectionsfor " performanceeffects." Performance is consideredthedomainof psychologists , or at least, not of concernto linguists. The term performance suggeststhat the things that the standardtheory abstractsawayfrom or ignoresarea naturalclass; they aredatathatbearon language processingbut not languagestructure. But in fact a good deal that is " is not labeled" performance computationalin any essentialway. It is more accurateto considerperformanceto be negativelydefmed: it is whateverthe grammardoesnot accountfor. It includesgenuinelycomputationalissues,but a gooddealmorethat is not. OneissueI would like to discussin somedetail is the issue of grammaticality and ambiguity judgments about sentencesas opposedto structures.Thesejudgmentsareno moreor lesscomputationalthan , but it is difficult to give a good accountof them judgmentsabout structures with grammarsof the usual sort; they seemto call for stochastic, or at least weighted, grammars. 2.1 Grammaticality and Ambiguity Considerthe following : ( I ) a. The a areof I b. The cowsaregrazingin the meadow c. JohnsawMary The questionis the statusof theseexampleswith respectto grammaticalityand ambiguity. The judgmentshere, I think, are crystal clear: ( I a) is word salad, and ( I b) and(c) areunambiguoussentences . In point of fact, ( I a) is a grammaticalnoun phrase, and ( I b) and (c) are ambiguous,the nonobviousreadingbeingasa nounphrase.Consider: anare is a measureof area, asin a hectareis a hundredares, andlettersof the alphabet
StatisticalMethodsandLinguistics
" may be usedasnounsin English( Writtenon the sheetwasa singlelowercase " " a, As describedin section2, paragraph b . . ." ). Thus ( 1a) hasa structurein which are andI are headnouns, anda is a modifier of are. This analysiseven becomesperfectlynaturalin the following scenario.Imaginewe aresurveyors, and that we havemappedout a pieceof land into large segments , designated with capitalletters, andsubdividedinto one-are subsegments , designatedwith lowercaseletters. ThenThea are of I is a perfectlynaturaldescriptionfor a particular parcelon our map. As for ( 1b), are is againthe headnoun, cowsis a premodifier, andgrazingin the meadowis a postmodifier. It might be objectedthat plural nounscannotbe nominalpremodifiers, but in fact they often are: considerthe bondsmarket, a securitiesexchange , he is vicepresidentandmediadirector, an in-homehealth care servicesprovider, Hartford ' s claims division, the financial -services industry, its line of systemsmanagementsoftware. (Severalof theseexamples areextractedfrom the Wall StreetJournal.) It mayseemthatexamples( la ) and(b) areillustrativeonly of a trivial andartificial problemthat arisesbecauseof a rare usageof a commonword. But the " " problemis not trivial: withoutanaccountof rareusage, we haveno way of distinguishing betweengenuineambiguitiesandthesespuriousambiguities.Alternatively one , might objectthat if onedoesnot know that are hasa readingasa noun, thenare is actuallyunambiguousin one' s idiolect, and ( la ) is genuinely . But in thatcasethequestionbecomeswhy a hectareis a hundred ungrammatical aresis notjudgedequallyungrammaticalby speakersof theidiolectin question. Further, ( lc ) illustratesthattherareusageis not anessentialfeatureof examples (a) and (b). Sawhasa readingasa noun, which may be lessfrequentthan the verb reading, but is hardly a rareusage. Propernounscanmodify (Gatling gun) and be modified by (TyphoidMary) commonnouns. Hence, John saw Mary hasa readingas a noun phrase, referring to the Mary who is associated with a kind of sawcalleda Johnsaw. It may be objectedthat constructionslike Gatling gun and TyphoidMary belongto the lexicon, not thegrammar, but howeverthat maybe, they arecompletely productive. I may not know whatCohenequations,theRussiahouse, or ' Abney sentencesare, but if not, then the denotataof Cohens equations, the ' Russianhouse, or thosesentencesof Abney s are surely equally unfamiliar.2 2. There are also syntactic groundsfor doubt about the assumptionthat noun-noun modificationbelongsto the lexicon. Namely. adjectivescanintervenebetweenthemodifying noun and the headnoun. (Examplesare given later in this section.) If adjective modification belongsto the syntax, and if thereare no discontinuouswords or movement of piecesof lexical items, thenat leastsomemodificationof nounsby nounsmust tllke placein the syntax.
Chapter}
LikewiseI maynotknowwhopeglegPeterefersto, or riverboatSally, butthat . doesnotmaketheconstructions or productive anylessgrammatical Theproblemis epidemic and it snowballs as sentences , growlonger. One unremarkable sentences oftenhearsin computational linguisticsaboutcompletely . Nor is it with hundreds of parses , andthatis in fact no exaggeration If theundesired a of a . one examines merely consequencehaving poorgrammar one finds that are , and , analyses generally they extremelyimplausible " " violenceto soft constraints like heaviness constraints oftendo considerable or thenumberandsequence of modifiers , butno onepieceof thestructureis . outrightungrammatical consider this sentence To illustrate , , drawnmoreor lessat randomfrom a ' book(Quines WordandObject) drawnmoreor lessatrandomfrommy shelf: relevant is epistemologically , assuggesting (2) In a generalwaysuchspeculation in the environment howorganisms and maturing evolving physical endupdiscoursing of abstract weknowmightconceivably objects aswedo [Quine, 1960,p. 123]. thissentence Oneof themanyspuriousstructures mightreceiveis thefollowing : (3)
s === : == : = = ---::== -----== --- ----- ------:: : = :::: : :: Absolute PP AdiP "'- - - - -- - - -- - - - - "'-----...",,~ PP organisms andevolving .. relevant Inageneral maturing wayRC epistemologically ...",,~ .....",,~ how such is assuggesting speculation ,
- - - - ----- NP - - - - - --.......VP -- - ....-........ I we know -S - - -- - - NP VP "' ...",,~ --- ----aswedo objects
AP might Ptcpl " " " -""of '"abstrac " " " " ' " end conceivably up discoursin
There are any number of criticisms one can direct at this structure, but I believe none of them are fatal . It might be objected that the PP-AdiP - Absolute sequenceof sentential premodifiers is illegitimate , but each is individually fine , and there is no hard limit on stacking them. One can even come up with relatively good examples with all three modifiers , e.g . [ppon the beach] [ AdiPnaked
StatisticalMedIads andLlDgni ~tic~
asjaybirds] [Absolute waveslappingagainsttheshore] the wild boyscarriedout their bizarre rituals. Anotherpoint of potentialcriticismis thequestionof licensingtheelidedsentence afterhow. In fact its contentcouldeitherbe providedfrom precedingcontext or from therest of the sentence , asin thoughas yet unableto explainhow, astronomersnow knowthat starsdevelopfrom specksof grit in giant oysters. Might is takenhereasa noun, asin mightand right. The AP conceivablyend up maybea bit mysterious: endup is hereanadjectival, asin we turnedthebox end up. Abstractis unusualasa massnoun, but can in fact be usedasone, as, for example, in the article consistedof threepagesof abstract and only two pagesof actual text. Onemight object that the NP headedby might is badbecauseof the multiple postmodifiers,but in fact thereis no absoluteconstraintagainststackingnominal postmodifiers,andgoodexamplescanbe constructedwith the samestructure : marlinespikes , businessend up, sprinkled with tabascosauce, can be a poweiful deterrent against pigeons. Even the commas are not absolutely required. The strengthof preferencefor themdependson how heavythe modifiers are: cf. strengthjudiciously applied increasesthe effectiveness of diplomacy , a cupof peanutsunshelledin the stockaddscharacter.3 In short, the structure(3) seemsto be best characterizedas grammatical, thoughit violatesany numberof parsingpreferencesandis completelyabsurd. Onemight think that onecouldeliminateambiguitiesby turning someof the into absoluteconstraints.But attemptingto eliminateunwanted dispreferences readingsthat way is like squeezinga balloon: everydispreferencethat is turned into an absoluteconstraintto eliminateundesiredstructureshasthe unfortunate side effect of eliminating the desiredstructurefor some other sentence . No matterhow difficult it is to think up a plausibleexamplethat violatesthe constraint , somewriter hasprobablyalreadythoughtone up by accident, and we will improperlyreject his sentenceas ungrammaticalif we turn the dispreference into an absoluteconstraint. To'illustrate: if a noun is premodifiedby both an adjective and anothernoun, standardgrammarsrequire the adjectiveto comefirst, inasmuchasthe nounadjoinsto ~ but the adjectiveadjoinsto N. It is not easyto th~ up goodexamplesthat violate this constraint. Perhapsthe readerwould careto try beforereadingtheexamplesin the footnote.4 3. Cf. this passagefrom Tolkien: "Their clotheswere mendedas well as their bruises their tempersand their hopes. Their bagswere filled with food and provisions light to carry but strongto bring themover the mountainpasses ." [ Tolkien, 1966, p. 61] 4. Maunder climatic cycles, ice-core climatalogical records, a Kleene-star transitive closure, Precambrianera solar activity, highlandigneousformations.
1 Chapter Not only is my absurdanalysis(3) arguablygrammatical, there are many, many equally absurdanalysesto be found. For example, general could be a noun (the army officer) insteadof an adjective, or evolving in could be analyzed as a particle verb, or the physical could be a noun phrase(a physical exam} - not to mention various attachmentambiguitiesfor coordinationand modifiers, giving a multiplicative effect. The consequenceis considerable . ambiguityfor a sentencethat is perceivedto be completelyunambiguous a I am. But it is and I I am Now perhapsit seems perversity suppose beingperverse, that is implicit in grammaticaldescriptionsof the usual sort, and it emergesunavoidablyassoonaswe systematicallyexaminethe structuresthat . Either the grammarassignstoo many structures the grammarassignsto sentences 2 it or like to sentences ( ), incorrectly predictsthat exampleslike three pages of abstract or a cup of peanutsunshelledin the stock have no well formedstructure. To sumup, thereis a problemwith grammarsof the usualsort: their predictions aboutgrammaticalityandambiguityaresimply not in accordwith human . The problemof how to identify the correctstructurefrom among perceptions the in-principle possiblestructuresprovidesone of the centralmotivationsfor the use of weighted grammarsin computationallinguistics. A weight is assignedto eachaspectof structurepermittedby the grammar, andthe weight of a particular analysisis the combinedweight of the structuralfeaturesthat make it up. The analysiswith the greatestweight is predictedto be the perceived . analysisfor a given sentence Before describingin more detail how weightedgrammarscontributeto a solution to the problem, though, let me addressan evenmore urgentissue: is this evena linguistic problem?
2.2 Is This Linguistics? Under the usual assumptions , the fact that the grammarpredicts grammaticality and ambiguity wherenoneis perceivedis not a linguistic problem. The usual opinion is that perceptionis a matter of performance, and that grammaticality alone does not predict performance; we must also include nonlinguisti factors like plausibility and parsing preferencesand maybe even probabilities. Grammaticality and Acceptability The implication is that perceptionsof grammaticalityand ambiguity are not linguistic data, but performancedata. ? And This stanceis a bit odd- are not grammaticalityjudgmentsperceptions what do we meanby " performancedata?" It would be one thing if we were
StatisticalMethodsandLinguistics
talking aboutdatathat clearly haveto do with the courseof linguistic computation , datalike responsetimesandreadingtimes, or regressiveeyemovement , or even more outlandish things like positron emission tomofrequencies graphicscansor early receptorpotentialtraces. But humanperceptions(judgments , intuitions) about grammaticalityand ambiguity are classic linguistic data. What makesthe judgments concerningexamples( 1a--c) performance data? AU linguistic dataarethe resultof little informal psycholinguistic experiments thatlinguistsperformon themselves , andtheexperimentalmaterialsare " questionsof the form Can you say this?" " Does this mean this?" " Is this " " ? Are thesesynonymous ?" ambiguous Part of the answeris that the judgmentsabout examples( la--c) are judgments aboutsentencesaloneratherthan aboutsentenceswith specifiedstructures . The usualsortof linguisticjudgmentis ajudgmentaboutthe goodnessof a particularstructure, andexamplesentencesareonly significantasbearersof the structurein question. If any choiceof wordsandany choiceof contextcan be found that makesfor a good sentence , the structureis deemedto be good. The basic data arejudgmentsabout structuredsentencesin context- that is, sentences plus a specificationof the intendedstructureandintendedcontextbut thesebasicdataare usedonly groupedin setsof structuredcontextualized sentenceshavingthe same( possiblypartial) structure. Sucha setis defmedto begoodjust in caseany structuredcontextualizedsentenceit containsis judged to be good. Hencea greatdealof linguists' time is spentin trying to find some choiceof wordsandsomecontextto get a clearpositivejudgment, to showthat a structureof interestis good. As a result, thereis actuallyno intent that thegrammarpredict- that is, generate - individual structuredsentencejudgments. For a given structuredsentence , the grammaronly predictswhetherthereis somesentencewith the same structurethat is judged to be good. For the examples( 1), then, we shouldsaythat the structure [NPthe [Na] [Nare] [ppof [NI ]]] is indeedgrammaticalin the technicalsense, sinceit is acceptablein at least one context, and since every piece of the structureis attestedin acceptable sentences . The groupingof databy structureis not the only way that standard grammars fail to predict acceptability and ambiguity judgments. Judgmentsare rather smoothly graded, but goodnessaccording to the grammar is allor nothing. Discrepanciesbetweengrammar and data are ignored if they involve sentencescontainingcenterembedding, parsingpreferenceviolations,
Chapterl
" garden-patheffects, or in generalif their badnesscanbe ascribedto processing " S complexity. Grammar and Computation The differencebetweensttucturejudgments " data" in somesense andsbingjudgmentsis not thatthe formerare competence " " andthe latterare performancedata. Rather, the distinctionrestson a working assumptionabouthow the dataareto be explained, namely, that the dataarea result of the interactionof grammaticalconstraintswith computationalconstraints . Certainaspectsof thedataareassumedto bereflectionsof grammatical constraints , andeverythingelseis ascribedto failuresof the processorto translate grammaticalconstraintstransparentlyinto behavior, whetherbecauseof memorylimits or heuristicparsingstrategiesor whateverobscuremechanisms of judgments.We arejustified in ignoringthoseaspectsof the creategradedness . to the idiosyncraciesof the processor we ascribe that data . But this distinctiondoesnot hold up underscrutiny Dividing the humanlanguage capacityinto grammarand processoris only a mannerof speaking, a . It is naiveto expectthe of way dividing things up for theoreticalconvenience to to division correspond anymeaningfulphysiologlogical grammar/processor , onefunctioning ical division say, two physically separateneuronalassemblies es asa storeof grammarrules and the other as an activedevicethat access the grammar-rule store in the course of its operation. And even if we did , we have believein a physiologicaldivision betweengrammarand processor with not a distinction no evidenceat all to supportthat belief; it is any empirical content. A coupleof examplesmight clarify why I say that the grammar/processor . Grammarsand syntacticsttucdistinction is only for theoreticalconvenience , but turesareusedto describecomputerlanguagesaswell ashumanlanguages typical compilersdo not accessgrammarrules or consttuctparsetrees. At the level of descriptionof the operationof the compiler, grammar-rules andparse" treesexist only " virtually asabstractdescriptionsof the courseof the compu5. In addition, therearepropertiesof grammaticalityjudgmentsof a different sort that arenot beingmodeled, propertiesthat arepoorly understoodandsomewhatworrisome. arisenot infrequentlyamongjudges - it is moreoften the casethan not Disagreements that I disagreewith at leastsomeof thejudgmentsreportedin syntaxpapers,andI think seemto changewith changingtheoretical my experienceis not unusual. Judgments " : a sentencethat sounds" not too good whenoneexpectsit to be bad may assumptions ' " " . Andjudg in the if a sound not too bad grammarchangesone s expectations change mentschangewith exposure.Someconstructionsthat soundterrible on a first exposure improveconsiderably with time.
Statistical MethodsandLinguistics
13
tation beingperformed. What is separatelycharacterizedas, say, grammarvs. parsingstrategyat the logical level is completelyintenningledat the level of compileroperation. At the other extreme, the constraintsthat probablyhavethe strongestcomputationalflavor are the parsingstrategiesthat are consideredto underliegarden -path effects. But it is equally possibleto characterizeparsingpreferences in grammaticalterms. For example, the low attachmentstrategycanbe characterizedby assigninga cost to structuresof the form [ Xi+l Xi YZ] proportional to thedepthof the subtreeY.The optimal structureis theonewith the leastcost. Nothing dependson how treesare actually computed: the characterizationis only in termsof the shapesof trees. If we wish to makea distinction betweencompetenceand computation, an appropriatedistinction is betweenwhat is computedand how it is computed. " " issuesare , most performance not computationalissuesat all. By this measure the of Characterizing perceptions grammaticalityand ambiguity describedin the previoussectiondoesnot necessarilyinvolve any assumptionsabout the computationsdoneduring sentenceperception. It only involvescharacterizing the set of structuresthat are perceivedas belongingto a given sentence . That canbedone, for example, by defining a weightedgrammarthat assignscoststo trees, andspecifyinga constantC suchthatonly structureswhosecostis within distanceC of the beststructurearepredictedto be perceived. How the setthus definedis actuallycomputedduring perceptionis left completelyopen. We may think of competencevs. performancein terms of knowledgevs. computation,but that is merelya mannerof speaking.What is really at issueis an idealizationof linguistic datafor the sakeof simplicity. The Frictionless Plane, Autonomy and Isolation Appeal is often madeto an analogybetweencompetenceandfrictionlessplanesin mechanics . Syntacticians focus on the data that they believe to contain the fewestcomplicating factors, and " cleanup" the datato removewhat they believeto be remaining complicationsthat obscuresimple, generalprinciplesof language. That is properandlaudable, but it is importantnot to lose sight of the original problem, andnot to mistakecomplexityfor irrelevancy. The testof whether the simpleprincipleswe think we havefound actuallyhaveexplanatorypower is how well they fare in makingsenseof the largerpicture. Thereis alwaysthe dangerthat the simpleprincipleswe arrive at areartifactsof our dataselection and data adjustment. For example, it is sometimesremarkedhow marvelous it is that a biological systemlike languageshouldbe so discreteandclean, but in fact there is abundantgradednessand variability in the original data; the
Chapter1 evidence for the discretenessand cleannessof language seemsto be mostly evidence we ourselves have planted . It has long been emphasized that syntax is autonomous. The doctrine is older " 6 than Chomsky ; for example, Tesniere [ Tesniere, 1959, p . 42 ] writes . . . la " ' ' syntaxe n a a chercher sa propre loi qu en elle meme. Elle est autonome . To illustrate that structure cannot be equated with meaning, he presents the sentence pair : Ie signal vert indique Ie Vole libre Ie symbole veritable impose Ie vitesse lissant ' The similarity to Chomsky s later but more famous pair revolutionary new ideas appear infrequently colorless green ideas sleep furiously is striking . But autonomy is not the same as isolation . Syntax is autonomous in the sense that it cannot be reduced to semantics; well - formedness is not identical to meaningfulness. But syntax in the sense of an algebraic grammar is only one piece in an account of language, and it stands or falls on how well it fits into the larger picture .
The Holy Grail The largerpicture, andthe ultimategoal of linguistics, is to describelanguagein the senseof that which is producedin languageproduction in languagecomprehension , acquiredin languageacquisition , comprehended varies in which that in and , , aggregate , languagevariation andchangesin languagechange. I havealwaystakenthe Holy Grail of generativelinguisticsto be to characterize a class of models, each of which representsa particular (potentialor actual) human languageL , and characterizesa speakerof L by defming the classof sentencesa speakerof L produces, the structuresthat a speakerof L ; in short, by predictingthe linguistic datathat characterize perceivesfor sentences a speakerof L. A "Turing test" for a generativemodelwould be somethinglike the following that are at random, the sentences . If we usethe modelto generatesentences producedarejudged by humansto be clearly sentencesof the language- to " soundnatural." And in the otherdirection, if humans judge a sentence(or non' 6. The cited work was completedbeforeTesnieres deathin 1954, thoughit was not publisheduntil 1959.
StatisticalMethodsandLinguistics
15
sentence) to have a particular sbUcture, the model should also assign precisely that sbUcture to the sentence.
Natural languagesare such that these tests cannot be passedby an unweighted grammar. An unweighted grammar distinguishes only between grammaticalandungrammaticalsb' uctures,andthat is not enough. " Sounding natural" is a matterof degree. What we must meanby " randomlygenerating " is that natural-soundingsentences sentencesare weightedby the degreeto which they sound natural, and we samplesentenceswith a probability that accordswith their weight. Moreover, the sb' ucturethat peopleassignto a sentence is the sb' ucturetheyjudge to havebeenintendedby the speaker,andthat judgmentis alsoa matterof degree. It is not enoughfor the grammarto define the set of sb' ucturesthat could possibly belong to the sentence ; the grammar shouldpredictwhich sb' uctureshumansactuallyperceive, andwhattherelative weights are in caseswhere humansare uncertainabout which sb' ucturethe speakerintended. The long andlittle of it is, weightedgrammars(andother speciesof statistical methods) characterizelanguagein sucha way asto makesenseof language , acquisition, variation, and change. Theseare linguistic production, comprehension , andnot computationalissues,a fact that is obscuredby labelingeverything " " perfonnance that is not accountedfor by algebraicgrammars.What is " " really at stakewith competence is a provisionalsimplifying assumption , or an expressionof interestin certain subproblemsof linguistics. There is certainly no indicting an expressionof interest, but it is importantnot to losesight of the largerpicture.
3 HowStatisticsHelps Acceptingthat thereare divergencesbetweentheory and data- for example, the divergencebetweenpredictedand perceivedambiguity- and accepting that this is a linguistic problem, and that it is symptomaticof the incompleteness of standardgrammars, how does adding weights or probabilities help makeup the difference? Disambiguation As alreadymentioned, the problemof identifying the correct parse- the parsethat humansperceive- amongthe possibleparsesis a central applicationof stochasticgrammarsin computationallinguistics. The problemof defmingwhich analysisis correctis not a computationalproblem, however; thecomputationalproblemis describingan algorithmto computethe correctparse. Thereare a variety of approach es to the problemof defining the
Chapter1
correctparse. A stochasticcontext-freegrammarprovidesa simpleillustration. Considerthe sentenceJohn walks, andthe grammar (4) 1. 2. 3. 4. 5. 6. 7.
S - NPV S - NP NP - N NP - NN N - John N - walks V - walks
.7 .3 .8 .2 .6 .4 1.0
, oneasa sentenceand Accordingto grammar(4), John walkshastwo analyses oneasa nounphrase. (The rule S NP representsan utteranceconsistingof a single noun phrase.) The nu. mbersin the rightmost column representthe weightsof rules. The weight of an analysisis the productof the weightsof the rules usedin its derivation. In the sententialanalysisof John walks, the derivation consistsof rules 1, 3, 5, 7, so the weight is (.7)(.8)(.6)( 1.0) = .336. In the noun-phraseanalysis, the rules 2, 4, 5, 6 are used, so the weight is (.3)(.2)(.6)(.4) = .0144. The weight for the sententialanalysisis muchgreater, predicting that it is the one perceived. More refined predictions can be obtainedby hypothesizingthat an utteranceis perceivedas ambiguousif the " " next-best caseis not too much worsethan the best. If not too much worse is interpretedasa ratio of , say, not more than 2:1, we predict that John walks is perceivedas unambiguous, as the ratio betweenthe weights of the parses is 23: 1: Gradations of acceptability are not accommodated Degrees of Grammaticality in algebraic grammars: a structure is either grammatical or not. The idea of degrees of grammaticality has been entertained from time to time , and some " " classes of ungrammatical structures are informally considered to be worse than others (most notably , Empty Category Principle (ECP) violations vs. sublacency violations ) . But such degrees of gramrnaticality as have been considered have not been accorded a formal place in the theory . Empirically , acceptability judgments vary widely across sentences with a given structure, depending on lexical choices and other factors. Factors that cannot be reduced 7. The hypothesisthat only the beststructure(or possibly, structures ) areperceptibleis esto syntaxin which grammaticalityis definedas somewhatsimilar to currentapproach optimal satisfactionof constraintsor maximal economyof derivation. But I will not . hazarda guesshereaboutwhetherthat similarity is significantor merehappenstance
StatisticalMethodsandLinguistics
to a binary grammaticalitydistinction areeither poorly modeledor ignoredin standardsyntacticaccounts. Degreesof grammaticalityarise as uncertaintyin answeringthe question " Can " you sayX? or perhapsmore accurately, " If you said X , would you feel " you had madean error? As such, they reflect degreesof error in speechproduction . The null hypothesisis that the samemeasureof goodnessis usedin both speechproduction and speechcomprehension , though it is actually an open question. At any rate, the measureof goodnessthat is important for is not degreeof grammaticalityalone, but a global measure speechcomprehension that combinesdegreesof grammaticalitywith at least naturalnessand " . structuralpreference(i.e., " parsingstrategies ) We must also distinguish degreesof grammaticality, and indeed, global , from theprobability of pro. ducinga sentence goodness . Measuresof goodness and probability are mathematicallysimilar enh cementsto algebraicgrammars ~ , but goodnessalonedoesnot determineprobability. For example, for an infinite language, probability must ultimately decreasewith length, though arbitrarily long sentences may be perfectlygood. Perhapsonereasonthat degreesof grammaticalityhavenot found a placein standardtheory is the questionof wherethe numberscomefrom, if we permit continuousdegreesof grammaticality. The answerto wherethe numberscome from is parameterestimation. Parameterestimationis well-understoodfor a numberof modelsof interest, andcan be seenpsychologicallyaspart of what goeson during languageacquisition. Naturalness It is a bit difficult to say preciselywhat I meanby naturalness . A large componentis plausibility, but not plausibility in the senseof world knowledge, but ratherplausibility in the senseof selectionalpreferences , that is, semanticsortalpreferencesthat predicatesplaceon their arguments . Another important componentof naturalnessis not semantic, though, but " " simply how you sayit. This is whathasbeencalledcollocationalknowledge, like the fact that one saysstrong tea and poweiful car, but not vice versa [Smadja, 1991], or that you say thick accent in English, but starker Akzent " " ( strongaccent ) in German. Thoughit is difficult to definejust what naturalnessis, it is not difficult to recognizeit. If onegeneratestext at randomfrom anexplicit grammarplus lexicon , the shortcomingsof the grammarare immediatelyobviousin the unnatural - sentencesthat are - even if not ungrammatical produced. It is alsoclear that naturalnessis not at all the samething as meaningfulness . For example, I think it is clearthatdifferential structureis morenaturalthandifferential child,
Chapterl
eventhoughI could not saywhat a differential structuremight be. Or consider thefollowing examplesthat werein fact generatedat randomfrom a grammar: (5) a. matter-like , complete, allegedstrips a stratigraphic,dubiousscattering a far alternativeshallowmodel b. indirect photographic-drill sources earlier stratigraphicallypreciseminimums ' Europes cyclic existence , but I think All theseexamplesare abouton a par asconcernsmeaningfulness . a than the natural more the (b) examplesarerather ( ) examples Collocationsand selectionalrestrictionshave beentwo importantareasof applicationof statisticalme~ ods in computationallinguistics. Questionsof , interesthavebeenboth how to include themin a global measureof goodness for as a tool both Resnik 1993 , ], and how to induce them distributionally [ investigations,andasa modelof humanlearning. , have , or parsing strategies Structural Preferences Structuralpreferences " " is one match A . example. The preference longest alreadybeenmentioned example (6) The emergencycrewshatemostis domesticviolence is a garden-path becauseof a strong preferencefor the longest initial NP, . (The Theemergencycrews, ratherthanthe correctalternative, Theemergency correctinterpretationis: Theemergency[that crewshatemost] is domesticviolence .) The longest-matchpreferenceplays an importantrole in the dispreference for the structure(3) that we examinedearlier. As alreadymentioned, thesepreferencescan be seenas structuralpreferences . They interactwith the otherfactorswe , ratherthanparsingpreferences . For example, in (6), an of goodness measure in a havebeenexamining global is evenlongermatch, The emergencycrewshate, actuallypossible, but it violates the dispreferencefor havingplural nounsasnominalmodifiers. Error Tolerance A remarkablepropertyof humanlanguagecomprehension that an algebraicgrammarwould simply is its error tolerance.Many sentences classifyasungrammaticalareactuallyperceivedto havea particularstructure. A simpleexampleis wesleeps,a sentencewhoseintendedstructureis obvious, . In fact, an erroneousstructuremay actuallybe preferred albeit ungrammatical to a grammaticalanalysis; consider
StatisticalMethodsandLinguistics
(7) Thanksfor all you help. which I believeis preferentiallyinterpretedasan erroneousversionof Thanks for all your help. However, there is a perfectly grammaticalanalysis: Thanks for all thosewhoyou help. We canmakesenseof this phenomenonby supposingthat a rangeof errorcorrectionoperationsare available, thoughtheir applicatiol~imposesa certain cost. This cost is combinedwith the otherfactorswe havediscussed , to determine a global goodness , andthe bestanalysisis chosen.In (7), thecostof error or correctionis apparentlylessthanthe costof the alternativein unnaturalness structuraldispreference . Generally, error detectionand correctionare a major selling point for statisticalmethods. They wereprimary motivationsfor Shannon' s noisychannelmodel[Shannon,1948], which providesthe foundationfor . manycomputationallinguistic techniques Learning on the Fly Not only is the languagethat one is exposedto full of errors, it is producedby otherswhosegrammarsand lexica vary from one' s own. Frequently, sentences thatoneencounterscanonly be analyzedby adding new constructionsor lexical entries. For example, when the averageperson hearsa hectareis a hundredares, they deducethat are is a noun, and succeed in parsingthe sentence . But therearelimits to learningon the fly , just asthere arelimits to error correction. Learningon the fly doesnot help oneparsethe a are of I . Learningon the fly can be treatedmuch like error correction. The simplest - for example, assigninga approachis to admit a spaceof learningoperations frameto a verb, new part of speechto a word, addinga new subcategorization etc.- andassigna costto applicationsof the learningoperations.In this way it is conceptuallystraightforwardto include learningon the fly in a global optimization . Peopleare clearly capableof error correctionand learningon the fly ; they are highly desirableabilities given the noise and variancein the typical linguistic environment. They greatly exacerbatethe problem of picking out the intendedparsefor a sentence , becausethey explodethe candidatespaceeven of candidatesthat the grammarprovides. To the set beyond already large how it is nonetheless possibleto identify the intendedparse, thereis no explain seriousalternativeto the useof weightedgrammars. the problemof identifying Lexical Acquisition A final factor that exacerbates the correctparseis the sheerrichnessof natural languagegrammarsand
Chapter1
lexica. A goal of earlierlinguistic work, andonethat is still a centtalgoalof the linguistic work that goeson in computationallinguistics, is to developgrammars that assigna reasonablesyntacticstructureto every sentenceof English, as or nearlyeverysentenceaspossible. This is not a goal that is currentlymuch -Binding theory in fashionin theoreticallinguistics. Especiallyin Government (GB), the developmentof large fragmentshas long sincebeenabandonedin favor of the pursuitof deepprinciplesof grammar. The scopeof the problemof identifying the correctparsecannotbe appreciated by examining behavior on small fragments, however deeply analyzed. Large fragmentsare not just small fragmentsseveraltimes over- there is a qualitativechangewhenonebeginsstudyinglargefragments. As the rangeof increases constructionsthat the grammaraccommodates , the numberof undesired increasesdramatically. parsesfor sentences In-breadthstudiesalso give a different perspectiveon the problemof language acquisition. When one attemptsto give a systematicaccountof phrase structure,it becomesclearjust how manylittle factstherearethat do not fallout from grandprinciples, butjust haveto belearned.Thesimple, generalprinciples in thesecasesarenot principlesof syntax, but principlesof acquisition. Examples arethecomplexconstraintson sequencingof prenominalelementsin English , or the syntaxof dateexpressions(MondayJune the 4th, MondayJune4, * MondayJune the4, *June4 Monday) or the syntaxof propernames(Greene . ) , or the syntaxof numeralexpressions CountySheriff' s DeputyJim Thurmond -setting The largestpieceof whatmustbe learnedis the lexicon. If parameter views of syntaxacquisitionare correct, thenlearningthe syntax(which in this casedoesnot includethe low-level messybits discussedin the previousparagraph ) is actuallyalmosttrivial. The really hardjob is learningthe lexicon. Acquisition of the lexicon is a primary areaof applicationfor distributional esto acquisition. Methodshavebeendevelopedfor the andstatisticalapproach acquisitionof partsof speech[Brill , 1993; Schiitze, 1993], terminologicalnoun compounds[Bourlgault, 1992], collocations [Smadja, 1991], support verbs frames[Brent, 1991; Manning, 1993], , 1995], subcategorization [Grefenstette 1993 restrictions selectional ], and low-level phrasestructurerules [ Resnik, [Finch, 1993; Smith andWitten, 1993]. Thesedistributionaltechniquesdo not so muchcompetewith parametersettingasa modelof acquisition, asmuchas -settingaccountspassover complementit , by addressingissuesthat parameter also not are . Distributional in silence adequatealoneas modelsof techniques humanacquisition whateverthe outcomeof the syntacticvs. semanticbootstrappingdebate, children clearly do make use of situationsand meaningto learn language- but the effectivenessof distributionaltechniquesindicatesat leastthat they might accountfor a componentof humanlanguagelearning.
StatisticalMethodsandLinguj ~tic~ 4
Objections
There are a couple of generalobjectionsto statistical methodsthat may be ' lurking in the backsof readers minds that I would like to address.First is the sentimentthat, however relevant and effective statistical methods may be, ' they areno morethan an engineers approximation, not part of a properscientific theory. Secondis the nagging doubt: did not Chomskydebunk all this agesago? 4.1 Are StochasticModels Only for Engineers? Onemight admit that onecan accountfor parsingpreferencesby a probabilistic model, but insistthat a probabilisticmodelis at bestan approximation,suitable for engineeringbut not for science. On this view, we do not needto talk about degreesof grammaticality, or preferences , or degreesof plausibility. Granted, humansperceiveonly one of the many legal structuresfor a given sentence , but the perceptionis completelydeterministic. We needonly give a properaccountof all the factorsaffectingthejudgment. Considerthe example:
threeshotswerefiredat Humberto Yesterday Calvados assistant to the , personal famous tenorEnrique Felicidad , whowasin Parisattending to unspecified personal matters . ' Supposefor arguments sakethat 60% of readerstake the tenorto be in Paris, and40% take the assistantto be in Paris. Or more to the point, supposea particular infonnant, John Smith, choosesthe low attachment60% of the time whenencounteringsentences with preciselythis structure(in the absenceof an infonnativecontext), andlow attachment40% of thetime. Onecouldstill insist thatno probabilisticdecisionis beingmade, but ratherthat therearelexical and semanticdifferencesthat we have inappropriatelyconflatedacrosssentences with " preciselythis structure," andif we takeaccountof theseothereffects, we endup with a detenninisticmodelafterall. A probabilisticmodelis only a stopgap in absenceof an accountof the missing factors: semantics , pragmatics, what topics I have beentalking to other peopleaboutlately, how tired I am, whetherI atebreakfastthis morning. By this speciesof argument,stochasticmodelsarepracticallyalwaysa stopgap approximation. Take stochasticqueuetheory, for example, by which one can give a probabilistic model of how many trucks will be arriving at given depotsin a transportationsystem. Onecould arguethat if we couldjust model everythingaboutthestateof thetrucksandtheconditionsof theroads, the location of every nail that might causea flat , and every drunk driver that might
Chapter1 cause an accident, then we could in principle predict detenninistically how many trucks will be arriving at any depot at any time , and there is no need of stochastic queue theory . Stochastic queue theory is only an approximation in lieu of infonnation that it is impractical to collect . But this argument is flawed . If we have a complex detenninistic system, and if we have access to the initial conditions in complete detail , so that we can compute the state of the system unerringly at every point in time , a simpler stochastic description may still be more insightful . To use a dirty word , some properties of the system are genuinely emergent, and a stochastic account is not just an approximation , it provides more insight than identifying every deterministic factor . Or to use a different dirty word , it is a reductionist error to stochastic account and insist that only a more complex , a successful reject lower level , detenninistic model advances scientific understanding. 4.2 Chomsky v. Shannon In one' s inttoductory linguistics course, one learns that Chomsky disabused the field once and for all of the notion that there was anything of interest to statistical models of language. But one usually comes away a little fuzzy on the question of what , precisely , he proved . " ' The arguments of Chomsky s that I know are from Three Models for the " Description of Language [Chomsky , 1956] and Syntactic Structures [Chomsky , 1957] (essentially the same argument repeated in both places), and from the Handbook of Mathematical Psychology , chapter 13 [Miller and Chomsky , 1963] . I think the fIrSt argument in Syntactic Structures is the best known . It goes like this. It is fair to assumethat neithersentence( 1) [colorlessgreenideassleepfuriously] nor ) has (2) [furiously sleepideasgreencolorless], (nor indeedany part of thesesentences ever occurredin an English discourse. . . Yet ( 1), thoughnonsensical , is grammatical, while (2) is not. [Chomsky, 1957, p. 16] This argument only goes through if we assumethat if the frequency of a sentence " or " part is zero in attaining sample, its probability is zero. But in fact , there is quite a literature on how to estimate the probabilities of events that do not occur in the sample, and in particular how to distinguish real zeros from zeros that just reflect something that is missing by chance. Chomsky also gives a more general argument: of a given lengthin orderof statisticalapproximationto English If we rank the sequences scatteredthroughout , we will fmd both grammaticalandungrammaticalsequences the list; thereappearsto be no particularrelation betweenorder of approximationand . [Chomsky, 1957, p. 17] grammaticalness
StatisticalMethodsandLinguistics Because for any n , there are sentences with grammatical dependencies spanning more than n words , so that no nth order statistical approximation can sort out the grammatical from the ungrammatical examples. In a word , you cannot define grammaticality in terms of probability . " " It is clear from context that statistical approximation to English is a reference to nth - order Markov models, as discussed by Shannon. Chomsky is saying that there is no way to choose n and E such that for all sentencess, grammatical (s) +-+Pn(s) > E " " where Pn(s) is the probability of s according to the best nth- order approximation to English . But Shannon himself was careful to call attention to precisely this point : that for any n , there will be some depende;ncies affecting the well -formedness of a ' sentence that an nth - order model does not capture. The point of Shannon s approximations is that , as n increases, the total mass of ungrammatical sentences that are erroneously assignednon-zero probability decreases. That is , we in can fact defme grammaticality in terms of probability , as follows : Jim P n(s) > 0 grammatical (s) +-+ n~ A third variant of the argument appears in the Handbook . There Chomsky states that parameter estimation is impractical for an nth -order Markov model " " where n is large enough to give a reasonablefit to ordinary usage. He emphasizes that the problem is not just an inconvenience for statisticians , but renders " the model untenable as a model of human language acquisition : we cannot parameters in a childhood seriously propose that a child learns the values of " lasting only 108 seconds. This argument is also only partially valid . If it takes at least a second to estimate each parameter, and parameters are estimated sequentially , the argument is correct. But if parameters are estimated in parallel , say, by a high -dimensional iterative or gradient -pursuit method , all bets are off . Nonetheless, I think even the most hardcore statistical types are willing to admit that Markov models represent a brut~ force approach, and are not an adequatebasis forpsycho logical models of language processing. However , the inadequacy of Markov models is not that they are statistical , but ' that they are statistical versions of fmite -state automata. Each of Chomsky s arguments turns on the fact that Markov models are fmite state, not on the fact that they are stochastic. None of his criticisms are applicable to stochastic models generally. More sophisticated stochastic models do exist: stochastic context free grammars are well understood, and stochastic versions of Tree Adjoining
Chapter1
Grammar[Resnik, 1992], GB [ FordhamandCrocker, 1994], andHPSG[Brew, 1995] havebeenproposed. m fact, probabilitiesmakeMarkov modelsmoreadequatethantheir nonprobabilistic . Markov modelsare surprisinglyeffective , not lessadequate counterparts . For example, they are the workhorseof , given their finite statesubstrate . Stochastic grammarscan alsobe easierto learn speechrecognitiontechnology . For than their nonstochastic counterparts example, thoughGold [Gold, 1967] showedthattheclassof context-freegrammarsis not learnable , Horning [ Horning . , 1969] showedthattheclassof stochasticcontext-freegrammarsis learnable m short, Chomsky's argumentsdo not bearat all on the probabilisticnature of Markov models, only on thefact that they arefinite-state. His argumentsare not by any stretchof the imaginationa sweepingcondemnationof statistical methods. 5
Conclusion
In closing, let me repeatthe main line of argumentas conciselyas I can. Statistical methods- by which I meanprimarily weightedgrammarsand distributional induction methods- are clearly relevant to languageacquisition, languagechange, languagevariation, languagegeneration,andlanguagecomprehens . Understandinglanguagein this broadsenseis the ultimategoal of linguistics. The issuesto which weightedgrammarsapply, particularlyasconcernsperception of grammaticalityand ambiguity, one may be temptedto dismissas " " performanceissues. However, the set of issueslabeled performance are not " " essentiallycomputational,asoneis often led to believe. Rather, competence representsa provisionalnarrowingandsimplificationof datain orderto understand " is a the algebraicpropertiesof language." Performance misleadingterm " " for everything else. Algebraic methodsare inadequatefor understanding many importantpropertiesof humanlanguage,suchas the measureof goodness that permitsoneto identify the correctparseout of a largecandidatesetin the faceof considerablenoise. Many other propertiesof language, as well, that are mysteriousgiven unweighte grammars, propertiessuch as the gradualnessof rule learning, the gradualnessof languagechange, dialect continua, and statisticaluniversals, makea greatdeal more senseif we assumeweightedor stochasticgrammars. ? tec?mlquesmat computatIolla?1mguists Thereis a hugebody of mamemadca have begunto tap, yielding tremendousprogresson previously intransigent problems. The focusin computationallinguisticshasadmittedlybeenon tech-
Statistical MethodsandLinguistics
25
nology. But the sametechniquespromiseprogresson issuesconcerningthe natureof languagethat haveremainedmysteriousfor so long. The time is ripe to apply them. Acknowledgments
I thankTilmanHoehle , GrahamKatz, MarcLight, andWolfgangStemefeld for theircomment ~onanearlierdraftof thischapter . All errorsandoutrageous
opinions are, of course, my own . References
DidierBourigault . Surfacegrammatical analysisfor theextractionof tenninoiogical -92, Vol. 3, pp. 977- 981, 1992 nounphrases . In COLING . MichaelR. Brent. Automaticacquisitionof subcategorization framesfrom untagged , free-text corpora . In Proceedings of the29thAnnualMeetingof theAssociation for Computational , pp. 209-214, 1991. Linguistics ChrisBrew. Stochastic HPSG . In Proceedings of EACL-95, 1995. -BasedLearning Eric Brill. Transformation . PhiD. thesis , Universityof Pennsylvania , , 1993. Philadelphia NoamChomsky . Threemodelsfor thedescription of language . IRE Transactions on InformationTheory , IT-2(3): 113- 124, 1956.NewYork, Instituteof RadioEngineers . NoamChomsky . Syntactic Structures . TheHague , Mouton,1957.
NoamChomsky . Thelogicalbasisof linguistic . In Horace Lunt, editor theory , Proceedings of theNinthInternational Congress , pp. 914-978,TheHague of Linguists , Mouton . 1964 . Steven PaulFinch . FindingStructure in Language . PhDthesis of Edinburgh , University . , 1993 Andrew Fordham andMatthew Crocker . Parsing withprinciples andprobabilities . In TheBalancing Act: Combining andStatistical Symbolic estoLanguage . Approach , 1994 E. MarkGold.Language identification in thelimit. Information andControl , 10(5): 447-474, 1967 . -based Grefenstette . Corpus Gregory method for automatic identification of support -95, 1995 verbsfornominalizations . InEACL . JohnA. Hawkins . WordOrderUniversals . NewYork,Academic Press . , 1983 JohnA. Hawkins . A parsing of wordorderuniversals . Linguistic theory , 21(2): Inquiry 223-262, 1990 . James . A Studyof Grammatical JayHoming . PhiD . thesis Inference , Stanford (Computer Science . ), 1969 Terrence Kaufman . Thenative ofLatinAmerica : languages remarks . In 's general . 31- Christopher andRE. Asher Moseley , editors ,AtlasoftheWorld , Languages pp 33, London . , Routledge , 1994
Chapter1
in IrishGaelic.In EACL-95, 1995. . Computational BrettKessier dialectology of a largesubcategorization . Automatic . Manning dictionary acquisition ChristopherD , . In 31stAnnualMeetingof theAssociation fromcorpora Linguistics for Computational . 242 1993 . 235 , pp users . In R. D. . Finitarymodelsof language GeorgeA. Miller andNoamChomsky , chapter Luce, R. Bush,andE. Galanter , Handbook Psychology , editors of Mathematical 13. NewYork, Wiley, 1963. natural for statistical asaframework Tree-AdjoiningGrammar . Probabilistic PhilipResnik . In COUNG-92, pp. 418- 424, 1992. language processing . PhiD. thesis , . SelectionandInformation , Universityof Pennsylvania Philip Resnik . 1993 , Philadelphia . In Proceedings HinrichSchUtte. Part-of-speechinductionfrom scratch of the 31st . 251 258 the Association , 1993. , AnnualMeetingof Linguisticspp for Computational Technical . TheBellSystem . A mathematical ClaudeE. Shannon theoryof communication Journal,27(3- 4): 379- 423, 623- 656, 1948. . In Uri Zemik, editor, thelexiconfor language . Microcoding FrankSmadja generation . Cambridge to Builda Lexicon : UsingOn-LineResources , Mass.: LexicalAcquisition TheMIT Press , 1989. : Language Generation . ExtractingCollocations FrankSmadja from Text.AnApplication . PhiD. thesis , NewYork, 1991. , ColumbiaUniversity inferencefrom functionwords. Manuscript TonyC. SmithandIan H. Witten. Language , January1993. , Universityof CalgaryandUniversityof Waikato Model. PhDthesis : A Connectionist , Stanford WhitneyTabor. SyntacticInnovation . 1994 , University . In Christopher NorthAmerica ,Atlasof the MoseleyandRE . Asher,editors MaryTait. . World's Languages , 1994 , Routledge , pp. 3- 30, London . 2ndedition. Paris: Klincksieck , . Elementsde SyntaxeStructurale LucieneTesniere 1959. Mifflin, 1966. J. R. R. Tolkien.TheHobbit. Boston , Houghton , 1960. WillardvanOrmanQuine.WordandObject.Cambridge , Mass., TheMIT Press asan.errorcorrectingcode.TechnicalReport33:XV, Quarterly VictorYngve. Language Instituteof of Electronics , TheMassachusetts Laboratory Reportof theResearch , April 15, 1954,pp. 73- 74. Technology
Chapter 2 Qualitative and Quantitative Models of SpeechTranslation
Hiyan Alshawi
Alshawi achieves two goals in this chapter . First , he challenges the notion that the identification of a statistical -symbolic distinction in language processing is an instance of the empirical vs. rational debate. Second, Alshawi proposes models for speech translation that retain aspects of qualitative design while moving toward incorporating quantitative aspectsfor structural dependency, lexical transfer , and linear order . On the topic of the place of the statistical -symbolic distinction in natural lan guage processing , Alshawi points to the fact that rule -based approach es are becoming increasingly probabilistic . However , at the same time , since language is symbolic by nature , the notion of building a "purely " statistical model may not be meaningful . Alshawi suggests that the basis for the contrast is in fact a distinction between qualitative systemsdealing exclusively with combinatoric constraints , and quantitative systems dealing with the computation of numerical functions . Of course, the problem still remains of how and where to introduce quantitative modeling into language processing . Alshawi proposes a model to do just this , specifically for the language translation task. The design reflects the conventional qualitative transfer ' approach , that is , starting with a logic -based grammar and lexicon to produce a set of logical forms that are then filtered , passed to a translation component , and then given to a generation component mappinglogicalforms to surface syntax , which is then fed to the speech synthesizer. Alshawi then methodically analyzes which of these steps would be improved by the introduction of quantitative modeling . His step-by-step analysis , considering specific ways to improve the basic qualitative model , illustrates the variety of possibilities still to be explored in achieving the optimal balance among types ofmodels .- Eds .
Chapter2 1
Introduction
In recent years there has been a resurgence of interest in statistical approaches to natural language processing . Such approaches are not new , witness the statistical approach to machine translation suggested by Weaver [ 1955] , but the current level of interest is largely due to the success of applying hidden Markov models and N - gram language models in speech recognition . This success was directly measurable in terms of word recognition error rates, prompting language processing researchers to seek corresponding improvements in performance and robustness. A speech translation system, which by necessity combines speech and language technology , is a natural place to consider combining the statistical and conventional approaches, and much of this chapter describes probabilistic models of structural language analysis and translation . My aim is to provide an overall model for translation with the best of both worlds . Various factors lead us to conclude that a lexicalist statistical model with dependency relations is well suited to this goal . As well as this quantitative approach, we consider a constraint , logic -based approach and try to distinguish characteristics that we wish to preserve from those that are best replaced by statistical models. Although perhaps implicit in many conventional approaches to translation , a characterization in logical terms of what is being done is rarely given , so we attempt to make that explicit here, more or less from fIrst principles . Before proceeding, I first examine some fashionable distinctions in section 2 in order to clarify the issues involved in comparing these approaches. I argue that the important distinction is not so much a rational - empirical or symbolic statistical distinction but rather a qualitative - quantitative one. This is followed in section 3 by discussion of the logic -based model , in section 4 by the overall quantitative model , in section 5 by monolingual models, in section 6 by translation models, and, in section 7 , some conclusions. I concentrate throughout on what information about language and translation is coded and how it is expressed as logical constraints in one model or statistical parameters in the other. At Bell Laboratories , we have built a speechtranslation system with the same underlying motivation as the quantitative model presentedhere. Although the quantitative model used in that system is different from the one presented here, they can both be viewed as statistical models of dependency grammar . In building the system, we had to address a number of issuesthat are beyond the scope of this chapter, including parameter estimation and the development of efficient search algorithms .
QualitativeandQuantitativeModelsof SpeechTranslation 2
Qualitative and Quantitative Models
One conttastoften takenfor grantedis the identification of a statistical-symbolic distinction in languageprocessingas an instanceof the empirical vs. rationaldebate.I believethis contrasthasbeenexaggerated , thoughhistorically it hashad somevalidity in termsof acceptedpractice. Rule-basedapproach es have becomemore empirical in a number of ways: First, a more empirical approachis being adoptedto grammardevelopmentwherebythe rule set is modified accordingto its performanceagainstcorpora of natural text (e.g. [ Tayloret al., 1989]). Second, there is a classof techniquesfor learningrules from text, a recentexamplebeing [Brill , 1993]. Conversely, it is possibleto imagine building a languagemodel in which all probabilities are estimated accordingto intuition without referenceto any real data, giving a probabilistic modelthat is not empirical. Most languageprocessinglabeled as statisticalinvolves associatingrealnumber- valuedparametersto configurationsof symbols. This is not surprising giventhatnaturallanguage,at leastin written form, is explicitly symbolic. Presumably, classifyinga systemassymbolicmustrefer to a different setof (internal ) symbols, but eventhis doesnot rule out manystatisticalsystemsmodeling eventsinvolving nonterminalcategoriesandword senses . Giventhatthenotion of a symbol, let alonean " internal symbol," is itself a slipperyone, it may be unwiseto build our theoriesof language,or eventhe way we classifydifferent theories, on this notion. Instead, it would seemthatthe real conttastdriving the shift towardstatistics in languageprocessingis a conttastbetweenqualitativesystemsdealingexclusively with combinatoricconstraints, and quantitative systemsthat involve computingnumericalfunctions. This bearsdirectly on the problemsof brittleness and complexity that discreteapproach es to languageprocessingshare with, for example, reasoningsystemsbasedon traditional logical inference. It relatesto the inadequacyof the dominant theoriesin linguistics to capture " shades " of meaningor degreesof acceptabilitywhich areoften recognizedby outside the field as importantinherentpropertiesof natural language . people The qualitative-quantitativedistinctioncan also be seenasunderlyingthe difference betweenclassificationsystemsbasedon featurespecifications , asused in unification formalisms[Shieber, 1986], and clusteringbasedon a variable degreeof granularity(e.g. [Pereiraet al., 1993]). It seemsunlikely that thesecontinuouslyvariableaspectsof fluent natural languagecanbe capturedby a purely combinatoricmodel. This naturallyleads
2 Chapter to the questionof how bestto introducequantitativemodelinginto language . It is not, of course, necessaryfor the quantitiesof a quantitative processing modelto beprobabilities. For example,we may wish to definereal-valuedfunctions on parsetreesthat reflect the extentto which the treesconform to, say, minimal attachmentand parallelismbetweenconjuncts. Such functionshave beenusedin tandemwith statisticalfunctionsin experimentson disambiguation (e.g. [Alshawi andCarter, 1994]). Anotherexampleis connectionstrengthsin neuralnetwork approach es to languageprocessing , thoughit hasbeenshown thatcertainnetworksareeffectivelycomputingprobabilities[RichardandLippmann, 1991]. Nevertheless , probability theory doesoffer a coherentand relatively wellunderstoodframeworkfor selectingbetweenuncertainalternatives,making it a naturalchoicefor quantitativelanguageprocessing . The casefor probability is a well theory strengthened by developedempiricalmethodologyin the form of statisticalparameterestimation. Thereis alsothe strongconnectionbetween probability theoryandthe formal theoryof informationandcommunication,a connectionthat hasbeenexploited in speechrecognition, for example, using the conceptof entropyto provide a motivatedway of measuringthe complexity of a recognitionproblem[Jelineket al., 1992] . Evenif probability theoryremains, asit currentlyis, themethodof choicein makinglanguageprocessingquantitative, this still leavesthefield wide openin termsof carving up languageprocessinginto an appropriateset of eventsfor probability theory to work with. For translation, a very direct approachusing parametersbasedon surfacepositionsof words in sourceandtargetsentences was adoptedin the Candidesystem[Brown et al., 1990]. However, this does not captureimportantstructuralpropertiesof naturallanguage.Nor doesit take into accountgeneralizationsabouttranslationthat areindependentof the exact word orderin sourceandtargetsentences . Suchgeneralizationsare, of course, central to qualitative structural approach es to translation(e.g. [Isabelle and Macklovitch, 1986; Alshawi et al., 1992]) . The aim of thequantitativelanguageandtranslationmodelspresentedin sections 5 and6 is to employprobabilisticparameters thatreflectlinguisticstructure withoutdiscardingrich lexical informationor makingthemodelstoo complexto train automatically.In termsof a traditionalclassification,this would be seenas a " hybrid symbolic-statistical" systembecauseit dealswith linguistic structure. From our perspective , it canbe seenasa quantitativeversionof the logic-based model becauseboth modelsattemptto capturesimilar information(aboutthe organizationof wordsinto phrasesandrelationsholdingbetweenthesephrases or their referents ), thoughthetoolsof modelingaresubstantiallydifferent.
andQuantitative Modelsof Speech Translation Qualitative 3 Dissectinga Logic-BasedSystem We now consider a hypothetical speech ttanslation system in which the language processing components follow a conventional qualitative ttansfer design . Although hypothetical , this design and its components are similar to those used in existing database query [Rayner and Alshawi , 1992] and ttanslation systems [ Alshawi et al., 1992] . More recent versions of these systems have been gradually taking on a more quantitative flavor , particularly with respect to choosing between alternative analyses, but our hypothetical system will be more purist in its qualitative approach. The overall design is as follows . We assume that a speech recognition subsystem delivers a list of text sttings corresponding to ttanscriptions of an input utterance. These recognition hypoth ~ses are passedto a parser which applies a logic - basedgrammar and lexicon to produce a set of logical forms , specifically formulas in first order logic corresponding to possible interpretations of the utterance. The logical forms are filtered by contextual and word - sense constraints , and one of them is passedto the ttanslation component . The ttanslation relation is expressedby a set of first order axioms which are used by a theorem prover to derive a target language logical form that is equivalent ( in some context ) to the source logical form . A grammar for the target language is then applied to the target form , generating a syntax ttee whose fringe is passed to a speech synthesizer. Taking the various components in turn , we make a note of undesirable properties that might be improved by quantitative modeling . 3.1 Analysis and Generation A grammar , expressed as a set of syntactic rules ( axioms ) Gsynand a set of semantic rules (axioms ) Gsemis used to support a relation form holding between sbings s and logical fonD S cp expressed in fIrSt order logic : GsynU GsemFform (s, cp) The relation form is many -to -many , associating a sbing with linguistically possible logical fonD interpretations . In the analysis direction , we are given s and search for logical fonD Scp, while in generation we search for sbings s given cp. For analysis and generation, we are treating sbings s and logical fonD S cp as object level entities. In interpretation and translation , we will move down from this meta- level reasoning to reasoning with the logical fonD S as propositions . The list of text strings handed by the recognizer to the parser can be assumed to be ordered in accordance with some acoustic scoring scheme internal to the
Chapter2
recognizer.The magnitudeof the scoresis ignoredby our qualitativelanguage esthehypothesesoneat a time until it fmds onefor ; it simply process processor which it canproducea completelogical form interpretationthat passesgrammatical andinterpretationconstraints,at which point it discardsthe remaining . Clearly, discardingthe acousticscoreandtaking the fIrSthypothesis hypotheses that satisfiesthe constraintsmay leadto an interpretationthat is lessplausible than one derivablefrom a hypothesisfurther down in the recognitionlist. But there is no point in processingtheselater hypothesessince we will be forcedto selectoneinterpretationessentiallyat random. " " Syntax The syntacticrulesin Gsynrelate category predicatesco' CI, C2holding of a string and two spanningsubstrings(we limit the rules here to two daughtersfor simplicity) : Co(So) A daughters(so, sl ' S2) CI(SI) AC2(S2) A (So= concat(SI ' S2 , variableslike Soand SI are implicitly universally (Here, and subsequently quantified.) Gsynalsoincludeslexical axiomsfor particularstringsw consisting of singlewords: CI(W), ... cm(w) For a feature-basedgrammar, theserules can include conjunctsconstraining the values, aI , a2," ' , of discrete-valuedfunctionsfon the strings: f (w) = aI , f (so) = f (sl) The main problemhereis that suchgrammarshaveno notion of a degreeof grammaticalacceptability- a sentenceis either grammaticalor ungrammatical . For small grammarsthis meansthat perfectly acceptablestringsare often rejected; for large grammarswe get a vast numberof alternativetreesso the chanceof selectingthe correcttree for simple sentencescan get worseas the . There is also the problem of requiring increasgrammarcoverageincreases ingly complexfeaturesetsto describeidiosyncrasiesin the lexicon.
Semantics Semanticgrammaraxiomsbelongingto Gsemspecify a " composition " function for g deriving a logical form for a phrasefrom thosefor its subphrases :
form(so, g(cf>l ' cf>2 ) daughters(so, Sl, SVACI(SI) AC2(SV A Co(SO Aform (sl' <1>1) Aform (S2' v
QualitativeandQuantitativeMntip;1~ of SpeechTranslation
The interpretationrulesfor stringsbottomout in a setof lexical semanticrules " ." associatingwords with predicates(PI,p2,...) correspondingto word senses For a particularword and syntacticcategory, there will be a (small, possibly : empty) fmite setof suchword-sensepredicates c;(w) - form ( w, p; ) .. .
c;(w) - +form(w, pin) First order logic was assumedas the semanticrepresentationlanguage becauseit comes with well-understood , if not very practical, inferential for constraint . However , applying this machineryrequires machinery solving makinglogical forms fme-grainedto a degreeoften not warrantedby the information the speakerof an utterance.intendedto convey. An exampleof this is explicit scopingwhich leads(again) to largenumbersof alternativeswhich the qualitative model has difficulty choosingbetween. Also, many natural language cannotbe expressedin first orderlogic without resortto elaborate sentences formulasrequiringcomplexsemanticcompositionrules. Theserulescan be simplified by usinga higherorderlogic but at theexpenseof evenlesspractical inferentialmachinery. In applyingthe grammarin generationwe arefacedwith the problemof balancing over- and undergeneration by tweakinggrammaticalconstraints,there no to being way prefer fully grammaticaltargetsentencesover more marginal ones. Qualitativeapproach es to grammartendto emphasizethe ability to capture generalizationsasthemainmeasureof successin linguistic modeling. This might explain why producing appropriatelexical collocations is rarely addressed seriouslyin thesemodels, eventhoughlexical collocationsareimportant for fluent generation.The studyof collocationsfor generationfits in more , as illustratedby Smajdaand McKeown naturally with statisticaltechniques [ 1990] . 3.2 Interpretation In the logic-basedmodel, interpretationis the processof identifying from the possibleinterpretationsof s for which! orm(s, Here, we haveseparatedthe contextinto a contingentsetof contextualpropositionsS " and a set R of (monolingual) " meaningpostulates , or selectional restrictions, thatconstrainthe word sensepredicatesin all contexts.A is a setof
Chapter2 assumptions sufficient to support the interpretation given Sand R . In other " " words , this is interpretation as abduction [ Hobbs et al . , 1988 ] , since abduction , not deduction , is needed to arrive at the assumptions A . The most common types of meaning postulates in R are those for restriction , hyponymy , and disjointness , expressed as follows : restriction Pl (Xl , X2) - P2 (Xl ) P2(X) - P3 (X ) hyponymy -' ( P3(X) Ap4 (X disjointness Although there are compilation techniques ( e.g . [ Mellish , 1988 ] ) which allow selectional constraints stated in this fashion to be implemented efficiently , the scheme is problematic in other respects . To start with , the assumption of a small set of senses for a word is at best awkward because it is difficult
to arrive
at an optimal granularity for sense distinctions . Disambiguation with selectional restrictions expressed as meaning postulates is also problematic because it is virtually impossible to devise a set of postulates that will always filter all but one alternative . Weare
thus forced to underfilter
and make an arbitrary
choice between remaining alternatives .
3.3 Logic-BasedTranslation In both the quantitativeand qualitative models we take a transfer approach to translation. We do not dependon interlingual symbols, but insteadmap a representationwith constantsassociatedwith the source languageinto a correspondingexpressionwith constantsfrom the target language. For the is basedon logical qualitative model, the operablenotion of correspondence and the constants are source word sense equivalence predicatesPI , P2,' " and targetsensepredicatesql , q2," " More specifically, we will saythe translationrelationbetweena sourcelogical form cJ>s anda targetlogical form cJ>, holds if we have BUS U A ' F (cJ>s" " cJ>,) whereB is a setof monolingualandbilingual meaningpostulates , andS is a set of formulascharacterizingthe currentcontext. A ' is a set of assumptionsthat includesthe assumptionsA which supportedcJ>s' Herebilingual meaningpostulates arefirst orderaxiomsrelatingsourceandtarget-sensepredicates . Atypical for between and be of the form: bilingual postulate translating PI ql might PS(XI) - + (PI(Xl, X2) .... ql (Xl , X2 The need for the assumptionsA ' ariseswhen a sourcelanguageword is vaguerthanits possibletranslationsin the targetlanguage,so different choices
Translation Modelsof Speech andQuantitative Qualitative
35
. For of targetwordswill correspondto translationsunderdifferentassumptions example, the conditionPS(Xl ) abovemight be proved from the input logical . form, or it might needto be assumed ' solutions case In the general , fmding (i.e. A , <1>, pairs) for the abductive schemais an undecidabletheorem-proving problem. This canbe alleviatedby placingrestrictionson the form of meaningpostulatesandinput formulasand using heuristicsearchmethods. Although suchan approachwas appliedwith somesuccessin a limited-domain systemtranslatinglogical forms into database queries[RaynerandAlshawi, 1992], it is likely to be impracticalfor language translation with tens of thousandsof sensepredicatesand related axioms. Settingasidetheintractabilityissue, this approachdoesnot offer a principled way of choosingbetweenalternativesolutionsproposedby the prover. One " " , but it is difficult would like to prefersolutionswith minimal setsof assumptions to fmd motivateddefinitionsfor this minimizationin a purely qualitative framework. 4
Quantitative Model Components
4.1 Moving to a Quantitative Model In movingto a quantitativearchitecture,we proposeto retainmanyof thebasic characteristicsof the qualitativemodel: . A transferorganizationwith analysis, transfer, andgenerationcomponents . Monolingualmodelsthat canbe usedfor both analysisandgeneration . Translation models that exclusively code contrastive (cross-linguistic) information . Hierarchicalphrasescapturingrecursivelinguistic structure Insteadof feature-basedsyntaxtreesand first order logical forms we will that is morecloselyrelatedto those adopta simpler, monostratalrepresentation . Hudson e. found in dependencygrammars( g [ , 1984]). Dependencyrepresentations havebeenusedin large scalequalitativemachinetranslationsystems, " " notablyby McCord [ 1988] . The notionof a lexical head of a phraseis central becausethey concentrateon relationsbetweensuch to theserepresentations case lexical heads.In our , the dependencyrepresentationis monostratalin that the relations may include ones normally classified as belonging to syntax, semantics , or pragmatics. One salientpropertyof our languagemodel is that it is strongly lexical: it consistsof statistical parametersassociatedwith relations betweenlexical
Chapter2
itemsandthe numberandorderingof dependents of lexical heads.This lexical facilitates statistical and anchoring ttaining sensitivityto lexical variation and collocations. In orderto gainthe benefitsof probabilisticmodeling, we replace the taskof developinglargerule setswith the taskof estimatinglargenumbers of statisticalparametersfor the monolingualandtranslationmodels. This gives riseto a newcosttradeoffin humanannotationandjudgmentvs. barelytractable further researchon lexical similarity fully automaticttaining. It alsonecessitates and clustering(e.g. [ pereiraet al., 1993; Daganet al., 1993]) to improve parameterestimationfrom sparsedata. We should emphasizeat the outset that the quantitativemodel presented below is not a way of augmentinga logic-basedsystemby associatingprobabilities or costswith the axiomsor rulesof that model. Instead, the parameters of the quantitativemodelencodeall the structuraland preferenceinformation necessaryto apply the model. 4.2 Translation via Lexical Relation Graphs The model associatesphraseswith relation graphs. A relation graph is a directedlabeledgraphconsistingof a setof relation edges. Eachedgehasthe form of an atomicproposition r (Wi' wfl
wherer is a relationsymbol, Wiis the lexical headof a phrase,andWjis the lexical headof anotherphrase(typically a subphraseof the phraseheadedby Wi). The nodesWi and Wj are word occurrencesrepresentableby a word and an index, the indicesuniquelyidentifying particularoccurrencesof the wordsin a discourseor corpus. The set of relation symbolsis open-ended, but the fIrSt argumentof the relationis alwaysinterpretedasthe headandthe secondasthe dependentwith respectto this relation. The relations in the models for the sourceandtargetlanguagesneednot be the same, or evenoverlap. To keepthe languagemodelssimple, we will mainly restrictourselveshereto dependency graphsthat aretreeswith unorderedsiblings. In particular, phraseswill always be contiguousstringsof words and dependentswill always be headsof subphrases . Ignoring algorithmic issuesrelating to compactly representingand efficiently , the overall designof the searchingthe spaceof alternativehypotheses is as follows. The quantitativesystem speechrecognizerproducesa set of word-positionhypotheses(perhapsin theform of a word lattice) corresponding to a setof stringhypothesesfor the input. The sourcelanguagemodelis usedto computea set of possiblerelation graphs, with associatedprobabilities, for
QualitativeandQuantitativeModelsof SpeechTranslation
eachstring hypothesis.A probabilisticgraphtranslationmodel thenprovides, for each sourcerelation graph, the probabilities of deriving corresponding graphswith word occurrencesfrom the target language. Thesetarget graphs includeall the wordsof possibletranslationsof the utterancehypothesesbut do not specifythe surfaceorderof thesewords. Probabilitiesfor differentpossible word orderingsare computedaccordingto ordering parameterswhich fonn part of the targetlanguagemodel. In the following subsectionwe explain how the probabilitiesfor thesevarious processingstagesare combined to select the most likely target word . . This word sequencecan then be handedto the speechsynthesizer sequence the about For tighter integrationbetweengenerationandsynthesis,infonnation . derivationof the targetutterancecanalsobe passedto the synthesizer 4.3 Integrated Statistical Model The probabilities associatedwith phrasesin the abovedescriptionare computed accordingto the statisticalmodelsfor analysis, translation, and generation . In this subsectionwe showtherelationshipbetweenthesemodelsto arrive at an overall statisticalmodel of speechtranslation. We are not considering training issuesin this paper, thougha numberof now familiar techniquesranging from methodsfor unsupervisedmaximumlikelihood estimationto direct estimationusingfully annotateddataareapplicable. The objectsinvolved in the overall model are as follows (we omit target speechsynthesisunderthe assumptionthat it proceedsdeterministicallyfrom a targetlanguageword string) : . . . . .
As: (acousticevidencefor) sourcelanguagespeech Ws: sourcelanguageword string Wt: targetlanguageword string Cs: sourcelanguagerelationgraph Ct: targetlanguagerelationgraph
Given a spokeninput in the sourcelanguage, we wish to fmd a targetlanguage string that is the most likely translationof the input. We are thus interested in theconditionalprobability of Wt givenAs. This conditionalprobability canbe expressedasfollows (cf. [ChangandSu, 1993]) : P{WtIAs) =
) )P(CsIWs'As ,Cs ,C,P(WsIAs LWs P(CtI Cs'Ws' As) P(Wt I Ct, Cs'Ws' As)
Chapter2
We now apply somesimplifying independence assumptionsconcerningrelation : that their derivation from word stringsis independent , graphs specifically of acousticinformation; that their translationis independentof theoriginal wordsandacousticsinvolved; andthat targetword string generationfrom target relation edgesis independentof the sourcelanguagerepresentations . The extentto which these(Markovian) assumptionshold dependson the extentto which relation edgesrepresentall the relevantinformation for translation. In particularit meansthey shouldexpressaspectsof surfacerelevantto meaning, suchastopicalization, aswell aspredicateargumentstructure. In any case, the simplifying assumptionsgive the following : P(W, I As) = ( W s I As) P ( Cs I.w s) P (Ct I Cs) P ( W tiC ,) LWs ,Cs,CtP This can be rewritten with two applications of Bayes' s rule :
) .Cs .C,P(AsIWs) (l /P(As P(WsiCs LWs P (Cs) P (C, I Cs) P (W, I C,) SinceAs is given, liP (As) is a constantwhich can be ignoredin fmding the maximumof P(W, I As) . DetenniningW, that maximizesP (W, I As) therefore involvesthe following factors: . . . . .
P (As I Ws) : sourcelanguageacoustics P (Wsi Cs) : sourcelanguagegeneration P(Cs) : sourcecontentrelations P (C, I Cs) : source-to-targettransfer P (W, I C,) : targetlanguagegeneration
We assumethat the speechrecognizerprovidesacousticscoresproportional to P(As I Ws) (or logs thereof) . Suchscoresare nonnally computedby speech recognitionsystems,althoughthey are usually also multiplied by word-based languagemodelprobabilitiesP(Ws) which we do not requirein this application context. Our approachto languagemodeling, which coversthe contentanalysis and languagegenerationfactors, is presentedin section5 and the transfer probabilitiesfall underthe translationmodelof section6. ' Finally note that by anotherapplicationof Bayess rule we can replacethe two factorsP(Cs)P (C, I Cs) by P (C,)P (CsiC ,) without changingotherpartsof the model. This latter fonnulation allows us to apply constraintsimposedby the target language model to filter inappropriate possibilities suggested ' by analysisand transfer. In somerespectsthis is similar to Daganand Itai s
QualitativeandQuantitativeModelsof SpeechTranslation
39
approach [ 1994] to word sensedisambiguation using statistical associations in a second language.
5 LanguageModels 5.1 LanguageProduction Model Our languagemodel can be viewed in terms of a probabilistic generative " " processbasedon thechoiceof lexical heads of phrasesandtherecursivegeneration of subphrasesand their ordering. For this purpose, we can define the headwordof a phraseto be the word that most stronglyinfluencesthe way the . This notion hasbeencentralto a phrasemay be combinedwith other phrases es to grammar for some time, including theories like number of approach dependencygrammar [Hudson, 19.84] and HPSG [Pollard and Sag, 1987] . More recently, the statisticalpropertiesof associationsbetweenwords, and moreparticularlyheadsof phrases , hasbecomean activeareaof research(e.g. [Changet al., 1992; Hindle andRooth, 1993]). The languagemodelfactorsthe statisticalderivationof a sentencewith word string W asfollows: P(W) =
PC PW C Lc ( ) ( I ) whereC rangesover relationgraphs. The contentmodel, P (C), andgeneration model, P(W I C), are componentsof the overall statisticalmodel for spoken languagetranslationgiven earlier. This decompositionof P(W) canbe viewed as first decidingon the contentof a sentence , formulatedas a set of relation edgesaccordingto a statisticalmodel for P(C), and then deciding on word orderaccordingto P(W I C) . Of course, this decompositionsimplifiestherealitiesof languageproduction in thatreal languageis alwaysgeneratedin thecontextof somesituationS(real or imaginary), so a more comprehensivemodel would be concernedwith P(C I S), that is, languageproductionin context. This is lessimportant, however , in the translationsettingsincewe produceCt in the contextof a source relationgraphCs andwe assumethe availability of a modelfor P (CtiCs ) . 5.2 Content Derivation Model The model for deriving the relation graph of a phraseis taken to consist of " " choosinga lexical head ho for the phrase(what the phraseis about ) followed " " by a seriesof nodeexpansion steps. An expansionsteptakesa node andchoosesa possiblyempty setof edges(relation labelsandendingnodes)
40
Chapter2
startingfrom that node. Here we consideronly the caseof relation graphsthat are treeswith unorderedsiblings. To start with, let us take the simplified casewhere a headword h has no optional or duplicateddependents(i.e., exactly one for eachrelation). There will be a setof edges E (h) = {rl (hiwI ), r2(h, W2) ...rk(h, Wk)} correspondingto the local ttee rootedat h with dependentnodeswI ' " Wk' The set of relation edgesfor the entire derivation is the union of theselocal edge sets. To determinethe probability of deriving a relation graph C for a phrase " headedby howe makeuseof parameters(" dependencyparameters ) P(r (h, w) I h, r ) for the probability, given a nodeh anda relationr , that w is an r -dependentof h. Under the assumptionthat the dependentsof a headare chosenindependently from oneanother, the probability of deriving Cis :
P(C) = P(Top(ho n Pr h w h r r(h, w)EC ( ( , ) I , ) whereP(TOP(ho is the probability of choosinghoto startthe derivation. If we now remove the assumptionmade earlier that there is exactly one r dependentof a head, we needto elaboratethe derivation model to include . We modelthis by parameters choosingthe numberof suchdependents P(N (r , n) I h) that is, the probability that headh has n r -dependents . We will refer to this " detail " Our as a . probability parameter previousassumptionamountedto stating that this wasalways 1 for n = 1 or for n = ODetailparametersallow us to model, for example, the number of adjectival modifiers of a noun or the " " degree to which a particularargumentof a verb is optional. The probability of an expansionof h giving rise to local edgesE (h) is now: P(E (h) I h) = nr P (N (r , nr) I h) k (nr) nl Sisnr P ( r (h, w'; ) I h, r ) wherer rangesoverthe setof relationlabelsandh hasnr r -dependents ;;. w'i . . .MI k(nr) is a combinatoricconstantfor taking accountof the fact that we are not distinguishingpermutationsof the dependents(e.g., therearenr! permutations of the r -dependents ofh if thesedependentsareall distinct). So if hois the root of a treeC, we have
andQuantitative Modelsof Speech Translation Qualitative
41
P(C) = P(Top(ho n P (Ec (h) I h) hEheads (C) whereheads(C) is the setof nodesin C andE c(h) is the setof edgesheadedby h in C. The above ; formulationis only an approximationfor relation graphsthat are not treesbecausethe independence assumptionswhich allow the dependency to be simply multiplied togetherno longerhold for thegeneralcase. parameters Dependencygraphswith cyclesdo ariseasthe mostnaturalanalysesof certain , but calculatingtheir probabilitieson a node-by-node linguistic constructions basisasabovemay still provideprobabilityestimatesthat areaccurateenough for practicalpurposes. 5.3 Generation Model We now return to the generationmodelP(W I C) . As mentionedearlier, since C includesthe words in Wand a set of relationsbetweenthem, the generation modelis concernedonly with surfaceorder. Onepossibility is to use" bi-relation " parametersfor the probability that an r ,- dependentimmediatelyfollows an r}- dependent . This approachis problematicfor our overall statisticalmodel becausesuchparametersarenot independentof the " detail" parametersspecifying the numberof r-dependentsof a head. " We thereforeadoptthe useof " sequencing , thesebeingprobaparameters bilities of particularorderingsof dependentsgiven that the multiset of dependency relationsis known. We let the identityrelation e standfor the headitself. Specifically, we haveparameters P(s I M (s wheres is a sequenceof relation labelsincluding an occurrenceofe andM (s) is the multisetfor this sequence . For a headh in a relation graphC, let sWCh be the sequence of dependentrelationsinducedby a particularword string Wgenerated from C. We now have
P(WIC)=nhEW IM(SWCh (nr ~ )P(SWCh whereh rangesover all the headsin C, andnr is the numberof occurrencesof r in SWCh ' assumingthat all orderingsofnr -dependentsareequally likely. We canthususethesesequencingparametersdirectly in our overall model. To summarize,our monolingualmodelsarespecifiedby: . TopmostheadparametersP{Top{h . DependencyparametersP{r {h, w) I h, r )
Chapter2
. Detail parametersP(N (r , n) I h) . SequencingparametersP (s I M (s The overall model splits the contributionsof content P (C) and ordering P(W I C) . However, we may also want a model for P(W), for example, for -recognitionhypotheses . Combining our contentand ordering pruning speech modelswe get: P(W) = LP (C) P (W I C) c = I h) LP (Top(hc n P (SWCh ChEW P (r (h, w) I h, r ) n r(h, w)EEc(h) The parametersP(s I h) can be derived by combining sequencingparameters with the detail parametersfor h. 6 Translation Model 6.1 Mapping Relation Graphs As alreadymentioned, the translationmodel definesmappingsbetweenrelation graphsCs for the sourcelanguageandCt for the targetlanguage.A direct (though incomplete) justification of translation via relation graphs may be basedon a simplereferentialview of naturallanguagesemantics . Thus nominals andtheir modifierspick out entitiesin a (real or imaginary) world; verbs andtheir modifiersrefer to actionsor eventsin which the entitiesparticipatein roles indicatedby the edgerelations. On this view, the purposeof the translation mappingis to determinea targetlanguagerelationgraphthat providesthe best approximationto the referentialfunction inducedby the sourcerelation . graph. We call this approximatingreferentialequivalence This referentialview of semanticsis not adequatefor takingaccountof much of the complexity of natural language, including many aspectsof quantification , distributivity, andmodality. This meansit cannotcapturesomeof the subtleties that a theory basedon logical equivalencemight be expectedto. On the otherhand, whenwe proposeda logic-basedapproachasour qualitativemodel, we hadto restrictit to a simplefirst orderlogic anywayfor computationalreasons , and even then it did not appearto be practical. Thus using the more impoverishedlexical relationsrepresentationmay not be costing us much in practice. One aspectof the representationthat is particularly useful in the translation applicationis its conveniencefor partial or incrementalrepresentationof con-
Qualitative and Quantitative Models of Speech Translation
tent- we canrefine the representation by the additionof further edges.A fully specifieddenotationof the meaningof a sentenceis rarely requiredfor translation , a complete , andas we pointedout when discussinglogic representations not have been intended the . by specificationmay speaker Although we have not provideda denotationalsemanticsfor setsof relation edges, we anticipate that this will be possiblealong the lines developedin monotonicsemantics [Alshawi andCrouch, 1992] . 6. TranslationParameters To be practical, a modelfor P(C, I Cs) needsto decomposethe sourceandtarget graphsCs and C, into subgraphssmall enoughthat subgraphtranslation " parameterscanbe estimated.We do this with the help of nodealignmentrelations " betweenthenodesof these graphs. Thesealignmentrelationsaresimilar in somerespectsto the alignmentsusedby Brown et al. [ 1990] in their surface translationmodel. The translationprobability is then the sum of probabilities over different alignmentst
ttanslationmapping. For our quantitativedesign, we adopta simple model in which lexical and . In this model relation(structural) probabilitiesare assumedto be independent the alignmentrelationsarefunctionsfrom the word occurrencenodesof C, to the word occurrencesof Cs. The idea is that I (vp = Wimeansthat the source word occurrenceWi" gaverise" to the target word occurrenceVjoThe inverse relation1 1neednot be a function, allowing different numbersof wordsin the . sourceandtargetsentences We decomposeP(C" I I Cs) into " lexical" and " structural" probabilitiesas follows: P (C" I I Cs) = P (N ,, 1 IN s)P (E , IN ,, 1 , Cs) whereN , andNsarethe nodesetsfor C, andCsrespectively,andE, is the setof edgesfor the targetgraph. The first factor P(N ,, 1 IN s) is the lexical componentin that it doesnot take into accountanyof therelationsin the sourcegraphCs' This lexical component is the productof alignmentprobabilitiesfor eachnodeof Ns: P(N " I I Ns) =
n WiENs
p( fl (Wi)= {vl...v'!} I Wi).
Chapter2
That is, theprobability that/ mapsexactlythe (possiblyempty) subset{ vI ...v'/} of Nt to Wi. Thesesetsare assumedto be disjoint for different sourcegraph nodes, so we canreplacethe factorsin the aboveproductwith parameters : P (M I w) where W is a sourcelanguageword and M is a multiset of target language words. We will derivea targetsetof edgesEt of Ct by k derivationstepswhich partition the setof sourceedgesEs into subgraphsS1. . . Sk. Thesesubgraphsgive riseto disjoint setsof relationedgesT1... T k which togetherform Et. The structural componentof our translationmodel will be the sumof derivationprobabilities for suchan edgesetEt. For simplicity, we assumeherethatthe sourcegraphCsis a tree. This is consistent with our earlierassumptionsaboutthe sourcelanguagemodel. We take our partitionsof the sourcegraph to be the edgesetsfor local trees. This ensures that the partitioning is deterministicso the probability of a derivationis theproductof theprobabilitiesof derivationsteps. More complexmodelswith largerpartitionsrootedat a nodearepossible, but theserequireadditionalparameters for partitioning. For the simple model it remainsto specify derivation stepprobabilities. The probability of a derivationstepis given by parametersof the form: p (TI I Si' , / i ) whereSi' andTi ' areunlabeledgraphsand/ i is a nodealignmentfunctionfrom Ti ' to Si' . Unlabeledgraphsarejust like our relationedgegraphsexceptthatthe nodesarenot labeledwith words(theedgesstill haverelationlabels). To apply a derivationstepwe needa notion of graphmatchingthat respectsedgelabels: g is an isomorphism(modulonodelabels) from a graphG to a graphH if g is a one-oneandonto function from the nodesof G to the nodesof H suchthat r (a, b) EG iff r (g(a),g(b EH . The derivationstepwith parameterP( T i' I Si' , / i ) is applicableto the source edgesSi, underthe alignment/ . giving rise to the targetedgesTi if (i ) thereis an isomorphismhi from SI to Si, (ii ) thereis an isomorphismgi from Ti to TI , and (iii ) for any nodev of Ti it mustbe the casethat hi( f i(gi(V ) = f (v) . This lastconditionensuresthat the targetgraphpartitionsjoin up in a way that is compatiblewith the nodealignmentf
andQuantitative Modelsof Speech Translation Qualitative
45
The factoringof the translationmodel into theselexical andstructuralcomponents meansthat it will overgeneratebecausetheseaspectsarenot independent . It is thereforeappropriateto in translationbetweenreal naturallanguages filter translationhypothesesby rescoringaccordingto the versionof the overall statisticalmodelthat includedthe factorsP (C,)P (Cs I C,) so that the target languagemodel constrainsthe output of the translationmodel. Of course, in this casewe needto model the translationrelation in the " reverse" direction. This canbe donein a parallelfashionto the forward directiondescribedabove.
7 Conclusions Our qualitative and quantitativemodelshave a similar overall sttuctureand thereareclear parallelsbetweenthe factoringof logical constraintsand statistical , for example, monolingualpostulatesand dependencyparameters parameters . The parallelismwould , bilingual postulatesandtranslationparameters havebeencloserif we hadadoptedID/ LPstylerules[Gazdaret al., 1985] in the qualitativemodel. However, we arguedin section3 that our qualitativemodel suffered from lack of robustness , from having only the crudest meansfor between , and from being computationally competing hypotheses choosing . intractablefor largevocabularies The quantitativemodel is in a muchbetterpositionto copewith theseproblems . It is lessbrittle becausestatisticalassociationshavereplacedconstraints featural , selectional, etc.) that must be satisfied exactly. The probabilistic ( modelsgive us a systematicand well-motivated way of ranking alternative . Computationally, the quantitativemodel lets us escapefrom the hypotheses undecidabilityof logic-basedreasoning. Becausethis model is highly lexical, we can hopethat the input words will allow effective pruning by limiting the numberof searchpathshavingsignificantly high probabilities. We retainedsomeof the basicassumptionsaboutthe sttuctureof language whenmoving to the quantitativemodel. In particular, we preservedthe notion of hierarchicalphrasesttucture. Relationsmotivatedby dependencygrammar madeit possibleto do this without giving up sensitivityto lexical collocations which underpinsimplestatisticalmodelslike N-grarns. The quantitativemodel also reducedoverall complexity in termsof the setsof symbolsused. Inaddition to words, it only requiredsymbolsfor dependencyrelations, whereasthe qualitativemodel requiredsymbol setsfor linguistic categoriesand features, anda setof word sensesymbols. Despitetheir apparentimportanceto translation , thequantitativesystemcanavoid the useof word-sensesymbols(andthe
Chapter2 problems of granularity they give rise to ) by exploiting statistical associations between words in the target language to filter implicit sensechoices. Finally , here is a summary of our reasons for combining statistical methods with dependency representations in our language and translation models: . Inherent lexical sensitivity of dependency representations, facilitating parameter estimation . Quantitative preference based on probabilistic derivation and translation . Incremental or partial , or incremental and partial , specification of the content of utterances, particularly useful in translation . Decomposition of complex utterances through recursive linguistic structure These factors suggest that dependency grammar will play an increasingly important role as language processing systems seek to combine both structural
andcollocational infonnation . Acknowledgments
I am grateful to Fernando Pereira, Mike Riley , and Ido Dagan for valuable discussions on the issues addressedin this paper. Fernando Pereira and Ido Dagan also provided helpful comments on a draft of the paper. References H. Alshawi and D. Carter. Training and ScalingPreferenceFunctionsfor Disambiguation . ComputationalLinguistics, 20: 635- 648, 1994. H. AIshawi, D. Carter, B. Gamback,andM. Rayner. Swedish-EnglishQL F Translation. In H. Alshawi, editor, TheCore LanguageEngine. Cambridge, Mass.: The MIT Press, 1992. H. AIshawi and R. Crouch, Monotonic SemanticInterpretation. In Proceedingsof the 30th Annual Meetingof the Associationfor ComputationalLinguistics, Newark, Del., 1992. E. Brill . Automatic Grammar Induction and ParsingFree Text: A TransfonnationBasedApproach. In Proceedingsof the 31st Annual Meeting of the Associationfor ComputationalLinguistics, pp. 259- 265. Columbus, Ohio, 1993. P. Brown, J. Cocke, S. Della Pietra, F. Jelinek V. Della Pietra, J. Lafferty, R. Mercer, andP. Rossin. A StatisticalApproachto MachineTranslation, ComputationalLinguistics , 16: 79- 85, 1990. J. Chang, Y. Luo, and K. SuoGPSM: A GeneralizedProbabilisticSemanticModel for Ambiguity Resolution. In Proceedingsof the 30th Annual Meeting of the Association ' for ComputationalLinguistics, pp. 177- 192. Newark, Del., 1992.
QualitativeandQuantitativeMnde1sof Speech Translation J. Chang and K . Su, A Corpus-Based Statistics-Oriented Transfer and Generation Model for MachineTranslation. In Proceedingsof the 5th International Conference on Theoretical and Methodological Issues in Machine Translation, Kyoto, Japan, 1993. I. DaganandA . Itai. Word SenseDisambiguationUsing a SecondLanguageMonolingual Corpus. ComputationalLinguistics, 20: 563- 596, 1994. I. Dagan, S. Marcus, and S. Markovitch. ContextualWord Similarity and Estimation from SparseData. In Proceedingsof the 31st Annual Meeting of the Associationfor ComputationalLinguistics, pp. 164- 171. Columbus, Ohio, 1993. G. Gazdar, E. Klein, G. K. Pullum, andI . A . Sag. GeneralisedPhraseStructureGrammar . Oxford, Blackwell, 1985. D. Hindle and M. Rooth. StructuralAmbiguity and Lexical Relations, Computational Linguistics, 19: 103- 120, 1993. J. R. Hobbs, M. Stickel, P. Martin, andD; Edwards. InterpretationasAbduction. In Proceedings of the 26th Annual Meetingof the Associationfor ComputationalLinguistics, pp. 95- 103. Buffalo, 1988. R. A. Hudson, WordGrammar. Oxford, Blackwell, 1984. P. IsabelleandE. Macklovitch, TransferandMT Modularity. In Proceedingsof the 11th InternationalConferenceon ComputationalLinguistics, pp. 115- 117. Bonn, Germany, 1986. F. Jelinek, R. L. Mercer, and S. Roukos. Principlesof Lexical LanguageModeling for . Sondhi, editors, Advancesin SpeechSignal SpeechRecognition. In S. Furui andMM Processing.New York, Marcel Dekker, 1992. M. McCord. A MultiTarget MachineTranslationSystem. In Proceedingsof the International Conferenceon Fifth GenerationComputerSystems , pp. 1141- 1149. Tokyo, 1988. C. S. Mellish. ImplementingSystemicClassificationby Unification. Computational Linguistics, 14: 40- 51, 1988. : Volumel- Fundamentals C. J. Pollard and I. A. Sag. Information BasedSyntaxand Semantics . Stanford, Calif., Centerfor the Studyof LanguageandInformation, 1987. F. Pereira, N. Tishby, and L. Lee. DistributionalClusteringof English Words. In Proceediny -, Qfthe Associationfor ComputationalLinguistics, ~,s of. the31st Annual M eetiny
- 190 . . Columbus , Ohio,1993 pp. 183
M. RaynerandH. Alshawi. Deriving DatabaseQueriesfrom Logical Formsby Abductive Definition Expansion. In Proceedingsof the Third Conferenceon Applied Natural LanguageProcessing.Trent, Italy, 1992. M. D. Richardand R. P. Lippmann. Neural Network ClassifiersEstimateBayesiana posteriori Probabilities. Neural Computation, 3: 461- 483, 1991. S. M. Shieber. An Introduction to Unification-BasedApproaches to Grammar. Stanford , Calif., Centerfor the Studyof LanguageandInformation, 1986.
Chapter3 of StudyandImplementation Combined for Techniques AutomaticExtractionof Tenninology
BeatriceDaille
The acquisition of terminology for particular domains has long been a significant problem in natural language processing , requiring a great deal of manual effort . Statistical techniques hold out the promise of identifying likely candidates for domain -specific terminology , but analyzing word co-occurrences in text often uncovers collocations that are statistically significant but termino logically irrelevant - for example, frozen forms , collocations more characteristic of the language in general than the particular domain , and the like . . Assessing statistical significance is certainly part of the problem , for exampIe , Dunning [ 1993] argues convincingly that many commonly used statistics , such as the x2 test, are based on assumptions that do not generally hold for text. One might conjecture , however , that the problem lies not only in the statistics per se, but in deciding what counts as a co-occurrence. " In her Study and Implementation of Combined Techniquesfor Automatic " Extraction of Terminology , Beatrice Daille takes this conjecture seriously , and explores a method in which the co -occurrences of interest are defined in terms of surface syntactic relationships rather than proximity of words or tags within a fixed window (compare with [ Smadja, 1993] , for example) . Shefinds that filtering based on even shallow a priori linguistic knowledge proves useful ,. in addition , she investigates a number of alternative statistics (simple frequency , mutual information , likelihood ratio , etc.) in order to identify which of them is best for the purpose of identifying lexical patterns that constitute domain -specific terminology . The approach to combining linguistic and statistical methods taken in ' Daille s work- using shallow syntactic relationships to define the co-occur rences over which statistical methods operate - is quite general , and has also proved useful in work other than terminology extraction , notably in word clustering and automatic thesaurus generation ( e.g . [ Grefenstette, 1994; Hindle , 1990,. Pereira etal ., 1993] ,. also see Hatzivassiloglou , chapter4 ) .- Eds .
3 Chapter 1
Introduction
A tenninology bank containsthe vocabularyof a techliical domain: tenDs, which refer to its concepts. Building a terminologicalbank requiresa lot of time andboth linguistic andtechnicalknowledge.The issueat stakeis theautomatic extractionof tenninology of a specific domain from a corpus. Current researchon extractingtenninology useseither linguistic specificationsor statistical es. Concerningthe fonner, [Bourigault, 1994] hasproposeda approach of lexical units programwhich extractsautomaticallyfrom a corpussequences . This list whosemorphosyntaxcharacterizesmaximal technicalnoun phrases is given to a tenninologistto be checked. For the latter, several of sequences works ([Lafon, 1984; Church and Hanks, 1990; Calzolari and Bindi , 1990; Smadjaand McKeown, 1990]) haveshownthat statisticalscoresare usefulto extract collocationsfrom corpora. The main problem with one or the other " " approachis the noise : indeed, morphosyntacticcriteria are not sufficient to isolatetenDs, andcollocationsextractedby statisticalmethodsbelongto various : functional, semantic, thematic, or uncharacterizable typesof associations ones. Our goal is to use statisticalscoresfor extractingdomain-specific collocations only and to forget aboutthe other types of collocations. We proceedin two steps: fIrSt, by applyinga linguistic filter which selectscandidatesfrom the corpus; thenby applyingstatisticalscoresrankingthesecandidatesandselecting the scoresthat fit our purposebest, in other words, scoresthat concentrate their high valuesto tenDSand their low valuesto co-occurrencesthat are not tenDS.
2 LinguisticData First, we studythe linguistic specificationson the natureof tenDsin the technical for French. Then, taking into accountthese domainof telecommunications the method and the program that extract and results we , present linguistics countthe candidatetenDs. 2.1 Linguistic Specifications Termsaremainly multiword units of nominal type. They could be considered as a subclassof nominal compoundsthat inherit morphologicaland syntactic properties which have been stressedby studies on nominal compounding ([Grosset al., 1986; Noally, 1990], etc.) . To be more precise, the structureof termsbelongsto well-known morphosyntacticstructuressuchasN ADJ,N } de
CombinedTechniquesfor Extractionof Terminology
N2, etc. and fits the generaltypology of French compoundselaboratedby [Mathieu-Colas, 1988] . Some graphic indications (hyphen), morphological indications(restrictionsin inflection), andsyntacticones(lack of article inside the structure) could alsobe goodcluesthat a nounphraseis a term. [But these propertiesare not discriminatorycontrary to the semanticproperty of terms: their referentialmonosemy.] A term is a label that refersto a concept, and in the bestof all possibleworlds, a term refersuniquelyto oneandonly oneconcept within a given subjectfield , andthis independentlyof its textualcontext. Terminologists, themselvesagreethat they encounterdifficulty in defining exactly how to chooseand delimit terms, sincethis semanticcriterion relies mainly upon intuition. So, we havereinforcedthe criterion of uniquereferent with that of unique translation. A term as it refers to a unique and universal conceptis ideally alwaystranslatedby anothersuchterm in anotherlanguage (though terms may have free or even contextuallydeterminedvariants). We havemanuallyextractedFrenchterms, following thesecriteria from our bilingual corpusavailablein both Frenchand English, the Satellite Communication Handbook (SCH), containing 200,000 words in each language. The translationsof theseterms mostly possessthe morphosyntacticstructuresof English compoundstoo. Then, we classifiedterms accordingto their length; the lengthof a term is definedasthe numberof main items it contains.} From this classification, it appearsthat termsof length2 areby far the mostfrequent ones. Becausestatisticalmethodsdemanda goodrepresentationin numberof samples, we decidedto extract in a first round only terms of length 2, which we will call base-terms, and which matcheda list of previously determined patterns: N ADJstationterrienne(Earth station) N } de (DET ) N2 zonede couverture(coveragezone) N } a (DET ) N2 reflecteura grille (grid reflector) N } PREP N2 liaisonpar satellite(satellite link) N } N2diodetunnel(tunneldiode) Of course,termsexistwhoselengthis greaterthan2. But themajorityof termsof lengthgreaterthan 2 are createdrecurrentlyfrom base-terms. We havedistinguished threeoperationsthatleadto a termof length3 from a termof length1 or 1. Main itemsarenouns, adjectives, adverbs,etc. Neitherprepositionsnor detenniners aremain items.
Chapter3
" 2: " overcomposition , modification, and coordination. We will illusttate these operationswith a few exampleswherethebase-tenDSappearinsidebrackets: 1. Overcomposition Two kinds of overcompositionhavebeenpointedout: overcompositionby juxtapositionandovercompositionby substitution. (a) Juxtaposition A tenD obtainedby juxtaposition is built with at least one base-tenD whosestructurewill not be altered. The examplebelow illustratesthe juxtapositionof a base-tenDand a simplenoun: N ) PREP ) [N2 PREP2 N3] modulationpar [deplacementde phase] ([phaseshift] keying) (b) Substitution Given a base-tenD, one of its main items is substitutedby a base-tenD whoseheadis this main item. For example, in the N ) PREP ) N2structure, N ) is substitutedby a basetenDof N ) PREP2 N3structureto createa tenD PREP structure : of N ) PREP2 N3 ) N2 reseaua satellites+ reseaude transit - reseaude transit a satellites (satellitetransit network).
We note in the aboveexamplethat the structureof reseaua satellites (satellitenetwork) is altered. 2. Modification Modifiers that could generatea new tenD from a base-tenD appeareither insideor after it. (a) Insertionof modifiers Adjectivesand adverbsare the currentmodifiersthat could be inserted inside a base-tenD structure: adjectivesin the N ) PREP (DET) N2 structure andadverbsin the N ADJone: liaisonsmultiples par satellite(multiple [ satellite links] ) reseauxentierement numeriques(all [ digital networksJ) (b) Post-modification ADJN structure Adjectivesand adverbialprepositionalphrasesof PREP main modifiers that lead to are the the creation of new tenDS: postadjectiv can modify any kind of base-tenD; for example [ station terrienne] brouilleuse(interfering[ earth(-)stationJ) . Adverbialprepositional phrasesmodify either simple nouns or base-tenns2: amplifica2. In this case, the lengthof the tenDis equalto 4.
CombinedTechniquesfor Extractionof Tenninology
teur( s) [afaible bruit] ([ low noise] amplifiers , [ interfaces) usagerreseau ] [ user-networkinterfaces)]) . ] [ a usagemultiple] ([multipurpose 3. Coordination Coordinationis a rather complex syntacticphenomenon(term coordination has been studied in [Jacquemin, 1991]) and seldom generatesnew terms. Let us examinea rare exampleof a term of length 3 obtainedby coordination: Nt de N3 + N2de N3- + Nt et N2 de N3 assemblagede paquet + desassemblagede paquets - + assemblageet !disassembly desassemblagedepaquets(packetassembly ) It is difficult to determinewhethera modified or overcomposedbase-term is or is not a term. Take, for example~bandelaterale unique(singlesideband): bandelaterale (sideband ) is a base-term of structureN ADJandunique(single) modifier a very commonpost adjectivein French. The fact that bandelaterale unique is a term is indicatedby the presenceof the abbreviationBLU (88B). As abbreviationsare not introducedfor all terms, the right way is surely to fIrst extract base-terms, that is, bande laterale (sideband ) . Once you have of terms from the extract baseterms, you caneasily lengthgreaterthan corpus 2, at leastpost modified baseterms and overcomposedbase-terms by juxtaposition. But~evenif we havedecidedto exb' actonly base-tenDs(length2), we have to takeinto accounttheir variations, or at leastsomeof them. Variantsof base: tenDSareclassifiedunderthe following categories 1. Graphicandorthographicvariants By graphicvariants, we meaneithertheuseor not of capitalizedletters(Service national or servicenational (Did )omesticservice), or the presenceor not of a hyphen inside the N I N2 structure(modepaquet or mode-paquet (packet (-)mode)). N2 structure. For this structure, the OrthographicvariantsconcernN I PREP numberof N2 is generallyfixed, either singularor plural. However, we have encounteredsome exceptions: reseau(x) a satellite, reseaux(x) a satellites (satellitenetworks)). 2. Morphosyntacticvariants Morphosyntacticvariantsrefer to thepresenceor not of an articlebeforetheN2 ' ' in the N I PREP N2 structure: ligne d abonne, lignes de I abonne(subscriber lines), to the optional characterof the preposition: tension helice, tension d' helice (helix voltage), andto synonymyrelation betweentwo base-tenDsof
Chapter3 different sttuctures: for example, N ADJand ni a N2: reseau commute, reseau a commutation (switched network ) . 3. Elliptical variants A base- tenD of length 2 could be called up by an elliptical fonD: for example , debit , which is used instead of debit binaire (bit rate ) . After this linguistic investigation , we concentrate on ten DSof length 2 (basetenns ) which seem by far the most frequent ones. Moreover , the majority of ten D Swhose length is greater than 2 are built from base-tenDs. A statistical approach requires a good sampling , which base- tenDs provide . To filter basetenDS from the corpus, we use their morphosyntactic sttuctures. For this task, we need a tagged corpus where each item comes with its part of speech and its lemma. The part of speech is used to filter and the lemma to obtain an optimal sampling . We have used the stochastic tagger and the lemmatizer of the Scientific Center of mM -France developed by the speech recognition team ([Derouault , 1985; and EI -Beze, 1993]) .
2.2 Linguistic Filters We now face a choice: we can either isolategeneralcollocationsusing statistics and then apply linguistic filters to retain only morphosyntacticsequences that characterizebase-terms, or apply, first, linguistic filters, andthenstatistics. It is the latter strategythat has beenadopted; indeed, the former has already beenproposedby [Smadja, 1993] and doesnot seemadequateto take into accountterm variations. Indeed, computingstatisticsfirst involves collecting thepossiblecollocatesusinga window of an arbitrarysize. The sizeof the window is generallya compromisebetweena small or a largewindow: If you take a small window size, you miss many occurrences , mainly morphosyntactic variantsasbase-termsmodified by severalinsertedmodifiers, very frequentin French, and multiple coordinatedbase-terms; if you take a longer one, you obtain occurrencesthat do not refer to the sameconceptualentity, many ill formed sequences that do not characterizesterms, and, thus, wrong frequency counts may be. First, using linguistic filters basedon part-of-speechtags appearsto be the bestsolution. Moreover, aspatternsthat characterizesbasetermscanbedescribedby regularexpressions , the useof finite automataseems a naturalway to extractandcountthe occurrencesof the candidatebase-terms. Thefrequencycountsof theoccurrencesof thecandidatetermsarecrucial as they are the parametersof the statistical scores. A wrong frequencycount implies wrong or not relevantvaluesof statisticalscores. The objectiveis to optimize the count of base-term occurrencesand to minimize the count of
CombinedTechniquesfor Extractionof Tenninology
. Graphic, orthographic, andmorphosyntacticvariantsof incorrectoccurrences base-term (exceptsynonymicvariants) aretakeninto accountas well as some syntacticvariationsthat affect the base-term structure: coordinationandinsertion of modifiers. Coordinationof two base-termsrarely leadsto the creation of a new term of length greaterthan 2, so it is reasonableto think that the sequenceequipementsde modulation et de demodulation(modulation and demodulationequipment) is equivalentto the sequenceequipementdemodulation et equipementde demodulation(modulationequipmentand demodulation ). Insertionof modifiers inside a base-term structuredoesnot raise equipment problems, exceptwhenthis modifier is an adjectiveinsertedinsidean N I PREP N2 structure. Let us examinethe sequenceantenneparabolique de reception (parabolic receiving antenna). This sequencecould be a term of length 3 (obtainedeither by overcomposition.or by modification) or a modified baseterm, namely, antenne de reception, modified by the inserted adjective parabolique. On theonehand, we do not wantto extracttermsof lengthgreater than2, but on theotherhand, it is not possibleto ignoreadjectiveinsertion. So, we havechosento acceptinsertionof the adjectiveinsideN I PREP N2 structure. This choiceimpliesthe extractionof termsof length3 of N I ADJPREP N2 structure that areconsideredastermsof length2. However, suchcasesarerareand refer to a N I PREP the majority of N I ADJPREP N2base-term modN2 sequences ified by an adjective. Eachoccurrenceof a base-term is countedequally; we considerthat thereis of the term in the corpus. The occurrenceof equiprobabilityof the appearance base-terms are classifiedunder which characterize morphologicalsequences pairs: a pair is composedof two main itemsin a fixed orderandcollectsall the wherethe two lemmasof the pair appearin oneof the allowedmorsequences ' : ligne d abonne, lignes de phosyntacticpatterns; for example, the sequence ' ' l abonne(subscriberlines), andligne numeriqued abonne(digital subscriber line) areeachone occurrenceof the pair (ligne , abonne ) . lfwe havethe ' coordinatedsequencelignes et servicesd abonne(subscriberlines and services ), we countoneoccurrencefor the pair (ligne , abonne ) andoneoccurrencefor the pair (service , abonne ). Our programscansthe corpus and countsand extractscollocationswhosesyntax characterizesbase-terms. Undereachpair, we find all the differentoccurrencesfoundwith their frequencies and their location in the corpus(file , sentence , item). This programruns extract 8 000 2 minutes to it took fast: for example, , pairsfrom our corpusSCH ) N2 on a Sparcstation ELC (200,000 words) for the structureni prep (DET (551) under Sun-Os Release4.1.3. Resultspresentedin table 3.1 summarize
Chapter3
Table3.1 Exb'aCtion results Corpora
NADJ
N) (prep (Det N2
SCH 1 occurrence 2 occurrences > 2 occurrences total
3,144 655 684 4,483
6,834 1,503 1,616 9,953
5,201 1,507 2,113 8,821
12,167 3,481 6,288 21,936
CBB 1 occurrence 2 occurrences >2 occurrences total
the frequenciesof co-occurrencesexpressedin termsof pairs extractedfrom two corpora, SCHandCBB (CommunicationBlue Book) (800,000 words). In orderto manuallycheckif the candidatebase-termswe haveidentifiedare indeedtermswhenseenin context, andthat we havenot missedany, we have developeda Shell programwhich insertsopeningandclosing bracketsaround the occurrencesof the candidatepairs through the corpora. A sampleof the CBB corpuswherecandidatepairs arebracketedis the following: 582822 on pentmontrerquetoute [2 [ 1 onderadioelectrique] 1 a [3 polarisation] 2 elliptique ] 3 pentetreconsiderecommela sommede deux [4 composantes orthogonales]4 , par exemplede deux [6 [5 ondesa [7 polarisation] 5 rec' tiligne ] 7 perpendiculaires]6 ou d une [8 ondea [ 10 [9 polarisation] 8 circulaire ]9 levogyre] 10et d' une [ 11 ondea [ 13 [ 12polarisation] 11circulaire ] 12dextrogyre] 13 . 582823 une [2 [ 1 caracteristiqueirnportante] 1 du [3 diagramme] 2 de [4 rayonnement ' ] 3 d une antenne]4 ( notamrnentlorsquecelle-ci estmiseen oeuvredansun [5 systemede [6 reutilisation] 5 desfrequences]6 [7 par doublepolarisation ] 7 ) est sa [ 8 puretede polarisation] 8 . The bracketsareindexedto determinetheright openingandclosingbracketof a candidateterm; indeed, somecandidateterms can also appearinside more
CombinedTechniquesfor Extractionof Terminology
complex candidatetenDs: in the aboveexample, the N ADJsequence(onde radioelectrique) inside the N I ADJPREP N2 (onderadioelectriquea polarisalion) correspondsto a tenDof structureni PREP N2 (ondeapolarisation) modified by the insertedadjective(radioelectrique). Twenty bracketedsentences takenat randomhavebeencheckedand only two occurrencesof a researched patternweremissing: . The first miss belongs to the N ADJpattern: the adjective brouilleuse appearsinside parenthesisand is separatedfrom the noun composanteby : une [ 5 composantecontrapolaire] 5 (x) anotheritem itself insideparentheses brouilleuse , ( ) . the secondmissconcernsalsothe N ADJpattern: in the sequence : les polarisationsquasi circulaires, the pair (p.olarisation, circulaire) is not recognized. This error comesfrom the tagger: indeed, the N ADJfmite-statemachineasks for an agreementbetweenthe noun and the adjective and in this sequence , polarisation is well-tagged(feminine substantivein the plural inflection) but circulaire badly (adjectivein the masculineandplural inflection). So, asthere is no agreementbetweenthe adjective and the noun, this occurrenceis not takeninto account. Thesemissesshowthat: 1. It is not possibleto encodein the fmite-statemachineall the marginalpossible . wherea candidatebase-tenDappears sequences 2. The accuracyof this base-tenDextractionprogramreliesuponthe accuracy of the tagger. Now that we haveobtaineda set of pairs, eachpair representinga candidate base-tenD, we apply statistical scoresto distinguish tenDS from non-tenDS . amongthe candidates 3
Lexical Statistics
The problemto solve now is to discoverwhich statisticalscoreis the bestto . So, we computeseveralmeasures : isolatetenDsamongour list of candidates and distance scores . All , associationcriteria, Shannondiversity, frequencies thesemeasurescould not be usedfor the samepurpose: frequenciesare the parametersof the associationcriteria, associationcriteria proposea conceptual sortof the pairs, andShannondiversity anddistancemeasuresarenot discriminatory scoresbut provideothertypesof infonnation.
Chapter3
3.1 Frequenciesand AssociationCriteria From a statisticalpoint of view, the two lemmasof a pair could be considered astwo qualitativevariableswhoselink hasto be tested. A contingencytableis
definedfor eachpair(LitLfl : L,. ' Li' with i + i
4 a c
' Ljl withj +j b d
where: a stands for the frequency of pairs involving both Li and , Lj b stands for the frequency of p.airs involving Li and '" LJ c stands for the frequency of pairs involving L ,.. and , and Lj d standsfor the frequency of pairs involving L ,.. and LJ". The statistical literature proposes many scores which can be used to test the strength of the bond between the two variables of a contingency table. Some are well -known , ~uch as the association ratio , close to the concept of mutual information introduced by [Church and Hanks , 1990] :
1M= log2(a + baa + c )( )
(1)
the<J>2coefficientintroduced by [GaleandChurch,1991]:
or the Loglike coefficientintroducedby [Dunning, 1993] : Loglike = a log a + b log b + clog c + d log d - (a + b) log (a + .b) - (a + c) log (a + c) - (b + d) log (b + d) - (c + d) log (c + d) + (a + b + c + d) log (a + b + c + d)
(4)
A propertyof thesescoresis that their valuesincreasewith the strengthof the bond of the lemmas. We havetried out severalscores(more than 10) including 1M, 2, and Loglike, and we have sorted the pairs following the score value. Eachscoreproposesa conceptualsort of the pairs. This sort, however, could put at the top of the list compoundsthat belong to generallanguage ratherthan to the telecommunications domain. Sincewe want to obtain a list
CombinedTechniquesfor Extractionof Terminology
of telecommunicationterms, it is essentialto evaluatethe correlationbetween the score values and the pairs and to fmd out which scoresare the best to extract terminology. Therefore, we comparedthe values obtained for each scoreto a referencelist of the domain. We obtaineda list of over 6,000 French terms from EURODICAUTOM, the terminology data bank of the EEC, telecommunicationssection, which was developedby experts. We purchased the evaluationon 2,200 Frenchpairs3of N 1de (DET ) N2structure, the mostfrequent and common French term structure, extractedfrom our corpus SCH (200,000 words). To limit the size of the referencelist , we retainedthe intersection betweenour list of candidatesand the EURODICAUTOM list , 1,200 thus gettingrid of termswhich we would not fmd in our corpusanyway, pairs, evenif they belongto this technicaldomain. We assumethat the referencelist , 1,200 pairs of our list of 2,200 candidatepairs, is as completeas possible, so that base-terms that we might identify in our corpusare indeedfound in the referencelist. Eachscoreyields a list wherethe candidatesare sortedaccording to the decreasingscore value. We have divided this list in equivalence classeswhich generallycontain50 successivepairs. The resultsof a scoreare representedgraphically by a histogramin which the x-axis representsdifferent classes, and the y-axis the ratio of good pairs. If all the pairs in a class belongto the referencelist , we obtain the maximumratio of 1; if noneof the pairs appearsin the referencelist , the minimum ratio of 0 is reached.The ideal scoreshouldassignhigh (low) valuesto good (bad) pairs, that is, candidates which belong (which do not belong) to the referencelist (in other words, the histogramof the ideal scoreshould assignto equivalenceclassescontaining the high (low) valuesof the scorea ratio closeto 1 [0]) . We are not going to presenthereall the histogramsobtained(see[Daille, 1994]). All of themshow a generaltrendthat confirms that the scorevaluesincreasewith the strengthof the bond of the lemma. However, the growth is more or lessclear, with more or lesssharpvariations. The most beautiful histogramis the simple frequency of the pair (seefigure 3.1). This histogramshowsthat the more frequentthe pair is, the more likely the pair is a term. Frequencyis the most significant scorefor detectingtermsof a technicaldomain. This resultcontradictsnumerous resultsof lexical resources , which claim that associationcriteria are more : for than example, all the mostfrequentpairs whoseterfrequency significant minological statusis undoubtedsharelow valuesof associationratio [equation ( 1)], as, for example, reseaua satellites (satellite network) 1M = 2.57, 3. Only pairs which appearat leasttwice in the corpushavebeenretained.
Chapter3
Figure 3.1 Frequencyhistogram.
liaisonpar satellite (satellite link) 1M = 2.72, circuit telephonique(telephone circuit ) 1M = 3.32, station spatiale (space station) 1M = 1.17, etc. The remainingproblemwith the sort proposedby frequencyis that it very quickly , that is, pairs which are not terms. So, we havepreferred integratesbad candidates to elect the Loglike coefficient [equation(3)] the best score. Indeed, Loglike coefficient, which is a real statisticaltest, takesinto accountthe pair frequencybut acceptsvery little noisefor high values. To give an elementof comparison, the fIrSt bad candidatewith frequencyfor the generalpatternN 1 (PREP (OET N2 is the pair (cas , transmission ) which appearsin 56th this which is also the fIrst bad candidate with , ; place pair Loglike, appearsin 176thplace. We give in table 3.2 the topmost 11 Frenchpairs sortedby the Loglike coefficient (Logl) (Nbc is the numberof the pair occurrencesand1M the value of associationratio). 3.2 Diversity the marginaldistribution Diversity, introducedby [Shannon, 1949], characterizes of the lemmaof a pair throughthe rangeof pairs. Its computationusesa
CombinedTechniquesfor Extractionof Terminology
Table3.2 Topmostpairs Pairsof Nl (PREP (DET N2 sbucture
The mostfrequent pair sequence
(largeur, bande)
largeurde bande( 197) (bandwidth) temperaturede bruit 110 (noisetemperature ) bandede base( 142) (baseband ) amplificateur(s) de puissance( 137) (poweramplifier) tempsdepropagation (93) (propagationdelay) reglementdes radiocommunications (60) (radio regulation) produit{s) d' intermodulation(61) (intermoduationproduct) taux d' erreur (70) (error ratio) miseen oeuvre(47) (implementation ) telecommunications ) par satellite(88) (satellite communications) bilan de liaison (37) (link budget)
, bruit) (temperature (bande, base) (amplificateur, puissance )
(temps, propagation ) (reglement. radiocommunication)
{produit, intermodulation}
(taux, en-eur) (mise, oeuvre)
, satellite) (telecommunication
(bilan, liaison)
Logl Nbc 1M 1328 223
5.74
777
126 6.18
745
145 5.52
728
137 5.66
612
94 6.69
521
60 8.14
458
61 7.45
420
70
6.35
355
47
7.49
353
99
4.09
344
55 6.42
contingencytable of length n: we give below as an examplethe contingency tablethat is associatedwith the pairsof the N ADJsttucture: N; Adjj
progressif
onde cornet . .. Total
porteur
Total
.. .
19
4
nb
( onde , . )
9
0
nb
( cornet , . )
...
.. .
.. . nb
...
( . , progress
i />
nb
( . , porteur
)
nb(.,.)
Chapter3
Thelinecountsnb;., whicharefoundin therightmostcolumn,represent thedistributio of theadjectives widtregardto a givennoun.ThecolumncountsnbJt whicharefoundon thelastline, represent thedistributionof thenounswith to a . These distributions arecalled" marginaldistributions regard givenadjective " of thenounsandthe for theN ADJstructure . Diversityis computed adjectives for eachlemmaappearing in a pair, usingtheformula: $ ~ nb.. .. HI. = nbI.. lognI.. - L. lJlognbIJ . J- 1 (4) $ ~ .. lognb.. HJ. = nbJ.IognJ. L. nbIJ lJ ;=1 Forexample tableof theN ADJstructure above , usingdIecontingency , diversity of the noun onde is equal to: = H (onde . .) nb(onde ..) log nb(onde ..) (nb(onde . progress . progress if) log nb(onde if) + nb(onde + . l KJrteur . porteur ) log nb(onde ) ...) We note HI , diversity of the first lemma of a pair , and H2' diversity of the second lemma. We take into account the diversity normalized by the number of occurrences of the pairs :
Hi hi= "iiij
Hj hj='iiij The normalizeddiversitieshI andh2aredefinedfrom HI andH2. The normalizeddiversity providesinterestinginformationaboutthe distribution of the pair lemmasin the set of pairs. A lemma with a high diversity meansthat it appearsin severalpairs in equalproportion; conversely,a lemma that appearsonly in one pair owns a zero diversity (minimal value) and this whateverthe frequencyof the pair. High valuesof hI appliedto the pairs of N ADI structurecharacterizesnounsthat could be seenas keywordsof the domain : reseau(network), signal, antenne(antenna), satellite. Conversely, high valuesof h2appliedto the pairsof N ADIstructurecharacterizes adjectivesthat do not takepart in base-termssuchasnecessaire(necessary ), suivant(following ), important, different (various), tel (such), etc. The pairswith a zerodiversity on one of their lemmasreceivehigh valuesof associationratio and other associationcriteria and a nondefinitevalue of Loglike coefficient However, thediversity is moreprecisebecauseit indicatesif the two lemmasappearonly togetherasfor (ocean , indien ) (indian ocean) (HI = hi = H2 = h2 = 0),
Combined ITechniquesfor Extractionof Tenninolog~ or if not, which of the two lemmas appear only with the other , as for ( r6s eau , mail16 ) ( mesh network ) ( H2 = h2 = 0), where the adjective maille appears = = only with reseau or for ( codeur , id6al ) (ideal coder ) ( H. h . 0 ) where the noun codeur appears only with the adjective ideal. Oilier examples are: ( ile , salomon ) (solomon island) , ( h6lium , gazeux ) (helium gas), , 6cho ) (echo suppressor) . These pairs collect many frozen (suppresseur compounds and collocations of the current language. In future work, we will investigate how to incorporate the good results provided by diversity into an automatic extraction algorithm .
3.3 DistanceMeasures Frenchbase-termsoften acceptmodificationsof their internalstructure, ashas beendemonstratedpreviously. Eachtime an occurrenceof a pair is extracted and counted, two distancesare computed: the numberof items, Dist, and the numberof main items, MDist, which occurbetweenthe two lemma.~. Then, for eachpair, the meanandthe varianceof thenumberof itemsandmainitemsare computed.The varianceformula is:
V(X) =*L (Xi- xf U (X ) = JV ( X) The distance measuresbring interesting information concerningthe morphosyntacticvariationsof the base-terms, but theydo not allow makinga decision on thestatusof term or non-tenDof a candidate.A pair thathasno distance variation, whateverthe distance, is or is not a term; we give now someexamples of pairswhich haveno distancevariationsandwhich arenot terms: paire designal (a pair of signal), typed 'antenne(a typeof antenna), organigramme de la figure (diagramof thefigure), etc. We illustratebelow how the distance measuresallow attributing to a pair its elementarytype automatically, for DETN2, or N I AD] PREP ) N2for (OET N2, N I PREP example,eitherN I N2, N I PREP . the generalN I (PREP OET structure ( N2 1. Pairswith no distancevariation V(X ) = 0 (a) N I N2: Dist = 2 MDist = 2 . liaison semaphore , liaisonssemaphores(commonsignalinglink(s . canal support, canauxsupport, canauxsupports(bearerchannel) (b) N I PREP N2: Dist = 3 MDist = 2 . accuses) de reception(acknowledgement of receipt) . refroidissementa air , refroidissementpar air (cooling by air )
Chapter3
DETN2: Dist = 4 MDist= 2 (c) N} PREP
. sensibiliteau bruit (susceptibilityto noise) . reconnaissance dessignaux(signal recognition) (d) Nt AD] PREP N2: Dist = 4 MDist = 3 . reseaulocal de lignes, reseauxlocauxde lignes(local line networks ) . servicefixe par satellite
Conclusion
We have presented a combining approach for automatic term extraction . Starting from a first selection of lemma pairs representing candidate terms from a morphosyntactic point of view , we have applied and evaluated several statistical scores. The results were surprising : most association criteria ( e.g ., mutual association) did not give good results contrary to frequency . This bad behavior of the association criteria could be explained by the introduction of linguistic filters . We can note in any event that frequency undoubtedly characterizes terms, contrary to association criteria , which select in their high values frozen compounds belonging to general language. However , we preferred to elect the Loglike criterion rather than frequency as the best score. This latter takes into account the frequency of the pairs but provides a conceptual sort of high accuracy. Our system, which uses fmite automata, allows us to increase the results of the extraction of lexical resources and to demonstrate the efficiency of incorporating linguistics in a statistical system. This method has
CombinedTechniquesfor Extractionof Terminology
been extendedto bilingual terminologyexttactionusing alignedcorpora [Daille et al., 1994] . Acknowledgments I thank the mM -France team, and in particular Eric Gaussier and Jean- Marc Lange, for the tagged and lemmatized version of the French corpus and for their evaluation of statistics, and Owen Rambow for his review of the manuscript . Researchwas supported by the European Commission and mM -France, through the ET - 10/63 project . References Didier Bourigault. Acquisitionde terminologie. Pill thesis, EHESS, France, 1994. Nicoletta Calzolari and Remo Bindi. Acquisition of lexical information from a large textual italian corpus. In Proceedingsof the ThirteenthInternational Conferenceon ComputationalLinguistics, Helsinki, 1990. KennethWard Church and Patrick Hanks. Word associationnorms, mutual information , andlexicography. ComputationalLinguistics, 16( 1) : 22- 29, 1990. BeatriceDaille. Approchemixtepour I ' extractionautomatiquede terminologie: statistiqueslexicaiesetfiltres linguistiques. Pill thesis, Universityof Paris7, 1994. BeatriceDaille, Eric Gaussier,andJean-Marc Lange. Towardsautomaticextractionof monolingualand bilingual terminology. In Proceedingsof the FifteenthInternational Conferenceon ComputationalLinguistics- COUNG-94, Kyoto, Japan, 1994. Anne-Marie Derouault. Modelisationd' une languenaturel/e pour la disambiguation . Pill thesis, University of Parisvll , 1985. deschainesphonetiques Ted Dunning. Accuratemethodsfor the statisticsof surpriseandcoincidence.Computational Linguistics, 19( 1) : 61- 76, 1993. ' Marc EI-Beze. LesModeiesde LangageProbabilistes: QuelquesDomainesd Applications . Pill thesis, University of Paris-Nord, 1993. Habilitation a diriger les recherches [thesisrequiredin Franceto be a professor]. for paralleltexts. InProceedings William A. GaleandKennethW. Church. Concordances of the SeventhAnnual Conferenceof the UW Centrefor the New OED and Text Research,UsingCorpora, pp. 40- 62, Oxford, 1991. . Explorationsin AutomaticThesaurusDiscovery. Kluwer, 1994. GregoryGrefenstette GastonGross, JacquesChaurand, Robert Vives, Michel Mathieu-Colas, and Pierre . Technicalreport, A.TiP.- Nouvel/es recherches Billy . Typologiedes nomscomposes sur Ie langage, University of Paris 13, Villetaneuse, 1986. D. Hindle. Noun classificationfrom predicate-argumentstructures. In Proceedingsof the 28th Annual Meeting of the Assocationof ComputationalLinguistics, Pinsburgh, Penna., pages268- 275, 1990. Associationfor ComputationalLinguistics, Morristown, NJ.
Chapter3
ChristianJacquemin . Transformationsdes nomscomposes . PhD thesis, University of Paris7, 1991. Pien-e Lafon. Depouillementset Statistiquesen Lexicometrie. Geneva, Slatkine-Champion , 1984. Michel Mathieu-Colas. Typologiedesnomscomposes . TechnicalReport7, programme de recherchescoordonees..il #!ormatiQueet linp,uistiQue," University of Paris13. Paris. France, 1988. michele Noally. Le substantifep;thete. Paris, PUF, 1990.
Fernando Pereira , NaftaliTishby - , andLillianLee. Distributionalclustering - of English words. In Proceedingsof the31stAnnualMeetingof theAssociationfor Computational
. , June1993 Linguistics
C. E. Shannon, A Mathematicaltheory of communication. Urbana: University of Dlinois Press, 1949. Frank Smadja. Retrieving collocationsfrom text: Xtract. ComputationalLinguistics, 19( 1) : 143- 177, 1993. FrankA. Smadjaand KathleenR. McKeown. Automatically extractingand representing collocationsfor languagegeneration.In Proceedingsof the28th AnnualMeetingof theAssociationfor ComputationalLinguistics, pp. 252- 259, 1990.
Chapter 4 Do We Need Linguistics When We Have Statistics? A Comparative Analysis of the Contributions of Lin2- Uistic
Vasileios Hatzivassiloglou
Cuesto a StatisticalWord GroupingSystem
Is linguistic knowledge useful for a particular task, and what kinds of linguistic knowledge furnish the most benefit ? Whether in the affirmative or in the ' negative, the answer to these questions often comesfrom a researcher s unsystematic past experience - or , equally often ,from intellectual blases rather than careful exploration of the alternatives . ' li Hatzivassiloglou s chapter , Do We Need Linguistics When We Have Statistics ? A Comparative Analysis of the Contributions of Linguistic Cues to a " Statistical Word Grouping System, represents an exemplary model of how such a careful exploration can be done. Like Daille (chapter 3 ) and others , Hatzivassiloglou adopts an approach in which linguistic knowledge is used to define the space of lexical co- occurrences, and then straightforward statistical methods are applied to the resulting co- occurrence frequencies . What distin guishes this work from most other efforts of this kind , however, is its attention to methodology : the variables of interest are carefully motivated and defined , a complete experimental design is used, and observed differences are evaluated in terms of statistical significance . Moreover , Hatzivassiloglou explicitly considers the cost of incorporating linguistic knowledge into his system, both in development and on -line performance . Since the most successful combinations of linguistic and statistical approach es still rely on relatively shallow methods, this is an underappreciated topic ,.for real applications , deciding whether deeper linguistic analysis is worthwhile will require not only a rigorous analysis of performance , asillus trated here, but also a careful consideration of whether the benefits of added linguistic sophistication balance out the costs.- Eds .
68
Chapter4
1 Introduction
es were among the fIrSt Historically, pure statistical, corpus-basedapproach onesthat wereappliedto naturallanguageprocessingproblemsin generaland lexical knowledgeacquisitionproblemsin particular. Suchefforts took place very early: in the 1950sand early 1960sa considerableamountof research work wasunderway addressingproblemssuchasmachinetranslationandlexicon and thesaurusconstructionwith purely statisticalmethods. However, the lack of sophisticatedstatisticalmodelsandpowerful hardwareled to ratherdisappointin results, and the statisticalapproachwas largely abandonedin the 1970sin favor of knowledge-based, artificial intelligence(AI ) approach es. Yet, pure knowledge-basedapproach es, which useknowledgecollectedby humanexpertsandenteredin the systemasanexternalsource,alsoseeminsufficient for providing an adequatesolutionto naturallanguageproblems. Such systemsareeffectivewhenthedomainandthevocabularyaretightly controlled, but fail to scale up in more general settingswhere knowledgeacquisition becomesa major bottleneck. Whenthis limitation of currentknowledge-based systemsbecameapparent, interestin statisticalmethodswas renewedin the mid- and late 1980's, and continuesunabateduntil now. This renaissanceof statisticalmethodswasparticularlyhelpedby recentsuccess esof such" knowledge -poor" methods in problems such as part-of-speechtagging [Church, 1988; Kupiec, 1992; Cutting et al., 1992] andspeechrecognition[Waibel and Lee, 1990] . Once more, however, systemsbasedon statisticalmethodsalonehavenot beentotally successfulin solvingmostof thenaturallanguageprocessingproblems . Consequently , a numberof researchershave turned to they addressed hand encoded combininglinguistic, knowledgewith statisticaltechniquesasa meansto improveoverall performance , sinceit is reasonableto expectthat the combinedapproachwill potentiallyoffer significantlybetterperformanceover either methodologyalone. The interactionof statisticaland linguistics-based componentsin a systemis thus a topic of considerablecurrentinterest. However , whenwe considersucha hybrid system, severalquestionsarise. Perhaps the most pressingonesare whetherthe linguistic knowledgeactuallyhelpsin , andif so, whetherthe improvementis worth the effort improvingperformance neededto implementthe linguistic modulesand incorporatethem in the system . In most casesanswersto thesequestions, if suppliedat all , are derived from intuitive beliefs and anecdotalevidence, rather than from a rigorous, quantitativecomparison.
Do We Need Linguistics When We Have Statistics?
This chaptersupplementstheseintuitive beliefswith actualevaluationdata, obtainedwhenseveralsymbolic, linguistics-basedmoduleswereintegratedin a statisticalsystem. As the basisfor our comparativeanalysis, we useda system we previously developedfor the separationof adjectivesinto semantic groups[Hatzivassiloglouand McKeown, 1993] . We identified severaldifferent types of shallow linguistic knowledgethat can be efficiently introduced into our system. We evaluatedthe systemwith andwithout eachsuchfeature, ' obtainingan estimateof eachfeatures positiveor negativecontributionto the overall performance . By matchingcaseswhereall systemparametersare the sameexceptfor onefeature, we assessthe statisticalsignificanceof the differences found. Also, a statisticalmodelof the system's performancein termsof the active featuresfor eachrun offers a view of the contributionsof features from a different angle, contrastingthe significanceof linguistic features(or othermodeledsystemparameters ) againstoneanother. Our analysisof the experimentalresultsshowedthat manyforms of linguistic knowledgemake a significant positive contributionto the performanceof the system. Otherstatisticalsystemsthat addressword classificationproblems do not emphasizethe useof linguistic knowledgeand do not deal with a specific word class [Brown et al., 1992], or do not exploit as much linguistic knowledgeas we do [Pereiraet al., 1992] . As a result, a coarserclassification is usuallyproduced.In contrast, by limiting the system's input to adjectives,we can take advantageof specific syntacticrelationshipsand additionalfiltering . Thesesourcesof linguistic proceduresthat applyonly to particularword classes in turn the extra for knowledgeprovide edge discriminating among the adjectivesat the semanticlevel. In what follows, we briefly review our adjectivegroupingsystem, anddiscuss how the lexical semanticknowledgeextractedby it canbe usedin a variety of natural languageprocessingapplications. We then present several different types of shallow linguistic knowledgethat can be, and have been, efficiently introducedinto this system. In section6 we give the resultsof our evaluationof the system's performanceon different combinationsof features (linguistic modules) and analyzetheir statistical significance. We also estimate the relative importanceof the linguistic modulesand we measurethe overall effect linguistic knowledgehasin the word groupingsystem. We then proceedto discussthe cost of incorporatinglinguistic knowledgein the statistical system, and concludeby presentingargumentsin favor of the relevance of theseresultsto statisticalapproachesfor other naturallanguageprocessing problems.
Chapter4 2
Overview of the Adjective Grouping System
2.1 Definitions Our adjectivegroupingsystem[Hatzivassiloglouand McKeown, 1993] starts with a set of adjectivesto be clusteredinto groups of semanticallyrelated words. In linguistics, thereis a long tradition of work associatedwith " semantic " and" " relatedness groupsof semanticallyrelatedwords. Trier [ 1934] proposed that the vocabularyof a naturallanguageis organizedin groupsthat he called lexical fields (Wort/elder), which representconceptualfields (Sinn' felder), the latter being setsof closely relatedconcepts. Trier s lexical fields " " closelycorrespondto the groupsof semanticallyrelatedwords that our system aimsto fmd. The work of Trier focusedmoreon the diachronicandcross, but a considerablenumberof linguists linguistic aspectof lexical semantics followed and extendedhis approach; see[Lehrer, 1974] for a comprehensive treatmentof thetheoryof lexical andsemanticfields. Unfortunately, thetheory is mostly basedon intuitive judgments; asLyons [ 1977, p. 277] remarks, Whatis lackingsofar . . . is a moreexplicitfonnulationof thecriteriawhichdefmea lexicalfield. . . Themajorityof lexicalfieldsarenotsoneatlystructured or asclearly asTrieroriginallysuggested . separated
Given the lack of a formal defmition of semanticgroups, we will use an informal, intuitive definition for semanticrelatedness , and treating semantic relatedness asanequivalencerelationin thevocabulary, we will thendefinethe semanticgroupsas the equivalenceclassesinducedby this relation. What we mean by " semanticallyrelated" is that words belonging to the samegroup shouldconsistentlyexpressa closenessin their meaning, for example, by being synonyms, antonyms, complementaryterms, hyponyms, specialcasesof the same superordinateconcept (co-hyponyms), or terms describing the same property. Informal criteria suchasthe abovehavebeenfrequentlyusedin linguistics . As in many caseswherean informal definition is used, resultsbased on this defmition are opento criticism, suchas the remarksby Lyons above, andthereis disagreement on whatexactlyis the correctanswer, given the same , thereis difficulty in interpretingthe results, and special input. Correspondingly careis neededduring evaluation. Nevertheless , humansare able to judge the quality of semanticgroupings, evenif they frequentlydisagreeandcannot fully externalizethe criteria they use. Furthermore,groupingsproducedindependent by severalhumanstendto agreeto a degreethat cannotbe expected chance , thus establishingthat the organizationof words into semantic by classesis not arbitraryor artificial.
DoWeNeedLin .2;uistic~ WhenWe HaveStatistics? 2.2 Algorithm and Implementation Our systemgenerallyoperateson a set of words from a particular syntactic class, using distributionalcriteria to measurethe semanticsimilarity between thesewords. In other words, the semanticrelatednessbetweenwordsX and Y is measuredon the basisof the similarity of the co-occurrencepatternsof these . The relevant words with words from other informative syntactic classes classesof wordsareselectedso that semanticconstraintsareexpectedto apply betweenthem andthe words from the original set that co-occur with them. In the versionof the systemusedin the experimentsdescribedin this chapter, we group adjectivesand we usenounsthat are modified by the adjectivesas this secondinformative syntacticclass. We are currently extendingthe systemto group nounson the basisof co-occurringadjectivesand verbsthat participate in appropriatesyntacticrelationshipswith thesenouns. The systemhasaccessto a text coipus, which hasbeenautomaticallytagged with part-of -speechinformation. All lexical semanticknowledgefor thegrouping taskis extractedfrom the corpus; no semanticinformationaboutthe adjectives or any other words is availableto the system. The systemoperatesby extractingmodified nounsfor eachadjective, and, optionally, pairs of adjectives thatwe canexpectto be semanticallyunrelatedon linguistic grounds.The latter are adjectivesthat either modify the samenoun in the samenounphrase (e.g., big red truck) or oneof themmodifiesthe other(e.g., light bluescarf) ; see , [Hatzivassiloglouand McKeown, 1993] and especially [Hatzivassiloglou . The estimateddistribution 1995a] for a detailedanalysisof this phenomenon asa vector, anda similarity of modifiednounsfor eachadjectiveis represented ' scoreis assignedto eachpossiblepair of adjectives.This is basedon Kendall s T, a nonparametric , robustestimatorof correlation[Kendall, 1938, 1975]. ' Kendall sT comparesthe two vectorsby repeatedlycomparingtwo pairsof . Formally, if (Xi, Y;) and (Xj ' Yfl are two their correspondingobservations for the of adjectivesX and Y that we are interestedin grouping pairs frequencies i and on the nouns j respectivel)' , we call thesepairsconcordantif Xi > Xj andYi > Yj or if Xi < Xj and Yi < Yr If Xi > Xj but Yi < Yj ' or if Xi < Xj and Yi > Yj ' the two pairs arediscordant. In general, if the distributionsof the two randomvariablesX and Y acrossthe various modified nounsare similar, we , and consequentlya small numberof expecta large numberof concordances sincethetotal numberof pairsof observationsis fixed for thetwo discordances indicate variables. This is justified by the fact that a largenumberof concordances " that whenoneof the variablestakesa " large valuethe other also takesa " " " " large value on the correspondingobservation; of course, large is a term that is interpretedin a relativemannerfor eachvariable.
Chapter4
Kendall' s Tis defmedasPc- Pd, wherePc andPd are the probabilitiesof observinga concordanceor discordancerespectively. It rangesfrom - 1 to + 1, with + 1 indicating completeconcordance , - 1 completediscordance , and 0 = i.e. no correlation between X and Y. T We use an unbiased estimator , ( Pc Pd) for T , which incorporatesa correctionfor ties, that is, pairs of observations whereXi = Xj or Yi = Yj [Kendall, 1975, p. 75] . The estimatoris also made asymmetricby ignoring observationsfor which the nounfrequencyis zerofor bothX and Y. For moredetailson the similarity measureandits advantages for our task, see[Hatzivassiloglou, 1995a]. Using the computedsimilarity scoresand, optionally, the establishedrelationship of non-relatedness , a nonhierarchical clustering method [Spath, 1985] assignsthe adjectivesto groupsin a way that maximizesthe withingroup similarity (and therefo~ also maximizesthe between-group dissimilarity .1 ). The systemis given the numberof groupsto form asan input parameter The clusteringalgorithm operatesin an iterative manner, startingfrom a random partitionof theadjectives.An objectivefunction is usedto scorethecurrent clustering. Eachadjectiveis consideredin turn and all possiblemovesof . The move that leadsto the that adjectiveto anothercluster are considered largestimprovementin the value of is executed, and the cycle continues throughthe setof wordsuntil no moreimprovementsto thevalueof arepossible . Note that in this way a word may be movedseveraltimesbeforeits fmal groupis detennined. This is a hill -climbing methodand thereforeis guaranteedto convergein fmite time, but it may leadto a local minimum of <1>, inferior to the global minimum that correspondsto the optimal solution. To alleviatethis problem, the partitioningalgorithmis calledrepeatedly2with different randomstartingpartitions andthe bestsolutionproducedfrom theserunsis kept. Figure4.1 showsan exampleclusteringproducedby our systemfor one of the adjectivesetsanalyzedin this chapter. 2. 3 Evaluation In many natural language processing applications , evaluation is performed either by using internal evaluation measures ( such as perplexity [ Brown et al., 1. Detennining this numberfrom the data is probably the hardestproblem in cluster , 1990] . However, a reasonably good analysisin general; see[KaufmanandRousseeuw valuefor this parametercanbe selectedfor our problemusingheuristicmethods. 2. In thecurrentimplementation , 50 timesfor eachvalueof thenumberof clustersparameter .
Do We NeedLinguisti C5WhenWe HaveStatistics ~ 13. communistleftist 14. astonishingmeagervigorous 15. catasttophicdisastrousharmful 16. dry exotic wet 17. chaoticturbulent 6. generousoub' ageousunreasonable 7. endlessprob' acted
18. confusingmisleading 19. dismalgloomy
8. plain 9. hostile unfriendly
20. dual multiple pleasant 21. fat slim
10. delicatefragile unstable 11. affluent impoverishedprosperous
22. affordableinexpensive 23. abruptgradualstunning
12. brilliant cleverenergeticsmartstupi~
24. flexible lenientrigid strict stringent
Figure 4.1
foundby thesystemusingall linguisticmodules . Exampleclustering 1992]) or by havinghumanjudgesscorethe system's output. However, thefIrSt approachproducesresultsthat dependon the adoptedmodel, while the second approachfrequently introducesbias and inflation of the scores, especially whenthe " correct" answeris not well defmed(asis the casewith mostnatural languageprocessingproblems). To addressthesedeficienciesof traditional evaluationapproach es, we employ model solutionsconstructedby humans ' independentlyof the systems proposedsolution. The humansreceivethe list of adjectivesthat are to be clustered, a descriptionof the domain, and general instructionsabout the task. To avoid introducing bias in the evaluation, the instructionsdo not include low-level detailssuchasthe numberof clustersor specifictestsfor decidingwhetherany two wordsshouldbe in the samegroup. In orderto comparetwo partitionsof the samesetof words, we converteach " " partition to a list of yes/no decisionsasfollows: We view eachpossiblepair of the words in the set as a decisionpoint, with a " yes" answerin the current " " partition if the two adjectivesareplacedin the samegroup anda no answer otherwise.Then, we canapplythestandardinformationretrievalmeasuresprecision and recall [ Frakesand Baeza-Yates, 1992] to measurehow close any two suchlists of decisionsare, or moreprecisely, how similar onesuchlist is to anotherwhich is consideredasthereferencemodel. Consideringtheanswersin the referencelist as the correctones, precision is defmedas the percentageof correct" yes" answersreportedin the list that is evaluatedover the total number of " yes" answersin that list. Similarly, recall is defmedasthe percentage
Chapter4
of correct" yes" answersin the testedlist over the total numberof (by definition " " , correct) yes answersin the referencelist. The two evaluationmeasuresdefinedaboveratecomplementaryaspectsof the correctnessof the evaluatedpartition. In order to perform comparisons betweendifferent variantsof the grouping system, correspondingto the use of different combinationsof linguistic modules, we needto convert this pair of scoresto a single number. For this purposewe usethe F-measurescore [ Van Rijsbergen, 1979], which producesa number betweenprecision and recall that is larger when the two measuresareclosetogether, and thusfavors partitionsthat arebalancedin the two typesof errors(falsepositivesandfalse negatives). Placing equal weight on precision and recall, the F -measureis definedas
Up to this point, we haveconsideredcomparisonsof onepartition of words againstanothersuchpartition. However, given the considerabledisagreement that existsbetweengroupingsof the sameset of wordsproducedby different humans, we decidedto incorporatemultiple m~ ls in the evaluation. Previously " " , multiple modelshavebeenusedindirectly to constructa single best or most representativemodel, which is then usedin die evaluation[Gale et al., and Litman, 1993]. Although this approachreducesthe 1992a; Passonneau problemscausedby relying on a singlemodel, it doesnot allow thedifferences betweenthe modelsto be reflectedin the evaluationscores.Consequently , we that uses models an evaluation method , simultaneously multiple developed in the of between tlte models reflects the produced directly degree disagreement scores,andautomaticallyweighsthe importanceof eachdecisionpoint according to tlte homogeneityof the answersfor it in the multiple models. We have extendedthe informationretrievalmeasuresof precision, recall, fallout, andFmeasurefor this purpose;themathematicalformulationof thegeneralizedmeasures is givenin [HatzivassiloglouandMcKeown, 1993] and[Hatzivassiloglou , 1995a]. In the experimentsreportedin this chapter, we employeight or nine humanconstructedmodelsfor eachadjectiveset. We baseour comparisonson and report the generalizedF~measurescores.In addition, sincethe correctnumber of groupingsis somethingthat the systemcannotyet determine(and, incidentally , somethingthat humanevaluatorsdisagreeabout), we run the systemfor the five casesin the range- 2 to + 2 aroundthe averagenumber of clusters employedby the humansand averagethe results. This smoothingoperation
Do We NeedLinguisticsWhenWe HaveStatistics?
preventsan accidentalhigh or low scorebeingreportedwhena small variation in the numberof clustersproducesvery different scores. It shouldbe notedherethat the scoresreportedshouldnot be interpretedas linearpercentages . The problemof interpretingthe scoresis exacerbated in our contextbecauseof the structuralconstraintsimposedby the clusteringand the presenceof multiple models. Even the bestclusteringthat could be produced would not receivea scoreof 100, becauseof the disagreementamonghumans on what is the correctanswer; applying the sameevaluationmethodto score eachmodelconstructedby humansfor thethreeadjectivesetsusedin this comparative studyagainstthe otherhuman-constructedmodelsleadsto an average scoreof 60.44 for the humanevaluators.To clarify the meaningof the scores, we accompanythem with lower and upper boundsfor eachadjectiveset we examine. Theseboundsare obtained.by the performanceof a systemthat creates randomgroupings(averagedover manyruns) andby the averagescoreof the human-producedpartitions when evaluatedagainstthe other human-produced modelsrespectively. 3
Motivation
3.1 Applications The outputof the word groupingsystemthat we describedin the previoussection is usedas the basisfor the further processingof the retrievedgroups: the classificationof groups into scalar and nonscalarones, the identification of synonymsandantonymswithin eachsemanticgroup, the labelingof wordsas positive or negativewithin a scale, and the orderingof scalartermsaccording to semanticstrength. In this way, thegroupingsystemis usedasthefirst part of a larger systemfor corpus-basedcomputationallexicography, which in turn producesinformationusefulfor a variety of naturallanguageprocessingapplications . We briefly list below someof theseapplications: . The organizationof words into semanticgroupscan be exploitedin statistical languagemodeling, by pooling togetherthe estimatesfor the variouswords in eachgroup [Sadier, 1989; Hindle, 1990; Brown et al., 1992]. This approach of the data, especiallyfor low-frequency significantly reducesthe sparseness words. . A study of medical casehistories and reports has shown that frequently physiciansusemultiple modifiersfor the sametermthat areincompatible(e.g., they are synonyms,contradicteachother, or onerepresentsa specializedcase of the other) [Moore, 1993]. Given the technicalcharacterof thesewords, it is
Chapter4
, evento identify quite hard for non-specialiststo edit theseincorrectexpressions suchproblematiccases.But the outputof the word groupingsystem, which identifiessemanticallyrelatedwords, canbe usedto flag occurrencesof incompatible modifiers. . Knowledgeof the synonymsand antonymsof particularwordscan be used during both understandingand generationof text. Such knowledgecan help with the handling of unknown words during understandingand increasethe paraphrasingpowerof a generationsystem. . Knowledgeof semanticpolarity (positive or negativestatuswith respectto a norm) can be combined with corpus-based collocation extraction tools [Smadja, 1993] to automaticallyproduceentriesfor the lexicalfunctions used in Meaning-Text Theoryfor text generation[Mel ' cuk andPertsov, 1987]. For example, if the collocationextractiontool identifiesthe phraseheartyeater as a recurrentone, thenknowing that heartyis a positiveterm enablesthe assignment of hearty to the lexical function MAGN(standingfor magnify), that is, MAGN (eater) = hearty. . Therelativesemanticstrengthof scalaradjectivesdirectly correlateswith the , the relative argumentativeforce of the adjectivesin the text. Consequently semanticstrengthinformationcan be usedin languageunderstandingto properly interpretthemeaningof scalarwordsandin generationto selectthe appropriate word to lexicalize a semanticconceptwith the desiredargumentative force [Elhadad, 1991] . . Scalar words obey pragmaticconstraints, for example, scalar implicature [Levinson, 1983; Hirshberg, 1985] . If the position of the word on the scaleis known, the systemcan draw the implied pragmatic inferencesduring text analysis, or usethem for appropriatelexical choice decisionsduring generation . In particular, such information can be usedfor the proper analysisand . For example, not hot usuallymeanswarm, generationof negativeexpressions but not warm usuallymeanscold. 3.2 The Needfor Automatic Methods for Word Grouping In recentyears, the importanceof lexical semanticknowledgefor language processinghasbecomerecognized.Someof the latestdictionariesdesignedfor humanuseincludeexplicit lexical semanticlinks; for example, the COB~ D dictionary [Sinclair, 1987] explicitly lists synonyms,antonyms,andsuperordinates for manyword entries. WordNet [Miller et al., 1990] is perhapsthe bestknown example of a large lexical databasecompiled by lexicographers specificallyfor computationalapplications,andit hasbeenusedin severalnatural languagesystems(e.g., [Resnik, 1993; Resnik and Hearst, 1993; Knight
Do We NeedLinguisticsWhenWe HaveStatistics?
and Luk, 1994; Basili et al., 1994]). Yet, WordNet and the machine-readable versionsof dictionariesand theusaristill suffer from a numberof disadvantages when comparedwith the alternative of an automatic, corpus-based approach: . All entriesmust be encoded hand, which by representssignificant manual effort. . Changesto lexical entries may necessitatethe careful examinationand potential revision of other relatedentriesto maintain the consistencyof the database . . Many typesof lexical semantic knowledgearenot presentin currentdictionaries or in WordNet. Most dictionariesemphasizethe syntacticfeaturesof words, suchas part of speech,number, and form of complement.Even when dictionarydesignerstry to focuson the semanticcomponentof lexical knowledge , the results have not yet been fully satisfactory. For example, neither COBUll. .O nor WordNet includesinformationaboutscalarsemantic strength. . The lexical information is not specific to any domain. Rather, the entries attemptto capturewhat appliesto the languageat large, or representspecialized sensesin a disjunctive manner. Note that semanticlexical knowledgeis most sensitiveto domainchanges.Unlike syntacticconstraints, semanticfeatures tend to change as the word is used in a different way in different domains. For example, our word grouping systemidentifiedpreferred as the word mostcloselysemanticallyrelatedto common,. this association may seem peculiarat fIrst glance, but is indeeda correctonefor thedomainof stockmarket reports and financial information from which the training material was collected. . Time-varying information, that is, the currencyof words, compounds , and collocations, is not adjustedautomatically. . The validity of any particular entry dependson the assumptionsmadeby the particularlexicographer (s) who compiledthat entry. In contrast, an automatic systemcan be more thorough and impartial, since it basesits decisionson actualexamplesdrawnfrom the corpus. An automaticcorpus-basedsystemfor lexical knowledgeextractionsuchas our word grouping systemoffsets thesedisadvantagesof static human-constructed knowledgebasesby automaticallyadaptingto the domain sublanguage . Its disadvantageis that while it offers potentially higher recall, it is generallylessprecisethan knowledgebasescarefully constructedby human . This disadvantagecan be alleviatedif the output of the automatic lexicographers systemis modified by humanexpertsin a post-editing phase.
Chapter4
4 Linguistic Featuresand Alternative Values for Them We haveidentified severalsourcesof symbolic, linguistic knowledgethat can be incorporatedin the word groupingsystem, augmentingthe basicstatistical component. Each suchsourcerepresentsa parameterof the system, that is, a featurethat canbe presentor absentor moregenerallytakea valuefrom a predefined set. In this sectionwe presentfirst oneof theseparametersthatcantake severalvalues, namelythe methodof extractingdatafrom the corpus, andthen severalotherbinary-valuedfeatures.
4.1 Extracting Data from the Corpus When the word-clusteringsystempartitions adjectivesin groupsof semantically relatedones, it determinesthedistributionof related(modified) nounsfor eachadjectiveand eventuallythe similarity betweenadjectivesfrom pairs of the form (adjective, modified noun) that have beenobservedin the corpus. Direct information about semanticallyunrelatedadjectives(in the form of appropriateadjective-adjectivepairs) can also be collectedfrom the corpus. Therefore, a first parameterof the systemand a possibledimensionfor comparisons is the methodemployedto identify suchpairsin free text. Thereareseveralalternativemodelsfor this taskof datacollection, with different degreesof linguistic sophistication.A fIrSt model is to useno linguistic : we collect for eachadjectiveof interestall words that fall knowledgeat al13 within a window of some predeterminedsize. Naturally, no negativedata (adjective-adjectivepairs) can be collected with this method. However, the methodcan be implementedeasily and doesnot require the identification of any linguistic constraintsso it is completely general. It has been used for diverseproblemssuchasmachinetranslationandsensedisambiguation[Gale et al., 1992b; Schutte, 1992] . A secondmodelis to restrictthe wordscollectedto the samesentenceasthe adjective of interest and to the word class(es) that we expect on linguistic groundsto berelevantto adjectives.For our application, we collect all nounsin the vicinity of an adjectivewithout leaving the current sentence . We assume that thesenounshave somerelationshipwith the adjectiveand that semantically different adjectiveswill exhibit different collectionsof suchnouns. This model requires only part-of-speechinformation (to identify nouns) and a methodof detectingsentenceboundaries . It usesa window of fixed lengthto 3. Aside from the conceptof a word, which is usually approximatedby defining any string of charactersseparatedby white spaceor punctuationmarksas a word.
Do We NeedLinguisticsWhenWe HaveStatistics?
defmethe neighborhoodof eachadjective. Again, negativeknowledgesuchas pairs of semanticallyunrelatedadjectivescannotbe collectedwith this model. Nevertheless , it hasalsobeenwidely used, for example, for collocationextraction [Smadja, 1993] andsensedisambiguation[Liddy andPaik, 1992] . Sincewe are interestedin nounsmodified by adjectives, a third modelis to collect a nounimmediatelyfollowing an adjective, assumingthat this impliesa modificationrelationship. Pairsof consecutiveadjectives, which arenecessar unrelated can also be collected. ily semantically , Up to this point we have successivelyrestrictedthe collectedpairs on linguistic grounds, sothat lessbut moreaccuratedataarecollected. For the fourth model, we extendthe simplerule given above, using linguistic informationto catch more valid pairs without sacrificing accuracy. We employ a pattern matcherthat retrievesany sequenceof oneor moreadjectivesfollowed by any sequenceof zero or more nouns. These sequencesare then analyzedwith heuristicsbasedon linguisticsto obtainpairs. The regularexpressionandpatternmatchingrulesof the previousmodelcan be extendedfurther, forming a grammarfor the constructsof interest. This approachcandetectmorepairs, andat the sametime addressknown problematic casesnot detectedby the previousmodels. We implementedthe abovefive dataextractionmodels, using typical window sizesfor the first two methods(50 and 5 on each side of the window respectively) which have beenfound appropriatefor other problemsbefore. Unfortunately, thefirst modelprovedto beexcessivelydemandingin resources 4 for our comparativeexperiments , so we droppedit from further consideration andusethe secondmodelasthe baselineof minimal linguistic knowledge. For the fifth model, we developeda fmite-stategrammarfor nounphraseswhich is ableto handleboth predicativeandattributivemodificationof nouns, conjunctions of adjectives,adverbialmodificationof adjectives,quantifiers, andapposition of adjectivesto nounsor otheradjectives.5 A detaileddescriptionof this grammarandits implementationcanbe found in [Hatzivassiloglou, 1995b]. 4.2 Other Linguistic Features In additionto the dataextractionmethod, we identifiedthreeotherareaswhere linguistic knowledgecan be introducedin our system. First, we can employ 4. For example, 12,287,320word pairsin a 151MB file wereextractedfor the21 adjectives in our smallesttest set. Other researchers havealso reportedsimilar problemsof excessiveresourcedemandswith this " collect all neighbors" model[Galeet al., 1992b]. 5. For efficiency reasonswe did not considera morepowerful formalism.
Chapter4
morphologyto convert plural nounsto the correspondingsingular ones and adjectivesin comparativeor superlativedegreeto their baseform. This conversion combinescountsof similar pairs, thus raising the expectedand estimated frequenciesof eachpair in any statisticalmodel. Another potentialapplicationof symbolic knowledgeis the useof a spellchecking procedureto eliminate typographical errors from the corpus. We implementedthis componentusing the UNIX spell program and associated . Unfortunately, sincea word list, with extensionsfor hyphenatedcompounds fixed anddomain independentword list is usedfor this process,somevalid but overspecialized wordsmay be discardedtoo. Finally, we have identified severalpotential sourcesof additional knowledge that can be extractedfrom the corpus(e.g., conjunctionsof adjectives) and can supplementthe primary similarity relationships. In this comparison studywe implementedandconsideredthe significanceof one of theseknowledge sources, namely the negativeexamplesoffered by adjective-adjective where the two adjectiveshavebeenobservedin a syntacticrelationship pairs . that stronglyindicatessemanticunrelatedness 5
The Comparison Experiments
In the previoussectionwe identifiedfourparametersof the system, the effects of which we wantto analyze. But in additionto theseparameters , which canbe variables values several other varied and have , predetenninedpossible directly canaffect the perfonnanceof the system. First, the perfonnanceof the systemdependsnaturally on the adjectiveset that is to be clustered. Presumably , variationsin the adjectivesetcanbe modeled as the sizeof the set, the numberof semantic several such , by parameters relatednessamongits members,plus in it and the of semantic , strength groups severalparametersdescribingthe propertiesof the adjectivesin the set in isolation , suchasfrequency, specificity, etc. A secondvariablethat affectsthe clusteringis the corpusthat is usedasthe main knowledgesource, throughthe observedco-occurrencepatterns. Again theeffectsof different corporacanbe separatedinto severalfactors, for example , the sizeof the corpus, its generality, the genreof the texts, etc. Sincein theseexperimentswe are interestedin quantifyingthe effect of the linguistic knowledgein the system, or morepreciselyof the linguistic knowledge that we can explicitly control through the four parametersdiscussed above, we did not attemptto model in detail the various factors enteringthe systemas a result of the choiceof adjectiveset and corpus. However, we are
Do We NeedLinguisticsWhenWe HaveStatistics?
interestedin measuringthe effectsof the linguistic parametersin a wide range of contexts. Therefore, we included in our experimentmodel two additional , representingthe corpusandthe adjectivesetused. parameters We usedthe 1987Wall StreetJournal articlesfrom the ACL -DCI (Association for ComputationalLinguistics-Data Collection Initiative) as our corpus. We selectedfour subcorporato study the relationshipof corpussize with linguistic featureeffects: subcorporaof 330,000 words, 1 million words, 7 million words, and21 million words(the lastconsistingof the entire 1987corpus) were selectedasrepresentative . Eachselectedsubcorpuscontainedthe selectedsubcorpora of smaller sizes, and was constructedby samplingacrossthe whole rangeof theentirecoIpus at regularintervals. Sincewe usesubsetsof the same corpus, we areessentiallymodelingthe corpussizeparameteronly. For eachcorpus, we analyzedthr~ different setsof adjectives, listed in figures 4.2, 4.3, and4.4. The first of theseadjectivetest setswas selectedfrom a similar corpus, contains21 words of varying frequenciesthat all associate strongly with a particular noun (problem), and was analyzedin [Hatzivassiloglou and McKeown, 1993]. The secondset (43 adjectives) was selected with the constraintthat it containhigh-frequencyadjectives(more than 1000 occurrencesin the 21-million -word corpus). The third set(62 adjectives) satisfies the oppositeconstraint, containingadjectivesof relatively low frequency (between50 and 250). Figure 4.1 on page73 showsa typical groupingfound by our systemfor the third setof adjectives, whenthe entirecorpusandalllin guistic moduleswereused. Thesethreesetsof adjectivesrepresentvariouscharacteristics of theadjective setsthat the systemmay be calledon to cluster. First, they explicitly represent increasingsizesof thegroupingproblem. The secondandthird setsalsocontrast the independentfrequenciesof their memberadjectives.Furthermore , the less of the third set tend to be more than the more frequent frequentadjectives specific ones. The humanevaluatorsreportedthat the task of classificationwas antib' Ust
international
big economic
legal little
fmancial foreign
major mechanical
global
new
old political potential real serious severe
Figure 4.2 Test set 1: Adjectivesstronglyassociatedwith the word problem.
staggering technical unexpected
Chapter4 annual
hard
big chief
high
negative net new
commercial
important initial
current
international
old
daily different
likely local
past
difficult
low
easy fmal
military modest
possible pre-tax
future
national
next
positive
previous
public quarterly recent regional senior significant similar small strong weak
private
Figure 4.3 Test set2: High-frequencyadjectives. abrupt affluent affordable
dismal
hostile
slim
dry dual
impoverished inexpensive insufficient leftist lenient
smart
astonishing brilliant
dumb
capitalist
energetic exotic
catastrophic chaotic
endless
fat
clean
fatal
clever
flexible
communist
fragile
confusing
generous
deadly delicate
gloomy
dirty disastrous
Figure 4.4
gradual hanDful hazardous
meager misleading multiple outrageous plain pleasant prosperous protracted rigid scant
Test set 3: Low - to medium -frequency adjectives .
socialist strict stringent stunning stupid toxic turbulent unfriendly unreasonable unstable vigorous wet
Do We NeedLinguisticsWhenWe HaveStatistics?
easierfor the third set, and their modelsexhibitedabout the samedegreeof agreementfor the secondand third sets, althoughthe third set is significantly larger. " " " " By including the parameters corpussize and adjectiveset, we have six . Any remainingfactorsaffecting parametersthat we canvary in the experiments the performanceof the systemaremodeledasrandomnoise,6 so statistical methodsare usedto evaluatethe effects of the selectedparameters . The six chosenparametersarecompletelyorthogonal, with the exceptionthat parameter " " " " negativeknowledge must have the value not used when parameter " extractionmodel" has " " the value nounsin vicinity . In order to avoid introducing imbalancein the experiment, we constructeda complete designed experiment [Hicks, 1982] for all their (4 X 2 - 1) X 2 X 2 X 4 X 3 = 336 valid combinations! 6
Experimental Results
6.1
Average Et Tectof Each Linguistic Parameter Presenting the scores obtained in each of the 336 individual experiments performed , which correspond to all valid combinations of the six modeled parameters , is both too demanding in spaceand not especially illuminating . Instead, we present several summary measures. We measured the effect of each particular setting of each linguistic parameter of section 4 by averaging the scores obtained in all experiments where that particular parameter had that particular value. In this way , table 4.1 summarizes the differences in the performance of the system caused by each parameter. Because of the complete design of the experiment , each value in table 4.1 is obtained in runs that are identical to the runs used for estimating the other values of the same parameter except for the difference in the parameter itselfS Table 4.1 shows that there is indeed improvement with the introduction of any of the proposed linguistic features, or with the use of a linguistically more sophisticated extraction model. To assessthe statistical significance of these differences , we compared each run for a particular value of a parameter with 6. Includingsomelimited in extentbut ttuly randomeffectsfrom our nondetenninistic clusteringalgorithm. 7. Recallthat a designedexperimentis completewhenat leastone trial, or run, is performed for everyvalid combinationof the modeledpredictors. 8. The slight asymmetryin parameters" extractionmodel" and " negativeknowledge" is accountedfor by leavingout non-matchingruns.
Chapter4 Table 4.1 AverageF-measurescoresfor eachvalueof eachlinguistic feature Parameter
Value
Averagescore
Extract inn model
Parsing Patternmatching Observedpairs Nounsin vicinity Yes No
30.29 28.88 27.87 22.36 28.60 27.53 28.12 28.00 29.40 28.63
Morphology Spell-checking
Yes No
Useof negativeknowledge
Yes No
thecorrespondingidentical(exceptfor thatparameter ) run for a differentvalue of theparameter . Eachpair of valuesfor a parameterproducesin this way a set of pairedobservations . On eachof thesesets, we perfonneda sign test [Gibbons and Chakraborti, 1992] of the null hypothesisthat thereis no real difference in the system's perfonnancebetweenthe two values, that is, that any observeddifferenceis dueto chance.We countedthe numberof timesthat the fIrStof thetwo comparedvaluesled to superiorperfonnancerelativeto the second , distributingties equallybetweenthe two casesasis the standardpractice in classifierinductionandevaluation. Underthe null hypothesis,thenumberof timesthat the fIrStvalueperfonnsbetterfollows the binomial distributionwith parameterp = 0.5. Table4.2 givestheresultsof thesetestsalongwith theprobabilities that the same or more extreme results would be encounteredby chance.We canseefrom the tablethat all typesof linguistic knowledgeexcept spell-checking have a beneficial effect that is statistically significant at, or below, the 0.1% level. 6.2 Comparison Among Linguistic Features In orderto measuredie significanceof die conbibutionof eachlinguisticfeature relative to die oilier linguistic features , we fitted a linear regressionmodel [Draperand Smidt, 1981] to die data. We usedie six parametersof die experiments asdie predictors,anddie F-measurescoreof die corresponding clustering9 9. Averagedoverfive adjacentvaluesof thenumberof clustersparameter , asexplained in section2.3.
Do We NeedLinguisticsWhenWe HaveStatistics?
Table4.2 Statistical of thedifference in perfonnance offeredby eachlinguisticfeature analysis
asthe responsevariable. In sucha modelthe responseR is assumedto bea linearfunction (weightedsum) of the predictorsVi. that is.
n R = /30+ L .B;V; =
; 1
(1)
where V; is the ;-th predictor and .8; is its correspondingweight. Table 4.3 showsthe weightsfound by the fitting processfor the experimentaldatacollected for all valid combinationsof the six parametersthat we model. These weightsindicateby their absolutemagnitudeandsignhow importanteachpredictor is and whetherit contributespositively or negativelyto the final result. Numericalvaluessuchasthe corpussizeenterequation( 1) directly aspredictors, so table 4.3 indicatesthat eachadditionalmillion words of ttaining text
Chapter4 Table 4.3 Fitted coefficientsfor the linear regressionmodel that contrasts the effects of various parametersin overall systemperformance Variable Intercept Corpussize(in millions of words) Extractionmethod(pairsvs. nounsin vicinity ) Exttactionmethod(sequences vs. nounsin vicinity) Extractionmethod(parservs. nounsin vicinity) Morphology Spell- checking Adjective set(2 vs. 1) Adjective set(3 vs. I ) Useof negativeknowledge
Weight
18.7997 0.9417 5.1307 6.1418 7.5423 0.5371 0.0589 2.5996 - 11.4882 0.3838
increasesthe performanceof the systemby 0.9417on average . For binary features , theweightsin table4.3 indicatethe increasein the system's performance when the featureis present, so the introductionof morphologyimprovesthe ' systems performanceby 0.5371 on average.The different possiblevaluesof thecategoricalvariables" adjectiveset" and" extractionmodel" areencodedas contrastswith a basecase; the weightsassociatedwith eachsuchvalue show the changein scorefor the indicatedvalue in contrastto the basecase(adjective set I and the minimal linguistic knowledgerepresentedby extraction model " nounsin vicinity ," respectively). For example, using the fmite-state " " parserinsteadof the nounsin vicinity modelimprovesthescoreby 7.5423on the score , while going from adjectiveset2 to adjectiveset3 decreases average = 2.5996 11 .4882 14.0878 on . the ( ) by averageFinally intercept.8ogives a baselineperformanceof a minimal systemthat usesthe basecasefor each ; the effectsof corpussizeareto be addedto this system. parameter From table 4.3 we can seethat the dataextractionmodel has a significant effect on the quality of the producedclustering, andamongthe linguistic parameters is the mostimportantone. Increasingthe sizeof the corpusalsosignificantly increasesthe score. The adjectiveset that is clusteredalso hasa major influenceon the score, with rareradjectivesleadingto worseclusterings.Note, however, thattheseareaverageeffects, takenover a wide rangeof differentsetringsfor the system. In particular, while the systemproducesbadpartitionsfor
Do We NeedLinguisticsWhenWe HaveStatistics?
adjectiveset 3 when the corpusis small, when the largestcorpus(21 million words) is usedthe partitionsproducedfor test set3 areequalin quality or better thanthepartitionsproducedfor theothertwo setswith the samecorpus. The two linguistic features" morphology" and" negativeknowledge" havelesspronounced althoughstill significanteffects, while spell-checkingoffers minimal improvementthat probably does not justify the effort of implementingthe moduleandthe costof activatingit at run-time. 6.3 Overall Et Tectof Linguistic Knowledge Up to this point we havedescribedaveragesof scoresor of scoredifferences, takenover manycombinationsof featuresthat areorthogonalto the one studLed. Theseaveragesare good for establishingthe existenceof a performance differencecausedby the different v~ ues of eachfeature, acrossall possible combinationsof theotherfeatures.Theyarenot, however, representative of the performanceof the systemin a particular settingofparameters, nor are they suitablefor describingthe differencebetweenfeaturesquantitatively, since ' they are averagestakenover widely differing settingsof the systems parameters . In particular, the inclusionof very smallcorporadrivesthe averagescores down, aswe haveconfirmedby computingaveragesseparatelyfor eachvalue of the corpussizeparameter ."To give a feeling of how importantthe introduction of linguistic knowledgeis quantitatively, we comparein table 4.4 the
Randompartitions
linguistic
Adjective set 1
Adjective set2
Adjective set3
9.66(17.90%) 24.51(45.41%)
6.21(9.66%) 38.51(59.92%)
3.80(6.03%) 33.21(52.66%)
/( 39.06(72.360
44.73(69.60%)
46.17(73.20%)
53.98
64.27
63.07
components active
All linguistic components active
Humans
Two versionsof the system(with all or none of the linguistic modulesactive) are contrastedwith the performanceof a randomclassifier and that of the humans. The scoreswereobtainedon the 21-million-word corpus, usinga smoothingwindow of three adjacentvaluesof thenumberof clustersparametercenteredat theaveragevaluefor that parameterin the human-preparedmodels. We also showthe percentageof the scoreof the humansthat is attainedby the randomclassifierandeachversionof the system.
Chapter4
resultsobtainedwith the full corpusof 21 million words for the two casesof having all or none of the linguistic componentsactive. The scoresobtained by a randomsystemthat producespartitions of the adjectiveswith no knowledge exceptthe numberof groupsare included asa lower bound. Theseestimates are obtained after averaging the scores of 20,000 such random partitions for eachadjective set. The averagescoresthat eachhumanmodel receiveswhencomparedwith all otherhumanmodelsarealso included, asan estimateof the maximum score that can be achievedby the system. That maximum dependson the disagreementbetweenmodels for eachadjective set. For thesemeasurementswe use a smaller smoothingwindow of size 3 insteadof 5, which is fairer to the systemwhen its performanceis compared with the humans. We also give in figure 4.5 the grouping producedby the systemfor adjectiveset 3 using the entire 21-million -word corpusbut without any of the linguistic modules active. This partition is to be contrasted with the one given in figure 4.1 on page 73 which was producedfrom the samecorpusand with the samenumberof clusters, but with all the linguistic modulesactive. 1. catastrophichannful 2. dry wet 3. lenientrigid strict stringent 4. communistleftist 5. clever 6 . abrupt chaotic disasttous gradual turbulent vigorous
7. affluent affordableinexpensive prosperous 8. outrageous
13. brilliant energetic 14. dual multiple stupid 15. hazardoustoxic unreasonable unstable 16. plain 17. confusing 18. flexible hostileprob' acted unfriendly 19. endless
11. generousinsufficient meager scantslim
20. cleandirty impoverished 21. deadlyfatal 22. astonishingmisleading stunning 23. dumbfat smart
12. delicatefragile
24. exotic
9. capitalistsocialist 10. dismalgloomy pleasant
Figure 4.5 Partitionwith 24 clustersproducedby the systemfor the adjectivetestset3 of figure 4.4 usingthe entire 21-milIion -word corpusandno linguistic modules.
? DoWeNeedLinguisticsWhenWeHaveStatistics 7 Cost of Incorporating the Linguistic Knowledge in the System The cost of incorporatingthe linguistics-basedmodulesin the systemis not prohibitive. The effort neededto implement all the linguistic moduleswas about5 person-months, in contrastwith 7 person-monthsneededto developthe basic statisticalsystem. Most of this time was spentin designingand implementing the fmite-stategrammarthat is usedfor extractingadjective-nounand adjective-adjectivepairs [Hatzivassiloglou, 1995b]. Furthermore, the run-time overheadcausedby the linguistic modules is not significant. Each linguistic module takes from 1 to 7 minutes on a Sun SparcStation 10 to processa million items (words or pairs of words, as appropriate for the module), and all except the negative knowledge module needprocessa corpusonly once, reusing the sameinformation for different probleminstances(word sets). This shouldbe comparedto the approximately 15 minutes needed by the statistical component for grouping about 40 adjectives.
8 Generalizingto Other Applications In section6 we showedthat the inttoduction of linguistic knowledgein the word groupingsystemresultsin a performancedifferencethat is not only statistically observablebut also quantitatively significant (cf. table 4.4). We believethat thesepositive resultsshouldalso apply to other corpus-basednatural languageprocessingsystemsthat employ statisticalmethods. essharethe samebasicmethodologywith our system Many statisticalapproach : a setof words is preselected , relatedwords areidentified in a corpus, the words and of of pairsof relatedwordsareestimated,and a statistical frequencies model is usedto makepredictionsfor the original words. Acrossapplications , thereare differencesin what words are selected,how relatedwords are defmed, andwhat kinds of predictionsaremade. Nevertheless , the basiccomponents in the . For the same adjectivegroupingapplicationthe example, stay original words are the adjectives and the predictions are their groups; in machinettanslation, the predictionsare the ttanslationsof the words in the sourcelanguagetext; in sensedisambiguation , the predictionsare the senses in of interest words of to the ; part speechtaggingor in classification, assigned the predictionsare the tagsor classesassignedto eachword. Becauseof this underlyingsimilarity, the comparativeanalysispresentedin this chapteris relevant to all theseproblems.
Chapter4
For a concreteexample, considerthe caseof collocationextractionthat has beenaddressedwith statisticalmethodsin the past. Smadja[ 1993] describesa " " systemthat initially usesthe nounsin vicinity extractionmodelto collectco occurrenceinformation about words, and then identifies collocationson the basisof distributionalcriteria. A later componentfilters the retrievedcollocations , removing the oneswhere the participatingwords are not usedconsistently in thesamesyntacticrelationship.This post-processingstagedoublesthe precisionof the system. We believethat using from the start a more sophisticated extractionmodelto collect thesepairsof relatedwordswill havesimilar , such as a morphologymodule positive effects. Other linguistic components that combinesfrequencycounts, shouldalso improvethe performanceof that system. In this way, we canbenefitfrom linguistic knowledgewithout having to usea separatefiltering processafter expendingthe effort to collect the collocation . Similarly, the sensedisambiguationproblem is' typically attackedby comparing the distributionof the neighborsof a word s occurrenceto prototypical ' distributionsassociatedwith each of the word s senses[Gale et al., 1992b; SchUtte, 1992] . Usually, no explicit linguistic knowledgeis usedin defining theseneighbors, which are taken as all words appearingwithin a window of .10Many wordsunrelated fIXedwidth centeredat theword beingdisambiguated to the word of interestarecollectedin this way. In contrast, identifying appropriate word classesthat can be expectedon linguistic groundsto convey significant information aboutthe original word shouldincreasethe performance of the disambiguationsystem. Suchclassesmight be modifiednounsfor adjectives , nounsin a subjector object position for verbs, etc. As we have shown in section6, less but more accurateinformation increasesthe quality of the results. An interestingtopic is theidentificationof parallelsof the linguistic modules thathavebeendesignedwith thepresentsystemin mind for theseapplications, at leastfor thosemoduleswhich, unlike morphology, arenot ubiquitous. Negative knowledge, for example, improvesthe performanceof our system, suppleme the positiveinformationprovidedby adjective-nounpairs. It could be useful for other systemsas well if an appropriateapplication-dependent methodof extractingsuchinformationis identified. , haveusedlimitedlinguisticknowledgein selecting 10. Althoughsomeresearchers and 1991 Hearst for see , ] these , [ and ; , example neighbors , processing classifying , 1994 ]. [Yarowsky
Do We NeedLinguisticsWhenWe HaveStatistics? 9
Conclusions and Future Work
We haveshownthat all linguistic featuresconsideredin this studyhad a positive contributionto the perfonnanceof the system. Exceptfor spell-checking, all thesecontributionswere both statistically significant and large enoughto makea differencein practicalsituations. The costof incorporatingthe linguistics -basedmodulesin the systemis not prohibitive, both in tenDSof development time and in tenDsof actualrun-time overhead.Furthennore, the results canbeexpectedto generalizeto a wide varietyof corpus-basedsystemsfor different applications. We shouldnote herethat in our comparativeexperimentswe havefocused on analyzingthe benefitsof symbolicknowledgethat is readily availableand canbe efficiently incorporatedinto the system. We haveavoidedusinglexical semanticknowledgebecauseit is not generallyavailableand becauseits use would defeatthe very purposeof the word groupingsystem. However, on the basisof the measurableperfonnancedifferenceofferedby the shallowlinguistic knowledgewe studied, it is reasonableto conjecturethat deeperlinguistic knowledge, if it becomesreadily accessible , would probablyincreasethe perfonnanceof a hybrid systemevenmore. In the future, we plan to extendthe resultsdiscussedin this chapterwith an of theeffectsof eachparameteron the valuesof the analysisof the dependence other parameters . We are currently stratifying the experimentaldataobtained to studytrendsin the magnitudeof parametereffectsasother parametersvary in a controlled manner, and we will examinethe interactionswith corpussize and specificity of clusteredadjectives. Preliminary results indicate that the importanceof linguistic knowledgeremains high even with large corpora, showingthat we cannotoffset the advantagesof linguistic knowledgejust by increasingthe corpussize. We plan to investigatethesetrendsandinteractions with extendedexperimentsin the future.
Acknowledgments Thisworkwassupported Research jointly by theAdvanced ProjectsAgency andtheOfficeof NavalResearch undergrantNOOO 14-89-J- 1782,by theOffice of NavalResearch undergrantNOOOI4 -95- 1-0745,andby theNationalScience Foundation undergrantGER-90-24069 . It wasperfonned undertheauspices of theColumbiaUniversityCATin HighPerfonnance andCommunications Computing in Health care, a New York StateCenterfor AdvancedTechnology
Chapter4
supportedby the New York StateScienceand TechnologyFoundation. Any opinions, findings, conclusions, or recommendationsexpressedin this publication aremine anddo not necessarilyreflect the views of the New York State Science and Technology Foundation. I thank Kathy McKeown, Jacques Robin, the anonymousreviewers, andthe BalancingAct workshoporganizers andeditorsof this book for providing useful comnientson earlier versionsof the chapter. References , and PaolaVelardi. The Noisy Channelandthe RobertoBasili, Maria TeresaPazienza the ACL WorkshopTheBalancingAct: Combining In . Braying Donkey Proceedingsof to es and Statistical Language, pp. 21- 28, Las Cruces, New Mexico Approach Symbolic , July 1994, Associationfor ComputationalLinguistics. PeterF. Brown, Vincent J. della Pietra, PeterV . de Souza, JenniferC. Lai, and Robert L. Mercer. Class-Basedn-gram Models of Natural Language.ComputationalLinguistics , 18(4) : 467- 479, 1992. KennethW. Church. A StochasticParts Programand Noun PhraseParserfor Unrestricted Text. In Proceedingsof the SecondConferenceon Applied Natural Language Processing,pp. 136- 143, Austin, Texas, February1988. , andPenelopeSibun. A Practical DouglasR. Cutting, Julian M. Kupiec, JanO. Pedersen Part-of-SpeechTagger. In Proceedingsof the Third Conferenceon AppliedNatural LanguageProcessing,pp. 133--140, Trent, Italy, April 1992. Nonnan R. Draper and Harry Smith. Applied RegressionAnalysis, 2nd edition, New York, Wiley , 1981. ' Michael Elhadad. GeneratingAdjectives to Express the Speakers Argumentative Intent. In Proceedingsof the 9th National Conferenceon Artificial Intelligence(AAAI91), pp. 98- 104, Anahelm, California, July 1991. AmericanAssociationfor Artificial Intelligence. William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data StructuresandAlgorithms. EnglewoodCliffs , N.J., PrenticeHall, 1992.
William A. Gale, KennethW ~Church, and David Y arowsky. EstimatingUpper and Lower Boundson the Perfonnanceof Word-SenseDisambiguationPrograms.In Proceeding of the30thAnnualMeetingof theACL, pp. 249--256, Newark, Del., June1992. Associationfor ComputationalLinguistics. William A. Gale, KennethW. Church, andDavid Y arowsky. Work on StatisticalMethods for Word SenseDisambiguation. In Probabilistic Approaches to Natural Language , : Papersfrom the 1992Fall Symposium , pp. 54- 60, Cambridge, Massachusetts October 1992. Menlo Park, Calif., American Associationfor Artificial Intelligence, AAAI Press, 1992. JeanDickinsonGibbonsandSubhabrataChakraborti. NonparametricStatisticalInference , 3rd edition, New York, Marcel Dekker, 1992.
Do We NeedLinguisticsWhenWe HaveStatistics?
Vasileios . Automatic Retrieval ofSemantic Hatzivassiloglou andScalar WordGroups fromFreeText -OI8-95, NewYork . Technical CUCS Report , Columbia , University 1995 . VasileiosHatzivassiloglou -Noun, Adjective . RetrievingAdjective -Adjective , and -AdverbSyntagmatic Adjective from Corpora : Extractionvia aFiniteRelationships StateGrammar , HeuristicSelection , andMorphological . TechnicalReport Processing CUCS-019-95, NewYork, ColumbiaUniversity , 1995. VasileiosHatzivassiloglou andKathleen McKeown . Towards theAutomaticIdentification of AdjectivalScales : Clustering . In Proceedings Adjectives Accordingto Meaning ofthe31stAnnualMeetingof theACL, pp. 172- 182,Columbus , Ohio, June1993.Association for Computational . Linguistics MartiA. Hearst . NounHomograph Disambiguation UsingLocalContextin LargeText . In Proceedings Corpora of the7thAnnualConference of theUniversityof Waterloo Centre for thetheNewOEDandTextRe~earch: UsingCorpora,Oxford, 1991. CharlesR. Hicks. Fundamental in theDesignof Experiments Concepts , 3rd edition, NewYork, Holt, Rinehart , andWilson, 1982. DonaldHindle. NounClassification -ArgumentStructures fromPredicate . InProceedings of the28thAnnualMeetingof theACL, pp. 268-275, Pittsburgh , June1990.Association for Computational . Linguistics JuliaB. Hirshberg . A Theoryof ScalarImplicature . Pill thesis of Computer , Department andInformationScience , Universityof Pennsylvania , Philadelphia , 1985. LeonardKaufmanandPeterJ. Rousseeuw . FindingGroupsin Data: AnIntroductionto ClusterAnalysis . NewYork, Wiley, 1990. MauriceG. Kendall. A New Measureof RankCorrelation . Biometrika , 30: 81- 93, 1938. MauriceG. Kendall.RankCorrelationMethods , 4thedition.London . , Griffin, 1975 KevinKnightandSteveK. Luk. Buildinga Large-ScaleKnowledge Basefor Machine Translation . In Proceedings of the12thNationalConference onArtijiciallntelligence (AAAl -94), vol. I , pp. 773- 778, Seattle , July- August1994.AmericanAssociation for ArtificialIntelligence . JulianM. Kupiec. RobustPart-of-SpeechTaggingUsinga HiddenMarkovModel. andLanguage Computer Speech , 6: 225-242, 1992 . AdrienneLehrer.Semantic FieldsandLexicalStructure . Amsterdam , NorthHolland. 1974. C. Levinson . Pragmatics . Cambridge Stephen , England , Cambridge , UniversityPress 1983. ElizabethD . LiddyandWoojinPaik. Statistically -GuidedWordSense . Disambiguation In Probabilistic esto NaturalLanguage : Papersfromthe1992Fall Symposium Approach , pp. 98--107,Cambridge , Massachusetts . October1992.MenloPark.Calif., American Association for ArtificialIntelligence , AAAI Press . , 1992 JohnLyons.Semantics , vol. I . Cambridge , England , Cambridge . 1977. UniversityPress
Chapter4 Igor A . MeI'cuk and Nikolaj V. Pertsov. SurfaceSyntaxof English: a Formal Model within theMeaning-TextFramework. Amsterdam, Benjamins, 1987. GeorgeA . Miller , RichardBeckwith, ChristianeFellbaum, DerekGross, andKatherine . /nternationaI Journal J. Miller . Introductionto WordNet: An On-Line Lexical Database of Lexicography(specialissue), 3(4) : 235- 312, 1990. JohannaD. Moore. Personalcommunication.June1993. : Human and Diane J. Litman. Intention-BasedSegmentation RebeccaJ. Passonneau Reliability and Correlationwith Linguistic Cues. In Proceedingsof the 31st Annual Meetingof the ACL, pp. 148- 155, Columbus, Ohio, June 1993. Associationfor Com . putationalLinguistics FernandoPereira, Naftali Tishby, andLillian Lee. DistributionalClusteringof English Words. In Proceedingsof the31stAnnualMeetingof theACL, pp. 183- 190, Columbus, Ohio, June1993. Associationfor ComputationalLinguistics. Philip Resnik. SemanticClassesandSyntacticAmbiguity. In Proceedingsof theARPA Workshopon Human LanguageTechnology, pp. 278- 283, Plainsboro, N.J., March 1993. ARPA Software and Intelligent SystemsTechnology Office, San Francisco, MorganKaufmann, 1993. Philip Resnikand MartiA . Hearst. StructuralAmbiguity and ConceptualRelations. In Proceedingsof the ACL Workshopon Very Large Corpora, pp. 58- 64, Columbus, Ohio, June1993. Associationfor ComputationalLinguistics.
: DisambiguationTechniquesin Victor Sadier. Working with Analogical Semantics DLT. Dordrecht, The Netherlands , Foris Publications, 1989. . In Proceedings Hinrich Schiitze. Word SenseDisambiguationWith SublexicalRepresentations . 109 - 113, NLP Based on , A A Al 92 Techniquespp Workshop Statistically of the SanJose, Calif., July 1992. AmericanAssociationfor Artificial Intelligence. John M. Sinclair (editor in chief) . Collins COBU/W English LanguageDictionary. London, Collins, 1987. Frank Smadja. RetrievingCollocationsfrom Text: Xtract. ComputationalLinguistics, 19( 1) : 143- 177, March 1993. HelmuthSpith. ClusterDissectionandAnalysis: Theory, FORTRANPrograms, Examples . Chichester,WestSussex,England, Ellis Horwood, 1985. . NeueJahrbecherfUr WissJostTrier. Das sprachlicheFeld. Eine Auseinandersetzung . 1934 10: 428449 , , wenschaftundJugenbildung C. J. van Rijsbergen./ nformation Retrieval, 2nd edition, London, Butterworths, 1979. Alex Waibel and Kai-Fu Lee, editors. Readingsin SpeechRecognition. San Mateo, Calif., MorganKaufmann, 1990. David Y arowSky. Decision Lists for Lexical Ambiguity Resolution: Application to AccentRestorationin SpanishandFrench. In Proceedingsof the 32ndAnnual Meeting of the ACL, pp. 88- 95, Las Cruces, NiM ., June 1994. Associationfor Computational Linguistics.
Chapter 5 The Automatic Construction of a Symbolic Parser via Statistical Techniques
Shyam Kapur and Robin Clark
At thecoreof contemporarygenerativesyntaxis thepremisethat all languages obeya set of universalprinciples, and syntacticvariation amonglanguagesis confinedto afinite numberof parameters. On this model, a child' s acquisition of the syntaxof his or her native languagedependson identifying the correct parameter settingsfor that language based on observation- for example, determiningwhethertoform questionsbyplacing questionwordsat the beginning of the sentence(e.g., English: Who did Mary say John saw) or leaving themsyntacticallyin situ (e.g., Chinese: Mary said John saw who) . Prevalent work onparametersettingfocuseson thewaythat observedeventsin thechild' s " " input might trigger settingsforparameters (e.g. [ Manzini and Wexier, 1987J), to the exclusionof inductiveor distributional analyses. In " TheAutomaticConstructionof a SymbolicParser via StatisticalTechniques " , Kapur and Clark build on their previouswork onproving learnability resultsin a stochasticsetting[Kapur, 1991Jand exploring the complexityof parametersetting in theface of realistic assumptionsabout how parameters interact [ Clark, 1992J. Here theycombineforces to presenta learning model in which distributional evidenceplays a critical role - while still adheringto an orthodox, symbolicview of languageacquisitionconsistentwith the Chomskianparadigm. Notably, they validatetheir approachby meansof an implementedmodel,testedon naturally occurringdata of the kind availableto child languagelearners.- Eds.
1 Motivation We report on the progresswe have madetoward developinga robust " selfconstructing " parsingdevicethat usesindirect negativeevidence[Kapur and Bilardi, 1991] to setitsparameters.Generally, by parameterwe meananypoint of variationbetweenlanguages ; thatis, a propertyon which two languagesmay
Chapter5
differ. Thus, therelativeplacementof anobjectwith respectto theverb, a determiner with respectto a noun, the differencebetweenprepositionaland postposition languagesand the presenceof long distanceanaphorslike Japanese " zibun" and Icelandic " " are all . A self-constructingparsing parameters sig devicewould be exposedto an input text consistingof simpleunpreprocessed . On the basisof this text, the devicewould induceindirect negative sentences evidencein supportof someoneparsingdevicelocatedin the parameterspace. The developmentof a self-constructingparsingsystemwould havea number of practical and theoreticalbenefits. First, such a parsing device would reducethe developmentcostsof new parsers.At the moment, grammarsmust be developedby hand, a techniquerequiringa significantinvestmentin money andman-hours. If a basicparsercouldbedevelopedautomatically,costswould be reducedsignificantly, evenif the parserrequiredsomefme-tuning after the initial automaticlearningprocedure. Second, a parsercapableof self-modificationis potentiallymorerobustwhenconfrontedwith novel or semigrammatical input. This type of parserwould haveapplicationsin informationretrieval as well as languageinstruction and grammarcorrection. As far as linguistic , the developmentof a parsercapableof self-modification theory is concerned would give us considerableinsight into the formal propertiesof complexsystems aswell asthetwin problemsof languagelearnabilityandlanguageacquisition , the researchproblemsthat haveprovidedthe foundationof generative grammar. Given a linguistic parameterspace,theproblemof locatinga targetlanguage somewherein the spaceon the basisof a text consistingof only grammatical is far from trivial. Clark [ 1990, 1992] hasshownthat the complexity sentences of the problemis potentially exponentialbecausethe relationshipbetweenthe pointsof variationandthe actualdatacanbe quite indirect andtangled. Since, given nparameters, thereare 2npossibleparsingdevices, enumerativesearch throughthe spaceis clearly impossible. Becauseeachdatummay besuccess fully parsedby a number of different parsing deviceswithin the spaceand the properties becausethe surfacepropertiesof grammaticalstringsunderdetermine of the parsing device which must be fixed by the learni~g algorithm, standarddeductivemachinelearningtechniquesareascomplexasa bruteenumerati search[Clark, 1992, 1994a]. In order to solve this problem, robust techniquesthat canrapidly eliminateinferior hypothesesmustbe developed. We proposea learningprocedurethat unitessymboliccomputationwith statistical tools. Historically, symbolic techniqueshave proved to be a versatile of . Thesetechniqueshavethe disadvantage tool in naturallanguageprocessing and error user or ) costly (as being both brittle (easily brokenby new input by
The AutomaticConSb'uctionof a SymbolicParservia StatisticalTechniques
97
grammars are extended to handle new constructions , development becomes more difficult owing to the complexity of rule interactions within the grammar ) . Statistical techniques have the advantage of robustness, although the resulting grantmars may lack the intuitive clarity found in symbolic systems. We propose to fuse the symbolic and statistical techniques, a development we view both as inevitable and welcome ; the resulting system will use statistical learning techniques to output a symbolic parsing device. We view this development to provide a nice middle ground between the problems of overtraining vs. undertraining . That is , statistical approaches to learning often tend to overfit the training set of data. Symbolic approaches, on the other hand, tend to behave as though they were undertrained (breaking down on novel input) since the grammar tends to be compact . Combining statistical techniques with symbolic parsing would give the adv~ tage of obtaining relatively compact descriptions (symbolic processing) with robustness ( statistical learning) that is not overtuned to the training set. We believe that our approach not only provides a new technique of obtaining robust parsers in natural language systems but also provides partial explanation for child language acquisition . Traditionally , in either of these separate fields of inquiry , two widely different approaches have been pursued. One of them is largely statistical and heavily data-driven ; another one is largely symbolic and theory -driven . Neither approach has proved exceptionally successful in either field . Our approach not only bridges the symbolic and statistical approaches but also tries to bring closer the two disparate fields of inquiry . We claim that the final outcome of the learning process is a grammar that is not simply some predefmed template with slots that have been filled in but rather crucially a product of the process itself . The result of setting a parameter to a certain value involves not just the fixing of that parameter but also apotential ' reorganization of the grantmar to reflect the new parameter s values. The final result must not only be any parser consistent with the parameter values but one that is also self -modifiable and .furthermore one that can modify itself along one of many directions depending on the subsequent input . Exactly for this reason, the relevance of our solution to a purely engineering solution to parser building remains- the parser builder cannot simply look up the parameter values in a table. In fact , parameter setting has to be a part , even just a small part , of the parser construction process. If this were not the case, we probably would have had little difficulty in building excellent parsers for individual languages . Equally , the notion of self- modification is of enormous interest to linguistic typologists and diachronic linguists . In particular , a careful study of self-modification would place substantive limits on linguistic variation and on
Chapter5
the ways in which languagescould, in principle, changeover time. The information -theoretic analysisof linguistic variation is still in its infancy, but it promisesto providean importanttheoreticaltool for linguists. (See[Clark and Roberts, in preparation] for applicationsto linguistic typology and diachronic change.) As far aschild languageacquisitionis concerned , viewing the parametersetting problemin an information theoreticlight seemsto be the bestperspective onecanput togetherfor this problem[Clark, 1994b; KapurandClark, in press]. Linguistic representationscarry information, universal grammar encodes informationaboutall naturallanguages , andthe linguistic input from the target languagemustcarry informationaboutthe targetlanguagein someform. The taskof the learnercanbe viewedasthat of efficiently andaccuratelydecoding the informationcontainedin !;heinput in order to haveenoughinformationto build the grammarfor the targetlanguage. To date, the information-theoreticprinciples underlying the entire process have not receivedadequateattention. For example, the most commonlyconsidered learningalgorithmis onethat simply movesfrom oneparametersetting to anotherparametersettingbasedonly on failure in parsing. That suchan algorithm is entirely implausibleempirically is one issue; in addition, it can be shownthat one of the fastestways for this algorithm to convergeis to take a randomwalk in the parameterspace, which is clearly grosslyinefficient. Such an approachis also inconsistentwith a maxim true about much of learning: " " We learn at the edgeof what we alreadyknow. Furthermore, in no sense would one be able to maintainthat thereis a monotonicincreasein the information thechild hasaboutthetargetlanguagein anyreal sense.We know from observationandexperimentationthat children' s learningappearsto be largely monotonic and fairly uniform acrosschildren. Finally and most important, thesealgorithmsfail to accountfor how certain information is necessaryfor children' s learningto proceedfrom stagen to stagen + I . Justas somebackground information is necessaryfor children' s learningto proceedfrom stage 0 (the initial state) to stageI , thereis goodreasonto believethat theremustbe somebackground+ acquiredinformationthat mustbe crucial to takethe child from stagen to stagen + I . In the algorithmswe consider, we provide arguments that the child canproceedfrom onestageto the next only becauseat the earlierstagethechild hasbeenableto acquireenoughinformationto be ableto build enoughstructure. This, in turn, is necessaryto in fact efficiently extract further informationfrom the input to learnfurther. The restrictivelearningalgorithmsthat we considerhereallow the process of informationextractionfrom a plausibleinput text to be investigatedin both
The AutomaticConstructionof a SymbolicParservia StatisticalTechniques
99
complete formal and computational detail . We hope to show that our work is leading the way to establish precisely the information -theoretic details of the entire learning process, from what the initial state needs to be to what can be learned and how . For example, another aspect in which previous attempts have been deficient is in their varying degrees of assumptions about what information the child has accessto when in the initial state. We feel that the most logical approach is to assumethat the child has no accessto any information unless it can be argued that without some information , learning would be impossible or at least infeasible . Some psycholinguistic consequences of our proposal appear to be empirically valid . For example, it has been observed that in relatively free word order languages such as Russian, the child fIrSt stabilizes on some word order , although not the same word order across children . Another linguistic and psycholinguistic consequenceof this proposal is that there is no need to stipulate markedness or initial preset values. Extensional relationships between languages and their purported consequences, such as the Subset Principle , are irrelevant . Furthermore , triggers need not be single utterances; statistical properties of the corpus may trigger parameter values. 2
Preliminaries
In this section, we fIrSt list some parameters that give some idea of the kinds of variations between languages that we hope our system is capable of handling . We then illustrate why parameter setting is difficult by standard methods. This provides some additional explanation for the failure so far in developing a truly universal parameterized parser.
2.
LioeuisticParameters Naturally, a necessarypreliminaryto our work is to specifya setof parameters that will serveas a testinggroundfor the learningalgorithm. This set of parameters must be embeddedin a parsingsystemso that the learningalgorithm can be testedagainstdata setsthat approximatethe kind of input that parsing devicesarelikely to encounterin real-world applications. Our goal, then, will be to first developa prototype. We do not requirethat the prototypeacceptany arbitrarily selectedlanguageor that the coverageof the prototype parser be complete in any given language. Instead, we will developa prototypewith coveragethat extendsto somebasic structuresthat any languagelearningdevicemustaccountfor , plus somestructuresthat have proved difficult for various learning theories. In particular, given an already
100
Chapter5
existing parser, we will extendits coverageby parameterizingit , asdescribed below. Our initial setof parameterswill includethe following otherpointsof variation :
1. Relative order of specifiersand heads: This parametercoversthe placement of determinersrelative to nouns, relativeposition of the subject, andthe placementof certainVP-modifying adverbs. : This parameterdealswith the 2. Relativeorder of headsand complements or orders), placementofnomi relative to the verb VO OV of ( position objects betweenprepositionsand as well as the choice nal andadjectivalcomplements , . postpositions 3. Scrambling: Somelanguagesallow (relatively) free word order. For example , Germanhas rules for displacing definite NPs and clausesout of their allows relatively free orderingof NPs andpostpositio canonicalpositions. Japanese phrasesso long as the verbal complex remainsclausefmal. Other languagesallow evenfreer word orders. We will focuson GermanandJapanese scrambling, bearingin mind that the model shouldbe extendibleto other typesof scrambling. 4. Relativeplacementof negativemarkersand verbs: Languagesvary as to wheretheyplacenegativemarkers, like Englishnot. Englishplacesits negative markerafterthefIrSttensedauxiliary, thusforcing do insertionwhenthereis no other auxiliary, whereasItalian placesnegationafter the tensedverb. French usesdiscontinuouselementslike ne . . . pas or ne . . . plus, which arewrapped aroundthe tensedverb or occur as continuouselementsin inflnitivals. The , giventherangeof propertreatmentof negationwill requireseveralparameters variation. 5. Root word order changes: In general, languagesallow for certain word order changesin root clausesbut not in embeddedclauses. An exampleof a root word orderchangeis subject-auxiliary inversionin English, which occurs in root questions(Did John leave? vs. * 1 wonder did John leave?). Another examplewould be inversionof the subjectclitic with the tensedverb in French " " (QuelIe pommeat -il mangee[ Which appledid he eat? ]) andthe processof subjectpostpositionand PP prepositionin English (A man walked into the room vs. Into the room walkeda man). 6. Rightwarddislocation: This includesextrapositionstructuresin English (That John is late amazesme. vs. It amazesme that John is late.), presentational there structures(A man was in the park. vs. There was a man in the park.), and stylistic inversionin French(Quellepiste Marie at -elle choisie?
The AutomaticConstructionof a SymbolicParservia StatisticalTechniques
101
" [ What path has Marie chosen?" ]) . Each of these constructions presentsunique problems so that the entire data set is best handled by a system of interacting parameters. 7. Wh -movement vs. wh - in situ: Languages vary in the way they encode whquestions. English obligato rily places one and only one wh phrase (e.g ., who or which picture ) in first position . In French the wh -phrase may remain in place ( in situ ) although it may also form wh questions as in English . Polish allows wh phrasesto be stacked at the beginning of the question. 8. Exceptional casemarking , structural casemarking : These parameters have little obvious effect on word order , but involve the treatment of infmitival complements . Thus , exceptional case marking and structural case marking allow " " for the generation of the order V [+tense ] is a ], where V [ +tense ] NP VP [-tense " " tensed verb and VP [-tense ] is a VP ~eaded by a verb in the infinitive . Both parameters involve the semantic relations between the NP and the infmitival VP as well as the treatment of case marking . These relations are reflected in constituent structure rather than word order and thus pose an interesting problem for the learning algorithm . 9. Raising and control : In the case of raising verbs and control verbs, the learner must correctly categorize verbs that occur in the same syntactic frame into two distinct groups based on semantic relations as reflected in the distribution of elements (e.g ., idiom chunks) around the verbs. " 10. Long - and short- distance anaphora: Short-distance anaphors, like himself " in English , must be related to a coreferential NP within a constrained " " " " local domain. Long - distance anaphors (Japanese zibun , Korean caki ) must also be related to a coreferential NP , but this NP need not be contained within the same type of local domain as in the short-distance case. The above sampling of parameters has the virtue of being both small (and therefore possible to implement relatively quickly ) and posing interesting learnability problems which will appropriately test our learning algorithm . Although the above list can be described succinctly , the set of possible targets will be large and a simple enumerative search through the possible targets will not be efficient .
2.2 Complexities of Parameter Setting Theoriesbasedon the principlesandparameters(P&P) paradigmhypothesize that languagessharea centralcore of universalpropertiesand that language variationcanbe accountedfor by appealto a fmite numberof points of variation . The parametersthemselvesmay take on only a , the so-calledparameters
102
Chapter5
finite numberof possiblevalues, prespecifiedby UniversalGrammar(UG). A fully specifiedP&P theorywould accountfor languageacquisitionby hypothesizing that the learnersetsparametersto the appropriatevaluesby monitoring the input stteamfor " triggering data" ; triggersare sentenceswhich causethe learnerto set a particular parameterto a particular value. For example, the " " imperativein ( 1) is a trigger for the order V (erb) O(bject) : ( 1) Kiss grandma. underthehypothesisthat the learneranalyzesgrandmaasthe patientof kissing andis predisposedto tteat patientsasstructuralobjects. Notice that trigger-basedparametersettingpresupposes that for eachparameter p andeachvaluev the learnercan identify the appropriatetrigger in the input stream. This is the prob~em of trigger detection. That is, given aparticular input item, the learnermust be able to recognizewhetheror not it is a trigger and, if so, what parameterand value it is a trigger for. Similarly, the learnermustbe able to recognizethat a particular input datumis not a trigger for a certainparametereventhoughit may sharemany propertieswith a trigger . In order to make the discussionmore concrete, considerthe following example: (2) a. Johnithinks that Mary likes himib. * Johnthinks that Maryj likes herj. Englishallowspronounsto becoreferentwith a c-comrnandingnominaljust in casethatnominalis not containedwithin thesamelocal syntacticdomainasthe pronoun; this is a universalpropertyof pronounsandwould seemto presentlittle problemto the learner. Note, however, that somelanguages , including Chinese, Icelandic, Japanese . Theseare elementswhich , and Korean, allow for long-distanceanaphors areobligatorily coreferentwith anothernominalin the sentence , but which may be separatedfrom that nomin.al by severalclauseboundaries . Thus, the following examplefrom Icelandicis grammaticaleventhoughthe anaphorsig is separated from its antecedent Jon by a clauseboundary[Anderson, 1986] : (3) J6nisegirad Maria elski sigi/ hanni Johnsaysthat Mary lovesself/him Johnsaysthat Mary loveshim. Thus, UG includes a parameterthat allows some languagesto have longdistanceanaphorsand that, perhaps,fiXescertainotherpropertiesof this class of anaphora .
The Automatic Construction of a Symbolic Parser via Statistical Techniques
103
Note that the example in ( 3 ) is of the same sttucture as the pronominal example in ( 2a ) . A learner whose target is English must not take examples like ( 2a ) as a bigger for the long distance anaphor parameter ; what prevents the ' learner from being deceived ? Why doesn t the learner conclude that English him is comparable to Icelandic sig ? We would argue that the learner is sensitive evidence . For example , the learner is aware of examples to disbibutional like ( 4 ) : ( 4 ) Johni likes himj ' where the pronoun is not co referential with anything else in the sentence . The existence of ( 4 ) implies that him cannot be a pure anaphor , long - distance or otherwise . Once the learner is aware of this disbibutional property of him , he or she can correctly rule out ( 2a ) as a potential bigger for the long - distance anaphor parameter . evidence , then , is crucial for parameter setting ; no theory of Disbibutional parameter setting can avoid statistical properties of the input text . How far can we push the statistical component of parameter setting ? In this chapter , we suggest that statistically based algorithms can be exploited to set parameters involving phenomena as diverse as word order , particularly verb second con sttuctions , and cliticization , the difference between free pronouns and proclitics . The work reported here can be viewed as providing the basis for a theory of bigger detection ; it seeks to establish a theory of the connection between the raw input text and the process of parameter setting .
-SettingProposal 3 Parameter Let us supposethattherearenbinaryparameterseachof which cantakeoneof . Thecoreof a naturallanguage two values(' + ' or ' - ' ) in a particularnaturallanguage I is uniquelydefmedonceall thenparametershavebeenassigneda value. Considera randomdivision of the parametersinto somem groups. Let us -Setting Machine (PSM) fIrSt call thesegroupsPI , P2,..., Pm. The Parameter goesaboutsettingall the parameterswithin the fIrStgroupP I concurrently, as . From a 1. Parameterscan be looked at as fixed points of variation amonglanguages correspond of a values different view two of may simply , parameter computationalpoint to two differentbits of codein the parser. We arenot committedto anyparticular schemefor the translationfrom a tuple of parametervaluesto the correspondinglanguage . However, the sortsof parameterswe considerhavebeenlisted in the previous section.
104
Chapter5
sketched below . After these parameters have been fixed , the machine next tries to set the parameters in group P2 in similar fashion , and so on. Allparameters are unset initially , that is, there are no presetvalues. The parser is organized to only obey all the universal principles . At this stage, utterances from any possible natural language are accommodated with equal ease, but no sophisticated structure can be built . 2. Both the values of each of the parametersPi E PI are " competing " to establish themselves. 3. Corresponding to Pi, a pair of hypotheses are generated, say H ~ and H ~ . 4. Next , these hypotheses are tested on the basis of input evidence. 5. If H ~ fails or H ~ succeeds, set Pi ' S value to ' + ' . Otherwise , set Pi ' S value to ' - ' .
3.1 Formal Analysis of the PSM We next considera particularinstantiationof the hypothesesandtheir testing. The way we havein mind involvesconsttuctingsuitablewindow sizesduring which the algorithm is sensitiveto occurrenceas well as non-occurrenceof . Regularfailure of a particularphenomenonto occur in a specificphenomena suitablewindow is onenatural, robustkind of indirect negativeevidence. For example, the pair of hypothesesmay be: 1. HypothesisH ~ : Expectnot to observephenomenafrom a fixed set O~ of '- ' phenomenawhich . supportthe parametervalue . i 2. HypothesisH,-: Expectnot to observephenomenafrom a fixed set 0 + of ' ' phenomenawhich supportthe parametervalue + . Let Wiandki be two small numbers. Testingthe hypothesisH ~ involvesthe following procedure: 1. A window of size Wi sentencesis consttuctedand a record is maintained whetheror not a phenomenonfrom within the seto ~ occurredamongthoseWi sentences . 2. This consttuctionof the window is repeatedki different times anda tally Ci is madeof the fraction of times the phenomenaoccurredat leastonce in the durationof the window. 3. ThehypothesisH + succeedsif andonly if theratio of Cito ki is lessthan0.5. Note that the phenomenaunder scrutiny are assumedto be such that the parseris alwayscapableof analyzing(to whateverextentnecessary ) the input. This is becausein our view the parserconsistsof a fixed, coreprogramwhose
TheAutomaticConstruction of a SymbolicParservia Statistical Techniques
105
behaviorcan be modified by selectingfrom amonga fmite set of " flags" (the ). Therefore, evenif not all of the flags havebeensetto the correct parameters values, the parseris suchthat it canat leastpartially representthe input. Thus, the parseris alwayscapableof analyzingthe input. Also, there is no needto explicitly storeany input evidence. Suitablewindow sizescan be constructed during which thealgorithmis sensitiveto occurrenceaswell asnon-occurrence of specificphenomena . By using windows, just the relevantbit of information from the input is extractedand maintained. (For detailedargumentationthat this is a reasonabletheoreticalargument, see[Kapur andBilardi, 1991; Kapur, 1993] .) Note alsothat we haveonly sketchedandanalyzeda particular, simple version of our algorithm. In general, a whole rangeof window sizesmay be usedandthis may be governedby thedegreeto which the different hypotheses haveearnedcorroboration. (For someideasalongthis directionin a moregeneral setting, see[Kapur, 1991; Kapur andBilardi , 1992] .) 3.2 Order in Which Parameters Get Set Note that in our approach certain parameters get set quicker than others. These are the ones that are expressed very frequently . It is possible that these parameters also make the infonnation extraction more efficient quicker , for example , by enabling structure building so that other parameters can be set. If our proposal is right , then, for example, the word -order parameters which are presumably the very first ones to be set must be set based on a very primitive parser capable of handling any natural language. At this early stage, it may be that word and utterance boundaries cannot be reliably recognized and the lexicon is quite rudimentary . Furthennore , the only accessible property in the input stream may be the linear word order. Another particular difficulty with setting word -order parameters is that the surface order of constituents in the input does not necessarily reflect the underlying word order. For example, even though Dutch and Gennan are SOY languages, there is a preponderance of SVO fonD S in the input due to the V2 (verb - se~ond) phenomenon. The finite verb in root clauses moves to the second position and then the first position can be occupied by the subject, objects (direct or indirect ), adverbials , or prepositional phrases. As we shall see, it is important to note that if the subject is not in the fIrSt position in a V2 language, it is most likely in the first position to the right of the verb. Finally , it has been shown by Gibson and Wexier [ 1992] that the parameter spacecreated by the head- direction parameters along with the V2 parameter has local maxima , that is , incorrect parameter settings from which the learner can never escape.
106
Chapter5
3. 3 Computational Analysis of the PSM
3.3.1 V2 Parameter In this section, we summarize results we have obtained which show that wordorderpara can plausibly be set in our model? The key concept we use is that of entropy , an information -theoretic statistical measureof randomness of a random variable. The entropy H (X ) of a random variable X , measured in bits , x . To give a concrete example, the outcome of a fair coin X is LxP ( ) logp ( - ) has an entropy of ( .5 * log (.5) + .5 * log ( .5 = 1 bit . If the coin is not fair and has .9 chance of heads and .1 chance of tails , then the entropy is around .5 bit . There is less uncertainty with the unfair coin - it is most likely going to turn up heads. Entropy can also be thought of as the number of bits on the average required to describe a random variable. Entropy of one variable , say X , conditioned on another, say Y, denoted as H (X I f ), is a measure of how much better the first variable can be predicted when the value of the other variable is known . We considered the possibility that by investigating the behavior of the entropy of positions in the neighborhood of verbs in a language, word order 3 characteristics of that language may be discovered. For a V2 language, we expect that there will be more entropy to the left of the verb than to its right , that is , the position to the left will be less predictable than the one to the right . We first show that using a simple distributional analysis technique based on the five verbs the algorithm is assumedto know , another 15 words , most of which turn out to be verbs, can readily be obtained. Consider the input text as generating tupies of the form (v , d , w) , where v is one of the top 20 words (most of which are verbs), d is either the position to the 4 left of the verb or to the right , and w is the word at that position . V , D , and W are the corresponding random variables. 2. Preliminaryresultsobtainedwith Eric Brill werepresentedat the 1993Georgetown Roundtableon LanguageandLinguistics: Presessionon Corpus-basedLinguistics. 3. In the competitionmodel for languageacquisition[MacWhinney, 1987], the child considerscuesto determinepropertiesof the language, but while thesecuesare reinforced in a statisticalsensethe cuesthemselvesarenot information-theoreticin the way that ours are. In somerecentdiscussionof triggering, Niyogi and Berwick [ 1993] formalize parametersetting as a Markov process. Crucially, there again the statistical assumptionon the input is merelyusedto ensurethat convergenceis likely andtriggers . aresimplesentences 4. We thankSteveAbney for suggestingthis formulationto us.
TheAutomatic ConSb "uction ofaSymbolic Parser viaStatistical Techniques 107 Theprocedure for settingtheV2 parameter is thefollowing: If H (WIVD = left) > H (WIVD = right) then+ V2 else- V2. Oneachof theninelanguages onwhichit hasbeenpossibleto testouralgorithm the correct result was obtained . (Onlythelastthreelanguages in table , 5.1 areV21anguages .) Furthennore , in almostall cases , pairedt testsshowed that the resultswerestatisticallysignificant . The amount(only 3000utterances unannotated ) andthequalityof theinput(unstructured inputcaretaker from the CHll ..DES database , 1991]), andthe [MacWhinney speech subcorpus resources neededfor parameter arepsychologically computational settingto succeed . Furthertestsweresuccess in orderto establish plausible fully conducted boththerobustness andthesimplicityof thislearningalgorithm . It is also clearthatoncethevalueof theV2 parameter hasbeencorrectlyset, theinputis farmorerevealingwithregardto otherword-orderparameters andtheytoocan besetusingsimilartechniques . In orderto makeclearhowthis procedure fits into ourgeneralparameter we out what the are . In the case of the V2 , spell settingproposal hypotheses are not separately sinceonehypothesis , the two hypotheses parameter necessary is theexactcomplement of theother. Sothehypothesis H + maybe asshown. H +: Expectnotto observe thattheentropyto theleft of theverbs Hypothesis is lowerthanthatto theright. Thewindowsizethatmaybeusedcouldbearound300utterances andthe numberof repetitionsneedto be around10. Our previousresultsprovide empiricalsupportthatthisshouldsuffice. Table5.1 Theconditional entropyresults
English French Italian Polish Tamil Turkish Dutch Danish German
H (W I v , D = left )
H(WI Y, D = right)
4.22 3.91 4.91 4.09 4.01 3.69 4.84 4.42 5.55
4.26 5.09 5.33 5.78 5.04 4.91 3.61 4.24 4.97
108
ChapterS
By assumingthat besidesknowing a few verbs, as before, the algorithm alsorecognizessomeof the first andsecondpersonpronounsof the language, we cannotonly determineaspectsof the pronoun system(seesection3.3.2) but also get information aboutthe V2 parameter.The first stepof learning is the sameas above; that is, the learneracquiresadditional verbsbasedon distribution analysis. We expectthat in the V2languages(Dutch andGerman), the pronounswill appearmoreoften immediatelyto the right of the verb than to the left. For French, English, and Italian, exactly the reverseis predicted. Our results (2- 1 or better ratio in the predicted direction) confirm these 5 predictions. 3.3.2 Clitic Pronouns We now show that our techniquescan lead to 6 sttaightforwardidentificationandclassificationof clitic pronouns. In orderto correctly set the parametersgoverningthe syntaxof pronominals, the learner mustdistinguishclitic pronounsfrom freeandweakpronounsaswell assortall pronounsystemsaccordingto their propercasesystem(e.g., nominativepronouns , accusativepronouns). Furthennore,the learnermusthavesomereliable methodfor identifying the presenceof clitic pronounsin the input stteam. The algorithm we report, which is also basedon the observationof entropiesof positionsin the neighborhoodof pronouns, not only distinguishes accurately betweenclitic andfreestandingpronounsbut alsosuccess fully sortsclitic pronouns . into linguistically naturalclasses It is assumedthat the learnerknows a set of first and secondpersonpronouns . The learning algorithm computesthe entropy profile for three positions to the left and right of the pronouns(H (W I P = p) for the six different ' positions, where p s are the individual pronouns. These profiles are then comparedand those pronouns that have similar profiles are clustered together . Interestingly, it turns out that the clustersare syntactically appropriate categories. In French, for example, basedon the Pearsoncorrelation coefficients we could deducethat the object clitics me and te, the subjectclitics je and tu, the non-clitics moi and toi, and the ambiguouspronounsnousand vousare most closelyrelatedonly to the otherelementin their own class. 5. We also verified that the object clitics in Frenchwere not primarily responsiblefor the correctresult. 6. Preliminaryresultswerepresentedat the Berneworkshopon L 1- andL2-acquisition of clause-internalrules: scramblingandcliticization in January1994.
of a SymbolicParservia Statistical TheAutomaticConstruction Techniques
109
VOUS 1 TOI
0.62
1
M Ol
0.57
0.98
1
ME
0.86
0.24
0.17
1
JE
0.28
0.89
0.88
-0.02
1
TU
0.41
0.94
0.94
0.09
0.97
1
TE
0.88
0.39
0.30
0.95
0.16
0.24
1
NOUS 0.91
0.73
0.68
0.82
0.53
0.64
0.87
1
VOUS TOI
M Ol
ME
JE
TU
TE
NOUS
In fact, the entropy signaturefor the ambiguouspronounscan be analyzed as a mathematicalcombinationof the signaturesfor the conflatedforms. To distinguishclitics from non-clitics we usethe measureof sticklness(proportion of timesthey aresticking to the verbscomparedto the times they aretwo or threepositionsaway). Theseresultsarequite good. The sticklnessis ashigh as54% to 55% for the subjectclitics; non-clitics havesticklnessno more than 17%. The resultscanbe seenmostdramaticallyif we chartthe conditionalentropy of positionsaroundthe pronounin question. Figure 5.1 showsthe unambiguous freestandingpronounsmoi andtoi. Comparemoi, the freestandingpronoun, the other fIrSt personpronounje (the nominativeclitic ), and me (the non-nominativeclitic ). The freestanding pronoun, moi is systematicallylessinformativeaboutits surroundingenvironment , correspondingto a slightly flatter curvethaneitherje or me. This distinctionin the slopesof the curvesis alsoapparentif we comparethe curve associatewith toi againstthe curvesassociatedwith tu (nominative) and te (non-nominative) in Figure5.2; toi hasthe gentlestcurve. This suggeststhat ' the learnercould distinguishclitic pronounsfrom freestandingpronounsby checkingfor sharpdrops in conditional entropy aroundthe pronoun; clitics shouldstandout ashavingrelatively sharpcurves. Note that we havethreedistinct curvesin figure 5.3. We havealreadydiscussed the differencebetweenclitic and freestandingpronouns. Do nominative and non-nominative clitics sort out by our method? Figure 5.3 suggests they might sinceje hasa sharpdip in conditionalentropyto its right , whereas me has a sharp dip to its left. Consider figure 5.4 where the conditional entropy of positions around je , tu, and on have been plotted. We have
110
Chapter5
FREE-STANDING UNAMBIGUOUS PRONOUNS ENTROPY ffij Ol
5
80 .
5
70 .
M
5.60 5.50 40 . . . .
:
.
" "
. .
. . . . .
. .
.
.
.
.
.
.
.
.
.
.
.
.
' .
.
.
.
.
.
.
.
.
.
.
.
"
"
.
. . . .
r
. .
'
.
,
:
. . . .
"
.
. . .
"
. .
5
5.30 5.20 5.10 5.00 4.90 4.80 4.70 4.60
. .
. . . " / ' \ . . . . . \ . . . . . / \ ; "
50 .
. . .
4
. ,
/ ,
:
. " "
4
40 .
4
30 .
4
20 .
4
.
4
00 .
" '
! " " ' /
. . . " . . . ,
!
" : " " 10 " " / " " ' !
POSITION
3.00
4.00
5.00
6
2.00
00 .
1.00
5.] Figure Entropyconditionedon position.
includedon with the first and secondpersonclitics sinceit is often usedas a first personplural pronounin colloquial French. All threeareunambiguously nominative clitic pronouns. Note that their curves are basically identical, showing a sharpdip in conditional entropy one position to the right of the clitic . Figure5.5 showsthe non-nominativeclitic pronounsmeandte. Onceagain, the curvesareessentiallyidentical, with a dip in entropyonepositionto theleft of the clitic. The positionto the left of the clitic will tend to be part of the subject (often a clitic pronounin the samplewe considered , it is ). Nevertheless
The AutomaticConstructionof a SymbolicParservia StatisticalTechniques
111
SECOND PERSON PRONOUNS ENTROPY
7.50
,. - - - - - - - - -
---,
"
7.00
"
,,
,,
,"
,,
"
,, ,,
6.50
6.00
5.50
5.00
4.50
4.00
", ,, ,, ,
'' ,, , ,
: ',
''
''
: ''
''
'
, ''
,'
"
'
, ',
'
VOUs TOI ----TU - - TE
,, : .." . ... - - '' , .- .- - -, .' , .' , , .... '' , ........ ., .. .- '' "" ........ .: .- ................ ' -- ' , ......... ,... , " .. . ... ... ; , , - .::' ., '." . . ." '..' , ... .'.. ' ' ". " " ... "" . " '. . , . '....... '... , ..... ... " ... ..~ .... " ~. .... , ' .....'". , " ' " ' , ' , ' , ' , ' , '" , "
3.50
POSITION 1.00
2.00
3.00
4.00
clear that the learner will have evidence to partition the clitic pronouns on the basis of where the dip in entropy occurs.
Let us turn, finally , to the interestingcasesof nous and vous. Thesepronouns areunusualin that they are ambiguousbetweenfreestandingandclitic pronounsand, furthermore, may occur as either nominative or non-nominative clitics. We would expectthem, therefore, to distinguishthemselvesfrom the other pronouns. If we considerthe curve associatedwith vous in figure 5.2, it is immediatelyapparentthat it hasa fairly gentle slope, as one would , the conditional entropy of expect of a freestandingpronoun. Nevertheless
112
Chapter5
FIRSTPERSONPRONOUNS ENTROPY ------------, ,"
.
" "
6 .60
.
," ,"
6 .40
. ,
," .
, "
6 .20
.
, , , "
6 .00
, , , '
5 .80 . .. ... . ..
5 .60 5 .40 5 . 20
... .
... . .. ...
, , , , " " " "
, ,
... .
, " ...
! " i ,i ,' I , i '' :
., "', ," '. " " "" " ". . " '" " " " " " " " "" "
. 500 4 .80 4 .60 4 .40
! f
4 .00
.-
"
.-
:
. ... ,:: .. .. . . .. ~ .. ... .. .. .. .. . , ' , :
.!
.-
M Ol . ... ... ME ---. JE
.. .. .
. .. ..
... .
" " "
' , ,, "
,
:
! II f " ". " '" " '
4 . 20
.
.
.
" , ' , ,
.. .. .. ...
" ...
, ,
.
'
,
.
! / .., :f ..,
POSITION
3 .80 1,00
2 ,00
3 ,00
5 .00
Figure5.3 Entropyconditionedon position.
vous is rather low both to its right and its left , a property we associatewith clitics ; in fact, its conditional entropy is systematicallylower than the unambiguo clitics tu andte, althoughthis fact may be due to our sample. Figure 5.6 comparesthe conditional entropy of positions surroundingvousand nous. Onceagain, we seethat nousand vousareassociatedwith very similar curves. Summarizing, we haveseenthat conditionalentropycan be usedto distinguish ' freestandingandclitic pronouns.This solvesat leastpart of the learners problem in that the methodcan form the basisfor a practical algorithm for
The AutomaticConstructionof a SymbolicParservia StatisticalTechniques
113
PRONOUNS NOMINA TIVECLITIC ENTROPY
IE TO ON
6.00 5.80 5 .60
4.80 1.00
3.00
4.00
5.00
POSITION
Figure 5.4 Entropyconditionedon position. detecting the presenceof clitics in the input stream. Furthennore , we have seen that conditional entropy can be used to break pronouns into further classeslike nominative and non- nominative . The learner can use these calculations as a robust, noise- resistant means of settingparameters . Thus , at least part of the problem of trigger detection has been answered. The input is such that the learner can detect certain systematic cues and exploit them in detennining grammatical properties of the target. At the very least, the learner could use " " these cues to fonn a rough sketch of the target grammar , allowing the learner to bootstrap its way to a full fledged grammatical system.
114
5 Chapter
-NOMINA TTVF CLITICPRONOUNS NON
ENTROPY
ME TE
'
00 .
6
5.80
\ ~ . .
. : .
.
-
.
- ".
.
.
.
-
'
:
:
.
.
5
60 .
5
20 .
5
00 .
4
80 .
4
60 .
r ! ~ , :
' ~
ij
"
' "
.
I
:
' ,
.
. '
1 ; j
4.40
,
4
20 .
4
00 .
3
80 . \
3.00
4.00
5.00
6.00
POsmON
00 .
2
00 .
1
Figure 5.5 Entropyconditionedon position. The Dutch clitic system is far more complicated than the French pronoun system (see, e.g., [Zwart , 1993]) . Even so, our entropy calculations made some headway toward classifying the pronouns. We are able to distinguish the weak and strong subject pronouns. Since even the strong subject pronouns in Dutch tend to stick to their verbs very closely and two clitics can come next to each other, the raw sticklness measureseemsto be inappropriate . Although the Dutch case is problematic owing to the effects of V2 and scrambling , we are in the processof treating thesephenomenaand anticipate that the pronoun calculations in Dutch will sort out properly once the influence of these other word -order processes are factored in appropriately .
The AutomaticConstructionof a Symbolic Parservia StatisticalTechniques
115
PRONOUNS AMBIGUOUS ENTROPY
V'5""i:fs NOiJS
4.20 4.15 4.10 4.05 4.00 3.95 3.90 3.85 3.80 3.75 3.70 3.65 3.60 3.55 3.50 3.45 3.40 3.35 3.30 3.25
POSITION
3.20 1.00
Figure Entropy
3.00
2.00
4.00
5.00
5 .6 conditioned
on position
.
4 Conclusions It needs to be emphasized that in our statistical procedure there is a mechanism available to the learning mechanism by which it can determine when it has seen enough input to reliably determine the value of a certain parameter. ( Such means are nonexistent in any trigger -based error -driven learning theory .) In principle at least, the learning mechanism can determine the variance in the quantity of interest as a function of the text size and then know when enough text has been seen to be sure that a certain parameter has to be set in a particular way .
116
Chapter5
We are currently extending the results we have obtained to other parameters and other languages. We are convinced that the word - order parameters [e.g ., ( 1) and ( 2)] should be fairly easy to set and amenable to an infonnation -theoretic analysis along the lines sketched earlier . Scrambling also provides a case where calculations of entropy should provide an immediate solution to the parameter-setting problem . Note however that both scrambling and V2 interact in an interesting way with the basic word -order parameters; a learner may be potentially misled by both scrambling and V2 into missetting the basic word order parameters since both parameters can alter the relationship between heads, their complements, and their specifiersiParameters involving adverb placement, extraposition , and wh - movement should be relatively more challenging to the learning algorithm given the relatively low frequency with which adverbs are found in adult speechto children . These cases provide good examples which motivate the use of multiple trials by the learner. The interaction between adverb placement and head movement , then, will pose an interesting problem for the learner since the two parameters are interdependent; what the learner assumesabout adverb placement is contingent on what it assumesabout head placement and vice versa. Acknowledgments We fIr Stly thank two anonymous referees for some very useful comments. We are also indebted to Isabelia Barbier , Eric Brill , Bob Frank , Aravind Joshi, Barbara Lust , and Philip Resnik along with the audience at the Balancing Act workshop at the Annual meeting of the Association of Computational Linguistics for comments on various parts of this chapter. References Anderson, S. 1986. The typology of anaphoricdependencies : Icelandic (and other) reflexives. In L . Helian and K. Christensen , editors. Topics in ScandinavianSyntax, pp. 65- 88. Dordrecht, The NetherlandsD. Reidel. Robin Clark. 1990. Paperson learnability and natural selection. TechnicalReport 1, Universite de Geneve, Departementde Linguistique generale et de linguistique fran~aise, FacultedesLettres, CH- 1211, Geneve4, 1990. TechnicalReportsin Formal andComputationalLinguistics. Robin Clark. 1992. The selectionof syntacticknowledge. LanguageAcquisition, 2(2): 83- 149. Robin Clark. 1994a. Hypothesisformation asadaptationto an environment: Learnability andnaturalselection. In BarbaraLust, Magui Suner, andGabrieliaHermon, editors, . PreSyntacticTheory and First LanguageAcquisition: Crosslinguistic Perspectives
The AutomaticConstructionof a SymbolicParservia StatisticalTechniques
117
sentedat the 1992symposiumon SyntacticTheory and First LanguageAcquisition: at Cornell University, Ithaca. Hillsdale, N.J., Erlbaum. CrossLinguistic Perspectives Robin Clark. 1994b. Kolmogorov complexity and the information contentof parameters . TechnicalReport. Philadelphia, Institute for Researchin Cognitive Science, University of Pennsylvania , 1994. . Complexityis theEngineof Variation, manuscript RobinClark andIan Roberts.In preparation . Universityof Pennsylvania , Philadelphia,andUniversityof Wales, Bangor. EdwardGibsonand KennethWexier. 1992. Triggers. Presentedat GLOW. Linguistic Inquiry, 25, pp. 407- 454. ShyamKapur. 1991. ComputationalLearningof Languages.PhiD. thesis, Cornell University . ComputerScienceDepartmentTechnicalReport91- 1234. ShyamKapur. 1993. How much of what? Is this what underliesparametersetting? In Proceedingsof the 25th StanfordUniversityChild LanguageResearchForum. ShyamKapur. 1994. Someapplicationsof formalleaming theoryresultsto naturallanguage acquisition. In BarbaraLust, Magui Suffer, andGabrieliaHermon, editors, Syntactic . Lawrence. Theoryand First LanguageAcquisition: Crosslinguistic Perspectives Presentedat the 1992symposiumon SyntacticTheoryandFirst LanguageAcquisition: at Cornell University. Hillsdale, N.J., Erlbaum. CrossLinguistic Perspectives ShyamKapur and GianfrancoBilardi. 1992. Languagelearningfrom stochasticinput. In Proceedingsof theFifth Conferenceon ComputationalLearningTheory. SanMateo, Calif., Morgan-Kaufman. ShyamKapur and Robin Clark. In press. The Automatic Identification and Classifica L2 the LJ and on at the Berne . Presented tion of Clitic Pronouns Acquisition workshop of Clause-Internal Rules: Scramblingand Cliticization, January1994. R. Manzini and KWexlerParameters , binding theory, and learnability. Linguistic Inquiry, 18: 413- 444, 1987. Brian MacWhinney. 1987. The competition model. In Brian MacWhinney, editor, Mechanismsof LanguageAcquisition. Hillsdale, N.J., Erlbaum. Brian MacWhinney. 1991. TheCHIWE S Project: Toolsfor AnalyzingTalk. Hillsdale, N.J., Erlbaum. ParthaNiyogi andRobertC. Berwick. 1993. Formalizingtriggers: A learningmodelfor . TechnicalReport A .I. Memo No. 1449. Cambridge, Mass., Institute of fmite spaces Technology. Also Center for Biological ComputationalLearning, Whitaker College PaperNo. 86. C. Jan-WouterZwart. 1993. Noteson clitics in Dutch. In Lars Helian, editor, Clitics in Germanicand Slavic, pp. 119- 155. Eurotyp Working Papers,ThemeGroup 8, vol. 4, . University of Tilburg, The Netherlands
6 Chapter Combining Linguistic with Statistical Methods in Automatic Speech Understanding
Patti Price
Speechunderstanding is an application that consists of two major components: the natural language processing component, which has traditionally been based on algebraic or symbolic approach es, and the speech recognition component, which has traditionally used statistical approach es. Price reviews the culture clash that has resulted as these two areas have been linked into the larger speech understanding task. Her position is that balancing the symbolic and the statistical will yield results that neither community could achieve alone. Price points out that the best performing speech recognition systems have been based on statistical pattern matching techniques. At the same time , the mostfully developed natural language analysis systemsof the 1970s and 1980s were rule -based, using symbolic logic , and often requiring large sets of handcrafted rules. When these two were put together in the late 1980s - most notably in the United States, in the context of projects funded by the Defense Advanced Research Projects Agency (DARPA ) - the result was to have an effect on both communities. Initially , that effect tended to be the fostering of skepticism, as ' shown in Price s table 6.1 , but increasingly the result has been a tendency to combine symbolic with statistical and engineering approach es. Price concludes her thoughtful review by presenting some of the challenges in achieving the balance and some of the compromises required by both the speechand naturallan guage processing communities in order to reach their shared goal .- Eds . 1
Introduction : The Cultural Gap
This chapterpresentsanoverviewof automaticspeechunderstanding techniques es with statisticalpatternmatchingmethods. that combinesymbolic approach arisefrom different The two majorcomponenttechnologiesin speechunderstanding : naturallanguage( NL) understandingtechnologyhas cultural heritages es, and speechrecognition used algebraicor symbolic approach traditionally
120
Chapter6
es. Integrationof thesetechnologies technologyhastraditionallyusedstatisticalapproach " " in speechunderstanding escultural requiresa balancingact that address and technicaldifferencesamongthe componenttechnologiesand their . representatives As arguedin Price and Ostendorf[ 1995], representatives of symbolic apand of es based on statisticalpatternmatchingmay view each proaches approach other with somesuspicion. Psychologistsand linguists, representingsymbolic es, may view automaticalgorithmsas "uninterestingcollectionsof ad approach hocungeneralizable methodsfor limited domains." The automaticspeechrecognition community, on the other hand, may argue that automaticspeechrecognition shouldnot be modeledafterhumanspeechrecognition; sincethetasks and goalsof machinesare very different from thoseof humans, the methods shouldalsobedifferent. Thus, ~ this view, symbolicapproach esare" uninteresting collectionsof ad hoc ungeneralizable methodsfor limited domains." The samewordsmaybe used, but meandifferentthings, asindicatedin table6.1. It is the thesisof this chapterthat balancingthe symbolicand the statistical es can yield results that neither community alone could achieve approach because : . Statisticalapproach es alonemay tend to ignore the importantfact that spoken is a social mechanismevolvedfor communicationamongentities language whosebiological propertiesconstrainthe possibilities. Mechanismsthat are successfulfor machinesarelikely to sharemanypropertieswith thosesuccessful for people, andmuchof our knowledgeof humanpropertiesis expressedin symbolicfonn. . Symbolictechniquesalonemay not be powerful enoughto model complex humanbehavior; statisticalapproach eshavemanyvaluabletraitsto beleveraged . Table6.1 -culturalmini-lexicon Cross
Uninteresting
Ad hoc U ngeneralizable
Linguists
Engineers
Providesno explanation of cognitiveprocess es. Without theoretical motivation. " Techniquesthat help you climb a treemay not help " you get to the moon.
Providesno useful applications. Must be providedby hand. Expenseof knowledge engineeringprohibits new or more assessing complexdomains.
CombiningLinguistic with StatisticalMethods
After a brief historical survey(section2), this chaptersurveysthe fields of speechrecognition(section3), of NL understanding(section4), and of their integration(section5), and concludeswith a discussionof currentchallenges (section6). 2
Historical Considerations
Activity andresultsin automaticspeechunderstandinghaveincreasedin recent years. The DARPA (DefenseAdvancedResearchProjectsAgency) program mergerof two previously independentprograms(speechand NL ) has had a profound impact. Previously, the speechrecognitionprogramfocusedon the automatic transcriptionof speech, whereasthe NL understandingprogram focusedon interpretingthe meaningsof typedinput. In the DARPA speechunderstandingprogramof the 1970s(see, e.g. [Klatt, 1977]), artificial intelligence(AI ) was a relatively new field full of promise. Systemswere developedby separatingknowledgesourcesalong traditional linguistic divisions: for example, acousticphonetics, phonology, morphology, lexical access , syntax, semantics , discourse. The approachwas largely symbolic and algebraic; rules were devised, measurements were made, thresholds wereset, anddecisionsresulted. A key weaknessof the approachprovedto be the numberof modulesandthedecision-makingprocess.Wheneachmoduleis forced to make irrevocabledecisionswithout interactionwith other modules, errorscanonly propagate ; a seven-stageserialprocessin which eachmoduleis 90% accuratehasan overall accuracyof lessthan 50%. As statisticalpattern matchingtechniqueswere developedand performedsignificantly better than the symbolicapproach es with significantly lessresearchinvestment,the funding focusandthe researchcommunity' s activities shifted. Thedifferencesin performancebetweenthetwo approach esduringthe 1970s could be viewedasa lessonfor both symbolicandstatisticalapproach es: making irrevocabledecisionsearly (beforeconsideringmore knowledgesources ) can severelydegradeperformance . Statistical models provide a convenient mechanismfor suchdelayeddecision-making, and subsequenthardwareand algorithmicdevelopmentsenabledthe considerationof increasinglylargersets of hypotheses . Although statisticalmodelsare certainly not the only tool for , they do provideseveralimportantfeatures: investigatingspeechandlanguage . They canbe trainedautomatically (providedtherearedata), which facilitates porting to new domainsand uses. . They can provide a systematicand convenientmechanismfor combining . multiple knowledgesources
122
Chapter6
. They can express the more continuous properties of speech and language (e.g ., prosody , vowel changes, and other sociolinguistic processes) . . They facilitate use of large corpora , which is important since the more abstract linguistic units are relatively rare compared to the phones modeled in speechrecognition; hence large corpora are needed to provide enough in -
stancesto be modeled. . They provide a meansfor assessingincompleteknowledge. . They can provide a meansfor acquiringknowledgeabout speechand language . The advantages summarizedabovearefurtherelaboratedin PriceandOstenof statisticalmodelsmaybelack of familiarity dorf [ 1995]. Thebiggestdisadvantage es. The following to those more comfortablewith symbolic approach sectionsoutline how cultural andtechnicalchallengesare being met througha esto speech , NL understanding , andtheir integration. varietyof approach 3
Speech Recognition Overview
For severalyears, the best perfonning speechrecognitionsystemshavebeen basedon statisticalpatternmatchingtechniques[Pallett et al., 1990; Pallett, 1991; Pallettet al., 1992, 1993, 1994, 1995] . The mostcommonlyusedmethod is probably hidden Markov models (HMMs) (see, e.g. [Bahlet al., 1983; Rabiner 1989; Picone 1990]), althoughthere is significant work using other patternmatchingtechniques(see, e.g. [OstendorfandRoukos1989; Zue et al., es (seee.g. [Hampshireand 1992]), including neuralnetwork- basedapproach es(seee.g. [Abrash Weibel 1990]) andhybrid HMM neuralnetworkapproach et al., 1994]). One can think of the symbolic componentsas representingour knowledge, and of the statisticalcomponentsas representingour ignorance. The words, phones, and stateschosenfor the modelare manipulatedsymbolically . Statisticalmethodsare usedto estimateautomaticallythoseaspectswe cannotor do not want to modelexplicitly . Typically, developmentof recognition systemsinvolvesseveralissues.Samplesareoutlinedbelow. Feature Selection and Extraction If the raw speechwavefonn is simply sampledin time and amplitude, there are far too much data; some feature extractionis needed.The mostcommonfeaturesextractedarecepstralcoefficients (derivedfrom a spectralanalysis), andderivativesof thesecoefficients. Although therehasbeensomeincorporationof knowledgeof the humanauditory systeminto featureextractionwork, little hasbeendonesincethe 1970sin
CombiningLinguistic with StatisticalMethods
123
implementinglinguistically motivatedfeatures(e.g., high, low, front, back) in a recognitionsystem. (See, however, the work of Ken Stevensand colleagues for significantwork in this areanot yet incorporatedin automatic speechrecognition systems[ Stevenset aI., 1992]) . A representation of phonesin termsof a small setof featureshasseveraladvantagesin speechrecognition: fewer parameters could be betterestimatedgiven a fixed corpus; phonesthat are rare or unseenin the corpuscould be estimatedon the basisof the more frequently occurringfeaturesthat composethem; and sincefeaturestend to changemore slowly thanphones, it is possiblethat samplingin time could be lessfrequent. Acoustic and Phonetic Modeling A Markov model representsthe probabilities of sequencesof units, for example, words or sounds. The " hidden" Markov model, in addition, modelsthe uncertaintyof the current" state." By analogywith speechproduction, and using phonesas states, the mechanism can be thoughtof as modeling two probabilitiesassociatedwith each phone: the probability of the acousticsgiven the phone(to modelthe variability in the realizationof phones), andthe probability of transitionto anotherphonegiven the currentphone. ThoughsomeHMMs are usedthis way, most systemsuse statesthat aresmallerthana phone(e.g., fIrSt, middle, andlastpart of a phone). Such models have more parameters , and hencecan provide greaterdetail. Adding skipsand loops to the statescan model the temporalvariability of the realizationof phones.Giventhemodel, parametersareestimatedautomatically from a corpusof data. Thus, modelscanbe " tuned" to a particular(representative ) sample, an importantattributefor porting to new domains. Model Inventory Although many systemsmodelphones, or phonesconditioned on the surroundingphonetic context, others claim improved performance through the selection of units or combination of units determined automaticallyor semiautomatically(see, e.g. [Bahlet al., 1991]). The SRI system combinesphonemodelsbasedon a hierarchyof linguistic contextsdiffering in detail, combinedasa functionof theamountof trainingdatafor each(see [Butzbergeret al., 1992]) . Distributions In the HMM fonnulation, the stateoutput disbibutionshave beena topic of researchinterest~Generally speaking, modeling more detail improvesperfonnance,but requiresmoreparametersto estimate,which in turn requiresmore data for robust estimation. Methods have been developedto reducethenumberof parametersto estimatewithout degradingaccuracy,some
124
Chapter6
. Seeexamples in [Kimballand of which include constraints basedonphonetics Ostendorf, 1993] and [DigalakisandMurveit 1994]. Pronunciation Modeling Individual HMMs for phonescanbe concatenated to model words. Linguistic knowledge, perhapsin the form of a dictionary or by rules, typically determinesthe sequenceof phonesthat make up a word. Linguistic knowledgein the form of phonologicalrules can be usedto model possiblevariationsin pronunciation, suchas the flap or stoprealizationof It!. For computationalefficiency (at the expenseof storage), additionalpronunciations canbe addedto the dictionary. This solutionis not ideal for the linguist, sincedifferent pronunciationsof the sameword are treatedastotally independent even thoughthey may shareall but one or two phones. It is also not an ideal engineeringsolution, sincerecognitionaccuracymay be lost depending on the implementation, since words with more pronunciationsmay be disfavored . The work of Cohen (e.g. relative to those with few pronunciations and others see ( , e.g. [ Withgottand Chen, [Cohenet al., 1987]; Cohen, 1989) essomeof theseissues,but this areacould likely benefitgreatly 1993]) address from a betterintegrationof symbolicknowledgewith statisticalmodels.
Language Modeling Any method that can be used to consttain the sequence of occurringwordscan be thoughtof as a languagemodel. Modeling of wordsthe way word pronunciationsaretypically modeled, (i.e., a sequences ) is not a solutiona linguist or an engineer dictionaryof all possiblepronunciations for the most constrainedapplications). A simple would propose(except alternativeis to model all words in parallel andadd a loop from the end to the " " " " beginning, whereoneof the words is the end-of-sentence word so that the sentencesare not infinitely long. Of course, this simple model hasthe disadvantage of assumingthat the endsof all words areequivalent(the samestate). This model assumesthat at eachpoint in an utterance, all words are equally likely , which is not true of any humanlanguage.Alternatively, Markov models canbe usedto estimatethe likelihoodsof wordsgiven the previousword (or N words or word classes ), basedon a training corpusof sentencetranscriptions. are more likely than others, little Exceptfor the intuition that somesequences is used . That intuition is difficult to call " linguistic" since, knowledge linguistic althoughtheremay be somerecognitionof doubtful cases,gramrnaticalityhas traditionally beena binary decisionfor manylinguists. This will likely change as linguists begin to look at spontaneousspeechdata. Statisticalmodelingof linguistically relevant relationships(e.g., number agreementof subject and verb; or co-occurrencesof adjectiveswith nouns, which may be an arbitrary
CombiningLinguistic with StatisticalMethods
125
number of words away from each other) is a growing area of interest. For , examples,seethe numerouspaperson this topic in the (D)ARPA , Eurospeech on ICSLP and InternationalConference SpokenLanguageProcessing( ) proceedings over the pastseveralyears. Search Given the acoustic models, the languagemodels, and the input , the role of the recognizeris to searchthroughall possiblehypotheses speech and fmd the best (most likely) string of words. As the acousticand language modelsbecomemoredetailedthey becomelarger, andthis canbe anenonnous task, even with increasingcomputationalpower. Significant effort has been spenton managingthis search.Recentinnovationshaveinvolved schemesfor makingmultiple passesusingcoarsermodelsat first to narrow the searchand progressivelymore detailedmodels.to further narrowthe prunedsearchspace (see, e.g. [Murveit et al., 1993; Nguyenet al., 1993]). Typically, more extensive linguistic knowledgeis more expensiveto computeand is savedfor later " " esusedfor integrationof speechandnaturallanguage stages.The N-best approach (seesection5) havealsobeenusedto improvespeechrecognition.
4 NaturalLanguageUnderstanding es to NL understandinghave been basedin symbolic Traditional approach rule based estypically involving largesetsof handcrafted approach logic, using since the first rules. However, joint meetingof the speechandNL communities in 1989, the numberof papersand the rangeof topics addressedusing statistical . At the last two meetings, the categoryof methodshavesteadilyincreased statisticallanguagemodelingandmethodsreceivedthe mostabstractsandwas . one of the mostpopularsessions In the mergerof speechwith NL , the traditional computationallinguistic approachof coveringa set of linguistically interestingexampleswas put to a severetestin the attemptto cover, in a limited domain, a setof utterancesproduced by peopleengagedin problem-solving tasks. Severalnew sourcesof complexitywereintroduced: the moveto an empirically basedapproach(covering a seeminglyendlessnumberof " simple" things becamemore important thancoveringthe " interesting," but morerare, complexphenomena ), the separation of testandtraining materials(addingrulesto coverphenomenaobserved in the training corpusmayor may not affect coverageon an independenttest speech(which hasa different, and perhaps corpus), the natureof spontaneous more creative, structurethan written language, previously the focus of much NL work), andrecoveryfrom errorsthat canoccurin recognitionor by the fact
126
Chapter6
. that talkers do not always produceperfectly fluent well-formed utterances of statisticalapproach es(asoutlinedabove) areappropriate Many of the advantages for dealing with theseissues. The growing tendencyto combinesymbolic with statisticaland engineeringapproach es, basedon recentpapers, is representedin severalresearchareas.describedbelow. Lexicon Although speechrecognitioncomponentsusuallyusea lexicon, lexical . Different tools in NL aremorecomplexthanlists of wordsandpronunciations formalismsstoredifferent typesand formatsof information, including, for example, morphologicalderivations, part-of-speechinformation, and syntactic and semanticconstraintson combinationswith other words. Recently, therehasbeenwork in using statisticalinformation in lexical work. See, for -frequenciesfor word sensedisambiguationin [Miller example, theuseof sense et al., 1994] . Grammar An NL grammarhastraditionally beena set of rules devisedby . observationof or intuitions concerningpatternsin a languageor sublanguage Typically, such grammarshave either accepteda sentenceor rejectedit , although grammarsthat degrademore gracefully in the face of spontaneous and recognitionerrors are being developed(see, e.g. [Hindle, 1992]). speech Basedon the grammarused, the goal of parsingis to retrieveor assigna structure . Traditionally, to a string of words for useby a later stageof processing on a of have worked deterministically singlestring input. Whenparsers parsers werefacedwith typedinput, asidefrom the occasionaltypographicalerror, the intendedwords werenot in doubt. The mergerof NL with speechrecognition has forced NL componentsto considerspeechdisfluencies, novel syntactic constructions , andrecognitionerrors. The indeterminancyof the input andthe needto analyzevarioustypesof ill -formed input haveled to an increaseduse of statisticalmethods. The (D)ARPA, Eurospeech , and ICSLP proceedingsof recentyearscontain several.examplesof combining linguistic and statistical componentsin grammars,parsers,andpart-of-speechtaggers. of meaning Interpretation Interpretationis the stageat which a representation is consttucted, and may occur at different stagesin different systems. Of is not of muchusewithout a " back-end" thatcanuse course, this representation the representationto perform an appropriateresponse , for example, retrievea set of datafrom a database , ask for more information, etc. This stageis typically purelysymbolic, thoughlikelihoodsor scoresof plausibility maybe used. Seealso the work on sensedisambiguationmentionedabove. Somework has
CombiningLinguistic with StatisticalMethods
127
beendevotedto probabilistic semanticgrammars(see[Seneff, 1992]) and to " hidden " understanding(see[Miller , et al., 1995]). 5 Integration of SpeechRecognition and Natural Language Understanding The integrationof speechwith NL hasseveralimportantadvantages : ( 1) To NL can , understandingspeechrecognition bring prosodicinfonnation, infonnation importantfor syntaxandsemanticsbut not well representedin text; (2) NL can bring to speechrecognitionseveralknowledgesources(e.g., syntax and semantics ) not previously used (N-grams model only local constraints, and largely ignore systematicconstraintssuchasnumberagreement ); (3) for both, the integrationaffords the possibility. of many more applicationsthan could otherwisebe envisioned,andthe acquisitionof new techniquesandknowledge basesnot previouslyrepresented . Althoughtherearemanyadvantages , integrationof speechandNL givesrise to somenew challenges , including integrationstrategies , the effective use in NL of a new sourceof infonnation from speech(prosody, in particular), and the handlingof spontaneous speecheffects. Prosodyanddisfluenciesareespecially importantissuesin the integrationof speechand NL sincethe evidence for themis distributedthroughoutall linguistic levels, from phoneticto at least the syntacticandsemanticlevels. Integrationstrategies , prosody, anddisfluenciesaredescribedbriefly below (an elaborationappearsin [Price, 1995]). Integration There is much evidence that human speech understanding involvesthe integrationof a greatvariety of knowledgesources,andin speech recognitiontighter integrationof componentshasconsistentlyled to improved perfonnance.However, asgrammaticalcoverageincreases , standardNL techniques can becomecomputationallydifficult and provide less constraintfor . On the otherhand, a simpleintegrationby concatenationis suboptimal speech becauseany speechrecognitionerrorsarepropagatedto the NL systemandthe speechsystemcannottakeadvantageof the NL knowledgesources . In the face of culturalandtechnicaldifficulties with tight integrationandthe limitationsof " " a simpleconcatenation , N-best integrationhasbecomepopular: The connection betweenspeechand NL can be strictly serial, but fragility problemsare mitigatedby the fact that speechoutputsnot onebut manyhypotheses . The NL can then use other know component ledgesourcesto detenninethe best-scoring . The D hypothesis ( )ARPA, Eurospeech , and ICSLP proceedingsover the past severalyearscontainseveralexamplesof the N-bestapproach.In addition, the
128
Chapter6
special issue of Speech Communication on spoken dialogue [ Shiral and Furui , 1994] contains several contributions on this topic .
Prosody Prosody can be defined as the suprasegrnentalinfonnation in speech; that is, infonnation that c~ ot be localizedto a specific soundsegment , or infonnation that does not changethe segmentalidentity of speech . Prosodicinfonnation is not generallyavailablein text-basedsystems segments , exceptinsofar as punctuationmay indicatesomeprosodicinfonnation. Prosodycanprovideinfonnationaboutsyntacticstructure,discourse,andemotion and attitude. A surveyof combiningstatisticalwith linguistic methodsin prosodyappearsin [Price andOstendorf, 1995]. SpontaneousSpeech The s~ e acousticatbibutesthat indicatemuchof the prosodic structure (pitch and duration patterns) are also very common in speechthat seemto be morerelatedto the speechplanning aspectsof spontaneous the structureof the utterance.Disfluenciesarecommonin than to process normal speech. However, modeling of speechdisfluenciesis only beginning (see[Shriberget al., 1992; Lickley , 1994; Shriberg, 1994]). The distributionof disfluenciesis not random, and may be a part of the communicationitself. Although disfluenciestendto be lessfrequentin human-computerinteractions thanin human-humaninteractions,aspeoplebecomeincreasinglycomfortable with human-computerinteractionsand concentratemore on the task at hand thanon monitoringtheir speech,disfluenciescanbe expectedto increase.
6 Current Challenges Although progress has been made in recent years in balancing symbolic with statistical methods in speech and language research, important challenges remain. A few of the challenges for speech recognition , for NL , and for their integration are outlined below .
6.1 SpeechRecognition Challenges Someof our knowledge, perhapsmuchof our knowledge, aboutspeechhasnot beenincorporatedin automaticspeechrecognitionsystems.For example, the notion of a prototypeanddistancefrom a prototype(see, e.g. [Massaro, 1987; Kuhl , 1990]) which seemsto explain much datafrom speechperception(and other areasof perception), is not well modeledin the currentspeechrecognition frameworks. A personwho hasnot beenwell understoodtendsto change . This may involve speakhis or her speechstyle so asto be betterunderstood
CombiningLinguistic with StatisticalMethods
129
ing more loudly or more clearly, changingthe phrasing, or perhapsevenleaving pausesbetweenwords. Thesechangesmay help in human-humancommunication , but in typical human-machineinteractions, they result in forms that are more difficult for the machineto interpret. The conceptof a prototypein machinerecognitioncould leadto morerobustrecognitiontechnology. escommonin statisticalmethods That is, the maximum-likelihood approach to speechrecognitionmiss a crucial aspectof language:the role of contrast. A given linguistic entity (e.g., phone) is characterizednot just by what it is but also by what it is not, that is, the systemof contrastin which it is involved. Thus, hyperarticulationmay aid communicationover noisy telephonelines for humans,but may decreasethe performanceof recognizerstrainedon a corpus in which this styleof speechis rareor missing. The resultscanbe disastrousfor , a commonreactionis to applications, sincewhen a recognizermisrecognizes hyperarticulate([Shriberget al., 1992]). Although many factors affect how well a systemwill perform, examining recentbenchmarkevaluationscangive an ideaof the relativedifficulty of various aspectsof speech(seee.g. [Pallettet al., 1995]). Suchareasmight be able to takeadvantageof increasedlinguistic knowledge. For example, the variance acrossthe talkersusedin the test set was greaterthan the varianceacrossthe systemstested. Further, the varioussystemstestedhad the highesterror rates for the samethreetalkers, who werethe fastesttalkersin the set. Theseobservations could be taken as evidencethat variability in pronunciation, at least insofarasfast speechis concerned , is not yet well modeled. 6.2 Natural Language Challenges Results in NL understanding have been more resistant to quantification than those in speechrecognition ; people agreemore on what string of words was said than on what those words mean. Evaluation is important to scientific progress, but how do we evaluate an understanding system if we are unable to agree on what it means to understand? In the DARPA community , this question has been postponed somewhat by agreeing to evaluate on the answer returned from a database. Trained annotators examine the string of words ( NL input ) and use a databaseextraction tool to extract the minimum and maximum accepted set of " " tupies from the evaluation database. A comparator then automatically determines whether a given answer is within the minimum and maximum allowed . The community is not , however , content with the current expense and limitations of the evaluation method described above, and is investing significant resources in finding a better solution . Key to much of the debate is the cultural gap: engineers are uncomfortable with evaluation measures that cannot be
130
Chapter6
automated(forgettingthe role of the annotatorin the currentprocess); andlinguists are uncomfortablewith evaluationsthat are not diagnostic; and, of course, neithersidewantssignificantresourcesto go to evaluationthat would otherwisego to research .
6.3 Integration Challenges In fact, most of this chapterhasaddressedthe challengeof integratingspeech with NL , andmuchof thechallengehasbeenarguedto berelatedto culturaldifferences as much as to technicaldemands . As arguedin Price and Ostendorf the 1995 [ ], increasinglypopularclassificationand regressiontrees, or decision trees(see, e.g. [Breirnanet al., 1984]) appearto be a particularlyusefultool in bridgingtheculturalandtechnicalgapin question.In this formalism, the speech researcheror linguist can spe
I thankMarl Ostendorffor her usefulcommentson the manuscript.I gratefully 14-90-C-OO85 , acknowledgethe supportof DARPA/ ONR ContractONR NOOO
Combining Linguistic with Statistical Methods
and DARPA / NSF funding through NSF Grant IRI - 8905249. The opinions expressed are mine and not necessarily those of the funding agencies. References Abrash, V., M. Cohen, H. Franco, and I. Arima ( 1994). IncorporatingLinguistic Features in a Hybrid HM~ P SpeechRecognizer. Proceedingsof the International Conferenceon Acoustics, Speechand SignalProcessing, 62.8.1- 4. Bahl, L., P. de Souza, P. Gopalakrishnan , D. Nahamoo,andM. Picheny( 1991). Context DependentModeling of Phonesin ContinuousSpeechusing DecisionTrees. Proceedings of theDARPASpeechand Natural LanguageWorkshop,pp. 264- 269. Bahl, L., Jelinek, F. and Mercer, R. L. ( 1983). A Maximum Likelihood Approachto ContinuousSpeechRecognition. IEEE Transactionson PatternAnalysisand Machine IntelligencePAMI -5 , 2, 179- 190. Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone( 1984) . Classificationand RegressionTrees. Monterey, Calif., WadsworthandBrooks/ Cole AdvancedBooksand Software. Butzberger,J., H. Murveit, E. Shriberg, P. Price( 1992) . Spontaneous SpeechEffectsin Large Vocabulary SpeechRecognition Applications. Proceedingsof the DARPA Speechand Natural LanguageWorkshop,pp. 339- 343. Cohen, M. ( 1989). Phonological Structuresfor SpeechRecognition, PhiD. thesis, Deparbnentof ComputerScience, University of California at Berkeley, Ann Arbor, University of Michigan Microfilms. Cohen, M., G. Baldwin, J. Bernstein, H. Murveit andM. Weintraub( 1987). Studiesfor anAdaptiveRecognitionLexicon. Proceedingsof theDARPASpeechandNatural Language Workshop,pp. 49- 55. Digalakis, V., and H. Murveit ( 1994) . An Algorithm for Optimizing the Degreeof Tying in a Large VocabularyHidden Markov Model BasedSpeechRecognizer.Proceedings of theInternational Conferenceon Acoustics, Speechand Signal Processing, 54.2.1- 4. Hampshire, J., and A . Weibel ( 1990). ConnectionistArchitecturesfor MultiSpeaker PhonemeRecognition. In D. Rouretzky, editor. Advancesin Neural Information Processing Systems2, MorganKaufmann. Hindle, D. ( 1992). An Analogical Parserfor RestrictedDomains. Proceedingsof the ARPAHumanLanguageTechnologyWorkshop,pp. 150- 154. Hirschberg,J. ( 1993). Pitch Accentin Context: PredictingProminencefrom Text. Arti ficiallntelligence , 63: 305- 340. Kimball, 0 ., andM. Ostendorf( 1993). On the Useof Tied-Mixture Distributions. Proceedings of theARPAHumanlangUageTechnologyWorkshop,pp. 102- 107. Klatt, D. ( 1977). Review of the ARPA SpeechUnderstandingProject. Journal of the AcousticalSocietyof America, 62: 1345- 1366. Kuhl, P. ( 1990). Towardsa New Theoryof the Developmentof SpeechPerception . Proceedings on SpokenLanguageProcessing2, pp. 745- 748. of theInternationalConference
132
6 Chapter
Lickley, R. J. ( 1994). DetectingDisfluencyin Spontaneous , doctoraldissertation Speech . , Universityof Edinburgh , Scotland Massaro , D. ( 1987 ). Speech Perception byEarandEye: A Paradigm for Psychological . , N.J., Erlbaum Inquiry, Hillsdale Miller, S., M. Bates in HiddenUnderstanding , andR. Schwartz ( 1995). RecentProgress Models. Proceedings of the ARPAHumanLanguageTechnologyWorkshop , pp. 276-280. Miller, G. M. Chodorow , S. Landes , C. Leacock , andR. Thomas( 1994). Usinga Semantic Concordance for SenseTechnology . Proceedings of theARPAHumanLanguage , pp. 240-243. Technology Workshop Murveit, H., J. Butzberger , V. Digalakis , andM. Weintraub ( 1993). LargeVocabulary DictationusingSRI's DECIPHERSpeechRecognition : Progressive Search System . Proceedings onAcoustics andSignal Techniques , Speech of theInternational Conference , pp. ll -319- 322. . Processing , L., R. Schwartz , F. Kubala,andP. Placeway Nguyen ( 1993 ). SearchAlgorithmsfor -OnlyReal-TimeRecognition Software with VeryLargeVocabularies . Proceedings of theARPAHumanLanguage Technology Workshop pp. 91- 95. -Based Ostendorf Modelfor Phoneme , M., andS. Roukos( 1989 ). A Stochastic Segment Continuous . IEEE Transactions on andSignal Acoustics , Speech Speech Recognition - 1869. , December , pp. 1857 Processing Ostendorf , M., andN. Veilleux( 1994). A Hierarchical Stochastic Modelfor Automatic Prediction of Prosodic Location . Computational . Vol. 20, No. 1, Boundary Linguistics pp. 27- 54. Pallett,D. ( 1991 andATIS Benchmark TestPoster ). DARPAResource Management Session . Proceedings andNaturalLanguageWorkshop . MorganKaufof theSpeech mann,pp. 49-58. Pallett,D., N. Dahlgren , J. Fiscus , W. Fisher , J. Garofolo , andB. Tjaden( 1992 ). DARPA TestResults . Proceedings andNatural February1992ATIS Benchmark of theSpeech , SanMateo,Calif., MorganKaufmann Language , pp. 15- 27. Workshop Pallett D. J. W. , , Fiscus Testsfor theDARPA , Fisher , andJ. Garofolo(1993 ). Benchmark . Proceedings Workshop SpokenLanguage Program of theHumanLanguageTechnology , SanFrancisco , MorganKaufmann , pp. 7- 18. Pallett,D., J. Fiscus , W. Fisher , J. Garofolo , B. Lund, andA. Martin, andM. Prysbocki 1995 . 1994 Benchmark Testsfor theARPASpoken . Proceedings ( ) Language Program . MorganKaufmann of theHumanLanguage , pp. 5- 36. Technology Workshop Pallett,D., J. Fiscus , W. Fisher , J. Garofolo , B. Lund, andM. Prysbocki( 1994). 1993 Benchmark Testsfor theARPASpoken . Proceedings Language Program of theHuman San Francisco Kaufmann , . Language , Technology Workshop , in press Morgan Pallett,D., W. Fisher,J. Fiscus , andJ. Garofolo( 1990). DARPAATIS TestResults . . MorganKaufmann Proceedings of the Speechand NaturalLanguageWorkshop , pp. 114- 121.
CombiningLinguistic with StatisticalMethods
133
. Picone , J. ( 1990). Continuous SpeechRecognitionUsingHiddenMarkovModels . 26 41. IEEEASSPMagazine , pp . In R. Cole, J. Mariani , H. UszkorPrice, P. ( 1995). SpokenLanguage Understanding the State eit, A. Zaenen of theArt in HumanLanguage , andV. Zue, editors.Surveyof Institute , for , OregonGraduate . Center Understanding SpokenLanguage Technology 4855. . pp Price, P., andM. Ostendorf ). CombiningLinguisticwith StatisticalMethodsin ( 1995 : BootL. In J. , Signalto Syntax . , editors MorganandK. Demuth ModelingProsody . . Hillsdale,N.J., Erlbaum to Grammarin EarlyAcquisition from Speech strapping in selected and models Markov applications Rabiner , L. (1989). A tutorialon hidden . 2 257 286 77 . IEEE Proceedings , , recognition speech Applications Seneff Systemfor SpokenLanguage , S. ( 1992). TINA, A NaturalLanguage . Computational , 18: 61- 86. Linguistics . SpecialIssueon Communication . ( 1994). Speech K. ShiralandS. Furui, guesteditors . . 15 34. Dialogue Spoken ). AutomaticDetectionandCorrectionof , E., J. Bear, andJ. Dowding( 1992 Shriberg andNatural . in Human of theDARPASpeech Proceedings Dialog Computer Repairs . . 419 424 , Workshop pp Language -MachineProblemSolvingUsing ). Human , E., E. Wade,andP. Price( 1992 Shriberg andUserSatisfaction Performance : Factors SLS Mfecting ( ) Systems SpokenLanguage , pp. 49-54. andNaturalLanguage Workshop . Proceedings of theDARPASpeech , doctoraldissertation to a Theoryof Speech Disjluencies ). Preliminaries , E. E. ( 1994 Shriberg Calif. Stanford Stanford , , , University -Hufnagel of a , andS. Liu ( 1992). Implementation Stevens , K., S. ManuelS. Shattuck International the 1992 In . Features on Based Access Lexical for of Model Proceedings onSpoken , Banff, vol. 1pp. 499- 502. Processing Language Conference Boundaries Intonational Automatically ). Predicting ( 1991 Wang,M., andJ. Hirschberg andNaturalLanguage DARPA the . Domain ATIS : The Text Speech from of Proceedings , pp. 378- 383. Workshop . CSLI Modelsof AmericanSpeech ). Computational Withgott,M., andF. Chen( 1993 No. 32. Notes Lecture , H. Leung, M. Phillips, J. , L. Hirschman Zue, V., J. Glass , D. Goodine , J. Goddeau : February1992Progress ATIS The WT Polifroni, and S. Seneff( 1992). System , pp. 84-88. Human the ARPA Workshop . Technology Language of ReportProceedings
Chapter7 Exploring the Nature of Transfonnation-Based Learning
Lance A . Ramshaw and Mitchell P. Marcus
Transformation -based, error -driven learning [ Brill , 1993b] is a corpus -based method that has provided interesting results on a variety of tasks,' the most notable of these is its application topart -of -speech tagging , owing to both the quality of the results and the existence of a publicly available implementation of the algorithm [ Brill , 1995] . The overall approach can be distinguishedfrom most other forms of corpus -based learning by its simplicity , and by thefact that it is based on counting but not on explicit probability estimates. " " In Exploring the Nature of Transformation -Based Learning , Ramshaw ' and Marcus provide a detailed investigation of the algorithm s properties how it differs from statistical models based on explicit probability estimates such as hidden Markov models; how it relates to work in probabilistic machine ' learning such as decision trees, and why it exhibits less of a tendency to overtrain than other supervised learning techniques. In the process , they elucidate " " their view of transformation -based learning as a compromise method involving both a statistical and a symbolic component. Crucially , and unlike most other hybrid approach es, the quantitative component of the algorithm makes its contribution only during training .. the learner produces rules that are purely symbolic , interpretable , and modifiable .- Eds . 1
Introduction
Purely statistical methodslike hidden Markov models (HMMs) have been applied with considerablesuccessto linguistic problemslike part-of-speech eshavealsobegunto be explored tagging. Recently, however, hybrid approach that combineelementsof symbolic knowledgewith statisticallearning algorithms . Suchmixed modelsderivepart of their powerfrom the initial specification , determinedby linguistic principles, of which factorsshouldbe includedin thestatisticalsearchspace.Theyalsouserepresentions that aremoreaccessible
136
Chapter?
to symbolic interpretationthan is true of the purely statisticalmodels. These modelsthusmakeit possibleto bring linguisticknowledgeto bearboth indefining and tuning the model and in analyzingits results. This chapterfocuseson -basedlearning, which is oneof thesenewcompromisemethods, transformation andexploreshow it differs from existing, morepurely statisticalapproach es. -basederrorEric Brill in his thesis[Brill , 1993b] proposed" transformation driven learning" as a novel methodfor statisticallyderiving linguistic models from corpora. The techniquehassincebeenappliedin variousdomainsincluding part-of-speechtagging [Brill , 1992, 1994], building phrasestructuretrees [Brill , 1993a], text chunking [Ramshawand Marcus, 1995], and resolving prepositionalphraseattachmentambiguity [Brill and Resnik, 1994] . The method works by learninga sequenceof symbolicrules that characterizeimportant contextualfactorsandusingthoserulesto predicta most likely value. The searchfor suchfactorsonly requirescountingthe instancesof varioussetsof eventsthatactuallyoccurin a training corpus; the methodis thusableto survey a largerspaceof possiblecontextualfactorsthancould easilybe capturedby a statisticalmodelthat requiredexplicit probability estimatesfor everypossible combinationof factors. Brill ' s resultson part-of-speechtagging suggestthat the methodcan achieveperformancecomparableto that of the HMM techniques widely usedfor that task, while alsoproviding more compactand perspicuou models. RocheandSchabes[ 1995] havealsoshownthat suchmodels can be implementedas fInite-statetransducerswhosespeedis dominatedby the accesstime of massstoragedevices. We haveexploredthis new techniquethrougha seriesof instrumentedpartof-speechtaggingexperiments , using asdatathe taggedBrown Corpus[ Francis and Kucera, 1979] and a taggedSeptuagintGreekversionof the fIrSt five booksof theBible [CATSS, 1991] . Mterbriefly explainingthetransformation basedlearningapproachanddescribinganew, fast implementationtechnique, this chapterusesthe resultsof theseteststo explore the differencesbetween this techniqueand purely statistical models like HMMs. We also compare -basedlearningwith otherpartially symbolicmethodslike decision transformation treesand decisionlists, which are similar to it in that they surveya wide spaceof possiblefactors, initially identified using symbolic knowledge, in orderto selectfactorsto addto the model. Thesecontrastshighlight the kinds -basedlearning is especiallysuited, of applicationsfor which transformation andhelp to explainhow it managesto largely avoid the difficulties with overtraining that affect the other approach es. We alsodescribea way of recording the dependencies betweenrulesin the learnedsequencethat may be usefulfor further analysis.
Exploring the Natureof TransfonnationBasedLearning 2
137
Brill ' s Approach
As shown schematicallyin figure 7.1, nansformation-basedlearning begins with a small, supervisednaining corpus, for which the correcttagsareknown. " " The first step is to use some baseline heuristic to select an initial current ). guesstag for eachword (ignoring for the momentthe known correctanswers In the part-of-speechtaggingapplication, a plausiblebaselineheuristicmight be to assignto eachknown word whosepart of speechis ambiguouswhatever in the naining corpus, and to tag all tag is most often correct for that word ' unknown words as nouns. (Brill s results point out that performanceon unknownwordsis a crucial factor for part-of-speechtaggingsystems.His system is therefore organized in two separatenansformation-based naining , with oneimportantpurposeof the first passbeingexactlyto predictthe passes part of speechof unknownwords. However, becausethe focus in theseexperiments is on understandingthe mechanismratherthan on comparativeperformance , the simplebut unrealisticassumptionof a closedvocabularyis made.)
Figure 7.1 Learninga transfonnation-basedmodel.
138
? Chapter
The method then learns a series of transfonnational rules that iteratively improve those initial baseline guesses. The space of possible rules that the algorithm searches is defmed by a set of rule templates that select particular characteristics of particular words in the neighborhood of a given word as the grounds for changing the current tag at that location . For part -of - speech tagging , the rule templates typically involve either the actual words or the tags currently assigned to words within a few positions on each side of the location to be changed. The rule templates used in these experiments involve up to two of the currently assigned tags on each side of the tag being changed; they include [ - C A/ B - - ] (change tag A to tag B if the previous tag is C ) and [ - - A/ B CD ] (change A to B if the following two tags are C and D ) . During training , instantiated rules like [ - DET V IN - - ] are built by matching these templates against the tr ~ g corpus. A set of such templates combined with the given part-of -speech tag set (and vocabulary , if the rule patterns also refer directly to the words ) defines a large space of possible rules ; the training process operates by using some ranking function to select at each step some rule judged likely to improve the current tag assignment. Brill suggeststhe simple ranking function of choosing (one of) the rule (s) that makes the largest net improvement in the current training set tag assignments. Note that applying a rule at a location can have a positive effect (changing the current tag assignment from incorrect to correct ), a negative one (from correct to some incorrect value ), or can be a neutral move (from one incorrect tag to another) . Rules with the largest positive -minus - negative score cause the largest net benefit . In each training cycle , one such rule is selected and applied to the training corpus and then the scoring and selection process is repeated on the newly transfonned corpus. This process is continued either until no beneficial rule can be found , or until the degree of improvement becomes less than some specified threshold. The process of scoring rule candidates is tractable despite the huge space of possible rules because rules that never apply positively can be ignored . The final model is thus an ordered sequence of pattern-action rules. As shown in figure 7.2 , it is used for prediction on a test corpus by beginning with the predictions of the baseline heuristic and then applying the transfonnational rules in order. In our test runs, seven templates were used during training to define the space of possible rules: three templates testing the tags of the immediate , next , and both neighbors to the left ; three similar templates looking to the ; right and a seventh template that tests the tags of the immediate left and right neighbors. The first 10 rules learned from a training run across a 50K - word
Explorin ~the Natureof Transfonnation-BasedLearning
139
Test Corpus
System
Current Corpus
Apply
Rule
Previously
Learned
Rule Sequence
Output Corpus
Figure7.2 -basedmodel. Applyinga transfonnation sampleof the Brown Corpus are listed in figure 7.3; they closely replicate Brill ' s original results [Brill , 1993b] allowing for the fact that his testsused moretemplates,including templateslike " if anyone of the threeprevioustags is A." In later work, Brill [ 1994] improvedthe performanceof his transformation basedtaggerby addingrule templates(not duplicatedhere) that weresensitive not just to the tags of neighboringwords but also to the actual lexical items used, and by having the systemlearna separate , initial setof rulesto improve ' of words that never appearedin the s baseline as to the the system guess tag training data. His comparativeresultsindicatethat this approachcanachievea level of performanceon part-of-speechtaggingthat is at leaston a par with that esthat arefrequentlyused[Jelinek, 1985; Church, 1988; of the ~ approach DeRose, 1988; Cutting et al., 1992], as well as showing promise for other applications. The resultingmodel, encodedas a list of rules, is also typically more compactandfor manypurposesmoreeasily interpretablethan a table of HMM probabilities.
140
Chapter7
Pass 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Rule -
-
-
-
-
TO/IN NNNB TO/IN PPS / PPO TO/IN TO/ IN CS/DT HVD VBDNBN CS/QL MD NNNB -
AT
-
TO
-
-
-
NN
-
-
-
NP PP$ NN
-
-
-
-
-
227 113 49 51 46 46 52 38 41 32
Neg.
Neut.
0 13 0 4 0 1 11 0 7 0
0 0 0 0 0 0 1 0 0 0
7.3 Figure First 10ruleslearnedon Brown Corpussample.
3 An IncrementalAlgorithm It is worthwhilenoting fIrst that it is possiblein somecircumstancesto significantly speedup the sttaightforwardalgorithm describedabove. An improvement in our experimentsof almosttwo ordersof magnitudewas achievedby usingan incrementalapproachthat maintainslists of pointersto link ruleswith the sitesin the ttaining corpuswherethey apply, ratherthan scanningthe corpus from scratcheachtime. The improvementis particularly noticeablein the later stagesof ttaining, when the rules being learnedtypically affect only one or two sitesin the ttaining corpus. Note, however, that the linked lists in this incremental approachdo require a significant amount of storage space. Dependingon the numberof possiblerulesgeneratedby a particularcombination of rule templatesandttaining corpus, spaceconstraintsmaynot pennit this optimization. Incrementalizingthe algorithmrequiresmaintaininga list for eachrule generated of thosesitesin the corpuswhereit applies, anda list for eachsite of the rulesthat apply there. Onceoneof the highest-scoringrules is selected,its list of site pointersis fIrst usedto makethe appropriatechangesin the currenttag valuesin the corpus. After makingthe changes,that list is usedagainin order to updateother rule pointersthat may havebeenaffectedby them. It suffices to check each site within the spanof the largestdefined rule templatefrom eachchangedsite, testingto seewhetherall of its old rule links are still active,
Exploring the Natureof Transfonnation-BasedLearning
and whetherany new rules now apply at that site. This algorithm is shownin figure 7.4. Note that afterthe initial setupit is only necessaryto rescanthe corpus when updatinguncoversa rule that has not previously had any positive effect.
II II II II II
Recordsfor locationsin the corpus. called " sites" . includea linked list of the rules that apply at that site. Recordsfor rules includescorecomponents( positive. negative. andneutral) anda linked list of the sitesat which the rule applies. A hashtable storesall rulesthat apply positively anywherein the training.
scancorpususingtemplates,makinghashtableentriesfor positiverules scancorpusagainto identify negative~ d neutralsitesfor thosrules loop high_rule := somerule with maximumscore if high_rule.scoreS 0 thenexit loop outputrule trace for eachchange_site on high_rule.site_list do apply high_rule at change_site by changingcurrenttag unseen _rules := 0 for eachchange_site on high_rule.site_list do for eachtest_site in the neighborhoodof change_site do new_rules_list := Nil. . for eachtemplatedo if templateappliesat test_site thenaddresultingrule to new_rules_list for eachrule in test_site.rules_Iist - new_rules_list do removeconnectionbetweenrule andtest_site for eachrule in new_rules_list - test_site.rules_list do if rule in hashtable thenmakenew connectionbetweenrule andtest_site elseunseen _rules := un~ n_rulesu {rule } If unseen _rulesthen~ 0 then addunseen _rulesto hashtable for eachsite in corpusdo for eachrule in unseen _rulesdo if rule appliesat site then makeconnectionbetweenrule andsite adjustappropriaterule score( positive, negative, or neutral) endloop Figure 7.4 Incrementalversionof transfonnation-basedlearningalgorithm.
142 4
Chapter7
The Effects of Iterative Transformations
Proceedingnow to analysisof the method, many of the unusualfeaturesof ttansformation-basedlearning, when comparedto methodslike HMMs or decisionttees, are directly relatedto its iterative character,that it learnsrules not to assigntags de novo but to improve at each step on somecurrent tag . As we shall seein this section, this iterative characterallows the assignment to technique leveragenew rules on the resultsof earlier ones. However, while this leveragingdoesprovide a way for rules to interact with one another, the degreeof that interactionis limited to thatwhich canbemediatedby currenttag . The iterative characterof the method also plays a role in the assignments of different analysis possiblerule-rankingmetricsandin explainingthe limited amountof overtrainingthat this.methodseemsto encounter. It will frequentlybe helpful during the analysisto comparettansformationbasedlearningwith the more purely statisticalHMM method, and with decision ttees and decision lists. Decision ttees [Breiman et al., 1984; Quinlan, 1993] are an establishedmethodfor inducing quasi-symbolic, compact, and interpretablemodels. Black et al. [ 1992] haveexploredthe useof decisiontrees for part-of-speechtagging, citing resultsthat modestlyoutperformedan HMM model. If the occurrenceof a particular tag in a given context is termedan event, their methodconstructeda decisionttee with binaryqueriesat eachnode that partitionedthe setof eventsinto leavesthat wereas indicativeaspossible of someparticular tag. There are interestingsimilarities and differencesbetween this approachand ttansformation-basedlearning. Yarowsky' s use of decision lists for learning linguistic models [Yarowsky, 1993, 1994, 1995] resultsin a methodwhich, like ttansformation-basedlearning, involvesmaking selectionsfrom amonga large numberof possiblerules or factors. However, while ttansformation-basedlearning works from a set of current guesses , choosingthe rule that will most improvethe scoreof the training set, andthen repeatingthat processon the ttansformedcorpus, the decisionlist techniqueis noniterative; it simply selectsat eachambiguoussite the singlefactor that provides the sttongestevidenceregardingthe correctanswerfor that site. 4.1 Leveraging on Previous Rules Some useful evidence for predicting the part of speech of a word in a corpus certainly comes from the identities of the neighboring words used within some window . However , it would also be useful to know the currently predicted tags for those neighboring words , since the tag-assignment problems for nearby words in a corpus are not independent. Since transformation - based learning is
Exploring the Natureof Transformation-BasedLearning
143
iterative, its rules cantake accountof the currentbestguessesfor neighboring tags, andthusthis methodseemsparticularlywell adaptedto input that is inherently a sequenceof such interrelatedproblem instances . Becausethe occurrencepatternsfor correcttagsdo dependin part on the unknownpart-of-speech valuesat neighboringlocations, it seemsusefulto allow the rule patternsto be basedat eachpoint on the system's bestcurrentguessfor thosevalues. With HMM models[Rabiner, 1990], the Viterbi decodingprocedure,which usesdynamicprogrammingto efficiently determinethe optimal path through the model, automaticallyallows the model' s choiceof tagsat nearbylocations to influenceone another. In fact, that influencecan be more fme-grainedwith -basedlearning, since the HMM model is HMMs than with transformation working with actual probability estimatesrather than a single current best requirea fully specifiedprobabilisticmodel, guess. However, becauses . they aretypically morelimited in termsof the contextualfeaturesrepresented -basedlearning Decisiontrees, on the otherhand, aresimilar to transformation in beingableto scana largerangeof contextualfeatures.However, they are traditionally appliedto independentproblem instancesencodedas vectorsof measurements for thevariouspossiblyrelevantfactors, andit is not clearin this how to allow the decisionsat adjacentsitesto influenceeachother. approach Black et al. [ 1992] includeda leftward contextof correct tagsin their defmition of the eventsfrom which their treeswerelearned, andthustheir approach did allow for a one-way, left -to-right influence betweensites. Magermans decision-tree parser [Magerman, 1994, 1995] pushedthat even further by exploringmultiple pathsthroughthe decisionspace, andallowing eachchoice to dependon the choicesthat had beenmadeso far along that particularpath. However, it is difficult to exploit both leftward and rightward dependence whenusingdecisiontrees, sincechangesin neighboringtag predictionscould thenforce the recomputationof previouspredicatesplitshigher in the tree. Breakingthe tag predictionprocessup into a seriesof rules that caneachbe appliedimmediatelyto theentirecorpusis a simpleschemethat allowsthe system to baseits future learningon the improvedestimatesof neighborhoodtags resultingfrom the operationof earlierrules. In a rule-basedsystemwithout that sort of leverage, later rules would haveto resolvean ambiguityat aneighboring location as part of a single rule patternfor the primary site, using as evidence only caseswherethe two occur together, while the iterative approach allows the systemto usethe bestcurrentguessfor the neighboringsite aspart of the evidencefor the choiceat the primary site. Intuitively, leveragingthusdoesappearto add significantly to the power of the rules. It remainsan openquestionhow much later rules in the sequences
144
Chapter7
actually learned do depend on earlier ones, a point that is addressedfurther in
section5.
4.2 Limited Rule Dependence While the iterativenatureof this methoddoespermit later rules to dependon the tag assignments madeby earlierones, that is the only way in which rulesin the sequencecan dependon one another, so that the overall characterof the techniqueis still primarily oneof independentrules. In this way, the approach is quite different from HMMs, for example, which constructa single monolithic model. -basedlearningin this regardwith It is interestingto comparetransformation decisiontrees, sinceboth employmodelsof similar granularity. In the building of a decisiontree, an elementarypredicateis selectedat eachstepto split a former leaf node, meaningthat the new predicateis appliedonly to thosetraining instancesassociatedwith that particularbranchof the tree. The two new leaves thuscreatedcan be seenas embodyingtwo new classificationrules, eachone coveringexactly the subsetof instancesthat classify to it ; eachrule' s pattern thusincludesall of the predicatesinheriteddown that branchof the tree. In the -basedlearningapproach,on the otherhand, new rules are transformation generated by applyingthe templatesdirectly to the entire corpus. Thereis no correspondi inheritanceof earlier predicatesdown the branches of a tree; the only effect thatearlierrulescanhaveon later learningis throughchangingcurrent in the corpus. tag assignments It may be helpful in understanding this limited rule dependence mediatedby to tag assignments considera systemwherethe rule templateswould be tested eachtime againstthecurrenttagof theword to bechanged , but wheretherestof the rule patternwould be matchedagainstthe initial baselinetagsat thoselocations , ratherthanthe currenttags. Earlier rulescould thenaffect later onesat a particularlocationonly by changingthecurrenttag assignmentfor thatlocation itself. The firing of a rule at a locationwould makethoserulesthat specifythat new tag valueastheir centralpatternelementpotentiallyapplicable, while disabling thoserules whosepatternsspecifythe former tag; the training setat any time during training would thus in effect be partitionedfor purposesof rule applicationinto at mostasmanyclassesastherearetags. Sucha systemcanbe picturedasa lattice with onecolumnfor eachtag assignmentandwith a single slantingarc at eachgenerationthat movessomecorpuslocationsfrom onecolumn to another, an architecturethat is reminiscentof the pylon structuresused by Bahlet al. [ 1989] to representthe restrictedforms of binary queriesusedin their decisiontreeapproachto languagemodelingfor speechrecognition. While
Exploring the Natureof Transfonnation.-BasedLearning
145
a path in a nonnal decisiontreecanencodean arbitraryamountof infonnation -basedsystemmustmergeasoften in its branching,thepathsin a transfonnation as they branch, restrictingthe amountof infonnation that can be encodedand . thusforcing the rulesto be moreindependent -basedlearningin tenDSof the Becauseof this limitation on transfonnation connectionsbetweenrulesthat canbe constructedduring training, anycomplex .A predicatesthat aregoingto be availablemustbe built into the rule templates decision-tree learner, on the otherhand, canbe providedwith elementaryfeature ; it will constructmore complex combinationsas it builds the predicates tree. However, this additionalpowerof decision-treelearnersmustbebalanced between againstboth their relative difficulty in taking accountof dependence of the training sites and their constant at fragmentation ambiguities neighboring to overtraining setwhich, asnotedbelow, appearsto makethemmoresubject -basedlearnercanbe providedwith an adequateinitial . If a transformation set of templates , it has the advantagesboth of being sensitiveto the mutual effectsof tag choicesat neighboringsitesandof avoidingfragmentationof the training set. 4.3 Rule-Ranking Metrics The iterativenatureof transfonnation-basedlearningalso meansthat alternative ranking metricsfor selectingthe bestrule in eachpasscan significantly affect the path takenthroughthe spaceof possiblerule sequencemodels. The maximum net benefit metric that Brill proposesranks rules by the improvement they causeon the training set, which is alsothe resubstitutionestimate(a biasedone) of the improvementthat they would causeon the test set. That is certainlya reasonablechoice, andif only onerule were going to be applied, it would be a very persuasiveone. However, when selectingearly rules in a , it may be helpful to take into accountthat later rules will have the sequence . chanceto try to fix any negativechangesthat this rule causes We thus exploreda numberof 3;lternativemetricsthat seemto differ from thenet benefitmetric primarily in focusingmoreon rulesthat makemanypositive . We tried , and giving lessweight to avoiding negativechanges changes and J-scores using likelihood ratio scores[Y arowsky, 1993; Dunning, 1993] [SmythandGoodman, 1992], aswell asmoread hoc functionssuchasusinga candidaterule' s positivescoreminussomefraction of its negativescore. However , while thesealternativemetrics did sometimesseemto producebetter, we did not fmd differences large enough or perfomling rule sequences consistentenoughto justify any claim of superiorityin general. We hope to continueto explorethesemattersfurther.
146
Chapter7
Different ranking metrics have also been proposed for choosing which leaves to split in decision trees. A simple approach that is similar to the net benefit metric is to choose on the basis of the immediate improvement in the score on the training set, but a number of other approaches have been explored that pay more attention to the distribution of items among the leaves, using either a diversity index or some information -theoretic measure based on the conditional ' entropy of the truth given the tree s predictions [ Breiman et al., 1984; Buntine , 1992; Quinlan and Rivest , 1989; Quinlan , 1993] . Because successive splits in a decision tree can combine to synthesize a more complex predicate, each proposed split needs to be ranked partly with respect to what future splits it makes possible. It may even be useful to split a node in such a way that both of the new leaves would be assigned the same category as the parent; although such a rule does not change the training set score, it may make it easier for later rules to isolate particular subsets of those sites. In the transformation - based learning approach, however , a rule that did not change tags would have no effect , since earlier rules can only affect later ones by changing tags. This limited dependence between rules in transformation - based learning suggests that there would be less payoff here from a ranking metric that factored in the effect of each proposed rule on future rules.
4.4 Overtraining Oneof the iriferestingfeaturesof transfonnation-basedlearningin comparison to othertaggingmodelsis a surprisingdegreeof resistanceto overtraining(or " " overfitting ). For example, figure 7.5 showsthe graphof percentcorrecton both thetraining set(dottedline) andthe testset(solid line) asa function of the number of rules applied for a typical part-of-speechtraining run on 120K wordsof Greektext. The training setperfonnancenaturallyimprovesmonotonically , given the natureof the algorithm, but the surprisingfeature of that graphis that the test setperfonnanceseemsto rise monotonicallyto a plateau, exceptfor minor noise. This-seemsto be true for most of our transfonnationbasedtraining runs, in markedcontrastto similar graphsfor decisiontreesor neuralnet classifiersor for the iterativeEM training of HMM taggerson unsupervise data, whereperfonnanceon the test set initially improves, but later . significantly degrades of these Many learning methodstypically encounterovertraining to an extent significant enoughto motivate them to explore specialtechniquesfor controlling it. Schaffer[ 1993] usefully clarified the issuesinvolved, pointing out that techniquesto limit overtrainingby pruning a decisiontree or rule sequence are actuallya fonD of bias in favor of simplermodels. They areuseful
Exploring the Natureof Transfonnation-BasedLearning
0 0 ~ 0 0)) m 0) to .). 0 CO 0)
.' . , . . '..' . . . . ; : ; ; ;..
:
.' - - -. ' ---
., - -. ' - . . -- - --
-- - - -
---
.. .. :;;;:, : ;;=
--- -
-
- - .-
147
-
-
-
(
0
200
400
600
800
1000
Figure7.5 onGreekcorpusasafunction Trainingset(dottedline) andtestset(solidline) perfonnance of thenumberof rulesapplied . exactly in thosecaseswherethe choicebeingpredictedis likely to be in fact a fairly simplefunction of the measuredfeatures,but wherethe learningprocess is capableof exploringvery complexmodels. The later stagesof decision-tree training, for example, modify the modelin waysthat do makeit fit the training set moreclosely, but which also, becauseit becomesa more complexmodel, makeit inherentlylesslikely to be true of text in general. Thusthereis no one besttechniquefor avoidingovertrainingin general; the choicedependson the kinds of problemsbeingmodeled. Within this context, the surprisingfact abouttransfonnation-basedlearning is that the rule sequences learnedtypically do not experiencesignificantovertraining . Experimentssuggestthat this is at leastpartly due to the knowledge embodiedin thetemplates.Whena part-of-speechtrainingrun is suppliedwith " relevant" templates,onesthat identify significant andpredictivetag patterns in the data, as in figure 7.5, one gets an " improve to plateau" test-set curve. " Irrelevant" templates, however, can lead to overtraining. Figure 7.6 shows that noticeableovertrainingresultsfrom usingjust a single suchtemplate, in this caseonethat testedthe tagsof the wordsfive positionsto the left andright, which seemlikely to be largelyuncorrelatedwith thetag at thecentrallocation. Figure7.7, wherethis singleirrelevanttemplateis combinedwith the seven nonnal templates, showsthat most of the overtrainingin suchcaseshappens late in the training process, when most of the useful relevanttemplateshave alreadybeenapplied. At that stage, asalways, thetemplatesareappliedin each passto eachremainingincorrectlytaggedsite, generatingcandidaterules. Each
148 0 0 ~ 0 0))
Chapter?
~ 0) ,.). 0 CO 0 ) 200
400
600
800
1000
Figure 7.6 Training with one irrelevanttemplateon Greekcorpus.
rule naturallysucceedsat the site thatproposedit , but with respectto the restof the ttaining set, the changesproposedby rulesat this stagearelargelyrandom, andare thuslikely to do more harm than good when appliedelsewhere , especially since most of the assignedtags in the colpus at this stageare correct. Thusif therule' s patternmatcheselsewherein the training set, it is quite likely that the effect there will be negative, so that the unhelpful rule will not be learned. Thus the presenceof relevanttemplatessuppliesan importantdegree of protectionagainstovertrainingfrom anyirrelevanttemplates,bothby reducing the numberof incorrectsitesthat areleft late in training andby raisingthe percentagealreadycorrect, which makesit more likely that bad rules will be filtered out. The sameapplies, of course, to relevantandirrelevantinstancesof mixed templates,which is the usualcase. Most of the overtrainingwill thuscomefrom patternsthat matchonly once in the ttaining set, to their generatinginstance. Under theseassumptions , note that applyinga scorethresholdgreaterthan 1 cansignificantlyreducetheovertraining risk, just as decisiontreessometimescontrol that risk by applying a thresholdto the entropy gain requiredbefore splitting a node. Brill ' s system usesa scorethresholdof 2 as the default, thus gaining additional protection againstovertraining, while our experimentalruns have been exhaustive, in orderto betterunderstandthe mechanism . Using test runs like thoseplotted abovefor irrelevanttemplatesof various degreesof complexity, we alsofound a connectionin tenDSof overtrainingrisk betweenthe inherentmatchingprobability of thetemplatesusedandthe sizeof
-Based of Transformation theNature Learning Exploring
149
0 0 ..0 0> >
...- . . ...." " . ........ .. .. .. ... - - - _.. ' .' - ..... _ .... . " " _ . .. _ ..' ~ ." j f
~ 0>
.. -
-
- - _. .~
-
....-
_. _
.._
._
....-
-
.... -
.-
. -
... -
~
,....> 0 (
co 0> 0
200
400
600
800
1000
1200
Figure7.7 andoneirrelevant withseven relevant templateon Greek corpus . Training templates the training set. A large training set meansa larger numberof incorrectsites that might engenderovertrainedrules, but alsoa betterchanceof fmding other instancesof thoserule patternsandthusfiltering themout. The combinationof thosefactorsappearsto causethe risk of overtrainingfor a particularirrelevant templateto fIrStrise andthenfall with increasingtraining setsize, asthe initial effect of increasedexposureis later overcomeby that of increasedfiltering from further occurrencesof the patterns. In comparingthis with decisiontrees, the key contrastis that the filtering . The splitting predicatesareapplied astraining proceeds effect theredecreases to increasinglysmall fragmentsof the training set, so that the chanceof filtering alsodecreases . ( Withfew pointsleft in the regionthat the counterexamples new rule will split, it becomesmore likely that an irrelevant predicatewill -based incorrectly appearto provide a useful split.) But since transformation learningcontinuesto scoreits essentiallyindependentrules againstthe entire training set, the protectionof filteririg againstovertrainingremainsstronger. It is worth addingthat further experimentsindicatedthat the degreeof resistance -basedlearning to overtrainingin thesetestsmay be of transformation highly dependenton the degreeof similarity betweenthe training andtestdata. In the experimentsreportedabove, the training andtestdatawereseparatedby , and, as mentionedbefore, the dictionary that randomly selectingsentences wasusedwasdrawnfrom the entirecorpus, including the testset, so that there alsowereno unknownwordsin the testmaterial. In later tests, which did allow for unknownwordsor wherethe training andtestmaterialwerenot takenfrom
150
Chapter7
thesamecorpus thateventrainingrunsusingrelevanttemplates , it appeared couldencounter . Differences between significantovertraining trainingandtest materialboth reducethe chancethatpositivechangesin the training setwill be reflectedin thetestsetandalsoweakenthefiltering effect which usesthe ttaining set to protect againstrules that are likely to do harm on the test. These factors may explain why the resistanceof ttansformation-based learning to overttaining is dependenton closesimilarity betweenthe ttaining and test data. 5
Tracking Rule Dependence
As notedearlier, sincelater rules in a sequencedo sometimesdependon tag assignmentchangesmadeby .earlier rules, it would be interestingto be ableto characterizeand quantify thoserule dependencies . We have thereforeadded codethat generatesdependencytreesshowingthe earlier rule applications(if any) that eachrule dependson. For example, the dependencytreein figure 7.8 from theBrown Corpusdatashowsa casewherethelastrule that appliedat this particularsite (the bottom line in the figure, representingthe root of the tree), which changedJJ to RB, dependedon earlier rules that changedthe previous site (relativeposition - 1) to VBN and the following one (position + 1) to DT. (The fmal numberon eachline tells on what passthat rule was learned. Also note that while recordedinternally astrees, thesestructuresactuallyrepresent dependencyDA Gs, sinceone rule applicationmay be an ancestorof another alongmorethanonepath.) All sitesare initially assigneda null dependencytreerepresentingthe baseline heuristic choice. The applicationof a rule causesa new tree to be built , with a newroot node, whosechildrenarethecurrentdependencytreesfor those locationsreferencedby the rule pattern. At the endof the training run, the final dependencytreesfor all the sitesaresortedby sizeandstructurallysimilar trees that show the samerules applied in the samerelative pattern are grouped together. Thoseclassesof treesarethen sortedby frequencyandoutputalong with the list of ruleslearned. +1: - 1..
-
0:
-
-
CD/ DT
NN
HVD
VBDNBN VBN
-
-
JJ/ RD
DT
-
7.8 Figure Sampledependencytreefrom Brown Corpusdata.
-
-
(7) (8) 649 ( )
Exploring the Natureof Transformation-BasedLearning
Certaincommonpatternsof rule dependencycan be noted in the resulting trees. A correction pattern results when one rule makes an overly general change,which affectsnot only appropriatesitesbut alsoinappropriateones, so that a later rule in the sequenceundoespart of the earliereffect. Oneexample of this type from our Brown Corpusrun canbe seenin figure 7.9. Herethe first rule was the more generalone that changedPP$ to PPO wheneverit follows VBD. While that rule wasgenerallyuseful, it overshotin somecases,causing the later learningof a correctionrule that changedPPOback to PP$ after RB VBD. A chainingpatternoccursin caseswherea changeripples acrossa context, asin figure 7.10. The first rule to apply here(21) changedQL to AP in relative position + 2. That changeenabledthe RB to QL rule ( 181) at position + 1, and togetherthosetwo changesenabled! beroot rule (781). Note that this two-step rule chainhasallowedthis rule to dependindirectly on a currenttag valuethat is further away than could be sensedin a single rule, given the currentmaximum templatewidth. This dependencytreeoutputalsoshowssomethingof the overalldegreeand natureof rule interdependence . The treesfor a run on 50K wordsof the Brown Corpusbearout that rule dependencies , at leastin the part-of-speechtagging application, are limited. Of a total of 3395 siteschangedduring training, only 396haddependencytreeswith morethanonenode, andeventhemostfrequent class of structurally similar trees appearedonly four times. Thus the great majority of the learningin this casecamefrom templatesthat applied in one stepdirectly to the baselinetags, with leveragingbeinginvolved in only about 12% of the changes .
0: 0:
VBD VBD
-
RB
PP$/PPO PPO $ /PP
-
-
-
-
(30) (174 )
Figure7.9
Samplecorrection-type dependencynee from Brown Corpusdata.
+2: +1: 0: -
-
-
-
-
-
RB/QL QL
NNSNBZ
QL/AP AP AP
cs cs
Figure7.10 -typedependency treefromBrownCorpus Sample data . chaining
-
(21) (181 ) 781 ( )
7 Chapter
152
The relatively small amountof interactionfound betweenthe rulesalsosuggests that the orderin which the rulesareappliedmay not be a major factor in the successof the methodfor this particularapplication, andinitial experiments tend to bearthis out. Figure 7.5 earlier showedthe performanceof a training run on Greektext, wherethe rule to apply on eachpasswasselectedusingthe . Note that, on this Greekcorpus, largestnet benefit metric that Brill proposes theinitial baselineperformancelevel of choosingthemostfrequenttraining set tag for each word is already quite good; performanceon both sets further improvesduring training, with mostof the improvementoccurringin the fIrst few passes . In comparison, figure 7.11 showsthe results for a training run wherethe next rule at eachstepwas randomlyselectedfrom amongall rules that had a net positive effect of any size. While the progressis more gradual, both the training and test curviesreachvery close to the samemaximaunder theseconditionsas they do whenthe largestnet benefit rule is chosenat each step. Note that it doestake more rules to reachthoselevels, sincethe random ranking presumablyoften endsup selectingmore specificrules that are actually subsumedby more generalonesnot chosentill later. Thus at leastfor this task, wherethere is little rule dependence , the choice of rule-ranking metric doesnot seemto have much effect on the final performanceachieved. The largestnet benefitrankingcriterion is still a usefulone, of course, if onewants to fmd a short initial subsequenceof rules that achievesthe bulk of the improvement.
0 0 ~ 0 0)) ~ 0) ,.). 0
.. . .. .
.. .. .
. . ~ . ..
. . .. . .
.. ..
...... ... ..
.. . . .. .
. . .. . .
.. . .. ..
. .. '
< 00 ) 500 Figure7.11
1000
Trainin~ andtest setperforntanceon Greek, random-rule choice.
1500
Exploring the Natureof Transfonnation-BasedLearning 6
153
Future Work
-basedlearningpresentedhereis based The generalanalysisof transformation on part of speechtaggingexperiments . Within that domain, it would be useful -basedlearning performs compared to quantify more clearly how transformation with other methods. It would be particularly interestingto seehow it compareswith traditionaldecision-treemethodswhenappliedto the samecorpora andmaking useof the samefactors; suchexperimentswould betterilluminate the tradeoffs betweenthe ability to combine predicatesinto more complex rules on the one hand and the ability to leveragepartial resultsand resistovertrainingon the other. It would also be useful to run testssimilar to thosepresentedhereon overtrainingrisk and on rule dependence , using data from other domains, especiallydomainswherethe degreeof rule dependence would be expectedto be greater. Further exploration of the connections -basedlearning and decision trees and decision lists betweentransformation es, perhapsblendsof the two, that would work may alsosuggestotherapproach betterin somecircumstances . -basedlearning itself, further work is required to Within transformation determinewhetherother ranking schemesfor selectingthe next rule to apply might be ableto improveon the simplemaximumnet benefitheuristic. It may alsobe possibleto control for the remainingrisk of overtrainingin a moresensitive way than with a simple threshold. Selectivepruning like that usedwith decisiontreesis one possibleapproach, and deletedestimation [Jelinek and Mercer, 1980] or other cross-validation techniquesare also worth trying, thoughany techniquethat involves selectingparticularrules from a sequence or mergingtwo different sequences would haveto dealwith the hiddendependencies betweenrules. One goal for collecting the dependencytree data is to make it possibleto prune or restructurerule sequences , using the recorded to maintain the dependencies consistencyamong remainingrules. 7 Conclusions While ttansformation -based learning usesa simple , statistical search process to automatically select the rules in the rule - sequence model that it generates, symbolic linguistic knowledge also plays an important role . The fact that symbolic knowledge is required to specify the template patterns of relevant factors to be considered, as well as the fact that the resulting model is encoded as intelligible symbolic rules , makes this simple and powerful new mechanism
154
Chapter7
for capturingthe patternsin linguistic dataan interestingcompromisemethod to explore. The iterativenatureof the approachturnsout to be a key factor that setsit apartfrom othermethods,in that it providesa limited way in which later rulescanbe leveragedon the resultsof earlierones. The techniquehas much in commonwith decisiontrees, especiallyin its ability to automaticallyselectat eachstagefrom a largespaceof possiblefactors the predicateor rule that appearsto be most useful. One importantdifference is that decision trees can synthesizecomplex rules from elementary -basedlearningmustprespecify predicatesby inheritance,while transformation in thetemplatesessentiallythe full spaceof possiblerules. However, aslong as the templateset can be maderich enoughto cover the patternslikely to be found in the data, this restrictionin powermay not causetoo greata reduction in performance , and it bring$ two importantbenefitsin return: fIrSt, breaking the model up into independentrules makesit possibleto apply them to the whole corpusas they are learned, allowing the rules to leverageoff the best estimatesregardingtheir surroundings ; andsecond,sincethe independentrules continueto be scoredagainstthe whole training corpus, a substantialmeasure of protectionagainstovertrainingis gained. References
Bahl, Lalit R., PeterF. Brown, PeterV. deSouza . 1989.A tree, andRobertL. Mercer basedstatistical model for natural . IEEETransaction language language speech recognition - 1008. onAcoustics , Speech , andSignalProcessing , 37: 1001 Black, Ezra,FredJelinek . 1992.Decision , JohnLafferty,RobertMercer , andSalimRoukos treemodelsappliedto thelabelingof textwithparts-of-speech . in Speech andNatural , pp. 117- 121. MorganKaufmann , SanMateo, Language Workshop Proceedings Calif. Breiman . 1984. , Leo, JeromeH. Friedman , RichardA. Olshen , andCharlesJ. Stone andRegression Trees.PacificGrove,Calif., Wadsworth & Brooks /Cole. Classification Brill, Eric. 1992 . A simplerulebasedpart of speechtagger . In Proceedings of the DARPASpeech andNaturalLanguage , 1992. Workshop Brill, Eric. 1993a . Automatic inductionandparsingfreetext: atransformation grammar based In . Proceedings andNaturalLanguage , approach oftheDARPASpeech Workshop 1993,pp. 237- 242. -BasedApproach Brill, Eric. 1993b . A Corpus toLanguage . PhiD. thesis , University Learning of Pennsylvania . , Philadelphia -basedpartof speechtagging Brill, Eric. 1994. Someadvances in transformation . In onArtificialIntelligence , pp. 722- 727. Proceedings of theTwelfthNationalConference -basedtagger Brill, Eric. Transformation .cs.jhu.edu/pub/briW , version1.14. ftp://blaze / RULE _BASED_TAGGER - V.l .14.tar.Z, 1995. Prograrns
the Natureof Transfonnation-BasedLearning Exploring
155
Brill , Eric and Philip Resnik. 1994. A rule-basedapproachto prepositionalattachment . In Proceedingsof theSixteenthInternationalConferenceon Computadisambiguation tional Linguistics, Associationfor ComputationalLinguistics, Morristown, NJ. Buntine, Wray. 1992. Learningclassificationtrees. Statisticsand Computing, 2: 63- 73. CATSS. 1991. Producedby Computer-AssistedTools for SeptuagintStudies, available ' throughthe University of Pennsylvanias Centerfor ComputerAnalysisof Texts. Church, Kenneth. 1988. A stochasticparts programand noun phraseparserfor unrestricted text. In SecondConferenceon AppliedNatural LanguageProcessing.Association for ComputationalLinguistics, Morristown, NJ. , and P. Sibun. 1992. A practicalpart-of-speechtagCutting, D., J. Kupiec, J. Pederson . In the Third ger Proceedingsof Conferenceon AppliedNatural LanguageProcessing. Associationfor ComputationalLinguistics, Morristown, NJ. DeRose, StevenJ. 1988. Grammaticalcategorydisambiguationby statisticaloptimization . ComputationalLinguistics, 14: 31- 39. Dunning, Ted. 1993. Accuratemethodsfor the statisticsof surpriseand coincidence. ComputationalLinguistics, 19: 61- 74. Francis, W. Nelsonand Henry Kucera. 1979. Manual of Information to Accompanya StandardCorpusof Present-Day EditedAmericanEnglish,for usewith Digital Computers . TechnicalReport. Providence , R.I., Departmentof Linguistics, Brown University. Jelinek, F. 1985. Markov sourcemodelingof text generation.In J. K. Skwirzynski, editor . Dordrecht, Nijhoff . , TheImpact of ProcessingTechniqueson Communication Jelinek, F. and R. L. Mercer. 1980. Interpolatedestimationof Markov sourceparameters from sparsedata, pp. 381- 397. In ES . Gelsemaand L. N. Kanal, editors, Pattern Recognitionin Practice. Amsterdam, North-Holland. Magerman, David M. 1994. Natural LanguageParsingas StatisticalPatternRecognition , PhiD. thesis, StanfordUniversity, Stanford, Calif. Magerman, David M. 1995. Statisticaldecision-treemodelsfor parsing. In Proceedings of the 33rd Annual Meetingof the Associationfor ComputationalLinguistics, Morristown, NJ, pp. 276- 283. Quinlan, J. Ross. 1993. C4.5: Programsfor MachineLearning. SanFrancisco, Calif., MorganKaufmann. Quinlan, J. Ross, and Ronald L. Rivest. 1989. Inferring decisiontreesusing the minimum descriptionlengthprinciple. Informationand Computation, 80: 227- 248. Rabiner, LawrenceR. 1990. A tutorial on hiddenMarkov modelsand selectedapplications in speechrecognition. In Alex Waibel and Kai-Fu Lee, editors, Readingsin SpeechRecognition.Los Altus, Calif., MorganKaufmann. Originally publishedin Proceedings of the IEEE in 1989. Ramshaw, Lance A. and Mitchell P. Marcus. 1995. Text chunking using transformation -basedlearning. In Proceedingsof the ACL Third Workshopon Very Large Corpora , pp. 82- 94.
156
Chapter7
Resnik, PhilipS . 1993. Selectionand Information: A Class-BasedApproachto Lexical . Ph.D. thesis. Philadelphia, University of Pennsylvania , Philadelphia, Relationships Institutefor Researchin CognitiveScienceReportNo. 93- 42. Roche, Emmanueland Yves Schabes . 1995. Detenninisticpart-of-speechtaggingwith finite-statetransducers . ComputationalLinguistics, 21: 227- 253. Schaffer, Cullen. 1993. Overfitting avoidanceasbias. MachineLearning, 10: 153- 178. Smyth, Padhraicand RodneyGoodman. 1992. An infonnation theoretic approachto rule induction from databases . IEEE Transactionson Knowledgeand Data Engineering , 4: 301- 316. Weischedel, Ralph, Marie Meteer, Richard Schwartz, Lance Ramshaw, and Jeff Palmucci. 1993. Coping with ambiguity and unknownwords throughprobabilisticmethods . ComputationalLinguistics, 19: 359- 382. Yarowsky, David. 1993. One senseper collocation. In HumanLanguageTechnology, Proceedingsof theDARPAWorkshop,pp. 266- 277. SanFrancisco,MorganKaufmann. Y arowsky, David. 1994. Decisionlists for lexical ambiguityresolution: Application to accentrestorationin SpanishandFrench. In Proceedingsof the 32thAnnualMeetingof theAssociationfor ComputationalLinguistics, pp. 88- 95. Yarowsky, David. 1995. Unsupervisedword sensedisambiguationrivaling supervised methods. In Proceedingsof the 33rd Annual Meeting of the Associationfor Computational Linguistics, pp. 189- 196.
Most linguistic analyses assume an ideal speaker-hearer , who will utter only grammatical sentences or grammatical sentencefragments . But spontaneous speech is full of extragrammaticalphenomena , such as false starts , words not listed in the lexicon , and ungrammatical constructions . Although it is unclear whether this type of data should be accounted for in a linguistic theory , there is no question that the computational linguist needs to account for such possibilities . Similarly , the question of whether such data falls inside the purview of a linguistic theory of competence or only in a performance model is debatable ( seeAbney , chapter 1) . Rose and Waibel provide a detailed system description of an approach to recovering from parser failures that builds on an effective hybrid model. The repair module uses symbolic information as the basis of the feature -structure conceptual representation built during analysisfrom input into interlingua . Statistical information is used to optimize the computation of ways partial analyses can fit together to create a valid structure for input into the interlingua . Rose and Waibel argue that by drawing upon both statistical and symbolic sources of information , the repair module can constrain its repair predictions to those which are both likely (a statistical notion ) and meaningful (a symbolic notion ) . The complete description of their working system, along with evaluation of results , provides a convincing example of the effectiveness of balancing approach es.- Eds .
1 Introduction Naturallanguageprocessingof spontaneous speechis particularlydifficult because it containsfalsestarts, out-of-vocabularywords, andungrammaticalconstructions . Becauseof this, it is unreasonableto hope to be able to write a grammarthat will coverall of thephenomenathata parseris likely to encounter
158
Chapter8
in a practicalspeechtranslationsystem. In this chapterwe describean implementat of a hybrid statisticaland symbolic approachfor recoveringfrom -to-speechtranslationsystemof significant parserfailuresin the contextof a speech scope(vocabularysizeof 996, word recognitionaccuracy60%, grammar sizeon the orderof 2000rules). The domainwhich the currentsystemfocuses on is theschedulingdomainwheretwo speakersattemptto setup a meetingover the telephone . Becausethis is an interlingua-basedtranslationsystem,the goalof the analysis stageof thetranslationprocessis to maptheutterancein the sourcelanguage onto a feature-structurerepresentationcalled an interlinguawhich represents -independentway. If theparsercannotderivea complete meaningin a language , it derivesa partial parseby skippingover portionsof analysisfor an utterance the utterancein orderto find .asubsetwhich canparse. It alsoreturnsan analysis for the skippedportionswhich can be usedto rebuild the meaningof the . The goal of the repairmoduleis to interactivelyreconstructthe input utterance of the full utteranceby generatingpredictionsaboutthe way the fragments meaning canfit togetherandcheckingthemwith theuser. In this way it negotiates with the userin orderto recoverthe meaningof the user's utterance . Therepairmoduledescribedin this chapterusesboth symbolicandstatistical 's informationto reconstructthespeaker meaningfrom thepartialanalysiswhich theparserproduces . It generates predictionsbasedon constraintsfrom a specification of the interlinguarepresentation and from mutual informationstatistics extractedfrom a corpusof naturally occurringschedulingdialogues. Mutual informationis intuitively a measureof how stronglyassociated two conceptsare. the structure of the utterance Although syntactic input certainly plays an importantrole in determiningthe meaningof an utterance, it is possiblewith the useof the interlinguaspecificationto reasonaboutthe meaningof an utterance whenonly partial structuralinformationis available. This canbe accomplished by fitting the partial featurestructurestogetheragainstthe mold of the interlinguaspecification. During the parsingprocess,two structuralrepresentations aregenerated : oneis a treelikestructuregeneratedfrom the structureof the context-free portion of the parsinggrammarrules; the other is afeaturestructuregeneratedfrom the unification portion of the parsinggrammarrules. There is a many-to-one mapping betweentree-structuresand feature-structures . Both of thesestructuresareimportantin the repairprocess. The repair processis analogousin someways to fitting piecesof a puzzle into a mold that containsreceptaclesfor particular shapes . The interlingua , making it specificationis like the mold with receptaclesof different shapes possibleto computeall of the wayspartial analysescanfit togetherin orderto createa structurethat is valid for thegiven interlingua. But the numberof ways
: A HybridStatistical andSymbolicApproach 159 fromParserFailures Recovering it is possibleto do this areso numerousI that the bruteforce methodis computationally intractable. Mutual information statistics are used to guide the search.Thesemutual information statisticsencoderegularitiesin the typesof fillers which tendto occurin particularslotsandwhich feature-structuresassociated with particularnon-terminalsymbolsin the parsinggrammartendto be . By drawing upon used in a particular way in the interlinguarepresentation both statisticaland symbolic sourcesof information, the repair module can constrainits repairpredictionsto thosewhich areboth likely andmeaningful. Oneadvantageof the designof this moduleis that it drawsuponinformation sourcesthat were already part of the systembefore the introduction of the repairmodule. Most of the additionalinformationwhich the moduleneedswas . The advantageof a designis trainedautomaticallyusingstatisticaltechniques that the modulecan be easily portedto different domainswith minimal additional effort. Another strengthis that the statisticalmodel the repair module makesuse of continually adaptsduring use. This is desirablein a statistical approachin orderto overcomeproblemswith unbalancedtraining setsor training setsthat aretoo small, leadingto overfitting. 2
Motivation
esto handlingill Theoverwhelmingmajority of researchin symbolicapproach . Hobbsandco-workers formedinput hasfocusedon flexible parsingstrategies [2], McDonald [5], Carbonell and Hayes [ 1], Ward and co-workers [6], Lehman [4], and Lavie and Tomita [ 3] have all developedtypes of flexible parsers.Hobbset al. and McDonaldeachemploy grammar-specificheuristics which are suboptimalsincethey fall shortof being completelygeneral. Ward and Carbonelltake a pattern-matchingapproachwhich is not specific to any particulargrammarbut the structureof the outputrepresentationis not optimal for an applicationwherethe outputrepresentationis distinct from the structure of the parse, for example, a feature-structure, as in an interlingua-basedmachine translationsystem. Both LehmanandLavie take an approachthat is independentof any particular grammarand makesit possibleto generatean output representationdistinct from the structureof the parse. Lehman's least- deviant-flfSt parsercan accommodatea wide rangeof repairsof parserfailures. But, as it addsnew rulesto its grammarin orderto accommodate idiosyncraticlanguagepatterns, it quickly becomesintractablefor multiple users. Also, becauseit doesnot 1. The search space is theoretically infinite since the interlingua representation is recursive .
160
Chapter8
makeuseof any statisticalregularities, it hasto rely on heuristicsto determine which repair to try fIrSt. Lavie' s approachis a variation on Tomita' s Generalized LR parser, which can identify andparsethe maximal subsetof the utterance that is grammaticalaccordingto its parsinggrammar. He usesa statistical modelto rank parsesin orderto dealwith the extraordinaryamountof ambiguity associatedwith flexible parsingalgorithms. His solution is a generalone. The weaknessof this approachis that part of the original meaningof the utterance may be thrown awaywith the portionsof the utterancethat were skipped in orderto fmd a subsetwhich canparse. From a different angle, Gorin [2] hasdemonstratedthat it is possibleto successfully build speechapplications with a purely statistical approach. He makesuse of statistical correlationsbetweenfeaturesin the input and the es do not in generalmakeuseof. The output which purely symbolic approach evidenceprovided by eachfeaturecombinesin order to calculatethe output which hasthe most cumulativeevidence. In Gorin' s approach,the goal is not to deriveany sortof structuralrepresentation of the input utterance.It is merely to map the set of words in the input utteranceonto somesystemaction. If the , asis thecasein an interlingua goalis to mapthe input onto a meaningrepresentation -basedmachinetranslationproject, the task is morecomplex. The setof , even in a relatively small domain such as possiblemeaningrepresentations scheduling,is so largethat suchan approachdoesnot seempracticalin its pure form. But if the input featuresencodestructuralandsemanticinformation, the sameideacanbe usedto generaterepairhypotheses . The repairmoduledescribedin this chapterbuilds uponLavie andTomita' s and Gorin' s approach es, reconstructingthe meaningof the original utterance by combiningthefragmentsreturnedfrom theparser, andmakinguseof statistical regularitiesto naturally determinewhich combinationto try fIrSt. In our approachwe haveattemptedto abstractawayfrom any particulargrammarin orderto developa modulethat canbe easilyportedto otherdomainsandother . Our approachallows the systemto recoverfrom parserfailuresand languages adaptwithout addingany extrarulesto the grammar, allowing it to accommodate multiple userswithout becomingintractable. The repairmodulewastestedon two corporawith 129sentenceseach. One corpuscontains 129 transcribedsentencesfrom spontaneousschedulingdialogues . Thesesentenceswere transcribedjust as they were spoken, so they contain all of the false startsand ungrammaticalitiesof natural speech. The other corpuscontainsthe output from the speechrecognizerfrom readingthe transcribedsentences . So it containsall of the difficulties of the transcribed corpusin additionto speechrecognitionerrors. Given a maximumof 10questions to askthe user, it canraisethe accuracyof the parser(point valuederived
Recoveringfrom ParserFailures: A Hybrid StatisticalandSymbolicApproach
161
from automatically comparing generated feature- sb"Uctures to hand-coded ones) from 52% to 64% on speech data and from 68% to 78% on transcribed data. Given a maximum of 25 questions, it can raise the accuracy to 72% on speech data and 86% on ttanscribed data. 3
Symbolic Information
The system which this repair module was designed for is an interlingua -based machine translation system. This means that the goal of the analysis stage is to map the input utterance onto a language- independent representation of meaning called an interlingua . Currently , the parsing grammar that is used is a semantic grammar which maps the input utterance directly onto the interlingua representation . Although the goal of an interlingua is to be language- independent, most interlinguas are domain -dependent. Although this may seem like a disadvantage , it actually makes it possible for domain knowledge to be used to constrain the set of meaningful interlingua structures for that domain , which is particularly useful for constraining the set of possible repairs that can be hypothesized. The domain which the current system focuses on is the scheduling domain where two speakersattempt to set up a meeting over the phone. The interlingua is a hierarchical feature- structure representation. Each level of an interlingua structure contains a frame name which indicates which concept is representedat that level , such as * busy or * free. Each frame is associatedwith a set of slots which can be filled either by an atomic value or by another featurestructure. At the top level , additional slots are added for the sentencetype and the speech act. Sentence type roughly corresponds to mood , that is, * state is assignedto declarative sentencesand * query-if is assigned to yes/ no questions. The speech act indicates what function the utterance performs in the discourse context. ( See the sample interlingua structure in figure 8.1.) -act (* multiple* speech * state-constraint* reject -type * state) (sentence (frame * busy) (who frame* i ) (when frame * special-time) (next week) (specifier(* multiple* all-range next ) Figure 8.1 returnedby the parserfor " I ' m busy all next week." Sampleinterlinguarepresentation
162
Chapter8
The interlingua specification determinesthe set of possible interlingua structures. This specificationis one of the key symbolic knowledgesources . It is composedof BNF-like ruleswhich usedfor generatingrepair hypotheses specify subsumptionrelationshipsbetweentypesof feature-structures(figure 8.2), or betweentypes of feature-structuresand feature-structurespecifications (figure 8.3). A feature-structurespecificationis a feature-structurewhoseslotsarefilled in with types rather than with atomic values or feature-structures. Featurestructurespecificationsare the leavesof the subsumptionhierarchyof interlingua specificationtypes. Becausethe interlingua representationis defined independentlyof the repair module, this approachextendsto other featureaswell. structure-basedmeaningrepresentations
> = < SIMPLE- TIME> TEMPORAL < SPECIAL- TIME> < RELATIVE- TIME> < EVENT- TIME> < TIME - liST Figure8.2 a subsumption rule for expressing relationship specification Sampleinterlingua > andmorespecific . between temporal types type< TEMPORAL BUSY> = frame*busy ) < FRAME (topic (who ) (degree [DEGREE ]) ) Figure 8.3 Sampleinterlingua specificationrule for expressinga subsumptionrelationshipbetween the type < BUSY > andthe feature-structurespecificationfor the frame * busy.
Recoveringfrom ParserFailures: A Hybrid StatisticalandSymbolicApproach
163
4 StatisticalKnowledge Intuitively, repairhypothesesaregeneratedby computingthe mutualinformation betweensemanticgrammarnon-terminal symbolsand typesin the interlingua specificationandalsobetweenslot-typepairsandtypesthat arelikely to be fillers of that slot. Mutual informationis roughly a measureof how strongly associatedtwo conceptsare. It is defmedby the following formula: log[P(CkI vnz >/ P (ct>] whereckis the kth elementof the input vectorandVmis the mth elementof the outputvector. Basedon Gorin' s approach , statisticalknowledgein our repair module is storedin a setof networkswith weightscorresponding to themutualinformation betweenan input unit andan outputunit. Gorin' s networkformalismis appealing becauseit canbe trainedbothoff -line with examplesandon-line duringuse. Anotherpositiveaspectof Gorin' s mutual informationnetwork architectureis thatratherthanprovidea singlehypothesisaboutthecorrectoutput, it providesa rankedsetof hypotheses , soif theuserindicatesthatit madethe wrongdecision, it hasa naturalwayof determiningwhatto try next. It is alsopossibleto introduce new input units at any point in the training process . This allows the systemto learnnew wordsduringusewhich will be explainedin moredetailbelow. In the limit , this givesthe systemthe additionalability to handlenil parses . Our implementationof the repair modulehascodefor generatingandtraining five instantiationsof Gorin' s networkarchitecture,eachusedin a different way in the repairprocess. The fIrSt network is used for generatinga set of hypothesizedtypes for chunkswith feature-structuresthat have no type in the interlingua specification . The parseassociatedwith thesechunksis most commonlya singlesymbol dominatinga singleword. This occurswhentheparserskipsover a word or set of words which cannotbe parsedinto a featurestructurecoveredby the interlinguaspecification, as is the casewith unknown words. The symbol is usedto computea rankedset of likely typesthe symbol is likely to map onto basedon how much mutual information it has with eachoutput node. In the casethatthis is a new symbolwhich the nethasno informationaboutyet, it will returna rankedlist of typesbasedon how frequentlythosetypesarethecorrect output. This effect falls naturally out of the activationfunction. Activation of an outputnodeis calculatedby summingthe mutualinformationbetweeneach of the activatedinput nodesand the output nodein question. Finally a bias is
164
Chapter8
added, which is the proportionof iterationsin which the output nodeinquestion was selectedas the correct output node during training. Becausea new symbol will havethe samemutual information with eachof the output nodes, the only thing that will distinguishonefrom anotheris the bias. The secondnetwork is usedfor calculatingwhat typesare likely fillers for particular frame slot pairs, for example, a slot associatedwith a particular frame. This is usedfor generatingpredictionsaboutlikely typesof fIllers which could be insertedin the currentinterlinguastructure. This informationcanhelp the repairmoduleinterpretchunkswith uncertaintypesin a top-downfashion. The third network is similar to the first network exceptthat it mapscollections of parsernon-terminalsymbolsontotypesin the interlinguaspecification. It is usedfor guessinglikely top-level semanticframesfor sentencesand for building largerchunksout of collectionsof smallerones. The fourth network is similar to the third exceptinsteadof mappingcollections of parsernon-terminalsymbolsontotypesin the interlinguaspecification, it mapsthem onto sentencetypes (seediscussionon interlingua representation ). This is usedfor guessingthe sentencetype after a new top-level semantic framehasbeenselected. The fifth andfmal networkmapsa booleanvalueonto a rankedsetof frame slot pairs. This is usedfor generatinga rankedlist of slotsthat arelikely to be filled . This networkcomplementsthe secondnetwork. A combinationof these two networksyields a list of slots likely to be filled along with the typesthey arelikely to be filled with. Our implementationof the mutual informationnetworksallows for a mask to filter out irrelevanthypothesesso that only the outputsthat are potentially relevantat a given time will be returned. 5
The Repair Process: Detailed Description
In thissectionwegivea detailedhigh-leveldescription of theoperation of the Module . Repair 5.1 SystemArchitecture The heartof theRepairModule (figure 8.4) is the HypothesisGenerationModule whosepurposeit is to generaterepairhypotheses , which areinstructionsfor 's the reconstructing speaker meaningby perfonning operationson the Chunk Structureof the parse. The Chunk Structurerepresentsthe relationshipsbetween the partial analysisand the analysisfor each skippedsegmentof the utterance(seefigure 8.5).
Recoveringfrom ParserFailures: A Hybrid StatisticalandSymbolicApproach
165
*Skipping GLR Parser
Module Repair Interlingua Update Module Question Generation Module
The Initialization Module builds this structurefrom the fragmentedanalysis returnedby theparser. It insertsthis structureinto the DynamicRepairMemory structurewhich servesas a blackboardfor communicationbetweenmodules. The DynamicRepairMemory alsocontainsslotsfor the currentrepairhypothesis andthe statusof thathypothesis , that is, test, pass, fail. Thereareessentially thattheHypothesisGenerationModulecangenerate four typesof repairhypotheses . Theseareguessingthe top- level semanticframefor the interlinguastructure of the sentence , guessingthe sentencetype, combiningchunksinto larger chunks, andinsertingchunksinto the currentinterlinguastructure. The HypothesisGenerationModule has accessto eight different strategies . The strategydetennineswhich of the four for generatingrepair hypotheses it should generateon eachiteration. A metastrategyselects typesof hypotheses which strategyto employin a given case.
166
Chapter8
' Speaker s Utterance: Tuesdayafternoonthe ninth would be okayfor me though. SpeechHypothesisFrom the Recognizer: Tuesdayafternoonthe ninth be okay for me that. Partial Analysis: -type * fragment) sentence (when frame * simple-time) (time-of- dayafternoon) (day-of-weekTuesday) (day 9 ) Paraphraseof partial analysis: Tuesdayafternoonthe ninth Skipped Portions: 1. valuebe
2. frame* free) (who frame*i ) (good-bad+ 3. frame* that Figure8.5 . Sample partialparse
Oncethe hypothesisis generated , it is sentto the QuestionGenerationModule which generatesa questionfor the userto checkwhetherthe hypothesisis correct. Mer the user responds , the statusof the hypothesisis noted in the and if the responsewas positive, the Interlingua Dynamic Repair Memory UpdateModule makesthe specifiedrepair and updatesthe Dynamic Repair Memory structure. It is the Interlingua Update Module which uses these hypothesesto actuallymakethe repairsto derive the completemeaningrepresentatio for the utterancefrom the partial analysisand the analysisfor the skippedportions. If the statusindicatesthat the speaker's responsewasnegative, the Hypothesis GenerationModule will suggestan alternativerepair hypothesiswhich is possiblesincethe mutual infonnation nets return a rankedlist of predictions rather than a single one. In .this way the repair module negotiateswith the speakerabout what was meantuntil an acceptableinterpretationcan be constructed (figure 8.6). When the goal returnspositive, the networksare reinforced with the new infonnation so they can improve their perfonnanceover time.
5.2 TheThreeQuestions Theeightstrategies aregenerated by all possiblewaysof selectingeithertopdown or bottom -up as the answer to three questions.
Recoveringfrom ParserFailures: A Hybrid Statisticaland SymbolicApproach
167
Interlingua Representation: -type * state) sentence (frame* free) (who frame * i )
-time) (when frame* simple time ofafternoon ( ) day (day-of-weekTuesday ) (day9 : I amfreeTuesday afternoon theninth. Paraphrase Figure8.6 afterrepair. Complete meaning representation The flfSt question is , " What will be the top - level semantic frame?" The top down approach is to keep the partial analysis returned by the parser as the top level structure, thereby accepting the top -level frame in the partial analysis returned by the parser as representing the gist of the meaning of the sentence. Strategies 1 through 4 use the top - down approach. The bottom -up approach is to assumethat the partial analysis returned by the parser is merely a portion of the meaning of the sentence which should fit into a slot inside of some other top- level semantic frame. This is the case in the example in figure 8.5. Strategies 5 through 8 use the bottom - up approach. If bottom -up is selected, a new level semantic frame is chosen by taking the set of all parser non -terminal top in symbols the tree-structure for the partial analysis from each skipped segment and computing the mutual information between that set and each interlingua specification type . This gives it a ranked set of possible types for the top- level interlingua structure. The interlingua specification rule for the selected type would then become the template for fitting in the information extracted from the partial analysis as well as from the skipped portions of the utterance (figure 8.7) . If a new top -level frame was guessed, then a new sentencetype must also be guessed. Similar to guessing a top - level frame , it computes the mutual information between the same set of parser non- terminal symbols and the set of sentencetypes. The second question is , " How will constituents be built ?" The top -down approach is to assume that a meaningful constituent to insert into the current interlingua structure for the sentencecan be found by simply looking at available chunks and portions of those chunks ( figure 8.8) . Strategies 1, 2, 5 , and 6 use the top - down approach. The bottom - up approach is to assume that a
168
8 Chapter
Question: What will be the top-level structure? Answer: Try Bottom-Up. Hypothesis: (top- level-frame frame-name* free ) Question: Is your sentencemainly aboutsomeonebeingfree? User Response: Yes. New Current Interlingua Structure: frame* free Portions: Skipped 1. 2. 3. 4.
valuebe frame * free) (who frame * i ) (good-bad+ frame * that frame * simple-time) (time-of-day afternoon) (day-of-weekTuesday) (day 9
Figure8.7 ThefIrStquestion . meaningfulchunkcan be constructedby combiningchunksinto largerchunks which incorporatetheir meaning. The processof generatingpredictionsabout how to combinechunksinto larger chunksis similar to guessinga top-level frame from the utteranceexceptthat only the parsernon-terminal symbolsfor the segmentsin questionare usedto makethe computation. Strategies3, 4, 7, and8 usethe bottom-up approach. The third questionis, " What will drive the searchprocess?" The bottom-up approachis to generatepredictionsof where to insert chunksby looking at the chunks themselvesand determining where in the interlingua structure they might fit in. Strategies1, 3, 5, and 7 use the bottom-up approach(figure 8.9). The top-down approachis to look at the interlingua structure, determine whatslot is likely to be filled in, andlook for a chunkwhich might fill that slot. Strategies2, 4, 6, and8 usethe top-down approach(figure 8.10). The difference between"thesestrategiesis primarily in the ordering of . But thereis alsosomedifferencein thebreadthof the searchspace. hypotheses The bottom-up approachwill only generatehypothesesaboutchunkswhich it has. And if thereis somedoubt aboutwhat the type of a chunkis, only a fmite numberof possibilitieswill be tested, andnoneof thesemay matchsomething which can be insertedinto one of the availableslots. The top-down approach generatesits predictionsbasedon what is likely to fit into availableslotsin the currentinterlinguastructure. It fIrSttries to fmd a likely filler which matchesa chunk that has a definite type, but in the absenceof this eventuality, it will assumethat a chunk with no specific type is whatevertype it guessescan fit
Recoveringfrom ParserFailures: A Hybrid StatisticalandSymbolicApproach Question: How will constituentsbe built? Answer: Try Top-Down. Available Chunks: 1. valuebe 2. frame * free) (who frame * i ) (good-bad + 3. frame* that 4. frame * simple-time) (time-of- dayafternoon) (day-of-weekTuesday) (day 9 Constituents: 1. frame * simple-time) (time-of-day afternoon) (day-of-weekTuesday) (day 9 * free) {who frame * i ) (good-bad+ 2. ((frame --
3. frame* i 4. frame* that 5. valuebe Figure 8. 8
The second question .
Question: What will drive the searchprocess? Answer: Try Bottom-Up. Current Constituent: frame * simple-time) (time-of-day afternoon) (day-of-weekTuesday) (day 9 ) R '5\\(\t.~ * (frame-slot frame-name free) * time frame when ) ( simple (time-of- dayafternoon) (day-of-weekTuesday) (day 9 ) ? Question: Is Tuesdayafternoonthe ninth the time of beingfree in your sentence User Response: Yes. New Current Interlingua Structure: -type * state) sentence * (frame free) (when frame * sirnple-time) (time-of- dayafternoon) (day-of -weekTuesday) (day 9 Figure 8.9 The third question- part 1.
169
170
Chapter8
: Whatwill drivethesearch ? Question process Answer: Try Top-Down. CurrentSlot: who Hypothesis: (frame-slot frame-name* free) (who frame * i " " ? Question: Is it I who is beingfree in your sentence User Response: Yes.
)
New Current Interlingua I Structure : -type * state) sentence (frame* free) (who frame * i ) (when frame * simple-time) (time-of-day afternoon) (day-of-weekTuesday) (day 9 Figure 8.10 The third question- part 2.
into a slot. And if the userconfirmsthat this slot shouldbe filled with this type, it will learn the mapping betweenthe symbolsin that chunk and that type. Learningnew words is more likely to occur with the top-down approachthan with the bottom-up approach. The metastrategyanswersthesequestions,selectingthe strategyto employ at a given time. Oncea strategyis selected,it continuesuntil it eithermakesa repair or cannotgenerateany more questionsgiven the current stateof the , it is never DynamicRepairMemory. Also, oncethe first questionis answered askedagain sinceonce the top-level frame is confinned, it can be depended uponto be correct. The metastrategyattemptsto answerthe first questionat the beginningof the searchprocess.If the wholeinput utteranceparsesor the parsequality indicated by the parseris good and the top-level frame guessedas most likely by the mutualinformationnetsmatchesthe onechosenby the parser, it assumesit shouldtake the top-down approach. If the parsequality is bad, it assumesit shouldguessa newtop-level frame, but it doesnot removethecurrenttop-level frame from its list of possibletop-level frames. In all other cases, it confirms with the userwhetherthe top-level frame selectedby the parseris the correct oneandif it is not, thenit proceedsthroughits list of hypothesesuntil it locates the correcttop-level frame. Currently, the meta-sttategyalways answersthe secondquestionthe same way. Preliminaryresultsindicatedthat in the greatmajority of cases,the repair
Recoveringfrom ParserFailures: A Hybrid StatisticalandSymbolicApproach
171
module was more effective when it took the top -down approach. It is most often the case that the chunks that are needed can be located within the structures of the chunks returned by the parser without combining them. And even when it is the case that chunks should be combined in order to form a chunk that fits into the current interlingua structure, the same effect can be generated by mapping the top - level structure of the would -be combined chunk onto an available chunk with an uncertain type and then inserting the would -be constituent chunks into this hypothesized chunk later. Preliminary tests indicated that the option of combining chunks only yielded an increase in accuracy in about 1% of the 129 casestested. Nevertheless, it would be ideal for the metastrategy to sensewhen it is likely to be useful to take this approach, no matter how infrequent . This is a direction for future research. The third question is answered by taking the bottom - up approach early , considering only chunks with a definite type and then using a top -down approach for the duration of the repair process for the current interlingua structure. The final task of the metastrategyis for it to decide when to stop asking questions. Currently it does this when there are no open slots or it has asked some arbitrary maximum number of questions. An important direction for future research is to fmd a better way of doing this. Currently , the repair module asks primarily useful questions (yielding an increase in accuracy) early within the first 5 or 10 questions) and then proceeds to ask a lot of irrelevant ( questions. But we have not found an optimal maximum number of questions. If the number of questions is too small , it will not be able to learn some new input patterns and sometimes fails to recover information it would have been able to recover had it been allowed to ask a few more questions. But if the number is too large, it is unnecessarily annoying for the user, particularly in cases where the important information was recovered early in the process.
5.3 User Interaction -toUser interactionis an essentialpart of our approach. The ideal in speech speechtranslationhas been direct throughputfrom input speechto output speech.But this leavesthe speakerwith no ideaof whatthe systemunderstood from what was said or what is ultimately communicatedto the other speaker. This is particularlya problemwith flexible parsingtechniqueswherethe parser must take someliberties in fmding a parsefor ill -formed input. An example interactionis displayedin figure 8.11. The interlinguarepresentations for the sentence before and after are in 8.12. example repair displayed figure The sampleinteractionin figure 8.11 was collectedduring systemtesting when a userunfamiliar with the systemwas given the task of communicating
172
Chapter8
SYSTEM: Please entera sentence (or quit): USER: I amafraidI will beattending thefuneralof my precious canarynextweek. SYSTEM: Is yoursentence about an ? mainly interjection USER: No. SYSTEM: Is yoursentence ? mainlyaboutbeingavailable USER: No. SYSTEM: Is yoursentence mainlyaboutsomeone beingbusy? USER: Yes. SYSTEM: Is yoursentence a shortbutcomplete sentence ? USER: No. SYSTEM: Is yoursentence a statement ? USER: Yes. SYSTEM: Is thenextweekin focusin yourexpression of someone beingbusy? USER: Yes. SYSTEM: Is it I whois beingbusyin yoursentence ? USER: Yes. SYSTEM: Is a classthereason for beingbusyin yoursentence ? USER: No. SYSTEM: Is beingavailable thereason for beingbusyin yoursentence ? USER: No.
Figure8.11 Exampleinteraction.
to the system that he would not be available for a meeting the following week. The parser was unable to parse most of his utterance. The parse before repair is displayed in figure 8.12. The portion of his utterance which the parser returned an analysis for completely misses most of the content of the user' s utterance. Through a series of yes/ no questions, the repair module was able to extract the essential meaning of the user' s utterance from the partial parse. The purpose of the fIrSt three questions in the sample interaction is for the repair module to detennine what the top -level semantic frame of the meaning representation of the sentence should be. With the next two questions the repair module determines the sentencetype of the sentence. The last four questions are for determining what the slot fillers for the representation should be. The resulting interlingua representation can be found in figure 8.12. Because our Hypothesis Generation Module makes hypotheses about local repairs, the questions generated focus on local infonnation in the meaning representatio of the sentence. For instance, rather than confmn global meaning
Recoveringfrom ParserFailures: A Hybrid StatisticalandSymbolicApproach
173
Parsebefore repair : -type * fixed- expression sentence ) adverb (type unfortunately) (frame * adverb ) (frame * interject Paraphrase: " " Unfortunately Parseafter repair : -type * state) sentence * (frame busy) (who frame * i ) (topic specifier(* multiple* definite next (frame * special-time) (nameweek Paraphrase: "The next weekI don' t havetime."
Figure8.12 beforeandafterrepair. for example structures Interlingua " " representations as in , Did you mean to say X ?, it confinns local infonnation " ' as in , " Is two o clock the time of being busy in your sentence?, which confmns " ' " that the representation for two 0 clock should be inserted into the when slot in the * busy frame. Because the repair module generateslocal repair hypotheses, with each decision building upon the result of the last successful hypothesis, through trial and error it is forced to ask a large number of very tedious questions . An important current direction of research is exploring how to minimize this burden on the user while maximizing the infonnation the repair module can ' extract from the user s utterances through the use of a genetic programming to maximize the mutual infonnation over a single global repair approach ., hypothesis made up of a combination of local repair hypotheses.
6 QuantitativeEvaluation The repair module was tested on two corpora with 129 sentences each. One corpus contains 129 ttanscribed sentences from spontaneous scheduling dialogues as described above. The other corpus contains the output from the speech recognizer from reading the transcribed sentences. So it contains all of the difficulties of the ttanscribed corpus in addition to speech-recognition errors. The performance of the repair module was compared with a baseline process on an additional corpus of 113 sentences. These results indicate that the
174
Chapter8
performance of the repair module improves as the number of questions increases , that its performance generalizes to different data sets, and that the metastrategy consistently achieves better performance than any of the single strategies as well as the baseline comparison process. The repair module was evaluated automatically with no human intervention
whatever. Prior to the evaluation, a humancoderhand-codedideal interlingua for eachof the 129sentences . The human-computerinteraction representations componentof the repair modulewas simulatedby having the computermatch eachproposedrepairagainstthe ideal interlinguastructurefor the corresponding sentenceto testwhetherit would makethe current" in-progress" versionof the interlinguarepresentationinternalto the repair modulemore like the ideal one. Eachof thesetestscountedasonequestion. If the matchindicatedthat the " " " " , otherwisea no proposedrepairwasa goodone., a yes answerwasassumed " answerwas assumed . If the answerwas " yes, the repair module made the hypothesizedrepair. After eachquestion, the possibly updatedinternal interlingua wasmatchedwith the idealoneto calculatea point value. representation The matchingprocesswascarriedout recursively, first comparingthe top-level frame, and if it was the samefor eachslot, comparingthe fillers in the ideal structurewith thecorrespondingonesin the internalstructure. For eachmatching frameor atom, onepoint wasassigned . The total possiblewascomputedby the total number of frames and atoms in the ideal representation . counting From this, a percentagecorrectcould be calculatedat eachstageof the repair processto track the improvementof the quality of the internal representation per questionasked. Figure 8.13 displaysthe relative performanceof the eight strategiescompared with the metastrategyon speechdata. Figure 8.14 displaysthe relative performanceof the strategieson the transcribeddata. Note that the metastrategy consistentlyachievesa betterperformancethananyof the singlestrategies and that its performanceimprovesas the number of questionsallowed increases . Given a maximum of 10 questionsto ask the user, the repair module can raisethe accuracyof theparserfrom 52% to 64% on speechdataandfrom 68% to 78% on transcribeddata. Given a maximumof 25 questions,it canraisethe accuracyto 72% on speechdataand 86% on transcribeddata. Thebaselinefor comparisonwasa processconsistingof building the correct interlinguastructurefrom a Huffman-codedversionof the interlinguaspecification whereeachnodein the Huffman treecountedasonequestion. A Huffman tree was constructedfor eachtype in the interlingua specification. The comparisonprocessbeganby trying to guessthe top-level semanticframe by traversingthe Huffman treefor top- level frames. It thencontinuedthe process
25
20
15
Figure8.14 Resultsfromall strategies ontranscribed data . 10
. . . .
- .
"
"
.
_ .
. .
~ . . .
. .
~ .
.
.
.
_ .
-
6
-
.
_
~ .
- .
, .
e -
-
EI -
~ .
El EI
-
-
-
-
-
-
-
EI
El -
-
-
.
_
E ) . . . . . . .
(
I
( . . . . .
. . . . . . .
H
I
"
. . . . . . .
.
.
.
.
~ .
.
-
. .
~
.
~
&
.
_ .
.
. .
-
.
.
_ .
~ -
-
EI -
-
-
.
ro .
.
5
0
0
88 .
0
88 .
.
~ .
~ -
-
.
" ( I
Oft ~
-
_
.
. . . . . .
. . . ~. .
.
.
-
.
.
_ .
-
.
~ .
D
-
-
-
-
-
.
.
: .
~ . . . , . .
.
. . . . . . . .
7 .
0
.
=
.
. ~ . . . . . (
( )
. .
. . . jC' .
-
-
-
-
-
. , - .
.
" ( 1
13
)
. . . . j . . .
72 .
74 .
"
" .
.
0
. .
. . . . . . .
. . . . ' . '
" .
X. .
_
-
e EI -
e e
,
.
.
~
.
_ . El
/
"
- .
( . . _ " e -
t
0
'
)
. . . . jC . ' .
. . . . . . .
0
78 .
0
76 .
0
(
8 .
82 .
8 -
- .
6
(
&
.
4
Sb8t
.
0
Conm
-
6
1
. . . . .
&
2
.
at
'
Sb
.
nm
84 .
0
. .
) -
.
7
E
-
&
3
.
at
'
Sb
Comb
.
-
-
-
6
+
&
1
at
.
'
Sb
Comb
.
-
metaStrategy
88 .
-
0
88 .
Ies Data
ribed
*
TraI
ateg '
0
Sb
on
AD
Resulu from all strategieson speechdata.
Figure8.13
All on Data Strategies Speech 0.75 -0 ---MetaStrategy -. Comb Strat .1 & 5 + 0.7 Comb a.--". Comb .Strat Strat .2 7.--..-K 3& . . & 6 .Strat Comb .4&8-.-.6-.~ II0.65 --> .-. .)0 . E . x ; ' " . x , . ) E ' ~- 06 . . x ; . ~ . 0 _ ~-0-0-.0-.A ~.-,0 )(".-.0 0-..A .0,.0 . . ~ , ) ( ' . '-6j . , ~ . A . ' ~ ' . . " , . 0 -, _ 6 . . .."-~",.0 ~ A ' -, .o ' , ' . A . ~ " ~ ) C , . , , ("-..~ x;-~ o"'-...6.-~ o-.A -"~ '-'--)o c :A ' ~ . 6 6 j 0.55 :. . 0.50 5 10 15 20 25 Number of Questions Recoveringfrom ParserFailures: A Hybrid StatisticalandSymbolicApproach
175
176
Chapter8 Baseline
to
compared
MetaStrategy 1
.
+
-
-
Baseline
-
+ MetaStrategy
-
-
-
+
-
-
+
-
-
+
-
+
-
-
+
-
-
-
+
-
-
+
-
-
+
-
-
+
-
+
-
-
-
+
+
-
-
+
-
-
"
"
"
-
8 . . . . .
-
-
0
+
-
-
-
+
-
+
-
-
+
.
.
-
.
.
, '
, '
, .
.
-
-
-
+
-
-
-
. . . . 0
6 .
0
4 . 2 .
~ .II. ~ f ~
0
0 ANumber
AA
~ 10
15 Questions
of
Figure8.15 Resultsfrom metastrategycomparedwith baseline. recursively consttucting fillers for each of the slots using the Huffman tree for the type corresponding to the slot. This process corresponds to an extreme version of the repair process not using any information from the parser whatever. The results of comparing the metastrategy with the comparison process on the additional data set can be found in figure 8.15. Note that the metastrategyis consistently better than the comparison process. 7
Qualitative Evaluation
The repair module ' s ability to make repairs of various sorts is detennined by two parameters which guide the searchprocess. The first is the maximum number of questions it is allowed to ask, and the second is what is called the bandwidth . Bandwidth is used in two different ways in the current implementation of the repair module. It specifies how long the list of likely types will be assigned to a chunk with an uncertain type and how many potential slots will be considered when a top -down approach is selected for driving the search process. This parameter is set for practical reasons so that the repair module will spend the majority of its time considering hypotheses that are truly likely .
andSymbolicApproach 177 : A HybridStatistical fromParserFailures Recovering If themaximumnumberof questionswereinfmite, it would not be importantto . In practice, with a smallfmite numberof questions,it is havesucha parameter areawithin its search important. It forcesthe repairmoduleto samplea larger areain depth. In the spacemore shallowly insteadof looking at a very small future, bandwidthand maximumnumberof questionswill be replacedwith a confidencemodel. The problem with setting a bandwidth is that it makescertain potential ' to learn. If the correct repairsimpossibleandaffectstherepairmodule s ability to the statistical hypothesisin a particularcaseis extremelyunlikely according so it will never bandwidth the within , ranked model, it will not most likely be In somecases, similar. is . The maximumnumberof questions be considered wherethe correcthypothesisis extremelyunlikely, the repair modulewill not be able to honein on it within its maximumnumberof questions.In the case that the repairmoduleis facedwith the taskof mappingnew wordsonto a less to the correct to frequently encounteredconcept, if it never manages get it will not be ableto learn hypotheseswithin its maximumnumberof questions, \ that mapping. With an infmite bandwidthandinfmite numberof questions,the repairmodule would have the ability to makeany repair. This is intuitive becauseeven with no informationfrom the parse, by cycling throughthe possibletop level framesand sentencetypes, it would eventuallyarrive at the correctcombination . From thereit knows from the interlinguaspecificationwhat the possible fillers for eachof the slotsare, andby cycling throughall of thosepossibilities it would eventuallyarrive at the correctinterpretation. It is not practicalto allow the repair moduleto operatethis way, however. First of all, it is muchmorepracticalto makeuseof informationfrom the parse whereit is available(seefigure 8.15) . In somecases,this meansrelying upon the parserto producereliable results. If the repair module confirmed every theparser, it would ask pieceof informationin the featurestructurereturnedby it took an all-or-nothing if hand other the On , far too many uselessquestions. of the analysisfrom the approach,it might throw out potentiallyusefulportions level semanticframereturned parser. Currently, it only checksto seeif the top returnedby by the parseris correct. If it is, it keepsthe whole partial analysis ' the of scratch from parsers partial , usingportions theparser; otherwiseit starts however, it analysiswhereverpossible. If it keepsthe whole partial analysis, morereliableconfidence may retainportionsof it which arenot correct. With a . model, this compromisewould not be necessary to allow the repair moduleto is not taken have One additionalshortcutwe is which the parser not able to identify as date postulatedate representations
178
Chapter8
material . This is becausethere is no such thing as a " most likely date," so it is unclear how to build a statistical model which could make useful predictions about date fillers outside of context , especially since there are an infInite number of possible date representations.
Conclusions andCurrent Directions This chapter describes an approach to interactive repair of fragmented parsesin the context of a speech-to - speech translation project of significant scale. It makes it possible to use symbolic knowledge sources to the extent that they are available and uses statistical knowledge to fill in the gaps. This gives it the ability to keep the precisenessof symbolic approaches wherever possible as well as the robustness of statistical approaches wherever symbolic knowledge sources are not available. It is a general approach which applies regardless of how degraded the input is , even if the sentencecompletely fails to parse. The primary weakness of this approach is that it relies too heavily onuser interaction . One goal of current research is to reduce this burden on the user. Current directions include exploring a genetic programming technique for generating global repair hypotheses, eliminating the need for many tedious questions such as the ones described in this chapter, as well as exploring the use of discourse and domain knowledge for the purpose of eliminating hypotheses that do not make sense. Acknowledgments
C.P.R. offersspecialthanksto herco-authorAlex Waibelfor advisinghermaster' s researchand also to her two other master's committeemembers, Lori Levin andDavid Evans, for makinghelpful contributions. The researchdescribedin this chapterwas sponsoredby the Departmentof -93- 1-0806. The ideasdescribedhere do not Naval Research , grant #NOOO14 necessarilyreflect the position or policy of the government, and no official endorsementshouldbe inferred. References J. G. Carbonelland P. J. Hayes. Recoverystrategiesfor parsingextragrammaticallan . Technical 84 87 School of Science , guage Report , CarnegieMellon University Computer , Pittsburgh, 1984. J. R. Hobbs, D. E. Appelt, and J. Bear. Robustprocessingof real-world natural-language texts. Trent, Italy, 1992. Third Conferenceon Applied NLP.
Recoveringfrom ParserFailures: A Hybrid StatisticalandSymbolic Approach
179
A. Lavie and M. Tomita. GLR* - an efficient noise-skipping parsingalgorithm. Presented , 1993. Tilburg, The at 3rd InternationalWorkshop on ParsingTechnologies . Netherlands J. F. Lehman. Self-ExtendingNatural LanguageInterfaces. PhiD. thesis, School of ComputerScience,CarnegieMellon University, Pittsburg, 1989. D. McDonald. The interplay of syntacticand semanticnode labels in partial parsing. , 1993. Presentedat 3rd InternationalWorkshopon ParsingTechnologies M. Woszcyna, N. Coccaro, A . Eisele, A . Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Sloboda , M. Tomita, J. Tsutsumi, N. Waibel, A . Waibel, andW. Ward. Recent A speechtranslationsystem. ARPA HumanLanguageTechnology in JANUS: advances . Plainsboro,New Jersey. 1993 , Workshop
Contributors
StevenAbney University of TUbingen TUbingen,Germany
Patti Price SRI International Menlo Park, California
Hiyan Alshawi AT&T Bell Laboratories Murray Hill , New Jersey
LanceA . Ramshaw University of Pennsylvania Philadelphia,Pennsylvania Bowdoin College Brunswick, Maine
Robin Clark University of Pennsylvania Philadelphia,Pennsylvania BeatriceDaille TALANA , University of Paris Paris, France VasileiosHatzivassiloglou ColumbiaUniversity New York, New York ShyamKapur JamesCook University of North Queensland Townsville, Austtalia Mitchell P. Marcus University of Pennsylvania Philadelphia,Pennsylvania
CarolynPensteinRose CarnegieMellon University Pittsburgh, Pennsylvania Alex H. Waibel CarnegieMellon University ' Pittsburgh, Pennsylvania
Index
Abbreviations, 53 Abduction, 34
Acceptability ; . - , 10. SeealsoGrammaticality Perfonnance of, 29, 32 degrees ACL. SeeAssociation for Computational Linguistics ACL- DCIcorpus , 81 Acousticphonetics , 121 -adjective Adjective pairs.78 Ad_iectives . 52. 68-72. 79- 89 scalarandnonscalar , 75-76 Adverbs , 52, 79 AI. SeeArtificialintelligence , 3, 14- 15 Algebraicgranunar Algebraicmodel.SeeAlgebraicgrammar , 65 Alignedcorpora , 7- 11, 15- 16, 24, 160.Seealso Ambiguity ; Wordsense Disambiguation attachment 10 6 judgments , 101- 102 Anaphora Artificialintelligence (AI), 68, 121 Association for Computational Linguistics (ACL), vii, xii, 81 ' Bayess rule, 38 Browncorpus , 136, 139, 150- 151 Candide , 30 system Casemarking , IOI exceptional structural , IOI Cl DLDES database , 107 Chinese , 102 , vii, 1, 5, 21- 24, 95. Seealso Chomsky -Bindingtheory;Principles Government and parameters Clitics, 100, 103, 108- 115 , 29, 36, 71- 72, 75, 78, 80, 85- 86 Clustering
COBUll. D dictionary , 76-77 es, 120 Cognitive -process Collocations , 18, 20, 33, 36, 45, 49- 50, 54-55, 63, 77, 79, 90 , 1, 12- 13, 15, 24, 157.Seealso Competence Performance andcomputation 13 , 12 Compilers ' 8ints Consb combinatoric , 29 , 12 computational , 12 grammatical Coordination . 53 -linguisticinfonnation Cross , 3S Culturalgapanddifferences , 119, 130 DARPA.SeeDefense Advanced Research Projects Agency Database query.31. 35 Dates . syntaxof. 20 Decisionmes. 130. 136. 142- 143 Defense Advanced Research Projects Agency (DARPA). 119. 121 of acceptability . SeeAcceptability . Degrees of degrees of gramrnaticality . See Degrees of . degrees Grammaticality . 35. 39. 45- 46 Dependency grammar , 36 Dependency graphs . 42- 43 Dependency parameters .4 Dialectology . 2. 15. 30, 34. 39. 78-79. Disambiguation 89-90. SeealsoAmbiguity Discourse . 121. 128. 161 Disbibutional criteria.71 Disbibutional evidence . 101 Disbibutional . 20. 24 techniques Do-insertion . 100 Domainknowledge , 161 Dutch, 105, 108, 114
184
, 17,51, 100 , 101 , 102 , 108 English Enb ' opy - lIS, 146 , 30, 106 , lOS , 125 Eurospeech Evaluation - 130 , 72- 75, 129 - 178 , 176 qualitative - 176 quantitative , 173
Explanatory power,13 , l00 Extraposition
F-measure , 74, 84 Feature-basedgrammar, 32 Feature-sttucturerepresentation , 158 . 161 Finite automata , 54, 64 Finite stategrammar, 79, 86
Finitestateb' ansducers , 136 French , 14, 51- 64, 100, 101, 108, 110, 114 Garden-patheffects, 13 Generativegrammar, vii , xi , 96 Generativelinguistics, vii Genre, 80 German, 17, 100, lOS, 108 Gorin' s networkformalism, 163 -Binding theory, 20, 24. Seea/so Government Chomsky Grammars.SeeAlgebraicgrammar ; Dependencygrammar; Feature-based grammar; Generativegrammar; Stochastic grammar; Tree-adjoininggrammar; Weightedgrammar Grammaticality, 6, 10- 11, 22- 24. Seea/so ; Competence Acceptability; Performance
of 16- 17, 20 degrees Greek(Septuagint ), 136, 152
Head(of a phrase ), 35- 36, 39, 100, 105 HiddenMarkovmodels(H: MM ), x, 2, 28, 122- 123, 135, 139, 142.SeealsoMarkov models Hudson , dependency , 35 grammar es, HybridHMM-neuralnetworkapproach 122 , 129 Hyperarticulation Icelandic , 96, 102 ICSLP.SeeInternational Conference on Spoken Language Processing ill / LPrules,45 Ill -fannedinput, 126, 159 Infonnation rebieval . 96 Infonnation retrievalmea~II~ ~
, 73-74 precision recall , 73-74. 77
Information theory.viii . 98 . 34. 158. 161. 170(fig.). 171 Interlingua
Index International Conference onSpoken Language (ICSLP), 125 Processing , 126- 127 Interpretation Italian, 108 J-score, 145 , 96, 100, 101, 102 Japanese Jelinek, 30 Kendall' sT , 71- 72 Korean, 101, 102
, 1, 3, 14, 20, 23-24, Language acquisition 95- 97 errorcorrection in, 19 , 1, 3- 4, 14, 24 Language change , 6, 14, 17, 24 Language comprehension errortolerance in, 18 , 24, 33, 35, 37, 39, Language generation 41- 42, 76 insb11ction , 96 Language , 6, 14, 17 Language production , 76, 125- 126 Language understanding universals ,5 Language variation , 1, 3- 4, 14, 24 Language , 95- 96, 101, 135 Leamability Learning , 135 probabilistic -based ttansformation , 135 Lexicalaccess , 121 . . Lexicallanguage model,35 Lexicalprobabilities , 43 Lexicalrelations (antonym ), 70, 75- 76 antonomy , 34 disjointness , 34, 70 hyponymy resbiction , 34 (synonym ), 53, 70, 75- 76 synonymy Lexicalsemantics , 70, 76 Lexicalttansfer(in ttanslation ), 27 , 75 Lexicography Lexicon , 602, 32, 126 Likelihoodestimation , 124.Seealso Maximumlikelihoodestimation Likelihoodratio, 49 Linearregression model,84 Logic, fIrStorder,31, 42 es, 28, 31 Logic-basedapproach Machineb' anslation , 36- 45, 78, 89. Seealso Transferapproach;Speechttanslation; Interlingua Markedness , 99
185
Index Markov models, 23, 24, 123. Seealso Hidden Markov models Markov process, l06n3 Maximum likelihood estimation, 37, 129, 145. Seealso Likelihood ratio Mood, 161 Morpheme, x Morphology, 80, 87, 90, 121, 126 Mutual infonnation, 49, 58, 158- 159, 163, 166 N-bestapproach, 125, 127 N-gramlanguagemodels, 28, 45, 127 Negativemarkers, 100 Neuralnetworks, 3, 30, 122 Noun phrases , 7, 16, 50- 51, 79 ), 6n2, 20, Noun-noun(nominalcompounds 49, 63, 77. Seealso Compounds Numeralexpressions , syntaxof , 20 Overb'aining, 146- 150 Parameterestimation, 17, 23, 30, 36 Parametersetting, 4, 20, 95, 102 Parsing bottom-up, 168 , 13 preferences sttategies, 18 top-down, 167 Partof speech acquisitionof, 20 information, 126 tagging, 54, 57, 68, 89, 135- 136(seealso ) Ambiguity; Disambiguation , 13 Perception Performance , l , 6, ll - 13, 15, 157. Seealso ; Acceptability; Competence Grammaticality Perplexity, 72 Phoneme ,x Phonetics.SeeAcousticphonetics Phonologicalrules, 124 Phonology, 121 Phrasesb' Ucturegrammars(HPSG), 24, 39 Phrasesb' Ucturerules, acquisitionof, 20 Phrasesb' Ucturettees, 136 Polish, 101 Precision. SeeInformationretrievalmeasures Predicateargumentrelations, x, 38 , 52 Prepositionalphrases ambiguityandattachmentviii , 136 there, 100 Presentational , 101 Principlesandparameters Probabilisticapproach,viii , 3. Seealso Stochasticgrammars
Pronunciation , 124 , syntaxof, 20 Propernames , 122, 127- 128 Prosody , ix Psycholinguistics Punctuation , 128 es, vii , ix. Quantitativemethodsandapproach Seealso Probabilisticapproach;Stochastic grammar Raisingandconb' ol, 101 Recall. SeeInfonnation rettieval measures Repairprocess(speech), 164- 171 Rightwarddislocation, 100 Rule-basedapproach,29 Rule dependence , 150- 153 Rule-rankingmettic, 142, 145- 146 Russian, 99 Selectionalresbictions, 18, 20, 33- 34 Semanticclasses , 70 Semantics , 121, 160 denotational,43 monotonic, 43 referential, 42 rules andaxioms, 31- 34 Shannon,22- 25, 57 diversity, 60- 63 noisychannelmodel, 19 SmoodUng, 74, 88. Seealso Sparsedata Sociolinguistics, ix , 122 Sparsedata, 36, 75 "al analysis, 122 Specb Speech disfluencies, 126- 12 falsestarts, 157 , 128, 157 spontaneous Speechact, 161 , viii Speechprocessing Speechrecognition, ix , 24, 28, 30- 31, 36, 68, 119- 120, 128- 129 Speechsynthesis,31, 37 Speechttanslation, 27, 158 , 119 Speechunderstanding Speechwavefonn, 122 -to-speechttanslation. SeeSpeech Speech ttanSlation , 120 Spokenlanguage Stochasticgrammar, 3, 24 Stylistic inversion, 100 , 20 Subcategorization Subject-auxiliary inversion, 100 Subsetprinciple, 99 Supportverbs, 20 Surfaceorder, 41
186 , 121 Syntax autonomous , 14 theoretical ,6 Tenninology, 49, 65 Thesaurus consb11ction , 68 , 49 generation Tomita' s generalizedLF parser, 160 Transferapproach(in ttanslation), 27, 34 TranAlatinn- .\'~~ Machinetr AnA1atinn
Tree-adjoininggrammar , 24 model ), 14 Turingtest(for generative , 4- 5, 97 Typology
Unification , 29, 158 Universal , 102 grammar - 173 Userinteraction , 171 Variability,130 Variation lexical,36 , 97 linguistic Verbsecond consb ' Uction . 103. 105. 106- 108 Viterbiprocedure , 143 Vowelchanges , 122
Weaver ttanslation (andmachine ), 28 Wei2hted e:rammar . 10. 15. 19.24 Wh-movement . 101
Wordorder , 37, 99- 100 , 105 Wordsense . Seealso , 34, 39, 45, 126 ; Ambiguity Disambiguation
WordNet,77 WorldWideWeb, ix
Index