http://www.springerlink.com/content/k478x2/cover-large.gif
http://www.springerlink.com/content/k478x2/cover-large.gif [20/3/2008 10:44:44 μμ]
Hearing – From Sensory Processing to Perception
B. Kollmeier G. Klump V. Hohmann U. Langemann M. Mauermann S. Uppenkamp J. Verhey (Eds.)
Hearing – From Sensory Processing to Perception With 224 Figures
Prof. Dr. Birger Kollmeier Prof. Dr. Georg Klump Dr. Volker Hohmann Dr. Ulrike Langemann Dr. Manfred Mauermann Dr. Stefan Uppenkamp Dr. Jesko Verhey Fakult¨at V Institut f¨ur Physik Carl-von-Ossietzky Universit¨at 26111 Oldenburg Germany
Library of Congress Control Number: 2007928331 ISBN: 978-3-540-73008-8 Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permissions for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Editor: Dr. Dieter Czeschlik, Heidelberg, Germany Desk editor: Dr. Jutta Lindenborn, Heidelberg, Germany Cover design: WMXDesign GmbH, Heidelberg, Germany Production and typesetting: SPi Printed on acid-free paper SPIN 11915300
31/3100
543210
Preface
The current book presents the written contributions to a kind of “World summit on hearing research”, i.e., the “International Symposium on Hearing” (ISH 2006) which was held in Cloppenburg, a small northern German town close to Oldenburg and Bremen in August 2006. The International Symposium on Hearing has been held approximately every three years in Europe since 1969. The participants come from groups mostly in Europe and in the USA that focus on a wide range of topics in research on auditory system function. It is a hallmark of this truly interdisciplinary meeting to bring together well known researchers specializing on psychophysics, physiology and models of hearing. This connection stimulates the discussion on the physiological mechanisms underlying perception and provides the basis for a better understanding of auditory function. Modelling approaches complement the experimental studies and serve as a framework for interpreting the results and developing new experimental paradigms. The main themes of the current meeting are at the focus of interest in hearing research. The physiological representation of the temporal and the spectral structure of stimuli on different levels of the auditory system is a pervasive topic of the studies presented at the meeting, helping us to understand the perception of modulation patterns, pitch and signal intensity. Our knowledge of the physiological mechanisms of binaural processing in mammals is developing further, providing an improved basis for understanding spatial hearing. How the different stimulus features are integrated into auditory scene analysis and which physiological mechanisms allow the formation of auditory objects is another unifying theme linking researchers focussing on modeling, physiology and psychophysics. Finally, the topics of speech perception and the limitations of auditory perception resulting from hearing disorders were discussed on the basis of our understanding of the physiology of the auditory system. The chapters of this volume with the proceedings of the “14th International Symposium on Hearing” provide an up-to-date status of the field of hearing research. We hope that it will stimulate further discussion and will also enable newcomers to the field to access the newest developments in our understanding of auditory system function and auditory perception. The organizers of the ISH 2006 and editors of this book are affiliated with the Universität Oldenburg where one of the largest European centres for
vi
Preface
hearing research is located. Institutional support for the ISH 2006 was therefore provided by: ● Kompetenzzentrum HörTech (i.e., national centre of competence for hearing aid system technology, located in the “house of hearing” in Oldenburg) ● Sonderforschungsbereich/Transregio “Das aktive Gehör” (Oldenburg/ Magdeburg, i.e., collaborative research center “the active auditory system” supported by DFG) ● Internationales Graduiertenkolleg (international research training site) “neurosensory science, systems, and applications” (Oldenburg/Groningen, supported by DFG and NWO) Further financial support was kindly provided by Widex A/S and Siemens Audiologische Technik (SAT). The organizers wish to thank these institutions and all individuals that made the ISH 2006 an unforgettable event. Oldenburg, December 2006
Birger Kollmeier, Georg Klump, Volker Hohmann, Ulrike Langemann, Manfred Mauermann, Stefan Uppenkamp, and Jesko Verhey
List of participants (and key to photograph) Bahmer, Andreas Beutelmann, Rainer Bleeck, Stefan Carlyon, Bob Carney, Laurel L. Carr, Catherine E. Chait, Maria Chen, Hsi-Pin Christiansen, Thomas Ulrich Colburn, Steve de Cheveigné, Alan Demany, Laurent Dietz, Matthias Divenyi, Pierre
49 89 13 80 91 16 17 73 45 1 28 82 26 36
Dooling, Robert J. Duifhuis, Hendrikus Egorova, Marina El Hilali, Mounya Emiroglu, Suzan Englitz, Bernhard Ernst, Stephan Ewert, Stephan D. Festen, Joost M. Garre, Susanne Ghitza, Oded Gleich, Otto Goossens, Tom Goupell, Matthew Joseph
43 88 77 81 4 22 25 19 59 71 46 34 87 66
viii
Goverts, Theo Greenberg, Steven Grimault, Nicolas Hage, Steffen R. Hall, Deborah A. Hancock, Kenneth E. Hansen, Hans Hartmann, William M. Heinz, Michael G. Heise, Stephan Henning, G. Bruce Hohmann, Volker Junius, Dirk Kashino, Makio Klinge, Astrid Klump, Georg Kohlrausch, Armin Kollmeier, Birger Langemann, Ulrike Langers, Dave R.M. Langner, Gerald Leek, Marjorie R. Leijon, Arne Long, Glenis Lopez-Poveda, Enrique A. Lüddemann, Helge Lütkenhöner, Bernd Marquardt, Torsten Mauermann, Manfred McAlpine, David Meddis, Raymond Meyer, Julia Micheyl, Christophe Narins, Peter M. Neher, Tobias Nelson, Paul Palmer, Alan R. Patterson, Roy D. Plack, Christopher J.
List of Participants
11 75 2 63 67 31 50 74 32 93 53 10 24 39 72 21 35 9 7 83 15 33 62 23 30 37 68 90 6 27 40 55 94 84 54 61 78 14 42
Pressnitzer, Daniel Riedel, Helmut Roberts, Brian Rupp, Andre Schimmel, Othmar Schmidt, Erik Schoffelen, Rick Shackleton, Trevor M. Shamma, Shihab A. Shinn-Cunningham, Barbara Simon, Jonathan Z. Siveke, Ida Strahl, Stefan Trahiotis, Constantine Tsuzaki, Minoru Unoki, Masashi Uppenkamp, Stefan van Beurden, Maarten F. B. van de Par, Steven Verhey, Jesko Lars Watkins, Anthony Weber, Reinhard Wiegrebe, Lutz Winter, Ian Michael Yasin, Ifat Yost, William A. Young, Eric D. Not in photograph: Bilsen, Frans A. Culling, John F. Delgutte, Bertrand Devore, Sasha Ihlefeld, Antje Kaernbach, Christian Seeber, Bernhard U. van Dijk, Pim Tollin, Daniel J.
69 60 47 48 86 12 92 64 79 3 18 52 76 65 20 38 5 44 57 8 56 51 70 29 41 85 58
Contents
Part I
Cochlea/Peripheral Processing
1
Influence of Neural Synchrony on the Compound Action Potential, Masking, and the Discrimination of Harmonic Complexes in Several Avian and Mammalian Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 OTTO GLEICH, MARJORIE LEEK, AND ROBERT DOOLING
2
A Nonlinear Auditory Filterbank Controlled by Sub-band Instantaneous Frequency Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 VOLKER HOHMANN AND BIRGER KOLLMEIER
3
Estimates of Tuning of Auditory Filter Using Simultaneous and Forward Notched-noise Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 MASASHI UNOKI, RYOTA MIYAUCHI, AND CHIN-TUAN TAN
4
A Model of Ventral Cochlear Nucleus Units Based on First Order Intervals . . .27 STEFAN BLEECK AND IAN WINTER
5
The Effect of Reverberation on the Temporal Representation of the F0 of Frequency Swept Harmonic Complexes in the Ventral Cochlear Nucleus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 MARK SAYLES, BERT SCHOUTEN, NEIL J. INGHAM, AND IAN M. WINTER
6
Spectral Edges as Optimal Stimuli for the Dorsal Cochlear Nucleus . . . . . . . . . .43 SHARBA BANDYOPADHYAY, ERIC D. YOUNG, AND LINA A. J. REISS
7
Psychophysical and Physiological Assessment of the Representation of High-frequency Spectral Notches in the Auditory Nerve . . . . . . . . . . . . . . . . .51 ENRIQUE A. LOPEZ-POVEDA, ANA ALVES-PINTO, AND ALAN R. PALMER
Part II
Pitch
8
Spatio-Temporal Representation of the Pitch of Complex Tones in the Auditory Nerve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61 LEONARDO CEDOLIN AND BERTRAND DELGUTTE
9
Virtual Pitch in a Computational Physiological Model . . . . . . . . . . . . . . . . . . . . .71 RAY MEDDIS AND LOWEL O’MARD
x
Contents
10
Searching for a Pitch Centre in Human Auditory Cortex . . . . . . . . . . . . . . . . . .83 DEB HALL AND CHRISTOPHER PLACK
11
Imaging Temporal Pitch Processing in the Auditory Pathway . . . . . . . . . . . . . .95 ROY D. PATTERSON, ALEXANDER GUTSCHALK, ANNEMARIE SEITHER-PREISLER, AND KATRIN KRUMBHOLZ
Part III
Modulation
12
Spatiotemporal Encoding of Vowels in Noise Studied with the Responses of Individual Auditory-Nerve Fibers . . . . . . . . . . . . . . . . . . . . . .107 MICHAEL G. HEINZ
13
Role of Peripheral Nonlinearities in Comodulation Masking Release . . . . . .117 JESKO L. VERHEY AND STEPHAN M.A. ERNST
14
Neuromagnetic Representation of Comodulation Masking Release in the Human Auditory Cortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 ANDRÉ RUPP, LIORA LAS, AND ISRAEL NELKEN
15
Psychophysically Driven Studies of Responses to Amplitude Modulation in the Inferior Colliculus: Comparing Single-Unit Physiology to Behavioral Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .133 PAUL C. NELSON AND LAUREL H. CARNEY
16
Source Segregation Based on Temporal Envelope Structure and Binaural Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143 STEVEN VAN DE PAR, OTHMAR SCHIMMEL, ARMIN KOHLRAUSCH, AND JEROEN BREEBAART
17
Simulation of Oscillating Neurons in the Cochlear Nucleus: A Possible Role for Neural Nets, Onset Cells, and Synaptic Delays . . . . . . . . .155 ANDREAS BAHMER AND GERALD LANGNER
18
Forward Masking: Temporal Integration or Adaptation? . . . . . . . . . . . . . . . . .165 STEPHAN D. EWERT, OLE HAU, AND TORSTEN DAU
19
The Time Course of Listening Bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .175 PIERRE DIVENYI AND ADAM LAMMERT
Part IV
Animal Communication
20
Frogs Communicate with Ultrasound in Noisy Environments . . . . . . . . . . . . .185 PETER M. NARINS, ALBERT S. FENG, AND JUN-XIAN SHEN
21
The Olivocochlear System Takes Part in Audio-Vocal Interaction . . . . . . . . .191 STEFFEN R. HAGE, UWE JÜRGENS, AND GÜNTER EHRET
Contents
xi
22
Neural Representation of Frequency Resolution in the Mouse Auditory Midbrain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .199 MARINA EGOROVA , INNA VARTANYAN, AND GUENTER EHRET
23
Behavioral and Neural Identification of Birdsong under Several Masking Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207 BARBARA G. SHINN-CUNNINGHAM, VIRGINIA BEST, MICHEAL L. DENT, FREDERICK J. GALLUN, ELIZABETH M. MCCLAINE, RAJIV NARAYAN, EROL OZMERAL, AND KAMAL SEN
Part V
Intensity Representation
24
Near-Threshold Auditory Evoked Fields and Potentials are In Line with the Weber-Fechner Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215 BERND LÜTKENHÖNER, JAN-STEFAN KLEIN, AND ANNEMARIE SEITHER-PREISLER
25
Brain Activation in Relation to Sound Intensity and Loudness . . . . . . . . . . . .227 DAVE LANGERS, WALTER BACKES, AND PIM VAN DIJK
26
Duration Dependency of Spectral Loudness Summation, Measured with Three Different Experimental Procedures . . . . . . . . . . . . . . . . . . . . . . . . . .237 MAARTEN F.B. VAN BEURDEN AND WOUTER A. DRESCHLER
Part VI
Scene Analysis
27
The Correlative Brain: A Stream Segregation Model . . . . . . . . . . . . . . . . . . . . .247 MOUNYA ELHILALI AND SHIHAB SHAMMA
28
Primary Auditory Cortical Responses while Attending to Different Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .257 PINGBO YIN, LING MA, MOUNYA ELHILALI, JONATHAN FRITZ, AND SHIHAB SHAMMA
29
Hearing Out Repeating Elements in Randomly Varying Multitone Sequences: A Case of Streaming? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .267 CHRISTOPHE MICHEYL, SHIHAB A. SHAMMA, AND ANDREW J. OXENHAM
30
The Dynamics of Auditory Streaming: Psychophysics, Neuroimaging, and Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .275 MAKIO KASHINO, MINAE OKADA, SHIN MIZUTANI, PETER DAVIS, AND HIROHITO M. KONDO
31
Auditory Stream Segregation Based on Speaker Size, and Identification of Size-Modulated Vowel Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .285 MINORU TSUZAKI, CHIHIRO TAKESHIMA, TOSHIO IRINO, AND ROY D. PATTERSON
32
Auditory Scene Analysis: A Prerequisite for Loudness Perception . . . . . . . . .295 NICOLAS GRIMAULT, STEPHEN MCADAMS, AND JONT B. ALLEN
xii
Contents
33
Modulation Detection Interference as Informational Masking . . . . . . . . . . . .303 STANLEY SHEFT AND WILLIAM A. YOST
34
A Paradoxical Aspect of Auditory Change Detection . . . . . . . . . . . . . . . . . . . . .313 LAURENT DEMANY AND CHRISTOPHE RAMOS
35
Human Auditory Cortical Processing of Transitions Between ‘Order’ and ‘Disorder’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .323 MARIA CHAIT, DAVID POEPPEL, AND JONATHAN Z. SIMON
36
Wideband Inhibition Modulates the Effect of Onset Asynchrony as a Grouping Cue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .333 BRIAN ROBERTS, STEPHEN D. HOLMES, STEFAN BLEECK, AND IAN M. WINTER
37
Discriminability of Statistically Independent Gaussian Noise Tokens and Random Tone-Burst Complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .343 TOM GOOSSENS, STEVEN VAN DE PAR, AND ARMIN KOHLRAUSCH
38
The Role of Rehearsal and Lateralization in Pitch Memory . . . . . . . . . . . . . . .353 CHRISTIAN KAERNBACH, KATHRIN SCHLEMMER, CHRISTINA ÖFFL, AND SANDRA ZACH
Part VII
Binaural Hearing
39
Interaural Correlation and Loudness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .359 JOHN F. CULLING AND BARRIE A. EDMONDS
40
Interaural Phase and Level Fluctuations as the Basis of Interaural Incoherence Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .369 MATTHEW J. GOUPELL AND WILLIAM M. HARTMANN
41
Logarithmic Scaling of Interaural Cross Correlation: A Model Based on Evidence from Psychophysics and EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .379 HELGE LÜDDEMANN, HELMUT RIEDEL, AND BIRGER KOLLMEIER
42
A Physiologically-Based Population Rate Code for Interaural Time Differences (ITDs) Predicts Bandwidth-Dependent Lateralization . . . . . . . . .389 KENNETH E. HANCOCK
43
A p-Limit for Coding ITDs: Neural Responses and the Binaural Display . . . .399 DAVID MCALPINE, SARAH THOMPSON, KATHARINA VON KRIEGSTEIN, TORSTEN MARQUARDT, TIMOTHY GRIFFITHS, AND ADENIKE DEANE-PRATT
44
A p-Limit for Coding ITDs: Implications for Binaural Models . . . . . . . . . . . .407 TORSTEN MARQUARDT AND DAVID MCALPINE
45
Strategies for Encoding ITD in the Chicken Nucleus Laminaris . . . . . . . . . . .417 CATHERINE CARR AND CHRISTINE KÖPPL
Contents
xiii
46
Interaural Level Difference Discrimination Thresholds and Virtual Acoustic Space Minimum Audible Angles for Single Neurons in the Lateral Superior Olive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .425 DANIEL J. TOLLIN
47
Responses in Inferior Colliculus to Dichotic Harmonic Stimuli: The Binaural Integration of Pitch Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .435 TREVOR M. SHACKLETON, LIANG-FA LIU, AND ALAN R. PALMER
48
Level Dependent Shifts in Auditory Nerve Phase Locking Underlie Changes in Interaural Time Sensitivity with Interaural Level Differences in the Inferior Colliculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447 ALAN R. PALMER, LIANG-FA LIU, AND TREVOR M. SHACKLETON
49
Remote Masking and the Binaural Masking-Level Difference . . . . . . . . . . . . .457 G. BRUCE HENNING, IFAT YASIN, AND CAROLINE WITTON
50
Perceptual and Physiological Characteristics of Binaural Sluggishness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .467 IDA SIVEKE, STEPHAN D. EWERT, AND LUTZ WIEGREBE
51
Precedence-Effect with Cochlear Implant Simulation . . . . . . . . . . . . . . . . . . . .475 BERNHARD U. SEEBER AND ERVIN HAFTER
52
Enhanced Processing of Interaural Temporal Disparities at High-Frequencies: Beyond Transposed Stimuli . . . . . . . . . . . . . . . . . . . . . . . . .485 LESLIE R. BERNSTEIN AND CONSTANTINE TRAHIOTIS
53
Models of Neural Responses to Bilateral Electrical Stimulation . . . . . . . . . . .495 H. STEVEN COLBURN, YOOJIN CHUNG, YI ZHOU, AND ANDREW BRUGHERA
54
Neural and Behavioral Sensitivities to Azimuth Degrade with Distance in Reverberant Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .505 SASHA DEVORE, ANTJE IHLEFELD, BARBARA G. SHINN-CUNNINGHAM, AND BERTRAND DELGUTTE
Part VIII
Speech and Learning
55
Spectro-temporal Processing of Speech – An Information-Theoretic Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .517 THOMAS U. CHRISTIANSEN, TORSTEN DAU, AND STEVEN GREENBERG
56
Articulation Index and Shannon Mutual Information . . . . . . . . . . . . . . . . . . . .525 ARNE LEIJON
57
Perceptual Compensation for Reverberation: Effects of ‘Noise-Like’ and ‘Tonal’ Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .533 ANTHONY WATKINS AND SIMON MAKIN
xiv
Contents
58
Towards Predicting Consonant Confusions of Degraded Speech . . . . . . . . . .541 O. GHITZA, D. MESSING, L. DELHORNE, L. BRAIDA, E. BRUCKERT, AND M. SONDHI
59
The Influence of Masker Type on the Binaural Intelligibility Level Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .551 S. THEO GOVERTS, MARIEKE DELREUX , JOOST M. FESTEN, AND TAMMO HOUTGAST
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .559
1 Influence of Neural Synchrony on the Compound Action Potential, Masking, and the Discrimination of Harmonic Complexes in Several Avian and Mammalian Species OTTO GLEICH1, MARJORIE LEEK2, AND ROBERT DOOLING3
1
Introduction
An important goal of comparative auditory research is to understand the relationship between structure, mechanisms, and function. The ears of mammals and birds are quite different along many dimensions, but the hearing abilities are remarkably similar on a variety of psychoacoustic tasks (Dooling et al. 2000). However, tests involving temporal fine structure now show interesting differences between birds and humans that may permit a more penetrating analysis of the role of structural and mechanical variation among species in the processing of complex sounds. One major difference between birds and mammals related to substantial differences in cochlear dimensions is the frequency dependent cochlear response delay. In this chapter we analyze how physiological responses and psychoacoustic measures of masking and discrimination may be accounted for by an interaction of species-specific cochlear response delay and the time distribution of harmonic frequencies within these complexes.
2 2.1
Methods Stimuli
Stimuli for the studies reviewed here were harmonic complexes with equalamplitude components and component phases selected to produce complexes with monotonically increasing or decreasing frequency across each fundamental period. Complete descriptions of these stimuli may be found in Leek et al. (2005) and in Lauer et al. (2006). These complexes, generally called “Schroeder complexes”, have component frequencies from 0.2 to 5 kHz, and a fundamental frequency of 100 Hz. Variants of the original Schroeder phase 1
ENT Department, University of Regensburg, Germany,
[email protected] National Center for Rehabilitative Auditory Research, Portland VA, Medical Center, USA,
[email protected] 3 Department of Psychology, University of Maryland, USA,
[email protected] 2
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
2
O. Gleich et al.
algorithm (Schroeder 1970) include a scalar, C, ranging from ±1.0 in steps of 0.1, that serves to increase or decrease the rate of change of frequency across each fundamental period. 2.2
Physiological Measures
The procedures to record evoked cochlear potentials in response to harmonic complexes are described in detail in Dooling et al. (2001). The stimulus waveforms were those shown in Fig. 1 as well as inverted versions to cancel the cochlear microphonic response and isolate the compound action potential (CAP) as a measure of neural synchronization. The stimulus level used for the CAP measurements was set to 70 dBSPL. Physiological data were collected from three budgerigars, two canaries, one zebra finch, four gerbils and two guinea pigs. 2.3
Frequency Specific Cochlear Delay
The cochlear delay functions were derived as best fit power functions from scatter plots of published data relating response delay to frequency. These
Fig. 1 Waterfall display of three periods of the waveform for harmonic complexes with a fundamental frequency of 100 Hz created by systematically varying the scalar C in 0.1-steps from −1.0 to +1.0 as indicated by the number at the right side of each waveform. The gray lines in each trace indicate the variation of instantaneous frequency between 0.2 and 5.0 kHz over time. The greater the slope of these lines, the more rapid are the within-period frequency sweeps
Influence of Neural Synchrony on the Compound Action
3
data come predominantly from auditory nerve fiber recordings in birds (Sachs et al 1974; Gleich and Narins 1988), guinea pig (Palmer and Russell 1986) and gerbil (Schmiedt and Zwislocki 1977) and have been corrected by 1 ms to account for neural delay. Additional bird data came from mechanical measurements of basilar membrane response latency in pigeon (Gummer et al. 1987). Human data are based on the derived ABR data shown in Fig. 3d of Schoonhoven et al. (2001) and frequency specific wave V latency data presented in Table 1 of Donaldson and Ruth (1993), adjusted by 5.3 ms. The resulting best fit power functions relating frequency to cochlear delay for the different species are: human, y = 3.4138x−0.7396; gerbil, y = 0.502x−1.5836; guinea pig, y = 1.6394x−0.7496 and bird, y = 0.6813x−0.6121 with x representing frequency in kHz and y being the delay in ms.
3 3.1
Results CAP Amplitude as a Function of Scalar Value
Mean CAP amplitudes are illustrated in Fig. 2 as a function of the scalar value, demonstrating a species specific variation of the CAP amplitude. Negative scalars are on average associated with higher CAP amplitudes than positive scalars, consistent with the notion that upward frequency sweeps tend to “compensate” cochlear delay and cause a higher degree of synchronization compared to downward sweeps (e.g. positive scalars). A prediction for humans, which will be described in the next section, is illustrated as the thick black line in Fig. 2.
Fig. 2 Mean CAP amplitude as a function of scalar value is shown for bird, gerbil and guinea pig, with the number of animals in each group indicated in the legend. The thick black line shows a prediction for CAP amplitude in humans based on the regression line described in the next section
4
O. Gleich et al.
3.2 Cochlear Activation: Interaction of Stimulus Related Frequency Timing and Cochlear Delay Figure 3 illustrates that cochlear activation over stimulus periods varies considerably between species and scalars. The difference in the duration of cochlear activation by one stimulus period is more pronounced in mammals due to the long response delays at low frequencies, and the difference between positive and negative scalars decreases for scalar values close to 0. A stimulus perfectly compensating cochlear delay would result in synchronized cochlear activation across frequencies and an activation function represented by a vertical line in Fig. 3. A high degree of synchronization should result in a maximized CAP amplitude (Dau et al. 2000). Obviously, none of the harmonic stimuli perfectly compensates cochlear delay in the species studied. To obtain a quantitative measure of the degree of synchronization we determined the maximum frequency range activated by a single period within a 0.5-ms time window (i.e. around the steepest portion of the cochlear activation functions shown in Fig. 3) as a function of the scalar value for the different species. Since the frequency representation in all species can be regarded as roughly logarithmic, we used the maximally synchronized cochlear region expressed as octaves for this comparison. Figure 4 demonstrates that all species show a maximum synchronization for negative scalars
Fig. 3 Cochlear activation over three stimulus periods for four different scalars in human, gerbil, guinea pig and bird. The dotted line indicates the frequency timing within the stimulus
Influence of Neural Synchrony on the Compound Action
5
Fig. 4 The left panel shows the cochlear region responding within a 0.5-ms time window, expressed in octaves, as a function of the scalar value. The right panel plots the CAP amplitude derived from the mean curves in Fig. 2 as a function of the corresponding synchronized octaves of a given scalar shown in the left panel
of −0.1 or −0.2. Overall, birds show a higher degree of synchronization when expressed in octaves. If the frequency scale is converted to physical location on the sensory epithelium the maximally synchronized region is 9 mm for a scalar of −0.2 in humans (almost one third of the organ of Corti) and 2 mm for a scalar of −0.1 in birds (corresponding to 70% of the basilar papilla). The shape of the curves illustrating the synchronized octaves as a function of scalar value (Fig. 4, left panel) is similar to the shape of the mean CAP-amplitude curves (Fig. 2). The right panel in Fig. 4 demonstrates a highly significant correlation between the physiologically determined CAP and the synchronized cochlear region responding within 0.5 ms.
4
Discussion
Based on vertebrate cochlear frequency representation, harmonic complexes with within-period frequency sweeps from low to high (negative complexes) can be expected to synchronize neural responses better than those with downward sweeping instantaneous frequencies (positive complexes) because they “compensate” cochlear delays (see also Dau et al. 2000). This is consistent with the general observation that CAP amplitudes for negative scalars tend to be higher than those in response to positive scalars (Fig. 2). Since frequency within a given period of the harmonic complex varies linearly over time (Fig. 1) and the cochlear delay shows a highly non-linear variation with frequency, the interaction results in a complex pattern of cochlear activation over consecutive stimulus periods (Fig. 3). The right diagram in Fig. 4 demonstrates that the analysis of the temporal interaction of an acoustic stimulus and cochlear delay looking at an arbitrarily selected 0.5-ms time window allows reasonable predictions of neural synchronization and CAP amplitude. The scalar dependent variation of cochlear synchronization (Fig. 4) or CAP amplitude (Fig. 2) differ substantially from the pattern of scalar dependent
6
O. Gleich et al.
variation in the degree of masking reported by Lauer et al. (2006). In an attempt to derive a measure from the cochlear activation analysis (Fig. 2) that might be used to predict the scalar dependent degree of masking, we calculated the time of cochlear activation within one period of the harmonic complex for the frequency range between 2.6 and 3.0 kHz around the signal frequency of 2.8 kHz used by Lauer et al. (2006). The hypothesis is that longer cochlear activation around the signal frequency will cause more masking compared to shorter activation. Fig. 5 shows that there is a good correlation for positive and negative scalars in birds. Data for negative scalars in humans are similar to data from birds, but masking by positive scaled complexes appears independent of the duration of cochlear activation. In order to assess whether the differences across species regarding scalar discrimination might be reconciled by taking cochlear activation into account, the data from the Leek et al. (2005) study were replotted as a function of estimates of the difference in total duration of cochlear activation by one period of the standard and the corresponding scaled complex (Fig. 6).
Fig. 5 The diagram shows masked threshold (taken from Lauer et al. 2006) as a function of cochlear activation for the frequency range between 2.6 and 3.0 kHz. Open symbols: negative scalars, filled symbols: positive scalars
Fig. 6 The probability for a correct discrimination as a function of the absolute difference in the duration of cochlear activation between the standard (−1, +1 and 0) and the scaled complexes for humans (left diagram) and birds (right diagram)
Influence of Neural Synchrony on the Compound Action
7
Discriminability of these scaled harmonic complexes from either cosinephase (i.e., scalar = 0) or from scalars of ±1.0 generally improves when the difference between the standard and the signal in the duration of cochlear activation increases (Fig. 6).
5
Conclusions
These results show that several physiological and behavioral measures of the processing of harmonic complexes are remarkably similar across a number of very diverse species when considered in terms of two simple parameters: species-specific cochlear response delay and the time distribution of harmonic frequencies within the harmonic complex. Variation in CAP amplitude across harmonic complexes correlates well with the spatial extent of cochlear activation. The duration of cochlear activation around probe frequency is consistent with masking data from birds but does not explain the reduced masking seen in humans with positive scaled harmonic complexes. Discriminability of harmonic complexes is generally related to differences in the duration of cochlear activation except for birds discriminating negative scalars from a cosine background. Acknowledgments. Supported by NIH Grants DC-00198 to RJD and DC-00626 to MRL.
References Dau T, Wegner O, Mellert V, Kollmeier B (2000) Auditory brainstem responses with optimized chirp signals compensating basilar-membrane dispersion. J Acoust Soc Am 107:1530–1540 Donaldson GS, Ruth RA (1993) Derived band auditory brain-stem response estimates of traveling wave velocity in humans. I: Normal-hearing subjects. J Acoust Soc Am 93:940–951 Dooling RJ, Lohr B, Dent ML (2000) Hearing in birds and reptiles. In: Dooling RJ, Fay RR, Popper AN (eds) Comparative hearing: birds and reptiles. Springer, berlin Heidelberg New York, pp 308–359 Dooling RJ, Dent ML, Leek MR, Gleich O (2001) Masking by harmonic complexes in three species of birds: psychophysical thresholds and cochlear responses. Hear Res 152:159–172 Gleich O, Narins PM (1988) The phase response of primary auditory afferents in a songbird. Hear Res 32:81–92 Gummer AW, Smolders JW, Klinke R (1987) Basilar membrane motion in the pigeon measured with the Mossbauer technique. Hear Res 29:63–92 Lauer AM, Dooling RJ, Leek MR, Lentz JJ (2006) Phase effects in masking by harmonic complexes in birds. J Acoust Soc Am 119:1251–1259 Leek MR, Dooling RJ, Gleich O, Dent ML (2005) Discrimination of temporal fine structure by birds and mammals. In: Presnitzer D, Cheveigne A, McAdams S, Collet L (eds) Auditory signal processing. Springer Science+Business Media, p 471–477 Palmer AR, Russell IJ (1986) Phase-locking in the cochlear nerve of the guinea-pig and its relation to the receptor potential of inner hair-cells. Hear Res 24:1–15
8
O. Gleich et al.
Sachs MB, Young ED, Lewis RH (1974) Discharge patterns of single fibers in the pigeon auditory nerve. Brain Res 70:431–447 Schmiedt RA, Zwislocki JJ (1977) Comparison of sound-transmission and cochlear-microphonic characteristics in Mongolian gerbil and guinea pig. J Acoust Soc Am 61:133–149 Schoonhoven R, Prijs VF, Schneider S (2001) DPOAE group delays versus electrophysiological measures of cochlear delay in normal human ears. J Acoust Soc Am 109:1503–1512 Schroeder M (1970) Synthesis of low-peak-factor signals and binary sequences with low autocorrelation (Corresp.). Information Theory, IEEE Trans 16:85–89
Comment by Kohlrausch In your Fig. 5, you analyze masking properties of scaled Schroeder-phase complexes by your measure of cochlear activation (basilar-membrane response synchrony across a certain frequency range) in the spectral region of the signal. I wonder whether this property of the stimulus is primarily responsible for the amount of masking. The influence of masker phase on masking properties in such conditions has been explained quite successfully by the peakiness of the on-channel masker waveform after going through the inner-ear filter at the signal frequency (see the original publications by Smith et al. 1986 and Kohlrausch and Sander 1995, but also the recent paper by Lauer et al. 2006, all in JASA). According to this explanation, the Schroederphase masker producing the least amount of masking is the one for which the phase curvature is similar (and opposite) to the one of the relevant inner ear filter. Thus, masking depends primarily on the phase characteristic of an individual point of the basilar membrane, which is a priori independent of the place-dependent cochlear delay. For humans, the psychophysical data by Lauer et al. and by Lentz and Leek (2001) suggest that around 3 kHz, scalar values between 0.5 and +1 result in the least effective masker. For the bird condition, the least effective masker is one with a scalar value close to zero, i.e. a zero-(or sine-) phase masker. This has lead Lauer et al. to the conclusion that, around 3 kHz, the phase curvature of the corresponding inner ear filter in birds is a factor 4 to 8 smaller than the curvature in humans. Introducing a curvature in the stimulus phase spectrum (i.e., increasing the scalar from 0 to either +1 or −1) will for birds have two effects. First, and in my view most important, the energy at the output at the 2.8-kHz filter will be smeared out over a longer portion of each masker period as for the zerophase complex, leading to an increase in masking. Second, the synchrony across frequency will be reduced, because the frequency-dependent delays in the Schroeder-phase stimuli will be much larger than the place-dependent delay in the bird inner ear. For humans, on the other hand, the smearing out of the energy at the output of the corresponding inner-ear filter will only occur for negative scalar values (because stimulus phase curvature and filter phase curvature will add up to increase the resulting curvature, leading to a flat temporal envelope), but not for positive scalars (at least up to +1), for which the phase characteristics compensate each other to a certain extent.
Influence of Neural Synchrony on the Compound Action
9
Such a view, based on within-channel masker waveforms, agrees with all experimental data for both humans and birds shown in the left panel of Fig. 5. References Kohlrausch A, Sander A (1995) Phase effects in masking related to dispersion in the inner ear. II. Masking period patterns of short targets. J Acoust Soc Am 97:1817–1829 Lentz JJ, Leek MR (2001) Psychophysical estimates of cochlear phase response: masking by harmonic complexes. J Assoc Res Otolaryngol 2:408–422 Smith BK, Sieben UK, Kohlrausch A, Schroeder MR (1986) Phase effects in masking related to dispersion in the inner ear. J Acoust Soc Am 80:1631–31637
Reply We are aware of, and agree with, the explanations reviewed by Dr. Kohlrausch regarding the masking data, and our data analysis is not inconsistent with the within-channel views. We were looking for a general analysis of cochlear activation patterns that could be related to various aspects of data on perception and processing of harmonic complexes. These included questions of synchronization (CAP), masking, and discrimination. The masking analyses of cochlear activation across a rough estimate of critical band around the probe frequency provides an alternative explanation for data obtained in birds that is consistent with within-channel masking in humans. These analyses reconcile masking differences across species except for the release from masking when the phase spectrum of the masker compensates for the phase characteristic of the sensory epithelium (as pointed out by Kohlrausch). They also are useful (if not perfect) explanations of the amplitudes of the compound action potentials and some aspects of discrimination across the complexes. Our goal was to find a physiological mechanism that would support all these experimental findings.
Comment by Lütkenhöner Did the scalar C affect only the amplitude of the compound action potential (CAP) or did you observe also changes in shape? In that case, it might be useful to consider alternative measures of the response magnitude, for example the area under the dominant CAP peak.
Reply The scalar not only affected the amplitude, but also the shape of the CAP waveform as illustrated for a set of typical examples in Fig. A. Despite these
10
O. Gleich et al.
Fig. A Typical CAP waveforms in response to selected scaled harmonic complexes, collapsed across 10 periods from a gerbil (black lines) and a zebra finch (gray line). The inset in each diagram illustrates one period of the waveform of the corresponding harmonic complex
changes in the shape of the waveform, the peak to peak amplitude appears as a useful measure for the present analysis of synchronized cochlear activation (see also Fig. 4).
2 A Nonlinear Auditory Filterbank Controlled by Sub-band Instantaneous Frequency Estimates VOLKER HOHMANN AND BIRGER KOLLMEIER
1
Introduction
Functional models of basilar membrane motion have a long tradition and a wide range of applications. They usually take as input the stapes vibration and provide the excitation pattern of the inner hair cells as an output. Even though the design of these models and the psychophysical data put into the model design are based on simple signals (e.g. sinusoids and twotone complexes), these models have the advantage to be also applicable to complex sounds (such as speech). Hence, they describe the degree to which we understand the response of the human peripheral auditory system to every-day sounds. While transmission line and coupled elements models (such as, e.g., Duifhuis et al. 1985; Talmadge et al. 1998) are primarily used to describe the “effective” influence of physical parameters and mechanical properties on the basilar membrane response, filterbank models are primarily used to describe the “effective” signal processing properties of the basilar membrane at a fixed position on the BM. While single-filter approaches (both linear and – in more refined models – nonlinear filters) have been used in the past, dual resonance filter approaches (see, e.g., Goldstein 1988; Meddis et al. 2001) have been suggested more recently. They explicitly model the approximately linear response to input frequencies remote from the best frequency separately from the nonlinear, compressive response to frequencies close to the best frequency. This approach has the advantage of adequately describing the frequency-selective gain and instantaneous compression. However, it does not correctly describe suppression phenomena for configurations with high frequency separation between suppressor and suppressee: While psychoacoustic and physiological data show an increase of suppression up to 2.5 dB per dB with increasing suppressor level in lowside suppression, typical dual resonance filter models can only show a suppression rate of less than 1 dB per dB suppressor level. This originates from
Medizinische Physik, Fakultät V, Institut für Physik, Carl von Ossietzky Universität, Oldenburg, Germany,
[email protected],
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
12
V. Hohmann and B. Kollmeier
the fact that in these models the increase in suppression with suppressor level is directly coupled to the amount of compression in the on-frequency nonlinear filter. For this reason, the current paper describes a new approach that extends the idea of a dual resonance filter by including a control of the nonlinear filter gain by the sub-band instantaneous frequency. This approach enables us to control to a certain extent the amount of suppression separately from the frequency-dependent gain characteristics. The main idea is to appropriately model the observation that the auditory system yields an increasingly linear response with less gain to an on-frequency component as soon as an increasing level of off-frequency components falls into the respective on-frequency filter. Using this approach, two-tone suppression data with suppressor frequencies well below the on-frequency component (low-side suppression) will be considered in this paper as well as the different behavior of on- and off-frequency masking in simultaneous and non-simultaneous masking conditions.
2
Description of the Model
The key feature of the model introduced here is the extraction of the instantaneous frequency which is well defined for AM/FM-signals, e.g., signals after peripheral bandpass filtering. The instantaneous frequency (IF) can be derived efficiently within a few waveforms from the analytical signal s(t) by computing the derivative of its phase. The assumption is that the deviation of the instantaneous frequency from the best frequency of the channel determines the amount of gain and compression. The hypothetical physiological mechanism may be the instantaneous-frequency-dependent direction of outer haircell stereocilia deflection. According to Wersäll et al. (1965) the direction of outer haircell deflection differs between on-frequency excitation and offfrequency excitation at the same place on the basilar membrane with the sensitivity of the OHC being highest for the direction of motion induced by an on-frequency signal (see also Duifhuis 1976). Hence, the gain and compression for a given best frequency is modeled to depend on the difference between instantaneous frequency and the best frequency as given exemplarily in Fig. 1. The gain characteristic is obtained from a typical BM gain response curve obtained for ∆IF=0 ERB (no deviation between best frequency and instantaneous frequency). With increasing difference, the gain characteristics is modeled to become less compressive and achieves less gain at low input levels, i.e., gain(dB)=gain_exp(∆IF) × dbgain(L) where the gain exponent gain_exp decreases linearly with ∆IF from one to zero and dbgain(L) describes the standard on-frequency compressive input/output characteristic as a function of input level L. The complete design of one frequency channel of the model is given in Fig. 2. The dual-resonance approach is achieved with a wide, linear filter F1 in combination with a more frequency-selective filter F2 which is followed by
A Nonlinear Auditory Filterbank
13
Fig. 1 I/O-characteristics, i.e., output level as a function of input level for deviations of the instantaneous frequency from best frequency (∆IF-values) of 0, −0.8, −1.7 and −2 ERB. At ∆IF=−2 ERB and below, response is linear and gain is zero
Fig. 2 Block diagram of one frequency-channel of the model. F1: wide band filter, F2: narrow band filter, IC: instantaneous compression, IF: instantaneous frequency estimation
a nonlinear instantaneous compression circuit (IC, right column in Fig. 2) which adds its output to the output of F1. The control of this compression is achieved by a feed-forward simulation of the dual-resonance filter with fixed IC (middle column in Fig. 2) which is taken as the input to the instantaneous frequency estimation (IF). The resulting difference ∆IF is used to compute
14
V. Hohmann and B. Kollmeier
the gain_exp(∆IF) (see above) which alters the gain characteristic of the IC as shown in Fig. 1. Note that instead of employing two separate IC blocks for the feedforward structure in Fig. 2, only one IC block would be sufficient for a feedback structure where the output of the complete filter is used as input to the IF-circuit. Whereas such a feedback control might be simpler and more physiologically plausible, it would produce numerical instabilities. In addition, the chosen structure from Fig. 2 has the advantage that the compressed on-frequency components are used as input to the IF which is a necessary prerequisite for predicting the correct two-tone suppression characteristic (see below). The implementation of the model employs linear bandpass filters (F1, F2) which were chosen as double exponential filters parameterized by the respective centre frequency and the lower and upper slope in dB per octave (F1: lower slope 12 dB/octave, upper slope −48 dB/octave, F2: lower slope 60 dB/octave and upper slope −60 dB/octave). The filters were implemented as FFT-based minimum phase filters with a length of 1000 sample at a sampling frequency of 22.05 kHz using a complex output to approximate the analytical signals. A linear distribution of centre frequencies was obtained on an ERBscale with two filters per ERB. The instantaneous compression stage operates on a sample-by-sample basis on the Hilbert envelope (as in Herzke and Hohmann 2005) and uses the I/O-characteristics as sketched in Fig. 1. These parameters were selected in order to fit best to a variety of psychophysical masking data. Specifically, the parameters were fitted to best produce the two-tone suppression data and the upward spread of masking data were predicted using these fixed parameter settings.
3 3.1
Results Two-tone Suppression Data
Figure 3a shows psycho-acoustical two-tone suppression data obtained by Duifhuis (1980) where the suppressor level (L2 at 400 Hz) is given on the abscissa while the pulsation threshold of a tone (1 kHz) that achieves the same “internal level” as the suppressed tone (1 kHz) is given on the ordinate. Parameter of the curve is the level L1 of the suppressee. With increasing suppressor level, the “effective” suppressee level (given here by the level of the equivalent pulsation threshold) drops at a very high rate (approximately −2.5 dB per dB suppressor level) as soon as the suppressor level exceeds a suppression threshold. In addition, the suppression threshold increases at a slope of approximately 4 dB in suppressee level per dB suppressor level. This high slope can be taken as an indicator of the effective compression of the suppressee in the control channel (such as the middle column in Fig. 2), assuming the suppressor is processed linearly. With even higher suppressor
A Nonlinear Auditory Filterbank
15
Relative Excitation Level / dB
(b) 60 50 40 30 20 10 0 20
30
40
50
60
70
80
90 100 110
Suppressor Level / dB Fig. 3 Upper panel a psychoacoustical two-tone suppression data from Duifhuis (1980). Lower panel b model simulations
levels, the “effective” on-frequency excitation determining the pulsation threshold is dominated by the suppressor which explains the curve increase at the right side of the graph. Figure 3b shows the corresponding model output derived for the same stimuli and approximate levels as given for the data plot (20 dB suppressee level is missing). The plot shows the output of the on-frequency channel at 1000 Hz for suppressee levels of 30, 40, 50 and 60 dB, respectively, as a function of suppressor level and referenced to an on-frequency signal. To generate the model data, the suppressor level increased linearly from 20 dB to 100 dB within 2 s and the instantaneous output level inversely transformed across the on-frequency compressive I/O-characteristics is plotted. This inverse transformation is necessary, because the pulsation threshold is
16
V. Hohmann and B. Kollmeier
measured with a reference on-frequency test tone that is transformed compressively. Obviously, the general pattern is consistent with the data given in Fig. 3a while the fine structure of the total output level is generated by interference between suppressor and suppressee. The strong dips in the curves between 75 and 85 dB suppressor level, respectively, are due to interference in the signal path, i.e., where suppressor and suppressee level are approximately the same at the output. The modulation close to suppression threshold is due to interference between suppressor and suppressee in the control channel which leads to a modulation of instantaneous frequency and subsequently in overall gain. 3.2
Upward Spread of Masking
Figure 4a shows spectral masking data from Oxenham and Plack (1998) obtained with a narrowband noise as masker at different masker levels (given at the abscissa) and a short sinusoidal tone as signal to be detected either in a simultaneous masking condition (filled symbols) or a non-simultaneous masking condition (i.e., forward masking, open symbols). Squares denote on-frequency masking (masker and test tone centered at 4 kHz), where simultaneous and non-simultaneous masking data coincide quite well, whereas circles denote the off-frequency masking conditions (masker centered at 2.4 kHz, test tone at 4 kHz) where the simultaneous masking condition shows a much higher slope in masked threshold as function of masker level than the non-simultaneous condition. The difference between these curves represents the suppression of the 4-kHz-tone in the simultaneous condition (i.e., the level has to be increased considerably in order to achieve detection), whereas no suppression is exerted from the 2.4-kHz-masker in the non-simultaneous condition. Figure 4b shows the respective prediction from the model described above: To predict detection data, the output of the 4-kHz-channel was monitored and the detection threshold was assumed as soon the output level for masker plus test tone exceeds the output level for the masker alone by 1 dB. For predicting the threshold in quiet at the left-end side of the plot, an appropriate threshold criteria was assumed. For comparison, the estimation for the non simultaneous off-frequency condition is given (lower solid line), which was achieved by finding those test tone levels in the test tone-only condition that yield the same output level as the masker-alone condition in the 4-kHz-channel. In addition, the on-frequency condition for simultaneous and non simultaneous masking is plotted which was derived in the same way and which yields a 1:1 characteristic (upper solid curve). Obviously, the model predicts the average subjects data from Oxenham and Plack (1998) quite accurately even though the model parameters were not fit to this particular experimental condition. This underlines that the implementation of the suppression mechanism proposed here seems to be an adequate model of suppression effects in humans.
A Nonlinear Auditory Filterbank
17
(a) Signal level (dB SPL)
100
Mean
80 60 40 20
20
40
60
80
100
Masker level (dB SPL)
test tone level @ thres / dB
(b) 100 90 80 70 60 50 40 30 20 10 0 0
10 20 30 40 50 60 70 80 90 100
Fig. 4 Upper panel a psychoacoustical upward spread of masking data from Oxenham and Plack (1998). Lower panel b model simulations
4
Discussion
The model introduced here is just a prototype implementation of the newly developed idea of an instantaneous frequency detector controlling the gain and compression of specific filterbank channels or positions on the basilar membrane, respectively. In order to model exactly a variety of psychophysical and physiological data with the correct amount of suppression, frequency-specific gain and compression, a comparatively complicated model structure had to be assumed (cf. Fig. 2) where a feedback structure was avoided in order to maintain computational stability. It is not yet clear, however, if this structure or an alternative structure using basically the same elements is most appropriate for predicting also physiological experiments in the same way. Nevertheless, the basic idea of the instantaneous frequency extraction and its control of filterbank channels can be generalized and simplified for applications outside
18
V. Hohmann and B. Kollmeier
“pure” modeling in psychoacoustics and physiology. In speech processing, for example, an instantaneous-frequency-controlled filterbank shows a distinct enhancement of the most “relevant” frequency components of speech and may hence be used for efficient dynamic compression of speech and other wideband signals for hearing-impaired listeners. A key feature of the instantaneous frequency approach is the possibility to extract the instantaneous frequency within a few periods of the channels center frequency using the model assumption of an AM/FM signal. This may be useful for decomposing complex signals into “important” spectral components and “unimportant” ones and may thus help to derive better speech and audio coding strategies based on auditory models. Acknowledgments. Work supported by BMBF and DFG (Sonderforschungsbereich “Das aktive Gehör”).
References Duifhuis H (1976) Cochlear nonlinearity and second filter: possible mechanism and implications. J Acoust Soc Am 59:408–423 Duifhuis H (1980) Level effects in psychophysical two-tone suppression. J Acoust Soc Am 67:914–927 Duifhuis H, Hoogstraten HW, van Netten SM, Diependaal RJ, Bialek W (1985) Modelling the cochlear partition with coupled Van der Pol oscillators. In: Allen JB, Hall JL, Hubbard AE, Neely ST, Tubis A (eds) Peripheral auditory mechanisms. Springer, Berlin Heidelberg New York, pp 290–297 Goldstein JL (1988) Updating cochlear driven models of auditory perception: a new model for nonlinear auditory frequency analysing filters. In: Elsendoorn BAG, Bouma H (eds) Working models of human perception. Academic Press, London, pp 19–58 Herzke T, Hohmann V (2005) Effects of instantaneous multi-band dynamic compression on speech intelligibility. EURASIP JASP 2005(18):3034–3043 Meddis R, O’Mard LP, Lopez-Poveda EA (2001) A computational algorithm for computing nonlinear auditory frequency selectivity. J Acoust Soc Am 109:2852–2861 Oxenham AJ, Plack CJ (1998) Suppression and the upward spread of masking. J Acoust Soc Am 104:3500–3510 Talmadge C, Tubis A, Long GR, Piskorski P (1998) Modeling otoacoustic emission and hearing threshold fine structures in humans. J Acoust Soc Am 104:1517–1543 Wersäll J, Flock A, Lundquist P-G (1965) Structural basis for directional sensitivity in cochlear and vestibular sensory receptors. Cold Spring Harbor Symp Quant Biol 30:115–132
3 Estimates of Tuning of Auditory Filter Using Simultaneous and Forward Notched-noise Masking MASASHI UNOKI, RYOTA MIYAUCHI, AND CHIN-TUAN TAN
1
Introduction
The frequency selectivity of an auditory filter system is often conceptualized as a bank of bandpass auditory filters. Over the past 30 years, many simultaneous masking experiments using notched-noise maskers have been done to define the shape of the auditory filters (e.g., Glasberg and Moore 1990; Patterson and Nimmo-Smith 1980; Rosen and Baker, 1994). The studies of Glasberg and Moore (2000) and Baker and Rosen (2006) are notable inasmuch as they measured the human auditory filter shape over most of the range of frequencies and levels encountered in everyday hearing. The advantage of using notched-noise masking is that one can avoid off-frequency listening and investigate filter asymmetry. However, the derived filter shapes are also affected by the effects of suppression. The tunings of auditory filters derived from data collected in forward masking experiments were apparently sharper than those derived from simultaneous masking experiments, especially when the signal levels are low. The tuning of a filter is commonly believed to be affected by cochlear nonlinearity such as the effect of suppression. In past studies, the tunings of auditory filters derived from simultaneous masking data were wider than those of filters derived from nonsimultaneous (forward) masking data (Moore and Glasberg 1978; Glasberg and Moore 1982; Oxenham and Shera 2003). Heinz et al. (2002) showed that a tuning is generally sharpest when stimuli are at low levels and that suppression may affect tuning estimates more at high characteristic frequencies (CFs) than at low CFs. If the suggestion of Heinz et al. (2002) holds, i.e., if suppression affects frequency changes, comparing the filter bandwidths derived from simultaneous and forward masking experiments would indicate this. In this study we attempt to estimate filter tunings using both simultaneous and forward masking experiments with a notched-noise masker to investigate how the effects of suppression affect estimates of frequency selectivity across signal frequencies, signal levels, notch conditions (symmetric and asymmetric), and signal delays. This study extends the study of Unoki and Tan (2005). School of Information Sience, Japan Advanced Institute of Science and Technology, Japan,
[email protected],
[email protected],
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
20
M. Unoki et al.
2 Simultaneous and Forward Masking with Notched-Noise Masker 2.1
Methods
A diagram of the stimulus used in our masking experiments is shown in Fig. 1. The signal frequencies (ƒc) were 0.5, 1.0, 2.0, and 4.0 kHz. The notchednoise masker consisted of two bands of white noise where each bandwidth was fixed as 0.4 × ƒc. Under five conditions, the notch was symmetrically placed about ƒc. The values of ∆ƒc / ƒc under these conditions were 0.0, 0.1, 0.2, 0.3, and 0.4 (Fig. 1a). Under four asymmetric conditions, the combinations of the lower and upper ∆ƒc / ƒc s were (0.3, 0.1), (0.4, 0.2), (0.1, 0.3), and (0.2, 0.4), as shown in Fig. 1b,c. In the masking experiments we used three time conditions: the onsetinterval between the notched-noise masker and probe was 150, 300, and 305 ms, labeled A, B, and C in Fig. 1. Time condition A corresponded to the simultaneous masking, while B and C corresponded to the forward masking. At a fixed probe level, Ps (10, 20, and 30 dB SL), the masker levels, N0, at the masked thresholds were measured for a brief 10-ms signal (5-ms raisedcosine ramps, no steady state) in the presence of a 300-ms masker gated with 15-ms raised-cosine ramps. Fifteen normal-hearing listeners, aged 21–33, participated in the experiments. Six, seven, and six of them participated in the experiments with time conditions A, B, and C. Four participated under two conditions. The absolute thresholds of all subjects, measured through a standard audiometric tone test using a RION AA-72B audiometer, were 15 dB HL or less for both ears at octave frequencies between 0.125 and 8.0 kHz. All subjects were given at least 2 h of practice.
0.4fc fu,min fc ∆ fc fl,max
300 ms N0
0.4fc
N0
0.2fc
Ps
(c) asymmetrical condition (
)
Freq. (Hz)
fu,max
(b) asymmetrical condition ( Freq. (Hz)
Freq. (Hz)
(a) Symmetrical condition (o)
)
N0
Ps
Ps 0.2fc
fl,min
10 ms 150 ms
5 ms
Level
A 150 ms
B
300 ms 305 ms
C 5 ms
Fig. 1 Stimulus shape and position used in notched-noise masking experiments
Time (s)
Estimates of Tuning of Auditory Filter
21
All stimuli were re-generated digitally at a sampling frequency of 48 kHz and presented via a Tucker-Davis Technologies (TDT) system III real-time processor (TDT RP2). The masker and signal were separately attenuated by two programmable attenuators (TDT PA5) before they were mixed (using TDT SM5) and passed through a headphone buffer (TDT HB7) for presentation. The stimuli were presented monaurally to the subjects in a doublewalled sound-attenuating booth via Etymotic Research ER2 insert earphone. The level of the stimuli were verified using a B&K 4152 Artificial Ear Simulator with a 2-cm3 coupler (B&K DB 0138) and a B&K 2231 Modular Precision Sound Level Meter. Masked thresholds were measured using a three-alternative forced-choice (3AFC) three-down one-up procedure that tracks the 79.4% point on the psychometric function (Levitt 1970). Three intervals of stimuli were presented sequentially using a 500-ms inter-stimulus interval in each trial. Subjects were required to identify the intervals that carried the probe signals using numbered push-buttons on a response box. Feedback was provided by lighting up the LEDs corresponding to the correct interval on the response box after each trial. A run was terminated after twelve reversals. The step size was 5 dB for the first four reversals and 2 dB thereafter. The threshold was defined as the mean signal level at the last eight reversals. All data in which the masker level at the threshold was over 90 dB SPL were eliminated because they were affected by the compression effect. 2.2
Results and Discussion
∆
∆
∆
∆
The mean masked thresholds for signal frequencies of 1.0 and 2.0 kHz in the three time conditions are plotted in Fig. 2 as functions of the signal levels. Those for 0.5 and 4.0 kHz are omitted here, but the trends of all data were similar. The abscissas of the plots in this figure show the smaller of the two values of ∆ƒc / ƒc. The circles denote the mean masked thresholds under the symmetric notched-noise conditions (Fig. 1a). The triangles pointing to the right ( ) denote the mean masked thresholds in the asymmetric notched-noise conditions where ∆ƒc / ƒc for the upper noise band was 0.2 greater than that for the lower noise band (Fig. 1b). The triangles pointing to the left ( ) denote the mean masked thresholds under the asymmetric notched-noise conditions where ∆ƒc / ƒc for the lower noise band was 0.2 greater than that for the upper noise band (Fig. 1c). We found that the masked threshold increased as the notch width was increased. We also found that the s were consistently higher than the s. This suggests that the auditory filter shapes were asymmetric, with a steeper high frequency slope. The slopes of the growth of the masking functions (the variability of the masker level at the threshold in terms of signal levels from 10 to 30 dB SL) for 1.0 and 2.0 kHz under the three time conditions (A, B, and C) are shown
M. Unoki et al.
Masker level at threshold (dB SPL) Masker level at threshold (dB SPL)
22 (a) Sim. A, 1 kHz
(b) Fwd. B, 1 kHz
(c) Fwd. C, 1 kHz
80
80
80
60
60
60
40
40
40
20
20
20
(d) Sim. A, 2 kHz
(e) Fwd. B, 2 kHz
(f) Fwd. C, 2 kHz
80
80
80
60
60
60
40
40
40
20
20 0
0.1
0.2
0.3
0.4
Relative notch width, D fc / fc
20
0
0.1
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
Relative notch width, D fc / fc Relative notch width, D fc / fc
Slope of masking func. (dB/dB)
Fig. 2 Mean masked thresholds in masking experiments with three time conditions. A, B, and C, for a–c 1 kHz and d–f 4 kHz. Signal levels were 10, 20, and 30 dB SL
(a) 1 kHz
(b) 2 kHz
2
2
1.5
1.5
1
1
0.5
A
B
C
0.0
5.0 (ms)
0.5
A
B
C
0.0
5.0 (ms)
Fig. 3 Mean slope of growth of masking function under three conditions: simultaneous (A) and forward masking (B and C) for a 1.0 kHz and b 2.0 kHz
∆
∆
∆
∆
in Fig. 3. The thick, medium, and thin solid lines show the slopes under the symmetric notch conditions (º) of 0.0, 0.1, and 0.2. The dotted and dashed lines show the slope under the asymmetric conditions ( and ) of 0.1. The slope under the -notch condition is greater than that under the -notch condition. In addition, the slope in C is greater than those in A and B. These results suggest that filter non-linearity such as compression tended to occur
Estimates of Tuning of Auditory Filter
23
as the signal was delayed under the three time conditions (A, B, and C) and that the decayed lower notched-noise components still might affect the masking as suppressive masking in the condition C.
3
Estimation of the Filter Tuning
The most common method for estimating auditory filter shape is the roex filter model based on the power spectrum model of masking. The current form was proposed by Glasberg and Moore (2000). This model can be used to account precisely for simultaneous masking. It may be used to estimate the filter shape from forward masking as a pilot test. However, it does not suitably account for forward masking with a complex or noise masker because it cannot separately deal with excitatory and suppressive masking (Wojtczak and Viemeister 2004). We used the parallel roex filter (Unoki et al. 2006) to estimate the filter shape and tuning under the three time conditions as an alternative method. Because this model consists of a passive tail roex and an active tip roex with the schematic I/O function used by Glasberg and Moore (2000), this can deal with the above problem. The internal level, Prxp, is determined as the output of the passive tail filter (t), and then the active tip filter (p) varies with this level. The parallel roex filter is characterized by five parameters (tl, tu, pl, pu, and Gmax). Another two non-filter parameters (efficiency, K, and absolute threshold, P0) are used in the power spectrum model. These parameters are represented as a function of the normalized ERBN-rate, Eƒ = ERBN - rate(ƒ)/ERBN - rate(1 kHz) − 1, and were determined by using the refined fitting procedure used by Unoki et al. (2006) on masking data with the three time conditions. This fitting procedure also included the outer and the effect of transmission in precochlear processing, MidEar correction (Glasberg and Moore 2000) and the effect of off-frequency listening (Patterson and Nimmo-Smith 1980). In this fitting procedure, as the revised point, we incorporated a decay function (a leaky integrator) into the level estimator. Because the masker level (N0) that approached to the signal position should be decayed drastically in the forward masking (B and C in Fig. 1), whereas the masker level was constant at the signal position in the simultaneous masking (A in Fig. 1). Thus, the reduction of the masker levels under time conditions B and C were 16.3 dB and 42.0 dB by the decay function. These values were, then, used in the power spectrum model. The parameters were optimized by minimizing the root mean square (rms) error between the masked thresholds and the predicted thresholds. The optimized values for the five parameters of the parallel roex auditory filters and the rms errors, fitted to masking data corrected under the three time conditions, are shown in Table 1. The thresholds predicted using the parallel roex filter are plotted in Fig. 2 (solid lines for circles, dashed lines for
24
M. Unoki et al.
Table 1 Filter coefficients of parameters and rms errors in fit Condition
tl
tu
Gmax
pl
pu
rms (dB)
A. Simulta-neous
10.8
81.6
29.1 −0.988Eƒ
33.8 +0.006Pprx
25.8
2.38
B. Forward (0 ms)
11.8
82.7
24.1 −7.11Eƒ
26.2 +0.149 Pprx
48.8 −0.140 Pprx
2.80
C. Forward (5 ms)
9.05
132
19.8 +0.121Eƒ
34.8 −0.050 Pprx
72.8 −0.227 Pprx
2.27
10
Filter gain (dB)
0
10 dB SL
−10 −20 −30 −40 −50 10
Filter gain (dB)
0
30 dB SL
−10 −20 −30 −40 −50
Simult. masking (A) Forward masking (B) Forward masking (C) 500
1k
2k
4k
Frequency (Hz)
Fig. 4 Auditory filter shapes with center frequencies between 0.5 and 4.0 kHz, derived from mean threshold data in three masking experiments (A, B, and C)
∆
∆
s, and dotted lines for s). The shapes of the derived filters centered at the signal frequencies of 0.5, 1.0, 2.0, and 4.0 kHz are plotted in Fig. 4 as a function of the signal level (10 and 30 dB SL). All the parallel roex filters can be excellently fitted to the simultaneous and forward masking data. Under the three time conditions, we found that the skirts of filters on the higher side for B and C are somewhat steeper than those of A. However, the tail slopes on the
Estimates of Tuning of Auditory Filter
25
Table 2 Means of the filter bandwidths of parallel roex filter at lower levels Condition \ ƒc (Hz)
500
1000
2000
4000
ERBN (Glasberg and Moore 1990)
79
133
241
456
A. Simultaneous masking
81
136
248
471
B. Forward masking (no silence)
66
112
204
392
C. Forward masking (5 ms delay)
61
97
175
330
lower side of B and C are somewhat shallower than those of A. Remaining lower notched-noise components may affect this. The mean equivalent rectangular bandwidth (ERB)s of the derived auditory filter shape as shown in Fig. 4 under the three conditions are shown in Table 2. The results show that the tuning of the derived filter using forward masking (B and C) is somewhat sharper than that using simultaneous masking (A). The ratios of ERBN (Glasberg and Moore 1990) to the ERBs in C for 0.5–−4.0 kHz are 1.30, 1.37, 138, and 1.38. The tuning of the derived filter from forward masking have became sharper as the signal frequency was increased and/or the signal was delayed (A, B, and C). In addition, when the signal level was increased in dB SL, the ERBs estimated from the forward masking data are still sharper.
4
Conclusions
We estimated filter tuning using both simultaneous and forward masking with a notched-noise masker as functions of signal frequencies (0.5, 1.0, 2.0, and 4.0 kHz), signal levels (10, 20, and 30 dB SL), notched-noise conditions (five symmetric and four asymmetric), and the time conditions (A, B, and C in Fig. 1). Auditory filter shapes were derived under these conditions using the parallel roex filter. The results suggest that the tunings of the auditory filters derived from the forward masking data were considerably sharper than those derived from the simultaneous masking data. The tunings of the auditory filters became much sharper as the center frequency was increased (ratios of 1.30 to 1.38). However, the difference between the tunings of the auditory filters at lower center frequencies in using both maskings tended to be smaller than that at higher center frequencies. It may be affected by remaining effects of suppression due to the decayed lower notched-noise components below the signal frequency. Acknowledgments. This work was supported by special coordination funds for promoting science and technology (supporting young researchers with fixed-term appointments).
26
M. Unoki et al.
References Baker RJ, Rosen S (2006) Auditory filter nonlinearity across frequency using simultaneous notched-noise masking. J Acoust Soc Am 119:454–462 Glasberg BR, Moore BCJ (1982) Auditory filter shapes in forward masking as a function of level. J Acoust Soc Am 71:946–949 Glasberg BR, Moore BCJ (1990) Derivation of auditory filter shapes from notched-noise data. Hear Res 47:103–138 Glasberg BR, Moore BCJ (2000) Frequency selectivity as a function of level and frequency measured with uniformly exciting noise. J Acoust Soc Am 108:2318–2328 Heinz MG, Colburn HS, Carney LH (2002) Quantifying the implications of nonlinear cochlear tuning for auditory filter estimates. J Acoust Soc Am 111:996–1101 Levitt H (1970) Transformed up-down methods in psychoacoustics. J Acoust Soc Am 49:467–477 Moore BCJ, Glasberg BR (1978) Psychophysical tuning curves measured in simultaneous and forward masking. J Acoust Soc Am 63:524–532 Oxenham AJ, Shera CA (2003) Estimates of human cochlear tuning at low levels using forward and simultaneous masking. J Assoc Res Otolaryngol 4:541–554 Patterson RD, Nimmo-Smith I (1980) Off-frequency listening and auditory filter asymmetry. J Acoust Soc Am 67:229–245 Rosen S, Baker RJ (1994) Characterising auditory filter nonlinearity. Hear Res 73:231–243 Unoki M, Tan C-T (2005) Estimates of auditory filter shape using simultaneous and forward notched-noise masking. Forum Acust Budapest 1497–1502 Unoki M, Irino T, Glasberg BR, Moore BJC, Patterson RD (2006) Comparison of the roex and gammachirp filters as representations of the auditory filter. J Acoust Soc Am 120:1474–1492 Wojtczak M, Viemeister NF (2004) Mechanisms of forward masking. J Acoust Soc Am 115:2599(A)
4 A Model of Ventral Cochlear Nucleus Units Based on First Order Intervals STEFAN BLEECK1 AND IAN WINTER2
1
Introduction
In the presence of a constant stimulus the arrival time of a spike has little meaning except in the context of the time of the preceding spike. Electrophysiological studies reveal a surprising degree of variability in neurons firing behaviour which can be described by the probability distribution function of consecutive spikes, the inter-spike intervals (ISI). A histogram (ISIH) can be constructed to represent the ISI-distribution (ISID) of intervals and a probability density function (PDF) of the intervals can be modelled by fitting it to the histogram. The continuous nature of the PDF is attractive since it avoids an arbitrary resolution of spike quantisation. There is no generally accepted PDF to model ISIHs. Interval distributions that were fitted, with more or less success, to ISIHs of constant stimuli include normal, log normal, exponential, gamma, Weibull, bimodal, multimodal. With the exception of the normal distribution, these functions are asymmetric (due to their positive skewness) reflecting the tendency of slower firing neurons to show greater variability. These functions are all empirical fits and cannot be fully justified physiologically – exponential PDFs (which are a result of a Poisson process) can only fit parts of the curve and gamma functions (which are a result of a coincidence detector) ignore the decay times of EPSPs – and consequently there are many other possible candidates for empirical fitting functions. Neuronal processing and all models based on physiological and biophysical bases include many noise sources that jitter the occurrences of spikes. These noise sources include the number and size of input synapses; the precise arrival time of input spikes and the location of synapses on dendrites and the thickness of dendrites which lead to differences in amplitude, width and time of arrival of EPSPs and IPSPs at the axon hillock. It is very difficult, if not impossible, to measure the influence of all these noise sources experimentally; therefore we have to rely on assumptions and simulations to investigate them in detail.
1
Institute of Sound and Vibration Research, University of Southampton, UK,
[email protected] Department of Physiology, Development and Neuroscience, University of Cambridge, UK,
[email protected] 2
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
28
S. Bleeck and I. Winter
The idea that is developed in this chapter provides a possible solution to this problem. We demonstrate here that the huge number of noise sources can actually reduce the complexity of the system behaviour. The “central limit theorem” states that any sum of many independent identically distributed random variables will be distributed normally (independent of the original distribution). The according theorem for probability distributions for our argument was formulated by Fischer and Tippett (1928) as the “central limit theorem of extreme value distributions”: the extreme values of many independent identically distributed random variables will take the form of an “extreme value distribution” (EVD) (independent of the original distribution). The EVD is a valid model under the following conditions: The membrane potential is affected in a complicated manner by many random processes. The event when the membrane potential exceeds the threshold can be seen as “extreme”, when the threshold is high compared to the noisy resting potential. This can be shown by intracellular recordings (Calvin and Stevens 1967). Furthermore, the resting potential and the underlying probability density function for spike generating must be stationary. To apply the EV theorem we do not have to know or assume any (physiologically motivated) models of the underlying spike generating process. The resulting spike generating process then takes on the form of a Markov process in which the probability of the next spike explicitly depends on its immediate predecessor but is conditionally independent of all other preceding spikes (Tuckwell 1988). The shape of the PDF of such a process is determined by the EVD.
2
Methods
All physiological recordings were obtained from the cochlear nucleus of the anaesthetized guinea pig (Cavia porcellus). The procedures are described in detail in Bleeck et al. (2006). The normalised cumulative first order interval distributions of half the number of all intervals between stimulus onset and offset was fitted by a maximum likelihood method with the generalised extreme value distribution: Z V]] R log (x) - a W S F (x | a, b, ζ ) = exp [- S1 + ζ f pW b W ]] S X \ T
1
_ b ` b b a
ζb
where τ is the time since the last spike (ISI), a the its location, b the scale and ζ the shape of the distribution. The shape parameter ζ governs the tail behaviour of the distribution, and the sub-families defined by ζ>0, ζ→0 and ζ <0 correspond, respectively, to the Fréchet, Gumbel and Weibull families, whose cumulative distributions are shown in Fig. 1. The quality of the
A Model of Ventral Cochlear Nucleus Units Based on First Order Intervals
29
Fig. 1 The three families of extreme value distributions that constitute the generalized extreme value distribution. Note the different tail behaviour
fitting procedure was tested by performing a Kolgomorov-Smirnov test with a confidence interval of 5%. The other half of the intervals (that were not used for the fit) were used as the comparison distribution with the resulting fitting function.
3
Results
In total, 1867 ISIHs from 877 units were investigated. A subset of these (323 ISIHs from 184 units) were also classified by one of the authors (IMW) using conventional criteria such as PSTH shape and regularity analysis (Young et al. 1988). The eight classification categories were: sustained chopper (CS), transient chopper (CT), wide chopper (CW), primary-like (PL), primary-like with notch (PN), onset (ON), low frequency (LF) and unusual (UN). Figure 2 shows data of six example units. The panels for each unit show a) the Post Stimulus Time Histogram (PSTH), b) the first order interval distribution (FOID) – the black line represents the GEV-fit to the data, c) the normalised cumulative FOID and the fit, d) the hazard function PSpike=F′/(1−F) which is the probability that a spike happens after that interval, and the calculated hazard function (calculated by the fitting parameters), and e) reconstruction of the PSTH using the hazard function. Neuronal responses to pure tones were simulated for 250 sweeps. At every point in time the probability that a spike happens is calculated as a Markov process given the values of µ, σ, ζ and the time of the last spike.
30
S. Bleeck and I. Winter
Fig. 2 PSTH (time in ms), FOID, recovery function, hazard functions (time interval in ms) and reconstructions (time in ms) for six example neurons of different types. “Chop S” unit (characterized by very regular firing) has a relatively sharp FOID. In effect its recovery function is steep. “Chop T” unit (less regular firing pattern then Chop S) has a less steep recovery function. The hazard functions is complex. “Primary like” (similar to an auditory nerve fibre) and “Primary like with Notch” (characterized by a distinct notch after the initial peak) units show slower recovery and almost linear Hazard functions. Their FOID is very broad. “Onset” unit (has a low sustained rate) have a delayed recovery function and a hazard function that is slowly rising in the beginning and the very steep. “Unusual” units have different characteristics that prevent them from being classified as one of the others. The example shown here has recovery properties that places it between Chopper and Onset
A Model of Ventral Cochlear Nucleus Units Based on First Order Intervals
3.1
31
Population Overview
Figure 3 shows the distribution of mean and standard deviation for fits to 1867 ISIHs from 877 units. Each histogram is represented by a symbol. The open symbols in the foreground represent the identity of the classified units. The grey circles in the background represent all other units that were not classified. The classified units mainly occupy different, although overlapping, areas. An “automatic classification” algorithm finds the mean of the parameter values of the response type that is closest. It classifies as the same by both the model and experimenter: 100% CS, 93% CT, 100% CW, 73% PL, 60% PN and 71% ON. Most “unusual” units are in areas not occupied by the major classes. The Kolgomorov Smirnov test reveals that the logarithmic GEV fit was successful in 43% of all cases. This is significantly (p < 0.01) better than other distributions tested (GEV linear, gamma logarithmic and linear, Weibull logarithmic and linear, exponential, normal logarithmic and linear). Due to adaptation during the stimulus duration, the exact shape of the fitting function changes. If only intervals during parts of the stimulus are considered (e.g. the first 10 ms) the number of successfully fitted units is up to 78%.
Fig. 3 Population plot of the means and variances of all fits. Each dot indicates the fit of the recovery function of one unit to the fit. The open symbols correspond to a classified type, the grey circles in the background represent all units. the mean and standard deviation of each population is indicated by the black lines
32
3.2
S. Bleeck and I. Winter
The Neurons Journey Through Parameter Space
During the course of 50 ms stimulus the neurons behaviour and hence the fitting parameter changes due to adaptive processes (Fairhall et al. 2001). Not only the ISIs during the stimulus but also the latency of the first spike after stimulus onset and the firing patterns without a stimulus (spontaneous activity) can be described adequately using a GEV-fit. The number of neurons that can be fitted successfully as such are comparable to the interstimulus fit. The firing probability of a neuron during a stimulus (and also as a function of stimulus parameter like level) can therefore be traced in the parameter space spanned by the fitting parameters (data now shown).
4
Discussion
Since the coding capacity of a spike train may be reflected in both the profile and the asymmetric dispersion of the ISIH, an information approach can be used to overcome the limitation of traditional second moment statistics. Recently, inter-spike intervals have formed the basis for quantifying the coding capacity of spike trains. Numerical methods have been developed based on logarithmic (Bhumbra et al. 2004) and linear (Reeke and Coop 2004) intervals. These measures suffer, however, from the absence of a generally accepted form of the distribution and must be calculated numerically for every case. It would therefore be extremely useful to have a general model to describe ISIHs as this allows direct access to the information content of spike trains. Distinct cell types in the VCN can be associated with almost unique temporal response patterns (Rhode et al. 1983). However, we show here a method to represent neuronal responses in the VCN within overlapping parameter areas. The continuous filling of the space in the parameter map indicates that the different neuron classes have overlapping features. Therefore we argue here that it might be useful to reconsider unit classification as continuous in the descriptive parameter space rather then in distinct classes.
5
Summary
The automatic algorithm allows classification of a neuron with good accuracy. Most of the errors made by the algorithm are actually between unit types that are similar by visual inspection as well. Additionally the automatic classifier is faster: Usually 40–100 intervals are enough to get a stable classification (data not shown). Analysed conventionally, many units are classified as unusual because they do not fit completely into the classic classification scheme. The
A Model of Ventral Cochlear Nucleus Units Based on First Order Intervals
33
GEV fitting reveals that many of these neurons are actually highly regular in terms of their firing patterns and might well described in a continuous parameter space. The spike probability at each time can be described by the hazard function as a function of the two main parameters of the ISI-distribution. The change of the ISI-distribution can be visualised as a journey through parameter space. The advantage of using the extreme value distribution is that the outcome is determined by the statistical laws of big numbers independent of the actual distributions and properties of the underlying processes. It is nevertheless still possible to infer certain neuronal characteristics from the results. For instance the shape ζ is mainly constant for processes like amplitude changes and adaptation (data not shown). The location α and the scale β on the other hand are determined by the stimulus: the louder the stimulus, the smaller α, the longer its duration, the bigger β. Acknowledgements. Supported by a grant to the second author from the BBSRC
References Bhumbra GS, Dyball REJ (2004) Measuring spike coding in the rat supraoptic nucleus. J Physiol (London) 555(1):281–296 Bleeck S, Sayles M, Ingham NJ, Winter IM (2006) The time course of recovery from suppression and facilitation from single units in the mammalian cochlear nucleus. Hear Res 212:176–184 Calvin WH, Stevens CF (1967) Synaptic noise as a source of variability in the interval between action potentials. Science 155(764):842–844 Fairhall AL, Lewen GD, Bialek W, de Ruyter Van Steveninck RR (2001) Efficiency and ambiguity in an adaptive neural code. Nature 412:787–792 Fisher RA, Tippett LHC (1928) Limiting forms of the frequency distribution of the largest or smallest member of a sample. Proc Cambridge Philos Soc 24:180–190 Reeke GN, Coop AD (2004) Estimating the temporal interval entropy of neuronal discharge. Neural Comput 16(5):941–970 Rhode WS, Oertel D, Smith PH (1983) Physiological response properties of cells labelled intracellularly with horseradish peroxidase in cat ventral cochlear nucleus. J Comp Neurol 213(4):448–463 Tuckwell HC (1988) Introduction to theoretical neurobiology: vol 2 – Nonlinear and stochastic theories. Cambridge University Press Young ED, Robert JM, Shofner WP (1988) Regularity and latency of units in ventral cochlear nucleus – implications for unit classification and generation of response properties. J Neurophysiol 60:1–29
5 The Effect of Reverberation on the Temporal Representation of the F0 of Frequency Swept Harmonic Complexes in the Ventral Cochlear Nucleus MARK SAYLES1, BERT SCHOUTEN2, NEIL J. INGHAM1, AND IAN M. WINTER1
1
Introduction
When listening in an enclosed space, part of the sound has travelled to the listener directly from its source. In addition the listener receives multiple delayed and attenuated copies of the sound as it is reflected from the room’s surfaces, an effect referred to as reverberation. This series of reflections has a filtering effect, introducing distortion in both the spectral and temporal domains. Spectral transitions are smeared in time and the introduction of slowly decaying “tails” effectively applies a low-pass filter to the temporal envelope. Since each reflection is added back to the original “direct” sound with random phase, envelope periodicity tends to be disrupted. Although reverberation is often exploited as a means of delivering the necessary sound level to audiences in auditoria it has long been acknowledged that the resulting distortion can have a deleterious effect on the intelligibility of complex time-varying stimuli such as speech (Knudsen 1929; Santon 1976; Nàb˘elek et al. 1989). Human psychophysical studies (Culling et al. 1994) have demonstrated that the combination of relatively mild reverberation and fundamental (F0) frequency modulation at rates commonly found in speech disrupts listeners’ ability to exploit differences in F0 to perceptually segregate two competing sound sources. Despite the obvious importance of reverberation in the intelligibility of speech, to our knowledge this has yet to be explored from a physiological perspective. As a first attempt at understanding the effects of reverberation on the representation of complex sounds in the mammalian auditory system we have recorded the responses of single units in the ventral cochlear nucleus to the F0 of frequency swept harmonic complexes with and without reverberation. Here we show that in many single units in the ventral cochlear nucleus (VCN) reverberation degrades the temporal representation of F0. Only units responding to resolved harmonics are able to maintain a representation of the F0 through all reverberation conditions. These results are consistent with the hypothesis that reverberation disrupts the phase 1 Centre for the Neural Basis of Hearing, The Physiological Laboratory, Downing Street, Cambridge, UK,
[email protected],
[email protected],
[email protected] 2 Universiteit Utrecht, The Netherlands,
[email protected].
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
36
M. Sayles et al.
relationship between unresolved harmonics sufficiently to abolish temporal responses to envelope periodicity. It is suggested that, under reverberant conditions, temporal information is limited to frequency channels containing resolved harmonics.
2
Methods
Detailed methods describing the surgical approach and recording techniques can be found elsewhere (e.g. Bleeck et al. 2006) and will be briefly described below. 2.1
Physiology and Recording
The data reported in this chapter were recorded from pigmented guinea pigs anaesthetised with urethane (1.0 g/kg i.p.). Supplementary analgesia was provided by Hypnorm (fentanyl/fluanisone) (1.0 ml/kg i.m.). Additional doses of urethane and Hypnorm were given when required. The surgical preparation and stimulus presentation took place in a sound attenuating chamber (IAC). All animals were tracheotomised and core temperature was maintained at 38 ºC with a thermostatically controlled heating blanket. The electrocardiogram and respiration rate were monitored throughout and on signs of suppressed respiration the animal was artificially ventilated. Recordings were made using tungsten-in-glass microelectrodes (Merrill and Ainsworth 1972). Broadband noise was used as a search stimulus. Upon isolation of a single unit, estimates of best frequency (BF) and threshold were made using audio-visual criteria. Single units were classified according to peri-stimulus time histogram shape in response to suprathreshold BF tonebursts, interspike interval distribution and discharge regularity. 2.2
Complex Stimuli
The stimuli were linearly swept harmonic complex tones summed in cosine phase (Fig. 1). The complexes were constructed from harmonics 1–20 with starting F0 frequencies spaced at 1/3 octave intervals between 100 and 400 Hz and ending F0 frequencies one octave above that at the start of the sweep. Sweep duration was 500 ms. Each sweep was convolved with impulse responses provided to us by Dr Tony Watkins (University of Reading). These were recorded at 0.32, 0.63, 1.25, 2.5, 5.0 and 10.0 m from a sound source in a corridor measuring approximately 2 m wide, 35 m long and with a ceiling height of 3.4 m and are the same impulse responses as used in previous psychophysical studies of
The Effect of Reverberation on the Temporal Representation of the F0
37
Fig. 1 Waveforms and spectrograms of harmonic complex sweeps with F0 100–200 Hz
reverberation (Watkins 2005). In this chapter we show only responses to the dry condition and the 32, 125, and 500 cm distance conditions. All stimuli were normalised after convolution for equal rms voltage and were presented monaurally with typically 25 repetitions and a 1 s repetition period. All stimuli were gated on and off with a 1 ms cos2 ramp.
38
2.3
M. Sayles et al.
Analyses
Spikes were analysed using a 50-ms windowed segment of the response with the analysis window slid in 12.5-ms steps. From these windowed spike train segments we calculated an all-order interspike interval distribution between pairs of non-identical sweeps. Such shuffled all-order interval histograms (referred to as shuffled autocorrelograms (SAC)) have been used previously to show temporal responses to broadband noise in auditory nerve fibres (Louage et al. 2004).
3
Results
The preliminary results reported in this paper come from the responses of 67 units in the VCN (8 Primary-Like, 15 Transient Chopper, 29 Sustained Chopper, 6 Onset (On-I and On-L) and 9 Low-BF phase-locking units). In most cases a range of frequency sweeps were played. Examples from three units are shown below. The plots show windowed normalised SAC functions. In Fig. 2 we show the responses of a low-BF unit to a harmonic complex sweep with an F0 transition from 100 to 200 Hz over 500 ms. There is a clear peak in panel A at delays corresponding to the F0 throughout the response. There are also obvious peaks at shorter delays corresponding to delays appropriate for the second and third harmonics. The unit BF is 352 Hz; thus initially the third harmonic of the 100 to 200-Hz sweep would be dominant within the unit’s filter, with the second harmonic rapidly taking over as the dominant component. Approximately mid-way through the response the unit seems to respond predominantly to the fundamental component. With increasing levels of reverberation (Fig. 2B–D) the temporal representation of the stimulus remains clearly visible as peaks in the SAC, although the response to the third and second harmonics appears to be spread to later time points. We have observed similar responses from nine units with BFs in the range 200–500 Hz. The responses of a Transient Chopper unit (BF = 2.34 kHz) are shown in Fig. 3. Again there is a clear representation of the stimulus in the dry condition, with peaks in the SAC corresponding to the stimulus fundamental. The peaks become much sharper as the stimulus periodicity enters this unit’s range of preferred periodicities. The BF of this unit is 2.34 kHz; hence it is responding to envelope periodicity resulting from beating between several unresolved harmonics within the peripheral filter. There is still a representation of the F0 transition in the 32-cm condition, although at 125 cm and 500 cm this representation has disappeared with the SACs tending towards unity, indicating the presence of uncorrelated spike times. This response pattern is typical of our population of units of all response types (except OnsetI) with BFs above approximately 1 kHz.
The Effect of Reverberation on the Temporal Representation of the F0
39
Low frequency Phase-locking unit (1270011) BF = 352 Hz Left: SACs Binwidth = 50 µs Below: PSTH Binwidth = 0.2 ms 250 presentations of a 50-ms tone at BF, 50 dB above unit threshold
Fig. 2 Windowed normalised SACs of the response to a harmonic complex sweep with F0 100–200 Hz from a Low-BF unit: A the response to the dry stimulus and panels; B–D responses to the same stimulus after convolution with impulse responses at source-to-receiver distances as indicated in each panel. The scale along the ordinate in each case indicates the time through the stimulus.
The responses from a unit classified as Onset-I are shown in Fig. 4. This unit exhibits an almost perfect representation of the F0 transition in the dry condition. However, the addition of even very mild reverberation (32 cm) results in the almost complete abolition of a driven response. By 500 cm the raster plots in Fig. 4B show only spikes at stimulus onset. We have observed two units classified as On-I with this response pattern. Onset-L units give responses similar to those shown for the transient chopper neuron in Fig. 3.
4
Discussion
We have shown that units responding to low-numbered resolved harmonics (Fig. 2) maintain a representation of the F0 in their interspike intervals even with relatively severe reverberation. In contrast the same representation in units responding to envelope modulation appears to be degraded with even relatively mild reverberation.
40
M. Sayles et al. Transient Chopper unit (1275006) BF = 2.34 kHz Chopping freq.=500 Hz Below: PSTH Binwidth = 0.2 ms 250 presentations of a 50-ms tone at BF, 50 dB above unit threshold
Fig. 3 Same format as Fig. 2 but for a Transient Chopper unit in response to the 100 to 200-Hz sweep
Fig. 4 A Windowed SAC function from the responses of an On-I unit (BF = 3.41 kHz) to a harmonic complex sweep stimulus with an F0 of 250–500 Hz. B Dot raster plots for the responses to the dry condition (top) and the three reverberation conditions
The Effect of Reverberation on the Temporal Representation of the F0
41
The breakdown of the temporal response in high frequency channels is likely due to the randomisation of the phase relationships between unresolved partials of the complex. This is further supported by the fact that low frequency channels showing responses to resolved components appear more resistant to the effects of reverberation with the main effect in these units being due to the smearing of the acoustic spectrum through time. Our On-I units’ responses to the reverberant conditions are similar to those in response to random phase harmonic complexes (Evans and Zhao 1998). A major attraction of temporal theories of pitch processing (based largely on the autocorrelation approach) is that the same neuronal operation (i.e. counting of coincidences) applies equally well for both resolved and unresolved partials of a complex. Others have argued for a two-mechanism hypothesis, invoking some pattern recognition scheme for resolved regions and temporal processing for unresolved regions (e.g. Shackleton and Carlyon 1994). Despite the wealth of psychophysical evidence concerning the perception of vowels and other speech sounds under reverberant conditions, no study to our knowledge has specifically addressed the issue of the effect of reverberation on pitch perception when listening to either resolved or unresolved harmonics. Our results suggest that in the presence of reverberation the use of temporal information may be limited to frequency channels containing resolved harmonics. These results are interesting in light of psychophysical evidence that in the presence of reverberation and a modulated F0 contour listener’s ability to perceptually segregate two competing sound sources with different F0s is compromised. In order to segregate sounds on the basis of an F0 difference it is necessary for a central processor to estimate the pitch of at least one of the competing sources, in order to either enhance the target sound, or to cancel the interfering sound (de Cheveigné et al. 1995). Based on our results it seems likely that if the listener is making use of higher, unresolved, harmonics to estimate the pitch of the interfering sound in the presence of reverberation, the cancellation of this sound would be difficult. Ackowledgements. This work was supported by the BBSRC and the Wellcome Trust. We thank the Frank Edward Elmore and James Baird funds of the Cambridge MB/PhD programme for supporting one of the authors, MS.
References Bleeck S, Sayles M, Ingham NI, Winter IM (2006) The time course of recovery from suppression and facilitation from single units in the mammalian cochlear nucleus. Hear Res 212:176–184 Culling JF, Summerfield Q, Marshall DH (1994) Effects of simulated reverberation on the use of binaural cues and fundamental-frequency differences for separating concurrent vowels. Speech Commun 14:71–95 de Cheveigné A, McAdams S, Laroche J, Rosenberg M (1995) Identification of concurrent harmonic and inharmonic vowels: a test of the theory of harmonic cancellation and enhancement. J Acoust Soc Am 97:3736–3748
42
M. Sayles et al.
Evans EF, Zhao W (1998) Periodicity coding of the fundamental frequency of harmonic complexes: physiological and pharmacological study of onset units in the ventral cochlear nucleus. In: Psychophysical and physiological advances in hearing. Proceedings of the 11th international symposium on hearing, 1997. Whurr, London Knudsen VO (1929) The hearing of speech in auditoriums. J Acoust Soc Am 1:56–82 Louage DH, van der Heijden M, Joris PX (2004) Temporal properties of responses to broadband noise in the auditory nerve. J Neurophysiol 91:2051–2065 Merrill EG, Ainsworth A (1972) Glass-coated platinum tipped tungsten microelectrodes. Med Biol Eng 10:662–672 Nàbeˇlek AK, Letowski TR, Tucker FM (1989) Reverberant overlap- and self-masking in consonant identification. J Acoust Soc Am 86:1259–1265 Santon F (1976) Numerical prediction of echograms and of the intelligibility of speech in rooms. J Acoust Soc Am 59:1399–1405 Shackleton TM, Carlyon RP (1994) The role of resolved and unresolved harmonics in pitch perception and frequency modulation discrimination. J Acoust Soc Am 95:3529–3540 Watkins AJ (2005) Perceptual compensation for effects of reverberation in speech identification. J Acoust Soc Am 118:249–262
Comment by Langner Your results may be explained by the way pitch information is mapped in the inferior colliculus (IC). As we have shown by single unit recordings as well as by functional mapping with the 2-Deoxyglucose method, a harmonic sound is represented by a column of activated neurons parallel to the tonotopic gradient. The activation of low frequency neurons is due to resolved harmonics, possible distortion products, and across frequency activation in the IC. The activation of neurons with higher CFs requires periodicity coding (according to my model by a cross-correlation of envelope periodicity with resolved harmonics). If you cut the lowest harmonics from the stimulus, the neuronal column is still (partly) activated, because periodicity coding works for higher harmonics. If you destroy periodicity information by reverberation, the column is (partly) activated by resolved harmonics alone. In either case, pitch remains encoded in the same way, by the same neuronal column in the IC.
6 Spectral Edges as Optimal Stimuli for the Dorsal Cochlear Nucleus SHARBA BANDYOPADHYAY1, ERIC D. YOUNG1, AND LINA A. J. REISS2
1
Introduction
The principal neurons of the dorsal cochlear nucleus (DCN) form one of several parallel pathways through the brainstem from the cochlear nucleus to the inferior colliculus (Rouiller 1997). Unlike the neurons of the ventral cochlear nucleus, DCN principal cells give strongly non-linear responses to sound (Nelken et al. 1997; Yu and Young 2000), meaning that models of DCN neurons often do not predict the responses to complex sounds. Such nonlinearity is typical of auditory neurons (e.g. Eggermontet al. 1983; Machenset al. 2004) and poses difficulties for studies of the representation of sound in the brain, because it is not possible to obtain a comprehensive view of the representation of sound by such nonlinear neurons. In the case of the DCN, information about function has been provided by behavioral experiments in which the nucleus or its output tract were lesioned (e.g. May 2000), leading to deficits in sound localization. In addition, the DCN receives inputs from various non-auditory sources, including the somatosensory system (Davis et al. 1996; Shore 2005) and these seem to have specifically to do with the position of the external ear in cats (Kanold and Young 2001). These results are consistent with the finding that DCN neurons in the cat respond sensitively with inhibition to the acoustic notches in the head-related transfer functions of the cat external ear (reviewed in Young and Davis 2001). Together, these data suggest a role in sound localization for the DCN, especially in localization based on spectral cues.
2
Spectral Notches and Spectral Edges
However, stimuli like acoustic notches and head-related transfer functions are complex, with multiple components (Fig. 1A); it is unclear exactly which components are important to DCN responses. Two approaches to this question are 1 Biomedical Engineering and Center for Hearing and Balance, Johns Hopkins University, Baltimore, USA,
[email protected],
[email protected] 2 Speech Pathology and Audiology, University of Iowa, Iowa City, USA,
[email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
44
S. Bandyopadhyay et al.
Gain re free-field, dB
A 20 10 0 −10
158 Az., −158 El.
−20 2
5
20 40 Frequency, kHz
B
C
WBI II
Sound level, dB SPL
70 60 50 40 30 20 10 ANF
0 1
3 10 Frequency, kHz
30
Fig. 1 A Cat head-related transfer function. The bracket at 10 kHz shows 0.5 oct. B Model of the tone response maps of DCN type IV neurons. C Response map of a DCN type IV neuron. Dark gray is excitatory, light gray is inhibitory. Contours are iso-rate, spaced at 12 spikes/s. Responses within 5 spikes/s of spontaneous rate are suppressed. Line at top marks the BF
taken in this chapter: first, responses to notches with systematic variation of the notch width and center frequency suggest that the upper-frequency edge of the notch is the important aspect (Middlebrooks 1992; Reiss and Young 2005). Second, a new approach to finding the optimal stimulus for a neuron is used to show that a rising spectral edge located at the neuron’s best frequency (BF) is often the optimal stimulus for a DCN principal cell. The responses of a DCN principal cell (type IV neuron) to tones are summarized in the response map in Fig. 1C. As for all the data in this paper, these data are from a well-isolated single neuron in an unanesthetized, decerebrate cat. Such neurons are excited by frequencies near the best frequency (BF) at low sound levels (near 7 kHz at 0 dB SPL) and usually also at frequencies above BF (≈10 kHz) at higher sound levels (arrow). Other excitatory areas may be present, but the inhibitory area centered on BF and the second inhibitory area above BF (>10 kHz) are characteristic. The model in Fig. 1B provides an explanation for DCN response maps (Blum and Reed 1998). It consists of excitatory
Spectral Edges as Optimal Stimuli for the Dorsal Cochlear Nucleus
45
input from auditory nerve fibers (ANF, dark gray) with strong inhibitory inputs from so-called type II neurons (light gray) and weaker inhibition from a second source (WBI; Nelken and Young 1994). The BF of the type II inhibitory input is shifted to a frequency below the neuron’s (and the ANF’s) BF and the inhibitory input has a higher threshold (Voigt and Young 1990), resulting in the major excitatory and inhibitory features in the response map. Figure 1A shows a typical cat head-related transfer function with a spectral notch positioned at the BF of the model (dashed line). This is the spectrum at the eardrum for a broadband noise presented in free field from 15˚ azimuth, −15º elevation. Because of the offset in their BFs, this notch would activate the excitatory input without activating the inhibitory input, leading to a strong response from the model. These features of the response map suggest that the upper edge of a notch might be a strong stimulus for DCN type IV neurons. Responses to spectral notches of a type IV neuron are shown in Fig. 2 in the form of discharge rate (ordinate) as the notch is moved in frequency (abscissa; Reiss and Young 2005). The abscissa is the frequency of the rising edge of the stimulus in each case. Figure 2A shows that notches of various
1
A
0.125 oct. 0.25 oct. 0.5 oct. 1 oct.
100 Rate, sp./s.
0.5
80
B 13 dB spect. level
80
60
60
40
40
20
20
0
10
30 20 Upper edge freq., kHz
3 dB 13 dB 23 dB 33 dB
100
0
40
0.5 octave notchwidth
20 30 Upper edge freq., kHz
40
C
Rate, sp./s.
100 80 60
0.5 octave bandwidth 3 dB 13 dB 23 dB 33 dB
40 20 0
10
20 Lower edge freq., kHz
30
Fig. 2 Rate responses of a DCN neuron to notches (A,B) or noise bands (C) moved in frequency. Abscissae show the frequency of the rising edge of the notch or band. Passbands were 30 dB above stopbands. Sound levels are passband spectrum level, dB re 20 µPa/Hz1/2. The spectra giving maximum rate are shown above the plots. Horizontal dashed line is average spontaneous rate, vertical dashed line is BF
46
S. Bandyopadhyay et al.
widths (see the legend) produce a strong excitatory response when the upper-frequency edge of the notch is near BF (vertical dashed line) and inhibition when the notch is centered on BF (when the upper edge frequency is just above BF). This pattern of response remains across a range of sound levels (Fig. 2B, for 1/2 octave notches). It is also observed when the stimulus is a noise band, but in this case is associated with the lower-frequency edge of the band, on the abscissa in Fig. 2C.
3
Finding the Optimal Stimulus
A useful approach to understanding the sensory representation by nonlinear neurons is to search for optimal stimuli (e.g. deCharms et al. 1998; O’Connor et al. 2005). The characteristics of a neuron’s optimal stimulus provide a functional definition of the signal processing being done by the neuron. The optimum can be the stimulus giving the highest discharge rate or it can be the stimulus about which the neuron provides the most information in some sense. In this chapter the optimum is the maximum discharge rate. The problem of finding the optimum is not well defined and must usually be limited to some class of stimuli. Here, the stimulus class is random spectral shape stimuli (RSS; Yu and Young 2000; Young and Calhoun 2005) and the optimum spectral shape is sought. RSS stimuli consist of sums of random-phase tones spaced at 1/64 octave intervals over a several-octave frequency range; the tones are gathered into sets of 8 in 1/8 octave bins. The total power in each bin, in dB relative to a reference stimulus, varies pseudo-randomly with an approximately Gaussian distribution and a standard deviation of 1.5–12 dB. These stimuli have minimal envelope fluctuations and the effects of the temporal envelope are not considered. Figure 3A shows examples of the spectra of RSS stimuli. The optimization proceeds by changing the spectral shape iteratively, guided by the Fisher information matrix F of the responses (Cover and Thomas 1991). The i–jth term in the Fisher matrix is Fij = E > 22q ln p (r ; q) 22q ln p (r ; q)H . i
j
(1)
where p(r; q) is the pdf of discharge rate r given the stimulus parameters q, the amplitudes (dB) of the stimulus in the 11 RSS bins centered on BF. Fij is the sensitivity of the neuron’s rate response to simultaneous changes in the stimulus amplitude in the ith and jth bins, in the sense that the inverse of the Fisher matrix is the covariance matrix of the minimum-variance unbiased estimator of q based on r (the Cramér-Rao bound). The Fisher matrix can be computed from rate data using the following approximation (Johnson et al. 2001): D ( p (r; q + dq) || p (r; q)) .
1 dqT F dq 2 ln 2
(2)
Spectral Edges as Optimal Stimuli for the Dorsal Cochlear Nucleus
B
Level, dB atten.
A
Largest eigenvector
C
47
*
40
*
* *
50 −0.5
0
0.5
−0.5
0
0.5
1
D
−0.5
0
30
0.5
*
0.5 40
0 −0.5
50
−1
−0.5
0
0.5
−0.5
0
0.5
* −0.5
0
0.5
Octaves re BF Rate (sp/s)
E 200
100
0
−8
−4
0
4
8
−8
−4
0
4
8
Eigenvector multiplier, dB
Fig. 3A–E Finding the optimal stimulus shape. The abscissae in A–D are frequency, in octaves re BF. The ordinate scale in D is level in dB attenuation as in A,B
where D( ) is the so-called KL distance between the pdfs of the rate response to stimulus vectors q + δ q and q and the approximation is good for small δ q. The change δq in the stimulus that gives the largest change in the KL distance in Eq. (2) is parallel to the eigenvector with the largest eigenvalue emax, i.e. δqmax = Aemax, where A is a constant. It can be shown from a model of RSS responses that, for small δ q, this also gives the largest change in discharge rate. Thus the rate optimization proceeds by estimating F in the vicinity of a reference stimulus q, then finding the δ q that produces the largest rate change by empirically finding the value of A (limited to ±8 dB) such that δ qmax = A emax gives the largest rate change. The reference stimulus is then changed to q + δ qmax and the process is repeated. The process terminates when the reference q is a rate maximum, as judged from a local quadratic model of the dependence of rate on δ q. This process is done on-line and typically requires ~1 h and three iterations. The Fisher matrix is estimated from rate responses r to a large number of different perturbations δq around the reference stimulus, giving many simultaneous linear algebraic equations like Eq. (2) with the terms of F as the unknowns. The KL distance is computed from the mean rates, assuming that r is Poisson.
48
S. Bandyopadhyay et al.
Figure 3 shows the steps in computing an optimum shape. The left column shows the first iteration step. The initial reference q was an RSS stimulus with 0 dB in all bins, shown by the horizontal line in Fig. 3A. The dashed line shows an example of an RSS perturbation δ q. The eigenvector emax is shown in Fig. 3C and the discharge rates for stimuli with spectral shape q + Aemax are shown in Fig. 3E, plotted as a function of A. The open circle shows the rate in response to the 0-dB reference stimulus and the filled circle shows the maximum rate, over the 16-dB range tested. The error bars are the SD of 10 repetitions of each stimulus. The second column shows the second iteration. The reference in this case (solid line in Fig. 3A) has the shape of emax from the first iteration. emax and the rates for δ q= Aemax are shown in Fig. 3C,E as before. In this case the maximum rate occurred at the spectral shape shown in Fig. 3B. This stimulus is a rate maximum for all directions δ q and so terminated the iterations. Close inspection shows that the maximum rate after the second iteration was slightly smaller than the maximum rate after the first iteration. This occurred because of a systematic rate change in the neuron, sometimes seen in DCN principal cells. Essentially the rate decreased by 18% during the first iteration, as shown by rates in response to a control stimulus (not shown). Thus the rate maximum after the second iteration was indeed an overall rate maximum at this time. The optimization process only constrains the amplitudes at frequencies to which the neuron is sensitive. The bins marked by asterisks in Fig. 3B account for 80% of the rate change across a set of RSS stimuli. Those are also the bins that changed significantly during the iteration; note that the remaining, nonasterisk, bins stayed near their initial values. Thus the optimal stimulus should be considered to consist of the four asterisked bins.
4
The Optimal Stimulus for DCN Neurons
The optimal stimulus for the example neuron in Fig. 3B is a rising spectral slope centered on BF. Figure 3D shows the outcome of the optimization process for a second type IV neuron, whose optimal stimulus is a sharp spectral edge at BF. The results of the optimization process thus correspond to the organization of excitatory and inhibitory inputs in Fig. 1 and the rate peaks observed in Fig. 2. It is important to emphasize that not all DCN principal cells show the notch edge sensitivity of the examples shown here, presumably because of different arrangements of the inhibitory inputs (Reiss and Young 2005). The method of Sect. 3 provides a general way to find optimal spectral shapes that is applicable to neurons in all parts of the auditory system. It is fast and can be made faster by initiating the search with a reference stimulus that produces the highest discharge rate across an RSS set. Its major limitation is that it does not incorporate temporal aspects of the stimulus. Acknowledgements. This work was supported by NIH grants DC00115 and DC05211.
Spectral Edges as Optimal Stimuli for the Dorsal Cochlear Nucleus
49
References Blum JJ, Reed MC (1998) Effects of wide band inhibitors in the dorsal cochlear nucleus. II. Model calculations of the responses to complex tones. J Acoust Soc Am 103:2000–2009 Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York Davis KA, Miller RL, Young ED (1996) Effects of somatosensory and parallel-fiber stimulation on neurons in dorsal cochlear nucleus. J Neurophysiol 76:3012–3024 deCharms RC, Blake DT, Merzenich MM (1998) Optimizing sound features for cortical neurons. Science 280:1439–1443 Eggermont JJ, Aertsen AMHJ, Johannesma PIM (1983) Prediction of the responses of auditory neurons in the midbrain of the grass frog based on the spectro-temporal receptive field. Hearing Res 10:191–202 Johnson DH, Gruner CM, Baggerly K, Seshagiri C (2001) Information-theoretic analysis of the neural code. J Comput Neurosci 10:47–69 Kanold PO, Young ED (2001) Proprioceptive information from the pinna provides somatosensory input to cat dorsal cochlear nucleus. J Neurosci 21:7848–7858 Machens CK, Wehr MS, Zador AM (2004) Linearity of cortical receptive fields measured with natural sounds. J Neurosci 24:1089–1100 May BJ (2000) Role of the dorsal cochlear nucleus in the sound localization behavior of cats. Hearing Res 148:74–87 Middlebrooks JC (1992) Narrow-band sound localization related to external ear acoustics. J Acoust Soc Am 92:2607–2624 Nelken I, Young ED (1994) Two separate inhibitory mechanisms shape the responses of dorsal cochlear nucleus type IV units to narrowband and wideband stimuli. J Neurophysiol 71:2446–2462 Nelken I, Kim PJ, Young ED (1997) Linear and non-linear spectral integration in type IV neurons of the dorsal cochlear nucleus: II. Predicting responses using non-linear methods. J Neurophysiol 78:800–811 O’Connor KN, Petkov CI, Sutter ML (2005) Adaptive stimulus optimization for auditory cortical neurons. J Neurophysiol 94:4051–4067 Reiss LAJ, Young ED (2005) Spectral edge sensitivity in neural circuits of the dorsal cochlear nucleus. J Neurosci 25:3680–3691 Rouiller EM (1997) Functional organization of the auditory pathways. In: Ehret G, Romand R (eds) The central auditory system. Oxford University Press, New York, pp 3–96 Shore SE (2005) Multisensory integration in the dorsal cochlear nucleus: unit responses to acoustic and trigeminal ganglion stimulation. Eur J Neurosci 21:3334–3348 Voigt HF, Young ED (1990) Cross-correlation analysis of inhibitory interactions in dorsal cochlear nucleus. J Neurophysiol 64:1590–1610 Young ED, Calhoun BM (2005) Nonlinear modeling of auditory-nerve rate responses to wideband stimuli. J Neurophysiol 94:4441–4454 Young ED, Davis KA (2001) Circuitry and function of the dorsal cochlear nucleus. In: Oertel D, Popper AN, Fay RR (eds) Integrative functions in the mammalian auditory pathway. Springer, Berlin Heidelberg New York, pp 160–206 Yu JJ, Young ED (2000) Linear and nonlinear pathways of spectral information transmission in the cochlear nucleus. PNAS 97:11780–11786
Comment by Langner According to your Fig. 1 the spectral notches in cat head-related transfer functions show up around 10 kHz, which would suggest a functional role for units with an inhibitory area close to or at their CF around 10 kHz. However,
50
S. Bandyopadhyay et al.
the tuning curves of type IV neurons are similar not only around 10 kHz but for all center frequencies. Therefore my question is: What is your opinion about the functional role of type IV neurons outside the 10-kHz range? Reply We have noted previously that type IV neurons with notch sensitivity do not seem to be limited to BFs where the cat’s ear shows spectral notches (Young and Davis 2001, Fig. 5.13). The present chapter, along with the results of Lina Reiss (Reiss and Young 2005), provide an alternative view of DCN notch sensitivity as sensitivity to rising frequency edges. During the meeting, an interesting suggestion was made by B. Delgutte: because the acoustic environment is usually low-pass in its spectral content, DCN neurons may be tuned to unusual acoustic features which are high-pass, by contrast to the usual spectra. This corresponds well to our previous suggestions that the DCN may serve to detect potentially important acoustic events and report them to the auditory system (Nelken and Young 1996). References Nelken I, Young ED (1996) Why do cats need a dorsal cochlear nucleus? Rev Clin Basic Pharmacol 7:199–220 Reiss LAJ, Young ED (2005) Spectral edge sensitivity in neural circuits of the dorsal cochlear nucleus. J Neurosci 25:3680–3691 Young ED, Davis KA (2001) Circuitry and function of the dorsal cochlear nucleus. In: Oertel D, Popper AN, Fay RR (eds) Integrative functions in the mammalian auditory pathway. Springer, Berlin Heidelberg New York, pp 160–206
7 Psychophysical and Physiological Assessment of the Representation of High-frequency Spectral Notches in the Auditory Nerve ENRIQUE A. LOPEZ-POVEDA1, ANA ALVES-PINTO1, AND ALAN R. PALMER2
1
Introduction
Destructive interference between sound waves within the pinna produces notches in the stimulus spectrum at the eardrum. Some of these notches have a center frequency that depends strongly on the relative vertical angle between the sound source and the listener (e.g. Lopez-Poveda and Meddis 1996). Therefore, it is not surprising that they constitute useful cues for judging sound source elevation (reviewed by Carlile et al. 2005). The auditory nerve (AN) is the only transmission path of acoustic information to the brain. Single fibers encode the physical characteristics of the sound in at least two ways: in their discharge rate and in the time at which their spikes occur (reviewed by Lopez-Poveda 2005). Because spectral notches due to the pinna occur at frequencies beyond the cut-off of phase locking, the common view is that the AN representation of these notches must be based on the discharge rate alone, i.e. temporal representations do not contribute (Rice et al. 1995). In other words, the brain would infer the stimulus spectrum from a representation of the discharge rate of the population of AN fibers as a function of their characteristic frequencies (CFs). This representation is known as the rate profile. On the other hand, evidence exists that the apparent quality of the rateprofile representation of high-frequency spectral notches degrades as the sound pressure level (SPL) of the stimulus increases (Rice et al. 1995; LopezPoveda 1996). Almost certainly this is due to the low threshold and the narrow dynamic range of AN fibers with high-spontaneous rate (HSR), which are the majority, and to the progressive broadening of their frequency tuning with increasing SPL. Although low-spontaneous rate (LSR) fibers have higher thresholds and wider dynamic ranges, they are a minority. Furthermore, the broadening of basilar membrane tuning at high levels makes it unlikely that they can convey high-frequency spectral notches in their rate profile equally well at low and high levels (Carlile and Pralong 1994; Lopez-Poveda 1996). 1 Instituto de Neurociencias de Castilla y León, Universidad de Salamanca, Avda. Alfonso X El Sabio s/n, 37007 Salamanca, Spain,
[email protected],
[email protected] 2 MRC Institute of Hearing Research, University Park, Nottingham, NG7 2RD, UK,
[email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
52
E.A. Lopez-Poveda et al.
Consistent with this, one would expect that to discriminate between a flatspectrum noise and a similar noise with a spectral notch centered at a high frequency (say 8 kHz) would be increasingly more difficult as the overall stimulus level increases. However, contrary to this expectation we have shown (Alves-Pinto and Lopez-Poveda 2005) that the ability to discriminate between flat-spectrum and notched noise stimuli is a nonmonotonic function of level for the majority of listeners. Specifically, discrimination is more difficult at levels around 70–80 dB SPL than at lower or higher levels. Here we report on our efforts at understanding the nature of this paradoxical result. Our approach consists in predicting the limits of psychophysical performance in the spectral discrimination task of Alves-Pinto and Lopez-Poveda (2005) based on the statistical analysis of experimental AN responses. The results contradict the common view that high-frequency spectral notches are conveyed to the central auditory system in the AN rate profile. Instead, they suggest that spike rates over narrow time windows almost certainly convey useful information for discriminating between noise bursts with and without high-frequency spectral notches.
2
Methods
The activity of guinea pig AN fibers was recorded in response to the same bursts of broadband (20–16,000 Hz) frozen noise that we used in our previous psychophysical study. Two types of noise were considered. One had a completely flat spectrum while the spectrum of the other had a rectangular notch between 6000 and 8000 Hz with a depth of 3 dB re. noise spectrum level. Responses were measured for overall noise levels between 40 and 100 dB SPL in steps of 10 dB. The noise bursts had a duration of 110 ms, including a 10-ms rise ramp (no fall ramp was applied), and were presented every 880 ms. Details on the noise generation procedure are given elsewhere (Alves-Pinto and Lopez-Poveda 2005). Responses were recorded for a sample of 106 AN fibers (from 16 animals) with CFs spanning a range from 1000 to 19,000 Hz. Of the fibers, 31 had spontaneous rates less than 18 spikes/s. The method of recording of physiological responses was virtually identical to that described in Palmer et al. (1985). The response of any given fiber was measured at least five times for each stimulus condition. 2.1
Statistical Analysis of Auditory Nerve Responses
The psychophysical just-noticeable difference (JND) in a given stimulus parameter, ∆αJND, can be predicted from the instantaneous discharge rate of the population of AN fibers as follows (Siebert 1970; Heinz et al. 2001): Da JND = */ # i
0
T
2
- 0.5
2ri (t, a) 1 dt ri (t, a) < 2a F 4
,
(1)
Psychophysical and Physiological Assessment
53
where t denotes time, and ri(t, α) the instantaneous discharge rate of the i-th fiber in response to the stimulus with parameter α. In our context, α corresponds to the notch depth. Hence, Eq. (1) allows predicting the threshold notch depth for discriminating between a flat-spectrum noise and a noise with a spectral notch based on the experimental AN responses. Equation (1) was derived on the assumption that the times at which AN spikes occur follow a Poisson distribution (i.e., that spikes occur at times that are independent of each other). Furthermore, it was derived on the assumption that psychophysical discrimination thresholds reflect optimal use of every bit of information available in the activity of the population of fibers. Neither of these two conditions apply here (see Heinz et al. 2001); thus we do not expect the resulting ∆αJND values to match the psychophysical thresholds directly. However, it is reasonable to assume that the error in using Eq. (1) for predicting the psychophysical thresholds will be similar for all SPLs. Therefore, Eq. (1) remains useful for predicting the shape of the threshold notch depth vs level function, as reported previously by us (Alves-Pinto and Lopez-Poveda 2005). It is noteworthy that Eq. (1) predicts the threshold notch depth for spectral discrimination using the instantaneous discharge rate of the population of AN fibers. This contrasts with the rate-place model described in the Introduction that only considers the information conveyed in the overall discharge rate of the fibers. For obvious reasons, in applying Eq. (1) we had to consider a discrete version of the instantaneous discharge rate, ri(∆t, α) rather than the continuoustime ri(t, α). Note that ri(∆t, α) may be interpreted as a mean-rate post-stimulus time histogram for a bin width duration of ∆t. Instead of deciding on an arbitrary value for ∆t, we computed ∆αJND for different bin widths (or sampling periods), ∆t, from 0.333 to 110 ms. Note that in the extreme case that ∆t equals the stimulus duration, the resulting ∆αJND corresponds to performance based on a rate-place code only. In Eq. (1), the term between square brackets denotes the change in instantaneous discharge rate for an incremental change in parameter α. It was calculated as the instantaneous difference in discharge rate for the flat-spectrum, (α = 0 dB) and the notched (α = 3 dB) noises. ∆αJND becomes unrealistically equal to zero when the discharge rate of any fiber is equal to zero for any bin. To prevent this artifactual result, a small, arbitrary constant of 0.1 spikes/s was added to the measured discharge rate in all bins of all fibers. The actual value of this constant did not alter the results significantly.
3
Results
The results are illustrated in Fig. 1. The series denoted by the open circles (left ordinate) illustrates the ∆αJND values based on the experimental AN responses. These will be hereinafter referred to as the physiological JNDs.
54
E.A. Lopez-Poveda et al. 1
100 110 ms
22 0.1
10 S1
9
8
7 4 0.01
Psychophysical JND (dB)
Physiological JND (dB)
55
1 2
0.005 40
50
60
70
80
90
0.5 100
Level (dB SPL) Fig. 1 Predicted threshold notch depth values from auditory nerve responses (circles, left ordinate) for different bin widths (as denoted by the numbers next to each trace). Also shown for comparison is an example psychophysical function (squares, right ordinate) taken from AlvesPinto and Lopez-Poveda (2005)
Each series illustrates the results for a different bin width, ∆t, as indicated by the numbers next to each trace. The series denoted by filled squares (right ordinate) illustrates a particular example of a psychophysical threshold notch depth vs level function taken from Alves-Pinto and Lopez-Poveda (2005; Fig. 3, listener S1). Notice that the scales of both ordinate axes are logarithmic (after Alves-Pinto and Lopez-Poveda 2005) and span a comparable range of values in relative terms, but not in absolute terms. In general, for any given SPL, the physiological-JND values increase as the sampling period ∆t increases, suggesting that discrimination benefits from the information conveyed by the timing of spike occurrences. The most striking result is that the shape of the physiological-JND vs level functions varies largely depending on the time window ∆t. Only for ∆t values within the range from 4 to 9 ms are the physiological-JND functions nonmonotonic with a peak at or around 80 dB SPL, thus resembling the shape of the psychophysical threshold notch depth vs level function (squares). In absolute terms, however, the physiological-JND values are about two orders of magnitude lower than the psychophysical ones (for the listener considered in Fig. 1). This may reflect differences in cochlear
Psychophysical and Physiological Assessment
55
13 12 11 10 9 8 7 6 5 4 3
1
**
**
0.8 0.6 0.4 0.2 0
S1
S2
S3
S4
Correlation for best binwidth
Best binwidth (ms)
processing between human and guinea pig, and/or that humans do not behave as “optimal” spectral discriminators; otherwise the absolute values would match. The shape of the psychophysical threshold notch depth vs level function varies among listeners (Alves-Pinto and Lopez-Poveda 2005). Similarly, the shape of the physiological-JND vs level function depends on the value of the bin width ∆t (Fig. 1). Kendall’s τ correlation coefficient (Press et al. 1992) was used to quantify the degree of correlation between the shapes of the psychophysical function for each one of five listeners (S1 to S5) considered by Alves-Pinto and Lopez-Poveda (2005) with the physiological-JND vs level functions for different values of ∆t. Figure 2 shows the ∆t values (circles; left ordinate) that yielded the highest correlations for each listener, as well as the corresponding correlation values (squares; right ordinate). The degree of correlation varies considerably across listeners, but the ∆t that yields the highest correlations is similar across listeners (mean ± s.d. =8.66 ± 0.36). In Fig. 1, the series for ∆t equal to the stimulus duration (110 ms, top trace), shows the predicted physiological-JND values taking only the overall average discharge rate. The shape of this function clearly differs from that of the psychophysical function and matches overall the prediction of the rate-only theory. That is, threshold notch depths are lowest for overall levels around 60 dB SPL (corresponding to a spectrum level of 18 dB SPL) and increase progressively with increasing SPL. The level for which the physiological-JND is
S5
Listener Fig. 2 Binwidths (circles, left ordinate) for which maximum correlation occurred between the shapes of the physiological and the psychophysical threshold notch depth vs level functions for the five listeners considered by Alves-Pinto and Lopez-Poveda (2005). Squares (right ordinate) illustrate the actual degree of correlation. Two asterisks denote highly significant ( p< 0.01) correlations
56
E.A. Lopez-Poveda et al.
Physiological JND (dB)
0.4 0.3 0.2 0.1
HSR LSR HSR+LSR 0.01 40
50
60
70
80
90
100
Level (dB SPL) Fig. 3 Physiological JND (in dB) vs SPL based on the information conveyed by fiber groups of different types
lowest corresponds to an effective level of approximately 28 dB SPL (assuming a fiber with a CF = 8000 Hz and an effective bandwidth of 1000 Hz), which approximately falls at the center of the dynamic range of HSR fibers. Figure 3 compares the physiological-JND vs level function (for ∆ t = 8.33 ms) for three cases: using the information conveyed by all 106 fibers (circles) – this is the case considered so far; using only the information conveyed by the 75 fibers with SRs≥18 spikes/s (HSR, triangles); and using only the information from the 31 fibers with SRs<18 spikes/s (LSR, squares). Differences exist between the functions. Nonetheless, the physiological JND is a nonmonotonic function of SPL in all three cases and a peak occurs at 80 dB SPL.
4
Discussion
We have previously suggested that the nonmonotonic aspect of the psychophysical threshold notch depth vs level function reflects the existence of two fiber types (HSR and LSR) with different thresholds and dynamic ranges (Alves-Pinto and Lopez-Poveda, 2005; Alves-Pinto et al. 2005), and that the peak around 80 dB SPL indicates the transition between their dynamic ranges. This interpretation is, however, almost certainly wrong as the predicted threshold notch depth vs level function is nonmonotonic with a peak at 80 dB SPL even when the two types of fibers are considered separately (Fig. 3).
Psychophysical and Physiological Assessment
57
Our previously suggested interpretation was based on the premise that the spectral notch must be encoded in the AN rate profile. The present results suggest that this premise is almost certainly false. Indeed, the results of Fig. 1 argue against the view that high-frequency spectral notches must be encoded in the average rate profile of AN fibers and suggest, instead, that the discharge rate over narrow time windows conveys useful information for discriminating between flat-spectrum and notched noise stimuli. The results also suggest that humans somehow sample the discharge rate of AN fibers in non-overlapping time windows of approximately 8.6 ms (Fig. 2). Three important questions arise now: 1) what is the temporal code in question; 2) how is it generated; and 3) how does it relate to the predicted 8.6-ms sampling period. The answer to these questions requires further analysis of the AN responses and we can only speculate at present. The effective drive to any AN fiber is a half-wave rectified, low-pass filtered version of the basilar membrane (BM) response waveform at its corresponding place in the cochlea. With broadband noise stimulation, this can be described as a randomly amplitude-modulated carrier with a carrier frequency near the fiber’s CF, where the range of modulation frequencies is limited by the bandwidth of the cochlear filter (Louage et al. 2004) or the cut-off of phase locking. The bandwidth of BM filters, and thus the range of modulation frequencies, increases with increasing the SPL. Similarly, the phase of the BM response waveform depends on the filter bandwidth and thus on the stimulus SPL. AN fibers can phase-lock to the envelope of BM excitation even at high levels, when their discharge rate is at saturation (Cooper et al. 1993). Fibers with CFs near the notch frequency certainly “see” a different level compared with those with CFs well away from it. It is, therefore, possible that spectral discrimination be based on detecting either the range of modulation frequencies or the phase differences implicit in AN spike trains (or both). On the basis of this conjecture, the psychophysical threshold notch depth vs level functions reported by Alves-Pinto and Lopez-Poveda (2005) would reflect the “dynamic range” of envelope-following rather than of discharge rate of AN fibers.
5
Conclusions
High-frequency spectral notches are not encoded in the auditory rate profile, as is commonly thought. Instead, they are encoded by mean rate measures taken over quite short (~8.6 ms) time windows. Acknowledgments. Work supported by FIS (PI02/203 and G03/203) and PROFIT (CIT-3900002005-4). We thank Trevor Shackleton, Ray Meddis, and Gerald Langner for their support and suggestions.
58
E.A. Lopez-Poveda et al.
References Alves-Pinto A, Lopez-Poveda EA (2005) Detection of high-frequency spectral notches as a function of level. J Acoust Soc Am 118:2458–2469 Alves-Pinto A, Lopez-Poveda EA, Palmer R (2005) Auditory nerve encoding of high-frequency spectral information. Lect Notes Comp Sci 3561:223–232 Carlile S, Pralong D (1994) The location-dependent nature of perceptually salient features of the human head-related transfer function. J Acoust Soc Am 95:3445–3459 Carlile S, Martin R, McAnally K (2005) Spectral information in sound localization. Int Rev Neurobiol 70:399–434 Cooper NP, Robertson D, Yates GK (1993) Cochlear nerve fiber responses to amplitude modulated stimuli: variations with spontaneous rate and other response characteristics. J Neurophysiol 70:370–386 Heinz MG, Colburn HS, Carney LH (2001) Evaluating auditory performance limits: I. Oneparameter discrimination using a computational model for the auditory nerve. Neural Comp 13:2273–2316 Lopez-Poveda EA (1996) The physical origin and physiological coding of pinna-based spectral cues. Doctoral dissertation. Loughborough University, UK Lopez-Poveda EA (2005) Spectral processing by the peripheral auditory system: facts and models. Int Rev Neurobiol 70:7–48 Lopez-Poveda EA, Meddis R (1996) A physical model of sound diffraction and reflections in the human concha. J Acoust Soc Am 100:3248–3259 Louage DH, van der Heijden M, Joris PX (2004) Temporal properties of responses to broadband noise in the auditory nerve. J Neurophysiol 91:2051–2065 Palmer AR, Winter IM, Darwin CJ (1985) The representation of steady-state vowel sounds in the temporal discharge patterns of guinea-pig cochlear-nerve and primary-like cochlearnucleus neurones. J Acoust Soc Am 79:100–113 Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C: the art of scientific computing. Cambridge University Press, New York Rice JJ, Young ED, Spirou GA (1995) Auditory-nerve encoding of pinna-based spectral cues: rate representation of high-frequency stimuli. J Acoust Soc Am 97:1764–1776 Siebert WM (1970) Frequency discrimination in the auditory system: place or periodicity mechanisms? Proc IEEE 58:723–730
Comment by Heinz The notch depth used to approximate the partial derivative in Eq. (1) based on your physiological data was 3 dB, while many of the predicted all-information JNDs shown in Fig. 1 were much smaller (down to 0.0025 dB). This raises the question of whether auditory-nerve discharge rate varies linearly with notch depth over the range from 0 to 3 dB. Ideally, the derivative would be approximated with a notch depth smaller than (or equal to) the JNDs you are predicting. Have you investigated whether using a smaller notch depth would affect your predictions in terms of either absolute value or more importantly the trends vs level? Perhaps this could most easily be checked using a computational auditory nerve model.
Psychophysical and Physiological Assessment
59
Reply You are right that ideally the notch depth used to compute the partial derivative in Eq. (1) should be smaller than (or comparable to) the predicted JNDs. We were limited by experimental conditions. However, the predicted JNDs vary depending on the number of fibers and the stimulus duration. We have applied the same analysis to fewer fibers (Fig. 3), to shorter (20-ms) stimuli (results not shown), and deeper notch depths (6 and 9 dB) (results not shown). In all these cases, the functions were nonmonotonic with a peak at around 80 dB SPL only for binwidths between 4 and 9 ms. Therefore, it is unlikely that the shape of the JND vs level function (which is more important than the actual JND value) varies much depending on the threshold notch depth used to compute the rate increment in Eq. (1).
8 Spatio-Temporal Representation of the Pitch of Complex Tones in the Auditory Nerve LEONARDO CEDOLIN1,2 AND BERTRAND DELGUTTE1,2,3
1
Introduction
Although pitch is a fundamental auditory percept that plays an important role in music, speech, and auditory scene analysis, the neural codes and mechanisms for pitch perception are still poorly understood. In a previous study (Cedolin and Delgutte 2005), we tested the effectiveness of two classic representations for the pitch of harmonic complex tones at the level of the auditory nerve (AN) in cat: a rate-place representation based on resolved harmonics and a temporal representation based on pooled interspike-interval distributions (a.k.a. autocorrelation). Both representations supported precise pitch estimation in the F0 range of cat vocalizations (500-1000 Hz), but neither was entirely consistent with human psychophysical data. Specifically, the rate-place representation failed to predict the existence of an upper limit for the pitch of missing-F0 complex tones (Moore 1973). The rate-place representation also degrades rapidly with increasing sound level, in contrast to the relatively robust pitch discrimination performance. The interval representation did not account for the greater salience of pitch based on resolved harmonics compared to pitch based on unresolved harmonics (Carlyon and Shackleton 1994). Here, we investigate an alternative, “spatio-temporal ” neural representation of pitch, which may combine the advantages and overcome the limitations of the traditional rate-place and interval representations. 1.1
Spatio-Temporal Representation of Pitch
Physiological and modeling studies have shown that the phase of basilar membrane motion in response to a pure tone varies rapidly with cochlear place near the place tuned to the tone frequency (Pfeiffer and Kim 1975). 1 Eaton-Peabody Laboratory, Massachusetts Eye and Ear Infirmary, bertrand_delgutte@meei. harvard.edu 2 Speech and Hearing Bioscience and Technology Program, Harvard-M.I.T. Division of Health Sciences and Technology,
[email protected] 3 Research Laboratory of Electronics, M.I.T.
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
62
L. Cedolin and B. Delgutte
At frequencies within the range of phase-locking, this rapid spatial change in phase is reflected in the timing of AN spike discharges, thus generating a spatio-temporal cue to the frequency of the pure tone which can in principle be extracted by a neural mechanism sensitive to the relative timing of spikes from adjacent cochlear locations (Shamma 1985). For harmonic complex tones, such rapid changes in phase are expected to occur at each of the spatial locations tuned to a resolved harmonic, thereby providing “spatio-temporal ” cues to pitch that could serve as input to a harmonic template mechanism. Figure 1 shows the response of a peripheral auditory model (Zhang et al., 2001) to a missing-F0 harmonic complex tone. The latency of the resulting traveling wave varies more rapidly for CFs near low-order harmonics than for CFs in between two harmonics (white broken line in Fig. 1). To extract these spatio-temporal cues, we compute the spatial derivative of the response pattern (a point-by-point difference between adjacent rows in Fig. 1), then integrate the absolute value of the derivative over time. This “mean absolute spatial derivative” (MASD) simulates a lateral inhibitory mechanism operating upon the spatio-temporal pattern of AN activity (Shamma 1985). The MASD shows local maxima at CFs corresponding to the frequencies of Harmonics 2-6, while the average discharge rate (Ravg), obtained by integrating the response at each CF over time, is largely saturated at this stimulus level. Thus, these model results suggest that spatio-temporal pitch cues may persist at levels at which the rate-place representation fails due to the saturation of AN fibers responses. A major goal of the present study is to test this prediction physiologically.
Fig. 1 Spatio-temporal response of peripheral auditory model (Zhang et al. 2001) to a harmonic complex tone with missing F0 (200 Hz). Filter bandwidths were set to match human psychophysical masking data (Moore and Glasberg 1990).
Spatio-Temporal Representation of the Pitch of Complex Tones in the Auditory Nerve
63
Fig. 2 Left: Spatio-temporal response pattern of Zhang et al. (2001) model to a harmonic complex tone (missing F0 = 500 Hz). Right: Model responses for a single cochlear place (CF = 1500 Hz) to a series of complex tones with varying F0 (333-1000 Hz). Note the normalized time scale and the inverted F0 scale. Filter bandwidths were set to match physiological data from cat AN (Carney and Yin 1988)
1.2
Scaling Invariance in Cochlear Mechanics
Measuring the entire spatio-temporal response pattern of the AN for a complex-tone stimulus as in Fig. 1 would be extremely difficult, because it would require a very fine, regular and extensive sampling of CFs. We overcame this hurdle by applying the principle of local scaling invariance in cochlear mechanics (Zweig 1976). Scaling invariance implies that the spatio-temporal response pattern to a complex tone with a given F0 can be inferred from the responses at a single cochlear place (fixed CF) to a series of complex tones with varying F0, if cochlear place and time are expressed in dimensionless units CF/F0 (harmonic number) and t ×F0 (normalized time), respectively. Figure 2 shows the model spatio-temporal response pattern to a complex tone with fixed F0 (left) next to the model response pattern at a fixed CF to a series of complex tones with varying F0 (right). The harmonic number CF/F0 varies from 1.5 to 4.5 in both cases. The two response patterns are nearly undistinguishable and both Ravg and MASD, computed as in Fig. 1, show nearly identical features, thus justifying our method for inferring the spatiotemporal response from responses of a single AN fiber.
2
Methods
Methods for recording from auditory-nerve fibers in anesthetized cats were as described by Cedolin and Delgutte (2005). Stimuli were harmonic complex tones with missing F0s. For each fiber, the F0 range was chosen such that the harmonic number CF/F0 varied from
64
L. Cedolin and B. Delgutte
1.5 to 4.5 in order to capture low-order harmonics likely to be resolved. Each complex tone was composed of Harmonics 2 to 20, all of equal amplitude and in cosine phase. Each of the F0 steps lasted 200 ms and was presented 40 times. The sound pressure level of each harmonic was initially set at 10-15 dB above rate threshold for a pure tone at CF. When possible, the stimulus level was then systematically varied over a 20-30 dB range. Period histograms were constructed in response to each complex tone and displayed as a function of both normalized time and CF/F0 for each fiber. Ravg and MASD were computed from the resulting response pattern as for Fig. 1.
3
Results
Figure 3 shows the responses to complex tones for three AN fibers with different CFs. For the low-CF fiber (700 Hz, A), the response latency varies very uniformly with harmonic number, indicating that the cochlear frequency selectivity is insufficient to resolve harmonics at this CF; as a result neither rate-place nor spatio-temporal pitch cues can be detected when examining Ravg and MASD. The spatio-temporal response pattern of the fiber with an intermediate CF (2150 Hz, B) shows non-uniform variations in response latency with harmonic number qualitatively similar to those predicted by the model of Fig. 2. The latency varies rapidly at integer harmonic numbers, while it changes more slowly between integers. As a result, the MASD shows local maxima at Harmonics 2, 3 and 4, thus providing evidence for spatio-temporal cues to pitch in the AN response. Ravg also shows peaks at integer harmonic numbers, although they are less pronounced than for the MASD. For the high-CF fiber (4.3 kHz, C), the spatio-temporal response pattern shows no evidence of phase locking to individual harmonics. As a result, the MASD is basically flat. In contrast, Ravg shows pronounced peaks at Harmonics 2-6, consistent with the improvement in relative frequency selectivity at higher CFs (Cedolin and Delgutte, 2005). Thus, because the spatiotemporal representation depends on phase locking, it predicts an upper F0-limit to pitch which is consistent with psychophysical observations but is not seen in the rate-place representation. We have hypothesized that the spatio-temporal representation of pitch may remain effective at stimulus levels where the rate-place representation breaks down. Figure 4 shows results for one AN fiber at 3 different levels (10, 25 and 40 dB re. threshold). At the low level (A), Harmonics 2, 3, and possibly 4 appear as distinct peaks in both Ravg and MASD. At the intermediate level (B), Ravg begins to show signs of saturation as only Harmonic 2 is apparent. In contrast, strong latency cues to Harmonics 2, 3, and possibly 4 are still present in the spatio-temporal response pattern, resulting in corresponding prominent peaks in MASD. At the highest level (C), Ravg is completely saturated, while peaks at Harmonics 2 and 3 are still easily detectable in the MASD.
Spatio-Temporal Representation of the Pitch of Complex Tones in the Auditory Nerve
65
Fig. 3 Responses of 3 AN fibers with different CFs to series of harmonic complex tones. The stimuli were at 10, 20 and 30 dB, respectively, above each fiber’s threshold
This example supports our hypothesis that the spatio-temporal representation is more robust with respect to stimulus level than the rate-place representation. Intuitively, the more pronounced the oscillations in Ravg and MASD, the better individual harmonics are resolved and therefore the stronger the pitch cues. To quantify this intuition, we fit a damped cosinusoidal function of harmonic number separately to Ravg and MASD, then use the area between the top and bottom envelopes of the fitted curve as a measure of the strength of the pitch representation. Since Ravg and MASD have different units, we express this area relative to the typical standard deviation of the data points,
66
L. Cedolin and B. Delgutte
Fig. 4 Spatio-temporal response pattern of AN fiber (CF=1920 Hz) to a series of harmonic complex tones at 3 different stimulus levels. Pure-tone threshold at CF was 25 dB SPL
making this metric analogous to a d’. We call this metric the harmonic strength of the MASD or the Ravg. To compare the strengths of the two pitch representations, we define a metric called “normalized strength difference” (NSD) as the difference between the harmonic strength of the MASD and that of the Ravg, divided by their sum. NSDs take values between −1 and 1, with positive values indicating that MASD provides stronger pitch cues than Ravg. Figure 5 shows normalized strength differences against CF for 3 different level ranges for our data set. For CFs between 1 and 3 kHz, the strengths of the two representations are comparable at low levels (A). In this CF range, the NSDs tend to be positive at moderate levels (B), indicating that MASD better represents resolved harmonics than Ravg. This tendency is even more pronounced at high levels (C), where virtually all NSDs are positive. Thus, the
Spatio-Temporal Representation of the Pitch of Complex Tones in the Auditory Nerve
67
Fig. 5 Scatter plots of normalized strength difference (NSD) of MASD re. Ravg against CF for 3 different level ranges (expressed in dB re. pure-tone threshold at CF). Each point shows data for one AN fiber. Plots only include measurements for which MASD and Ravg oscillated sufficiently against harmonic number to reliably fit a damped cosine curve
spatio-temporal representation is more robust at higher stimulus levels than the rate-place representation in this CF range. For CFs above 3 kHz, NSDs tend to take large negative values at all levels, indicating that rate cues to resolved harmonics are stronger than latency cues. This result is most likely caused by the degradation of phaselocking with increasing frequency of the resolved harmonics near the CF. Overall, the spatio-temporal pitch representation is most effective for CFs between 1 and 3 kHz.
4
Discussion
We found that robust spatio-temporal cues to resolved harmonics are available in the response of AN fibers whose CFs are high enough for harmonics to be sufficiently resolved (above ~1 kHz in cat), but below the limit (~3 kHz) above which phase-locking is significantly degraded. To translate the CF range where the spatio-temporal representation is effective into a range of stimulus F0s, we rely again on the scaling invariance principle. Since we almost always selected the F0 range for each fiber so that the harmonic number CF/F0 varied from 1.5 to 4.5, the fiber CF was on average 3 times greater than the F0. Hence, the 1-3 kHz CF range in which the spatiotemporal representation works best approximately corresponds to an F0 range from 300 Hz to 1 kHz, which covers the entire range of cat vocalizations. These limits are approximate as they depend on the signal-to-noise ratio of our measurements from single fibers. What might be the corresponding F0 range in humans? Because the upper limit is determined by neural phase locking, and there are no strong reasons to assume that the frequency dependence of phase-locking greatly differs among mammalian species, it may be similar in humans. On the other hand
68
L. Cedolin and B. Delgutte
if, as argued by Shera et. al (2002) (but see Ruggero et al. 2005), cochlear frequency selectivity is 2-3 times sharper in humans than in cats, the 300-Hz lower F0-limit in cats might translate to about 100 Hz in humans. If so, the F0 range where the spatio-temporal representation works best in humans would encompass most of the range of human voice. The proposed spatio-temporal representation of pitch seems to overcome some of the main limitations of the traditional rate-place and interspikeinterval representations in accounting for the main trends in human psychophysics. Unlike the rate-place representation, it predicts the existence of an upper F0 limit to the perception of the pitch of missing-F0 complex tones, and it remains effective at high stimulus levels; unlike the interspike-interval representation, its strength depends strongly on harmonic resolvability, and it does not require long neural delays or precise neural oscillators for which there is little physiological evidence. A key question is whether the spatio-temporal pitch cues available in the AN are extracted in the central nervous system. In principle, the spatio-temporal cues could be extracted by a neural mechanism that (1) receives inputs from auditory nerve fibers with neighboring CFs and (2) is sensitive to differences in the timing of these inputs. Two such mechanisms are lateral inhibition (Shamma, 1985) and cross-frequency coincidence detection (Carney 1990), which would likely produce local maxima and local minima, respectively, in discharge rate at locations corresponding to the harmonics, where the rapid phase changes would cause inputs with neighboring CFs to be less coincident. Since evidence for both mechanisms exists in the cochlear nucleus, this site seems to be a promising focus for future studies. Acknowledgments This work was supported by NIH grants DC 02258 and 05209.
References Carlyon RP and Shackleton TM (1994) Comparing the fundamental frequencies of resolved and unresolved harmonics: Evidence for two pitch mechanisms? J Acoust Soc Am 95. Carney LH and Yin TCT (1988) Temporal coding of resonances by low-frequency auditory nerve fibers: single-fiber responses and a population model. J Neurophysiol 60. Carney LH (1990) Sensitivities of cells in the anteroventral cochlear nucleus of cat to spatiotemporal discharge patterns across primary afferents. J Neurophysiol 64. Cedolin L and Delgutte B (2005) Pitch of Complex Tones: Rate-Place and Interspike Interval Representations in the Auditory Nerve. J Neurophysiol 94: 347–362. Glasberg BR and Moore BCJ (1990) Derivation of auditory filter shapes from notched-noise data. Hear Res 47: 103-138. Moore BCJ (1973) Some experiments relating to the perception of complex tones. Q J Exp Psychol 25: 451-475. Pfeiffer RR and Kim DO (1975) Cochlear nerve fiber responses: Distribution along the cochlear partition. J Acoust Soc Am 58: 867-965.
Spatio-Temporal Representation of the Pitch of Complex Tones in the Auditory Nerve
69
Shera CA, Guinan JJ, Jr., and Oxenham AJ (2002) Revised estimates of human cochlear tuning from otoacoustic and behavioral measurements. Proc Natl Acad Sci USA 99. Shamma SA (1985) Speech processing in the auditory system. I: The representation of speech sounds in the responses of the auditory nerve. J Acoust Soc Am 78: 1612-1621. Ruggero M and Temchin AN (2005) Unexceptional sharpness of frequency tuning in the human cochlea. Proc Natl Acad Sci USA 102: 18614-18619. Zhang X, Heinz MG, Bruce IC, and Carney LH (2001) A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression. J Acoust Soc Am 109: 648-670. Zweig G (1976) Basilar membrane motion. Cold Spr Harb Symp Quant Biol 40: 619-633.
Comment by Greenberg Could you describe how your model would handle the so-called “dominance region” for pitch? Ritsma (1967) and Plomp (1967) (as well as others) have shown that the spectral region generating the strongest sensation of pitch often varies as a function of fundamental frequency (f0). For an f0 of 500 Hz, the dominant harmonics are the second and third, while for an f0 of 100 Hz the sixth and seventh harmonics are dominant. In your model, the frequency resolution of the auditory periphery is logarithmically constant across frequency, which makes it difficult to accommodate findings such as those reported by Ritsma and Plomp. In other words, your model would seem to predict that the strongest pitch would be generated by a certain set of harmonics regardless of fundamental frequency (up to the limit of musical pitch). One way (of several) to resolve this issue would be to assume that the frequency selectivity of the auditory periphery is not constant Q (as in your model) but varies in a manner consistent with the tuning characteristics of auditory nerve fibers (e.g., Evans, 1975). In such studies, the Q10 dB of fibers varies between 0.5 for units with characteristic frequencies below 800 Hz to approximately 2 for the spectral region of 1.5 kHz (the upper limit of the dominance region). References Evans, E. (1975) The cochlear nerve and cochlear nucleus. In Handbook of Sensory Physiology. W. D. Keidel (ed.). Heidelberg: Springer, pp. 1-109. Plomp, R. (1967) Pitch of complex tones. Journal of the Acoustical Society of America 41: 1526-1533. Ritsma, R. (1967) Frequencies dominant in the perception of pitch of complex sounds. Journal of the Acoustical Society of America 42: 191-198.
Reply Since our paper primarily reports single-unit data from the cat auditory nerve, the increase in cochlear frequency selectivity (as measured by Q) with CF is naturally included. We only assumed constant-Q filtering (as prescribed
70
L. Cedolin and B. Delgutte
by scaling invariance) for the specific purpose of estimating the spatial derivative from the responses of a single AN fiber to a set of complex tones with varying F0. Since F0 was varied over only 1.6 octave, and the dependence of Q on CF is very gradual, the bandwidth errors resulting from this assumption are small. Specifically, the differences between the neural bandwidths (measured from reverse correlation functions by Carney and Yin, 1988) and those predicted from the constant-Q assumption never exceeded ±11% for any CF within the range investigated. We don’t see how the increase in Q with CF could explain how the dominance region for pitch depends on F0 because higher-order harmonics are increasingly well resolved as F0 increases due to the increase in Q, while, psychophysically, low-order harmonics are increasingly dominant at higher F0s (Moore et al. 1985; Dai, 2000). The spatio-temporal representation offers a solution to this problem because, by requiring phase locking to the harmonics, it imposes an upper frequency limit (~3000 Hz) to which harmonics can contribute to pitch. Since higher-order harmonics will increasingly exceed this limit as F0 increases, pitch estimation from spatio-temporal cues has to rely increasingly on low-order harmonics, consistent with the psychophysics. It is more difficult to account for the psychophysical observation that the dominant harmonics are not always the lowest ones for low F0s. In our data, the lowest harmonic present (Harmonic 2) is always the most prominent in the spatio-temporal representation. However, for each fiber, we selected stimulus levels for our complex tones relative to the pure-tone threshold at CF; this procedure effectively equalizes the middle-ear transfer function, which would otherwise attenuate low-frequency harmonics. Since the perceptual dominance of a harmonic increases with its relative amplitude (Moore et al., 1985), low-order harmonics may be sufficiently attenuated by the middle ear at low F0s that they can no longer contribute to pitch. Note that the dominant harmonics at low F0s vary substantially between studies depending on the method used (Ritsma, 1967; Plomp, 1967; Moore et al. 1985; Dai, 2000), and there can be large intersubject differences within the same study (Moore et al. 1985). For example, for one of the subjects of Moore et al. (1985), the fundamental was dominant for F0 = 200 Hz. References Carney, LH, and Yin, TCT (1988). Temporal coding of resonances by low-frequency auditorynerve fibers: Single-fiber responses and a population model. J Neurophysiol. 60: 1653-1677. Dai, H. (2000). On the relative influence of individual harmonics on pitch judgment. J Acoust Soc Am 107: 953-959. Moore, BCJ, and Glasberg BR (1985). Relative dominance of individual partials in determining the pitch of complex tones. J Acoust Soc Am 77:1853-1860. Plomp, R. (1967) Pitch of complex tones. J Acoust Soc Am 41: 1526-1533. Ritsma, R. (1967) Frequencies dominant in the perception of pitch of complex sounds. J Acoust Soc Am 42: 191-198.
9 Virtual Pitch in a Computational Physiological Model RAY MEDDIS AND LOWEL O’MARD
1
Introduction
There are many different explanations of the origin of virtual pitches and these are often categorized as either ‘spectral’ or ‘temporal’. This chapter addresses a group of hypotheses in the ‘temporal’ category. These theories assume that virtual pitch arises from temporal regularity or periodicities in sounds and these regularities can be characterized using statistical methods such as autocorrelation. This approach makes many quantitatively detailed and often successful predictions concerning the outcome of a wide range of virtual pitch experiments. The status of autocorrelation models remains controversial, however, because it appears to be physiologically implausible. There are no structures in the auditory brainstem that look capable of carrying out the delay and multiply operations required by autocorrelation. This is a major impediment to a general acceptance of an autocorrelation-type model of virtual pitch perception. We address this issue by showing that a model using physiologically plausible components can behave in some important respects like autocorrelation and can simulate a number of the most important virtual pitch phenomena. This model is an expanded version of an older computational model that was originally developed to simulate the response of single units in the auditory brainstem to sinusoidally amplitude modulated (SAM) tones (Hewitt and Meddis 1993, 1994). In those studies it was shown that the model is able to simulate appropriate modulation transfer functions (MTFs) in cochlear nucleus (CN) neurons when stimulated using sinusoidally amplitude modulated (SAM) tones. It can also simulate appropriate rate MTFs in single inferior colliculus (IC) neurons. An earlier study has already shown that the response of the model CN units is a successful simulation of the neuron’s response to broadband pitch stimuli (Wiegrebe and Meddis 2004). An additional cross-channel processing stage has been added to allow information to be aggregated across channels. This is essential for pitch processing because the stimuli with the clearest pitch consist of harmonics that
Department of Psychology, Essex University, Colchester, CO4 3SQ, UK,
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
72
R. Meddis and L. O’Mard
are resolved by the periphery. The overall architecture of the model is the same as that of an autocorrelation model (Meddis and O’Mard 1997). It consists of three stages: 1) peripheral segregation of sound into frequency bands, 2) extraction of periodicities on a within-channel basis, and 3) aggregation of periodicity information across BF-channels. The novelty in the model lies in the way in which periodicity is extracted; using physiologically plausible circuits rather than an artificial mathematical device.
2
The Model
The model contains thousands of individual components but is modular in structure (Fig. 1). The basic building block of the system is a module consisting of a cascade of three stages: 1) auditory nerve (AN) fibers, 2) CN units, and 3) an IC unit. Each module has a single IC cell receiving input from 10 CN units all with the same saturated chopping rate. Each CN unit receives input from 30 AN fibers all with the same BF. All modules are identical except for BF and the saturated firing rate of the CN units. Within a module, it is the saturated firing rate of the CN units that determine the selectivity of the IC rate response to periodicity. The CN units are modeled on CN chopper units that chop at a fixed rate in response to moderately intense acoustic stimulation.
Fig. 1 A A single constituent module. A module consists of 10 CN units feeding one IC unit. Each CN unit receives input from 30 AN same-BF fibers. Within a module, all CN units have the same saturated firing (chopping) rate. B Arrangement of modules. There are 30 modules within each BF channel, each with a different characteristic firing rate. There are 40 channels with BFs ranging from 100 to 10,000 Hz. A stage-4 unit receives one input from each channel from modules with the same saturated chopping rate
Virtual Pitch in a Computational Physiological Model
73
These modules are the same as those described in Hewitt and Meddis (1994) The extended model replicates this core module many times within a single BF channel using different chopping rates in different modules. There are 10 CN units in each block and, within a single channel, there are 30 blocks, each characterised by its chopping rate. This arrangement is replicated across 40 BF channels making a total of 12000 CN units and 1200 IC units. This multi-rate, multi-BF architecture provides the basis for a further extension of the model, a fourth stage where periodicity information is aggregated across BF channels. This pitch-extraction stage is added to the model in the form of an array of hypothetical ‘stage 4’ units called the `stage 4 profile’. Each stage 4 unit receives input from one IC unit from each BF channel where all the contributing IC units have the same best modulation frequency and, hence, the same CN chopping rate. The within-module details are largely unchanged from the original modelling studies. Improvements to the detail of the AN model have also been included in order to be consistent with recent modelling work from this laboratory. The AN section of the model has, for example, been recently updated and is fully described in Meddis (2006). The parameters of the dual resonance nonlinear (DRNL) filterbank are based on human psychophysical data (Lopez-Poveda and Meddis 2001).
3
Implementation Details
Stage 1: auditory periphery. The basilar membrane was modelled as 40 channels whose best frequencies (BFs) were equally spaced on a log scale across the range 100–10,000 Hz. This was implemented as an array of dual resonance nonlinear (DRNL; Meddis et al. 2001) filters. All fibers were modelled identically, except for stochasticity, as high spontaneous rate (HSR) fibers with thresholds at 0 dB SPL at 1 kHz. The output of the auditory periphery was a stochastic stream of independent spike events in each auditory nerve fiber. Stage 2: CN units. These are implemented as modified McGregor cells (MacGregor 1987) and are fully described in Hewitt and Meddis (1993) and Meddis (2006). The intrinsic (saturated) chopping rate of the CN units is determined in the model by the potassium recovery time constant (τGk). This time constant is varied systematically across the array of units in such a way as to produce 30 different chopping rates equally spaced between 60 and 350 spikes/s. The appropriate values of τGk were determined with an empirically derived formula; τGk=rate−1.441 based on the pure tone response at 65 dB SPL. Time constants varied between 2.8 ms (60 Hz) and 0.22 ms (350 Hz). Stage 3: IC units. These are described in full in Hewitt and Meddis (1994) and are implemented here using the same MacGregor algorithm as used for the CN units. A single IC unit receives input from 10 CN units. It is a critical (and speculative) feature of the model that each IC unit receives input only
74
R. Meddis and L. O’Mard
from CN units with the same intrinsic chopping rate. The thresholds of the IC units are set to require coincidental input from many CN units. Stage 4 units. These units receive input from 40 IC units (one per BF-channel). All inputs to a single unit have the same rate-MTF as determined by the intrinsic chopping rate of the CN units feeding the IC unit. It is assumed that each spike input to the stage 4 unit provokes an action potential. Therefore, stage 4 units are not coincidence detectors but simply counters of all the spikes occurring in their feeder units. There are 30 stage 4 units, one for each CN rate. The output of the model is, therefore, an array of 30 spike counts called the ‘stage-4 profile’.
4 4.1
Evaluation Pitch of a Harmonic Tone
Figure 2A shows the stage 4 profile to three ‘missing fundamental’ harmonic tones composed of harmonics 3 through 8 presented at 70 dB SPL for 500 ms. There is no spectral energy in the signals at their fundamental frequencies (F0 = 150, 200 and 250 Hz). The profiles shift systematically to the right as F0 increases. Figure 2B shows the profile for the same F0s where the tone is composed of harmonics 13–18 only. The effect of changing F0 is similar in that the profile shifts to the right as F0 increases. Despite differences in the overall shape, it is clear that the profiles in both figures discriminate easily among the three different pitches. The left to right upward slope in Fig. 2A is explained in terms of the intrinsic chopping rates of the different modules. High chopping rates in the CN units give rise to more activity in those IC units that receive their input. It is
Fig. 2 Stage 4 rate profile for three 500-ms harmonic tones with F0=150, 200 and 250 Hz presented at 70 dB SPL. The x-axis is the saturated rate of firing of the CN units at the base of the processing chain. The y-axis is the rate of firing of the stage 4 units: A harmonics 3-8; B harmonics 13–18
Virtual Pitch in a Computational Physiological Model
75
significant that the stimuli with unresolved harmonics (Fig. 2B) do not show this continued upward slope. The horizontal nature of the function suggests that the stage 4 rate is not reflecting the intrinsic chopping rate of the CN units. In contrast, for higher F0 the plateau to the right hand of the function is higher and reflects the frequency of the envelope of the stimulus. The plateau is not present for resolved harmonics because the signal in the individual channels does not have a pronounced envelope. 4.2
Inharmonic Tones
Patterson and Wightman (1976) showed that the pitch of a harmonic complex remained strong when the complex was made inharmonic by shifting the frequency of all components by an equal amount. The heard pitch of the complex shifts by a fraction of the shift of the individual components. This pitch shift is large when the stimulus is composed of low harmonics but small for complexes composed of only high harmonics. For example, Moore and Moore (2003) showed that a complex composed of three resolved harmonics would show a pitch shift of 8% when the individual harmonics were all shifted by 24%. On the other hand, a complex of three unresolved harmonics showed little measurable pitch shift. This was true for F0s of 50, 100, 200 and 400 Hz. Moore and Moore’s stimuli were used in the following demonstration which used 70-dB SPL, 500-ms tones consisting of either three resolved harmonics (3, 4, 5) or three unresolved harmonics (13, 14, 15) with F0 = 200 Hz. Pitch shifts were produced by shifting all component tones by either 0, 24, or 48% and generating a stage 4 profile for each. The resulting stage 4 profiles are shown in Fig. 3. The profiles for the resolved harmonics change with the shift. Shifting the frequencies of the unresolved
Fig. 3 Stage 4 rate profiles for shifted harmonic stimuli (F0 = 200 Hz). Shifts applied equally to all harmonics are 0, 24 or 48 Hz: A harmonics 4, 5 and 6 (resolved); B harmonics 13, 14 and 15 (unresolved). Shifting the harmonics has a larger effect when the harmonics are unresolved
76
R. Meddis and L. O’Mard
harmonics (Fig. 3B) had little effect however. Qualitatively at least, this replicates the results of Moore and Moore. When all harmonics are shifted by the same amount, the envelope of the signal is unchanged because it depends on the spacing between the harmonics which remains constant. The model reflects the unchanged periodicity of the envelope of the stimulus with unresolved harmonics in Fig. 3B. On the other hand, the model reflects the changing values of the resolved components when resolved stimuli are used (Fig. 3A). 4.3
Iterated Ripple Noise
Iterated ripple noise (IRN) is of particular interest in the study of pitch because it produces a clear pitch percept but does not have the pronounced periodic envelope that is typical of harmonic and inharmonic tone complexes. When IRN is created by adding white noise to itself after a delay (d), the perceived pitch is typically matched to a pure tone or harmonic complex whose fundamental frequency is 1/d. The strength of the pitch percept is also proportional to the number of times the delay and add process is repeated. The model was evaluated using stimuli constructed using a delays of 6.67, 5 and 4 ms reciprocals and a gain of 1. These stimuli have pitches around 150, 200 and 250 Hz. When only three iterations are used a clear shift in the stage 4 rate profile can be seen (Fig. 4A). When the number of iterations is increased, the differences become more obvious (Fig. 4B). These profiles can be compared with those produced with harmonic stimuli in Fig. 2 where the perceived pitches are the same. The comparison between IRN with 16 iterations in Fig. 4B and harmonic tones consisting of resolved components (Fig. 2A) is the most clear. However, the IRN profiles do not contain a plateau at high chopping rates previously seen in Fig. 2B. This is significant because the plateau signifies a response to the envelope of the stimulus and IRN stimuli do not have an envelope related to the perceived pitch.
Fig. 4 Stage 4 rate profiles produced by the model in response to iterated ripple noise stimuli with iteration delays corresponding to pitches of 150, 200 and 250 Hz: A 3 iterations; B 16
Virtual Pitch in a Computational Physiological Model
5
77
Discussion
The aim of this study was to show that a model using only physiological components could share some of the properties previously shown to characterize cross-channel autocorrelation models. The mathematical model and the new physiological model already have a great deal in common. Both use an auditory first stage to represent auditory nerve spike activity. Both extract periodicity information on a within-channel basis. Both models accumulate information across channels to produce a periodicity profile. They differ primarily only in the mechanism used to determine periodicities. In the physiological model periodicity detection uses CN units working together with their target IC units. This replaces the delay-and-multiply operations of the autocorrelation method. The physiological mechanism depends on the synchronization properties of the model CN chopper units. When the choppers are driven by a stimulus periodicity that coincides with their intrinsic driven firing rate, all the CN units with the same firing rate will begin to fire in synchrony with the stimulus and with each other. This periodicity may originate equally from low frequency pure tones or from modulations of a carrier frequency. It is this synchrony that drives the receiving IC units. This is not exactly the same as autocorrelation, however, and it is likely that differences in the detail of the mathematical and physiological systems will produce some differences in the predictions they make. So far, however, the evaluation has not shown any major differences. It would be premature to claim that the model described above is a complete pitch model. There are more pitch phenomena than those considered here and the new model needs to be tested on a much wider wide range of stimuli. Indeed some stimuli produce a pitch that is claimed not to be predicted by existing autocorrelation models. These are matters for further study. Nevertheless, the project has demonstrated that a cross-channel autocorrelation model can be simulated, to a first approximation, by a physiological model and is worthy of further consideration.
References Hewitt MJ, Meddis R (1993) Regularity of cochlear nucleus stellate cells: a computational modelling study. J Acoust Soc Am 93:3390–3399 Hewitt MJ, Meddis R (1994) A computer model of amplitudemModulation sensitivity of single units in the inferior colliculus. J Acoust Soc Am 95:2145–2159 Lopez-Poveda EA, Meddis R (2001) A human nonlinear cochlear filterbank. J Acoust Soc Am 110:3107–3118 MacGregor RJ (1987) Neural and brain modeling. Academic Press, San Diego Meddis R (2006) Auditory-nerve first-spike latency and auditory absolute threshold: a computer model. J Acoust Soc Am 119:406–417 Meddis R, O’Mard LP (1997) A unitary model of pitch perception. J Acoust Soc Am 102:1811–1820
78
R. Meddis and L. O’Mard
Meddis R, O’Mard LP, Lopez-Poveda EA (2001) A computational algorithm for computing nonlinear auditory frequency selectivity. J Acoust Soc Am 109:2852–2861 Moore GA, Moore BCJ (2003) Perception of the low pitch of frequency-shifted complexes. J Acoust Soc Am 113:977–985 Patterson RD, Wightman FL (1976) Residue pitch as a function of component spacing. J Acoust Soc Am 59:1450–1459 Wiegrebe L, Meddis R (2004) The representation of periodic sounds in simulated sustained chopper units of the ventral cochlear nucleus. J Acoust Soc Am 115:1207–1218
Comment by Kollmeier Your neural circuits seems to implement a modulation filterbank in a similar way to that in the Dissertation by Ulrike Dicke (Neural models of modulation frequency analysis in the auditory system, Universität Oldenburg, 2003, download at http://docserver.bis.uni-oldenburg.de/publikationen/dissertation/2004/ dicneu03/dicneu03.html) and Dicke et al. (2006). However, in your model the modulation tuning is critically dependent on the time constant of the model chopper unit which may not be a stable physiological quantity of a single cell and is also unlikely to vary over several octaves (as would be required to predict physiological and psychoacoustical data). In contrast, the Dicke approach uses a neural circuit that does not employ a continuously changing temporal parameter to obtain different best modulation frequencies (BMFs) of the IC modulation bandpass units. Instead, different BMFs are yielded from varying the number of input units projecting onto different bandpass units. What evidence do you have that the chopper unit time constant is a stable and scalable property across different cells as opposed to the assumption that the modulation tuning is a network property rather than a property of a single neuron? References Dicke U, Ewert S, Dau T, Kollmeier B (2006) A neural circuit transforming temporal periodicity information into a rate-based representation in the mammalian auditory system (submitted)
Reply We agree that there are aspects of the CN model that need further investigation. For example, it might be that the intrinsic chopping rate of a unit is controlled by some other factor such as the number of input AN fibers or inhibitory modulation. We are currently investigating this issue. We are also investigating the question of whether a full range of chopping frequencies is necessary to simulate the pitch results. These are open questions.
Virtual Pitch in a Computational Physiological Model
79
Comment by Nelson This comment concerns the continued use of cochlear nucleus (CN) chopper neurons as fundamental components of the neural circuitry in simulations of responses to AM in the inferior colliculus (IC). Two sets of empirical observations appear inconsistent with the assumptions of these models, such as the one described here (originally suggested by Hewitt and Meddis 1994). First, real VCN chopper neurons do not typically exhibit maximum synchrony at stimulus AM rates equal to their inherent chopping rate. The correlation between synchrony best modulation frequency (BMF) and chopping frequency is weak (Frisina et al. 1990), but a one-to-one correspondence between these two metrics is an inherent feature of the Hewitt and Meddis model. The second (more important) piece of physiological data is related to the assumed tight coupling between CN synchrony tuning and IC rate tuning to AM in the model: in Hewitt and Meddis’ implementation, IC rate-BMFs are equal to their input (CN chopper) synchrony-BMFs. This is not borne out in the data either, because the range of CN chopper synchrony-BMFs (~150–700 Hz; Rhode and Greenberg 1994; Frisina et al. 1990) does not match the distribution of IC cell rate-BMFs (~1–150 Hz; Krishna and Semple 2000). An alternative physiologically based model (Nelson and Carney 2004) can extract periodicity information in the form of band-pass rate-MTFs without the use of an intermediate population of chopper neurons. Instead, temporal interactions between excitation and inhibition underlie rate tuning and enhanced synchronization at the level of the model IC cells. References Frisina RD, Smith RL, Chamberlain SC (1990) Encoding of amplitude modulation in the gerbil cochlear nucleus. I. A hierarchy of enhancement. Hear Res 44:99–122 Hewitt MJ, Meddis R (1994) A computer model of amplitude-modulation sensitivity of single units in the inferior colliculus. J Acoust Soc Am 95:2145–2159 Krishna BS, Semple MN (2000) Auditory temporal processing: responses to sinusoidally amplitude-modulated tones in the inferior colliculus. J Neurophysiol 84:255–273 Nelson PC, Carney LH (2004) A phenomenological model of peripheral and central neural responses to amplitude-modulated tones. J Acoust Soc Am 116:2173–2186 Rhode WS, Greenberg S (1994) Encoding of amplitude modulation in the cochlear nucleus of the cat. J Neurophysiol 71:1797–1825
Reply We are aware of the excellent model of Nelson and Carney and also of the model presented at this conference by Bahmer and Langner. It is important to stress that either of these two models might be substituted for our own in stages two and three of the pitch model in so far as they simulate IC responses
80
R. Meddis and L. O’Mard
to amplitude modulated sounds. Our choice of an old IC model was motivated primarily by simplicity, convenience and familiarity. The main thrust of our paper, of course, does not concern the choice of IC model. Rather it makes the point that large arrays of these units are sufficient to simulate a range of psychophysical pitch results. When our IC model was developed 15 years ago, the decision to use sustained chopper units in the CN was made in the full knowledge of the physiology referred to in the comment. While the choice of this type of unit was not intuitively obvious, it was successful in the sense that it simulated the data (in which we put our trust). Whether the model is the correct one or not is an empirical issue that remains undecided. It is hoped that the availability of at least three competing models will stimulate further physiological enquiry. Comment by Greenberg Would your model be consistent with complex pitch discrimination (by human listeners) on the order of 0.2–0.5% for spectrally non-overlapping harmonics (Wightman 1981)? References Wightman FL (1981) Pitch perception: an example of auditory pattern recognition. In: Getty DJ, Howard JH Jr (eds) Auditory and visual pattern recognition. Hillsdale, NJ: Lawrence Erlbaum, pp 3–25
Reply At present, I am doubtful whether a physiological model can produce that level of precision. However, I am not aware of any reason, in principle, why it should not work given enough units and computing power. Comment by Carlyon As you point out, your model is physiologically realizable and shares many properties similar to autocorrelation. A weakness of the most popular autocorrelation model (Meddis and O’Mard 1997) is that it fails to account for the effects of resolvability, independently of the frequencies of the harmonics used to convey pitch information (Carlyon 1998; Bernstein and Oxenham 2005). That is, for a given F0 it predicts poorer discrimination for high-numbered than for low-numbered harmonics, but does not capture the interaction between F0 and frequency region observed in the psy-
Virtual Pitch in a Computational Physiological Model
81
chophysical literature (Shackleton and Carlyon 1994). Your Fig. 3 shows that it can capture the former finding; does it do any better than Meddis and O’Mard (1997) on the latter? References Bernstein JGW, Oxenham AJ (2005) An autocorrelation model with place dependence to account for the effect of harmonic number on fundamental frequency discrimination. J Acoust Soc Am 117:3816–3831 Carlyon RP (1998) Comments on “A unitary model of pitch perception” [J Acoust Soc Am 102:1811–1820 (1997)]”. J Acoust Soc Am 104:1118–1121 Meddis R, O’Mard L (1997) A unitary model of pitch perception. J Acoust Soc Am 102:1811–1820 Shackleton TM, Carlyon RP (1994) The role of resolved and unresolved harmonics in pitch perception and frequency modulation discrimination. J Acoust Soc Am 95:3529–3540
Reply While the Shackleton and Carlyon (1994) study is frequently quoted in this context, it should be treated with caution with respect to establishing the status of autocorrelation as a reliable mathematical predictor of pitch percepts. That studied used pitch matches between simultaneously presented tones allowing for possible complex perceptual interactions between the tones. It would have been more convincing if the pitch matches had been made between non-simultaneous tones. Unfortunately this experiment has not been tried and, as a consequence, it would not be wise to abandon autocorrelation approaches on the basis of this one difficult-to-interpret study. We accept that the Bernstein and Oxenham (2005) study was a successful challenge to the original autocorrelation architecture where each possible lag was given equal weight in each channel. Fortunately, the authors were able to solve the problem satisfactorily by changing the weights applied to different lags according to the CF of the channel. It is likely that a similar weighting function might work in the physiological model. We also urge caution with respect to the concept of ‘resolvability’ in this context. It harkens back to a simpler age when it was enough to characterise the auditory periphery as a bank of narrowly tuned linear bandpass filters. Nonlinearity in both the electrical and mechanical responses of the cochlear has caused us to say ‘Goodbye’ to all that. What might be resolved near threshold is not resolved at higher listening levels while pitch remains largely insensitive to level.
10 Searching for a Pitch Centre in Human Auditory Cortex DEB HALL1 AND CHRISTOPHER PLACK2
1
Introduction
Recent data from human fMRI (Barrett and Hall 2006; Hall et al. 2005; Patterson et al. 2002; Penagos et al. 2004) and primate electrophysiological (Bendor and Wang 2005) studies have suggested that a region near the anterolateral border of primary auditory cortex may be involved in pitch processing. Collectively, these findings present strong support for a single central region of pitch selectivity. However, for a brain region to be referred to as a general pitch centre, its response profile should satisfy a number of criteria: i) Pitch selectivity: responses to the pitch-evoking stimulus should be greater than to a control stimulus that does not evoke a pitch percept, but is matched as closely as possible with respect to acoustic features; ii) Pitch constancy: selective responses should occur for all pitch-evoking stimuli, whatever their spectral, temporal or binaural characteristics and irrespective of whether there is spectral energy at the fundamental frequency (F0) (Tramo et al. 2005); iii) Covariation with salience: the response magnitude should covary with pitch salience; and iv) Elimination of peripheral phenomena: it must be possible to discount the contribution of peripheral effects, such as cochlear distortions (McAlpine 2004), to the pitch-evoked response. In the current study we sought evidence for a pitch centre in humans that complies with these criteria. We combined psychophysical measurements of frequency and fundamental frequency difference limens (FDL) and fMRI measurements of the cortical response to five different pitch-evoking stimuli.
2 2.1
Methods Stimuli
Five different pitch-evoking stimuli were generated: (i) PT: Pure tone consisting of a 200-Hz pure tone and a Gaussian noise bandpass filtered between 500 Hz and 2 kHz 1 2
MRC Institute of Hearing Research, University Park, Nottingham, UK,
[email protected] Department of Psychology, Lancaster University, Lancaster, UK,
[email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
84
D. Hall and C. Plack
(ii) WB: Wideband complex consisting of the harmonics of a 200-Hz F0 added in cosine phase and lowpass filtered at 2 kHz (iii) Res: Resolved complex consisting of the harmonics of a 200-Hz F0 added in cosine phase and bandpass filtered between 1 and 2 kHz, together with a Gaussian noise masker lowpass filtered at 1 kHz to reduce the effect of combination tones (iv) Unres: Unresolved complex consisting of the harmonics of a 100-Hz F0 added in alternating sine and cosine phase and bandpass filtered between 1 and 2 kHz, again with a Gaussian noise masker lowpass filtered at 1 kHz (v) Huggins: Huggins pitch consisting of a Gaussian noise lowpass filtered at 2 kHz and presented diotically, except for a frequency region from 190 to 210 Hz (200 Hz±10%). This region was given a progressive phase shift, linear in frequency between 0 and 2π, in the left ear only. Each of these five stimuli has a pitch equivalent to that of a pure tone at 200 Hz. For the fMRI experiment, a control stimulus consisting of a Gaussian noise lowpass filtered at 2 kHz was generated. With the exception of Huggins, stimuli were presented diotically. For the behavioural experiments the overall level in each ear was fixed at 83 dB SPL, and the “average” spectrum level was held constant at 50 dB (re. 2×10−5 N/m2). In other words, the noise, when present, had a spectrum level of 50 dB, the pure tone had a level of 77 dB SPL [50 +10 log10(500)], the harmonics of the 200-Hz complexes had a level of 73 dB SPL, and the harmonics of the 100-Hz complex had a level of 70 dB SPL. For the behavioural experiment, the stimuli had a total duration of 200 ms including 10 ms onset and offset ramps. For the fMRI experiment, the stimuli were 500 ms, including 10 ms onset and offset ramps. These were repeated in a 15.5-s sequence, with 50-ms gaps between each stimulus. The sound levels delivered in the scanner were 87–88 dB SPL, measured at the ear.
2.2
Subjects
We recruited 16 normally hearing listeners (≤25 dB HL between 250 Hz and 6 kHz) from the university population. Their mean age was 24.5 years, ranging from 18 to 41, and the group comprised seven females and nine males. A majority of listeners were musically trained; with only two listeners unable to read music or play an instrument (#10 and #14). All except one listener (#03) was strongly right handed. The study was approved by the University Medical School Ethics Committee and written informed consent was obtained from all participants.
2.3
Pitch Discrimination
Pitch discrimination was measured using a two-down, one-up, adaptive procedure that estimates the 71% correct point on the psychometric function (Levitt 1971). The discrimination task was pitch direction (“in which interval was the
Searching for a Pitch Centre in Human Auditory Cortex
85
pitch higher?”). On each trial there were two observation intervals. The frequency, fundamental frequency, or (in the case of Huggins) the centre frequency of the phase-shifted region, of the standard was fixed to produce a nominal pitch corresponding to 200 Hz. The pitch of the comparison was greater than this. The frequency difference between the standard and comparison intervals was varied using a geometric step size of 2 for the first four reversals, and 1.414 thereafter. In each block, 16 reversals were measured and the threshold taken as the geometric mean of the last 12. Five such estimates were made for each condition, and the final estimate was taken as the geometric mean of the last four. Two of the subjects (#10 and #12) could not hear the Huggins pitch and had thresholds greater than 100%. The thresholds for these subjects were assumed to be 100% for the purpose of subsequent analysis. 2.4
fMRI Protocol
Scanning was performed on a Philips 3 T Intera using an 8-channel SENSE receiver head coil. For each listener, we first acquired a 4.5-min anatomical scan (1 mm3 resolution) of the whole head. Functional scans consisted of 20 slices taken in an oblique-axial plane, with a voxel size of 3 mm3. The anatomical scan was used to position the functional scan centrally on Heschl’s gyrus (HG). We took care to also include the superior temporal plane and superior temporal sulcus and to exclude the eyes. Functional scanning used a SENSE factor of 2 to reduce image distortions and a SofTone factor of 2 to reduce the background scanner noise level by 9 dB. Scans were collected at regular 8-s intervals, with the stimulus presented predominantly in the quiet periods between each scan. The functional experiment consisted of one 40-min listening session. In total it included 44 scans for each stimulus type and an additional 46 silent baseline scans, with the order of conditions randomised. Listeners were requested to attend to the sounds and to listen out for the pitch, but were not required to perform any task. Analysis of the imaging data was conducted using SPM2 (www.fil.ion.ucl. ac.uk/spm) separately for each listener. Pre-processing steps included within-subject realignment and spatial normalization. For each subject, normalized images were up-sampled to a voxel resolution of 2 mm3 and smoothed by 4 mm FWHM. This procedure meets the smoothness assumptions of SPM without compromising much of the original spatial resolution, so preserving the precise mapping between structure and function.
3 3.1
Results Pitch Discrimination
The geometric means of the pitch discrimination thresholds across subjects are shown in Fig. 1. Performance was best for WB and worst for Huggins. Interestingly, thresholds for Res and Unres were similar. It is usually reported
86
D. Hall and C. Plack
Fig. 1 Discrimination thresholds across the group of 16 listeners (geometric mean and standard error)
that thresholds for unresolved harmonics are substantially higher than those for resolved harmonics (Shackleton and Carlyon 1994). 3.2
Pitch Activation
Our first analysis confirmed that all listeners produced reliable sound-related activation (P < 0.05, FWE corrected) encompassing the primary auditory cortex on HG, posterior non-primary regions on lateral HG and the planum temporale, and anterior non-primary regions on the planum polare (PP). Pitch selectivity. To determine regions of pitch selectivity, we contrasted each pitch condition against the control Gaussian noise condition for individual listeners. For exploratory analyses, we used a very lenient metric for false positives (P < 0.01, uncorrected for multiple testing). An overall map of pitch selectivity was generated by summing all pitch-related activations across the group. The white areas in Fig. 2 illustrate the spread of the pitch response that was present in at least 2/80 cases (5 pitch conditions × 16 listeners). The greatest overlap occurred in the left planum temporale at the x, y, z co-ordinate −58, −30, 12 mm, shown by the black dot. Even at this point activation was overlapping in only 11/80 cases, suggesting considerable variability across listeners. Despite the very relaxed statistical thresholding, our pitch-related activation did not extend across the anterolateral area (defined by the traced region in Fig. 2). It lay mostly posterior to HG, in the planum temporale. Although unexpected, this result is not wholly inconsistent with previous literature. Even though the peak activity typically lies in anterolateral HG, our own studies have shown that a pitch-evoking, iterated ripplednoise also engaged an anterior portion of the planum temporale (planned
Searching for a Pitch Centre in Human Auditory Cortex
87
Fig. 2 Coronal, sagittal and axial views positioned through the point of the most frequent pitchrelated activity (black dot) and showing the extent of all pitch-related activation (white areas). Activations are overlaid onto the mean anatomical image for the group
Fig. 3 View across the supratemporal plane illustrating the extent of pitch-related activation (white areas) in each of the pitch conditions. Those voxels plotted in white reached significance (P<0.01, uncorrected) for at least two listeners. Black dots indicate the significant peaks of activation occurring within each listener (P<0.01, FWE corrected for the volume of the auditory cortex)
comparison C in Barrett and Hall 2006). Furthermore, Penagos et al. (2004) also showed that resolved complex tones produced differential activation in posterior and lateral sites that were separate from those in anterolateral HG. Pitch constancy. To explore the question of pitch constancy, we repeated the above procedure but generated separate maps for the five pitch-evoking stimuli (Fig. 3). The white areas represent activity present in at least 2/16 listeners. Although there were differences in the precise pattern, all five pitch conditions produced significant auditory activation. Planum temporale was most widely activated by the WB condition. Remarkably, the Huggins pitch also evoked a significant response, even though this pitch is generated using very different acoustic cues from the other types of pitch and has the greatest FDL. The group maps in Figs. 2 and 3 conceal the fact that pitch-related activation rarely occurred at exactly the same point in the auditory cortex across
88
D. Hall and C. Plack
Fig. 4 The pitch centre in three listeners; position is shown using x, y, z co-ordinates and each listener’s anatomical image
the different listeners. The spatial consistency was much more striking within listeners however. Our data would remain compatible with the notion of a pitch centre if pitch constancy were to be confirmed (i.e. if a significant response to all five pitch-evoking stimuli occurred in any one listener). Given that separate contrasts are thresholded at P<0.01, the probability of this occurring by chance is very small (P<10−10). Most of our listeners (N=10) did indeed produce conjoint activation for at least four of the pitch contrasts (P<5 × 10−8). However, two observations differ from those predicted by previous literature. First, the location of the pitch centre varied a great deal from listener to listener (Fig. 4). In seven listeners (e.g. #14 and 15) it fell in different portions of the planum temporale, in one listener (#13) it fell in PP and in two listeners it was elsewhere. Second, the magnitude of the response within the pitch centre was unrelated to the perceptual salience of the pitch that had been measured psychophysically.
4
Conclusion
To our knowledge, this is the first study that has sought to identify a pitch centre whose response satisfies the criterion of pitch constancy across a range of different pitch-evoking stimuli. In most listeners, we found small regions within posterior non-primary auditory cortex that responded selectively to the pitch-evoking stimuli, even to the Huggins pitch stimulus which evokes the weakest percept. This is the first time that a cortical response to a binaural pitch has been reported in humans. The two surprising caveats to our findings were that i) this apparent site for pitch processing occurred in different places in different listeners and, ii) the
Searching for a Pitch Centre in Human Auditory Cortex
89
response did not vary consistently as a function of the pitch salience. Neither of these observations can be easily reconciled with current models of pitch coding and its neural representation within the auditory cortex. We were unable to find evidence for either pitch selectivity or pitch constancy in the anterolateral area of the human auditory cortex. Nevertheless, our data highlight the importance of other non-primary regions in pitch coding, a finding that has been reported, but perhaps not emphasized, by other researchers (e.g. Penagos et al. 2004). Our stringent criteria for inclusion could account for our failure to replicate previous findings. Not only was our pitch-related response required to generalize across the different pitchevoking stimuli – it also had to be significantly greater than for a control noise matched for spectral energy, and it had to occur reliably for pitchevoking stimuli that contained a noise masker around the missing F0 (Res and Unres conditions). Not all of these conditions have been met before. The present data perhaps raise more questions than they answer about the neural substrate of pitch processing. Apart from the questions that we have already raised it is important to gain a better understanding about how these pitch computations are affected by sound level, especially for those signals presented at high levels in the MR scanner. We hope this brief report will stimulate further neuroimaging and electrophysiological investigations to address these issues. Acknowledgements. This work was supported by the Medical Research Council of the UK and a Knowledge Transfer Grant from the RNID.
References Barrett DJK, Hall DA (2006) Response preferences for ‘what’ and ‘where’ in human nonprimary auditory cortex. NeuroIm (in press) Bendor D, Wang X (2005) The neuronal representation of pitch in primate auditory cortex. Nature 436:1161–1165 Hall DA, Barrett DJK, Akeroyd MA, Summerfield AQ (2005) Cortical representations of temporal structure in sound. J Neurophys 94:3181–3191 Levitt H (1971) Transformed up-down methods in psychoacoustics. J Acoust Soc Am 49:467–477 McAlpine D (2004) Neural sensitivity to periodicity in the inferior colliculus: evidence for the role of cochlear distortions. J Neurophys 92:1295–1311 Patterson RD, Uppenkamp S, Johnsrude IS, Griffiths TD (2002) The processing of temporal pitch and melody information in auditory cortex. Neuron 36:767–776 Penagos H, Melcher JR, Oxenham AJ (2004) A neural representation of pitch salience in nonprimary human auditory cortex revealed with functional magnetic resonance imaging. J Neurosci 24:6810–6815 Shackleton TM, Carlyon RP (1994) The role of resolved and unresolved harmonics in pitch perception and frequency modulation detection. J Acoust Soc Am 95:3529–3540 Tramo MJ, Cariani PA, Koh CK, Makris N, Braida LD (2005) Neurophysiology and neuroanatomy of pitch perception: Auditory cortex. Ann NY Acad Sci 1060:148–174
90
D. Hall and C. Plack
Comment by Yost I liked your use of Huggins pitch to probe for pitch processing centres for many of the same reasons you stated in your paper. I have two questions. 1. Did you attempt to control for the fact that the Huggins stimulus condition produces both a pitch at the region of the interaural phase shift and a shift in the lateralized position of the pitch image? That is, an fMRI response might be due to the pitch and\or the laterality associated with the Huggins pitch stimulus. 2. Given the very weak pitch strength of Huggins pitch, I was surprised to see that the fMRI response to this stimulus was one of the strongest. Do you have an explanation for why the fMRI response was so strong for the Huggins pitch stimulus?
Reply We did not make any attempt to control for the lateralisation cues in our Huggins stimulus, and we think that this would be hard to do, partly because the percept is not consistent across listeners. All our stimuli were compromised to some extent, since it is impossible to produce a pitch stimulus without introducing spectral, temporal, or spatial features that might be identified by a cortical mechanism not specific to pitch. Our approach was to try to find a single cortical locus that responded to all our stimuli, and hence might be a candidate locus for the common feature of pitch. The response to Huggins was distributed in a similar manner to our other pitch stimuli. We did not find any relationship between response size and salience and we have no explanation at present for why this is the case.
Comment by de Cheveigné I really like this study. The level of rigor and care is refreshing, and the sobering results are possibly more exciting than those of less controlled studies. In your talk you mention that your pitch-producing stimuli failed to activate the areas that have been identified as a ‘pitch centre’ in studies that used ‘iterated ripple noise’ (IRN) stimuli. IRN is physically similar to a random-phase harmonic complex (for large number of iterations), and it evokes a clear pitch, so the discrepancy is puzzling. This comment points to some properties of IRN that might possibly explain the paradox. IRN is obtained by delaying noise by multiples of a time interval T, and adding up the delayed signals. An IRN of order N is the sum of N copies of the same noise with delays of 0 to N−1 times the interval (‘period’). IRN is quasi-periodic: the difference of one period to the next amounts to only about
Searching for a Pitch Centre in Human Auditory Cortex
91
1/N times the power (the rest being identical between periods). Over several periods the difference gets larger, and after N periods the waveform has been completely ‘renewed’. For large N, the period-to-period difference is small and the renewal slow, and so the stimulus is very much like a periodic tone, with a period derived from a ‘chunk’ of noise. The delay-and-add process also affects the spectrum. For N=2 (‘repetition noise’) the long-term power spectrum is shaped as a raised cosine with peaks at multiples of 1/T. For large N the peaks become narrower and in the limit of large N the spectrum (calculated over any finite window) becomes similar to the line-spectrum of a harmonic complex tone, but with two qualifications: the phase of each harmonic is drawn from a uniform distribution between [0, 2π], and its amplitude is drawn from a Rayleigh distribution. In other words, IRN has an irregular spectral envelope, somewhat akin to that of a ‘vowel’. This envelope fluctuates slowly over time (all the more slowly as N is large) and this may induce perceptible fluctuations in the timbre of the stimulus over the duration of a stimulus, or from repetition to repetition. The non-flat spectral envelope of IRN, or its evolution over time, could set it apart from other pitch-producing stimuli, and sensitivity to these aspects could explain why a ‘pitch centre’ would respond only to IRN. IRN is sometimes presented as an ideal stimulus for pitch studies, combining the virtues of white noise (lack of spectral structure) with a temporal structure sufficient to evoke pitch. This is incorrect: IRN offers spectral cues that are as clear as its temporal cues are clear (and its pitch salient). They may be made unusable by high-pass filtering and masking of combination tones, but IRN does not differ significantly in this respect from a complex tone. The number of iterations is convenient to manipulate pitch salience, but a similar effect may be obtained by adding a controlled amount of noise to a complex tone. IRN offers a non-flat spectral envelope, akin to that derived by sampling a period-length chunk of ongoing noise, but this does not make its timbre ‘noiselike’: for real noise this spectral shape would fade instantly, whereas for IRN it persists over relatively long periods of time. In this respect, IRN is akin to the result of exciting with a harmonic source a resonator with a random, slowly fluctuating transfer function. These properties covary with those that determine pitch and pitch strength, and with the random choice of noise, and thus the IRN stimulus is hard to control parametrically. It is not clear that the peculiar properties of IRN are an advantage, and thus one can question its systematic use in studies of pitch. Reply We were concerned that our pitch-evoking stimuli produced a different response pattern to that seen in previous fMRI studies using IRN. We followed the experiment described in the paper with a second experiment in which we presented IRN (10-ms delay, 16 iterations) to a subset of the original listeners.
92
D. Hall and C. Plack
Our results were consistent with the earlier studies, showing activation in anterolateral HG, and also planum temporale. The IRN effect in planum temporale was broadly consistent with that produced by our other pitch stimuli. Hence, it is possible that there is some feature of IRN not found in our other pitch stimuli that produces activation in anterolateral HG. Thank you for your excellent summary of the spectro-temporal features of IRN that may underlie this effect. After reading your comment, we passed our stimuli though a cochlear model. Although the peaks spaced at frequency intervals of 1/delay were not resolved (since they correspond to harmonic numbers of 10–20), broader slowly-varying spectral features were clearly present in the model output produced by the IRN stimulus compared to that produced by the noise control. Comment on de Cheveigne’s Comment to Hall and Plack by Yost and Patterson It would probably be useful to separate imaging studies from psychoacoustical studies when discussing the utility of using IRN to study pitch processing. In fMRI, it is necessary to use a subtractive method to help insure that the stimulus feature of interest is the one that leads to an increased BOLD signal. As a result, it is important to control for all relevant stimulus features if many exist. In this regard Alain’s concern about the multiple stimulus features of IRN is relevant. However, the major IRN feature is the temporal regularity in the temporal fine structure, which is highly correlated with psychophysically measured pitch and pitch strength, and there is good evidence that this temporal regularity is processed by the auditory system (e.g., Yost et al. 1998; Patterson et al. 2000; Krumbholz et al. 2003). In some psychophysical studies, such as pitch matching, it seems unlikely that the variables mentioned in Alain’s comment would play a role in the pitch matching, given the robust pitch of IRN. His points may pertain to discrimination experiments, such as those often used in pitch-strength measurements. That is, IRN features other than those thought to control pitch strength may have an influence on estimates of pitch strength. However, here it is useful to consider both the perceptual salience of different IRN stimulus features and what we currently know about auditory processing. It appears that Alain is concerned that changes in ‘timbre’ associated with the delay-and-add process may provide discrimination cues that might confound the results, especially when this process is iterated. Alain uses the very large N (number of iterations) case to illustrate his points. The effects he describes decrease as N decreases, and in many IRN studies N is relatively small at 8 or fewer. IRN pitch does not change with N, but pitch strength does. However, even for N =1, the pitch strength of IRN is substantial and any timbre differences are subtle. Alain does not mention the role that the temporal envelope may play in pitch processing. IRN for N of 8 or fewer has a flatter envelope than all of
Searching for a Pitch Centre in Human Auditory Cortex
93
the other stimuli used to study pitch. So, the use of IRN suggests that envelope cues may not be sufficient, and may not be necessary, for complex pitch processing. We certainly encourage the generation of other stimuli to probe pitch processing, as suggested in Alain’s comment. However, until these other stimuli are specified and are shown to be better in some way than IRN, we do not believe that the use IRN should be discontinued as one of the stimuli used to study pitch processing. We agree that more needs to be done to determine which neural centres are involved in complex pitch processing. References Krumbholz K, Patterson RD, Nobbe A, Fastl H (2003) Microsecond temporal resolution in monaural hearing without spectral cues? J Acoust Soc Am 113(5):2790–2800 Patterson RD, Yost WA, Handel S, Datta JA (2000) The perceptual tone/noise ratio of merged iterated rippled noises. J Acoust Soc Am 107:1578–1588 Yost WA, Patterson RD, Sheft S (1998) The role of the envelope in processing iterated rippled noise. J Acoust Soc Am 104:2349–2361
11 Imaging Temporal Pitch Processing in the Auditory Pathway ROY D. PATTERSON1, ALEXANDER GUTSCHALK2, ANNEMARIE SEITHER-PREISLER3, AND KATRIN KRUMBHOLZ4
1
Introduction
Physiological studies of temporal pitch processing suggest that the processing of temporal regularity begins in the brainstem (e.g., Palmer and Winter 1992), which suggests that there is a hierarchy of temporal pitch processing in the auditory pathway as would be expected from computational models of auditory perception (e.g., Patterson et al. 1995; Pressnitzer et al. 2001). This chapter reports a series of brain imaging studies designed to search for evidence of the hierarchy.
2
Imaging Temporal Pitch Processing with PET
There is an early positron emission tomography (PET) study of temporal pitch processing by Griffiths et al. (1998), who used Regular Interval Sounds (RIS) (Yost et al. 1996) to produce a spectrally balanced set of stimuli with varying pitch strength. A delay-and-add technique is used to produce a concentration of one time interval in what is otherwise a broadband noise. As the degree of regularity increases, the hiss of the noise dies away and a pitch at the delay increases in strength to the point where it dominates the perception. With appropriate high-pass filtering, these RIS produce essentially uniform excitation across the tonotopic dimension of neural activity in the auditory pathway (see Fig. 1 of Patterson et al. 2002). RIS are useful in imaging because they enable one to generate sets of spectrally matched stimuli that enhance the sensitivity of perceptual contrasts in functional imaging. A brief comparison of spectral and temporal models of pitch for brain imaging is presented
1 Centre for the Neural Basis of Hearing, Department of Physiology, Development and Neuroscience, University of Cambridge, Cambridge, UK,
[email protected] 2 Department of Neurology, University of Heidelberg, Heidelberg, Germany, Alexander.Gutschalk @med.uni-heidelberg.de 3 Experimental Audiology, Münster University Hospital, Münster and CSS Institut für Psychologie, Karl-Franzens-Universität, Graz, Germany,
[email protected] 4 MRC Institute of Hearing Research, University Park, Nottingham, UK,
[email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
96
R.D. Patterson et al.
in Griffiths et al. (1998). In separate conditions, subjects were presented with sequences of stimuli having different, but fixed, values of pitch strength, and brain activation was observed to increase with pitch strength in the anterior region of auditory cortex referred to as Heschl’s gyrus. When conditions with varying pitch value were contrasted with fixed-pitch conditions, differential activation was observed in regions of the temporal lobe clearly anterior and posterior to auditory cortex. The results were interpreted as evidence of a hierarchy of pitch processing in the auditory pathway. The power of PET experiments is severely limited, however, by the amount of radiation that a subject can be exposed to in a year, and so PET has largely been replaced by fMRI and MEG for brain imaging in the auditory system.
3
Imaging Temporal Pitch Processing with fMRI
Neural tissue draws oxygen from the blood when it is active, and functional Magnetic Resonance Imaging (fMRI) can be used to measure neural activation through the blood-oxygen-level depletion (BOLD) response. It is a noninvasive technique; however, it does have three important limitations: the scanner is very noisy, the subject has to stay very still, it is difficult to control stimulus fidelity. We begin by describing how these problems are managed. 3.1
Managing fMRI Constraints
The most obvious limitation for auditory studies is that of the very loud noise that MR scanners make during structural and fMRI image acquisition (such noise can exceed 130 dB; Palmer et al. 1998). One widely used set of techniques overcomes the influence of scanner noise on stimulus presentation during fMRI by temporally separating EPI scanner noise from the experimental sounds, taking advantage of the fact that the peak of the hemodynamic response lags the stimulus by several seconds (e.g., Hall et al. 1999). Image acquisition in such ‘sparse-imaging’ designs should be as rapid as possible (under 3 s), so that activation within a scan is not contaminated by scanner noise, and follows a silent interval during which experimental sounds are presented. This silent period is often quite long (7–15 s), which ensures that activation data is uncontaminated by scanner noise; however, such procedures are time consuming. Accordingly, efforts are now being made to minimize scanner noise –redesigning the pulse sequence to modify gradient motion and, thus, gradient noise. Any movement of the structures imaged, either within a scan or between scans, will blurr the image and reduce sensitivity by averaging voxels that have stimulus-related signal with those that do not. The subcortical nuclei of the auditory system, the CN, SOC and IC, move vertically with the pulsing of
Imaging Temporal Pitch Processing in the Auditory Pathway
97
the cardiac cycle across distances of up to half their diameter. So, if images are taken at random points of the cardiac cycle, the same point in the imaged volume would actually contain data from different brain structures on different scans, and it might include regions containing ventricle fluid. Fortunately, the temporal resolution of data capture in fMRI is relatively good (as opposed to the hemodynamic response); the data obtained in a single scan comes from a duration of about 20 ms, and so image acquisitions can be synchronized to a particular point in the cardiac cycle – a procedure referred to as ‘cardiac gating’ (Guimaraes et al. 1998). If the scan orientation is axial and the scan proceeds from bottom to top up the auditory pathway, then the scans of the brainstem structures occur shortly after the cardiac trigger in a part of the cycle where the position is predictable and the rate of motion is minimal. Guimaraes et al. (1998) described how this technique was used to image the response to a sinusoid in the CN and IC. Finally, there is the problem of stimulus fidelity; the large magnetic field of the scanner precludes the use of headphones with metallic coils which would disrupt image quality, especially since auditory cortex is close to the ear canal. One solution is to conduct the sound to the subject via plastic tubes from distance speakers, but this restricts the frequency response. These problems prompted the development of magnet-friendly headsets with carbonfibre leads and either piezo-electric transducers or electrostatic transducers. Both transducers can present stimuli with reasonable bandwidth and fidelity close to the entrance to the ear canal, and they can be mounted in relatively flat, circum-aural ear cups that provide substantial attenuation from the scanner noise. 3.2
Searching for Temporal Pitch Activation
With correct management of scanner noise, motion artifact and stimulus fidelity in hand, it became possible to brain image the auditory pathway with fMRI. Griffiths et al. (2001) conducted a study with RI sounds, and showed that with cardiac gating, sparse imaging and 48 repetitions of each condition, fMRI was sufficiently sensitive to image all of the monaural, subcortical nuclei of the auditory pathway simultaneously. Contrasts between the activation produced by RIS and spectrally matched noise confirmed that temporal pitch processing begins in subcortical structures (CN and IC). At the same time, a contrast between sounds with varying pitch and fixed pitch revealed that changing pitch does not produce more activation than fixed pitch in these regions. The results were interpreted as confirming that pitch processing begins in the brainstem but is not completed there as suggested in Griffiths et al. (1998). The processing of pitch and melody by cortical regions in this study was reported in Patterson et al. (2002). The anterior-most transverse temporal gyrus of Heschl (HG) is the landmark for primary auditory cortex (Morosan
98
R.D. Patterson et al.
et al. 2001). Both the RIS and the noise produced more activation than silence, bilaterally, in two large clusters of voxels centred in the region on and behind HG, and the individual sound conditions all produced very similar patterns of activation. In PAC, when any sound condition was contrasted with any other, there was no residual activity. The obvious interpretation is PAC is fully engaged by the processing of any complex sound. When the fixed-pitch condition was contrasted with noise there was differential activation in antero-lateral HG bilaterally, just outside PAC, which was interpreted as a sign of temporal pitch processing. Finally, when activity in the melody conditions was contrasted with that in the fixed-pitch condition, it revealed differential activity in the superior temporal gyrus (STG) below HG, and in planum polare (PP) anterior to HG. Moreover, activity was more pronounced in the right hemisphere. Melody produced about the same level of activity as fixed pitch in HG itself, suggesting that the al-HG region is involved in determining the pitch value and pitch strength, rather than the contour of pitch change across a sequence of notes. Penagos et al. (2004) extended these results using harmonic complex tones with and without resolved harmonics. With a 3-Tesla scanner, they were able to measure activation in the CN and IC, as well as PAC. There was a correlate of pitch salience in al-HG, which was interpreted as evidence of a pitch hierarchy with pitch-specific processing in al-HG. Warren et al. (2003) contrasted the chroma and height dimensions of pitch and found that, whereas chroma changes produced more activation in al-HG, pitch height changes produced more activation in PT directly behind al-HG. Recently, Bendor and Wang (2005) demonstrated the presence of cells in marmoset cortex that were sensitive to the low pitch of complex tones. The cells were in an area adjacent to PAC that Bendor and Wang argue is homologous to al-HG in humans.
4
Imaging Temporal Pitch Processing with MEG
Magnetoencephalography (MEG) measures the strength and direction of postsynaptic activity in pyramidal cells of cortex running parallel to the scalp. The main advantage of MEG for the investigation of auditory function is that it has millisecond temporal resolution, so it can follow the temporal dynamics of auditory processing. There is a small, positive deflection of the auditory evoked field (AEF) with a latency in the range 50–90 ms referred to as the P1m, or P50m. However, it does not change amplitude or latency when the pitch strength of a tone is varied in isolation (Gutschalk et al. 2004a). The subsequent negative deflection associated with stimulus change is the most prominent part of the AEF; it appears in the interval between 80 and 150 ms post stimulus onset, and it is referred to as the N1m or N100m. It is a complex response that is generally assumed to represent the aggregate activity of multiple sources in auditory cortex, involved in processing different properties of sound onset.
Imaging Temporal Pitch Processing in the Auditory Pathway
99
Forss et al. (1993) showed that the latency of the N1m elicited by a regular click train is inversely related to the pitch of the sound, which suggests that the generators of the N1m are involved in pitch processing. However, as Näätänen and Picton (1987) pointed out, the N1m can be elicited by the onset of almost any kind of sound. So, while it is the case that the latency of the N1m varies with pitch, the response is fundamentally confounded with a large onset response to the energy of the sound. To isolate the pitch component of the N1m, Krumbholz et al. (2003) and Rupp et al. (2005) developed continuous stimulation techniques in which the sound begins with a noise and then, after the initial N1m has passed and the AEF has settled into a sustained response, the fine structure of the noise is regularized without changing the energy or the longer term spectral distribution of the energy. There is a marked perceptual change at the transition from noise to RIS, and it is accompanied by a prominent negative deflection in the magnetic field, referred to as the pitch onset response (POR). The inverse transition, from a RIS to noise, is much more difficult to detect (Uppenkamp et al., 2004; Rupp et al. 2005), and produces virtually no deflection of the AEF. Krumbholz et al. (2003) showed that the latency of the POR varies inversely with the pitch of the RIS, and the magnitude of the response increases with pitch strength. The source of the POR was located in the antero-lateral portion of HG close to the pitch region identified with fMRI by Patterson et al. (2002). The notes of music and the vowels of speech produce sustained pitch perceptions, and when the duration is 100 ms, or more, they elicit a surface negative sustained field (SF) that rises after the N1m and continues to the end of the sound. Gutschalk et al. (2002) recorded the SF evoked by regular and irregular click trains. By contrasting the activity produced by regular and irregular conditions, they were able to dissociate activity associated with temporal regularity from activity associated with stimulus intensity. Two sources just lateral to PAC were isolated in each hemisphere. The more anterior, located in al-HG, was particularly sensitive to temporal pitch and largely insensitive to stimulus intensity. The more posterior, in PT just behind al-HG, was sensitive to intensity and largely insensitive to pitch. This double dissociation shows that al-HG also produces a sustained pitch response (SPR). The generators of the POR and SPR, on the one hand, and the components of the N1m and the sustained field that are indifferent to regularity, on the other hand, appear to arise from differentiable, but overlapping sites (Gutschalk et al. 2004a). The existence of a SPR as well as a POR in al-HG has now been confirmed in a succession of MEG studies (Gutschalk et al. 2004a, b, 2006; Seither-Preisler et al. 2004, 2006a, b). The latencies of the POR and the SPR are both surprisingly long. The peak latency of the POR is about 120 ms plus four times the ‘period’ of the RI sound (Krumbholz et al. 2003); the SPR appears in the source wave between 200 and 300 ms post regularity onset, and rises to its sustained level over 100–200 ms. Several of the groups have modeled the temporal dynamics of
100
R.D. Patterson et al.
the POR and SPR (Gutschalk et al. 2004a, b; Krumbholz et al. (2003); Rupp et al. 2005; Seither-Preisler et al. 2006b), either qualitatively or quantitatively, using the auditory image model (AIM) (Patterson et al. 1995). The results consistently show that the latencies of the POR and SPR are substantially longer than the latencies that would be predicted with AIM for the build up of the pitch ridge in the auditory image.
5
Conclusions
Brain imaging with PET and fMRI has been used to locate activity associated with temporal pitch processing on Heschl’s gyrus just antero-lateral to PAC. MEG has been used to reveal the temporal dynamics of the processing. The results suggest that both the POR and SPR reflect the measurement of pitch value and strength which, according to theory, occur relatively late in the pitch hierarchy.
References Bendor D, Wang Q (2005) The neuronal representation of pitch in primate auditory cortex. Nature 436:1161–1165 Forss N, Mäkelä JP, McEvoy L, Hari R (1993) Temporal integration and oscillatory response of the human auditory cortex revealed by evoked magnetic fields to click trains. Hear Res 68:89–96 Griffiths TD, Büchel C, Frackowiak RSJ, Patterson RD (1998) Analysis of temporal structure in sound by the brain, Nature Neurosci 1:422–427 Griffiths TD, Uppenkamp S, Johnsrude I, Josephs O, Patterson RD (2001) Encoding of the temporal regularity of sound in the human brainstem. Nature Neurosci 4:633–637 Guimares A, Melcher J, Talavage T, Baker J, Ledden P, Rosen B, Kiang N, Fullerton B, Weisskoff R (1998) Imaging subcortical activity in humans. Hum Brain Map 6:33–41 Gutschalk A, Patterson RD, Rupp A, Uppenkamp S, Scherg M (2002) Sustained magnetic fields reveal separate sites for sound level and temporal regularity in human auditory cortex. NeuroImage 15:207–216 Gutschalk A, Patterson RD, Scherg M, Uppenkamp S, Rupp A (2004a) Temporal dynamics of pitch in human auditory cortex. NeuroImage 22:755–766 Gutschalk A, Patterson RD, Uppenkamp S, Scherg M, Rupp A (2004b) Recovery and refractoriness of auditory evoked fields after gaps in click trains. Eur J Neurosci 20:3141–3147 Gutschalk A, Patterson RD, Scherg M, Uppenkamp S, Rupp A (2006) The effect of context on the sustained pitch response in human auditory cortex. Cerebral Cortex (in press) Hall D, Haggard M, Akeroyd M, Palmer A, Summerfield A, Elliott M, Gurney E, Bowtell R (1999) “Sparse” temporal sampling in auditory fMRI. Hum Brain Map 7:213–223 Krumbholz K, Patterson RD, Seither-Preisler A, Lammertmann C, Lütkenhöner B (2003) Neuromagnetic evidence for a pitch processing centre in Heschl’s gyrus. Cerebral Cortex 13:765–772 Morosan P, Rademacher J, Schleicher A, Amunts K, Schormann T, Zilles K (2001) Human primary auditory cortex: subdivisions and mapping into a spatial reference system. NeuroImage 13:684–701
Imaging Temporal Pitch Processing in the Auditory Pathway
101
Näätänen R, Picton TW (1987) The N1 wave of the human electric and magnetic response to sound: a review and an analysis of the component structure. Psychophysiology 24:375–425 Palmer AR, Winter IM (1992) Cochlear nerve and cochlear nucleus responses to the fundamental frequency of voiced speech sounds and harmonic complex tones. In: Cazals Y, Demany L, Horner K (eds) Auditory physiology and perception. Pergamon, Oxford, pp 231–239 Palmer AR, Bullock DC, Chambers JD (1998) A high-ouput, high-quality sound system for use in auditory fMRI. NeuroImage 7:S359 Patterson RD, Allerhand M, Giguere C (1995) Time-domain modelling of peripheral auditory processing: a modular architecture and a software platform. J Acoust Soc Am 98:1890–1894 Patterson RD, Uppenkamp S, Johnsrude I, Griffiths TD (2002) The processing of temporal pitch and melody information in auditory cortex. Neuron 36:767–776 Penagos H, Melcher JR, Oxenham AJ (2004) A neural representation of pitch salience in nonprimary human auditory cortex revealed with functional magnetic resonance imaging. J Neurosci 24:6810–6815 Pressnitzer D, Patterson RD, Krumbholz K (2001) The lower limit of melodic pitch. J Acoust Soc Am 109:2074–2084 Rupp A, Uppenkamp S, Bailes J, Gutschalk A, Patterson RD (2005) Time constants in temporal pitch extraction: a comparison of psychophysical and neuromagnetic data. In: Pressnitzer D, de Chevigne A, McAdams S, Collet L (eds) Proc. 13th ISH; Auditory signal processing: physiology, psychoacoustics and models. Dourdan, France, pp 119–125 Seither-Preisler A, Krumbholz K, Patterson RD, Seither A, Lütkenhöner B (2004) Interaction between the neuromagnetic responses to sound energy onset and pitch onset suggests common generators. Eur J Neurosci 19:3073–3080 Seither-Preisler A, Patterson RD, Krumbholz K, Seither S, Lutkenhoner B (2006a) Evidence of pitch processing in the N100m component of the auditory evoked field. Hear Res 213:88–98 Seither-Preisler A, Patterson RD, Krumbholz K, Seither S, Lutkenhoner B (2006b) From noise to pitch: transient and sustained responses of the auditory evoked field. Hear Res (in press) Uppenkamp S, Bailes J, Patterson RD (2004) How long does a sound have to be to produce a temporal pitch? Proc. 18th International Congress on Acoustics, Kyoto, vol. I, pp 869–870 Warren JD, Uppenkamp S, Patterson RD, Griffiths TD (2003) Separating pitch chroma and pitch height in the human brain. Proc Natl Acad Sci Vol 100, No 17, pp 10,038–10,042 Yost WA, Patterson RD, Sheft S (1996) A time-domain description for the pitch strength of iterated rippled noise. J Acoust Soc Am 99:1066–1078
Comment by Chait In your MEG experiments you interpret responses to transitions between irregular and regular click-tone sequences, or between white noise and IRN, as reflecting the activation of a pitch-related area. However, these transitions can also be interpreted as between “irregular” and “regular” stimuli. As we show in our talk on Monday, similar responses and asymmetries are obtained for transitions between irregular and regular signals that do not evoke a pitch percept. Measurable magnetic fields originate from EPSCs in the apical dendrites of tens of thousands of simultaneously active cells. Arguably, the processing of a feature such as pitch should not require the simultaneous and synchronized activation of tens of thousands of cells. Computations that are more likely to evoke the observed MEG responses might be related to object analysis, notification of change, attention switching, etc. (Gutschalk et al. 2004).
102
R.D. Patterson et al.
Conceivably, such global processes may involve the synchronous activation of many cells as a method of notification across mechanisms and brain areas that something new and potentially behaviorally relevant has occurred in the environment. Reply For us, the question is basically ‘How and where does the auditory system process the temporal regularity in a sound to produce the perceptions associated with temporal regularity?’ An extended example concerning the perception of click-trains is presented in Patterson et al. (1992). The example is used to motivate the strobed temporal integration mechanism in the Auditory Image Model (AIM) of perception. Briefly, for rates less than 10 clicks per second (cps), we hear individual events, independent of whether the train is regular or irregular. For rates greater than 40 cps, we hear a continuous sound, which has a strong pitch if the train is regular, and no pitch if it is not. The problem for auditory models is to explain the perceptual transition from individual clicks to click-tones, and the perception of flutter in the region of 16 Hz. Similar questions concerning regularity arise with many stimuli, such as those of Chait, Poeppel and Simon in this volume, as the rate of transitions rises from 1 to 64 per second. The question for brain imaging is whether the processing of slow click trains, which leads to the perception of a stream of separate events, occurs in the same neural structure as the processing of fast click trains, which leads to the perception of a continuous tone? MEG source waves indicate that isolated clicks produce transient N1m responses in Planum Temporale (PT) and the strength decreases as the click rate increases above 1 cps. Irregular click trains with rates greater than 40 cps produce a single N1m in PT at stimulus onset, and a sustained response in PT thereafter (Gutschalk et al. 2002, 2004). Regular click trains with rates greater than 40 cps produce what we have referred to as a Pitch Onset Response (POR) and a Sustained Pitch Response (SPR), in a region of Heschl’s gyrus a little anterior and lateral to primary auditory cortex (PAC). It would be interesting to know the location of the sources associated with the responses reported by Chait, Poeppel and Simon in this volume. When MEG techniques are sufficiently developed, it would be interesting to track the response to regular and irregular CTs as the click rate decreases from 64 to 1 cps, and the pitch fades out of the perception. In this regard, it would be interesting to extend these experiments concerning the lower limit of pitch to include three stimuli that produce continuous stimulation and repress the N1m. The stimuli are RIS (Krumbholz et al. 2003; Seither-Preisler et al. 2004), repeated frozen noise (Limbert and Patterson 1982) and AABB noise (Wiegrebe et al. 1998), all of which produce a strong pitch when the repetition rate is above about 40 Hz, but which produce the perception of noise with an ambiguous
Imaging Temporal Pitch Processing in the Auditory Pathway
103
repeating feature when the repetition rate is less than 32 Hz. These stimuli should excite the pitch centre when the rate is over 40 Hz, but they are unlikely to produce a repeating N1m at lower rates, even when the rate is as low as 1 cps, because they produce continuous stimulation. The question is how the neural centres in lateral HG and PT interact as the pitch fades from the sound? We agree with the postulate in the second paragraph that the responses we are measuring with MEG represent processes that might better be thought of in terms of their role in the definition and segregation of streams of auditory events with coherent features over time (Cooke 2006), although we would be more inclined to think of these processes as identifying and segregating sound sources rather than auditory objects. References Cooke A (2006) A glimpsing model of speech perception in noise. J Acoust Soc Am 119:1562–1573 Gutschalk A, Patterson RD, Rupp A, Uppenkamp S, Scherg M (2002) Sustained magnetic fields reveal separate sites for sound level and temporal regularity in human auditory cortex. NeuroImage 15:207–216 Gutschalk A, Patterson RD, Scherg M, Uppenkamp S, Rupp A (2004) Temporal dynamics of pitch in human auditory cortex. NeuroImage 22:755–766 Krumbholz K, Patterson RD, Seither-Preisler A, Lammertmann C, Lütkenhöner B (2003) Neuromagnetic evidence for a pitch processing centre in Heschl’s gyrus. Cerebral Cortex 13:765–772 Limbert C, Patterson RD (1982) Tapping to repeated noise. J Acoust Soc Am 71:S38 Patterson RD, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M (1992) Complex sounds and auditory images. In: Cazals Y, Demany L, Horner K (eds) Auditory physiology and perception. Proceedings of the 9th International Symposium on Hearing. Pergamon, Oxford, 429–446 Seither-Preisler A, Krumbholz K, Patterson RD, Seither A, Lütkenhöner B (2004) Interaction between the neuromagnetic responses to sound energy onset and pitch onset suggests common generators. Eur J Neurosci 19:3073–3080 Wiegrebe L, Patterson RD, Demany L, Carlyon RC (1998) Temporal dynamics of pitch strength in regular interval noises. J Acoust Soc Am 104:2307–2313
Comment by Hall In your talk, you suggest that the pitch-related computations performed by lateral Heschl’s gyrus might relate to the across-frequency channel averaging implemented in the autocorrelation model of temporal pitch. If this is the case, then one might expect the same region to be engaged by other analyses of fine temporal structure that also require the computation of a summary correlogram, such as the analysis of interaural correlation. Some of our recent data suggest that this is unlikely because lateral Heschl’s gyrus responded little to the degree of interaural correlation in the noise, while it did respond strongly to the degree of monaural temporal regularity. Would you like to comment?
104
R.D. Patterson et al.
Reply As a matter of fact, the Auditory Image Model (Patterson et al. 1995) does not use autocorrelation to compute pitch (Patterson and Irino 1998), nor does it use cross-correlation to compute laterality (Patterson et al. 2006), and I doubt that the auditory system does either, because correlation processes (auto- and cross-) are expansive in magnitude, symmetric in time, and extremely inefficient. That said, the main issue here is the crosschannel computations that might be used to summarise laterality or pitch information, and whether AIM would predict that cross-channel computations for laterality and pitch both occur in auditory cortex, and in the same neural structure. Although it is an intriguing hypothesis, the answer would appear to be no. The mechanism recently proposed for binaural processing (Patterson et al. 2006) involves a coincidence gate that really should be in the brainstem to minimize temporal distortion of the ITD information, which is in the tens-of-microseconds range. The coincidence gate mechanism is assumed to precede the strobed temporal integration (STI) mechanism (Patterson 1994) used to construct the time-interval histograms that constitute the auditory image (Patterson et al. 1995). The pitch calculation is reviewed in Krumbholz et al. (2003) which also addresses the differences between STI and autocorrelation at the end of the Discussion. The important thing for the present discussion is that the cross-channel pitch computation is applied to the auditory image after it is constructed (see, for example, Krumbholz et al. 2005), which probably means that it is performed farther along in the auditory pathway. Nevertheless, the original hypothesis of Hall et al. (2005), that the two cross-channel mechanisms might reside in the same region of auditory cortex, seems reasonable and very much worth testing because, if true, it might have required drastic restructuring of time-interval models like AIM. Moreover, it led to the discovery of an interaction between pitch salience and sound source location in Heschl’s sulcus. I agree that the interaction suggests that this region of auditory cortex may be involved in the integration of acoustic features as a prelude to source identification, and this very intriguing. References Hall DA, Barrett DJK, Akeroyd MA, Summerfield AQ (2005) Cortical representations of temporal structure in sound. J Neurophysiol 94:3181–3191 Krumbholz K, Patterson RD, Nobbe A, Fastl H (2003) Microsecond temporal resolution in monaural hearing without spectral cues? J Acoust Soc Am 113:2790–2800 Krumbholz K, Bleeck S, Patterson RD, Senokozlieva M, Seither-Preisler A, Lütkenhöner B (2005) The effect of cross-channel synchrony on the perception of temporal regularity. J Acoust Soc Am 118:946–954 Patterson RD (1994) The sound of a sinusoid: time-interval models. J Acoust Soc Am 96:1419–1428
Imaging Temporal Pitch Processing in the Auditory Pathway
105
Patterson RD, Irino T (1998) Auditory temporal asymmetry and autocorrelation. In: Palmer A, Rees A, Summerfield Q, Meddis R (eds) Psychophysical and physiological advances in hearing. Proceedings of the 11th International Symposium on Hearing, Whurr, London, pp 554–562 Patterson RD, Allerhand M, Giguère C (1995) Time-domain modelling of peripheral auditory processing: a modular architecture and a software platform. J Acoust Soc Am 98:1890–1894 Patterson RD, Anderson TR, Francis K (2006) Binaural auditory images for noise-resistant speech recognition. In: Ainsworth W, Greenberg S (eds) Listening to speech. LEA, pp 257–269
12 Spatiotemporal Encoding of Vowels in Noise Studied with the Responses of Individual Auditory-Nerve Fibers MICHAEL G. HEINZ
1
Introduction
The neural basis for robust speech perception exhibited by human listeners (e.g., across sound levels or background noises) remains unknown. The encoding of spectral shape based on auditory-nerve (AN) discharge rate degrades significantly at high sound levels, particularly in high spontaneousrate (SR) fibers (Sachs and Young 1979). However, continued support for rate coding has come from the observations that robust spectral coding occurs in some low-SR fibers for vowels in quiet and that rate-difference profiles provide enough information to account for behavioral discrimination of vowels (Conley and Keilson 1995; May, Huang, Le Prell, and Hienz 1996). Despite this support, it is clear that temporal codes are more robust than rate (Young and Sachs 1979), especially in noise (Delgutte and Kiang 1984; Sachs, Voigt, and Young 1983). Sachs et al. (1983) showed that rate coding in low-SR fibers was significantly degraded at a moderate signal-to-noise ratio for which human perception is robust. In contrast, temporal coding based on the average-localized-synchronized-rate (ALSR) remained robust. Although temporal coding based on ALSR is often shown to be robust, evidence for neural mechanisms to decode these cues is limited. Spatiotemporal mechanisms have been proposed for decoding these types of cues (e.g., Carney, Heinz, Evilsizer, Gilkey, and Colburn 2002; Deng and Geisler 1987; Shamma 1985). However, the detailed evaluation of spatiotemporal mechanisms has been limited primarily to modeling studies due to difficulties associated with the large population responses that are required to study spatiotemporal coding (e.g., see Palmer 1990). For example, Deng and Geisler (1987) used a transmission-line based AN model to suggest that spectral coding based on the peak cross-correlation between adjacent best-frequency (BF) channels was robust in the presence of background noise. In the present study, spectral coding of vowels in noise based on rate, ALSR, and a simple cross-BF coincidence detection scheme is evaluated from the responses of single AN fibers. By using data from a single AN fiber, many of the difficulties associated with large-population studies are eliminated. Department of Speech, Language, and Hearing Sciences and Weldon School of Biomedical Engineering, Purdue University,
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
108
2
M.G. Heinz
Methods
AN recordings were made from pentobarbital anesthetized cats using standard methods (see Heinz and Young 2004). Spikes were measured with 10-µs resolution. Each fiber was characterized using an automated tuning-curve algorithm to determine BF, Q10, and SR. All vowels were created using a cascade synthesizer and were scaled versions of the vowel /eh/, which has its first two formants at F1=0.5 kHz and F2=1.7 kHz, with the intermediate trough at T1=1.2 kHz. To maintain F0 within the voice-pitch range for each BF, a baseline steady-state vowel was resynthesized for each AN fiber. The baseline vowel had F0=75 Hz and was created with F2 at BF. The other formant frequencies and all bandwidths were scaled based on the frequency shift from the nominal F2 value for /eh/. The baseline vowel and a baseline broadband noise token were both 400 ms in duration and were sampled at 33000 Hz. The vowel-in-noise conditions with F1 and T1 near BF were produced via changes in sampling rate for the vowel and noise. Signal-to-noise ratio in dB was defined as the difference in overall vowel level and noise level within the frequency range from 0 Hz to the trough between the 3rd and 4th formant of the baseline vowel. Spectral coding was evaluated based on individual-neuron responses in a manner similar to the spectrum manipulation procedure (SMP), which was developed to study rate-based spectral coding (e.g., May et al. 1996). In the SMP, spectral coding is evaluated by comparing responses to vowels with formants and troughs placed at BF via changes in sampling rate. The slope of the discharge rate as a function of vowel feature level is used to quantify spectral coding, with robust coding indicated by a constant slope across vowel level (or SNR). Although the SMP is useful for evaluating rate coding, changes in the temporal waveform with changes in sampling rate do not allow spatiotemporal coding to be evaluated. The spectro-temporal manipulation procedure (STMP) was developed to study spatiotemporal coding by estimating the responses of several neurons with nearby BFs to the same stimulus waveform using the responses of a single neuron to different stimuli with a spectral feature shifted nearby BF (Heinz 2005). Based on a neuron with BF0 and a vowel with F1=BF0, the response of a neuron with BF
Spatiotemporal Encoding of Vowels
109
A physiologically based spatiotemporal mechanism was evaluated by computing shuffled cross-correlograms (SCCs) between responses at different effective BFs (Joris 2003; Joris, Van de Sande, Louage, and van der Heijden 2006). The SCCs were used as a model of a cross-BF coincidence detector with two AN-fiber inputs responding to the same vowel waveform. The SCC value at each delay represents the discharge rate (spikes/sec) of a coincidence detector with the corresponding delay between inputs. SCCs were computed using a 50-µsec binwidth. The SCC provides an efficient method to predict responses of simple monaural cross-BF coincidence detectors based on AN responses.
3 3.1
Results Choice of Vowel and Noise Levels
For each AN fiber, vowel and noise levels were chosen to focus the STMP analysis on conditions where rate coding degraded as noise level was increased. A vowel level was chosen for which there was good rate coding in quiet (rate to F1 > rate to T1; Fig. 1A). This was generally chosen near the middle of the T1 dynamic range. Rate as a function of noise level (or decreasing SNR) was then measured for F1 and T1 at that vowel level (panel B). Three SNRs (across a 20-dB range) were chosen to cover the range over which rate coding degraded, with the middle SNR typically chosen at the level where the rate to F1 and T1 became close (Fig. 1B). Figure 1 shows an example of a fiber with a low SR that had robust spectral coding in quiet. However, the addition of noise degraded the spectral coding in this fiber to the point where the rate to F1 and T1 were equal. The complete degradation of rate coding as noise increased was true of every fiber studied.
Fig. 1 A Rate-level functions for F1 and T1 placed at BF. Dotted vertical line indicates vowel level chosen. B. Rate-level functions for the 50-dB SPL vowel in noise as a function of decreasing SNR. Dotted vertical lines indicate SNRs used for STMP. Fiber BF=0.96 kHz; Thresh.=8 dB SPL; SR=1.4 sp/sec
110
3.2
M.G. Heinz
Predicted Spatiotemporal Response Patterns
Based on the vowel and noise levels chosen (Fig. 1), the STMP was used to predict the spatiotemporal responses of 10 effective BFs near the AN fiber’s BF to F1 and T1 in 4 noise conditions (3 SNRs and in quiet). The data shown in Fig. 2 represent 20 repetitions of the 80 conditions studied (10 effective BFs × 2 features × 4 noise levels), which were presented in an interleaved manner. The spatiotemporal responses to F1 (top panel) show synchrony capture to F1 in both conditions across all but the highest BFs. The responses of BFs near T1 (bottom panel) show a significant response to F0 in the quiet condition, which disappears in noise (Delgutte and Kiang 1984). The effective
Fig. 2 Spatiotemporal patterns in response to F1 (top) and T1 (bottom) at 0.96 kHz (thickest line), predicted from an AN fiber with BF0=0.96 kHz. Left panels show period histograms for each of 10 effective BFs for the in-quiet and middle SNR condition (4 dB). Vowel level = 50 dB SPL. Labels and short vertical lines at the bottom of the period histograms indicate the temporal periods (from time 0) associated with various vowel features. Middle panels show rate as a function of BF. Horizontal dotted lines and labels indicate the location of vowel features. Right panels show synchrony coefficient to F1 (top) and T1 (bottom) -only significant values shown. Octave shifts re BF0: −.4, −.25, −.15, −.05, 0, .05, .15, .25, .5, .75
Spatiotemporal Encoding of Vowels
111
BF near 1.4 kHz is near F2 and shows the expected response to F2. These predicted patterns are consistent with many of the properties reported in previous population studies (e.g., Delgutte and Kiang 1984; Sachs et al. 1983; Young and Sachs 1979). 3.3
Cross-BF Coincidence Functions
Figure 3 shows SCC functions computed from the predicted spatiotemporal responses (Fig. 2), which represent the discharge rate of a model coincidence detecting neuron with two inputs from AN fibers with effective BFs 0.05 octaves above and below each vowel feature. The periodic nature of the responses to F1 can be seen both in quiet and in noise. In contrast, a strong temporal representation of F0 in the fibers near T1 can be seen in quiet, but not in noise. These SCC functions are an example of the precise cross-correlation analyses that are possible with the STMP approach. Because the effective BFs are created by changes in sampling rate, the BF difference is known exactly and is controllable by the experimenter. In contrast, population studies are limited by BF sampling issues and by any inaccuracies in estimating the two BFs.
Fig. 3 SCC functions between effective BFs 0.05 octaves above and below F1 (top) and T1 (bottom). In-quiet: left panels; In-noise (SNR = 4 dB): right panels
112
3.4
M.G. Heinz
The Effect of Noise Level on Coding Schemes
The quantification of spectral coding in terms of SMP functions is shown in the left panels of Fig. 4. Average rate was compared between conditions with F1 and T1 at the AN fiber BF. The ALSR metric was computed from the predicted spatiotemporal responses (Fig. 2), across effective BFs within ± 0.25 octaves of the vowel feature. One simple cross-BF coincidence metric was taken as the value of the SCC function (Fig. 3) at the characteristic delay (i.e., maximum coincidence rate closest to zero delay). This metric was motivated by the cross-correlation model of Deng and Geisler (1987). The robustness of spectral coding was evaluated based on the slopes of the SMP functions as a function of noise level (right panels). Although this AN fiber had excellent rate-based spectral coding for the 50-dB SPL vowel in quiet, rate coding degraded significantly as noise was added. The SMP slope for rate dropped to one half the in-quiet slope at a SNR=10 dB. In contrast,
Fig. 4 Spectral coding versus noise level. The left panels show rate, ALSR, and the coincidence metrics versus vowel feature level (F1=50 dB SPL, T1=20 dB SPL) for each noise condition. The right panels show the SMP-function slopes versus noise level. The in-quiet values are shown by filled squares. The vertical dotted lines indicate the SNR at which each metric degraded to one half the SMP slope value in quiet
Spatiotemporal Encoding of Vowels
113
Fig. 5 SNR at which spectral coding degraded. Down (up) arrows on filled symbols indicate the lowest (highest) SNR measured for AN fibers for which SMP slope never (always) fell below half the in-quiet value (e.g., ALSR data in Fig. 4). Squares: high-SR; Triangles: low/med-SR
coding based on ALSR was not affected over this noise range. Spectral coding based on this implementation of a cross-BF coincidence detector degraded with noise level, with the F1 response being slightly higher than the T1 response in quiet and for the lowest noise level, but equal at the two highest noise levels. This was the general pattern observed across most units for the cross-BF coincidence SMP functions. A comparison of the relative degradation of each spectral coding scheme with noise is shown in Fig. 5 for the AN fibers for which full STMP data was collected over this range of noise levels. All data presented in Fig. 5 represent fibers from which at least 11 repetitions of all 80 conditions were measured (average of 17 reps). All AN fibers studied showed similar trends, including those for which fewer repetitions were measured (not shown). For each fiber, the SNR was computed at which spectral coding (as quantified by the SMP slope) degraded to half the value in quiet (see dotted lines in Fig. 4). The rate and cross-BF coincidence schemes both typically degraded at positive SNRs, whereas spectral coding based on ALSR degraded for negative SNRs. The 20-dB range of noise levels was chosen in most cases to cover the range over which rate coding degraded (Fig. 1), and thus it was often the case that ALSR did not degrade over the range of noise levels studied (solid symbols in Fig. 5).
4
Discussion
Spectral coding based on rate degraded in all fibers as noise level increased, even low-SR fibers with robust encoding in quiet. The ALSR metric was more robust, often remaining unaffected at SNRs 20 dB lower than rate. The one simple cross-BF coincidence mechanism evaluated here also was much less
114
M.G. Heinz
robust than ALSR, despite its similarity to the Deng and Geisler (1987) and Shamma (1985) models. It is possible that the most robust spatiotemporal information exists between adjacent BFs that are not centered exactly at the formant frequency. Another factor that requires further study is the confounding effect that differences in average discharge rate have on the ability of a coincidence detector to decode differences in across-BF temporal patterns. Differences in the ALSR values and SCC function shapes between formant and trough responses in noise indicate robust spatiotemporal information does exist for spectral coding of vowels. Alternative spatiotemporal mechanisms for decoding this spectral information can be evaluated based on AN data using the STMP and SCC functions. Spatiotemporal mechanisms have also recently been proposed to be useful for the detection of tones in noise (Carney et al. 2002) and for pitch coding of complex tones (Cedolin and Delgutte 2007). Acknowledgments Supported by NIH grant R03DC007348. Data collected in the lab of Eric Young, who also provided invaluable support. Diana Ma helped with data collection.
References Carney, L.H., Heinz, M.G., Evilsizer, M.E., Gilkey, R.H. and Colburn, H.S. (2002) Auditory phase opponency: A temporal model for masked detection at low frequencies. Acustica -Acta Acustica 88, 334-347. Cedolin, L. and Delgutte, B. (2007) Spatio-temporal representation of the pitch of complex tones in the auditory nerve. In B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp and J. Verhey (eds.), Hearing – From Sensory Processing to Perception. Springer Verlag, Berlin, pp. 61-70. Conley, R.A. and Keilson, S.E. (1995) Rate representation and discriminability of second formant frequencies for /ε/-like steady-state vowels in cat auditory nerve. J. Acoust. Soc. Am. 98, 3223-3234. Delgutte, B. and Kiang, N.Y. (1984) Speech coding in the auditory nerve: V. Vowels in background noise. J. Acoust. Soc. Am. 75, 908-918. Deng, L. and Geisler, C.D. (1987) A composite auditory model for processing speech sounds. J. Acoust. Soc. Am. 82, 2001-2012. Heinz, M.G. (2005) Spectral coding based on cross-frequency coincidence detection of auditory-nerve responses. Assoc. for Res. in Otolaryngology Abstracts 28, 27. Heinz, M.G. and Young, E.D. (2004) Response growth with sound level in auditory-nerve fibers after noise-induced hearing loss. J. Neurophysiol. 91, 784-795. Joris, P.X. (2003) Interaural time sensitivity dominated by cochlea-induced envelope patterns. J. Neurosci. 23, 6345-6350. Joris, P.X., Van de Sande, B., Louage, D.H., and van der Heijden, M. (2006) Binaural and cochlear disparities. Proc Natl Acad Sci USA 103, 12917–12922. May, B.J., Huang, A., Le Prell, G. and Hienz, R.D. (1996) Vowel formant frequency discrimination in cats: Comparison of auditory nerve representations and psychophysical thresholds. Aud. Neurosci. 3, 135-162. Palmer, A.R. (1990) The representation of the spectra and fundamental frequencies of steadystate single- and double-vowel sounds in the temporal discharge patterns of guinea pig cochlear-nerve fibers. J. Acoust. Soc. Am. 88, 1412-1426. Sachs, M.B. and Young, E.D. (1979) Encoding of steady-state vowels in the auditory nerve: Representation in terms of discharge rate. J. Acoust. Soc. Am. 66, 470-479.
Spatiotemporal Encoding of Vowels
115
Sachs, M.B., Voigt, H.F. and Young, E.D. (1983) Auditory nerve representation of vowels in background noise. J. Neurophysiol. 50, 27-45. Shamma, S.A. (1985) Speech processing in the auditory system. II: Lateral inhibition and the central processing of speech evoked activity in the auditory nerve. J. Acoust. Soc. Am. 78, 1622-1632. Young, E.D. and Sachs, M.B. (1979) Representation of steady-state vowels in the temporal aspects of the discharge patterns of populations of auditory-nerve fibers. J. Acoust. Soc. Am. 66, 1381-1403. Zhang, X., Heinz, M.G., Bruce, I.C. and Carney, L.H. (2001) A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression. J. Acoust. Soc. Am. 109, 648-670.
Comment by Ghitza Your study is restricted to measurements in anaesthetized cats, with the efferent system not operating. For awaken cats, it may very well be that a rate strategy is sufficient in providing the degree of robustness-to-noise adequate for predicting human performance (e.g. in perceiving degraded speech. See “Towards predicting consonant confusions of degraded speech”, Ghitza et al., this Volume). Reply Rate-based coding may be more robust in awake cats with a functioning efferent system. The key question is the degree to which the efferent system improves rate-based coding. The present data suggest that the efferent system needs to improve the robustness of rate-based coding by 20 dB to match temporal-based coding. Modelling studies, such as Ghitza et al. (this volume), are important for demonstrating the potential of the efferent system to improve speech understanding in noise. Unfortunately, experimental data to quantify the true extent to which the efferent system improves speech understanding in noise do not exist. The most relevant data are from May, LePrell, and Sachs (1998), who showed limited evidence suggesting that rate-based coding of vowels in quiet by high-SR primary-like units in the ventral cochlear nucleus (VCN) is more robust in awake cats than in barbiturateanesthetized cats. However, overall rate-based coding of vowels in the VCN was quite similar between awake and anesthetized cats, suggesting that the effects of anaesthesia on vowel coding in the auditory periphery are small. References May, B.J., Le Prell, G.S. and Sachs, M.B. (1998) Vowel representations in the ventral cochlear nucleus of the cat: Effects of level, background noise, and behavioral state. J. Neurophysiol. 79, 1755-1767.
13 Role of Peripheral Nonlinearities in Comodulation Masking Release JESKO L. VERHEY AND STEPHAN M.A. ERNST
1
Introduction
The detection of a signal in the presence of a masker at the signal frequency (on-frequency masker, OFM) is enhanced when one or more additional offfrequency maskers (flanking band, FB) are presented, but only if the FB and OFM are coherently modulated. This phenomenon, known as comodulation masking release (CMR), has been traditionally attributed to across-channel processing. However, it was also argued that part of the effect might be due to the processing within the auditory channel at the signal frequency. Withinchannel effects were usually discussed in relation to a possible excitatory interaction of FB, OFM and signal within the auditory filter at the signal frequency. However, the FB might also suppress the excitation evoked by the OFM (Oxenham and Plack 1998). Suppression can also be regarded as a within-channel cue, since it is an effect related to the nonlinear response of the auditory filter centred at the signal frequency (Ernst and Verhey 2005). The first two experiments of the present study investigated the role of suppression in CMR experiments with large spectral distances between OFM and FB. CMR was measured with various combinations of level and centre frequencies of OFM and FB. In order to determine the amount of CMR due to nonlinear properties of the basilar membrane, the data are simulated with a suppression model. The model is a modified version of the model proposed by Plack et al. (2002). They showed that a combination of the dual-resonance nonlinear (DRNL) filter (Meddis et al. 2001) and a temporal window (TW, e.g. Oxenham 2001) was able to describe two-tone suppression as observed in psychoacoustical experiments. In addition to the simulations with a within-channel model, an experiment is performed that was hypothesized to distinguish between withinchannel and across-channel processes in CMR experiments (Dau et al. 2005). Grose and Hall (1993) showed that onset asynchrony can abolish CMR. Dau et al. (2005) extended the experiment of Grose and Hall (1993) using different spectral distances between the OFM and the FBs. They
AG Neurosensorik, Institut für Physik, Fakultät V, Carl von Ossietzky Universität Oldenburg, 26111 Oldenburg, Germany,
[email protected],
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
118
J.L. Verhey and S.M.A. Ernst
showed the same effect as Grose and Hall (1993) for large spectral distances between FB and OFM but they found no effect for a small spectral distance between the masker components. They argued that for the small spectral distance CMR is due to a within-channel process on an early stage of the auditory pathway. This process would not be affected by the detrimental effect of onset asynchrony on the perceptual fusion of the masker components, which is presumably a higher level process. The third experiment investigates if the CMR for large frequency separations between FB and OFM is robust against onset asynchrony when the simulations with the suppression model indicate that the CMR is a consequence of the peripheral nonlinearity.
2
Suppression Model
The first stage of the model is a combined outer and middle ear filter as used in Breebaart et al. (2001). A low-level noise is added to the output of the filter to approximate the threshold in quiet. The following two stages of the model (DRNL and temporal window) are essentially the same as proposed in Plack et al. (2002). The filter is divided into a linear pathway and a nonlinear pathway. The linear pathway consists of a gammatone filter followed by a low-pass filter. The non-linear pathway consists of a gammatone filter, a compressive nonlinearity and a second gammatone filter. The nonlinear pathway has a gain relative to the linear pathway. The input is processed in parallel through both pathways and then added. In general, all DRNL parameters were taken from Plack et al. (2002). In contrast to Plack et al. (2002), a fourth order gammatone filter was used in the linear pathway and for the second filter of the nonlinear pathway. The output is squared and then passed through the temporal window. The window comprises three exponential functions, one to describe backward masking, and two to describe forward masking. All parameters for the temporal window were taken from Oxenham (2001), which showed the best fit between his data and predictions with the same compression values for the nonlinearity as used by Plack et al. (2002). The decision variable is the quotient of the maximum intensity of the whole temporal window output from the masker plus signal interval and the maximum intensity of the two masker-only intervals. To determine a threshold the decision variable has to exceed the parameter k. The value of k was set to 1.2 in order to match measured and simulated threshold for the reference condition. To determine the threshold with the model the same procedure was used as in the experiment. The final threshold estimate was taken as the mean of 10 threshold estimates. The first stage, the decision device and the derivation of the thresholds differ from the model proposed by Plack et al. (2002) but proved to be necessary for the stochastic stimuli used in the present study.
Role of Peripheral Nonlinearities in Comodulation Masking Release
3
119
Methods
All stimuli were generated digitally with a sampling rate of 44.1 kHz, D/A converted (RME ADI-8 DS) and amplified (Tucker-Davis HB7). The stimuli were presented to both ears through headphones (Sennheiser hd580 for the first and third experiments and Sennheiser hda200 for the second experiment). The frequency of the sinusoidal signal differed in the experiments. The signal duration was 250 ms including 50-ms raised-cosine ramps and the signal was temporally centred in the masker. Depending on the masking condition, the masker was composed of one or two 20 Hz wide noise bands. If not specified otherwise the masker duration was 500 ms including 50-ms raised cosine ramps. Each noise band was created by multiplying a sinusoidal carrier with a 10-Hz low-pass-filtered noise extending down to 2 Hz. For each stimulus presentation new noise bands were computed. The signal threshold was determined for two conditions. The signal was always in phase with the sinusoidal carrier of the masker. In the comodulated condition, the masker was composed of an OFM and a FB with the same envelope, obtained by using the same low-pass filtered noise for OFM and FB. In the reference condition, only the OFM was present. The centre frequency of the noise bands and their levels differed in the experiments. A three-alternative, forced-choice procedure with adaptive signal-level adjustment was used to determine the masked threshold of the sinusoidal signal. In general, the three intervals in a trial were separated by gaps of 500 ms. The signal was added to one of these intervals. Subjects had to indicate which of the intervals contained the signal. Visual feedback was provided after each response. The signal level was adjusted according to a two-down one-up rule to estimate the 70.7% detection threshold. The initial step size was 8 dB. After every second reversal the step size was halved until a step size of 2 dB was reached. The run was then continued for another six reversals. From the level of these last six reversals the mean was calculated and used as an estimate of the threshold. Four threshold estimates were collected for each condition. The final threshold value for that condition was taken as the mean of the four threshold estimates. Normal hearing subjects participated in the experiment varying in age from 23 to 35 years. All subjects had thresholds ≤15 dB HL (ISO 8253-1, 1989) at octave frequencies from 0.125 to 8.0 kHz. They had practice trials in CMR experiments before collecting the data.
4 4.1
Results and Discussion Experiment 1
In the first experiment the dependence of CMR on the centre frequency CFFB and the level of the FB was investigated. The signal frequency was 2 kHz and the OFM had a level of 20 dB SPL. The CFFB varied in the range from four octaves
120
J.L. Verhey and S.M.A. Ernst
below the signal frequency to one octave above the signal frequency. The level of the FB was 20, 30, 40, 50, 60, 70, or 80 dB SPL. Only five of the levels of the FB were used for each CFFB with the highest level producing at least 10 dB less excitation in the auditory filter at the signal frequency than the OFM. The left panel of Fig. 1 shows the mean CMR for 10 listeners, i.e. the difference between the threshold for the reference condition (4 dB re OFM level) and the thresholds for the CM conditions. CMR is plotted as a function of CFFB. Different symbols indicate the CMR for different levels of the FB. Two general trends were observed in the data for CFFB smaller than the signal frequency: (i) CMR increased as the FB level increased and (ii) CMR decreased as the CFFB decreased. The magnitude of CMR was largest (9 dB) for the FB three octaves below the signal frequency and a level of the FB of 80 dB. CMR was generally absent for CFFB higher than the signal frequency (less than 1 dB). Similar trends were observed in Cohen (1991) and Ernst and Verhey (2005). The right panel of Fig. 1 shows the model predictions. In agreement with the data, the predicted CMR decreased with increasing spectral distance between the FB and the OFM and decreasing level of the FB. In general, the model overestimated the CMR for the highest level of the FB and slightly underestimates the CMR for low levels of the FB. The simulations indicate that also for large spectral distances between the masker components CMR can still be accounted for by within-channel cues as long as the level of the off-frequency components are large compared to the level of the OFM. This interpretation is in line with Oxenham and Plack (1998), who suggested suppression as
80 dB SPL 70 dB SPL 60 dB SPL 50 dB SPL 40 dB SPL 30 dB SPL 20 dB SPL
12
CMR (dB)
10 8 6 4 2 0 −2
data −4
−3
predictions −2
−1
0
1
−4
−3
−2
−1
0
1
CFFB / CFOFM (octaves) Fig. 1 Mean data (left panel) for ten subjects and model predictions (right panel) as function of the centre frequency of the flanking band CFFB relative to the centre frequency of the signal centred band CFOFM (2 kHz). The level of the OFM was 20 dB SPL. The data for the different levels of the FB are indicated by different symbols. Error bars indicate plus or minus one standard error
Role of Peripheral Nonlinearities in Comodulation Masking Release
121
a possible mechanism to account for the CMR for FBs centred at frequencies below the signal frequency. In contrast to the data, a CMR of up to 3 dB was also predicted for CFFB higher than the signal frequency. This is presumably due to the inadequacy of the model to predict suppression in this frequency region (Plack et al. 2002). 4.2
Experiment 2
The first experiment showed a substantial CMR (up to 5 dB) over a four octaves range. The topic of the second experiment was to investigate if CMR was also obtained for spectral distances between the masker components larger than four octaves. The FB was centred at 125 Hz. The level of the FB was either 60 or 70 dB HL (i.e. 80 or 90 dB SPL). The signal frequency was 2, 4, or 8 kHz, i.e. four, five, or six octaves above CFFB. The level of the OFM was set to 20 dB HL. Figure 2 shows mean data for five subjects (left panel) and model predictions (right panel). Different symbols indicate different levels of the FB. The mean thresholds in the reference condition were the same for the different signal frequencies (4–5 dB re OFM level). The data showed the same trends as in the first experiment. The CMR decreased as the spectral distance between FB and signal increased and CMR is larger for the higher level of the FB. A CMR of up to 6 dB was still measured for a spectral distance between FB and OFM of six octaves. In general, the model predictions show the same trends as the data. However, for the FB centred four octaves below the signal frequency the model
16
70 dB HL 60 dB HL
14
data
predictions
CMR (dB)
12 10 8 6 4 2 0 −6
−5
−4
−6
−5
−4
CFFB / CFOFM (octaves) Fig. 2 Mean measured (left panel) CMR for five subjects and model predictions (right panel) as function of CFFB relative to CFOFM. CFFB was always 125 Hz. The OFM level was set to 20 dB HL. The FB level is indicated by different symbols as shown in the legend. Error bars indicate plus minus one standard error
122
J.L. Verhey and S.M.A. Ernst
overestimates the CMR for the higher level of the FB and underestimates the CMR for the level of the FB of 60 dB HL. For a spectral distance of six octaves, the model predictions are 3 dB lower than the measured CMR for both levels of the FB. This and the failure of the model to predict CMR for lower levels of the FB and for smaller spectral distances might indicate that part of the CMR is a consequence of an across-channel process and that this process operates over a six octave range. 4.3
Experiment 3
The third experiment investigated if onset asynchrony between the masker components eliminates CMR in conditions where the model predicts a CMR similar to the measured data. In contrast to the previous experiments, the OFM was gated on and off synchronously with the 250-ms signal, i.e. 125 ms after FB onset and 125 ms before masker offset (fringe condition). For comparison, masked thresholds were also measured for a synchronous condition, where all masker components and the signal were gated on and off synchronously. Only four combinations of level and centre frequency of the FB from the first experiment were considered. The FB was centred either two or three octaves below the signal frequency. For both FB positions the two highest levels of the first experiment were used. In these conditions, the suppression model predicted a substantial CMR. Figure 3 shows mean CMR for 11 subjects for the synchronous condition (open symbols) and for the fringed condition (filled symbols) for the two frequency separations between FB and
16 14
CMR (dB)
12 10
sync 80 dB SPL sync 70 dB SPL sync 60 dB SPL fringe 80 dB SPL fringe 70 dB SPL fringe 60 dB SPL
8 6 4 2 0 −3
−2
CFFB / CFOFM (octaves) Fig. 3 Mean measured CMR for 11 subjects for two CFFB for a synchronous condition (open symbols), where all masker components were gated on and off synchronously and in the fringe condition (filled symbols), where the FB were gated on earlier and gated off later. The OFM level was 20 dB SPL. The FB level is indicated by different symbols. Error bars indicate plus or minus one standard error
Role of Peripheral Nonlinearities in Comodulation Masking Release
123
OFM. For all combinations of level and centre frequency of the FB the CMR is similar for the fringe and the synchronous condition, i.e. the CMR was not eliminated by introducing a gating asynchrony between the OFM and the FB as found in Dau et al. (2005) for widely spaced masker components. The failure to find an effect of onset asynchrony between FB and OFM might be due to differences in the stimulus parameters. Dau et al. (2005) used more FBs with the same on- and offset which may enhance the impression of two objects (FBs and OFM) with different gating. In addition the present study investigated CMR with levels of the FB higher than that of the OFM where suppression is likely to occur. In contrast, all masker components had the same level in Dau et al. (2005).
5
Summary and Conclusions
CMR was measured and predicted with a various combinations of level and centre frequency of OFM and FB. The results and conclusions can be summarized as follows: 1. CMR decreased with increasing spectral distance between OFM and FB and increased with increasing level of the FB. A CMR of up to 6 dB could still be observed for the FB centred six octaves below the OFM. For an FB higher in frequency than the OFM CMR was considerably smaller than for an FB at the same spectral distance below the OFM. 2. The suppression model predicted the CMR for high FB levels over several octaves but underestimated the CMR for lower FB levels and also the asymmetry between FBs spectrally above and below the OFM. 3. The simulations indicate that even for large spectral distances between the masker components within-channel cues may still play a role in CMR experiments. The failure to eliminate CMR by introducing an onset asynchrony between the masker components further supported this hypothesis.
Acknowledgments. The present work was supported by the DFG. We would like to thank Bastian Epp and Sarah-Charlotta Heidorn for their assistance with the data collection.
References Breebaart J, van de Par S, Kohlrausch A (2001) Binaural processing model based on contralateral inhibition. I. Model structure. J Acoust Soc Am 110:1074–1088 Cohen MF (1991) Comodulation masking release over a three octave range. J Acoust Soc Am 90:1381–1384
124
J.L. Verhey and S.M.A. Ernst
Dau T, Ewert SD, Oxenham AJ (2005) Effects of concurrent and sequential streaming in comodulation masking release. In: Pressnitzer D, de Cheveigne A, McAdams S, Collet L (eds) Physiology, psychoacoustics and models. Springer, Berlin Heidelberg New York, pp 335–341 Ernst SMA, Verhey JL (2005) Comodulation masking release over a three octave range. Acta Acust united with Acust 91:998–1006 Grose JH, Hall JW (1993) Comodulation masking release: is comodulation sufficient? J Acoust Soc Am 93:2896–2902 Meddis R, O’Mard LP, Lopez-Poveda EA (2001) A computational algorithm for computing nonlinear auditory frequency selectivity. J Acoust Soc Am 109:28522861 Oxenham AJ (2001) Forward masking: adaptation or integration. J Acoust Soc Am 109:732–741 Oxenham AJ, Plack CJ (1998) Suppression and the upward spread of masking. J Acoust Soc Am 104:3500–3510 Plack CJ, Oxenham AJ, Drga V (2002) Linear and nonlinear processes in temporal masking. Acta Acust united with Acust 88:348–358
14 Neuromagnetic Representation of Comodulation Masking Release in the Human Auditory Cortex ANDRÉ RUPP1, LIORA LAS2, AND ISRAEL NELKEN2,3
1
Introduction
The detection of a low-level signal masked by a noisy background can be improved when the noise masker is coherently modulated over a wide frequency range, a psychoacoustical phenomenon referred to as comodulation masking release (CMR; Hall et al. 1984). Tone detection is more efficient when the modulated noise is increased in bandwidth. In vivo recordings in cats of three successive stages of the auditory pathway, namely the inferior colliculus (IC), medial genigulate body (MGB), and the primary auditory cortex (A1) revealed a possible neural correlate of CMR in neurons of A1 (Las et al. 2005). While A1 neurons tend to show some locking to the amplitude modulation of wideband maskers (envelope locking), envelope locking can be diminished markedly by the addition of a low-level tone. This effect is referred to as locking suppression. Although specific activation by low-level tones have already been demonstrated in the cochlear nucleus of guinea pigs (Pressnitzer et al. 2001; Neuert et al. 2004), locking suppression as observed in A1 neurons was hypothesized to enhance the representation of the low-level tone in cortex and to be a correlate of the formation of an auditory object (Las et al. 2005). In the present study, whole head magnetoencephalography was employed to investigate the specific effects of CMR in the auditory cortex of human listeners and to compare these field recordings to in vivo extracellular recordings of cat primary auditory cortex.
2
Locking Suppression in the Primary Auditory Cortex
The data presented here was collected in four halothane-anesthetized cats using extracellular recordings. Electrophysiological techniques have been described in details in Bar-Yosef et al. (2002), and the stimuli used here are 1 Section of Biomagnetism, Department of Neurology, University of Heidelberg,
[email protected] 2 Department of Neurobiology, The Alexander Silberman Institute of Life Sciences, Hebrew University, Jerusalem,
[email protected],
[email protected] 3 Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
126
A. Rupp et al.
the same as in the study of Las et al. (2005). In short, the masker consisted of noise bands centered on the best-frequency (BF) of one of the simultaneouslyrecorded neurons, with bandwidth spanning the range between BF/16 and BF Hz. The tone was always at the center of the band, and its level was selected to include masked threshold. The modulation pattern was trapezoidal (10 Hz, 50 ms tone bursts with 10 ms on-off linear ramps). Tone onset occurred between the second and third noise cycles. Figure 1 shows average responses of both well-separated units and small clusters. These units have been tested with the tone and masker in various relationships relative to BF, with the tone at levels comparable to those used in the MEG experiments. About half of the units (20 of 41 units averaged in Fig. 1, bottom) showed significant responses to the tone in quiet. This figure is therefore a rough approximation of the expected average activity in primary auditory cortex when using a fixed set of maskers and tones as done in the MEG experiments. Based on these data, we made two predictions. The first is that a wideband modulated masker should give rise to stronger phase locking in the evoked magnetic fields than narrowband noise (Fig. 1, top). This is due both to the stronger locking evoked by wideband maskers in individual neurons, and to
30
wideband narrowband
20 10 0
rate (sp/s)
20
masker only masker + tone 10
0 0
100
200
300
400
500
600
700
time (ms) Fig. 1 Average unit responses in A1 of halothane-anesthetized cats. Top: responses to wideband modulated maskers are larger than responses to narrowband modulated maskers (45 units). Bottom: adding a tone to a modulated masker reduces, but does not abolish, envelope locking (41 units tested at SNRs comparable to those used in the MEG experiment). The arrow marks tone onset
Neuromagnetic Representation of Comodulation Masking
127
the larger number of neurons presumably activated by the wideband relative to the narrowband masker. The second prediction from the neural data was that the addition of a tone would reduce, but not eliminate, the locking of the magnetic fields to the noise. This was due to our finding of hypersensitive locking suppression in A1 (Las et al. 2005). However, in the data described by Las et al., the tone was always a best-frequency tone. MEG recordings would be influenced by activity evoked in large neuronal populations, extending beyond the population whose BF exactly matches the tone frequency. Figure 1 (bottom) illustrates the average population response to a wideband masker alone and to a masker+tone stimulus. The arrow indicates tone onset. While locking is substantially the same during the first noise cycle following tone onset in the two conditions (as described by Las et al. 2005), it is partially suppressed during the following noise cycles. The remaining locking is presumably due to the presence of neurons with higher masked thresholds and to neurons whose best frequency may be away from the tone frequency, and therefore would be less sensitive to its presence.
3 3.1
Neuromagnetic Representation of CMR in Human Listeners Methods
Twelve normal hearing listeners (24–41 years) participated in the neuromagnetic study. In the first condition we used amplitude modulated noise maskers with a 10-Hz modulation rate. The noise contained six cycles, the first being 25 ms and the others 50 ms long with 50 ms silence between each noise burst. In the second condition maskers were presented in a continuous manner. Masking conditions included a broadband (900 Hz) and a narrowband noise (90 Hz) gated using a 12-ms linear window. The test signal was a 500-Hz pure tone, 275 ms in duration including 10-ms linear ramps. It was switched on 250 ms after masker onset. The longer delay between masker onset and tone onset was selected in order to switch the tone in after the main magnetic fields evoked by noise onset have died out. Psychoacoustic thresholds were determined by an adaptive 2AFC procedure in conjunction with a two-down one-up tracking rule to estimate the 70.7% correct point based on the psychometric curve. We applied two different signal levels for the MEG-experiments, i.e. 5 dB and 15 dB above the average thresholds of all stimulus conditions (derived from the first five participants of the psychoacoustic task). The resulting signal-to-noise ratios were kept constant for all MEG recordings. The broadband masker level was 70 dB SPL. Sounds were presented diotically via ER-3 earphones (Etymotic Research Inc.) connected to 90-cm plastic tubes and foam earpieces.
128
A. Rupp et al.
The auditory evoked magnetic fields (AEFs) were recorded using a Neuromag-122 whole head system with a sampling rate of 1000 Hz (bandwidth: DC-330 Hz) and an interstimulus interval of 1000 ms. About 350 sweeps were averaged per condition. During the MEG recordings subjects were watching a silent movie and were instructed not to attend to the sounds. Spatio-temporal source analysis (Scherg 1990) was performed using the BESA®5.1 software package (MEGIS Software GmbH). Two source models with one equivalent dipole in each hemisphere were used to analyse the CMR related responses. The first source model was based on the initial deflection of the pooled conditions to analyse the Pam. The specific N1m response elicited by the test tone embedded in the noise masker was modeled using the N1m deflection of the difference waveforms (tone+masker minus masker of the unmodulated wideband condition). Dipole solutions of both models were held fixed and used as a spatial filter to derive the source waveforms for each of the 12 stimulus conditions separately. Grand average source waveforms were computed for each condition. Difference waveforms were computed to derive the specific response elicited by the test tones. The specific representation of the middle latency components was analyzed using a bandpass filter ranging from 15 to 100 Hz. The analysis of the late AEFs were based on lowpass filtered data (0.001–30 Hz, zero-phase filter). The significance of the amplitudes were assessed by a permutation test for waveform differences (Blair and Karniski 1993). The output of this procedure is a single multivariate statistic, denoted as tsum. 3.2
Results and Discussion
In all subjects both dipole sources localized bilaterally within or near Heschl’s gyrus. The bandpass filtered (10–100 Hz) dipole source waveforms showed in both modulated masker conditions a highly similar pattern of the left and right auditory cortex with a Pam-P1m complex evoked by each noise burst. Due to the similarity of the Pam responses, the comparisons of this component were carried out using the average of the left and right waveforms (Fig. 2). The simultaneous comparison of all Pam deflections evoked by the six noise cycles showed significant higher amplitudes in the wideband condition (tsum = 165.9, P<0.005). A significant suppression of the Pam locking to the bursts after tone onset was found for the fifth and sixth response in the +15 dB condition (fifth burst: tsum = −33.8, P<0.05; sixth burst: tsum = −35.7, P<0.05) but could not be observed for the first burst after tone onset (tsum = 0.36, n.s.) as well as in the +5 dB wideband condition and both narrowband conditions. Thus, the Pam, which is supposed to represent specific activation of the primary auditory cortex (Liègeois-Chauvel et al. 1994), exhibited a comparable behaviour as found in extracellular recordings of cat primary auditory cortex, at least for tones 15 dB above masked threshold.
Neuromagnetic Representation of Comodulation Masking
modulated wideband
10
Pam
P1m 5
129
Pam
dipole moment (nAm)
0 −5 −10
0
100
200
300
400
500
modulated narrowband
10
600
700
masker +15dB +5dB
5 0 −5 −10
0
100
200
300
400
500
600
700
time (ms)
Fig. 2 Top panel: locking of the Pam and its suppression after low-level tone onset (grand-average waveforms). The grey lines at the bottom of each panel indicate the temporal position of the masker and tone. Note the reduction of the Pam component evoked by the fifth and sixth sixth noise burst when a tone (15 dB above threshold) is added to the modulated masker (grey arrows). Bottom panel: locking suppression is not found for low level tones
In order to demonstrate these effects more directly, we analysed the gradiometer data of a single subject (AR). The three 100 ms long periods following the first noise burst after tone onset (fifth and sixth noise cycles) were used. Modulation of the detrended signal was tested by a one-way ANOVA with the lag in the period as the factor and the signal during the two periods as repeated measurements (as in Nelken et al. 2004). F-values above 2 were considered as significant in order to compensate for multiple comparisons. The average modulation pattern tended to have rather similar shape in the responses of all gradiometers that were significantly modulated by the noise, except for possible sign switches. The polarity of the data from all gradiometers was therefore adjusted such that the main peak around 40 ms in the response to the modulated masker alone was positive. The results of this analysis are shown in Fig. 3. The black lines represent the modulation pattern in response to masker alone, averaged across all gradiometers with significant modulation (F>2). The error bars represent
130
A. Rupp et al.
magnetic field (fT/cm)
wideband
narrowband masker +15dB +5dB
0
25
50
75
100
time (ms) Fig. 3 Analysis of single-subject gradiometer data. Average modulation patterns across all gradiometers showing significant modulation locked to the noise periodicity, based on all noise periods following the fifth and sixth periods
one SEM (estimated from the variance across gradiometers, averaged across all conditions). The modulation depth in the responses to the wideband modulated masker was significantly larger than in the responses to the narrowband masker. Furthermore, the addition of the tone reduced the depth of modulation. In order to test for the changes in responses due to the presence of the tone, the squared modulation patterns were analyzed in a three-way ANOVA (condition × lag × gradiometer). All three factors had a highly significant main effect. This analysis confirms, at the level of the magnetic fields, the results of the source modeling presented above. In addition to the early response components, the magnetic fields showed late response components locked to the tone onset. The analysis showed the emergence of a N1m preceded by a small P1m (Fig. 4). These late responses had two important properties: first, the resulting tone-evoked N1m was larger for tones at about 15 dB above threshold compared to the +5 dB conditions. Second, there was a clear hemispheric asymmetry (data not shown), and the larger right hemisphere responses were significant for all conditions except for the low-level tone in the modulated narrow band condition (tsum = 448.4, P = 0.66). Third, the N1m evoked by the +15 dB tone was much larger in both unmodulated masking conditions, presumably due to the much higher absolute level of that tone.
Neuromagnetic Representation of Comodulation Masking modulated wideband
131 modulated narrowband
20 0 masker +15dB +5dB
dipole moment (nAm)
− 20 tone onset − 40
unmodulated wideband 20 Pam
unmodulated narrowband
P1m
0
SF −20
N1m −40
0
200
400
600
time (ms)
800 1000 1200
0
200
400
600
800 1000 1200
time (ms)
Fig. 4 Grand-average source waveforms of the two-dipole model tone evoked N1m for all conditions. Due to the similarity of the difference wave morphology of the left and right hemisphere data, the average of both hemispheres is shown. Note the increase in negativity at about 400 ms evoked by the test tone
4
Conclusions
The presence of envelope locking and locking suppression in the cat A1 suggested the presence of similar findings in the evoked magnetic fields in humans. The main result of this paper is the confirmation of the resulting predictions. Indeed, the middle latency AEF demonstrated locking suppression for the fifth and sixth noise bursts in the wideband condition. Since these are the second and third noise bursts following tone onset, this finding mirrors exactly the data of Las et al. (2005), as illustrated here in Fig. 1. In addition to the locking suppression of the early responses, we identified another response component, the N1m wave locked to tone onset. This component was significant even with very low-level tones, just 5 dB above psychoacoustic threshold, at least in the right hemisphere. This component did not have a correlate in the neural data from cat auditory cortex. There are two possible interpretations of this finding: the cat may not show an N1 wave locked to tone onset, or alternatively the N1 is not elicited in A1. Because of the high correlation between this component and human perception, we speculate that it is related to the appearance of a new auditory object in the sound. Acknowledgement. This work was supported by grants from the Israeli Science Foundation and the Human Frontiers Science Program.
132
A. Rupp et al.
References Bar-Yosef O, Rotman Y, Nelken I (2002) Responses of neurons in cat primary auditory cortex to bird chirps: effects of temporal and spectral context. J Neurosci 22:8619–8632 Blair RC, Karniski W (1993) An alternative method for significance testing of waveform difference potentials. Psychophysiology 30:518–524 Hall JW, Haggard MP, Fernandes MA (1984) Detection in noise by spectro-temoro pattern analysis. J Acoust Soc Am 76:50–56 Las L, Stern EA, Nelken I (2005) Representation of tone in fluctuating maskers in the ascending auditory system. J Neurosci 25:1503–1513 Liègeois-Chauvel C, Musolino A, Badier JM, Marquis P, Chauvel P (1994) Evoked potentials recorded from the auditory cortex in man: evaluation and topography of the middle latency components. Electroencephalogr Clin Neurophysiol 92:204–214 Nelken I, Bizley JK, Nodal FR, Ahmed B, Schnupp JW, King AJ (2004) Large-scale organization of ferret auditory cortex revealed using continuous acquisition of intrinsic optical signals. J Neurophysiol 92:2574–2588 Neuert V, Verhey JL, Winter IM (2004) Responses of dorsal cochlear nucleus neurons to signals in the presence of modulated maskers. J Neurosci 24:5789–5797 Pressnitzer D, Meddis R, Delahaye R, Winter IM (2001) Physiological correlates of comodulation masking release in the mammalian ventral cochlear nucleus. J Neurosci 21:6377–6386 Scherg M (1990) Fundamentals of dipole source analysis. In: Grandori F, Hoke M, Romani GL (eds) Auditory evoked magnetic fields and electric potentials, advances in audiology, vol 6. Karger, Basel, pp 40–69
Comment by Verhey Wouldn’t your N100 data (shown in Fig. 4) indicate that thresholds should be lower in the unmodulated condition than in the corresponding modulated condition, in contrast to the psychophysical data? Reply The tone levels we used were set relative to the average masked thresholds in each condition. The +5 dB tones in the modulated condition has an absolute level which was more than 30 dB below that of the +5 dB tone in the unmodulated condition. Thus, the N100 data is in fact fully consistent with the psychophysical data. When the tones were above energy detection threshold (the +15 dB unmodulated conditions), the data indicates indeed a steep increase in N100 amplitude which may reflect non-linear effects.
15 Psychophysically Driven Studies of Responses to Amplitude Modulation in the Inferior Colliculus: Comparing Single-Unit Physiology to Behavioral Performance PAUL C. NELSON1 AND LAUREL H. CARNEY1,2
1
Introduction
Psychophysical envelope processing has received renewed attention, largely in response to the success of the modulation filterbank model (Dau et al. 1997) in predicting perceptual data that are difficult to account for with the classical modulation low-pass filter (Viemeister 1979). In many behavioral studies, the stimulus modulation depth (m) is adaptively varied to determine thresholds for detection and discrimination of amplitude-modulated (AM) signals. To investigate directly and quantitatively the role of the inferior colliculus (IC) in the processing of such sounds, we recorded single-unit extracellular responses to AM stimuli with a wide range of modulation depths (from below psychoacoustic detection thresholds to 100% modulation), and in some neurons with resolution in depth finer than behavioral discrimination performance (~1–2 dB). Neural responses were quantified in terms of average firing rate and synchronization to the envelope; the significance of changes in the different metrics was determined by the slope of modulation depth functions (MDFs) and the across-repetition variability of the given response quantification (Nelson 2006). Similar approaches have been used to relate auditory-nerve (AN) responses to audio-frequency tone detection and intensity discrimination (Siebert 1968; Young and Barta 1986; Delgutte 1987; Viemeister 1988; Colburn et al. 2003). The most sensitive (high spontaneous-rate) AN fibers exhibit rate changes at tone levels in line with psychophysical detection thresholds (e.g. Young and Barta 1986), while fine-structure phase-locking can emerge in fibers with low characteristic frequencies at SPLs 10–20 dB lower than those required to elicit a rate increase (Johnson 1980). Intensity discrimination thresholds at mid to high SPLs are more difficult to account for with single-fiber average-rate analyses because of saturation and increased spike-count variance at high SPLs (Colburn et al. 2003). Several schemes have been proposed to offset these effects, including (1) pooling of 1 Department of Biomedical and Chemical Engineering and Institute for Sensory Research, Syracuse University, Syracuse, NY, USA,
[email protected] 2 Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, USA,
[email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
134
P.C. Nelson and L.H. Carney
rate responses (Delgutte 1987; Viemeister 1988), (2) spread of excitation (Siebert 1968; Heinz et al. 2001), and (3) the use of level-dependent response phase for low-frequency AN fibers (Colburn et al. 2003). AM detection and discrimination performance predicted with central neural responses suggests a fundamentally different representation of modulation depth in the IC as compared to the representation of tone level in the AN. Specifically, at low modulation depths, changes in average rate cannot account for perceptual thresholds, while synchrony to the envelope can be significant at these low depths. In contrast, for higher modulation depths, rate-based MDFs are not as prone to saturation, whereas vector strength tends to saturate in most neurons and phase is not systematically depthdependent (Krishna and Semple 2000; Nelson 2006). These properties of the synchronized response are inconsistent with behavioral depth-discrimination thresholds which remain approximately constant above about −25 dB (Ewert and Dau 2004). In other words, a qualitative description of IC responses to AM suggests a transition from a temporal code at low depths to a rate-based code at high depths. The current analysis was designed to further characterize IC responses with two specific encoding strategies in mind. The first hypothesis is that rate-based neural detection thresholds might improve by pooling information across a number of cells. The second hypothesis is that a single-neuron decision statistic incorporating both rate and temporal information will predict thresholds across a wider portion of the perceptually relevant range of AM depths than rate or synchrony alone.
2
Methods
Detailed descriptions of the experimental methods are available elsewhere (Batra et al. 1989; Nelson 2006). Briefly, single-unit extracellular recordings over a wide range of AM depths were obtained in 152 single neurons in the ICs of three awake female Dutch-belted rabbits (oryctolagus cuniculus). Glass-coated tungsten microelectrodes were positioned with a stereotaxic system mounted on a steel cylinder that was affixed to the animal’s skull. Parameters of the AM stimuli were designed for each neuron based on responses to a battery of simpler sounds. The best-frequency-tone carrier had an SPL on the ascending portion of the rate-level function. Sinusoidal AM was at the modulation frequency (fm) that elicited the largest value of synchronized rate (the product of vector strength and average rate) over a range of fm from 2 to 312 Hz. This fm almost always corresponded to the frequency that resulted in the highest value of the Rayleigh statistic (2nVS2, where n is the number of spikes and VS is the magnitude of vector strength at fm). Three repetitions of 2 s duration AM tones were presented at each tested frequency and depth. For statistical purposes, the responses were
Psychophysically Driven Studies of Responses to Amplitude Modulation
135
analyzed in 500-ms segments, omitting the initial 500 ms, resulting in nine estimates of each response metric at each fm and m. Neglecting the onset response avoided the calculation of significant synchrony to the AM period based solely on a transient burst at stimulus onset, but did not strongly affect the quantifications of most neural responses, which usually exhibited little temporal adaptation to AM stimulation. A t-test was used to establish significant differences in rates to variations in AM depth. The lowest modulation depth that resulted in a significant value of vector strength (Rayleigh statistic >13.8, p<0.001) was defined as synchrony threshold. Statistical analyses involving comparisons of two different synchrony measures took advantage of a transformation [–ln(1–VS)] that results in uniform variance of VS (Johnson 1974; Joris et al. 1994).
3 3.1
Results Example Single-Unit Responses
synchrony
average rate (s−1)
Rate- and synchrony-MDFs for one IC neuron are shown in Fig. 1, with period histograms of the spike times over a wide range of AM depths. Human AM detection thresholds for similar stimulus conditions approach −30 dB (e.g. Kohlrausch et al. 2000). Filled circles correspond to neural detection thresholds, or the first values of rate or synchrony that were significantly different from the response at the lowest tested depth (−35 dB). This cell had a rate threshold of −15 dB and a synchrony threshold of −20 dB. We consistently observed lower thresholds based on synchrony than on rate (93% of synchrony
200 −15
100 0 1 0.8 0.6 0.4 0.2 0
−30
−20
−10
−10
0
−5
0 −30
−20 −10 20 log m
0
0
1 2 cycle #
Fig. 1 Single-neuron response dependence on AM depth in terms of average rate, strength of synchrony, and raw period histograms (plotted twice for visual clarity). Error bars represent standard deviations across repetitions. Stimulus fm was 112 Hz, and the carrier was a 60-dB SPL, 3900-Hz tone
136
P.C. Nelson and L.H. Carney
thresholds were equal to or lower than their rate-based counterparts; Nelson 2006). For supra-threshold AM depths, this cell’s average rate increased monotonically over the remaining 15-dB dynamic range. In contrast, its VS was effectively constant for depths higher than −10 dB; the period histograms illustrate the emergence of a bimodal distribution of spike times that contributed to the saturation of this timing metric. Figure 1 also shows the across-repetition variability of rate and synchrony estimates as a function of stimulus m. Rate estimates were slightly more variable at higher driven rates (and higher m) in this neuron than at lower rates, while the variability of VS showed the opposite trend. The enhanced precision of VS at high depths was mainly a result of the increase in the number of spikes, as opposed to a transition into the compressive region of the metric’s dynamic range. Further analysis (not shown) supported this notion: mean VS did not change, but its variability increased, when spikes were deleted to match the average rates at different modulation depths. 3.2
Relationship Between Mean Count and Count Variance
One strategy used in comparisons of AN responses to intensity discrimination psychophysical thresholds is the simulation of a population of fibers based on assumed rate functions and dependence of spike count variance on the mean spike count (e.g. Delgutte 1987; Viemeister 1988). This approach provides an estimate of the number of fibers with a stereotypical rate function that would be required to account for psychophysical performance, and is justifiable for peripheral responses because of the systematic variability of count estimates in the AN (Young and Barta 1986; Winter and Palmer 1991): count variance is slightly less than that expected from a Poisson process. Here, we show that a similar strategy is not appropriate in the IC because rate variability is not predictable from the number of spikes. Count variance as a function of the mean count for the population of neurons tested with a range of modulation depths (1168 observations) is shown in Fig. 2. The thin line in Fig. 2 indicates the dependence that would be expected from a Poisson process, while the thicker lines represent the best linear fit to the data (in a minimized sum-of-squares error sense). The slope of the best-fit line is approximately 0.8 (consistent with Hancock and Delgutte 2004); however, the raw data make it clear that a simple relationship between variance and count does not adequately describe these data. In contrast to more peripheral auditory neurons (see above) and visual cortical neurons (e.g. Geisler and Albrecht 1997), the spike-count variance of IC neurons in the awake rabbit is clearly not proportional to the mean count. An inspection of variance-vs-count functions for individual neurons also revealed no systematic relationship for most cells (not shown). The neuron illustrated in Fig. 1 was atypical in this sense, in that its spike-count variance increased monotonically with firing rate.
count variance
Psychophysically Driven Studies of Responses to Amplitude Modulation
137
100
40
10
20
1
0 0
20
40
1
10
100
mean spike count in 500 ms Fig. 2 Spike count variance is not systematically dependent on the mean count in the IC. Linear axes are used in the left panel, focusing on measurements with counts and variances less than 50 (939/1168 observations); the logarithmic scale in the right panel allows for almost the entire population to be included (1115/1168)
The count-variance dependence of several subsets of cells was examined; no obvious differences were observed between groups with different puretone histogram types (onset or sustained), or rate-based sensitivities to AM depth (low versus high neural detection thresholds). Also, there were no clear trends that suggested a difference in this relationship for responses to stimuli with low vs high modulation depths (responses to all depths are included in Fig. 2). 3.3
Combining Rate and Timing Information
The fact that synchronization to the envelope can become significant in the IC at AM depths near psychophysical thresholds and average rate often increases monotonically at higher values of m points towards a hybrid (e.g. synchronized-rate) metric as a decision statistic with the potential to account for perceptual thresholds across the entire AM depth dynamic range. We tested this idea by quantifying the average rate, synchrony, and the product of synchrony and rate. A transformation of synchrony, –ln(1–VS), compensated for the compressive nature of the vector strength metric and resulted in an equal-variance synchrony axis (Joris et al. 1994). Note that this transformation does not alter the count-dependence of across-repetition synchrony variance (Fig. 1). Neural AM depth discrimination thresholds of one neuron for each tested standard depth and response metric are shown in Fig. 3; for reference, human psychophysical thresholds for high carrier frequencies and fm below 150 Hz are also included in the figure (from Ewert and Dau 2004). Two features of the neural thresholds are worth noting. First, predicted performance was essentially the same for all three of the tested decision statistics. Second, the trend toward higher thresholds at standard depths below −10 dB in the neural predictions was inconsistent with the human listeners’ ability to discriminate
138
P.C. Nelson and L.H. Carney
psychophysical thresholds rate synchronized rate synchrony
10 log ((m2c−m2s)/ms2)
20
10
0
−10 −30
−20
−10
0
20 log ms Fig. 3 Comparison of neural and psychophysical AM depth discrimination thresholds across a wide range of standard depths. Perceptual data from Ewert and Dau (2004)
small changes in m for all standard depths above approximately −25 dB. Qualitatively similar results were observed for the 19 other neurons examined with high depth resolution. One interpretation of these results (Fig. 3) is that a simple combination of rate and timing information does not carry adequate information to account for the low-depth psychophysical thresholds, as hypothesized above. Although synchrony was significant in the neuron at a depth of −19 dB, the variability of both synchrony and synchronized rate at depths near threshold was too high with respect to the slope of the modulation depth functions to predict the perceptual 1- or 2-dB discrimination thresholds (−6 to −2 dB on the ordinate axes shown in Fig. 3).
4
Summary and Conclusions
The neural representation of AM stimuli in the IC was examined in terms of the mean and variance of several response quantifications. Spike-count variability had no systematic relationship with mean count. This finding has implications for modeling strategies that pool rate responses across a population of neurons. Rate codes are also affected by several other fundamental issues that raise serious questions regarding the feasibility of proposed decoding algorithms. For instance, spike counts computed over a finite duration depend on both the duration of the counting window and the absolute position of the window in time. Neural discharge rates exhibit temporal
Psychophysically Driven Studies of Responses to Amplitude Modulation
139
correlations that can be modeled with a long-range dependent point process (e.g. Jackson and Carney 2005); therefore, to make efficient use of short-term changes in average rate, these slow temporal fluctuations must be correlated across populations of neurons that converge at a higher point in the system. There is no strong evidence either for or against the existence of such correlations in the central auditory system. Although we conclude that a strict rate code is probably not used in the IC to represent AM (at least at low AM depths), we were also not able to identify an alternative single-neuron decision statistic that was capable of accounting for human performance across the entire perceptually relevant dynamic range of AM depths. The variability associated with estimates of VS computed using small numbers of spikes overwhelms the systematic changes in timing-based metrics that are present at lower AM depths. Taken together, these findings suggest one of three possibilities: (1) differences in percepts elicited by variations of AM depth are not directly mediated by the activity of single IC neurons, (2) rabbits and humans experience different AM-induced sensations, or (3) the system uses another single-neuron response metric that has yet to be identified. Future work will address these issues with simultaneous recordings of populations of IC neurons, measurement of behavioral rabbit AM detection and discrimination thresholds, and the testing of additional response quantifications. Acknowledgements. We thank Anita Sterns for technical assistance and Shigeyuki Kuwada and Blagoje Filipovic for their generous contributions of time and electrode-making advice. This research was supported by NIHNIDCD F31-7268 (PCN) and NIH-NIDCD R01-01641 (LHC).
References Batra R, Kuwada S, Stanford TR (1989) Temporal coding of envelopes and their interaural delays in the inferior colliculus of the unanesthetized rabbit. J Neurophysiol 61:257–268 Colburn HS, Carney LH, Heinz MG (2003) Quantifying the information in auditory-nerve responses for level discrimination. J Assoc Res Otolaryngol 4:294–311 Dau T, Kollmeier B, Kohlrausch A (1997) Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J Acoust Soc Am 102:2892–2905 Delgutte B (1987) Peripheral auditory processing of speech information: implications from a physiological study of intensity discrimination. In: Schouten MEH (ed) The psychophysics of speech perception. Dordrecht, Nijhoff, pp 333–353 Ewert SD, Dau T (2004) External and internal limitations in amplitude-modulation processing. J Acoust Soc Am 116:478–490 Geisler WS, Albrecht DG (1997) Visual cortex neurons in monkeys and cats: detection, discrimination, and identification. Vis Neurosci 14:897–919 Hancock KE, Delgutte B (2004) A physiologically based model of interaural time difference discrimination. J Neurosci 24:7110–7117 Heinz MG, Colburn HS, Carney LH (2001) Rate and timing cues associated with the cochlear amplifier: level discrimination based on monaural cross-frequency coincidence detection. J Acoust Soc Am 110:2065–2084
140
P.C. Nelson and L.H. Carney
Jackson BS, Carney LH (2005) The spontaneous-rate histogram of the auditory nerve can be explained by only two or three spontaneous rates and long-range dependence. J Assoc Res Otolaryngol 6:148–159 Johnson DH (1974) The response of single auditory-nerve fibers in the cat to single tones: synchrony and average discharge rate. PhD dissertation, Cambridge, MA, MIT Johnson DH (1980) The relationship between spike rate and synchrony in responses of auditory-nerve fibers to single tones. J Acoust Soc Am 68:1115–1122 Joris PX, Carney LH, Smith PH, Yin TCT (1994) Enhancement of neural synchronization in the anteroventral cochlear nucleus. I. Responses to tones at the characteristic frequency. J Neurophysiol 71:1022–1051 Kohlrausch A, Fassel R, Dau T (2000) The influence of carrier level and frequency on modulation and beat-detection thresholds for sinusoidal carriers. J Acoust Soc Am 108:723–734 Krishna BS, Semple MN (2000) Auditory temporal processing: responses to sinusoidally amplitudemodulated tones in the inferior colliculus. J Neurophysiol 84:255–273 Nelson PC (2006) Physiological correlates of temporal envelope perception. PhD dissertation, Syracuse University Siebert WM (1968) Stimulus transformations in the peripheral auditory system. In: Kolers PA (ed), Recognizing patterns. MIT Press, Cambridge, MA, pp 104–133 Viemeister NF (1979) Temporal modulation transfer functions based upon modulation thresholds. J Acoust Soc Am 66:1364–1380 Viemeister NF (1988) Intensity coding and the dynamic range problem. Hear Res 34:267–274 Winter IM, Palmer AR (1991) Intensity coding in low-frequency auditory-nerve fibers of the guinea pig. J Acoust Soc Am 90:1958–1967 Young ED, Barta PE (1986) Rate responses of auditory nerve fibers to tones in noise near masked threshold. J Acoust Soc Am 79:426–442
Comment by Verhey Would your data be in line with the following two interpretations? 1. The most sensitive units show similar thresholds as the psychophysical data on modulation detection. 2. Units with different thresholds are sensitive to different ranges of modulation depths, i.e. for the modulation discrimination task units may be used by the auditory system which have their thresholds close to modulation depth of the reference. Reply 1. Although this aspect of the data was not emphasized here, it is true that a subset of neurons in our population with the lowest synchrony-based modulation depth thresholds can account for human AM detection performance [this is described more fully in Nelson (2006, PhD thesis), and in Nelson and Carney 2007]. One of the main points in the current presentation is that neural AM discrimination thresholds for standard depths below about −15 dB apparently cannot be predicted by changes in either
Psychophysically Driven Studies of Responses to Amplitude Modulation
141
the average rate or strength of synchrony in single IC neurons when the across-repetition variability of these metrics is taken into account (this leads to Verhey’s second suggestion). 2. The recruitment of neurons with different AM depth sensitivities is certainly a reasonable mechanism to explain discrimination thresholds at low depths. We would only add that average rate alone in some neurons is sufficient to predict discrimination psychoacoustics at high standard modulation depths (above −10 dB). References Nelson PC, Carney LH 2007 Neural rate and timing cues for detection and discrimination of amplitude-modulated tones in the awake rabbit inferior colliculus. J Neurophysiol 97:522–539
16 Source Segregation Based on Temporal Envelope Structure and Binaural Cues STEVEN VAN DE PAR1, OTHMAR SCHIMMEL2, ARMIN KOHLRAUSCH1,2, AND JEROEN BREEBAART1
1
Introduction
The lateralization of a single auditory object is mediated by interaural time delays (ITDs) and interaural level differences (ILDs). In daily life, listeners regularly encounter multiple auditory objects simultaneously, and it is of interest to learn to what extent and how listeners can localize each object. When the spectra of the objects differ sufficiently, the binaural cues within each pair of auditory filters would result from the auditory object that has most energy in the frequency range of that auditory filter. Although in principle this could provide cues for the separation of auditory objects, in experiments using simultaneously presented noises shaped to represent different vowels, subjects could not use ITD cues for segregation (Culling and Summerfield 1995). In this chapter we want to investigate whether listeners are able to discern the lateralization of two simultaneously presented auditory objects with different temporal structures that are spectrally fully overlapping. Both objects (band-pass noise and a harmonic tone complex) resulted in nearly the same spectral excitation pattern, while due to their different temporal structures, the objects could be well distinguished when listened to in isolation. By presenting these two auditory objects simultaneously and with different binaural properties, binaural cues stemming from both objects are equally reflected within each auditory filter. Therefore, in order to correctly lateralize one of the two objects, listeners would need to somehow separate the binaural cues within each single auditory filter and couple these cues to the auditory objects. Headphone lateralization experiments were done for different bandwidths and center frequencies in a similar way as in our earlier ISH contribution that was dealing with the discrimination of the same two auditory objects based on binaural cues (van de Par et al. 2005).
1 Digital Signal Processing Group, Philips Research Laboratories, Eindhoven, Netherlands,
[email protected],
[email protected],
[email protected] 2 Human-Technology Interaction, Technische Universiteit Eindhoven, Netherlands, O.V.Schimmel @tm.tue.nl
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
144
2
S. van de Par et al.
Experiment I
This experiment was carried out to investigate the ability of listeners to lateralize correctly two simultaneously presented, spectrally fully overlapping signals with opposing binaural cues. In order to perform this lateralization, listeners must be able to segregate and lateralize the signals based on their opposing binaural (ITD or ILD) cues and their different temporal structure. This task will be termed “discrimination task”, because it required listeners to discriminate between lateralizations of the signals. Besides measuring lateralization thresholds for the two signals presented simultaneously, as a reference, also lateralization thresholds were measured for each of the signals separately. These tasks will be termed “detection task”, because they deal with basic lateralization detection thresholds. 2.1
Method and Stimuli
For the discrimination task, stimuli consisted of two signals: a band-pass noise (BPN) with a flat spectral envelope, and a harmonic tone complex (HTC) with 20-Hz component spacings and a sinusoidal phase spectrum. Both signals had the same spectral range but differed in their temporal envelope structures. The level for each of the two signals was 65 dB SPL. The two reference intervals had binaural cues such that the BPN was lateralized to the right and HTC to the left using identical but opposing binaural cues. In the target interval the lateralizations of both signals were reversed. In the narrowband conditions, bandwidths were one critical band wide (1 ERB) and centered at 280, 550 or 800 Hz. In the wideband condition, the bandwidth was 600-Hz wide (7 ERB) and centered at 550 Hz. The intervals had a duration of 400 ms, including 30-ms raised-cosine onset and offset ramps, and were separated by 300 ms of silence. For the ILD conditions, level changes were such that the total added energy of left and right target signal was always constant. For the detection task, threshold ITDs and ILDs were measured for the BPN and the HTC in isolation. The two reference intervals had binaural cues such that the signal was lateralized to the left. The target interval had identical binaural cues such that the signal was lateralized to the right. Thus, total interaural differences between reference and test intervals were twice the values as reported hereafter. The method used for measuring lateralization thresholds was a three-interval, three-alternative forced-choice adaptive-tracking procedure. The adaptive variable (ITD or ILD) was adjusted according to a two-down one-up rule, to track the 70.7%-correct response level. The initial adaptive variable was adjusted by multiplying or dividing the adaptive variable with a certain factor. Initially this factor was 2.51 (=108/20). After each second reversal the factor was reduced by taking its square root until the value of 1.12 (=101/20) was reached. Another eight reversals were measured at this step size and the median of these eight levels was used as an estimate of threshold. Feedback was provided after each trial.
Source Segregation Based on Temporal Envelope Structure and Binaural Cues
145
For each condition, at least four attempts were made by each subject to measure a threshold. However, when the adaptive variable exceeded a certain maximum value, the tracking procedure was terminated and no threshold was measured. For these conditions, measurements were either repeated to obtain a total of at least four proper threshold values, or, for the most difficult conditions, measurements were stopped and thresholds (if any) discarded. The thresholds for each condition were pooled and checked for severe outliers, thresholds at more than three times the interquartile range were removed from the data. Five normal hearing male subjects, including the four authors, participated in the experiments. 2.2
Results and Discussion
Figure 1 shows the average detection and discrimination threshold and standard error of the mean for the various conditions. For the ITD conditions (left panel), detection thresholds for lateralization changes of the narrowband BPN and HTC are about 10–20 µs, while for these narrowband conditions a discrimination threshold could not be obtained. Detection thresholds for wideband BPN and HTC are similar to the lowest of the narrowband detection thresholds, about 12 µs, while the wideband discrimination threshold is about 29 µs. These data indicate that segregation of the BPN and the HTC by interaural time delays is not possible when their spectral energy is limited to one auditory filter. For the ILD conditions (right panel), all detection thresholds for BPN and HTC are about 1 dB independent of center frequency and bandwidth, while narrowband discrimination thresholds are about 5–32 dB. Note that this 32-dB threshold at the 280-Hz center frequency is beyond the plausible range
50 45
Subject average
Narrowband 280−Hz CF 550−Hz CF
800−Hz CF
102
Wideband 550−Hz CF
Subject average Narrowband 280−Hz CF 550−Hz CF
800−Hz CF
Wideband 550−Hz CF
Discr BPN HTC
Discr BPN HTC
40 ILD (dB)
ITD (µs)
35 30 25 20
101
100
15 10 5 0
Discr Discr Discr Discr BPN HTC BPN HTC BPN HTC BPN HTC
10−1
Discr BPN HTC
Discr BPN HTC
Fig. 1 Average detection thresholds for BPN (circles) and HTC (squares), and discrimination thresholds (diamonds) are shown for ITD (left panel) and ILD (right panel) conditions for narrowband (open markers) conditions and wideband (closed markers) conditions. Vertical lines indicate the standard error of the mean
146
S. van de Par et al.
for localization. For these narrowband conditions, discrimination thresholds decrease with increasing center frequency. The wideband discrimination threshold is nearly equal to its corresponding detection thresholds. These data indicate that, in contrast to ITD narrowband conditions, segregation by interaural level differences of the narrowband noise and the harmonic tone complex is possible within one auditory filter, although thresholds are very high. Analogous to the ITD conditions, segregation is best for conditions where the stimulus spectrum covers multiple auditory filters. From this finding we conclude that segregation by interaural level differences is not only depending on the bandwidth of the signals, but also on the center frequency of the auditory filter. Overall we can conclude that spectral energy in multiple auditory filters facilitates segregation by binaural listening of spectrally fully overlapping concurrent sound sources.
3
Experiment II
This experiment was defined to investigate to what extent the ILD thresholds for the discrimination task obtained in Experiment I can be understood in terms of listening with one ear only. 3.1
Method and Stimuli
The method was the same as in Experiment I, and measurements were limited to discrimination thresholds. The stimuli were the same as the ILD conditions from Experiment I, only now the right ear signal was presented diotically. In this way, only monaural level cues resulting from the different temporal envelope structures of the HTC and the BPN were available for discriminating between the target and the reference intervals. 3.2
Results and Discussion
Table 1 shows the subject average of monaural and previously established binaural thresholds and standard error of the mean for the various bandwidths and center frequencies. As for binaural thresholds from Experiment I, monaural thresholds in the narrowband conditions decreased with an increase in center frequency. However, the monaural thresholds were much lower than binaural thresholds. Thresholds for the wideband condition were similar, indicating that monaural and binaural discrimination were equivalent.
Source Segregation Based on Temporal Envelope Structure and Binaural Cues
147
Table 1 Average discrimination thresholds and standard errors of the mean (in brackets) for various bandwidths and center frequencies Narrowband 280-Hz
550-Hz
800-Hz
Wideband 550-Hz
Monaural [in dB]
4.7 (0.3)
3.2 (0.2)
2.5 (0.2)
1.2 (0.1)
Binaural [in dB]
34.2 (6.9)
7.4 (0.8)
5.1 (0.8)
1.0 (0.1)
From these findings, we conclude that for narrowband signals binaural stimulus presentation can actually reduce discrimination performance in the case of ILD cues. In other words, in the binaural ILD discrimination condition, much better performance could be obtained if listeners were able to listen with only one ear.
4
Experiment III
This experiment was defined to explore the effect of temporal envelope structure on the ability to segregate two spectrally overlapping signals by binaural cues. 4.1
Method and Stimuli
The temporal envelopes of the BPN and HTC were manipulated in two ways. First, a random phase was applied to the HTC components, to yield a temporal envelope more similar to that of the BPN. Second, a 20-Hz amplitude modulation (AM) was applied to the BPN, to yield a temporal envelope more similar to that of the sine-phase HTC. The AM BPN and the HTC were presented temporally in-phase (φ = 0), such that the envelope maxima of both signals coincided, or out-of-phase (φ = π). The method was the same as in Experiment I, except that measurements were limited to discrimination thresholds for the wideband condition, because these measurements led to the lowest thresholds. 4.2
Results and Discussion
Table 2 shows the subject average and standard error of the mean for the different temporal envelope structure conditions (including the sine-phase HTC of Experiment I).
148
S. van de Par et al. Table 2 Average discrimination thresholds and standard errors of the mean (in brackets) for various temporal envelope structures of HTC (sine/random phase) and AM BPN (φ = 0/φ = π) Sine
Random
φ=0
φ=π
ITD [in µs]
29.2 (2)
-
-
30.0 (3)
ILD [in dB]
1.0 (0.1)
29.0 (8.2)
1.5 (0.1)
1.0 (0.1)
For random phase conditions, segregation by ITDs was not possible, and segregation by ILDs was seriously degraded. For the conditions with the AM BPN and HTC presented in-phase (φ = 0), segregation by ITDs was also not possible, and for ILDs segregation was slightly reduced compared to the sinephase condition. For conditions with the AM BPN and HTC presented out-ofphase (φ = π), segregation was normal for both binaural cues. From these findings, we conclude that in order to be able to segregate and lateralize the two signals, it is important that the temporal envelope structure of both signals is different, either due to different degrees of modulation or due to a difference in the relative timing of envelope maxima.
5
Discussion and Conclusions
This study investigated to what extent two signals with a very similar spectral envelope but different temporal structures can be segregated based on binaural cues. From Experiment I, it appears that listeners can indeed segregate such signals based on a difference in binaural cues of the two signals if the bandwidth of the signals exceeds one critical band. When we consider the ILD conditions, as was shown in Experiment II, the left or right ear signals by themselves provide monaural cues that are sufficient to distinguish reference and target intervals. Therefore, it is not certain that in Experiment I, the lateralization thresholds for the ILD conditions are necessarily based on binaural processing. Interestingly, for narrowband conditions, the performance based on listening only with the right ear signal was better than based on binaural listening, while in the wideband condition, performance was very similar. This suggests that for narrowband conditions, binaural fusion is mandatory and listening with one of the two ears is not possible. When we consider the ITD conditions, segregation or lateralization of either one of the signals can only be explained based on binaural processing. However, the binaural cues are mixed equally within each critical band and therefore we have to assume that, somehow, the auditory system is able to separate the cues corresponding to both signals and couple them to the two signals.
Source Segregation Based on Temporal Envelope Structure and Binaural Cues
149
It is not clear, however, what would be the nature of binaural processing that allows separation of the ITD cues of both signals in Experiments I and III. It appears that a prerequisite for being able to segregate two binaural signals is that both signals have a distinctly different temporal envelope extending across several auditory filters. For the random-phase HTC and for the in-phase AM BPN, temporal envelopes of both signals were correlated and segregation was impossible. For the sine-phase HTC and the out-of-phase AM BPN, temporal envelopes were much less correlated and lateralization performance was not too dissimilar from that of the signals in isolation. In addition, it appears that temporal envelopes resulting from narrowband signals by themselves do not provide sufficient differences in temporal envelopes across the various signals to facilitate segregation (cf. Experiment I). Possibly, across-frequency integration based on temporal envelope coherence helps to facilitate segregation (cf. Trahiotis and Stern 1994), but within-channel cues may also improve even when the signal exceeds one critical band. One speculation would be that the monaural temporal envelope cues somehow help to select temporal intervals that belong to one of the two signals and that in this way temporally varying binaural cues are organized to facilitate lateralization. Alternatively, the auditory system may adopt several hypotheses about how the monaural input signals can be segregated based on monaural temporal envelope features. The binaural cues corresponding to each of the envelope features may help to tip the balance towards one of the hypotheses based on the assumption that within one auditory stream, binaural cues have to be coherent across time. In a previous contribution (van de Par et al. 2005) we reported a study where listeners had to discriminate between BPN and HTC signals both presented interaurally out-of-phase within an in-phase noise masker. Listeners were able to discriminate between these signals at signal-to-masker levels for which no monaural cues were available for performing this task. This suggests that within the binaural display, information is available about the temporal structure of the out-of-phase signal. An equalization-cancellation (EC) stage (e.g. Durlach, 1963; Breebaart et al. 2001) could in principle provide such information, because it would cancel the noise masker component of the stimulus but not the out-of-phase signal component. It would require, however, that within the binaural display, the capacity to process temporal information is sufficiently good to distinguish between HTC and BPN signals. In the current study, in a similar way, an EC stage could remove one of the two signals, allowing for the temporal processing of the other signal. It is not evident, however, how this could explain the finding of Experiment III, that the in-phase AM BPN could not be segregated from the HTC. Possibly it is related to the observation that for any internal time delay that is applied in the equalization stage, the output of the cancellation stage has one dominant modulation rate of 20 Hz. For the out-of-phase condition, however, modulation rates of 40 and 20 Hz can be observed depending on the internal delay. This difference in modulation rate across the binaural display may facilitate segregation.
150
S. van de Par et al.
Finally, we want to draw attention to the rather low lateralization threshold of 29 µs that was found in the ITD discrimination conditions with wideband stimuli. This indicates that a difference in azimuth between the simultaneously presented HTC and the BPN of about 6° in a free-field condition would already be sufficient for listeners to notice the difference in azimuth. In conclusion, this study presents an interesting stimulus paradigm that reveals binaural segregation based on monaural temporal envelope cues. Results are not easily understood in terms of existing binaural models.
References Breebaart J, van de Par S, Kohlrausch A (2001) Binaural processing model based on contralateral inhibition. I. Model setup. J Acoust Soc Am 110:1074–1088 Culling JF, Summerfield Q (1995) Perceptual separation of concurrent speech sounds: absence of across-frequency grouping by common interaural delay. J Acoust Soc Am 98:785–797 Durlach NI (1963) Equalization and cancellation theory of binaural masking-level differences. J Acoust Soc Am 35:1206–1218 Trahiotis C, Stern RM (1994) Across-frequency interaction in lateralization of complex binaural stimuli. J Acoust Soc Am 96:3804–3806 van de Par S, Kohlrausch A, Breebaart J, McKinney M (2005) Discrimination of different temporal structures of diotic and dichotic target signals within diotic wide-band noise. In: Pressnitzer D, de Cheveigné A, McAdams S, Collet L (eds) Auditory signal processing: physiology, psychoacoustics, and models. Springer, Berlin Heidelberg New York, pp 398–404
Comment by Kollmeier Your statement that monaural detection precedes binaural processing is in conflict with current models of binaural noise suppression for wideband experiments such as speech discrimination in noise with different spatial location of noise masker and speech signal. Beutelmann and Brand (2006, JASA) from our group, for example, were able to predict speech intelligibility in noise with different reverberant environments using EC-models in each critical band followed by a (monaural) speech intelligibility prediction stage employing the SII. How did you make sure that the discrimination task performed by your listeners was not corrupted by a confusion between signal and interferer?
References Beutelmann R, Brand T (2006) Prediction of speech intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners. J Acoust Soc Am 120:331–342
Source Segregation Based on Temporal Envelope Structure and Binaural Cues
151
Reply We made sure that the two target signals were not confused by presenting the harmonic tone complex target in isolation before each single adaptive threshold measurement. Therefore the listeners knew what the target was that they needed to lateralize. The data that we obtained for the wideband ITD conditions are difficult to reconcile with the idea that after binaural interaction (e.g. Equalization and Cancellation) the binaural display provides a representation that is rich enough to process one of the stimulus components as if it were presented in isolation monaurally. For example, assuming the Equalization and Cancellation processing is able to cancel out the bandpass noise component, the remaining output of the EC stage would reflect the presence of the harmonic tone complex. If this information could be processed with the same temporal processing capacity as the monaural system could provide, listeners should have no difficulties in determining the lateralization of the tone complex in any of the ITD conditions of Experiment III. In contrast we found that listeners were not able to perform their task for conditions where monaural segregation is more difficult; i.e., for a random-phase tone complex, for the sine-phase tone complex temporally aligned with an AM noise, or for any of the narrowband ITD conditions. Therefore, we think that these results are related to the difficulties that listeners have to monaurally segregate the tone complex and the bandpass noise, which apparently is a prerequisite for being able to use the binaural ITD information. Comment by Verhey One of your conclusion is that monaural (better-ear) cues can not be used in the binaural case. This assumption is based on a diotic experiment which produced considerably lower thresholds than the binaural condition. This raises the question about the order of the experiment and the instructions for the subjects. If the diotic data has been obtained after the binaural data lower thresholds may have resulted from training effects. Did the author check for possible training effects? In addition, can the authors rule out the possibility that listeners are able to use the monaural cue, if they are explicitly told to do so in the binaural experiment? Reply After returning from the conference, we ran an additional series of measurements where the diotic and dichotic narrowband ILD condition was remeasured for a center frequency of 550 Hz. The two conditions were presented in an interleaved manner. Listeners were instructed to listen only to the right ear for the dichotic condition. We found essentially the same results as reported in our
152
S. van de Par et al.
paper, suggesting that differences in training across the different conditions did not play a role. In addition, the specific instruction did not lead to an improved performance for the dichotic condition. We note, however, that the increase in thresholds for the dichotic condition as compared to the diotic condition was more pronounced in some subjects than in others, both in the initial and the new measurements. Nevertheless, we consistently found an increase in thresholds in all listeners for the dichotic condition. Comment by Ihlefeld and Shinn-Cunningham We find your results very interesting, but think that there may be alternative interpretations of the results. Did you ask your listeners how many objects they perceived and where they heard the stimuli? You discuss the results as if the subjects must have both segregated and lateralized the sources properly in order to perform the task. However, because you used a 3AFC task, subjects could have simply listened for any difference across intervals, whether they heard the stimuli as one object whose spatial properties changed or two segregated objects in different locations. Moreover, as was brought up by the other commentary, we wondered whether the similarity of the two signals was a factor. We believe that it is necessary, but not sufficient, to hear sound sources as separate objects in order to perceive them as coming from different locations – that spatial cues do not drive segregation. If the listeners could not segregate the stimuli, they might be able to do the task by listening for changes in the spatial quality of the single source. To satisfy our curiosity about what was actually happening, we generated and listened to stimuli constructed as described in your paper. In the narrowband stimulus conditions of Experiment I, we could hear two objects, but they appeared to come from one diffuse spatial location, preventing us from performing the task. In the broadband conditions of Experiment III that were difficult (in phase AM and random phase HTC), we only heard one object with a diffuse location. Only when the stimuli were broadband with very different envelopes (the sin-phase HTC in Experiment 1 and the pi-phase AM condition of Experiment III) did we hear two objects at different locations. In short, we think that there are important interactions between the way listeners group the stimuli and how they perform the task. Reply We agree that any change in spatial quality of the composite stimulus across the test and the reference interval could in principle be sufficient to allow listeners to perform the task as long as this spatial quality has a left-right orientation. We performed a control experiment (which we didn’t mention in our presentation or paper) where the lateralization cues (ITDs) of both tone complex and the
Source Segregation Based on Temporal Envelope Structure and Binaural Cues
153
bandpass noise were roved by a certain random offset in the same direction. The offset was never larger than the lateralization cues before roving such that both the tone complex and the bandpass noise would still have ITDs that pointed to different sides of the head. In this way the spatial quality of the composite stimulus varied strongly across the three intervals, which should make it rather difficult to detect changes in the spatial quality of the composite stimulus. Nevertheless, ITD thresholds for wideband stimuli with a sine-phase tone complex hardly increased as a result of adding the roving. This result seems to be more in line with the assumption that listeners are able to lateralize the separate stimulus components. Furthermore, your comments seem to follow our line of thinking expressed in the presentation and the paper, that the monaural dissimilarity between stimuli is a prerequisite to be able to assign the binaural cues to at least one of the objects (see also our reply to the comment of Kollmeier).
17 Simulation of Oscillating Neurons in the Cochlear Nucleus: A Possible Role for Neural Nets, Onset Cells, and Synaptic Delays ANDREAS BAHMER AND GERALD LANGNER
1
Introduction
Neurons in the cochlear nucleus (CN), the first processing center in the auditory system, show various response patterns after stimulation. Among these, chopper neurons (choppers) are outstanding because their oscillations are characterized by stable interspike intervals (ISIs) even when the stimulus amplitude changes (Pfeiffer 1966; Blackburn and Sachs 1989). Choppers are subdivided according to their coefficient of variation (CV; the ratio of the standard deviation to the mean of the ISI) into regular (CV<0.35) and irregular (CV>0.35). T-stellate cells, which are classified as choppers, seem to build interconnected circular networks (Ferragamo et al. 1998). These chopper networks have certain features (self-excitation and inputs from other neuron types) that differ from those of single neurons. The question arises as to whether, due to minimization processes, the fastest network of choppers exists. Analysis of electrophysiological data has provided evidence for the existence of the smallest possible network consisting of two interconnected neurons with a synaptic delay of 0.4 ms. Choppers show a high dynamic range of up to 90 dB of AM coding and at the same time are more sharply tuned than nerve fibers (Frisina et al. 1990). This suggests that only few nerve fibers from a narrow frequency range project to choppers. This is in line with the finding of Ferragamo et al. (1998) that T-stellate cells receive only about five monosynaptic inputs from the nerve. If there is only a small input from nerve fibers with a dynamical range of 30–40 dB of AM coding (Frisina et al. 1990), it is difficult to understand how the high dynamic range of choppers is achieved. Therefore, we suggest that onset neurons, which are classified as octopus cells (Ostapoff et al. 1994) and show a high dynamic range of up to 90–115 dB of AM coding (Frisina et al. 1990), project to choppers. Based on the physiological and anatomical data, we propose a model consisting of a minimum network of two choppers that are interconnected with a synaptic delay of 0.4 ms (Bahmer and Langner 2006a). Such minimum delays have been found in different systems and in various animals (e.g. Hackett et al.
Neuroakustik, Institut für Zoologie, Technische Universität Darmstadt, Germany, bahmer@bio. tu-darmstadt.de,
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
156
A. Bahmer and G. Langner
1982; Borst et al. 1995). The choppers receive input from both the auditory nerve and an onset neuron. This model can reproduce the mean, standard deviation, and coefficient of variation of the ISI and the dynamic features of AM coding of choppers.
2
Methods
The simulation of Hodgkin-Huxley (HH)-like chopper and onset neurons is based on models of Rothman and Manis (2003). The models consist of a single, electrical compartment with a membrane capacitance connected in parallel with a sodium current, a low-threshold potassium current, a highthreshold potassium current, a cation current, a leakage current, an excitatory synaptic current, and an external electrode current source. For the simulations concerning periodicity encoding, a model for the inner ear and leaky-integrate-and-fire (LIF) neurons with synapses is used (Bleeck 2000; Bahmer and Langner 2006b). NEURON and Matlab are used as simulation environments. For the simulation of a network of choppers in NEURON, the input of the two choppers is slightly phase delayed to ensure spike-to-spike oscillations. The resulting topology for the simulations with the proposed additional input from an onset neuron is shown in Fig. 1a (topology I). To
Fig. 1 Scheme of the simulation topology I (a), II (b). A model of the inner ear produces a response which converges on each chopper neuron via five inputs. The onset neuron receives a broadband input and excites one chopper neuron: a the chopper neurons are arranged serially in a circle and can excite each other. The self-excitation in this network can be stopped by decreasing its input from the auditory nerve; b two (or three) fast chopper neurons in a network project as a pace-maker to slower chopper neurons
Simulation of Oscillating Neurons in the Cochlear Nucleus
157
produce ISIs as large as those occasionally observed in real choppers (several ms), topology I would require a high number of neurons, as synaptic delays of neurons in the auditory system are found to be in the submillisecond range. Therefore, an alternative topology (topology II, Fig. 1b) is suggested, in which a small network of choppers plays the role of a pace maker. This network can trigger other choppers, which have larger refractory periods to produce larger ISIs. It is reasonable to assume that minimization processes of the networks concerning fast analysis in the time domain are used to produce a minimum synaptic delay. In line with the physiological evidence given below, a minimum synaptic delay of 0.4 ms is introduced in the model. To test the physiological relevance of the topologies, pure tone responses and periodicity encoding are tested and compared with physiological data.
3 3.1
Results Invariance of ISI in Networks of Chopper Neurons
An HH-like model of a single chopper was compared to a model of a network with two interacting choppers. The simulation results are shown in Fig. 2. While the ISIs of the single neuron model depend strongly on signal amplitude, the ISIs of the choppers in the network are relatively constant. This and the corresponding dynamic variation of spike rates is in line with physiological data (Pfeiffer 1966; Blackburn and Sachs 1989).
chopper net
300
20 250 ISI
ISI [ms]
300
20
15
200
250
Spikerate 15
200
10
150
Spikerate 150
10
100 5
50
0 0
20
40
60
80
0 100
100 5 0 0
ISI
20
40
60
50 80
Spikerate [spikes/s]
single chopper
0 100
Input [pA]
Fig. 2 Simulation results of a single chopper and of a network of two interconnected choppers in NEURON (HH-like model from Rothman and Manis 2003). The variance of ISIs of the single neuron is high, whereas the variance of the ISIs of the network chopper is low
158
A. Bahmer and G. Langner
3.2
Simulation of Pure Tone Response of Chopper Neurons
Figure 3a shows the PSTH of a sustained chopper (Blackburn and Sachs 1989), which is characterized by a low CV (<0.35) indicating highly regular interspike intervals. In the PSTH, four to five response maxima can be clearly seen. The regularity analysis (Fig. 3b) shows a mean interspike interval of about 2 ms, a standard deviation of nearly 0.25 ms and a resulting CV of about 0.15. These values are stable in time. The result of our simulation with topology I, which in this case includes five choppers, is shown in Fig. 3a′,b′.
b
a
msec
a'
msec
b' 3
1
CV = .14
CV
2000
msec
0
0 0
msec
0
b''
msec
msec
2000
0
msec
50
CV = .07
0
msec
σ
0
0
50 1
3
µ
0
25 msec
0
50
a''
Spikes/Sec
0
σ
CV
Spikes/Sec
µ
0
msec
50
Fig. 3 a,b PSTH and regularity analysis (mean interval µ, standard deviation σ, coefficient of variation CV) of a sustained chopper neuron in the CN (from Blackburn and Sachs 1989). a′b′ Response of simulated chopper neuron (topology I). Stimulus parameters are the same as in the physiological experiment. The graphs match the physiological data. a′′b′′ Response of simulated chopper neuron (topology II). The response shows more peaks than the physiological data
Simulation of Oscillating Neurons in the Cochlear Nucleus
159
The properties of the simulation, such as firing rate, number of peaks, and ratio of peak heights are nearly the same as in the electrophysiological results. Even the regularity analysis could be matched to the analysis of the in vivo recording. For this purpose a jitter (standard deviation 0.26 ms) had to be added to each synaptic delay of the five interconnected choppers to fit the CV. The CV in the simulation has a mean value of 0.14 (0.15 in the in vivo recording). Figure 3a′′,b′′ shows the simulation results of topology II. Again, firing rate and ratio of peak heights match physiological properties. The number of peaks is increased and the regularity analysis shows smoother results and a lower CV (0.07). The jitter (same as in topology I) is added only to the synaptic delay of the interconnections of the fast choppers. 3.3
Synchronization to AM Signals at Different Sound Pressure Levels
To verify the conclusion that choppers receive input from both auditory nerve and onset neurons, we simulate responses of auditory nerve fibers, onset neurons with different integration widths, and choppers with and without input from an onset neuron. For the simulation of the choppers, two choppers are arranged in a circular network. To quantify the degree of synchronization (periodicity coding), the vector strength (e.g. Langner 1992) is calculated for the simulations. Without the input from the onset neuron, the weights of the synapses of the auditory nerve have to be increased to enable chopping. The simulation of a nerve fiber with a CF at the carrier frequency of the AM signal shows that the synchronization to the modulation is small and nearly vanishes above 40 dB SPL (Fig. 4a). Because of their broader bandwidth, the onset
b
1 regular
VS [a.u.]
0.8
broad
0.6 0.4
auditory nerve fiber
0.2 0
0
20
40 60 SPL [dB]
1 regular
0.8 onset neuron
80
100
VS [a.u.]
a
0.6
broad narrow: without onset neuron
0.4 0.2 0
0
20
40 60 SPL [dB]
80
100
Fig. 4 a Comparison of VS of the simulation of auditory nerve fibers and two onset neurons at different SPL (response to SAM, fm: 100 Hz, fc: 600 Hz). The onset neurons have different bandwidths (regular and broad) and show robust synchronization over a wide dynamic range. b Comparison of simulations of chopper neurons. With input from an onset neuron, chopper neurons synchronize to AM signals over a wide dynamic range. By contrast, without input from an onset neuron synchronization deteriorates above 20 dB SPL
160
A. Bahmer and G. Langner
neurons encode periodicity much better than an auditory nerve fiber (Fig. 4a). For a “broad” in comparison to a “regular” bandwidth, the synchronization is better at high levels (above 50 dB SPL), but worse at low levels. The explanation for this is that unsaturated nerve fibers away from CF and therefore also from fc are able to encode periodicity information even at high intensities. Figure 4b shows the VS of simulations of choppers with and without input from an onset neuron. The different bandwidths of the onset neurons result in chopper responses which show the same dynamic effect as discussed in the previous paragraph for onset neurons. A comparison of the simulation of choppers with and without input from an onset neuron shows the effect and significance of such an input for the dynamic range of periodicity coding. 3.4 Evidence for a Time Constant of 0.4 ms in Intervals of Chopper Responses
Number of intervals
Evidence that preferred intervals in intrinsic oscillations consist of multiples of 0.4 ms was first found in the central nucleus of the IC of Guinea fowls (Langner 1981, 1983) and cats (Langner and Schreiner 1988). The intrinsic oscillations were only weakly influenced by changes in stimulus frequency or intensity (Langner and Schreiner 1988). Peaks were prevalent at intervals of 0.8, 1.2, 1.6, 2.0, and 2.4 ms which are all multiples of a base period of 0.4 ms. Since the IC receives a major input from the choppers of the CN (Adams 1983), it was hypothesized that the origin of the intrinsic oscillation found in IC is the CN (Langner 1992). Further evidence for a minimum synaptic delay was found in responses of units in the VCN recorded by Young et al. 1988. We analysed the ISIs of chopper’s preferences of certain oscillation intervals. The histogram (Fig. 5) shows peaks at integer multiples of 0.4 ms, indicating a corresponding preference of choppers for such intervals. The null hypothesis, assuming that the interspike 10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 1011121314151617181920
Interspike interval [ms/0.4] Fig. 5 Histogram of the interspike intervals of chopper neurons measured by Young et al. 1988. The binwidth is 0.1 ms, the number along the x-axis indicates the lower edge of the interval. The curve has been fitted to the histogram manually
Simulation of Oscillating Neurons in the Cochlear Nucleus
161
intervals of choppers are equally distributed, was tested statistically. For this purpose two classes were generated. One class contained intervals centered at multiples of 0.4 ms with an interval-width of 0.2 ms, and the remaining intervals of the second class were centered at multiples of 0.4 ms+ 0.2 ms. By rejection of the null hypothesis (level of significance: 5%), the preference for intervals centered at multiples of 0.4 ms was shown to be significant.
4
Discussion
The simulation of interconnected choppers shows that such a network can produce stable ISIs in spite of changing input strengths, thereby modelling real choppers and outmoding the simulation of single choppers. The response of the network also matches PSTHs and regularity analysis of real choppers. When synchronizing to AM signals, a dramatic difference shows up between the response of simulated choppers with and without input from onset neurons. The dynamic ranges of periodicity encoding differ by at least 70 dB (Fig. 4b). Since real choppers code periodicity with dynamic range similar to that of onset neurons and since they are located close to these neurons, it seems reasonable to assume that they may receive an input from onset neurons. The question of whether the pitch of harmonic sounds is based on a neuronal analysis of periodicity information available as a temporal code in different frequency channels or, alternatively, on resolved harmonics in single channels continues to be debated. In light of this, the most remarkable feature of the chopper network my be that the combination of broad-band integration in onset neurons and narrow-band input of choppers allows for both the coding of periodicity and of resolved harmonics in single neurons. This would suggest that both types of information are used in parallel in pitch perception.
5
Conclusions
– Based on the anatomical and physiological data we suggest a model of choppers that are arranged in a circular network and receive input from both auditory nerve fibers and an onset neuron. – It is reasonable to assume that the chopper network employs a minimum synaptic delay. – Evidence for a minimum synaptic delay of 0.4 ms is given by electrophysiological data. – The simulation of the model is able to explain the large dynamic range of periodicity encoding in spite of the frequency tuning. – By combining broad-band with narrow-band analysis, the model may explain corresponding aspects of pitch coding.
162
A. Bahmer and G. Langner
Acknowledgements. We would like to thank Mr. W. Hemmert and Mr. M. Holmberg of Infineon Technologies for their support and reading, and Mr. G. T. Sims for critical reading.
References Adams JC (1983) Multipolar cells in the ventral cochlear nucleus project to the dorsal cochlear nucleus and the inferior colliculus. Neurosci Lett 37:205–208 Bahmer A, Langner G (2006a) Oscillating neurons in the cochlear nucleus: I. Experimental basis of a simulation paradigm. Biol Cybern 95:371–379 Bahmer A, Langner G (2006b) Oscillating neurons in the cochlear nucleus: II. Simulation results. Biol Cybern 95:381–392 Blackburn C, Sachs M (1989) Classiffication of unit types in the anteroventral cochlear nucleus: PST histograms and regularity analysis. J Neurophysiol 62:1303–1329 Bleeck S (2000) Holistische Signalverarbeitung in einem Modell latenzverknüpfter Neuronen. PhD thesis, TU Darmstadt Borst JG, Helmchen F, Sakmann B (1995) Pre- and postsynaptic whole-cell recordings in the medial nucleus of the trapezoid body of the rat. J Physiol 489:825–840 Ferragamo M, Golding N, Oertel D (1998) Synaptic inputs to stellate cells in the ventral chochlear nucleus. J Neurophysiol 79:51–63 Frisina RD, Smith RL, Chamberlain SC (1990) Encoding of amplitude modulation in the gerbil cochlear nucleus: I. A hierarchy of enhancement. Hear Res 44:99–122 Hackett JT, Jackson H, Rubel EW (1982) Synaptic excitation of the second and third order auditory neurons in the avian brain stem. Neurosci 7:1455–1469 Langner G (1981) Neuronal mechanisms for pitch analysis in the time domain. Exp Brain Res 44:450–454 Langner G (1983) Evidence for neuronal periodicity detection in the auditory system of the guinea fowl: implications for pitch analysis in the time domain. Exp Brain Res 52:333–355 Langner G (1992) Periodicity coding in the auditory system. Hear Res 60:115–142 Langner G, Schreiner C (1988) Periodicity coding in the inferior colliculus of the cat: I. Neuronal mechanisms. J Neurophysiol 60:1799–1822 Ostapoff EM, Feng JJ, Morest DK (1994) A physiological and structural study of neuron types in the cochlear nucleus. II. Neuron types and their structural correlation with response properties. J Comp Neurol 346:19–42 Pfeiffer RR (1966) Classification of response patterns of spike discharges for units in the cochlear nucleus: tone-burst stimulation. Exp Brain Res 1:220–235 Rothman J, Manis P (2003) The roles potassium currents play in regulating the electrical activity of ventral cochlear nucleus neurons. J Neurophysiol 89:3097–3113 Young ED, Robert JM, Shofner WP (1988) Regularity and latency of units in ventral cochlear nucleus: implications for unit classification and generation of response properties. J Neurophysiol 60:1–29
Comment by Kollmeier Your neural circuits for modulation tuning are critically dependent on the synaptic transmission time and the inherent time constant of the pacemaker chopper unit. Another critical parameter is the effective gain in your circular loop of chopper units that may become either unstable or do not provide enough ringing if the synaptic strength of successive units is not fixed to an optimum value. As also noted in my comment to Meddis, the time constants
Simulation of Oscillating Neurons in the Cochlear Nucleus
163
of a single cell (which has to be large in this case) as well as its synaptic strength might not be the best fundamental principle onto which a model of modulation tuning should be based (see, e.g. the dissertation by Ulrike Dicke (Dicke 2003) and Dicke et al. (2006) that employ an alternative strategy without relying on time constants of individual cells. How do you prevent your ring of choppers from oscillating without any input? What evidence do you have that the chopper unit time constant is the “critical” fundamental parameter as opposed to the assumption that modulation tuning is a network property rather than a property of a single neuron? References Dicke U (2003) Neural models of modulation frequency analysis in the auditory system, Universität Oldenburg, 2003, download at http://docserver.bis.uni-oldenburg.de/publikationen/dissertation/2004/dicneu03/dicneu03.html) Dicke U, Ewert S, Dau T, Kollmeier B (2006) A neural circuit transforming temporal periodicity information into a rate-based representation in the mammalian auditory system (submitted)
Reply Response to Question 1: The ring of choppers receives input from auditory nerve fibers. Without activation from the nerve the chopper neurons of the model can not oscillate (Bahmer and Langner 2006b). Response to Question 2: In contrast to the Meddis model, our model is designed for encoding modulation, but not for modulation tuning. Evidence for the time constant was found by single unit recordings in the midbrain in Guinea fowl (Langner 1983), cat (Langner and Schreiner 1988), in the cat cochlear nucleus (Young et al. 1988 according to Bahmer and Langner 2006a), and also in pitch shift experiments with AM signals in human subjects (Langner 1981).
References Bahmer A, Langner G (2006a) Oscillating neurons in the cochlear nucleus: I. Experimental basis of a simulation paradigm. Biol Cybern, DOI: 10.1007/s00422-006-0092-6 Bahmer A, Langner G (2006b) Oscillating neurons in the cochlear nucleus: II. Simulation results. Biol Cybern, DOI: 10.1007/s00422-006-0091-7
Comment by Greenberg Your model seems plausible if the onset units described in your model are multipolar stellate cells, which are known as “onset-chopper” cells in the physiological literature. The onset choppers are characterized by a large dynamic range (often 60–70 dB), a broad bandwidth of auditory-nerve fiber
164
A. Bahmer and G. Langner
inputs (twice as broad as other cochlear nucleus unit types), an absence of inhibitory sidebands, extremely high discharge rates (on average ca. 600 spikes/s) and exquisitely precise neural timing (synchronization coefficients as high as 0.99) (Rhode and Greenberg, 1992). References Rhode WS, Greenberg WS (1992) Physiology of the cochlear nuclei. In: Fay RR, Popper AN (eds) The mammalian auditory pathway: neurophysiology. Springer, Berlin Heidelberg New York, pp 94–152
Reply Indeed, onset-choppers have a large dynamic range, but octopus cells have the highest dynamic range of periodicity encoding in the cochlear nucleus (up to 120 dB, Frisina et al. 1990). Moreover, according to our simulations (Fig. 4b), the bandwidth has a level-dependent optimum in periodicity encoding; and a large bandwidth is suitable only for high levels. Octopus cells are located adjacent to the region of chopper neurons. It is unknown if they are connected to choppers; but according to our model it would be sufficient if they trigger only a small minority of neighbouring chopper neurons. However, we agree that the model could be extended and possibly improved by including onset choppers.
18 Forward Masking: Temporal Integration or Adaptation? STEPHAN D. EWERT1,2, OLE HAU1, AND TORSTEN DAU1
1
Introduction
When a short signal tone is presented after a noise or tone masker, the threshold for detecting the signal is raised the smaller the gap duration between the masker and the signal is. This phenomenon is termed forward masking and refers to the fact that a masker affects the signal threshold when both are presented in a non-simultaneous, consecutive manner. With increasing temporal separation, signal threshold usually drops to performance in silence when the gap is in the region of hundreds of milliseconds. As a possible explanation for forward masking, mainly two different mechanisms have been discussed in the literature: (i) continuation or persistence of neural activity (e.g., Plomp 1964; Oxenham and Moore 1994), referring to temporal integration of neural activity at presumably higher stages than the auditory nerve; (ii) neural adaptation (e.g., Duifhuis 1973; Nelson and Swain 1996), assuming adaptation at various levels of the auditory pathway (including high levels). A third possible source for interaction of masker and signal is linked to the ringing of the auditory filters but is generally assumed to be negligible for signal frequencies of 1 kHz or higher (e.g., Vogten 1978). It is still unclear whether temporal integration or adaptation can better account for forward masking in various stimulus configurations (Oxenham 2001), nor have both mechanisms been compared directly in a common modeling framework to investigate their relation. The current study compares two well established models of temporal processing in the auditory system using a unified modeling framework: (i) the temporal-window model (e.g., Oxenham and Moore 1994) representing a temporal-integration mechanism and (ii) the adaptation-loop model (e.g., Dau et al. 1996) as the representative for the adaptation mechanism. The unified modeling framework shares a compressive, non-linear auditory filter stage and a template-based (optimal detector) decision stage. The question is, whether the temporal-window model and the adaptation-loop model can be considered 1 Centre for Applied Hearing Research, Ørsted DTU, Technical University of Denmark, Denmark,
[email protected],
[email protected] 2 Medizinische Physik, Fakultät V, Institut für Physik, Carl von Ossietzky Universität, Oldenburg, Germany,
[email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
166
S.D. Ewert et al.
in a unified modeling framework while maintaining their predictive power. Specifically, it is investigated if the two models can help distinguishing between persistence and adaptation, the two hypothetical mechanisms underlying forward masking.
2 2.1
Methods Procedure and Subjects
A three-interval, three-alternative forced-choice adaptive procedure (twodown, one-up rule) was used to determine detection thresholds in the simulations and experiments. The step size was 8 dB at the beginning and was halved after every two reversals, until it reached a minimum of 1 dB where eight reversals were obtained for threshold estimation. The starting level of the signal was 90 dB SPL. Two subjects participated as a control group. The stimuli were presented to one ear via headphones (AKG K-501) in a doublewalled, sound-attenuating booth. 2.2
Stimuli
Two forward masking experiments with signal tones at 1 and 4 kHz were conducted to test the models. At 1 kHz, a 10-ms, Hanning-windowed signal was used. The masker was a 200-ms, 77-dB, 20- to 5000-Hz frozen noise, no ramps were applied. The experimental design was the same as in Dau et al. (1996). At 4 kHz, a 12-ms signal, including 2-ms, raised-cosine ramps, was added to a 200-ms, 78-dB, 0- to 7000-Hz, frozen-noise masker. The masker included 2-ms, raised-cosine ramps. The design was the same as in Oxenham (2001). The offset-offset time of the signal and the masker was varied in the range from −10 to 150 ms, thus including conditions of simultaneous as well as nonsimultaneous masking.
3
Models and Predictions
The processing modules of the temporal-window (TW) model according to Oxenham (2001) were implemented. The TW model uses a linear, timeinvariant integration after non-linear peripheral processing. The shape of the integration window relevant for forward masking results from two exponential functions with time constants of 4.6 ms and 16.6 ms, added with a weight of 0.17 for the longer time constant. In the nonlinear part, the model uses an instantaneous power-law compression with an exponent of 0.25 for
Forward Masking: Temporal Integration or Adaptation?
167
levels higher than about 35 dB SPL after peripheral band-pass filtering. At the output of the temporal window, representations of the signal and masker are derived and detection is based on the best (signal+masker)to-masker ratio at one instant in time. The modules of the adaptation-loop (AD) model were implemented according to Dau et al. (1996). In the model, a series of five non-linear feedback loops mimics adaptation in the auditory system. The time constants of the adaptation loops are 5, 50, 129, 253, and 500 ms. In contrast to the TW model, the AD model (Dau et al. 1996) does not employ instantaneous compression of the output of the peripheral filters. For the detector, the model calculates a template representation consisting of the normalized difference between the masker-plus-supra-threshold-signal representation and the masker-alone representation after adaptation. Detection is based on cross-correlation of the template and the output representation given by the difference between masker-plus-current-signal and masker alone. For details of the TW and AD model the reader is referred to the respective publications. 3.1
Unified Model Framework
Figures 1 and 2 show the modified TW and AD models as part of a common framework that shares the preprocessing and the decision stage. Peripheral filtering was simulated using the dual resonance non-linear (DRNL) model (Meddis et al. 2001). The DRNL model was adjusted to show a compression ratio of 0.25 for input levels in the region of about 40 to 70 dB SPL, comparable to the nonlinearity used in the original TW model. Stimuli were then subjected to half-wave rectification, lowpass filtering and squaring. The TW model structure has been modified to fit the optimal detector of the original AD model. Both model implementations in the unified framework have been
Fig. 1 Modified temporal-window (TW) model. The DRNL filter and the optimal detector were changed with respect to the original implementation. The optimal detector derives a template from the upper processing path. During the run of the experiment, the reference intervals, M/M, and the actual signal interval, (M+S)/M, are processed and correlated with the template
168
S.D. Ewert et al.
Fig. 2 Modified adaptation-loop (AD) model. The DRNL filter and the squaring module were changed with respect to the original implementation
Fig. 3 Predictions of the original models (squares) and the unified TW and AD models (triangles) in the 1- and 4-kHz forward masking experiment. Left: TW model (black). Right: AD model (gray). The stars and crosses indicate empirical control data from two subjects
verified to match the predictions of the original models in the 1- and 4-kHz experiments described above. Results are shown in Fig. 3. The time constants of the unified TW model were fitted to the 4-kHz condition while the adaptation loops of the AD model were kept unchanged. For both models, a better agreement between the control data and the model predictions was observed at 4 kHz. Both models showed a too steep decay of forward masking in the 10- to 30-ms offset-offset time region. Overall, the TW model showed a slightly better agreement with the data than the AD model. Average deviations between the original and unified models were in the region of a few decibels.
4
Model Analysis
In order to investigate how the modules of the two models account for forward masking, two simplified block diagrams of the TW and the AD model are shown in Figs. 4 and 5. In the TW model (Fig. 4), the internal representation of the
Forward Masking: Temporal Integration or Adaptation?
169
Fig. 4 Simplified block diagram highlighting the division stage of the temporal-window model
Fig. 5 Simplified block diagram of the AD model (upper panel). For each input stimulus condition, the adaptation loops can be replaced by a division of the input waveform with an equivalent divisor as shown in the lower panel
masker+signal is divided by the representation of the masker alone which is referred to as “divisor” in the following. The TW model divisor is shown in Fig. 6 (black line). The TW model is able to account for forward masking since the divisor only declines gradually after masker offset at 0.2 s, reflecting persisting masker energy or the effect of temporal integration. However, in addition to the
170
S.D. Ewert et al.
Fig. 6 Comparison of the divisors in the temporal-window model (black) and adaptation-loop model (dashed gray)
persistence of masker energy, the fact that the TW model incorporates a division of both stimulus paths is crucial for the function of the model. The division module in the current model implementation equals the ratio detection criterion, (M+S)/M, in the original TW model. A comparable analysis of the AD model (Fig. 5) reveals that the stages of the adaptation loops can be viewed as a similar division process (upper and middle panel). The difference is that the AD model incorporates feedback while the TW model is a pure feed-forward circuit. For each input stimulus condition, however, an equivalent adaptation devisor for a feed-forward mechanism can be derived (Fig. 5, lower panel). The equivalent AD model devisor is indicated by the dashed gray line in Fig. 6. In comparison to the TW divisor, it shows a bump at the temporal position of the signal, in this case at about 250 ms. This bump reflects a “self-suppression” effect of the signal. 4.1
Simplified Adaptation-loop Model
The above analysis has shown that the prominent difference between the two models could be reduced to the effect of the signal on the divisor function which reflects a “self suppression” of the signal. The AD model was further simplified (see Fig. 7) to derive the equivalent AD devisor function from the masker only, similar to the processing in the TW model. The hypothesis was that if the parameters of the temporal window in the TW model were adjusted to produce a divisor function matching the divisor of the simplified AD model, both models should predict the same forward masking curve. Figure 8 shows the matched divisors (left) and the predictions obtained with the two models (right). Both models predicted essentially the same data when their divisors were matched.
Forward Masking: Temporal Integration or Adaptation?
171
Fig. 7 Simplified adaptation-loop model where only the adaptation effect originating from the masker is considered
Fig. 8 Left: Divisor of the TW model (black) with time constants adjusted to match the devisor of the simplified AD model (dashed gray). Right: Predictions of the TW model and simplified AD model at 4 kHz with the divisors shown in the left panel
5
Discussion
The key mechanism responsible for the simulation of forward masking in both models is the division of the internal representation of the signal by the representation of the persisting or temporally smoothed masker. In the temporalwindow model, this division is realized in the detection process, while it is a part of the feedback loops in the adaptation-loop model. With regard to forward masking, the temporal-window model can be viewed as a simplified adaptation model neglecting “self suppression” of the signal. In fact, both models use the identical key mechanism to describe forward masking. Thus, these model realizations can not be used to critically separate between adaptation and persistence. The concept of persistence as realized in the TW model, cannot lead to successful predictions without the ratio-based decision criterion. The temporal-window model has proven its strength as a very flexible and well “controllable” tool to investigate, e.g., effects of peripheral compressive non-linearity on forward masking in the normal versus the impaired auditory
172
S.D. Ewert et al.
system (e.g., Plack and Oxenham 1998). The adaptation-loop model has proven its strength in a variety of experimental conditions in addition to forward masking, such as spectro-temporal masking and modulation detection. The model has also been used as front end in automatic speech recognition and objective speech quality assessment (whereby the detection stage was replaced by other post-processing devices). A possible modification of the adaptation stage might use a single low-pass filter in the feedback loops where the parameters of the impulse response could be adjusted in a similar way as the time constants in the temporal window model. The lack of goodness of fit in the 1-kHz case could be solved either by a variation of the parameters in the TW and AD models or by using alternative peripheral filter functions (with different ringing) at low frequencies.
6
Conclusions
It was found that the TW and AD models can be considered as being essentially equivalent in predicting forward masking: the combination of integration and the signal-to-noise-ratio based detection criterion in the TW model act effectively as a simplified adaptation mechanism. However, since there is physiological evidence for adaptation along the auditory pathway, the AD model appears to be the more general approach. It shows the effect of adaptation in the internal representation of the stimuli and can be applied successfully to a broader class of masking conditions than the TW model. Acknowledgments. This work was supported by the Danish Research Council.
References Dau T, Püschel D, Kohlrausch A (1996) A quantitative model of the “effective” signal processing in the auditory system. I. Model structure. J Acoust Soc Am 99:3615–3622 Duifhuis H (1973) Consequences of peripheral frequency selectivity for nonsimultaneous masking. J Acoust Soc Am 54:1471–1488 Meddis R, O’Mard LP, Lopez-Poveda EA (2001) A computational algorithm for computing nonlinear auditory frequency selectivity. J Acoust Soc Am 109:2852–2861 Nelson DA, Swain AC (1996) Temporal resolution within the upper accessory excitation of a masker. Acust Acta Acust 82:328–334 Oxenham AJ (2001) Forward masking: adaptation or integration? J Acoust Soc Am 109:732–741 Oxenham AJ, Moore BC (1994) Modeling the additivity of nonsimultaneous masking. Hear Res 80:105–118 Plack CJ, Oxenham AJ (1998) Basilar-membrane nonlinearity and the growth of forward masking. J Acoust Soc Am 103:1598–1608 Plomp R (1964) The rate of decay of auditory sensation. J Acoust Soc Am 36:277–282 Vogten LLM (1978) Low-level pure-tone masking: a comparison of “tuning curves” obtained with simultaneous and forward masking. J Acoust Soc Am 63:1520–1527
Forward Masking: Temporal Integration or Adaptation?
173
Comment by Kohlrausch You provided evidence that, for one forward masking condition, the two schemes previously published to explain forward masking are conceptually equivalent and predict the same results. My question: does this equality also apply to some of the additional properties of the adaptation loop scheme, which were considered to be important when this scheme was first proposed by Dirk Pueschel in his Ph.D. thesis? 1) The fact that forward masking curves become steeper for shorter maskers, and 2) that, in simultaneous masking, no such influence of masker duration on detection is observed, while, in contrast, signal duration has a considerable effect on detection. In my understanding, the ability to model observation 1 lies in the different values of the time constants of the feedback loops, while observation 2 is primarily attributed to using a matched template for detection. Reply We would like to underline that our conclusion is that the temporal-window model can be viewed as a simplified adaptation model, not as a fully equivalent model. This implies that the temporal window model can not account for all effects that an adaptation model can cover. Particularly, changes in the forward masking curve as function of masker duration can, to our knowledge, not be accounted for by the temporal window model with a fixed set of parameters/time constants. We do also attribute your second observation to the matched filter detection mechanism, which is conceptually different from the detector used in the temporal-window model as published in the literature. In our unified model, however, we used the matched filter detector for both the temporal-window processing scheme and the adaptation scheme. Thus, we disregarded potentially limiting effects of the original temporal-window model detector. Comment by Plack I agree that the TW model and the modified AD model are equivalent in most forward masking situations. However, there is considerable evidence to suggest that processes after the basilar membrane (BM) are effectively linear in the way they combine the energy of BM vibration over time, at least with respect to forward and backward masking (Plack et al. 2002, 2006). Hence, it may be advantageous to keep the non-linearity out of the adaptation loops if possible. I’m not sure that I agree with the final statement in your paper that the AD model can be applied to a broader class of experimental masking conditions
174
S.D. Ewert et al.
than the TW model. The TW model has been applied successfully to forward masking, backward masking, simultaneous masking, and increment and decrement detection. Being effectively a low-pass filter, the TW can also account for gap detection, temporal integration, and some aspects of modulation detection. Although the decision device in the AD model is more sophisticated than that in the TW model, my understanding is that the AD model is successful across the full range of temporal resolution and masking tasks only when combined with an additional processing stage, such as a modulation filterbank. I like the physiological realism of the adaptation stage, but do you know of any psychophysical result that requires a simulation of adaptation to model the data? References Plack CJ, Oxenham AJ, Drga V (2002) Linear and nonlinear processes in temporal masking. Acustica 88:348–358 Plack CJ, Oxenham AJ, Drga V (2006) Masking by inaudible sounds and the linearity of temporal summation. J Neurosci 26:8767–8773
Reply The temporal-window (TW) model has been powerful to demonstrate the effects of cochlear compression on forward (and backward) masking and, in particular, to account for consequences of sensorineural hearing loss on forward masking functions. We agree that the broad applicability of the original AD model is also related to its more general detector, while the TW model uses a more simple detection mechanism. Our point in the current study is that the TW model can only function correctly if there is the assumption of the (S+N)/N-ratio criterion in the decision process, i.e., the whole concept of persistence or temporal integration after compression only holds when it is connected with this specific criterion. This, in turn, is in principle equivalent to an adaptation mechanism. Thus, we argue that the TW model represents a concept which effectively provides correct predictions in these specific conditions, however, essentially being a simplified adaptation model. As such, it does not allow simulating an internal representation of the stimulus that reflects properties of adaptation as found in physiology. We agree that it might not be necessary for a model that is used for the prediction of psychophysical detection/masking data to resemble effects of neural adaptation in the internal representation. However, in our view, the AD model remains the more general approach and therefore seems applicable to a broader class of experiments. Regarding your first point, we would like to point out that the non-linearity in the adaptation loops is, according to our analysis, to some extent equivalent to the division process in the ratio criterion of the TW model.
19
The Time Course of Listening Bands
PIERRE DIVENYI AND ADAM LAMMERT
1
Introduction
It has been known for decades that frequency analysis in the auditory system reveals the existence of bandpass filter-like channels – the Critical Bands – a finite number of which cover the whole range of audible frequencies, with the consequence that nearby frequencies are not resolved individually. Because critical bands are formed already in the cochlea, they appear with no delay: their contours are formed the moment an acoustic wave reaches the inner ear and disappear when the wave goes silent. Psychophysical measurement of the width of critical bands has been pursued by a many investigators (for a summary, see Chap. 3 in Moore 2003) and this behavioral indicator of frequency selectivity has often been compared to biophysical and physiological measures of frequency selectivity, both excitatory and inhibitory, at various stages along the auditory pathway (see e.g., Evans 2001). Yet, when moving away from the psychophysics of what the listener is capable of doing with the aim of investigating what he/she actually does, one comes upon phenomena that critical band analysis alone cannot explain. One such phenomenon was studied by the early proponents of signal detection theory in audition (Swets 1963), long before the concept of attention surreptitiously escaped the watchful eyes of orthodox behaviorism and gradually settled in experimental psychology. These investigators were wondering whether detection of signals of uncertain frequency occurs by the system switching between different bands or shifting a unique band from one frequency to another. They were careful not to refer to these bands as critical bands – to avoid confusion they named them “listening bands.” Among the many properties of listening bands reported, an important one was their capability to keep the listener’s attention tuned to a particular frequency even after a pure-tone signal was turned off, and to maintain it for a rather long duration in the absence of any signal or in the presence of broad-band random noise (Greenberg and Larkin 1968). Other investigators demonstrated that the listening band sluggishly remained centered at the frequency of the last signal heard until an audible tone of a different frequency
Speech and Hearing Research, VA Medical Center and East Bay Institute for Research and Education, Martinez, California, USA,
[email protected],
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
176
P. Divenyi and A. Lammert
was presented (Pastore and Sorkin 1971). Still others showed that it is possible to simultaneously tune several listening bands on different frequencies or even on a frequency the listener was never physically presented with, only instructed to imagine (Schlauch and Hafter 1991). Thus, it seems that listening bands exist in the memory of listeners and linger on for long periods of time in quiet or in the absence of hearing another signal with a different, salient pitch (Demany and Semal 2005). Recent physiological data also suggest that tones give rise to activity patterns the excitatory and inhibitory contours of which outlast the presence of the tone itself (Fritz et al. 2005). It could well be that the existence of listening bands may stem from these contours and may underlie behavioral findings on listeners’ ability to shift the frequency focus. However, if the new frequency focus is in one of the inhibitory bands flanking the excitatory band of the previous frequency focus, the build-up of new excitation takes time and thus the shift of the listening band should not be instantaneous. Unfortunately, important questions related to the timing of this phenomenon have not been asked: how long does it take to establish listening bands, how long do they last in absence of a stimulus, and how long does it take to establish a new listening band when the frequency of a tonal stimulus changes? The present study attempts to answer these questions which, it seems, have important physiological implications. The hypothesis to be tested in a psychophysical experiment is that when a tone of different frequency follows one of a given frequency, it will be perceived with a delay because it first has to overcome the inhibitory effect of the previous tone – a process which takes time.
2
Methods
Since the delay stated by the hypothesis is not expected to be longer than a few milliseconds, the question to ask is whether there is a psychophysical method sensitive enough to measure time intervals so short. Earlier work (Divenyi and Danner 1977) showed that, in the 20- to 50-ms range, 4–6% differences of unfilled time intervals marked by brief tone or noise bursts can be reliably discriminated. In the present study, we used this ability to have listeners compare a 40-ms onset-to-onset time interval marked by two tone bursts of frequency f1 to a time interval marked by one tone burst of frequency f1 and another of frequency f2. The comparison was done using the Method of Adjustment: the listener was instructed to adjust the second time interval between the f1 and f2 frequency markers to match the first interval, the 40-ms standard between the two f1 frequency markers, as shown in the top diagram of Fig. 1. The difference between frequencies f1 and f2 was varied from condition to condition such that the geometric mean remained constant at 1 kHz. The tone bursts of 20-ms nominal duration (and 15-ms half-power duration) were shaped with a 2-ms onset and a 10-ms offset; their envelope
The Time Course of Listening Bands
177
Fig. 1 The stimulus from Experiment 1 (top) and the perceptual error in judging the second time interval (bottom)
was rounded to minimize transients. The time separating the onset of the first burst of the two observation intervals was 600 ms; the trials proceeded at a 2.5-s rate, thus allowing the subject a response time of close to 2 s. That is, the diagram should be imagined to be repeating with cycle of 2.5 s. Stimuli were presented monaurally to the subject’s right earphone (TDH-49 in MX/AR cushions) at 86 dB SPL. A run was terminated when the subject indicated that the two intervals were perceived identical. The reported data represent the average of 48–96 adjustments for each subject in each experimental condition. Subjects were normal-hearing young volunteers.
3 3.1
Results Experiment 1
The averaged duration of the second time interval judged by our subjects to be identical to the first are illustrated in Fig. 1 as “perceptual error”: the difference between the adjusted second observation interval and the 40-ms
178
P. Divenyi and A. Lammert
standard as a function of frequency difference ∆f in Hertz and in octaves. In this experiment, the frequency f2 was always equal to or lower than f1, i.e., the frequency change within each trial went in a downward direction. One notices that the error is positive when ∆f is zero, an outcome we attribute to the “time order error” (Helson and Himelstein 1955) frequently observed when comparing two identical stimuli. We assume for the present data that this error does not change across frequency differences, that is, the temporal judgment error due to frequency difference can be attributed to another source. This source appears to have to do with frequency f2 influencing the listeners’ time judgments such that they adjusted the interval shorter than 40 ms. This perceptual error takes a “W” shape with two minima (i.e., maximum error points): one when ∆f is about 1/3 octave and the other when it is around 2/3 octave. In addition, although at frequency differences of 1 octave and larger the time judgment error essentially vanishes, its uncertainty (measured as the standard error of the mean shown in the error bars) increases several fold. On the whole, the results confirm our hypothesis: the negative perceptual errors suggest that the perceived onset of the burst with the f2 frequency occurred a few ms later than its physical onset, possibly because it had to overcome the inhibitory contour of the preceding burst of frequency f1. Interestingly, the subjects seemed to be very certain when adjusting this second time interval shorter than the first, as indicated by the small degree of uncertainty. The inhibitory contours of f1 do not appear to extend beyond 2/3 octave, as if the listening band was “shifted” up to that frequency difference but for differences larger than this limit a “switching” (Swets’ [1963] term) between listening bands took place. However, what could account for the dual error maxima? Indeed, this non-monotonicity is difficult to explain unless one assumes that the inhibitory contours of the listening band are long-lasting, that is, persisting at least for the duration of the 2-s interval that separates the last (f2) burst of a trial and the first burst (f1) of the next – in which case the onset of this first burst of frequency f1 will be perceived after a delay necessary to overcome the inhibition around the burst of frequency f2, i.e., the last tone of the preceding trial. However, unlike the downward frequency change in the second observation interval from f1 to f2, that frequency change moves in an upward direction. If the inhibitory contours around a certain frequency do not spread to the same extent above and below, as in masking (Shannon 1976) and in the ventral cochlear nucleus (Rhode and Greenberg 1994), then we would expect to obtain the “W”shaped result. Granted, this explanation is based on two corollary hypotheses – (1) that the inhibition lingers on far beyond the cessation of a tone and (2) that it spreads farther in one direction on the frequency axis than in the other. If these hypotheses are true, they also bring with them the consequence that the perceptual errors observed in Experiment 1 actually represent the
The Time Course of Listening Bands
179
sum of two delays: one that makes the second observation interval longer by delaying its second burst (of frequency f2), and one that makes the first observation interval shorter by delaying its first burst (of frequency f1). Fortunately, both hypotheses are testable with a modification of the stimulus used in Experiment 1. 3.2
Experiment 2
As it turns out, the above hypotheses are easily tested by introducing two modifications to the stimulus: (1) have a tone of frequency f1 precede the first burst of the trial, so that, if the inhibitory contours of the f2 tone in the previous trial must be overcome when the frequency of the tone burst changes back to f1, such a “disinhibition” occurs before the 40-ms interval is presented, and (2) add conditions in which the second observation interval’s frequency change goes upward. These modifications should also measure the perceptual error due to the f2 burst overcoming the inhibitory contours of the f1 burst in the second observation interval alone, rather than the two summed effects we predicted that the results of Experiment 1 reflect. The top diagram of Fig. 2 shows the stimulus: it differs from that of Experiment 1 in that it has a long (150-ms) tone of frequency f1 precede the stimulus proper. This added “cueing” tone has gradual (50-ms) onset and offset and is presented at a level 15 dB below the stimulus, in order for it not to interfere with the fine temporal discrimination needed to compare the 40-ms standard and the variable time intervals. Averaged results of the subjects are shown in the body of Fig. 2. The top data graphs illustrate those separately for the downward and upward f1-tof2 frequency change and indicates two effects: (1) the perceptual errors of time adjustments are about half the size of those observed in Experiment 1, and (2) the largest error for the upward frequency change occurs at a frequency difference of about 2/3 octave, i.e., about twice the ∆f at which the largest error for the downward frequency change is observed (~1/3 octave). In other words, the data confirm the two corollary hypotheses. To test more rigorously the hypothesis that the Experiment 1 results were induced by two frequency changes within any single trial (one downward and one upward), we computed for each ∆f value the sum of the perceptual errors observed at the downward and the upward frequency changes in Experiment 2 and compared it with the results of Experiment 1 averaged for the three subjects. The bottom graph of Fig. 2 in which this comparison is displayed indicates that the summed results of Experiment 2 and the results of Experiment 1 are essentially identical. The most surprising finding that derives from this equivalence is that, in Experiment 1, the putative inhibitory effects of the tone of frequency f2 appear to have been still in effect when the first tone of frequency f1 was presented in the next trial – i.e., over a period of about 2 s.
180
P. Divenyi and A. Lammert
f1
f1
f1
f1
t std
150 ms
250 ms
f2
t var
√f1f2 = 1kHz
600 ms
2
Perceptual Error (ms)
1 f1 f2 0
−1 f1 f2
−2
−3 3 2 Perceptual Error (ms)
1 Cumulative down-up Exp. 2
0 −1 −2 −3
Exp. 1
−4 −5 −6
0
222
358
470
700
0
1/3
1/2
2/3
1
Hz Octave
Fig. 2 The stimulus from Experiment 2 (top), the error in judging the second time interval (middle) and a comparison of errors in both experiments (bottom)
The Time Course of Listening Bands
181
The two experiments have generated results consistent with our hypotheses and in general agreement with physiological observations. We are thus inclined to think that establishing listening bands is a dynamic process that may reflect excitatory and inhibitory profiles at diverse stages of the auditory system. 3.3
A Model of Listening Band Dynamics
Excitatory build-up and decay in the auditory nerve expressed as discharge rate have been shown to obey an exponential law (Smith 1977). The model we explored follows this law and represents t∆ƒ, the time required to shift the listening band from a first frequency to a second located ∆ƒ Hz (or octaves) away as t∆ƒ = A{exp(−a∆ƒ)+u(∆ƒ–ƒlim)[1–exp(–b(∆ƒ–ƒlim))]}
(1)
where A is a weighting constant that affects the inhibitory and the disinhibitory processes to the same degree, a and b are the growth constants of the inhibitory and disinhibitory processes, respectively, ƒlim is the frequency difference at which the disinhibitory process begins, and u is the unit step function. The model constants were calculated using a nonlinear regression and the individual subjects’ data. The model output is shown as the dotted lines in the top graph of Fig. 2.
4
Discussion
The results’ general agreement with the model suggests that the buildup of listening bands, taking up to 2–4 ms at the maximum frequency change at which this buildup is observed, may be related to the excitatory buildup when the stimulus frequency is one that falls in the inhibitory area generated by a previously presented different frequency. Such short delay for the buildup of excitation also suggests that the process responsible for it is likely to take place in, or close to, the auditory periphery. Our data also show that past this frequency limit the buildup diminishes and eventually vanishes, suggesting that the new excitatory process encounters less, and eventually no, inhibition: it builds up a contour around a frequency not affected by the previous tone’s response contours. However, the moment of the onset of this new tone is not integrated efficiently with that of the old – hence the increased variability at frequency differences approaching the octave – consistent with what has been observed for the discrimination of gaps between spectrally different markers (Divenyi and Danner 1977). However, the results also raise many questions. What would be the buildup time of response contours generated by broad-band instead of puretone markers? If the buildup delay truly originates at the periphery, does this
182
P. Divenyi and A. Lammert
mean that shifting listening bands across ears would not result in any observable delay? Also, since the inhibitory sidebands are intensity dependent, would changing the intensity of the markers influence the buildup delay? If data collected in other experiments were to answer these questions in a way still consistent with the original hypotheses, these new experiments would strengthen the view that listening bands develop dynamically. In absence of such data our knowledge about changing the frequency focus of listening bands has not been significantly advanced.
5
Conclusion Listening bands have an itch: Will the ear scan? Will it switch? Yet, unless one bets on wisdom by Swets (1963), knows this no son of a ....
Acknowledgments. The authors thank Ira Hirsh, James Saunders, and Steven Greenberg for many helpful comments on earlier versions of the manuscript, and the assistance of JC Sander for data analysis. The research was supported by the National Institutes of Health and the Department of Veterans Affairs.
References Demany L, Semal C (2005) The slow formation of a pitch percept beyond the ending time of a short tone burst. Percept Psychophys 67:1376–1383 Divenyi PL, Danner WF (1977) Discrimination of time intervals marked by brief acoustic pulses of various intensities and spectra. Percept Psychophys 21:125–142 Evans EF (2001) Latest comparison between physiological and behavioural frequency selectivity. In: Breebart DJ, Houtsma AJM, Kohlrausch A, Prijs VF, Schoonhoven R (eds), Physiological and psychological bases of auditory function. Shaker Publishing, Maastricht, the Netherlands, pp 382–387 Fritz J, Elhilali M, Shamma S (2005) Active listening: task-dependent plasticity of spectrotemporal receptive fields in primary auditory cortex. Hear Res 206:159–176 Greenberg GZ, Larkin WD (1968) Frequency-response characteristics of auditory observers detecting signals at a single frequency in noise: the probe-signal method. J Acoust Soc Am 44:1513–1523 Helson H, Himelstein P (1955) A short method for calculating the adaptation-level for absolute and comparative rating judgments. Am J Psychol 68:631–637 Moore BCJ (2003) Psychology of hearing, 5th edn. Academic Press, San Diego Pastore RE, Sorkin RD (1971) Adaptive auditory signal processing. Psychon Sci 23:259–260 Rhode WS, Greenberg S (1994) Lateral suppression and inhibition in the cochlear nucleus of the cat. J Neurophysiol 71:493–514 Schlauch RS, Hafter ER (1991) Listening bandwidths and frequency uncertainty in pure-tone signal detection. J Acoust Soc Am 90:1332–1339
The Time Course of Listening Bands
183
Shannon RV (1976) Two-tone unmasking and suppression in a forward-masking situation. J Acoust Soc Am 59:1460–1470 Smith RL (1977) Short-term adaptation in single auditory nerve fibers: some poststimulatory effects. J Neurophysiol 40:1098–1111 Swets JA (1963) Central factors in auditory frequency selectivity. Psychol Bull 60:429–440
Comment by Shinn-Cunningham Given my own interests in how spatial auditory cues affect performance, I wonder if you have considered what happens in your experiments when spatial cues are manipulated. Does this alter the basic results? Reply When the first marker of both intervals is presented in the left, and the second marker of both intervals in the right ear, the shape of the perceptual errors as a function of the frequency difference f1−f2 remains basically the same, except that (contrary to what was observed in the monaural Experiments 1 and 2) the errors do not entirely recover for large frequency differences. This suggests that the listening band center frequency can be manipulated by tones presented in the opposite ear. However, the true test of this hypothetical conclusion is an experiment in which only the second marker of the second interval, the one with the “odd” frequency f2, is presented in the opposite ear. Data from such an experiment indicates a generally similar nonmonotonicity for the perceptual error as a function of frequency difference, although the error is slightly smaller and the within-condition variability slightly larger than for the monaural case. Thus, it appears that the shift of the focus of the listening band can, indeed, be accomplished across the ears, i.e., at a site more central than the cochlear nucleus. This, of course, does not negate the possibility that a shift can occur also at a peripheral level, although the presence of an efferent control cannot be excluded.
20 Frogs Communicate with Ultrasound in Noisy Environments PETER M. NARINS1, ALBERT S. FENG2, AND JUN-XIAN SHEN3
1
Introduction
Males of the concave-eared torrent frog (Amolops tormotus) from Huangshan Hot Springs, China produce diverse bird-like melodic calls with pronounced rising and/or falling frequency modulations that often contain spectral energy in the ultrasonic range (Feng et al. 2002; Narins et al. 2004). Acoustic playback experiments with these frogs in their natural habitat showed that males exhibited distinct evoked vocal responses when presented with the ultrasonic or audible components of a frog call. Electrophysiological recordings from the inferior colliculus (IC) confirmed the ultrasonic hearing capacity of these frogs and another sympatric species (Feng et al. 2006). To determine if the neural responses to ultrasound were the result of direct stimulation of the frog brain, we recorded averaged evoked potentials (AEPs) from the IC in the intact condition and again with the ears occluded. Occluding the ears completely eliminated the AEPs from the IC, suggesting that the ultrasound must be transduced by the inner ear itself. The dramatic shift of hearing into the ultrasonic range of both the harmonic content of the advertisement calls and the frog’s hearing sensitivity likely represents an adaptation that reduces signal masking by the intense broadband background noise from local streams.
2
Behavioral Evidence
To determine whether A. tormotus uses ultrasound to communicate, we conducted acoustic-playback experiments with eight males in their natural habitat. We recorded the vocalization patterns of these frogs under three experimental conditions for a period of 3 min each: (i) an NS period during which no sound
1 Departments of Physiological Science and Ecology & Evolutionary Biology, University of California, Los Angeles, CA 90095 USA,
[email protected] 2 Department of Molecular & Integrative Physiology, University of Illinois, Urbana, IL 61801 USA 3 State Key Laboratory of Brain and Cognitive Science, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
186
P.M. Narins et al.
was presented, (ii) a US period during which we presented the ultrasonic components of a previously-recorded conspecific vocal signal at ~77 dB SPL (RMS reading), a sound level that is behaviorally relevant, (iii) an AUD period during which we presented the audible components (<20 kHz) of the same vocal signal at a similar level. For five frogs (* in Table 1), the male’s calling rate was markedly increased during the AUD and/or US period, compared to spontaneous calling during the NS periods; two frogs (601-4, 602-2) showed no overt evoked vocal responses to any playback stimulus (Feng et al. 2006). The stimulatory effect of the US components was most robust for frogs 531-1 and 601-2. Frog 531-1 did not produce any call during the NS periods, but it emitted 11 calls during the US period. Frog 601-2 produced 6 calls during the NS-1 period; it emitted 18 calls during the US period including four antiphonal responses that were precisely time-locked (within 60 ms of the stimulus offset) to the US stimulus (Fig. 1B). Frog 602-1 produced five calls during the US period including one antiphonal response. In summary, the 8 frogs studied produced a total of n=47 calls during the playbacks of the US stimuli, of which 5 were antiphonal calls. The probability of exactly k successes (antiphonal calls) in a binomial distribution (n,p) is: P[X = k] = Cn,k : pk : qn-k
(1)
where Cn, k =
n! k! (n - k) !
(2)
Table 1 Intraspecific playback experiments to determine the behavioral significance of the ultrasonic components of the Amolops call. NS-1 and NS-2: spontaneous calls given with no stimulus; US: calls given in response to ultrasound stimulus; AUD: calls given in response to audible components of the stimulus Frog #
NS-1
US
AUD
NS-2
*531-1
0
11
10(2)
0
*531-2
2
*601-2
6
*531-3
0
*601-5
6
601-4
0
1
1(1)
–
602-1
3
5(1)
1
2
602-2
0
1
0
Total
6
–
–
–
–
0
18
4
6
14
–
18(4)
0 47(5)
45(3)
Frogs Communicate with Ultrasound in Noisy Environments
187
Fig. 1 Averaged AEPs from the ICs of three Chinese frog species in response to 10 tone bursts over 1–40 kHz presented at a rate of 0.5 bursts/s. The arrows in a, b and c indicate the responses to 34, 22 and 4 kHz, respectively, the highest tone frequencies that elicited AEPs above the noise level for each species
and pk is the probability of a success, p (60/15,000), q is the probability of a failure, equal to 1–p. Using Eqs. (1) and (2), and with an interval of 15 s between stimuli, the probability that five antiphonal responses out of 47 total responses occur within a 60-ms time window by chance is 1.3 × 10−6 (binomial probability). In light of this vanishingly small probability, the most parsimonious conclusion is that the antiphonal responses are not a result of chance, but rather that males of A. tormotus detect and respond to ultrasound. Similarly, the probability that three antiphonal responses out of 45 total responses occur within a 60-ms time window in response to the AUD stimulus by chance is 7.7 × 10−4 (binomial probability).
3
Physiological Evidence
In order to verify ultrasonic sensitivity, auditory-evoked potentials (AEPs) and single-unit activity were recorded from the inferior colliculus of three species of frogs: A. tormotus and Odorrana livida living near the noisy Tau Hua Creek in Huangshan Hot Springs, and the black-spotted pond frog, Pelophylax nigromaculata, an inhabitant of rice fields throughout much of China. Frogs were anesthetized by immersion in tricaine methanesulfonate and the IC was surgically exposed. Tone bursts (50–100 ms duration, 5-ms rise/fall times, presented at 0.5–1 burst/s) were broadcast from a free-field US loudspeaker (Tucker-Davis ES-1, 1–100 kHz) located 10 cm from the frog’s contralateral eardrum. The stimulus delivery system was calibrated such that the frequency response was equalized ±6 dB between 2 and 40 kHz. AEPs were averaged over 10 trials; for single unit recordings, each tone/intensity was presented 20 times to construct a PSTH. Representative AEPs are shown in Fig. 1. A. tormotus and O. livida, two sympatric frogs living near a fast-flowing creek that generates broadband noise up to 20 kHz (Feng et al. 2006), have clear US sensitivity; P. nigromaculata
188
P.M. Narins et al.
Fig. 2 a PSTHs in response to tone-bursts between 5 and 30 kHz from a single tonic unit in the IC of A. tormotus. b Spike count plot showing peak response at 20 kHz. c PSTHs in response to tone-bursts between 5 and 30 kHz for a single phasic unit in the IC. d Spike count plot showing peak response at 10 kHz
inhabiting rice patties lacks such sensitivity. As a control, both ears of A. tormotus were filled with modeling clay and the recordings were repeated. No AEPs resulted, suggesting that the US stimulus did not evoke AEPs via a direct effect on the IC, but rather required transmission through the middle and inner ears to be effective. Recordings from 12 of 30 single-units in the IC of A. tormotus exhibited responses over a wide range of frequencies and confirm the US sensitivity seen in the AEP recordings (Feng et al. 2006). Figure 2 illustrates the PSTHs and spike count vs frequency plots from one tonic unit (a, b) tuned to 20 kHz, and a second unit (c, d) tuned to 10 kHz. Both of these units are tuned to frequencies higher than those of any auditory cell or fiber previously reported from any species of frog (Gridi-Papp and Narins 2007). Clear responses to US frequencies are evident in these cells, as well as in 10 additional cells isolated from the IC.
4
Discussion
Males of A. tormotus exhibit middle ear morphology which would favor detection of high frequencies. The eardrum is recessed in a cavity or chamber such that it is afforded some degree of protection from objects coming into
Frogs Communicate with Ultrasound in Noisy Environments
189
Fig. 3 A. tormutus. Scale bar: 3.5 mm
contact with the head (Fig. 3). As a consequence of the recessed eardrum, the columella (stapes) and extracolumella (extrastapes) are physically smaller and less massive than those in the large majority of frogs which have tympanic membranes on the surface of the head (Wever 1985). The tympanic membrane, as in many species of Amolops, is transparent and exceedingly thin (3–4µm) at its edges (Feng et al. 2006). This combination of lightweight ossicles and thin tympanic membranes may be viewed as an adaptation for detection of high frequencies. Assuming the ear cavity is a simple Helmholtz resonator, its resonant frequency was calculated to be 4.3 kHz, very close to the fundamental frequency of many of the advertisement calls of this species (Narins et al. 2004). This correlation may have implications both for the reception (Capranica and Moffat 1983) and broadcasting (Purgue 1997) of the animal’s calls. Experiments are planned to examine the specific contributions of the hair cells, amphibian papilla, basilar papilla, tectorial membrane and Eustachian tubes of Amolops to its US sensitivity as well as the directional characteristics of its specialized ear. Moreover, since the female of this species lacks the recessed eardrum, it would be of interest to compare US sensitivity between the sexes. Although it is tempting to ascribe the US sensitivity of A. tormotus and O. livida to the outcome of selection pressure from the noisy environment, it is clear this argument does not apply in all cases. For example, it has been recently shown that one sympatric species, the piebald odorous frog, Odorrana schmackeri, lacks US sensitivity (Yu et al. 2006). Additional behavioral and physiological studies are needed to identify this trait among species inhabiting such noisy environments. Acknowledgments. We thank Chun-He Xu, Wen-Yu Lin, Zu-Lin Yu, Qiang Qiu, Zhi-Min Xu for their help with this work. We thank M. Kowalczyk for her help drawing Fig. 3. Supported by grants from the National Institutes of Health (R01DC-00222 to PMN and R01DC04998 to ASF), a UCLA Academic Senate Grant to PMN and a grant from the National Natural Sciences Foundation to J-XS.
190
P.M. Narins et al.
References Capranica RR, Moffatt A (1983) Neurobehavioral correlates of sound communication in anurans. In: Ewert J-P, Capranica RR, Ingle DJ (eds) Advances in vertebrate neuroethology. Plenum Press, New York, pp 701–730 Feng AS, Narins PM, Xu C-H (2002) Vocal acrobatics in a Chinese frog, Amolops tormotus. Naturwissenschaften 89:352–356 Feng AS, Narins PM, Xu C-H, Lin W-Y, Yu Z-L, Qiu Q, Xu Z-M, Shen J-X (2006) Ultrasonic communication in frogs. Nature 440:333–336 Gridi-Papp M, Narins PM (2007) Sensory ecology of hearing. In: Dallos P, Oertel D, Hoy RR (eds) Handbook of the senses. Academic Press, London Narins PM, Feng AS, Schnitzler H-U, Denzinger A, Suthers RA, Lin W, Xu C-H (2004) Old world frog and bird vocalizations contain prominent ultrasonic harmonics. J Acoust Soc Am 115:910–913 Purgue AP (1997) Tympanic sound radiation in the bullfrog Rana catesbeiana. J Comp Physiol 181:438–445 Wever EG (1985) The amphibian ear. Princeton University Press, Princeton, NJ Yu Z-L, Qiu Q, Xu Z-M, Shen J-X (2006) Auditory response characteristics of the piebald odorous frog and their implications. J Comp Physiol 192:801–806
21 The Olivocochlear System Takes Part in Audio-Vocal Interaction STEFFEN R. HAGE1, UWE JÜRGENS1, AND GÜNTER EHRET2
1
Introduction
The auditory system and the vocal control system do not function independently of each other. On the one hand, vocal output is directly influenced by auditory feedback; an example is the “Lombard” reflex. On the other hand, auditory perception is directly influenced by the vocal output. The middle-ear reflex is an example, in which the auditory input is attenuated by contraction of the middle ear muscles during self-produced sounds in order to protect the inner ear (Suga and Jen 1975). Damping of inner ear activation during one’s own vocalizations is also achieved via the action of the olivocochlear system (OCS) (Goldberg and Henson 1998). In order to tune the auditory sensitivity to environmental sounds of possible importance and, at the same time, protect the inner ear during selfproduced vocalization, complex audio-vocal integration mechanisms must exist. Single-unit recording studies in the monkey and bat have revealed audio-vocal interactions in the auditory cortex, inferior colliculus and paralemniscal area at the midbrain-pons transition (Eliades and Wang 2003; Metzner 1993; Tammer et al. 2004). The caudal pontine brainstem, though rarely investigated, is another candidate area for such audio-vocal integration. In the caudal pontine brainstem, the superior olivary complex (SOC), including the periolivary region (POR), is part of the ascending and descending auditory systems (e.g. Helfert and Aschoff 1997) and vocalization output is blocked by injection of kynurenic acid (a glutamate antagonist) into this area (Jürgens 2000). In a recent study we reported that neurons with properties consistent with audio-vocal integrators are present at this lower brainstem level in awake, behaving, and vocalizing squirrel monkeys (Saimiri sciureus) during communication (Hage et al. 2006). Here, we discuss the possible contributions of these neurons to audio-vocal regulation processes.
1
Dept. of Neurobiology, German Primate Center, Göttingen, Germany,
[email protected],
[email protected] 2 Dept. of Neurobiology, University of Ulm, Germany,
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
192
2
S.R. Hage et al.
Methods
We used a telemetric single-neuron recording technique, which allowed recording of hearing- and vocalization-related activity in freely moving animals (for a detailed description of the method: Grohrock et al. 1997; Jürgens and Hage 2006). Neuronal activity was recorded during all call types uttered. Quantitative data analysis was done for a highly frequency-modulated type with a specific repetitive character (trill). To test whether the recorded neurons showed a consistent auditory response, bursts of white noise were used as acoustic stimuli, beside the animal’s own vocalization and vocalizations from its group mates. For the identification of vocal-motor and auditory units, conventional perievent time histograms (PETH) and peri-stimulus time histograms (PSTH), respectively, were constructed after the original recording had passed a spikesorting procedure (template-based spike-clustering). Verification of the recording sites was carried out histologically at the end of the experiments by staining the brain sections for glial fibrillary acidic protein, allowing the identification of electrode tracks even many months after withdrawal of the electrodes (Benevento and McCleary 1992).
3
Results
A total of 322 units were isolated in the ventrolateral pontine brainstem of three squirrel monkeys showing various response patterns to noise bursts. Neurons were located in SOC, the ventral nucleus of the lateral lemniscus (vLL), the lateral lemniscus (LL) and the adjacent pontine reticular formation (FRP; Fig. 1A). Most of these neurons (n = 295) did not show activity prior to the onset of own vocalizations and, therefore, were defined as auditory neurons. A small group of the isolated neurons (n =27) showed modulations of their activity (increases or decreases) to external acoustic stimuli as well as prior to and during self-produced vocalizations (for examples, see Fig. 1B). These audio-vocal units (AVU) were found at locations not described before, namely in the POR of the SOC and the adjacent pontine reticular formation. Two-thirds of the recorded AVU showed an increase of activity immediately before and during self-produced vocalization (excitatory AVU). The remaining AVU were characterized by suppression of spontaneous activity prior and during self-produced vocalization (inhibitory AVU). About one-third of recorded AVU showed activity correlated with the repetition of the trill syllables, as shown in Fig. 1B. The locations of excitatory and inhibitory AVU showed little overlap. Almost all excitatory AVU were located more laterally in the POR and the adjacent pontine reticular formation than the inhibitory AVU (see Fig 1A). A comparison of the auditory response types with those to self-produced vocalizations showed a significant non-homogeneous distribution (Fisher’s exact-test, P<0.05). Most neurons with excitatory responses to
Fig. 1 A Frontal views of the squirrel monkey’s brainstem showing the spatial distribution of recorded excitatory ( filled black circles) and inhibitory audio-vocal neurons (open circles) and the purely auditory neurons (grey dots). Scale, 500 µm. B Peri-event time histograms and raster plots of excitatory and inhibitory AVU showing similar activity patterns to self-produced vocalizations and bursts of white noise. Black bars below the trill-related activity indicate the onset of all trill vocalizations (time 0.0) and the duration of the shortest trill emitted; the gray bars indicate the duration of the longest trill. Different call durations are mainly due to different numbers of syllables in trill vocalizations. Black bars below the noise-related activity indicate the onset and duration of the noise bursts (300 ms). The relationship between neuronal activity and trill syllables of a representative trill call is shown. Bin size, 5 ms. (modified from Hage et al. 2006)
194
S.R. Hage et al.
self-produced vocalizations had a tonic response pattern to noise bursts (13/18). Tonic activity in inhibitory AVU, in contrast, was very rare (1/9). Inhibitory AVU mainly showed phasic (4/9) or inhibited responses (4/9). In other words, most AVU (17/27) responded similarly to external and selfproduced acoustic stimuli, with the only difference that the change in neural activity started before the acoustic stimulus when self-produced, as shown in Fig. 1B. When we compared auditory responses of excitatory and inhibitory AVU with those of close-by purely auditory neurons in the same electrode tract (Fig. 1A), we found a statistically significant relationship (P<0.01, Fisher’s exact test): except for the very caudal periolivary region, excitatory AVU were more frequently co-localized with tonically responding auditory neurons, while inhibitory AVU were more frequently co-localized with phasically responding auditory neurons.
4 4.1
Discussion Modulation of the Auditory System by Vocalization
AVU showed vocalization-related excitation or inhibition prior to vocal onset, indicating that they got input from the vocalization pathways or from AVU in the upper auditory pathways. Since more than half of the recorded AVU did not follow the rhythm of trill syllables and thus did not show an activity correlated with call patterns, we suggest that these neurons received their vocalization-related input rather indirectly, possibly via projections from the auditory cortex (e.g. Mulders and Robertson 2001) or the inferior colliculus (e.g. Huffman and Henson 1990), both of which are known to modulate the activity of the olivocochlear system (e.g. Groff and Liberman 2003; Mulders and Robertson 2005; Xiao and Suga 2002). The AVU described here may be involved in a general modulation of cochlear sensitivity to self-produced and external sounds. In the pontine brainstem, two systems are known to modulate cochlear sensitivity: the OCS, having its efferent neurons mainly in the POR of the SOC, and the middle-ear reflex, having its motoneurons ventrolateral to the trigeminal motor nucleus and ventromedial to the facial nucleus in primates (M. stapedius: Thompson et al. 1985; M. tensor tympani: Rouiller et al. 1986). Since the motoneurons of the middle-ear muscles differ in their activity patterns from the neurons recorded here (Suga and Jen 1975) and are located outside the explored area, we can exclude the middle-ear reflex as a possible function represented by the AVU activity. However, all except the two neurons dorsal of vLL could be part of the OCS. The neurons of the lateral OCS in the squirrel monkey are located lateral and caudal of the medial nucleus of the SOC (MSO); the neurons of the medial OCS are located medial, rostral and ventral of the MSO (Thompson
The Olivocochlear System Takes Part in Audio-Vocal Interaction
195
and Thompson 1986). With this division, the six neurons located in the POR medial of the MSO would belong to the medial OCS, probably together with the four neurons dorsal of the rostral MSO (stereotaxic coordinate 0.5 in Fig. 1A), while the 13 other neurons in the POR and the two neurons dorsal of the POR in between the MSO and the lateral nucleus of SOC (LSO) would belong to the lateral OCS. The lateral OCS projects in the cochlea directly to the afferent fibers from the inner hair cells and/or the inner hair cells themselves, the medial OCS innervates the outer hair cells (e.g. Guinan et al. 1984). Seven of the 10 AVU belonging to the medial OCS (as defined above) were inhibitory AVU. Their contribution to OCS function could be a tonic inhibitory influence on the olivocochlear neurons as found by Liberman (1988). The reduced activity of the inhibitory AVU during vocalization, in this case, would lead to an increased activity (disinhibition) of the olivocochlear neurons leading to a suppression of cochlear output (e.g. Wiederhold and Kiang 1970) and, thus, to a reduced cochlear sensitivity to self-produced trills. Since trills are quite loud (about 86 dB sound pressure level at a distance of 0.5 m) and have their main energy in a frequency range of high sensitivity in the squirrel monkey (Wienicke et al. 2001), the inhibitory AVU of the medial OCS could have a protective effect on the cochlea against overstimulation by self-produced trills, as has been proposed for external sounds before (e.g. Patuzzi and Thompson 1991). Since most inhibitory AVU responded to loud external noise bursts (80 dB) by inhibition or a weak phasic excitation followed by inhibition, this response would lead to the same protective effect as assumed for the selfproduced trills. Fourteen of the 15 AVU belonging to the lateral OCS (as defined above) were excitatory AVU, responding either tonically or weakly phasically (sometimes followed by inhibition) to external sounds. The tonically responding excitatory AVU were found in the rostral (stereotaxic coordinates rostral to AP 0), the phasically responding excitatory AVU in the very caudal brainstem. The lateral OCS can have increasing and decreasing effects on the amplitude of the cochlear output by excitatory and inhibitory influences on the afferents from the cochlear inner hair cells (e.g. Mulders and Robertson 2005). Hence, tonically active excitatory AVU may sensitize or desensitize auditory nerve fibers when processing selfproduced vocalizations. Excitatory AVU responding tonically to external sounds should, according to our results, exert the same modulatory effects on auditory nerve fibers during self-produced trills. Excitatory AVU with a weak phasic or even inhibitory response to external sounds should have little or even a reversed effect compared to that of self-produced trills. Thus, AVU of the lateral OCS are expected to have diverse effects on cochlear sensitivity, partly depending on whether they become activated by self-produced and/or external sounds. Such a diversity of effects of the lateral OCS is in agreement with existing evidence (e.g. Groff and Liberman 2003).
196
4.2
S.R. Hage et al.
Other Implications
The Lombard reflex is characterized by an increase in vocal intensity, when the auditory feedback of the vocalizer’s own voice is masked by noise. It can be found in monkeys and humans (Brumm et al. 2004). Furthermore, deafened kittens and monkeys show an increase in vocal intensity, when compared with normal-hearing animals (Romand and Ehret 1984; Talmage-Riggs et al. 1972). This means that the intensity of one’s own voice is automatically up-regulated if it is less than a critical level above the acoustic background. Earlier studies assumed that the Lombard-reflex arc is located in the brainstem, because of its survival in decerebrated cats (Nonaka et al. 1997). With their unique properties of tonically modulating the afferent sensitivity to selfproduced sounds and little or reversed responding to external sounds, the excitatory AVU in the caudal POR are candidates for mediating the Lombard reflex. Their vocalization-related activity could provide the decisive difference in the cochlear responses to self-produced compared to external sounds over a large range of intensities of external sounds.
5
Conclusion
The locations and activity patterns of most audio-vocal neurons with suppressed activity to self-produced vocalizations suggest that they are part of the medial olivocochlear system and function to protect the cochlea against overstimulation by own sounds. The location and activity patterns of most audiovocal neurons with increased activity to self-produced vocalization suggest that they are part of the lateral olivocochlear system. Their function is to modulate the activity of cochlear nerve fibers and/or to control the Lombard reflex. Acknowledgments. The authors thank Ludwig Ehrenreich for technical support and Roland Tammer for medical support. This study was supported by the Deutsche Forschungsgemeinschaft, Ju 181/16-1.
References Benevento LA, McCleary LB (1992) An immunocytochemical method for marking microelectrode tracks following single-unit recordings in long surviving, awake monkeys. J Neurosci Methods 41:199–204 Brumm H, Voss K, Kollmer I, Todt D (2004) Acoustic communication in noise: regulation of call characteristics in a New World monkey. J Exp Biol 207:443–448 Eliades SJ, Wang X (2003) Sensory-motor interaction in the primate auditory cortex during selfinitiated vocalizations. J Neurophysiol 89:2194–2207 Goldberg RL, Henson OW Jr (1998) Changes in cochlear mechanics during vocalization: evidence for a phasic medial efferent effect. Hear Res 122:71–81
The Olivocochlear System Takes Part in Audio-Vocal Interaction
197
Groff JA, Liberman MC (2003) Modulation of cochlear afferent response by the lateral olivocochlear system: activation via electrical stimulation of the inferior colliculus. J Neurophysiol 90:3178–3200 Grohrock P, Häusler U, Jürgens U (1997) Dual-channel telemetry system for recording vocalization-correlated neuronal activity in freely moving squirrel monkeys. J Neurosci Methods 76:7–13 Guinan JJ Jr, Warr WB, Norris BE (1984) Topographic organization of the olivocochlear projections from the lateral and medial zones of the superior olivary complex. J Comp Neurol 226:21–27 Hage SR, Jürgens U, Ehret G (2006) Audio-vocal interaction in the pontine brainstem during self-initiated vocalization in the squirrel monkey. Eur J Neurosci 23:3297–3308 Helfert RH, Aschoff A (1997) Superior olivary complex and nuclei of the lateral lemniscus. In: Ehret G, Romand R (eds) The central auditory system. Oxford University Press, New York, pp 193–258 Huffman RF, Henson OW Jr (1990) The descending auditory pathway and acousticomotor systems: connections with the inferior colliculus. Brain Res Brain Res Rev 15:295–323 Jürgens U (2000) Localization of a pontine vocalization-controlling area. J Acoust Soc Am 108:1393–1396 Jürgens U, Hage SR (2006) Telemetric recording of neuronal activity. Methods 38:195–201 Liberman MC (1988) Response properties of cochlear efferent neurons: monaural vs. binaural stimulation and the effects of noise. J Neurophysiol 60:1779–1798 Metzner W (1993) An audio-vocal interface in echolocating horseshoe bats. J Neurosci 13:1899–1915 Mulders WH, Robertson D (2001) Origin of the noradrenergic innervation of the superior olivary complex in the rat. J Chem Neuroanat 21:313–322 Mulders WH, Robertson D (2005) Diverse responses of single auditory afferent fibres to electrical stimulation of the inferior colliculus in guinea-pig. Exp Brain Res 160:235–244 Nonaka S, Takahashi R, Enomoto K, Katada A, Unno T (1997) Lombard reflex during PAGinduced vocalization in decerebrate cats. Neurosci Res 29:283–289 Patuzzi RB, Thompson ML (1991) Cochlear efferent neurones and protection against acoustic trauma: protection of outer hair cell receptor current and interanimal variability. Hear Res 54:45–58 Romand R, Ehret G (1984) Development of sound production in normal, isolated, and deafened kittens during the first postnatal months. Dev Psychobiol 17:629–649 Rouiller EM, Capt M, Dolivo M, De Ribaupierre F (1986) Tensor tympani reflex pathways studied with retrograde horseradish peroxidase and transneuronal viral tracing techniques. Neurosci Lett 72:247–252 Suga N, Jen PH (1975) Peripheral control of acoustic signals in the auditory system of echolocating bats. J Exp Biol 62:277–311 Talmage-Riggs G, Winter P, Ploog D, Mayer W (1972) Effect of deafening on the vocal behavior of the squirrel monkey (Saimiri sciureus). Folia Primatol 17:404–420 Tammer R, Ehrenreich L, Jürgens U (2004) Telemetrically recorded neuronal activity in the inferior colliculus and bordering tegmentum during vocal communication in squirrel monkeys (Saimiri sciureus). Behav Brain Res 151:331–336 Thompson GC, Thompson AM (1986) Olivocochlear neurons in the squirrel monkey brainstem. J Comp Neurol 254:246–258 Thompson GC, Igarashi M, Stach BA (1985) Identification of stapedius muscle motoneurons in squirrel monkey and bush baby. J Comp Neurol 231:270–279 Wiederhold ML, Kiang NY (1970) Effects of electric stimulation of the crossed olivocochlear bundle on single auditory-nerve fibers in the cat. J Acoust Soc Am 48:950–965 Wienicke A, Häusler U, Jürgens U (2001) Auditory frequency discrimination in the squirrel monkey. J Comp Physiol A 187:189–195 Xiao Z, Suga N (2002) Modulation of cochlear hair cells by the auditory cortex in the mustached bat. Nat Neurosci 5:57–63
22 Neural Representation of Frequency Resolution in the Mouse Auditory Midbrain MARINA EGOROVA1, INNA VARTANYAN1, AND GUENTER EHRET2
1
Introduction
The theory of frequency analysis of complex sounds in the auditory system is based on the concept of a bank of internal bandpass filters with continuously variable center frequencies. These filters, called critical bands, determine the spectral resolution power of the ear, i.e. the ability to detect and perceive peaks in a sound spectrum separately. Originally, critical bands have been determined in experiments on the perception of tones in noise (Fletcher 1940); therefore they are psychophysical measures. Critical bandwidths (CBs) have been studied in psychophysical tests in humans (e.g. Zwicker and Feldtkeller 1967; Scharf 1970; Moore 1982) and behavioral tests in animals (e.g. Pickles 1975; Ehret 1976; Nienhuys and Clark 1979; Dooling 1980). In these tests, two fundamental properties of critical bands have been established: 1) species-specific dependence of the CBs on their center frequency and 2) intensity independence of the CBs through a wide range of sound intensities. The starting point of the frequency resolution of the whole auditory system is the frequency-place transformation in the cochlea. For mammals with non-specialized cochleae, CBs cover about equal distances along the basilar membrane (Greenwood 1961, 1990). Hence, the sizes of the CBs of a given species are strongly related to the species cochlear tonotopy. The neural coding of CBs has its first correlate in the frequency tuning of the cochlear nerve fibers without realizing, however, the intensity independence of the CBs (Ehret 1995). How are CBs coded in higher auditory brain centers? Only for the cat, experimental data on CBs are available for the auditory nerve (Pickles and Comis 1976), the central nucleus of the inferior colliculus (Ehret and Merzenich 1985; 1988) and the primary auditory cortex (Ehret and Schreiner 1997). Compared with behavioral data (Pickles 1975; Nienhuys and Clark 1979) obtained in tests with comparable methods of sound presentation (narrow-band masking), the two main perceptual properties of CBs mentioned above are found in neurons at the midbrain and cortical levels. The midbrain 1 Laboratory of Comparative Physiology of Sensory Systems, I.M. Sechenov Institute of Evolutionary Physiology and Biochemistry, RAS, St. Petersburg, Russia,
[email protected] 2 Department of Neurobiology, University of Ulm, Ulm, Germany,
[email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
200
M. Egorova et al.
data lead to the hypothesis that the influence of lateral inhibition flanking the excitatory receptive fields of neurons is important for the determination of the widths of CBs and for their independence of sound intensity (Ehret et al. 1988; Ehret 1995). The major goal of our present studies on the inferior colliculus in the mouse is to test this hypothesis and to determine the relationship between inhibitory receptive fields and neural CBs.
2
Methods
Single unit recordings were taken from the central nucleus of inferior colliculus (ICC) of anesthetized (ketamine, xylazine) house mice Mus musculus, hybrids of outbred strain NMRI and feral mice. For every neuron, the conventional single-tone excitatory tuning curve and two-tone response areas (Egorova et al. 2001, 2002) were measured to get the distribution of excitatory and inhibitory response areas in the neuron’s receptive field. Computercontrolled measurements were taken within the whole frequency range of hearing sensitivity of mouse (3–80 kHz) and over a broad range of sound levels: from the units’ tone-response thresholds up to 80 dB above of them, corresponding to a range of −20 to 90 dB SPL. Estimation of the neural CBs was made by simultaneous narrowband white noise masking with constant noise spectral level and variable bandwidth (96 dB/octave slope of the filter). First, a neuron’s response to tone bursts at its excitatory characteristic frequency (CFE) was masked by a continuous broadband noise. Then, the noise bandwidth was narrowed gradually and separately both from the high frequency and the low frequency sides until the tone response appeared again. The noise bandwidth that was just effective in masking the tonal response directly determined the lower and the upper boarders of the critical bands (Vartanian et al. 1999).
3 3.1
Results and Discussion Coding of Critical Band Properties
The data presented here are based on recordings from 76 neurons. The main properties of the neuronal CBs in mouse ICC, their dependence on neuron’s CFE and sound intensity, are shown in Fig. 1a,b (Vartanian et al. 1999). Regression analysis (see equation and regression line as solid line in Fig. 1a) showed with high statistical significance (p<0.001) that CBs increased with an increase in the neuron’s CFE. This frequency dependence of neural CBs is very similar to the frequency dependence of psychophysically measured CBs in the mouse (Ehret 1976, line with open circles in
Neural Representation of Frequency Resolution in the Mouse Auditory Midbrain
201
Fig. 1 a Critical bandwidths (left ordinate) measured in single units of the inferior colliculus (closed symbols) and from behavioral tests (line with open symbols; Ehret 1976) of the mouse as a function of the units’ characteristic frequencies or tone frequencies (abscissa). The regression line (solid line) of neural data follows the equation indicated in the figure. The line with closed symbols shows the relationship between cochlea place (in mm from the apex, right ordinate) and the representation of frequencies (abscissa). b Critical bandwidths (ordinate) of single units of the mouse inferior colliculus as a function of the sound pressure level of tones above response threshold (abscissa)
Fig. 1a). Both frequency dependencies of CBs agree with the pattern of frequency representation along the basilar membrane in house mice (Ehret 1975). Hence, the average CFE dependence of CBs reflects the general property of the auditory system which is the dependence of the auditory frequency resolution on the cochlear tonotopy.
202
M. Egorova et al.
The independence of neural CBs from sound intensity becomes evident in Fig. 1b. On average, neural CBs remain constant over an intensity range of up to at least 70 dB above the neurons’ tone-response thresholds. Thus, the fundamental properties of frequency resolution and critical band formation in hearing are present in the average activity of neurons in the mouse auditory midbrain. 3.2
Critical Bands and Neural Excitatory and Inhibitory Receptive Fields
The ICC neurons have been divided into four classes based on the shape parameters of their excitatory frequency response areas, the steepness of slopes on the two sides of the excitatory frequency response areas, and the degree of overlap of excitatory and inhibitory frequency response areas (Vartanian et al. 2000; Egorova et al. 2001). Examples of the receptive fields of neurons of the four classes are shown in Fig. 2. The percentages of neurons in the classes were the following: class I, 35.5%; class II, 31.5%; class III, 29%; class IV, 4%. Besides one or two excitatory characteristic frequencies (CFE), neurons often display two characteristic frequencies of inhibition, one below (CFLI)
Fig. 2 Examples of receptive field properties of single neurons of the four classes (class I–IV) of the mouse inferior colliculus. Excitatory response areas are indicated by stippling, inhibitory response areas by shades of gray (light gray: partial inhibition, dark gray: total inhibition). Critical bands (solid horizontal lines) are indicated for the different signal levels above the CFE threshold. CFE: neuron’s excitatory characteristic frequency; CFLI: characteristic frequency of inhibition below the CFE; CFHI: characteristic frequency of inhibition above the CFE; CFEL: lowfrequency CFE of class IV neuron; CFEH: high-frequency CFE of class IV neuron; CFCI: characteristic inhibitory frequency of central inhibitory area of class IV neuron. Abscissa: frequency, kHz; ordinate: sound level, dB SPL
Neural Representation of Frequency Resolution in the Mouse Auditory Midbrain
203
and one above (CFHI) the CFE. The position of the CFLI and CFHI relative to the position of the CFE indicate whether or not lateral inhibition starts close to the center of the excitatory response area of a neuron and whether the neuron may be more sensitive to excitatory or inhibitory influences. Usually, class II neurons had the largest inhibitory areas with the strongest inhibition often covering a large part of the excitatory area. Inhibition in class I neurons never extended through the excitatory area. In class III neurons, inhibition was usually weak and inhibitory areas small or only unilateral. The borders of the critical bands as indicated in Fig. 2 correlated with the shapes of excitatory and inhibitory response areas (Vartanian et al. 2000). CBs always extended at least through part of the excitatory area and often ended in the inhibitory areas. CBs never extended through areas of total inhibition, i.e. through areas in which a tone led to a total inhibition of the neuron’s response to a CFE tone. Thus, areas of total inhibition (dark gray areas in Fig. 2) always marked the limits of the critical band filters. The same relationship between inhibitory response areas and extensions of CBs has been found also in neurons of the cat ICC (Ehret and Merzenich 1988). 3.3
Borders of Critical Bands and Characteristic Frequencies of Inhibition
The quantitative evaluation of the borders of CBs in relationship with the characteristic frequencies of inhibition showed highly significant correlations (p < 0.001) in class I–III neurons as indicated by the correlation coefficients (r) of the linear regression lines in Fig. 3. Data of class IV are not included in this analysis because only few (3) neurons have been recorded. The slopes of the regression lines of the high-frequency borders of the CBs (CBHF) vs the characteristic frequencies of the high-frequency side inhibition (CFHI) are all very close to 1.0 indicating that the high-frequency borders of the neural critical bands are determined by the characteristic frequencies of the highfrequency inhibitory fields. The high-frequency borders of neural CBs (CBHF) can be calculated by subtracting 1.6 kHz (class I), 1.8 kHz (class II) or 3.4 kHz (class III) from the values of the characteristic frequencies of high-frequency inhibitory fields (CFHI). Thus, it is evident that, on average, CBs are strongly restricted in their extension towards high frequencies by the presence of high-frequency lateral inhibition. The slopes of the regression lines of the low-frequency borders of the CBs (CBLF) vs the characteristic frequencies of the low-frequency side inhibition (CFLI) are close to 1.0 only for class I and class II neurons. For these neurons, the low-frequency borders of CBs (CBLF) can be calculated by adding 3 kHz (class I) or 1.7 kHz to the values of the characteristic frequencies of the lowfrequency inhibitory fields (CFLI). Compared to the high-frequency side, this adding of a frequency range to the CFLI in order to obtain the CBLF indicates that, on average, the inhibition on the low-frequency side is not as strong as on the high-frequency side to stop the extension of the CBs. It is clear,
204
M. Egorova et al.
Fig. 3 Correlations between the borders of neural critical bands (CBHF, CBLF) and the characteristic frequencies of inhibition (CFHI, CFLI) of single neurons in the inferior colliculus of the mouse separately for neurons in the three classes (class I–III) and separately for the highfrequency (closed circles) and low-frequency (open circles) CB borders (high-frequency and low-frequency inhibitory areas), respectively. Linear regression lines (solid lines) follow the indicated equations. For further explanations, see text
however, that at least for class I and class II neurons lateral inhibition in the auditory midbrain determines the borders of the CBs. The role of class III neurons with weak inhibitory areas remains to be clarified.
4
Conclusions
Fundamental properties of the psychophysical critical band filters, their species-specific frequency dependence and constancy over the wide range of sound intensities are coded in the activity of neurons in the central nucleus of the auditory midbrain. The highly significant correlations between the widths and borders of CBs and characteristics of inhibitory receptive fields of ICC neurons support the hypothesis that CBs are created by the cochlear tonotopy and filtering and are shaped by inhibition at the level of the auditory midbrain to their perceptual properties. Acknowledgements. Supported by the Volkswagen Foundation (project no. 1/69 589), the DFG (grants to GE), and the Russian Foundation for Basic Research (no. 06-04-48616).
Neural Representation of Frequency Resolution in the Mouse Auditory Midbrain
205
References Dooling RJ (1980) Behavior and psychophysics of birds. In: Popper AN, Fay RR (eds) Comparative studies of hearing in vertebrates. Springer, Berlin Heidelberg New York, pp 261–288 Egorova M, Ehret G, Vartanian I, Esser KH (2001) Frequency response areas of neurons in the mouse inferior colliculus. I. Threshold- and tuning-characteristics. Exp Brain Res 140:145–161 Egorova MA, Vartanian IA, Ehret G (2002) Neural critical bands and inhibition in the auditory midbrain of house mouse (Mus domesticus). Dokl Biol Sci 382:131–133 Ehret G (1975) Masked auditory thresholds, critical ratios and scales of the basilar membrane of the house mouse (Mus musculus). J Comp Physiol 103:329–341 Ehret G (1976) Critical bands and filter characteristics of the ear of the house mouse (Mus musculus). Biol Cybern 24:35–42 Ehret G (1995) Auditory frequensy resolution in mammals: from neuronal representation to perception. In: Manley GA, Klump GM, Koeppl C, Fastl H, Oekingham H (eds) Advances in hearing research. World Scientific, Singapore, pp 387–397 Ehret G, Merzenich MM (1985) Auditory midbrain responses parallel spectral integration phenomena. Science 227:1245–1247 Ehret G, Merzenich MM (1988) Complex sound analysis (frequency resolution, filtering and spectral integration) by single units of the inferior colliculus of the cat. Brain Res Revs 13:139–163 Ehret G, Schreiner CE (1997) Frequency resolution and spectral integration (critical band analysis) in single units of the cat primary auditory cortex. J Comp Physiol A 181:635–650 Fletcher H (1940) Auditory patterns. Rev Mod Phys 12:47–65 Greenwood DD (1961) Critical bandwidth and the frequency coordinates of the basilar membrane. J Acoust Soc Am 33:1344–1356 Greenwood DD (1990) A cochlear frequency-position function for several species - 29 years later. J Acoust Soc Am 87:2592–2605 Moore BCJ (1982) An introduction to the psychology of hearing. Academic Press, London Nienhuys TGW, Clark GM (1979) Critical bands following the selective destruction of cochlear inner and outer hair cells. Acta Otolaryngol 88:350–358 Pickles JO (1975) Normal critical bands in the cat. Acta Otolaryngol 80:245–254 Pickles JO, Comis SD (1976) Auditory-nerve-fiber bandwidths and critical bandwidths in the cat. J Acoust Soc Am 60:1151–1156 Scharf B (1970) Critical bands. In: Tobias JV (ed) Foundations of modern auditory theory, vol 1. Academic Press, New York, pp 159–202 Vartanian IA, Egorova MA, Ehret G (1999) Expression of the main properties of critical bands in the neuronal activity of posterior quadrigemini in mice. Dokl Biol Sci 368:437–439 Vartanian IA, Egorova MA, Ehret G (2000) Critical bandwidths of different types of neurons in the mouse auditory midbrain. Dokl Biol Sci 373:364–366 Zwicker E, Feldtkeller R (1967) Das Ohr als Nachrichtenempfänger. Hirzel, Stuttgart
23 Behavioral and Neural Identification of Birdsong under Several Masking Conditions BARBARA G. SHINN-CUNNINGHAM1, VIRGINIA BEST1, MICHEAL L. DENT2, FREDERICK J. GALLUN1, ELIZABETH M. MCCLAINE2, RAJIV NARAYAN1, EROL OZMERAL1, AND KAMAL SEN1
1
Introduction
Many animals are adept at identifying communication calls in the presence of competing sounds, from human listeners communicating in a cocktail party to penguins locating their kin amongst the thousands of conspecifics in their colony. The kind of perceptual interference in such settings differs from the interference arising when targets and maskers have dissimilar spectrotemporal structure (e.g., a speech target in broadband noise). In the latter case, performance is well modeled by accounting for the target-masker spectrotemporal overlap and any low-level binaural processing benefits that may occur for spatially separated sources (Zurek 1993). However, when the target and maskers are similar (e.g., a target talker in competing speech), a fundamentally different form of perceptual interference arises. In such cases, interference is reduced when target and masker are dissimilar (e.g., in timbre, pitch, perceived location, etc.), presumably by enabling a listener to focus attention on target attributes that differentiate it from the masker (Darwin and Hukin 2000; Freyman et al. 2001). We investigated the interference caused by different maskers when identifying bird songs. Using identical stimuli, three studies compare (a) human performance, (b) avian performance, and (c) neural coding in the avian auditory forebrain. Results show that the interference caused by maskers with spectrotemporal structure similar to the target differs from that caused by dissimilar maskers.
2
Common Stimuli
Targets were songs from five male zebra finches (five tokens from each bird). Three maskers were used that had identical long-term spectral content but different short-term statistics (see Fig. 1): 1) song-shaped noise (steady-state 1 Boston University Hearing Research Center, USA,
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected] 2 Department of Psychology, University at Buffalo, SUNY, USA,
[email protected], mcclain@ buffalo.edu
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
208
B.G. Shinn-Cunningham et al.
Fig. 1 Example spectrograms of a target birdsong and one of each of the three types of maskers
noise with spectral content matching the bird songs), 2) modulated noise (song-shaped noise multiplied by the envelope of a chorus), and 3) chorus (random combinations of three unfamiliar birdsongs). These maskers were chosen to elicit different forms of interference. Although the noise is qualitatively different from the targets, its energy is spread evenly through time and frequency so that its spectrotemporal content overlaps all target features. The chorus is made up of birdsong syllables that are statistically identical to target song syllables; however, the chorus is relatively sparse in time-frequency. The modulated noise falls between the other maskers, with gross temporal structure like the chorus but dissimilar spectral structure. Past studies demonstrate that differences in masker statistics cause different forms of perceptual interference. A convenient method for differentiating the forms of interference present in a task is to test performance for co-located and spatially separated target and maskers. We recently examined spatial unmasking in human listeners for tasks involving the discrimination of bird song targets in the presence of the maskers described above (Best et al. 2005). Results show that spatial unmasking in the noise and modulated noise conditions is fully explained by acoustic better-ear effects. However, spatial separation of target and chorus yields nearly 15 dB of additional improvement beyond any acoustic better-ear effects, presumably because differences in perceived location allow listeners to focus attention on the target syllables and reduce central confusions between target and masker. Here we describe extensions to this work, measuring behavioral and neural discrimination performance in zebra finches when target and maskers are co-located.
Behavioral and Neural Identification of Birdsong under Several Masking Conditions
3
209
Human and Avian Psychophysics
Five human listeners were trained to identify the songs of five zebra finches with 100% accuracy in quiet, and then asked to classify songs embedded in the three maskers for target-to-masker energy ratios (TMRs) between −40 and +8 dB. Details can be found in Best et al. (2005). Four zebra finches were trained using operant conditioning procedures to peck a left (or right) key when presented with a song from a particular individual bird. For symmetry, songs from six zebra finches were used as targets, so that avian subjects performed a categorization task in which they pecked left for three of the songs and right for the remaining three (with the category groupings randomly chosen for each subject). Subjects were trained on this categorization task in quiet until performance reached asymptote (about 85–90% correct after 30–35 100-trial training sessions). Following training, the birds were tested with all three maskers on the target classification task at TMRs from −48 to +60 dB. Figure 2 shows psychometric functions (percent correct as a function of TMR) for the human and avian subjects (left and middle panels, respectively; the right panel shows neural data, discussed in Sect. 4). At the highest TMRs, both human and avian performance reach asymptote near the accuracy obtained during training with targets in quiet (100% for humans, 90% for birds). More importantly, results show that human performance is above chance for TMRs above −16 dB, but avian performance does not exceed chance until the TMR is near 0 dB. On this task, humans generally perform better than their avian counterparts. This difference in absolute performance levels could be due to a number of factors, including differences between the two species’ spectral and temporal sensitivity (Dooling et al. 2000) and differences in the a priori knowledge available (e.g., human listeners knew explicitly that a masker was present on every trial).
Zebra Finch Psychophysics
Percent Correct
Human Psychophysics
Zebra Finch Neurophysiology
100
100
100
60
75
60
20
50
−40
−20
TMR (dB)
0
−50
chance performance
0
TMR (dB)
50
Noise Mod Noise Chorus clean targets
20 −10 −5
0
5
10
TMR (dB)
Fig. 2 Mean classification performance as a function of TMR in the presence of the three maskers for humans, zebra finches, and Field L neurons. Each panel is scaled vertically to cover the range from chance to perfect performance (also note different TMR ranges)
210
B.G. Shinn-Cunningham et al.
Comparison of the psychometric functions for the three different maskers reveals another interesting difference between the human and avian listeners. At any given TMR, human performance is poorest for the chorus, whereas the avian listeners show very similar levels of performance for all three maskers. In the previous study (Best et al. 2005) poor performance with the chorus masker was attributed to difficulties in segregating the spectrotemporally similar target and masker. Consistent with this, performance improved dramatically with spatial separation of target and chorus masker (but not for the two kinds of noise masker). The fact that the birds did not exhibit poorer performance with the chorus masker than the two noise maskers in the co-located condition may reflect the birds’ better spectrotemporal resolution (Dooling et al. 2000), which enable them to segregate mixtures of rapidly fluctuating zebra finch songs more easily than humans do. For humans, differences in the forms of masker interference were best demonstrated by differences in how spatial separation of target and masker affected performance for the chorus compared to the two noise maskers. Preliminary results from zebra finches suggest that spatial separation of targets and maskers also improves avian performance, but we do not yet know whether the size of this improvement varies with the type of masker as it does in humans.
4
Avian Neurophysiology
Extracellular recordings were made from 36 neural sites (single units and small clusters) in Field L of the zebra finch forebrain (n = 7) using standard techniques (Sen et al. 2001). Neural responses were measured for “clean” targets (presented in quiet), the three maskers (each presented in quiet), and targets embedded in the three maskers. In the latter case, the TMR was varied (by varying the intensity of the target) between −10 dB and +10 dB. The ability of sites to encode target song identity was evaluated. Responses to clean targets were compared to the spike trains elicited by targets embedded in the maskers. A spike-distance metric that takes into account both the number and timing of spikes (van Rossum 2001; Narayan et al. 2006) was used to compare responses to targets embedded in maskers to each of the clean target responses. Each masked response was classified into a target song category by selecting the target whose “clean” response was closest to the observed response. Percent-correct performance in this one-in-five classification task (comparable to the human task) was computed for each recording site, with the temporal resolution of the distance metric set to give optimal classification performance. The recorded spike trains were examined for additions and deletions of spikes (relative to the response to the target in quiet) by measuring firing rates within and between target song syllables. Each target song was temporally hand-labeled to mark times with significant energy (within syllable) and temporal gaps (between syllable). The average firing rates in the clean and
Behavioral and Neural Identification of Birdsong under Several Masking Conditions
211
masked responses of each site were then calculated separately for the within and between syllable portions of the spike-train responses. In order to account for the neural transmission time to Field L, the hand-labeled classifications of the acoustic waveforms were delayed by 10 ms to align them better with the neural responses. The across-site average of percent-correct performance is shown in Fig. 2 (right panel) as a function of TMR for each of the three maskers. In general, as suggested by the mean data, single-site classification performance improves with increasing TMR for all sites, but did not reach the level of accuracy possible with clean responses, even at the largest TMR tested (+10 dB TMR; rightmost data point). Strikingly, performance with the chorus was better than with either noise masker. This implies that, for the singlesite neural representation in Field L, the spike trains in response to a target embedded in a chorus are most similar (in a spike-distance-metric sense) to the responses to the clean targets. The fact that zebra finch behavioral data are similar for chorus and noise maskers suggests that the main interference caused by the chorus arises at a more central stage of neural coding (e.g., due to difficulties in segregating the target from the chorus masker). As in the human and avian psychophysical results, overall percent correct performance for a given masker does not give direct insight into how each masker degrades performance. Such questions can only be addressed by determining whether the form of neural interference varies with masker type. We hypothesized that maskers could: 1) suppress information-carrying spikes by acoustically masking the target content (causing spike deletions), and/or 2) generate spurious spikes in response to masker energy at times that the target alone would not produce spikes (causing spike additions). Furthermore, we hypothesized that the: 1) spectrotemporally dense noise would primarily cause deletions, particularly at low TMRs, because previous data indicate that constant noise stimuli typically suppress sustained responses and the noise completely overlaps any target features in time/frequency; 2) temporally sparse modulated noise would primarily cause additions, as the broadband temporal onsets in the modulated noise were likely to elicit spikes whenever they occurred; and 3) the spectrotemporally sparse chorus was also likely to cause additions, but fewer than the modulated noise, since not all chorus energy would fall within a particular site’s spectral receptive field. Figure 3 shows the analysis of the changes in firing rates within and between target syllables. The patterns of neural response differ with the type of masker, supporting the idea that different maskers cause different forms of interference. Firing rates for the modulated noise masker (grey bars in Fig. 3) are largest overall, and are essentially independent of both target level and whether or not analysis is within or between target syllables. This pattern is consistent with the hypothesis that the modulated noise masker causes neural additions (i.e., the firing rate is always higher than for the target alone). The noise masker (black bars in Fig. 3) generally elicits firing rates lower than the modulated noise but greater than the chorus (compare black bars to grey and white bars).
212
B.G. Shinn-Cunningham et al.
Mean Firing Rate (spike/s)
Within Syllables 120
120
100
100
80
80
60
60
40
40
20
20
∆Rate re: no target (spike/s)
0
no target −10
−5
0
5
10
0
30
2
20
2
10
6
0
10
−10
−5
Chorus ModNoise Noise Target alone
Between Syllables
0
TMR (dB)
5
10
no target−10
−10
−5
−5
0
0
5
5
10
10
TMR (dB)
Fig. 3 Analysis of firing rates within and between target song syllables. Top panels show average rates as a function of TMR for each masker (line shows results for target in quiet). Bottom panels show changes in rates caused by addition of the target songs (i.e., relative to presentation of the masker alone)
Within syllables, the firing rate in the presence of noise is below the rate to the target alone at low TMRs and increases with increasing target intensity (see black bars in the top left panel of Fig. 3 compared to the solid line). This pattern is consistent with the hypothesis that the noise masker causes spike deletions. Finally, responses in the presence of a chorus are inconsistent with our simple assumptions. Within target syllables at low TMRs, the overall firing rate is below the rate to the target alone (i.e., the chorus elicits spike deletions; white bars in the top left panel of Fig. 3). Of particular interest, between syllables, there are fewer spikes when the target is present than when only the chorus masker is present (i.e., the target causes deletions of spikes elicited by the chorus; e.g., the white bars in the bottom right panel of Fig. 3 are negative). In summary, the general trends for the noise and the modulated noise maskers are consistent with our hypotheses, i.e., we observe deletions for the noise at low TMRs and the greatest number of additions for the modulated noise. However, the results for the chorus are surprising. While we hypothesized that the chorus would cause a small number of additions, instead we observe nonlinear interactions, where the targets suppress responses to the chorus, and vice versa.
5
Conclusions
In order to communicate effectively in everyday settings, both human and avian listeners rely on auditory processing mechanisms to ensure that they can: 1) hear the important spectrotemporal features of a target signal, and 2) segregate it from similar competing sounds.
Behavioral and Neural Identification of Birdsong under Several Masking Conditions
213
The different maskers used in these experiments caused different forms of interference, both perceptually (as measured in human behavior) and neurally (as seen in the pattern of responses from single-site recordings in Field L). Equating overall masker energy, humans have the most difficulty identifying a target song embedded in a chorus. In contrast, for the birds, all maskers are equally disruptive, and in Field L, the chorus causes the least disruption. These avian behavioral and physiological results suggest that species specialization enables the birds to segregate and identify an avian communication call target embedded in other bird songs more easily than humans can. Neither human nor avian listeners performed as well in the presence of the chorus as might be predicted by the single-site neural responses (which retained more information in the presence of the chorus than the two noise maskers). However, the neural data imply that there is a strong non-linear interaction in neural responses to mixtures of target songs and a chorus. Human behavioral results suggest that identifying a target in the presence of spectrotemporally similar maskers causes high-level perceptual confusions (e.g., difficulties in segregating a target song from a bird song chorus). Moreover, such confusion is ameliorated by spatial attention (Best et al. 2005). Consistent with this, neural responses are degraded very differently by the chorus (i.e., there are significant interactions between target and masker responses) than by the noise (which appears to cause neural deletions) or the modulated noise (which causes neural additions). Future work will explore the mechanisms underlying the different forms of interference more fully, including gathering avian behavioral data in spatially separated conditions to see if spatial attention aids performance in a chorus masker more than in noise maskers. We will also explore how spatial separation of target and masker modulates the neurophysiological responses in Field L. Finally, we plan on developing an awake, behaving neurophysiological preparation to explore the correlation between neural responses and behavior on a trial-to-trial basis and to directly test the importance of avian spatial attention on behavioral performance and neural responses. Acknowledgments. This work is supported in part by grants from the Air Force Office of Scientific Research (BGSC), the National Institutes of Health (KS and BGSC), the Deafness Research Foundation (MLD) and the Office of Naval Research (BGSC).
References Best V, Ozmeral E, Gallun FJ, Sen K, Shinn-Cunningham BG (2005) Spatial unmasking of birdsong in human listeners: Energetic and informational factors. J Acoust Soc Am 118:3766–3773 Darwin CJ, Hukin RW (2000) Effectiveness of spatial cues, prosody, and talker characteristics in selective attention. J Acoust Soc Am 107:970–977
214
B.G. Shinn-Cunningham et al.
Dooling RJ, Lohr B, Dent ML (2000) Hearing in birds and reptiles. In: Popper AN, Fay RR (eds) Comparative hearing: birds and reptiles. Springer, Berlin Heidelberg New York Freyman RL, Balakrishnan U, Helfer KS (2001) Spatial release from informational masking in speech recognition. J Acoust Soc Am 109:2112–2122 Narayan R, Grana GD, Sen K (2006) Distinct time-scales in cortical discrimination of natural sounds in songbirds. J Neurophys [epub ahead of print; doi: 10.1152/jn.01257.2005] Sen K, Theunissen FE, Doupe AJ (2001) Feature analysis of natural sounds in the songbird auditory forebrain. J Neurophys 86:1445–1458 van Rossum MCW (2001) A novel spike distance. Neural Comp 13:751–763 Zurek PM (1993) Binaural advantages and directional effects in speech intelligibility. In: Studebaker G, Hochberg I (eds) Acoustical factors affecting hearing aid performance. College-Hill Press, Boston, MA
24 Near-Threshold Auditory Evoked Fields and Potentials are In Line with the Weber-Fechner Law BERND LÜTKENHÖNER, JAN-STEFAN KLEIN, AND ANNEMARIE SEITHER-PREISLER
1
Introduction
According to the Weber-Fechner law, the relationship between sound pressure and perceived loudness should be logarithmic. However, psychoacoustic data do not support this law. Instead, it is widely accepted now that intensity and loudness are approximately related by a power law such that a doubling in loudness roughly requires a tenfold increase in intensity (Stevens 1955). The power law is not applicable at low intensities, though: Extrapolation of the function derived for higher intensities leads to a considerable overestimation of the loudness at threshold (see, e.g., the compilation of data in Buus et al. 1998). The physiological basis of loudness perception is only partially understood. Fletcher and Munson (1933) postulated that loudness is proportional to the number of nerve impulses per second reaching the brain along all the excited nerve fibers. More recent studies (e.g. Relkin and Doucet 1997) showed the limitations of this old concept. Nevertheless, the fact remains that the overall activity of the auditory pathways is, over a wide range, a monotonous function of stimulus intensity, and it still appears plausible to assume that related measures, such as the amplitudes of far-field evoked potentials, reflect aspects of loudness coding, at least qualitatively. This holds especially for low intensities, where cochlear compression and factors such as saturation and non-monotonous rate intensity functions of neurons are less relevant than at higher intensities. Here, we investigated the intensity dependence of wave V of the brainstem auditory evoked potential (BAEP) and, as a measure of the cortical activation, wave N100m of the auditory evoked field (AEF). We will show that, at low sound intensities, both responses are basically in line with the Weber-Fechner law.
2
Methods
In a first study (Klein 2006), we recorded the AEF in response to 1-kHz tonebursts of 120 ms duration (including 10 ms for both rise and fall). The measurements were carried out in a magnetically shielded room using a 37-channel ENT Clinic, Münster University Hospital, Münster, Germany,
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
216
B. Lütkenhöner et al.
axial gradiometer system (for details see Lütkenhöner and Steinsträter 1998). The auditory threshold (0 dB sensation level, SL) was defined as the intensity where the subject detected about 30% of the presented stimuli. It was determined while the subject was in its final measurement position. During the AEF measurements, the stimuli were presented at intervals randomized between 1.2 and 1.8 s (mean: 1.5 s). Most of the recording time was devoted to intensities below 10 dB SL. All in all, about 9000 stimuli were presented, 600 at the highest intensity (40 dB SL) and 2400 at the lowest intensity (2 dB SL). The total measurement time was subdivided into six sessions (two per day). The data of each session were reduced to a current dipole (for technical details see Lütkenhöner et al. 2003). The time courses of the dipole moments were finally averaged over all sessions, separately for each intensity. In what follows, we will consider only the grand average of our five subjects. In a second study, we recorded the BAEP elicited by a 4-kHz tone pulse (effective duration of about 1 ms). The stimuli were transmitted by a short plastic tubing (length of 25 cm) from an electrically shielded headphone to an earplug in the subject’s right ear. The investigation was done in an anechoic chamber so that the threshold could be determined more reliably than in the first study. It was estimated using a two-interval two-alternative forced-choice procedure that was combined with a transformed up-down method converging to p = 0.794 (Levitt 1971). The two intervals (each having a duration of 4 s, test stimulus in the middle of one of them) were marked by three weak 250-Hz tone pips. The sequence was preceded (gap of 500 ms) by a weak 500-Hz tone pip that served as an alarm signal. The standard test stimulus was a sequence of five 4-kHz tone pips presented at 500-ms intervals. The threshold for this sequence will be referred to as 0 dB SL1 (the subscript indicates that this is basically the threshold for a single tone pulse, because significant temporal integration in the auditory periphery is not to be expected for the five pulses). The BAEP was recorded between vertex and right and left earlobe, respectively. Here we will present only data from the most extensively studied subject (the first author). In this subject, more than one million stimulus repetitions were achieved for eight intensities between 11 and −3 dB SL1. Although the stimuli were presented with a high repetition rate (intervals of 16 ms), the overall measurement time was exorbitant. Thus, this experiment could be realized only because the subject did part of his desk work (e.g. reviewing manuscripts, reading doctoral theses) in the anechoic chamber while being measured, over a period of several months. Here, we will confine ourselves to the dominant peak in the response, wave V.
3 3.1
Results Auditory Evoked Field (AEF)
Figure 1 shows how the AEF time course changes with decreasing stimulus intensity. At the highest intensity (40 dB SL), three peaks can be recognized: a small positive peak around 50 ms (P50m), a pronounced negative peak
Near-Threshold Auditory Evoked Fields and Potentials
217
Fig. 1 Auditory evoked field (AEF) in response to a 1-kHz toneburst of 120 ms duration (grand average of five subjects). The stimulus intensities were 2 (thickest curve), 4, 6, 8, 10 (thick curves), 15, 20, 30, and 40 dB SL. The amplitude of the dominant peak (N100m) decreases with decreasing intensity, while the latency increases (Lütkenho-ner and Klein, 2007)
around 100 ms (N100m), and a second positive peak around 200 ms (P200m). We will consider only the N100m. Figure 2A shows that, for intensities below 20 dB SL, the N100m amplitude is approximately a linear function of the stimulus intensity in dB. Thus, if the dB value is denoted as x and the N100m amplitude as y, we may write y = α : (x−x0).
(1)
218
B. Lütkenhöner et al.
Fig. 2 Intensity dependence of N100m: A amplitude; B latency (Lütkenho-ner and Klein, 2007)
The parameter x0 may be interpreted as the N100m threshold: the intensity where the N100m amplitude becomes zero. A least-squares fit based on the intensities below 15 dB (dashed line) resulted in x0= 0.2 dB. Thus, N100m threshold and psychoacoustic threshold agree almost perfectly. For the parameter α we obtained the value 1.19 nAm/dB. A good fit for all intensities is achieved by assuming that the N100m saturates at higher intensities such that its amplitude becomes s (y) = cy/ c2+ y2
(2)
(gray solid curve in Fig. 2A). For c we derived the value 45.7 nAm. The N100m latency (Fig. 2B) approximately follows the function L = L3+ Lo : 10- q (x - x ) /20 0
(3)
As in Eq. (1), x0 is the threshold extrapolated from the N100m amplitudes. A least-squares fit for the remaining parameters resulted in L∞ = 96.0, L0 =141.9 ms, and q= 0.947 (curve plotted as a gray solid line). The parameter L∞ specifies
Near-Threshold Auditory Evoked Fields and Potentials
219
the N100m latency for very high intensities and may be interpreted as the transmission delay of the system, whereas L0 is the additional delay for a signal of threshold intensity x0. The following consideration suggests that the additional delay L0 is largely caused by temporal integration. For a system with perfect sound pressure integration, the product of integration time T and sound pressure P is constant: T : P=T0 : P0, where T0 is the integration time for a signal at the threshold P0. With the above terminology, we may write P / P0 = 10 (x - x ) /20
(4)
T = T0 : 10 (x - x ) /20
(5)
0
Thus 0
which is basically the second term on the right side of Eq. (3), because the parameter q in that equation is close to 1. At threshold, the integration T0 should correspond to the effective duration of the stimulus. The latter is about 110 ms in our case (counting rise and fall time half). This value is somewhat smaller than the value estimated for L0 in Eq. (3). Nevertheless, in view of the coarseness of the above consideration, the match is surprisingly good. If we constrain Eq. (3) by letting q=1 and L0 =110 ms, a least-squares fit results in the curve plotted as a dashed line in Fig. 2B (the scale on the right refers to the temporal integration effect described by this model). While the fit is excellent at lower intensities, marked deviations are found now at higher intensities. However, the problem could be easily fixed by assuming that the transmission delay is not completely independent of stimulus intensity (slightly shorter delay at higher intensities, which does not appear unreasonable). 3.2
Brainstem Auditory Evoked Potential (BAEP)
Figure 3 shows how the time course of the BAEP changes when the intensity of the 4-kHz tone pulse is reduced from 30 to −3 dB SL1. The response is dominated by wave V, which can be recognized at all intensities. Except for the lowest intensities, waves I and III can be recognized as well, but we refrain from considering these waves in more detail. Figure 4A shows that between 1 and 20 dB SL1 there is a roughly linear relationship between the wave-V amplitude and the intensity in dB, which may again be described by Eq. (1). For x0 we now obtain the value −1.8 dB SL1. Above 20 dB SL1, saturation appears to take effect. Below 1 dB SL1 the response amplitude is larger than predicted by the linear function. This discrepancy is presumably not a measuring error; it appears to indicate that the proposed linear relationship between response amplitude and intensity in dB is not applicable at the limits of audibility.
220
B. Lütkenhöner et al.
Fig. 3 Brainstem auditory evoked potential (BAEP) in response to 4-kHz tone pulses (solid lines: ipsilateral recording; dotted lines: contralateral recording). The stimulus intensity was varied between 30 and −3 dB SL1 (sensation level for a single pulse). While BAEP waves I and V can be recognized only at higher intensities, wave V is clearly visible at all intensities. Responses for negative dB values presumably reflect temporal integration (the stimuli were presented not individually, but in a series at intervals of 16 ms)
Near-Threshold Auditory Evoked Fields and Potentials
221
Fig. 4 Intensity dependence of: A the amplitude; B the latency of BAEP wave V
Figure 5 offers an alternative view at the data, which appears much more conclusive regarding the lowest intensities. The figure is analogous to Fig. 4A, except that the abscissa represents sound pressure rather than intensity in dB. The figure suggests that between −3 and 9 dB SL1 the response amplitude is a linear function of the sound pressure (dashed line). For comparison, the solid line represents a linear relationship between response amplitude and dB SL1 (corresponding to the solid line in Fig. 4A). The sound pressure was normalized so that the value one corresponds to x0. Threshold extrapolation using the dashed line yields a sound pressure of 0.44. This corresponds to about −8.9 dB SL1(a sound pressure of one corresponds to x0 = −1.8 dB SL1). Figure 4B shows the intensity dependence of the wave V latency. Between 1 and 10–15 dB SL1, the data could be fitted well by a linear function. At higher intensities the latency changes more slowly than predicted by the
222
B. Lütkenhöner et al.
Fig. 5 Amplitude of wave V as a function of sound pressure. The numbers close to the data points indicate the corresponding intensity in dB SL1
linear function. Irregularities at the lowest intensities are possibly due to the fact that the latency estimation becomes increasingly problematic with decreasing response amplitude.
3.3
Psychoacoustic Threshold: Single Pulse vs Sequence of Pulses
BAEP wave V was clearly visible at −3 dB SL1. However this does not mean that we measured a subliminal response. While the dB SL1 scale basically refers to an isolated, single stimulus, the BAEP were measured while presenting a series of stimuli at 16-ms intervals, which gives rise to temporal integration. The sequence was indeed audible at −3 dB SL1, although the sensation was extremely weak. To get an idea of the threshold difference between a single pulse and a series of pulses, supplementary psychoacoustic measurements were done in three subjects. The threshold for two pulses at 16-ms intervals was about −1.6 dB SL1 (roughly consistent with Viemeister and Wakefield 1991), while the threshold for a series of 16 pulses at 16-ms intervals was about −5.5 dB SL1. The latter condition seems to be comparable to the situation during the BAEP measurements.
4
Discussion
A logarithmic transformation represents a convenient way to map a continuum spanning many orders of magnitude to a compact scale. This is the reason why the dB scale turned out to be so useful for quantifying physical
Near-Threshold Auditory Evoked Fields and Potentials
223
measures such as sound intensity. Our findings for both wave V of the BAEP and peak N100m of the AEF suggest that, regarding the representation of low sound intensities, the brain also uses a logarithmic transformation. A roughly linear relationship between the amplitude of the neural response and the logarithm of sound intensity is nothing else than a neurophysiological analogue of the Weber-Fechner law. This does, of course, not mean that our study corroborates the ideas that guided Fechner when deriving the law. Besides that, our data significantly deviate from the law at extremely low intensities (data available only for wave V) and at higher intensities. For wave V, Fig. 5 suggests a more general model. According to that model, the response amplitude is basically a linear function of sound pressure up to intensities of about 9 dB SL1, which is about 18 dB above the threshold that was extrapolated from the response amplitudes. If the intensity increases further, a compressive nonlinearity (similar to Eq. 2) begins to take effect. Taking that model as a basis, a logarithmic scaling analogous to the Weber-Fechner law is merely an approximation which is valid for a relatively limited intensity range. The threshold extrapolated from the sound-pressure dependence of the response amplitude is almost 6 dB lower than the lowest intensity studied, and more than 3 dB lower than the psychoacoustic threshold estimated for a series of pulses at 16-ms intervals. This implies that the response magnitude at threshold is greater than zero. A corresponding view was proposed for loudness (Buus et al. 1998; Moore and Glasberg 2004; Zwislocki 1965).
References Buus S, Musch H, Florentine M (1998) On loudness at threshold. J Acoust Soc Am 104:399–410 Fletcher H, Munson WA (1933) Loudness, its definition, measurement and calculation. J Acoust Soc Am 5:82–108 Klein JS (2006) Auditorisch evozierte Felder im Bereich der Hörschwelle. Dissertation, Medizinische Fakultät der Westfälischen Wilhelms-Universität, Münster Levitt H (1971) Transformed up-down methods in psychoacoustics. J Acoust Soc Am 49:467–477 Lütkenhöner B, Steinsträter O (1998) High-precision neuromagnetic study of the functional organization of the human auditory cortex. Audiol Neurootol 3:191–213 Lütkenhöner B, Krumbholz K, Lammertmann C, Seither-Preisler A, Steinsträter O, Patterson RD (2003) Localization of primary auditory cortex in humans by magnetoencephalography. NeuroImage 18:58-66 Lütkenho-ner B, Klein JS (2007) Auditory evoked field at threshold. Hear Res 228:188–200 Moore BC, Glasberg BR (2004) A revised model of loudness perception applied to cochlear hearing loss. Hear Res 188:70–88 Relkin EM, Doucet JR (1997) Is loudness simply proportional to the auditory nerve spike count? J Acoust Soc Am 101:2735–2740 Stevens SS (1955) The measurement of loudness. J Acoust Soc Am 27:815–829 Viemeister NF, Wakefield GH (1991) Temporal integration and multiple looks. J Acoust Soc Am 90:858–865 Zwislocki J (1965) Analysis of some auditory characteristics. In: Luce RD, Bush RR, Galanter E (eds) Handbook of mathematical psychology, vol III. Wiley, New York, pp 1–97
224
B. Lütkenhöner et al.
Comment to Lütkenhöner (and Langers) by Chait In both studies, I wonder how much the responses observed are related to the properties of the acoustic environments in which the listeners operated. In general, does it make sense at all to talk about “perceptual loudness” without considering the specific acoustic context? In Langers’ fMRI experiment, responses were recorded while listeners were exposed to high intensity machine noise. In Lütkenhöner’s experiments, stimuli with different intensities were presented in a randomized manner. Since your stimulus set included mostly low intensity stimuli, and since we know that listeners adjust to the properties of their acoustic environments (e.g. Dean et al. 2005) could it be that the particular stimulus set that you used influenced the responses you measure? Specifically, responses to rare high intensity stimuli might be different from those you might have observed if they were less rare. Similarly, might you have observed different responses to low-intensity stimuli if your mean intensity (across stimuli) was still lower?
References Dean I, Harper NS, McAlpine D (2005) Neural population coding of sound level adapts to stimulus statistics. Nat Neurosci 8:1684–1689
Reply Your argument does not apply to our BAEP experiment, where the stimulus intensity was typically kept constant for about 15 min. In the MEG experiment, louder stimuli were indeed rare events compared to nearthreshold stimuli, and this fact probably influenced the results to some extent. It is quite conceivable, for example, that a block of stimuli presented at 40 dB SL would have resulted in slightly smaller N100m amplitudes, owing to refractory effects. However, such factors are probably irrelevant near threshold.
Comment by Greenberg It may be worthwhile to model the variation in magnitude and latency of the brainstem and cortical potentials in terms of excitation patterns rather than stimulus intensity (or other physical parameters). Because there is a systematic shift in latency with frequency as well as with intensity, the two factors may be conflated. This is particularly the case for the cortical M100 MEG response where latency is systematically related to signal frequency
Near-Threshold Auditory Evoked Fields and Potentials
225
(Greenberg et al. 1998). One way to test this hypothesis would be to collect fine-grained parametric data over a variety of intensities and signal frequencies as a means of deriving frequency-intensity latency functions. References Greenberg S, Poeppel D, Roberts T (1998) A space-time theory of pitch and timbre based on cortical expansion of the cochlear traveling wave delay. In: Palmer A, Summerfield Q, Rees A, Meddis R (eds) Psychophysical and physiological advances in hearing. Whurr Publishers, London, pp 293–300
Reply Only the intensity was varied in our experiments. Thus, the data are not suitable for speculations about the effect of stimulus frequency. Previously, we observed a minimum of the N100m latency at tone frequencies between 500 and 1000 Hz (Lütkenhöner et al., 2003), and regarding the lower frequencies this finding is compatible with your idea of a cortical expansion of the cochlear traveling wave delay. But I would like to sound a note of caution. The N100m arises from multiple cortical sources so that the frequency dependence of the N100m latency does not necessarily have a direct counterpart in the activities of the contributing cortical sources. An activity maximum might be observed, for example, at a time when the activity of one source is still rising while the activity of another source is already falling. References Lütkenhöner B, Krumbholz K, Seither-Preisler A (2003) Studies of tonotopy based on wave N100 of the auditory evoked field are problematic. Neuroimage 19:935–949
25 Brain Activation in Relation to Sound Intensity and Loudness DAVE LANGERS1,2, WALTER BACKES3, AND PIM VAN DIJK1,2
1
Introduction
In spite of extensive research, still relatively little is known about how sound is processed by the brain and how various sound attributes are neurally represented. Questions remain even regarding very basic attributes like sound level. In subjects with sensorineural hearing loss, hearing thresholds are elevated over some range of frequencies, due to defects in the inner ear haircells that are required to achieve good sensitivity to soft sounds. At the same time, uncomfortable intensity levels may not have changed notably or may even have decreased. The difference between the hearing threshold and the intensity level of uncomfortable loudness is a measure of the input dynamic range of the ear. A reduction of the input dynamic range is a characteristic of sensorineural hearing loss. Beside intensity, loudness is the attribute of subjective auditory sensation in terms of which sounds may be ordered on a scale extending from soft to loud. Despite a reduction of intensity range, the loudness range is not reduced in sensorineural hearing loss. A relatively large rise in loudness can then be generated by a relatively small increase in sound intensity. As a result, the relation between loudness and intensity is modified. This phenomenon is known as loudness recruitment (Oxenham and Bacon 2003) and is usually accompanied by other changes related to perception, e.g. a reduced comprehension of speech. As the input to the brain is often permanently changed in people with sensorineural hearing loss, qualitative modifications in the brain activation characteristics as a function of sound intensity and/or loudness may occur. Hence, people with sensorineural hearing loss are of interest in gaining more knowledge on the cortical representation of sound level and potentially for studying plasticity in the auditory system (McDermott et al. 1998). 1 Department of Otorhinolaryngology, University Medical Center Groningen, Groningen, The Netherlands,
[email protected] 2 School of Behavioral and Cognitive Neurosciences, State University Groningen, Groningen, The Netherlands,
[email protected] 3 Department of Radiology, Maastricht University Hospital, Maastricht, The Netherlands,
[email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
228
D. Langers et al.
Previous functional magnetic resonance imaging (fMRI) studies have systematically investigated sound level dependent brain activation in normal hearing subjects exclusively. It has been shown that the hemodynamic response signal in the auditory cortex increases with increasing sound level (Jäncke et al. 1998). However, due to the hindrance of acoustic scanner noise as well as the limited signal detection power of fMRI, the determination of sound level dependent activation has been limited to high sound intensity levels over 40 dB sound pressure level (SPL), and often more. Using sparse imaging techniques auditory stimuli can be delivered at low intensities with limited interference from acoustic scanner noise (Hall et al. 1999). Thus, it should be possible to measure brain activation over a much larger intensity range than has yet been done. For these reasons, an fMRI study was performed to characterize brain responses in a broad intensity level range of 0-70 dB above the hearing threshold for subjects with normal hearing, spanning most of the range that is of daily relevance in life. Also, activation was studied for an equivalent loudness range in subjects with elevated hearing thresholds. In this report, the magnitude of cortical activation will be described as a function of sound intensity level and equivalent loudness level to provide further insights into the relation between brain activation, the strength of the stimulus, and its percept.
2
Subject Population and Quality of Hearing
Ten normal hearing subjects (eight male, two female; age 22–61 years, mean 37 years) and ten subjects with high-frequency hearing loss (nine male, one female; age 57–70 years, mean 64 years) were included in this study. Pure tone audiometry was performed to characterize the subject population. For the normal hearing subjects, hearing thresholds were determined at 7 ± 5 dB hearing level (HL; mean ± standard deviation), averaged over both ears and over frequencies of 0.25–8.0 kHz. In the hearing impaired subjects, average air conduction thresholds for frequencies of 0.25–1.0 kHz equaled 13 ±8 dB HL, and 66 ±11 dB HL for frequencies of 4.0–8.0 kHz. The absolute difference in threshold between the left and right ears averaged over all frequencies was 8±4 dB. Bone conduction thresholds equaled 11 ± 7 dB HL (at 0.25–1.0 kHz) and 59 ± 7 dB HL (at 4.0 kHz; bone conduction thresholds were not determined at 8.0 kHz). Air-bone gaps were smaller than 5 dB on average. Therefore, all impaired subjects suffered from moderately severe symmetrical bilateral sensorineural hearing loss for high frequencies (≥4 kHz), while hearing was normal for low frequencies (≤1 kHz). To characterize loudness perception at low and high frequencies, highfrequency auditory stimuli were matched in loudness to low-frequency stimuli. In this study, stimuli consisted of frequency modulated (FM) tones with 5-Hz sinusoidal modulations in a spectral range of 0.5–1.0 kHz and 4.0–8.0 kHz, respectively. Because all subjects had normal hearing at low frequencies,
Brain Activation in Relation to Sound Intensity and Loudness
229
the low-frequency FM-tones were used as reference tones, and were presented at levels of 0–70 dB sensation level (SL) above the individually determined threshold in 10-dB increments. In a matching procedure, subjects were asked to adjust the intensity of alternatingly presented high-frequency stimuli until low- and high-frequency tones were perceptually equally loud. The subject-dependent loudness level of the FM-tones was quantified by a loudness scale analogous to the phon scale for pure tone stimuli. For the high-frequency FM-tones, the equivalent loudness level (expressed in dB EL) equaled the intensity level of the low-frequency reference FM-tone with equal loudness (expressed in dB SL). For the low-frequency FM-tones, the equivalent loudness level was by definition equal to the intensity level. For example, if a low-frequency FM-tone at 60 dB above the corresponding individual threshold was perceptually matched in loudness level with a high-frequency FM-tone at 40 dB above threshold, then these tones had an intensity level of 60 and 40 dB SL respectively, while both tones were said to have an equivalent loudness level of 60 dB EL. The relationships between the intensity level and the equivalent loudness level of the high-frequency stimuli according to the results of the loudness matching task are displayed in Fig. 1. For the group of normal hearing subjects, the high-frequency intensity level increased with loudness level by 0.75± 0.06 dB SL/dB EL (mean±standard error). For the impaired subjects this increase was significantly smaller (p<0.01), and equaled 0.55 ± 0.03 dB SL/dB EL. Subjects with impaired hearing therefore showed loudness recruitment at high-frequencies, i.e. a disproportionately strong increase in loudness with
Fig. 1 Stimulus intensity level vs loudness level in subjects with normal and impaired hearing. The intensity level above threshold of high-frequency FM-tones is plotted as a function of their equivalent loudness level. In the impaired subjects, the increase in stimulus intensity that is required to evoke a certain rise in perceptual loudness is significantly smaller than in normal hearing subjects. This phenomenon is commonly referred to as loudness recruitment, and is characteristic for sensorineural hearing loss. The gray band indicates the 95% confidence interval of the quadratic polynomial fit through the data points of all subjects collectively
230
D. Langers et al.
intensity. This indicates a distorted (compressed) loudness perception, which is typical for sensorineural hearing loss.
3
Brain Activation
Functional MRI was performed on a 1.5-T clinical MR-system using a ‘sparse’ acquisition paradigm to overcome the influence of acoustic scanner noise (Hall et al. 1999). Functional scans were acquired in a volume covering the superior surface of the temporal lobe, and consisted of a dynamic series of 2.5-s single-shot T2*-sensitive echo planar imaging (EPI) acquisitions at 10.0s intervals. In the 7.5-s silent intervals between scans, 4.0-s FM-tone fragments were presented with a loudness of 0-70 dB EL. The functional image volumes were corrected for motion and drift, and spatially smoothed. For each voxel, the fMRI blood oxygenation level dependent (BOLD) signal across all acquisitions was correlated with the stimulus loudness level. For every subject the 100 voxels with the most significant positive correlation coefficients were selected to form a reference set of voxels that responded most strongly to the presented stimuli. These were mainly located in the auditory cortices in the temporal lobes (Fig. 2). Per subject, signals were averaged over this set of voxels and over all acquisitions corresponding with a certain stimulus condition to obtain average signal levels for each of the stimulus conditions (with regard to stimulus frequency and sound level). 3.1
Low Frequency Activation
Figure 3 displays the subjects’ average activation as a function of the lowfrequency stimulus level. For this stimulus frequency, the intensity level and loudness level were equal by definition.
Fig. 2 Distribution and density of the 100 most active voxels in all subjects, projected and overlaid on an anatomical reference image. In general, the bilateral auditory cortices were activated most strongly
Brain Activation in Relation to Sound Intensity and Loudness
231
Fig. 3 Activation to low-frequency FM-tones as a function of stimulus intensity level or, equivalently, loudness level. The gray band indicates the 95% confidence interval of the quadratic polynomial fit through the data points of all subjects collectively
Table 1 The rate of increase in brain activation (mean±standard error [10−3%/dB]) in subjects with normal or impaired hearing, as a function of the physical intensity level (in dB SL) or perceptual loudness level (in dB EL) of the two stimulus types. The hypothesis that both subject groups show equal increase in activation was tested using an independent samples T-test, and rejected only for the high-frequency stimuli when calculated as a function of sound intensity level Stimulus frequency
Sound level measure
Normal hearing
Impaired hearing
pequal
0.5–1.0 kHz
SL, EL
24.3 ± 3.3
28.8 ± 2.0
0.25
4.0–8.0 kHz
SL
20.5 ± 2.3
37.2 ± 4.8
0.005
4.0–8.0 kHz
EL
14.8 ± 1.7
20.4 ± 2.7
0.10
In all individual subjects the activation increased significantly with the stimulus level (p<0.05). The average rates of increase in activation are listed in Table 1 for both subject groups. The increase rates did not differ significantly between the two groups. 3.2
High Frequency Activation
The activation levels in response to the high-frequency stimuli are shown in Fig. 4.
232
D. Langers et al.
a
b
Fig. 4 Activation to high-frequency FM-tones as a function of: a physical stimulus intensity level; b perceptual stimulus loudness level. Differences between the two subject groups were significant as a function of intensity level, but not as a function of loudness level. The gray band indicates the 95% confidence interval of the quadratic polynomial fit through the data points of all subjects collectively
Again, in all individual subjects the activation increased significantly (p<0.05) with stimulus intensity level (Fig. 4a). Moreover, in the impaired subjects the activation increased significantly more strongly as a function of stimulus intensity level than in the normal hearing subjects (Table 1). The observed difference in brain activation between the two subject groups may be a direct reflection of the loudness recruitment phenomenon. To test this hypothesis, Fig. 4b displays the activation levels as a function of the equivalent loudness level. In all individual subjects the activation increased significantly with loudness level (p<0.05). However,
Brain Activation in Relation to Sound Intensity and Loudness
233
in contrast with the findings as a function of intensity level, the difference in the activation increase rate between both groups if calculated as a function of loudness level was not significant (Table 1).
4
Discussion and Conclusions
In summary, we found that the loudness level of stimuli, in contrast with intensity level, related strongly to cortical brain activation even for groups of subjects with vastly different sound perception. For the low-frequency stimuli, both groups of subjects displayed normal hearing thresholds and the fMRI response level did not show a significant difference in the rate of increase with sound level. However, for the high-frequency stimuli, hearing thresholds in the impaired subjects were worse than those in subjects with normal hearing. In addition, loudness recruitment was observed, as the equivalent loudness of high-frequency stimuli increased more strongly with intensity in impaired subjects than in normally hearing subjects. We also found that the cortical activation increased more strongly with intensity in impaired subjects than in normally hearing subjects. This suggests that the cortical activation level reflects stimulus loudness more closely than stimulus intensity. Indeed, in spite of the severely disturbed perception in the impaired subjects, the increase in cortical activation was not significantly different between both subject groups if expressed as a function of loudness. While loudness recruitment is a symptom that is commonly associated with inner ear impairment, the corresponding neural mechanisms are poorly understood. It has been reported that the afferent signals in the auditory nerve fibers do not provide a simple representation of the excitation of the basilar membrane in people that display loudness recruitment (Heinz and Young 2004). Therefore, it remains unclear how the loudness percept is generated by the brain from available input signals, especially in pathological conditions. The present study is the first to document the cortical responses related to loudness recruitment in humans using fMRI. Previous magnetoencephalography (MEG) studies have suggested that brain activity increases abnormally quickly with stimulus intensity in individuals with loudness recruitment (Morita et al. 2003). Our results agree with such findings and confirm that cortical activity is more closely related to the perceptual loudness level of sound than to its intensity level. This suggests that fMRI activation can be interpreted as a correlate of the subjective strength of the stimulus percept. In contrast with suggestions that brain activation reflects the physical stimulus attributes in primary sensory cortices and relates to the stimulus percept only at higher levels of processing in the frontal cortices (de Lafuente and Romo 2005), we found that activation correlates well with perceptual attributes already at the level of the auditory cortices in the temporal lobes.
234
D. Langers et al.
References de Lafuente V, Romo R (2005) Neuronal correlates of subjective sensory experience. Nat Neurosci 8:1698–1703 Hall DA, Haggard MP, Akeroyd MA, Palmer AR, Summerfield AQ, Elliott MR, Gurney EM, Bowtell RW (1999) “Sparse” temporal sampling in auditory fMRI. Hum Brain Mapp 7:213–223 Heinz MG, Young ED (2004) Response growth with sound level in auditory-nerve fibers after noise-induced hearing loss. J Neurophysiol 91:784–795 Jäncke L, Shah NJ, Posse S, Grosse Ryuken M, Muller Gartner HW (1998) Intensity coding of auditory stimuli: an fMRI study. Neuropsychologia 36:875–883 McDermott HJ, Lech M, Kornblum MS, Irvine DR (1998) Loudness perception and frequency discrimination in subjects with steeply sloping hearing loss: possible correlates of neural plasticity. J Acoust Soc Am 104:2314–2325 Morita T, Naito Y, Nagamine T, Fujiki N, Shibasaki H, Ito J (2003) Enhanced activation of the auditory cortex in patients with inner-ear hearing impairment: a magnetoencephalographic study. Clin Neurophysiol 114:851–859 Oxenham AJ, Bacon SP (2003) Cochlear compression: perceptual measures and implications for normal and impaired hearing. Ear Hear 24:352–366
Comment to (Lütkenhöner and) Langers by Chait In both studies, I wonder how much the responses observed are related to the properties of the acoustic environments in which the listeners operated. In general, does it make sense at all to talk about “perceptual loudness” without considering the specific acoustic context? In Langers’ fMRI experiment, responses were recorded while listeners were exposed to high intensity machine noise. In Lükenhöner’s experiments, stimuli with different intensities were presented in a randomized manner. Since your stimulus set included mostly low intensity stimuli, and since we know that listeners adjust to the properties of their acoustic environments (e.g. Dean et al. 2005) could it be that the particular stimulus set that you used influenced the responses you measure? Specifically, responses to rare high intensity stimuli might be different from those you might have observed if they were less rare. Similarly, might you have observed different responses to low-intensity stimuli if your mean intensity (across stimuli) was still lower?
References Dean I, Harper NS, McAlpine D (2005) Neural population coding of sound level adapts to stimulus statistics. Nat Neurosci 8:1684–1689
Reply by Langers I fully agree that the loudness percept will depend on the context, with regard to acoustic aspects (e.g. background noise) and possibly also with regard to other
Brain Activation in Relation to Sound Intensity and Loudness
235
aspects (e.g. subject alertness). Although, in this study, listeners were exposed to high intensity machine noise, a sparse acquisition paradigm was used such that stimuli were presented during long periods of silence (8 s) between consecutive scans, limiting forward/backward masking effects to negligible levels. However, other sources of ambient noise were inevitably present (e.g., a helium pump) that can indeed have affected the perceived loudness. In fact, stimuli at threshold (as determined in silence) were reported to be completely imperceptible in the noisy MR-environment. Still, the 10-dB stimuli were clearly audible. Also, the loudness of both low and high frequency stimuli will be affected by the presence of background noise, such that the resulting mismatch in loudness between the stimulus pairs in this experiment due to the difference in acoustic environment will likely be limited to values well below 10 dB; the higher intensity stimuli are expected to be affected less than that. In comparison, the reduction in dynamic range of intensities related to loudness recruitment in the included patients was much larger. In summary, in my opinion this comment is certainly justified, but in practice the conclusion that fMRI brain activation more closely reflects stimulus loudness than stimulus intensity will remain valid. Comment by Verhey In your fMRI study you showed a more or less linear relation between activation and perceived loudness. You found some deviations at very low and at very high levels. Could this deviation be a consequence of the loudness scale used n the study. The perceived loudness was expressed on a scale which is essentially a phon scale. Would the authors expect a different relation (maybe even linear over the whole level range), if they used a different loudness scale such as the sone-scale? Reply Given that the non-linearities in the fMRI activation level as a function of stimulus intensity level had a negative sign when significant, with the strongest effects occurring near threshold, and given that the sone scale shows similar characteristics, the deviations from linearity in brain activation should be expected to become smaller when expressed as a function of loudness in sones, as compared to a phon-like scale. This is especially the case for the low-frequency data, for which non-linearities and threshold effects were strongest. In addition, some of the variability between subjects could possibly be accounted for, if there is a corresponding variability in loudness judgment (in sones). However, although our data indicate that brain activation is more closely related to perceived stimulus levels (i.e., loudness measures) than to stimulus presentation levels (i.e., intensity measures), the variance in the activation data is too large to assess whether fMRI brain activation levels are a better neural correlate for either phon-based loudness or sone-based loudness scales.
26 Duration Dependency of Spectral Loudness Summation, Measured with Three Different Experimental Procedures MAARTEN F.B. VAN BEURDEN AND WOUTER A. DRESCHLER
1
Introduction
Many studies have investigated loudness perception in normal hearing and hearing impaired subjects. In these studies different measurement procedures have been applied. In loudness matching a subject has to compare the loudness of a target signal to the loudness of a reference signal at a certain level. In loudness scaling a subject has to judge the loudness of a single signal on a particular scale for a set of signal levels. Specific advantages of the measuring procedures are that loudness matching is an accurate procedure, while loudness scaling is more appropriate when the loudness perception of a large range of levels is of interest. In this study we describe the results of two loudness matching procedures and a loudness scaling procedure on an experiment to determine the time dependency of loudness summation. Verhey and Kollmeier (2002) showed with a loudness matching procedure that loudness summation depends on the duration of a signal, with shorter durations leading to more spectral loudness summation. This effect was investigated in more detail using three different experimental test procedures. The first two procedures used loudness matching, with a more traditional and a more experimental response task, the latter aiming at improved accuracy. The third procedure was loudness scaling, designed to investigate a larger range of levels.
2 2.1
Methods Stimuli and Apparatus
In all procedures a computer controlled the stimulus generation, registered the subjects’ responses and executed the adaptive procedure. In the loudness matching procedures all stimuli were generated in Matlab with a sampling rate of 20 kHz. The stimuli were converted from digital to Department of Clinical and experimental Audiology, Academic Medical Centre, Amsterdam, Netherlands,
[email protected],
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
238
M.F.B. van Beurden and W.A. Dreschler
analogue by a D/A converter (TDT DA 3-2) and low pass filtered at 8 kHz (TDT FT6). The output of the low pass filter was attenuated by a programmable attenuator (TDT PA4), led to a headphone buffer (TDT HB6), and presented monaurally via headphones (TDH 39). In the loudness scaling procedure the stimuli were generated in Matlab with a sampling rate of 44.1 kHz. The stimuli were played by an Echo Audio Gina sound card, led to a headphone buffer (TDT HB6), and presented monaurally via headphones (TDH 39). All noises were low-noise noise (LNN) with a peak factor, defined as 2 W = x 4 / ` x 2j of approximately 1.7 for each bandwidth applied in the experiments. The noises were generated from pink noise with a method similar to the method described by Kohlrausch et al. (1997). Besides restricting the bandwidth by zeroing the components in the power spectrum outside the original bandwidth, a pink noise was created by performing an appropriate amplitude transformation. The entire procedure provided a pink noise with a well-defined bandwidth. The noises were gated with a raised-cosine rise and fall of 6.67 ms. The nominal duration of such a rise and fall is 1.67 ms shorter than the duration between the half-amplitude points and amounts thus to 5.0 ms. The intensity level of the reference signal was roved between 54 dB SPL and 66 dB SPL. The test and reference signals were band-limited noise signals geometrically centered around 2000 Hz. In the loudness matching procedures the reference signal had a bandwidth of 800 Hz and the test signals had bandwidths of 1600, 3200, and 6400 Hz. Two durations were measured: 25 and 1000 ms. In the loudness scaling procedure no reference bandwidth was needed and test signals had bandwidths of 400, 3200 and 6400 Hz. Two durations were measured: 25 and 400 ms. The calibration of all signals was based on the longterm rms level of each signal measured in dB SPL. Sound pressure levels were measured using the artificial ear B&K 4153 and the sound level meter B&K 2260 Investigator. 2.2
Procedures
The first loudness matching procedure (called matching 1) was an adaptive two-interval, two-alternative forced choice procedure similar to the procedure used by Verhey and Kollmeier (2002). In each trial the subject heard two sounds, a reference signal and a test signal, separated by a 400-ms silent interval. Test and reference signals were presented in random order and with equal a priori probability. The listeners indicated which signal was louder by pressing a button on a two-button console. A simple one-up one-down procedure was used, which converges at the 50% point of the psychometric function. The initial step size of 4 dB was decreased to 2 dB after the second reversal in the adaptive tracking procedure and held constant for the next eight reversals. To reduce biases several interleaved tracks were used.
Duration Dependency of Spectral Loudness Summation
239
The second loudness matching procedure (called matching 2) is a variation on the first procedure (van Beurden and Dreschler 2005) and was intended to increase the accurateness of the loudness matching procedure. Matching is a very difficult task, especially around the equal loudness point. This procedure was designed to make matching close to the equal loudness point easier by changing the task from differentiating the louder of two signals to discriminating if there is a loudness difference in a signal pair or not. Instead of comparing the loudness of two signals, the task is to compare the loudness differences of two sound pairs. In each trial the subject heard two pairs of sounds, each pair consisting of a reference signal and a test signal, separated by 400 ms. The two sound pairs were separated by 800 ms. The reference signal had the same level in both pairs, but – at random – one of the two test signals had a level increase of 2 dB (this value is just above the just noticeable difference; see Ozimek and Zwislocki (1996). In each stimulus presentation the position of test and reference was randomized, but the order was the same for both sound pairs in a single presentation. Listeners indicated in which sound pair the loudness difference was larger by pressing a button. A simple one-up one-down procedure was used, converging to the 50% point of the psychometric function. If a listener indicated that the interval containing the intensity increase had the greater loudness difference, the levels of the test signals were decreased – otherwise they were increased. All starting levels were chosen randomly from a set of levels ranging from − 6 to 6 dB with respect to the level of the reference signal. The initial step size of 4 dB was decreased to 2 dB after the second reversal in the adaptive procedure and held constant for the next eight reversals. A reversal was defined as a change in choice for the interval with the greater loudness difference between, the interval with and without the 2-dB level increase. The level difference between test and reference signal yielding the same loudness was determined by calculating the average of the levels within the uncertainty region for the last six reversals (Fig. 1). Randomization of the position of the 2-dB level increase ensured that subjects were not able to follow the adaptive procedure. We expected this task to be less sensitive to a shift to the comfortable loudness level and to ignoring the fixed reference sound compared to the task in the conventional procedure. Because of this assumption and because presentation of one track at a time lets the subjects better focus their attention to the small loudness differences of the signals under consideration, no interleaving tracks were applied in this procedure. Roving is applied to help the subject to focus on loudness and to ignore other differences as pitch. The loudness scaling procedure used was the Oldenburg-Adaptive CAtegorical LOudness Scaling (ACALOS) procedure designed by Brand and Hohmann (2002). This is a loudness scaling procedure with 11 response categories, 5 named categories, 4 un-named intermediate categories and 2 limiting categories, which correspond to categorical loudness levels from 0 to 50. The level assigned to a given loudness category x is termed the “categorical loudness level” Lx.
240
M.F.B. van Beurden and W.A. Dreschler
Fig. 1 An example of an outcome of the loudness difference procedure. At each turning point the level differences between the variable and the variable +2 dB increase re. the reference are shown. The estimated equal loudness level is constructed from the level differences at the upper turning points of the variable and at the lower turning points of the variable +2 dB increase. The dotted lines represent the assumed loudness uncertainty region
The procedure consists out of two phases. In the first phase the limits of the auditory range are estimated by an interleaved ascending and descending stimulus sequence. In the second phase the four named intermediate categorical loudness levels are estimated. This last phase consists out of two blocks. In the first block the four named intermediate categorical loudness levels are estimated by linear interpolation between the two limits of the auditory range, which are the values at L5 (very soft) and L50 (too loud). In the second block the named intermediate categorical loudness levels are estimated by a modified least-squares fit of a linear model function. In this study three iterations of the final block were applied. In the analysis each ACALOS measurement was fitted with a model function consisting of two linear parts with independent slopes and a free cut-point. This model function is a slightly different function than the model function applied by Brand and Hohmann (2002), because their model function had a fixed cut-point at 25 CU. 2.3
Subjects
In the loudness matching experiments nine normal-hearing subjects (four male, five female) participated. The age of the subjects ranged from 18 to 34 years. Two of the subjects were members of the Audiology department; the
Duration Dependency of Spectral Loudness Summation
241
other subjects were paid volunteers without previous experience with loudness matching experiments. In the loudness scaling experiment 12 other normalhearing subjects (5 male, 7 female) aged from 18 to 36 participated. The subjects were paid volunteers without previous experience with loudness scaling experiments. All subjects had auditory thresholds <15 dB HL and no previous history of any hearing problems.
3
Results
Figure 2 shows the results of matching procedure 1 and matching procedure 2 at a center frequency of 2000 Hz and with a reference bandwidth of 800 Hz. The figures show the differences between the level of the test signal and the reference signal (∆L) at equal loudness as a function of the bandwidth of the test signal. A negative level difference means that the test signal needs a lower level to be judged as equally loud as the reference signal. Signal durations were 25 ms (circles) and 1000 ms (squares). The error bars indicate plus and minus one standard error of the mean. The results of the loudness matching procedure are presented in Fig. 3. Each ACALOS measurement was fitted with a model function consisting of two linear parts with independent slopes and a free cut-point. Therefore, each fit was characterized by four parameters. The fits shown are based on the average of the parameters across subjects, where the variables per subject are based on all points obtained in the three tests.
Fig. 2 Results of the two procedures at 25 ms (circles) and 1000 ms (squares)
242
M.F.B. van Beurden and W.A. Dreschler
Fig. 3 Results loudness scaling for 25 ms and 400 ms signals of different bandwidths
The figure shows that: 1. The slopes of the higher-intensity part are usually steeper than for the lowintensity part. This effect is found for all signal bandwidths and both signal durations. 2. The low-intensity slope is less steep for the 25-ms signals. 3. Furthermore, the loudness curves are ordered according to bandwidth, with a larger bandwidth leading to a higher loudness at the same level. This is what is expected from spectral loudness summation. 4. Finally, a comparison of both figures shows that corresponding levels yield a higher loudness level for the 400-ms signals than for the 25-ms signals, as would be expected from temporal integration. In order to obtain a similar parameter of loudness summation from the loudness scaling data, we calculated for each of the stimuli the level differences relative to the level of the reference signal (60 dB), needed to obtain equal loudness as for the 400 Hz wide reference signal. Table 1 shows the calculated summation data. At 25 ms the loudness of this signal is 11.6 CU and at 400 ms the loudness is 14.5 CU. So, there is a slight loudness differences between the reference signals
Table 1 Spectral loudness summation difference between short and long duration signals in dB SPL
1600 Hz
Summation difference Matching 1
Summation difference Matching 2
Summation difference Scaling
0.64
−0.14
−1.58
3200 Hz
2.47
1.23
2.75
6400 Hz
2.31
2.44
1.14
Duration Dependency of Spectral Loudness Summation
243
at different durations. For the loudness scaling data the loudness ratings of a 800 Hz wide stimulus was used as a reference. This makes the data of the loudness matching not directly comparable to the data of the loudness scaling.
4
Discussion
Although the matching procedures and the scaling procedure have been conducted with slightly different stimuli the same trends can be observed. First of all, in all three procedures spectral loudness summation is larger for short signals than for long signals. This corresponds well with the findings of Verhey and Kollmeier (2002) and Chalupper (2002). The fact that duration dependent spectral loudness summation has been found in three different measuring procedures provides extra support for the existence of this effect and excludes possible artifacts due to the measurement procedure. The amount of spectral loudness summation difference depends on the amount of summation, which is in agreement with the results of Brand and Hohmann (2002). The duration-dependency of spectral loudness summation is small, when the loudness summation is small. As loudness summation increases, the loudness summation difference also increases. However, the maximum amount of loudness summation difference between short and long signals seems limited. In all three procedures the amount of loudness summation at a bandwidth of 3200 Hz and 6400 Hz is approximately the same. A further investigation with even broader bandwidths is needed to confirm the observation that a ceiling effect may be present. It would be interesting to determine the “critical” bandwidth at which the summation difference between long and short signals reaches the maximum value. There are also differences between the procedures, especially with respect to the amount of summation found. This is probably a consequence of procedural differences. In the second matching procedure we assumed that interleaving of the different conditions was not necessary. Verhey (1999) found that an adaptive procedure with interleaved tracks leads to larger loudness summation. The differences we found between matching procedure 1 and matching procedure 2 correspond to the differences found between an interleaved and a non-interleaved procedure. The long duration condition of the scaling procedure is conducted with a 400-ms signal instead of a 1000-ms signal. The influence of this difference in signal duration may be expected to be negligible, as the effect of temporal loudness integration is thought to be limited to approximately 200 ms. The scaling procedure is much less sensitive than the two matching procedures at one specific loudness. The results depend heavily on the definition of the fitting curve. Nevertheless, the results correspond reasonably well with the matching results and give also a hint towards the level dependency of the effect. At low levels there is almost no spectral loudness summation and
244
M.F.B. van Beurden and W.A. Dreschler
therefore the summation difference is also very small. Around the cut-point of the fitting curve, which seems to lie at the lower side of the most comfortable loudness region, both the spectral summation and the summation difference are largest. At higher levels they tend to decrease again.
5
Summary and Conclusion
This study shows with three different measuring procedures the effect that spectral loudness summation is larger at short signal durations. Although the amount of summation differs in the different procedures the summation difference is approximately the same. Our data show a possible ceiling effect in the amount of spectral loudness summation differences between short and long signals. Further research is needed in order to investigate the effect of bandwidth on the loudness summation difference between short and long signals. An adapted version of the model of loudness applicable to time-varying sounds (Glasberg and Moore 2002) that increases the loudness of short signals at low levels has been proven to model these effects reasonably well.
References Brand T, Hohmann V (2002) An adaptive procedure for categorical loudness scaling. J Acoust Soc Am 112:1597–1604 Chalupper J (2002) Perzeptive Folgen von Innenschwerhörigkeit: Modellierung, Simulation und Rehabilitation. Shaker, Aachen Glasberg BR, Moore B C J (2002) A model of loudness applicable to time-varying sounds. J Audio Eng Soc 50:331–342 Ozimek E, Zwislocki JJ (1996) Relationships of intensity discrimination to sensation and loudness levels: dependence on sound frequency. J Acoust Soc Am 100:3304–3320 Kohlrausch A, Fassel R, van der Heijden M, Kortekaas R, van de Par S, Oxenham AJ, Puschel D (1997) Detection of tones in low-noise noise: further evidence for the role of envelope fluctuations. Acust Acta Acust 83:659–669 van Beurden MFB, Dreschler WA (2005) Bandwidth dependency of loudness in series of short noise bursts. AcustActa Acust 91:1020–1024 Verhey JL (1999) Psychoacoustics of spectro-temporal effects in masking and loudness perception. PhD thesis Verhey JL, Kollmeier B (2002) Spectral loudness summation as a function of duration. J Acoust Soc Am 111:1349–1358
Comment by Verhey In your talk you presented a model that predicts the duration-dependent spectral loudness summation as reported in, e.g., Verhey and Kollmeier (2002, JASA 111, 1349-1358). The model was based on the assumption of a
Duration Dependency of Spectral Loudness Summation
245
larger gain applied to short signals than to long signals at low levels. Such a mechanism results in a higher compression at the medium to high levels. How does this assumption relate to Epstein and Florentine (2005a, b), and Anweiler and Verhey (2006) who showed that the loudness function of the short signals is essentially a vertically (downward) shifted version of the loudness function of the long signals? Such a vertical shift produces the same slope for different duration at the same intensity, i.e. no change in compression with duration. References Epstein M, Florentine M (2005a) Inferring basilar-membrane motion from tone-burst otoacoustic emissions and psychoacoustic measurements. J Acoust Soc Am 117:263–274 Epstein M, Florentine M (2005b) A test of the equal-loudness-ratio hypothesis using crossmodality matching functions. J Acoust Soc Am 118:907–913 Anweiler AK, Verhey JL (2006) Spectral loudness summation for short and long signals as a function of level. J Acoust Soc Am 119:2919–2928
Reply First of all, an equal level difference at equal loudness for short and long signals over all levels is indeed in disagreement with my hypothesis. A small decrease in level difference at equal loudness is expected at low levels. In fact the data of Epstein and Florentine (2005b) is somewhat ambiguous and a small decrease in level difference at equal loudness can be seen. If the lowest point is neglected a decrease in loudness ratio at low levels is found. The ambiguity can also be seen in the individual data in which some subjects seem to show a clear level difference decrease at low levels (L5,L6,L9) In that case no contradiction between the adapted model and their data is present. The magnitude estimation data of Epstein and Florentine (2006a) show a clear increase in level difference at equal loudness at low levels, which is clearly in contrast with my hypothesis. But here the group data appear to be heavily influenced by subject L3 and the individual data again show also subjects with a decrease in level difference at low levels (L5, L6). The scaling data from Anweiler and Verhey (2006) and the present paper is not accurate enough at low levels to do well funded statements. Therefore I think it is not possible to say if these data agree, or disagree with my hypothesis. Unfortunately Epstein and Florentine (2005a, b) have not presented data for broadband signals. For such signals the model predicts a larger decrease in level difference at low levels than for narrowband signals, which should be easier to measure.
27
The Correlative Brain: A Stream Segregation Model
MOUNYA ELHILALI AND SHIHAB SHAMMA
1
Introduction
The question of how everyday cluttered acoustic environments are parsed by the auditory system into separate streams is one of the most fundamental in perceptual science. Despite its importance, the study of its underlying neural mechanisms remains in its infancy; with a lack of general frameworks to account for both psychoacoustic and physiological experimental findings. Consequently, the few attempts at developing computational models of auditory stream segregation remain highly speculative. This in turn has considerably hindered the development of such capabilities in engineering systems such as automatic speech recognition, or sophisticated interfaces for communication aids (hearing aids, cochlear implants, speech-based human-computer interfaces). In the current work, we present a mathematical model of auditory stream segregation, which accounts for both perceptual and neuronal findings of scene analysis. By closely coordinating with ongoing perceptual and physiological experiments, the proposed computational approach provides a rigorous framework for facilitating the integration of these results in a mathematical scheme of stream segregation, for developing effective algorithmic implementations to tackle the “cocktail party problem” in engineering applications, as well as generating new hypotheses to better understand the neural basis of active listening.
2 2.1
Framework and Foundation Premise of the Model
Numerous studies have attempted to reveal the perceptual cues necessary and/or sufficient for sound segregation. Researchers have identified frequency separation, harmonicity, onset/offset synchrony, amplitude and
Institute for Systems Research & Department of Electrical and Computer Engineering, University of Maryland, College Park MD, USA,
[email protected],
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
248
M. Elhilali and S. Shamma
frequency modulations, sound timbre and spatial location as the most prominent candidates for grouping cues in auditory streaming (Cooke and Ellis 2001). It is, however, becoming more evident that any sufficiently salient perceptual difference along any auditory dimension (at the periphery or central auditory stages) may lead to stream segregation. On the biophysical level, our knowledge of neural properties particularly in the auditory cortex indicates that cortical responses (Spectro-Temporal Receptive Fields, STRFs) exhibit elaborate selectivity to spectral shapes, symmetry and dynamics of sound (Kowalski et al. 1996; Miller et al. 2002). This intricate mapping of acoustic waveforms into a multidimensional space suggests a role of the cortical circuitry in representing sounds in terms of auditory objects (Nelken 2004). Moreover, this organizational role is supported by the correspondence between time scales of cortical processing and the temporal dynamics of stream formation and auditory grouping. In this study, we formalize these principles in a computational scheme that emphasizes two critical stages of stream segregation: (1) mapping sounds into a multi-dimensional feature space; (2) organizing sound features into temporally coherent streams. The first stage captures the mapping of acoustic patterns onto multiple auditory dimensions (tonotopic frequency, spectral timbre and bandwidth, harmonicity and common onsets). In this mapping, acoustic elements that evoke sufficiently non-overlapping activity patterns in the multi-dimensional representation space are deemed perceptually distinguishable and hence may potentially form distinct streams. We assume that these features are rapidly extracted and hence this mapping simulates “instantaneous” organization of sound elements (over short time windows; e.g. <200 ms), thus evoking the notion of simultaneous auditory grouping processes (Bregman 1990). The second stage simulates the sequential nature of stream segregation. It highlights the principle that sound elements belonging to the same stream tend to evolve together in time. Conversely, temporally uncorrelated features are an indication of multiple streams or a disorganized acoustic scene. Identifying temporal coherence among multiple sequences of features requires integration of information over relatively long time periods (e.g. >300 ms), consistent with known dynamics of streaming-buildup. Therefore, the current model postulates that grouping features according to their levels of temporal coherence is a viable organizing principle underlying cortical mechanisms in sound segregation. 2.2
Stage 1: Multi-dimensional Cortical Representation
Current understanding of auditory cortical processing inspires our model for the multi-dimensional representation of sound. The model takes in as input an auditory spectrogram, and effectively performs a wavelet decomposition using a bank of linear spectro-temporal receptive fields (STRFs). The
The Correlative Brain: A Stream Segregation Model
249
analysis proceeds in two steps (as detailed in Chi et al. 2005): (i) a spectral step that maps each incoming spectral slice into a 2D frequency-scale representation. It is implemented by convolving the time-frequency spectrogram y(t,x) with a complex-valued spectral receptive field SRF, parametrized by spectral tuning Ωc and characteristic phase φc; (ii) a temporal step in which the time-sequence from each frequency-scale combination (channel) is convolved with a temporal receptive field TRF to produce the final 4D cortical mapping r. Each temporal filter is characterized by its modulation rate ωc and phase θc. This cortical mapping is depicted in Fig. 1A, and can be captured by s (t, x; Ωc, fc) = y (t, x)*x SRF (x ; Ωc, fc) r(t, x; wc, Ωc ,qc, fc) = s (t, x; Ωc, fc)*t TRF(t; wc,qc)
(1)
We choose the model’s parameters to be consistent with cortical response properties, spanning the range Γ=[0.5–4] peaks/octave spectrally and Ψ = [1–30] Hz temporally. Clearly, other feature dimensions (such as spatial location and pitch) can supplement this multidimensional representation as needed.
Fig. 1 A,B Schematic of stream segregation model
250
2.3
M. Elhilali and S. Shamma
Stage 2: Temporal Coherence Analysis
The essential function of this stage is twofold: (i) estimate a pair-wise correlation matrix (C) among all scale-frequency channels, and then (ii) determine from it the optimal factorization of the spectrogram into two streams (foreground and background) such that responses within each stream are maximally coherent. The correlation is derived from an instantaneous coincidence match between all pairs of frequency-scale channels integrated over time. Given that TRF filters provide an analysis over multiple time windows, this step is equivalent to an instantaneous pair-wise correlation across channels summed over rate filters (Fig. 1B): Correlation Matrix = # si (t) s j (t) dt -
/ ~!}
ri (~) r *j (~) _ Cij
(2)
where (*) denotes the complex-conjugate. We can find the “optimal” factorization of this matrix into two uncorrelated streams, by determining the direction of maximal incoherence between the incoming stimulus patterns. Such a factorization is accomplished by a principal component analysis of the correlation matrix C (Golub and Van Loan 1996), where the principal eigenvector corresponds to a map labeling channels as positively or negatively correlated entries. The value of its corresponding eigenvalue reflects the degree to which the matrix C is decomposable into two uncorrelated sets, and hence reflects how ‘streamable’ the input is. 2.4
Computing the Two Streams
Therefore, the computational algorithm for factorizing the matrix C is as follows: 1. At each time step, the matrix C(t) is computed from the cortical representation as in Eq. (2). The correlation matrix keeps evolving as the cortical output r(t) changes over time. However for stationary stimuli, the correlation pattern reaches a stable point after a buildup period. 2. Given its hermitian nature (since it is a correlation matrix), C can be expressed as C = lmm† + e, where m is the principal eigenvector of C, l its corresponding eigenvalue, and e(t) the residual energy in C not accounted for by the outer-product of m. (†) denotes the hermitian transpose. The ratio of l2 to the total energy in C corresponds to the proportion of the correlation matrix accounted for by its best factorization m. This ratio is an indicator of the separablity of the matrix C, and hence the streamability of the sound. The principal eigenvector m can be viewed as a ‘mask’, which can differentially shape the scale-frequency input pattern at any given time instant. This mask
The Correlative Brain: A Stream Segregation Model
251
consists of a map of weights that positively scales channels with a common orientation and suppresses channels in the opposite direction. Effectively, m (and its complement 1-m) acts as a “filter” through which we can produce the foreground (and background) stream.
3
Simulation Results
The model was tested on several classic stream segregation conditions to demonstrate its ability to emulate known percepts as reported by human subjects. The first row in Fig. 2 illustrates results of the classic alternating tone paradigm (Bregman 1990). The leftmost panel shows the mask profile m for this stimulus. Given its stationary nature, the matrix C stabilizes rapidly, and its factorization m reveals that the energy in channel A (low tone) is temporally anti-correlated with channel B (high tone), and hence should belong to a different stream. The second row of Fig. 2 depicts simulation results for a target tone in a multi-tone background, commonly used in Informational Masking (IM) tests. This stimulus is the focus of the remainder of this study, where we attempt to use the model to account for perceptual and physiological results using the same paradigm. The right lower panels of Fig. 2 show the outcome of applying the mask m to the IM spectrogram. As the correlation pattern builds up in time, the target tone is flagged as temporally un-correlated with the background tones, and hence is slowly suppressed in the left stream. Given the random nature of the background, some maskers are occasionally labeled as weakly correlated with the target. This explains why the target stream has a weak contribution from the maskers.
Fig. 2 Model simulations
252
M. Elhilali and S. Shamma
Fig. 3 Predicted target detection
4
Perceptual Measures
To validate the simulation results against human perception with IM stimuli, we derived a measure of how detectable the target is, based on our mask profile. The measure quantifies the mean vectorial distance between the complex-valued energy of m at the target channel, and energy in any other masker channel. Figure 3 illustrates the change in this distance d as the protection zone separating the target from the maskers varies. In accord with findings from psychoacoustic tests (Micheyl et al. 2007), the trend in this distance plot reveals that unmasking effects of the target depend on the size of the spectral protective region around the target tone. Additionally, the model reveals that temporal regularity of the target does not seem to be a critical cue for target detection. The open symbols in Fig. 3 demonstrate that regular targets or roved irregular targets (average of one target every two masker bursts) yield virtually similar distance values; and hence result in similar unmasking levels, as shown by perceptual findings in (Micheyl et al. 2007).
5
Physiological Correlates
In addition to mimicking human perceptual performance, the model enables us to explore neural correlates of streaming and attention, as observed in physiological studies using the same IM paradigm (Yin et al. 2007). To do so, we add more biological realism to the model by incorporating a stage of neuronal adaptation, simulated via mechanisms of synaptic depression known to
The Correlative Brain: A Stream Segregation Model
253
Fig. 4 Gain change of target tuning curve as predicted by model’s mask response
operate at the thalamo-cortical projections (as described in Elhilali et al. 2003). This stage shapes the energy pattern of each channel at the input of the cortical model by effectively adapting its activity in a nonlinear fashion. This neuronal adaptation has been explored as a potential mechanism underlying observed tuning curve changes in naïve or non-behaving animals presented with streaming-like paradigms (Yin et al. 2007). Consistent with these speculation, our simulations reveal a drop in tuning curve gain during the buildup period (Fig. 4). These tuning curves are obtained by weighting the model’s spectral receptive fields (SRF) region around the target tone with its corresponding mask profile m at different time epochs of the stimulus. By contrast, simulating behavioral shifts in trained animals has to evoke top-down attentional mechanisms which would for instance modulate the weights of the cortical map, by emphasizing the STRF regions associated with the task at hand. Specifically, when a trained animal is performing a detection task of a single tone surrounded by broadly distributed masker tones (referred to as Task 2 in Yin et al. 2007), a potential mechanism at play is learning to promote narrowly tuned neuronal ensembles so as to focus on a single target tone. Such consistent attentional emphasis can be simulated by applying a high-pass to the scale dimension in Fig. 1, hence amplifying the response from the high scales (i.e., narrowband) region. Conversely, when the animal learns to attend to the broadband masker background tones (Task 1), it could potentially emphasize activity in the broadband region. We simulate this situation by a low-pass along the scale dimension. The effect of these task dependencies is illustrated in Fig. 5, which depicts the changing bandwidth of a tuning curve during the performance of these two tasks, as shown in physiological findings in (Yin et al. 2007).
254
M. Elhilali and S. Shamma
Fig. 5 Bandwidth changes of target tuning curve as predicted by model’s mask response during task performance
6
Final Remarks
We have demonstrated that analysis of response coherence in a model of auditory cortical processing can account for the perceptual organization of sound streams. While response coherence emerges as the key overarching organizational principle, its computational implementation can take different but essentially equivalent forms. For instance, this paper focused on the correlation matrix C and its factorization as the vehicle for the analysis. Alternatively, a focus on predicting response consistency within different streams results in a Kalman filtering interpretation (Elhilali and Shamma 2006). Ongoing and future investigations must also incorporate biologically plausible adaptive mechanisms to account for the observed effects of behavior on cortical responses during streaming. Acknowledgment. This work is supported by CSRNS RO1 AG02757301, AFOSR and SWRI.
References Bregman A (1990) Auditory scene analysis. MIT Press Chi T, Ru P, Shamma S (2005) Multiresolution spectrotemporal analysis of complex sounds, J Acoust Soc Am 118:887–906 Cooke M, Ellis D (2001) The auditory organization of speech and other sources in listeners and computational models. Speech Commun 35:141–177 Elhilali M, Shamma S (2006) A biologically-inspired approach to the cocktail party problem. Proc ICASSP Elhilali M, Fritz J, Klein D, Simon J, Shamma S (2003) Dynamics of precise spiking in primary auditory cortex. J Neurosci 24(5):1159–1172
The Correlative Brain: A Stream Segregation Model
255
Golub G, Van Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins Univ Press Kowalski N, Depireux D, Shamma S (1996) Analysis of dynamic spectra in ferret primary auditory cortex. J Neurophysiol 76:3503–3523 Micheyl C, Oxenham A, Shamma S (2007) Detection of repeating targets in random multi-tone backgrounds: perceptual mechanisms. Current volume Miller L, Escabi M, Read H, Schreiner C (2002) Spectrotemporal receptive fields in the lemniscal auditory thalamus and cortex. J Neurophysiol 87(1):516–527 Nelken I (2004) Processing of complex stimuli and natural scenes in the auditory cortex. Curr Opin Neurobiol 14:474–480 Yin P, Ma L, Elhilali M, Fritz J, Shamma S (2007) Neural correlates of attention during streaming. Current volume
Comment by Yost The discussion following your excellent talk, underscored what I think can be an important distinction. I do not believe that ‘streaming’ and ‘source’ segregation are always the same thing. That is, in your A-B example two sounds (A and B) that occur at the same time can be perceived as coming from two sources, but they may not be perceived as being a continuation of the sources perceived at a different point in time – segregation may occur but streaming did not. From my perspective, streaming is a form of source segregation that involves an element of continuity over time. Or, put another way streaming is an example of source segregation, but they are not the same thing. You presented your model as a stream segregation model, but it appears that with the proper time constants it might also be used for source segregation in the absence of perceived continuity from stimulus presentation to stimulus presentation. For instance, with a very short time constant the model might be able to account for the segregation of two different transients that occurred at the same time. Is this correct? Reply I agree with your argument about the use of the model (namely the first stage of a sound multi-feature representation) as a scheme for segregating sound components present in the environment at any instant in time. In this multidimensional representation, acoustic elements that evoke sufficiently nonoverlapping activity patterns in the feature space are deemed perceptually distinguishable and can hence be perceived as individual components in a complex scene at any instant in time. This sound representation builds up over tens of milliseconds (<150 ms), but reflects the instantaneous segregation of an acoustic scene. In contrast, the temporal coherence stage of the model reflects the dynamic nature of stream segregation as it builds up over time requiring information integration over few hundred milliseconds. Hence, as expressed in your comment, the ‘instantaneous’ elements parsed in the first stage might or might not evolve to a percept of segregated streams,
256
M. Elhilali and S. Shamma
depending on whether they maintain a coherent evolution over time from one stimulus presentation to the next stimulus presentation. My only disagreement with your statement is the use of the term ‘source’ segregation, because that expression reflects more the physical cause of a sound, and not necessarily our perception of the individual components of a sound. Hence, I would prefer to call this instantaneous segregation a parsing of the scene into its constituent elements; which allow us at every instant of time to perceive different elements present in the environment (which you called sources). Comment by Divenyi I don’t think anybody would argue that the basic premise of streaming is source perception. When a sequence is perceived as two streams, it is that we attribute the two alternating sounds as coming from two different sources. Conversely, when the sequence is perceived as a single stream, we attribute the whole sequence to a single source. So, when two sounds that are segregated into two streams are now played simultaneously and repeated over a longer period, they are grouped together by virtue of their shared temporal properties. I think that the listener will end up considering the ensemble as being produced by a single source, just like a consonant burst is considered as coming from one source regardless of how many disparate spectral patches it may consist of. In music, too, a repeated chord, no matter how complex, will be considered as the same repeated event. Would not it be preferable that the correlation metric you propose would indicate the number of sources instead? Reply I do agree that the percepts that arise from many acoustic scenes do not necessarily reflect the actual physical sound sources present in the environment. However, I would disagree with your statement that the premise of streaming is source separation. Rather, I would agree with Bregman’s definition of stream where he argued to reserve the word ‘stream’ for the perceptual representation, and the word ‘sound’ or ‘source’ for the physical cause. Aside from the nomenclature issue, I completely agree with your argument. As far as the use of the model for indicating the number of streams (or ‘sources’) in the scene, we can definitely expand our formulation to incorporate information from the second and higher principal dimensions of the coherence correlation matrix C (after performing the matrix factorization). These additional degrees of freedom can indicate the presence of a third or fourth stream whose components are highly correlated amongst themselves. We have not yet explored this extension of the model in the current study, but will try to incorporate it in alternative implementations of the model.
28 Primary Auditory Cortical Responses while Attending to Different Streams PINGBO YIN1, LING MA2, MOUNYA ELHILALI1,3, JONATHAN FRITZ1, AND SHIHAB SHAMMA1,2,3
1
Introduction
Auditory streaming is a fundamental perceptual component of auditory scene analysis. It manifests itself in the everyday ability of humans and animals to parse complex acoustic information arising from multiple sound sources into meaningful auditory “streams”. While seemingly effortless, the neural mechanisms underlying auditory streaming remain a mystery largely because experiments in non-human species has been hampered by the difficulty of assessing the subjective perception of streaming without relying on introspection and language. Here we overcome this difficulty through the use of specially designed stimuli and psychoacoustic tasks to induce, manipulate, and at the same time objectively assess streaming in animals. While the basic structure of the stimuli and tasks are identical to those used with humans in Micheyl, Oxenham, and Shamma (this volume) they are slightly adapted so that physiological data can be collected simultaneously with behavior in animals.
2
Background
There has recently been growing interest in the neural basis of auditory stream segregation using animal models ranging from awake monkeys to birds (Fishman et al. 2004, Bee and Klump 2004). Using the classic ABAB... repeating two-tone paradigm (Bregman 1990), these studies have inferred that neural responses to the tones become more segregated reflecting the well-known streamed percept of the tones. For instance, Fishman et al. (2004) recorded multi-unit and local field potentials while setting the frequency of the A tone to correspond to the BF of the cortical site contacted. The frequency of the B tone was set either below or above the BF such that the response elicited by the
1
Institute for Systems Research, University of Maryland, College Park MD USA 20742 Bioengineering Program, University of Maryland, College Park, MD USA 20742 3 Electrical and Computer Engineering Department, University of Maryland, College Park, MD USA 20742,
[email protected] 2
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
258
P. Yin et al.
B tones was approximately half that elicited by the A tones. The results showed that when the tone repetition rate was increased from 2.5 to 20 Hz, the responses of cortical sites with a BF at the A-tone frequency displayed less and less activation to the B tones. Similarly, changing the frequency separation, presentation rate, tone duration, and other parameters elicited response changes consistent with perceptual trends observed in human listeners (Fishman et al. 2004; Bee et al. 2004). Finally, single-unit recordings of responses in the primary auditory cortex of awake-monkeys to 10-s repeating (ABA_) tone sequences, have demonstrated that responses change gradually following the onset of the sequence in a way that was consistent with the phenomenon of “build-up of stream segregation” (Micheyl et al. 2005b). Despite these encouraging results, two criticisms can be made of all neural studies of streaming so far. First, it could be argued that the neural response patterns putatively associated with one- and two-stream percepts could merely be reflections of the change in stimulus parameters, rather than providing the neural basis for streaming. Stronger neural evidence for streaming would be obtained if one could demonstrate parallel changes in neural responses and percepts in the absence of concomitant changes in the physical stimulus. Second, no previous studies provided behavioral tests of the percepts that the animals were experiencing with the stimuli used in the experiments since all involved non-behaving animals. This may be particularly important for streaming-related issues, as attention seems to be involved in the formation or modulation of auditory streams (e.g., Carlyon et al. 2001, Cusack 2005; Sussman 2005). The experiments described below aimed to address these two objections by combining behavior with physiological experiments.
3
Behavioral Tasks
The perceptual phenomena investigated here are similar to those already described in Micheyl et al. (this volume). The stimuli have been adapted for animal experimentation by (1) creating tasks with objective performance criteria, similar to those already used with human subjects by Micheyl et al. (2005a), and (2) selecting parameter ranges appropriate for the ferret and the isolated cortical units. Ferrets were trained on the two behavioral tasks illustrated in Fig. 1. The two tasks share an identical initial sequence of random tone maskers (A-tones) and an embedded sequence of B-tones. This entire portion of the stimulus is referred to as reference. The two tasks differ in the final three bursts that the animals must attend to, which are designated as targets. In task 1, the ferret detects that the A maskers become a repeated set of tones (B-tone remains unchanged throughout). By contrast, in task 2, the target is a 1/4 octave change in the frequency of the B tone (B′). The hypothesis is that animals must now switch attention from the broadly distributed masker tones (“A”) to the spectrally narrow B-tone. With the addition of this task, we
Primary Auditory Cortical Responses while Attending to Different Streams
259
Fig. 1 Two behavioral tasks for ferrets: Task 1 – detect stationary A (A’); Task 2 – detect a change in B (B’)
can compare the neural correlates of the perceptual state of the animals in three different conditions: (1) naïve, (2) attending to the stream of B-tones, (3) attending to the broader spectrum (or the “stream” of the A-maskers). The behavioral training paradigm was the conditioned avoidance procedure, which involved a combination of positive and negative reinforcements (Heffner and Heffner 1995; Fritz et al. 2003). Specifically, two ferrets readily learned to lick a waterspout during the reference signals, and then to immediately withdraw upon hearing the targets. They achieved adequate performance levels (Discrimination Rate>0.65) in both tasks at various parameter settings, the two most important of which are the width of the protection zone (<2 ERBs; shaded region in Fig. 1) and the level of B-tones relative to the A-maskers (<20 dB). Increasing these parameters further separates the A and B streams, facilitating the detection of targets.
4
Physiological Responses
Responses of single-units in the primary auditory cortex were measured or contrasted while animals were in each of three conditions: naïve, tasks 1 and 2. In particular, because of the broadband and random structure of the A-tones, it was possible to estimate the tuning curves of the isolated cells with the reverse-correlation method (deCharms et al. 1998, and Fig. 2), using only the reference portion of the stimuli (which was identical across all conditions).
Fig. 2 Correlates of streaming in the naïve animal: A raster, post-stimulus histogram, and shorttime tuning curves of a single unit responding to the tone-in-masker stimulus (see text for details). Short-time tuning curves were computed over the intervals a–b–c and e–f–g marked below and above the histogram, respectively; B the gain of the tuning curves decreases during the buildup, whereas its bandwidth remains unchanged; C tuning curve bandwidths decrease with increasing rates, but remain approximately unchanged with respect to other stimulus parameters
Primary Auditory Cortical Responses while Attending to Different Streams
4.1
261
Responses in the Naïve Ferret
As a benchmark for the behavioral neurophysiology measurements, we recorded responses in 23 single units in AI of a naïve ferret while passively listening to the stimuli of Fig. 1. In particular, we extracted the gain and bandwidth of the tuning curves for different stimulus parameters (presentation rate, size of protection region, relative B-tone level), and as they varied during the early “build-up” period following the onset of the stimulus. Figure 2A (left panel) illustrates the raster and PST histograms of responses from a single unit to ten repetitions of nine different sequences of different lengths. Tuning curves are computed by reverse-correlating the responses with the spectrogram of the stimulus (deCharms et al. 1998) over a series of short-time intervals following the onset of the sequences (marked at bottom of raster as a-b-c in Fig. 2A; right panel). In Fig. 2B,C we illustrate the total population tuning curves computed by centering and then averaging the tuning curves from 23 single units. During the buildup period, the data reveal two changes: (1) The gain of the tuning curves declines by about 10%, first rapidly, but then much more slowly or not at all past 1 s (Fig. 2B; left panel); (2) The bandwidths of the tuning curves remain essentially unchanged throughout (Fig. 2B, right panel). This suggests that decreasing responses to off-BF tones observed previously (Fishman et al. 2001) is due to a change in the tuning curve gain and not its bandwidth. We also examined changes in the population tuning curves under the following stimulus conditions: (1) Presentation rate (5 vs 10 Hz): there was a significant decrease in the average bandwidth at faster rates of about 25% (Fig. 2C, left panel). This finding is consistent with the decrease of responses to the off-BF tone at higher rates observed in previous studies (e.g., Fishman et al. 2001), and which was attributed to “forward suppression” or “adaptation” of responses to tone frequencies remote from the BF by preceding tones at the BF. (2) Signal-to-masker levels (0 vs 10 dB) and Protection zone widths (0 vs 2 ERB): increasing the B-tone level or protection zone surrounding it facilitates its perceptual streaming (see psychoacoustics above), and hence one might expect a narrowing of the tuning curves with both. However, there were no significant changes in tuning curve bandwidths with either manipulation (Fig. 2C; right panel). There are no comparable physiological data in the literature regarding the signal-to-masker level manipulation. The protection zone width, however, is analogous to the frequency separation between the A and B tones in the classic two-tone streaming experiments. Therefore, the absence of tuning curve bandwidth changes here suggests that, decreasing responses to the off-BF tones with increasing ∆F may simply reflect the finite bandwidth of the tuning curves and not a change in tuning related to the perceptual state.
262
4.2
P. Yin et al.
Responses in Behaving Ferrets
We explored how responses adapted when the focus of the animal’s attention changed from the spectrally-broad A stream in task 1 to the spectrally-narrow B-tone stream in task 2. Specifically, we hypothesized that the tuning curves may sharpen during the onset of streaming in task 2 to reflect this shift in focus. Responses of 15 single-units were measured from two animals while they engaged in the two tasks of Fig. 1. Recall that tuning curves were only estimated and compared based on responses during the reference portion of the stimuli (which was identical in both tasks). As in the naïve animal, tuning curve gains declined by about 10% during the initial 1–2 s (not shown). However, bandwidth of the tuning curves in most cells (60%) changed depending on the behavioral task (as discussed in detail below). An example from one single-unit is shown in Fig. 3. This cell was tuned at a BF of 2.7 kHz. The frequency of the B-tone in the stimulus was set about 1/2 octave below the BF (at 1.6 kHz; arrowheads in Fig. 3B), surrounded by a protection zone of 3 ERB on either side. We also measured the spectro-temporal receptive field (STRF) of this cell (left panel of Fig. 3D) before any behavioral tasks commenced using the methods of Fritz et al. (2003). The STRF indicates a strong excitatory region near the BF of the cell at 2.7 kHz (white area), surrounded by a weak inhibitory field (darker). The range of frequencies of the A-maskers and B-tone used in the behavioral tasks are schematically depicted on the left of the STRF by the dashed and solid lines, respectively. Response PSTH during the two tasks are shown in Fig. 3A. The cell responded during both stimulus intervals, labeled A (only A-maskers present) and A+ B (A-maskers+B-tone). In task 1 (left panel), the responses during the A intervals were bigger than those during the intervening A+ B intervals, presumably because the B-tone coincided with the weak inhibitory fields just outside of the excitatory tuning curve of the cell. The average tuning curves during this task (Fig. 3B) were relatively broad, with little evidence of the inhibitory sidebands. During task 2, responses to the A +B interval diminished considerably relative to their levels in task 1 (Fig. 3A; right panel). Evidently, the reason is a suppression of the tuning curve near the B-tone causing a substantial narrowing of the excitatory tuning curve (see arrows in Fig. 3B). This emergent inhibitory sideband apparently persisted since it could also be detected in the STRF measured after the completion of the task (Fig. 3D; right panel). In addition to the task-dependent changes in the overall (average) tuning curves that persisted after the end of the experiments (e.g., as in Fig. 3D), there were also rapid tuning curve changes that occurred within each stimulus trial and may correlate with the “build up” of streaming. For instance, in Fig. 3C, the short-time average tuning curves from a population of ten cells responding during task 1, demonstrate a rapid increasing in bandwidth during the initial response period (1–2 s following stimulus onset). So it is possible that there are two kinds of superimposed tuning curve and STRF changes: (1) persistant changes that reflect the overall effect of the
Primary Auditory Cortical Responses while Attending to Different Streams
263
Fig. 3 Effects of behavior on responses and tuning curves: A PSTH of responses from a single unit during the two behavioral tasks, collected from all trials as shown in Fig. 2A. The primary difference between the two panels is the substantial decrease of responses during the A+B intervals (indicated by the “dots”) during task 2; B normalized tuning curves computed only from responses during the reference interval of the stimuli (Fig. 1). Shaded region highlights the sharpening of the curves during task 2. The arrows indicate the BF of the cell, and the frequency of the B-tone; C the rapid broadening of the tuning curves in the population of single units within 2 s following onset of the stimulus trials in task 1; D STRFs of the cell measured before and after the two tasks. The dashed lines represent the maskers. Before task 1, the lone excitatory region (white) is near 2.7 kHz. After task 2, an inhibitory field (black) emerges at the frequency of the B-tone
264
P. Yin et al.
attentional state of the animal (e.g., attending to stream A or B throughout the block of trials enhances the representation of the attended stream relative to the other), and (2) rapid changes within a trial reflecting more automatic processes (e.g., correlates of the buildup of streaming).
5
Summary and Discussion
The results described here suggest that attentive behavior alters two aspects of the neural correlates of streaming: (1) those associated with the build-up and formation of the streams, and (2) others more global and that persist after the behavior. In the first, stable tuning curves and highly phasic responses following the onset of tone sequences in the naïve give way to narrowing bandwidths and less precipitously decreasing gains in the behaving animal. We conjecture that these changes facilitate the formation of two streams in the attentive animal. Second, attention to a foreground (target) sequence against a background (reference) induces additional tuning curve or STRF changes, with target frequencies enhanced and reference frequencies depressed. Unlike the first set of changes that start anew at the onset of each trial, these latter changes are not reset every trial, but instead may persist after the behavior. They are very similar to the “rapid plasticity” effects observed in STRFs of animals during and after similar target/reference discrimination behaviors (Fritz et al. 2005). Such plasticity, in fact, is not related to streaming per se since it occurs even when the target/reference stimuli are not “streams”, e.g., if they are presented at slow rates as in Fritz et al. (2005). Therefore, it is clear that “attention” induces complex intertwined changes in the response properties, with different time-courses and longevity. Perhaps, attending to a “streamed” reference simply amplifies the effects already dictated by the structure of the task. To clarify these factors, it will be necessary in the future to contrast the effects of streaming on the tuning curves in behavioral tasks where “streams” play both the role of reference and/or target. Acknowledgments. This research is funded by NIH grants R01DC007657 and R01DC005779.
References Bregman A (1990) Auditory scene analysis: perceptual organization of sound. MIT Press Bee M, Klump G (2004) Primitive auditory stream segregation: a neurophysiological study in the songbird forebrain. J Neurophysiol 92:1088–1104 Carlyon R, Cusack R, Foxton J, Robertson I (2001) Effects of attention and unilateral neglect on auditory stream segregation, J Exp Psychol: Hum Percept Perform 27:115
Primary Auditory Cortical Responses while Attending to Different Streams
265
Cusack R (2005) The intraparietal sulcus and perceptual organization. J Cogn Neurosci 17:641–651 deCharms R, Blake D, Merzenich M (1998) Optimizing sound features for cortical neurons, Science 280:1439–1443 Fishman YI, Reser DH, Arezzo JC, Steinschneider M (2001) Neural correlates of auditory stream segregation in primary auditory cortex of the awake monkey. Hear Res 151:167–187 Fishman Y, Arezzo J, Steinschneider M (2004) Auditory stream segregation in monkey auditory cortex: effects of frequency separation, presentation rate, and tone duration. J Acoust Soc Am 116:1656–1670 Fritz J, Shamma S, Elhilali M, Klein D (2003) Rapid task-dependent plasticity of spectrotemporal receptive fields in primary auditory cortex, Nat Neurosci 6:1216–1223 Fritz J, Elhilali M, Shamma S (2005) Active listening: task-dependent plasticity of receptive fields in primary auditory cortex, Hear Res 206:159–176 Heffner HE, Heffner RS (1995) Conditioned avoidance. In: Klump G, Dooling R et al. (eds) Methods in comparative psychoacoustics, Birkhauser, Basel, pp 79–94 Micheyl C, Tian B, Carlyon R, Rauschecker J (2005b) Perceptual organization of tone sequences in the auditory cortex of awake macaques. Neuron 48(1):139–148 Micheyl C, Carlyon R, Cusack R, Moore B (2005a) Performance measures of auditory organization. In: Pressnitzer D, de Cheveigné A, McAdams S, Collet L (eds) Auditory signal processing: physiology, psychoacoustics, and models. Springer, Berlin Heidelberg New York Sussman E (2005) Integration and segregation in auditory scene analysis. JASA 117:1285
29 Hearing Out Repeating Elements in Randomly Varying Multitone Sequences: A Case of Streaming? CHRISTOPHE MICHEYL1, SHIHAB A. SHAMMA2, AND ANDREW J. OXENHAM1
1
Introduction
One of the most essential functions of perception is to detect and track signals of interest amid other, potentially distracting stimuli. Understanding the mechanisms behind this function is important not just from a theoretical perspective, but also because of its potential applications in artificial scene analysis. An important aspect of auditory scene analysis is the organization of sounds into streams (Bregman 1990). While the vast majority of studies on auditory streaming have used repeating sound sequences, perhaps one of the most informative findings regarding the potential usefulness of streaming in everyday life comes from a study that involved randomly varying tones. Kidd et al. (1994) measured detection thresholds for a fixed-frequency target tone presented simultaneously with multiple other spectral components, the frequencies of which varied unpredictably across trials; a situation known to produce large amounts of informational masking (Neff and Green 1987). Kidd et al. (1994) found that if the target-plus-masker bursts were repeated identically several times, so that they formed an unchanging sequence, large informational masking effects remained. However, if the masker component frequencies were allowed to vary from one burst to the next within the course of the stimulus sequence, as illustrated schematically in Fig. 1, thresholds were substantially reduced. Kidd et al. (1994) explained this outcome as being due to the repeating target tones being perceptually organized into a coherent auditory stream, which stood out from the randomly varying background. From this point of view, the results of Kidd et al. (1994) appear to demonstrate that stream segregation can greatly alleviate informational masking effects, which might otherwise dramatically impair our ability to detect sounds of interest in complex varying backgrounds. While an interpretation of the findings of Kidd et al. (1994) in terms of auditory streaming is intuitively appealing, streaming is a subjective phenomenon
1
Research Laboratory of Electronics, MIT, Cambridge, MA; new address: Department of Psychology, University of Minnesota, MN, USA,
[email protected],
[email protected] 2 Institute for Systems Research and Department of Electrical and Computer Engineering, University of Maryland, College Park, MD, USA,
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
268
C. Micheyl et al.
Fig. 1 Schematic spectrogram showing a sequence of regularly repeating tones amid random masker tones
and unless the term is used to imply a set of well-defined perceptual or physiological mechanisms, it does not explain why and how the repeating tones are detected in the randomly varying background. The perception of the target tones as a stream might be a by-product of their detection, which in turn might be mediated by mechanisms that themselves have little to do with streaming. Some steps have been taken towards ruling out some explanations for the detection of the repeating target tones. Kidd et al. (2003) showed that introducing longer gaps between successive tone bursts affected thresholds in a way that could not be accounted for by a simple multiple-looks mechanism, but was consistent with the known dependence of auditory streaming on inter-tone interval (Bregman et al. 2000). The present study continues this effort in an attempt to understand better the links between, and the mechanisms underlying, informational masking and auditory streaming in three ways. First, to test whether the detection of the target tones depends on the stimulation rate or temporally integrated energy being higher at the target frequency, we presented the target every other (or every third) burst. Second, to test whether detection was based on a frequency template, we roved the target frequency over a wide range. Third, to test whether detection was based on a temporal template, we introduced randomness in the temporal positions of the target tones. All these manipulations were tested using different widths of a “protected region” around the target frequency, which determined the minimum spectral distance between the target and nearest masker components.
2
Methods
In all the experiments, the stimuli were sequences of 10, 20, or 40 multi-tone bursts. Each burst was 60 ms in duration (including 20-ms onset and offset ramps), resulting in sequence durations of 0.6, 1.2, and 2.4 s. Each burst consisted of 8 synchronous sinusoidal masker components, with frequencies
Hearing Out Repeating Elements in Randomly Varying Multitone Sequences
269
drawn randomly from a list of 16 frequencies spaced 2 semitones (about 12%) apart, with the constraint that no masker components were permitted within a 6-, 10-, 14-, or 18-semitone-wide “protected region” around the target frequency. The target frequency was either fixed at 1 kHz (fixed-targetfrequency condition) or drawn randomly on each trial from a fixed list of 15 frequencies ranging from −14 to +14 semitones around 1 kHz in 2-semitone steps (roving-target-frequency condition), in such a way that at least one masker component was always present on either side of the target protected region. With this design, possible stimulus component frequencies ranged from 265 to 3776 Hz. The target and masker components all had the same level (50 dB SPL). Unless otherwise specified, the target tone was presented on every other burst after the second, so that given the 50% chance for each of the 16 masker components being present on a given burst, the average rate and energy at target and masker frequencies were equal. In three additional control conditions involving a single sequence length (20 bursts), the targets were presented on every burst (as was commonly the case in the experiments by Kidd et al. 1994), every three bursts after the second, or randomly distributed across the bursts and following an irregular temporal pattern. Using these stimuli, two types of one-interval experiments were performed. The first type involved target-detection experiments, where listeners were presented on each trial with a sequence of 40 bursts (2.4 s), which contained repeating target tones or not, and had to press a key if, and as soon as, they detected the repeating target tone – a psychophysical paradigm known as “go/no-go”. Response times were recorded and used to construct psychometric functions showing the cumulative proportion of “detect” responses as a function of time relative to the onset of the stimulus sequence in the different conditions. “Detect” responses on catch trials containing no target were used to estimate false-alarm rates in the different listeners and conditions, and to compute the sensitivity index, d¢. In these experiments, listeners performed 16 trials per condition. The second type were target-frequency discrimination experiments, where the target was always present and the listeners had to report the direction (up or down) of a small (2%) frequency shift applied to the last target in the sequence. These experiments were inspired by earlier studies that used forced-choice frequency discrimination performance as a measure of listener’s ability to hear-out target tones inside sequences (Watson et al. 1975; Micheyl et al. 2005a). To ensure that performance was not based on the detection of any small frequency shift near the end of the sequence (see Demany and Ramos 2005 and this volume), frequency changes of the same magnitude but random sign were also applied to the masker tones that accompanied the target on the last burst. In these experiments, between 100 and 300 trials were obtained in each condition for each listener. Blocks of trials with the maskers were always preceded by at least one block of 25 trials in which the target was presented in isolation in order to allow listeners to clearly identify it. All experiments used a constant-stimuli procedure, with the main
270
C. Micheyl et al.
conditions (i.e., the fixed- and roving-target-frequency conditions, and the different target rates) tested in a blocked fashion, but the different protected-region sizes within each condition intermingled and presented in random order (across blocks as well as across listeners). Stimuli were computer generated at a 50-kHz rate, converted to analog (LynxStudio LynxOne), attenuated (TDT PA4), pre-amplified (TDT HB6), and presented diotically through Sennheiser HD580 headphones. Six experienced normal-hearing listeners took part. They were tested individually in a soundproof booth (IAC).
3
Results
Figure 2 shows how d¢ in the detection task increased over the course of a 40-burst (2.4-s) sequence containing a target on every other burst. The filled and open symbols represent fixed and roving target frequency conditions, respectively, and the different symbols denote different protected region widths. It can be seen that performance generally increased over time, and that the increase was more rapid and marked as the protected-region width increased from 6 to 14 semitones. At the smallest width (6 semitones), 3.0
2.5
2.0
fixed roved 18 semitones 14 semitones 10 semitones 6 semitones serial-search
d' 1.5 1.0
0.5
0.0 0.0
0.3
0.6
0.9
1.2
1.5
1.8
2.1
2.4
Time after sequence onset (s) Fig. 2 Detection performance as a function of time after sequence onset in the fixed (open symbols) and roving (filled symbols) target-frequency conditions, for different protected-region widths (as indicated by the legend). The line with no symbols shows the predictions of a serialsearch model (see text). In this and other figures, the error bars denote standard errors of the mean across listeners
Hearing Out Repeating Elements in Randomly Varying Multitone Sequences
271
performance was unaffected by the roving. At larger widths, roving the target frequency reduced both the rate at which performance improved and the asymptotic level that it reached. The line with no symbols in Fig. 2 shows the predictions of a serial-search model, which “scanned” the stimulus spectrogram sequentially, randomly selecting one of the components in every burst (with equal probability and no memory) until it found one that repeated twice at the same rate as the target – hence the initial “dead time”. The predictions of this model do not depend on the width of the protected-region, and they can clearly not account for the rapidly increasing performance of the human listeners at large protected-region widths, even when only the roving-targetfrequency condition is considered. The major trends in the detection data were confirmed by the frequencydiscrimination results. As shown in Fig. 3, performance improved as the number of bursts in the sequence increased, consistent with the improvement over time in the detection experiment. Moreover, as in the detection experiment, the increase in performance was steeper at large protected-region widths. Finally, here too, performance was higher in the fixed- than in the roving-target-frequency condition. Figure 4 illustrates the influence of the target repetition rate on discrimination performance. Presenting the target on every burst, as in Kidd et al. (1994), produced higher performance than presenting it every two bursts. However, further decreasing the target repetition rate, by presenting the target every three bursts, had no effect. The detection data (not shown) showed similar
3.0 2.5 2.0
d' 1.5 1.0 0.5 0.0 10
20
30
40
Number of bursts in the sequence Fig. 3 Performance in the discrimination task as a function of the number of number of bursts in the sequence under the fixed (filled symbols) and roving (open symbols) target-frequency conditions. The different symbols indicate different protected-region widths, using the same format as in Fig. 2
272
C. Micheyl et al. 3.0 2.5 2.0
d' 1.5 1.0 0.5 0.0 -1
0
1
2
Number of bursts between consecutive targets Fig. 4 Performance in the discrimination task as a function of the number of bursts between consecutive targets. This is for 20-burst-long sequences, with roving target frequency. The different symbols indicate different protected-size widths, using the same format as in Fig. 2
trends, although in that task, reducing the target rate from one every two to one every three bursts did produce some decrement in performance. One last result was obtained in a condition where the target tones were dispatched randomly across the bursts, and formed an irregular temporal pattern. Testing this condition in the detection task turned out to be impractical, because the listeners could no longer simply be asked to indicate whether they heard regularly repeating targets, and it was not obvious how else to instruct them. However, the condition could be tested in the frequency discrimination task, as listeners actually found it no more difficult to perceive the frequency shift applied to the target tones when these were temporally irregular than when they occurred at regular intervals. The results (not shown here) that were obtained in this condition using 10- and 20-burst-long sequences confirmed this impression: performance with the temporally irregular targets was statistically indistinguishable from that obtained in the condition where the targets repeated at regular intervals with the same density.
4
Discussion
The results allow us to rule out certainly potential explanations for how a tone sequence is detected in a random-masker background. First, detection does not seem to depend on average rate or energy differences between the target and masker frequencies: most of the conditions presented here had
Hearing Out Repeating Elements in Randomly Varying Multitone Sequences
273
average target rates and energy that were identical to those of the masker components, and yet detection was possible, even when the target frequency was roved. Thus, the results argue strongly against temporal energy integration and event-counting mechanisms. Second, in the case of the roved target frequency, detection did not seem to be based on a serial search mechanism, which monitors each frequency band sequentially for the presence of the target. Third, the results argue against detection based on the temporal regularity of the target, since making the targets randomly irregular in time did not adversely affect discrimination performance. The results of this study are generally consistent with the hypothesis that the detection of repeating target tones depends on the same mechanisms that have been proposed for the formation of auditory streams of pure tones. In particular, the results show a strong and systematic dependence of performance on the size of the protected region around the target tones. This parameter can be thought of as an analogue of the frequency-separation parameter in streaming experiments involving repeating tone sequences (Bregman 1990), and its influence is likely to depend on the frequency selectivity of neurons in the central auditory system (Fishman et al. 2001; Micheyl et al. 2005b). The improvement in performance found when the targets were presented every burst is also consistent with the fact that the degree of streaming is related to the gaps between successive tones in a stream (Bregman et al. 2000). Based on our other results (above) and those of Kidd et al. (2003), the improvement is unlikely to be due simply to increased signal energy or multiple looks. Alternative explanations based on physiological findings, such as response enhancement effects (Brosch and Schreiner 2000), remain viable. A final similarity between our results and the known properties of auditory streaming relates to how performance improved over time, or as the number of bursts increased. Although we ruled out temporal energy-integration and serial-search mechanisms, this effect could still be explained by sensoryevidence-accumulation mechanisms; however, it is unclear just what evidence is accumulated, since the temporal regularity of the target does not seem to be key. A possible solution is that the increasing salience of the targets is not mediated by the accumulation of sensory information but rather by adaptation. Micheyl et al. (2005b) have recently shown how multi-second neural adaptation in the auditory cortex may explain the build-up of auditory stream segregation (Bregman 1978). Unfortunately, the characteristics of cortical adaptation to randomly varying tones like those used here are not known, and we can therefore only speculate as to whether this type of phenomenon may also explain the increasing salience of repeating target tones in random multitone backgrounds. Neurophysiological and modeling studies, which are currently being performed (see Elhilali and Shamma, this volume), will hopefully answer this and other important questions. Acknowledgments. Work supported by NIDCD grant R01 DC 07657. The authors would like to thank Gerald Kidd for helpful suggestions.
274
C. Micheyl et al.
References Bregman AS (1978) Auditory streaming is cumulative. J Exp Psychol 4:380–387 Bregman AS (1990) Auditory scene analysis. MIT Press, Cambridge Bregman AS, Ahad PA, Crum PA, O’Reilly J (2000) Effects of time intervals and tone durations on auditory stream segregation. Percept Psychophys 62:626–636 Brosch M, Schreiner CE (2000) Sequence sensitivity of neurons in cat primary auditory cortex. Cereb Cortex 10:1155–1167 Demany L, Ramos C (2005) On the binding of successive sounds: perceiving shifts in nonperceived pitches. J Acoust Soc Am 117:833–841 Fishman YI, Reser DH, Arezzo JC, Steinschneider M (2001) Neural correlates of auditory stream segregation in primary auditory cortex of the awake monkey. Hear Res 151:167–187 Kidd G Jr, Mason CR, Deliwala PS, Woods WS, Colburn HS (1994) Reducing informational masking by sound segregation. J Acoust Soc Am 95:3475–3480 Kidd G Jr, Mason CR, Richards VM (2003) Multiple bursts, multiple looks, and stream coherence in the release from informational masking. J Acoust Soc Am 114:2835–2845 Micheyl C, Carlyon RP, Cusack R, Moore BCJ (2005a) Performance measures of auditory organization. In: Pressnitzer D, de Cheveigné A, McAdams C, Collet L (eds) Auditory signal processing: physiology, psychoacoustics, and models. Springer, Berlin Heidelberg New York, pp 203–209 Micheyl C, Tian B, Carlyon RP, Rauschecker JP (2005b) Perceptual organization of tone sequences in the auditory cortex of awake macaques. Neuron 48:139–148 Neff DL, Green DM (1987) Masking produced by spectral uncertainty with multicomponent maskers. Percept Psychophys 41:409–415 Watson CS, Wroton HW, Kelly WJ, Benbassat CA (1975) Factors in the discrimination of tonal patterns. I. Component frequency, temporal position, and silent intervals. J Acoust Soc Am 57:1175–1185
30 The Dynamics of Auditory Streaming: Psychophysics, Neuroimaging, and Modeling MAKIO KASHINO1,2,3, MINAE OKADA2, SHIN MIZUTANI1, PETER DAVIS1, AND HIROHITO M. KONDO1
1
Introduction
Listening to a speaker or a melody in the presence of competing sounds crucially depends on our brain’s sophisticated ability to organize a complex sound mixture changing over time into coherent perceptual objects or “streams”, which generally correspond to sound sources in the environment (Bregman 1990). The acoustic factors governing this “auditory streaming” are well established (Carlyon 2004), and various theories have been proposed to explain auditory stream formation (Anstis and Saida 1985; Beauvois and Meddis 1991; Bregman 1990; Hartman and Johnson 1991; McCabe and Denham 1997; van Noorden, 1975). However, it remains unclear how and where auditory streaming is achieved in the brain. A common limitation of early studies that tried to find the neural correlates of auditory streaming is that the neural response patterns corresponding to the different states of perceptual streaming were evoked by physically different stimuli (Alain et al. 1998; Fishman et al. 2001, 2004; Näätänen et al. 2001; Sussman et al. 1999). This makes it difficult to determine whether the neural response patterns reflect perception per se or they simply reflect the physical properties of the stimuli. To overcome this difficulty, recent studies have taken advantage of the fact that the segregation of sounds into streams typically takes several seconds to build up (Carlyon et al. 2001; Cusack 2005; Gutschalk et al. 2005; Micheyl et al. 2005). Under appropriate conditions, a physically unchanging sequence of alternating tones initially tends to be heard as a single coherent stream, and after several seconds it appears to split into two distinct streams (Anstis and Saida 1985). This makes it possible to compare neural responses corresponding to different percepts without introducing any physical change in the stimulus. Based on this approach, the neural correlates of auditory streaming have been found in the primary auditory cortex (Micheyl et al. 2005), the non-primary auditory cortex (Gutchalk et al. 2005), and the intraparietal 1 NTT Communication Science Laboratories, NTT Corporation, Japan,
[email protected],
[email protected],
[email protected],
[email protected] 2 ERATO Shimojo Implicit Brain Function Project, JST, Japan,
[email protected] 3 Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Japan
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
276
M. Kashino et al.
sulcus (Cusack 2005). Seemingly the findings rather diverge, and some other lines of research are desired. Here, we focus on an aspect of auditory streaming that has been paid little attention to, namely, the spontaneous transitions of percepts for a physically unchanging sequence of alternating tones not only during but also after the initial buildup of stream segregation. We examine the nature of the perceptual transitions in a psychophysical experiment and analyze the data from a viewpoint of stochastic point process. Moreover, we explore brain activities correlated with the perceptual transitions using functional magnetic resonance imaging (fMRI).
2 2.1
Psychophysical Experiment and Stochastic Process Analysis Methods
The test sequences were 900 repetitions (6 min.) of a triplet pattern composed of L and H tones (Fig. 1). The size of the frequency difference (∆ƒ) between L and H was varied from approximately 1/12 to 1 octave while the center frequency was fixed to 1 kHz. To avoid harmonic consonance between L and H, the frequency of each tone was set to the following values: L = 967 Hz and H=1039 Hz in the ∆ƒ≈1/12 octave condition, L = 937 Hz and H =1069 Hz in the ∆ƒ≈1/6 octave condition, L=883 Hz, H=1129 Hz in the ∆ƒ≈1/3 octave condition, L = 823 Hz, H = 1213 Hz in the ∆ƒ≈1/2 octave condition, and L = 691 Hz, H =1447 Hz in the ∆ƒ≈1 octave condition. The duration of each tone was 40 ms, including rising and falling cosine ramps of 10 ms. The difference of onset time between the adjacent L tone and H tone within a triplet was 100 ms, and that between neighboring triplets was 200 ms. The duration of a triplet was 400 ms per cycle, including 160 ms of silence after the offset of the third tone in a triplet.
Fig. 1 Schematic representation of test sounds used in the experiment
The Dynamics of Auditory Streaming: Psychophysics, Neuroimaging, and Modeling
277
Ten adults aged from 20 to 33 years with normal hearing participated. Participants were instructed to listen to the test sequence passively without any particular focus or attitude and judge whether they perceived “one stream” (LHL-LHL-. . .) with a galloping rhythm or “two streams” (L-L-. . . and H---H---. . .) with an isochronous rhythm for each stream. Participants responded by touching the respective key of a response box whenever the perception changed in each session. The stimuli were presented to the left ear through a headphone at 60 dB SPL. Each participants ran ten sessions for each of the five ∆ƒ conditions. 2.2
Results and Discussions
In each session, we obtained a series of response times at which the percept changed from “one stream” to “two streams” or in the opposite direction. In all ∆ƒ conditions for all participants, the initial response was “one stream” and it then changed to “two streams” within several seconds. This confirmed the cumulative effect of stream segregation reported in previous studies (Anstis and Saida 1985). The time required for the buildup of streaming decreased as ∆ƒ increased. This tendency is evident in the time course of the mean number of perceived streams averaged in every millisecond across sessions and participants in each ∆ƒ condition (Fig. 2). As the stimulus presentation progressed further, participants’ responses changed frequently in all ∆ƒ conditions. Figure 3 shows the mean number of perceptual transitions (Nt) averaged across sessions and participants for each ∆ƒ condition in the line plot. More than 30 perceptual transitions occurred in
Fig. 2 Time course of the mean number of perceived streams. Each line shows the transition of the mean number of perceived streams every millisecond
278
M. Kashino et al.
Fig. 3 The mean number of perceptual transitions in a single session (Nt) in a line plot, and the mean total duration of “one stream” (T1) and “two streams” (T2) in bar plots
a session regardless the size of ∆ƒ. This is surprising, because it has been well established that such perceptual transitions can occur only within a limited range of ∆ƒ (1/3 ≤ ∆ƒ ≤ 1/2 octave at the repetition rate of 100 ms) (van Noorden 1975). Figure 3 also shows the mean total duration of “one stream” responses (T1) and that of “two streams” responses (T2) per session (360 s) in bar plots. Both values were averaged across sessions and participants in each ∆ƒ condition. As the size of ∆ƒ increased, the value of T1 decreased and that of T2 increased, and their proportions in a session reversed at ∆ƒ≈1/3 octave. The autocorrelation functions of the response time series revealed no periodicity in the perceptual changes in all ∆ƒ conditions. Moreover, no correlation was found between successive intervals of the same percept, or between successive intervals of different percepts. These findings suggest that auditory streaming should be thought of as a stochastic process; we know the likelihood of a particular percept but cannot determine which percept would occur at a particular time. The likelihood of a percept depends on ∆ƒ in the present case, but there is no fixed perceptual boundary along the ∆ƒ axis. Figure 4 shows the histograms of the duration of each perceptual state for each ∆ƒ condition, pooled over the ten sessions and ten participants. The width of each bin is 500 ms. All of the distributions have a fast growth and a long tail. We used the lognormal, gamma and Weibull distributions for fitting to the histograms using the maximum likelihood method, and found that the lognormal distribution is the best fitting in 92% of the histograms. The probability density function of the lognormal distribution is given by p (t) =
1 e- (lnt - n) /2v 2π vt 2
2
(1)
The Dynamics of Auditory Streaming: Psychophysics, Neuroimaging, and Modeling
279
Fig. 4 Histograms of the durations of each of the two percepts and the estimated probability distribution of durations for each ∆ƒ condition
where σ and µ are the mean and standard deviation of logarithm of t for t ≥ 0, respectively. Each panel of Fig. 4 shows the values of parameters σ and µ and estimated lognormal distributions plotted by the continuous line in each ∆ƒ condition. The Kolmogorov-Smirnov goodness-of-fit test rejected the lognormal distribution hypothesis in 32% of the histograms (p<0.05). This is comparable to a recent study on perceptual transitions in visual ambiguous figures, which also showed that the lognormal distribution is the best fitting (Zhou et al. 2004). Next, we assumed two independent stochastic point processes that alternate with each other for perceptual transitions from “one stream” to “two streams” and in the opposite direction. Since no correlation was found between successive intervals, the rate of transition depends simply on the time from the previous transition. The rate of transition l(t) can be calculated from the distribution p(t) by Eq. (2): p (t)
m (t) =
t
1-
# p (x) dx 0
(2)
280
M. Kashino et al.
Fig. 5 Transition rates for a typical participant calculated from experimental histograms (cross points) and from the best fitting lognormal distribution (lines)
Figure 5 shows examples of transition rates for a typical participant calculated from experimental histograms (cross points) and from the best fitting lognormal distribution (lines). The transition rates grow quickly after the previous transition, and decay slowly to nonzero values. This shape cannot be produced by the gamma or Weibull distributions. Neural mechanisms having such transition rates may underlie the spontaneous transitions in auditory streaming.
3 3.1
Neuroimaging Methods
We performed a pretest to select suitable participants for the fMRI experiment who had long intervals between perceptual transitions, taking slow time constants of blood oxygen level dependent (BOLD) response into account. The 24 participants (12 males, 12 females; 19–30 years of age) selected by the pretest were right-handed adults with no history of neurological and psychiatric illness. The stimuli and procedure of the pretest and the fMRI experiment were essentially the same as those in the psychophysical experiment described in Sect. 2, except for the following points. The pretest and the fMRI experiment consisted of 5 sessions (90 s for each). Only 2 ∆ƒ conditions (∆ƒ≈1/6 and 1/2 octave) were tested using separate groups of participants (12 participants for each).
The Dynamics of Auditory Streaming: Psychophysics, Neuroimaging, and Modeling
281
Images were obtained using a 1.5-T MRI scanner. Functional images sensitive to BOLD signal were acquired by a single-shot echo-planar imaging sequence (TR 2 s, TE 48 ms, flip angle 80°, voxel size 3 × 3× 7 mm, 20 contiguous slices). Data were analyzed by SPM2. We modeled each event of perceptual transitions with a canonical hemodynamic response function. A fixed effect model was used to obtain activation maps of subject-specific linear contrasts. These contrasts were entered to a random-effect model to estimate averaged activations. 3.2
Results and Discussions
The behavioral data of the pretest and the fMRI experiment replicated well the essential features of the psychophysical data described in Sect. 2. In the ∆ƒ≈1/2 octave condition, significant activation synchronized with the perceptual transitions was observed in the auditory areas (BA42), the superior temporal sulcus (BA21/22), and the posterior insular cortex. In the ∆ƒ≈1/6 octave condition, the supramarginal gyrus (BA40), the left intraparietal sulcus (BA7), and the thalamus were activated in addition to the regions activated in the ∆ƒ≈1/2 octave condition. Figure 6 shows regions activated
Fig. 6 Regions activated when the percept changed from “one stream” to “two streams” (bottom) and in the opposite direction (top) in the ∆ƒ≈1/6 octave condition (left) and in the ∆ƒ≈1/2 octave condition (right) (z=0, N=12)
282
M. Kashino et al.
when the percept changed from “one stream” to “two streams” and in the opposite direction separately in the two ∆ƒ conditions (z = 0). In the ∆ƒ≈1/6 octave condition where “one stream” percept was dominant, more activation was observed at the transition from “one stream” to “two streams” than at the transition in the opposite direction. On the other hand, in the ∆ƒ≈1/2 octave condition where “two streams” percept was dominant, more activation was observed at the transition from “two streams” to “one stream”. Apparently, escaping from a dominant percept is associated with higher activation. The involvement of the auditory areas in auditory streaming has been shown by single-unit recordings in animals (Micheyl et al. 2005) and by magneto-encephalography in human (Gutschalk et al. 2005), but not by fMRI (Cusack 2005). The discrepancy may be due to the poor temporal resolution of fMRI. However, the present study revealed the activation of the auditory areas related to auditory streaming using fMRI, taking advantage of the event-related image acquisition time-locked to the changes in percepts. Our findings indicate that the formation of streams may involve multiple neural sites from subcortical to suprasensory levels.
4
Conclusions
We have uncovered several features of perceptual transitions in auditory streaming, including stochasticity, interval distributions, and time-dependent transition rates. These features cannot readily be explained by the previous theories of streaming such as the peripheral channeling theory (Beauvois and Meddis 1991; Hartmann and Johnson 1991), the pitch-jump detector theory (Anstis and Saida 1985), and the accumulation-of-evidence theory (Bregman 1990). Some recent studies assume that neural habituation observed in the auditory areas provides a basis of streaming (Fishman et al. 2001; Micheyl et al. 2005). However, such a theory cannot explain the interaction between the direction of transition and ∆ƒ shown in the present fMRI experiment. We suggest that the theory of auditory streaming should incorporate the neural dynamics. Detailed data on the perceptual transitions in auditory streaming and their neural correlates would provide important constraints on the development of such models.
References Alain C, Cortese F, Picton TW (1998) Event-related activity associated with auditory pattern processing. Neuroreport 15:3537–3541 Anstis S, Saida S (1985) Adaptation to auditory streaming of frequency-modulated tones. J Exp Psychol Hum Percept Perform 11:257–271 Beauvois MW, Meddis R (1991) A computer model of auditory stream segregation. Q J Exp Psychol 43:517–541
The Dynamics of Auditory Streaming: Psychophysics, Neuroimaging, and Modeling
283
Bregman AS (1990) Auditory scene analysis: the perceptual organization of sound. MIT, Cambridge, MA Carlyon RP (2004) How the brain separates sounds. Trends Cogn Sci 8:465–471 Carlyon RP, Cusack R, Foxton JM, Robertson IH (2001) Effects of attention and unilateral neglect on auditory stream segregation. J Exp Psychol Hum Percept Perform 27:115–127 Cusack R (2005) Intraparietal sulcus and perceptual organization. J Cogn Neurosci 17:641–651 Fishman YI, Reser DH, Arezzo JC, Steinschneider M (2001) Neural correlates of auditory stream segregation in primary auditory cortex of the awake monkey. Hear Res 151:167–187 Fishman YI, Arezzo JC, Steinschneider M (2004) Auditory stream segregation in monkey auditory cortex: effects of frequency separation, presentation rate, and tone duration. J Acoust Soc Am 116:1656–1670 Gutschalk A, Micheyl C, Melcher JR, Rupp A, Scherg M, Oxenham AJ (2005) Neuromagnetic correlates of streaming in human auditory cortex. J Neurosci 25:5382–5388 Hartmann WM, Johnson D (1991) Stream segregation and peripheral channeling. Music Percept 9:155–183 McCabe SL, Denham MJ (1997) A model of auditory streaming. J Acoust Soc Am 101:1611–1621 Micheyl C, Tian B, Carlyon RP, Rauschecker JP (2005) Perceptual organization of tone sequences in the auditory cortex of awake macaques. Neuron 48:139–148 Näätänen R, Tervaniemi M, Sussman E, Paavilainen P, Winkler I (2001) ‘Primitive intelligence’ in the auditory cortex. Trends Neurosci 24:283–288 Sussman E, Ritter W, Vaghan HG Jr (1999) An investigation of the auditory streaming effect using event-related brain potentials. Psychophysiology 36:22–34 van Noorden LPAS (1975) Temporal coherence in the perception of tone sequences. Unpublished doctoral dissertation, Eindhoven University of Technology Zhou YH, Gao JB, White KD, Merk I, Yao K (2004) Perceptual dominance time distributions in multistable visual perception. Biol Cybern 90:256–263
Comment by Langner Looking at a Neckar-cube for some minutes, I see the two possible threedimensional perspectives oscillate, first with a slow period of several seconds until it gets faster and is totally blurred into a two-dimensional set of lines at the end. It seems to me that at least one of the transition curves in Fig. 5 shows periodic peaks which may indicate a similar oscillation behaviour for your paradigm. Reply We reexamined the individual data of perceptual transitions, but did not find clear evidence for such an oscillation. Related findings have been reported by Pressnitzer and Hupe (2006), who found that the first percept is significantly longer than subsequent ones both in visual and auditory bistable perception, but there is no long-term trend in the duration of phases after the first one. References Pressnitzer D, Hupe JM (2006) Temporal dynamics of auditory and visual bistability reveal common principles of perceptual organization. Curr Biol 16:1351–1357
31 Auditory Stream Segregation Based on Speaker Size, and Identification of Size-Modulated Vowel Sequences MINORU TSUZAKI1, CHIHIRO TAKESHIMA1, TOSHIO IRINO2, AND ROY D. PATTERSON3
1
Introduction
When a receiver of acoustic signals is surrounded by several vibrating bodies, it becomes important to “sort out” sound energies into subparts appropriately to represent the original sources. This issue is called a problem of source segregation, and has been investigated in several ways as a core of the auditory scene analysis. Pitch, or a perceptual attribute corresponding to the fundamental periodicity, has been regarded as one of significant cues for sound segregation. It has been also known that “timbre” can function as another cue (Bregman 1990). However, there are still some problems with the ambiguity in the definition of timbre. The work by Irino and Patterson (2002) on the wavelet-Mellin image has drawn attention to the scale dimension in natural sounds, and information about the size of the resonators in a source. The existence of the scale information illustrates the ambiguity of timbre: is it just a dimension of timbre or is it a dimension of perception like pitch. If it is a dimension that can be separated from the rest of the timbre information using Mellin transform as suggested by Irino and Patterson (2002), this would explain listeners’ ability to estimate speaker size as reported recently by Ives et al. (2005). Tsuzaki and Irino (2004) tried to estimate the temporal resolution of this computational process by investigating the identification of vowel sequences whose “size” was modulated sinusoidally. A puzzling aspect of the experimental results was that the performance did not show a monotonic, low-pass characteristic. Although it tended to drop when the modulation period was around 250 ms, the performance became better at shorter modulation periods. The sinusoidal modulation of Tsuzaki and Irino (2004) was applied independent of vowel duration in the sequence. Informal listening indicated that the stimuli with the short modulation period, i.e., the most rapidly size-modulated speech sounded as if two people were speaking identical utterances simultaneously.
1
Kyoto City University of Arts, Kyoto, Japan,
[email protected] Department of Design Information Sciences, Wakayama, Wakayama University, Japan, irino @sys.wakayama-u.ac.jp 3 Centre for the Neural Basis of Hearing, Cambridge University, Cambridge, UK,
[email protected] 2
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
286
M. Tsuzaki et al.
The implication was that listeners segregated the speech into two auditory streams based on the size information. The fast modulation condition in the study of Tsuzaki and Irino (2004) produced the perception of size modulation without disrupting the vowel-type information. We wonder how the perception would change if the size and vowel-type information shape changed coincidentally. Would you hear two concurrent speakers saying different things? Or, would the perception simply become chaotic? To answer these questions, two experiments were conducted. The first experiment was to investigate the identification of the vowels in size-modulated sequences. The second experiment was to evaluate detection of a target vowel in size-modulated vowel sequences.
2 Experiment 1: Identification of Vowels in Size-Modulated Sequences The purpose of the first experiment was to investigate the effects of the depth and speed of size modulation on the identification of vowel sequences whose size parameter alternated between two values vowel by vowel. If the auditory system is able to extract size information and to build images of two concurrent sources, the identification of the whole sequence will become difficult as faster and deeper modulation is applied because of the difficulty in judging the order. 2.1
Stimulus
Vowel sequences were synthesized with a channel vocoder, STRAIGHT (Kawahara et al. 1999; Kawahara and Irino, 2005) based on sampled natural utterances by a Japanese male speaker. All the sequences had six segments, and each segment contained one of five Japanese vowels, i.e., “a”, “e”, “i”, “o”, and “u”. The sequences were generated by concatenating “doublets” of two vowels. For example, a sequence, “aiu” was generated by concatenating “ai” and “iu”, where the middle of the “i” segment was used for the transition. In the transition, the spectrum changed gradually from that of the first “i” into that of the second to minimize the discontinuity associated with the concatenation. Forty sequences shown in Table 1 were prepared as base sequences. The first and last segments always contained the same vowel, and the middle four segments were permutations of the other four vowels. Size modification was applied by dilating or compressing the frequency axis of the STRAIGHT spectra for the vowel. Dilation raises formant frequencies proportionally, which corresponds to a reduction of vocal tract size; conversely, compression lowers the formants. The size modulation was achieved by alternating the dilation and compression segment by segment. The modulation depth, defined by the amount of the size modification in one direction was either, a quarter, or half, an octave.
Auditory Stream Segregation Based on Speaker Size
287
Table 1 List of vowel sequences
The other main factor was the speed of size modulation. The speaking rate of the original utterances was slower than natural speech: the average segment duration was 340 ms. Two conditions were prepared from the recordings by reducing the segment duration to either one half, or one quarter, of the original duration. For convenience, the former will be referred as the “fast” condition, and the, the “slow” condition. The F0 pattern of the original sequence was used for all the stimuli. Accordingly, there were no abrupt changes in the F0 contour at the segment boundaries. 2.2
Listeners
Six students of Kyoto City University of Arts participated in the experiment. Their audiograms were normal, and they were paid for the participation. Four listeners were assigned to the slow condition and the other two, to the fast condition. 2.3
Procedure
Modulation depth was a within-listener factor, while modulation speed was a between-group factor. Each listener was presented three modulation depths, i.e., 0, 1/4, and 1/2 octave, either in the fast, or slow form. The task of the listeners was to identify all six of the vowels in each sequence in the correct order using virtual buttons on a GUI labeled with the five vowel names. No feedback was provided. Each listener received 40 trials at each modulation depth and 20 trials in the no modulation condition for each of 40 sequences. So there were 4000 trials per listener. The stimuli were synthesized off-line in advance on a workstation (Apple PowerMac G5), and presented to the listeners by a DSP system controlled by a workstation (SymbolicSound Capybara 350 + Apple iMac G5) through a headphone (Sennheiser HD 600 amplified with Luxman P1).
288
2.4
M. Tsuzaki et al.
Results and Discussion
The vowels in the initial and final positions were regarded as “fillers” and were discarded from the analysis because of strong primacy and recency effects. For each modulation speed, the percentage of trials where they identified the four central vowels correctly was calculated, and plotted as a function of the modulation depth in Fig. 1. Percent correct decreases as modulation depth increases both in the fast and slow conditions. In addition, the listeners performed better in the slow condition than in the fast condition. Performance in the control condition with no modulation was not perfect. To estimate deterioration caused by the size modulation, the ratio of ‘percent correct in test condition’ to ‘percent correct in the control condition’ was calculated for each listener and each token. The geometric mean of these scores is plotted as a function of modulation depth in Fig. 2. The data are consistent with the hypothesis that the auditory system segregates the sequence based on speaker size. If the sequence were segregated into two streams, i.e., one from a “longer” vocal tract and the other from a “shorter” vocal tract, it would become difficult to perceive the correct order of the vowels. Because the task was to identify each sequence in the correct order, it would become difficult if the sequence broke into two streams. It is reasonable to assume that the segregation will be augmented when the separation of the size becomes larger as well as when the alternation occurs faster, as in the case of the pitch based segregation. It could also be that the observed deterioration with increasing size modulation was caused by a constraint of Mellin Image construction, i.e., a limit on the temporal resolution of the process. This seems unlikely, however, for the following reason. The two lines in Figs. 2 and 1 are almost parallel, and
Fig. 1 Percent correct sequence idenfification plotted as a function of the modulation depth with the modulation speed as the parameter
Auditory Stream Segregation Based on Speaker Size
289
Fig. 2 Ratio of percent correct in the test and control conditions as a function of modulation depth, with modulation speed as a parameter
this indicates that there is no interaction between these factors. If the observed deterioration with increasing size modulation were caused by a restriction in the rate of Mellin Image construction, we would expect an interaction. The segregation hypothesis implies that the size information is properly extracted and normalized. This predicts that the identification of individual vowels will not suffer significantly from the size modulation. The purpose of Experiment 2 was to check this prediction by requiring listeners to detect a vowel in size modulated sequences.
3 Experiment 2: Detection of a Target Vowel in Size-Modulated Vowel Sequences The purpose of Experiment 2 was to investigate the identification of indivisual vowels in the size-modulated sequences. The target detection task was chosen to avoid judgments about order and to minimize change in stimulus characteristics. 3.1
Stimulus
Half of the stimuli were identical to those in Experiment 1 They are called “positive” stimuli, i.e., stimuli containing a target vowel, which was either the third, or fourth vowels in the sequence. The “negative” stimuli were copies of the positive stimuli in which the target vowel was replaced by a different
290
M. Tsuzaki et al.
vowel with the restriction that it was not the same as the first/last vowel, and not the same as the preceding or following vowel. For example, if the positive stimulus was “aeioua” with “i” as the target vowel, the negative stimulus was “aeuoua”. The size modulation was applied in the same manner as Experiment 1, using the two modulation depths, 1/4, and 1/2 octave. There were also control conditions, with stimuli having no modulation. The speed of modulation was limited to the slow condition in this experiment. 3.2
Procedure
The task of listeners was simply to say whether the target vowel existed in the sequence or not. At the start of each trial, the ‘target’ vowel was displayed on the computer screen both on a positive, and negative trials. The listener provided their answers (yes or no) by clicking one of two buttons on the GUI. Feedback was provided at the end of each trial. Each listener was presented 20 trials with each modulation depth (0, 1/4, or 1/2 octave) in each target position (3rd or 4th), for each of 40 pairs of positive and negative stimuli. Thus, there were 4800 trials per listener over 10 experimental sessions. 3.3
Listeners
Four students of Kyoto City University of Arts participated in the experiment. Their audiograms were normal, and they were paid for their participation. 3.4
Results and Discussion
The percentage of responses averaged over listeners and sequence type is shown in Table 2. The reduction in percent correct with increasing modulation rate was smaller than in Experiment 1. To evaluate the effect of size modulation, the ratio of percent correct in each test condition to that in the control condition was calculated, for each listener and each token, as in Experiment 1. The geometric mean is plotted as a function of the modulation in Fig. 3, together with the results from Experiment 1. If there were no effect of size modulation, the scores should be close to unity. The effect of modulation depth is much smaller than in Experiment 1. This supports the segregation hypothesis which assumes that the perceptual problem with the size modulated stimuli was mainly caused by the difficulty in judging the correct order of the vowels due to stream segregation.
Auditory Stream Segregation Based on Speaker Size
291
Table 2 Average percent correct in Experiment 2 Modulation Depth 0 oct
1/4 oct
1/2 oct
99
96
87
Fig. 3 Ratio of percent correct in the modulated condition to that in the unmodulated condition, plotted as a function of the modulation depth, for both Experiments 1 and 2
4
General Discussion
The results of Experiment 1 suggested that it becomes difficult to perceive the vowels as a single sequence, when vocal tract size alternates segment by segment. On the other hand, it was less difficult to recognize a single vowel as shown in Experiment 2. One could explain this task difference by assuming that the sequences were segregated on the basis of perceived size. For example, when a sequence like “aeuioa” were presented with the size factor alternating between “long” and “short” segment by segment, it would be perceived as two concurrent streams, i.e., a big person saying “a-u-o-” and a small person saying “-e-i-a”. And it would be difficult to judge whether the second “e” came before “u” in the third position or after it. Problems in judging order are typical when sounds are segregated into streams. The fact that perceived speaker size functions as streaming cue suggests that it is used in source identification, as might be expected. Although body size increases as animals mature, it is a very gradual
292
M. Tsuzaki et al.
process, and over the course of a communication, size is normally fixed for a given source. It is worth noting that the listeners in Experiment 2 had to organize the stimuli pre-attentively. Although the target vowel was presented visually at the start of the trial, they were not told the position of the target within stream, nor which stream it would appear in. The listeners had to store and remember the two streams to make the judgment. Acknowledgments. Work supported in part by the Grant-in-Aid for Scientific Research (C) No. 17530529, and (B) 18300060, JSPS. Author RP was supported by the UK MRC (G9900369, G0500221) during the research.
References Bregman AS (1990) Auditory scene analysis: the perceptual organization of sound. MIT Press, Massachusetts Irino T, Patterson R (2002) Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform. Speech Commun 36:181–203 Ives DT, Smith DRR, Patterson RD (2005) Discrimination of speaker size from syllable phrases. J Acoust Soc Am 118(6):3816–3822 Kawahara H, Irino T (2005) Underlying principles of a high-quality speech manipulation system STRAIGHT and its application to speech segregation. In: Divenyi P (ed) Speech separation by humans and machines. Kluwer Academic Pub, Dordrechet, pp 167–180 Kawahara H, Masuda-Katsuse I, de Cheveigé A (1999) Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun 27:187–207 Tsuzaki M, Irino T (2004) Perception of size-modulated speech: the relation between the modulation period and the vowel identification. Trans Tech Committee Psychol Physiolog Acoust, Acoust Soc Jpn H-2004-125, 34
Comment by Divenyi You are asking your subjects to identify the sequence of six vowels modulated in the size domain as two interleaved three-vowel sequences. You consider a correct identification as an indication that the two sizes did not form two streams. In the other case, I think that one of the sizes (I guess the smaller and higher-pitched one) will be more salient than the other. If this is true, I think that a smart subject could produce a correct identification on every trial on which the sequence separates into two streams. He/she would do it by simply listening to the salient half sequence in the salient stream and reconstruct the sequence from that information plus the fact of knowing that every sequence contains all five vowels and that the first and last vowels are identical.
Auditory Stream Segregation Based on Speaker Size
293
Reply First, we used the ‘original’ F0 to avoid (or minimize) segregation by pitch. Although it may be possible for a listener to attend one of the two stream, that strategy prevent him/her from obtaining necessary information from the unattended stream. Because we used two version of modulated stimuli, i.e., what started with a small size as well as a large size, and both versions were tested in a random order in the session, he/she must still be required to make a correct judgment not only on the order of the two streams but also on the correct order of the two other vowels in the unattended stream. Comment by Roberts You have used a reduction in the ability of listeners to identify correctly the temporal order of a sequence of six vowels as evidence that the sequence has segregated into two streams, based on differences in the size of the vocal tract. However, you have not included an analysis of the nature of the errors made by the listeners. Stream segregation should lead to errors in identifying the relative order of acoustic elements heard in different streams, but the relative order of elements within a stream should be judged correctly. Only if this pattern of errors has occurred can the decline in accuracy be taken as evidence of stream segregation. Even if clear evidence of stream segregation can be shown, the effect of changing the size of the vocal tract may be an indirect one. This is because introducing a difference in vocal tract size is likely to increase the extent to which the excitation patterns evoked by the stimuli differ, and hence to increase the peripheral channelling cues for segregation. Reply We understand the argument that an error analysis might help decide whether streaming occurs in the first experiment or not. However, the experiment was originally designed with a quite different purpose, namely, to estimate the time constant of the size normalization process, i.e., how fast the Mellin image could be constructed. The vowel sequences were not designed for this form of error analysis. Each vowel sequence contained all five Japanese vowels, and the listeners knew this. As a result, some listeners responded with sequences containing a succession of the same vowel to gain partial points, although they knew that the sequence could not be correct. We did examine the errors, but we could not discern the cause of the errors. The second experiment was designed to investigate whether the errors in the first experiment could be caused simply by the difficulty of single vowel
294
M. Tsuzaki et al.
identification. The results indicated that they could not. This suggests that there was another factor at work and we think it is size-based streaming. Comment by Hartmann Stream segregation on the basis of the perceived size of a talker is an interesting idea, but there may be a more parsimonious explanation for the effects you observe. Large and small talkers are likely to produce vowels with important spectral differences, and it is known that musical tones with different spectra are readily segregated (Wessel 1979). Different spectra excite different tonotopic regions of the auditory periphery. It might be possible to filter your stimuli in a way that preserves the distinction between large and small talkers but eliminates the grossest spectral differences. However, even rather subtle spectral differences are able to produce streaming. For instance, Hartmann and Johnson (1991) found that tones with only odd harmonics could be segregated from tones with only a fundamental and even harmonics, though the overall spectral envelopes were the same for both kinds of tone. References Hartmann WM, Johnson D (1991) Stream segregation and peripheral channeling. Music Percept 9:155–184 Wessel DL (1979) Timbre space as a musical control structure. Comput Music J 3:45–52.
32 Auditory Scene Analysis: A Prerequisite for Loudness Perception NICOLAS GRIMAULT1, STEPHEN MCADAMS2, AND JONT B. ALLEN3
1
Introduction
Auditory scene analysis (ASA) refers to the ability of the auditory system to organize perceptually a sound mixture into distinct auditory streams, each stream ideally corresponding to a single sound source (see Bregman 1990 for a review). Level cues can be used to promote perceptual segregation of auditory objects. As such, temporal sequences of pure tone bursts with the same frequencies but with alternating levels can be perceived as segregated as soon as a level difference is introduced between the tones (van Noorden 1975). The inverse relation, i.e., the dependency of loudness on grouping processes, has been much less described in the literature, and the extent to which ASA processes can influence loudness perception remains largely undetermined. Jeng (1992) measured the relative loudness of two distinct sounds (speech and a simulated jack-hammer sound). She found that the total loudness did not follow the Zwicker power-spectrum loudness model (Zwicker and Fastl 1990). This is similar to the situation of a masking noise on speech. As shown by Fletcher (Fletcher and Munson 1933, 1937; Fletcher 1938), speech can be masked by noise, and as the noise level is increased, the loudness of the speech is decreased. From the spectral point of view, the total loudness should be found by summing up all the specific loudness components. In fact, this is not the case, because of grouping. McAdams et al. (1998) provided an argument supporting the assumption that the loudness of an auditory event can be influenced by the perceptual organization. The stimuli used in their experiments were alternating sequences of two identical bursts with no silent gap between them and played at different levels, well above threshold. According to Warren et al. (1972) and van Noorden (1975), these sequences lead to the perception of a continuous sound upon which is superimposed an additional intermittent stream of bursts. This phenomenon suggests that the
1 UMR 5020, CNRS, Université Claude Bernard Lyon 1, IFR19, Institut fédératif des neurosciences de Lyon, Lyon, France,
[email protected] 2 CIRMMT, Schulich School of Music, McGill University, Montréal, Québéc, Canada H3A 1E3,
[email protected] 3 Beckman Inst., Urbana, IL 61801, USA,
[email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
296
N. Grimault et al.
higher level portion is interpreted by the auditory system as being composed of two parts: the continuation of the preceding sound and a new superimposed sound, each with its own loudness. The results of this experiment indicate that the loudness of the intermittent stream can be influenced by up to about 12 dB by this auditory continuity illusion. McAdams et al. (1998) discussed these results in the light of several loudness models and subtractive mechanisms (Warren 1999). The experiment presented here is a continuation of such investigations between grouping and loudness, across, rather within, equivalent rectangular bandwidths. Our intuitive assumption is that a loudness value must always be associated with an auditory event (Allen 1996). As suggested by the Jeng experiment, two acoustic components belonging to distinct auditory objects do not contribute to a single loudness value. None of the loudness models described in the literature (see Plack and Carlyon 1995 for a review) takes into account the perceptual organization of the stimuli. It follows that the experimental data of McAdams et al. (1998) cannot be predicted by a subtraction mechanism, operating in sone units, that would occur subsequent to this loudness computation. These authors explored the limitations of simple pressure and power models. Predictably, these “linear” models accounted for their data better than the classical loudness models since the stimulus components all lie within equivalent rectangular bandwidths. Altogether, these results are consistent with the hypotheses that first, within a equivalent rectangular bandwidth, pressure must sum, and second, loudness is computed subsequent to auditory grouping mechanisms and consequently depends on these mechanisms. The experimental procedures used only involved the segregation of simultaneous auditory events with similar spectral properties in each equivalent rectangular bandwidth. Given the nature of the stimulus, this experimental design does not address how remote spectral components can lead to a single (or multiple) loudness sensation(s) that depend(s) on their grouping status, given that within a equivalent rectangular bandwidth the pressure sums, whereas across equivalent rectangular bandwidths, loudness sums. The goal of the present study is to investigate how ASA processes can influence loudness summation across remote equivalent rectangular bandwidths. In other words, this experiment has been specifically designed to check for a possible effect of auditory grouping on the loudness additivity law, as initially defined by Fletcher and Munson (1933); see also Allen (1996). The perceptual grouping status of two simultaneous tonal components has been shown in the literature to be strongly influenced by the spectral relationships they share (spectral regularities, harmonicity, etc.), and by the temporal context in which they are presented (Bregman and Tougas 1989). Thus, the tonal components of a two-tone complex, that would be perceived as integrated, will be heard individually once a tonal context, which forces the segregation, is introduced (Bregman and Tougas 1989). It is known that a slight asynchrony between the onsets of the components will lead to segregation
Auditory Scene Analysis: A Prerequisite for Loudness Perception
297
(Bregman and Pinker 1978). In the present experiment, asynchrony and temporal context was used to promote either integration or segregation across conditions, while the loudness level of the components was measured.
2 2.1
Method Subjects
Twenty subjects (10 males and 10 females, mean age 26.9 years) with no history of hearing disorder participated; 14 subjects participated in the 50 dB level condition, 15 in the 73.7 dB level condition. Of these nine participated in both conditions and one was dropped from the 73.7 dB condition. 2.2
Stimuli and Procedure
Five experimental conditions (C1 to C5), corresponding to the five sequences of bursts shown in Fig. 1, were generated. The burst duration, in all conditions, was 0.2 s including 3-ms linear rise and fall ramps. Bursts were separated by a 0.1-s silent gap except for C2 in which the bursts were alternately separated by 0.1-s or 0.4-s gaps. All sequences of bursts in all conditions were about 45 s long. During the sequence presentation, the subject’s task was always to adjust continuously the level of a particular tone (indicated by an arrow in Fig. 1) to match the loudness of either a pure tone (C1, C2, C3 and C5) or a two-tone complex (C4). The adjustment procedure was terminated at the end of the sequence presentation or when the subject pressed a button on a response box. The last adjusted level was used in this procedure as the adjustment result. Five consecutive adjustment procedures were performed and averaged in each condition. The levels of presentation for all non adjustable 1-kHz pure tones were either 50 or 73.7 dB SPL (the reference levels). All other tone levels were either adjustable or fixed experimentally (see below). The start levels for the adjustable tones were chosen at random between 5 and 8 dB above or below the reference levels for each sequence presentation. C1 and C2 simply consisted of the presentation of successive bursts of 1-kHz pure tones with two different rhythms. The subject’s task was to adjust the level of the tones to attain a sequence of constant loudness across events. C3 consisted of a repeating sequence including a fixed-level 1-kHz pure tone and an adjustable-level 2-kHz pure tone. The empirically determined matching levels obtained in this condition were then used to generate the two-tone complexes used in C4 and C5, independently for each individual. In fact, testing the additivity law as defined by Fletcher and Munson (1933) and Allen (1996) requires two-tone complexes with equally loud components.
298
N. Grimault et al.
Fig. 1 Schematic time-frequency representations of the first 1.4 seconds of the repeating sequences used in conditions C1 to C5 (see text for details)
C4 consisted of a repeating sequence including a simultaneous two-tone complex with frequencies equal to 1 kHz and 2 kHz at fixed levels determined in C3 and a 2-kHz adjustable-level pure tone. All subjects subjectively reported hearing the complex as integrated before proceeding to the loudness-adjustment task. This condition was designed to match the loudness of a pure tone with those of a perceptually integrated two-tone complex. C5 consisted of a repeating sequence including: 1) an asynchronous twotone complex with frequencies equal to 1 kHz (component with 0.05 s asynchrony) and 2 kHz, both having the same loudness, 2) a 1-kHz pure tone, and 3) a 2-kHz adjustable-level pure tone. All subjects but one (who did not perform this condition and was subsequently eliminated from the analyses) subjectively reported hearing the complex as segregated before proceeding to the loudness-adjustment task. The subject’s task was to adjust the level of the 2-kHz tones to match the loudness of the 2-kHz component in the complex that was perceptually segregated from the 1-kHz component.
Auditory Scene Analysis: A Prerequisite for Loudness Perception
2.3
299
Apparatus
All stimuli were delivered to the subjects through a Tucker Davis Technology system including an analogue converter (TDT AD1), an anti-aliasing filter (TDT FT6-2), two programmable attenuators (TDT PA4), a mixer (TDT SM3), a headphone buffer (TDT HB6) and a Sennheiser HD 250 headphone. In order to adjust the level of the tones, the subject directly controlled one attenuator (TDT PA4) by using two buttons of a response box (TDT RBOX4/PI2). All signals were calibrated (Larson Davis LD824 system with an AEC101 coupler).
3
Results
The average adjustments in each experimental condition are shown in Fig. 2 for the 73.7 dB SPL (filled squares) and 50 dB SPL (unfilled squares) reference-level conditions. As expected, the adjustments in C1 lead to values very close to the reference levels (74 dB and 50 dB, where 73.7 dB and 50 dB were targeted). As only 9 out of 20 subjects performed both level conditions, two independent statistical analyses (one-way repeated-measures ANOVAs) were performed for each reference level (14 subjects for each reference level) with the experimental condition as the dependent variable. Probabilities are corrected where necessary by the Geisser-Greenhouse epsilon (Geisser and Greenhouse 1958). The experimental condition is significant at the 73.7 dB reference level [F(4,52) = 5.96, p = 0.01, εGG = 0.44], but not at 50 dB SPL [F(4,52) = 2.73, p = 0.08, εGG = 0.54]. Moreover, a Fisher LSD test applied on the data shows that, at the 73.7 dB reference level, the adjustments in C4 are significantly higher than in all other conditions (C1, C2, C3 and C5). The same statistical test applied on the data at the 50 dB reference level shows that the adjustments in C4 are higher than in C1, C2 and, to a lesser extent, in C5. This pattern of results indicates that all adjustments are about the same (within a level condition) with the notable exception of the higher adjustments in C4. This result is also confirmed by a two-way repeated-measures ANOVA with factors reference level and condition, applied to the adjustments from the nine subjects common to the two level conditions. This analysis reveals a strong effect of level [F(1,8) = 1024, p<0.00001, εGG = 1], an effect of condition [F(4,32) = 6.13, p<0.01, εGG = 0.58], but no interaction between these factors [F(4,32)<1].
4
Discussion
The adjustment procedure used in this experiment seems to be validated by the good correspondence between the mean adjustment values in C1 and the reference levels. Moreover, the consistency between the results in C1 and C2
300
N. Grimault et al.
Fig. 2 Mean adjustments across conditions with reference levels equal to either 73.7 dB SPL (filled squares) or 50 dB SPL (unfilled squares). The reference levels are indicated by the continuous (73.7 dB SPL) and dotted (50 dB SPL) lines. The experimental adjustment differences between C4 and C5 (∆) are also indicated
indicates a negligible effect of rhythm. The adjustments in C3 indicate that the level difference between 2-kHz and 1-kHz tones that is necessary to obtain equally loud tones, is not significantly different from zero. This result does not compare well with the equal-loudness contours from the literature that generally predict a positive increment. The equal-loudness contours are, however, derived from free field measurements. The headphone presentation, the frequency-response curve of the headphone (Hirahara 1997) and the use of an AEC101 coupler for calibration may have contributed to this difference. This observation makes comparison of the absolute adjustment values obtained in this experiment with those from the literature difficult. The differences across conditions C4 and C5 (∆ in Fig. 2) can, however, be compared with predicted differences from Fletcher and Munson (1933). The adjustment
Auditory Scene Analysis: A Prerequisite for Loudness Perception
301
values in C4 are higher on average than those in the other conditions, and in C5 in particular, leading to positive values for ∆. The complex in C5 was reported by all listeners to be perceptually segregated such that the tonal components were then perceived separately. This could presumably explain the higher adjustment values in C4, in which listeners reported the complex tone as being integrated. This result indicates that a single tonal component from a complex can give rise to a particular loudness as soon as it is segregated from the other components (C5), or it can contribute to a global loudness if it is integrated along with the other components into a single auditory object (C4). McAdams et al. (1998) have already demonstrated the influence of ASA processes on within-channel loudness computation processes, where pressure sums. The results from the present experiment therefore provide clear evidence for the influence of ASA processes on across-channel loudness additivity. However, this result must be tempered by the fact that the additivity law (Fletcher and Munson 1933) predicts larger ∆ values than the experimental ∆ values. The lower-than-predicted ∆ values might be explained first by the fact that neither in C4 nor in C5 was the percept clearly integrated or segregated for all subjects. However the individual results were consistent across subjects, and all subjects reported a clear perception of the integrated and segregated events. Thus this explanation would be difficult to support. Second, the lower-than-predicted ∆ values might be explained by the short duration of the bursts used in this experiment to promote segregation (0.2 s instead of 1 s in Fletcher and Munson 1933). From the results of Munson (1947), the effect of the reduced burst duration on loudness should not exceed 3 dB, and therefore would not seem to account for the higher observed difference between the additivity law and the data. However, Munson’s results might not apply to the loudness of a two-tone complex. Third, the lower-than-predicted ∆ values might finally be explained by biases. In fact, in the loudness adjustment procedure, the complex tone was always fixed and the pure tone variable. The other way round has not been tested as recommended by Gabriel et al. (1997). Moreover, the range of starting levels for the adjustable stimulus was not determined experimentally to be symmetric around the final adjusted value. These biases should however apply similarly to C4 and C5 and would not explain the lower-than-predicted ∆ values. This therefore requires further experimental analysis involving additional experimental conditions with multi-component complex tones. This study, along with McAdams et al. (1998), suggests that the auditory stimulus representation, at the stage where ASA processes occur, must occur prior to loudness computation. Indeed, these studies demonstrate that loudness perception is strongly influenced by both within-channel and across-channel ASA mechanisms, because grouping processes have been shown to affect loudness perception in the case of homophonic continuity (McAdams et al. 1998) and in more complex sounds in the current study. We conclude that loudness models must be revised to account for this large ASA effect that has been largely underestimated.
302
N. Grimault et al.
Acknowledgments. We thank Brian Moore for helpful comments on a previous version of this manuscript.
References Allen JB (1996) Harvey Fletcher’s role in the creation of communication acoustics. J Acoust Soc Am 99:1825–1839 Bregman AS (1990) Auditory scene analysis: the perceptual organization of sound. MIT, Cambridge, MA Bregman AS, Pinker S (1978) Auditory streaming and the building of timbre. Can J Psychol 32:19–31 Bregman AS, Tougas Y (1989) Propagation of constraints in auditory organization. Percept Psychophys 46:395–396 Fletcher H (1938) Loudness, masking and their relation to the hearing process and the problem of noise measurement. J Acoust Soc Am 9:275–293 Fletcher H, Munson WA (1933) Loudness, its definition, measurement and calculation. J Acoust Soc Am 5:82–108 Fletcher H, Munson WA (1937) Relation between loudness and masking. J Acoust Soc Am 9:1–10 Gabriel B, Kollmeier B, Mellert V (1997) Influence of individual listener, measurement room and choice of test-tone levels on the shape of equal-loudness level contours. Acustica - Acta Acustica 83:670–683 Geisser S, Greenhouse SW (1958) An extension of Box’s results on the use of the F distribution in multivariate analysis. Ann Math Stat 29:885–891 Hirahara T (1997) Physical characteristics of headphones used in psychoacoustical experiments. J Acoust Soc Jpn 798–806 Jeng PS (1992) Loudness prediction using a physiologically based auditory model. Unpublished PhD thesis, CUNY, New-York McAdams S, Botte M-C, Drake C (1998) Auditory continuity and loudness computation. J Acoust Soc Am 103:1580–1591 Munson WA (1947) The growth of auditory sensation. J Acoust Soc Am 19:584–591 Plack CJ, Carlyon RP (1995) Loudness perception and intensity coding. In: Moore BCJ (ed) Hearing. Academic Press, pp 123–159 van Noorden LPAS (1975) Temporal coherence in the perception of tone sequences. Unpublished Doctoral Dissertation, Technische Hogeschool Eindhovern, Eindhoven, The Netherlands Warren RM (1999) Auditory perception: a new analysis and synthesis. Cambridge University Press Warren RM, Obusek CJ, Ackroff JM (1972) Auditory induction: perceptual synthesis of absent sounds, Science 176:1149–1151 Zwicker E, Fastl H (1990) Psychoacoustics – facts and models. Springer, Berlin Heidelberg New York
33 Modulation Detection Interference as Informational Masking STANLEY SHEFT AND WILLIAM A. YOST
1
Introduction
The elevation in thresholds for detecting amplitude modulation (AM) of a probe tone due to modulation of a masking tone is referred to as modulation detection interference (MDI). Past work has suggested a relationship between MDI and auditory grouping with a possible, though not necessary, basis in the similarity of concurrent probe and masker modulation. An alternate but related approach is to view MDI in the context of informational masking. As nonenergetic masking at the peripheral level with similarity of probe and masker a component, MDI exhibits characteristics used to describe informational masking (e.g., Watson 2005). The intent of the present work was to evaluate MDI in the context of informational masking, using a more stringent definition which extends energetic masking to the modulation domain. To allow for consideration of auditory grouping and segregation effects, envelope slope and concurrency of modulation were manipulated in experiments I and II, respectively.
2 2.1
Experiment I Method
The task was to detect either 4- or 10-Hz sinusoidal AM (SAM) of a 1.8-kHz probe carrier. The probe was either presented alone or in the presence of a two-carrier (0.75 and 4.5-kHz) masker complex. The masker AM index was 0.0 or 0.7 with masker modulation either sinusoidal or complex at the probe AM rate. Complex masker modulators were defined by two terms, envelope slope or rise/fall (r/f) time which varied from 1 ms to half the modulator period, and steady-state factor (ssf), the ratio of peak to peak-plus-valley durations of a modulation cycle (Fig. 1). Probe SAM was either in phase with the masker AM fundamental or was advanced 180°. The 500-ms probe and
Parmly Hearing Institute, Loyola University Chicago, USA,
[email protected],
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
304
S. Sheft and W.A. Yost
Fig. 1 Schematic representation of the effect of steady-state factor on masker waveforms. The bottom two waveforms show the possible envelope-phase configurations for the probe
masker were shaped with 20-ms cos2 r/f ramps. Probe and masker levels were 67 and 57 dB SPL, respectively, with overall levels held constant regardless of AM depth or waveshape. 2.2
Results and Discussion
Probe modulation rate affected the pattern of results (Fig. 2). For 4-Hz probe SAM, there was an interaction between factors r/f and ssf with no MDI, relative unmodulated-masker thresholds, in three of the six conditions in which the ssf was either 0.0 or 1.0. Shailer and Moore (1993) suggested that, for complex modulators with steep envelope slopes, masker salience can affect MDI. To evaluate the relationship between envelope waveshape and perceived salience, listeners judged by triadic comparison the perceptual prominence of various masker waveforms. Multidimensional scaling of judgments of salience indicated correlation between salience and fundamental amplitude in the modulation spectrum for modulators with higher values of rms amplitude (see Fig. 3 for spectra). For lower-level modulators, gross envelope waveshape accounted for salience judgments. With 10-Hz probe SAM, thresholds varied by only slightly more than 3 dB across the complex masker-modulation conditions. The relatively constant masking effect contrasts with either change in masker salience with modulator waveshape or energetic analysis in the modulation domain. Across the 10-Hz-masker conditions, masker power at the probe SAM rate varies by over 35 dB and masker-modulator rms amplitude varies by more than 15 dB. As with 10-Hz modulation, the 4-Hz thresholds do not follow trends indicated by energetic analysis.
Modulation Detection Interference as Informational Masking
305
Fig. 2 Mean thresholds averaged across eight listeners for detecting 4-Hz (left panel) or 10-Hz (right panel) probe SAM as a function of masker ssf with parameter masker r/f. Probe SAM was in phase with the fundamental of the masker modulator. Error bars represent one standard error (s.e.) of the mean threshold
Fig. 3 Amplitude spectra of the nine complex-masker modulators with a 4-Hz periodicity. Results are ordered across columns by modulator r/f and rows by ssf. The horizontal line at the top left of each panel indicates amplitude with sinusoidal AM. The number in the top right corner of each panel lists in dB relative sinusoidal modulation, the ac-coupled rms amplitude of the complex modulators. Similar trends were obtained with 10-Hz AM
306
S. Sheft and W.A. Yost
Fig. 4 Mean thresholds averaged across six listeners for detecting 4-Hz (left panel) or 10-Hz (right panel) probe SAM as a function of masker ssf with parameter probe-modulator phase. The masker-modulator r/f time was 5 ms. Error bars are one s.e. of the mean threshold
At both probe modulation rates, probe modulator phase had no effect (Fig. 4). This result is not consistent with MDI based on cross-spectral envelope summation coupled with detection determined by a max/min rule.
3 3.1
Experiment II Method
Probe and masker modulators were either continuous 8-Hz SAM, or complex waveforms termed “dropped-cycle” modulators in which only the even- or odd-numbered fluctuation cycles were present with the modulator remaining at dc during the time of the omitted cycles (Fig. 5). These modulator types are labeled as “all”, “odd”, or “even.” The task was to detect probe AM with each of the three modulator types. The 1.8-kHz probe was either presented alone or in the presence of a two-carrier (0.75- and 4.5-kHz) masker complex with the masker AM index either 0.0 or 1.0. Across conditions, each of the three types of masker AM was paired with each pattern of probe modulation. The overall level of each carrier of the probe and masker was 60 dB SPL regardless of AM depth or pattern. In the initial condition set, the 500-ms probe and masker were synchronously gated, while in the second set, masker duration was increased to 1000-ms with probe onset delayed 500-ms from the masker onset. All probe and masker waveforms were shaped with 5-ms cos2 r/f ramps.
Modulation Detection Interference as Informational Masking
307
Fig. 5 On the left, schematic representation of a “dropped-cycle” condition in which the probe carrier is modulated by only the even cycles of an 8-Hz sinusoidal function while the masker is modulated by the odd cycles. In both cases, the modulator remains at dc during the time of the omitted fluctuation cycles. The right panel shows the amplitude spectrum of a “dropped-cycle” modulator with m equal to 1.0
3.2
Results and Discussion
Results are shown in Table 1. With synchronous gating of the probe and masker carriers, masker modulation elevated thresholds in all conditions. In conditions with continuous probe modulation (probe-condition “all”), eliminating half the masker-modulation cycles (masker-condition “odd” or “even”) did not significantly reduce MDI relative to the conventional MDI stimulus configuration (masker-condition “all”). This absence of a release from interference when omitting masker-modulator cycles was obtained despite the roughly 7-dB drop in the masker-modulator amplitude spectrum in the vicinity of the 8-Hz probe SAM rate (i.e., when integrating levels of the 4-, 8-, and 12-Hz masker-modulator spectral components; see Fig. 5). With synchronous gating and “dropped-cycle” probe modulation (probeconditions “odd” and “even”), results again indicated an exception to energetic masking in the modulation domain. When “dropped-cycle” modulators were used for both the probe and masker modulators, it did not matter whether the probe and masker fluctuations were concurrent or sequential. Working with frequency modulation (FM), Gockel et al. (2002) reported significant MDI with probe modulation only during a single temporal gap in masker FM. In the present work, across the six modulated-masker conditions in which the probe modulation was not continuous, mean thresholds ranged from −15.9 to −17.7 dB. In all synchronous-masker conditions, the same data trends were obtained when the rate of sinusoidal envelope fluctuation was increased from 8 to 16 Hz.
308
S. Sheft and W.A. Yost
Table 1 Mean AM detection thresholds (dB) averaged across four subjects with both continuous and “dropped-cycle” 8-Hz AM. In the synchronous-masker conditions, the probe and masker carriers were gated on and off together; for the asynchronous masker, the masker onset preceded the probe onset by 500 ms with the probe and masker offsets coterminous. Each column is for a specific probe-modulation pattern, and each row a given masker characteristic; s.e.s of the mean thresholds are shown in parentheses. Separate subject groups ran in the synchronous- and asynchronous-masker conditions Probe All
Odd
Even
Synchronous masker None Unmodulated All Odd Even
−32.6 (0.7) −28.9 (1.2) −20.1 (2.1) −20.4 (2.2) −17.9 (0.8)
−28.6 (0.5) −24.9 (1.3) −17.6 (1.2) −17.0 (0.3) −16.1 (0.5)
−30.5 (1.2) −25.0 (0.1) −17.7 (0.3) −17.0 (0.7) −15.9 (0.6)
Asynchronous masker None Unmodulated All Odd Even
−32.0 (0.4) −31.0 (0.8) −24.3 (1.0) −25.6 (1.6) −24.7 (1.0)
−27.7 (0.8) −26.6 (0.9) −20.8 (0.7) −21.3 (0.7) −22.3 (1.5)
−28.4 (0.6) −27.3 (0.7) −19.8 (0.8) −21.4 (1.3) −22.0 (1.6)
Results from additional conditions indicated that the absence of an effect of cross-spectral modulator concurrency cannot be accounted for by temporal or nonsimultaneous masking of AM detection (e.g., Wojtczak and Viemeister 2005). In these additional conditions, the carrier(s) were gated off rather than remaining at dc during the time of omitted modulation cycles. AM adaptation should persist through the silent intervals of pulsed carriers. MDI, however, was generally not obtained with modulation of the pulsed carriers. A second consideration concerns potential “ringing” of the AM filters of a modulation filterbank (Sek and Moore 2002). Figure 6 illustrates the effect of filtering on modulator waveforms. Though filter output shows oscillation during the times the modulator is at dc, this low-level effect is insufficient to account for the absence of effect of modulation concurrency in the MDI conditions. In the asynchronous-masker conditions, there was a 500-ms delay of the probe-carrier onset. Results are shown in the bottom half of Table 1. Though the extent of MDI was greatly reduced by the gating asynchrony between the probe and masker carriers, data trends regarding the effects of variation in masker-modulator spectrum and cross-spectral concurrency of modulation are the same as obtained in the synchronous-masker conditions.
Modulation Detection Interference as Informational Masking
309
Fig. 6 Illustrations of the effect of filtering on ac-coupled modulators with the left column showing the input waveforms and the right column the filter outputs. Modulators were passed through an FIR filter centered at 8 Hz with a Q of 1. Filtering raises the correlation between “even” and “odd” modulators from 0.0 to 0.27. However, filter “ringing” introduces less than a 0.5-dB power increment during the “dropped” cycles in auditory simulations
4
Discussion and Summary
Results from experiment I showed a significant effect of modulator waveshape only at 4 Hz, and not at 10 Hz. The 4-Hz effect in part relates to variation in masker salience, with salience presumably having some relationship to auditory attention. In experiment II, there was no effect of envelope-fluctuation concurrency between the probe and masker. It is difficult to imagine a scheme for MDI based on grouping by common modulation in which concurrency would not come into play. In both experiments, significant departures from energetic masking in the modulation domain were obtained. As an alternative to basing MDI on auditory grouping of the probe and masker, the nonenergetic results suggest involvement of informational masking. Stimulus uncertainty is often associated with informational masking. For the MDI task, the requisite
310
S. Sheft and W.A. Yost
uncertainty may be the difficulty in associating near-threshold modulation with its appropriate carrier when several are present (Hall and Grose 1991). That is, this uncertainty makes it difficult for the listener to attend to the probe. A basis in informational masking does not eliminate possible effect of perceptual segregation on MDI; segregation enhances structure which reduces the uncertainty underlying the masking effect. Auditory information processing can be divided into the general areas of sound-source determination or scene analysis, and information extraction. At best moderate, and at times absent, effects have been observed directly linking low-rate AM to source determination (see Sheft 2007). Regarding the second area, signal variation, or modulation, is the basis of transmitted information. We believe that this aspect of information processing better represents auditory processing of AM. It is likely that MDI is due to difficulty in attending to the relevant information in complex modulated signals, and it is within this context that MDI may represent a form of informational masking. Acknowledgments. This work was supported by NIDCD R01 Grant Nos. DC005423 and DC006250.
References Gockel H, Carlyon RP, Deeks JM (2002) Effects of modulator asynchrony of sinusoidal and noise modulators on frequency and amplitude modulation detection interference. J Acoust Soc Am 112:2975–2984 Hall JW, Grose JH (1991) Some effects of auditory grouping factors on modulation detection interference (MDI). J Acoust Soc Am 90:3028–3035 Sek A, Moore BCJ (2002) Mechanisms of modulation gap detection. J Acoust Soc Am 111: 2783–2792 Shailer MJ, Moore BCJ (1993) Effects of modulation rate and rate of envelope change on modulation discrimination interference. J Acoust Soc Am 94:3138–3143 Sheft S (2007) Envelope processing and sound-source perception. In: Yost WA, Fay RR, Popper AN (eds) Auditory perception of sound sources. Springer, Berlin Heidelberg New York Watson CS (2005) Some comments on informational masking. Acta Acust 91:502–512 Wojtczak M, Viemeister NF (2005) Forward masking of amplitude modulation: basic characteristics. J Acoust Soc Am 118:3198–3210
Comment by Ewert You’ve shown that energetic masking at the (fundamental) modulation frequency cannot explain your results. Is it possible to rule out energetic masking in the envelope domain if you would use a smoothed spectral representation of the envelope, assuming processing in relatively broadly tuned band-pass modulation filters?
Modulation Detection Interference as Informational Masking
311
Reply Use of a modulation filter does not improve the ability of the envelope amplitude spectrum (‘energetic masking’) to account for MDI. For the 10-Hz conditions in which the extent of MDI was relatively constant, the rms amplitude of the output of a 10-Hz modulation filter significantly varies across conditions in response to the masker. Due to the sluggish response of a lowCF modulation filter, the decibel range of the ac-coupled output values is nearly as great as the change in fundamental amplitude in the modulation spectrum, this despite the limited integration across spectral components by the filter. In the 4-Hz conditions, the correspondence between modulation-filter output and MDI is generally no better than that obtained with consideration of the masker-modulator spectrum. In both cases, the largest mismatch is with a steady-state factor of either 0.0 or 1.0.
34
A Paradoxical Aspect of Auditory Change Detection
LAURENT DEMANY AND CHRISTOPHE RAMOS
1
Introduction
The auditory entities to which human listeners attach meaning are generally combinations of successive and spectrally different sounds rather than static acoustic features. Thus, an important task of the brain is to connect, or bind, successive sounds. When several sound sources are concomitantly active, the connections must of course be selective. It appears that connections are established on the basis of automatic rules, in particular a rule of spectral proximity (Bregman 1990). This can lead, for instance, to the perception of a melodic “motion” between successive tones mixed or interleaved with other tones. The neural processes underlying the perceptual binding of successive sounds, and more generally auditory scene analysis, are still a matter of speculation, although precise hypotheses based on physiological facts have been proposed (e.g., Micheyl et al. 2005). One idea, put forth by van Noorden (1975), is that the auditory system contains “frequency-shift detectors” which are functionally comparable to the motion detectors known to exist in the visual system. We recently described a paradoxical perceptual phenomenon apparently supporting this view (Demany and Ramos 2005). The stimuli used in that study were sequences of two sounds: (1) a “chord” of five peripherally resolvable pure tones with randomly chosen frequencies and (2) a single pure tone (T). Because the components of the chord were gated on and off synchronously, they were very difficult to hear out individually. This was confirmed in an experimental condition (called “present/absent”) where T could be either identical to a randomly selected component of the chord (one of the three intermediate components) or halfway in (log-)frequency between two components. Listeners could not reliably discriminate between these two types of sequences. Surprisingly, however, performance was much better in another condition, called “up/down”, where T was positioned slightly (one semitone) above or below a randomly selected component of the chord (one of the three intermediate components, again) and listeners had to identify the direction of this frequency shift. Overall, it appeared that a sequence
Laboratoire de Neurophysiologie, CNRS and Université Victor Segalen, Bordeaux, France,
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
314
L. Demany and C. Ramos
of two pure tones is able to produce a percept of directional pitch shift even while the pitch of the first tone is not consciously audible. What are the precise implications of listeners’ good performance in the “up/down” task? One possible interpretation is that, in this task, listeners take advantage of the existence of neural units or structures specifically responding to frequency shifts in a given direction (rather than to steady sounds). An alternative interpretation may be termed the “partial forward masking (PFM) hypothesis”. According to the PFM hypothesis, the “up/down” task is easy because the excitation pattern of the T tone (that is, the representation of this tone in a bank of auditory bandpass filters) has a slightly different shape on “up” and “down” trials, due to a residual masking effect of the chord component which was close in frequency to T; listeners identify the two possible shapes and infer from this information the direction of the frequency shift. In our previous study, each tone had a nominal SPL of 65 dB and each T tone was separated from the preceding chord by an inter-stimulus interval (ISI) of at least 500 ms. In view of this, the PFM hypothesis did not seem to be a likely explanation of our paradoxical results. However, we could not definitely rule it out. The new experiments reported here provide direct tests of the PFM hypothesis.
2 2.1
Experiment 1 Method
The chords used in this experiment were generated according to the same rules as those used by Demany and Ramos (2005). They thus consisted of five 300-ms pure tones, synchronously gated on and off with 20-ms raisedcosine amplitude ramps, and spaced by intervals varying at random, independently of each other, between six and ten semitones (the probability distributions being rectangular). The chord presented on each trial was randomly positioned within a four-octave range, 200–3200 Hz. As in the “up/down” condition of our previous study, the chord was followed, after a 500-ms silent ISI, by a single pure tone (T), positioned at random one semitone above or below one of the chord’s three intermediate components; once more, the listener’s task was to identify the direction of this frequency shift. Responses were given by means of mouse clicks on two virtual buttons. No feedback was provided within a block of trials. The stimuli were presented via earphones (Sennheiser HD265), in a soundproof booth. All tones had a nominal SPL of 65 dB. A random melody of seven 300-ms tones, with frequencies drawn between 200 and 3200 Hz, was generated at the beginning of every trial; this melody was separated from the following chord by a 600-ms ISI.
A Paradoxical Aspect of Auditory Change Detection
315
Whereas the random melody was presented diotically, the chord and T were presented monaurally. Our goal was to determine whether performance would be much poorer if the chord and T were presented to opposite ears than if they were presented to the same ear. We thus used four experimental subconditions, in which the chord and T were respectively presented either to the left ear or to the right ear. Whenever a chord or T tone was delivered to one ear, a synchronous burst of pink noise was delivered to the contralateral ear, at a level of 67 dB (A weighting). For each subcondition and subject, eight blocks of 50 trials were run, the four subconditions being randomly ordered within sessions. Four listeners, including the two authors, served as subjects. All of them had previously performed the same “up/down” task on diotic versions of the chords and T tones (Demany and Ramos 2005). All of them had also been tested in the “present/absent” condition that we described in the Introduction; the corresponding data will be presented below, together with those obtained (from three of the listeners tested here) in a third condition, called “present/close”. 2.2
Results
The values of d′ obtained in the four “up/down” subconditions are displayed in the first four columns of Fig. 1. Each subject is represented by a specific symbol. It can be seen that performance was very good when the chord and T were presented to the same ear (left-left, right-right; average d′: 3.10), but
Fig. 1 First four columns: results of experiment 1. Last two columns: data obtained from the same listeners in a previous study. The error bars represent 95% confidence intervals (Macmillan and Creelman 1991, Chap. 11)
316
L. Demany and C. Ramos
not appreciably worse when the chord and T were presented to opposite ears (left-right, right-left; average d′: 2.79). This is strong evidence against the PFM hypothesis since forward masking is essentially a monaural phenomenon (Lüscher and Zwislocki 1949). In the fifth column of Fig. 1 we display the performance of the same listeners in the “present/absent” condition of our previous study. [Three of the four listeners had been tested twice in the “present/absent” condition; for these listeners, the data shown in Fig. 1 are those obtained in the second test, i.e., experiment 2.] Recall that, in the “present/absent” condition, the task was to discriminate sequences in which T was identical to one of the chord’s three intermediate components (“present” trials) from sequences in which T was halfway in (log-) frequency between two components (“absent” trials). It can be seen that performance exceeded the chance level but was poor (average d¢: 0.94). Three listeners had also been tested in a “present/close” condition where the “present” trials were identical to those of the “present/absent” condition and, on “close” trials, T was positioned at random 1.5 semitone above or below one of the chord’s three intermediate components. For a theoretical listener hearing out individually the components of a chord and making explicit pitch comparisons between them and T, a “present” trial should be easier to discriminate from an “absent” trial than from a “close” trial. However, Fig. 1 indicates that we found just the opposite in real listeners. It is possible to account for the relative difficulties of the “up/down”, “present/absent”, and “present/close” conditions under the assumption that performance in the three conditions is determined by the activity of automatic frequency-shift detectors. A schematic model of their behavior is presented in our previous paper (Demany and Ramos 2005).
3 3.1
Experiments 2 and 3 Method
In experiments 2 and 3, new chords were used. In experiment 2, the chords consisted of 7 tones spaced by constant intervals of 7.5 semitones. In experiment 3, the chords consisted of 10 tones spaced by constant intervals of 5.5 semitones. Except for this difference, the two experiments were identical. In each of them, the 300-ms chord presented on a given trial was randomly positioned in a five-octave range, 125–4000 Hz, and it was followed, after a 1-s ISI, by a 300-ms T tone which was positioned one semitone above or below (at random) a randomly selected component of the chord (the edge components were not excluded in this selection). As in experiment 1, the subjects had to vote for “up” or “down”, and received no feedback during the 50-trial blocks. All stimuli were presented diotically and each tone had a nominal SPL of 65 dB. Once more, a random melody of 300-ms tones was
A Paradoxical Aspect of Auditory Change Detection
317
generated at the beginning of each trial. (However, it was now made up of only five tones, with frequencies drawn between 125 and 4000 Hz.) There were three types of trial blocks. They differed from each other only with respect to what occurred during the 1-s ISI separating the chord from T. In blocks of a first type, this ISI was simply silent. In the other blocks, a 300-ms interfering sound was generated in the middle of the ISI. In blocks of the second type, the interfering sound was a burst of pink noise. This noise burst had a level of 72 dB (A weighting); its loudness was thus similar to that of the chord. In blocks of the third type, the interfering sound was a soft pure tone. The frequency of this tone was on each trial drawn at random between 125 and 4000 Hz (the probability distribution being rectangular on a logfrequency scale), with one constraint: the interfering tone had to be at least four semitones away from the crucial component of the chord, i.e., the component one semitone away from T. The SPL of the interfering tones was slightly subject-dependent; its average value was 56 dB. For each subject, the chosen SPL corresponded to the detection threshold [P(C) = 0.75 in a 2I-2AFC paradigm] of the interfering tones when they were mixed with (and masked by) the noise bursts used in the trial blocks of the second type; the masked threshold in question had been measured in a short preliminary session. The logic of our comparison between the consequences of noise interference and tonal interference is straightforward. If the PFM hypothesis were correct, then noise interference should strongly impair performance; in contrast, tonal interference should have little effect since the interfering tone was never close in frequency to the crucial component of the chord. If, on the other hand, the PFM hypothesis is wrong and success in the task is due to the activity of frequency-shift detectors, then noise interference may have little effect, but an interfering tone should be harmful since the frequency-shift detectors will undesirably respond to its frequency relations with the components of the chord as well as with T. In both experiments, three listeners were tested. Two of them were the authors. The third subject was not the same person in the two experiments. For each subject, experiment, and type of interference (none, noise, tone), at least 12 blocks of 50 trials were run. The three types of interference were used within each experimental session, in a counterbalanced order. 3.2
Results
As before, performance was measured in terms of d¢. In the case of tonal interference, it could be expected that listeners’ responses (“up” or “down”) would be biased by the direction of the frequency shift between the interfering tone and T. This led us to compute, for each listener, separate d¢ statistics for the trials on which the interfering tone had been, respectively, lower and higher than T. In each experiment, however, we found that the grand mean of these d¢ values was almost identical to the mean of the d¢
318
L. Demany and C. Ramos
values obtained in the absence of a bipartition of trials. It was thus finally decided to process the data in exactly the same way for the three types of interference. The results of experiment 2 (seven-tone chords) and experiment 3 (ten-tone chords) are respectively displayed in the upper and lower panels of Fig. 2. It may be noted first that performance was globally poorer in experiment 3 than in experiment 2. We do not know whether the main source of this effect was the “number of tones” factor or the fact that the components of the ten-tone chords were more closely spaced than the components of the seven-tone chords. In any case, and more importantly, the effects of interfering sounds were similar in the two experiments. Performance was impaired by noise interference, but not very much, and definitely less than by tonal interference. This contradicts the PFM hypothesis. In a previous and as yet unpublished study, the three subjects of experiment 3 had been tested in a “present/absent” condition using chords of
Fig. 2 Results of experiment 2 (top panel) and experiment 3 (bottom panel). Error bars represent 95% confidence intervals
A Paradoxical Aspect of Auditory Change Detection
319
exactly the same type and a silent ISI of only 500 ms between each chord and the following T tone. The average d¢ obtained was 0.01. This means that, in experiment 3, the components of the chords were absolutely impossible to hear out individually by the tested listeners. Thus, if the PFM hypothesis were correct, d¢ should have been essentially zero in the presence of noise interference. The average d¢ actually obtained for the noise interference (0.64) does not fit in with this prediction.
4
Discussion
The present study demonstrates that the perceptual phenomenon initially described by Demany and Ramos (2005) cannot be accounted for in terms of partial forward masking. Taken together, the results of the two studies show that a sequence of two pure tones is able to produce a percept of directional pitch shift while (1) the pitch of the first tone is not consciously audible, and (2) the “excitation pattern” of the second tone (that is, the auditory representation of this tone which matters from the point of view of masking) provides no cue about the direction of the shift. Under such circumstances, what can be the mechanism of pitch shift perception? Two general hypotheses may be considered. According to one of them, the representation of the second tone (T) at some central level of the auditory system is influenced by its frequency relation with the previous tone, even though the previous tone does not affect the excitation pattern of T at a more peripheral level, where forward masking takes place. This hypothesis is consistent with the results of several physiological investigations. In the auditory cortex of cats or monkeys, the response of a neuron to a pure tone can be markedly increased by the previous presentation of another pure tone, differing in frequency (e.g., Brosch and Schreiner 2000; McKenna et al. 1989). These effects are observable in the presence of a silent ISI between the two tones and are sensitive to the direction of the frequency shift. To our knowledge, however, there is currently no evidence that a tone can affect the response to a subsequent tone when the two tones are presented to opposite ears or are separated by a noise burst. Moreover, one must bear in mind that the mere existence of an interaction between two successive sounds in the auditory system does not immediately account for the perception of a relation between these two sounds: a problem is to dissociate, in the neural activity concomitant to the presentation of the second sound, what is due to the relation between the two sounds from what could be due to intrinsic features of the second sound. An alternative hypothesis is that frequency relations between successive tones are encoded by “frequency-shift detectors” which do not participate in the encoding of the tones themselves. The basic function of these detectors would be to bind successive tones within coherent auditory streams, and thus
320
L. Demany and C. Ramos
to serve as tools in auditory scene analysis. In the same vein, Ullman (1978) pointed out that the motion detectors of the visual system are used to bind the elements of successive and distinct images in such a way as to match corresponding elements (a necessity for the perceptual constancy of physical objects). This visual binding follows a rule of (two-dimensional) spatial proximity that is comparable to the principle of spectral proximity at work in hearing. It should be noted, however, that visual perception of apparent motion (“phi motion”) is possible only for ISIs below 200 or 300 ms. We have reported here evidence that a frequency shift between two tones can still be detected automatically in the presence of a 1-s ISI. Demany and Ramos (2005) reported data suggesting that even longer ISIs are not prohibitive. Auditory change perception thus seems to have unique properties, associated with particularities of auditory memory. Acknowledgments. We thank Christophe Micheyl, Daniel Pressnitzer, and Maja Serman for comments on a previous version of this paper.
References Bregman AS (1990) Auditory scene analysis. MIT Press, Cambridge, USA Brosch M, Schreiner CE (2000) Sequence sensitivity of neurons in cat primary auditory cortex. Cereb Cortex 10:1155–1167 Demany L, Ramos C (2005) On the binding of successive sounds: perceiving shifts in nonperceived pitches. J Acoust Soc Am 117:833–841 Lüscher E, Zwislocki J (1949) Adaptation of the ear to sound stimuli. J Acoust Soc Am 21:135–139 Macmillan NA, Creelman CD (1991) Detection theory: a user’s guide. Cambridge University Press, Cambridge, UK McKenna TM, Weinberger NM, Diamond DM (1989) Responses of single auditory cortical neurons to tone sequences. Brain Res 481:142–153 Micheyl C, Tian B, Carlyon RP, Rauschecker JP (2005) Perceptual organization of tone sequences in the auditory cortex of awake macaques. Neuron 48:139–148 Ullman S (1978) Two-dimensionality of the correspondence process in apparent motion. Perception 7:683–693 van Noorden LPAS (1975) Temporal coherence in the perception of tone sequences. Doctoral dissertation, Technische Hogeschool, Eindhoven
Comment by Tsuzaki You assume that there are mutiple frequency change detectors. If this is the case, do you expect any difficulty in judging the direction of change by using two T tones whose frequencies change from those of each closest component in the opposite direction? Do you also expect any “enhancement” if two T tones change in the same direction?
A Paradoxical Aspect of Auditory Change Detection
321
Reply It would be very interesting to perform an experiment in which, as you suggest, each chord is followed by two simultaneous T tones (T1, T2) instead of a single T tone. The most appropriate task is probably a “present/close” task in which T1 and T2 are either both present in the chord or slightly different from two components of the chord (C1, C2). In the latter case, ⎪T1-C1⎪ should be equal to ⎪T2-C2⎪ within trials, but, as you suggest, T1-C1 and T2-C2 could have either the same sign or opposite signs. The required judgments would be merely “present” or “not present”. For a given value of ⎪T1-C1⎪ and ⎪T2-C2⎪, will performance be better when T1-C1 and T2-C2 have the same sign than when they have opposite signs? A positive answer would suggest that, in response to a sequence of two sounds, the frequency-shift detectors provide only one global output; when T1-C1 and T2-C2 have opposite signs, the activations of the detectors by T1-C1 and T2-C2 would thus cancel each other. In contrast, a negative answer would suggest that the frequency-shift detectors provide separate outputs in different frequency regions.
35 Human Auditory Cortical Processing of Transitions Between ‘Order’ and ‘Disorder’ MARIA CHAIT1,4, DAVID POEPPEL2, AND JONATHAN Z. SIMON3
1
Introduction
Sensitivity to changes in sound is important to auditory scene analysis and detection of the appearance of new objects in the environment. In this chapter we describe two experiments that used Magnetoencephalography (MEG) to investigate the temporal dynamics of auditory cortical responses to changes in ongoing stimuli. The experiments used very different stimuli (dichotic vs diotic, noiselike vs tonal, stationary vs dynamic), but shared the abstract characteristic that they both involved a transition from a state of order to disorder, or viceversa (Fig. 1). In one experiment (Chait et al. 2005) we studied changes in the interaural correlation (IAC) of wide-band noise. Stimuli consisted of interaurally correlated noise (identical noise signals played to the two ears) that changed into uncorrelated noise (different noise signals at the two ears) or vice versa. The stimuli of the second experiment (Chait et al. 2007) were designed to mimic the abstract properties of those in the IAC experiment, while changing the acoustic properties completely. Signals consisted of a constant tone that changed into a sequence of random tone pips, or vice versa. In both experiments, magnetic responses were gathered while subjects attended to an auditory task unrelated to the dimension along which the stimuli varied. The responses are thus presumed to reflect pre-attentive ‘bottom-up’ mechanisms, processing aspects of sound that the subject does not attend to consciously. We show that early auditory cortical responses are remarkably similar between experiments. For both experiments, the response pattern differed radically between transitions from order to disorder and vice-versa. We interpret this result as possibly reflecting the different requirements of the process that estimates the regularity of the stimulus (interaural correlation vs decorrelation, constant tone vs random pip sequence) according to the direction of 1 Neuroscience and Cognitive Science program, Cognitive Neuroscience of Language Laboratory, University of Maryland, College Park, MD, USA,
[email protected] 2 Departments of Linguistics and Biology, University of Maryland, College Park, MD, USA,
[email protected] 3 Departments of Biology and Electrical and Computer Engineering, University of Maryland, College Park, MD, USA,
[email protected] 4 Laboratorie de Psychologie de la Perception, CNRS (UMR 2929), Université Paris Descartes and École Normale Supérieure, Paris, France
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
324
M. Chait et al.
Fig. 1A,B Schematic representation of the abstract similarity between the stimuli used in the two experiments. In the IAC experiment, the dimension of change (Y axis) is interaural difference. Correlated noise is hypothesized (Durlach 1963) to be represented as a constant ‘zero’ value, whereas uncorrelated noise is represented as a randomly fluctuating value. The size of the black segments symbolizes the temporal resolution of binaural processing. In the tone experiment, the dimension of change is frequency. Pip size (black segments) is 30 ms
the change. The data shed light on the heuristics with which auditory cortex samples, represents, and detects changes in the environment, including those that are not the immediate focus of attention.
2
Materials and Methods
Methods are described in full in Chait et al. (2005). Subjects – 18 and 24, right handed, paid subjects gave written informed consent to participate in the IAC and tone experiments, respectively. We also conducted behavioral experiments with the same subjects and stimuli, the results of which are reported elsewhere (Chait et al. 2005; Chait et al. in preparation). Stimuli – the signals in the IAC experiment were 1100 ms-long wide-band noise bursts, consisting of an initial 800 ms-long segment (reference correlation) that was either interaurally correlated (IAC = 1) or interaurally uncorrelated (IAC = 0), followed by a 300-ms segment with one of six fixed values of IAC: 1.0, 0.8, 0.6, 0.4, 0.2, 0.0. The purpose of the relatively long initial segment was to ensure that responses to change in IAC do not overlap with those associated with stimulus onset. Stimuli were ramped on and off with 15-ms cosine-squared ramps and presented in random order with an inter-stimulus interval that was varied randomly between 600 and 1400 ms. The signals in the tone experiment were 1440 ms long, consisting of an initial 840 ms pre-transition segment (either random or constant), immediately
Human Auditory Cortical Processing of Transitions Between ‘Order’ and ‘Disorder’
325
followed by a 600-ms post-transition segment (either random or constant). Controls were 1440 ms random or constant throughout. Random sequences contained successive tone pips of a random frequency. We used three pip durations (15, 30 and 60 ms) which were presented by blocks. Frequencies were drawn from 20 frequency values equally spaced on a log scale between 222 and 2000 Hz. Tone and pip onsets and offsets were ramped with 3-ms cosine-squared ramps. Presentation order was randomized with an interstimulus interval between 600 and 1400 ms. The stimulus set also included a proportion (25–33%) of target stimuli (not analyzed), which subjects were required to rapidly respond to. These stimuli served to keep the listeners alert and attentive but did not involve IAC or tone-change processing. Neuromagnetic recording and analysis - the magnetic signals were recorded using a 160-channel, whole-head axial gradiometer system (KIT, Kanazawa, Japan). The 20 strongest channels over the temporal regions of the head of each subject (5 sinks and 5 sources in each hemisphere) were determined in a pre-experiment (see Chait et al. 2005). Two measures of dynamics of cortical processing are reported: the amplitude time course (increases and decreases in activation) as reflected in the root mean square (RMS) of the selected channels, and the accompanying spatial distributions of the magnetic field (contour plots) at certain times post onset. For illustration purposes, we plot the RMS derived from the grand-average (average over all subjects for each of the channels) but the statistical analysis is always performed on a subject-by-subject, hemisphere by hemisphere, basis, using the RMS values of the 10 channels chosen for each subject in each hemisphere. For each of the stimulus conditions 1500-ms epochs (including 200-ms pre onset) were created. Approximately 100 epochs of each condition were averaged, low-pass filtered at 30 Hz, and base-line corrected to the pre-onset interval. The consistency, across subjects, of peaks and magnetic field distributions was assessed automatically as described in Chait et al. (2005).
3
IAC Experiment Results
Clear transition responses were recorded from auditory cortex even though listeners performed a task that is irrelevant to IAC change. In Chait et al. (2005) we show that this pre-attentive cortical sensitivity mirrors psychophysicallymeasured sensitivity. Results for all conditions (including intermediate values of IAC) are reported in Chait et al. (2005). Here we concentrate on transitions between extreme values of IAC (1 and 0), denoted as 1→0 and 0→1. The auditory evoked responses to 1→0 and 0→1 transitions are shown in Fig. 2A,B. The origin of the time scale coincides with the onset of the signals and the change is introduced at 800 ms post onset. While the onset responses for
326
M. Chait et al.
Fig. 2 Measured data. RMS magnetic field measured in the IAC experiment (left) and tone experiment (right). Plotted in black is the root mean square (RMS) magnetic field over all 157 channels derived from the grand-average (average over subjects and stimulus repetitions) of the evoked auditory cortical field. The upper plot in each column corresponds to the order-to-disorder transition (correlated to uncorrelated, or constant to random). The bottom plots correspond to the opposite transition. The appropriate control conditions (no change) are plotted in grey. The onset of the stimulus and onset of change are marked with dashed lines. Contour plots indicate the topography of the magnetic field at critical time periods. Source = white, Sink = black. Onset response patterns differed between experiments. In the IAC experiment (left), stimulus onset responses to correlated (A) and uncorrelated (B) noise stimuli were comparable, and characterized by two peaks at approximately 70 ms and 170 ms post onset with similar magnetic field distributions (analogous to the classic M50 and M150 onset responses). In the tone experiment (right), onset responses were characterized by a single peak at approximately 110 ms post onset, with a topography characteristic of the classic M100 response. In contrast to the onset responses, transition responses were remarkably similar between experiments (compare A and C, B and D). Both experiments revealed the same asymmetry. Transitions from order to disorder (A and C) were characterized by peaks at ~70 ms and ~150 ms post change onset whereas transitions from disorder to order (B,D) were dominated by a single peak at around 150 ms. Furthermore, whereas both peaks in A and C exhibited an M50-like topography, the single peak in B and D was characterized by an opposite (M100 like) dipolar distribution
Human Auditory Cortical Processing of Transitions Between ‘Order’ and ‘Disorder’
327
correlated and uncorrelated noise are similar in temporal dynamics (two peaks at about 50 and 150 ms post onset) and spatial distribution, the responses to the IAC transitions (shaded) are markedly different. The change in the 1→0 condition (Fig 2A) is characterized by two peaks at ~70 ms and ~150 ms post change. In contrast, the opposite, 0→1 transition, evoked only one peak at about 130 ms. Remarkably, the dipolar distribution of the peak in the 0→1 stimulus is of opposite polarity to the two peaks in the 1→0 condition, indicating that this response cannot be a delayed activation of the neural substrates responsible for the first peak in the 1→0 condition, and is most probably generated by a different neural mechanism.
4
Tone Experiment Results
The auditory evoked responses to constant-to-random and random-toconstant 300-ms pip-size stimuli are shown in Fig. 2C,D (results for other conditions and regarding cortical adjustment to pip-size are reported elsewhere; Chait et al. 2007). The change is introduced at 840 ms post onset. The MEG activity evoked by the stimuli exhibits an onset response, about 100 ms after the onset of the stimuli and a later response related to processing the change, which begins at about 900 ms post onset (60 ms post change). Unlike the onset responses which are very similar across conditions, transition responses are distinctly disparate in their temporal dynamics and field distributions. The change from a constant tone to a random sequence of tone pips evokes two consecutive deflections, at about 70 and 150 ms post change onset. The opposite transition evokes only one peak, occurring about 150 ms post change onset. As in the IAC data, the dipolar distribution of the transition response peak in Fig. 2D is of opposite polarity from the transition response peaks in Fig. 2C indicating that activity results from a different neural substrate. Thus the response to the ‘random-to-constant’ transition does not merely reflect delayed activation compared to the ‘constant-to-random’ transition. Rather, the data suggest that the sequence of cortical activation is distinct in each case: processing of transitions in each direction involves different neural computations.
5
Discussion
‘Binaural sluggishness’ refers to the apparent insensitivity of the binaural system to rapidly varying interaural configuration. For example, it has been demonstrated that listeners become less sensitive to time-varying changes in interaural correlation as the change rate is increased (Grantham 1982). This ‘sluggishness’ is assumed to reflect the existence of a ‘binaural integration
328
M. Chait et al.
window’ that operates subsequent to the site of binaural interaction, i.e. at or above the IC. However, Joris et al. (2006) found no evidence of sluggishness in responses of IC cells to stimuli with dynamic interaural correlation. It is thus possible that the mechanisms we observe reflect the operation of this integration process. However, the similarity between the tone and IAC data presented here point to the fact that this mechanism is not special to binaural processing. The resemblance suggests that the responses we observe reflect the operation of a general auditory cortical change-detection mechanism, which handles in a common fashion changes along very different stimulus dimensions and specifically transitions between irregularity and regularity. Therefore ‘Binaural sluggishness’ might not be a result of special binaural integration windows but may in fact result from the same integration mechanisms that process monaural data. The apparently larger integration for binaural than monaural stimuli (e.g. Kollmeier and Gilkey 1990) may stem from differences in the statistics of the two kinds of signals. By searching for a monaural stimulus (for example by adjusting the pip size/distribution properties of our tone stimuli) that results in the same response properties (peak latencies) as the binaural stimuli in the IAC experiment, we may be able to learn more about how binaural information is centrally represented. Change detection in humans is usually investigated with the MMN paradigm (Polich 2003), which is based on comparing brain responses to deviant (low probability) signals presented among standard (high probability) signals. The MMN response (derived by subtraction between the responses to standard and deviants) peaks between 100 and 200 ms and is interpreted as reflecting a process that registers a change in a sound feature and updates the ongoing representation (Winkler et al. 1996). MMN experiments are typically conducted with silent intervals between relevant stimuli, so that the time at which a stimulus is compared to the preceding one is defined by the experimenter. The experiments here target a stage that directly precedes the one probed with standard MMN techniques: In natural environments changes are superimposed on the continuous waveform that enters the ear and a listener thus needs a mechanism to decide at which point in a continuous sound the change is introduced. In transitions such as correlated-to-uncorrelated or constant-to-random (Fig. 1A) the change can be detected immediately by the system – the first waveform sample that violates the regularity rule suffices to signal the transition. However in the opposite transition - from uncorrelated to correlated or from random-to-constant (Fig. 1B) the system must wait long enough to distinguish the transition from a momentary ‘lull’ in the fluctuation, such as might occur by chance. The amount of time an optimal listener would have to wait in order to detect the change depends on the statistical properties of the stimulus. This may explain the finding that the two kinds of transitions are handled by different cortical systems: The first deflection at ~70 ms post change, with M50 like dipolar distribution that occurs in situation when the change is immediately detectable (Fig. 2A,C) may be reflecting the operation of an ‘obligatory cortical integration window’. The response at around 100–150 ms with the M100 like
Human Auditory Cortical Processing of Transitions Between ‘Order’ and ‘Disorder’
329
dipolar distribution may reflect the operation of another, ‘adjustable window’, provided by a separate neural substrate, which integrates incoming information to reach a sufficient level of certainty that a change has occurred. Interestingly, these transition responses are similar to the properties of the ‘pitch onset response’ (POR; Krumbholz et al. 2003; Gutschalk et al. 2004; Ritter et al. 2005; Chait et al. 2006). The POR, hypothesized to reflect cortical pitch processing mechanisms, is evoked by transitions between irregular click trains, which do not have a pitch and regular click trains which are perceived to have a sustained temporal pitch (Gutschalk et al. 2004) or by transitions between white noise and iterated rippled noise (Krumbholz et al. 2003). However, it is also possible to describe the shift as a transition between states that differ along a more abstract dimension, such as degree of ‘regularity’ or ‘order’. Our stimuli, whether constant/random tones or correlated /uncorrelated noise, are very simple examples of ‘order’ and ‘disorder’. We have recently replicated this pattern of results with regularly alternating pip sequences (rather than a constant tone) and speculate that more complex stimuli involving transitions between regularity and irregularity should evoked similar response patterns to the ones observed here. Future experiments are needed to test the generality of the present findings, to clarify the relation between behavioral and brain responses to change and to understand better the rules that determine factors such as integration time. In addition to studying the dynamics of change detection, the paradigm introduced here may serve as a methodological tool to explore the perceptual relevance of various sound dimensions: are changes in different features, which are relevant to auditory objects (e.g. loudness, ITD, ILD, pitch) processed in the same way and equally quickly? Such an investigation may shed light on their relative importance in determining the emergence of auditory objects. Acknowledgments. This work was supported by R01 DC05660 to DP. We thank Alain de Cheveigne’ for comments and discussion and Jeff Walker for excellent MEG technical support.
References Chait M, Poeppel D, de Cheveigne A, Simon JZ (2005) Human auditory cortical processing of changes in interaural correlation. J Neurosci 25:8518–8527 Chait M, Poeppel D, Simon JZ (2006) Neural response correlates of detection of monaurally and binaurally created pitches in humans. Cereb Cortex 16:835–848 Chait M, Poeppel D, de Cheveigne’ A, Simon J (2007) Processing asymmetry of transitions between order and disorder in human auditory cortex. J Neurosci 27:5207–5214 Durlach NI (1963) Equalization and cancellation theory of binaural masking-level differences J Acoust Soc Am 35:1206–1218 Grantham DW (1982) Detectability of time-varying interaural correlation in narrow-band noise stimuli. J Acoust Soc Am 72:1178–1184 Gutschalk A, Patterson RD, Scherg M, Uppenkamp S, Rupp A (2004) Temporal dynamics of pitch in human auditory cortex. Neuroimage 22:755–766 Joris PX, van de Sande B, Recio-Spinoso A, van der Heijden M (2006) Auditory midbrain and nerve responses to sinusoidal variations in interaural correlation. J Neurosci 26:279–289
330
M. Chait et al.
Kollmeier B, Gilkey RH (1990) Binaural forward and backward masking: evidence for sluggishness in binaural detection. J Acoust Soc Am 87:1709–1719 Krumbholz K, Patterson RD, Seither-Preisler A, Lammertmann C, Lütkenhöner B (2003) Neuromagnetic evidence for a pitch processing center in Heschl’s gyrus. Cereb Cortex 13:765–772 Polich J (2003) Detection of change: event related potential and fMRI findings. Kluwer Academic Press, Boston, MA Ritter S, Gunter Dosch H, Specht HJ, Rupp A (2005) Neuromagnetic responses reflect the temporal pitch change of regular interval sounds. Neuroimage 27:533–543 Winkler I, Karmos G, Naatanen R (1996) Adaptive modeling of the unattended acoustic environment reflected in the mismatch negativity event-related potential. Brain Res 742:239–252
Comment by Duifhuis Inspecting your results as presented in Fig. 2 (and supported by the additional data that you presented, it seems to me that the response to the 1→0 condition (A) is very similar to the negative response to the 0→1 transition (B), and vice versa. In addition, the MEG profile indicates opposite directions. Thus the question arises if you have to consider the differences as results from different neural mechanisms, or just as the ‘opposite’ response of the same mechanism to the ‘opposite’ stimulus. The resulting positive peak then automatically comes later in B than in A, but could be generated by the same neural mechanism.
Reply Figure 2 shows the root mean square of the magnetic field (computed over channels) so it is not meaningful to refer to the ‘negative’ of the response. The troughs of the RMS correspond to a lack of response (not a negative response) and therefore it cannot be flipped.
Comment by Riedel EEG studies by Dajani and Picton (2006) and Lüddemann et al. (current volume) did not reveal strong differences in the evoked potential to changes from correlated to uncorrelated broad band noise and vice versa. Both conditions showed components P1, N1 and P2 with comparable amplitudes at similar latencies. Could you comment on possible reasons for the difference to your data?
References Dajani HR, Picton TW (2006) Human auditory steady-state responses to changes in Interaural correlation. Hear Res 219:85–100
Human Auditory Cortical Processing of Transitions Between ‘Order’ and ‘Disorder’
331
Reply Our data for detection of change in interaural correlation (IAC) are in line with several subsequent experiments investigating transitions between ‘order’ and ‘disorder’ in the frequency domain. The response asymmetry we observe in all of these experiments is consistent with MEG data for transitions between irregular and regular click trains (Gutschalk et al. 2002, 2004) and between noise and IRN (Lutkenhoner et al., in preparation; Rupp et al. 2005; Krumbholz et al. 2003), as well as EEG data for transitions between randomly alternating and constant tones (Jones 2002). This consistency supports our interpretation of the responses, as reflecting different demands on temporal integration in the course of change-detection, and specifically for transitions between regularity or ‘order’ and irregularity or ‘disorder’. The discrepancy between your data and ours probably arises from the different procedures of stimulus presentation you (as well as Dajani and Picton 2006) employed. Specifically, in your experiments segments with different IAC are regularly alternating (change occurs exactly every 800 ms) such that the task faced by the auditory cortex is not change detection per se. There is much evidence in the MEG/EEG literature indicating that in cases where change is expected at a particular point in time (e.g. Lange et al. 2003) auditory cortex reacts faster and differently comparing to situations where change is unexpected (e.g. our studies, or the MMN literature). It is clear however that the only way to resolve this discrepancy is investigating it experimentally via replication. We intend to do that at the earliest opportunity. Late update (June, 2007): see Chait et al (in press) for such a replication.
References Chait M, Poeppel D, Simon JZ (in press) Stimulus contexts affects auditory cortical responses to changes in interaural correlation. J Neurophysiol Dajani HR, Picton TW (2006) Human auditory steady-state responses to changes in Interaural correlation. Hear Res 219:85–100 Gutschalk A, Patterson RD, Rupp A, Uppenkamp S, Scherg M (2002) Sustained magnetic fields reveal separate sites for sound level and temporal regularity in human auditory cortex. Neuroimage 15:207–216 Gutschalk A, Patterson RD, Scherg M, Uppenkamp S, Rupp A (2004) Temporal dynamics of pitch in human auditory cortex. Neuroimage 22:755–766 Jones SJ (2002)The internal auditory clock: what can evoked potentials reveal about the analysis of temporal sound patterns, and abnormal states of consciousness? Neurophysiol Clin 32:241-253 Lange K, Rosler F, Roder B (2003) Early processing stages are modulated when auditory stimuli are presented at an attended moment in time: an event-related potential study. Psychophysiology 40:806-817 Rupp A, Uppenkamp S, Bailes J, Gutschalk A, Patterson RD (2005) Time constants in temporal pitch extraction: a comparison of psychophysical and magnetic data. In: Pressnitzer D, de Cheveigné A, McAdams S, Collet L (eds) Auditory signal processing: physiology, psychoacoustics, and models. Springer, Berlin Heidelberg New York, pp 119–125
36 Wideband Inhibition Modulates the Effect of Onset Asynchrony as a Grouping Cue BRIAN ROBERTS1, STEPHEN D. HOLMES1, STEFAN BLEECK2, AND IAN M. WINTER2
1
Introduction
Onset asynchrony is arguably the most powerful grouping cue for the separation of temporally overlapping sounds (see Bregman 1990). A component that begins only 30–50 ms before the others makes a greatly reduced contribution to the timbre of a complex tone, or to the phonetic quality of a vowel (e.g. Darwin 1984). This effect of onset asynchrony does not necessarily imply a cognitive grouping process; instead it may result from peripheral adaptation in the response to the leading component in the few tens of milliseconds before the other components begin (e.g., Westerman and Smith 1984). However, two findings suggest that the effect of onset asynchrony cannot be explained entirely by peripheral adaptation. First, though the effect is smaller, the contribution of a component to the phonetic quality of a short-duration vowel is reduced when it ends after the other components (Darwin and Sutherland 1984; Roberts and Moore 1991). Second, the reduced contribution of a leading component to vowel quality can be partly restored by accompanying its leading portion with an additional pure tone (Darwin and Sutherland 1984). This extra sound, called the captor tone, was set to begin at the same time as the leading component, to end at vowel onset, and to be an octave higher than the leading component. The partial restoration, of about one third, cannot be attributed either to peripheral adaptation or to two-tone suppression, because the captor tone was too remote in frequency from that of the leading component. Instead, Darwin and Sutherland (1984) attributed this restoration to the formation of a perceptual group between the leading portion and the captor, based on their shared onset time and harmonic relations, leaving the remainder of the component to integrate into the vowel percept. This interpretation has been highly influential in the grouping literature, but to date has not been tested. The experiments reported here have used both psychophysical and physiological approaches to evaluate Darwin and Sutherland’s interpretation. 1 Psychology, School of Life and Health Sciences, Aston University, Birmingham, UK, b.roberts @aston.ac.uk,
[email protected] 2 Centre for the Neural Basis of Hearing, Physiological Laboratory, University of Cambridge, Cambridge, UK,
[email protected];
[email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
334
2
B. Roberts et al.
Psychophysical Methods
The extent to which extra energy added to the first-formant (F1) region integrates into the percept of a vowel was investigated, under various conditions, using a paradigm developed by Darwin (1984). It is possible to construct a continuum of vowels, differing only in their F1 frequencies, that spans the boundary between the phonemes /Ι/ and /ε/. In terms of the F1 values used to synthesize the vowels, the position of the phoneme boundary along this continuum will shift downwards if the effect of adding energy to a harmonic raises the perceived F1 frequency, and vice versa. For a fixed increase in the level of a harmonic near F1, the effect on perceptual integration of introducing an asynchrony or adding a captor can be assessed by comparing the size of the boundary shifts across conditions. All vowels were synthesized on a fundamental (F0) frequency of 125 Hz and were 56 ms long (including rise/fall times of 16 ms). The F1 frequency ranged from 350 to 550 Hz (in 10-Hz steps) and the bandwidth was 90 Hz. The frequencies of formants 2–5 were set to 2300, 2900, 3800, and 4600 Hz, with bandwidths of 120, 170, 1000, and 1000 Hz. The standard stimulus arrangements used are illustrated in Fig. 1. The basic continuum, normalized so that each token was
Fig. 1 Schematic spectrograms showing stimuli from the four standard conditions
Wideband Inhibition Modulates the Effect of Onset Asynchrony as a Grouping Cue
335
heard diotically at 69 dB SPL, comprised the vowel-alone condition. The level of the 4th harmonic could be boosted by adding in-phase a 500-Hz tone set to 6 dB above the original level of that harmonic. The extra tone could be added in synchrony with the vowel (incremented-4th condition) or extended to begin before the vowel (leading-4th condition). Each captor condition was created by accompanying the leading portion of the added 500-Hz tone with a captor, whose onset time and harmonic status could be varied relative to the leading portion. The captor was set to the same level as that of the added tone, except in one experiment in which the relative level of the captor was varied. Each captor condition had a corresponding captor-control condition that differed from its counterpart only in that the leading portion of the added 500-Hz tone was removed. These conditions were included to control for any effect of the captor other than through its interaction with the leading portion of the added tone. After initial training, listeners in each experiment (N = 12) heard ten randomized blocks of tokens from each condition and indicated on each trial whether the vowel token was perceived as /Ι/ or /ε/. The position of the phoneme boundary in each condition for each listener was estimated using probit analysis, and these boundary estimates were compared. The main purpose of the experiments was to measure the extent to which the asynchronyinduced reduction in the effect of the added 500-Hz tone is reversed by different kinds of captor. To this end, a measure called the restoration effect was computed. It was defined as the difference in Hz between the boundaries for the leading-4th and incremented-4th conditions minus the difference between the boundaries for the captor condition and its control. The efficacy of the different captors was compared using these restoration values. Each restoration effect can also be expressed as a percentage of the boundary difference between the leading-4th and incremented-4th conditions.
3
Psychophysical Findings: Part 1
Darwin and Sutherland’s (1984) key finding that accompanying the leading portion of the added 500-Hz tone with a synchronous 1-kHz captor partly reverses the effect of the asynchrony was replicated in a series of experiments. However, the effects of varying the properties of the captor indicate that the restoration effect does not depend on either a shared onset time or a harmonic relationship between the captor and the leading portion of the added tone (Roberts and Holmes 2006). Figure 2 shows the boundary positions (means and inter-subject standard errors) and restoration values for an experiment in which the leadtime on the 4th harmonic was long (240 ms), and where the captor tone (1 kHz) either began at the same time as the leading component, or 160 ms before or after it. The overall mean restoration effect was about 35%, and the three types of captor were similarly effective in reducing the asynchrony-induced exclusion of the added 500-Hz tone. As a precaution
336
B. Roberts et al.
Fig. 2 Effect of captor asynchrony
against the possibility that the perceptual consequences of a difference in onset time between the captor and the added tone had worn off before vowel onset, an experiment was conducted using a short (40 ms) lead-time on the 4th harmonic. The 1-kHz captor either began at the same time as the leading 4th harmonic or 160 ms before it. Once again, both types of captor produced similar restoration effects. Figure 3 shows the restoration values for an experiment in which the frequency of a synchronous-onset captor was varied from 900 to 2250 Hz. Some captor frequencies were harmonically related to the added 500-Hz tone (leadtime = 240 ms); others were not. Captor efficacy was found to depend on frequency proximity to the added tone rather than on harmonic relations. The captor effect is wideband with an upper cut-off frequency around 1500 Hz. The negative restoration value associated with the highest-frequency captor may reflect captor-induced adaptation in the region of the vowel’s second formant. As a precaution against the possibility that captor efficacy is maintained when the captor retains at least either a common onset time or harmonic relations with the leading 500-Hz tone, an experiment was conducted that included a condition in which neither of these potential grouping cues was present. This kind of captor was just as effective as the others, and so it can be concluded
Wideband Inhibition Modulates the Effect of Onset Asynchrony as a Grouping Cue
337
Fig. 3 Effect of the frequency proximity and harmonicity (indicated by ‘H’) of the captor
that captor efficacy does not depend on the perceptual grouping of the captor with the leading portion of the 4th harmonic. The asynchrony-induced exclusion of the added 500-Hz tone from perceived vowel quality can also be reduced by attenuating its leading portion. This suggests that the effect of the captor may be mediated by a reduction in the effective level of the leading portion. To quantify this relationship, an experiment was conducted in which the leading portion of the added 500-Hz tone was unaccompanied but varied in level, relative to its continuation, over the range +6 to −24 dB, in 6-dB steps. An attenuation of the leading portion (lead-time = 160 ms) by 6 dB produced a return in the phoneme boundary towards its value for the vowel-alone condition of about one third. This suggests that accompanying the leading portion of the 500-Hz tone with a captor tone of equal level is equivalent to reducing its physical level by about 6 dB. It is proposed that a captor-induced reduction in the response to the leading portion of the 500-Hz tone arises from a wideband inhibitory interaction between the two components in the central auditory system.
4
Physiological Correlates
Wideband inhibitory cells have been identified by physiologists at several different levels along the auditory pathway. The first examples along this pathway are onset chopper (OC) cells, which are found in the ventral part of the cochlear nucleus (CN). Transient chopper (CT) cells in the CN receive both
338
B. Roberts et al.
Fig. 4 A hypothetical neural circuit in CN to account for the captor effect
narrowband excitatory input from the cochlea and wideband inhibitory input from OC cells (Ferragamo et al. 1998), and these connections are assumed to be between cells that share a similar best frequency (Pressnitzer et al. 2001). Figure 4 illustrates a neural processing scheme based on the interaction of these unit types that might plausibly account for the captor effect observed psychophysically using vowel stimuli. The excitatory response of CT cells with a best frequency (BF) of 500 Hz to the leading portion of the 4th harmonic will be reduced by inhibitory input from OC cells with the same BF, as they will also be stimulated by the leading portion. The addition of a remote captor tone at (say) 1 kHz will further reduce the excitatory response of the CT cells to the leading portion, as the captor also falls within the wide receptive field of the OC cells. As well as the increase in captor-induced inhibition on the response of CT cells to the leading portion, the offset of the captor as the vowel begins may also produce a transient rebound in the excitatory response through release from inhibition. Either of these responses might reduce the impact of the onset asynchrony on the 4th harmonic. Note that the frequency proximity effect seen for captor efficacy is compatible with the typical bandwidth of OC cells (Winter and Palmer 1995; Palmer et al. 1996). Single-unit recordings to a simplified analogue of the psychophysical stimuli were made in the CN of anaesthetized guinea pigs. This analogue comprised a two-component complex in which the higher-frequency component ended first. ‘Tickle tone’ response maps were used to identify and characterize units whose receptive fields included both inhibitory and excitatory regions. These included CT cells, and also units classified as sustained chopper, primary-like with notch, pauser, and onset cells. The lower of the two components (corresponding to the 4th harmonic) was chosen to match the BF of the unit and the
Wideband Inhibition Modulates the Effect of Onset Asynchrony as a Grouping Cue
339
Fig. 5 Example single-unit response in CN
higher component (corresponding to the captor) was chosen to fall in an inhibitory sideband of the unit’s receptive field. Figure 5 displays an example response from a typical unit to one of these stimuli. Note that both sustained inhibition and rebound components can be seen in the response. First, the presence of the short tone reduced the unit’s response to the long tone, relative to the case when the short tone was absent; second, the termination of the short tone produced an abrupt increase in the unit’s response to the on-going long tone. All units tested showed sustained inhibition when the short tone was present, and 85% of them also showed rebound excitation when the short tone ended. The rebound effect was most prominent in CT cells and cells classified as primary-like with notch (Bleeck et al. 2005). These findings are consistent with an account of the psychophysical captor effect in terms of the proposed CN scheme.
5
Psychophysical Findings: Part 2
The inhibitory account of the captor effect has been evaluated further, both in general and in terms of the proposed CN circuit. Figure 6 shows the effect of changing the level of a 1-kHz captor relative to that of the added 500-Hz tone. The restoration effect gradually declines as the captor level is attenuated, which is consistent with the known wide dynamic range of OC cells (Winter and Palmer 1995). The effect of three other manipulations on captor efficacy was explored (Holmes and Roberts 2006). Although two of the findings are not explicable
340
B. Roberts et al.
Fig. 6 Effect of relative captor level
in terms of the proposed CN circuit without modification, the general inhibitory account of the captor effect was supported. First, replacing a puretone captor with a noise-band captor, matched for centre frequency and energy, does not reduce captor efficacy. This is consistent with the known strong response to bands of noise by OC cells and other inhibitory units (e.g., Winter and Palmer 1995). Second, the reduced contribution of the added 500-Hz tone to vowel quality when it ends 240 ms after the vowel cannot be restored by accompanying the lagging portion with a synchronous 1-kHz captor. This suggests that the captor effect seen when the added tone begins before the vowel is mediated primarily by a rebound in excitation at captor offset, rather than by sustained inhibition when the captor is present. In the leading-tone context, rebound occurs at the same time as vowel onset, but in the lagging-tone context it occurs long after the vowel has ended. Given that units in CN generally show both types of response, it is unclear why only the rebound response appears to be important for producing a restoration effect. Third, accompanying the leading portion (320 ms) with a captor that ends before vowel onset (from 0 to 160 ms earlier) indicates that captor efficacy is lost only for a gap of more than 80 ms. This implies that the neural rebound following captor offset persists for at least 80 ms, which is long compared to the decay times typically observed for units in CN (e.g., Bleeck et al. 2005).
Wideband Inhibition Modulates the Effect of Onset Asynchrony as a Grouping Cue
6
341
Conclusions
Captor efficacy does not depend on grouping cues, but rather appears to arise from wideband inhibitory interactions in the central auditory system. Single units in the CN show response properties of broadly the right form to explain the restoration effect, but the absence of a psychophysical counterpart to the sustained inhibition and the long psychophysical decay time for the restoration effect after captor offset are not consistent with the known responses of CN units. This suggests that the neural rebound hypothesized to mediate the captor effect either first arises further along the auditory pathway than the CN or is modified by processing subsequent to the CN. Whatever its point of origin along the auditory pathway, it is clear that the presence of wideband inhibition can produce patterns of behaviour that may be confusable with those arising from more cognitive processes of auditory grouping. Acknowledgments. This research was supported by Research Grants S17804 and S17805 from the BBSRC (UK). Figures 1–4 are adapted with permission from “Asynchrony and the grouping of vowel components: Captor tones revisited,” by B. Roberts and S.D. Holmes, 2006, J Acoust Soc Am 119:2905–2918. Copyright 2006 by the Acoustical Society of America.
References Bleeck S, Ingham N, Verhey J, Winter IM (2005) Wideband suppression in cochlear nucleus: a role in grouping by common onset? ARO Abstr 28:236 Bregman AS (1990) Auditory scene analysis: the perceptual organization of sound. MIT Press, Cambridge, MA Darwin CJ (1984) Auditory processing and speech perception. In: Bouma H, Bouwhuis DG (eds) Attention and performance X: control of language processes. Erlbaum, Hillsdale, NJ, pp 197–210 Darwin CJ, Sutherland NS (1984) Grouping frequency components of vowels: when is a harmonic not a harmonic? Q J Exp Psychol 36A:193–208 Ferragamo MJ, Golding NL, Oertel D (1998) Synaptic inputs to stellate cells in the ventral cochlear nucleus. J Neurophysiol 79:51–63 Holmes SD, Roberts B (2006) Inhibitory influences on asynchrony as a cue for auditory segregation. J Exp Psychol Hum Percept Perform 32:1231–1242 Palmer AR, Jiang D, Marshall DH (1996) Responses of ventral cochlear nucleus onset and chopper units as a function of signal bandwidth. J Neurophysiol 75:780–794 Pressnitzer D, Meddis R, Delahaye R, Winter IM (2001) Physiological correlates of comodulation masking release in the mammalian ventral cochlear nucleus. J Neurosci 21:6377–6386 Roberts B, Holmes SD (2006) Asynchrony and the grouping of vowel components: captor tones revisited. J Acoust Soc Am 119:2905–2918 Roberts B, Moore BCJ (1991) The influence of extraneous sounds on the perceptual estimation of first-formant frequency in vowels under conditions of asynchrony. J Acoust Soc Am 89:2922–2932 Westerman LA, Smith RA (1984) Rapid and short-term adaptation in auditory nerve responses. Hear Res 15:249–260 Winter IM, Palmer AR (1995) Level dependence of cochlear nucleus onset unit responses and facilitation by second tones or broadband noise. J Neurophysiol 73:141–159
37 Discriminability of Statistically Independent Gaussian Noise Tokens and Random Tone-Burst Complexes TOM GOOSSENS1, STEVEN VAN DE PAR2, AND ARMIN KOHLRAUSCH1,2
1
Introduction
Hanna (1984) has shown that noise tokens with a duration of 400 ms are harder to discriminate than noise tokens of 100 ms. This is remarkable because a 400-ms stimulus potentially contains four times as much information for judging dissimilarity than the 100-ms stimulus. Apparently, the ability to use all information in a stimulus is impaired by some kind of limitation, e.g. a memory limitation (cf. Cowan 2000) or a limitation in the ability to allocate attentional resources (cf. Kidd and Watson 1992). In a first experiment, this study examined the influence of stimulus duration and bandwidth of Gaussian noise tokens on the ability to perform an auditory discrimination task. In a second experiment, the amount of potential information in a stimulus was decoupled from its duration in order to more carefully examine the properties of the memory or attention limitation that results in the discrimination impairment. Finally, a computational model that limits the amount of perceptual information is introduced as an attempt to model the findings of the first and second experiment.
2 2.1
Discrimination of Gaussian Noise Tokens Method and Stimuli
This psychoacoustic experiment is a replication of, and partly an extension to, an experiment by Hanna (1984). It was executed to test the ability of listeners to discriminate between Gaussian noise tokens. The experiment was performed using a same-different procedure where two noise tokens were presented to the listener in each experimental trial. These noise
1
University of Technology Eindhoven, The Netherlands,
[email protected] Philips Research Laboratories Eindhoven, The Netherlands,
[email protected],
[email protected]
2
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
344
T. Goossens et al.
tokens were either identical or uncorrelated. For each trial, new noise samples were generated. The a priori probability of same and different presentations was 50%. Subjects were given feedback about the correctness of their answer after each trial. Three subjects participated, including the first and second author, in the experiments that were divided in several sessions of maximally 1 h. Each experimental condition was measured in four blocks of 50 (Subjects S1 and S2) or three blocks of 100 (subject S3) trials. The blocks were presented in random order. Each subject had a training session of at least 25 blocks of 100 trials before the actual experiment started. The responses of the listeners were transformed into d¢ values by calculating percentages correct of same and different presentations for each condition. These percentages correct were converted to z-scores. Finally d¢, the quantity of interest, was calculated by adding the z-scores of same and different presentations. At chance performance, the d¢ value equals zero. Abovechance performance results in positive d¢ values, e.g. 69% and 84% correct for both same and different trials results in d¢ values of approximately 1 and 2, respectively. Measurement conditions were combinations of five noise bandwidths and nine durations. The −3 dB bandwidths were: 100–3300 Hz; 100–600 Hz; 225–275 Hz; 2800–3300 Hz; and 2975–3025 Hz. The specified durations before filtering were: 1.6, 6.4, 10.2, 16.1, 25.6, 40.6, 64.5, 102.4 and 409.6 ms. Noise tokens were produced by digitally generating broadband noise of the specified duration with 40 dB/Hz SL. Subsequently, the tokens were filtered with a Chebyshev Type II digital filter with slopes of at least 100 dB/oct. The stimuli included the ringing of the filters to avoid audible truncation effects. The stimuli were presented from a PC through a high-quality soundcard at 16 bit, 44.1-kHz sampling resolution on Beyerdynamic DT990Pro headphones. 2.2
Results and Discussion
Figure 1 shows the results of the Gaussian noise token discrimination experiment. The abscissa indicates the noise duration and the ordinate indicates the across-subject means of the d¢ values. In order to get an estimate of the confidence intervals, a bootstrap procedure was used to create subsets of repeated measurement blocks across all subjects. The means of these subsets were used to calculate means and 95% confidence intervals. Bootstrap sample size was 1000. Looking at the influence of bandwidth on the results, it can be observed that for the stimuli comprising low frequencies (dashed lines and solid line), discrimination ability improved with increasing bandwidth. For high frequencies (dash-dotted lines) discrimination proved to be very difficult for subjects in all conditions.
Discriminability of Statistically Independent Gaussian Noise Tokens 4
345
100–3300 Hz 100–600 Hz 225–275 Hz 2800–3300 Hz 2975–3025 Hz
3
d⬘ 2 1
Subjects S1 S2 S3
0 1.6
6.4
25.6
102.4
409.6
duration [ms] Fig. 1 Across-subject d¢ means for Gaussian noise token discrimination of three subjects for broadband conditions (circles), 500-Hz wide bandpass conditions (squares), and 50-Hz wide bandpass conditions (triangles) at low frequencies (dashed lines) and at high frequencies (dash-dotted lines). The abscissa indicates the noise duration. Error bars indicate 95% confidence intervals
With increasing duration, discrimination ability increased up to a maximum performance at durations of about 40 ms. Beyond this maximum, the ability to discriminate decreased with increasing duration. This degradation of discriminability is remarkable because the longer duration stimuli contain more information for performing discrimination. These results were similar to the results found by Hanna (1984). It seems that listeners do not have access to all information that is available in the stimulus. In the next experiment the duration effect will be studied more carefully.
3
Discrimination of Random Tone-Burst Complexes
In the previous experiment, the number of degrees of freedom in the stimulus increased with increasing duration. In this experiment the number of degrees of freedom is decoupled from the stimulus duration. 3.1
Method and Stimuli
The method used for this second experiment was the same as for the previous experiment – only the stimulus type was different. Instead of Gaussian noise tokens, random tone-burst complexes were used as stimuli. The random tone-burst complexes in this experiment consisted of a number of 5-ms Hanning-windowed tone-bursts placed at random time positions
346
T. Goossens et al.
c m s 0
10
20
30
40
50
Time [ms] Fig. 2 A random tone-burst signal (s) is generated by multiplying a carrier (c) with a modulation envelope (m) that consists of a number of Hanning windows additively placed at random temporal positions within the duration of the stimulus
within a time-frame of either 51.2 ms or 409.6 ms (cf. Fig. 2). This was done by multiplying a 70-dB SPL carrier tone (trace c in Fig. 2) of appropriate duration and frequency with a modulation envelope (trace m in Fig. 2). This modulation envelope was generated by adding a number, depending on the condition, of Hanning-windows at random temporal positions within the stimulus duration. This way of stimulus generation did not introduce phase mismatches when two tone bursts of the same frequency overlapped. A random tone-burst complex allows for distribution of the amount of spectral and temporal information in the stimulus in a duration-independent way whereas the amount of information in a Gaussian-noise token always increases with increase of duration. In one set of conditions, the tone bursts all had a frequency of 607 Hz (ERB rate of 12). In another set of conditions there were tone bursts of seven frequencies, i.e. 208, 314, 444, 607, 808, 1057, and 1367 Hz (ERB rates of 6 up to and including 18 with a spacing of 2). In both sets, the number of tone bursts was 2, 4, 8, 16, 32, 64, 128, and 256 tone bursts per frequency, with the exception that for the 51.2-ms conditions, the 128, and 256 tone bursts per frequency were not measured. Per experimental condition, trials were presented in 4 blocks of 100 trials. Blocks were presented in random order in sessions of maximally 1 h to four subjects, including the three subjects of the first experiment. Each subject had a training session of at least 8 conditions of 100 trials before the actual experiment started. 3.2
Results and Discussion
The results of the random tone-burst discrimination experiment are shown in Fig. 3. The across-subject means of the d¢ values are indicated on the ordinate. The error bars show the 95% confidence intervals. Again, a bootstrap procedure was used to calculate the means and 95% confidence intervals.
Discriminability of Statistically Independent Gaussian Noise Tokens 4
347
Subjects S1 S2 S3 S4
Subjects S1 S2 S3 S4
3
d’ 2 1 0
duration 51.2 ms duration 409.6 ms tone bursts of 1 frequency tone bursts of 7 frequencies
4
16
64
256
1024
Total number of tone-bursts
4
16
64
256
1024
Number of tone-bursts per frequency
Fig. 3 Across-subject d¢ means for random tone-burst discrimination of four subjects. Error bars indicate 95% confidence intervals
The left and right panel show the same results only represented on different abscissas. The left panel shows the results as a function of the total number of tone bursts, whereas the right panel shows them as a function of the number of tone bursts per frequency. Note that the conditions are much more alike in the right panel implying that the total number of tone bursts is less predictive for the ability to discriminate than the number of tone bursts per frequency. However, it has to be noted that this trend is less clear for individual subjects than in the average data. All conditions show the same trend except that the single frequency conditions (circles) with two tone bursts show worse performance than the conditions with four tone bursts. Again, this means that more potential information in a stimulus does not necessarily improve discrimination performance. Moreover, in contrast to the first experiment, the duration of the stimulus did not show a strong effect on performance. This supports the assumption that the cause of the degradation of performance for longer durations in the first experiment lies in a limitation of the amount of information that can be kept in memory rather than a limitation related only to stimulus duration.
4 4.1
A Discrimination Model Method
Several psychoacoustic models have been proposed for predicting the discrimination of stimuli (e.g., Dau et al. 1996). These make use of an internal representation (IR) of a stimulus. It is inherent to the IR approach that the number of degrees of freedom of the IR increases with the duration of the
348
T. Goossens et al.
stimulus; i.e. there is more information. According to such models, long duration stimuli are easier to discriminate than short duration stimuli, which contradicts the findings in the first experiment of this paper and those of Hanna (1984). In this chapter a model is introduced that fixed the number of degrees of freedom in the IR. It consisted of three stages of which the first stage computed an IR of both stimuli in a trial according to the model by Dau et al. (1996). This model includes outer and middle-ear filtering (not used in the present simulations), basilar-membrane filtering (gamma-tone filter-bank, 52 filters from 20 Hz to 10 kHz), inner hair cell, adaptation, temporal smoothing, and addition of internal noise. The number of degrees of freedom of this IR was still dependent on the duration. The second stage reduced the degrees of freedom to a fixed amount regardless of the duration of the stimuli. This was done by simply windowing each auditory channel of the IR with 75%-overlapping Hanning windows of a length of 1/3 of the stimulus duration, including 150 ms ringing of the filters, and by taking the average IR across each window. Note that by doing so, the window length of the Hanning windows is dependent on the stimulus duration and thus the number of values obtained in each auditory channel is fixed. The window averages together form a new set of values, or points, which essentially is a reduced fixed-size IR. The last stage was a decision device that decided if two presented stimuli in a trial were the same or different. For this purpose it first calculated a decision variable by taking the sum of squares of the difference in IR integrated across time and frequency. Then, a decision noise was added. Finally, the decision variable was compared to a certain criterion. If it was larger than the criterion, then the decision device decided that the stimuli were different, else it decided they were the same. The criterion was determined heuristically. At the start of a block of 100 trials, the criterion was set to a fixed arbitrary value. However, this criterion was adjusted after each trial by storing the values of the decision variable in two separate storages. One storage for values that, after feedback from the experiment, were the same and the other for values that were different. By making Gaussian fits, the decision device calculated means and variances for both cases to determine a new criterion using maximum-likelihood estimates. In every subsequent trial, the model adapted to a more accurate criterion and its performance improved. After about six to ten trials the criterion had converged to a reasonable value. The variance of the memory noise was chosen such that the sum of squares of the difference between the model simulations and the experimental data, expressed in d¢, was minimal. Different variances of the observation noise were needed for the simulation of the first and second experiment. This reflects the different nature of the stimulus features in both experiments. Per condition, trials were presented to the model in 8 blocks of 100 trials.
Discriminability of Statistically Independent Gaussian Noise Tokens
4.2
349
Results and Discussion
The left panel of Fig. 4 shows the results of the model simulations of the first experiment. As can be observed when comparing this panel to Fig. 1, the order of the bandwidth dependencies was correctly predicted by the model, although it showed too strong spectral integration, especially for shorter durations. For example, in Fig. 4 (left panel) the simulated broadband conditions (circle markers) do not coincide with the simulated low-frequency 500-Hz conditions (dashed line, square markers) in the way that they do in the psychoacoustic data. The effect of too much spectral integration can also be observed in the right panel of Fig. 4 where the results of the model simulations of the second experiment are shown. It is demonstrated by the fact that the seven-frequencies conditions of the 51.2-ms conditions (dotted line with cross markers) did not coincide with the corresponding 409.6-ms conditions (solid line with cross markers) as they did in the psychoacoustic data shown in Fig. 3. In contrast, these two curves do coincide in the single-frequency conditions where no frequency integration was possible. The model correctly predicts the dependence of discriminability on stimulus duration for the first experiment showing a maximum at about 40 ms. This model behavior is related to the fixed number of points in the fixed-size IR that is independent of stimulus duration. As a result, the discriminability of the stimulus now only depends on the variance of each of the points in the fixed-size IR. When the stimulus duration is about 40 ms, it has a number of degrees of freedom comparable to the number of points in the IR and as a result the variance in the internal representation is large, resulting in a high discrimination performance of the model. For shorterduration stimuli, fewer degrees of freedom will be present in the stimulus and the points within the internal representation will be highly correlated,
4 3
d’ 2 1 0 1.6
6.4
25.6
102.4
Duration [ms]
409.6
4
16
64
256
1024
Number of tone-bursts per frequency
Fig. 4 Model results for the data of Fig. 1 (left panel) and the data of Fig. 3 (right panel)
350
T. Goossens et al.
reducing the discriminability. For longer-duration stimuli, the number of degrees of freedom will be much larger than the number of points in the IR and averaging will take place reducing the variance in the points of the fixed-size IR and reducing discriminability. A final remark about the model that should be addressed is the fact that, as analysis has shown, the variability in the fixed-size IR mostly takes place at the on- and offset of the stimulus. For this implementation of the model it would be preferable to have more homogeneous variability, as we know from other studies, e.g. Fallon and Robinson (1992), that listeners are able to discriminate based on information in the middle of a stimulus as well.
5
General Discussion
The simulations show that a simple method for limiting the amount of information in the internal representation can already account for a large portion of the effect of degraded discriminability in the experiment with noise tokens of various durations. It should be noted however, that the method of limiting the amount of available information presented in this study should not be taken as a definitive method of memory limitation. The exact nature of this limitation needs further investigation. There is a difference in the processing of spectral and of temporal information. While increasing the amount of temporal information of a stimulus can lead to impairment of the ability to discriminate, increasing the amount of spectral information has never led to such impairments in this study. The random tone-burst experiment, in which the amount of information was decoupled from duration, has shown that the ability to discriminate is not so much a function of the duration but rather of the number of degrees of freedom or the amount of information in a stimulus.
References Cowan N (2000) The magical number 4 in short-term memory: a reconsideration of mental storage capacity. Behav Brain Sci 24:87–185 Dau T, Püschel D, Kohlrausch A (1996) A quantitative model of the “effective” signal processing in the auditory system. I. Model structure. J Acoust Soc Am 99:3615–3622 Fallon SM, Robinson DE (1992) Discrimination of burst of reproducible noise. J Acoust Soc Am 92:2630–2635 Hanna TE (1984) Discrimination of reproducible noise as a function of bandwidth and duration. Percept Psychophys 36:409–416 Kidd GR, Watson CS (1992) The “proportion-of-the-total-duration rule” for the discrimination of auditory patterns. J Acoust Soc Am 92:3109–3118
Discriminability of Statistically Independent Gaussian Noise Tokens
351
Comment by Emirog˘lu Did you try to “squeeze” the temporal dimension in the model, i.e. to average the “Internal Representation” over all time-steps or over a certain minimal duration, respectively, before feeding the IR into the decision device (i.e. optimal detector)? Because I think, the brightness (or “brightness melody” if the stimulus is long enough) is here the distinguishable and memorizable cue, and the “spectral percept” probably needs a minimal time span that is necessary to establish/recognize one “brightness”. Reply Two alternative approaches to reduction of the information in the internal representation (IR) are addressed in your question. The first approach is to collapse the time dimension of the IR by taking the average over its full duration, resulting in a one-dimensional IR with only spectral information in it. Simulations show that the duration at which the ability to discriminate is maximal, is inversely related to the number of (temporal) degrees of freedom in the IR. Hence, averaging across the complete duration does not improve the predictions of the model. The second approach is to take a fixed window-length (called minimal duration in your comment) instead of making the window length dependent on the stimulus duration. This would cause the degrees of freedom in the IR to increase with increasing duration. As an effect, the ability to discriminate would also increase and this is not what we observe in the measurements.
38 The Role of Rehearsal and Lateralization in Pitch Memory CHRISTIAN KAERNBACH1, KATHRIN SCHLEMMER2, CHRISTINA ÖFFL3, AND SANDRA ZACH3
1
The Role of Rehearsal in Pitch Memory
In classical short-term memory (STM) for categorical information it is a well-known fact that the lifetime of a trace can be lengthened ad infinitum by rehearsing the stored information. If one wishes to measure the lifetime of a trace in STM, one needs to prevent rehearsing. For instance, in the classical BrownPeterson paradigm (Brown 1958; Peterson and Peterson 1959), participants are prevented from rehearsing by articulatory tasks such as counting backwards. Auditory sensory memory has been shown to share many characteristics with categorical STM (Kaernbach 2004). However, a major difference seems to be that rehearsal seems not to be effective with sensory information. Keller et al. (1995) note that in standard delayed pitch comparison tasks no measures are taken to prevent rehearsal but that this does not fully prevent the loss of auditory information over time. Moreover, Demany et al. (2001, 2004) failed to demonstrate a beneficial influence of perceptional attention. It is quite a natural task to sing or hum a pitch, and intuitively one might think of this as helpful for the retention of the pitch trace. If rehearsed audibly, the recorded pitch data might help to elucidate the mechanisms underlying sensory retention. Therefore, the present study compared overt, covert, and no rehearsal conditions, making use of the recorded pitch data in the overt rehearsal condition in order to try to understand the performance in these three conditions. 1.1
Methods
Participants (N = 3) had to compare two stimuli (S1 and S2) that were separated by a certain retention interval. The second stimulus would be slightly higher or slightly lower than the first stimulus, and participants had to indicate which of these two possibilities were the case. 1 Institut für Psychologie, Christian-Albrechts-Universität Zu Kiel, Germany, http://www. psychologie.uni-kiel.de/emotion 2 Kognitionspsychologie, Institut für Psychologie, Humboldt Universität, Berlin, Germany; Institut für angewandte Familien-, Kindheits- und Jugendforschung, Universität Potsdam, Burgwall 15, 16727 Oberkrämer, Germany 3 Psychologische Methodik und computergestützte Modellierung, Institut für Psychologie, Karl-Franzens-Universität, Graz, Austria
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
354
C. Kaernbach et al.
In order to facilitate rehearsal we employed Shepard tones (Shepard 1964). These tones allow participants to rehearse the presented tone at whatever octave is most appropriate to them. The duration of the tones was 1 s, with 0.1-s ramps at the onset and the offset of the tones. The chroma of the first stimulus S1 was randomized uniformly within one octave. Prior to the main experiment we determined the just noticeable difference (JND) for these stimuli (inter-stimulus interval 0.5 s) for each participant. In an unforced-choice adaptive procedure (Kaernbach 2001) we estimated the point of the psychometric function where 75% of the judgments were correct. In the main experiment we tested four different conditions. These conditions were tested blockwise in blocks of 20 trials of the same type. Participants performed several training blocks until they felt at ease with the different tasks. Then they performed 20 blocks of each condition in circulating order. In one condition, the duration of the retention interval was 0.5 s. With this short duration of the retention interval, no specific rehearsal instruction was given. The second stimulus S2 differed from the first stimulus S1 by −5/3, −3/3, −1/3, +1/3, +3/3, or +5/3 JNDs. In three other conditions, the duration of the retention interval was 6 s. The difference between S2 and S1 was taken randomly from −3, −1, +1, or +3 JNDs. These three different conditions differed only by the rehearsal instruction. The first rehearsal instruction was “no rehearsal”. It was symbolized by a mandala appearing on the computer screen, and the participants were told not to focus their attention to the pitch of the first stimulus until the second stimulus would appear. The other two conditions were rehearsal conditions, with silent vs overt rehearsal. They too were symbolized by icons on the computer screen, but this time the participants had to direct their attention towards the pitch of S1. In the overt rehearsal condition we recorded the voice during the retention interval. 1.2
Pitch Discrimination
The data of the three participants were similar if compared on a JND scale. Therefore, they were combined into a single dataset. Figure 1 shows the psychometric functions for all four conditions. The psychometric function for the 0.5-s retention condition is much steeper than the other three psychometric functions, illustrating that the precision of the trace is much higher after 0.5 s than after 6 s. The differences between the other three psychometric functions are not marked. 1.3
Voice Pitch Analysis
In each single trial of the overt rehearsal condition, voice pitch during the retention interval was determined with the YIN algorithm (de Cheveigné and Kawahara 2002). The voice pitch was set into relation to the nearest octave
The Role of Rehearsal and Lateralization in Pitch Memory
355
Fig. 1 Pitch discrimination performance
Fig. 2 Mean mirrored voice error during overt rehearsal for successful (light gray symbols) and unsuccessful (dark symbols) trials
component of the S1 stimulus. On average it was 1.2% lower than this component. The standard deviation relative to the nearest component was 43%, i.e. nearly half a semitone. Further analysis of the voice pitch during overt rehearsal beyond mean and standard deviation took place in the scope of a random-walk model of sensory retention that will be put forward in a future publication. Figure 2 shows the mean voice error during overt rehearsal as a function of time. Voice data of trials with S2<S1 have been mirrored at the nearest matching component of S1. The data show a clear tendency of increasing voice error
356
C. Kaernbach et al.
over time. Triangles are for trials with a 3-JND difference, and squares are for trials with a 1-JND difference between S2 and S1. Also plotted in Fig. 2 are the predictions of the random-walk for these four voice error curves. Solid lines give the model prediction for a 3-JND difference, and dotted lines for a 1-JND difference.
1.4
Discussion
The processes going on during retention might after all not be that different between the different rehearsal conditions. The insight into the random walk processes of the trace of S1 gained by the analysis of the recorded voice during overt rehearsal yields parameters for this random walk process well compatible with the performance in other rehearsal conditions. It seems not justified to treat these different conditions as evoking different retention mechanisms. More generally, it seems not justified to term attention effects on sensory retention as “rehearsal”. It could well be that there is no possibility to change the rate of information loss during retention by attention.
2
The Role of Lateralization in Pitch Memory
In a second study we investigate the role of lateralization on pitch memory. In a study by Diana Deutsch (Deutsch 1980) it was found that moderately lefthanded participants are better in a pitch memory task than right-handed and extremely left-handed participants. Diana Deutsch speculated that this was due to the different lateralization of pitch memory. We test the findings of Diana Deutsch with her original and with a pitch memory task similar to the task in the rehearsal study, and we determine the localization of pitch memory in these participants with functional magnetic imaging.
References Brown J (1958) Some test of the decay theory of immediate memory. Q J Exp Psychol 10:12–21 de Cheveigné A, Kawahara H (2002) YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111:1917–1930 Demany L, Clément S, Semal C (2001) Does auditory memory depend on attention? In: Breebaart DJ, Houtsma AJM, Kohlrausch A, Prijs VF, Schoonhoven R (eds) Physiological and psychophysical bases of auditory function. Shaker Publishing BV, Maastricht, pp 461–467 Demany L, Montandon G, Semal C (2004) Pitch perception and retention: two cumulative benefits of selective attention. Percept Psychophys 66:609–617
The Role of Rehearsal and Lateralization in Pitch Memory
357
Deutsch D (1980) Handedness and memory for tonal pitch. In: Herron J (ed) Neuropsychology of left-handedness. Academic Press, New York, pp 263–271 Kaernbach C (2001) Adaptive threshold estimation with unforced-choice tasks, Percept Psychophys 63:1377–1388 Kaernbach C (2004) The memory of noise. Exp Psychol 51:240–248 Keller TA, Cowan N, Saults JS (1995) Can auditory memory for tone pitch be rehearsed? J Exp Psychol Learn Mem Cog 21:635–645 Peterson LR, Peterson MJ (1959) Short-term retention of individual items. J Exp Psychol 58:193–198 Shepard RN (1964) Circularity in judgments of relative pitch. J Acoust Soc Am 36:2346–2353
38 The Role of Rehearsal and Lateralization in Pitch Memory CHRISTIAN KAERNBACH1, KATHRIN SCHLEMMER2, CHRISTINA ÖFFL3, AND SANDRA ZACH3
1
The Role of Rehearsal in Pitch Memory
In classical short-term memory (STM) for categorical information it is a well-known fact that the lifetime of a trace can be lengthened ad infinitum by rehearsing the stored information. If one wishes to measure the lifetime of a trace in STM, one needs to prevent rehearsing. For instance, in the classical BrownPeterson paradigm (Brown 1958; Peterson and Peterson 1959), participants are prevented from rehearsing by articulatory tasks such as counting backwards. Auditory sensory memory has been shown to share many characteristics with categorical STM (Kaernbach 2004). However, a major difference seems to be that rehearsal seems not to be effective with sensory information. Keller et al. (1995) note that in standard delayed pitch comparison tasks no measures are taken to prevent rehearsal but that this does not fully prevent the loss of auditory information over time. Moreover, Demany et al. (2001, 2004) failed to demonstrate a beneficial influence of perceptional attention. It is quite a natural task to sing or hum a pitch, and intuitively one might think of this as helpful for the retention of the pitch trace. If rehearsed audibly, the recorded pitch data might help to elucidate the mechanisms underlying sensory retention. Therefore, the present study compared overt, covert, and no rehearsal conditions, making use of the recorded pitch data in the overt rehearsal condition in order to try to understand the performance in these three conditions. 1.1
Methods
Participants (N = 3) had to compare two stimuli (S1 and S2) that were separated by a certain retention interval. The second stimulus would be slightly higher or slightly lower than the first stimulus, and participants had to indicate which of these two possibilities were the case. 1 Institut für Psychologie, Christian-Albrechts-Universität Zu Kiel, Germany, http://www. psychologie.uni-kiel.de/emotion 2 Kognitionspsychologie, Institut für Psychologie, Humboldt Universität, Berlin, Germany; Institut für angewandte Familien-, Kindheits- und Jugendforschung, Universität Potsdam, Burgwall 15, 16727 Oberkrämer, Germany 3 Psychologische Methodik und computergestützte Modellierung, Institut für Psychologie, Karl-Franzens-Universität, Graz, Austria
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
354
C. Kaernbach et al.
In order to facilitate rehearsal we employed Shepard tones (Shepard 1964). These tones allow participants to rehearse the presented tone at whatever octave is most appropriate to them. The duration of the tones was 1 s, with 0.1-s ramps at the onset and the offset of the tones. The chroma of the first stimulus S1 was randomized uniformly within one octave. Prior to the main experiment we determined the just noticeable difference (JND) for these stimuli (inter-stimulus interval 0.5 s) for each participant. In an unforced-choice adaptive procedure (Kaernbach 2001) we estimated the point of the psychometric function where 75% of the judgments were correct. In the main experiment we tested four different conditions. These conditions were tested blockwise in blocks of 20 trials of the same type. Participants performed several training blocks until they felt at ease with the different tasks. Then they performed 20 blocks of each condition in circulating order. In one condition, the duration of the retention interval was 0.5 s. With this short duration of the retention interval, no specific rehearsal instruction was given. The second stimulus S2 differed from the first stimulus S1 by −5/3, −3/3, −1/3, +1/3, +3/3, or +5/3 JNDs. In three other conditions, the duration of the retention interval was 6 s. The difference between S2 and S1 was taken randomly from −3, −1, +1, or +3 JNDs. These three different conditions differed only by the rehearsal instruction. The first rehearsal instruction was “no rehearsal”. It was symbolized by a mandala appearing on the computer screen, and the participants were told not to focus their attention to the pitch of the first stimulus until the second stimulus would appear. The other two conditions were rehearsal conditions, with silent vs overt rehearsal. They too were symbolized by icons on the computer screen, but this time the participants had to direct their attention towards the pitch of S1. In the overt rehearsal condition we recorded the voice during the retention interval. 1.2
Pitch Discrimination
The data of the three participants were similar if compared on a JND scale. Therefore, they were combined into a single dataset. Figure 1 shows the psychometric functions for all four conditions. The psychometric function for the 0.5-s retention condition is much steeper than the other three psychometric functions, illustrating that the precision of the trace is much higher after 0.5 s than after 6 s. The differences between the other three psychometric functions are not marked. 1.3
Voice Pitch Analysis
In each single trial of the overt rehearsal condition, voice pitch during the retention interval was determined with the YIN algorithm (de Cheveigné and Kawahara 2002). The voice pitch was set into relation to the nearest octave
The Role of Rehearsal and Lateralization in Pitch Memory
355
Fig. 1 Pitch discrimination performance
Fig. 2 Mean mirrored voice error during overt rehearsal for successful (light gray symbols) and unsuccessful (dark symbols) trials
component of the S1 stimulus. On average it was 1.2% lower than this component. The standard deviation relative to the nearest component was 43%, i.e. nearly half a semitone. Further analysis of the voice pitch during overt rehearsal beyond mean and standard deviation took place in the scope of a random-walk model of sensory retention that will be put forward in a future publication. Figure 2 shows the mean voice error during overt rehearsal as a function of time. Voice data of trials with S2<S1 have been mirrored at the nearest matching component of S1. The data show a clear tendency of increasing voice error
356
C. Kaernbach et al.
over time. Triangles are for trials with a 3-JND difference, and squares are for trials with a 1-JND difference between S2 and S1. Also plotted in Fig. 2 are the predictions of the random-walk for these four voice error curves. Solid lines give the model prediction for a 3-JND difference, and dotted lines for a 1-JND difference.
1.4
Discussion
The processes going on during retention might after all not be that different between the different rehearsal conditions. The insight into the random walk processes of the trace of S1 gained by the analysis of the recorded voice during overt rehearsal yields parameters for this random walk process well compatible with the performance in other rehearsal conditions. It seems not justified to treat these different conditions as evoking different retention mechanisms. More generally, it seems not justified to term attention effects on sensory retention as “rehearsal”. It could well be that there is no possibility to change the rate of information loss during retention by attention.
2
The Role of Lateralization in Pitch Memory
In a second study we investigate the role of lateralization on pitch memory. In a study by Diana Deutsch (Deutsch 1980) it was found that moderately lefthanded participants are better in a pitch memory task than right-handed and extremely left-handed participants. Diana Deutsch speculated that this was due to the different lateralization of pitch memory. We test the findings of Diana Deutsch with her original and with a pitch memory task similar to the task in the rehearsal study, and we determine the localization of pitch memory in these participants with functional magnetic imaging.
References Brown J (1958) Some test of the decay theory of immediate memory. Q J Exp Psychol 10:12–21 de Cheveigné A, Kawahara H (2002) YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111:1917–1930 Demany L, Clément S, Semal C (2001) Does auditory memory depend on attention? In: Breebaart DJ, Houtsma AJM, Kohlrausch A, Prijs VF, Schoonhoven R (eds) Physiological and psychophysical bases of auditory function. Shaker Publishing BV, Maastricht, pp 461–467 Demany L, Montandon G, Semal C (2004) Pitch perception and retention: two cumulative benefits of selective attention. Percept Psychophys 66:609–617
The Role of Rehearsal and Lateralization in Pitch Memory
357
Deutsch D (1980) Handedness and memory for tonal pitch. In: Herron J (ed) Neuropsychology of left-handedness. Academic Press, New York, pp 263–271 Kaernbach C (2001) Adaptive threshold estimation with unforced-choice tasks, Percept Psychophys 63:1377–1388 Kaernbach C (2004) The memory of noise. Exp Psychol 51:240–248 Keller TA, Cowan N, Saults JS (1995) Can auditory memory for tone pitch be rehearsed? J Exp Psychol Learn Mem Cog 21:635–645 Peterson LR, Peterson MJ (1959) Short-term retention of individual items. J Exp Psychol 58:193–198 Shepard RN (1964) Circularity in judgments of relative pitch. J Acoust Soc Am 36:2346–2353
39
Interaural Correlation and Loudness
JOHN F. CULLING AND BARRIE A. EDMONDS
1
Introduction
An extensive literature shows that when the same sound is presented to both ears rather than to just one, a phenomenon called binaural summation occurs (Reynolds and Stevens 1960), in which the overall loudness of a stimulus presented to both ears is greater. A much smaller literature on the effect of interaural correlation on binaural loudness indicates that it has no additional effect (Dubrovskii and Chernyak 1969; Dubrovskii et al. 1972). However, Culling et al. (2001) found that listeners judged a band of noise spectrally flanked by correlated noise as progressively louder as its interaural correlation decreased towards zero. Culling et al. attributed the effect to the mechanism of binaural unmasking (e.g. Durlach 1972; Culling and Summerfield 1995). Further, Culling et al. (2003) found that, provided the flanking noise was not spectrally contiguous with the target band, the target became less loud as correlation reduced further towards minus one. Both of those articles featured experiments that employed correlated flanking noise surrounding the manipulated target band in an attempt to make the task relatively similar to a broadband binaural unmasking task. Both articles focussed on the loudness dimension of the stimuli by employing a loudness discrimination task. However, the second article also employed a three-interval odd-one-out task, without instruction to attend to loudness. In that experiment, the perceived width of the auditory image, which also changes with interaural correlation might have provided an alternative cue and the flanking bands were used in an attempt to reduce the salience of this cue. The present chapter examines further the effect of interaural correlation on loudness, but using a loudness-matching paradigm in the absence of flanking bands. As such, it bears a much greater similarity to the conventional loudness literature and, in particular, the literature on binaural loudness summation. Remarkably, a search of that literature found only two direct references to any potential effect of interaural correlation (Dubrovskii and Chernyak 1969; Dubrovskii et al. 1972). The experiments reported there
School of Psychology, Cardiff University, Tower Building, Park Place, Cardiff, CF10 3AT, UK,
[email protected],
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
360
J.F. Culling and B.A. Edmonds
indicated that interaural correlation had no effect on loudness. A further article (Zwicker and Zwicker 1991) indicated that a 20% increase in loudness ratings can result from rapidly alternating a noise source from ear to ear (at 7, 49 and 343 cycles/s), compared to a slower alternation rate (1 cycle/s). It is possible that this phenomenon may be attributable to an effect of interaural correlation, since such alternation can create an interaurally uncorrelated stimulus within a finite temporal analysis window. The maximal loudness they observed from this effect was slightly less than occurred when both ears were continuously stimulated (a stimulus 3 dB more intense).
2
Method
Stimuli were generated online in MATLAB at a 20-kHz sampling rate. Broadband noises of 500 ms duration were generated digitally and bandpass filtered. Filtering was performed by applying a discrete Fourier transform to the entire waveform, zeroing frequency bins outside the passband, and then inverse transformation. The cut-off frequencies for the band-pass noises were 460–540 Hz, 100–900 Hz, or 100–5000 Hz. These band-pass filtered noises were then presented to the listener in one of four interaural configurations: monaural, correlated, anticorrelated and uncorrelated. In ‘monaural’ stimuli, a single noise band was presented, randomly, at either the left or right ear of the listener. For the ‘correlated’ stimuli, identical noise bands were presented to both the left and right channel of the headphones (correlation = 1). For ‘anticorrelated’ stimuli, a single noise band was created for the left channel of the stimulus and then copied and inverted at the right channel (correlation = −1). Finally, for ‘uncorrelated’ stimuli, two independent noise bands were presented at the left and right channel of the headphones (correlation≈0). An adaptive-matching paradigm was used to measure the difference in intensity between stimuli that equated them in loudness. On each trial, listeners were presented with two stimuli in a random order (separated by a short silent interval). One of these stimuli was the “reference” stimulus and the other the “tracked” stimulus. Listeners were asked to judge which interval contained the louder stimulus. The intensity of reference stimulus was kept at a constant rms throughout the experiment (measured at 57 dB (A) for the 460–5400-Hz and 100–900-Hz bandwidths; 60 dB (A) for the 100–5000-Hz bandwidth), but the intensity of the tracked stimulus was adjusted in accordance with a one-up/one-down adaptive track (Levitt 1971). If the listener judged the tracked stimulus to be louder on a given trial then the intensity of the tracked stimulus was decreased for the next trial of that type, but if the listener judged the reference to be louder then the intensity of the tracked stimulus was increased. The stepsize was ±2 dB for the first two reversals and ±1 dB thereafter.
Interaural Correlation and Loudness
361
A total of 72 listeners were recruited from among Cardiff University undergraduates. For each bandwidth 24 listeners were used. Half of the listeners for a given bandwidth started all staircases with the tracked stimulus 3 dB more intense than the reference whilst the other half started each staircase with the tracked stimulus 3 dB less intense than the reference. Sixteen loudness matches (i.e. all pairwise comparisons of the 4 interaural configurations) were obtained from each listener, using 16 concurrent staircases with randomly interleaved trials. All staircases continued until all staircases had reached at least 12 reversals. The matching intensity offset was taken to be the mean intensity of the tracked stimulus over the last 8 trials from each staircase. Listeners sat in a single-walled IAC sound attenuating booth. Instructions were delivered through a computer monitor visible through the booth window and responses were taken through the mouse using a graphical interface. No feedback was given. All stimuli were presented using an Edirol UA-20 soundcard, MTR HPA-2 headphone amplifier and Sennheiser HD650 headphones.
3
Results
Figures 1–3 show the mean matched intensity offsets in dB for the three different stimulus bandwidths. Each plot shows 16 measurements corresponding to all possible combinations of the 4 stimulus types, including the identity conditions, in which listeners were required to match the intensity of 2 similarly constructed stimuli. A dotted line through the middle of each plot at a matching offset of zero dB indicates the expected match for the identity conditions. Average matches for the identity conditions did not differ significantly from this expectation, since the zero line is always within the error bars, which represent 95%-confidence intervals. Matches for each pairing of different stimulus types appear twice, once with each of the two stimulus types in the pairing as the tracked stimulus. Data for each of the three bandwidths were subjected to separate two-way (4 reference stimulus types × 4 tracked stimulus types) analyses of variance, with adjusted α for three tests (Bonnferoni). All three analyses indicated significant effects of both reference and tracked stimulus type (F(3,69)>90, p<0.001), with no interaction. Figures 1–3 show a strong binaural summation effect in which a monaural stimulus (squares) is matched in loudness to a binaural one (other symbols) that is 4–8 dB less intense. Tukey HSD post-hoc tests showed that this difference was significant at each bandwidth, for each type of binaural stimulus and with the monaural stimulus both as the reference and as the tracked stimulus (q>15, p<0.001, in all cases). Figure 1 shows the results for the narrowest bandwidth, 460–540 Hz. As well as a 4.6-dB difference between monaural and correlated stimuli, the figure shows an effect of interaural correlation, in which uncorrelated noise (circles)
362
J.F. Culling and B.A. Edmonds
Fig. 1 Mean intensity offsets at matched loudness for 460–540-Hz bandwidth stimuli. Means plotted with 95% confidence intervals for the four reference stimuli for each of the four types of tracked stimuli
Fig. 2 Mean intensity offsets at matched loudness for 100–900-Hz bandwidth stimuli. Means plotted with 95% confidence intervals for the four reference stimuli for each of the four types of tracked stimuli
is matched in loudness with a correlated noise (inverted triangles) that is 1.9 dB more intense and with anticorrelated noise (upright triangles) that is 1.6 dB more intense. These differences were also statistically significant (q>4, p<0.01, in each case). However, anticorrelated noise is matched to
Interaural Correlation and Loudness
363
Fig. 3 Mean intensity offsets at matched loudness for 100–5000-Hz bandwidth stimuli. Means plotted with 95% confidence intervals for the four reference stimuli for each of the four types of tracked stimuli
correlated noise that is only 0.3 dB more intense and this difference is non-significant. Figure 2 shows similar data for the 100–900-Hz bandwidth. Here, the difference between monaural and correlated stimulus types is a little larger at 5.7 dB (q>34, p<0.001). The difference between uncorrelated and correlated stimuli is similar to the 460–540-Hz case at 2.1 dB (q<11, p<0.001), but the difference between correlated and anticorrelated has increased to 0.9 dB and is significant when the anticorrelated noise is the reference stimulus (q = 6.4, p>0.001). Figure 3 shows the data for the 100–5000-Hz bandwidth. Here, the pattern is a little different. The binaural summation effect (correlated vs monaural) is still 5.5 dB (q>24, p<0.001), but the correlation effects are only 0.8 dB for uncorrelated vs correlated and 0.5 dB for anticorrelated vs correlated, and neither of these differences are significant.
4
Discussion and Conclusions
These experiments provide the first direct evidence that interaural correlation can have an effect on the loudness of a stimulus, which can be offset by a compensating difference in physical intensity. Previous research has either used a discrimination paradigm (Culling et al. 2001, 2003) or has not detected an effect on loudness (Dubrovksyii and Chernyak 1969).
364
J.F. Culling and B.A. Edmonds
From the standpoint of loudness research, the method employed represents an improvement over those used by Culling et al. (2001, 2003), because those studies focussed listeners’ attention on the loudness dimension only through instruction. Those designs were therefore open to the criticism that listeners might have responded to some other stimulus dimension, particularly the image width. In the present study, the listener was required to match stimuli in loudness using an intensity offset. Sound intensity has negligible influence on image width, and so would not provide the listeners with an alternative means of performing the task. The design also conforms to the methods used in the literature to measure binaural summation. There, the method is known as “binaural-monaural loudness matching” (Reynolds and Stevens 1960). The binaural summation effect reported by Reynolds and Stevens using this and other methods was also replicated. Given that the results genuinely reflect an influence of interaural correlation on loudness, one may ask why Dubrovksii and Chernyak did not observe it in their experiments. The answer to this question probably lies in the effect of stimulus bandwidth. Dubrovskii and Chernyak used full spectrum white noise, only limited by the frequency response of their headphones (Russian TD-6 audiological headphones), which they stated had a high frequency cut-off at around 5 kHz. According to measurements by Robinson (1971), this is a rather conservative limit for these headphones. In the experiments reported here, the effect of interaural correlation is about 2 dB at bandwidths of 460–540 Hz and 100–900 Hz, but reduces to 0.8 dB (non-significant) at a bandwidth of 100–5000 Hz. Since the first two cases are both limited to frequencies at which binaural unmasking is effective, while the third extends to much higher frequencies, it is tempting to suppose that the substantial elevation of loudness at low frequencies is diluted by the presence of higher frequencies at which there is little effect. Further experiments will be needed to establish that the effect of interaural correlation is absent at higher frequencies using this paradigm. Certainly, the figure of 0.8 dB is close to the theoretical prediction of about 0.65 dB that one would expect from a linear dilution effect. Since Dubrovskii and Chernyak’s stimuli probably had an even broader spectrum, it is not surprising that they did not observe an effect. The results support an interpretation of Zwicker and Zwicker (1991) in terms of interaural correlation. There are two aspects that seem to correspond well. First, they found that the overall increase in loudness produced by alternating a noise between the ears was insufficient to make the stimuli louder than continuous diotic noise at the same spectrum level. Thus, it was equivalent to an increase in intensity of something a little less than 3 dB. This corresponds well with the 2 dB observed here. Although Zwicker and Zwicker used a broad band of noise, that might have been susceptible to the dilution effect discussed above, it contained equal energy in each critical band, and so was spectrally tilted towards the low frequency region where binaural effects are strongest. Second, as noted by Zwicker
Interaural Correlation and Loudness
365
and Zwicker, the lowest of the alternation rates at which they observed increased loudness, 7 cycles/s, creates a situation in which at least one interaural transition occurs within any 100-ms window, corresponding with estimates of binaural temporal resolution (Grantham and Wightman 1978; Culling and Summerfield 1998). Culling et al. (2001, 2003) provide a theoretical explanation for the effect of interaural correlation on loudness. They drew upon the work of Osman (1971) and Durlach et al. (1986), which suggested that the mechanism of binaural unmasking is sensitive to deviations in interaural correlation from unity. According to these ideas, during binaural unmasking, the presence of a signal with different interaural phase from the noise reduces the interaural correlation of the stimulus at the signal frequency. If, instead, the correlation is directly manipulated at a given frequency, an illusory experience of the signal is created. Extending this idea, Culling et al. suggested that beyond the point of detection, further decreases in correlation should be interpreted by the binaural system as increases in the relative intensity of the signal and hence result in progressive increases in the perceived loudness of the illusory signal. The present study shows that these increases in loudness also occur when the correlation of the whole stimulus is altered. Loudness was found to be lower for the anticorrelated stimuli compared to the uncorrelated stimuli. This result is consistent with previous observations (Culling et al. 2003). The phenomenon may be explained with recourse to the idea that the mechanism of binaural unmasking operates independently in each frequency channel (Culling and Summerfield 1995). In order to recover signals from noise in any direction, the binaural system is thought to apply a compensating internal delay to the stimuli at each ear before assaying the correlation (e.g. Durlach 1972). The overall process is thus similar to measurement of the coherence. If the process operates independently in each frequency channel, anticorrelated noise will be interpreted as having a high coherence in all channels, because within each frequency channel the π phase difference can be approximately compensated by a single internal delay; within an individual channel a π phase shift is equivalent to much the same delay at each of the narrow range of frequencies that the frequency channel contains. Across different frequency channels the measurement mechanism will thus apply a delay equivalent to half the period of the centre-frequency of each channel. After these delays, the stimuli from each ear will always be highly correlated in all channels and therefore will not have the same degree of enhanced loudness as an uncorrelated stimulus. Models of loudness are essentially monaural (Zwicker and Scharf 1965; Moore et al. 1999), including binaural processing only through the simple binaural summation process (Scharf and Fishken 1970). The evidence presented here suggests that some modification may be necessary in order to include the effect of interaural correlation at low frequencies.
366
J.F. Culling and B.A. Edmonds
Acknowledgement. Pilot data for these experiments, using the method of adjustment, were collected by Sonya Ginty in her final-year project. Work supported by UK EPSRC.
References Culling JF, Summerfield Q (1995) Perceptual segregation of concurrent speech sounds: absence of across-frequency grouping by common interaural delay. J Acoust Soc Am 98:785–797 Culling JF, Summerfield Q (1998) Measurements of the binaural temporal window using a detection task. J Acoust Soc Am 103:3540–3553 Culling JF, Colburn HS, Spurchise M (2001) Interaural correlation sensitivity. J Acoust Soc Am 110:1020–1029 Culling JF, Hodder KI, Colburn HS (2003) Interaural correlation discrimination with spectrallyremote flanking bands: constraints for models of binaural unmasking. Acta Acust united with Acustica 89:1049–1058 Dubrovskii NA, Chernyak RI (1969) Binaural summation under varying degrees of noise correlation. Sov Phys Acoust 14:326–332 Dubrovskii NA, Chernyak RI, Shapiro VM (1972) Binaural summation of differently correlated noises. Sov Phys Acoust 17:468–473 Durlach NI (1972) Binaural signal detection: equalization and cancellation theory. In: Tobias JV (ed) Foundations of modern auditory theory, vol 2. Academic Press, New York, pp 369–462 Durlach NI, Gabriel KJ, Colburn HS, Trahiotis C (1986) Interaural correlation discrimination: II. Relation to binaural unmasking. J Acoust Soc Am 78:1458–1557 Grantham DW, Wightman FL (1978) Detectability of varying interaural temporal differences. J Acoust Soc Am 63:511–523 Levitt H (1971) Transformed up-down methods in psychoacoustics. J Acoust Soc Am 49:467–477 Moore BCJ, Glasberg BR, Vickers DA (1999) Further evaluation of a model of loudness perception applied to cochlear hearing loss. J Acoust Soc Am 106:898–907 Osman E (1971) A correlation model of binaural masking level differences. J Acoust Soc Am 50:1494–1495 Reynolds GS, Stevens SS (1960) Binaural summation of loudness. J Acoust Soc Am 32:192–205 Robinson DW (1971) A review of audiometry. Phys Med Biol 16:1–24 Scharf B, Fishken D (1970) Binaural summation of loudness reconsidered. J Exp Psych 86:374–379 Zwicker E, Scharf B (1965) A model of loudness summation. Psych Rev 72:3–26 Zwicker E, Zwicker UT (1991) Dependence of binaural loudness summation on interaural level differences, spectral distribution, and temporal resolution. J Acoust Soc Am 89:758–764
Comment by Weber When listening to your presentation I got the following idea: What do we basically need if we want to perceive a complex tone from a certain place in space? We would need a set of resonators for the complex tone and a delay line for performing the cross-correlation to position the complex sound in space. We assume that the basilar membrane gives us the basis to calculate the complex tone. But what about the delay line? We would perhaps have the tendency to construct a delay line out of RC-elements.
Interaural Correlation and Loudness
367
But nature might be more efficient. It might use the basilar membrane itself also as delay line. The RC-elements might be a bit different to what we expect at first glance. But the basilar membrane might be at the same time the basis for frequency and correlation analysis. So the basilar membrane would not only be regarded at as an important part of a frequency analyser but also as delay line for the calculation of the auditory space. When, e.g., using the HRTFs for special calculations both functions – the frequency analysis and the correlation analysis – need to be processed in combination. This might be performed by the basilar membrane being the basic part for a frequency analyser and delay line at the same time. And if the basilar membrane has to perform a twofold role as frequency analyser and delay line, the requirements of both functions then might determine its special construction and functioning. Are there consequences for the understanding of the auditory signal processing? As one example one may speculate that the differences in loudness summation of correlated and uncorrelated noise might rely on the different spacious extensions of the two noises. The uncorrelated noise is more spacious than the correlated one and it is perceived as being louder. It may be that the summation of loudness of objects in space shows similar behaviour as the spectral summation in the frequency domain. But this can be checked. Reply The idea that the internal delays needed for binaural processing might have a cochlear origin was proposed by Shamma et al. (1989). However, evidence in favour of the idea has since proved elusive. The most recent evidence points towards a role for timed inhibition rather than a Jeffress-style delay network (Brand et al. 2002). It is tempting to think that the effect of interaural correlation on loudness might be mediated by the stimulation of a larger number of “spatial channels.” As you point out, there is an analogous effect in the frequency domain, in which equal-energy bands of noise are perceived as louder if they extend beyond one critical band (Feldtkeller and Zwicker 1956, as cited in Scharf 1970). Given our modern understanding of cochlear non-linearity, it seems likely that the bandwidth effect found by Zwicker and Feldtkeller is mediated by the cochlea’s compressive input-output function; when all the acoustic energy is concentrated on a single frequency channel the compression in that channel reduces the cochlea’s response relative to the situation where the energy is spread over many channels. I am not aware of any analogous mechanism that might be called upon to explain the effect of correlation on loudness, but I don’t think the idea can be ruled out.
368
J.F. Culling and B.A. Edmonds
References Brand A, Behrend O, Marquardt T, McAlpine D, Grothe B (2002) Precise inhibition is essential for microsecond interaural time difference coding. Nature 417:543–547 Scharf B (1970) Critical bands. In: Tobias JV (ed) Foundations of modern auditory theory, vol 1. Academic Press, pp 159–202 Shamma SA, Shen N, Gopalaswamy P (1989) Stereausis: binaural processing without neural delays. J Acoust Soc Am 86:989–1006
40 Interaural Phase and Level Fluctuations as the Basis of Interaural Incoherence Detection MATTHEW J. GOUPELL AND WILLIAM M. HARTMANN
1
Introduction
Interaural coherence is a measure of the similarity of signals in a listener’s two ears. It is derived from the interaural cross-correlation function, which is a function of the interaural lag. The peak of the cross-correlation function is of particular interest. The value of the lag for which the peak occurs is regarded as the relevant interaural time difference (ITD) cue for the location of the sound image. This value of lag was given a place representation in the famous binaural model by Jeffress (1948). The height of the peak is thought to determine the compactness of the image. If the sounds in the two ears are identical except for an interaural delay, then the peak height has its maximum value of 1, and the image is expected to be maximally compact. If the height of the peak is less than 1, the image is broader or more diffuse (Barron 1983; Blauert and Lindemann 1986). Listeners are particularly sensitive to deviations from a reference coherence of 1.0. Using narrowband noise, Gabriel and Colburn (1981) found that listeners could easily distinguish between noise with a coherence of 1.0 and noise with a coherence of 0.99. A reference coherence of 1.0 is also of interest in connection with the masking level difference (MLD). Wilbanks and Whitmore (1967) and Koehnke et al. (1986) concluded that the threshold signal-to-noise ratio for a heterophasic signal in a homophasic noise is essentially determined by the ability to detect the incoherence introduced by the out-of-phase signal. The present work is also concerned with incoherence detection starting with perfectly coherent noise as a reference. Its working hypothesis is that the detection of a small amount of interaural incoherence is not based on the cross-correlation or the coherence per se. Instead, detection is hypothesized to be based on fluctuations in interaural phase difference (IPD) and/or interaural level difference (ILD) leading to salient fluctuations in the output of brainstem nuclei.
Michigan State University, USA,
[email protected],
[email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
370
2
M.J. Goupell and W.M. Hartmann
Experiment
The experiment sought to test the adequacy of coherence per se by presenting listeners with reproducible noise samples, all of which had exactly the same value of interaural coherence. However, different noises had different amounts of IPD and ILD fluctuations. If it could be shown that the incoherence was significantly more detectable in some noise samples than in others, and if that difference correlated with IPD and ILD fluctuations, then the hypothesis would be supported. Implementing the experimental plan required a choice of noise bandwidth and a choice of stimulus duration. A narrow bandwidth was chosen because fluctuations in different noise samples differ more widely when the bandwidth is narrow. A range of durations was chosen. Although many MLD experiments have been done with 500-ms stimuli, recent binaural models have stressed the potential importance of short-term correlations (Bernstein et al. 1999). In order to make a convincing case for or against coherence per se, it was necessary to perform the experiments over all relevant time scales. 2.1
Stimuli
The stimuli were noise bands with 14-Hz nominal bandwidth and durations of 500, 100, 50, and 25 ms. The bands were centered on 500 Hz and experienced some inevitable splatter when the duration was brief. Splatter was reduced by using 30-ms or 10-ms raised-cosine edges. One hundred different noises were created for each duration. The computation of noise stimuli targeted a fixed coherence, and stimulus selection after enveloping ensured that every noise had an interaural coherence of exactly 0.992. After the noises were computed, the interaural phase and level differences were computed as functions of time from the Hilbert transforms of the noises. Fluctuations were then characterized by the standard deviation of the IPD and the standard deviation of the ILD as computed over the duration of the noise. Those standard deviations form the vertical and horizontal axes, respectively, of Fig. 1, which shows the 100 noises for 500-ms duration. It is evident that some noises, e.g. noise number 79, have large IPD fluctuations whereas other noises, like noise number 2, have large ILD fluctuations. For some noises, e.g. number 90, both the IPD and the ILD fluctuations are large. Other noises, e.g. number 96, have very small fluctuations. 2.2
Fluctuation Statistics
Average fluctuations statistics are shown in Table 1 as a function of duration. Two columns on the left show the mean of the standard deviation for IPD and ILD fluctuations. This mean is computed over the 100 stimuli in the entire ensemble of noises. These means correspond to the centroids of the mass of
Interaural Phase and Level Fluctuations as the Basis of Interaural Incoherence Detection
371
Fig. 1 IPD and ILD fluctuations for 100 noises, given as standard deviations averaged over time, for 500-ms noises
Table 1 Values of the mean and standard deviation of st[∆F] and st[∆L] for “14-Hz” noise-pairs with four durations: 25, 50, 100, and 500 ms. Correlations between st[∆F] and st[∆L] are also given Duration (ms)
m (st[∆Φ]) (degrees)
m(st[∆L]) (dB)
s (st[∆Φ]) (degrees)
s(st[∆L]) (dB)
corr
25
3.62
0.41
3.56
0.36
0.53
50
5.85
0.73
6.83
0.65
0.74
100
7.81
1.06
6.83
0.68
0.75
500
12.20
1.60
5.14
0.50
0.68
points, shown for example in Fig. 1. The next two columns show the standard deviations (over 100 noises) of the standard deviations (over time). The IPD and ILD standard deviations correspond respectively to vertical and horizontal widths of the mass of points, shown for example in Fig. 1. 2.3
Procedure
To create the stimuli for the experiment the five noises with the largest IPD fluctuations and the five noises with the smallest IPD fluctuations were chosen to make the ten stimuli of a collection to be called the “phase set.”
372
M.J. Goupell and W.M. Hartmann
Similarly ten stimuli were chosen on the basis of largest and smallest ILD fluctuations to make the level set. The experiment was three-interval two-alternative forced choice. The first interval was diotic, the second or third interval, selected randomly, was dichotic with the interaural fluctuation. The remaining interval was again diotic. The two diotic intervals were created by presenting just the left channel of one of the ten noises, different from the target noise in the dichotic interval and different from each other. The listener’s task was to say which interval, the second or the third, contained the dichotic noise. Beyond the simple forced choice, the listener was given the opportunity to declare confidence in his judgment. That procedural element led to a Confidence Adjusted Score (CAS) equal to the number of correct responses plus the number of correct confidence ratings. Listeners were discouraged from using the confidence rating casually. If an experimental run included more than one incorrect judgment, which the listener rated as confident, the run immediately terminated and the listener was obliged to begin again. The relative weighting given to confidence, whereby a confidence rating was given the same weight as a correct response, was determined by statistical tests which showed that the overall CAS was a relatively flat function of the weighting parameter when the relative weight was 1.0 (Goupell 2005). An experiment run consisted of six trials for each of the ten dichotic noises presented in random order. A total of 6 runs led to a total of 36 trials for each noise. Thus, the maximum possible CAS was 72. There were three listeners in the experiment. All were male and all had normal thresholds near 500 Hz. 2.4
Results
The resulting CAS values for the phase sets and level sets are shown in Tables 2 and 3, respectively. The entries in the tables can be compared with the maximum possible value of 72. The random guessing limit would correspond to a CAS of 18, where half the decisions are correct and the listener is never confident about anything. Columns labeled “Min” and “Max” correspond to the five stimuli with minimum fluctuations and the five stimuli with maximum fluctuations. Therefore, each table entry is the average of the responses to five noises. Tables 2 and 3 indicate that CAS values are considerably higher for the maximum fluctuations than for the minimum fluctuations for both the phase sets and the level sets, so long as the duration is 50 ms or greater. For a duration of 25 ms, there is little difference between Max and Min columns. A two-sample one-tailed t-test tested the hypothesis that CAS values for maximum fluctuations were larger than CAS values for minimum fluctuations. The results of the test are given in Table 4. It is evident that the hypothesis is supported, normally at the 0.01 level or better, except for a duration of 25 ms.
Interaural Phase and Level Fluctuations as the Basis of Interaural Incoherence Detection
373
Table 2 CAS values (maximum 72) for three listeners – phase set 25 ms
50 ms Min
100 ms Max
Min
500 ms
Listener
Min
Max
Max
Min
Max
D
16
25
11
45
8
57
40
62
M
23
24
12
50
13
60
37
61
W
25
25
14
53
17
67
28
60
Table 3 CAS values (maximum 72) for three listeners – level set 25 ms
50 ms
100 ms
500 ms
Listener
Min
Max
Min
Max
Min
Max
Min
Max
D
15
21
12
46
13
58
40
64
M
20
27
15
47
15
50
35
70
W
17
29
19
44
19
54
31
66
Table 4 The p-values for the phase and level sets with a nominal bandwidth of 14 Hz and four durations: 25, 50, 100, and 500 ms 25 ms
50 ms
100 ms
500 ms
Listener
Phase
Level
Phase
Level
Phase
Level
Phase
Level
D
0.027
0.039
<0.001
0.010
<0.001
<0.001
<0.001
<0.001
M
0.306
0.023
0.002
0.002
<0.001
0.004
0.002
0.002
W
0.457
0.065
0.002
0.015
<0.001
0.002
0.001
<0.001
3
Discussion and Further Experiments
The experiment made it clear that coherence itself, or cross-correlation of the stimulus waveform, is an inadequate predictor of incoherence detection. Instead, detection performance correlates well with size of fluctuations in interaural phase and level. This conclusion applies to a narrow band and stimulus durations from 500 to 50 ms. 3.1
25-ms Duration
The above conclusion does not apply for a stimulus duration of 25 ms, but for a 14-Hz bandwidth and 25-ms duration the stimulus fluctuations themselves are quite infrequent. As shown by the standard deviation columns in Table 1,
374
M.J. Goupell and W.M. Hartmann
the 25-ms noises tend not to fluctuate but tend to present lateralization cues that are left or right for the entire duration. Informally, listeners reported that they often used lateralization instead of width as a cue for this duration. The number of fluctuations in a 25-ms stimulus can be increased by increasing the bandwidth. The same listeners participated in an experiment wherein the bandwidth was increased to 28 Hz. The CAS values then resembled those obtained for a 14-Hz bandwidth and a 50-ms duration, with p-values less than 0.03 for all listeners. The conclusion that can be drawn from these experiments is that fluctuations, and not short-term coherence, are the basis of incoherence detection down to the smallest durations for which fluctuations occur physically and for which listeners employ their sense of source width to perform the task. An important point is that the incoherence, controlled in these experiments, was in the stimuli themselves and not in model cross-correlations (Bernstein et al. 1999). 3.2
CAS Values Less than Chance
Tables 2 and 3 show that some CAS values were systematically below the chance value of 18. This effect was particularly prominent for particular noises, where CAS values sometimes fell below 10. The effect was traced to an experimental artifact wherein listeners confused envelope fluctuations with interaural fluctuations. For the narrow-band noises, as used in the 500-ms experiment for example, correlations between envelope fluctuations in a single channel correlated with interaural fluctuations in phase (0.59) and level (0.65). As a result, listeners found more “action” in a diotic noise with large envelope fluctuations than in a dichotic noise with small interaural and envelope fluctuations. The role of large envelope fluctuations was demonstrated in a control experiment wherein the left channel of the noises was presented to both ears on all three intervals. Listeners were instructed to use the same criteria that they had used in the normal dichotic experiments. Although confidence ratings were low, listeners made systematic choices. Inter-listener correlations were 0.40 for the phase set and 0.76 for the level set. This effect proved the advantage of collecting confidence ratings in addition to forced-choice in the normal experiments because although listeners may have systematically chosen some diotic noises over small-fluctuation dichotic noises, they were never confident about these choices. 3.3
Increased Bandwidth
Further experiments were performed at larger bandwidth. As bandwidths grow, different noises have increasingly similar fluctuations in IPD and ILD. The widths of the distributions, as shown in Fig. 1, shrink, approximately as the inverse cube root of the bandwidth. Although listeners may still use fluctuations to detect incoherence, the different noises increasingly resemble one another.
Interaural Phase and Level Fluctuations as the Basis of Interaural Incoherence Detection
375
When the bandwidth grew to become about as large as a critical band (100 Hz), the difference in detectability for maximum-fluctuation noises as compared with minimum-fluctuation noises was barely significant. For the 500-ms experiment, the p-value increased from 0.002 to 0.05. Further, the correlation in performance among listeners, as computed across the ten noises of the phase set or the ten noises of the level set, decreased from 0.91 to 0.47. Experiments with a bandwidth of 2400 Hz, maintaining 500 Hz as the geometric mean frequency, found little physical difference between noises with the largest fluctuations and noises with the smallest, and also found no significant difference in the detectability of incoherence. Correlations among listeners dropped to 0.11. The conclusion of bandwidth experiments is that as the bandwidth increases, the value of the coherence, or incoherence, itself becomes an increasingly good predictor of incoherence detection. It is believed that the main reason for that effect is the change in the distributions of the physical fluctuations themselves. Acknowledgments. We are grateful to Drs. H.S. Colburn, J.F. Culling, A. Kohlrausch, and C. Trahiotis for useful discussions about coherence. Peter Xinya Zhang helped with the manuscript production. This work was supported in part by the National Institute on Deafness and Other Communicative Disorders, grant DC00181.
References Barron M (1983) Objective measures of spatial impression in concert halls. Proc. Sixth International Congress on Acoustics, vol 7, pp 105–108 Bernstein LR, van de Par S, Trahiotis C (1999) The normalized interaural correlation: accounting for NoSπ thresholds obtained with Gaussian and “low-noise” masking noise. J Acoust Soc Am 106:870–876 Blauert J, Lindemann W (1986) Auditory spaciousness: some further psychoacoustic analyses. J Acoust Soc Am 80:533–542 Gabriel KJ, Colburn HS (1981) Interaural correlation discrimination: I. Bandwidth and level dependence. J Acoust Soc Am 69:1394–1401 Goupell MJ (2005) The use of interaural parameters during incoherence detection in reproducible noise. PhD dissertation, Michigan State University, unpublished Jeffress LA (1948) A place theory of sound localization. J Comp Physiol Psychol 41:35–49 Koehnke J, Colburn HS, Durlach NI (1986) Performance in several binaural-interaction experiments. J Acoust Soc Am 79:1558–1562 Wilbanks WA, Whitmore JK (1967) Detection of monaural signals as a function of interaural noise correlation and signal frequency. J Acoust Soc Am 43:785–797
Comment by Hohmann This is a more technical remark. I very much like the idea of considering the IPD statistics. However, I think the way you calculate the IPD makes the derivation of its statistics somehow awkward and may influence your results. Because the phase is a cyclic variable, the IPD should be calculated by multiplying left and right conjugated analytic signal, i.e.,
376
M.J. Goupell and W.M. Hartmann
Y = Xl · Xr* = al ar · exp(i(jl − jr)). IPD = angle(Y) By this, the variable Y is a complex-valued (vector) variable whose argument is the IPD. If a statistics of the IPD is to be derived, or if a temporal average of IPD-related responses in a binaural temporal-window model is to be calculated, the complex Y-vectors should be averaged instead of the real IPD values. This is known as cyclic statistics of cyclic variables, as opposed to linear moments of continuous variables. The mean IPD then is the angle of the averaged variable
, i.e., IPDmean = angle _G Y Hi, or IPDmean = angle d eY Ye n . Whereas the former formula includes intensity weighting, the latter formula is unweighted. The reason why vector-averaging is necessary is the phase wrapping at ±π inherent to the angle operation. Assume you have a signal in which the IPD is somehow fluctuating in time around π, which often happens in realistic listening conditions and in binaural detection tasks, where some external noise is involved. The mean IPD calculated from cyclic statistics as defined above will give a mean phase of around π, as expected, whereas the linear average of the IPD variable that you did, i.e., IPDmean_lin = 〈angle (Y)〉, would give a mean of around zero, i.e., a value that was almost never present in the IPD values. Also, the linear variance would give a much larger value than its cyclic counterpart. However, if the IPD happens to fluctuate around zero, the difference between linear and cyclic analysis is small. See Nix and Hohmann (2006) for a description of the method of calculating IPD and its cyclic statistics. References Nix J, Hohmann V (2006) Sound source localization in real sound fields based on empirical statistics of interaural parameters. J Acoust Soc Am 119:463–479
Reply We agree that our procedure for calculating the average IPD runs into trouble if the IPD is fluctuating around an average value of ±π radians, which may occur in realistic listening conditions such as listening in rooms. However,
Interaural Phase and Level Fluctuations as the Basis of Interaural Incoherence Detection
377
that problem did not arise for the stimuli in our experiments, which studied the detection of a very small amount of incoherence as a deviation from diotic noise. See the calculations of IPD distributions for incoherent stimuli with different bandwidth and values of coherence in Breebaart and Kohlrausch (2001). Our stimuli had distributions similar to those of Breebaart and Kohlrausch, which had a mean IPD that was essentially 0 radians. The question that you raise would indeed be important for the interesting experiment where the task is to detect a deviation from a pure Nπ stimulus due to the addition of some uncorrelated noise. Possibly cyclic statistics might be used profitably, but our tendency would be to approach this problem with some form of physiologically based model, probably with IPD converted into ITD. Then the averages and the fluctuations about the averages are likely to be interpretable in terms of perceptual experiments. References Breebart J, Kohlrausch A (2001) The influence of interaural stimulus uncertainty on binaural detection. J Acoust Soc Am 109:331–345
41 Logarithmic Scaling of Interaural Cross Correlation: A Model Based on Evidence from Psychophysics and EEG HELGE LÜDDEMANN, HELMUT RIEDEL, AND BIRGER KOLLMEIER
1
Introduction
The aim of this contribution is to demonstrate that the detection of transitions in interaural cross correlation (IAC, denoted by r) depends on the Fisher Z-transform of r, denoted by rlog (see Eq. 3) of r, rather than on r itself. Evidence is provided by two electrophysiological and closely corresponding psychophysical experiments, both using noise stimuli with stepwise transitions between different values of IAC. The conclusions from these experiments are used to develop a model of binaural feature extraction which avoids several shortcomings of other binaural models, e.g., normalization, perceptually inadequate scaling of IAC, lack of neurophysiological plausibility. The IAC between the signals l(t) at the left and r(t) at the right ear describes the physical basis of how compact or diffuse a sound is perceived in auditory space. The value r of the IAC is defined as the normalized scalar product of l and r : t = # l (t) . r (t) dt . ; # l 2 (t) dt .
# r 2(t) dt E
- 1/2
.
(1)
When presented via headphones, (noise) signals with r=1 are perceived as a compact sound source at a central position in the head, whereas signals with r=0 are associated with a continuum of simultaneously active sources between both ears (diffuse). Noise stimuli with any desired r can be generated by mixing a diotic noise (N0) and an antiphasic noise (Np) in an appropriate ratio, with both noise sources being orthonormalized prior to mixing: N t = (1+ t) /2 . N 0 + (1- t) /2 . N r .
(2)
Accordingly, ρ can be expressed as the dB scaled ratio of correlated and anti-correlated components in the stimulus, hereby denoted as the ‘dB(N0/Np) scaled IAC’ ρlog. This ratio which is proportional to Fisher’s ρ-to-Z transform can be directly obtained from the mixing factors in Eq. (2):
Medizinische Physik, Carl von Ossietzky Universität, Oldenburg, Germany, [email protected], [email protected], [email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
380
H. Lu¨ddemann et al.
tlog7dB (N0 /Nr )A =10 . log (1+ t) / (1- t) .
(3)
The just noticeable difference between different values of IAC (r-JND) critically depends on the reference-IAC rref (see Fig. 4). For ⎪rref⎪=1, psychoacoustical thresholds are markedly lower than for rref = 0 (Boehnke et al. 2002; Gabriel and Colburn 1981). For intermediate rref there is a nonlinear decrease of the r-JND with rref between 0 and 1 (Pollack and Trittipoe 1959; Culling et al. 2001). The difference between (rref)log and the IAC at the respective discrimination threshold (rref+r-JND)log after transforming both to the dB(N0/Np) scale (Eq. 3), however, appears to be about 4 dB(N0/Np), independent of rref (dashed line in Fig. 4, left). This suggests that the dB(N0/Np) scaled rlog should be used as the appropriate parameter rather than r itself for a model of interaural correlation processing (see Sect. 4; also compare to van der Heijden and Trahiotis 1997). 1.1
Binaural Sluggishness and the Disputable Normalization Hypothesis
An analogous dependency of perceptual performance on rref has been found for the detection of binaural gaps, i.e., fixed-valued IAC-changes of short duration (Boehnke et al. 2002). Akeroyd and Summerfield (1999) pointed out that this analogy is a direct consequence of ‘binaural sluggishness’, i.e., the limited capability of the binaural system in tracking fast changes of interaural parameters. In order to quantitatively describe the relationship between the gap duration threshold and the r-JND they proposed a model (similar to that of Kollmeier and Gilkey 1990; Culling and Summerfield 1998) in which an ‘instantaneous IAC’ r(t) is smoothed by convolution with a normalized window function w(t) in the time domain: rw(t) = ∫ r(t) · w(t − t)dt.
(4)
In their model a binaural gap becomes detectable if the difference between the smoothed internal IAC rw(t) and rref exceeds the r-JND (⎪rw(t)−rref⎪≥ r-JND). Thus, the binaural system’s sluggishness can be characterized by the shape and the duration of the temporal window w. Longer window durations correspond to higher degrees of sluggishness and vice versa. However, although providing a functionally useful description of gap detection, the model by Akeroyd et al. lacks plausibility. (1) The model has to compute the IAC r(t) instantaneously and therefore needs some kind of samplewise normalization, which does not make much sense for single spikes at the neural level. (2) Moreover, van de Par et al. (2001) reasonably concluded from their psychoacoustical data that it is ‘highly unlikely that normalization, per se, actually occurs as part of binaural processing.’
Logarithmic Scaling of Interaural Cross Correlation
2
381
Experiments and Methods
r(t)
r(t)
In the first experiment, the stepsize Dr of a small but persistent IAC-transition is varied, in order to estimate the electrophysiological correlate of the r-JND (Fig. 1A). In the second experiment the duration tgap of a short but fixed-valued IAC-change (binaural gap with ⎪Dr⎪=1) is varied (Fig. 1B). In both experiments the same four combinations of rref and IACchange direction were investigated, corresponding to the columns in Fig. 1. In addition, all subjects also participated in corresponding psychoacoustical experiments. The stimuli were generated in Matlab as a concatenation of ‘reference’ and ‘gap’ segments of Gaussian bandpass noise (100–2000 Hz) with the respective correlations rref and rtarget = rref + Dr. The stimuli were presented to nine normal hearing subjects at a level of 65 dB SPL via insert earphones in an acoustically and electrically shielded booth. During the EEG recordings several ‘rref⎪rtarget⎪rref’-sequences were concatenated by crossfading consecutive reference segments to yield a continuous running noise with no silence in between. Thus all brain activity time-locked to the IAC-transition is solely due to binaural interaction along the auditory pathway, without being superposed by any monaurally evoked responses. This technique allows for a direct measurement of purely binaural late auditory evoked potentials (LAEP) and avoids the shortcomings of indirect methods such as the mismatch negativity or the binaural difference potentials. LAEP were recorded from both mastoids (A1, A2) with the vertex (Cz) as reference electrode. For each subject, amplitudes N1 and P2 were determined after filtering and averaging 1000 responses per experimental condition and value of Dr and tgap.
1.0 0.5 0.0 −0.5 −1.0 1.0 0.5 0.0 −0.5 −1.0
-1 ↑
0↓
1↓
A
0↑ -1|0|-1
0|-1|0
1|0|1
B
0|1|0 time
Fig. 1 Time functions of the stimulus’ IAC r(t): A experiment 1: r-JND; B experiment 2: qgap. Both experiments were performed for the same four combinations of rref and the direction of the IAC-change
382
H. Lu¨ddemann et al. 10
1↓
7.5
-1↑
0↑
0↓
A
P2–N1 in mV
5 2.5 0
−12
12
−12
0
12
−12
0
12
−12
0
12
(rtarget)log in dB(N0 / Nπ)
10 7.5
0
1|0|1
-1|0|-1
0|1|0
0|-1|0
B
5 2.5 0 2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64
time in ms Fig. 2 LAEP peak-to-peak amplitudes P2-N1 (mean over nine subjects): A dependence on (rtarget)log = (rref + Dr)log (IAC scaled in dB(N0/Np)); B dependence on tgap (x-axis scaled logarithmically). Error bars denote the standard deviation over subjects. The black lines are linear functions fitted to the data. The vertical grey lines correspond to the respective psychoacoustical thresholds
3
Experimental Results
As an indicator for the activity in the auditory cortices, in Fig. 2 the peak-to-peak amplitude P2-N1 (mean over nine subjects and both channels) is shown as a function of the logarithmically transformed stimulus parameters rtarget = rref + Dr and tgap. For experiment 1 there is a linear relationship between the difference P2−N1 and rtarget scaled in dB(N0/Np) according to Eq. (3). In experiment 2 the values of P2−N1 increase linearly with the logarithm of tgap.
4
Model
Besides providing an adequate description of psychoacoustical data, the model has been designed to incorporate two important findings: (1) the linear relationships between the LAEP magnitudes and the (appropriately rescaled) stimulus parameters in both experiments indicate that, on the cortical level, sensitivity to changes in the IAC appears to be based on the dB scaled ratio of N0- and Np-energies within the binaural temporal window; (2) psychoacoustical data by van de Par et al. (2001) suggest that the binaural system does not perform any normalization of levels in order to derive a measure of interaural correlation.
Logarithmic Scaling of Interaural Cross Correlation
383
Our model structure (described in Fig. 3) therefore does not require any normalization, yet providing the time-resolved, dB(N0/Np) scaled IAC at its output. The model has two free parameters: (1) the SNR of uncorrelated internal noise, which simulates errors in neural processing independent of the stimulus levels at the left or right ear and (2) the effective duration T90 of the temporal window (including 90% of the window’s area; see Kollmeier and Holube 1992). For a set of stimuli with different static correlations we generated histograms of the model’s output (rw(t))log. The model’s predictions for the r-JND are based on a comparison of these histograms. For each pair of histograms, representing a pair of stimuli with rref and a deviant IAC, ROC curves were computed. From the areas below the ROC curves corresponding to a common rref , the psychometric function with respect to rref was derived, on which the 70.7% point was used to determine the discrimination threshold.
Nu SNR
x2 dt
T90 w
x2 dt
Fig. 3 Model structure. After peripheral preprocessing (gammatone filterbank, addition of internal Nu noise at a fixed SNR, power law envelope compression, hair cell transformation) a subtractive/additive mechanism decomposes the internal signals li(t) and ri(t) into N0 and Np components, i.e., c(t)=li(t)+ri(t) and a(t)=li(t)-ri(t). The corresponding powers c2(t) and a2(t) are then smoothed by a weighted temporal integration, yielding the N0- and Np-energies EC(t) and EA(t) within the binaural system’s time window w(t-t) (symmetrical double-sided exponential, centered at t). Finally, the difference of logarithmically compressed energies, 10 × [log(EC)− log(EA)], is computed for each time step. The result (rw(t))log is the model’s internal representation of IAC, being scaled in dB(N0/Np) automatically. After feature extraction, histograms of (rw(t))log are generated.
384
H. Lu¨ddemann et al.
For narrowband stimuli (1.3 ERB) with center frequencies between 250 and 1000 Hz the model was adjusted (fitting T90 and the SNR) to match the r-JND at rref = 1 and rref = 0 measured by Akeroyd and Summerfield (1999) and Gabriel and Colburn (1981), respectively. To test the model for intermediate rref we used r-JND data by Pollack and Trittipoe (1959). They are displayed in Fig. 4 together with the model’s predictions at a center frequency of 500 Hz.
5
Model Results and General Discussion
For an SNR of about 15 dB and an effective window duration T90 of about 25 ms (corresponding to an equivalent rectangular duration of 10 ms), the model’s predictions agree very well with the r-JNDs from literature for all center frequencies between 250 and 1000 Hz. Using the same parameters the model is also able to simulate binaurally masked tone detection thresholds reported by Breebaart and Kohlrausch (2001) and by van der Heijden and Trahiotis (1998) for a variety of masker correlations and bandwidths (see Fig. 4). The model’s r-JND predictions are slightly asymmetric for positive (IAC change towards r = 1) vs negative (towards r = 0, not shown) directions, which is in agreement with the shape of cumulative d-prime functions by Culling et al. (2001). A variation of model parameters reveals that the model’s performance for rref near zero mainly depends on T90, while the r-JND at rref = 1 is limited by the amount of internal noise. The reason for this behaviour is that the
r−JND 0.5
A
data model
0.25
masked threshold of Sπ dB 70 B
dB 80
60
70
50
60
40 30
0
0
0.25 0.5 0.75
rref
1
50
∆f = 10 Hz ∆f = 100 Hz .60 .88 .97 .992
40 1
−tone in N r −noise
C
∆f = 30 Hz ∆f = 100 Hz .60 .88 .97 .992
1
correlation of the N r −masker
Fig. 4 Comparison of model predictions (empty symbols) to literature data (filled symbols): A r-JND, data by Akeroyd and Summerfield (1999), Boehnke et al. (2002) and Pollack and Trittipoe (1959); dashed lines: 4 dB(N0/Np) difference to rref: B,C detection thresholds for an antiphasic tone in narrowband noise for different bandwidths as functions of the masker’s IAC. B data by Breebaart and Kohlrausch (2001). C data by van der Heijden et al. (1998nl)
Logarithmic Scaling of Interaural Cross Correlation
385
predicted r-JND increases with the width of the histograms of the model’s output, i.e., the variance of rw. Because the variance of rw is inversely proportional to the duration of the temporal window and inversely proportional to the bandwidth of peripheral filter outputs, we conclude that the r-JND at rref = 0 is a consequence of binaural sluggishness and (for stimulus bandwidths ≥1 ERB) of monaural peripheral filter bandwidth. An unrealistic improvement of the r-JND with increasing peripheral filter bandwidths at higher frequencies is prevented by the hair cell transformation. Accordingly, it is very likely that the model will provide (at least qualitatively) correct predictions for the effects of stimulus duration and bandwidth on IAC discrimination. Furthermore, since the model does not need any normalization of levels, its r-JND will increase if an additional interaural level difference is applied to the stimulus. In contrast, any normalizationbased model would be insensitive to ILDs. However, the model has not been tested quantitatively for such stimuli yet. The window duration of the model’s feature extraction mechanism (T90=25 ms) is much shorter than the direct T90-estimate of 120 ms from own psychoacoustical data (analytically derived from Eq. 4). This apparent contradiction can be resolved by the hypothesis that two time windows contribute to the overall sluggishness of the whole perceptual process: The first window (25 ms) refers to the mechanism of feature extraction only. The hypothetical second time window characterizes the sluggishness of a presumably cortical mechanism which sorts the values of rw(t) into the bins of an ‘internal histogram’. Note that, for stimuli with static IAC, the second time window would have no effect on the shape (or width) of internal histograms and the corresponding psychometric functions for IAC discrimination at all. The hypothesis that two time constants are required to account for the sluggishness of binaural perception has also been proposed by, e.g., Bernstein et al. (2001) for psychoacoustical data and by Dajani and Picton (2006) for electrophysiological data. In summary, the proposed model quantitatively explains the dependency of the r-JND on rref as a consequence of binaural sluggishness, monaural filter bandwidth and hair cell transformation. The predicted thresholds of the model are compatible with literature and are roughly constant on the dB(N0/Np) scale, they amount to about 4 dB. The model avoids normalization, its components are computationally simple and are neurophysiologically plausible (e.g., EE and IE neurons for the additive and subtractive mechanism, respectively). Changes of the model’s output due to IAC transitions in the stimulus are linearly related to the amplitudes of the LAEP to these stimuli. For this reasons we believe that the model structure provides an adequate description of IAC processing by perceptual and neurophysiological means. Acknowledgements. This research has been supported by the Deutsche Forschungsgemeinschaft.
386
H. Lu¨ddemann et al.
References Akeroyd MA, Summerfield AQ (1999) A binaural analog of gap detection. J Acoust Soc Am 105(5):2807–2820 Bernstein LR, Trahiotis C, Akeroyd MA, Hartung K (2001) Sensitivity to brief changes of interaural time and interaural intensity. J Acoust Soc Am 109(4):1604–1615 Boehnke SE, Hall SE, Marquardt T (2002) Detection of static and dynamic changes in interaural correlation. J Acoust Soc Am 112(4):1617–1626 Breebaart J, Kohlrausch A (2001) The influence of interaural stimulus uncertainty on binaural signal detection. J Acoust Soc Am 109(1):331–345 Culling JF, Summerfield Q (1998) Measurements of the binaural temporal window using a detection task. J Acoust Soc Am 103(6):3540–3553 Culling JF, Colburn HS, Spurchise M (2001) Interaural correlation sensitivity. J Acoust Soc Am 110(2):1020–1029 Dajani H, Picton T (2006) Human auditory steady-state responses to changes in interaural correlation. Her Res 219(1/2):85–100 Gabriel KJ, Colburn HS (1981) Interaural correlation discrimination: I. Bandwidth and level dependence. J Acoust Soc Am 69(5):1394–1401 Kollmeier B, Gilkey RH (1990) Binaural forward and backward masking: evidence for sluggishness in binaural detection. J Acoust Soc Am 87(4):1709–1719 Kollmeier B, Holube I (1992) Auditory filter bandwidths in binaural and monaural listening conditions. J Acoust Soc Am 92(4.1):1889–1901 Pollack I, Trittipoe W (1959) Interaural noise correlation: examination of variables. J Acoust Soc Am 31(12):1616–1618 van der Heijden M, Trahiotis C (1997) A new way to account for binaural detection as a function of interaural noise correlation. J Acoust Soc Am 101(2):1019–1022 van der Heijden M, Trahiotis C (1998) Binaural detection as a function of interaural correlation and bandwidth of masking noise: implications for estimates of spectral resolution. J Acoust Soc Am 103(3):1609–1614 van de Par S, Trahiotis C, Bernstein LR (2001) A consideration of the normalization that is typically included in correlation-based models of binaural detection. J Acoust Soc Am 109(2):830–833
Comment by Carlyon I don’t think you can conclude that the linear relationship between LAEP amplitudes and the ‘dB(C/A) scaled IAC’ is evidence that sensitivity to changes in IAC are based on that log-ratio measure. This is because you do not have evidence that the linear LAEP amplitude is the appropriate decision statistic. For example, you could have calculated the log (or some other transform) of the LAEP amplitude, and we don’t know whether one transform is better or worse than any other (or none). Depending on the transform used, you would need to change your log-scaled ratio to some other measure in order to maintain a linear relationship. To determine what scale the LAEP should be expressed in, one would have to have measures not only the LAEP mean amplitude, but also its standard deviation. One could then use a scale where the standard deviation is constant across the range of LAEP amplitudes measured. This might be a linear scale, but it might not.
Logarithmic Scaling of Interaural Cross Correlation
387
Reply Indeed the variance over epochs in our EEG data does not depend on the IAC or the IAC transition and is roughly the same at all latencies, in particular, it does not increase at latencies corresponding to the N1-/P2-response. Since the dB(N0/Nπ) transform is the Fisher-Z-transform of the linear normalized IAC, also the variance of instantaneous dB(N0/Nπ)−IAC over time, as computed at the output of a moving temporal window, is the same for all our stimuli. Following your argument, one could easily interpret the constancy of variance of both, the EEG and the stimulus properties, in the sense that equal differences in LAEP amplitude correspond to equal discriminability by means of d-prime. However, such an interpretation would be invalid for the following reason. In EEG recordings it is impossible to observe the activity of only those parts of the brain which are involved in binaural processing separate from other parts of the brain. Instead one always measures a far-field superposition of a comparatively tiny stimulus related brain response S (we found maximum LAEP peak amplitudes of 3–4 µV in the average over epochs) and spontaneous brain activity N which is much higher in amplitude than S and is considered as noise (with an RMS-value of about 10 µV in the filtered data before averaging). Because the stimulus related binaural response (and in particular its possible variance) is much weaker than the spontaneous brain activity, it is not possible to separate the contributions of S and N to the EEG or its overall variance. Therefore the statistical approach described above is unfortunately not practicable for the analysis of EEG data from the auditory cortex. Additionally, using methods similar to Shackleton et al. (2003, 2005), we computed the area below the ROC-curve corresponding to the amplitude statistics over 1000 epochs of EEG data elicited by stimuli with either the reference IAC or the deviant IAC. Even for the largest differences in IAC the ROC area (probability for correct discrimination) did not exceed 0.6, although the corresponding IAC transitions were clearly detectable in psychoacoustics (i.e., psychoacoustical ROC area close to 1). References Shackleton TM, Skottun BC, Arnott RH, Palmer AR (2003) Interaural time difference discrimination thresholds for single neurons in the inferior colliculus of Guinea pigs. J Neurosci 23(2):716–724 Shackleton TM, Arnott RH, Palmer AR (2005) Sensitivity to interaural correlation of single neurons in the inferior colliculus of guinea pigs. J Assoc Res Otolaryngol 6(3):244–259
Comment by Lütkenhöner The proposed transformation from the interaural correlation ρ to the dB-scaled ratio of diotic and antiphasic noise components, ρlog, is appealing. However, the question arises how to interpret ρlog in physiological terms. While the quantities
388
H. Lu¨ddemann et al.
ρ and ρlog are more or less proportional for ⎪ρ⎪<0.5 (roughly corresponding to ⎪ρlog⎪<5 dB), ⎪ρlog⎪ becomes infinite for ⎪ρ⎪→1. The following consideration might help to understand this kind of singularity. The case ρ→ −1 is related to the detection of a faint diotic noise, N0, against an antiphasic noise background, Nπ. If N0 is formally considered as the signal and Nπ as the noise, ρlog may be interpreted as the signal-to-noise ratio in dB. Correspondingly, −ρlog may be interpreted as the signal-to-noise ratio related to the detection of a faint antiphasic noise against a diotic noise background. This consideration suggests that the JNDs for |ρ|≈1 might be of a quite different nature than the JNDs for |ρ|≈0 (detection threshold versus discrimination threshold). Reply It is correct that the basic dB(N0/Nπ)-transform as given in our article becomes infinite for values of ρ=+1 or −1. On the physiological level, however, such infinite values will never occur due to irregularities of neural processing. In our model this is simulated by adding uncorrelated noise Nu to the dichotic input signal S at a signal-to-noise ratio (SNR) of 15 dB, which is assumed to be independent of the input signal’s IAC. In a model without haircell transform, the effective “internal IAC” of the mixture (S+Nu/SNR) is then ρ(SNR2/(1+SNR2). The corresponding transformed value is 10 . log (1+ρ+1/SNR2)/(1−ρ+1/SNR2). Thus, for an SNR of 5.6 (=15 dB) the internal IAC ranges between −18 and +18 dB(N0/Nπ). Durlach et al. (1986), Koehnke et al. (1986), Culling et al. (2001) and Boehnke et al. (2002) suggested that the BMLD and IAC-JNDs can be explained by similar mechanisms. Our model can explain both kinds of experiments, e.g., BML data by van der Heijden and Trahiotis (1998) and by Breebaart and Kohlrausch (2001). However, Culling et al. (2001) described how different cues are used depending on masker bandwidth: In broadband conditions the signal is detected as a tone in the background noise. In contrast, a signal in narrowband noise causes a percept of spatial movement of the whole stimulus. Accordingly, the issue if listeners perform a discrimination or a detection strategy could depend on the stimulus’ bandwidth rather than on the masker’s IAC. References Durlach NI, Gabriel KJ, Colburn HS, Trahiotis C (1986) Interaural correlation discrimination: II. Relation to binaural unmasking. J Acoust Soc Am 79(5):1548–1557 Koehnke J, Colburn HS, Durlach NI (1986) Performance in several binaural-interaction experiments. J Acoust Soc Am 79(5):1558–1562
42 A Physiologically-Based Population Rate Code for Interaural Time Differences (ITDs) Predicts Bandwidth-Dependent Lateralization KENNETH E. HANCOCK1,2
1
Introduction
Interaural time difference (ITDs) are the most important cue to the location of sounds containing low-frequency energy (Wightman and Kistler 1992). ITDs are encoded centrally in the medial (MSO) and lateral (LSO) superior olives which transmit the code to the inferior colliulus (IC) (Batra et al. 1997; Goldberg and Brown 1969). Each ITD-sensitive neuron is characterized by its best ITD (BD), the one producing maximal discharge rate. It is a longstanding view that these neurons are conceptually arranged in an array with best frequency (BF) on one axis and BD on the other to form a labeled-line code. According to this model, the stimulus ITD corresponds to the BD (i.e. the label) of the most active neuron in the array (Jeffress 1948). The labeled-line model is challenged by physiological data from guinea pig (confirmed in cat and gerbil) showing that the distribution of BD is highly dependent on BF, and in general does not correspond to the range of naturally-occurring ITDs (Brand et al. 2002; Hancock and Delgutte 2004; McAlpine et al. 2001). Instead, best interaural phase (BP = BD × BF) is more nearly independent of BF, such that the steepest slopes of neural rate-ITD curves tend to occur near the midline. Because the slopes, not the peaks, align near the midline (where perceptual ITD acuity is finest), it has been suggested that ITD is encoded by the discharge rate itself rather than by the locus of maximal activity (McAlpine et al. 2001). Thus, ITD may be represented by a population rate code, in which the activity of many neurons pool to form monolithic ITD channels on each side of the brain, and the stimulus ITD may be inferred by comparing the relative activity of the two channels (van Bergeijk 1962; von Békésy 1960). Though the physiological data suggest the existence of a rate code, analysis of its viability has barely begun (Marquardt and McAlpine 2001). Here, we demonstrate that a population rate code model can account for the dependence of perceived laterality on stimulus bandwidth.
1 Eaton-Peabody Laboratory, Massachusetts Eye & Ear Infirmary, Boston MA USA, Ken_Hancock@ meei.harvard.edu 2 Department of Otology and Laryngology, Harvard Medical School, Boston MA USA
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
390
2
K.E. Hancock
Methods
Model neurons are arranged into four arrays, one representing each MSO and LSO (Fig. 1C). Individual neurons are modeled by the cross-correlation operation depicted in Fig. 1A. The sounds at each ear are filtered using identical gammatone filters (center frequency CF and time constant t). The contralateral filter output is both delayed and phase-shifted (CD and CP, respectively), then multiplied by the ipsilateral filter output. The cross-correlator output is converted to a firing rate by a quadratic function with coefficients A and B. The single neuron model is thus specified by the six parameters {CF, t, CD, CP, A, B}, whose values were previously constrained by fitting the model to cat IC data (Hancock and Delgutte 2004). For all model neurons, the coefficients A and B were assigned the mean physiological values. The filter time constant varies inversely with CF according to t = Q/CF, where Q = 0.3. The remaining parameters were assigned as described below. We have made the simplest possible assumption that MSO and LSO have similar CF and BP distributions, and differ primarily in characteristic phase. The CP was set to zero for all neurons in the MSO arrays, and set to 0.5 cycles
Fig. 1 Population rate model of ITD coding: A model for ITD-sensitive neurons. Acoustic inputs are bandpass-filtered, then cross-correlated after applying delay (CD) and phase shift (CP) to one side. Quadratic function converts cross-correlation to neural firing rate; B broadband noise rate-ITD curves for two model neurons. Solid line, MSO neuron (CP=0). Dashed line, LSO neuron (CP = 0.5 cycles); C model neurons are grouped into four arrays, representing each MSO and LSO. CF is distributed along one dimension of array, best phase (BP) along the other. Output of each array is the sum of its neural rates; D array outputs as a function of ITD in response to broadband noise
A Physiologically-Based Population Rate Code for ITDs
391
in the LSO arrays. The CF parameter was represented along one dimension of each array according to a log-normal distribution with a mean of about 600 Hz, and ranging from 50 Hz to 1500 Hz. The best phase BP was represented along the other dimension following a Gaussian distribution with mean and standard deviation each equal to 0.3 cycles. Each value of BP was used to assign the characteristic delay: CD = BP/CF. Responses of two model neurons to broadband noise varied in ITD are shown in Fig. 1B. For both neurons, CF = 650 Hz and BP = 0.2 cycles (CD = 308 µs). The solid line represents the response of a model MSO neuron (CP = 0), and shows a peak firing rate at CD. The dashed line represents a model LSO neuron (CP = 0.5 cycles), and exhibits a null at CD. The output of each array is the sum of its individual firing rates. The outputs are illustrated in Fig. 1D as functions of ITD for broadband noise stimulation. Each MSO is maximally active when the stimulus is in the contralateral hemisphere, while each LSO is most strongly activated by ipsilateral stimulation.
3 3.1
Results Summary of Psychophysical Results to be Modeled
This section summarizes psychophysical data which illustrate a dependence of laterality on stimulus bandwidth (Trahiotis and Stern 1994), and which represent a nontrivial test of lateralization in the model. The stimulus is bandpass noise centered at 500 Hz, with ITD=1.5 ms. Figure 2A shows the pattern of activity produced by this stimulus in the BF-BD plane. The full display consists of a series of peaks and valleys, but for clarity we show only the two peaks closest to the midline (solid black lines). The straight contour is at 1.5 ms, corresponding to the ITD of the stimulus. The secondary contour is separated from the main contour by the CF period, and hence is curved. When the stimulus is narrowband (dark gray shading), it is perceived on the left. Trahiotis and Stern (1994) explained this percept as “centrality” dominated, because it favors the secondary peak in the BF-BD plane which occurs nearer the midline. In contrast, broadband stimuli (light gray shading) are perceived on the right side. This was described as “straightness” dominated, because it favors the peak for which the ITD is consistent across CF. Laterality also depends on the stimulus interaural phase difference (IPD). The dashed lines in Fig. 2A show the contours corresponding to a 270° phase shift. The ITD was adjusted so that the contours always passed through the center frequency at constant ITD values. Shifting the phase straightens the left contour and curves the right one. In the broadband condition, this stimulus is perceived on the left because that contour is favored by both straightness and centrality.
392
K.E. Hancock
Fig. 2 A Contours of peak activity in CF-BD plane produced by noise with ITD=1.5 ms. Solid lines, IPD=0°. Dashed lines, IPD=270°. B Image heard to the left for narrow bandwidths and/or large IPDs. Image heard to the right for wide bandwidths
3.2
Model
The dependence of lateralization on bandwidth and IPD can be explained using models that incorporate both straightness- and centrality-weighting (Stern et al. 1988). Weighting that reflects the overall tendency for physiological BD values to occur within the naturally-occurring ITD range is the most straightforward realization of centrality (Shackleton et al. 1992; Stern et al. 1988), and is an explicit property of the model described in this chapter. One method of implementing straightness-weighting is to integrate across BF along constant values of BD (Shackleton et al. 1992). This fundamentally requires a labeled-line representation because the inputs to the integration stage must be segregated according to BD. We show here that the bandwidthdependent lateralization data can also be simulated using a simple population rate code model, without explicit straightness-weighting and the resulting need for labeled lines. Figure 3A shows the output of each of the four channels as a function of bandwidth (for IPD = 0°). We consider first a two-channel model comprising only MSO activity, and argue that it is insufficient to predict the psychophysical data. We assume that the position estimate is simply a vector sum of the MSO rates, and note that activity in one MSO corresponds to an image position in the contralateral hemifield. As bandwidth gets larger, the activity in the right MSO (RMSO) decreases (solid gray line) while the activity in the left MSO (LMSO) increases (solid black line). This correctly predicts that the lateral
A Physiologically-Based Population Rate Code for ITDs
393
Fig. 3 A Individual channel responses to noise (ITD=1.5 ms, IPD=0°) vs bandwidth. B Left and right components of model position estimates. C Model fit to psychophysical data
position moves rightward with increasing bandwidth. But the image can never actually cross to the right of the midline because RMSO is always more active than LMSO. This reflects the fact that the secondary peak produced by the stimulus is always more central than the main peak. A four-channel model incorporating both MSO and LSO, however, can account for both the trends and magnitudes of the psychophysical data. Lateral position estimates were generated by linear combination of the channel outputs: P = a (LMSO − RMSO ) + b (LLSO − RLSO )
(1)
The parameters a = 1.55 × 10−3 and b = 1.45 × 10−3 were chosen to minimize the sum of squared error between the model position estimates and the psychophysical data. The resulting model fit (Fig. 3C) agrees well with the data (Fig. 2B), for the reasons discussed below.
394
K.E. Hancock
The effect of incorporating LSO channels into the model is illustrated in Fig. 3B. The position estimates derived from the channel outputs of Fig. 3A are shown decomposed into left and right components: PR = a LMSO − b RLSO
PL = a RMSO − b LLSO
(2)
Including LSO channels preserves the essential trend exhibited by the MSO outputs (position increasingly favors the right as bandwidth increases), but shifts the balance such that the position estimate crosses the midline and attains realistic magnitudes. Because the main peak occurs outside the natural range of ITD, it evokes unnaturally large activity in LLSO. In this model, it is heightened LLSO activity, rather than straightness, that trades with the centrality manifested in the RMSO activity to shift the image across the midline. As IPD increases from 0° to 270°, the main stimulus peak curves away from the array of BDs comprising the LMSO channel. At the same time, the secondary peak straightens, becoming more closely aligned with the RMSO channel. Consequently, RMSO activity increases with respect to LMSO activity, and the image position shifts to the left. As discussed in Sect. 3.1, this is purely a reflection of centrality.
4 4.1
Discussion Advantages to a Code Without Labeled Lines
Neural best ITDs are determined by several factors, including axonal propagation delays, inhibition, and perhaps peripheral mechanical delays due to interaural CF mismatches (Beckius et al. 1999; Brand et al. 2002; Joris et al. 2004). A labeled-line code demands a stable and relatively precise distribution of best ITDs from these combined factors. In contrast, the population rate model requires only a hemifield bias in the best ITD distribution of each channel. The details of the distribution are not necessarily important, especially if the ITD processor is part of a larger feedback control system that guides orienting movements by restoring balance among the outputs of the four sensory channels.
4.2
A Motor Interpretation of the Population Rate Code
Rotation of the head about the vertical axis involves four muscles, two on each side of the head. Rotation to the right is accomplished primarily by the sternocleidomastoid (SCM) muscle on the left side (i.e. contralateral to
A Physiologically-Based Population Rate Code for ITDs
395
Fig. 4 Motor control suggests a possible strategy underlying the population rate code. Broken lines indicate that the sensory-motor connections are conceptual; no physical connections are implied
the direction of motion) with assistance from the splenium muscles on the right (ipsilateral to the motion). The outputs of the population rate model may be suitable for generating motor commands that orient the head toward a sound source (Fig. 4). Specifically, because the MSO is broadly responsive to sounds originating in the contralateral hemisphere, its output is appropriate for driving the muscles which turn the head toward the contralateral hemisphere. The LSO channels in Fig. 4 inhibit motion toward the ipsilateral side, consistent with the function illustrated by Eq. (2) and Fig. 3. Low-frequency LSO neurons, which presumably comprise the ITD-sensitive component of the LSO, make inhibitory projections to ipsilateral IC (Glendenning and Masterton 1983). The broken lines in Fig. 4 emphasize that the auditory brainstem does not have direct control over motor action. Rather, the intention is only to suggest a functional strategy underlying the coding of ITD. It is conceivable, however, that the population rate code described here is an evolutionary remnant of a primitive, more direct coupling between sensory stimulation and motor response. In that context, it is interesting to note that the saccule responds to moderately intense acoustic stimulation and projects to SCM motoneurons by way of the vestibular nucleus (McCue and Guinan 1997). 4.3
Conclusion
The population rate code and the motor interpretation represent a simple, alternative framework to the conventional labeled-line view of ITD coding. It may prove useful in considering such issues as the development of ITD processing, comparison of ITD processing across species, and binaural hearing in reverberant environments (Devore et al. 2006). Acknowledgements. This work was supported by NIH grants DC07353 and DC002258.
396
K.E. Hancock
References Batra R, Kuwada S, Fitzpatrick DC (1997) Sensitivity to interaural temporal disparities of lowand high-frequency neurons in the superior olivary complex. I. Heterogeneity of responses. J Neurophysiol 78:1222–1236 Beckius GE, Batra R, Oliver DL (1999) Axons from anteroventral cochlear nucleus that terminate in medial superior olive of cat: observations related to delay lines. J Neurosci 19:3146–3161 Brand A, Behrend O, Marquardt T, McAlpine D, Grothe B (2002) Precise inhibition is essential for microsecond interaural time difference coding. Nature 417:543–547 Devore S, Ihlefeld A, Shinn-Cunningham BG, Delgutte B (2006) Neural and behavioral sensitivities to azimuth degrade similarly in reverberant environments. International Symposium on Hearing, Chap 24 Glendenning KK, Masterton RB (1983) Acoustic chiasm: efferent projections of the lateral superior olive. J Neurosci 3:1521–1537 Goldberg JM, Brown PB (1969) Response of binaural neurons of dog superior olivary complex to dichotic tonal stimuli: some physiological mechanisms of sound localization. J Neurophysiol 32:613–636 Hancock KE, Delgutte B (2004) A physiologically based model of interaural time difference discrimination. J Neurosci 24:7110–7117 Jeffress LA (1948) A place theory of sound localization. J Comp Physiol Psychol 41:35–39 Joris PX, van der Heijden M, Louage D, Van de Sande B, Van Kerckhoven C (2004) Dependence of binaural and cochlear “best delays” on characteristic frequency. In: Pressnitzer D, de Cheveigne A, McAdams S, Collet L (eds) Auditory signal processing: physiology, psychoacoustics, and models. Springer, Berlin Heidelberg New York Marquardt T, McAlpine D (2001) Simulation of binaural unmasking using just four binaural channels. Assoc Res Otolaryngol Abs, p 87 McAlpine D, Jiang D, Palmer AR (2001) A neural code for low-frequency sound localization in mammals. Nat Neurosci 4:396–401 McCue MP, Guinan JJ Jr (1997) Sound-evoked activity in primary afferent neurons of a mammalian vestibular system. Am J Otol 18:355–360 Shackleton TM, Meddis R, Hewitt MJ (1992) Across frequency integration in a model of lateralization. J Acoust Soc Am 91:2276–2279 Stern RM, Zeiberg AS, Trahiotis C (1988) Lateralization of complex binaural stimuli: a weighted-image model. J Acoust Soc Am 84:156–165 Trahiotis C, Stern RM (1994) Across-frequency interaction in lateralization of complex binaural stimuli. J Acoust Soc Am 96:3804–3806 van Bergeijk W (1962) Variation on a theme of von Békésy: a model of binaural interaction. J Acoust Soc Am 34:1431–1437 von Békésy G (1960) Experiments in hearing. McGraw-Hill, New York, pp 272–301 Wightman FL, Kistler DJ (1992) The dominant role of low-frequency interaural time differences in sound localization. J Acoust Soc Am 91:1648–1661
Comment by Carr Your paper discusses the advantages of a code without labeled lines for ITD. It is not clear to me why there need be a dichotomy between the labeled-line code strategy of an ITD map and a rate code. Many sensory variables are encoded by activity in populations of neurons with bell-shaped tuning curves. Models of information coding from these populations appear consistent with
A Physiologically-Based Population Rate Code for ITDs
397
the behavioral resolution of ITD (Skottun et al. 2001). Takahashi et al. (2003), however, have argued that rate coding and place coding are not mutually exclusive. In the barn owl, they have shown that changes in the firing rate of space-specific neurons can serve as the basis of a spatial discrimination task. The place code, which is based on the position of active neurons in the space map of the barn owl IC, appears to be used to direct orientation toward a sound source, such as the motor task you use in your example. References Skottun BC, Shackleton TM, Arnott RH, Palmer AR (2001) The ability of inferior colliculus neurons to signal differences in interaural delay. Proc Natl Acad Sci 98:14050–14054 Takahashi TT, Bala AD, Spitzer MW, Euston DR, Spezio ML, Keller CH (2003) The synthesis and use of the owl’s auditory space map. Biol Cybern 89:378–387
Reply As you describe, Takahashi et al. argue that the two codes might coexist, but are applicable to different tasks. They suggest a population rate code for the general task of discriminating two stimuli, and a place code for the more specific task of sound localization. What I tried to show is that the population rate code (with four channels) can also account for sound localization, that is, it can apply to both kinds of tasks, at least under the stimulus conditions considered. The larger issue, of course, is whether or not mammals use the same strategy as barn owls (and chickens?) to localize sound based on ITD. That, I believe, is still an open and interesting question.
43 A p-Limit for Coding ITDs: Neural Responses and the Binaural Display DAVID MCALPINE1, SARAH THOMPSON1, KATHARINA VON KRIEGSTEIN2, TORSTEN MARQUARDT1, TIMOTHY GRIFFITHS2, AND ADENIKE DEANE-PRATT1
1
Introduction
Interaural time differences (ITDs) are the main cues used by humans to determine the horizontal position of low-frequency (<1500 Hz) sound sources. The neural representation of ITDs is presumed to be one in which brain centres in each hemisphere encode the opposite (contralateral) side of space (Jenkins and Merzenich 1984). Assumptions in most human psychophysical studies are that the range of ITDs encoded is constant across the range of sound frequencies at which sensitivity to ITDs in the fine-structure of sounds is observed (<1500 Hz) (Trahiotis and Stern 1989) and determined largely by the physiological range, but with greater ITDs of probably up to at least 3000 µs, explicitly encoded in the 500-Hz frequency band in order to account for human psychophysical performance (van der Heidjen and Trahiotis 1999). The brain’s response to ITDs is commonly represented in the form of a cross-correlogram, which plots the cross-correlation function of the sound at each ear for each frequency channel following cochlear filtering. For the example in Fig. 1, a 500-Hz tone presented over stereo headphones, and containing an ITD of −1500 µs (i.e. leading at the left ear), activates the crosscorrelogram at multiple periods of the stimulus waveform, giving peaks in activity every 2000 µs within the 500-Hz channel (black sinusoid in 500-Hz channel). Human listeners report such a sound to have an intracranial image on the contralateral side to the leading ear (Trahiotis and Stern 1989), in this case the right side, consistent with an ITD of +500 µs. It appears that for tones and narrow bands of noise, the auditory system resolves the ambiguity in the internal representation by selecting the shortest of the possible ITDs – a weighting for centrality that has been explained by the existence of more coincidence-counting units encoding ITDs that lie within the physiological range (Stern et al. 1988). This central weighting is shown in Fig. 1 as an increased gray-scale density for shorter ITDs. As signal bandwidth is 1 Ear Institute, University College London, London, UK, [email protected], [email protected], [email protected], [email protected] 2 Wellcome Department of Imaging Neuroscience, Institute of Neurology, University College London, London, UK, [email protected], [email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
400
D. McAlpine et al.
Fig. 1 Representation of ITD information in the form of a cross-correlogram. Grey-scale density illustrates central weighting function of ITD detectors used in central weighting models. Curved dashed lines indicate the “π-limit”. See text for full description
increased, however, the intracranial auditory image shifts from the right to the left side until, for a bandwidth of 400 Hz (grey vertical bar to left of correlogram in Fig. 1), the image is lateralised fully to the left side corresponding to that of the “true” ITD. As for a tone, multiple peaks of activation appear in the plot, but the peaks at the true ITD (in this case, −1500 µs) are aligned across frequency channels, a pattern referred to as “straightness” (Trahiotis and Stern 1989). As such, it has been proposed that a second layer of coincidence detectors exists, potentially in the midbrain, multiplying acrossfrequency straightness (Stern and Trahiotis 1997) and thereby giving weight to the true ITD which, under real-world listening conditions, would be consistent across frequency for a single sound emanating from a single source. This straightness-weighting therefore accounts for the ‘true’ lateralized image of broadband sounds with high ITD on the basis of midbrain activity on the opposite side to the lateralised sound image (grey curve at top of Fig. 1). However, straightness-weighting is unsatisfactory because not only would a full representation of ITDs have to exist in each frequency band, including ITDs that can never be experienced under natural listening conditions, but also a second level of coincidence detectors specifically looking for across-frequency straightness at these ITDs. The existence or otherwise of neural mechanisms specialized to detect such long ITDs is disputed.
A π-Limit for Coding ITDs: Neural Responses and the Binaural Display
401
Recent investigations indicate only a restricted range of ITD detectors in the mammalian brain (McAlpine et al. 2001; Hancock and Delgutte 2004), with no neurons showing tuning for ITDs beyond approximately 1/2 a cycle of the centre frequency of each auditory filter. We refer to this as the π-limit (denoted by the curved dashed lines running down Fig. 1). Note that with the π-limit the 500-Hz frequency band possesses ITD detectors up to a maximum of 1000 µs, being 1/2 the period of a 500-Hz centre frequency. The π-limit must therefore account for the ‘true’ lateralized image of broadband sounds with high ITD on the basis of midbrain activity on the same side as the lateralised sound image. Note that the models make entirely opposite predictions to each other for ITDs of 1500 µs: the straightness model predicts greater activity in the brain hemisphere contralateral to the lateralised sound image, while the π-limit predicts greater activity ipsilateral. This is in contrast to ITDs of ±500 µs, for which both models predict greater activation in the brain hemisphere contralateral to the lateralized sound image. Here, we examine the representation of ITD in the human midbrain and cortex using functional magnetic resonancing imaging (fMRI), the mismatch negativity potential (MMN), and headphone presented sounds with variable ITD.
2
Methods
For the functional magnetic resonance imaging (fMRI) experiment, stimuli were 400-Hz band-pass noises exemplars (fixed-amplitude, random-phase) centred at 500 Hz. Stimuli consisted of eight consecutive noise bursts of 1 s each from the same condition (8 s), or 8 s of silence, presented via stereo headphones. ITDs were ±1500 µs, ±500 µs and 0 µs plus a silent condition. Sounds with negative ITDs were always perceived on the left, and those with positive ITDs on the right. BOLD contrast images were acquired using T2* weighted EPI and sparse imaging. A total of 48 slices in ascending axial order were obtained with cardiac gating for 14 subjects, all of whom were right-handed, with no hearing impairment or history of neurological disorder. Subjects were asked to pay attention to the noises, and to press a button on their keypad at the end of each trial to maintain alertness. They were also asked to keep their eyes open and fixate on a central cross, in order to counteract any confound caused by correlated eye movements. Imaging data were analysed using the statistical parametric mapping algorithm implemented in SPM2. Structural and EPI images were co-registered and normalised to a standard grey-matter template (Montreal Neurological Institute), Data were thereby transformed into a standard stereotaxic space and subsampled with a voxel resolution of 2 × 2 × 2 mm (original voxel resolution was approximately 3 × 3 × 3 mm). Data were spatially smoothed with a Gaussian smoothing kernel of 5 mm. SPM2 was used to compute individual subject analyses according to the General Linear Model by
402
D. McAlpine et al.
fitting the data time-series with the canonical Haemodynamic Response Function (cHRF) at the onset of each trial. Each condition was modeled as an individual regressor in the design matrix, and statistical parameter estimates were computed individually for each brain voxel. The mismatch negativity (MMN), a component of the auditory evoked potential produced by context-dependent changes in the environment, is reported as being greater at electrodes contralateral to the perceived location of a sound. In each of four blocks, participants read self-selected material while seated in a soundproof booth, and were instructed to ignore the sounds. Stimuli were presented via headphones at 65 dB SPL in pseudo-random order with a SOA of 1 s. Each block was ten minutes long and consisted of 480 standard stimuli and 30 × 4 deviant stimuli (0.8 and 0.05 probability of occurrence, respectively). A total of 2400 (1920 standard and 480 deviant) 400 Hz wide, bandpass-filtered white noise bursts, centred on 500 Hz were generated in MATLAB using the Binaural Toolbox. Noises of 500 ms duration, including a 50-ms rise/fall time, were digitized at 20,000 Hz. There was no difference in onset time, and delays were applied to the right channel only. Standard stimuli had an ITD of 0 µs and a perceived location in the middle of the head. The electroencephalogram (EEG) was recorded using NUAMPS with Ag-AgCl electrodes from 32 channels positioned according to the extended 10–20 system and re-referenced offline to the average from the mastoids. Horizontal and vertical electroocculograms were recorded with electrodes placed above and below the left eye, and on the outer canthi. Blink removal, artefact rejection (threshold: ±100 µV) and lowpass-filtering below 20 Hz were performed offline. Data for each condition were averaged from 600-ms epochs (beginning 100 ms pre-stimulus). MMN was defined as the mean amplitude of the deviant-standard difference waveforms in a 40-ms window centred on the grand average negative peak at each electrode. Statistical analysis was performed at three left-right electrode pairs: F3-F4, FC3-FC4, C3-C4. One-sample t-tests were used to determine that MMN amplitudes differed from zero and three-way ANOVA delay (−1500, −500, +500 +1500)* hemisphere (left, right)* row (F, FC, C) confirmed no main effects and no interactions of peak MMN latencies.
3
Results
Initial analysis served to map functionally the IC for both hemispheres. To that end individual statistical maps were generated for the contrast All Sounds>Silence. Data from the resulting contrast maps for 14 subjects were entered into a second level analysis to allow population-level inferences to be drawn (random effects). The group analysis was thresholded at 0.05 FWE corrected to map the IC in both hemispheres as our a priori ROI. The one resulting ROI was divided in the midline (x = 0) in two ROIs representing
A π-Limit for Coding ITDs: Neural Responses and the Binaural Display
403
right and left ICs respectively. Contrasts of interests were between [(−500 µs normal orientation) + (500 µs at right-left flipped orientation)]>[(1500 µs normal orientation) + (−1500 µs at right-left flipped orientation)]. These data are shown in Fig. 2, and demonstrate clearly that the representation of ITDs in the human midbrain switches from contralateral to ipsilateral when the ITD of bandpass noise is increased beyond the π-limit. In this, our result accords with a growing body of physiological data from the auditory midbrain of other mammals (McAlpine et al. 2001; Hancock and Delgutte 2004), suggesting that only a limited representation of ITDs, corresponding to an upper limit of approximately 1/2 the period of the stimulus centre frequency, exists within each sound frequency channel in the auditory brain. The data are also consistent with an absence of any preference for straightness weighting when the ITD is ±1500 µs.
Fig. 2 a Coronal slice, mapping the IC by contrasting all noise conditions with the silent baseline for all subjects. Activation is superimposed on a standard structural brain template. The area marked with a square is shown enlarged in b which shows group statistical parametric maps for the contrasts between ipsi- and contralateral delays at different ITD (500 µs/1500 µs) within the functionally mapped IC. c Percentage change in BOLD response
404
D. McAlpine et al.
Fig. 3 a Grand mean difference waveforms (left panels) for each condition at the electrode pairs FC3-FC4. Note that negative amplitudes are plotted upwards. b Mean MMN amplitudes from electrode pairs in a. The ITDs of ±1500 µs correspond to 3/4 of a cycle of the stimulus centre frequency and ±500 µs 60 1/4 of the cycle. Ipsilateral and contralateral refer to brain hemisphere referenced to the ear at which the sound leads in time
A similar pattern to the IC responses was also observed at the cortical level in the analysis of MMN at each electrode pair. Figure 3a shows the difference waveforms at FC3 (left hemisphere) and FC4 (right hemisphere). Peak activation occurs between 100 and 200 ms post-stimulus onset and is greater at the electrode contralateral to the perceived location for the −500-µs condition (p<0.05). This conforms to a bias toward greater right hemisphere activation previously seen in a number of studies and presumably due to right hemispheric specialisation for spatial stimuli. Sounds on the left that are within the π-limit elicit stronger contralateral activity, whereas right-sided stimuli are represented more bilaterally; outside the π-limit there is no difference between left and right electrodes. In order to examine the data without this bias, mean contralateral (e.g. [+500 µs at FC3] + [−500 µs at FC4]; [+1500 µs at FC3] + [−1500 µs at FC4]) and ipsilateral (e.g. [+500 µs at FC4] + [−500 µs at FC3]; [+1500 µs at FC4] + [−1500 µs at FC3]) MMN amplitudes were compared. Here, contralateral activation was greater (p<0.05) within, but not beyond the π-limit, as can be seen in Fig. 3b. These data are consistent with our fMRI results and with animal data, but not with the straightness model or a full-range of ITD representation.
4
Discussion
Imaging data from the human brain are consistent with the notion that ITD is represented by a restricted range of detectors, such that ITDs beyond 1/2 a cycle of the stimulus centre frequency are not explicitly coded. For the stimuli used in the current study, ITDs equivalent to ±3/4 of the period of the stimulus centre frequency are therefore likely encoded by neurons whose response
A π-Limit for Coding ITDs: Neural Responses and the Binaural Display
405
maxima are evoked by ITDs equivalent to 1/4 of the period of the stimulus centre frequency, and of opposite sign (and opposite lateralised location). Thus, the notion that activation of auditory spatial detectors is determined by the side from which a sound source is heard to originate does not hold. Clearly, however, both the absolute brain activation as well as the relative activation across the two brain hemispheres, differs between the ITDs of ±500 and ±1500 µs. These differences likely stem from differences in interaural correlation between the two stimuli. How such differences might be interpreted as differences in lateral position and extent of an intracranial sound image remains to be determined, but the existence should be noted of successful models of binaural hearing, designed to account for psychophysical data, in which the notion of internal delays is completely absent. The weighted image model of ITD processing (Stern et al. 1988), designed to account for human psychophysical performance, posits the existence of a second-level of coincidence detection explicitly to account for the switch in lateralised percept of noise with long (±1500 µs) delays as the bandwidth is increased. However, the current data call into question the existence of the required straightness detectors, at least up to the level of primary auditory cortex. Explicit anatomical and physiological manifestations such have been suggested for the barn-owl brain (Wagner et al. 1987) therefore might not be relevant to binaural hearing in mammals. Future use of binaural displays such as the cross-correlogram should take account of both physiological data when interpreting psychophysical findings. Although the straightness-weighting model is a simple and attractive notion, our data demonstrate that it does not hold for all ITDs, e.g. it cannot account for sensitivity to ITDs greater than 1/2 the stimulus period. Our data are entirely consistent with the π-limit model. Changes in the perceived location of a sound have been reported to generate a MMN, which is reported to be maximal in the brain hemisphere contralateral to the perceived location. Ergo, there is presumed to be a switch in the laterality of brain activation to accompany the switch in lateral image for broadband stimuli with ITDs of opposite sign. The current study indicates this not to be the case. The MMN does not appear to accompany the perceptual shift, but rather accompanies the neural population most active, suggesting that its generation is related to detection of activity in a neural population, rather than the perceived location of a sound source. Therefore, the popularlyheld notion that the MMN is a marker for the perceptual processing of auditory cues, including cues for spatial hearing, may have to be revised.
References Hancock KE, Delgutte B (2004) A physiologically based model of interaural time difference discrimination. J Neurosci 24:7110–7117 Jenkins WM, Merzenich MM (1984) Role of cat primary auditory cortex for sound-localization behavior. J Neurophysiol 52:819–847
406
D. McAlpine et al.
McAlpine D, Jiang D, Palmer AR (2001) A neural code for low-frequency sound localization in mammals. Nat Neurosci 4:396–401 Stern RM, Trahiotis C (1997) Psychophysical and physiological advances in hearing. In: Palmer AR, ReesA, Summerfield AQ, Meddis R (eds) Proceedings of the XI International Symposium on Hearing. Whurr Publishers, London Stern RM, Zeiberg AS, Trahiotis C (1988) Lateralization of complex binaural stimuli: a weighted-image model. J Acoust Soc Am 84:156–165 Trahiotis C, Stern RM (1989) Lateralization of bands of noise: effects of bandwidth and differences of interaural time and phase. J Acoust Soc Am 86:1285–1293 van der Heijden M, Trahiotis C (1999) Masking with interaurally delayed stimuli: the use of “internal” delays in binaural detection. J Acoust Soc Am 105:388–399 Wagner H, Takahashi T, Konishi M (1987) Representation of interaural time difference in the central nucleus of the barn owl’s inferior colliculus. J Neurosci 7:3105–3116
Comment by Hall Can you comment on whether the pattern of response asymmetries observed within the IC was also found within the MGB? One would probably expect this to be the case, especially since in both these sites your activation is likely to be represented by one voxel (given the size of your smoothing kernel). The convergent evidence would further support your interpretation even if, as you say, the cortical activation pattern is complicated by the right hemisphere preference for spatial cues.
44 A p-Limit for Coding ITDs: Implications for Binaural Models TORSTEN MARQUARDT AND DAVID MCALPINE
1
Introduction
Interaural time difference (ITD) is an important sound localization cue, arising from the different travel time of a sound from its source to the left and right ears for sources located to either side of the head. Neural extraction of ITDs occurs in the superior olivary complex (SOC). SOC neurons receive binaural input and are thought to perform a process of cross-correlation between the spike trains arriving from left and right ear. Since the acoustic signals are bandpass filtered by the cochlea, the ITD tuning curves of SOC neurons to noise stimuli exhibit shapes similar to cross-correlation functions of bandpass noise (Yin and Chan 1990). They are generally sinusoidal in shape and their amplitude is symmetrically damped either side of their tuning maximum (Fig. 1A). The current view of binaural processing, based on Jeffress’ hypothesis (Jeffress 1948), lays down a system of various axonal travel time differences to individual coincidence detecting SOC neurons. This results in a shift of the entire tuning curves along the ITD axis, as illustrated in Fig. 1A for several neurons with increasing best ITD (tuning maximum). There is no principal limit to the shift in Jeffress-based models. The range of best ITD is, rather, determined by the range of naturally occurring ITDs, the so-called ecological range. Many physiological recordings are not symmetric about the peak. A wellknown extension of Jeffress model therefore includes neurons innervated by excitation from one ear and inhibition from the other (e.g. Durlach 1963; Breebaart et al. 2001). This leads theoretically to a horizontal inversion of the cross-correlation function, i.e. to tuning curves symmetric about their tuning minimum, or trough. However, many recorded tuning curves still do not fit in either the “peaker” or the “trougher” category but are somewhere in between (McAlpine et al. 1996; Fitzpatrick et al. 2000). These neurons have been referred to as “tweeners” and are currently ignored in binaural models.
Ear Institute, University College London, London, UK, [email protected], [email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
408
T. Marquardt and D. McAlpine
Fig. 1 A Idealized noise ITD tuning curves of binaural neurons with various best ITDs, where the difference in ITD tuning is accomplished by axonal delay differences. Note that the envelope (dashed line) is shifted, too. B ITD tuning curves of neurons in the right IC divided into three groups of different best ITD ranges (average curve in white, its envelope dashed). C Distribution of best ITD expressed in cycles of DF (frequency normalized) and corresponding DF; neurons of right IC only (ecological range of the guinea pig in gray). D Computer simulation using a Hodgkin-Huxley model with phase-locked inhibition
2
ITD Tuning Curves of Neurons in the Guinea Pig IC
Binaural SOC neurons project directly, or indirectly via the lateral lemniscus, to the inferior colliculus (IC) and IC neurons would appear to inherit their ITD tuning properties from neurons in the SOC. We present here an analysis of ITD tuning curves to wideband noise stimuli recorded from isolated neurons in the right IC of urethane anesthetized guinea pigs. The analysis involves the fitting of these tuning curves by a sinusoid. Where multiple tuning curves were obtained for any one neuron, the tuning curve with the strongest ITD modulation was included in the analysis. The frequency of the sinusoid determines the dominant frequency (DF) of the neuron. In almost all cases, the DF closely matched the best frequency (BF) of the neuron. There was a trend in neurons with BF greater than 400 Hz for the DF to be slightly below BF, and in neurons with BF less than 400 Hz for the DF to be slightly above BF. The best ITD of the neuron was determined by the phase of the fitted sinusoid’s maximum closest to the maximum of the ITD tuning curve. This way, all of the data points, not just the tuning maximum, contributed to the estimate. The ITD was then normalized by the
A π-Limit for Coding ITDs: Implications for Binaural Models
409
DF of the neuron and expressed in periods of DF. As shown in Fig. 1C, in almost all cases the maximum was within half a cycle of the period of the DF from zero ITD. In other words, the two right most cases of Fig. 1A, illustrating hypothetical tuning curves based on internal delays larger than half a cycle, did not exist in our sample of tuning curves. We refer to this finding as the “π-limit” of coded ITD. This boundary has been observed previously (guinea pig: McAlpine et al. 2001; cat: Hancock and Delgutte 2004; Joris et al. 2006). Note also that the distribution seems not to be related to the ±200-µs ecological range of the guinea pig (gray area). It appears that best ITDs can cover the full ITD range within the π-limit in a rather frequency-independent manner. (An explanation of the few outliers in Fig. 1C follows below.) However, the best ITDs within the π-limit are, although continuous, by no means equally distributed, and the actual density function might be well influenced by the ecological range. Figure 1B shows the shape of the tuning curves matching column-wise the best ITDs of the three left-most cases of Fig. 1A. The tuning curves have been divided into three equal ranges of best ITD. Note, this does not imply the existence of three types of neurons but, rather, a continuum of best ITDs as suggested above. From the total of 234 tuning curves, those with negative best ITDs or with tuning curves that showed obvious flooring (zero spikes or spontaneous rate) at both minima either side of the tuning maxima were not included in the analysis. Furthermore, the normalization of ITD by DF permitted the calculation of an average shape for the three groups. The average curves show a systematic change in shape with increasing best ITD, which could be best described by a constant envelope centered at zero ITD (gray dashed lines). This shift in the fine structure only is characteristic of a phase shift, implying the existence of a phase delay rather than an axonal time delay difference between the two ears. Physiologically, this phase-delay could be accomplished by the action of fast inhibitory inputs onto SOC neurons, as demonstrated physiologically by Brand et al. (2002). The phase-locked inhibitory signal, slightly leading the phase-locked excitatory version of the same ear signal potentially creates sensitivity to the negative slopes in the incoming signal, and effectively delays the signal, similar to a differentiator by a constant phase. Figure 1D shows further simulations using the Hodgkin-Huxley model employed by Brand et al. (2002). Ipsilateral excitation was adjusted to maintain average spike rate, whilst contralateral inhibition (leading by 200 µs) was increased, and contralateral excitation was decreased (left to right). The results show a clear phase shift of the simulated tuning curve resembling the averaged curves of the physiological data in Fig. 1B. Figure 1B shows that most tuning curves with best ITD close to 0.5 cycles DF have two maxima with similar height. Almost all outliers in Fig. 1C are of such “trough” shape but their maximum on the same ITD side as the trough is slightly higher resulting, by our definition of best ITD, in a best ITD larger
410
T. Marquardt and D. McAlpine
than 0.5 cycles DF, although their secondary within the π-limit is only marginally smaller.
3
Implication for Binaural Models
A phase shift in ITD tuning has principle implications for binaural models. First, it provides an inherent explanation for the π-limit in the physiological data (Fig. 2A). Second, it necessitates a revision of the current definition of the interaural coherence (IAC). Inspired by Jeffress’ model, IAC is currently defined by the peak value of the cross-correlation function. Here we suggest the use of the classic, phase-based definition of coherence for IAC (Fig. 3A).
Fig. 2 A Illustration of phase shifted ITD tuning curves limiting the range of best ITDs to half a cycle DF (“π-limit”). B Representation of 4.5 ms-ITD wide band noise using an IPD circle (500 Hz-channel). C Idealized noise ITD tuning curve of a 500-Hz neuron with a best IPD of 45° (dotted lines at positive ITDs: std. dev. due to fluctuations in IAC/IPD, at negative ITDs: due to IAC/IPD fluctuations plus spike generation (Poisson noise). D Upper: normalized binaural component (cbinaural) over time. Lower: predicted spike rate over time, ignoring the Poisson noise (dotted: due to stimulus intensity fluctuations only)
A π-Limit for Coding ITDs: Implications for Binaural Models
411
Fig. 3 A Extraction of the three interaural parameters. B Simulation results using a noise stimulus of 0.8 IAC (500 Hz ch, 0 IPD, 30 ms Hw). Full circles: cross-power phase over time (instantaneous IPD, ∠R • L*). Marker diameter indicates cross-power magnitude. Open circles: same after smoothing by temporal integrator (IPD at model output). C Illustration how an IAC<1 is caused by inconsistency of IPD within the binaural integration window
Figure 2B introduces a frequency-independent representation of the interaural parameters based on interaural phase difference (IPD), the IPD circle. The radial dimension shows the IAC, and the angle shows the IPD. Note that the projection onto the vertical axis (pointing towards zero IPD) corresponds to the interaural correlation coefficient. This projection equals also the intensity-normalized output of a model neuron with 0.0 best IPD. Similarly, one can read the output of neurons with other best IPDs, e.g. 45° (gray arrow). The output consists of an intensity component, the radius of the circle, which is common for neurons of any best IPD, and a binaural component (cbinaural) which is determined by the neuron’s best IPD. The binaural component is a fraction of the intensity component which can, depending on the relation between best IPD and stimulus IPD, either increase, as well as decrease, the output of the neuron. Figure 2C,D illustrates three sources of spike rate variance: 1. the intensity fluctuations (stimulus envelope), 2. the variation of the binaural component if IAC<1 (see Fig. 2D), and 3. the stimulus independent Poisson noise of the spike generation. For binaural detection, the intensity fluctuations are irrelevant since they co-vary in all IPD channels within a frequency channel (hence, the normalization in the IPD circle and IAC definition (see also Van de Par et al. 2001). Note that the first two will
412
T. Marquardt and D. McAlpine
have a strong covariance between neurons of similar frequency tuning, meaning that the d’-primes derived from individual neurons can not be regarded as independent! Figure 3 shows how the interaural stimulus parameters for the IPD circle are derived. The analytic signal from the gammatone filter output (Hohmann 2002) has complex values and the phase difference between left and right signal can be interpreted as the instantaneous IPD. Only the observation of this instantaneous IPD over a period of time (temporal integration window, specifically referred here as binaural window) allows the IAC to be defined. This temporal integration leads to a smoothing of the instantaneous IPD and models the behaviorally-observed binaural “sluggishness” (Fig. 3B). Figure 3C shows how inconsistency in instantaneous IPD values within the binaural window, e.g. caused by ITD, leads to a reduction in IAC. The vectors could stem from the samples of instantaneous IPD, (as in Fig. 3A), but might also be the spectral lines from a cross-spectrum having been computed from left and right time signals within the binaural window.
4 Illustrations of the Interaural Parameter Using the IPD Circle The IPD circle usefully illustrates the interaural parameters of a binaural stimulus, and is shown in Fig. 4 in response to a 3-s burst of wideband noise. Interaural parameters are sampled at intervals of 100 ms resulting in 30 point clusters. Figure 4A illustrates how, in contrast to a stimulus IPD, an ITD causes a decrease in IAC, and consequently a variation of IPD over time. It also shows how an ITD translates in the different frequency channels into different IPDs. Figure 4B illustrates IPD/IAC variations if the stimulus IAC is decreased. An IAC of 0.99 is approx. the human detection threshold from a reference of 1.0. The left IPD circle shows that detection is likely to be based on random departures in IPD from the zero reference rather than in the radial IAC dimension. At 0.95 IAC the cluster starts to move along the IAC radial dimension. Comparison of the middle pictures in Fig. 4A,B gives an IAC estimate of 0.95 for the stimulus with 2 ms ITD. The influence of the binaural window length, denoted as half-width (Hw), on cluster size is shown in Fig. 4C, where an uncorrelated noise stimulus is applied. The longer the extent of temporal integration the more the instantaneous IPD samples are averaged. The cluster becomes tighter around the location determined by the stimulus parameters. Note, that the cluster is compressed in the radial direction (IAC) as it approaches IAC of 1.0 because it constitutes the upper IAC boundary. In models, the Fisher’s Z-Transform helps to transform the shape of the cluster projection onto any radial axis (as in Fig. 2B) into a Gaussian distribution (e.g. for covariance estimations).
A π-Limit for Coding ITDs: Implications for Binaural Models
413
Fig. 4 IPD circle representation of wide band noise stimuli with A 2 ms ITD in different frequency channels (100 ms Hw.); B various interaural coherences (500 Hz ch, 0 IPD, 100 ms Hw); C zero interaural coherence using various temporal integration times (500 Hz ch)
Acknowledgments. We thank Neil Ingham, Susan Boehnke, Dan Jiang, Alan Palmer, and John Agapiou for providing data. Work supported by Medical Research Council.
References Brandt A, Behrend O, Marquardt T, McAlpine D, Grothe B (2002) Precise inhibition is essential for microsecond interaural time difference coding. Nature 417:543–547 Breebaart J, van de Par S, Kohlrausch A (2001) Binaural processing model based on contralateral inhibition. I. Model structure. J Acoust Soc Am 110:1073–1088 Durlach NI (1963) Equalization and cancellation theory of binaural masking-level differences. J Acoust Soc Am 59:1206–1218
414
T. Marquardt and D. McAlpine
Fitzpatrick D, Batra R, Kuwada S (2000) Neural sensitivity to interaural time differences: beyond the Jeffress model. J Neurosci 20:1605–1615 Hancock KE, Delgutte B (2004) A physiologically based model of interaural time difference discrimination. J Neurosci 24:7110–7117 Hohmann V (2002) Frequency analysis and synthesis using a Gammatone filterbank. Acta Acustica Acust 88(3):433–442 Jeffress LA (1948) A place code of sound localization. J Comp Physiol Psychol 41:35–39 Joris PX, v.d. Sande B, Louage DH, v.d. Heijden M (2006) Binaural and cochlear disparities. Proc Natl Acad Sci USA 103(34):12917–12922 McAlpine D, Jiang D, Palmer AR (1996) Interaural delay sensitivity and the classification of low best-frequency binaural responses in the inferior colliculus of the guinea pig. Hear Res 97:136–152 McAlpine D, Jiang D, Palmer AR (2001) A neural code for low-frequency sound localization in mammals. Nat Neurosci 4:396–401 Van de Par S, Trahiotis C, Bernstein LR (2001) A consideration of the normalisation that is typically included on correlation based models of binaural detection. J Acoust Soc Am 109:830–833 Yin TCT, Chan JCK (1990) Interaural time sensitivity in the medial superior olive of cat. J Neurophysiol 64:465–488
Comment by Carr David McAlpine asked if the chicken data follow the π-limit. This is partly a technical question, because if one measures ITD responses with a tonal stimulus at CF, and defines the peak closest to zero as the best ITD, all data fall within ±π. Where this becomes interesting is when the response minimum is close to 0 ITD. Which peak is then the best ITD? Characteristic delay measures have generally been used to resolve this issue, but we also used click responses to resolve the ambiguous cases. We have examined our data, and have four sites that show values greater than π. These are, however, only slightly beyond 0.5, with a maximal value of 0.57. Are these points significantly beyond π? Your paper (Marquardt and McAlpine, this volume, Fig. 1C) shows a few data points beyond 0.5 cycles, and McAlpine and Marquardt state that you observe no tuning for ITDs beyond approximately a half cycle of the centre frequency of each auditory filter. Thus the data from chicken appear to conform to the π-limit about as well as the data from the guinea pig. This leads to two observations that bear on your model. First, is the π-limit result surprizing? At the systems level, I would suggest not. In the barn owl, recorded best ITDs can be beyond π. In some cases, the peak nearest to 0 µs ITD is not always the best ITD, which may instead be a response to contralateral ITDs within the biological range but beyond π, as shown by measures of characteristic delay, click and mean monaural phase delays. This is unlikely to be true in the chicken, because chicken head size and range of best frequencies are such that responses beyond π would not fall within the biological range. It should be noted that the range below 800 Hz has not been measured, but we assume that the biological range of ITDs would increase
A π-Limit for Coding ITDs: Implications for Binaural Models
415
with decreasing best frequency because of pressure gradient effects (Larsen et al. 2006). Second, at the cellualr level, the π-limit is surprizing. If there were a single mechanism underlying the π-limit, such as inhibitory synaptic delay, the presence of a π-limit in the chicken is a surprize, because avian inhibitory mechanisms differ from mammalian ones (Yang et al. 1999). My question is therefore “Is the similar π-limit in gerbil and chicken a coincidence, or does it point to an underlying similar mechanism, related to phase locking and coincidence detection?”
References Larsen ON, Dooling RJ, Michelsen A (2006) The role of pressure difference reception in the directional hearing of budgerigars (Melopsittacus undulatus). J Comp Physiol A Neuroethol Sens Neural Behav Physiol 192:1063–1072 Yang L, Monsivais P, Rubel EW (1999) The superior olivary nucleus and its influence on nucleus laminaris: a source of inhibitory feedback for coincidence detection in the avian auditory brainstem. J Neurosci 19:2313–2325
Reply Although the number of chicken neurones in your plot of best ITD vs best frequency does not allow a final conclusion, there is a strong indication towards a π-limit. If the neural coding of interaural timing differences in the form of an interaural phase spectrum is an efficient solution, the finding that the bird’s evolution has also led to a π-limit might not be surprising nor a coincidence. I showed that the shape of the noise ITD tuning curves in the guinea pig IC can be described by a phase shift which leads intrinsically to a π-limit. I wonder whether the π-limit in the chicken can be similarly explained by the shape of their noise ITD function. I modelled one possible biological implementation of a phase delay which involves precisely-timed glycinergic inhibition on MSO neurones. This is obviously not feasible in the absence of fast glycinergic inhibition in birds, and might not necessarily be the one operating in mammals. Thus, other possible underlying mechanisms need to be investigated. Since the binaural system of mammals and birds evolved independently the mechanisms might well be different. I believe that the distribution of best ITDs within the π-limit is likely to be influenced by the animal’s ecological range. However, the range of coded best ITDs in mammals is apparently independent from the head size and covers one full IPD cycle (±π). I agree that the unique high-frequency binaural system of the barn owl is clearly different.
416
T. Marquardt and D. McAlpine
Answer from Carr and Köppl We do not think that the barn owl is different to the chicken, but instead has a larger head and the ability to phase-lock to higher frequencies. We predict that responses beyond π would be expected given an ITD range beyond the stimulus period. Although the barn owl has an ITD range of only about 200 µs, it is able to phase lock up to about 8 kHz, and thus points in far contraor ipsilateral space will be beyond the π-limit. The same logic suggests that a large mammal such as a tiger would show responses beyond π. Even a human might show responses beyond π for a narrow band stimulus centered around 2 kHz.
45 Strategies for Encoding ITD in the Chicken Nucleus Laminaris CATHERINE CARR1 AND CHRISTINE KÖPPL2,3
1
Introduction
Animals, including humans, use interaural time differences (ITDs) that arise from different sound path lengths to the two ears as a cue of horizontal sound source location. The nature of the neural code for ITD is still controversial. Current models advocate either a map-like place code of ITD along an array of neurons, consistent with a large body of data in the barn owl, or a ratebased population code, consistent with data from small mammals. Recently, it was proposed that these different codes reflect an optimal coding strategy that depends on head size and sound frequency. The chicken makes an excellent test of this hypothesis because its physical features are similar to small mammals, yet it shares a more recent common ancestry with the owl. 1.1
Theories of ITD Coding
Theories of ITD detection have been dominated by the Jeffress model (Jeffress 1948). In this model, delay lines and coincidence detectors form a circuit for detection of ITDs. The delay line inputs synapse on coincidence detector neurons, which are organized in an array where each element has a different relative delay between its ipsilateral and contralateral excitatory inputs (Fig. 1, inset). Thus, ITD is encoded by the place or location of the coincidence detectors whose delay line inputs best cancel out the acoustic ITD (for reviews, see Joris et al. 1998; Konishi 2003). In this circuit, the temporal code, where spikes encode the phase of the sound at each ear, is transformed into a rate-place code, where position and response in the array encodes sound location for each frequency band. In vivo data from the barn owl, and brain slice data from chicken, are consistent with the Jeffress model. Inputs from the cochlear nucleus magnocellularis (NM) project to the nucleus laminaris (NL) and act as delay lines 1
Department of Biology, University of Maryland, College Park, USA, [email protected] Lehrstuhl für Zoologie, Technische Universität Muenchen, Garching, Germany, [email protected] 3 Department of Physiology, University of Sydney, New South Wales, Australia 2
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
418
C. Carr and C. Köppl
Fig. 1 Cross-section through the chicken brain with a small neurobiotin injection (arrow) in NL. Inset shows proposed circuit for ITD detection in the avian auditory brainstem
(Carr and Konishi 1990; Overholt et al. 1992), while the postsynaptic neurons act as coincidence detectors (Funabiki et al. 1998) and form a map of ITD in each frequency band. The cochlear nuclei receive descending GABAergic inputs from the superior olive that function as gain control elements, or a negative feedback that allows NL neurons to maintain sensitivity to ITDs at high sound intensities (Pena et al. 1996). The mammalian medial superior olive had been assumed to be organized along similar lines, although evidence for maps of ITDs within each frequency band was insufficient to show that ITD was mapped with each isofrequency band (Yin and Chan 1990; Beckius et al. 1999). Studies in the guinea pig inferior colliculus cast doubt on this assumption. McAlpine et al. (2001) found a different kind of systematic relationship between soundfrequency tuning and sensitivity to interaural time delays. In the inferior colliculus, ITD sensitive neurons with relatively low best frequencies showed response peaks at long delays, whereas neurons with higher best frequencies showed response peaks at short delays. Thus the steepest region of the function relating discharge rate to ITD fell close to the midline for all neurons. This finding led McAlpine et al. (2001) to propose that, in the guinea pig, ITD sensitive coincidence detectors were organized into two broad, hemispheric spatial channels, rather than in a map of sound location. A subsequent test of this population code hypothesis in the gerbil medial superior olive found the slope of the recorded ITD responses fell within the gerbil’s biological range (Brand et al. 2002). The computational utility of a slope function and population code in this system is undisputed (Shackleton et al. 2003; Hancock and Delgutte 2004), since the slope of the ITD function contains more information than the peak. In the inferior colliculus of the barn owl, circumscribed spatial receptive fields form a rate-place code for localization (Knudsen and Konishi 1978). Owls’ behavioral discrimination of spatial locations is, however, sharper than these receptive fields, and consistent with coding by receptive-field slopes (Takahashi et al. 2003). Use of a slope code does not, however, preclude the
Strategies for Encoding ITD in the Chicken Nucleus Laminaris
419
simultaneous existence of a map of ITD, which also has computational utility (Konishi 1986). The mammalian data do not at present show whether or not the computation of ITD also leads to the formation of a map of location within each frequency band in the MSO. In order to address the proposed dichotomy between a rate-place code and a broad hemispheric representation of ITD without maps of ITD, we examined ITD coding in vivo in the chicken. These birds have a similar head size and range of ITDs to small mammals like the guinea pig and gerbil, and should therefore be subject to similar constraints. In contrast to the mammalian MSO, however, the chicken NL is organized as a largely flat monolayer of neurons, with input from the ipsilateral NM terminating upon the dorsal dendrites of NL neurons, and the input from the contralateral NM terminating upon the ventral dendrites (Fig. 1, Smith and Rubel 1979). This flat lamina organization makes it easier to test the hypothesis that ITD is mapped within an isofrequency band. Both the ipsilateral and contralateral projections from an individual NM neuron innervate an isofrequency line of cells oriented roughly caudomedial to rostrolateral along NL (Young and Rubel 1983). The terminal branching patterns of the ipsilateral and contralateral projections to these cells differ. The ipsilateral inputs form a spray of axonal branches such that the conduction time from the NM neuron does not change much along the lamina, while the contralateral NM axons show a linear increase in the latency of field potential responses from medial to lateral positions in NL (Overholt et al. 1992). This arborization pattern of the contralateral axons is consistent with these inputs acting as delay lines for auditory information as hypothesized in Jeffress’s model (Fig. 1).
2
Methods
Results are reported from 15 chickens, aged P15-47. Animal husbandry and experimental protocols were approved by the Regierung von Oberbayern (AZ 209.1/211-2531-56/04). Anesthesia was induced by injections of ketamine and xylazine until the transition to isofluorane. Body temperature was maintained at 41°C. Chickens were placed in a sound-attenuating chamber and closed, custom-made sound systems were placed at the entrance of both ear canals, containing small earphones and miniature microphones. The sound systems were calibrated before the recordings. Stimuli were tone bursts of 50 ms duration. Monaural frequency-vs-level responses for both ipsi- and contralateral stimulation were recorded first, by presenting tones from a matrix of frequencies and sound pressure levels in random sequence. Characteristic frequencies (CF) were used to test ITD selectivity. For ITD curves, the mean spike rate was used for single units. For neurophonic recordings, the averaged analog waveform was fitted with a cosine function at the stimulus frequency and the amplitude of this fit used as the response
420
C. Carr and C. Köppl
amplitude at a particular ITD. All ITD curves were fitted with a cosine function to determine best ITD, the peak closest to zero ITD. Monaural best phase and click delay were also measured. Recordings were made with a glass electrodes filled with 5% neurobiotin in K-acetate. At selected recording sites, neurobiotin was deposited iontophoretically. Single-unit recordings in NL and its mammalian analog, the medial superior olive, are difficult to achieve in vivo. Because we combined electrophysiological characterization with histological verification of recording sites, we used data from both single unit and neurophonic potential recordings. The neurophonic is a sinusoidal evoked potential reflecting the frequency of the pure-tone stimulus. It is more easily and stably recorded than single units. The amplitude of the neurophonic potential fell sharply with increasing distance from the cellular monolayer of NL, and the maximal neurophonic amplitude could usually be determined to within 50 µm. Recordings and neurobiotin injections were made at the maximum neurophonic amplitude. In many cases, 1–3 labeled cell bodies were later seen in the histological sections. Chickens were fixed by perfusion with 4% paraformaldehyde, after which the brains were sectioned and the Neurobiotin visualized using standard protocols. All sections containing NL were identified, and the nucleus mapped using image analysis software (AnalySIS by Soft Imaging Software, Münster, Germany). These data were used to construct a surface view of NL and determine the position of label in normalized map of NL.
3
Maps of Interaural Time Difference in Nucleus Laminaris
The brainstem nucleus laminaris in the chicken displays the major features of a rate-place code of ITD. We showed that best ITD varied in a fashion consistent with the presence of a map, by recording where possible from pairs of locations within NL. Usually, one recording was made close to the medial edge of NL, and the second was made more laterally. If the NM inputs act as delay lines to form a map of interaural time differences within NL, cells in the first location should respond to sound from in front, with similar ipsi- and contralateral delays, while the more lateral locations should map sound from the contralateral hemifield. We found that the physiological range of ITDs is systematically represented in the maximal responses of neurons along each isofrequency band. 3.1
Responses in Nucleus Laminaris Encode ITD
We characterized and labeled 22 recording sites in the chicken NL, including 4 single units and 18 neurophonic recordings. All recordings showed selectivity for ITD, in the form of cyclic changes of neurophonic amplitude or spike rate with variations in ITD (Fig. 2). The cycle period corresponded to the period of the stimulus, which was chosen to be close to CF. CFs ranged
Strategies for Encoding ITD in the Chicken Nucleus Laminaris
421
Fig. 2 Neurophonic amplitude at 2500 Hz. A cosine fit yielded a best ITD of −26µs
from 80 Hz to 3250 Hz. The peak of the cyclic function closest to an ITD of zero was taken as the best ITD, with positive values indicating contralateral leading and negative values indicating ipsilateral leading, unless monaural click responses predicted a different laterality. 3.2
ITD Was Systematically Mapped Within Nucleus Laminaris
The chicken NL is tonotopically organized, with the lowest frequencies represented caudolaterally and the highest rostromedially (Rubel and Parks 1975). Accordingly, isofrequency bands run from caudomedial to rostrolateral. Best ITD was systematically related to position along the isofrequency axis and increased towards increasingly contralateral values when moving rostrolaterally within NL (Fig. 3). Best ITDs were generally evenly distributed over about ±160 µs, which corresponds to the biological range of ITDs available to the chicken above 1 kHz (Hyson et al. 1994). In all cases, neurons with best ITDs close to 0 were located medially within NL, while units sensitive to sound from the contralateral hemifield were located laterally. Thus, the chicken NL conforms to the requirements of the Jeffess model. 3.3
Delay Lines Form Maps of ITD in Nucleus Laminaris
Laminaris neurons phase-lock to the auditory stimulus, and are driven by both monaural and binaural stimulation. The phase of each monaural response reflects the delay between the ear and the recording point. When an
422
C. Carr and C. Köppl
Fig. 3 Best ITD of all histologically verified recording sites as a function of their position along the isofrequency axis. Data are divided by frequency, and one point with an ITD of −770 µs is not shown. Dashed lines show the biological range of ITDs at frequencies above 1 kHz (Hyson et al. 1994)
Fig. 4 Click delays predict best ITD. For neurophonic recordings, high-pass filtered click-evoked responses were fit with a gammatone function (see Wagner et al. 2005)
equal but opposite interaural delay nullifies this internal delay, the two phase-locked responses coincide to produce the large binaural response that characterizes coincidence detection. In coincidence detection mediated ITD coding circuits, the phase difference between the monaural phase-locked responses should equal the best ITD. This observation holds for recordings in chicken NL, where differences in ipsi- and contralateral click delays predict best ITD (Fig. 4). Best ITD and the ITD predicted by the click delay were
Strategies for Encoding ITD in the Chicken Nucleus Laminaris
423
generally around and contralateral to zero ITD, or predominantly in the contralateral sound field. Delay differences between the two ears were generally such that to bring peaks into coincidence, the stimulus to the ipsilateral ear must be delayed with respect to the contralateral ear.
4
Conclusions
Harper and McAlpine (2004) proposed that an optimal coding strategy for animals with small heads and sensitivity to low frequencies would be two broadly tuned hemispheric channels sensitive to ITD. The results from the chicken NL do not support this hypothesis. Although the chicken shares similar head size and frequency sensitivity with the gerbil and guinea pig (Hyson et al. 1994), recordings from chicken show that best ITDs from frontal space are mapped medially within NL, while best ITDs from contralateral space are mapped laterally. Furthermore, differences between ipsi- and contralateral click delays predict best ITD. The presence of maps of ITD in both the chicken and the barn owl NL further suggests that evolutionary history influences coding strategy. The ancestors of birds and mammals appear to have evolved sensitivity to air born sound independently (Grothe et al. 2005). It is possible that modern birds and mammals differ in the organization of their brainstem centers for ITD detection, and that ITD is not mapped within the medial superior olive. Recordings from the guinea pig inferior colliculus, however, show a range of best ITDs (McAlpine et al. 2001; Sterbing et al. 2003), and the existence of a place map within the medial superior olive cannot be ruled out. Acknowledgments. We are grateful to Mark Konishi for the generous gift of the software “xdphys”, custom-written in his lab and used in our experiments. We also thank Jose-Luis Peña, Richard Kempter and Katrin Vondershen for data analysis routines and Birgit Seibel for excellent technical assistance with the histology. Supported by a Humboldt Research Award to CEC, DFG grant KO 1143/12-2 to CK and NIH grant 000436 to CEC.
References Beckius GE, Batra R, Oliver DL (1999) Axons from AVCN that terminate in medial superior olive of cat: observations related to delay lines. J Neurosci 19:3146–3161 Brand A, Behrend O, Marquardt T, McAlpine D, Grothe B (2002) Precise inhibition is essential for microsecond interaural time difference coding. Nature 417:543–547 Carr CE, Konishi M (1990) A circuit for detection of interaural time differences in the brainstem of the barn owl. J. Neurosci 10:3227–3246 Funabiki K, Koyano K, Ohmori H (1998) The role of GABAergic inputs for coincidence detection in the neurones of nucleus laminaris of the chick. J Physiol 508:851–869 Grothe B, Carr CE, Casseday JH, Fritzsch B, Köppl C (2005) The evolution of central pathways and their neural processing patterns. In: Manley GA, Popper AN, Fay RR (eds) Evolution of the vertebrate auditory system. Springer, Berlin pp 289–359
424
C. Carr and C. Köppl
Hancock KE, Delgutte B (2004) A physiologically based model of interaural time difference discrimination. J Neurosci 24:7110–7117 Harper NS, McAlpine D (2004) Optimal neural population coding of an auditory spatial cue. Nature 430:682–686 Hyson RL, Overholt EM, Lippe WR (1994) Cochlear microphonic measurements of interaural time differences in the chick. Hear Res 81:109–118 Jeffress L (1948) A place theory of sound localization. J Comp Physiol Psychol 41:35–39 Joris PX, Smith PH, Yin TC (1998) Coincidence detection in the auditory system: 50 years after Jeffress. Neuron 21:1235–1238 Knudsen EI, Konishi M (1978) A neural map of auditory space in the owl. Science 200:795–797 Konishi M (1986) Centrally synthesized maps of sensory space. TINS 9:163-168 Konishi M (2003) Coding of auditory space. Annu Rev Neurosci 26:31-55. McAlpine D, Jiang D, Palmer AR (2001) A neural code for low-frequency sound localization in mammals. Nat Neurosci 4:396–401 Overholt EM, Rubel EW, Hyson RL (1992) A circuit for coding interaural time differences in the chick brainstem. J Neurosci 12:1698–1708 Pena JL, Viete S, Albeck Y, Konishi M (1996) Tolerance to sound intensity of binaural coincidence detection in the nucleus laminaris of the owl. J Neurosci 16:7046–7054 Rubel EW, Parks TN (1975) Organization and development of brainstem auditory nuclei of the chicken: tonotopic organization of NM and NL. J Comp Neurol 164:411–414 Shackleton TM, Skottun BC, Arnott RH, Palmer AR (2003) Interaural time difference discrimination thresholds for single neurons in the inferior colliculus of Guinea pigs. J Neurosci 23:716–724 Smith ZDJ, Rubel EW (1979) Organization and development of brain stem auditory nuclei of the chicken: dendritic gradients in nucleus laminaris. J Comp Neurol 186:213–239 Sterbing SJ, Hartung K, Hoffmann KP (2003) Spatial tuning to virtual sounds in the inferior colliculus of the guinea pig. J Neurophysiol 90:2648–2659 Takahashi TT, Bala AD, Spitzer MW, Euston DR, Spezio ML, Keller CH (2003) The synthesis and use of the owl’s auditory space map. Biol Cybern 89:378–387 Wagner H, Brill S, Kempter R, Carr CE (2005) Microsecond precision of phase delay in the auditory system of the barn owl. J Neurophysiol 94:1655–1658 Yin TC, Chan JC (1990) Interaural time sensitivity in medial superior olive of cat. J Neurophysiol 64:465–488 Young SR, Rubel EW (1983) Frequency-specific projections of individual neurons in chick brainstem auditory nuclei. J Neurosci 7:1373–1378
46 Interaural Level Difference Discrimination Thresholds and Virtual Acoustic Space Minimum Audible Angles for Single Neurons in the Lateral Superior Olive DANIEL J. TOLLIN
1
Introduction
It has long been hypothesized that the lateral superior olive (LSO) is where the interaural level difference (ILD) cue to sound source location is first processed in a functionally meaningful way (see Tollin 2003). First, LSO neurons are some of the most peripheral neurons where inputs from both ears converge; they receive an excitatory input from the ipsilateral ear but an inhibitory input from the contralateral ear via the medial nucleus of the trapezoid body. Second, LSO neurons are sensitive to ILDs of stimuli presented to the two ears. Third, horizontal localization, where ILDs (and interaural time differences) are used, is impaired in animals where the input pathways to or the LSO neurons themselves were lesioned. Finally, mammals that use predominantly high-frequency ILD cues for localization tend to have large and well developed LSO. These data together implicate the LSO as critical for encoding ILDs. Behaviorally, humans and cats can discriminate changes in azimuth of broadband noise, the minimum audible angle (MAA), of 1° and 3–6°, respectively. Both can discriminate 1-dB ILDs in tones presented over headphones. While there is little doubt that ILDs are first processed in LSO, it is unknown whether the responses of individual LSO neurons can signal changes in ILD or azimuth with a resolution comparable to psychophysics or whether further processing across neurons is required.
2
Methods
Details of the physiological methods can be found in Tollin and Yin (2002). Briefly, extracellular responses of well-isolated LSO neurons were recorded in barbiturate-anesthetized cats. Electrode penetrations through LSO were verified histologically. Neurons were studied under two conditions. First, ILD sensitivity was examined using 300-ms duration characteristic frequency (CF)
Department of Physiology and Biophysics, University of Colorado Health Sciences Center, Aurora, CO USA, [email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
426
D.J. Tollin
tones by holding the signal level to the ipsilateral ear (20 dB re: threshold) constant and varying the signal level at the contralateral ear in 5-dB steps from ~25 dB below to 25 dB above ipsilateral threshold. Second, the virtual acoustic space technique was used to manipulate the azimuth of a 200-ms “frozen” noise that was filtered by the head-related transfer functions of a standard cat (Tollin and Yin 2002). The range of azimuths was ±90° along the horizontal plane in the frontal hemisphere in 9° steps. In both conditions, mean rate and SD over the stimulus duration were computed from 20 repetitions. The ability of LSO neurons to signal changes in ILD or azimuth via changes in discharge rate was examined. To facilitate the analyses, descriptive functions were fitted to the data for each neuron. Rate-ILD data was fitted with a four-parameter sigmoid, rate(ILD) = yo + a/(1+exp(-(ILD-ILDo)/b)) (e.g., Fig. 2A) and rate-azimuth data was fitted with a five-parameter Gaussian, rate(az) = yo + aexp(-0.5(⎪az-azo⎪/b)c) (Fig. 4A). These equations have no functional significance; they simply allowed the determination of a rate for any arbitrary ILD or azimuth, in steps of 0.1. These functions were also used to determine the SD. First, the empirical variance (e.g., SD2) was characterized by a two-parameter power function of rate, var(rate)=a*(rate)b. To simplify the modeling, instead of using data for each neuron, a single power function was fit to the response data from the population of neurons, separately for tone and for noise stimulation (Fig. 1). The variance was computed for each neuron by inputting to the power function the appropriate sigmoid or Gaussian rate function; SD was computed from the square root of the variance. For each LSO neuron, the standard separation, D (Sakitt 1973), was used to compute the smallest increment in ILD or azimuth (az) just necessary to discriminate that increment based on the change in rate and response variability: D (DILD) =
r (ILD + DILD) - r (ILD) d (ILD + DILD)) d (ILD)
(1)
where r(ILD) refers to rate and δ(ILD) the SD at the base ILD (or az). r(ILD + DILD) and d(ILD + ∆ILD) represent the rate and SD after a ILD (or az) increment. At each base ILD (or az), ILD (or az) was incremented and decremented until D reached 1.0 or −1.0. The increment or decrement that first resulted in a D of ±1.0 was taken as the threshold ILD (or az) at that base value. Figure 2B shows for one neuron how D changed as ILD was varied from a base of 0 dB. A D of −1 was first reached for an increment of 1.0 dB, which was taken as the threshold at that base ILD. Base ILD (or az) was then changed and the process repeated. Figure 1C shows for the same neuron the ILD thresholds as a function of base ILD, from which three values were computed (Fig. 2D): threshold ILD (or az) at 0 dB ILD (or 0°), minimum threshold ILD (or az), and the base ILD (or az) where this minimum occurred.
Interaural Level Difference Discrimination Thresholds
427
A
B
Fig. 1 Response variance is a power function of discharge rate. Stimuli were: A CF tones; B broadband noise filtered by HRTFs. N indicates the number of neural responses used
3
Results
Results are based on 45 high-CF (>3 kHz) LSO neurons. Only one side of the brain is modeled in this chapter. 3.1 Response Variance is Characterized by a Power Function of Discharge Rate Figure 1 shows for the population of LSO neurons a scatterplot of discharge rate and response variance. A power function provided a good fit for both tone (r2 = 0.9) and noise stimuli (r2 = 0.88) while a linear function did not (r2 of 0.5 and 0.51 for tones and noise, respectively). The power functions for the population of neurons shown in Fig. 1 were similar to that for the individual neurons themselves (e.g., insets, Figs. 2A and 4A) so the SDs were also well
428
D.J. Tollin
− −
Fig. 2 A Mean rate ( filled circles) and SD (open circles) as a function of ILD for one LSO neuron. Four-parameter sigmoid fit to the rate (black line, r2 = 0.99) and SD computed from the power function in Fig. 1A (grey line). Variance vs rate for this neuron and for the population completely overlap (Fig 1A) (inset). B Standard separation, D, as a function of ∆–ILD at a 0-dB base. Threshold is the smallest ∆–ILD to first yield a D of ±1. C Threshold as a function of base ILD. D Three threshold metrics
modeled by this simplification (e.g., grey lines, Figs. 2A and 4A). Use of each neurons’ own variance-rate relationships did not alter the data presented in this chapter. 3.2
Neural and Psychophysical Threshold ILDs
Figure 2 shows an example of threshold ILDs for an individual neuron (see Methods). For the population of neurons, the sigmoid function provided a good fit to the rate-ILD functions (mean r2 = 0.98 ± 0.035, n = 32 neurons). SD was computed as described above. The relationship between variance and rate for this neuron (inset, Fig. 2A) was virtually the same as that for the population of neurons in Fig. 1A. The power fit to the individual data completely overlapped the population function; this was generally true for all neurons tested with tones. The neuron in Fig. 2 yielded a threshold ILD of 1 dB at a base of 0 dB. The minimum threshold ILD was 0.9 dB at a base of −3.6 dB.
Interaural Level Difference Discrimination Thresholds
429
In general, the base ILD at which the minimum occurred did not always correspond to the steepest portion of the rate-ILD function but rather tended to occur nearer to the positive inflection point. Figure 3A plots the minimum thresholds and the threshold ILDs computed at a base of 0 dB as a function of the CF of 32 neurons. The thresholds at 0 dB ILD (range = 1–16.6 dB, mean = 5.2 ± 3.46 dB) were significantly larger than the minimum (range 0.8–13.6 dB, mean = 3.1 ± 2.38 dB) [paired t(31) = 3.89, p = 0.0005]. There was no obvious trend in the thresholds with CF. The psychophysical threshold ILD of ~1.5 dB for cats, measured at a base of 0 dB, is also shown (Fig. 3A, horizontal line, Wakeford and Robinson 1974).
A
4
B
Fig. 3 A Minimum and threshold ILDs at 0 dB as a function of CF. Behavioral threshold for cats (horizontal line). B Minimum thresholds for each neuron as a function of the base ILD where they occurred (circles). Minimum population thresholds (line)
430
D.J. Tollin
Figure 3B shows the minimum threshold ILD for each neuron as a function of the base ILD at which it occurred. Across the population, the minimum thresholds did not necessarily occur at 0 dB, but rather at an ILD slightly favoring the ipsilateral ear. However, mean base ILD (−1.6 ± 13.2 dB, n =32) corresponding to minimum threshold was not significantly different from 0 dB [t(31) = −0.69, p>0.05]. The minimum thresholds computed across the population as a function of base ILD (Fig. 3B, line) were between 1 and 1.5 dB and were virtually invariant to changes in base ILD over ±25 dB. 3.3
Neural and Psychophysical Minimum Audible Angles
Figure 4 shows a characteristic example of threshold azimuths for one neuron. For the population, the Gaussian function provided good descriptions of the rate-azimuth data (mean r2 =0.975 ± 0.037, n=32 neurons). The relationship between response variance and rate for this neuron (inset, Fig. 4A) was
A
B
Fig. 4 A,B Example of threshold azimuth computation. Same as in Fig. 2, but for virtual space azimuth and different neuron
Interaural Level Difference Discrimination Thresholds
431
virtually the same as that for the population of neurons (Fig. 1B). This was generally true of all neurons tested with the virtual space stimuli; hence, the modeled SD (grey line) provided a good description of the empirical SD. This neuron yielded a minimum threshold of 4.9° at −5°. A threshold of 4.9° was also obtained at 0°. Figure 5A plots threshold azimuths at 0° as well as the minimum threshold as a function of CF for 30 neurons. Threshold azimuths at 0° (range 2.6–21.5°, mean=7.9±5.25°) were significantly larger than the minimum (range 2.1°–17°, mean = 5.7±3.67°) [paired t(29) =3.1, p = 0.004]. The behavioral MAAs of cats from three studies (Heffner and Heffner 1988; Martin and Webster 1987; Huang and May 1996) for noise stimuli with a base of 0° ranged from 3° to 6° (Fig. 5A, right). Figure 5B shows the minimum threshold azimuth for each neuron as a function of the base azimuth at which it occurred. Across the population, the minimum thresholds occurred at base azimuths slightly favoring the ipsilateral
Fig. 5 A Minimum (triangles) and threshold azimuths computed at a base of 0° (circles) as a function of CF. Behavioral MAAs of cats (right). B Minimum threshold azimuths as a function of the base azimuth at which they occurred (open circles). Minimum thresholds computed across the population of neurons (line). Behavioral MAAs from cats (filled circles) and humans (squares)
432
D.J. Tollin
ear; the mean base azimuth (−7.7±15.25°, n=30) corresponding to the minimum threshold was significantly different from 0° [t(29)=−2.75, p=0.008]. The solid line shows the minimum threshold azimuth computed across the population of neurons as a function of base azimuth. As base azimuth was moved contralateral from 0°, where the population minimum threshold was 2.6°, thresholds increased substantially. But when moved ipsilateral from 0°, minimum thresholds dipped to a minimum of 2.1° at a base of −11.9°. Both human and cat MAAs for noise stimuli (Heffner and Heffner 1988) are shown as a function of base azimuth.
4
Discussion
These are the first estimates of the abilities of auditory neurons to discriminate changes in the spatial locations of sounds and changes in the ILD cue to location based on discharge rate. The lowest ILD and azimuth thresholds of 0.8 dB and 2.1°, respectively, are comparable to not only cat psychophysical thresholds, but also human thresholds. Even more remarkable is the fact that the data were obtained from one of the most peripheral stages of binaural interaction in the auditory system, the LSO. The data indicate that there is sufficient information in the discharge rates of some LSO neurons to permit discrimination of azimuth and ILD at psychophysical levels without having to pool information across neurons. While such selectivity is important, it does not by itself establish a specific role for the LSO in the extraction and encoding of ILDs. However, in combination with the experimental evidence from neurophysiology, behavioral psychology, and comparative neuroanatomy cited in the Introduction, the demonstration that individual LSO neurons can discriminate ILD and azimuth with a resolution comparable to psychophysical abilities strongly reinforces the hypothesis that the functional role of the LSO is to encode the ILD cue to sound source location. Threshold ILDs and azimuths of some individual neurons were comparable to human and cat psychophysics, even at a base of 0 dB and 0°, respectively. Human threshold ILDs have been shown to vary little with tone frequency (Grantham 1984) or changes of base ILD over ±24 dB (Hafter et al. 1977). The neural data in Fig. 3 also show relative invariance to stimulus frequency and base ILD over a range of ±25 dB, provided that the best thresholds are allowed to be taken from different neurons. Indeed, the insensitivity of threshold ILDs to changes in these parameters occurs because different neurons have their regions of best acuity at different frequencies and base ILDs. A similar finding was found for threshold azimuths (Fig. 5); the best thresholds across the population were consistent with psychophysical data for azimuths near and slightly contralateral to the midline, but failed to show the systematic increase in threshold for large base azimuths. In general, these data are consistent with the “lower envelope” hypothesis, which states that
Interaural Level Difference Discrimination Thresholds
433
psychophysical performance is limited by the most sensitive neurons (Parker and Newsome 1998). On this point, our results are in agreement with observations that interaural time difference discrimination (ITD) thresholds of the most sensitive neurons in the inferior colliculus are also comparable to human psychophysical abilities (Shackleton et al. 2003). However, Hancock and Delgutte (2004) reported that the increases in ITD thresholds for noise with increases in base ITD exhibited by human observers could not be accounted for by the most sensitive neurons, but by pooling across neurons, these data could be modeled. Pooling across LSO neurons may be needed to account for the general increase in human and cat MAAs with increasing base azimuth shown in Fig. 5B. Acknowledgements. Supported by NIH NIDCD grants DC006865 to DJT and DC00116 and DC02840 to Tom C.T. Yin.
References Grantham DW (1984) Interaural intensity discrimination: insensitivity at 1000 Hz. J Acoust Soc Am 75:1191–1194 Hafter ER, Dye RH, Nuetzel JM, Aronow H (1977) Difference thresholds for interaural intensity. J Acoust Soc Am 61:829–834 Hancock KE, Delgutte B (2004) A physiologically based model of interaural time difference discrimination. J Neurosci 24:7110–7117 Heffner RS, Heffner HE (1988) Sound localization acuity in the cat: effect of azimuth, signal duration, and test procedure. Hear Res 36:221–232 Huang AY, May BJ (1996) Spectral cues for sound localization in cats: effects of frequency on minimum audible angles in the median and horizontal planes. J Acoust Soc Am 100:2341–2348 Martin RL, Webster WR (1987) The auditory spatial acuity of the domestic cat in the interaural horizontal and median vertical planes. Hear Res 30:239–252 Parker AJ, Newsome WT (1998) Sense and the single neuron: probing the physiology of perception. Ann Rev Neurosci 21:227–277 Sakitt B (1973) Indices of discriminability. Nature 241:133–134 Shackleton TM, Skotton BC, Arnott RH, Palmer AR (2003) Interaural time difference discrimination thresholds for single neurons in the inferior colliculus of Guinea pigs. J Neurosci 23:716–724 Tollin DJ (2003) The lateral superior olive: a functional role in sound source localization. Neuroscientist 9:127–143 Tollin DJ, Yin TCT (2002) The coding of spatial location by single units in the lateral superior olive of the cat. I. Spatial receptive fields in azimuth. J Neurosci 22:1454–1467 Wakeford OS, Robinson DE (1974) Lateralization of tonal stimuli by the cat. J Acoust Soc Am 55:649–652
47 Responses in Inferior Colliculus to Dichotic Harmonic Stimuli: The Binaural Integration of Pitch Cues TREVOR M. SHACKLETON1, LIANG-FA LIU2, AND ALAN R. PALMER1
1
Introduction
Humans have the ability to integrate harmonics which alternate between the ears into a single percept with a pitch corresponding either to the fundamental of the entire complex or the spacing of the harmonics in each ear, depending upon the resolvability of the harmonics (Bernstein and Oxenham 2003; Houstma and Goldstein 1972). Additionally, it is possible to binaurally extract a pitch from dichotic stimuli which posses no pitch when the signal to either ear is presented alone (e.g. Bilsen et al. 1998; Cramer and Huggins 1958; Culling et al. 1998; Fourcin 1970, amongst many others). These two facts suggest that pitch percepts are generated after integration of the information from both ears (c.f. Bilsen et al. 1998; Zurek 1979). In the auditory system, whilst there is also some binaural processing in the cochlear nucleus (e.g. Mast 1970), the first major site of binaural processing is in the superior olivary complex (e.g. Goldberg and Brown 1969). The inferior colliculus is an obligatory ascending relay from these nuclei, as well as being a de novo site of binaural convergence (see, e.g. Winer and Schreiner 2005 for reviews). In the experiments reported here we determined the extent to which pitch cues from each of the two ears are combined at the level of the inferior colliculus.
2
Methods
Details of the methods have been previously published (see Shackleton et al. 2003 for details). All experiments were performed in accordance with the United Kingdom Animals (Scientific Procedures) Act of 1986. Briefly, recordings were made in the right inferior colliculus (IC) of pigmented guinea pigs
1 MRC Institute of Hearing Research, University Park, Nottingham, NG7 2RD, UK, [email protected], [email protected] 2 Department of Otolaryngology, Head and Neck Surgery, Chinese PLA General Hospital 28 Fuxing Road, Beijing, P.R. China 100853, [email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
436
T.M. Shackleton et al.
anesthetized with urethane and Hypnorm (Janssen, High Wycombe, UK), supplemented with Hypnorm on indication by pedal withdrawal reflex. A premedication of atropine sulphate reduced bronchial secretions. The animals were placed inside a sound attenuating room in a stereotaxic frame in which hollow plastic speculae replaced the ear bars to allow sound presentation and direct visualization of the tympanic membrane. A craniotomy was performed over the right IC, the dura reflected and recordings made using a linear array of eight glass-insulated tungsten electrodes (Bullock et al. 1988) nominally spaced at 200 µm, advanced through the intact cortex by a piezoelectric motor (Burleigh Inchworm IW-700/710, Scientifica, Uckfield, UK). Extracellular action potentials were amplified and filtered between 300 Hz and 3 kHz (RA16AC, RA16PA, 4xRA16BA, Tucker-Davis Technologies, Alachua, FL). Spike sorting software (Brainware, v7.43, Jan Schnupp, Oxford University) was used to separate the responses into single units and unit-clusters. Stimuli were delivered to each ear through sealed acoustic systems comprising custom-modified tweeters (Radioshack 40-1377; M. Ravicz, Eaton Peabody Laboratory, Boston, MA), which fitted into the hollow speculum. The output was calibrated a few millimeters from the tympanic membrane using a microphone fitted with a calibrated probe tube. Stimuli were digitally synthesized (RP2.1, Tucker-Davis Technologies) at 50 kHz sampling rate and were output through 24-bit sigma-delta converters and a waveform reconstruction filter set at 40 kHz (135 dB/octave elliptic: Kemo 1608/500/01 modules supported by custom electronics). Stimuli were of 100 ms duration, switched on and off simultaneously in the two ears with cosine-squared gates with 2-ms rise/fall times (10% to 90%). A response area was first obtained using tonal stimuli (0 to 100 dB in 5-dB steps, 200 Hz to 20 kHz in 0.1-octave steps) followed by a rate vs level function (0 to 100 dB in 5-dB steps) for harmonic complexes with 100 Hz fundamental frequency, presented to the left, right and both ears. Once these preliminary data had been obtained, the main experiment was run in a single block comprising 100 repeats of all 7 fundamental frequency and 8 condition combinations in random order. Stimuli consisted of harmonic series containing all of the harmonics up to 10 kHz with fundamental frequencies (F0) from 50 Hz to 400 Hz in half-octave steps. The level of individual harmonics was 50 dB SPL. Eight conditions were presented: 1) all harmonics in the left ear; 2) all harmonics in the right ear; 3) all harmonics in both ears; 4) even harmonics in the left ear and odd harmonics in the right; and 5) odd harmonics in the left ear and even harmonics in the right. The final three conditions comprised alternating phase harmonics and will not be considered here. Data were analysed using the peri-stimulus time histograms (PSTHs), the Fourier transform of the PSTH, and the period-histogram synchronized to the F0. Of particular interest was the response-rate averaged over the stimulus duration, and the rate synchronized to either F0 or 2F0 (obtained from the Fourier transformed PSTH).Vector strength was also obtained from the periodhistogram using the method of Goldberg and Brown (1969).
Responses in Inferior Colliculus to Dichotic Harmonic Stimuli
3
437
Results
Responses were recorded from 95 single and multi-unit clusters with characteristic frequencies (CFs) between 0.2 and 15 kHz. Qualitatively, the responses of all 95 units were similar, so we shall show the results for a typical unit that is representative of the whole sample. The rate response is shown in Fig. 1. The response to contralateral stimulation was very similar to diotic stimulation, so only the diotic response is shown here. The strongest response to diotic stimulation was to an F0 of 200 Hz. There were no striking differences between the two modes of dichotic stimulation (odd or even harmonics to left or right ears), so these results are also shown averaged together (Fig. 1). The dichotic response is also strongest at one F0, but the peak is at half the frequency of that for diotic presentation. The temporal responses of the example unit of Fig. 1 are shown in Fig. 2. There are clear responses at the period of the fundamental up to 200 Hz for the diotic condition. The shape of the PSTHs for the dichotic conditions are very similar to those for the diotic conditions in response to twice the F0. It appears that the responses are phase locked at 2F0. This is confirmed by the Fourier transforms of the PSTHs (Fig. 3), where it can be seen that although there is a clear response to the fundamental (and harmonics) in the diotic condition, there is little response to F0 in the dichotic condition (see
spikes/stimulus
40
Diotic 30
Rate Sync F0 Sync 2F0
20 10 0 50
100
200
400
100
200
400
spikes/stimulus
40
Dichotic 30 20 10 0 50
Fundamental frequency (Hz)
Fig. 1 Responses of a unit with CF of 1.8 kHz to a harmonic series with fundamental frequencies (F0s) increasing from 50 Hz to 400 Hz. Solid circles show total number of spikes, open circles show number of spikes synchronized to F0, and triangles show spikes synchronized to 2F0
438
T.M. Shackleton et al.
150
100
100
50
50
0 0
20
40
60
80
100
20
spikes/bin
Dichotic: 50 Hz F0
40
60
80
100
20
100 Hz F0
40
60
80
100
20
200 Hz F0
40
60
80
0 100
400 Hz F0
150
150
100
100
50
50
0 0
20
40
60
80
100
20
40
60
80
100
20
40
60
spikes/ bin
spikes/bin
400 Hz F0
200 Hz F0
80
100
20
40
60
80
spikes/ bin
100 Hz F0
Diotic: 50 Hz F0 150
0 100
Time (ms)
Fig. 2 Peristimulus time histograms (PSTHs) of the unit shown in Fig. 1
0
200 400 600 800
0
200 400 600 800
0
400 Hz F0
200 Hz F0
psth psthboth bothvsvs50b 50b
0
200 400 600 800
0
200 400 600 800
0
200 400 600 800
25 20 15 10 5 0 200 400 600 800 1000
spikes/stimulus
spikes/stimulus spikes/stimulus
0
100 Hz F0
Dichotic: 50 Hz F0 25 20 15 10 5 0
200 400 600 800
400 Hz F0
200 Hz F0
0
25 20 15 10 5 0 200 400 600 800 1000
spikes/stimulus
100 Hz F0
Diotic: 50 Hz F0 25 20 15 10 5 0
Frequency (Hz)
Fig. 3 Fourier transforms of the PSTHs shown in Fig. 2. Arrows emphasise F0
arrows). Instead, the response is to 2F0 and the harmonics of 2F0. The magnitude of each Fourier component shows the degree of phase locking to that frequency (indeed if the Fourier transform is normalized by the DC component then the magnitudes are equivalent to the vector strength at that frequency). The Fourier magnitudes at (i.e. synchronization to) F0 and 2F0 are shown in Fig. 1. These plots emphasise that whilst the degree of phase locking to F0 is greater than to 2F0 in the diotic conditions, it is the reverse in the dichotic conditions. The ratio of synchronization to 2F0 relative to F0 is shown for all units as a function of CF in Fig. 4. It is clear that on average, for diotic presentation, synchronization to 2F0 is lower than, or equal to, synchronization to F0.
Responses in Inferior Colliculus to Dichotic Harmonic Stimuli 200 Hz
R
400 Hz
U
R
100
U
10 1
0.1 0.01
0.1 0.01
2F0 / F0
1 100 10 1
2FO / F0
100 Hz
U
10
1
Dichotic: 50Hz
10
100 Hz
0.1 U 0.01 10
10
200 Hz
U 1
1
R 1
10
1
10
400 Hz
U
R
1
10
100 10 1 0.1 0.01
U
1
2FO / F0
2F0 / F0
Diotic: 50Hz 100 U 10 1
439
10
Characteristic Frequency (kHz)
Fig. 4 Ratio of synchronization to 2F0 to synchronization to F0 for all units. Results for diotic stimulation are shown in the top row and for dichotic stimulation in the lower row. The symbols R and U indicate regions where harmonics should be resolved, and unresolved, respectively, and the vertical dashed lines indicate the boundaries, calculated from the rule in Shackleton and Carlyon (1994) assuming guinea pig filter bandwidths (Evans 2001)
Ratio of ipsi/contra
10
1
0.1
0.01 0.1
1
Characteristic Frequency (kHz)
10
0 10 20 30
# units
Fig. 5 Ratio of ipsilateral response rate to contralateral response rate averaged across all F0s. The histogram on the right shows the number of units pooled across frequency. Stimuli were monaural harmonic series
Conversely, for dichotic presentation, synchronization to 2F0 is up to 10 times higher than synchronization to F0. An explanation consistent with all the data described is that the units are responding to a single, dominant, ear. In Fig. 5 are shown the ratios of the responses to an ipsilateral stimulus divided by the responses to an contralateral stimulus, averaged across all F0. It is clear that for almost all units the ipsilateral response is lower than the contralateral response, and that for about half the
440
T.M. Shackleton et al.
units the ipsilateral response is less than half that of the contralateral. There is a slight trend for the responses to be more equal at lower CFs and for the contralateral response to be more dominant at higher CFs.
4
Discussion
The data presented here are largely consistent with units responding to the stimulus at a single, dominant, ear. In the diotic conditions the stimuli at each ear share the same spectra and waveforms as each other, and central combination of the signals would not affect this. On the other hand, the dichotic signals have components spaced at twice the fundamental at each of ears, and so have a waveform with an envelope periodicity at twice the rate of the diotic stimulus at each of the ears. Therefore, consideration of either the spectrum, or the waveform, at either ear alone would result in responses characteristic of 2F0. It is only if the spectra, or waveforms, from each ear are somehow combined, that we would expect the dichotic stimulus to have the same response as the diotic. Our results might, therefore, appear completely unsurprising. However, it remains the case that the pitch cues from each ear are integrated into a single percept, and purely binaural pitches do exist which can be combined with “monaural” pitch cues (Akeroyd and Summerfield 2000). We also need to account for the result that motivated this study, that pairs of harmonics presented to different ears can elicit a pitch corresponding to their separation (Houstma and Goldstein 1972). This result was true for all the pairs of harmonics which were used. However, as the average harmonic number increased (i.e. from harmonics 2 and 3 to 10 and 11), performance declined equally for both monotic and dichotic presentation from perfect performance to near chance. A similar result was reported by Bernstein et al. (2003) (Expt 2), who found that frequency discrimination thresholds of both diotic and dichotic harmonic complexes (with harmonics alternating between ears, like ours) increased significantly as the lowest harmonic in the complex was increased beyond the 10th. Such diotic results have been shown in many experiments (see Bernstein et al. 2003; Shackleton and Carlyon 1994 for reviews), emphasising again that harmonic number is an important variable in pitch perception! In a further experiment, Bernstein et al. (2003) (Expt 3) compared the perceived pitches of complexes presented diotically and dichotically. If the complexes contained harmonics below the 10th, then the dichotic complex was perceived with a pitch corresponding to F0, whereas if the complexes contained only harmonics above the 15th then the dichotic stimulus had a pitch equal to 2F0 (in between the pitch was uncertain). That is, for low harmonics pitch information is combined across the ears, whereas for high harmonics the pitch corresponds to the periodicity, or harmonic separation, of the stimulus at a single ear. This latter finding is consistent with our results, so we need to consider the harmonic number of our stimuli.
Responses in Inferior Colliculus to Dichotic Harmonic Stimuli
441
There are two complications in making comparisons with Bernstein and Oxenham (2003). First, all of our stimuli comprised all harmonics below 10 kHz, so they obviously contain the lower harmonics. We assume that the harmonics “seen” by a neuron are only those which are contained within its excitatory response area. To a first approximation, this has the same width as the tuning at the auditory nerve. Second, guinea pig filter bandwidths are approximately twice those of humans (both physiologically and psychophysically) (Evans 2001), so, to the extent that the pitch transition region depends upon the number of harmonics passing through a single filter, we need to compensate for guinea pig peripheral resolvability. We do not have the space here to discuss whether the pitch transition region is determined by peripheral resolvability or harmonic number (see Bernstein and Oxenham 2003 for a discussion), but following Bernstein and Oxenham (2003) we will assume that the critical harmonic numbers are those which would become unresolved if all harmonics in a stimulus were passing through the same peripheral filters. Applying the rule for resolvability derived by Shackleton and Carlyon (1994) to guinea pig filter bandwidths we obtain the transition regions shown in Fig. 4. There is no noticeable difference between the results in the resolved and unresolved regions. However, most of the resolved harmonics were with a fundamental of 400 Hz, which tended to show minimal phase locking to either F0 or 2F0 (Figs. 2 and 3 are typical). It is therefore probable that all of the units which showed significant phase locking were responding to unresolved harmonics. In which case, our results are entirely consistent with the psychophysics. Throughout this chapter we have mostly been concerned with the temporal properties of IC neurons. However, some pitch theories posit that a place code for periodicity exists in the IC (see Rees and Langner 2005 for a review). About half of our units were preferentially tuned to a single F0 for diotic stimuli and these all were tuned to half-F0 for dichotic stimuli (e.g. Fig. 1). In a population this means that the peak activation for dichotic stimulation would be at 2F0, consistent with the psychophysics for high harmonics.
References Akeroyd MA, Summerfield AQ (2000) Integration of monaural and binaural evidence of vowel formants. J Acoust Soc Am 107:3394–3406 Bernstein JG, Oxenham AJ (2003) Pitch discrimination of diotic and dichotic tone complexes: harmonic resolvability or harmonic number? J Acoust Soc Am 113:3323–3334 Bilsen FA, van der Meulen AP, Raatgever J (1998) Salience and JND of pitch for dichotic noise stimuli with scattered harmonics: grouping and the central spectrum theory. In: Palmer AR, Rees A, Summerfield AQ, Meddis R (eds) Psychophysical and physiological advances in hearing. Whurr, London, pp 403–411 Bullock DC, Palmer AR, Rees A (1988) Compact and easy-to-use tungsten-in-glass microelectrode manufacturing workstation. Med Biol Eng Comput 26:669–672 Cramer EM, Huggins WH (1958) Creation of pitch through binaural interaction. J Acoust Soc Am 30:413–417
442
T.M. Shackleton et al.
Culling JF, Marshall DH, Summerfield AQ (1998) Dichotic pitches as illusions of binaural unmasking. II. The fourcin pitch and the dichotic repetition pitch. J Acoust Soc Am 103:3527–3539 Evans EF (2001) Latest comparisons between physiological and behavioural frequency selectivity. In: Houtsma AJM, Kohlraush A, Prijs VF, Schoonhoven R (eds) Physiological and psychophysical bases of auditory function. Shaker Publishing BV, Maastricht, pp 382–387 Fourcin AJ (1970) Central pitch and auditory lateralization. In: Plomp R, Smoorenburg GF (eds) Frequency analysis and periodicity detection in hearing. Sijthoff, Leiden, pp 319–328 Goldberg JM, Brown PB (1969). Response of binaural neurons of dog superior olivary complex to dichotic tonal stimuli: some physiological mechanisms of sound localization. J Neurophysiol 32:613–636 Houstma HJM, Goldstein JL (1972) The central origin of the pitch of complex tones: evidence from musical interval recognition. J Acoust Soc Am 44:807–812 Mast TE (1970) Binaural interaction and contralateral inhibition in dorsal cochlear nucleus of the chinchilla. J Neurophysiol 33:108–115 Rees A, Langner G (2005) Temporal coding in the auditory midbrain. In: Winer JA, Schreiner CE (eds) The inferior colliculus. Springer, Berlin Heidelberg New York Shackleton TM, Carlyon RP (1994) The role of resolved and unresolved harmonics in pitch perception and frequency-modulation discrimination. J Acoust Soc Am 95:3529–3540 Shackleton TM, Skottun BC, Arnott RH, Palmer AR (2003) Interaural time difference discrimination thresholds for single neurons in the inferior colliculus of guinea pigs. J Neurosci 23:716–724 Winer JA, Schreiner CE (2005) The inferior colliculus. Springer, Berlin Heidelberg New York Zurek PM (1979) Measurements of binaural echo suppression. J Acoust Soc Am 66:1750–1757
Comment by Langner Pitch is not defined by single neurons in the midbrain. The neuron you show responds very nicely to the periodicity of 200 Hz in the stimulus. This would also correspond to the perceived pitch provided the stimulus would contain only harmonics of 200 Hz. Under your special dichotic conditions the perceived pitch is 100 Hz because the orthogonal map of frequency and periodicity in the midbrain contains also lower, resolved, harmonics – all multiples of 100 Hz – and this lower pitch obviously dominates the pitch. In conclusion: I believe that your results are in line with a periodicity map in the midbrain. Reply We are not in disagreement. During the presentation we were at pains to stress that we were looking for representations of pitch cues which could underpin the extraction of a pitch percept. Further, in the last paragraph of the paper we point out that if a periodicity map exists, the results presented are consistent with the psychophysics for unresolved harmonics. A periodicity map containing units behaving like ours would predict a doubled pitch for dichotically alternating harmonic stimuli. To explain the pitch of stimuli containing resolved harmonics your argument requires units which are only
Responses in Inferior Colliculus to Dichotic Harmonic Stimuli
443
responding to individual harmonics, however our data do not show this, although it is possible that our method was biased against finding them. Comment by Greenberg You may not have used the appropriate stimuli in searching for binaural integration underlying dichotic pitch. In humans, dichotic pitch is restricted to fundamental frequencies lower than 330 Hz (Bilsen and Goldstein 1974) and extremely low sound pressure levels (generally within 10–20 dB of the listener’s threshold). Even under optimal listening conditions, dichotically generated pitch is extremely weak; many listeners have difficulty hearing it at all. However, with trained listeners, as those used in the studies reported by Bilsen and Goldstein (1974) and Houtsma and Goldstein (1972), reliable pitch matching and discrimination data can be obtained. Therefore, it would be useful to use signals known to generate a reliable sensation of dichotic pitch and whose acoustic properties are sufficiently distinct in the monotic and dichotic cases that there would be no possibility of confusing responses to the two sets of stimulus conditions. Both dichotic repetition noise (Bilsen and Goldstein 1974) and multiple-phase-shift noise (Bilsen 1977) have such properties and therefore could be used for investigating the neural correlates of dichotic pitch in central brainstem nuclei. References Bilsen FA (1977) Pitch of noise signals: evidence for a central spectrum. J Acoust Soc Am 61:150–161 Bilsen FA, Goldstein JL (1974) Pitch of dichotically delayed noise and its possible spectral basis. J Acoust Soc Am 55:292–296 Houtsma AJM, Goldstein JL (1972) The central origin of the pitch of complex tones: evidence from musical interval recognition. J Acoust Soc Am 51:520–529
Reply We agree with your sentiment that we should use stimuli which distinguish between monaural and binaural processing. However, we question whether dichotic pitches are the relevant stimulus. We chose the alternating harmonics stimulus because it produces different pitches depending upon whether monaural or binaural cues are being used. Dichotic pitches, however, are necessarily generated by binaural processing, so do not, on their own, indicate whether fusion is taking place in, or before, the inferior colliculus. Indeed Hancock and Delgutte (2002) have already demonstrated a representation of Huggins pitch consistent with coincidence detection in superior olive in the IC. We were, however, planning to extend our studies
444
T.M. Shackleton et al.
by comparing the responses to diotic and alternating harmonic stimuli, and “pseudo” harmonic stimuli created by binaural interaction like multiplephase shift noise. References Hancock KE, Delgutte B (2002) Neural correlates of Huggins dichotic pitch. Assoc Res Otolaryngol Abstr 25:40
Comment by Hartmann Your introduction begins by noting the fusion of binaural information in the Houtsma-Goldstein experiment. This experiment has an interesting history. When it was first announced in the early 1970s, many psychoacousticians attempted to hear the effect and failed. Eventually it emerged that listeners who were not specially trained needed to hear the stimuli at quite low levels, and it was also helpful if the region of the low-pitch was cued by previous trials that pointed unambiguously to that region. Subsequently, Houtgast (1976) showed that, with appropriate cuing and given a low signal to noise ratio, a low pitch, or subharmonic, could be evoked by only a single sine component! The Bernstein-Oxenham work is similar to Houtsma-Goldstein in that the S/N was low – only 10 dB SL for each component. It also seems likely that the method, with its varying lowest harmonic number but f_0 always in the same range, helped to cue the periodicity pitch. More research needs to be done on the Houtsma-Goldstein effect with the goal of determining how “real” it actually is. Should it stand together with monaural periodicity pitch as motivation for physiological investigations? By contrast, the dichotic noise pitches, also noted in your introduction, don’t seem to require any special conditions for audibility.
References Houtgast T (1976) Subharmonic pitches of a pure tone at low S/N ratio. J Acoust Soc Am 60:405–409
Reply We thank you for your clarification of the original Houtsma-Goldstein effect. Whilst it is true that this effect was an initial motivator, the experiment we conducted is one which has intrinsic merit just from a physiological point of view. The question of how, when, and where the information from the two, independent, ears is integrated into a single, fused percept is fundamentally important.
Responses in Inferior Colliculus to Dichotic Harmonic Stimuli
445
Comment on Shackleton et al. and Greenberg by Frans A. Bilsen In response to Greenberg, I want to comment that in order to have a listener perceive the (strong) low pitch, the binaural two-tone stimulus (Houtsma and Goldstein 1972; see also Bilsen 1973) has indeed to be presented at a rather low sensation level. This might be related to the fact that, due to the optimal separation of harmonics, optimal rivalry exists between the synthetic mode of perception (low pitch) and the analytic mode of perception (harmonics perceived separately). On the other hand, the (relatively weak) low pitch evoked by dichotic stimuli based on white noise at either ear does not call for a low sensation level. In the past, low sensation levels were applied only to insure the absence of acoustic cross talk. In response to Shackleton and colleagues, as to the place of generation of the low pitch in the auditory pathway, one has to make a clear distinction between the binaural two-tone stimulus and dichotic-pitch stimuli. The twotone stimulus is expected to fuse and stimulate the central pitch processor directly at a ‘central’ level, while dichotic-pitch stimuli require binaural processing at a ‘peripheral’ level beforehand (compare Bilsen 1977, Fig. 6 therein). The search for low-pitch coding with stimuli related to the two-component stimulus at a rather ‘peripheral’ level like the inferior colliculus, therefore seems not to have a firm basis. But more importantly, ample psychophysical evidence exists that optimal low pitch is derived from the lower (resolved) harmonics. However, given the response areas of inferior colliculus units investigated and the diotic vs dichotic stimuli used, the present experiments (as argued by Shackleton and colleagues) dealt with the rate (PSTH) response of units to unresolved harmonics mainly. Those might indeed not be expected to show either the proper low-pitch coding or the binaural integration as hypothesized by the psychophysics of the two-tone stimulus. References Bilsen FA (1973) On the influence of the number and phase of harmonics on the perceptibility of the pitch of complex signals. Acustica 28:59–65. Bilsen FA (1977) Pitch of noise signals: evidence for a “central spectrum”. J Acoust Soc Am 61:150–161 Houtsma AJM, Goldstein JL (1972) The central origin of the pitch of complex tones: evidence from musical interval recognition. J Acoust Soc Am 51:520–529
Reply We agree with Bilsen that dichotic-pitch stimuli and binaural two-tone stimuli are likely to have different origins; see our reply to Greenberg. However, they need to be integrated somewhere and to set arbitrary labels for what is central and what peripheral does not seem helpful. The inferior
446
T.M. Shackleton et al.
colliculus is the first major relay station in the auditory system where monaural and binaural information converges, therefore it is a possible place for pitch (or at least periodicity) integration. Indeed, Langner (above) argues that there is a periodicity map in the IC. In response to the second point, as we argued in the paper, since most of the neurons we record from are responding to unresolved harmonics we have not tested the integration hypothesis as rigorously as we would have liked. However, there are responses from neurons which are in the resolved region which are no different to the responses to unresolved harmonics.
48 Level Dependent Shifts in Auditory Nerve Phase Locking Underlie Changes in Interaural Time Sensitivity with Interaural Level Differences in the Inferior Colliculus ALAN R. PALMER1, LIANG-FA LIU2, AND TREVOR M. SHACKLETON1
1
Introduction
Interaural time differences are initially analyzed in the medial superior olive (MSO) in the brainstem. Neurons in this nucleus act as coincidence detectors, only firing when the activity from the two ears reaches the cell within a small time window (Batra et al. 1997a, b; Goldberg and Brown 1969; Spitzer and Semple 1995; Yin and Chan 1990). Maximal values of interaural time difference (ITD) for humans are 700 µs, with just noticeable differences often of the order of a few tens of µs (Durlach and Colburn 1978; Hafter et al. 1979; Mills 1958). To achieve such accuracy requires a very precise time signal from the two ears, which is provided by the phase-locking in the auditory nerve fibers (Johnson 1980; Kiang et al. 1965; Palmer and Russell 1986), that is a direct result of the manner of activation of the inner hair cells by the vibration of the basilar membrane (see Ruggero and Rich 1987 for a review). The vibration of the basilar membrane is non-linear, resulting in shifts in the phase of vibration as a function of the level of tonal stimuli (e.g. review in Robles and Ruggero 2001). Such phase shifts can also be seen in the phase-locked activity of auditory nerve fibers (e.g. Anderson et al. 1971). An implication of these level dependent phase shifts is that the output of the MSO coincidence detectors should be sensitive to interaural level differences (ILDs) as these will cause a phase shift between their inputs from the two ears. This was tested (Kuwada and Yin 1983; Yin and Kuwada 1983) by recording the ITD sensitivity of inferior colliculus (IC) neurons (the target of the ascending projection from the MSO). They saw a continuum along which the ITD sensitivity of some neurons was unchanged by ILD, while others showed marked and systematic changes in phase. The relationship between the ITD, ILD and the CF was not explored in detail in the paper, so the degree to which the observations at the periphery (e.g. the auditory nerve – Anderson et al. 1971; Robles and Ruggero 2001) match those at the IC remains unclear. Here we
1 MRC Institute of Hearing Research, University Park, Nottingham, NG7 2RD, UK, [email protected], [email protected] 2 Department of Otolaryngology, Head and Neck Surgery, Chinese PLA General Hospital 28 Fuxing Road, Beijing, P.R. China 100853, [email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
448
A.R. Palmer et al.
reexamine this question by recording in the IC and the auditory nerve of the guinea pig. The changes in ITD sensitivity that we measure in the IC appear to be consistent with the picture of phase changes that we have measured in the guinea pig auditory nerve.
2
Methods
Details of the methods have been previously published (see McAlpine and Palmer 2002; Palmer et al. 1986). Briefly, recordings were made in the right IC of pigmented guinea pigs anesthetized with urethane and Hypnorm (Janssen, High Wycombe, UK). A premedication of atropine sulfate reduced bronchial secretions. The animals were placed inside a sound attenuating room in a stereotaxic frame in which hollow plastic speculae replaced the ear bars to allow sound presentation and direct visualization of the tympanic membrane. Single-fiber recordings were obtained by introduction of micropipettes filled with 2.7 M KCl into the auditory nerve under direct visual guidance, through a posterior craniotomy, after retraction of the cerebellum using a spatula. The bulla was vented and a silver wire electrode was placed on the round window of the cochlea and the threshold of the cochlear action potential was monitored periodically. For IC recordings the bullae on both sides were vented. A craniotomy was performed over the IC, the dura reflected and recordings made from single, well-isolated neurons, with glass-insulated tungsten electrodes (Bullock et al. 1988) advanced through the intact cortex by a piezoelectric motor (Burleigh Inchworm IW-700/710). Extracellular action potentials were amplified (Axoprobe 1A, Axon Instruments, Foster City, CA), filtered between 300 Hz and 2 kHz, discriminated using a level-crossing detector (SD1, Tucker-Davis Technologies, Alachua, FL), and time stamped with a resolution of 1 µs. All experiments were performed in accordance with the United Kingdom Animal (Scientific Procedures) Act of 1986. 2.1
Stimulus Generation
Stimuli were delivered to each ear through sealed acoustic systems comprising custom-modified tweeters (Radioshack 40-1377; M. Ravicz, Eaton Peabody Laboratory, Boston, MA), which fitted into the hollow speculum. The output was calibrated a few millimeters from the tympanic membrane using a microphone fitted with a calibrated probe tube. Stimuli were digitally synthesized (System II, Tucker-Davis Technologies) at between 100 kHz and 200 kHz sampling rates and were output through a waveform reconstruction filter set at 1/4 the sampling rate (135 dB/octave
Level Dependent Shifts in Auditory Nerve Phase
449
elliptic: Kemo 1608/500/01 modules supported by custom electronics). Stimuli were of 50 ms duration, switched on and off simultaneously in the two ears with cosine-squared gates with 2 ms rise/fall times (10% to 90%). When a neuron was isolated the lowest threshold and frequency at that threshold (characteristic frequency: CF) were obtained audio-visually. Frequency response areas, rate-level functions and peristimulus response histograms (PSTHs) were obtained using pure tones to enable neurons to be characterized (see Shackleton et al. 2003 for details). ITDs were obtained by delaying, or advancing, the fine structure of the signal to the ipsilateral ear while keeping the signal to the contralateral ear fixed. Coarse noise ITD functions (spike response vs ITD of the broadband noise) were obtained at 30 dB above noise response threshold to allow classification. Positive values of ITD are defined as contralateral leading. ITD functions to tones at different ILDs were obtained by fixing the contralateral level at 20 dB above threshold, and varying the ipsilateral level from threshold to 40 dB above threshold in steps of 10 dB. For consistency with the definition of ITDs, positive values of ILD are defined as contralateral more intense (note that this choice results in the unfortunate effect of increasing ipsilateral level corresponding to decreasing ILD). In three early experiments we used 30 dB suprathreshold for the contralateral level and from threshold to 60 dB above threshold for the ipsilateral level, in 15-dB steps. The signals were set at CF and at a selection of frequencies in steps of 0.25 octaves away from CF. Spikes were counted if they occurred between 10 and 80 ms after the stimulus onset. ITD functions were measured over ±0.5 cycles in 0.02-cycle steps using 10 repeats at a repetition rate of 5 per second. Best Phase (BP), vector strength and Rayleigh coefficient were calculated from the ITD functions using a modification of the method of Goldberg and Brown (1969). Phase-locking in auditory nerve fibers was obtained by computation from the period histograms constructed from isointensity frequency response curves measured in 1/8 octave steps from near threshold in 10-dB steps with up to 10 repetitions at each frequency and level combination.
3 3.1
Results Effect of ILD on ITD Sensitivity in IC
ITD functions were recorded from 72 well isolated IC neurons (CFs 56–1300 Hz, but mostly below 700 Hz) of which noise ITD functions were obtained for 54 and full data sets for 50. Of the 54, 54% were peak type units, 20% were trough and 19% were intermediate (see, for example, McAlpine et al. 1996). Initially we measured ITD functions to CF tones and subsequently as time permitted, or as the data dictated, we used frequencies above and below CF. The design of this protocol was based upon the shifts in auditory nerve phase
450
A.R. Palmer et al.
-0.5 octaves re CF
Spikes/stimulus
12 10 8 6 4 2 0
−0.4
−0.2
0.0
0.2
0.4
0.2
0.4
Spikes / stimulus
At CF 12
−26 dB
10
−15 dB 0 dB
8 6
15 dB 30 dB
4 2 0
−0.4
−0.2
0.0
+0.5 octaves re CF
Spikes / stimulus
12 10 8 6 4 2 0
−0.4
−0.2
0.0
0.2
0.4
Interaural Phase difference (cycles)
Fig. 1 ITD functions at different stimulus frequencies and ILDs (key). At 0.5 octave below CF, the best phase hardly changes at all with ILD. In this neuron, at CF and 0.5 octave above CF, the best phase moved ipsilaterally with increasing ILD as shown by the symbols and arrows above the ITD curves
Level Dependent Shifts in Auditory Nerve Phase
451
locking (Anderson et al. 1971) that indicated that the largest shifts occurred at up to an octave away from CF. An example of the responses of a single IC neuron tested at three different frequencies and five different ILDs is shown in Fig. 1. The arrows indicate that as the ILD is increased to favour the contralateral ear there is a progressive change in the best phase for CF and half an octave above CF, but no change at half an octave below CF. We use the term “null frequency” to describe the frequency at which no change in phase with ILD occurred and in this case it was −0.5 octaves. Contrary to the widely held view of the existing auditory nerve data we found the null to occur at the CF in less than a third of the neurons for which we had sufficient data (14/50). Equally often the null occurred below (15/50) and less often, above CF (8/50). The remaining (13/50) neurons for which we had sufficient data showed phase changes, but these were erratic or parallel and no null could be attributed. The changes in best phase were maximally about 0.2 cycles. In the majority (33/37) of the units with an identifiable null frequency, best phase moved more ipsilaterally with increasing ILD (as shown by the arrows in Fig. 1) for frequencies above the null frequency, whereas below the null frequency the best phase moved contralaterally with increasing ILD. In two units with the null above CF the best phase moved ipsilaterally below the null. The remaining two units showed clear, but inconsistent, phase shifts. The phase shifts were unrelated to the type of delay sensitivity (peak, trough, intermediate). Even with this limited sample it was clear that null frequencies could be found as frequently above, at, and below CF across the whole range of CFs. 3.2
Effect of Sound Level on Auditory Nerve Phase Locking
We measured the level dependency of phase locking in 183 auditory nerve fibers (CFs 0.071–3.227 kHz). These auditory nerve data are consistent in almost all respects with those previously published, but highlight some previously unremarked aspects of those reports. To facilitate comparison, our early protocol in the auditory nerve replicated as closely as possible that used in the IC. We supplemented these data with isointensity plots covering three or more octaves in 1/8 octave steps. These latter data gave a better definition of the null and so in later experiments we gathered only isointensity plots with as many repeats as possible. Following earlier studies, best phase was obtained from period histograms, unwrapped, and the slope of the plot of best phase against frequency at the highest level used to estimate the cochlear delay, which was then subtracted from all data for that neuron to give the phase changes. An example of the phase changes with level for an auditory nerve fiber for which stability was sufficient to run both protocols is shown in Fig. 2. It is clear that the data from the isointensity plots although noisier give the same phase as using more repeats. Equally clear is the fact that there is a null frequency which in this case occurs well below CF
452
A.R. Palmer et al. 82 dB SPL
0.5 Phase (cycles)
72 dB SPL
0.4
62 dB SPL 52 dB SPL
0.3
42 dB SPL
0.2 0.1 0.0
CF 222 Hz 20
30
50
100
200 300
500
Frequency (Hz)
Fig. 2 Best phase of phase locking in an auditory nerve fiber (CF 222 Hz) as a function of sound level and frequency. The vertical line indicates CF. The arrows highlight the phase change as ipsilateral level increases. The large symbols are from 50 repeats at steps of 0.5 octaves from CF. The smaller symbols are from 5 repeats at 1/8 octave steps. The legend gives the level in dB SPL
(−0.5 octaves). Ninety six fibers yielded data that allowed a null frequency to be determined. The null frequency was at CF in 41 fibers, above CF in 36 fibers and below CF in 19. In all fibers, above the null, there was an increasing phase lead as level increased (as shown by the arrows in Fig. 2), and an increasing phase lag below the null. Nulls at and above the CF could occur at all CFs, but nulls below CF only occurred in lower CF fibers. There was no clear CF at which the null moved from below to above CF.
4
Discussion
The direction of changes in ITD sensitivity that we see in the IC are consistent, in general, with the direction of the changes in phase locking that we find in the auditory nerve. Initially, the changes in ITD sensitivity with ILD appeared to be inconsistent with earlier reports of the changes at the input to the coincidence detector due to the level dependency of phase locking at frequencies away from CF in the auditory nerve. However, careful rereading of even the earliest reports (e.g. Anderson et al. 1971) revealed that null frequencies in the nerve could occur well away from the CF, as we had found in the IC data. Our own auditory nerve fiber data confirmed that this was the case and showed that null frequencies could occur at CF, or above, across all of the phase-locking range, but only below CF for CFs below 1 kHz. Existing basilar membrane data suggest that the level dependency of phase locking reflects non-linearities in the mechanics of the cochlea (e.g. Ruggero and Rich 1987), but most of these mechanical measurements are from relatively high-frequency positions along the cochlea where the null frequency occurs at CF. Those from the low-frequency part of the basilar
Level Dependent Shifts in Auditory Nerve Phase
453
membrane (Cooper and Rhode 1995; Zinn et al. 2000) suggest complicated mechanics in which the effect of the cochlear amplifier may even be expansive instead of compressive producing attenuation rather than amplification. The phase of the mechanical responses in these studies is difficult to reconcile with our auditory nerve data which looks much more like that at more basal locations. We, like others (Kuwada and Yin 1983) found IC cells whose ITD sensitivity did not change with ILD or in which the changes were erratic. While some of the effects may reflect the complicated apical mechanics, we cannot rule out the possibility that, in these instances, effects along the pathway from the auditory nerve (such as reconvergence and inhibitory modulation) might also be contributing.
References Anderson D, Rose J, Hind J, Brugge J (1971) Temporal position of discharges in single auditory nerve fibers within the cycle of a sine-wave stimulus: frequency and intensity effects. J Acoust Soc Am 49:1131–1139 Batra R, Kuwada S, Fitzpatrick DC (1997a) Sensitivity to interaural temporal disparities of lowand high- frequency neurons in the superior olivary complex. I. Heterogeneity of responses. J Neurophysiol 73:1222–1236 Batra R, Kuwada S, Fitzpatrick DC (1997b) Sensitivity to interaural temporal disparitites of lowand high- frequency neurons in the superior olivary complex. II. Coincidence detection. J Neurophysiol 78:1237–1247 Bullock DC, Palmer AR, Rees A (1988) Compact and easy-to-use tungsten-in-glass microelectrode manufacturing workstation. Med Biol Eng Comput 26:669–672 Cooper N, Rhode W (1995) Nonlinear mechanics at the apex of the guinea-pig cochlea. Hear Res 82:225–243 Durlach NI, Colburn HS (1978) Binaural phenomena. In: Carterette EC, Friedman MP (eds) Handbook of perception, vol IV, hearing. Academic Press, New York, pp 365–466 Goldberg JM, Brown PB (1969) Response of binaural neurons of dog superior olivary complex to dichotic tonal stimuli: some physiological mechanisms of sound localization. J Neurophysiol 32:613–636 Hafter ER, Dye RH, Gilkey RH (1979) Lateralization of tonal signals which have neither onsets nor offsets. J Acoust Soc Am 65:471–477 Johnson DH (1980) The relationship between spike rate and synchrony in responses of auditory-nerve fibers to single tones. J Acoust Soc Am 68:1115–1122 Kiang NYS, Watanabe T, Thomas EC, Clark LF (1965) Discharge patterns of single fibers in the cat’s auditory nerve. MIT, Cambridge, Mass Kuwada S, Yin TCT (1983) Binaural interaction in low-frequency neurons in inferior colliculus of the cat. I. Effects of long interaural delays, intensity, and repetition rate on interaural delay function. J Neurophysiol 50:981–999 McAlpine D, Palmer AR (2002) Blocking gabaergic inhibition increases sensitivity to sound motion cues in the inferior colliculus. J Neurosci 22:1443–1453 McAlpine D, Jiang D, Palmer AR (1996) Interaural sensitivity and the classification of low bestfrequency binaural responses in the inferior colliculus of the guinea pig. Hear Res 97:136–152 Mills AW (1958) On the minimum audible angle. J Acoust Soc Am 30:237–246 Palmer AR, Russell IJ (1986) Phase-locking in the cochlear nerve of the guinea-pig and its relation to the receptor potential of inner hair cells. Hear Res 24:1–15
454
A.R. Palmer et al.
Palmer AR, Winter IM, Darwin CJ (1986) The representation of steady-state vowel sounds in the temporal discharge patterns of the guinea-pig cochlear nerve and primary-like cochlear nucleus neurons. J Acoust Soc Am 79:100–113 Robles L, Ruggero M (2001) Mechanics of the mammalian cochlea. Physiolog Rev 81:1305–1352 Ruggero M, Rich N (1987) Timing of spike initiation in cochlear afferents: dependence on site of innervation. J Neurophysiol 58:379–403 Shackleton TM, Skottun BC, Arnott RH, Palmer AR (2003) Interaural time difference discrimination thresholds for single neurons in the inferior colliculus of guinea pigs. J Neurosci 23:716–724 Spitzer MW, Semple MN (1995) Neurons sensitive to interaural phase disparity in gerbil superior olive: diverse monaural and temporal response properties. J Neurophysiol 73:1668–1690 Yin TCT, Chan JCK (1990) Interaural time sensitivity in medial superior olive of cat. J Neurophysiol 64:465–488 Yin TCT, Kuwada S (1983) Binaural interaction in low-frequency neurons in inferior colliculus of the cat. II. Effects of changing rate and direction of interaural phase. J Neurophysiol 50:1000–1019 Zinn C, Maier H, Zenner H-P, Gummer A (2000) Evidence for active, nonlinear, negative feedback in the vibration response of the apical region of the in-vivo guinea pig cochlea. Hear Res 142:159–183
Comment by Gleich One issue related to your finding that the “cross-over” point of the phase vs frequency functions obtained at different levels is not at the best frequency (BF) is the accuracy of the audio-visual determination of BF. Fibers originating from a given cochlear location have the same BF and the same cochlear response latency that in turn should be reflected in the slopes of the phase vs frequency functions. Consequently, if the audio-visually determined BFs correctly represent cochlear location, one would predict that the slopes of the phase vs frequency functions at a given BF are the same for fibers with the “cross-over” points below, at and above the BF. Alternatively, steeper slopes (equivalent to longer cochlear delays) in cells with “cross-over” points below and shallower slopes (shorter cochlear delays) in cells with “cross-over” points above the audiovisually determined BF would suggest some error in BF determination. In addition to audiovisually determined BFs you might consider deriving the frequency dependent discharge rate from the data used to construct the phase vs frequency functions and compare the response peaks with audiovisually determined BFs.
Reply The estimation of characteristic frequency (CF) is obviously important for the points we are making in this chapter. Certainly, for low-CF fibers the tuning curve is relatively broad and errors could occur. The relationship of cochlear delay to CF is of course correct, but we are dealing here with minor perturbations imposed on this very steep phase slope so it is not completely obvious what the relationship of the major slope to the null frequency should be. We
Level Dependent Shifts in Auditory Nerve Phase
455
plotted the cochlear delays derived from the slopes at the highest sound level. With a perfect place frequency map of the cochlea these delays alone could be used to derive characteristic frequencies. However, these data are noisy and the estimates would likely be little better than our audio visual estimates. Comparison of the delays for two fibers from the same cochlea gave cochlear delays that differed by the amount predicted from the difference in audiovisually estimated CFs. While the relationships we report in the chapter are with respect to the audio-visually derived CFs, we have also extracted CFs from the iso-level functions. It is a moot point whether these are a better estimate, as CF changes with sound level, and the iso-level curves near threshold are noisy. We took the lowest level at which an unambiguous peak was present, but since this was often not at absolute threshold the CF estimate will still be approximate. However, when we calculated the proportions of fibers with null frequencies above, below and at CF using the iso-level function derived CF it was no different from that using audio-visual CFs. Comment by Greenberg Some of the variability observed in the phase plots as a function of auditorynerve-fiber characteristic frequency may be correlated with fiber spontaneous discharge rate. Spontaneous rate is correlated with threshold sensitivity (Liberman and Kiang 1978; Geisler et al. 1985 and phase-locking precision (Greenberg 1986) and thus may affect the phase/CF function as well. It would be interesting to partition your data into low (<10 spikes/s) and high spontaneous rate fiber populations to ascertain if there’s a significant difference in the phase behavior. A separate issue is the method by which a fiber characteristic frequency is determined. Sinusoidal signals may not always be the most precise stimuli to use, particularly for CFs below <1500 Hz (where significant phase locking is observed). One alternative is to use a two-component signal whose frequencies are arithmetically centered around the nominal unit CF. By moving the two-tone signal through the center of the unit’s response area in fine frequency steps and examining the relatively synchrony patterns associated with each component, it may be possible to derive a more accurate method for estimating CF than the conventional discharge-rate approach (see Greenberg et al. 1986 for a description). References Geisler CD, Deng L, Greenberg S (1985) Thresholds for primary auditory fibers using statistically defined criteria. J Acoust Soc Am 77:1102–1109 Greenberg S (1986) Possible role of low and medium spontaneous rate cochlear nerve fibers in the encoding of waveform periodicity. In: Moore B, Patterson R (eds) Auditory frequency selectivity. Plenum, New York, pp 241–248
456
A.R. Palmer et al.
Greenberg S, Geisler CD, Deng L (1986) Frequency selectivity of single cochlear nerve fibers based on the temporal response patterns to two-tone signals. J Acoust Soc Am 79:1010–1019 Liberman MC, Kiang NY-S (1978) Acoustic trauma in cats: cochlear pathology and auditorynerve activity. Acta Oto-Laryngol Suppl Stockh 358:1–63
Reply An interesting suggestion , but we have plotted the difference between the null frequency and the CF (from the iso-level function) against the spontaneous rate and found no dependence on spontaneous discharge rate. We are certainly aware of the importance of the accurate determination of the CF. Unfortunately, from the present data set there are only two estimates readily available: the audio-visual estimate, which can be subject to error, and the CF from the iso-level function (see previous reply), which is also not necessarily a good estimate of CF. Taking either of these CF estimates did not change the conclusions we drew from our data.
49 Remote Masking and the Binaural Masking-Level Difference G. BRUCE HENNING1, IFAT YASIN1, AND CAROLINE WITTON2
1
Introduction
The improvement in detection performance known as the binaural masking-level difference (BMLD) is usually measured as the difference between signal levels that correspond to 75% correct obtained when both signal and masking noise have zero interaural phase difference (NoSo) and when the noise is in-phase but the signal has a 180° interaural phase difference (NoSπ) (Colburn and Durlach 1978). One of the principal models for the BMLD is Durlach’s equalization and cancellation model (Durlach 1963). One result, difficult to accommodate within the context of this model, is the near disappearance of the BMLD in remote masking when the signals have a frequency remote from an narrowband masking noise (Zwicker and Henning 1984). The decrease in the BMLD is not caused by reduced masking, which remains substantial at signal frequencies where the BMLD has become very small. Figure 1 [from Fig. 5 of Zwicker and Henning (1984)] illustrates the awkward result. The black (NoSπ) and white (NoSo) symbols in the left panel show median thresholds for detecting 600-ms tones of different frequency. (Thresholds in quiet are shown near the 10-dB level.) The continuous masking noise, centred on 250 Hz, had a bandwidth of 10 Hz and a noise-power density of 50 dB SPL. The corresponding median BMLD, shown in black in the right panel, was about 25 dB when the signal was centred in the noise, but fell abruptly when the signal lay more than a few tens of Hertz from the noise. Because some 30 dB of remote masking occurs, sizeable BMLD’s would be expected to result if the stimuli to the ears were, in effect, subtracted in the NoSπ condition. The grey symbols (left panel, NoSπ eff) show thresholds when the centre frequency of the masking noise was shifted to correspond to that of the signal and the level of the noise adjusted to produce the same masking in NoSo as when the noise was centred on 250 Hz. When the signal phase was then
1 Sensory Research Unit, Department of Experimental Psychology, The University of Oxford, Oxford, UK, [email protected], [email protected] 2 Neurosciences Research Institute, School of Life and Health Sciences, Aston University, Birmingham, UK, [email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
458
G.B. Henning et al. 30 NoSo
25
60 50
NoSp
40 NoS eff
30 20 10
THQ
0 200 220 240 260 280 300 320 340
Signal Frequency (Hz)
BMLD (dB)
Signal Level (dB SPL)
70
20
NoSo-NoSo eff
15 10 5
NoSo-NoSp eff
0 −5
200 220 240 260 280 300 320 340
Signal Frequency (Hz)
Fig. 1 Detection thresholds (left) and BMLD’s (right) from Zwicker and Henning (1984)
inverted, the data in grey resulted. The BMLD, shown in grey in the right panel, shows a small reduction in the BMLD, attributable to a lower effective masking level (McFadden 1968), but the BMLD was 15 dB greater than in the remote masking condition. Thus the small BMLD in remote masking is caused neither by the signal frequency itself nor by the reduction in the amount of masking. One possible explanation for the decrease in the BMLD in remote masking comes from consideration of the stimuli formed by adding the signal in one ear and subtracting it in the other. These considerations form the main feature of the other principal model of the BMLD (Jeffress 1972). Figure 2 illustrates Jeffress’ “vector model”; the diagram depicts the relative instantaneous magnitude and phase of the signal and masker, and shows the instantaneous interaural phase difference of the resultant stimuli at the left and right ear (VL and VR). Sinusoidal signals are produced by the projection of vectors, of length VS, onto one axis. In a continuous sample these vectors rotate at a constant angular speed ω =2πfS (radians/s) where fS is the signal frequency (Hz). Figure 2 shows the NoSπ condition with the signal centred in the noise, so the signal vectors are 180˚ out-of phase and are shown added to a noise vector of length OVM (the square root of the instantaneous noise power), which has a Raleigh distribution. The phase of the noise masker rotates at a mean rate of ωM =2πfm, where fm is usually taken to be the centre frequency of the noise band (Rice 1954; Davenport and Root 1958; Papoulis and Pillai 2002). The noise phase distribution is uniform on [0 2π], so the signal phase relative to the noise, α, also has an uniform distribution. The resultant signals, VL and VR, have the random interaural phase difference, ∆Ø, the distribution of which is a function of α, VS, and VM (Henning 1973; Zurek 1991) and there are 2WT independent samples in a T-second sample of noise with a bandwidth W (Davenport and Root 1958). Jeffress’ and his colleagues’ arguments concerning the use of ∆Ø (or functions of it) are discussed elsewhere (Henning 1973) but Jeffress’ diagram is useful in considering the temporal characteristics of the stimuli.
Remote Masking and the Binaural Masking-Level Difference
459
VL ∆Φ α
O
VS
VM VS VR
Fig. 2 Jefress’ vector model applied to the NoSπ condition
In particular, when the signal is centred in the noise, the parameters VM and α change at rates that are proportional to the noise bandwidth (Rice 1954; Davenport and Root 1958). [The slowly varying changes in α and VM produce the envelope fluctuations extensively studied in detection (Richards and Nekrich 1993; Bernstein et al. 2001; Witton et al. 2005; Richards and Henning 1994; van de Par and Kohlrausch 2005).] We are concerned with changes in the interaural phase ∆Ø when fs ≠fm, because then α, in addition to its slowly changing random component, has an additional constant term, 2π ∆f nt, where ∆f is fs −fm,. In Fig. 2, when fs ≠fm, the signal vectors rotate at a mean rate of 2π∆f relative to the noise vector. In other words, α changes at that mean rate. The consequences of differences in stimulus and masker properties_ for the rate of change of ∆Ø are not immediately apparent. For example, if VM were – constant and equal to VS, ∆Ø would take only two values: ±π/2 _ _and the triangle OVSVM would then be inscribed in a circle with diameter VSVM, subtending ∆Ø. The absolute value of interaural phase difference would thus always be ⎪π/2⎪ whatever the value of α: positive with 0≤α < π (left ear leading); negative with π≤α < 2π. Under these circumstances the term associated with a nonzero ∆f would not alter the distribution; rather the sign of ∆Ø would change at mean intervals of 1/(2∆f)_with each value of ∆Ø lasting T/2 s. The situation is – more complicated _ when VM is random as with a noise masker, but if VS and the mean value of VM are similar, the probability-density function for independent samples of ∆Ø is bimodal with modes near ±π/2 radians and observers, who report hearing the signal in both ears, easily achieve 100% correct. The temporal aspects of the stimuli thus depend in complicated ways on the signal-to-noise ratio, the bandwidth of the noise, and fs−fm. The signal level in NoSπ will usually be well below that of the effective noise and a given value of ∆Ø and its associated interaural delay, ∆t, will last for an average of only 1/∆f s. Thus either a “sluggish” binaural system (Grantham and Wightman 1978) or the brief duration for which a given ∆t lasts (Zwicker and Henning 1985) could explain the small BMLD’s. We aimed to examine this further first by reproducing the results of Zwicker and Henning (1984) using two-alternative forced-choice
460
G.B. Henning et al.
measurement techniques and then repeating the experiment with simplified temporal characteristics by using 12-ms signal bursts.
2
Methods
Three observers, two authors, served in standard 2-AFC experiments. A white Gaussian masking noise, spectral density 67 dB SPL and 11-Hz bandwidth, was gated on in two observation intervals separated by a 600-ms pause. The centre frequency of the noise, fm, was 250 Hz and the masker was gated for either 600 or 12 ms with 50- and 6-ms rise and fall times, respectively. A signal tone burst with the same temporal envelope occurred in one observation interval. The level, bandwidth, and centre frequency of the masker was fixed but signal frequencies ranged from 100 to 400 Hz. The noise was always presented in phase at the ears, No; the signal was either in-phase, So, or 180° out-of-phase, Sπ. Trials were run in blocks of 50 with signals of fixed amplitude, frequency (fs), and interaural phase. The probability of the signal’s being in the first interval was 0.5 on every trial. Lights indicated the beginning and end of the observation intervals, the answer and warning intervals, and the correct interval. For a given fs, the amplitude of the signal was varied to determine 8or 9-point psychometric-functions for each interaural-phase condition. Cumulative Gaussian functions were fit to the psychometric functions (Wichmann and Hill 2001a, b) and 75% correct levels and their ±1-standard deviation confidence intervals extracted. Three stimuli were calculated in MATLAB for each trial: noise alone, noiseplus-the-signal, and noise-minus-the-signal. The largest stimulus filled the 16-bit dynamic range of the DAC’s of the system II Tucker-Davies equipment without audible clipping. Separate DAC’s were used for each ear, their output passed through separate digital attenuators and earphone drivers to TDH-39 earphones mounted in Grason-Stadler 001 earmuffs with the liquid filling removed. The earphones operated in phase. (For the detection of low-intensity signals in quiet, the amplitude of the digital signal was fixed and the level at the earphones adjusted using the attenuators.) In the NoSo condition, the signal-plus-noise stimulus was presented to both ears and in the NoSπ condition, the signal-plus-noise stimulus was presented to one ear and the signal-minus-noise stimulus to the other.
3
Results and Discussion
The top panel of Fig. 3 shows the 600-ms results for two observers; the signal amplitude at 75% correct (dB SPL) is plotted as a function of signal frequency: white symbols for NoSo, black, for NoSπ. [Detection “thresholds” in quiet are also shown.] The NoSo results are broadly consistent with those of Zwicker and Henning (1984) and van de Par and Kohlrausch (2005); masking falls by 30 dB
Remote Masking and the Binaural Masking-Level Difference
461
Fig. 3 Detection thresholds (upper) and BMLD (lower) for the 600-ms stimulus
once fs is 25 Hz away from fm. Masking in NoSπ also falls as ⎪∆f⎪ increases but not nearly as fast as in the NoSo case leading, as Zwicker and Henning found, to the rapid reduction in the BMLD shown as open symbols in the bottom panel. Figure 4 is identical to Fig. 3 save that the stimulus duration was 12 ms. The observers behave the same way in the NoSo case (white symbols) but the
462
Fig. 4 Detection thresholds (upper) and BMLDs (lower) for 12-ms stimuli
G.B. Henning et al.
Remote Masking and the Binaural Masking-Level Difference
463
reduction in masking as ⎪∆f⎪ increases is much slower than with 600-ms stimuli. (The differences in the observers’ thresholds in quiet are also bigger.) There are also substantial differences between the observers in the NoSπ case (black symbols): the observer with the higher threshold in quiet is much better than the observer with the lower threshold. The effect on the BMLD is shown as the black symbols in the lower panel. For one observer, GBH, the BMLD remains relatively high particularly for fs>fm, but for IY, the BMLD falls less rapidly than with 600-ms signals but much more slowly than for GBH. The third observer is still learning the task.
4
General Discussion
The change from 600 to 12 ms reduces 2WT to less than 1 with only one sample of the noise available per observation interval. When ∆f=0, no envelope fluctuations occur and, in the NoSo case, the task should thus be harder than at 600 ms where envelope cues could be used. In terms of signal amplitude, however, neither observer is worse. At both durations, the signal and masker have the same temporal envelope so that, although the 12-ms signal has a broader spectrum than the 600-ms signal, the noise is splattered the same way. It is difficult to imagine that the spectral spread of the masker accounts for increased effectiveness of the masker when ∆f is large, but given the results shown in Fig. 1, that is a possibility. Another possible interpretation is that envelope changes with increasing ∆f remain too small to be effective until ∆f>~100 Hz for the 12-ms stimuli but might be used above 2 Hz with the 600-ms stimuli. That envelope changes caused by adding the signal are available to help in the latter condition, but not the former, is consistent with our results. Similar arguments for interaural cues apply in the NoSπ case: when ∆f is zero, only one sample of ∆Φ, or the associated ∆t, is available. The associated BMLD is 6 dB smaller than with the 600-ms stimuli but still substantial. With 600-ms stimuli, many more samples are available, but frozen noise (Zwicker and Henning 1985) and other experiments (van de Par and Kohlrausch 2005) suggest that it is not easy to see how information from different epochs of the stimulus are combined in the NoSπ case. Whatever the cues, they provide only little help over those used in the NoSo case. With 12-ms stimuli these additional cues are not available until ∆f>100 Hz. But the results with the 12-ms stimuli are difficult to interpret because the observers behave differently. In the NoSπ case, the performance of IY improves rapidly with increasing ∆f leading to as rapid a fall in her BMLD as with 600-ms stimuli. The other observer’s BMLD, however, doesn’t fall as fast partly because he is slightly worse than IY in the NoSo case and slightly better in NoSπ. In summary, the possibility that observers switch cues is difficult to preclude and different conditions produce BMLD’s in detection only when there are cues of different potency in NoSo and NoSπ conditions. Lack of envelope fluctuations
464
G.B. Henning et al.
may cause both smaller BMLD’s at 12 ms and its slow decrease as ∆f increases, because the NoSo thresholds are small when ∆F=0 and drop slowly as ∆F increases. Inconsistent with some models of binaural “sluggishness” there is a substantial BMLD with 12-ms stimuli. Acknowledgements. This research was supported by the Wellcome trust and the Lord Dowding Fund for Humane Research. IY was supported by Deafness Research UK.
References Bernstein LR, Trahiotis C, Akeroyd MA, Hartung K (2001) Sensitivity to brief changes of interaural time and interaural intensity. J Acoust Soc Am 109:1604–1615 Colburn HS, Durlach NI (1978) Models of binaural interaction. In: Carterette EC, Friedman MP (eds) Handbook of perception. Academic Press, New York Davenport WB Jr, Root WL (1958) An introduction to the theory of random signals and noise. McGraw-Hill, New York Durlach NI (1963) Equalization and cancellation theory of the binaural masking level difference. J Acoust Soc Am 35:1206–1218 Grantham DW, Wightman FL (1978) Detectability of varying interaural temporal differences. J Acoust Soc Am 63:511–523 Henning GB (1973) The effect of interaural phase on amplitude and frequency discrimination. J Acoust Soc Am 54:1160–1178 Jeffress LA (1972) Binaural signal detection: Vector theory. In: Tobias JV (ed) Foundations of modern auditory theory, vol II. Academic Press, New York, pp 349–368 McFadden D (1968) Masking-level differences determined with and without interaural disparities in masker intensity. J Acoust Soc Am 44:212–223 Papoulis A, Pillai SU (2002) Probability, random variables and stochastic processes. McGrawHill, New York Rice SO (1954) Mathematical analysis of random noise. In: Wax N (ed) Selected papers on noise and stochastic processes. Dover, New York, pp 133–294 Richards VM, Henning GB (1994) The effects of two types of level variation on the size of the binaural masking-level difference. J Acoust Am 95:3003 Richards VM, Nekrich RD (1993) The incorporation of level and level invariant cues for the detection of a tone added to noise. J Acoust Soc Am 94:2560–2574 van de Par S, Kohlrausch A (2005) The role of intrinsic masker fluctuations on the spectral spread of masking. Forum Acusticum, Budapest, pp 1635–1640 Wichmann FA, Hill NJ (2001a) The psychometric function: I. Fitting, sampling, and goodnessof-fit. Percept Psychophys 63:1293–1313 Wichmann FA, Hill NJ (2001b) The psychometric function: II. Bootstrap-based confidence intervals and sampling. Percept Psychophys 63:1314–1329 Witton C, Green GGR, Henning GB (2005) Binaural “sluggishness” as a function of stimulus bandwidth. In: Pressnitzer D, de Cheveigné A, McAdams S, Collet L (eds) Auditory signal processing: physiology, psychophysics, and models. Springer, Berlin Heidelberg New York, pp 443–453 Zurek PM (1991) Probability distribution of interaural phase and level differences in binaural detection stimuli. J Acoust Soc Am 90:1927–1932 Zwicker E, Henning GB (1984) Binaural masking-level differences with tones masked by noises of various bandwidths and levels. Hear Res 14:179–183 Zwicker E, Henning GB (1985) The four factors leading to binaural masking-level differences. Hear Res 19:29–47
Remote Masking and the Binaural Masking-Level Difference
465
Comment by Carlyon A simple reason for the pattern of results you report may lie in the phenomenon of comodulation masking release (CMR). When the narrowband-noise masker and the signal have the same frequency, their excitation patterns will overlap substantially. As the signal is moved off-frequency there will be portions of the excitation pattern that are excited by the masker but not by the signal. These portions could provide information on the masker envelope that the subject could use to detect the signal – e.g. by “dip listening” or by decorrelation in the envelope between different frequency regions. In the NoSo condition this information may reduce thresholds. In the NoSπ condition this cue may not be useful, if the additional cues produced by interaural decorrelation and CMR cannot be summed optimally. Reply It is an interesting speculation – that the modulation of a single band of masking noise, decoded in separate channels tuned to different bands of frequency, might allow co-modulation masking release (CMR) for a tonal signal presented at a frequency remote from that of the noise masker – a sort of “self CMR”. Such a mechanism could affect only the long duration conditions (Figs. 1 and 3) because with 12-ms bursts of 11 Hz wide noise, the noise modulation, changing at a rate inversely proportional to the noise bandwidth, will barely be observable during the 12-ms stimulus presentation time. The postulated CMR, if it occurs, should be visible in the difference between the spread of masking observed with pure-tone maskers (Wegel and Lane 1924) and with narrowband noise maskers (Egan and Hake 1950) in that, because of the putative CMR, the results with noise maskers should fall below those with tonal maskers. This appears not to be the case for signal frequencies that differ by up to 600 Hz from that of the masker (Green 1976, Fig. 6.2, p 137) but it is not easy to interpret the data not the least because of the difficulty in equating masker effectiveness. The relative symmetry (linear x-axis) of our results with 600-ms lowfrequency signals above and below the band of masking noise is not inconsistent with the CMR observations of Yasin et al. (2006) but the principal difficulty with the notion of “self CMR” is that the amplitude and phase response of channels tuned to different frequency bands will affect the amplitude and phase modulation of their responses to the noise; the modulation in the channel centred on the noise will be different in depth and timing from the modulation in the channel centred on the signal frequency because even if the mechanisms are similar, the noise falls on the peak of one channel but on the skirt of the other. Thus we think the improvement in the NoSo case is not likely to be caused by “self CMR”.
466
G.B. Henning et al.
References Egan JP, Hake HW (1950) On the masking pattern of a simple auditory stimulus. J Acoust Soc Am 22:622–630 Green DM (1976) An introduction to hearing. Lawrence Erlbaum, Hillsdale, New Jersey Wegel RL, Lane CE (1924) The auditory masking of one pure tone by another and its probable relation to the dynamics of the inner ear. Phys Rev 23:266–285 Yasin I, Fantini DA, Plack CJ (2006) Effect of flanker band number and excitation pattern symmetry on envelope comparisons in masking release. J Acoust Soc Am 120:3084
50 Perceptual and Physiological Characteristics of Binaural Sluggishness IDA SIVEKE1, STEPHAN D. EWERT2, AND LUTZ WIEGREBE1
1
Introduction
When the binaural system is presented with stimuli that change their interaural time delay (ITD) or interaural correlation over time, the ability to follow these changes diminishes already for relatively slow variation frequencies in the region of 2–3 Hz (Blauert 1972). This feature of the response of the binaural system to changes in the interaural configuration of the stimulus is usually referred to as binaural sluggishness. Psychophysically, several studies have characterized the ability to follow ongoing changes in the interaural properties (time difference and correlation) of stimuli (Blauert 1972; Grantham 1982, 1986; Grantham and Wightman 1978, 1979; Boehnke et al. 2002). In addition, the ability to detect fluctuations in the interaural level difference (ILD) was characterized in Stellmack et al. (2005) and can be directly compared to the detection of monaural amplitude modulation (Viemeister 1979). For broadband stimuli, the shape of the threshold curve in response to sinusoidal amplitude modulation (SAM) could be relatively well fitted by a single-pole low-pass filter with a time constant of about 2.5 ms (Viemeister 1979). In contrast, the derivation of time constants from binaural data leads to a relatively rough estimate of about 50–200 ms (Grantham and Wightman 1978, 1979), an about doubled time constant for interaural modulation compared to monaural modulation (Stellmack et al. 2005) and a threefold increase using a masking paradigm (Kollmeier and Gilkey 1990). However, Pollack (1978) suggested integration times for the detection of binaural disparities similar to the minimum integration times for monaural tasks. The binaural sluggishness is in strong contrast to the outstanding sensitivity of the binaural system to (static) interaural time disparities. While the static interaural disparities are assumed to be limited by the neural processing in relatively early stages (Brand et al. 2002), the binaural sluggishness is assumed to originate from more central stages of auditory processing. A recent physiological study of binaural sluggishness using the same sinusoidal correlation modulation (SCM) stimuli as Grantham (1982) showed 1 Neurobiologie, Biozentrum, Ludwig-Maximilians-Universität München, Germany, ida.siveke@zi. biologie.uni-muenchen.de, [email protected] 2 Medizinische Physik, Fakultät V, Institut für Physik, Carl von Ossietzky Universität, Oldenburg, Germany, [email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
468
I. Siveke et al.
that at the level of the cat inferior colliculus (IC), neurons are capable to follow SCMs well beyond the range of modulation frequencies perceived by humans (Joris et al. 2006). It was suggested that while the cochlear nucleus may be able to process fast modulations in the auditory-nerve rate response caused by SAM stimuli, the higher auditory system beyond the IC may lack a process to analyse fast modulations of IC-unit rate responses. The current study revisits both the psychophysics and physiology of SCM stimuli and, using a new class of SCM stimuli, shows that the binaural system is not necessarily slower in response to temporal variations of interaural parameters than the monaural system is in response to temporal variations in level.
2 2.1
Methods Stimuli
Psychophysical and neurophysiological sensitivity was measured for three types of SAM and SCM stimuli as a function of the modulation frequency (MF) which ranged from 2 Hz to 512 Hz or 1024 Hz in doublings. The stimuli were SAM noise, and two types of SCM noise referred to as ‘Oscor’, and ‘Phase-warp’ stimuli. The generation of the latter stimulus types is illustrated in Fig. 1. In short, Oscor stimuli are generated by adding two independent noise tokens after they had been multiplied by the sine and the cosine of a sinusoidal modulator, respectively. Phase-warp stimuli are kind of a broad-band version of binaural-beat stimuli. They are generated in the frequency domain using a frequency independent magnitude and a random phase whereby the phase components for one ear are shifted along the frequency axis by an amount equal to MF relative to the other ear. The lower row of Fig. 1 shows the interaural correlation of the stimuli as a function of ITD and time. Note that while the Oscor and the Phase-warp stimuli share the same MF of 4 Hz, the Phase-warp stimulus shows a pattern of correlation which is modulated both along the ITD and the time axis. The Oscor stimulus is only modulated along the time axis. None of the SCM stimuli produces monaural amplitude modulation. The standard stimulus was always interaurally uncorrelated noise. Stimulus duration was 1 s including 20-ms raised-cosine ramps and they were presented from a high quality Creative Labs soundcard through AKG K271 headphones at an SPL of 60 dB. 2.2
Listeners and Procedure
In an adaptive two-alternative, forced-choice procedure, four listeners (two females, two males, aged 21–24) were required to detect the modulated stimuli following a three-down, one-up rule. The dependent variable was the relative level of uncorrelated noise added to the SAM or SCM stimulus.
Perceptual and Physiological Characteristics of Binaural Sluggishness
469
Fig. 1 Stimulus generation for SCM stimuli. The Phase-warp stimulus (left) is generated in the frequency domain with a uniform magnitude spectrum and random phase spectrum where the phase for the left ear is shifted by MF along the frequency axis compared to the right ear. The Oscor stimulus is generated in the time domain using two noise sources as described in the text. The lower row shows an interaural correlogram for the stimuli with a MF of 4 Hz as a function time and ITD. The different shades of gray indicate the value of the running interaural cross-correlation (1-ms window)
2.3
In Vivo Recordings
Single-unit recordings using glass electrodes were made from units with best frequencies (BFs) between 0.4 and 2.6 kHz in the dorsal nucleus of the lateral lemniscus (DNLL) of the Mongolian gerbil. The animals were anaesthetised by injection of physiological NaCl-solution containing ketamine (20%) and rompun (2%) (initial i.p. 0.5 ml/100 g body weight, sustained s.c. 0.05 ml/30 min). Sounds were presented in close-field using methods previously described (Siveke et al. 2006). Randomized broadband and narrowband (±10% around BF) noise were 200 ms long and presented every 300 ms with a cosine-squared gating raise-fall time of 1 ms to determine the level (20 dB
470
I. Siveke et al.
above threshold of the rate level function, 5-dB steps) and maximal ITD (peak of the noise delay function, 0.15 cycle steps) for the SAM and SCM stimuli. Responses to 10 repeats of the SAM and SCM stimuli with 1 s duration and a repetition period of 1.5 s and a cosine-squared gating raise-fall time of 20 ms were obtained. The vector strength (VS) for each MF was calculated according to Goldberg and Brown (1968) and determined as significant if the P<0.05 criterion in the Rayleigh test was fulfilled (Batschelet 1991).
3 3.1
Results Psychophysics
Sensitivity is typically highest for a MF of 8 Hz, decreasing slightly for decreasing MF and decreasing in a stimulus-type dependent manner with increasing MF. Data were converted to effective modulation depth and fitted with a first-order low-pass filter. The time constants of these filters were 2.5 ms for the SAM, 2.9 ms for the Phase-warp stimuli, and 15.9 ms for the Oscor stimuli (Fig. 2). 3.2
Neurophysiology
Responses were obtained from a total of 59 DNLL single cells. Responses to the SCM and SAM stimuli from one cell are shown in Fig. 3 for the two different filtering conditions.
Fig. 2 Sensitivity to SCM or SAM, expressed as the signal-to-noise level required to mask the modulation, plotted as a function of modulation frequency. Data are averaged across listeners. Error bars represent across-listeners, standard errors
Perceptual and Physiological Characteristics of Binaural Sluggishness
471
Fig. 3 Responses of a DNLL cell with a BF of 1000 Hz to narrowband (upper row) and broadband (lower row) versions of SAM and SCM stimuli. Raster plots in response to a MF of 8 Hz are shown in the left row and extracted period histograms in the middle row. Responses are well locked to the modulator phase for all stimuli. The right row shows VS as a function of MF for the three stimulus types. Stars indicate significant VS
The averaged VS across the 28 cells which showed significant VS for at least one stimulus type is shown in Fig. 4. It clarifies the three common features revealed by ITD-sensitive DNLL neurons in response to the SCM and SAM stimuli: 1. Narrowband filtering impairs the encoding of high MFs for all stimulus types (solid lines vs dashed lines). The narrowband filtering imposes a low-pass filter in the monaural (SAM) and binaural (SCM) modulation-frequency domain. This modulation low-pass affects the dynamic properties of all stimuli similarly. Thus, the impaired encoding of high-frequency modulations does not reflect changes in binaural or monaural sluggishness. 2. Broadband stimulation decreases VS for MFs below 128 Hz for both SCM stimuli. This decrease is not observed for SAM stimuli. Typically, response
472
I. Siveke et al.
Fig. 4 Averaged VS as a function of modulation frequency. Different colours represent different stimuli; broadband stimuli are shown with dashed lines, narrowband with solid lines
rates were dramatically lower for broadband compared to narrowband stimulation. This decrease impairs also the VS measure. With SAM stimuli, however, this impairment may be counteracted by the shape of the modulation: As is evident from the raster plots in Fig. 3, the SAM responses have a higher ‘duty cycle’ than the SCM responses. This difference in duty cycle results from the amplitude modulation which is sinusoidal along a linear amplitude axis. With this type of temporal excitation pattern, the smaller rate response with broadband stimulation does not impair the VS measure. 3. Responses to Oscor and Phase-warp stimuli are very similar (black vs dark-gray lines). This is discussed in more detail below.
4
Discussion
The current data show that psychophysical sensitivity to a modulation of interaural correlation depends on the stimulus type: The Phase-warp stimuli allow for a considerably improved sensitivity to SCM compared to the Oscor stimuli. The time constant describing the low-pass characteristic of SCM sensitivity with the Phase-warp stimulus is in the range of that describing SAM sensitivity, arguing against a sluggishness that is specific to the binaural system when detection or discrimination tasks of binaural disparities are considered. Instead, performance appears to be limited by similar constraints both in the SAM task and in the SCM task with Phase-warp stimuli: Many previous simulations of amplitude modulation sensitivity suggest that these constraints partly arise from the narrow peripheral filters which enforce frequency-dependent temporal integration (Dau et al. 1997; Ewert and Dau 2000; Stein et al. 2005).
Perceptual and Physiological Characteristics of Binaural Sluggishness
473
The pronounced improvement in SCM sensitivity for Phase-warp compared to Oscor stimuli could be related to the shape of the interaural correlogram of the stimuli (Fig. 1). As pointed out above, the Oscor stimulus is modulated only along the time axis whereas the Phase-warp stimulus is modulated both along the ITD and the time axis. Physiologically, no difference between the temporal encoding of Oscor and Phase-warp stimuli were observed. The recorded DNLL cells are tuned to a specific ITD and this tuning prevents them from profiting from the modulation along the ITD axis caused by the Phase-warp stimulus. Psychophysically, however, listeners profit considerably from the additional modulation along an internal ITD axis produced by the Phase-warp stimuli. These data indicate that the auditory display, as generated by ITD sensitive units in the brainstem, has a temporal resolution which is only limited by monaural preprocessing. The Phase-warp stimulus maximises our capability to read out this display. Acknowledgments. We would like to thank Benedikt Grothe for technical support, intense discussions and critical comments of the manuscript. This work was supported by the ‘Deutsche Forschungsgemeinschaft’ and the ‘Studienstiftung des deutschen Volkes’
References Batschelet E (1991) Circular statistics in biology. Academic Press, London Blauert J (1972) On the lag of lateralization caused by interaural time and intensity differences. Audiology 11:265–270 Boehnke SE, Hall SE, Marquardt T (2002) Detection of static and dynamic changes in interaural correlation. J Acoust Soc Am 112:1617–1626 Brand A, Behrend O, Marquardt T, McAlpine D, Grothe B (2002) Precise inhibition is essential for microsecond interaural time difference coding. Nature 417:543–547 Dau T, Kollmeier B, Kohlrausch A (1997) Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. J Acoust Soc Am 102:2906–2919 Ewert SD, Dau T (2000) Characterizing frequency selectivity for envelope fluctuations. J Acoust Soc Am 108:1181–1196 Goldberg JM, Brown PB (1968) Functional organization of the dog superior olivary complex: an anatomical and electrophysiological study. J Neurophysiol 31:639–656 Grantham DW (1982) Detectability of time-varying interaural correlation in narrow-band noise stimuli. J Acoust Soc Am 72:1178–1184 Grantham DW (1986) Detection and discrimination of simulated motion of auditory targets in the horizontal plane. J Acoust Soc Am 79:1939–1949 Grantham DW, Wightman FL (1978) Detectability of varying interaural temporal differences. J Acoust Soc Am 63:511–523 Grantham DW, Wightman FL (1979) Detectability of a pulsed tone in the presence of a masker with time-varying interaural correlation. J Acoust Soc Am 65:1509–1517 Joris PX, van de Sande B, Recio-Spinoso A, van der Heijden M (2006) Auditory midbrain and nerve responses to sinusoidal variations in interaural correlation. J Neurosci 26:279–289 Kollmeier B, Gilkey RH (1990) Binaural forward and backward masking: evidence for sluggishness in binaural detection. J Acoust Soc Am 87:1709–1719
474
I. Siveke et al.
Pollack I (1978) Temporal switching between binaural information sources. J Acoust Soc Am 63:550–558 Siveke I, Pecka M, Seild AH, Baudoux S, Grothe B (2006) Binaural response properties of low frequency neurons in the gerbil dorsal nucleus of the lateral lemniscus. J J Neurophysiol 96(3):1425–1440 Stein A, Ewert SD, Wiegrebe L (2005) Perceptual interaction between carrier periodicity and amplitude-modulation in broadband stimuli: a comparison of the autocorrelation and modulation filterbank model. J Acoust Soc Am 118(4):2470–2481 Stellmack MA, Viemeister NF, Byrne AJ (2005) Monaural and interaural temporal modulation transfer functions measured with 5-kHz carriers. J Acoust Soc Am 118:2507–2518 Viemeister NF (1979) Temporal modulation transfer functions based upon modulation thresholds. J Acoust Soc Am 66:1364–1380
51 Precedence-Effect with Cochlear Implant Simulation BERNHARD U. SEEBER1 AND ERVIN HAFTER
1
Introduction
Cochlear implants (CIs) help many patients to understand speech in quiet and in acoustically dry environments. However, patients still encounter great difficulties in situations of speech-in-noise or in reverberation. The precedence-effect paradigm can be used to study the impact of reflections on perception. It describes the perceptual suppression of a delayed sound copy in the presence of a leading sound. From the view of auditory scene analysis, precedence can be seen as the inability to segregate the lead and lag sounds into two separate objects. Previous studies have shown that CI-patients rely on interaural level differences (ILDs) for localization while mostly ignoring interaural temporal difference cues (ITDs) (Seeber and Fastl 2004). Since ITDs at low frequencies play an important role for precedence as well as for localization in normal hearing, the goal was to see if the altered cues with CIs carry enough information for precedence. Results from CI-patients show no precedence and two different outcomes: an immediate breakup into two images for short lead-lag delays or the localization of a single image even for longer delays (Seeber and Hafter 2006). The purpose of the present study was to investigate if similar results could be obtained from normal hearing subjects who listen to precedence stimuli processed with a bilateral CI-simulation. In this way the influence of the reduction of monaural and binaural information by a noise-vocoder CI-simulation can be studied in precedence situations. The study is intended to take a broad look at precedence of normal hearing subjects that is not biased by past experience. Therefore, several ongoing sounds are studied.
Auditory Perception Lab, Department of Psychology, University of California at Berkeley, USA, [email protected], [email protected] 1 Now at MRC Institute of Hearing Research, University Park, Nottingham, UK Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
476
2 2.1
B.U. Seeber and E. Hafter
Methods The Simulated Open-Field Environment and Virtual Acoustics
Experiments were done in the Simulated Open-Field Environment, a multichannel listening environment with 48 loudspeakers placed in an anechoic chamber. All loudspeakers are mounted horizontally at ear level with 7.5° spacing. Loudspeakers are equalized to give phase corrected and frequency independent responses over 300 Hz to 10 kHz at the center position of the subjects head. Parallel playback on all 48 loudspeakers is possible through customized Matlab-software. Software to simulate reflections in arbitrarily shaped rooms allows for rendering of sources in accord with the direction of incidence of the reflections at the subjects head. The advantage of using the free-field approach for sound playback is that normal-hearing subjects as well as subjects with hearing devices can be tested in exactly the same environment which allows for accurate comparisons. In addition to the loudspeakers, headphones were used inside the anechoic chamber for comparison of localization and precedence between headphone and free-field presentation with matched methods. Experiments 2–4 used headphones in connection with virtual acoustics. Head-related transfer functions (HRTFs) for virtual directional presentation were individually selected from a catalog of non-individual HRTFs which leads to HRTFs that minimize localization variance and maximize externalization (Seeber and Fastl 2003). 2.2
Implementation of the ProDePo-Localization Method
A visual renderer allowed for real-time manipulation of the position and synchronized control of visual objects in a 2D-projection of a virtual 3D-scene. This visual renderer ran on a separate computer under direct control from the audio-renderer. The visual scene was projected in front of the subject on an acoustically transparent curtain. Subjects indicated their localization using the ProDePo-light pointer method by adjusting a light spot to the perceived sound location. The light spot appeared 0.5 s after the sound ceased at 0° in front of the subjects. It could be moved on a horizontal track within a range of ±37 ° by turning on a trackball. The localized direction was confirmed by pressing a button on the trackball. Since the subjective hand or head position is not directly involved in the pointing process, the method is called the Proprioception Decoupled Pointer (ProDePo) (Seeber 2002). A lateralization method was used in experiments 3 and 4, where stimuli were perceived to be inside the head. A horizontal white line was projected in front of the listener which was terminated by vertical strips labeled as “left ear” or “right ear”. To display the lateralized position, a red bar that appeared 0.5 s after the sound terminated provided the visual pointer for use with the
Precedence-Effect with Cochlear Implant Simulation
477
ProDePo-method. Subjects were instructed to place the bar at a position corresponding to the perceived position of the sound on the interaural axis. 2.3
Extension of the Noise-Band Vocoder to Binaural Presentation
A 16-channel noise-band vocoder was used to simulate CI-processing. The precedence sounds were first filtered with HRTFs. In the vocoder, the HRTF-processed sound was band-pass filtered in 16 logarithmically spaced channels in 300 Hz to 8 kHz. The channel-envelopes were computed by rectification and low-pass filtering at 200 Hz. The envelopes were computed independently for both ears and applied to noise bands (carrier, synthetization noise) of varying interaural correlation. The interaural correlation of the carrier noise is an important additional parameter of the binaural CI-simulation since it determines the compactness of the auditory image independent of the applied envelope. In a way it represents the interaural correlation of the carrier pulses in CIs. Noise was chosen as the carrier since it evokes pitch corresponding to the region of excitation on the tonotopic axis with weak pitch strength. Temporal modulation of noise-bands can also evoke a temporal pitch sensation, similar to modulating pulse trains in electric hearing. Other carriers, like sinusoids, produce a clear pitch which does not represent perception of the pulse-train carriers used with current CIs. In addition, the harmonicity present in log-spaced sinusoids in the CI-simulation might interfere with perceptual across-channel grouping processes.
2.4
Experimental Paradigms
The precedence effect was investigated in a localization dominance experiment. The lead and lag sounds were played from ±30° with a probability of 0.5 that the lead would be on the left. The lead-lag delay was varied within 0–48 ms, with the upper limit depending on the stimulus. In one experimental session subjects were instructed to localize the leftmost of the sound images if they perceived more than a single image. Randomizing the side of lead and lag sounds on every trial thus meant that subjects responded to the lead on one half of the trials and the lag on the other half without having to determine which of the sounds came first. Pointing biases were reduced by localizing the rightmost sound in separate sessions. In other conditions subjects were asked to localize the most dominant or weakest image. Four experimental paradigms were studied: 1. Precedence was studied in the free field with two loudspeakers of the Simulated Open Field Environment placed at ±30°. The ProDePo-method was used.
478
B.U. Seeber and E. Hafter
2. Using identical methods, precedence was studied with virtual acoustics based on subjectively selected non-individual HRTFs (Sect. 2.1). 3. The precedence-stimuli of experiment 2 were processed through a binaural noise-band vocoder to simulate CI-processing (Sect. 2.3). Lateralization was measured with a line-dissection method. 4. A CI-simulation similar to experiment 3 was used, but channel envelopes were quantized in 1.5-ms steps before being applied to the carrier noise. This was done to reduce the impact of ITDs in the envelope.
2.5
Subjects and Stimuli
Five normal hearing subjects (<20 dB HL in 300 Hz to 10 kHz) participated in the study, but results are only shown for one subject (female, age 29 years). Three stimuli were used: (1) a burst of white noise (10 ms duration, 300 Hz to 10 kHz), (2) a low-pass noise (10 ms, cut-off at 770 Hz, but playback/vocoder high-pass at 300 Hz), and (3) the spoken CVC-word “shape” (female speaker). The level was roved in 2-dB steps within ±6 dB from a base level of 60 dB(A) (55 dB(A) for the CVC). For each sound 10 trials were collected each for the leads at −30° and +30°.
3
Results and Discussion
Figures 1 and 2 compare precedence results for experiment 1 in the free-field and for the identical experiment 2 with virtual acoustics. Data are displayed for one subject for the CVC-word “shape”. Both experiments show similar results: summing localization for zero delay between lead and lag, the well known localization dominance of the lead for short delays, and a split into two perceived images at the lead and lag locations for larger delays (Blauert 1997). When subjects were instructed to point to the dominant image, they always pointed to the lead location for all delays, whereas the instruction to point to the weakest image corresponded to the lag image for delays larger than the echo-threshold. Although not observable in Figs. 1 and 2, slightly higher variance and some localization bias are visible for most subjects with non-individual HRTFs in experiment 2 compared to the free-field presentation in experiment 1. Another difference seems to be a slight decrease in echo-threshold with virtual acoustics, i.e. two auditory objects were perceived instead of one for slightly shorter delay times. The reasons for the reduction of echo-thresholds are not known. One speculative cause might be a misrepresentation of localization cues caused by the use of nonindividual HRTFs relative to learned individual cues. Another cause might
Localized direction in degrees
40
20 right / lead left / lag dominant
0
−20
−40
0
10
20
30
40
50
Delay in ms
Fig. 1 Results of experiment 1 on precedence in the free-field. Data from one subject are presented for the CVC-word “shape”. Scattered localization response data and medians are shown as a function of the lead-lag delay time in the precedence experiment. In different sessions the subject was instructed to respond either to the image on the rightmost (+, here: leading), the leftmost (◊, lagging), or the dominant (*) sound image if two images were heard. In the experiment the lead was played randomly from ±30°, but for clarity in the picture the lead is plotted at +30° and the lag at −30°. Data plotted at the lead (+) were combined from data for pointing to the rightmost image if the lead was on the right and from side-inverted data for pointing to the leftmost image if the lead was on the left. Data for the lag (◊) were combined in a similar way
Localized direction in degrees
40
20 right / lead left / lag dominant weakest
0
−20
−40
0
10
20
30
40
50
Delay in ms
Fig. 2 Results of experiment 2 on precedence with virtual acoustics based on selected non-individual HRTFs. Data are presented from one subject for the word “shape”. Scattered localization response data and medians are shown as a function of the lead-lag delay time in the precedence experiment. In different sessions the subject was instructed to respond either to the image on the rightmost (+, here: leading), the leftmost (◊, lagging), the dominant (*), or the weakest (■ ■) sound image if two images were heard. Results from pointing to either the left or rightmost image were combined and plotted as in Fig. 1
480
B.U. Seeber and E. Hafter
be that the visual scale is used in a different way for pointing to sounds that are not well externalized with HRTFs. The results for the two other stimuli (wide-band and low-pass noise bursts) show a similar pattern of summing localization and precedence as that found for the word, but echo-thresholds were shorter. Experiment 3 investigated effects of CI-simulation on precedence stimuli. Figure 3 shows lateralization results for a short wide-band noise burst. For short delays, summing localization provides a single image lateralized between the ears. Unlike the unprocessed case, this image is still mostly centered at 0.5 ms delay and not lateralized towards the lead. For larger delays an image is heard at the lead and described as the dominant one. At 2 ms delay, two images start to appear and lead and lag images are clearly separated at 4 ms delay. The lag image appears partly suppressed for delays between 0.5 and 4 ms. Despite the processing it is described as the weakest image. The suppression of the lag as well as the dominance of the lead suggest that limited precedence occurs. Since the simulation does not transmit ITDs in the noise carrier precedence must solely be based on ILDs at high frequencies and ITDs in the envelope. These cues also seem to be sufficient to evoke summing localization at short delays. The fact that the single image in summing localization at short delays moves so slowly towards the lead with increasing delay time can be attributed to the low-pass filtering of the envelope in the CI-simulation. The low-pass filtering temporally smears the
Lateralized position
1
Right ear
0.5
0
−0.5
−1
Left ear 0
5
10
15
Delay in ms
Fig. 3 Results of experiment 3 for precedence with CI-simulation with interaurally uncorrelated carrier noise. The stimulus was a wide-band noise burst of 10 ms duration. Lateralization was measured with the ears depicted by ±1. Results from pointing to either the left or rightmost sound image were combined and plotted similar to Fig. 1, with the leading sound at the right. Figure 2 shows the legend
Precedence-Effect with Cochlear Implant Simulation
481
information at both ears and interaural cues point to the center for an extended period of time. Figure 4 shows results of experiment 3 with a processed low-pass noise burst. The results are in general similar to those from the wide-band noise burst with two main differences: (1) Summing localization seems not to occur at zero delay between lead and lag; instead two images are reported. (2) For a delay of 0.5 ms “anomalous localization” is seen in a combined image lateralized towards the lag (Gaskell 1983). For the low-pass noise, the simulation confines binaural information to ITDs in the envelope. The envelope-ITDs seem to be sufficient to evoke two perceived images at longer delays, with a slightly more dominant image at the lead. The breakdown of summing localization at zero delay might be due to an inability of the auditory system to integrate two independent low-pass noises at both ears into a single image, despite the common envelope modulation of the signal. It seems that the envelope modulation inherent in the narrow-band noise carriers is stronger to suggest two images than the additional modulation caused by the signal. “Anomalous localization” towards the lag has been reported to be based on interaural level or phase cues (Gaskell 1983; Tollin and Henning 1999). Because the carrier noise is interaurally uncorrelated, we assume that the effect is here based on ILDs. The strong occurrence of “anomalous localization” is surprising, since the noise covers several frequency bands (300–770 Hz) and correct ITDs should still be present in the envelope. 1
Right ear
Lateralized position
0.5
0
−0.5
−1
Left ear 0
5
10
15
20
25
30
Delay in ms
Fig. 4 Results of experiment 3 for precedence with CI-simulation with interaurally uncorrelated carrier noise, but for a 770 Hz low-pass noise burst of 10 ms duration. Lateralization was measured with the ears depicted by ±1. Results from pointing to either the left or rightmost image were combined and plotted similar to Fig. 1 with the leading sound at the right/top. The legend is given in Fig. 2
482
B.U. Seeber and E. Hafter
Precedence for the CVC “shape” can not be seen for the selected subject at any delay and for any correlation of the carrier noise (Figs. 5 and 6). It appears that with an uncorrelated carrier (Fig. 5) the lag image is always audible, even at very short delay times. Even though there is no apparent precedence the lead is reported to be the dominant and the lag the weaker image. Other subjects show more responses at the lead, but again many responses are also present at the lag. The correlated carrier in Fig. 6 centralizes and fuses lead and lag images for short delays. However, two images are heard already at 5 ms delay, which is a far smaller echo-threshold than in the unprocessed case (c.f. Fig. 1). Thus it seems that the decorrelation from the envelope modulation by the signal is sufficient to suggest two images despite the fusion effects of the carrier. The reasons for this breakdown of precedence are not clear at present. Apparently, the breakdown occurs mostly for ongoing sounds which suggests a change in auditory scene analysis. Two hypotheses can be stated: 1. The incorrect ITDs from the carrier noise at low frequencies and the natural ILDs at high frequencies point to different locations and thus suggest two images. However, for isolated sound sources, across-channel grouping is functioning and CI-listeners hear a single image (Seeber et al. 2004). 2. The missing pitch information prevents across-channel grouping which leads to a split into two images. Pitch and harmonicity information serve as the strongest cues to combine auditory objects, but they are not well represented in CI-listeners as well as in the simulation (Culling and Darwin 1993).
Lateralized position
1
Right ear
0.5
0
−0.5
−1
Left ear 0
10
20
30
40
50
Delay in ms
Fig. 5 Results for precedence with CI-simulation with interaurally uncorrelated carrier noise (experiment 3), but for the word “shape”. Presentation as in Fig. 3
Precedence-Effect with Cochlear Implant Simulation
483
right / lead 1
Right ear
left / lag
Lateralized position
dominant weakest
0.5
0
−0.5
−1
Left ear 0
10
20
30
40
50
Delay in ms
Fig. 6 Results of experiment 3 for precedence with CI-simulation for the word “shape”, but with correlated carrier noise. Presentation as in Fig. 3
In Experiment 4 the impact of envelope-ITDs on precedence was assessed by temporally quantizing the envelope. No changes in localization occurred compared to experiment 3 which suggests a restricted influence of envelope ITDs. Thus, precedence seems to be predominantly based on ILD cues or coarse onset cues.
4
Conclusions
The extension of CI-simulation to binaural stimuli provides an interesting new way to study the relative importance of monaural and binaural cues in precedence. Despite obvious limitations of the simple CI-simulation for the prediction of CI-patient performance, the simulation results are congruent with results on precedence from some CI-listeners. Because the simulation results show precedence for short sounds, but strongly reduced precedence for longer sounds, we assume that the test reveals limitations of the auditory system to combine multiple cues to form auditory objects from the view of auditory scene analysis rather than limitations purely in the precedence mechanism. If this assumption proves correct the study of object separation in a precedence setting might be meaningful for concurrent speech segregation. Acknowledgements. We would like to thank for the support by NIH RO1 DCD 00087 and NOHR grant 018750 (patient studies).
484
B.U. Seeber and E. Hafter
References Blauert J (1997) Spatial hearing. MIT Press, Cambridge, USA Culling JF, Darwin CJ (1993) Perceptual separation of simultaneous vowels: within and acrossformant grouping. J Acoust Soc Am 93:3454–3467 Gaskell H (1983) The precedence effect. Hear Res 12:277–303 Seeber B (2002) A new method for localization studies. Acta Acust Acust 88:446–450 Seeber BU, Fastl H (2003) Subjective selection of non-individual head-related transfer functions. In: Brazil E, Shinn-Cunningham B (eds) Proc 9th Int Conf on Aud Display. Boston University Publications Prod Dept, Boston, USA, pp 259–262 Seeber B, Fastl H (2004) Localization cues with bilateral cochlear implants investigated in virtual space – a case study. Proc Joint Congress CFA/DAGA’04, Strasbourg, France, 22.-25.03.2004, vol I. Dt Ges f Akustik, Oldenburg, pp 213–214 Seeber B, Hafter E (2006) Precedence effect with cochlear implants – simulation and results. In: Santi PA (ed) Abstracts of the 29th Annual Midwinter Meeting. Assoc Res Otolaryngol, p 150 Seeber B, Baumann U, Fastl H (2004) Localization ability with bimodal hearing aids and bilateral cochlear implants. J Acoust Soc Am 116:1698–1709 Tollin DJ, Henning GB (1999) Some aspects of the lateralization of echoed sound in man. II. The role of the stimulus spectrum. J Acoust Soc Am 105:838–849
52 Enhanced Processing of Interaural Temporal Disparities at High-Frequencies: Beyond Transposed Stimuli LESLIE R. BERNSTEIN AND CONSTANTINE TRAHIOTIS
1
Introduction
At the previous two ISH meetings and in subsequent publications we reported that the processing of interaural temporal disparities (ITDs) within high-frequency auditory channels is enhanced by the use of “transposed” stimuli (Bernstein and Trahiotis 2002, 2003, 2004, 2005). Transposed stimuli are designed to provide envelope-based ITD-information within highfrequency channels similar to the ITD-information provided by the waveform itself within low-frequency channels. The enhancement occurred both in terms of better resolution of ITDs and larger extents of ITD-based laterality, as compared to those measured with conventional high-frequency stimuli. Transposed stimuli also exhibited resistance to the types of binaural interference found with conventional stimuli when remote, low-frequency stimulation occurred simultaneously with the high-frequency stimulation conveying the ITD. This presentation reports initial attempts to discern which particular aspects of the envelopes of high-frequency waveforms are sufficient to yield enhanced ITD-processing. We report data collected using what we refer to as “raised-sine” stimuli, which were recently described and employed by John et al. (2002) in research concerning auditory evoked potentials. Their method permits one to control the temporal characteristics of the envelopes of high-frequency waveforms by independently varying modulation frequency, modulation depth, and “dead-time/relative peakedness,” while also suitably restricting the spectral content of the stimulus. The patterning of the behavioral data collected with such stimuli reveals graded amounts of enhancement of ITD-processing as the characteristics of the envelope are changed, in a graded manner, from those characteristic of conventional stimuli toward those characteristic of transposed stimuli. This general outcome was observed in ITD-discrimination and ITD-based lateralization experiments.
Departments of Neuroscience and Surgery (Otolaryngology), University of Connecticut Health Center, Farmington, Connecticut, USA, [email protected], [email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
486
2
L.R. Bernstein and C. Trahiotis
“Raised-Sine” Stimuli
The method of John et al. (2002) entails raising a DC-shifted modulator to an exponent greater than or equal to 1.0 prior to multiplication with a carrier. For the case of sinusoidal modulation, the equation used to generate such stimuli is y(t) = (sin(2pƒc t))(2m(((1 + sin(2pƒmt))/2)n − 0.5) + 1) where ƒc is the frequency of the carrier, ƒm is the frequency of the modulator, m is the modulation index, and n is the exponent to which the DC-shifted modulator is raised. The left side of Fig. 1 depicts time-waveforms for instances when a 128-Hz modulating tone was raised to the power 1, 2, 4, or 8 prior to multiplication with a 4-kHz carrier. As shown in the top row, an exponent of 1.0 yields a conventional SAM waveform. The bottom row of the figure depicts the time-waveform of a 128-Hz tone transposed to 4 kHz. Examination of the figure reveals that the dead-time/relative peakedness of the envelope increases directly with the value of the exponent to which the modulator is raised. The right side of the figure displays the long-term spectrum of each stimulus. Note that, for the raised-sine stimuli, the number of sidebands and their spectral extent increase directly with the value of the exponent. Still, for all of the stimuli depicted, the vast majority of the power (greater than 95%) would fall within the approximately 500-Hzwide auditory filter centered at 4 kHz (Moore 1997).
3
Threshold ITDs
Threshold-ITDs were measured with a two-cue, two-alternative, forced choice adaptive task targeting 71% correct. The stimuli employed were all of those depicted in Fig. 1. Transposed stimuli were generated by multiplying rectified, low-pass filtered (2-kHz cutoff) low-frequency tones by high-frequency carriers (Bernstein and Trahiotis 2002). All stimuli were 300 ms long and were presented at an overall level of 70 dB SPL. Figure 2 displays threshold-ITDs obtained from one well-practiced listener whose results were typical of those measured with other listeners. Note that the threshold-ITDs obtained with the raised-sine stimuli were all smaller than those measured with the SAM tone, decreased with increases in the exponent, and, with an exponent of eight, approximated the threshold obtained with the 128-Hz tone transposed to 4 kHz. Because all of the stimuli had the same rates and depths of modulation, the differing thresholds might be attributable to the differences in the dead/time peakedness of the envelopes, which increased with increases of the exponent. It is also logically possible that the differences in threshold-ITDs stem from differences among the spectra of the stimuli, per se. In order to test this notion,
Enhanced Processing of Interaural Temporal Disparities at High-Frequencies
487
Fig. 1 Left-hand panels: 50-ms epochs of the resulting time-waveforms when a 128-Hz modulating tone was raised to the exponents 1 (i.e., a SAM waveform), 2, 4, and 8 prior to multiplication with a 4-kHz carrier. The bottom row depicts a transposed stimulus. Right-hand panels: the corresponding long-term power spectra of time-waveforms at left
488
L.R. Bernstein and C. Trahiotis
Fig. 2 Threshold-ITDs for SAM, raised-sine, and transposed stimuli (see Fig. 1)
we measured threshold-ITDs from a single, well-practiced listener (the first author) with the conventional SAM tone depicted in Fig. 1 and two raised-sine stimuli having an exponent of four. One of the raised-sine stimuli was the stimulus depicted in Fig. 1 and the second was identical to the first, save for the fact that its phase-spectrum was randomized. For this listener, the threshold-ITD for the SAM tone was 120 µs while the threshold for the raised-sine stimulus depicted in Fig. 1 was 42 µs. When the phase-spectrum of the raised-sine stimulus was randomized, however, the threshold-ITD increased to 116 µs. That is, the increased sensitivity to ITD produced by the raised-sine stimulus, as compared to that obtained with the SAM tone, was eliminated. This outcome strongly suggests that the advantages in detection measured with the raised-sine stimuli as compared to the SAM stimulus do not result from changes in spectral content, per se.
4
Extents of Laterality
Figure 3 displays the results of another preliminary study employing another well-practiced listener and the raised-sine stimuli described above. The data were gathered using an acoustic pointing task in which the listener adjusted the interaural intensitive difference (IID) of a 200-Hz-wide band of noise centered at 500 Hz so that its intracranial position matched that of the raised-sine
Enhanced Processing of Interaural Temporal Disparities at High-Frequencies
489
Fig. 3 Extent of laterality (measured in terms of the IID of an acoustic pointer) as a function of the exponent of the raised-sine stimulus. The dashed line indicates the IID obtained with a transposed tone
stimulus. In the figure, the IID of the pointer is plotted as a function of the exponent of the raised-sine stimulus. The ITD was fixed at 600 µs, a value close to the maximum ITD encountered by human listeners in “natural” acoustic settings. The data indicate that extent of laterality increases with the exponent of the raised-sine stimulus. The raised-sine having an exponent of eight was matched with an IID of the pointer of a little more than 6 dB. According to the listener’s verbal report, this corresponded to an intracranial image located about midway between the eye and the ear. The dashed line indicates the IID of the pointer required to match a transposed sine having the same rate of modulation as the raised-sine stimuli (128 Hz). Clearly, raised-sine stimuli not only yield lower threshold-ITDs, they also yield enhanced extents of laterality. Both sets of measures indicate that the potency of ITDs increases in a graded manner with increases in the exponent of the raised-sine. In our view, it is important to note, and to emphasize in particular, that the cross-correlogram of the envelopes of all of these stimuli, considered as external inputs, would be characterized by a peak at 600 µs, the ITD imposed on the physical stimuli.
5
Binaural Interference
Binaural interference refers to the degradation in sensitivity to interaural disparities that occurs for a stimulus in one spectral region (the region of the “target”) produced by the presence of a (typically diotic) stimulus in a second,
490
L.R. Bernstein and C. Trahiotis
remote spectral region (the region of the “interferer”) (e.g., McFadden and Pasanen 1976; Buell and Hafter 1991; Stellmack and Dye 1993; Bernstein and Trahiotis 1995; Heller and Trahiotis 1995, 1996). The term interference is used to differentiate binaural interference phenomena from other phenomena, such as “energetic” masking, that also produce degradations of sensitivity. It appears that binaural interference is a manifestation of “central” effects as opposed to reflecting monaural peripheral interactions or phenomena (e.g., Heller and Trahiotis 1995). Our recent findings with transposed stimuli suggest that much less binaural interference occurs when they serve as “targets” as compared to the amount of binaural interference observed when the targets are conventional high-frequency stimuli (Bernstein and Trahiotis 2004, 2005). This resistance to binaural interference was observed for measures of threshold-ITDs and for measures of extents of laterality. Such results suggest that the magnitude of binaural interference one observes depends on the properties of the stimulus that serves as the target within the high-frequency channel and not on properties inherent to the channel itself. Recent findings using raised-sine stimuli as targets are consistent with that point of view. Figure 4 contains the threshold-ITDs shown in Fig. 3 (open bars) along with threshold-ITDs obtained when those stimuli were accompanied by a simultaneously gated, diotic 400 Hz wide band of
Fig. 4 Threshold-ITDs for SAM, raised-sine, and transposed stimuli measured in the presence (shaded bars) and absence (open bars) of a low-frequency, diotic interferer
Enhanced Processing of Interaural Temporal Disparities at High-Frequencies
491
noise centered at 500 Hz (shaded bars). Note that binaural interference was substantial with the conventional SAM tone but was virtually absent for transposed stimuli and raised-sine stimuli. At least for these initial data, the gradation in threshold-ITD with changes in the exponent of the raised-sine stimulus is not mirrored in a graded amount of binaural interference. Specifically, once the exponent was increased beyond 1.0 (a SAM tone), interference was no longer observed even though such increases in the exponent yielded lower and lower thresholds. This is consistent with the idea that the temporal characteristics of envelopes of high-frequency targets that are sufficient to prevent binaural interference are not simply the ones that are sufficient to yield the lowest threshold-ITDs. It also appears to be the case that the temporal properties of the interferer can influence whether and to what degree binaural interference is observed. The data in Fig. 5, based on three listeners, reveal that a transposed stimulus centered at 2 kHz can produce substantial binaural interference on a target centered at 6 kHz, even though its conventional counterpart, a Gaussian band of noise centered at 2 kHz, produces little or no interference. This is another indication that characteristics of individual frequency channels, per se, do not determine whether and to what degree binaural interference occurs.
Fig. 5 Threshold-ITDs measured with a Gaussian noise “target” centered at 6 kHz. The type of interferer employed is indicated along the abscissa
492
L.R. Bernstein and C. Trahiotis
Rather, it appears that what matters are stimulus attributes such as the temporal characteristics of the targets and interferers. Incidentally, the fact that both types of interferers in Fig. 5 had the same sound-pressure-level, yet yielded diverging results, reinforces the notion that binaural interference does not result from energetic masking. 5.1
Thoughts Regarding Binaural Interference and Neurophysiology
Kiang and Moxon (1974) demonstrated that low-frequency stimulation within the “tail” of a high-frequency neuron’s tuning curve raised thresholds for stimuli within the “tip” of the tuning curve. In addition, Kim and Molnar (1979) showed that such low-frequency stimulation produced highly phase-locked activity. Griffin et al. (2005) demonstrated that the neural responses to envelopes of transposed stimuli were more synchronized than those to envelopes of conventional stimuli. Such neurophysiological outcomes could help to explain binaural interference. Perhaps the magnitude of binaural interference observed is proportional to the relative quality/magnitude of “timing information” conveyed by targets and interferers, respectively (e.g., conventional
References Bernstein LR, Trahiotis C (1995) Binaural interference effects measured with masking-level difference and with ITD- and IID-discrimination paradigms. J Acoust Soc Am 98:155–163 Bernstein LR, Trahiotis C (2002) Enhancing sensitivity to interaural delays at high frequencies by using transposed stimuli. J Acoust Soc Am 112:1026–1036 Bernstein LR, Trahiotis C (2003) Enhancing interaural-delay-based extents of laterality at high frequencies by using ‘transposed stimuli.’ J Acoust Soc Am 113:3335–3347 Bernstein LR, Trahiotis C (2004) The apparent immunity of high-frequency transposed stimuli to low-frequency binaural interference. J Acoust Soc Am 116:3062–3069 Bernstein LR, Trahiotis C (2005) Measures of extents of laterality for high-frequency ‘transposed’ stimuli under conditions of binaural interference. J Acoust Soc Am 118:1626–1635 Buell TN, Hafter ER (1991) Combination of binaural information across frequency bands. J Acoust Soc Am 90:1894–1900
Enhanced Processing of Interaural Temporal Disparities at High-Frequencies
493
Griffin SJ, Bernstein LR, Ingham NJ, McAlpine D (2005) Neural sensitivity to interaural envelope delays in the inferior colliculus of the guinea pig. J Neurophysiol 93:3463–3478 Heller LM, Trahiotis C (1995) Interference in detection of interaural delay in a SAM tone produced by a second, spectrally remote SAM tone. J Acoust Soc Am 97:1808–1816 Heller LM, Trahiotis C (1996) Extents of laterality and binaural interference effects. J Acoust Soc Am 99:3632–3637 John MS, Dimitrijevic A, Picton T (2002) Auditory steady-state responses to exponential modulation envelopes. Ear Hear 23:106–117 Kiang NYS, Moxon EC (1974) Tails of tuning curves of auditory-nerve fibers. J Acoust Soc Am 55:620–630 Kim DO, Molnar CE (1979) A population study of cochlear nerve fibers: comparison of spatial distributions of average-rate and phase-locking measures of response to single tones. J Neurophys 42:16–30 McFadden D, Pasanen EG (1976) Lateralization at high frequencies based on interaural time differences. J Acoust Soc Am 59:634–639 Moore BCJ (1997) Frequency analysis and pitch perception. In: Crocker M (ed) Handbook of acoustics, vol III. Wiley, New York, pp 1447–1460 Stellmack MA, Dye RH (1993) The combination of interaural information across frequencies: the effects of number and spacing of components, onset asynchrony, and harmonicity. J Acoust Soc Am 93:2933–2947
53 Models of Neural Responses to Bilateral Electrical Stimulation H. STEVEN COLBURN1, YOOJIN CHUNG1, YI ZHOU2, AND ANDREW BRUGHERA1
1
Introduction
Cochlear implants are becoming more available to deaf and hard of hearing people, and are increasingly fitted in bilateral configurations. There are also increasing numbers of psychophysical experiments with implanted subjects as well as increasing numbers of physiological experiments with electrical stimulation. From the point of view of binaural processing, some of the results of these experiments have been surprising. For example, many reports of interaural time sensitivity indicate performance much worse than that achieved by normal hearing listeners, even though responses to electrical pulses have better temporal compactness and repeatability than responses to acoustic stimuli (Moxon 1967; Kiang and Moxon 1972). Another striking result is that the physiological sensitivity to time delay in the inferior colliculus (IC) is dependent on the modulation of the waveform (Smith 2006), similar to human psychophysics (Poon 2006). Although there are increasing amounts of empirical data and interesting questions related to the interpretations of the data, there has been little modeling of these experiments. Some early work on the modeling of the CI systems is reported here.
2
General Comments on Modeling of Electrical Stimulation
The overall processing of auditory stimulation that is provided through electrical stimuli can be divided into a sequence of stages, each of which has an output that can be measured empirically. These measurable responses include the acoustic signals recorded at the microphones; the electrical current pulses that are presented to each electrode; the electrical activity as it propagates through the cochlear tissue; the firing patterns of primary auditory nerve fibers; the firing patterns of neurons in the central auditory system, notably in the inferior colliculus (IC); and the responses of CI wearers in 1 Department of Biomedical Engineering and Hearing Research Center, Boston University, Boston, USA, [email protected], [email protected] 2 Department of Biomedical Engineering, Johns Hopkins University, USA, [email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
496
H.S. Colburn et al.
psychophysical or clinical tests. Not all of these responses have been measured directly. For example, the distribution of electrical activity within the cochlea is indirectly measured by evoked potentials or by monitoring potentials on nonstimulated electrodes. Between each of these measurable outputs, there is a processing stage that could be explicitly described with a computational model. An overall model would include all of these stages, and would be useful in understanding the limitations of cochlear implants. In this chapter, because of the authors’ interest in binaural hearing, the focus of attention is on the processing between the auditory nerves and the inferior colliculus. One appealing model of central auditory processing in cochlear implant listening is that the central processing is the same as in a normal auditory system. According to this point of view, the central activity is abnormal only because it has abnormal input patterns. An attractive feature of this conceptualization is that existing computational models for central neurons are already available for describing cochlear implant processing. On the other hand, there are reasons for thinking that the central processing is different. First, the central processing may adapt to the ongoing CI inputs. Second, central processing may be impaired before implantation, particularly when there are long periods in which peripheral inputs are either missing or severely distorted. This question is still open. Although the degree of plasticity within auditory brainstem neurons is not known, there are increasing indications that the processing evolves over time after implantation. In this chapter, we focus on responses of IC neurons in response to inputs that are oversimplified but relevant to CI stimulation. This focus is stimulated by the recent recordings in IC of cat of responses to electrical stimulation with cochlear electrodes (Smith 2006) and by the observations by Poon (2006) that some aspects of her human psychophysical data on ITD discrimination are consistent with patterns seen in the data of Smith. Although the focus here is on IC activity, it is apparent that the activity distribution on the auditory nerve would be very abnormal with current cochlear implants. Several simplified models of auditory-nerve patterns in response to electrical stimulation are considered and the consequences for IC activity in response to such stimulation is evaluated.
3
Inferior Colliculus Rate Model of CI Stimulation
One of the striking aspects of the responses of IC neurons to electrical stimulation is that modulation is required for sustained firings with ongoing trains of pulses. As has been shown by several authors (e.g., Lane and Delgutte 2005) with acoustic stimulation and by Smith (2006) with electrical stimulation, there is little sustained response of IC neurons to constant amplitude pulse trains when the frequency is higher than a few hundred
Models of Neural Responses to Bilateral Electrical Stimulation
497
Hertz. There are several ways to achieve this property in models of neural activity. In this section we describe results from a simple mathematical model that includes adaptation. This model is described in terms of continuous waveforms that represent firing rates of neurons at several stages of processing in the auditory brainstem. The model is designed to test ideas and to generate insights with very simple descriptions. 3.1
Model Description
The model explicitly describes the firing rates of auditory-nerve fibers, MSO neurons, and IC neurons in response to electrical stimulation. Input signals to the model were constant-amplitude pulses and cosine amplitude-modulated pulses with 100-µs width. Auditory-nerve rates in response to electrical pulses are modeled as sequences of alpha-functions, and rates of MSO neurons are modeled as the cross-correlation of auditory-nerve response rates. The IC model neurons receive excitatory input from ipsilateral MSO and delayed inhibitory input from the contralateral MSO. In addition, an adaptation mechanism is imposed in the generation of the IC response. The underlying structure of this model is similar to that of Cai et al. (1998) and Lane et al. (2005). 3.2
Results
Figure 1 shows relative firing rates in response to constant-amplitude pulse trains with ITD on the ordinate and time on the abscissa. The firing rate is indicated by the darkness of the plotted points. As the pulse rate increases from panel to panel, the onset response is emphasized relative to the ongoing response and the ITD dependence in the ongoing response is poor. In this model the onset-emphasis in the response to the high-rate pulse-train is
40 pps
0
50
Time (ms)
100
1 ITD(ms)
ITD(ms)
ITD(ms)
0
−1
640 pps
160 pps 1
1
0
−1
0
50
Time (ms)
100
0
−1
0
50
100
Time (ms)
Fig. 1 Firing rates to constant-amplitude pulse trains of 40, 160, and 640 pps. As the pulse rate increase, the firing rate pattern becomes more onset-like
498
H.S. Colburn et al. 320 pps
640 pps 1
ITD(ms)
ITD(ms)
1
0
0
−1
−1 0
50 Time (ms)
100
0
50
100
Time (ms)
Fig. 2 Firing rates to 40-Hz cosine amplitude-modulated pulse trains of 320 and 640 pps
solely due to the adaptive mechanisms in the IC model. When high-rate pulse trains are amplitude modulated, ITD sensitivity to on-going stimulus is observed in a strong response (Fig. 2). This result supports the hypothesis that an adaptation mechanism with amplitude-modulation enhances ITD sensitivity.
4
Cell Membrane Factors in Modulation Effects
In contrast with the Jeffress (1948) coincidence model for ITD sensitivity, real coincidence mechanisms involve neurons with refractory properties that prevent neural firings and degrade ITD tuning for high frequency inputs (Cook et al. 2003; Dasika et al. 2005). Refractory behavior may also play a significant role in restoring neural discharges with amplitude modulation. Explored here are effects of synaptic parameters and sinusoidally amplitude modulated (SAM) pulse-train stimuli on the ITD sensitivity of a model neuron with an active membrane. 4.1
Model Description
The neural model and stimulus generation are similar to a multi-compartment MSO model used in a previous study (Zhou et al. 2005) with two modifications: 1) the cell model has only a soma compartment with the set of ionic channels, and 2) inputs to the model simulate the peripheral responses to amplitude-modulated electrical-pulse stimulations instead of tones. With a probability proportional to the amplitude of the pulse, a pulse triggers a change in post-synaptic conductance with maximum value Ge. The input
Models of Neural Responses to Bilateral Electrical Stimulation
499
pulse train is described as X(t)=Xamp(t)Xfine(t), where the pulse-train carrier is Xfine(t)=Σk p(t-kT) and the modulated amplitude is Xamp(t)=A (1+ sin(2πƒmt)). The amplitude A is a Gaussian random variable [N(1,σ)], ƒm is the modulation frequency, and 1/T is the carrier pulse rate. Effects of σ, ƒm, T and Ge on ITD tuning are explored below. 4.2
Results
Figure 3 shows that the synaptic strength influences the neural entrainment to pulse trains. The model cell responds to each input pulse at low pulse rates but not at high pulse rates when the synaptic strength is weak (i.e., low Ge, upper traces). Entrainment to the input pulse-train improves when the Ge value is increased (middle trace on right). This result is consistent with in vitro observations, which show that activities of specific potassium channels block neural responses to high-frequency stimuli (Reyes et al. 1996). The model action potential has a short duration and fast recovery time (a few milliseconds) due to the combined activities of fast-sodium and high-threshold and low-threshold potassium channels on the soma (Rothman and Manis 2003). The combination of these temporal characteristics and the strength of the conductance determine the relative refractory period (RRP), which accounts for the reduced neural activity. ITD computations involve temporal summations of bilateral inputs, and either the in-phase or out-of-phase responses may be in the regime where the RRP has a pronounced effect on neural discharges. Noting that the effective rate of out-of-phase inputs is twice the in-phase rate (ignoring strength), it is not surprising that the synaptic strength may have different effects for these conditions. Figure 4 shows ITD tuning curves for the model cell in response to un-modulated pulse trains for three synaptic strength levels. The sub-threshold, near-threshold and super-thresholds levels of strength are defined in terms of unilateral response probabilities. Specifically, the entrainment probability of model discharges to a low-rate (100 pps) unilateral, pulse-trains are at 0%, 14%, and 100% for the three
A
B
T=10ms
T=2ms
Fig. 3 The effect of the membrane refractoriness on neural entrainment to input pulse trains with constant amplitude: A model responses when conductance Ge is low and the inter-pulse interval (T=10 ms) is larger than the relative refractory period of the model cell – note perfect entrainment; B model responses when the inter-pulse interval (T=2 ms) is comparable to the relative refractory period
500
H.S. Colburn et al.
Spike/sec
300
500 pps 300
200
200
100
100
1000 pps ge= 0.5nS(sub) ge= 0.9nS(near)
0 −1
0 ITD(msec)
1
0 −1
ge= 1.5nS(super) 0 ITD(msec)
1
Fig. 4 ITD tuning for unmodulated pulse trains (500 and 1000 pps) for several values of synaptic conductance Ge. Tuning is very sensitive to synaptic strengths (Ge). For the sub-threshold Ge condition, membrane refractoriness decreases model responses to in-phase inputs (relative to near-threshold Gecase), resulting in no ITD-tuning. For the near-threshold Ge condition, membrane refractoriness decreases model responses to out-of-phase inputs, resulting in better ITD tuning. For the super-threshold Ge condition, summed inputs overcome membrane refractoriness at most ITD values, resulting in saturated ITD tunings
threshold levels. Simulation results indicate that refractory processes lead to reduced discharges for in-phase inputs with the sub-threshold Ge and for out-of-phase inputs with the near-threshold Ge, and has less effect on model discharges with a super-threshold Ge. As a result, the corresponding ITD tuning at these three conditions show different overall activities, dynamic ranges, and tuning widths. Amplitude modulating the inputs reduces the effective synaptic strength (and therefore neural firing) during half of the modulation cycle, thereby minimizing the effects of refractoriness on subsequent inputs. However, the resultant ITD tuning can be either improved or degraded dependent upon the synaptic strength level. Figure 5 shows that amplitude modulation improves ITD tuning at sub-threshold Ge condition and degrades ITD tuning at near-threshold Ge condition. This is due to neural release from the RRP for either in-phase or out-of-phase inputs. For the sub-threshold Ge condition (A and B), amplitude modulation releases in-phase responses from membrane refractoriness and ITD tuning is improved relative to that to un-modulated pulse-trains. For the near-threshold Ge condition (C and D), amplitude modulation releases out-of-phase responses from membrane refractoriness and ITD tuning is degraded relative to that to un-modulated pulse-trains. Finally, adding amplitude variation decreases the effective synaptic strength by reducing the coincidence of unilateral inputs, and therefore could be used to improve ITD tuning. Results (not shown) indicate that an increase in jitter (larger σ) decreases the response rate such that ITD dependence may increase or decrease with increases in σ.
Models of Neural Responses to Bilateral Electrical Stimulation
Spike/sec
150 A
500 PPS
150
100
100
50
50
1000 PPS
fm=50Hz
0 −1
0
1
0
fm=100Hz unmodulated
−1
0
1
0
1
150 D
C Spike/sec
B
501
200 100 100 0
50 −1
0
ITD(msec)
1
0
−1
ITD(msec)
Fig. 5 Amplitude modulation (ƒm=50 or 100 Hz) can either improve or degrade ITD tuning to pulse-train stimuli (500 or 1000 pps): A,B the sub-threshold Ge condition; C,D the nearthreshold Ge condition
5
Effects of Input Rate and Phase on ITD Sensitivity
The output of real a coincidence detector is also highly dependent on the rate and synchrony of its inputs. Principal MSO neurons are believed to derive ITD sensitivity by coincidence detection of their bilateral inputs, which are phase locked to auditory stimuli. This section explores why higher-rate, more highly synchronized responses in the auditory nerve during electrical stimulation may produce a loss in ITD sensitivity. Simulations of MSO neurons under acoustic and electrical stimulation of the auditory periphery were compared using the single compartment model. 5.1
Model Description
The single-compartment model MSO cell here is nearly identical to that in Sect. 4, with identical channel dynamics, conductances, and capacitance (Zhou et al. 2005), and Ih calculated in the manner of Rothman and Manis (2003). The model MSO cell has 20 bilateral inputs (10 per side) from explicitly modeled auditory nerve fibers. A 500-Hz tone was the driving stimulus for both acoustic and electric input models. Each stimulus period for each fiber, it is determined if a spike occurs and when according to a Gaussian distribution with σ set by the nominal synchronization index (SI) (Zhou et al. 2005). Average input spike rate and SI were set according to condition, and phase delays between electric inputs were introduced as described below.
502
5.2
H.S. Colburn et al.
Results
Figure 6 shows the rate-ITD characteristics of the model MSO cell for acoustic inputs and three variations of electric inputs. Synaptic strength was set and maintained such that simulated acoustic inputs (120 spikes/s for each input, SI=0.7) produced rate-ITD sensitivity similar to real MSO cells (Goldberg and Brown 1969; Yin and Chan 1990). Electrical stimulation was modeled as 250 spikes/s for each input, with SI=0.99. In simulated electrical stimulation without phase-dispersion (top curve), the high-rate, highly phase-locked inputs produced entrainment and saturation of the rate-ITD curve at good phase, and the resulting “monaural coincidences” produced high discharge rates at all phases, including bad phase. To simulate the possible effect of neural delays that may normally compensate for the cochlear traveling wave delay, phase dispersion between inputs was introduced such that added phases were equally spaced across the stimulus period (bottom curve). With this high degree of phase-dispersion, ITD sensitivity was eliminated, and few output discharges occurred across the range of ITD. To simulate the possible effect of neural plasticity after cochlear implantation, phase dispersion was limited to half the stimulus period. With this moderate phase dispersion, enhanced ITD sensitivity occurred in terms of the amplitude of the rate-ITD curve, however with slightly broader peaks (broader tuning) than for the lower rate acoustic stimulation. model MSO responses to acoustic and electric stimulation 700 simulated acoustic inputs: rate 120 spk/s, sync index 0.7 simulated electric inputs: rate 250 spk/s, sync index 0.99
600
simulated electric inputs: rate 250 spk/s, high phase-dispersion
Discharge Rate (spikes/sec)
simulated electric inputs: rate 250 spk/s, moderate phase-dispersion
500
400
300
200
100
0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
ITD (ms) Fig. 6 ITD sensitivity of a model MSO neuron in response to acoustic and electric stimulation
Models of Neural Responses to Bilateral Electrical Stimulation
6
503
Summary and Conclusions
Cochlear implant studies of ITD sensitivity in human psychoacoustics and animal electrophysiology lead to a consideration of ITD-dependent neural activities. Our simulation results suggest stimulation strategies to improve ITD sensitivity for CI users. Results show that increased neural firing to SAM stimuli can be explained by either release from neural adaptation or from membrane refractoriness. Model results also indicate that, dependent upon the synaptic strength level, the introduction of amplitude modulation can either improve or degrade ITD tuning to high-frequency pulse trains. These results suggest that, to enhance the ITD sensitivity of individual neurons, the effective parameter range of amplitude-modulated stimuli would be small and vary across ITD-sensitive cells, which have different membrane properties and synaptic input strengths. Further studies are needed to explore input-parameter combinations that could further optimize tuning to carrier ITDs. Acknowledgments. This work was supported by the US Public Health Service NIH/NIDCD grant R01 DC05775 Bertrand Delgutte, PI.
References Cai HM, Carney LH, Colburn HS (1998) A model for binaural response properties of inferior colliculus neurons. II. A model with interaural time difference-sensitive excitatory and inhibitory inputs and an adaptation mechanism,” J Acoust Soc Am 103:494–506 Cook DL, Schwindt PC, Grande LA, Spain WJ (2003) Synaptic depression in the localization of sound. Nature 421:29–30 Dasika VK, White JA, Carney LH, Colburn HS (2005) Effects of inhibitory feedback in a network model of avian brain stem. J Neurophysiol 94:400–414 Goldberg JM, Brown PB (1969) Responses of binaural neurons of dog superior olive to dichotic time stimuli: some physiological mechanisms of sound localization. J Neurophysiol 32:516–523 Jeffress LA (1948) A place theory of sound localization. J Comp Physiol Psychol 41:35–39 Kiang NYS, Moxon EC (1972) Physiological considerations in artificial stimulation of the inner ear. Ann Otol Rhinol Laryngol 81:714–730 Lane CC, Delgutte B (2005) Neural correlates and mechanisms of spatial release from masking: single-unit and population responses in the inferior colliculus. J Neurophysiol 94:1180–1198. Epub 2005 Apr 27 Lane CC, Kopco N, Delgutte B, Shinn-Cunningham BG, Colburn HS (2005) A cat’s cocktail party: psychophysical, neurophysiological and computational studies of spatial release from masking. In: Pressnitzer D, de Cheveign A, McAdams S, Collet L (eds) Auditory signal processing: physiology, psychoacoustics, and models. Springer, Berlin Heidelberg New York Moxon E (1967) Electrical stimulation of the inner ear of cat. Doctoral dissertation, MIT Elec Eng Dept Poon B (2006) Sound localization and interaural time sensitivity with bilateral cochlear implants. PhD Dissertation, M.I.T. Health Science and Technology, Cambridge, MA Reyes AD, Rubel EW, Spain WJ (1996) In vitro analysis of optimal stimuli for phase-locking and time-delayed modulation of firing in avian nucleus laminaris neurons. J Neurosci 16:993–1007
504
H.S. Colburn et al.
Rothman J, Manis P (2003) Kinetic analyses of three distinct potassium conductances in ventral cochlear nucleus neurons. J Neurophysiol 89:3083–3096 Smith ZM (2006) Binaural interactions with bilateral electric stimulation of the cochlea. PhD Dissertation, M.I.T. Health Science and Technology, Cambridge, MA Yin TCT, Chan JCK (1990) Interaural time sensitivity in the medial superior olive of the cat. J Neurophysiol 64:465–488 Zhou Y, Carney LH, Colburn HS (2005) A model for interaural time difference sensitivity in the medial superior olive: interaction of excitatory and inhibitory synaptic inputs, channel dynamics, and cellular morphology. J Neurosci 25:3046–3058
54 Neural and Behavioral Sensitivities to Azimuth Degrade with Distance in Reverberant Environments SASHA DEVORE1, ANTJE IHLEFELD2, BARBARA G. SHINN-CUNNINGHAM2, AND BERTRAND DELGUTTE1
1
Introduction
Reverberation poses a challenge to sound localization in rooms. In an anechoic space, the only energy reaching a listener’s ears arrives directly from the sound source. In reverberant environments, however, acoustic reflections interfere with the direct sound and distort the ongoing directional cues, leading to fluctuations in interaural time and level differences (ITD and ILD) over the course of the stimulus (Shinn-Cunningham et al. 2005). These effects become more severe as the distance from sound source to listener increases, which causes the ratio of direct to reverberant energy (D/R) to decrease (Hartmann et al. 2005; Shinn-Cunningham et al. 2005). Few neurophysiological and psychophysical studies have systematically examined sensitivity to sound source azimuth as a function of D/R (Rakerd and Hartmann 2005). Here we report the results of two closely-integrated studies aimed at characterizing the influence of acoustic reflections like those present in typical classrooms on both the directional sensitivity of auditory neurons and the localization performance of human listeners. We used low-frequency stimuli to emphasize ITDs, which are the most important binaural cue for sounds containing low-frequency energy (MacPherson and Middlebrooks 2002; Wightman and Kistler 1992). We find that reverberation reduces the directional sensitivity of low-frequency, ITD-sensitive neurons in the cat inferior colliculus (IC), and that this degradation becomes more severe with decreasing D/R (increasing distance). We show parallel degradations in human sensitivity to the azimuth of low-frequency noise.
1 Eaton-Peabody Laboratory, Massachusetts Eye and Ear Infirmary, Boston, MA, USA, [email protected], [email protected] 2 Hearing Research Center, Boston University, Boston, MA, USA, [email protected], [email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
506
2 2.1
S. Devore et al.
Single-Unit Neurophysiology Methods
Methods for recording from low-frequency, ITD-sensitive neurons in the IC of anesthetized cats were as described by Hancock and Delgutte (2004). We focused on measuring neural responses as a function of source azimuth in simulated rooms. Binaural room impulse responses (BRIRs) were simulated using the roomimage method (Allen and Berkeley 1979) for a pair of receivers corresponding to the left and right ears of a cat in the center of a simulated reverberant room (11 × 13 × 3 m). We did not include a model of the head in the simulations, so that the resulting BRIRs contained ITD but essentially no ILD cues. BRIRs were calculated for azimuths spanning the frontal hemifield (−90° to 90°) at distances of 1 m and 3 m with respect to the midpoint of the receivers. Anechoic impulse responses were created by time-windowing the direct wavefront from the 1-m reverberant BRIRs. The direct-to-reverberant energy ratio (D/R) was calculated as the ratio of the energy in the direct sound (timewindowed from the BRIR) to the energy of the remaining impulse response. An overall D/R was determined for each source distance by averaging across all azimuths. Virtual room stimuli were created by convolving the BRIRs with a reproducible 400-ms broadband noise burst. The first 400 ms of the resulting signals were presented to dial-in-urethane anesthetized cats over calibrated, closed acoustic systems. Neural responses were measured as a function of source azimuth for each virtual room condition (anechoic, 1 m, and 3 m). We typically used 11 azimuths (15° steps) or, occasionally, 7 azimuths (30° steps). The noise stimulus was repeated 16 times at each azimuth, with random order across azimuths. We computed the average firing rate by counting the number of action potentials over the stimulus duration. 2.2
Results
We measured neural responses as a function of azimuth for 25 IC units from 7 cats. Rate-azimuth curves for two typical units are shown in Fig. 1A. Neural rate responses depend strongly on source azimuth in the anechoic condition (D/R = ∞ dB), with a preference for contralateral azimuths. Reverberation reduces the range of firing rates over all source azimuths (a “demodulation”), although rates still vary systematically with azimuth. To quantify these observations, we define the relative response range as the range of firing rates for a given room condition normalized by the range of firing rates in the anechoic condition. The relative range for the anechoic condition is 1, by definition. For most neurons, the relative ranges in the reverberant conditions are less
Neural and Behavioral Sensitivities
507 B 1.5
20
10
0
Relative Range
Firing Rate (spikes/sec)
A
D/R: ∞ dB 0 dB −10 dB unit sd049b
−90 −60 −30
0
unit sd054b
30 60 90 −90 −60 −30
Azimuth (º)
0
30 60
Azimuth (º)
90
1.0 0.5 0
∞
0
−10
Direct-to-Reverberant Energy Ratio (dB)
Relative Information Transfer (%)
Fig. 1 A Mean neural response rate (±s.e.m.) as a function of source azimuth for two neurons in the cat IC. B Mean (large filled symbols) relative response range as a function of D/R. Small symbols indicate relative response range for individual units
50 40 30 20 10 0
∞
0
−10
Direct-to-Reverberant Energy Ratio (dB) Fig. 2 Mean (large filled symbols) rIT as a function of D/R. Small symbols show rIT for individual units
than 1, indicating a compression of the rate-azimuths curves; furthermore, relative range decreases with decreasing D/R (Fig. 1B). Neural sensitivity to azimuth depends not only on the response range, but also on the variability in responses at each azimuth. To quantify sensitivity, we computed the mutual information (MI) between the source azimuth and the neural spike counts for each neuron and room condition. MI characterizes the precision with which the source azimuth can be estimated from the neural spike counts without making additional assumptions about the neural code. To compare across experiments using different numbers of stimuli, MI was normalized by the stimulus entropy to get the relative information transfer (rIT). Figure 2 shows that the rIT systematically decreases with decreasing D/R.
508
2.3
S. Devore et al.
Discussion
Reverberation leads to a decrease in the directional sensitivity of lowfrequency ITD-sensitive neurons, as seen both by a reduction in the relative response range and a decrease in rIT between source azimuth and neural firing rate. Moreover, the degradation in sensitivity is largest at the smallest D/R. It is tempting to attribute the reduction in response range with decreasing D/R to a decrease in the interaural correlation of the input signals, since the rate responses of cat IC neurons to interaurally-delayed broadband noise become compressed when the noise is statistically decorrelated (Yin et al. 1987). However, a cross-correlation model of binaural processing in the IC (Hancock and Delgutte 2004) predicts an even larger degradation in directional sensitivity than we actually observed (see reply to comment below), indicating that the responses of IC neurons may be more robust to reverberation than cross-correlator models of binaural processing. Further studies are needed to identify the neural mechanisms underlying such robust ITD sensitivity.
3 3.1
Human Psychophysics Methods
Seven paid subjects with normal hearing participated in the experiment. Binaural room impulse responses (BRIRs; see Shinn-Cunningham et al. 2005 for details) were measured using KEMAR in a small room (T60 of approximately 485 ms; 3.3 × 5.8 × 2.6 m). BRIRs were used to simulate sources from 52 locations (all combinations of 13 azimuths, evenly spaced from −90° to 90°, and four distances (15, 40, 100 and 170 cm), in the horizontal plane containing the ears). Simulated source azimuth and distance were varied randomly from trial to trial by convolving low-pass noise (500 Hz–1 kHz, 20 ms sin2 ramp, 200 ms duration) with the appropriate pair of BRIRs. Signals were presented through ER-1 insert earphones. Subjects, seated in a sound-treated booth, were asked to indicate the perceived source location on a continuous scale ranging from −90° to 90° using a graphical user interface. Each subject performed two training sessions with pseudo-anechoic simulations during which they received correct-answer feedback. Following training, subjects completed 12 sessions in the simulated room conditions without feedback. For the results reported here, subjects completed 45 trials for each azimuth and distance pair. For each listener, mean response was calculated for each stimulus location. From the resulting confusion matrices (probability of responding each
Neural and Behavioral Sensitivities
509
azimuth, given which azimuth was presented), the mutual information (MI) and relative information transfer (rIT) between source and response were computed. 3.2
Results
Figure 3 shows the across-subject average response angle as a function of source azimuth (error bars show the across-subject standard error). At each distance, perceived source location varies monotonically, but nonlinearly, with stimulus azimuth such that the change in perceived location decreases with increasing source laterality. In addition, for each azimuth, mean response laterality decreases with increasing source distance. However, within-subject response variability is nearly independent of source distance (Fig. 3, inset). Figure 4 shows that rIT and unnormalized MI between stimulus and response decrease with increasing source distance (and therefore decreasing D/R, shown at the top of the plot).
15cm 40cm 100cm 170cm
60 30
Response STD [deg]
Response Angle [deg]
90
0 −30 −60 −90 −90 −60 −30
10 5 0
15 40 100 170 Stimulus Distance [cm]
0
30
60
90
Stimulus Angle [deg]
Fig. 3 Localization judgments as a function of stimulus azimuth for different distances. Inset shows the across-subject and across-azimuth average of the standard deviations in responses at each distance
14
D/R [dB] 10 5 −1 1.5
30
1
MI [bits]
rIT [%]
40
20 0.5 15 40 100 170 Distance [cm]
Fig. 4 rIT (left axis) and MI (right axis) vs source distance (D/R along top axis)
510
3.3
S. Devore et al.
Discussion
Both localization accuracy and rIT decrease with increasing distance. Although the mean ITDs in our reverberant BRIRs decrease somewhat with increasing distance (not shown), the magnitude of these changes cannot account for the observed effects of source distance on mean localization judgments or on sensitivity to stimulus azimuth. Acoustical analyses show that, with increasing distance in a room, 1) variability in both ITD and ILD increases, particularly for lateral sources, even though the mean ITD does not change substantially and 2) the magnitude of ILD cues decreases (ShinnCunningham and Kawakyu 2003; Hartmann et al. 2005; Shinn-Cunningham et al. 2005). Decreases in the size of ILDs with increasing distance may partly explain why perceived source laterality decreases with increasing distance. However, it is somewhat surprising that response variability does not increase with distance, as the variability in ITD and ILD cues increases. As measured by rIT, behavioral sensitivity to azimuth decreases with decreasing D/R. Theoretically, such a decrease could be due to either an increase in response variability or a reduction in the range of responses. Given that response variability is nearly independent of D/R, the reduction in rIT with decreasing D/R must be due to the compression of the range of perceived azimuths. Robust sound localization ability in reverberant environments is often ascribed to the precedence effect, in which the direct wavefront dominates localization judgments (Litovsky et al. 1999). Our results show that late-arriving reflections do degrade perceptual sensitivity to source azimuth.
4
General Discussion
There are parallels in the effects of reverberation on neural responses and human localization performance. Physiologically, decreasing D/R leads to decreases in both the relative neural response range and rIT between stimulus azimuth and neural firing rate. Behaviorally, decreasing D/R leads to both a reduction in the range of reported azimuths and a decrease in rIT between stimulus and perceived azimuths. In interpreting these parallels, methodological differences between the neurophysiological and psychophysical studies must be kept in mind. While the BRIRs used in the physiological study contained only ITD cues, the BRIRs used in the psychophysical study contained ILDs as well. Although ITD is usually the dominant cue for the low-frequency stimuli used in our study (Wightman and Kistler 1992), ILD cues do carry some weight (MacPherson and Middlebrooks 2002), and may be particularly important in our psychophysical experiments where large ILD cues are present at all frequencies for the short distances examined.
Neural and Behavioral Sensitivities
511
Nonetheless, we can ask whether our behavioral results can be explained by current models of sound localization based on ITD. By making explicit assumptions about the neural code for sound localization, we can relate the activity of single neurons to perceptual judgments of azimuth and therefore test whether these codes can account for the observed behavioral results. Here, we evaluate a simple 2-channel hemispheric difference model of binaural processing (Hancock, this volume; Marquardt and McAlpine 2001; van Bergeijk 1962). The model’s key assumption is that the neural code for azimuth is derived by comparing the total neural activity in the population of ITD-sensitive neurons across the two inferior colliculi. Since we lack an explicit analytic expression relating D/R to the firing rate of IC neurons, we estimated the total activity in one IC empirically by summing the normalized rate-azimuth curves across our sample of 25 neurons. Assuming symmetry in the nervous system, the activity in the opposite IC is obtained by reflecting the activation signal across the midline (Fig. 5A). The hemispheric difference signal is then the difference in total activity between the two hemispheres. Figure 5B shows the hemispheric difference signal for each of our three room conditions in the
Total Population Activity
A
left IC
right IC
−90
60
30
0
30
60
90
Azimuth Hemispheric Difference
B D/R:
−90
∞ dB 0 dB -10 dB
−60 −30
0
30
60
90
Azimuth Fig. 5 A Total neural activity in the left and right IC as a function of stimulus azimuth. B Hemispheric difference signal as a function of stimulus azimuth
512
S. Devore et al.
physiological experiments. Assuming a monotonic relationship between the hemispheric difference signal and perceived azimuth, the model predictions are qualitatively consistent with the behavioral results (compare Figs. 3 and 5B). The hemispheric difference signals become increasingly compressed with decreasing D/R, as do listener judgments of azimuth. Thus, the effects of reverberation on perceived source azimuth are qualitatively consistent with a neural code whereby perceived azimuth is monotonically related to the difference in summed neural activity from both sides of the midbrain. Our results establish the usefulness of combined neurophysiological and psychophysical studies of sound localization in reverberant environments as a tool for testing neural models of binaural sensitivity. Future efforts will be aimed at testing whether other models of sound localization, e.g., the Jeffress model (Jeffress 1948) can also predict observed behaviors. Further progress in identifying neural codes for sound localization at the midbrain level will require quantitative descriptions of the effects of reverberation on ITD sensitivity. Acknowledgements. This research was supported by NIH grants DC 02258 (BD), 05209 (BD) and 05778 (BGSC), as well as AFOSR grant FA9550-04-1-0260 (BGSC). The authors thank Connie Miller for surgical assistance and Justin Kiggins’s help in collecting the psychophysical data.
References Allen JB, Berkley DA (1979) Image method for efficiently simulating small-room acoustics. J Acoust Soc Am 65:943–950 Hancock KE, Delgutte B (2004) A physiologically based model of interaural time difference discrimination. J Neurosci 24:7110–7117 Hartmann WM, Rakerd B, Koller A (2005) Binaural coherence in rooms. Acta Acust 91:451–462 Jeffress LA (1948) A place theory of sound localization. J Comp Physiol Psych 41:35–39 Litovsky RY, Colburn HS, Yost WA, Guzman SJ (1999) The precedence effect. J Acoust Soc Am 106:1633–1654 Macpherson EA, Middlebrooks JC (2002) Listener weighting of cues for lateral angle: the duplex theory of sound localization revisited. J Acoust Soc Am 111:2219–2236 Marquardt T, McAlpine D (2001) Simulation of binaural unmasking using just four binaural channels. Assoc Res Otolaryngol Abs 21716 Rakerd B, Hartmann WM (2005) Localization of noise in a reverberant environment. In: Pressnitzer D, de Cheveigné A, McAdams S, Collet L (eds) Auditory signal processing: physiology, psychoacoustics, and models. Springer, Berlin Heidelberg New York pp 414–423 Shinn-Cunningham BG, Kawakyu K (2003) Neural representation of source direction in reverberant space. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (New Pfalz, New York), pp 79–82 Shinn-Cunningham BG, Kopco N, Martin TJ (2005) Localizing nearby sound sources in a classroom: binaural room impulse responses. J Acoust Soc Am 117:3100–3115 van Bergeijk W (1962) Variation on a theme of von Békésy: a model of binaural interaction. J Acoust Soc Am 34:1431–1437 Wightman FL, Kistler DJ (1992) The dominant role of low-frequency interaural time differences in sound localization. J Acoust Soc Am 91:1648–1661 Yin TC, Chan JC, Carney LH (1987) Effects of interaural time delays of noise stimuli on low-frequency cells in the cat’s inferior colliculus. III. Evidence for cross-correlation. J Neurophys 58:562–583
Neural and Behavioral Sensitivities
513
Comment by Hartmann It appears from your data that the onset was included in the stimuli in a way that allows the precedence effect to operate. If the onset had been masked you would have found that the variance increases as the separation between source and receiver increases. Reply The precedence effect undoubtedly influenced subject’s judgments of lateral source angle by allowing listeners to weigh spatial cues in the onset of the signal more heavily than those in the steady state. Indeed, the results you presented at the last ISH (Rakerd and Hartmann 2005) suggest that the variance of listener’s judgments increases with increasing distance when the onset is masked. However, this study was done in a room with a rather extreme reverberation time (4 s). The distribution of spatial cues in the ongoing stimulus in such an environment is much more variable than that in everyday spaces such as classrooms (Ihlefeld and Shinn-Cunningham, unpublished observation). Some recent results by Dizon and Colburn (2006) are relevant. They found robust localization dominance, even in the absence of an onset containing the direct sound alone, when listeners were presented with an ongoing segment of the sum of a single direct source and a single delayed reflection. Such an ongoing precedence effect could allow good localization in modest reverberation even when spatial information in the onset is eliminated. So long as the mean IPD does not depend on the distance between source and listener and the time window over which the auditory system averages binaural cues is long compared to the rate of IPD fluctuations, the response variability should not increase if the onsets are removed. References Dizon RM, Colburn HS (2006) The influence of spectral, temporal, and interaural stimulus variations on the precedence effect. J Acoust Soc Am 119:2947 Hartmann WM, Rakerd B, Koller A (2005) Binaural coherence in rooms. Acta Acust 91:451–462 Rackerd B, Hartmann WM (2005) Localization of noise in a reverberant environment. In: Pressnitzer D, de Cheveigné A, McAdams S, Collet L (eds) Auditory signal processing: physiology, psychoacoustics, and models. Springer, Berlin Heidelberg New York, pp 414–423
Comment by Kollmeier A straightforward analogy of adding (uncorrelated) reverberation is the addition of uncorrelated noise. Does your data show any evidence that the auditory system treats reverberation differently than expected when
514
S. Devore et al.
using the anechoic condition added with (interaurally uncorrelated) noise at the appropriate signal-to-noise ratio? If so, this would provide hints to an interesting de-reverberation process. Reply
Predicted Neural Response Range (spikes/sec)
This is an excellent point which we did not have space to explore in the paper. We are investigating whether a physiologically-based cross-correlation model of ITD processing (Hancock and Delgutte 2004) can account for neural responses in reverberation (Devore and Delgutte 2006). Since the model is based on the average interaural correlation over the entire stimulus duration, this approach is equivalent to choosing the signal-to-noise ratio of uncorrelated and anechoic noise sources to match the interaural cross-correlation coefficient of our reverberant stimuli. The model parameters were first fit to the anechoic data for each neuron, and then held constant to predict neural rate responses as a function of azimuth for the reverberant conditions. Figure A shows the range (maxmin) of predicted neural response functions plotted against the range of measured neural responses. The model accurately fits the anechoic data (black circles). However, in the reverberant conditions, the model often predicts more compression than what was actually observed, suggesting that the auditory system may treat reverberation differently than statistical decorrelation. We are actively investigating the mechanisms that give rise to this robust directional sensitivity in reverberation (see reply to Hohmann’s question).
60 50 40 30 20
D/R: ∞ dB 0 dB −10 dB
10 0 0
10
20
30
40
50
60
Measured Neural Response Range (spikes/sec) Fig. A Predicted neural response range plotted against measured neural response range. Gray shaded region depicts ±1 std from the line y = x
Neural and Behavioral Sensitivities
515
References Devore S, Delgutte B (2006) Robustness to reverberation of directionally-sensitive neurons in the inferior colliculus. Computational and Systems Neuroscience 2006, Abstract 130, Salt Lake City, Utah Hancock KE, Delgutte B (2004) A physiologically based model of ITD discrimination. J Neurosci 24(32):7710–7717
Comment by Hohmann You showed nicely the IPD statistics broadening with increasing reverberation. However, you did not show a relation of this effect with the changes in unit responses. Would it be possible to estimate a unit’s response from its IPD tuning curve and the IPD statistics? If this is not the case and given that your assumption on how IPD is represented in the auditory system is basically correct, this would suggest that the response is more than an ergodic stochastic process. Answering this question would be interesting to further assess localization models that utilize statistics of interaural parameters, e.g., Nix and Hohmann (2006). References Nix J, Hohmann V (2006) Sound source localization in real sound fields based on empirical statistics of interaural parameters. J Acoust Soc Am 119:463–479
Response to Comment by Hohmann The short stimulus (frozen 400-ms noise) used in the present physiological experiments provides insufficient data for an accurate characterization of short-term changes in neuronal firing rate. However, we are currently investigating the relationship between short-term interaural statistics of the stimulus and the instantaneous firing rates of ITD-sensitive neurons in the inferior colliculus (IC). The firing rates of these neurons are undoubtedly modulated, not only by changes in short-term interaural correlation (Joris et al. 2006), but also by other factors such as monaural temporal envelopes (Lane and Delgutte 2005) and interactions between frequency bands. Identifying and characterizing these factors (along with the distribution of short-term IPDs) may be necessary to predict accurately the temporal discharge patterns of IC neurons in reverberant conditions. It is worth noting that, in our psychophysical data, listener’s judgments do not directly follow the IPD statistics. The mean response azimuth varies with distance, even though the mean IPD does not, and the response variability is stable with distance, though the variance of the IPD increases.
516
S. Devore et al.
References Joris PX, van de Sande B, Recio-Spinoso A, van der Heijden M (2006) Auditory midbrain and nerve responses to sinusoidal variations in interaural correlation. J Neurosci 26(1):279–289 Lane CC, Delgutte B (2005) Neural correlates and mechanisms of spatial release from masking: single-unit and population responses in the inferior colliculus. J Neurophysiol 94:1180–1198
55 Spectro-temporal Processing of Speech – An Information-Theoretic Framework THOMAS U. CHRISTIANSEN1, TORSTEN DAU1, AND STEVEN GREENBERG1,2
1
Introduction
Which acoustic cues are most important for understanding spoken language? Traditionally, the speech signal has been described primarily in spectral terms (i.e., the distribution of energy across the acoustic frequency axis). In contrast, temporal properties have largely been ignored. However, there is mounting evidence that low-frequency energy modulations play a crucial role, particularly those below 16 Hz (e.g., Houtgast and Steeneken 1985; Drullman et al. 1994; Greenberg et al. 1998; Greenberg and Arai 2004; Christiansen and Greenberg 2005). Modulations higher than 16 Hz may also contribute under certain conditions (Apoux and Bacon 2004; Christiansen and Greenberg 2005; Greenberg and Arai 2004; Silipo et al. 1999). Currently lacking is a detailed understanding of how low-frequency amplitudemodulation cues are combined across the acoustic frequency spectrum, as well as how spectral and temporal information interact. Such knowledge is likely to enhance our understanding of how spoken language is processed in noisy and reverberant environments by both normal and hearingimpaired individuals.
2
Experimental Methods
The current study investigates the spectro-temporal cues associated with identification of Danish consonants through systematic filtering of the modulation spectrum in different regions of the audio frequency spectrum. Because of speech’s inherent redundancy, much of the signal’s audio frequency content must be discarded in order to delineate the interaction between spectral and temporal information. For this reason, amplitude modulations associated with each of three separate spectral regions were low-pass 1 Center for Applied Hearing Research, Ørsted DTU, Acoustic Technology, Technical University of Denmark, Ørsteds Plads, Bldg. 352, DK-2800 Kgs. Lyngby, Denmark, [email protected], [email protected] 2 Silicon Speech, 46 Oxford Drive, Santa Venetia, CA 94903, USA, [email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
518
T.U. Christiansen et al.
filtered, and the resultant signal processing evaluated in terms of consonant identification and the amount of information associated with each consonant’s constituent phonetic features. The phonetic feature of voicing (e.g., differentiating the consonants, [p, t, k] from [b, d, g]), articulatory manner (e.g., distinguishing [b] from [m], and [f], [d] from [n] and [s]) and place of articulation (e.g., distinguishing [p] from [t] and [k]) can be used to assess the contribution of each audio-frequency channel and modulation-frequency region to consonant recognition by computing confusion matrices and calculating the amount of information transmitted for each phonetic feature. In this way, the contribution of each acoustic frequency region to consonant recognition can be discerned when presented alone and in combination with Table 1 Consonant identification accuracy (percent correct) for each condition (average of six subjects). The coefficient of variation (i.e., standard deviation divided by the mean) was always less than 0.08 and usually lower than 0.03. The presence of a speech band (“slit”) at each of three center frequencies (750, 1500 and 3000 Hz) is indicated by either “–” (no low-pass modulation filtering) or “●” (low-pass modulation filtered). The low-pass modulation filter cutoff varied between 3 and 24 Hz. Of the consonants, 99% were correctly identified in the absence of spectral and modulation filtering (i.e., unprocessed stimuli) Slit frequency 750
1500
Low-pass modulation filtering 3000
● ● ● ●
● ●
●
●
– ●
–
–
<6 Hz
<3Hz
38.4
32.8
27.5
21.5
18.2
40.2
35.9
29.0
19.7
16.2
39.6
31.3
29.0
21.5
16.7
62.1
55.6
41.7
26.3
77.1
76.4
71.7
56.8
34.8
●
74.6
73.2
63.6
46.0
31.6
●
88.4
87.9
81.1
64.1
42.9
64.8
59.1
57.1
50.0
69.7
55.3
50.3
47.0
–
76.0
71.5
67.2
61.4
64.3
60.9
57.8
45.5
●
71.5
59.6
56.6
51.3
●
77.7
73.0
68.4
60.4
●
–
<12 Hz
67.6
–
●
<24 Hz
●
● ●
All pass
●
–
–
85.9
84.6
83.6
79.0
–
–
●
87.1
85.4
80.1
76.0
–
●
–
78.3
79.5
74.5
71.5
●
–
●
86.6
82.3
75.8
65.9
●
●
–
82.8
78.8
77.3
66.7
–
●
●
87.9
84.1
75.0
61.4
Spectro-temporal Processing of Speech – An Information-Theoretic Framework
519
other spectral bands (see Christiansen and Greenberg 2005 for additional details). Stimuli were Danish monosyllabic words and nonsense syllables recorded in a sound-insulated room at Aalborg University. Their original sampling rate was 20 kHz (at which the signal processing was performed). Subsequently, the speech signals were up-sampled to 44.1 kHz for stimulus presentation. The acoustic frequency spectrum was partitioned into three separate channels (“slits”), each three-quarters of an octave wide. The lowest slit was centered at 750 Hz, the middle slit at 1500 Hz and the highest slit at 3000 Hz. Each slit was presented either in isolation or in combination with one or two other slits and low-pass modulation filtered using the “Modulation Toolbox” (Schimmel and Atlas, 2005). The low-pass cutoff frequency of modulation filtering ranged between 3 Hz and 24 Hz in octave steps. Each slit was also presented without any modulation filtering. The stimulus was preceded by a short, unfiltered carrier phrase “På pladsen mellem hytten og...” and contained one of eleven consonants, [p, t, k, b, d, g, m, n, f, s, v], followed by one of three vowels, [i, a, u]. Each token concluded with the liquid segment [l] (e.g., “talle,” “tulle,” “tille”). The full set of stimulus conditions is listed in Table 1. The material was spoken by one of two talkers (one male, one female), and presented diotically over Sennheiser HD-580 headphones at a sound pressure level of 65 dB to the subject, who was seated in a double-walled sound booth. The subject’s task was to identify the initial consonant of each stimulus. No feedback was provided. Six individuals (3 males, 3 females) between the ages of 21 and 28 participated in the study. All reported normal hearing and no history of auditory pathology.
3
Data Analysis and Results
The data were analyzed in a variety of ways. Consonant identification accuracy declines progressively with decreasing low-pass modulation-frequency cutoff (Table 1). The number of slits also affects consonant recognition accuracy. Consonant identification was also scored in terms of how well a consonant’s phonetic features – voicing, manner and place of articulation – were decoded. When a consonant is correctly identified, its constituent phonetic features are (by definition) also decoded accurately. However, when a consonant is incorrectly recognized, it is rare that all of its constituent phonetic features are incorrectly decoded; one or two features are usually decoded accurately. Consonant perception is usually studied in terms of accuracy for individual phonetic segments. Because consonants are phonetically related to each other, scoring only in terms of the proportion of consonants correct may obscure patterns associated with cross-spectral integration and modulation processing. Confusion matrices of consonantal identification error patterns provide a
520
T.U. Christiansen et al.
straightforward means of delineating how much information associated with constituent phonetics features is transmitted. In order to compute the “true” amount of information associated with each consonant, a bias-neutral metric (such as d’ used in signal detection theory) is required. To compute the amount of information transmitted (Miller and Nicely 1955), the 11 consonants were partitioned into 3 (overlapping) groups of voicing, articulatory manner and place of articulation. Voicing is a binary distinction, whereas manner and place encompass three class distinctions for the Danish consonants used in this study. Confusion matrices were computed for each phonetic feature (see Christiansen and Greenberg 2005 for details). In essence, each phonetic feature is treated as a quasi-independent information channel. Although a consonant may be identified incorrectly, there may be information pertaining to its constituent phonetic properties that is correctly decoded. Information associated with voicing and manner of articulation is generally decoded accurately even when the consonant is not identified correctly (Fig. 1). In contrast, place of articulation is rarely decoded accurately when the consonant is incorrectly identified. Such analyses demonstrate that consonant identification depends largely on decoding the place-of-articulation dimension correctly.
Fig. 1 Phonetic feature classification as a function of consonant identification accuracy for the same conditions and listeners shown in Table 1. The correlation coefficient (R2) is shown for each phonetic feature
Spectro-temporal Processing of Speech – An Information-Theoretic Framework
521
In order to compute the amount of information associated with a specific feature and stimulus condition, it is necessary to calculate the co-variance between a specific stimulus and response category (as a means of neutralizing the effect of response bias). The information associated with voicing, manner and place is computed as follows: pi p j T (c) =- / pij log p ij i, j
(1)
where T(c) refers to the number of bits per feature transmitted across channel c, pij is the probability of feature, i, co-occurring with response j, pi is the probability of feature i, occurring and pj is the probability of response j occurring. When the data are plotted in terms of the amount of information transmitted, interesting patterns emerge (Table 2). Information combines differently
Table 2 The amount of transmitted information (as specified in Eq. 1) computed for each phonetic feature (place, manner, voicing) in conditions where each slit undergoes the same amount of lowpass modulation filtering). The signals contain 1, 2 or 3 spectral slits (whose center frequencies are indicated). Bold cells indicate conditions in which the low-pass modulation filtering result in a significant decline (≥25%) of transmitted information. Cells marked a indicate where cross-spectral integration of transmitted information is more than 50% greater than predicted on the basis of linear summation. Cells marked b indicate where cross-spectral integration is more than 200% greater than predicted on the basis of linear summation Slit center frequencies
PLACE
MANNER
VOICE
Low pass modulation filtering
750
1500
All Pass
0.14
0.10
3000
750 1500
1500 3000
750 3000
750 1500 3000
0.09
0.41
0.72b
0.62a
1.03b
b
b
1.12b
a
<24 Hz
0.09
0.13
0.07
0.40
0.74
<12 Hz
0.03
0.05
0.06
0.27b
0.65b
0.38b
0.94b
b
b
b
0.47b
0.37
0.59
<6 Hz
0.02
0.01
0.02
0.11
0.21
<3 Hz
0.02
0.01
0.02
0.05a
0.19b
0.07a
0.27b
All Pass
0.58
0.45
0.42
1.04
0.96
1.10
1.24
<24 Hz
0.42
0.36
0.31
0.85
1.10
1.00
1.18
a
a
a
1.04a
<12 Hz
0.22
0.22
0.16
0.80
<6 Hz
0.10
0.09
0.07
0.59b
0.72b
0.55b
0.84b
a
b
a
0.51a
0.98
0.32
0.87
<3 Hz
0.11
0.06
0.04
0.27
All Pass
0.55
0.30
0.39
0.79
<24 Hz
0.31
0.25
0.30
0.68
0.72
0.87
0.95
<12 Hz
0.27
0.23
0.22
0.66
0.77a
0.79a
0.94
<6 Hz
0.11
0.14
0.12
0.51a
0.56a
0.59a
0.81a
<3 Hz
0.07
0.07
0.04
0.33b
0.33a
0.37b
0.48a
0.72
0.41 0.90
0.94
522
T.U. Christiansen et al.
across the audio spectrum for each phonetic feature. In the absence of low-pass modulation filtering, both voicing and manner information combine linearly for two-slit signals. For three-slit stimuli, voicing information saturates (contains the same amount of information as the two-slit signals), while manner information increases slightly (i.e., exhibits some compression). In contrast, place of articulation combines synergistically (two or three slits contain far more information than linear summation would predict). There is substantially greater than linear summation across slits for virtually all conditions. The amount of place of articulation information transmitted within any single slit is small (substantially less than manner or voicing). Thus, this feature depends most on cross-spectral integration. It is also the feature that is most closely associated with the ability to accurately decode consonant identity (Fig. 1). From such patterns we conclude that cross-spectral integration is particularly important for speech robustness. There is a progressive decline in place and manner information transmitted with low-pass filtering of the modulation spectrum, particularly above 6 Hz for single-slit stimuli. In contrast, voicing information is most sensitive to modulation filtering below 6 Hz. When two or three slits are combined, phonetic feature information is relatively unaffected by modulation filtering as long as modulation frequencies greater than 6 Hz are preserved. When the modulation spectrum is filtered below 6 Hz, cross-spectral integration becomes extremely important for decoding all three phonetic features (voicing, place and manner of articulation). Amplitude-modulation cues present in separate audio frequency regions largely compensate for the low-pass filtering of the modulation spectrum.
4
Conclusions and Significance
Conventional methods of estimating the contribution made by different parts of the audio spectrum and the modulation spectrum fail to dissociate these two dimensions. Nor do they examine the specific contribution made by cross-spectral integration to speech decoding in a quantitative way. We conclude that: 1. Place of articulation information is crucial for accurate consonant recognition. 2. Accurate decoding of place-of-articulation information requires crossspectral integration across a broad range of acoustic frequencies. 3. Place of articulation information is associated most closely with the modulation spectrum between 6 and 24 Hz. Hence, consonant decoding requires cross-spectral integration of the modulation spectrum above 6 Hz. 4. Voicing is mainly associated with the modulation spectrum below 6 Hz.
Spectro-temporal Processing of Speech – An Information-Theoretic Framework
523
5. Manner of articulation is associated with the modulation spectrum between 3 and 24 Hz for single-band stimuli. For signals containing two or more slits, the modulation spectrum below 6 Hz may be particularly important, especially in noisy and reverberant conditions. 6. Cross-spectral integration of modulation patterns is crucial for accurate decoding of spoken language, and therefore is likely to hold the key for improving intelligibility in acoustically challenging environments.
References Apoux F, Bacon SP (2004) Relative importance of temporal information in various frequency regions for consonant identification in quiet and in noise. J Acoust Soc Am 116:1671–1680 Christiansen TU, Greenberg S (2005) Frequency selective filtering of the modulation spectrum and its impact on consonant identification. In: Rasmussen A, Poulsen T (eds) Twenty First Danavox Symposium, pp 585–599 Drullman R, Festen JM, Plomp R (1994) Effect of reducing slow temporal modulations on speech reception. J Acoust Soc Am 95:2670–2680 Greenberg S, Arai T (2004) What are the essential cues for understanding spoken language? IEICE Trans Inf & Syst. E87-D, pp 1059–1070 Greenberg S, Arai T, Silipo R (1998) Speech intelligibility derived from exceedingly sparse spectral information. Proc 5th Int Conf Spoken Lang Proc, pp 74–77 Houtgast T, Steeneken HJM (1985) A review of the MTF-concept in room acoustics. J Acoust Soc Am 77:1069–1077 Miller GA, Nicely PE (1955) An analysis of perceptual confusions among some English consonants. J Acoust Soc Am 27:338–352. Schimmel S, Atlas L (2005) Coherent envelope detection for modulation filtering of speech. Proc Int Conf Audio, Speech & Signal Proc (ICASSP), pp 221–224 Silipo R, Greenberg S, Arai T (1999) Temporal constraints on speech intelligibility as deduced from exceedingly sparse spectral representations Proc 6th European Conf Speech Comm Tech (Eurospeech-99), pp 2687–2690
56 Articulation Index and Shannon Mutual Information ARNE LEIJON
1
Introduction
The Articulation Index (AI), later revised and standardized as the Speech Intelligibility Index (SII), and the Speech Transmission Index (STI) have been successful in predicting speech intelligibility from acoustic measurements. Both approaches calculate the index as sum of additive audibility contributions from different frequency bands. Allen (2003) noted that a similar additivity property also holds for Shannon’s information-theoretic concept of Channel Capacity. Allen showed that the contributions to channel capacity are (approximately) linearly related to the signal-to-noise ratio (in dB), just like the audibility contributions to the AI, and suggested that the AI is actually a kind of channel-capacity measure. This would be a fundamental information-theoretical basis for the empirical success of AI theory. Leijon (2002) observed that the rate of auditory speech information is mathematically guaranteed to set a definite upper bound on speech-recognition performance, and exemplified how the performance-vs-information bound depends strongly on the entropy of the speech test material. He also suggested a way to estimate the rate of information conveyed from a sequence of discrete “phonetic” categories, supposedly coded in the acoustic speech signal, to the stream of sensory data available for central perceptual processing. A problem with that approach was that the analysis used only the acoustic input signal, and the “phonetic” categories were not real phonetic labels, but only statistical clusters automatically defined by the calculation procedure. This paper presents a way to avoid this problem, still without requiring a phonetically labelled speech signal. We now estimate the rate of information conveyed from the continuousvalued acoustic clean speech signal to the corresponding stream of sensory data. This approach allows us to test the hypothesis of a direct relation between SII (AI) and the acoustic-to-auditory information rate. The calculation does not make any a priori assumptions about additivity across frequency bands.
Sound and Image Processing Lab., KTH, Stockholm, Sweden, [email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
526
A. Leijon
The acoustic-to-auditory information rate is estimated for low-pass and high-pass filtered speech and for speech masked with speech-shaped noise. If there is indeed a direct proportionality between SII and acoustic-to-auditory information rate, we should find that both the SII and the information rate are affected similarly by these acoustic degradations.
2
Theory and Methods
A standardised Swedish speech test material (Hagerman 1982) is used, with a separate channel of noise that has the same long-time spectrum as the speech. The processing steps are schematically shown in Fig. 1. The recorded clean speech is analysed into a sequence of short-time spectra (in dB), with 20-ms steps and 50% segment overlap. A hidden Markov model (HMM) is then trained to represent the statistical characteristics of this sequence. (For the present calculation the HMM had 40 states, each associated with a GMM with 3 sub-states, as defined in Sect. 2.2.) The HMM is then modified, first to account for effects of added noise, and then for the auditory signal transformations, including the statistical variability of sensory data that limits auditory spectral discrimination. 2.1
Rate of Mutual Information
Let us denote the sequence of short-time spectra for the clean speech as a discrete-time random sequence of (vector-valued) signal samples as _X1 , f, Xt , fi . When this sequence is input to the transmission channel the corresponding stream of channel output data is denoted _R1 , f, Rt , fi. The Rate of Mutual Information (MI) specifies the amount of information successfully transmitted through the channel per time unit. With stationary
Fig. 1 Block diagram of processing used to estimate the rate of mutual information between a recorded clean speech signal and the corresponding stream of sensory data. The recorded speech and noise signals are analysed into sequences of short-time spectra (in dB). The trained clean-speech HMM is transformed, first by adding acoustic noise, and then by an auditory model that includes the effects of sensory noise
Articulation Index and Shannon Mutual Information
527
random sequences and a memory-less channel, the MI rate (in bits per sample interval) can be defined (e.g., Cover and Thomas 1991) as r XR = lim h _Rt | R1 , f, Rt -1i - h _Rt | Xt i = t "3 R V f _Rt | Xt i S2 W = lim E S log W, t "3 f R | R , f , R _ i S t t -1 W 1 T X
(1)
where h() is the differential entropy function, defined using the expectation operator E[] and the probability density ƒ(). As the differential entropy is a logarithmic measure of variability, the mutual information rate can also be seen as a measure of Modulation Transfer. The Channel Capacity is the highest possible MI rate that can be achieved with any statistical distribution of the input sequences. When we have a given class of input signals, it is more interesting to calculate the channel MI rate for that particular signal class. The auditory channel capacity may also be interesting, but not for the present purpose. 2.2
Signals Represented by Hidden Markov Models (HMM)
The probabilistic dependence over time, expressed in the first term of Eq. (1), is important in our application because it describes the temporal redundancy in any speech signal. One convenient method to handle this problem is to describe the random sequences by ergodic hidden Markov models (Rabiner 1989) with Gaussian Mixture Models (GMM) for the observable output. The HMM-GMM signal model implies that each observable sequence element is obtained by three random steps. First, a hidden discrete state St ! "1, f, N , is chosen, with known time-invariant conditional (transition) probabilities aij = P `St = j | St - 1 = ij . Second, a discrete sub-state U t = "1, f, M , is chosen with time-invariant conditional probabilities w jm = P `U t = m | St = jj. Third, the observable Xt is chosen from a multidimensional Gaussian probability distribution, with time-invariant mean vector n Xjm and covariance matrix C Xjm , for state j and sub-state m. Using this notation, the MI rate can be conveniently calculated as r XR = rSUR + r XR | SU
(2)
rSUR = lim h _Rt | R1fRt i - h _Rt | St , U t i
(3)
r XR | SU = h _Rt | St , U t i - h _Rt | Xt , St , U t i
(4)
where t "3
528
A. Leijon
Here, rSUR represents the MI rate from the sequence of discrete states and sub-states to the output, and r XR | SU shows the information about the X-pattern variations within each sub-state. This formulation is exactly equivalent to Eq. (1), because Rt depends on _St , U t i only via Xt, and therefore h _Rt | Xt , St , U t i = h _Rt | Xt i 2.3
(5)
Stochastic Integration
The MI rate component rSUR in Eq. (3) is estimated by replacing the exact expectation by stochastic averaging. The HMM is used to generate long random state and sub-state sequences, and a corresponding output sequence, all with length T=10000. The same HMM is then used again to analyze this output sequence with the Forward Algorithm (Rabiner 1989), which calculates the sequence of conditional probability density values needed in the first term of Eq. (3), for all 1≤ t ≤ T. Several such sequences were analyzed, until the standard deviation of the final average MI rate estimate was less than 1 bit/s. 2.4
HMM Transformations
As indicated in Fig. 1, we transform the clean-speech HMM into a corresponding HMM describing the statistical characteristics of sensory data. Both HMMs are assumed to have the same hidden state and sub-state sequences. Only the Gaussian mean and covariance parameters for the observable elements are modified by the channel. The non-linear transformation is approximated as a locally linear expansion around the mean for each state j and sub-state m, as Rt = g a n Xjmk + D a Xt - n Xjmk + Wt
(6)
where D is the partial-derivative matrix of the non-linear transformation g(), and Wt is an additive noise that represents all random variations that are independent of the input signal. Thus, the transformed conditional mean and covariance are nRjm = g a n Xjmk C Rjm = DC Xjm DT + CWjm
(7)
Using this approximation, and HMM stationary state probabilities p j and sub-state probabilities W jm , the MI rate component in Eq. (4) is obtained analytically as det C Rjm r XR | SU = 12/ p j/ w jm2log det CWjm j m
(8)
Articulation Index and Shannon Mutual Information
2.5
529
Auditory Model
The auditory model includes effects of outer ear transmission to the eardrum, middle-ear transmission, non-linear cochlear filtering and outer hair cell amplification, in a similar way to Moore and Glasberg (2004). For each input short-time spectrum the model calculates a corresponding output excitationlevel pattern. Cochlear filtering is modelled by a combination of two roexshaped filters, a linear tail part and a non-linear peak filter with gain depending on the peak-filter output. Details were described by Leijon (2002). The peak filters are symmetric with normal ERB (Moore 2003). The bandwidth is independent of input level (Baker and Rosen 2006). The tail filter has a fixed roex slope parameter p = 8 towards the low-frequency side. The maximum OHC gain was 50 dB at high frequencies, and reduced at frequencies below 500 Hz (Moore and Glasberg 2004, Fig. 2). Excitation patterns were calculated at cochlear “places” in the frequency range 50–15,000 Hz, with a resolution of 0.5 ERB between “place” samples. The additive sensory noise signal was adapted to represent the neural and perceptual variability that limits spectral discrimination. The noise components were statistically independent across both time and cochlear “place”, with a variance adjusted to reproduce normal intensity discrimination for broadband noise (Leijon 2002; Houtsma et al. 1980).
Fig. 2 Approximate acoustic-to-auditory information rate from a clean speech short-time spectrum sequence to the sequence of sensory patterns elicited by low-pass and high-pass filtered versions of the signal, as a function of the cut-off frequency. Normal hearing was simulated. Speech was presented at 70-dB SPL in quiet (upper curves), and with speech-shaped noise at 60- (middle curves) and 70-dB SPL (lower curves)
530
A. Leijon
Fig. 3 Approximate acoustic-to-auditory information rate (thick line with circles) and SII (thin line) for clean speech at 70-dB SPL, mixed with speech-shaped noise, as a function of the signal-to-noise ratio, with normal hearing
3
Results
As expected, the estimated speech-to-sensory MI rate increases with the bandwidth of filtered speech (Fig. 2). The cross-over frequency, where lowpass and high-pass filtered versions convey equal amounts of information, is about 1200 Hz for speech in quiet and about 800 Hz at 0 dB SNR. The corresponding cross-over frequency of the SII is 1550 Hz for average speech (ANSI 1997). If the speech information carried by non-overlapping frequency bands had been additive, the MI rate at the cross-over point should have been about half the full-band rate. This is obviously not the case. The acoustic speech information is completely destroyed by masking noise at speech-to-noise ratios below −15 dB (Fig. 3). The speech-to-sensory MI rate increases non-linearly with SNR in the range −15 to +15 dB, and continues to increase with higher SNR, whereas the SII just approaches 1 asymptotically.
4
Discussion
The rate of speech information transmitted to the brain by the stream of sensory-neural data certainly provides an upper limit on the listener’s ability to recognize speech. Still, the present results show that there is no direct
Articulation Index and Shannon Mutual Information
531
proportionality relation between the SII and the acoustic-to-auditory information rate. For speech in quiet the estimated rate is much higher than what is theoretically needed to decode the speech message. Speechspectrum variability at very low levels contributes to the information rate but probably not to intelligibility. SII contributions from different frequency bands are assumed to be additive. However, the acoustic speech information in non-overlapping high-pass and low-pass frequency bands is not additive, most probably because the statistical variations in speech spectra are correlated across frequency bands. The present method used a parametric model to estimate acoustic-toauditory mutual information. There are also other non-parametric ways to estimate differential entropy from an observed data sample (Beirlant et al. 1997; Nilsson and Kleijn 2004). It would have been difficult to include the temporal redundancy in a non-parametric method, with the rather small speech sample provided by the speech-recognition test material used here.
5
Conclusion
A parametric HMM-GMM-based method was used to estimate the rate of information transmitted from an acoustic clean speech signal to the corresponding stream of sensory data. Although the sensory information rate sets a definite upper limit on speech-recognition performance, the Speech Intelligibility Index (based on Articulation Theory) is not directly proportional to the rate of acoustic-toauditory information rate. The acoustic speech information in non-overlapping high-pass and low-pass frequency bands is not additive.
References Allen JB (2003) The articulation index is a Shannon channel capacity. In: Pressnitzer D, Cheveigne´ AD, McAdams S, Collet L (eds) Auditory signal processing: physiology, psychoacoustics, and models. Proc 13th Int Symp on Hearing. Springer, Berlin Heidelberg New York, pp 314–320 ANSI (1997) American national standard methods for the calculation of the speech intelligibility index. ANSI-S3.5. American National Standards Institute, New York Baker RJ, Rosen S (2006) Auditory filter nonlinearity across frequency using simultaneous notched-noise masking. J Acoust Soc Am 119(1):454–462 Beirlant J, Dudewicz E, Györfi L, Meulen E van der (1997) Nonparametric entropy estimation: an overview. Int J Math Stat Sci 6(1):17–39 Cover T, Thomas J (1991) Elements of information theory. Wiley, New York Hagerman B (1982) Sentences for testing speech intelligibility in noise. PhD Thesis, KI-Dept of Technical Audiology, Stockholm, pp 79–87
532
A. Leijon
Houtsma A, Durlach N, Braida L (1980) Intensity perception. XI. Experimental results on the relation of intensity resolution to loudness matching. J Acoust Soc Am 68:807–813 Leijon A (2002) Estimation of auditory information transmission capacity using a hidden Markov model of speech stimuli. Acust Acta Acust 88(3):423–432 Moore B (2003) An introduction to the psychology of hearing, 5th edn. Academic Press, London Moore BC, Glasberg BR (2004) A revised model of loudness perception applied to cochlear hearing loss. Hear Res 188:70–88 Nilsson M, Kleijn WB (2004) Shannon entropy estimation based on high-rate quantization theory. In: Proc XII European Signal Processing Conf. (EUSIPCO), pp 1753–1756 Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286
Comment by Greenberg Do you believe that the results of your study would be affected by changing the spectral and temporal granularity of the speech representation? In other words, instead of using 10-ms time windows and 0.5-ERB sample points in the auditory spectrum, a coarser (e.g., 40-ms windows and a low-order Mel frequency cepstral or perceptual linear prediction) representation might be used. In automatic speech recognition, coarser representations (up to a certain limit) often improve performance due to their ability to generalize across talkers and speaking conditions. Reply I believe the main conclusions are not significantly affected by the chosen model structure or the signal representation. The present results were obtained with 20-ms time windows and 0.5-ERB resolution in the acoustic spectral analysis. I have also tested with 10-ms windows and a more complicated HMM structure. I have also tried low-order cepstral representation of the signal spectra. There were some small differences in the numerical information rates for speech in quiet, but the main effects of low-pass and high-pass filtering were very similar, regardless of the model.
57 Perceptual Compensation for Reverberation: Effects of ‘Noise-Like’ and ‘Tonal’ Contexts ANTHONY WATKINS AND SIMON MAKIN
1
Introduction
Listeners’ perceptual decisions about the characteristics of speech sounds that are heard in rooms appear to utilize information about the room’s acoustic properties. This is shown in experiments on perceptual compensation for room reverberation, where the mechanism responsible seems to pick up information about reverberation from a running-speech ‘context’, and uses it to ameliorate effects of the reverberation on neighbouring test words (Watkins 1992, 2005). Some of these compensation experiments have used the [s] vs [st] distinction between “sir” and “stir” test-words and the context phrase “next you’ll get _ to click on”. When reverberation in the context is kept at a minimal level, reverberation added to test words generally makes them sound more like “sir”. This effect arises because of the smoothly-decreasing energy-decay that is typical of reverberation in box-shaped rooms (Schroeder 1965; Allen and Berkely 1979). Such reverberation adds a ‘tail’ to the [s], which obscures the gap that is a major cue to the presence of a [t]. If this happens, compensation may be seen when the amount of reverberation in the context is also increased, to a level close to the amount in the test word. The effect of this compensation is to make the test words sound more like “stir” again. The present experiments ask about the characteristics of context sounds that are important in effecting compensation. Sharp offsets in a sound are a likely source of information about reverberation (Stecker and Hafter 2000) because they will be prominent if the sound is relatively ‘dry’, or extended to form smooth ‘tails’ when there is more reverberation. Sharp offsets can occur at the ends of sounds, but they can also occur in narrow frequency-bands during continuous speech if the band’s power falls abruptly due to a change in the shortterm spectrum (‘spectral transition’, Furui 1986). Narrow-band offsets of this type are therefore likely to be prominent in continuous speech, as the inherently dynamic origins of this signal give rise to numerous spectral transitions. Such offsets might therefore provide listeners with information about a room’s
Department of Psychology, School of Psychology and Clinical Language Sciences, The University of Reading, [email protected], [email protected] Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
534
A. Watkins and S. Makin
Fig. 1 a Temporal envelopes (upper 20 dB) of the dry context, “next you’ll get ...” played through auditory filters at two centre-frequencies. Prominent offsets are arrowed. b Temporal envelope of the same context in room reverberation at a 5-m distance. This trace is superimposed on the dry trace, and the area between them is filled at the smooth ‘tails’ that are added at offsets
acoustic properties, because the ‘auditory filters’ that characterize peripheral auditory processing provide a narrow-band analysis of incoming sounds. These ideas are illustrated in Fig. 1, which shows the temporal envelopes of speech played through auditory filters centred at two frequencies. There are sharp offsets in these temporal envelopes when the sound is relatively dry. These offsets are at different points in time in the two bands, indicating that they have arisen through changes in the short-term spectrum. Figure 1b shows how added reverberation extends these offsets to form smooth tails. The ‘single-band’ contexts used in the present experiments do not possess any of the spectral transitions that inhere in speech. However, these sounds do have fluctuations in their temporal envelopes that are comparable to those found in a narrow band of speech. These temporal envelopes were obtained by playing the speech context through auditory-filter simulations, as illustrated in Fig. 2. Spectral transitions are present to an extent in three- and five-band ‘tonal’ contexts that are the sum of single bands at different centre-frequencies. Tonal contexts are noise bands with auditory-filter bandwidths, while ‘noiselike’ contexts are wide-band noises with the same overall temporal envelopes as their tonal counterparts.
2
Method
Listening conditions were chosen with the aim of providing substantial compensation effects. Accordingly, monaural presentation was used, as here effects of adding reverberation to test words are larger than in dichotic conditions (Watkins 2005).
Perceptual Compensation for Reverberation
535
Speech context ↓ Convolve with gammatone function ↓ Signal correlated noise ↓ Convolve with reversed gammatone function ↓ Tonal context ↓ Signal correlated noise ↓ Speech-shaping filter ↓ Noise-like context Fig. 2 Single-band contexts
Fig. 3 Test-word continuum
Test words were drawn from a continuum of steps between “sir” and “stir”. Clear “stir” sounds were obtained by amplitude modulation of a dry recording of “sir”, using a modulation function that interchanged the temporal envelopes of the two words’ waveforms. Point-wise interpolations between these end-point envelopes gave intervening steps to form the 11-step continuum of test-words, as shown in Fig. 3. These test words were used to measure distorting effects of adding reverberation, as well as to measure any perceptual compensation for this distortion.
536
A. Watkins and S. Makin
Accordingly, the category boundary was measured, which is the continuum step where listeners switch from “sir”- to “stir”-identifications. Category boundaries were obtained from groups of six listeners who were all asked to identify test words from each continuum step. Each of the 11 steps of a continuum was presented 3 times in a randomized sequence. The step corresponding to the category boundary was then found from the total number of sir responses to all of a continuum’s steps, by dividing this total by 3 before subtracting 0.5, giving a boundary step-number between −0.5 and 10.5. Identification function and category boundaries obtained for one listener in speech conditions are shown in Fig. 4. To obtain different amounts of reverberation in a ‘natural’ way, room impulse responses were obtained from recordings made at a dummy head’s ear (KEMAR) in an office room where a dummy head ‘talker’ (Bruel and Kjaer 4128) played the measurement sound (a maximum length sequence, Gardner and Martin 1994). To give the different amounts of reverberation, the transducers faced each other, while the talker’s position was varied to give distances from the listener of 0.32 m or 10 m. The amount of reverberation at these distances is
Fig. 4 Identification functions
Perceptual Compensation for Reverberation
537
indicated by the time taken for the room’s impulse-response energy to decay by 10 dB, which is the early decay time, EDT (ISO 3382, 1997). At 10 m the A-weighted EDT was 0.14 s, while at 0.32 m this EDT was less than 0.01 s. The context, “next you’ll get _ to click on.” was originally a dry recording, which contained a test word that was excised with a waveform editor. Test words were subsequently re-embedded in the context phrase, or in a processed version of this speech. This re-embedding was performed by adding the context’s waveform to the test word’s waveform. Before the addition, silent sections were added to preserve temporal alignment, and to allow different amounts of reverberation to be separately introduced into the two sounds. To obtain contexts that have the narrow-band temporal envelope of the original speech, the signals were first played through an auditory-filter simulation [gammatone, n = 4, with a bandwidth equal to the ‘Cambridge ERB’ (Hartmann 1998)]. This filtering was followed by an ‘SCN operation’, which turns the sound into signal correlated noise by reversing the polarity of a randomly selected half of the signal’s samples. The resulting wide-band noise has the temporal envelope that arises in one auditory filter when the context’s speech is heard. A second stage of processing was then performed on this noise so that its temporal envelope would occur in the auditory filter of a listener. This was done by playing the signal through a version of the gammatone filter whose impulse response was reversed. This results in a single-band ‘tonal’ version of the context that has the bandwidth and centre frequency of the auditory filter used. Single-band tonal contexts were made with auditory filters centred at 0.25 kHz, 1 kHz and 4 kHz. Three-band tonal contexts were made by adding together single-band contexts with auditory filters centred at 0.5 kHz, 1 kHz and 2 kHz. Five-band tonal contexts were made by adding single-band contexts with auditory filters centred at 0.25 kHz, 0.5 kHz, 1 kHz, 2 kHz and 4 kHz. Both the three- and five-band contexts were subsequently played through a ‘speech shaping’ filter whose response was the long-term average spectrum of the original speech signal. Noise-like contexts were made by applying a second SCN operation to the corresponding tonal contexts and then playing the sounds through the speechshaping filter. Consequently, the noise-like contexts have the same wide-band temporal envelopes as their tonal counterparts. Reverberation was introduced into the dry contexts and test words by convolution with the room impulse responses. To obtain signals at the listener’s eardrum that match the signal at KEMAR’s ear, the frequency-response characteristics of the dummy head talker and of the listener’s headphones were removed using appropriate inverse filters. When the amount of reverberation in the context is kept at a minimal level, reverberation added to the test word typically gives an upward shift in the category boundary, away from the “sir” end of the continuum. If this happens, compensation may be seen when the amount of reverberation in the
538
A. Watkins and S. Makin
context is also increased, to a level close to the amount in the test word. The compensation moves the category boundary back down towards “sir”, to a step close to that found when test words have a minimal amount of reflected sound.
3
Results
Means and standard errors of category boundaries in conditions with speech contexts are shown in Fig. 5. The compensation effect, defined in Fig. 5, is shown for conditions with the tonal contexts in Table 1, and for conditions with noise-like contexts in Table 2.
Fig. 5 Category boundaries
Table 1 Compensation effects with tonal contexts Tonal context Single band, fc = 0.25 kHz
Mean compensation 0.22
Standard error
t(5)
p if <0.05, else n.s.
0.35
0.63
n.s.
Single band, fc = 1 kHz
1.11
0.41
2.71
<0.05
Single band, fc = 4 kHz
−0.78
0.32
2.44
n.s.
Three-band
2.44
0.44
5.61
<0.01
Five-band
3.17
0.73
4.31
<0.01
Perceptual Compensation for Reverberation
4
539
Conclusions
The perceptual compensation mechanism seems to be able to pick up information about reverberation not only from running speech but also from other types of sounds that form a context for test words. In these cases, the effects of reverberation on neighbouring words are reduced when the context’s reverberation is increased to match the test word’s reverberation. Contexts that are effective in this respect include broad-band ‘noise-like’ sounds that have a steady spectrum with no changes in the shape of the shortterm spectral envelope over time. In addition, certain ‘tonal’ contexts can be effective, as long as they have several component frequency-bands. Single-band ‘tonal’ contexts were sharply-filtered noise, with the centre frequency and corresponding bandwidth of an auditory filter. These sounds were processed to give the temporal envelope found in the auditory filter concerned when the speech context was played through it. Results with these sounds indicate that they give little or no compensation at the centrefrequencies that were tested. Nevertheless, the temporal envelope of each of these sounds does bring about compensation when it is heard in a broad range of the ear’s frequency channels, as shown by the results with the broadband, ‘noise-like’ versions of these contexts. These findings suggest that compensation is confined to the frequency region occupied by the context, which leaves the bulk of the test-word’s frequency-content unaffected by a singleband tonal context The three- and five-band tonal contexts were more ‘speech-like’, as they were the sum of single-band contexts using auditory filters with different centrefrequencies. Results with these contexts indicate they are effective in generating compensation, and that the five-band context is the most effective. When the overall, broad-band temporal envelope of the five-band context was heard in a wide range of the ear’s frequency channels, with its ‘noiselike’ counterpart, the compensation effect diminished. The compensation in the five-band noise-like condition was less than the compensation in tonal conditions, and it was less than the compensation in any of the single-band noise-like conditions. These results are consistent with the idea that compensation is informed by the presence or absence of sharp offsets in the context’s temporal envelope (Watkins 2005), which are less prominent in the broad-band temporal envelope of a five-band context than they are in a narrow-band temporal envelope from this sound. This ‘smoothing’ of the broad-band temporal envelope would seem to be a straightforward consequence of the spectro-temporal fluctuations that inhere in speech, which gives imperfectly correlated temporal envelopes in different frequency bands. Consequently, temporal envelopes of speech in the ear’s numerous frequency channels are able to give more information about a room’s acoustic properties than is apparent from broad-band temporal envelope of the signal.
540
A. Watkins and S. Makin
Table 2 Compensation effects with noise-like contexts Noise-like context
Mean compensation
Standard error
t(5)
p if <0.05, else n.s.
Single band, fc = 0.25 kHz
3.39
0.34
10.03
<0.001
Single band, fc = 1 kHz
2.33
0.32
7.25
<0.001
Single band, fc = 4 kHz
3.39
0.26
12.83
<0.001
Three-band
2.61
0.79
3.30
<0.05
Five-band
2.06
0.36
5.72
<0.01
Speech contexts tended to be more effective than any of the ‘tonal’ or ‘noise-like’ contexts, giving larger compensation effects. This probably reflects the wide range of frequency channels available when listening to speech. The temporal envelopes in some of these channels might well be more informative about the presence of reverberation than any of the channels that were selected for study in the present experiments. Acknowledgment. This research was supported by a grant to the first author from EPSRC.
References Allen JB, Berkley DA (1979) Image method for efficiently simulating small-room acoustics. J Acoust Soc Am 62:943–950 Furui S (1986) On the role of spectral transition for speech perception. J Acoust Soc Am 80:1016–1025 Gardner B, Martin K (1994) HRTF measurements of a KEMAR dummy-head microphone. Perceptual Computing - Technical Report #280. MIT Media Lab Hartmann WM (1998) Signals, sound and sensation. Springer, Berlin Heidelberg New York ISO (1997) Acoustics - measurement of the reverberation time of rooms with reference to other acoustical parameters. ISO 3382. International Organization for Standardization, Geneva Schroeder MR (1965) New method of measuring reverberation time. J Acoust Soc Am 37:409–412 Stecker GC, Hafter ER (2000) An effect of temporal asymmetry on loudness. J Acoust Soc Am 107:3358–3368 Watkins AJ (1992) Perceptual compensation for effects of reverberation on amplitude-envelope cues to the ‘slay’-‘splay’ distinction. Proc Inst Acoust 14:125–132 Watkins AJ (2005) Perceptual compensation for effects of reverberation in speech identification. J Acoust Soc Am 118:249–262
58 Towards Predicting Consonant Confusions of Degraded Speech O. GHITZA1, D. MESSING2, L. DELHORNE2, L. BRAIDA2, E. BRUCKERT1, AND M. SONDHI3
1
Introduction
The work described here arose from the need to understand and predict speech confusions caused by acoustic interference and by hearing impairment. Current predictors of speech intelligibility are inadequate for making such predictions (even for normal-hearing listeners). The Articulation Index, and related measures, STI and SII, are geared to predicting speech intelligibility. But such measures only predict average intelligibility, not error patterns, and they make predictions for a limited set of acoustic conditions (linear filtering, additive noise, reverberation). We aim at predicting consonant confusions made by normally-hearing listeners, listening to degraded speech. Our prediction engine comprises an efferent-inspired peripheral auditory model (PAM) connected to a template-match circuit (TMC) based upon basic concepts of neural processing. The extent to which this engine is an accurate model of auditory perception will be measured by its ability to predict consonant confusions in the presence of noise. The approach we have taken involves two separate steps. First, we tune the parameters of the PAM in isolation from the TMC. We then freeze the resulting PAM and use it to tune the parameters of the TMC. In Sect. 2 we describe a closed-loop model of the auditory periphery that comprises a nonlinear model of the cochlea (Goldstein 1990) with efferent-inspired feedback. To adjust the parameters of the PAM with a minimal interference of the TMC we use confusion patterns for speech segments generated in a paradigm with a minimal cognitive load (DRT; Voiers 1983). To reduce further PAM-TMC interaction we have synthesized DRT word-pairs, restricting stimulus differences to the initial diphones. In Sect. 3 we describe initial steps in a study towards predicting confusions of naturally spoken diphones, i.e. tokens that inherently exhibit phonemic variability. We describe a TMC inspired by principles of cortical neural processing (Hopfield 2004). A desirable property of the circuit is insensitivity to time-scale variations of the input stimuli. We demonstrate the validity of this hypothesis in the context of the DRT consonant discrimination task. 1
Sensimetrics Corporation, Somerville, Massachusetts, USA, [email protected] Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 3 Avaya Research Laboratory, Basking Ridge, New Jersey, USA 2
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
542
2
O. Ghitza et al.
Peripheral Auditory Model (PAM)
We have developed a closed-loop model of the auditory periphery (PAM) that was inspired by current evidence about the role of the efferent system in regulating the operating point of the cochlea. This regulation results in auditory nerve (AN) representation that is less sensitive to changes in the environmental conditions. In implementing the PAM we use a bank of overlapping cochlear channels uniformly distributed along the ERB scale, four channels per ERB. Each cochlear channel comprises a nonlinear filter and a generic model of the Inner Hair Cell (IHC) – half-wave rectification followed by lowpass filtering, representing the reduction of synchrony with CF. The dynamic range of the simulated IHC response is restricted to a dynamic-range window (DRW), representing the observed dynamic range at the AN level. The filter is Goldstein’s model of nonlinear cochlear mechanics (MBPNL; Goldstein 1990). This model operates in the time domain and changes its gain and bandwidth with changes in the input intensity, in accordance with observed physiological and psychophysical behavior. The model is shown in Fig. 1. The lower path (H1/H2) is a compressive nonlinear filter that represents the sensitive, narrowband nonlinearity at the tip of the basilar membrane tuning curves. The upper path (H3/H2) is a linear filter that represents the insensitive, broad-band linear tail response of basilar-membrane tuning curves. A parameter G controls the gain of the tip of the basilar membrane tuning curves. To mimic best psychophysical tuning curves of a healthy cochlea in quiet, the tip gain is set to G = 40 dB (Goldstein 1990). The “isoinput” frequency response of an MBPNL filter at CF of 3400 Hz is shown in Fig. 2, upper-left panel.
Expanding Memoryless Nonlinearity
H3(w) wc
Compressing Memoryless Nonlinearity
H2(w) wc
S
R
H1(w) Stapes Velocity
wc
GAIN Basilar Membrane Displacement Efferent Control
Fig. 1 Goldstein’s MBPNL model (Goldstein1990)
Towards Predicting Consonant Confusions of Degraded Speech
543
Fig. 2 “Iso-input” frequency response, CF = 3400 Hz. Inside box are input levels in dB SPL
As for the efferent-inspired part of the model we mimic the effect of the Medial Olivocochlear efferent path (MOC). Morphologically, MOC neurons project to different places along the cochlear partition in a tonotopical manner, making synapse connections to the outer hair cells and, hence, affecting the mechanical properties of the cochlea (e.g. increase in basilar membrane stiffness). Therefore, we introduce a frequency dependent feedback mechanism which controls the tip-gain (G) of each MBPNL channel according to the intensity level of the sustained noise in that frequency band. By reducing G the MBPNL response to weaker stimuli (e.g. background noise) is attenuated. The lower-right panel, for example, shows the MBPNL response for G=10 dB. Compared to G = 40 dB, the response to high energy stimuli is hardly affected, while the response for low energy stimuli (e.g. 20 dB SPL) is reduced by some 30 dB. In our realization of the model, the value of the tip gain (G) per cochlear channel is adjusted so that intensity of background noise at the output will not exceed a prescribed value.
544
O. Ghitza et al.
Fig. 3 Simulated IHC response for open-loop PAM (left) and for closed-loop PAM (right)
Figure 3 shows – in terms of a spectrogram – simulated IHC responses to the diphone je (as in “jab”) in two noise conditions (70 dB SPL/10 dB SNR and 50 dB SPL/10 dB SNR), for an open-loop MBPNL-based system (left-hand side) and for the closed-loop system (right-hand side). Due to the nature of the noise-responsive feedback, the closed-loop system produces spectrograms that fluctuate less with changes in noise intensity compared to spectrograms produced by the open-loop system. This property is desirable for stabilizing the performance of the template-matching operation under varying noise conditions, as reflected in the quantitative evaluation reported next. 2.1
Quantitative Evaluation – Isolating PAM from Template Matching
The evaluation system comprises the PAM followed by the TMC. Ideally, to eliminate PAM-TMC interaction, errors due to template matching should be reduced to zero. In reality we could only minimize interaction. This was achieved by using the following three steps: (1) we use the simplest possible psychophysical task in the context of speech perception, namely a binary discrimination test. In particular, we use Voiers’ DRT (Voiers 1983) which presents the
Towards Predicting Consonant Confusions of Degraded Speech
545
subject with a two alternative forced choice between two alternative CVC words that differ in their initial consonants (i.e. a minimal pair). Such task minimizes the influence of cognitive and memory factors while maintaining the complex acoustic cues that differentiate initial diphones (recall the central role of diphones in speech perception, e.g. Ghitza 1993); (2) we use the DRT paradigm with synthetic speech stimuli. An acoustic realization of the DRT word-pairs was synthesized so that the target values for the formants of the vowel in a word-pair are identical, restricting stimulus differences to the initial diphones; and (3) we use a “frozen speech” methodology (e.g. Hant and Alwan 2003), namely, the same acoustic speech token is being used for training and for testing, so that testing tokens differs from training tokens only by the acoustic distortion. These three steps presumably result in a reduction in the number of errors induced by the template matching. Recall that a template-match operation comprises measuring the distance of the unknown token to the templates, and labeling the unknown token as the template with the smaller distance. Hence, template matching is defined by the distance measure and the choice of templates. As a distance measure we use the minimum mean squares error. This is an effective choice here because: (1) by using synthetic speech stimuli, the identical target values of the vowel formants for the two words results in zero error in time-frequency cells associated with the final diphone; and (2) by using frozen-speech stimuli, a distortion in a given time-frequency cell is generated locally (by noise component within the range of the cell) and is independent of noise at other cells. We have conducted formal DRT sessions using the synthetic stimuli in quiet and in additive noise, using speech-shape noise at three levels (70, 60 and 50 dB SPL) and at three SNRs (10, 5 and 0 dB). The data was collected from six subjects (four repetitions each), all students with normal hearing. All subjects have zero errors in quiet. Figure 4 shows errors produced by a DRT mimic with open-loop (upper panel) and closed-loop (lower panel) PAMs. Signal conditions are the same as those used to collect the human data. The DRT-mimic data is averaged over four exemplars of the database, each differing in the realization of the added noise. Templates were created for the 60 dB SPL/5 dB SNR condition. The abscissa marks the six Jakobsonian dimensions: Voicing, Nasality, Sustention, Sibilation, Graveness and Compactness (denoted VC, NS, ST, SB, GV and CM, respectively). The “+” sign stands for attribute present and the “−” sign for attribute absent. Bars show the difference between mean machine and human scores. The lines indicate plus and minus one standard deviation of the human data. Gray bars indicate that the difference is greater than one standard deviation. Scores with the openloop PAM are worse than those of human scores. Scores for the closed-loop PAM are superior to human scores and the difference is similar for all conditions. We are currently developing an iterative procedure for adjusting the parameters of the PAM (constrained by physiological plausibility) so as to match scores to those achieved by humans. The resulting PAM will then be frozen and used to formulate the template match operation.
546
O. Ghitza et al.
Mimic Error - Human Error (%)
40 30 20 10 0 −10 −20
Abs Mean = 15%
−30 +− VC
+− NS
+− ST
+− SB
+− GV
+− CM
Mimic Error- Human Error (%)
40 30 20 10 0 −10 −20
Abs Mean = 8%
−30 +− VC
+− NS
+− ST
+− SB
+− GV
+− CM
Fig. 4 DRT mimic scores for open-loop PAM (upper) and closed-loop PAM (lower)
3
A Template-Matching Circuit (TMC)
In developing the PAM (Sect. 2) we deployed a psychophysical task with a minimal cognitive load and used speech stimuli with restricted phonemic variation. In contrast, the parameters of the TMC will be tuned so as to predict human performance in a consonant identification task (i.e. predicting a confusion matrix). Towards this goal we seek a perceptually-relevant distortion measure between speech tokens that inherently exhibit phonemic variability. In this section we describe a template-matching circuit inspired by principles of cortical neural processing (Hopfield 2004). A block diagram of the circuit is shown in Fig. 5. It comprises three stages: a front-end, a layer of
Towards Predicting Consonant Confusions of Degraded Speech Layer- I IAF Neurons
Front End
547
Layer- II Coincidence “Patch” Neurons
1
1
100
26
“Gamma” Oscillator
M = 26×100
N = 6000
Fig. 5 A block-diagram of the template-match circuit
“integrate and fire” (IAF) neurons (Layer-I neurons) and a layer of coincidence neurons (Layer-II neurons). The front-end is the auditory model described in Section 2. Each neuron in Layer-I is characterized by the differential equation du(t)/dt+u(t)/RC = i(t)/C, where i(t) is the input current, u(t) the output voltage and RC is the time-constant of the circuit. Once u(t) reaches a prescribed threshold value the neuron produces a firing and u(t) is shunted to zero. The parameters of all Layer-I neurons are identical except of the threshold-of-firing. All Layer-I neurons are driven by one, global, underlying sub-threshold oscillatory current A · cosg t. Hence, the input current to the n-th IAF cell is in(t) = xn(t)+A · cosg t, where xn(t) the output of the n-th cochlear channel. In our realization RC = 20 ms and the frequency of the Gamma oscillator is 25 Hz. Each channel drives 100 Layer-I neurons which
548
O. Ghitza et al. daunt
taunt
State-1 40 neurons
10
30
10
20
30
10
20
30
1
1
1
0.5
0.5
0.5
0.5
0
0
0
State-2 140 neurons
20
1
10
20
30
10
20
30
10
20
30
10
20
30
10
20
30
10
20
30
0
1
1
1
1
0.5
0.5
0.5
0.5
0
0
0
10
20
30
10
20
30
10
20
30
0
10
20
30
10
20
30
10
20
30
10
20
30
Fig. 6 Illustrating the performance of the TMC in the context of the DRT discrimination task
differ only in their threshold-of-firing. Therefore, in our realization the number of Layer-I neurons is M = 2600. The final stage comprises N = 6000 Layer-II coincidence neurons. Each Layer-II neuron is driven by K randomly selected “patches” of Layer-I neurons (in our system K = 6). A patch is composed of L Layer-I neurons with successive thresholds – all driven by the same frequency channel (here L = 10). The computational principle realized by the proposed circuit can be summarized as follows. A given Layer-II neuron fires at time t0 only if all K Layer-I patches fire simultaneously at time t0. And a patch of Layer-I neurons fires at time t0 only if the time-evolution of the frequency channel prior to that time drives one of the L neurons in the patch to its threshold precisely at time t0. Hence each Layer-II neuron is “tuned” to a particular time-frequency template expressed in terms of the time evolution of K frequency channels. The same Layer-II neuron will also fire, albeit in a delayed time, if the time-evolution of all K cochlear channels is scaled by the same factor (this is so because all corresponding Layer-I neurons will reach their threshold with a similar time delay). Figure 6 illustrates the discrimination power of the circuit in the DRT context. Assume that we have identified 40 Layer-II neurons that are most sensitive to the time-frequency template of the initial diphone of the word “daunt” (phonetically transcribed as dont). Similarly, we have identified 140 neurons for the word “taunt” (transcribed as tont). We term these sets of neurons “State-1” and “State-2” neurons, respectively. The upper-left two panels of Fig. 6 show a spectrographic display of the front-end to the first 200 ms of two realizations of the word dont spoken by a single speaker (note phonemic variability). Below each spectrogram is a time-histogram of the number of state neurons firing to the corresponding stimuli (shown is the fraction out
Towards Predicting Consonant Confusions of Degraded Speech
549
of 40). The lower-right four panels show the analogous display for the response of State-2 neurons to the word tont. The lower-left (and the upperright) panels show the response to the opposite word. The response to stimuli matched to the state neurons peaks at a time-instance associated with the end-time of the initial diphone. For stimuli of the opposite token there is a small response. Further study of the TMC is underway.
4
Summary
We are developing a model of diphone perception based on models of salient properties of peripheral and central auditory processing. The model comprises a closed-loop peripheral auditory model (which provides a representation of speech that is robust against sustained background noise) connected to a template-match circuit based upon basic concepts of neural processing. Our strategy is to tune the PAM in isolation from the TMC, then freeze the PAM and use it to tune the parameters of the TMC. As probe-stimuli we use speech in the presence of noise. Speech stimuli provide rich, relevant time-varying spectral patterns, and the presence of noise imposes focus on the salient speech cues. Our measure of success is the ability of the model to predict consonant confusions in noise. Acknowledgment. This work is supported by the U.S. Air Force Office of Scientific Research.
References Ghitza O (1993) Processing of spoken CVCs in the auditory periphery: I. Psychophysics. JASA 94(5):2507–2516 Goldstein JL (1990) Modeling rapid waveform compression on the basilar membrane as a multiple-bandpass-nonlinearity filtering. Hear Res 49:39–60 Hant JJ, Alwan A (2003) A psychoacoustic-masking model to predict the perception of speechlike stimuli in noise. Speech Commun 40:291–313 Hopfield JJ (2004) Encoding for computation: recognizing brief dynamical patterns by exploiting effects of weak rhythms on action-potential timing. PNAS 101(16):6255–6260 Voiers WD (1983) Evaluating processed speech using the diagnostic rhyme test. Speech Technol 1(4):30–39
Comment by Greenberg The task used in your study has a very limited number of alternatives (two); it is essentially a 1-bit response paradigm. Although there are compelling reasons for using a task with such limited entropy, do you believe that the improvement in performance observed using the efferent system component
550
O. Ghitza et al.
in your representation would generalize to tasks with far greater entropy (such as open-set word identification)?
Reply Our model comprises an efferent-inspired peripheral auditory model (PAM) connected to a template-match circuit (TMC). We believe that robustness against background noise is provided principally by the signal processing performed by the peripheral circuitry, and that consonant confusions (e.g. in open-set word identification) result from errors in the internal representation (since the “shield” provided by the periphery is not perfect) and from computational properties of the neural template-matching circuit. Your comment, therefore, raises two questions: (1) does the resulting closed-loop cochlear model – which matches human performance in the DRT task for noisy speech – capture the signal processing principles that indeed are responsible for providing the shield against background noise?; and (2) assuming that the answer to question (1) is yes, do we believe that the templatematching circuit (suggested in Sect. 3) can be tuned to predict consonant confusions in an open-set task? The second question is currently being studied; hence we can’t provide an answer yet. As for the answer to question (1), our methodology calls for adjusting the parameters of the PAM with minimal interference of the TMC. The reason for choosing the DRT paradigm, binary in nature, is to reduce the role of the back-end to a minimum. Note, that although we predict human performance in a binary task, the parameters of the model were tuned to match errors between minimal pairs jointly along all Jakobsonian dimensions. Hence we believe that the spectro-temporal patterns generated by the resulting closed-loop cochlear model are an adequate model of the internal representation of degraded speech.
59 The Influence of Masker Type on the Binaural Intelligibility Level Difference S. THEO GOVERTS1, MARIEKE DELREUX2 , JOOST M. FESTEN1, AND TAMMO HOUTGAST1
1
Introduction
The Binaural Intelligibility Level Difference (BILD) was first described by Licklider (1948). It is a manifestation of binaural unmasking, the advantage of binaural over monaural hearing of a signal S against the background of a spatially separated noise N. In headphone experiments, often a design with a N0S0 presentation vs a N0Sπ presentation is used, in which the noise is presented homophasic and the signal either homophasic or antiphasic. The BILD is then defined as the difference in the speech reception threshold (SRT) in the N0S0 and N0Sπ presentation mode. Blauert (1997) provides an overview of experimental work on BMLD and BILD. Estimating the BILD requires SRT measurements in the N0S0 and N0Sπ presentation modes. Since the diotic N0S0 stimuli contain no binaural information, they can be considered as an estimation of monaural speech perception (Siegel and Colburn 1983). The BILD for a stationary masker is known to be about 4–7 dB (e.g. Blauert 1997; Johansson and Arlinger 2002). We are interested in the BILD for fluctuating maskers because of their relevance for daily life. Assessing speech intelligibility in the presence of a fluctuating masker in a N0Sπ condition, several components should be taken into account (see Fig. 1): point of departure is the diotic speech reception threshold (SRT) for stationary noise (Fig. 1a.). If temporal modulations are introduced there will be release of masking (masking release, MR), the SRT reduces (Fig. 1b.). A typical value is of this masking release is 10 dB. On the other hand, if an interaural phase shift is introduced in the speech-signal the SRT will be reduced because of binaural unmasking (Fig. 1c.). A typical value of this binaural unmasking is 5 dB. The question addressed is this chapter is whether there is an interaction between the diotic (“monaural”) release of masking and the binaural unmasking. So, can the reduction in SRT in the condition with modulated noise and interaural phase shifted speech be predicted by adding the values of masking release and binaural unmasking, i.e. 10 + 5 = 15 dB or is it different? 1 Audiology, ENT, VU university medical center, Amsterdam, Netherlands, [email protected], [email protected], [email protected] 2 EXP ORL, Leuven University, Leuven, Belgium, [email protected]
Hearing – From Sensory Processing to Perception B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
552
S.T. Goverts et al.
Fig. 1 Envelope of speech and masker (with estimated forward masking): a point of departure, stationary masker and a N0S0 presentation; b masking release, fluctuating masker and a N0S0 presentation; c binaural unmasking, fluctuating masker and a N0Sπ presentation; d masking release+binaural unmasking, fluctuating masker and a N0Sπ presentation
The BILD is investigated for: (1) 16-Hz block-modulated speech-shaped noise; (2) speech-shaped noise with speech-like fluctuations; and (3) ongoing speech of an interfering same-sex talker. While the spectro-temporal content of noises (2) and (3) are roughly the same, we hypothesize that the effect of informational masking can be investigated by comparing results for those noises.
2 2.1
Materials and Method Stimuli
The speech stimuli consisted of lists of 13 everyday Dutch sentences of eight to nine syllables read by a female speaker (Versfeld et al. 2000). Based on this speech material a stationary masking noise was derived with a long-term spectrum that resembled the long-term spectrum of the female voice. Based on this stationary noise two fluctuating noises were derived: (1) a 16-Hz block-modulated speech- shaped noise with duty cycle 50% and a modulation
The Influence of Masker Type on the Binaural Intelligibility Level Difference
553
Table 1 List of conditions Condition
Masker type
Level [dB]
ST65
Stationary noise
65
BL65
Block-modulated noise
65
BL75
Block-modulated noise
75
FL65
Noise with speech-like fluctuations
65
FL75
Noise with speech-like fluctuations
75
TA65
Interfering, same-sex talker
65
depth of 100%; and (2) a speech shaped noise with speech-like fluctuations (Festen and Plomp, 1990). Finally, continuous speech of an interfering female speaker (Plomp and Mimpen, 1979) was used as masker with additional informational content. The entire experiment was controlled by a personal computer. Subjects were tested individually in a soundproof room. 2.2
Subjects
Five female and seven male normal hearing subjects participated in this study. Their thresholds did not exceed 15 dB HL and their age ranged between 20 and 30 years. All subjects were native speakers of Dutch. 2.3
Conditions and Procedure
The BILD, defined as the difference in threshold for speech in noise between diotic presentation of speech (SRTN0S0) and dichotic presentation of speech (SRTN0Sπ), is deduced for six masker conditions as listed in Table 1. The SRT in noise is defined as the signal-to-noise ratio (SNR) at which 50% of the sentences are reproduced correctly. Measurement procedures were in accordance with Plomp and Mimpen (1979), varying the SNR adaptively in an up-down procedure using 2-dB steps. Each single SRT-measurement is based on one 13-sentence list. All measurements were performed three times. While we were interested in differences between the BILD in the six conditions, the order of conditions was balanced over subjects.
3
Results
For the 12 normal-hearing subjects the SRT was measured with N0S0 and N0Sπ presentation as described in the Methods section. For each condition the BILD was calculated as the SRTN0S0 minus the SRTN0Sπand in the MR as the SRT-STATN0S0 minus the SRT-CONDN0S0.
554
S.T. Goverts et al.
Average results and standard deviations are listed in Table 2. An ANOVA was performed on the data to investigate significance of differences. For the BILD, results that are significantly different from the STAT65 condition are indicated by*. The data for BILD and SRT in N0S0 and N0Sπ presentation mode Table 2 Mean data and standard deviations of the measured SRTs and calculated BILDs and MRs. For the BILD, results that are significantly different from the BILD STAT65 are indicated by* SRTN0S0 [dB]
SRTN0Sπ [dB]
BILD [dB]
MR [dB]
mean
stdev
mean
stdev
mean
stdev
ST65
−4.5
0.5
−9.1
0.7
4.6
0.5
BL65
−14.8
1.8
−17.3
1.8
2.6*
BL75
−18.5
2.7
−20.4
2.2
1.9*
FL65
−11.5
0.8
−14.5
0.8
2.9*
1.2
7.1
0.8
FL75
−11.4
1.2
−14.8
1.1
3.4
1.6
6.9
1.3
TA65
−11.8
1.5
−14.2
1.4
2.3*
1.8
7.3
1.4
mean
stdev
0.8
10.3
2.0
1.0
14.0
2.9
Condition
Fig. 2 a Mean BILD data and standard deviations. b Mean SRT data for N0S0 (open circles) and N0Sπ and standard deviations. c Mean BILD data (and standard deviations) plotted vs the absolute level of the diotic speech for the different conditions. d Mean BILD data (and standard deviation) plotted vs the diotic SRT for the different conditions, which is directly related to the masking release for those conditions
The Influence of Masker Type on the Binaural Intelligibility Level Difference
555
are also plotted in Fig.2a,b. The SRTN0S0 data for the different conditions, and thus the MR data, are in line with the literature. The BILD results show that the binaural unmasking for the fluctuating maskers is reduced compared to the stationary noise. This is in line with Carhart et al. (1966) who found a slightly reduced BILD for modulated compared to stationary noise (3.9 vs 4.5 dB). Coming back to the question posed in the introduction, the total advantage in the BL65 N0Sπ presentation mode, compared to the diotic ST65 condition is about 13 dB, which is less than the sum of binaural unmasking and masking release, being 4.6+10.3=15 dB.
4 4.1
Analysis and Discussion Relation of BILD with Absolute Level of the Diotic Speech
The reduced binaural unmasking might be caused by reduced absolute level (Blauert 1997). In an earlier study, we found a dependence of binaural unmasking on level, for levels typically below 50 dB SPL (Goverts 2004; Goverts and Hougast 2007). To investigate this in Fig. 2c., the mean BILD is plotted vs the absolute level of the diotic speech for the different conditions. There is no strong relation between the BILD and the average absolute level of the diotic speech. The difference in absolute level seems to be no reason for the difference in binaural unmasking. This is in line with other literature findings (e.g. Carhart et al. 1966, 1969). 4.2
The Relation Between Masking Release and BILD
In Fig. 2d the mean BILD is plotted vs the SRT of the diotic speech for the different conditions which is, of course, directly related to masking release. The BILD appears to be related to the SRT for diotic speech. These data suggest a lower binaural unmasking for conditions in which a higher diotic (“monaural”) masking released is found. In order to understand this relation we should inspect the envelope of the different maskers. In Fig. 3a–c for ST65, BL65, and FL65 respectively the envelope as well as the long term average of the speech in the diotic condition, is plotted. It can be seen that the proportion of masked speech in the diotic presentation varies considerably among the three maskers. This is further illustrated in Fig. 3d–f, where the distribution of instantaneous signal-to-noise ratios is given. Our hypothesis is that the reduced BILD for the fluctuating maskers is caused by a reduced proportion of time for which the instantaneous signal-to-noise ratio is in the range in which binaural unmasking is active. To model these results, we assume a binaural unmasking of 5 dB for all signal-to-noise ratios up to a critical value (CV) where unmasking is active
556
S.T. Goverts et al.
Fig. 3 a–c The envelopes of the maskers ST65, BL65, and FL65 respectively is plotted by dots. For this qualitative analysis, envelopes of the broadband signals are calculated. Forward masking is estimated assuming a decay to zero in 200 ms on a log-timescale. For each condition the long term average of speech at diotic SRT level is given by the drawn line. d–f The distribution of instantaneous signal-to-noise ratios is given for those conditions
and no unmasking at all for the higher signal-to-noise ratios. For this critical value two possibilities can be considered: (1) for signal-to-noise ratio of more than 15 dB speech perception is not influenced by the noise (as is well known in the Speech Intelligibility Index (SII) approach, for example); and (2) for signal-to-noise ratio of more than 0 dB normal hearing reach an intelligibility of 100% for sentences in stationary noise. To evaluate this hypothesis in a very qualitative way we computed a weighted BILD, multiplying the distribution with a simple weighting function of 5 for signal-to-noise ratios up to CV and 0 for the higher ratios. The results are given in Fig. 4. If we compare this to the actual BILD data of Fig. 2a we find a rather good correspondence, especially for CV = 0. Thus the reduced binaural unmasking for fluctuating maskers can be understood in terms of the reduced presence of the effective diotic masker, compared to a stationary
The Influence of Masker Type on the Binaural Intelligibility Level Difference
557
Fig. 4 Weighted BILD values for the six conditions for two values of the critical value (CV) above which binaural unmasking is modeled to be not-effective
masker. Due to the temporal gaps in the noise in the diotic condition that are below the long term average level of the speech binaural unmasking cannot be as effective as in stationary noise. In this line we can understand the data for all used maskers. 4.3
Informational Masking
Comparing the data of the conditions FL65 and TA65 we see no additional effect of informational masking. The results for the TA65 condition can be understood on the base of the spectro-temporal and energetic properties of the masker. Binaural unmasking is not influenced by informational masking at least not in the type and degree as used in this study. This is in line with Carhart et al. (1969) who found for modulated noise and interfering speech similar results of masking release and binaural unmasking.
558
5
S.T. Goverts et al.
Conclusion
Binaural unmasking of speech for fluctuating maskers is reduced in comparison to results in stationary noise. Departing from diotic (“monaural”) speech intelligibility in noise, the effects of masking release and binaural unmasking cannot simply be added to predict dichotic speech intelligibility in a masker with temporal fluctuations. The reduction of binaural unmasking can be understood in terms of reduced presence of effective diotic masker in conditions of masking release. Using this hypothesis we can in a very qualitative way predict binaural unmasking for a block-modulated masker, a masker with speech-like modulations and an interfering talker. This means that the relative importance of binaural importance diminishes in daily life for normal hearing subjects. Acknowledgements. We thank Hans van Beek for technical assistance.
References Blauert J (1997) Spatial hearing: the psychophics of human sound localization, rev edn. MIT Press, Cambridge Carhart RT, Tillman TW, Greetis ES (1966) Inaural masking of speech by periodically modulated noise. J Acoust Soc Am 39:1037–1050 Carhart RT, Tillman TW, Greetis ES (1969) Perceptual masking in mulitiple sound backgrounds. J Acoust Soc Am 45:694–703 Festen JM, Plomp R (1990) Effects of fluctuating noise and interfering speech on the speechreception threshold for impaired and normal hearing. J Acoust Soc Am 88:1725–1736 Goverts ST ( 2004) Assessment of spatial and binaural hearing in hearing impaired listeners. PhD thesis, VU Iniversity, Amsterdam Goverts ST, Hougast T (2007) The BILD of hearing-impaired subjects – the role of suprathreshold coding. To be submitted to J Acoust Soc Am Johansson MSK, Arlinger SD (2002) Binaural masking level difference for speech signals in noise. Int J Aud 41:279–284 Licklider J (1948) The influence of interaural phase relations upon the masking speech by white noise. J Acoust Soc Am 20:150–159 Plomp R, Mimpen AM (1979) Improving the reliability of testing the speech reception threshold for sentences. Audiology 18:43–52 Siegel RA, Colburn HS (1983) Internal and external noise in binaural detection. Hear Res 11:117–123 Versfeld NJ, Daalder L, Festen JM, Houtgast T (2000) Method for the selection of sentence materials for efficient measurement of the speech reception threshold. J Acoust Soc Am 106:1671–1684
Index
A Auditory brainstem responses (ABR), 3 Acoustic environment, 247 Acoustic features, 313 Acoustic patterns, 248 Across-channel, 301 Active, listening, 247 Adaptation, 261, 333, 497–498, 503 Adaptation loop, 165, 167–172 Auditory evoked field (AEF), 215–217 Three-alternative-forced-choice (3AFC), 21 Alternating, tone, 251 Amplitude modulation (AM), 18, 133–139, 147–149, 155–156, 161, 247, 306, 467–468, 472, 498, 500–501, 503, 517, 522, 535 depth, 133–137, 139 Articulation index, 525, 527, 541 Artificial, scene analysis, 267 Asynchrony, 334 Attention, 253, 258, 309, 324 Attentive, behaviour, 264 Auditory cortex, 228, 230, 248 filter, 486, 534, 537, 539 filter shapes, 23, 25 grouping, 248, 303 memory, 320 model, 529, 547 model, peripheral, 541 object, 295, 329 streaming, 257, 267–268, 273, 275 system, 247 Auditory cortex, 83, 97, 125–126, 128, 131 Auditory filter, 117
Auditory nerve (AN), 3, 30, 38, 45, 51–53, 57, 61–62, 72, 107–109, 133, 156, 165, 181, 195, 199, 233, 441, 447, 468, 495–497, 501, 542 activity, 62 discharge rate, 57, 107 fiber, 53, 57, 109 (fiber) recordings, 3, 108 rate profile, 57 responses, 52, 53 spikes, 53 Auditory object, 125, 131, 142 Auditory scene analysis (ASA), 257, 267, 285, 295, 313, 323, 475, 482–483 Auditory stream, 149 Automatic speech recognition, 247 Audio-vocal unit (AVU), 192–196 B Brodman areal 42 (BA42), 281 Brainstem auditory evoked potential (BAEP), 215–216, 219, 222–223 Bank, of bandpass auditory filters, 19 Basilar membrane, 3–4, 542–543, 549 response delay, 3–4 Behavioural, neurophysiology, 261 Best frequency, 12, 338 Binaural cue, 505 disparities, 467 hearing, 496 information, 475, 561 intelligibility level difference, 561 interaction, 328 interference, 485, 489–490 masking level difference, 457 processing, 495, 508
560
Binaural (continued) room impulse responses, 506 sluggishness, 327–328, 380, 412, 464, 467 system, 459, 467–468 unmasking, 561 Binaural-beat stimuli, 468 Birds, 1, 6, 257 Blood-oxygene-level depletion (BOLD) response, 230, 280 signal, 92 Bottom-up, 323 Brain, 275 activation, 228, 230–232, 313 Brainstem, 192 C Compound action potential (CAP), 1, 3–5 Captor, tone, 333, 336 Cat, 125–126, 128, 131, 319 Ceiling effect, 244 Central auditory system, 273 Central limit theorem, 27 Change detection, 313, 329, 331 Chopper (chopper neuron), 155–160 Chorus, 208–213 CI-simulation, 477 Click trains, 329 Cochlear activation, 4, 5 delay, 2–6, 8 filtering, 529 implabt, 247 nucleus, 468 region, 5 Cochlear implant, 475, 495–496, 502–503 Cochlear nucleus (CN), neurons, 71, 97, 155, 160–161, 337 Cocktail party, problem, 247 Coherence, 149, 248, 365, 369, 410 Coherent auditory streams, 267 streams, 248, 319 Communication, 292 Comodulation masking release (CMR), 117, 119–123, 125, 127–128, 465
Index
Complex harmonic, 1–2, 76 Schroeder-phase, 1, 6 tone, 301, 333 Complex tones, harmonic, 61, 63 Compression, 11, 13, 245, 507, 510, 514, 522, 549 Computational model, 247 Confidence adjusted score, 372 Constant stimuli, 269 Correlation, 277 matrix, 250 Correlogram, 469, 473, 489 Cortical model, 253 processing, 248, 325 representation, 250 response, 233, 323 sensitivity, 325 site, 258 Critical band (CB), 175, 199–204 Critical bandwidth, 243 D Dorsal cochlear nucleus (DCN), 43–45, 48, 50 Delay cochlear, 4 frequency-dependent, 8 neural, 3 place-dependent, 8 Detection, threshold, 253, 267, 269 Directional, pitch shift, 319 Discriminability, 7, 114, 343, 387 Discrimination, 273 Distraction, 267 Dual rsonance nonlinear (DRNL), filter, 73 Dynamic range, 227, 339 E Echo planar imaging, 230, 281 Eigenvalue, 250 Eigenvector, 250 Energetic masking, 303, 307, 309–310 Envelope, 304 fluctuations, 46, 307, 309, 374, 463 spectral, 91, 144, 294, 306, 539 Equalization and cancellation, 457 Equal-loudness, contour, 300
Index
Equivalent rectangular bandwidth (ERB), 25, 296 Event-counting mechanism, 273 Evoked potentials, 185, 215, 381, 485, 496 Excitation pattern, 11, 319, 465, 472, 529 Excitatory, 262 response, 338 F Ferrets, 258 Field L, 210–211, 213 Filter auditory, 19 bandwidth, 19, 25, 57, 62, 385, 439, 534 dual resonance, 11, 12 minimum phase, 14 roex, 23, 24 shape, 19 tuning, 23 tuning of auditory, 19 Filterbank, instaneous-frequencycontrolled, 18 First formant (F1), 334 Flanking band, 117, 359 Frequency modulation (FM), 18 Forward masking, 165–166, 168–172, 314, 316 suppression, 261 Frequency discrimination, 271 instantaneous, 5 modulation, 247, 307 shift, 314 Frequency-dependent gain, 12 Frequency difference limens (FDL), 83 Frequency-shift detector, 313 Frog, 185, 187 Frontal cortex, 233 Functional, MRI Functional magnetic resonance imaging (fMRI), 83–85, 96, 99, 228, 230, 276 Fundamental frequency (F0), 287 G Galloping rhythm, 277 Go/no-go task, 269
561
Grouping, 309, 333, 341 Guinea pig, 52, 338 H Habituation, 282 Haircell, 227, 529, 542–543 Harmonic complex, 5–6 consonance, 276 number, 63 Harmonicity, 247 Harmonics resolved, 35, 42, 61–62, 75 unresolved, 36, 75 Hearing aid, 247 impaired, 237 loss, 227–228, 230 threshold, 227 Hemodynamic response, 228, 281 Heschl’s gyrus (HG), 85, 96–98, 103, 128 Hidden Markov model, 526–527 Hilbert envelope, 14 I Central nucleus of the inferior colliculus (ICC), 200, 202–204 Informational masking (IM), 251, 267–268, 303, 309–310 Incoherence, 250, 369 Inferior colliculus (IC), 42, 44, 71, 97, 125, 133, 136–139, 185, 187–188, 191, 199, 328, 408, 418, 433, 435, 447, 468, 495, 505, 515 Information processing, 310 Inhibition, 200, 202–204, 338 Inhibitory cells, 337 field, 262 Inner-ear, 227 filter, 8 Instantaneous compression, 11 frequency, 5, 12, 13 frequency extraction, 17 Instantaneous discharge rate, 53–56, 58–59 Integration time, 329 Integrator, leaky, 23
562
Intelligibility of complex time vaying stimuli, 35 of speech, 35 Intensity, 227 Interaural coherence, 369, 410 correlation, 323, 331, 359, 380, 405, 467, 477, 508 cross correlation, 369, 379, 469, 514 level, 143, 369, 385, 425, 447, 467, 475 level difference, 467, 475 phase, 90, 365, 369, 389, 411, 450, 457, 551 phase difference, 457–459 time, 143, 369, 389, 399, 407, 417, 425, 447, 467, 495, 505 Interaural level difference (ILD), 143–146, 148, 329 Interaural temporal difference, 475 Internal representation, 168, 347, 383, 399, 550 Inter-spike interval, 27 Intervals, first order, 27 Intraparietal sulcus, 276, 281 Iterated ripple noise (IRN), 76, 91–92 Inter-spike interval histogram (ISIH), 27, 31–32 Inter-spike interval (ISI), 27 Interaural time difference (ITD), 143–146, 148–150, 329 Iterated rippled noise, 329 J Just noticable difference (JND), 52–56 K Kalman filtering, 254 L Lateralization, 143–145, 148–151, 353, 374, 389, 476, 478, 480–481, 485 Listening band, 175–176, 178, 181–182 Lateral lemniscus (LL), 192 Local field potential, 257 Localization, 505, 509–513, 515 Lombard effect, 191, 196 Loudness, 215, 223, 227, 237, 295–296, 317, 329, 359 categorical, 240
Index
level, 228–229, 233 matching, 237–239 model, 295 perception, 228, 237, 301 recruitment, 227, 229, 233 scaling, 229, 237–238 summation, 237, 243 M Magentoencepaholography (MEG), 98, 101–102, 125–128, 233, 282, 323, 331 Magnitude estimation, 245 Masker notched noise, 20 within-channel, 9 Masking, 1, 6, 9, 20, 23, 25, 207, 211, 304 forward, 16 forward notched-noise, 19 on frequency, 16 informational, 552, 557 non-simultaneous, 16 notched-noise, 19 off frequency, 12, 16 power specrum of, 23 release, 117 simultaneous, 20, 23, 25 spectral, 16 upward spread of, 16 Masking differences, across species, 9 Maximum likelihood method, 27 Medial nucleus of the superior olivary complex (MSO), 194, 195 Mellin image, 288–289 transform, 285 Melody, 275, 315 Memory, 176, 271, 320, 343, 353 Memoryless, 527, 542 Middle ear, 188 Middle-ear reflex, 191, 194 Mismatch negativity (MMN), 328, 381, 401 Model, 244, 247, 271 auditory image (AIM), 102, 104 autocorrelation, 71, 72 coupled elements, 11 cross-channel autocorrelation, 77 filterbank, 11
Index
functional, 11 of ventral cochlear nucleus units, 27 Modulation, 228, 249, 303 depth, 287 domain, 307 filter, 308, 310 period, 285 Modulation detection interference (MDI), 303–304 Modulation transfer function (MTF), 71 Monkey, 257, 319 Motion detector, 313, 320 Mouse, 200 Multidimensional representation, 249, 255 Multiple, looks, 268 Multitone, sequence, 267 N Natural speech, 287 Neckar-cube, 283 Neural activity, 319 correlates, 259, 275 delay, 3 mechanism, 257 unit, 314 Neuromagnetic recording, 325 Neuronal adaptation, 252–253 Noise burst, 324 Gaussian, 83, 343, 491 interference, 317 Nonlinear auditory, filterbank, 11 Nonlinear instantaneous, compression, 13 Nonlinearity, 542 Nonsimultaneous, masking, 308 Notches, high-frequency spectral, 51 Nucleus Laminaris, 417 O Off frequency listening, 19, 23 masking, 12 Offset, asynchrony, 247 On-frequency channel, 15
563
excitation, 15 masking, 12, 16 Onset asynchrony, 247, 333 chopper, 337 Onset asynchrony, 122, 123 P Partial forward masking (PFM), 314, 316, 319 Perception, 268 Perceptual grouping, 296 integration, 334 object, 275 phenomena, 258 Perceptual interference, 207, 208 Peripheral channelling, 282, 293 Phase of basilar membrane motion, 61 cosine, 36 locking, 67, 126 Phase locking, 38, 51, 62, 126, 133, 438, 447 Phoneme boundary, 335 Phonetic quality, 333 Physiology, 252, 258, 337 Pi-limit, 399, 407 Pitch, 61, 68, 71, 74, 83–86, 90, 92, 95, 314, 329, 477, 482 activation, 86 discrimination, 61, 84, 85 of a harmonic tone, 74 Huggins, 84–85, 90 interspike interval representation of, 68 onset response, 329 processing, 83, 92 rate-place representation of, 61, 68 selectivity, 83, 86 shift, 314 spatio-temporal representation of, 61, 68 temporal processing, 95 virtual, 71 Pitch-jump detector, 282 Planum polare, 86 temporale, 86, 102
564
Plasticity, 227 Population tuning curve, 261 Positron emission tomography (PET), 95 Posterioir insular cortex, 281 Potentials, evoked cochlear, 2 Pre-attentive, 292, 325 Precedence-effect, 475, 477, 510, 513 Primary auditory cortex, 259, 275 Primary-like, 338 Probit analysis, 335 Post stimulus time histogram (PSTH), 29 Psychometric function, 239, 269 Pure tone, 273, 295 R Rabbit, 134, 139 Rate code, 138, 389 Receptive field, 200, 202, 204, 249 Representation, neural of pitch, 61 Resolvable, pure tones, 313 Response coherence, 254 enhancement, 273 pattern, 275 Response patterns, spatiotemporal, 110 Restoration, effect, 335 Reverberation, 35, 39, 475, 505, 533, 541 Reverse, correlation, 261 S Scaling invariance, in cochlear mechanics, 63 Scene, analysis, 310 Schroeder-phase, complex, 1, 8 Segregation, 143, 145, 149, 301, 303, 310 Self-produced vocalization, 192 Serial, search, 273 Shuffled autocorrelograms (SAC), 38, 4 Shuffled cross-correlograms (SCC), 109, 111 Single-unit, 259 recording, 282 Size information, 286 modulation, 288, 290 Sound level, 227 Sound-source, determination, 310
Index
Source, segregation, 285 Sparse, imaging, 228 Spatial, location, 248 Speaker, 275 size, 285, 288 Spectral coding, 113 edges, 43 masking, 16 notches, 43 summation, 243 Spectral receptive field (SRF), 253 Spectro-temporal receptive fields (STRF), 248, 262 Speech, 285, 295 comprehension, 227 Speech intelligibility index, 525, 531, 556 Speech perception, robust, 107 Spetrogram, 249–250 Squirrel monkey, 191, 192 Stochastic point process, 276 STRAIGHT, 286 Stream formation, 248 segregation, 247, 251, 267, 285, 290 Superior olivary complex (SOC), 191–192, 194 Superior olive lateral, 389, 425 medial, 418, 447 Superior temporal sulcus, 281 Suppression, 11, 111, 117–118, 120, 122–123, 128–129, 131, 170–171, 475, 480 low-side, 11 Supramarginal, gyrus, 281 Sustained field (SF), 99 Synaptic, depression, 252 Synchronization degree of, 5 neural, 2, 5 Synchronized, cochlear region, 5 Synchrony, neural, 1 T Temporal disparities, 485 dynamics, 323
Index
envelope, 8, 35, 46, 92, 143, 460, 534 integration, 242, 273 lobe, 230, 233 modulation, 477, 551 pattern, 272 regularity, 273 window, 117, 165, 376, 380 Temporal window, 118, 165–167, 169–172 Thalamo-cortical, projections, 253 Thalamus, 281 Timbre, 248, 285, 333 Time constant, 166–168, 171–172, 255 normalized, 63 Timing, stimulus related frequency -, 4 Tonal, interference, 317 Tone burst, 268 pip, 325 Transfer function, head-related, 476 Transient, chopper, 337 Transition rate, 277 Transposed stimuli, 485–486, 488, 490, 492 Traveling wave, 502 Tuning curve, 253, 261, 486, 499, 515, 542
565
Two-tone complex, 298 suppression, 333 Two tone suppression, 14–15 U Ultrasound, 185 Uncorrelated features, 248 noise, 323, 344, 377, 383, 388, 412, 468, 480, 514 spike times, 38 streams, 250 Unmasking, 252 V Ventral cochlear nucleus (VCN), 27, 32, 35, 38, 43–44, 115 Vocoder, 478 Vowel, 285, 333, 338 spatiotemporal encoding of, 107 W Wavelet decomposition, 248 Weber-Fechner law, 215, 223 Z Zebra finch, 207–211