Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4666
Mike E. Davies Christopher J. James Samer A. Abdallah Mark D. Plumbley (Eds.)
Independent Component Analysis and Signal Separation 7th International Conference, ICA 2007 London, UK, September 9-12, 2007 Proceedings
13
Volume Editors Mike E. Davies Edinburgh University IDCOM & Joint Research Institute for Signal and Image Processing King’s Buildings, Mayfield Road, Edinburgh EH9 3JL, UK E-mail:
[email protected] Christopher J. James University of Southampton Signal Processing and Control Group, ISVR Southampton, SO17 1BJ, UK E-mail:
[email protected] Samer A. Abdallah Mark D. Plumbley Queen Mary University of London Department of Electronic Engineering Centre for Digital Music Mile End Road, London, E1 4NS, UK E-mail: {samer.abdallah,mark.plumbley}@elec.qmul.ac.uk
Library of Congress Control Number: 2007933693 CR Subject Classification (1998): C.3, F.1.1, E.4, F.2.1, G.3, H.1.1, H.2.8, H.5.1, I.2.7 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-540-74493-2 Springer Berlin Heidelberg New York 978-3-540-74493-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12113033 06/3180 543210
Preface
This volume contains the papers presented at the 7th International Conference on Independent Component Analysis (ICA) and Source Separation held in London, 9–12 September 2007, at Queen Mary, University of London. Independent Component Analysis and Signal Separation is one of the most exciting current areas of research in statistical signal processing and unsupervised machine learning. The area has received attention from several research communities including machine learning, neural networks, statistical signal processing and Bayesian modeling. Independent Component Analysis and Signal Separation has applications at the intersection of many science and engineering disciplines concerned with understanding and extracting useful information from data as diverse as neuronal activity and brain images, bioinformatics, communications, the World Wide Web, audio, video, sensor signals, or time series. This year’s event was organized by the EPSRC-funded UK ICA Research Network (www.icarn.org). There was also a minor change to the conference title this year with the exclusion of the word ‘blind’. The motivation for this was the increasing number of interesting submissions using non-blind or semi-blind techniques that did not really warrant this label. Evidence of the continued interest in the field was demonstrated by the healthy number of submissions received, and of the 149 papers submitted just over two thirds were accepted. These proceedings have been organized into 6 sections: theory, algorithms, sparse methods, biomedical applications, speech and audio applications, and miscellaneous. Within each section, papers have been organized alphabetically by the first author’s last name. However the strong interaction between theory, method and application inevitably means that many papers could have equally been placed in alternative categories. In this year’s papers there was a significant growth in development of sparsity as a tool for source separation, while the application areas were once again dominated by submissions focusing on speech and audio, and on biomedical processing. The organizing committee decided to reflect this in their choice of keynote speakers and were pleased to secure keynote talks from two leading researchers in these fields: Shoji Makino from NTT Communication Science Laboratories, Kyoto, Japan; and Scott Makeig from the Swartz Center for Computational Neuroscience, Institute for Neural Computation, UCSD, USA. Following the successful additions made to the 2006 event, the 2007 conference continued to offer a “Student Best Paper Award” and included two tutorial sessions on the day preceding the main conference. This year’s tutorials covered two topics closely connected with recent research progress in ICA and source separation: “Information Filtering,” lectured by Jos´e Principe of the University of Florida; and “Sparse Representations” lectured by R´emi Gribonval of INRIA at IRISA, Rennes. A further innovative aspect of the 2007 event was the
VI
Preface
introduction of the “Stereo Audio Source Separation Evaluation Campaign”, which aimed to compare the performance of source separation algorithms from different researchers when applied to stereo under-determined mixtures. The different contributions to the challenge were presented in a special poster session on the last day of the conference, which was then followed by a panel discussion. The overall results of the evaluation campaign are also summarized in a paper in this volume. There are many people that should be thanked for their hard work, which helped to produce the high quality scientific program. First and foremost we would like to thank all the authors who have contributed to this volume. Without them there would be no proceedings. In addition, we thank the members of the organizing committee and the reviewers for their efforts in commissioning the reviews, and for their help in selecting the very best papers for inclusion in this volume. We are also grateful to the organizers of the “Stereo Audio to Source Separation Evaluation Campaign” for managing both the submissions to and the evaluation of this work. Thanks also go to the members of the ICA international steering committee for their continued advice and ongoing support for the ICA conference series. All these contributions went towards making the conference a great success. Last but not least, we would like to thank Springer for their rapid transformation of a collection of papers into this volume in time for the conference. June 2007
Mike Davies Christopher James Samer Abdallah Mark Plumbley
ICA 2007 Committee Listings
Organizing Committee Mark Plumbley, Queen Mary, University of London, UK Mike Davies, IDCOM, University of Edinburgh, UK Christopher James, ISVR, University of Southampton, UK Samer Abdallah, Queen Mary, University of London, UK Betty Woessner, Queen Mary, University of London, UK
UK Advisory Committee Paul Baxter, Qinetiq, UK Richard Everson, University of Exeter, UK Colin Fyfe, University of Paisley, UK Mark Girolami, University of Glasgow, UK Asoke Nandi, University of Liverpool, UK Saeid Sanei, Cardiff University, UK
International ICA Steering Committee Luis Almeida, INESC-ID, Lisbon, Portugal Shun-ichi Amari, Riken BSI, Japan Jean-Fran¸cois Cardoso, ENST, Paris, France Andrzej Cichocki, Riken BSI, Japan Scott Douglas, Southern Methodist University, USA Simon Haykin, McMaster University, Canada Christen Jutten, INPG, France Te-Won Lee, UCSD, USA Shoji Makino, NTT CS Labs, Japan Klaus-Robert M¨ uller, Fraunhofer FIRST, Germany Noboru Murata, Waseda University, Japan Erkki Oja, Helsinki University of Technology, Finland Liqing Zhang, Shangai Jiaotong University, China
Referees Samer Abdallah Karim Abed-Meraim Pierre-Antoine Absil Tulay Adali
Luis Almeida J¨ orn Anem¨ uller Shoko Araki Simon Arberet
Francis Bach Radu Balan Paul Baxter Adel Belouchrani
VIII
Organization
Raphael Blouet Thomas Blumensath Markus Borschbach Jean-Fran¸cois Cardoso Mar´ıa Carri´ on Marc Castella Taylan Cemgil Jonathon Chambers Chong-Yung Chi Seungjin Choi Heeyoul Choi Andrzej Cichocki Pierre Comon Sergio Cruces-Alvarez Adriana Dapena Mike Davies Lieven De Lathauwer Yannick Deville Konstantinos Diamantaras Scott Douglas Ivan Duran-Diaz Deniz Erdogmus Jan Eriksson Da-Zheng Feng C´edric F´evotte Colin Fyfe Pando Georgiev Mark Girolami Pedro Gomez Vilda Juan Manuel Gorriz Saez R´emi Gribonval Anne Gu´erin-Dugu´e Zhaoshui He Kenneth Hild Antti Honkela Tetsuya Hoya Patrik Hoyer Aapo Hyvarinen Shiro Ikeda Alexander Ilin Mika Inki
Yujiro Inouye Maria Jafari Christopher James Tzyy-Ping Jung Christian Jutten Juha Karhunen Wlodzimierz Kasprzak Mitsuru Kawamoto Taesu Kim Kevin Knuth Visa Koivunen Kostas Kokkinakis Ercan Kuruoglu Elmar Lang Soo-Young Lee Intae Lee Yuanqung Li Ju Liu Hui Luo Ali Mansour Rub´en Mart´ın Clemente Kiyotoshi Matsuoka Oleg Michailovich Nikolaos Mitianoudis Dmitri Model Eric Moreau Ryo Mukai Juan Jos´e Murillo-Fuentes Klaus-Robert M¨ uller Wakako Nakamura Noboru Nakasako Asoke Nandi Klaus Obermayer Erkki Oja Umut Ozertem Sunho Park Barak Pearlmutter Dinh-Tuan Pham Ronald Phlypo Mark Plumbley Carlos Puntonet
Martin Pyka Stephen Roberts Justinian Rosca Diego Ruiz Padillo Tomasz Rutkowski Jaakko S¨ arel¨a Moussaoui Sa¨ıd Fathi Salem Reza Sameni Ignacio Santamaria Hiroshi Saruwatari Hiroshi Sawada Mikkel Schmidt Christine Serviere Hosseini Sharam Shohei Shimizu Paris Smaragdis Jordi Sol´e-Casals Toshihisa Tanaka Fabian Theis Nadege Thirion-Moreau Alle-Jan van der Veen Jean-Marc Vesin Ricardo Vig´ ario Emmanuel Vincent Tuomas Virtanen Fr´ed´eric Vrins DeLiang Wang Wenwu Wang Yoshikazu Washizawa Lei Xu Kehu Yang Arie Yeredor Vicente Zarzoso Liqing Zhang Yimin Zhang Yi Zhang Xu Zhu Michael Zibulevsky Andreas Ziehe
Organization
IX
Sponsoring Institutions UK ICA Research Network UK Engineering and Physical Sciences Research Council European Association for Signal Processing (EURASIP) IEEE Region 8 UK & Republic of Ireland Section Engineering in Medicine and Biology Society Chapter Queen Mary, University of London, UK
Table of Contents
Theory A Flexible Component Model for Precision ICA . . . . . . . . . . . . . . . . . . . . . . Jean-Fran¸cois Cardoso and Maude Martin
1
Blind Separation of Instantaneous Mixtures of Dependent Sources . . . . . . Marc Castella and Pierre Comon
9
The Complex Version of the Minimum Support Criterion . . . . . . . . . . . . . Sergio Cruces, Auxiliadora Sarmiento, and Iv´ an Dur´ an
17
Optimal Joint Diagonalization of Complex Symmetric Third-Order Tensors. Application to Separation of Non Circular Signals . . . . . . . . . . . . Christophe De Luigi and Eric Moreau
25
Imposing Independence Constraints in the CP Model . . . . . . . . . . . . . . . . . Maarten De Vos, Lieven De Lathauwer, and Sabine Van Huffel
33
Blind Source Separation of a Class of Nonlinear Mixtures . . . . . . . . . . . . . Leonardo Tomazeli Duarte and Christian Jutten
41
Independent Subspace Analysis Is Unique, Given Irreducibility . . . . . . . . . Harold W. Gutch and Fabian J. Theis
49
Optimization on the Orthogonal Group for Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michel Journ´ee, Pierre-Antoine Absil, and Rodolphe Sepulchre
57
Using State Space Differential Geometry for Nonlinear Blind Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David N. Levin
65
Copula Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jian Ma and Zengqi Sun
73
On Separation of Signal Sources Using Kernel Estimates of Probability Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oleg Michailovich and Douglas Wiens
81
Shifted Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . Morten Mørup, Kristoffer H. Madsen, and Lars K. Hansen
89
Modeling and Estimation of Dependent Subspaces with Non-radially Symmetric and Skewed Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jason A. Palmer, Ken Kreutz-Delgado, Bhaskar D. Rao, and Scott Makeig
97
XII
Table of Contents
On the Relationships Between Power Iteration, Inverse Iteration and FastICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hao Shen and Knut H¨ uper
105
A Sufficient Condition for the Unique Solution of Non-Negative Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toshio Sumi and Toshio Sakata
113
Colored Subspace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabian J. Theis and M. Kawanabe
121
Is the General Form of Renyi’s Entropy a Contrast for Source Separation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fr´ed´eric Vrins, Dinh-Tuan Pham, and Michel Verleysen
129
Algorithms A Variational Bayesian Algorithm for BSS Problem with Hidden Gauss-Markov Models for the Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nadia Bali and Ali Mohammad-Djafari
137
A New Source Extraction Algorithm for Cyclostationary Sources . . . . . . . C. Capdessus, A.K. Nandi, and N. Bouguerriou
145
A Robust Complex FastICA Algorithm Using the Huber M-Estimator Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jih-Cheng Chao and Scott C. Douglas
152
Stable Higher-Order Recurrent Neural Network Structures for Nonlinear Blind Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yannick Deville and Shahram Hosseini
161
Hierarchical ALS Algorithms for Nonnegative Matrix and 3D Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej Cichocki, Rafal Zdunek, and Shun-ichi Amari
169
Pivot Selection Strategies in Jacobi Joint Block-Diagonalization . . . . . . . . C´edric F´evotte and Fabian J. Theis
177
Speeding Up FastICA by Mixture Random Pruning . . . . . . . . . . . . . . . . . . Sabrina Gaito and Giuliano Grossi
185
An Algebraic Non Orthogonal Joint Block Diagonalization Algorithm for Blind Separation of Convolutive Mixtures of Sources . . . . . . . . . . . . . . Hicham Ghennioui, El Mostafa Fadaili, Nad`ege Thirion-Moreau, Abdellah Adib, and Eric Moreau
193
Table of Contents
Non Unitary Joint Block Diagonalization of Complex Matrices Using a Gradient Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hicham Ghennioui, Nad`ege Thirion-Moreau, Eric Moreau, Abdellah Adib, and Driss Aboutajdine A Toolbox for Model-Free Analysis of fMRI Data . . . . . . . . . . . . . . . . . . . . . P. Gruber, C. Kohler, and F.J. Theis An Eigenvector Algorithm with Reference Signals Using a Deflation Approach for Blind Deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mitsuru Kawamoto, Yujiro Inouye, Kiyotaka Kohno, and Takehide Maeda
XIII
201
209
218
Robust Independent Component Analysis Using Quadratic Negentropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaehyung Lee, Taesu Kim, and Soo-Young Lee
227
Underdetermined Source Separation Using Mixtures of Warped Laplacians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nikolaos Mitianoudis and Tania Stathaki
236
Blind Separation of Cyclostationary Sources Using Joint Block Approximate Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.T. Pham
244
Independent Process Analysis Without a Priori Dimensional Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Barnab´ as P´ oczos, Zolt´ an Szab´ o, Melinda Kiszlinger, and Andr´ as L˝ orincz
252
An Evolutionary Approach for Blind Inversion of Wiener Systems . . . . . . Fernando Rojas, Jordi Sol´e-Casals, and Carlos G. Puntonet
260
A Complexity Constrained Nonnegative Matrix Factorization for Hyperspectral Unmixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sen Jia and Yuntao Qian
268
Smooth Component Analysis as Ensemble Method for Prediction Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryszard Szupiluk, Piotr Wojewnik, and Tomasz Zabkowski
277
Speed and Accuracy Enhancement of Linear ICA Techniques Using Rational Nonlinear Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petr Tichavsk´y, Zbynˇek Koldovsk´ y, and Erkki Oja
285
Comparative Speed Analysis of FastICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vicente Zarzoso and Pierre Comon
293
Kernel-Based Nonlinear Independent Component Analysis . . . . . . . . . . . . Kun Zhang and Laiwan Chan
301
XIV
Table of Contents
Linear Prediction Based Blind Source Extraction Algorithms in Practical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhi-Lin Zhang and Liqing Zhang
309
Sparse Methods Blind Audio Source Separation Using Sparsity Based Criterion for Convolutive Mixture Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. A¨ıssa-El-Bey, K. Abed-Meraim, and Y. Grenier
317
Maximization of Component Disjointness: A Criterion for Blind Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J¨ orn Anem¨ uller
325
Estimator for Number of Sources Using Minimum Description Length Criterion for Blind Sparse Source Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . Radu Balan
333
Compressed Sensing and Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Blumensath and Mike Davies
341
Morphological Diversity and Sparsity in Blind Source Separation . . . . . . . J. Bobin, Y. Moudden, J. Fadili, and J.-L. Starck
349
Identifiability Conditions and Subspace Clustering in Sparse BSS . . . . . . Pando Georgiev, Fabian Theis, and Anca Ralescu
357
Two Improved Sparse Decomposition Methods for Blind Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Vikrham Gowreesunker and Ahmed H. Tewfik
365
Probabilistic Geometric Approach to Blind Separation of Time-Varying Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ran Kaftory and Yehoshua Y. Zeevi
373
Infinite Sparse Factor Analysis and Infinite Independent Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Knowles and Zoubin Ghahramani
381
Fast Sparse Representation Based on Smoothed 0 Norm . . . . . . . . . . . . . . G. Hosein Mohimani, Massoud Babaie-Zadeh, and Christian Jutten
389
Estimating the Mixing Matrix in Sparse Component Analysis Based on Converting a Multiple Dominant to a Single Dominant Problem . . . . . . . . . Nima Noorshams, Massoud Babaie-Zadeh, and Christian Jutten
397
Dictionary Learning for L1-Exact Sparse Coding . . . . . . . . . . . . . . . . . . . . . Mark D. Plumbley
406
Table of Contents
Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paris Smaragdis, Bhiksha Raj, and Madhusudana Shashanka Image Compression by Redundancy Reduction . . . . . . . . . . . . . . . . . . . . . . . Carlos Magno Sousa, Andr´e Borges Cavalcante, Denner Guilhon, and Allan Kardec Barros Complex Nonconvex lp Norm Minimization for Underdetermined Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emmanuel Vincent Sparse Component Analysis in Presence of Noise Using an Iterative EM-MAP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hadi Zayyani, Massoud Babaie-Zadeh, G. Hosein Mohimani, and Christian Jutten
XV
414 422
430
438
Speech and Audio Applications Mutual Interdependence Analysis (MIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heiko Claussen, Justinian Rosca, and Robert Damper
446
Modeling Perceptual Similarity of Audio Signals for Blind Source Separation Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brendan Fox, Andrew Sabin, Bryan Pardo, and Alec Zopf
454
Beamforming Initialization and Data Prewhitening in Natural Gradient Convolutive Blind Source Separation of Speech Mixtures . . . . . . . . . . . . . . Malay Gupta and Scott C. Douglas
462
Blind Vector Deconvolution: Convolutive Mixture Models in Short-Time Fourier Transform Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Atsuo Hiroe
471
A Batch Algorithm for Blind Source Separation of Acoustic Signals Using ICA and Time-Frequency Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eugen Hoffmann, Dorothea Kolossa, and Reinhold Orglmeister
480
The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria G. Jafari and Mark D. Plumbley
488
Signal Separation by Integrating Adaptive Beamforming with Blind Deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kostas Kokkinakis and Philipos C. Loizou
495
Blind Signal Deconvolution as an Instantaneous Blind Separation of Statistically Dependent Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ivica Kopriva
504
XVI
Table of Contents
Solving the Permutation Problem in Convolutive Blind Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Radoslaw Mazur and Alfred Mertins
512
Discovering Convolutive Speech Phones Using Sparseness and Non-negativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul D. O’Grady and Barak A. Pearlmutter
520
Frequency-Domain Implementation of a Time-Domain Blind Separation Algorithm for Convolutive Mixtures of Sources . . . . . . . . . . . . . . . . . . . . . . Masashi Ohata and Kiyotoshi Matsuoka
528
Phase-Aware Non-negative Spectrogram Factorization . . . . . . . . . . . . . . . . R. Mitchell Parry and Irfan Essa
536
Probabilistic Amplitude Demodulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard E. Turner and Maneesh Sahani
544
First Stereo Audio Source Separation Evaluation Campaign: Data, Algorithms and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emmanuel Vincent, Hiroshi Sawada, Pau Bofill, Shoji Makino, and Justinian P. Rosca ‘Shadow BSS’ for Blind Source Separation in Rapidly Time-Varying Acoustic Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Wehr, A. Lombard, H. Buchner, and W. Kellermann
552
560
Biomedical Applications Evaluation of Propofol Effects in Atrial Fibrillation Using Principal and Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Raquel Cervig´ on, Conor Heneghan, Javier Moreno, Francisco Castells, and Jos´e Millet Space-Time ICA and EM Brain Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mike Davies, Christopher James, and Suogang Wang Extraction of Gastric Electrical Response Activity from Magnetogastrographic Recordings by DCA . . . . . . . . . . . . . . . . . . . . . . . . . . C.A. Estombelo-Montesco, D.B. De Araujo, A.C. Roque, E.R. Moraes, A.K. Barros, R.T. Wakai, and O. Baffa
569
577
585
ECG Compression by Efficient Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Denner Guilhon, Allan K. Barros, and Silvia Comani
593
Detection of Paroxysmal EEG Discharges Using Multitaper Blind Signal Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan J. Halford
601
Table of Contents
XVII
Constrained ICA for the Analysis of High Stimulus Rate Auditory Evoked Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James M. Harte
609
Gradient Convolution Kernel Compensation Applied to Surface Electromyograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aleˇs Holobar and Damjan Zazula
617
Independent Component Analysis of Functional Magnetic Resonance Imaging Data Using Wavelet Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Johnson, Jonathan Marchini, Stephen Smith, and Christian Beckmann Multivariate Analysis of fMRI Group Data Using Independent Vector Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jong-Hwan Lee, Te-Won Lee, Ferenc A. Jolesz, and Seung-Schik Yoo Extraction of Atrial Activity from the ECG by Spectrally Constrained ICA Based on Kurtosis Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ronald Phlypo, Vicente Zarzoso, Pierre Comon, Yves D’Asseler, and Ignace Lemahieu Blind Matrix Decomposition Techniques to Identify Marker Genes from Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Schachtner, D. Lutter, F.J. Theis, E.W. Lang, A.M. Tom´e, J.M. Gorriz Saez, and C.G. Puntonet
625
633
641
649
Perception of Transformation-Invariance in the Visual Pathway . . . . . . . . Wenlu Yang, Liqing Zhang, and Libo Ma
657
Subspaces of Spatially Varying Independent Components in fMRI . . . . . . Jarkko Ylipaavalniemi and Ricardo Vig´ ario
665
Multi-modal ICA Exemplified on Simultaneously Measured MEG and EEG Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heriberto Zavala-Fernandez, Tilmann H. Sander, Martin Burghoff, Reinhold Orglmeister, and Lutz Trahms
673
Miscellaneous Blind Signal Separation Methods for the Identification of Interstellar Carbonaceous Nanoparticles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . O. Bern´e, Y. Deville, and C. Joblin
681
Specific Circumstances on the Ability of Linguistic Feature Extraction Based on Context Preprocessing by ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Borschbach and Martin Pyka
689
XVIII
Table of Contents
Conjugate Gamma Markov Random Fields for Modelling Nonstationary Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Taylan Cemgil and Onur Dikmen
697
Blind Separation of Quantum States: Estimating Two Qubits from an Isotropic Heisenberg Spin Coupling Model . . . . . . . . . . . . . . . . . . . . . . . . . . . Yannick Deville and Alain Deville
706
An Application of ICA to BSS in a Container Gantry Crane Cabin’s Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juan-Jos´e Gonz´ alez de-la-Rosa, Carlos G. Puntonet, A. Moreno Mu˜ noz, A. Illana, and J.A. Carmona
714
Blind Separation of Non-stationary Images Using Markov Models . . . . . . Rima Guidara, Shahram Hosseini, and Yannick Deville
722
Blind Instantaneous Noisy Mixture Separation with Best Interference-Plus-Noise Rejection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zbynˇek Koldovsk´ y and Petr Tichavsk´y
730
Compact Representations of Market Securities Using Smooth Component Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hariton Korizis, Nikolaos Mitianoudis, and Anthony G. Constantinides
738
Bayesian Estimation of Overcomplete Independent Feature Subspaces for Natural Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Libo Ma and Liqing Zhang
746
ICA-Based Image Analysis for Robot Vision . . . . . . . . . . . . . . . . . . . . . . . . . Naoya Ohnishi and Atsushi Imiya
754
Learning of Translation-Invariant Independent Components: Multivariate Anechoic Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lars Omlor and Martin A. Giese
762
Channel Estimation for O-STBC MISO Systems Using Fourth-Order Cross-Cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H´ector J. P´erez-Iglesias and Adriana Dapena
770
System Identification in Structural Dynamics Using Blind Source Separation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F. Poncelet, G. Kerschen, J.C. Golinval, and F. Marin
778
Image Similarity Based on Hierarchies of ICA Mixtures . . . . . . . . . . . . . . . Arturo Serrano, Addisson Salazar, Jorge Igual, and Luis Vergara
786
Table of Contents
Text Clustering on Latent Thematic Spaces: Variants, Strengths and Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xavier Sevillano, Germ´ an Cobo, Francesc Al´ıas, and Joan Claudi Socor´ o
XIX
794
Top-Down Versus Bottom-Up Processing in the Human Brain: Distinct Directional Influences Revealed by Integrating SOBI and Granger Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akaysha C. Tang, Matthew T. Sutherland, Peng Sun, Yan Zhang, Masato Nakazawa, Amy Korzekwa, Zhen Yang, and Mingzhou Ding
802
Noisy Independent Component Analysis as a Method of Rotating the Factor Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steffen Unkel and Nickolay T. Trendafilov
810
Multilinear (Tensor) ICA and Dimensionality Reduction . . . . . . . . . . . . . . M. Alex O. Vasilescu and Demetri Terzopoulos
818
ICA in Boolean XOR Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arie Yeredor
827
A Novel ICA-Based Image/Video Processing Method . . . . . . . . . . . . . . . . . Qiang Zhang, Jiande Sun, Ju Liu, and Xinghua Sun
836
Keynote Talk Blind Audio Source Separation Based on Independent Component Analysis (Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shoji Makino, Hiroshi Sawada, and Shoko Araki
843
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
845
A Flexible Component Model for Precision ICA Jean-Fran¸cois Cardoso1,2 and Maude Martin2, 2
1 CNRS / LTCI , France Univ. Paris 7 / APC, France
Abstract. We describe an ICA method based on second order statistics which was originally developed for the separation of components in astrophysical images but is appropriate in contexts where accuracy and versatility are of primary importance. It combines several basic ideas of ICA in a new flexible framework designed to deal with complex data scenarios. This paper describes our approach and discusses its implementation in terms of a library of components.
1
Introduction
Objectives. This paper describes a framework for component separation which has been designed with the following objectives in mind: – the ability to model components with much more flexibility than in most basic ICA techniques. This includes, in particular, the case of correlated or multidimensional components, – the ability to take noise into account. The noise is not necessarily well characterized: its correlation structure may have to be estimated, – the ability to deal with signals/images of varying resolution. – the ability to incorporate easily prior information about the components, – ‘Relative’ speed (enabling error estimates via Monte-Carlo simulations). Motivation. The original motivation for the work presented here is the processing of spherical maps resulting from all-sky surveys of the microwave emission. The object of interest is the Cosmic Microwave Background (CMB). The spacebased Planck mission of the European Space Agency will provide observations of the CMB in 9 frequency channels which will be used as inputs to a component separation method. This is needed because the cosmological background is observed together with several other astrophysical emissions, dubbed ‘foregrounds’, both of galactic and extra-galactic origins. These foregrounds, together with the CMB, are the components which are to be separated. The CMB map itself is very well modeled as a realization of a (spherical) Gaussian stationary random field but this is not the case of the other components. Our method, however, is not specific to this particular problem and may be considered for application to any situation where ‘expensive’ data deserve special care and have to be fitted by a complex component model. Section 2 describes the statistical framework while section 3 discusses implementation in terms of a library of components.
Maude Martin is partly supported by the Cosmostat project funded by CNRS.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 1–8, 2007. c Springer-Verlag Berlin Heidelberg 2007
2
2 2.1
J.-F. Cardoso and M. Martin
The Additive Component Model A Component Based Model
We introduce a special notation for the ICA model which is more convenient to our framework, in particular to deal with correlated sources. Traditionally, the noisy ICA model is denoted as X = AS + N where matrix A is m × n for m channels (sensors) and n sources. This is a multiplicative model in the sense that the mixing matrix A multiplies an hypothetical vector S of ‘sources’. The ith source Si contributes ai Si to the observation vector X where ai is the ith column of A. Hence, the model can be rewritten ‘additively’ as a superposition of C = n + 1 random components: X=
C
Xc
(1)
c=1
where X i = ai Si for 1 ≤ i ≤ n and X n+1 = N . Such a reformulation is worthless if all sources are modeled as independent. Assume now that two sources, say i and j, are modeled as statistically dependent. They contribute ai Si + aj Sj to the observed X. We decide to lump them into a single component denoted X c for some index c: X c = ai Si + aj Sj . Then X can still be written as in eq. (1) and all the components X c are independent, again. However, the new component X c can no longer be written as one-dimensional, i.e. as the product of a single fixed column vector multiplied by a random variable. Instead, it can be written as [Si Sj ]† left multiplied by the m×2 matrix [ai aj ]. Such a component is termed ‘bi-dimensional’ and we could obviously define multidimensional components of any dimension. In eq. (1), we have included the noise term as one of the components. Note that if the noise is uncorrelated from channel to channel, as is often the case, then the noise component is m-dimensional (the largest possible dimension). More generally, our model does not require any component to be low dimensional. Rather, our model is a plain superposition of C components as in eq. (1). None of these components is required to have any special structure, onedimensional or otherwise. We only require that they are mutually uncorrelated. In other words, we rely on the property R=
C
Rc
(2)
c=1
where R (resp. Rc ) is the covariance matrix of the data vector X (resp. of cth component X c ). 2.2
Component Restoration by Wiener Filtering
c of X c based on X is The best (in the mean-square sense) linear estimate X well known to be Cov(X c , X)Cov(X, X)−1 X which reduces here to c = Rc R−1 X X
(3)
A Flexible Component Model for Precision ICA
3
Hence optimal linear recovery of X c from X requires only the determination of matrices R and Rc . The total covariance matrix R can often be estimated directly from the data so that, in order to restore component c from the mixture, one “only” has to estimate its covariance matrix Rc . 2.3
Localized Statistics, Localized Separation
In practice, we do not consider a single covariance matrix. Rather, in order to capture better the correlation structure, we compress the data into a set q }Q of Q covariance matrices of size m × m. For instance, one would = {R R q=1 estimate the covariance of X over several domains (time intervals for time series, spatial domains for images) or in Fourier space over several frequency bands. More generally, one could localize the covariance matrices in both domains using a wavelet basis or just plain windowed transforms. Index q can be thought of as a time interval (or a spatial zone), a Fourier band, a wavelet domain, etc. . . In our application (see introduction), we would consider a hundred of angular frequency bands localized over a few zones on the sky and Q would be the product of these two numbers. This can be formalized by denoting X(i) the ith coefficient of the data in some localized basis of p elements X(i)
i ∈ [1, 2, . . . , p] = ∪Q q=1 Dq
(4)
where the set [1, . . . , p] of all coefficient indices is partitioned into Q domains D1 , . . . , DQ . For instance, X(i) is an m×1 vector of Fourier coefficients and Dq is a set of discrete Fourier frequencies in a particular band. The sample covariance matrix for the qth domain and its expected value are defined/denoted as q = 1 R X(i)X(i)† , pq
q Rq = E R
(5)
i∈Dq
where pq is the number of coefficients in Dq . The same notation is also used for each component so that, these being mutually uncorrelated by assumption, one has the decomposition C Rcq (6) Rq = c=1
There are two strong reasons for localizing the statistics. First, if the strength of the various components and the SNR vary with time, space, frequency,. . . , reconstruction is improved by localizing the filter in time, space frequency,. . . More specifically, the cth component is reconstructed from c (i) estimated by its coefficients X c (i) = Rc R−1 X(i) if i ∈ Dq . X q q
(7)
i.e. the reconstruction filter also is localized, taking advantage of the ‘local SNR conditions’ on domain Dq .
4
J.-F. Cardoso and M. Martin
Second, the diversity of the statistics of the components over several domains is precisely what may make this model blindly identifiable. For instance, if all components are one-dimensional and there is no noise, we are back to the standard ICA model. Then, if X(i) are Fourier coefficients and Dq are spectral bands, it is known that spectral diversity (no two components have identical spectrum) is a sufficient condition for blind identifiability. 2.4
Model Identification
So far, the separation of components has been discussed without any blindness ingredient. However, we saw that computing the MSE-optimal separating filter for component c in domain q requires only, by eq. (7), the determination of Rcq . A generic procedure for identifying these matrices is to assume some parametric model for each component: the set Rc of localized covariance matrices for the cth component is parametrized by a vector θc of parameters and the component model is some well thought of function θc → Rc (θc ) = {Rcq (θc )}Q q=1 . Some examples are given at section 3.1. A parametric model R(θ) follows by assuming component decorrelation (6) and taking the global parameter θ as the concatenation of the parameters c cof each Q component: θ = (θ1 , . . . , θc ) , so that R(θ) = {Rq (θ)}Q q=1 = { c Rq (θ )}q=1 . The unknown parameters are found by matching model to data, that is, by and R(θ). More specifically: minimizing some measure of discrepancy between R θ = arg min φ(θ)
where φ(θ) =
Q
q , Rq (θ)). wq K(R
(8)
q=1
Here, K(·, ·) is measure of mismatch between two positive matrices and wq are positive weights (example below). 2.5
Summary. Blind. . . or Not?
At this stage, (most of) the statistical framework is in place but our method is not well defined yet because many options are available: 1. choice of a basis to obtain coefficients X(i) and of domains {Dq }Q q=1 to define their second-order statistics R, 2. choice of a model θc → Rc (θc ), for each component contribution, 3. choice of weights wq and matrix mismatch K(·, ·) in criterion φ(θ). Regarding point 3, our most common choice is to use wq = pq and K(R1 , R2 ) = −1 −1 1 2 [trace(R1 R2 )−log det(R1 R2 )−m]. Then, φ(θ) is the negative log-likelihood of the model where X(i) ∼ N (0, Rq ) for i ∈ Dq and is independent from X(i ) for i = i . Another design choice is to implement the recovery (7) of individual compo −1 X(i) or as X c (i) = Rc (θc )R −1 X(i). c (i) = Rc (θc )Rq (θ) nents either as X q q q Is this a blind component separation method? It all depends on the component model. If all components are modeled as ‘classic’ ICA components (see 3.1), then
A Flexible Component Model for Precision ICA
5
the method is as blind as regular ICA. Our approach, however, leaves open the possibility of tuning the blindness level at will by specifying more or less stringent models θc → Rc for some or all of the components.
3
Implementation
We are ‘only’ left with the numerical issue of actually minimizing φ(θ) using an arbitrary library of components. This is the topic of next section. We call a collection of models θc → Rc (θc ) a library of components. In practice, each member of the library must not only specify a function θc → Rc (θc ) but also its gradient and other related quantities, as we shall see next. 3.1
A Library of Components
Typical examples of component models are now listed. 1. The ‘classic’ ICA component is one dimensional X c (i) = ac Sc (i). Denoting 2 σqc the average variance of Sc (i) over the qth domain, the contribution Rcq of this component to Rq is the rank-one matrix 2 Rcq = ac a†c σqc
This component can be described by an (m + Q) × 1 vector θc of param2 eters containing the m entries of ac and the Q variance values σqc . Such a parametrization is redundant, but we leave this issue aside for the moment. 2. A d-dimensional component can be modeled as Rcq = Ac Pqc A†c where Ac is an m × d matrix and Pqc is an d × d positive matrix varying freely over all domains. This can be parametrized by a vector θc of m × d + Q × d(d + 1)/2 scalar parameters (the entries of Ac and of Pqc ). Again, this is redundant, but we ignore this issue for the time being. 3. Noise component. A simple noise model is given by 2 Rcq = diag(σ12 , . . . , σm )
that is, uncorrelated noise from channel to channel, with the same level in all domains but not in all channels. This component is described by a vector θc of 2 2 , . . . , σmq ) only m parameters. In our application, we also use Rcq = diag(σ1q meaning that the noise changes from domain to domain. We then need a parameter vector θc of length mQ × 1. 4. As a final example, for modeling ‘point sources’, we also use Rcq = Rc . This component contributes identically in all channels. If, for instance, we assume that this contribution Rc is known, then the parameter vector θc is void. If Rc is known up to a scale factor, then θc is just a scalar, etc. . .
6
3.2
J.-F. Cardoso and M. Martin
Optimization
For a noise-free model containing only ‘classic ICA’ components, criterion φ(θ) is a joint diagonalization criterion for which a very efficient algorithm exists [3]. In the noisy case, this is no longer true but it is possible, for simple component models, to use the EM algorithm. The EM algorithm, however, is not convenient for general component models and, in addition, EM appears too slow for our purposes. Specialized optimization algorithms are thus required. The Conjugate Gradient (CG) algorithm has been found well suited for minimizing φ(θ). Its implementation requires that ∂φ/∂θ be computed. Also, CG only works well when properly pre-conditioned by (some approximation of) the inverse of ∂ 2 φ/∂θ2 . Since φ(θ) actually is a negative log-likelihood in disguise, its Hessian can be approximated by F(θ), the Fisher information matrix (FIM). The FIM is also useful for computing (approximate) error bars on θ. Hence we need to compute ∂φ/∂θ and (possibly an approximation of) ∂ 2 φ/∂θ2 . This computation offers no particular difficulty in theory but our aim is to implement it in the framework of a library of components. It means that we seek to organize the computations in such a way that each component model works as a ‘plug-in’. Computing the gradient. Slightly abusing the notation, the derivative with respect to θc takes the form Q ∂Rcq (θc ) ∂φ(θ) = trace G (θ) q ∂θc ∂θc q=1
(9)
where matrix Gq (θ) is defined as Gq (θ) =
1 −1 wq R−1 q (θ) Rq (θ) − Rq Rq (θ) 2
(10)
Hence the computation of ∂φ/∂θ at a given point θ = (θ1 , . . . , θC ) can be organized as follows. A first loop through all components computes R(θ) by adding up the contribution Rc (θc ) of each component. Then, a second loop over all Q domains computes matrices {Gq (θ)}Q q=1 which are stored in a common work space. Finally, a third loop over all components concatenates all partial gradients ∂φ/∂θc , each component implementing the computation of the right hand side of (9) in the best possible way, using the available matrices {Gq (θ)}Q q=1 . Computing an (approximate) Hessian. The Fisher information matrix can be partitioned component-wise with the off-diagonal block [F(θ)]cc depending on components c and c . This seems to be a problem for a plug-in architecture because its principle requires that new component models can be introduced (plugged in) independently of each other. Nonetheless, this requirement can be full-filled because, the (c, c ) block of the FIM is
∂Rcq (θc ) −1 ∂Rcq (θc ) −1 1 [F(θ)]cc = wq trace Rq (θ) Rq (θ) (11) 2 q ∂θc ∂θc
A Flexible Component Model for Precision ICA
7
Hence, the FIM can be computed by a double loop over c and c since it is only ∂Rcq (θ c ) Q necessary that the code for each component be able to return { ∂θ }q=1 . c A straightforward implementation of this idea may be impractical, though, ∂Rcq (θ c ) Q because { ∂θ }q=1 is a set of |θc | × Q matrices, possibly very large. This c problem can be largely alleviated in the frequent case where components have c ) ‘local’ variables, that is whenever θc can be partitioned as θc = (θ0c , θ1c , . . . , θQ c c c where, for q > 0, vector θq influences only Rq and where θ0 collects all the remaining parameters, i.e. those which affect the covariance matrix over two or more domains (the simplest example is the ‘classic’ ICA component: Rcq = 2 2 ac a†c σqc , for which θ0c = a and θqc = σqc for q = 1, . . . , Q). In that case, vector θ can be partitioned into a ‘global part’ θ0 = (θ01 , . . . , θ0C ) and Q local parts θq = (θq1 , . . . , θqC ). With such a partitioning, the FIM has many zero blocks since then [F(θ)]qq = 0 for 1 ≤ q = q ≤ Q and the computations can be organized much more efficiently. Space is lacking for giving more details here. Indeterminations and penalization. We saw at section 3.1 that ‘natural’ component parametrization often are redundant. From a statistical point of view, this is irrelevant: we seek ultimately to identify Rc = {Rcq }Q q=1 as a member of a family described by a mapping θc → Rc (θc ) but this mapping does not need 2 to be one-to-one. The simplest example again is for Rcq = ac a†c σqc which is −1 invariant if one changes ac to αac and σqc to α σqc . This is the familiar ICA scale indetermination but Rcq itself is perfectly well defined [2]. The only serious concern about over-parametrization is from the optimization point of view. Redundancy makes the φ(θ) criterion flat in the redundant directions and it makes the FIM a singular matrix. Finding non redundant reparametrizations is a possibility, but it is often simpler to add a penalty function to φ(θ) for any redundantly parametrized component. For instance, the scale in2 when parametrized determination of the classic ICA component Rcq = ac a†c σqc c c 2 c c θ0 = ac and θq = σqc (q > 1) is fixed by adding φ (θ ) = g(ac 2 ) to φ(θ), where g(u) is any reasonable function which has a single minimum at, say, u = 1.
4
Conclusion
Our technique for component separation gains a lot of its flexibility from realizing that one can start with covariance matrix separation —i.e. the identification of individual component terms in the domain-wise decomposition (6)— followed by data separation according to (3). It is thus sufficient to identify matrices Rcq . Whether or not minimizing the covariance matching criterion φ(θ) leads to uniquely identified components depends on the particular component models picked from a ‘library of components’. Uniqueness (identifiability) can only be decided on a case-by-case basis, either from analytical considerations or by inspection of the Fisher information matrix which can be numerically computed using the component library. By using more or less constrained components, the method ranges from totally blind to semi-blind, to non-blind.
8
J.-F. Cardoso and M. Martin
Some strong points of the approach are the following. Speed: the method is a possibly much potentially fast because large data sets are compressed into R, smaller object. Accuracy: the method is potentially accurate because it can model complex components and then recover separated data via local Wiener filtering. Flexibility: the method is flexible because it can be implemented via a library of components with arbitrary structure. Noise: the method can take noise into account without increased complexity since noise is not processed differently from any other component. Prior: the implementation also allows for easy inclusion of prior information about a component c if it can be cast in the form of a prior probability distribution pc (θc ) in which case one only need to subtracting log pc (θc ) from φ(θ) and the related changes can be delegated to the component code. Varying resolution: in our application, and possibly others, the input channels are acquired by sensors with channel-dependent resolution. Accurate component separation can only be achieved if this effect is taken into account. This can be achieved with relative simplicity if the data coefficients q are Fourier coefficients. entering in R This paper combines several ideas already known in the ICA literature: lumping together correlated components into a single multidimensional component is in [2]; minimization of a covariance-matching contrast φ(θ) derived from the log-likelihood of a simple Gaussian model is found for instance in [3]; the extension to noisy models is already explained in [4]. The current paper goes one step further by showing how arbitrarily structured components can be separated and how the related complexity can be managed at the software level by a library of components.
References 1. Pham, D.T., Cardoso, J.F.: Blind separation of instantaneous mixtures of non stationary sources. IEEE Trans. on Sig. Proc. 49(9), 1837–1848 (2001) 2. Cardoso, J.F.: Multidimensional independent component analysis. In: Proc. ICASSP ’98. Seattle (1998) 3. Pham, D.: Blind separation of instantaneous mixture of sources via the Gaussian mutual information criterion. Signal Processing 4, 855–870 (2001) 4. Delabrouille, J., Cardoso, J.F., Patanchon, G.: Multi–detector multi–component spectral matching and applications for CMB data analysis. MNRAS 346(4), 1089– 1102 (2003), http://arXiv.org/abs/astro-ph/0211504
Blind Separation of Instantaneous Mixtures of Dependent Sources Marc Castella1 and Pierre Comon2 1
´ GET/INT, UMR-CNRS 5157, 9 rue Charles Fourier, 91011 Evry Cedex, France
[email protected] 2 CNRS, I3S, UMR 6070, BP.121, Sophia-Antipolis cedex, France
[email protected]
Abstract. This paper deals with the problem of Blind Source Separation. Contrary to the vast majority of works, we do not assume the statistical independence between the sources and explicitly consider that they are dependent. We introduce three particular models of dependent sources and show that their cumulants have interesting properties. Based on these properties, we investigate the behaviour of classical Blind Source Separation algorithms when applied to these sources: depending on the source vector, the separation may be sucessful or some additionnal indeterminacies can be identified.
1
Introduction
Independent Component Analysis (ICA) is now a well recognized concept, which has fruitfully spread out to a wide panel of scientific areas and applications. Contrary to other frameworks where techniques take advantage of a strong information on the diversity, for instance through the knowledge of the array manifold in antenna array processing, the core assumption in ICA is much milder and reduces to the statistical mutual independence between the inputs. However, this assumption is not mandatory in Blind Source Separation (BSS). For instance, in the case of static mixtures, sources can be separated if they are only decorrelated when their nonstationarity or their color can be exploited. Other properties such as the fact that sources belong to a finite alphabet can alternatively be utilized [1,2] and do not require statistical independence. Inspired from [3,4], we investigate the case of dependent sources, without assuming nonstationarity nor color. To our knowledge, only few references have tackled this issue [5,6].
2
Mixture Model and Notations
We consider a set of N source signals (si (n))n∈Z , i = 1, . . . , N . The dependence on time of the signals will not be made explicit in the paper. The sources are mixed, yielding a P -dimensional observation vector x = (x(n))n∈Z according to the model: x = As (1) M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 9–16, 2007. c Springer-Verlag Berlin Heidelberg 2007
10
M. Castella and P. Comon
where s = (s1 , . . . , sN )T , x = (x1 , . . . , xP )T and A is a P × N matrix called the mixing matrix. We assume that A is left-invertible. Source separation consists in finding a N ×P separating matrix B such that its output y = Bx corresponds to the original sources. When only the observations are used for this, the problem is referred to as the BSS problem. Introducing the N × N global matrix G BA, the BSS is problem is solved if G is a so-called trivial matrix, i.e. the product of a diagonal matrix with a permutation: these are well known ambiguities of BSS. In this paper, we will study separation criteria as functions of G. Source separation sometimes proceeds iteratively, extracting one source at a time (e.g. deflation approach). In this case, we will write y = bx = gs where b and g = bA respectively correspond to a row of B and G and y denotes the only output of the separating algorithm. In this case, the separation criteria are considered as functions of g. Finally, we denote by E {.} the expectation operator and by Cum {.} the cumulant of a set of random variables. Cum4 {y} is equivalent to Cum {y, y, y, y} and, for complex variables, Cum2,2 {y} stands for Cum {y, y, y ∗ , y ∗ }.
3
Examples and Properties of Dependent Sources
We introduce in this section different cases of vector sources that are dependent and that will be considered in this paper. 3.1
Three Dependent Sources
Binary phase shift keying (BPSK) signals have specificities that will allow us to obtain source vectors with desired properties. In this paper, we will consider BPSK sources that take values s = +1 or s = −1 with equal probability1/2. We define the following source vector: is real-valued non Gaussian, indepenA1. s (s1 , s2 , s3 )T where s1 is BPSK;s2 dent of s1 and satisfies E {s2 } = E s32 = 0; and s3 = s1 s2 . Interestingly, the following lemma holds true: Lemma 1. The sources s1 , s2 , s3 defined by A1 are obviously mutually dependent. Nevertheless they are decorrelated and their fourth-order cross-cumulants vanish, that is: Cum {si , sj } = 0 except if i = j,
(2)
Cum {si , sj , sk , sl } = 0 except if i = j = k = l.
(3)
Proof. Using the definition of s1 , s2 and their independence, one can easily check that E {s1 } = E {s2 } = E {s3 } = 0. For these centered random variables, it is known that cumulants can be expressed in terms of moments: Cum {si , sj } = E {si sj } Cum {si , sj , sk , sl } = E {si sj sk sl } − E {si sj } E {sk sl } − E {si sk } E {sj sl } − E {si sl } E {sj sk }
(4) (5)
Blind Separation of Instantaneous Mixtures of Dependent Sources
11
Using again the definition of s1 , s2 and their independence, it is then easy to check all cases of equations (4) and (5) and to verify that these fourth order crosscumulants are indeed null. On the other hand, the third order cross-cumulant reads: (6) Cum {s1 , s2 , s3 } = E {s1 s2 s3 } = E s21 s22 = E s21 E s22 > 0 and this proves that s1 , s2 , s3 are mutually dependent.
Depending on s2 , more can be proved about the source vector defined by A1. For example, if the probability density function of s2 is symmetric, then s1 and s3 are independent. On the contrary s2 and s3 are generally not independent. An even more specific case is obtained when s2 is itself BPSK. In this case, one can check that the sources (s1 , s2 , s3 ) are pairwise independent, although not mutually independent. 3.2
Pairwise Independent Sources
We now investigate further the case of pairwise independent sources and introduce the following source vector: A2. s = (s1 , s2 , s3 , s4 )T where s1 , s2 and s3 are independent BPSK and s4 = s1 s2 s3 . This case has been considered in [3], where it has been shown that ∀i ∈ {1, . . . , 4}, Cum {si , si , si , si } = −2 ,
Cum {s1 , s2 , s3 , s4 } = 1
(7)
and all other cross-cumulants vanish. The latter cumulant value shows that the sources are mutually dependent; although it can be shown that they are pairwise independent. It should be clear that pairwise independence is not equivalent to mutual independence but in an ICA context, it is relevant to recall the following proposition, which is a direct consequence of Darmois’ theorem [7, p.294]: Proposition 1. Let s be a random vector with mutually independent components, and x = Gs. Then the mutual independence of the entries of x is equivalent to their pairwise independence. Based on this proposition, the ICA algorithm in [7] searches for an output vector with pairwise independent component. Let us stress that this holds only if the source vector has mutually independent components: pairwise independence is indeed not sufficient to ensure identifiability as we will see in Section 4.2. 3.3
Complex Valued Sources
We consider quaternary phase shift keying (QPSK) sources which take their π π 5π 5π values in {eı 4 , e−ı 4 , eı 4 , e−ı 4 } with equal probability 1/4. We then define the following source vector: A3. s = (s1 , s2 , s3 , s4 )T where s1 , s2 and s3 are mutually independent QPSK and s4 = s1 s2 s3 .
12
M. Castella and P. Comon
Based on the Equations (4) and (5) which hold for the above centered sources, one can check the following proposition: Lemma 2. The sources in A3 are dependent and Cum {s1 , s2 , s3 , s∗4 } = 1. However, they are second-order decorrelated and all their fourth order circular crosscumulants (i.e. with as many conjugates as non-conjugates) vanish, that is: Cum si , s∗j = 0 and Cum {si , sj } = 0 except if i = j, (8) Cum {si , sj , s∗k , s∗l } = 0 except if i = j = k = l.
(9)
Actually, we can prove that the above Lemma, as well as Lemma 4 and Proposition 6 still hold in the case when s1 , s2 and s3 are second order circular and have unit modulus: this is not detailed for reasons of space and clarity.
4 4.1
ICA Algorithms and Dependent Sources Independence Is Not Necessarily Required
The sources given by A1 provide us with a specific example of dependent sources that are sucessfully separated by several ICA methods: Proposition 2. Let y = gs where the vector of sources is defined by A1. Then, the function (10) g → |Cum4 {y}|α , α ≥ 1 defines a MISO contrast function, that is, its maximization over the set of unit norm vectors ( g 2 = 1) leads to a vector g with only one non-zero component. Proof. The above proposition follows straightforwardly from Lemma 1 since the proof of the validity of the above contrast functions only relies on the property in Equation (3). Considering again the argument in the proof, one should easily notice that the above proposition can be generalized to the case of sources which satisfy: A4. s = (s1 , . . . , s3K ) where: ∀i ∈ {0, . . . ,K − 1}, s3i+1 is BPSK; s3i+2 is non Gaussian and satisfies E {s3i+2 } = E s33i+2 = 0; s3i+3 = s3i+1 s3i+2 ; and the random variables {s3i+1 , s3i+2 ; i = 0, . . . , K − 1} are mutually independent. In addition, the above result can be generalized convolutive systems and to MIMO (multiple input/multiple output) contrast functions as defined in [7,2]: Proposition 3. Let y = Gs where the vector of sources is defined by A1. Then the function: N |Cum4 {yi }|α , α ≥ 1 (11) G → i=1
is a MIMO contrast, that is, its maximization over the group of orthogonal matrices leads to a solution G which is a trivial matrix (permutation, scaling).
Blind Separation of Instantaneous Mixtures of Dependent Sources
13
Many classical algorithms for BSS or ICA first whiten the data: it is known that in so doing, they constrain matrix G to be orthogonal. In particular so does the algorithm proposed in [7], which relies on the contrast function in (11). It justifies that this algorithm successfully separates the sources A1. Actually, any algorithm relying on a prewhitening and associated with a contrast function based on the vanishing of the fourth-order cross cumulants (e.g. JADE) is able to separate sources such as A1. 4.2
Pairwise Independence Is Not Sufficient
We now consider the pairwise independent sources given by A2 and show that pairwise independence is not sufficient to ensure identifiability of the ICA model. We first have the following preliminary result: Lemma 3. Let y = gs where the vector of sources is defined by A2. Assume that the vector (s1 , s2 , s3 ) takes all 23 possible values. If the signal y has values in {−1, +1}, then g = (g1 , g2 , g3 , g4 ) is either one of the solutions below: ∃i ∈ {1, . . . , 4} gi = ±1, and: ∀j = i, gj = 0 (12) ∃i ∈ {1, . . . , 4} gi = ±1/2, and: ∀j = i, gj = −gi Proof. If y = gs, using the fact that s2i = 1 for i = 1, . . . , 4, we have with the particular sources given by A2: y 2 = g12 +g22 +g32 +g42 +2 g1 g2 +g3 g4 s1 s2 + g1 g3 +g2 g4 s1 s3 + g2 g3 +g1 g4 s2 s3 Since (s1 , s2 , s3 ) take all possible values in {−1, 1}3, we deduce from y 2 = 1 that the following equations necessarily hold: g12 + g22 + g32 + g42 = 1 (13) g1 g2 + g3 g4 = g1 g3 + g2 g4 = g2 g3 + g1 g4 = 0 First observe that values given in (12) indeed satisfy (13). Yet, if a polynomial system of N equations of degree d in N variables admits a finite number of solutions1 , then there can be at most dN distinct solutions. Hence we have found them all in (12), since (12) provides us with 16 solutions for (g1 , g2 , g3 , g4 ). Using the above result, we are now able to specify the output of classical ICA algorithms when applied to a mixture of sources which satisfy A2. Constant modulus and contrasts based on fourth order cumulants. The constant modulus (CM) criterion is one of the most known criteria for BSS. In the real valued case, it simplifies to:
2 with: y = gs (14) JCM (g) E y 2 − 1 1
One can show that the number of solutions of (13) is indeed finite.
14
M. Castella and P. Comon
Proposition 4. For the sources given by A2, the minimization of the constant modulus criterion with respect to g leads to either one of the solutions given by Equation (12). Proof. We know that the minimum value of the constant modulus criterion is zero and that this value can be reached (for g having one entry being ±1 and other entries zero). Moreover, the vanishing of the constant modulus criterion implies that y 2 − 1 = 0 almost surely and one can then apply Lemma 3. A connection can now be established with the fourth-order autocumulant if we impose the following constraint: (15) E y 2 = 1 (or equivalently g = 1 since y = gs) Because of the scaling ambiguity of BSS, the above can be freely
normalization 2 imposed. Under (15), we have Cum4 {y} = E y 2 − 1 − 2 and minimizing JCM (g) thus amounts to maximizing −Cum4 {y}. Unfortunately, since Cum4 {y} may be positive or negative, no simple relation between |Cum4 {y}| and JCM (g) can be deduced from the above equation. However, we can state: Proposition 5. Let y = gs where the vector of sources is defined by A2. Then, under the constraint (15) ( g = 1), we have: (i) The maximization of g → −Cum4 {y} leads to either one of the solutions given by Equation (12). (ii) |Cum4 {y}| ≤ 2 and the equality |Cum4 {y}| = 2 holds true if and only if g is one of the solutions given in Equation (12). Proof. Part (i) follows from the arguments given above. In addition, using multilinearity of the cumulants and (7), we have for y = gs: Cum4 {y} = −2 g14 + g24 + g34 + g44 + 24 (g1 g2 g3 g4 ) (16) The result then follows straightfowardly from the study of the polynomial function in Equation (16). Indeed, optimizing (16) leads to the following Lagrangian:
4 4 4 4 2 L = −2 gi + 24 gi − λ gi − 1 (17) i=1
i=1
i=1
After solving the polynomial system which cancels the Jacobian of the above expresssion, one can check that all solutions are such that |Cum4 {y}| ≤ 2. Details are omitted for reasons of space. Part (ii) of the proposition easily follows. Similarly to the previous section, the above proposition can be generalized to MIMO contrast functions. In particular, this explains why, for a particular set of mixing matrices such as that studied in [3], the pairwise maximization algorithm of [7] still succeeded: a separation has luckily been obtained for the considered mixing matrices and initialization point of the algorithm, but it actually would not succeed in separating BPSK dependent sources for general mixing matrices.
Blind Separation of Instantaneous Mixtures of Dependent Sources
15
Let us stress also that the results in this section are specific to the contrast functions given by (10) or (11). In particular, these results do no apply to algorithms based on other contrast functions such as JADE, contrary to the results in Sections 4.1 and 4.3. 4.3
Complex Case
The output given by separation algorithms in case of complex valued signals may differ from the previous results which have been proved for real valued signals only. Indeed, complex valued BSS does not always sum up to an obvious generalization of the real valued case [8]. We illustrate it in our context and show that, quite surprisingly, blind separation of the sources given by A3 can be achieved up to classical inderterminations of ICA. This is in contrast with the result in Equation (12) where additionnal indeterminacies appeared. First, we have: Lemma 4. Let y = gs where the vector of sources is defined by A3. Assume that the vector (s1 , s2 , s3 ) takes all 43 possible values. If the signal y is such that its values satisfy |y|2 = 1, then g = (g1 , g2 , g3 , g4 ) satisfies: ∃i ∈ {1, . . . , 4}
|gi | = 1, and: ∀j = i, gj = 0
(18)
Proof. If y = gs, using the fact that |si |2 = 1 for i = 1, . . . , 4, we have with the particular sources given by A3: |y|2 =
4 i=1
|gi |2 +
gi gj∗ si s∗j
(19)
i=j
Since (s1 , s2 , s3 ) take all possible values in {1, ı, −1, −ı}3, we deduce from |y|2 = 1 that the following equations necessarily hold: |g1 |2 + |g2 |2 + |g3 |2 + |g4 |2 = 1 (20) g1 g2∗ = g1 g3∗ = g1 g4∗ = g2 g3∗ = g2 g4 ∗ = g3 g4∗ = 0 Solving for the polynomial system in the variables |g1 |, |g2 |, |g3 | and |g4 |, we obtain that the solutions are the ones given in Equation (18). Constant modulus and fourth-order cumulant based contrasts. In contrast with Propositions 4 and 5 we have the following result: Proposition 6. Let y = gs where the sources satisfy A3. Then, the functions:
2 and: (21) g → −E |y|2 − 1 2 (22) g → |Cum2,2 {y}| under constraint E |y| = 1 are contrast functions, that is, their maximization leads to g satisfying (18). Proof. The validity of the first contrast function is obtained with the same argum.s. ments as in the proof of Proposition 4: we have |y|2 = 0, which yields (20) via
16
M. Castella and P. Comon
(19). In the case of independent sources, the proof of the validity of the second contrast involves only cumulants with equal number of conjugate and non conjugate variables: invoking Lemma 2, one can see that the same proof still holds here. Note that the same arguments can be applied to ICA methods such as the pairwise algorithm in [2] or JADE [9]. Figure 1 illustrates our result.
5
0
−5 −5
0
5
5
0
−5 −10
0
10
10
1
1
5
0.5
0.5
0
0
0
−5
−0.5
−0.5
−10 −10
0
10
−1 −1
0
1
−1 −1
4
1
1
2
0.5
0.5
0
0
0
−2
−0.5
−0.5
−4 −5
0
5
−1 −1
0
1
−1 −1
0
1
0
1
Fig. 1. Typical observed separation result of the sources A3 with the algorithm JADE (left: sensors, right: separation result)
References 1. Li, T.H.: Finite-alphabet information and multivariate blind deconvolution and identification of linear systems. IEEE Trans. on Information Theory 49(1), 330–337 (2003) 2. Comon, P.: Contrasts, independent component analysis, and blind deconvolution. Int. Journal Adapt. Control Sig. Proc. 18(3) 225–243 (April 2004) special issue on Signal Separation: Preprint: I3S Research Report RR-2003-06 (2004), http://www3.interscience.wiley.com/cgi-bin/jhome/4508 3. Comon, P.: Blind identification and source separation in 2 × 3 under-determined mixtures. IEEE Trans. Signal Processing 52(1), 11–22 (2004) 4. Comon, P., Grellier, O.: Non-linear inversion of underdetermined mixtures. In: Proc. of ICA’99, Aussois, France, pp. 461–465 (January 1999) 5. Hyv¨ arinen, A., Shimizu, S.: A quasi-stochastic gradient algorithm for variancedependent component analysis. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 211–220. Springer, Heidelberg (2006) 6. Cardoso, J.F.: Multidimensional independent component analysis. In: Proc. ICASSP ’98. Seattle (1998) 7. Comon, P.: Independent component analysis, a new concept. Signal Processing 36(3), 287–314 (1994) 8. Eriksson, J., Koivunen, V.: Complex random vectors and ICA models: Identifiability, uniqueness and separability. IEEE Trans. on Information Theory 52(3), 1017–1029 (2006) 9. Cardoso, J.F., Souloumiac, A.: Blind beamforming for non gaussian signals. In: IEEE- Proceedings-F. vol. 140, pp. 362–370 (1993)
The Complex Version of the Minimum Support Criterion Sergio Cruces, Auxiliadora Sarmiento, and Iván Durán Dpto. de Teoría de la Señal y Comunicaciones , Escuela de Ingenieros, Universidad de Sevilla, Camino de los descubrimientos. s/n, 41092-Sevilla, Spain
[email protected]
Abstract. This paper addresses the problem of the blind signal extraction of sources by means of an information theoretic and geometric criterion. Our main result is the extension of the minimum support criterion to the case of mixtures of complex signals. This broadens the scope of its possible applications in several fields, such as communications.
1
Introduction
The paradigm of linear ICA consists in the decomposition of the observations into a linear combination of independent components (or sources), plus some added noise. The problem is named blind signal separation (BSS) when one tries to recover all the involved sources, whereas, it is named blind signal extraction (BSE) when one is interested in one or a subset of sources. In the late 1970s, a powerful contrast function was proposed to solve the problem of blind deconvolution [1]. This contrast function, which minimizes the Shannon entropy of the output under a variance constraint on its signal component, was a direct consequence of the entropy power inequality [2]. A similar principle was much latter rediscovered in the field of ICA, where the minimization of the mutual information of the outputs, under a covariance constraint, was seen as a natural contrast function to solve the BSS problem [3]. Indeed, provided that the inverse system exists, there is a continuum of contrast functions based on marginal entropies which allows the simultaneous extraction of an arbitrary number of source signals [4]. Since them, the ICA literature explored the properties of other generalized entropy measures, like Renyi’s entropies, to obtain novel information theoretic contrast functions [5,6]. A criterion, which involved the minimization of the sum of ranges of the outputs, was proposed in [7] for solving the BSS problem with order statistics. Some time latter, we independently proposed a similar criterion (the minimum support criterion) which minimizes zero order Renyi’s entropy of the output for solving the problem of the blind extraction of one of the sources [8]. In [9] the minimum range criterion for extraction was rediscovered and proved
Part of this research was supported by the MCYT Spanish project TEC200406451-C05-03.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 17–24, 2007. c Springer-Verlag Berlin Heidelberg 2007
18
S. Cruces, A. Sarmiento, and I. Durán
to be free of erroneous minima, a very desirable property. The minimum support and the minimum range criteria coincide only when all the involved signals have convex support, otherwise they differ [10]. In this paper, we retake the minimum support criterion and extend its role as contrast function for mixtures of complex source signals. The paper is organized as follows. In section 2 we present the signal model. Section 3 and section 4 detail some useful results and geometrical object definitions. Section 5 presents the complex version of the minimum support criterion and other extensions. Section 6 presents the simulations, and finally, section 7 discusses the conclusions.
2
Signal Model and Notation
We consider the standard linear mixing model of complex stationary processes in a noiseless situation. The observations random vector obeys the following equation X = AS , (1) where S = [S1 , · · · , Sn ]T ∈ Cn×1 is a random vector with independent components, and A ∈ Cn×n is a mixing matrix of complex elements. In order to extract one non-Gaussian source from the mixture, one can compute the inner product of the observations with the vector u, to obtain the output random variable or estimated source Y = uH X = gH S ,
(2)
where gH = uH A denotes the vector with the coefficients of the mixture of the sources at the output. The Darmois-Skitovitch theorem [3] guarantees the identifiability of nonGaussian complex sources, up to a permutation, scaling and phase term. Let ei , i = 1, . . . , n, denote the coordinate vectors; one source is extracted when g = gejθ ei , i ∈ {1, . . . , n}.
3
(3)
Support Sets and Geometric Inequalities
Consider two m-dimensional vectors of random variables A and B, whose respective densities are fA (a) and fB (b). Definition 1. The support set of a random vector A, which we denote by SA = supp{A}, is the set of points for which its probability density function is nonzero, i.e., SA = {a ∈ Rm : fA (a) > 0}. Definition 2. The convex hull of the set SA , which we denote by SA˘ = conv SA , is the intersection of all convex sets in Rm which contain SA .
The Complex Version of the Minimum Support Criterion
19
In this paper, we will consider that all the support sets of our interest are compact (bounded and closed), thus we will make no distinction between convex hull and the convex closure. Definition 3. The Minkowski sum of two given sets SA and SB is defined as the set SA ⊕ SB = {a + b : a ∈ A, b ∈ B} which contains all the possible sums of the elements of SA with the elements of SB . In the case of two independent random vectors A and B, it is easy to observe that the support of their sum SA+B is equal to the Minkowski sum of the original support sets SA ⊕ SB . The following famous theorem in geometry establishes the superadditivity of the n-th root of the volume of a Minkowsky sum of two sets. Theorem 1 (Brunn-Minkowski inequality in Rm ). Let SA and SB be nonempty bounded Lebesgue measurable sets in Rm such that SA ⊕ SB is also measurable. Then μm (SA ⊕ SB )1/m ≥ μm (SA )1/m + μm (SB )1/m
(4)
The Brunn-Minkowski inequality is formulated for nonempty bounded measurable sets in Rm . However, we want to apply it to obtain a criterion that works for complex data. The next section will help us in this task.
4
Isomorphisms Between Real and Complex Sets
The following bijective mapping {c} c = {c} + j{c} → T1 (c) = . {c}
(5)
defines a well-known isomorphism between the space of complex scalar numbers C and the vector space R2 with the operation of addition and multiplication by a real number. However, the multiplication of two complex numbers is not naturally carried in R2 . Hopefully, there is another isomorphism between the space of complex scalar numbers c ∈ C and the subfield of the M 2 vector space of real 2 × 2 which carries the operation of multiplication. It is defined by the following bijective mapping {c} −{c} c = {c} + j{c} → T2 (c) = . (6) {c} {c} The two previously presented isomorphisms allow one to express the following operation of complex random variables Y =
n i=1
gi∗ Si
(7)
20
S. Cruces, A. Sarmiento, and I. Durán
as the equivalent real operation between real vectors of random variables n {Y } {Si } {gi } {gi } = . {Y } −{gi } {gi } {Si }
(8)
i=1
Moreover, to any given set of complex numbers SA we can associate an area μ2 (A) which represents the area of the equivalent set T1 (SA ) = {T1 (a) : a ∈ SA } of R2 defined by the real and imaginary pairs of coordinates. Thus, the measure of the support of a complex scalar random variable is defined as the measure of support of the random vector formed by its real and imaginary parts {C} c μ1 (SC ) ≡ μ2 supp . (9) {C} Note that the measure of the support of the complex scalar multiplication gi∗ Si is invariant to the phase of the complex scalar gi∗ , because the phase term only implies a rotation of the space. This can be better seen from the fact that {gi } {gi } {Si } c μ supp = |gi |2 μc1 (SSi ) μ1 S(gi∗ Si ) = −{gi } {gi } 2 {Si }
5
The Complex Version of the Minimum Support Criterion
Now we are ready to apply the Brunn-Minkowski theorem. We will implicitly assume complex sources whose densities have bounded Lebesgue measurable and non-empty supports. Under these conditions, we can exploit the previously defined isomorphisms, between real and complex sets, to rewrite the BrunnMinkowski inequality in R2 (see equation (4)) as an inequality for the measure of the support of complex random variables 1
(μc1 (SY )) 2 ≥
n n c 1 1 μ1 (Sgi∗ Si ) 2 = |gi | (μc1 (SSi )) 2 . i=1
(10)
i=1
A theorem, originally formulated by Lusternik and whose proof was later corrected by Henstock and Macbeath [13], establishes the general conditions for the equality to hold in the Brunn-Minkowski theorem. Theorem 2 (Conditions for equality). Let SA and SB be nonempty bounded Lebesgue m-dimensional measurable sets, let SA and S˘A denote, respectively, the complement and the convex closure of SA . a) If μm (SA ) = 0 and 0 < μm (SB ) < ∞, then the necessary and sufficient condition for the equality in Brunn-Minkowski theorem is that SA should consist of one point only.
The Complex Version of the Minimum Support Criterion
21
b) If 0 < μm (SA )μm (SB ) < ∞ the equality in Brunn-Minkowski theorem holds if and only if μm (S˘A ∩ SA ) = μm (S˘B ∩ SB ) = 0, and the convex closures S˘A and S˘B are homothetic1 . By the application of theorem 2, the equality in (10) is only obtained when one of the following conditions is true: Case a) The mixture at the output is trivial, i.e., Y = gi∗ Si , i ∈ {1, . . . , n},
(11)
which happens when the output is an arbitrary scaled and rotated version of only one the sources. Case b) When the sources whose contribution to the output does not vanish have support sets which are all convex and homothetic. The connection between the zero order Rényi’s entropy of a random vector in R2 and the volume of its support set (see [11]) leads us to identify the zero order entropy of a complex random variable with the joint zero order entropy of its real and imaginary parts, hc0 (Y ) = log μc1 (SY ) ≡ h0 ({Y }, {Y }) .
(12)
Then, we can use equation (10) to obtain a different inequality which relates the zero order entropy of the output with those of the sources and which, at the same time, prevents the equality to hold true for the situations described in the case b). This new inequality is at the heart of the following result. Theorem 3. If the measure of the support set of the complex sources if finite and does not vanish for at least n − 1 of them, μc1 (SSπi ) = 0, i = 1, . . . , n − 1, π perm. of {1, . . . , n}, the zero order entropy of the normalized output H u Y c c X = h0 Ψ (X, u) = h0 u2 u2
(13)
(14)
is a contrast function for the extraction of one of the sources. The global minimum of this contrast function is obtained for the source (or sources) with smallest scaled measure of support, i.e., (15) min Ψ (X, u) = min hc0 Si /a− i 2 , u
i
−H where a− , the inverse hermitian transpose of the i denotes the ith column of A mixing matrix. 1
They are equal sets up to translation and dilation.
22
S. Cruces, A. Sarmiento, and I. Durán
Due to the lack of space, its proof is omitted. The result tells us that we can extract one of the sources by minimizing the area of the support set of the output. Note that the theorem does not require the typical ICA assumption of the circularity of the complex sources nor the mutual independence between their real and imaginary parts. The minimum support contrast function does not work for discrete sources (drawn from alphabets of finite cardinality) because they are of zero measure, a case not covered by the conditions of the theorem. Nevertheless, after replacing the support sets of the original random variables by its convex hull, we return to the conditions of the theorem, obtaining the well-behaved contrast function
Y˘ c ˘ c ˘ . (16) Ψ (X , u) = log μ1 (SY /u2 ) ≡ h0 u2 Indeed, in all of our experiments, and in similarity with the minimum range contrast for the case of real mixtures [9], this contrast function was apparently free of deceptive minima. Although we still don’t know whether this property is true in general, we succeeded in proving the following result. Theorem 4. For a mixture of n complex sources with bounded circular convex ˘ u) can only be attained at the hull, the minima of the contrast function Ψ (X, solutions of the extraction problem, i.e., there are no local deceptive minima.
6
Simulations
In order to optimize the contrast function we first parametrized a complex unit norm vector u in terms of 2n − 2 angles (ignoring a common phase term). Let R(1, k+1, αk , βk ), for k = 1, . . . , n−1, denote a class of planar rotation matrices, then u = e1 T R(1, n, αn−1 , βn−1 ) · · · R(1, 2, α1 , β1 ). Since the extraction solutions are non-differentiable points of the contrast function, we used the downhill simplex method of Nelder and Mead to optimize it in low dimensions [14]. In high dimensions, an improved convergence is obtained when combining the previous optimization technique with numerical gradient and line-search methods. Each function evaluation requires the computation of the planar convex hull of a set of T outputs. The optimal algorithms for this task, have, in the worst case, a computational complexity of O(T log V ) where V is the number of vertices of the convex hull [15]. Consider the sample experiment of 200 observations of a complex mixture of two 16QAM sources (a typical constellation used in communications). The ˘ , u) which illustration of figure 1 presents the graph of the contrast function Ψ (X periodically tessellates the (α1 , β1 )-plane. The figure shows a contrast function with no local deceptive minima, which is non-differentiable at those points where
The Complex Version of the Minimum Support Criterion
23
Fig. 1. Graph of the contrast function, with respect the parameters (α1 , β1 ), for a mixture of two 16QAM sources. The solutions to the extraction problem are at the minima of the function.
4
3
2
Im{Y}
1
0
−1
−2
−3
−4 −5
−4
−3
−2
−1
0 Re{Y}
1
2
3
4
5
Fig. 2. The 16QAM source recovered by the extraction algorithm and the frontier of the convex hull of its support (dashed line)
the Brunn-Minkowski equality holds true. The illustration of figure 2 presents the 16QAM source extracted by the previously described algorithm and the frontier of the convex hull of its support.
7
Conclusions
We have presented a geometric criterion for the extraction of one independent component from of a linear mixture of complex and mutually independent
24
S. Cruces, A. Sarmiento, and I. Durán
signals. The criterion favors the extraction of the source signals with minimum scaled support and does not require the mutual independence between their real and imaginary parts. Under certain given conditions, the criterion is proved to be free of defective local minima, although, a general proof is still elusive.
References 1. Donoho, D.: On minimum entropy deconvolution. In: Findley, D.F. (ed.) Applied Time Series Analysis II, pp. 565–608. Academic Press, New York (1981) 2. Blachman, N.M.: The convolution inequality for entropy powers. IEEE Trans. on Information Theory IT-11, 267–271 (1965) 3. Comon, P.: Independent component analysis, a new concept? Signal Processing 3(36), 287–314 (1994) 4. Cruces, S., Cichocki, A., Amari, S-i.: From blind signal extraction to blind instantaneous signal separation: criteria, algorithms and stability. IEEE Trans. on Neural Networks 15(4), 859–873 (2004) 5. Bercher, J.-F., Vignat, C.: A Renyi entropy convolution inequality with application. In: Proc. of EUSIPCO, Toulouse, France (2002) 6. Erdogmus, D., Principe, J.C., Vielva, L.: Blind deconvolution with minimum Renyi’s entropy. In: Proc. of EUSIPCO, Toulouse, France, vol. 2, pp. 71–74 (2002) 7. Pham, D.T.: Blind separation of instantaneous mixture of sources based on order statistics. IEEE Trans. on Signal Processing 48(2), 363–375 (2000) 8. Cruces, S., Durán, I.: The minimum support criterion for blind signal extraction. In: proc. of the int. conf. on Independent Component Analysis and Blind Signal Separation, Granada, Spain, pp. 57–64 (2004) 9. Vrins, F., Verleysen, M., Jutten, C.: SWM: A class of convex contrasts for source separation. In: proc. of the Int. Conf. on Acoustics, Speech and Signal Processing, Philadelphia (USA), vol. V, pp. 161–164 (2005) 10. Cruces, S., Sarmiento, A.: Criterion for blind simultaneous extraction of signals with clear boundaries. Electronics Letters 41(21), 1195–1196 (2005) 11. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley series in telecommunications. John Wiley, Chichester (1991) 12. Gardner, R.J.: The Brunn-Minkowski Inequality. Bulletin of the American Mathematical Society 39(3), 355–405 (2002) 13. Henstock, R., Macbeath, A.M.: On the measure of sum-sets. I. The theorems of Brunn, Minkowski, and Lusternik. In: Proceedings of the London Mathematical Society, third series, vol. 3, pp. 182–194 (1953) 14. Nelder, J.A., Mead, R.: A simplex method for function minimization. Computer Journal 7, 308–313 (1965) 15. Chan, T.M.: Optimal output-sensitive convex hull algorithms in two and three dimensions. Discrete and Computational Geometry 16, 361–368 (1996)
Optimal Joint Diagonalization of Complex Symmetric Third-Order Tensors. Application to Separation of Non Circular Signals Christophe De Luigi and Eric Moreau STD, ISITV, av. G.Pompidou, BP56, F-83162 La Valette du Var Cedex, France
[email protected],
[email protected] Abstract. In this paper, we address the problem of blind source separation of non circular digital communication signals. A new Jacobi-like algorithm that achieves the joint diagonalization of a set of symmetric third-order tensors is proposed. The application to the separation of non-gaussian sources using fourth order cumulants is particularly investigated. Finally, computer simulations on synthetic signals show that this new algorithm improves the STOTD algorithm.
1
Introduction
In the classical blind source separation problem, see e.g. [1] [2] [3] and [4], statistics based matrices or tensors often have an identical decomposition. This known decomposition is then used through a Jacobi-like algorithm to estimate the socalled mixing matrix. Perhaps one of the most popular algorithms of that kind is given in [1]. It is called JADE and its goal is to joint-diagonalize a set of hermitian matrices. The algorithm in [5] is intended to “joint-diagonalize” a set of complex symmetric matrices. The ICA algorithm in [2] is intended to diagonalize a fixed order (cumulant) tensor. The STOTD algorithm in [3] is intended to “joint-diagonalize” a particular set of (cumulant) third order tensor. Actually, principally in wireless telecommunication applications, non circular signals are of importance, see e.g. [5][6][7][8]. The main goal of this paper is to propose a novel approach that can combine “non-circular” statistics to circular one easily for separation. It is based on a particular decomposition of symmetric third order tensors. Notice that the circular part corresponds to the STOTD algorithm [3] while the non-circular one is original. We apply the proposed algorithm and compare it with STOTD using computer simulations. They illustrate the usefulness to consider both kind of statistics.
2 2.1
The Proposed Algorithm The “Non Circular” Algorithm
We consider N1 symmetric complex third-order tensors Tl , l = 1, · · · , N1 , of dimension N × N × N decomposed linearly as: Dl = Tl ×1 U ×2 U ×3 U M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 25–32, 2007. c Springer-Verlag Berlin Heidelberg 2007
(1)
26
C. De Luigi and E. Moreau
where Dl are also symmetric complex third-order tensors and U is a complex unitary matrix. The notation in (1) is defined component-wise as (Tl )k1 k2 k3 (U)j1 k1 (U)j2 k2 (U)j3 k3 . (2) (Dl )j1 j2 j3 = k1 ,k2 ,k3
It is important to notice that our decomposition is different from the one in [3]. Indeed, there, the considered complex third-order tensors satisfy the following decomposition (3) Dl = Tl ×1 U ×2 U∗ ×3 U∗ where Dl are also symmetric complex third-order tensors. The goal is to estimate a unitary matrix U in such a way that tensors Dl are (approximately) diagonal. For that task, it is rather classical to consider the following quadratic criterion C(U, {T}) =
N1 N
|(Dl )iii |2
(4)
l=1 i=1
to be maximized. It corresponds to the maximization of the sum of the squared norm of all diagonals of the set of tensors {D} hence to the minimization of the squared norm of all off diagonal components. As we consider a Jacobi-like algorithm, we study the case N = 2. In that case the unitary matrix U can be parameterized as cos(α) − sin(α) exp(jφ) U= (5) sin(α) exp(−jφ) cos(α) where α and φ are two angles. We can now propose the following result. Proposition 1. The criterion in (4) can be written as C(U, {T}) = uT B1 u
(6)
u = (cos(2α) sin(2α) sin(φ) sin(2α) cos(φ)) T
(7)
where and B1 is a real symmetric matrix whose expression is given in the proof. Proof With the property of symmetry of the tensor Tl that is to say (Tl )ppk = (Tl )pkp = (Tl )kpp ,
(8)
we can write the two elements of the diagonals of one single (2 × 2 × 2) tensor Dl (Dl )111 = (Tl )111 cos3 (α) − 3(Tl )112 sin(α) cos2 (α)ejφ +3(Tl )122 sin2 (α) cos(α)ej2φ − (Tl )222 sin3 (α)ej3φ (9) (Dl )222 = (Tl )111 sin3 (α)e−j3φ + 3(Tl )112 cos(α) sin2 (α)e−2jφ +3(Tl )122 cos2 (α) sin(α)e−jφ + (Tl )222 cos3 (α).
Optimal Joint Diagonalization of Complex Symmetric Third-Order Tensors
27
Then, we obtain the squared norms of the diagonals of this single (2 × 2 × 2) tensor |(Dl )111 |2 + |(Dl )222 |2 = |(Tl )111 |2 + |(Tl )222 |2 cos6 (α) + sin6 (α) + 2.25 |(Tl )112 |2 + |(Tl )222 |2 sin2 (α) + 3 Re (Tl )222 (Tl )∗122 ejφ − (Tl )111 (Tl )∗112 e−jφ sin(2α) cos(2α) + 1.5 Re (Tl )111 (Tl )∗122 e−j2φ + (Tl )222 (Tl )∗112 ejφ sin2 (2α) (10) which can be written |(Dl )111 |2 + |(Dl )222 |2 = uT Bl u
(11)
where u is a real (3 × 1) vector such that uT u = 1 and defined by (7), and Bl is a real symmetric matrix (3 × 3) defined by (Bl )11 = |(Tl )111 |2 + |(Tl )222 |2 (Bl )12 = −1.5 Im {(Tl )222 (Tl )∗122 + (Tl )111 (Tl )∗112 } (Bl )13 = 1.5 Re {(Tl )222 (Tl )∗122 − (Tl )111 (Tl )∗112 } (Bl )22 = 0.25 |(Tl )111 |2 + |(Tl )222 |2 +2.25 |(Tl )112 |2 + |(Tl )122 |2 −1.5 Re {(Tl )111 (Tl )∗122 + (Tl )222 (Tl )∗112 }
(12)
(Bl )23 = 1.5 Im {(Tl )111 (Tl )∗122 − (Tl )222 (Tl )∗112 } (Bl )33 = 0.25 |(Tl )111 |2 + |(Tl )222 |2 +2.25 |(Tl )112 |2 + |(Tl )122 |2 +1.5 Re {(Tl )111 (Tl )∗122 + (Tl )222 (Tl )∗112 } So, the maximization of the sum of the squared norms of the diagonals of the set of tensors {Dl } is obtained by N1
|(Dl )111 |2 + |(Dl )222 |2 = uT B1 u
(13)
l=1
where the real matrix B1 is defined by B1 =
N1 l=1
which completes the proof of proposition.
Bl
(14)
28
C. De Luigi and E. Moreau
The maximization of the criterion in (6) can be easily find by computing the eigenvector associated with the largest eigenvalue and then the angles are obtained using
1 1 + u(1) 2 1 (u(3) + j u(2)) sin(α) exp(jφ) = 2 cos(α)
cos(α)
=
(15)
with α ∈ [−π/4, π/4]. Finally, the unknown unitary matrix U is obtained from the accumulation of the successives Jacobi matrix which are taken transposed and conjugated. 2.2
The General Algorithm
In general one has to use all available useful statistics to solve a problem. Hence we propose to combine by an optimal way our hereabove developments with them of the STOTD algorithm [3]. As now seen, this can be done very easily. For third order tensors that can be decomposed as in (3), it was shown in [3], that in the case of N = 2, the criterion in (4) is written as C(U, {T}) = uT B2 u
(16)
where B2 is a real symmetric matrix. Hence an optimal combination of the two kind of tensors can be considered altogether by simply searching the eigenvector of (1 − λ)B1 + λ B2 associated with the largest eigenvalue, where λ is a real parameter with λ ∈ [0 1]. We can see that λ = 0 corresponds to the “non circular” algorithm called NC-STOTD and λ = 1 corresponds to the STOTD algorithm. In this paper, the optimal coefficient λ will be found by simulations (in fact it would be possible to propose to find it by the minimization of a norm of the covariance matrix from the parameters α and φ).
3
Link with Source Separation
In the source separation problem, an observed signal vector x[n] is assumed to follow the linear model x[n] = As[n] (17) where n ∈ Z is the discrete time, s[n] the (N, 1) vector of N = 2 unobservable complex input signals si [n], i ∈ {1, . . . , N }, called sources, x[n] the (N, 1) vector of observed signals xi [n], i ∈ {1, . . . , N } and A the (N, N ) square mixing matrix assumed invertible. It is classical to consider that the sources si [n], with i ∈ {1, . . . , N }, are zero-mean, unit power, stationary and statistically mutually independent.
Optimal Joint Diagonalization of Complex Symmetric Third-Order Tensors
29
We also assume that the sources possess non zero high order cumulant (of order under consideration) i.e. ∀i ∈ {1, . . . , N }, the R-th cumulant Cum{si [n], . . . , si [n]} = CR {si }
(18)
R terms
is non zero for all i and for a fixed R ≥ 3. We also assume that the matrix A is unitary. This can always be done assuming that a first whitening stage is applied onto the observations. The blind source separation problem consists now in estimating a unitary matrix H in such a way that the vector y[n] = Hx[n]
(19)
restores one of the different sources on each of its different components. Perhaps one of the most useful way to solve the separation problem consists in the use of a contrast functions. They correspond to objective functions which depend on the outputs of the separating system and they have to be maximized to get a separating solution. Let us now propose the following result. Proposition 2. Let R be an integers such that R ≥ 3, using the notation CR {y, i, j} = Cum{yi , yi , yi , yj1 , . . . , yjR−3 }
(20)
R−3 terms
the function N
JR (y) =
2
|CR {y, i, j}|
(21)
i,j1 ,...,jR−3 =1
is a contrast for white vectors y. The proof is reported in a forthcoming paper. Now we show that contrast JR (y) is linked to a joint-diagonalization criterion of a set of symmetric third order tensor. Such a joint-diagonalization criterion is defined as in (4). This equivalence is given according to the following result. Proposition 3. With R ≥ 3, let TR be the set of M = N R−3 third order tensors T(j1 , . . . , jR−3 ) = (Ti,j,k (j1 , . . . , jR−3 )) defined as Ti,j,k (j1 , . . . , jR−3 ) = Cum{xi , xj , xk , xj1 , . . . , xjR−3 } .
(22)
R−3 terms
Then, if H is a unitary matrix, we have C(H, TR ) = JR (Hx) .
(23)
Hence the joint-diagonalization of third order symmetric tensors is a sufficient condition for separation. Moreover different order of cumulant can be considered onto the same framework.
30
4
C. De Luigi and E. Moreau
Simulations
We illustrate the performances of the proposed algorithm in comparison with the STOTD algorithm (case where λ = 1) and the NC-STOTD one (case where λ = 0) by Monte Carlo simulations in which we average over 500 iterations. In our experiment, we consider two independent complex source signals which are non circular and two noisy mixtures. We have taken the mixing matrix unitary to avoid the whitening step which may degrade the performances. The objective is to emphasize the existence of an optimal parameter λ which allows an optimal performance of the general algorithm. We get into two situations: one with the 5 states source distribution S1 defined as: 1 1 1 1 ; 2(1+β) ; β−1 ; ; {−1; −j; 0; β; jβ} with the probabilities 2(1+β) β 2β(1+β) 2β(1+β) , and the other one with the 4 states source distribution S2 defined as: {−1; −j; 0; β; j ∗ β} with uniform probabilities. These two sources are non-circular and in S1 the parameter β may be chosen such that the cumulant C40 {·} is more weighty than the cumulant C22 {·} while in S2 whatever the parameter β, the cumulant C22 {·} is more weighty than the cumulant C40 {·}. At each process, we take 10000 samples for each of the chosen source and we take the same unitary mixing 2-by-2 matrix. The noise distribution is a zeromean Gaussian distribution. The signal to noise ratio (SNR) goes to obtain a power of noise equal to: 0, 1/8, 1/4, 1/2 and 3/4 of the power of the source signal. In order to find the optimal coefficient we vary λ from 0 to 1 with a 1/40 step. We consider the following index of performance [4] which evaluates the proxˆ which is the separating matrix to the mixing imity of the estimated matrix A, matrix A: ⎛ ⎛ ⎞ ⎞ N N N N ˆ i,j |2 ˆ i,j |2 1 |( AA) |( AA) ˆ ⎝ ⎝ I(AA) = − 1⎠ + − 1 ⎠, ˆ i, |2 ˆ ,j |2 N (N − 1) i=1 j=1 max |(AA) max |(AA) j=1 i=1 (24)
ˆ with N the dimension of the considered matrix BA. In Fig.1 and Fig.2, we plot for different power of noise this index of performance for the general algorithm called G-STOTD versus λ. First, we can see in the two cases S1 and S2 that it exists an optimal coefficient λ which gives a better performance that the STOTD and the NC-STOTD algorithms. So, we can tell that combining in an optimal way the statistics of symmetry allows to improve the results of the ICA algorithm in the case of non-circular sources.
Optimal Joint Diagonalization of Complex Symmetric Third-Order Tensors
31
Modulation 1 −3
10
−4
10
−5
10
−6
p=0 p=1/8 p=1/4 p=1/2 p=3/4
10
−7
10
0
0.2
0.4
0.6
0.8
1
Fig. 1. Performance of the G-STOTD algorithm versus λ at different power of noise for S1
Modulation 2
−3
10
−4
10
−5
10
−6
10
−7
10
−8
10
p=0 p=1/8 p=1/4 p=1/2 p=3/4
−9
10
−10
10
−11
10
0
0.2
0.4
0.6
0.8
1
Fig. 2. Performance of the G-STOTD algorithm versus λ at different power of noise for S2
32
5
C. De Luigi and E. Moreau
Conclusion
This paper propose a new general algorithm of joint diagonalization of complex symmetric third-order tensors that allow not only to improve the STOTD algorithm but opens new perspectives for non-circular sources.
References 1. Cardoso, J.-F., Souloumiac, A.: Blind beamforming for non Gaussian signals. IEE Proceedings-F 140, 362–370 (1993) 2. Comon, P.: Independent Component Analysis, A New Concept? Signal Processing 36, 287–314 (1994) 3. De Lathauwer, L., De Moor, B., Vanderwalle, J.: Independent Component Analysis and (Simultaneous) Third-Order Tensor Diagonalization. IEEE Transactions on Signal Processing 49, 2262–2271 (2001) 4. Moreau, E.: A Generalization of Joint-Diagonalization Criteria for Source Separation. IEEE Transactions on Signal Processing 49, 530–541 (2001) 5. De Lathauwer, L., De Moor, B., Vanderwalle, J.: ICA techniques for more sources than sensors. In: Proceeding of the IEEE Signal Processing Workshop on HigherOrder Statistics (HOS’99), June 1999, Caesarea, Israel, pp. 121–124 (1999) 6. De Lathauwer, L., De Moor, B.: On the Blind Separation of Non-circular Sources. In: Proceeding of EUSIPCO-02, (September 2002), Toulouse, France, vol. II, pp. 99–102 (2002) 7. Chevalier, P.: Optimal Ttime invariant and widely linear spatial filtering for radiocommunications. In: Proc. EUSIPCO’96, Trieste, Italy, September 1996, pp. 559–562 (1996) 8. Chevalier, P.: Optimal array processing for non stationary signals. In Proc. ICASSP’96, Atlanta, pp. 2868–2871 (May 1996)
Imposing Independence Constraints in the CP Model Maarten De Vos1 , Lieven De Lathauwer2, and Sabine Van Huffel1 1
ESAT, Katholieke Universiteit Leuven, Leuven, Belgium 2 CNRS - ETIS, Cergy-Pontoise, France
[email protected]
Abstract. We propose a new algorithm to impose independence constraints in one mode of the CP model, and show with simulations that it outperforms the existing algorithm.
1
Introduction
One of the most fruitful tools in linear algebra-based signal processing is the Singular Value Decomposition (SVD) (4). Most other important algebraic concepts use the SVD as building block, generalise or refine this concept for analysing quantities that are characterised by only two variables. When the data has an intrinsically higher dimensionality, higher-order generalizations of the SVD can be used. An example of a multi-way decomposition method is the CP model (also known as Canonical Decomposition (CANDECOMP) (3) or Parallel Factor Model (PARAFAC) (5)). Recently, a new interesting concept arose in the biomedical field. In (1), the idea of combining Independent Component Analysis (ICA) and the CP model was introduced. However, the multi-way structure was imposed after the computation of the independent components. In this paper, we propose an algorithm to impose the CP structure during the ICA computation. We also performed some numerical experiments to compare our algorithms to the algorithm proposed in (1).
This research is funded by a PhD grant of the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT-Vlaanderen). Research supported by Research Council KUL: GOA-AMBioRICS, CoE EF/05/006 Optimization in Engineering, IDO 05/010 EEG-fMRI; Flemish Government: FWO: projects, G.0407.02 (support vector machines), G.0360.05 (EEG, Epileptic), G.0519.06 (Noninvasive brain oxygenation), FWO-G.0321.06 (Tensors/Spectral Analysis), G.0341.07 (Data fusion), research communities (ICCoS, ANMMM); IWT: PhD Grants; Belgian Federal Science Policy Office IUAP P6/04 (‘Dynamical systems, control and optimization’, 2007-2011); EU: BIOPATTERN (FP6-2002-IST 508803), ETUMOUR (FP6-2002-LIFESCIHEALTH 503094), Healthagents (IST2004-27214), FAST (FP6-MC-RTN-035801); ESA: Cardiovascular Control (Prodex8 C90242).
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 33–40, 2007. c Springer-Verlag Berlin Heidelberg 2007
34
1.1
M. De Vos, L. De Lathauwer, and S. Van Huffel
Basic Definitions
Definition 1. A 3rd-order tensor T has rank 1 if it equals the outer product of 3 vectors A1 , B1 , C1 : tijk = ai bj ck for all values of the indices. The outerproduct of A1 , B1 and C1 is denoted by A1 ◦ B1 ◦ C1 . Definition 2. The rank of a tensor is defined as the minimal number of rank-1 terms in which the tensor can be decomposed. Definition 3. The Kruskal rank or k-rank of a matrix is the maximal number r such that any set of r columns of the matrix is linearly independent. Definition 4. The Frobenius norm of a tensor T ∈ RI×J×K is defined as I J K 1 t2ijk ) 2 ||T ||F = (
(1)
i=1 j=1 k=1
Notation Scalars are denoted by lower-case letters (a, b, . . . ), vectors are written as capitals (A, B, . . . ) (italic shaped), matrices correspond to bold-face capitals (A, B, . . . ) and tensors are written as calligraphic letters (A, B, . . . ). This notation is consistently used for lower-order parts of a given structure. For instance, the entry with row index i and column index j in a matrix A, i.e. (A)ij , is symbolized by aij (also (A)i = ai and (A)i1 i2 ...iN = ai1 i2 ...iN ). The ith column vector of a matrix A is denoted as Ai , i.e. A = [A1 A2 . . .]. Italic capitals are also used to denote index upper bounds (e.g. i = 1, 2, . . . , I). is the Khatri-Rao or column-wise Kronecker product. 1.2
Independent Component Analysis
Assume the basic linear statistical model Y =M·X +N
(2)
where Y ∈ R is called the observation vector, X ∈ R the source vector and N ∈ RI additive noise. M ∈ RI×J is the mixing matrix. The goal of Independent Component Analysis is to estimate the mixing matrix M, and/or the source vector X, given only realizations of Y . In this study, we assume that I J. Blind identification of M in (2) is only possible when some assumptions about the sources are made. One assumption is that the sources are mutually statistically independent, as well as independent from the noise components and that at most one source is gaussian (2). For more details, we refer to (9; 6). I
1.3
J
The CP Model
The model. The CP model (5; 3; 15) of a three-way tensor T ∈ RI×J×K is a decomposition of T as a linear combination of a minimal number R of rank-1 terms: R T = λr Ar ◦ Br ◦ Cr (+E) (3) r=1
Imposing Independence Constraints in the CP Model
35
A pictorial representation of the CP model for third-order tensors is given in figure 1. C1
T
CR
B1
λ1
=
BR
λR
+ ... + A1
+ AR
E
Fig. 1. Pictorial representation of the CP model
Consider a third-order (I × J × K) tensor T of which the CP model can be expressed as R tijk = air bjr ckr , ∀i, j, k (4) r=1
in which A ∈ RI×R , B ∈ RJ×R and C ∈ RK×R . Another equivalent and useful expression of the same CP model is given with the Khatri-Rao product. We assume that min(IJ, K) R. Associate with T a matrix T ∈ RIJ×K as follows: (T)(i−1)J+j,k = Tijk .
(5)
This matrix has following structure: T = (A B) · CT .
(6)
Comparing the number of free parameters of a generic tensor and the CP model, it can be seen that this model is very restricted. The advantage of this model is its uniqueness under mild conditions (7; 14): rankk (A) + rankk (B) + rankk (C) 2R + 2
(7)
with rankk (A) the k-rank of matrix A and R the rank of the tensor. Computation of the CP decomposition. Originally, an alternating leastsquares (ALS) algorithm was proposed in order to minimize the least squares cost function for the computation of the CP decomposition: min ||T − A(B C)T ||2 .
A,B,C
(8)
Due to the symmetry of the model in the different modes, the updates for all modes are essentially identical with the role of the different modes shifted. Assume that B and C are fixed, the estimate of the other can be optimized with a classical linear least squares problem: min ||X − AZT ||2 A
(9)
36
M. De Vos, L. De Lathauwer, and S. Van Huffel
where Z equals B C. This has to be repeated until convergence while matrices in other modes are kept fixed in order to compute all factors of the decomposition. Afterwards, it was also shown that the CP decomposition can be in theory computed from an eigen value decomposition (EVD) (12; 13) under certain assumptions among which the most restricting is that R min{I, J}. This results in a faster computation. However, when the model is only approximately valid, this will only form the initialization of the ALS-procedure. In (11), it is shown that the computation of the CP model, based on a simultaneous EVD is actually more robust than a single EVD. This again implied the rank condition R min{I, J}. As we will need this algorithm in the further developments, we review the computational scheme here. Substitution of (5) in (4) shows that any vector in the range of T, can be represented by an I × J matrix that can be decomposed as: V = A · D · BT
(10)
with D diagonal. If the range is spanned by K matrices V1 , V2 , . . . , VK , the computation of the canonical decomposition can be obtained by the simultaneous decomposition of the set {Vk }(1kK) . V1 = A · D1 · BT V2 = A · D2 · BT .. .
(11) (12)
VK = A · DK · BT
(13)
The best choice for these matrices in order to span the full range of this mapping consists of the K dominant left singular matrices of the mapping in (5) (10). In order to deal with these equations in a numerical proper way, the problem can be formulated in terms of orthogonal unknowns (17; 11). Introducing a ˜ T , leads to a QR-factorization A = QT R and an RQ-factorization BT = RZ simultaneous generalized Schur decomposition: ˜ QV1 Z = R · D1 · R ˜ QV2 Z = R · D2 · R .. . ˜ QVK Z = R · DK · R.
(14) (15)
(16)
This simultaneous generalized Schur decomposition can be computed by an extended QZ-iteration (17). Recently, in (8) it is shown that the canonical components can be obtained from a simultaneous matrix diagonalization with a much less severe restriction on R.
Imposing Independence Constraints in the CP Model
2
37
Combination of ICA and CP Model
In this section, we review the tensorial extension of ICA (§2.1), called ’tensor pICA’, as it was introduced by Beckmann (1). Then we present a new algorithm to compute the CP decomposition of a tensor T where independence is imposed to the factors in one mode (§2.2). For notational convenience, we will restrict us to the three-way real case, but generalization to higher dimensions or complex tensors is straightforward. In the following, we always assume that the components of the third mode are independent. Due to the symmetric structure of the PARAFAC model, equivalent equations can be derived for the other two modes. In formulas, we consider the matricized version of the real tensor T , given by equation (6) where matrix C contains the independent source values, and the mixing matrix M equals (A B). 2.1
Tensor pICA
In (1), a generalization of the standard bilinear (two-way) exploratory analysis to higher dimensions was derived as follows. 1. Perform an iteration step for the decomposition of the full data using the twodimensional probabilistic ICA approach for the decomposition into a compound mixing matrix MIJ×R and the associated source signals CK×R : ˜ 1. XIJ×K = MCT + E ˜2 2. Decompose the estimated mixing matrix M such that M = (A B) + E via a column-wise rank-1 eigenvalue decomposition: each column in (A B) is formed by K scaled repetitions of a single column from A. In order to obtain A and B, the matrices G1 , . . . , GR ∈ RI×J can be introduced as (Gr )ij = m(i−1)J+j,r
∀i, j, r
(17)
Ar and Br can then be computed as the dominant left and right singular vector of Gr , 1 r R. 3. iterate decomposition of XIJ×K and M untill convergence, i.e. when ||Anew − Aold ||F + ||Bnew − Bold ||F + ||Cnew − Cold ||F < . 2.2
ICA-CP
The ordinary ICA problem is solved by diagonalising the fourth-order cumulant (9). This cumulant can be written as following CP decomposition: Cy(4)
=
R
κx r M r ◦ M r ◦ M r ◦ M r
(18)
r=1
With a mixing matrix M = A B, this fourth-order cumulant can be expressed as an eighth-order tensor with CP structure: Cy(8) =
R r=1
κxr Ar ◦ Br ◦ Ar ◦ Br ◦ Ar ◦ Br ◦ Ar ◦ Br
(19)
38
M. De Vos, L. De Lathauwer, and S. Van Huffel
This can be seen as follows. Define matrices E1 , . . . , ER ∈ RI×J as (Er )ij = m(i−1)J+j,r
∀i, j, r
(20)
When the model in (6) is exactly satisfied, Er can be decomposed as (Er ) = Ar BrT
r = 1, . . . , R
(21)
which explains the CP decomposition in equation (19). This CP decomposition can be computed in different ways, depending on the rank R of the tensor. It is even not necessary to compute the full decomposition. Once A and B are known, the mixing matrix (matrix A B) can be computed and the independent sources can be estimated from equation (6). Rank R restricted by R min{I, J}. In order to compute the mixing 3 3 (8) matrix (A B) from (19), associate a matrix H ∈ RIJ×I J with Cy as follows: H(i−1)J+j,(k−1)I 2 J 3 +(l−1)I 2 J 2 +(m−1)IJ 2 +(n−1)IJ+(o−1)J+p = (Cy(8) )ijklmnop (22) This mapping can be represented by a matrix H ∈ RIJ×I
3
J3
:
H = (A B) · Λ · (A B A B A B)T .
(23)
with Λ = diag{κ1 , . . . , κR }. Substituting (22) in (19) shows that any vector in (8) the range of Cy can be represented by an (I ×J) matrix that can be decomposed as: (24) V = A · D · BT with D diagonal. Any matrix in this range can be diagonalized by congruence with the same loading matrices A and B. A possible choice of {Vk }(1kK) consist of ’matrix slices’ obtained by fixing the 3rd to 8th index. An optimal approach would be to estimate the R dominant left singular values of (23). The joint decomposition of the matrices {Vk }(1kK) will give a set of equations similar to equations (11) - (13). We have explained in §1.3 how to solve these equations simultaneously.
3
Numerical Experiments
In this section we illustrate the performance of our algorithm by means of numerical experiments and compare it to ’tensor pICA’. Rank-R tensors T˜ ∈ R5×3×100 , of which the components in the different modes will be estimated afterwards, are generated in the following way: T˜ =
N T + σN , ||T ||F ||N ||F
(25)
in which T exactly satisfies the CP model with R = 3 independent sources in the third mode (C) and N represents gaussian noise. All the source distributions are binary (1 or -1), with an equal probability of both values. The sources are
Imposing Independence Constraints in the CP Model
39
zero mean and have unit variance. The entries of the two other modes (A and B) are drawn from a zero-mean unit-variance gaussian distribution. We conduct Monte Carlo simulations consisting of 500 runs. We evaluate the performance of the different algorithms by means of the normalized Frobenius norm of the difference between the estimated and the real sources: errorC =
ˆ F ||C − C|| ||C||F
(26)
In figure 2, we plot the mean value of the 500 simulations. The previously proposed method tensor pICA is clearly outperformed by the new algorithm. 0.9 ICA−CP
0.8
tensor pICA 0.7
errorC
0.6
0.5
0.4
0.3
0.2
0.1
0
1
1.5
2
2.5
log σ
3
3.5
4
N
Fig. 2. The mean value of errorC as a function of the noise level σN for the algorithms ICA-CP (solid) and tensor pICA (dash-dash)
4
Conclusion
We proposed a new algorithm to impose the CP structure already during the ICA computation for the case that the rank R was restricted by R min{I, J}. We showed with simulations that by taking this structure into account, the algorithm outperformed tensor pICA. A follow-up paper will discuss an algorithm for the case the rank R min{I, J}. For a detailed comparison between CP and the combination of ICA-CP, we refer to (16).
References [1] Beckmann, C.F., Smith, S.M.: Tensorial extensions of independent component analysis for multisubject fmri analysis. Neuroimage 25, 294–311 (2005) [2] Cardoso, J.-F., Souloumiac, A.: Blind beamforming for non-gaussian signals. IEE Proc. F 140, 362–370 (1994)
40
M. De Vos, L. De Lathauwer, and S. Van Huffel
[3] Carroll, J.D., Chang, J.: Analysis of individual differences in multidimensional scaling via an n-way generalization of ’eckart-young’ decomposition. Psychometrika 35, 283–319 (1970) [4] Golub, G.H., Van Loan, C.F.: Matrix computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996) [5] Harshman, R.A.: Foundations of the parafac procedure: models and conditions for an ’explanation’ multi-modal factor analysis. UCLA Working Papers in Phonetics, 16 (1970) [6] Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Chichester (2001) [7] Kruskal, J.B.: Three-way arrays: rank and uniqueness of trilinear decompositions, with applications to arithmetic complexity and statistics. Psychometrika 18, 95– 138 (1977) [8] De Lathauwer, L.: A link between the canonical decomposition in multilinear algebra and simultaneous matrix diagonalization. SIAM J Matrix Anal Appl 28, 642–666 (2006) [9] De Lathauwer, L., De Moor, B., Vandewalle, J.: An introduction to independent component analysis. J Chemometrics 14, 123–149 (2000) [10] De Lathauwer, L., De Moor, B., Vandewalle, J.: On the best rank-1 and rank{R1 , R2 , . . . , RN } approximation of higher-order tensors. SIAM J Matrix Anal Appl 21, 1324–1342 (2000) [11] De Lathauwer, L., De Moor, B., Vandewalle, J.: Computation of the canonical decomposition by means of a simultaneous generalized schur decomposition. SIAM J Matrix Anal Appl 26, 295–327 (2004) [12] Leurgans, S.E., Ross, R.T., Abel, R.B.: A decomposition for three-way arrays. SIAM J Matrix Anal Appl 14, 1064–1083 (1993) [13] Sanchez, E., Kowalski, B.R.: Tensorial resolution: a direct trilinear decomposition. J Chemometrics 4, 29–45 (1990) [14] Siridopoulos, N.D., Bro, R.: On the uniqueness of multilinear decomposition of n-way arrays. J Chemometrics 14, 229–239 (2000) [15] Smilde, A., Bro, R., Geladi, P.: Multi-way Analysis with applications in the Chemical Sciences. John Wiley & Sons, Chichester (2004) [16] Stegeman, A.: Comparing independent component analysis and the parafac model for artificial multi-subject fmri data. Technical Report, Heymans Institute, University of Groningen, the Netherlands (2007) [17] Van Der Veen, A.-J., Paulraj, A.: An analytical constant modulus algorithm. IEEE Trans Signal Proc 44, 1136–1155 (1996)
Blind Source Separation of a Class of Nonlinear Mixtures Leonardo Tomazeli Duarte and Christian Jutten GIPSA-lab, INPG-CNRS, Grenoble, France
[email protected],
[email protected] Abstract. In this work, we deal with blind source separation of a class of nonlinear mixtures. The proposed method can be regarded as an adaptation of the solutions developed in [1,2] to the considered mixing system. Also, we provide a local stability analysis of the employed learning rule, which permits us to establish necessary conditions for an appropriate convergence. The validity of our approach is supported by simulations.
1
Introduction
The problem of blind source separation (BSS) concerns the retrieval of an unknown set of source signals by using only samples that are mixtures of these original signals. A great number of methods has been proposed for the case wherein the mixture process is of linear nature. The cornerstone of the majority of these techniques is the independent component analysis (ICA) [3]. In contrast to the linear case, the recovery of the independence, which is the very essence of ICA, does not guarantee, as a rule, the separation of the sources when the mixture model is nonlinear. In view of this limitation, a more reasonable approach is to consider constrained mixing systems as, for example, post-nonlinear (PNL) mixtures [4] and linear-quadratic mixtures [2]. In this work, we investigate the problem of BSS in a particular class of nonlinear systems which is related to a chemical sensing application. More specifically, the contributions of this paper are the adaptation of the ideas presented in [1,2] to the considered mixing system, as well as a study on some necessary conditions for a proper operation of the obtained separating method. Concerning the organization of the document, we begin, in Section 2, with a brief description of the application that has motivated us. After that, in Section 3, we expose the separation method and also a stability analysis of the learning rule. In Section 4, simulations are carried out in order to verify the viability of the proposal. Finally, in Section 5, we state our conclusions and remarks.
2
Motivation and Problem Statement
The classical methods for chemical sensing applications are generally based on the use of an unique high-selective sensor. As a rule, these techniques demand
Leonardo Tomazeli Duarte would like to thank CNPq (Brazil) for the financial support.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 41–48, 2007. c Springer-Verlag Berlin Heidelberg 2007
42
L.T. Duarte and C. Jutten
sophisticated laboratory analysis, which makes them expensive and time consuming. An attractive alternative to these methods relies on the use of an array of less-selective sensors combined with a post-processing stage whose purpose is exactly to extract the relevant information from the acquired data. In [5,6], post-processing stages based on BSS methods were considered in the problem of estimating the concentrations of several ions in a solution. In this sort of application, a device called ion-sensitive field-effect transistor (ISFET) [5] may be employed as sensor. In short, the ISFET is built on a MOSFET by replacing the metallic gate with a membrane sensitive to the ion of interest, thus permitting the conversion of chemical information into electrical one. The Nikolsky-Eisenman (NE) model [5] provides a very simple and yet adequate description of the ISFET operation. According to this model, the response of the i-th ISFET sensor is given by: zi z (1) aij sj j , xi = ci1 + ci2 log si + j,j=i
where si and sj are the concentration of the ion of interest and of the concentration of the j -th interfering ion, respectively, and where zi and zj denote the valence of the ions i and j, respectively. The selective coefficients aij model the interference process; ci1 and ci2 are constants that depends on some physical parameters. Note that when the ions have the same valence, then the model (1) can be seen as a particular case of the class of PNL systems, as described in [6]. In the present work, we envisage the situation in which zi = zj . According to the NE model, one obtains a tough nonlinear mixing model in this case. For the sake of simplicity, we assume, in this paper, that the coefficients ci1 and ci2 are known (even if their estimations are not so simple). Considering a mixture of two ions, such simplification leads to the following nonlinear mixing system that will be considered in this work x1 = s1 + a12 sk2 1
x2 = s2 + a21 s1k
,
(2)
where k = z1 /z2 and is known. We consider that k takes only positive integer values. Indeed, in many actual applications, typical target ions are H3 O+ , N H4+ , Ca2+ , K + , etc. Consequently, many cases correspond to k ∈ N and, in this paper, we will focus on this case. Also, the sources are supposed positives, since they represent concentrations. Finally, it is assumed that si are mutually independent, which is equivalent to assume that there is no interaction between the ions.
3
Separation Method
For separating sources si from mixtures (2), we propose a parametric recursive model (see (3) below), whose parameters wij will be adjusted by a simple ICA algorithm. Consequently, equilibrium points and their stability are depending both on a structural condition (due to the recursive nature of (3)) and on the learning algorithm, as explained in subsection 3.3.
Blind Source Separation of a Class of Nonlinear Mixtures
3.1
43
Separating Structure
In this work, we adopted the following recurrent network as separating system: y1 (m + 1) = x1 − w12 y2 (m)k 1 , y2 (m + 1) = x2 − w21 y1 (m) k
(3)
where [w12 w21 ]T are the parameters to be adjusted. In order to understand how this structure works, let s = [s1 s2 ]T denote a sample of the sources. By considering (2), one can easily check that when [w12 w21 ]T = [a12 a21 ]T , then s corresponds to an equilibrium point of (3). This wise approach to counterbalance the action of the mixing system without relying on its direct inversion was firstly developed in [1] regarding linear BSS. Its extension to the nonlinear case was proposed in [2], in the context of source separation of linear-quadratic mixtures. Naturally, an ideal operation of (3) as a separating system requires that s = [s1 s2 ]T be the only equilibrium point when [w12 w21 ]T = [a12 a21 ]T . Unfortunately, this is not the case as can be checked by setting y1 (m+1) = y1 (m) = y1 and y2 (m + 1) = y2 (m) = y2 in (3). From this, one observes that the determination of the equilibrium points of (3) leads to the following equation: k (1/k) y1 = x1 − a12 x2 − a21 y1 . (4) After straightforward calculation, including a binomial expansion, (4) becomes (1 + a12 b0 )y1 + a12 where bi =
k i (k−i) . i x2 (−a21 )
k−1
1− ki
bi y1
+ (a12 bk − x1 ) = 0,
(5)
i=1
1
By considering the transformation u = y1k in (5), one can verify that the solution of this expression is equivalent to the determination of the roots of a polynomial of order k and, as a consequence, the number of equilibrium points grows linearly as k increases. Thus, it becomes evident that the use of (3) is appropriate only for small values of k. For instance, when k = 2 there are just two equilibrium points: one corresponds to the sources themselves and the other one corresponds to a mixture of these sources. In the next step of our investigation, we shall verify the conditions to be satisfied so that the equilibrium point associated with the sources be stable. In view of the difficulty embedded in a global analyze of stability, we consider the study of the local stability in the neighborhood of the equilibrium point s = [s1 s2 ]T based on the first-order approximation of the nonlinear system (3). This linearization can be expressed by using a vectorial notation as follows: y(m + 1) ≈ c + Jy(m),
(6)
where y(m) = [y1 (m) y2 (m)]T , c is a constant vector and J is the Jacobian matrix of (3) evaluated at [s1 s2 ]T , which is given by: (k−1) 0 −a12 ks2 J= . (7) ( 1 −1) − k1 a21 s1 k 0
44
L.T. Duarte and C. Jutten
It can be proved that a necessary and sufficient condition for local stability of a discrete system is that the absolute values of the eigenvalues of the Jacobian matrix evaluated at the equilibrium point of interest be smaller than one [7]. Applying this result on (7), the following condition of local stability is obtained: ( 1 −1) k−1 s2 |
|a12 a21 s1 k
< 1.
(8)
This is a first constraint of our strategy, given that this condition must be satisfied for each sample [s1 s2 ]T . In order to illustrate this limitation, the stability boundaries in the (a12 , a21 ) plane for several cases are depicted in Figure 1.
1
1
a
21
1.5
a21
1.5
0.5
0.5
Stability regions
Stability regions
0
0
0.5
1
1.5
0
0
0.5
a12
1
1.5
a12
(a) Influence of k: sources distributed between (0.1, 1.1) with k = 2 (solid) and k = 3 (dash)
(b) For k = 2: sources distributed between (0.1, 1.1) (solid) and between (0.2, 2.2) (dash)
Fig. 1. Stability boundaries in the (a12 , a21 ) plane
3.2
Learning Algorithm
We consider a learning rule founded on the cancellation of nonlinear correlations, given by E{f (yi )g(yj )}, between the retrieved sources [1]. The following nonlinear functions were chosen: f (·) = (·)3 and g(·) = (·). Therefore, at each time n, the iteration of the separating method consists of: 1) the computation of yi , for each sample of the mixtures, according to the dynamics (3) and 2) the update of the parameters wij according to: w12 (n + 1) = w12 (n) + μE{y13 y¯2 } , w21 (n + 1) = w21 (n) + μE{y23 y¯1 }
(9)
where μ corresponds to the learning rate, [y1 y2 ]T denotes the equilibrium point of (3) and y¯i is a centering version of yi 1,2 . One can check3 that (9) converges 1 2
3
More specifically, we adopt the following notation y¯ir = yir − E{yir }. Given that the signals are not supposed zero-mean, the centering of one the variables in (9) is necessary, so that it converges when y1 and y2 are mutually independent. Note that (9) converges when E{yi3 y¯j } = E{yi3 yj } − E{yi3 } E{yj } = 0.
Blind Source Separation of a Class of Nonlinear Mixtures
45
when E{y13 y2 } = E{y13 }E{y2 } and E{y23 y1 } = E{y23 }E{y1 }. Obviously, these conditions are only necessary ones for the statistical independence between the sources and, as a consequence, there may be particular sources for which such strategy fails. On the other hand, this strategy provides a less complex algorithm than those that deal directly with a measure of statistical independence. In the last section, a stability condition concerning the separation structure was provided. Likewise, as it will be seen in the sequel, it is possible to analyze the stability of the learning rule (9). This study will permit us to determine whether the separating equilibrium point, i.e., [w12 w21 ]T = [a12 a21 ]T , corresponds to a stable one and, as a consequence, whether it is attainable for the learning rule. 3.3
Stability Analysis of the Learning Rule
According to the ordinary differential equation theory, it is possible, by assuming that μ is sufficiently small, to rewrite (9) as: dw12 = E{y13 y¯2 } dt dw21 = E{y23 y¯1 }. (10) dt A first point to be stressed is that the determination of all equilibrium points of (10) is a rather difficult task. Even when k = 1 in (2), which corresponds to the linear BSS problem, this calculation demands a great deal of effort [8]. Secondly, we are interested in the stability of the point [w12 w21 ]T = [a12 a21 ]T , but one must keep in mind that there are structural conditions to be assured so that it corresponds to an equilibrium point of (10). For example, when k = 2, we observed through simulations that this ideal adjustment of the separating system usually guarantees the separation of the sources when the local condition (8) is satisfied. Thus, in this situation and under the hypothesis of independent sources, it is assured that E{yi3 y¯j } = E{s3i s¯j } = 0. As in Section 3.1, the local stability analysis is based on a first-order approximation of (10). However, since we are dealing with a continuous dynamics in this case, a given equilibrium point of the learning rule is locally stable when the real parts of all eigenvalues of the Jacobian matrix are negatives [7]. After straightforward calculations, one obtains the Jacobian matrix evaluated at the equilibrium point [a12 a21 ]T ∂y1 ∂y1 3 ∂y2 2 3 ∂y2 } + E{¯ y } 3E{y y ¯ } + E{¯ y } 3E{y12 y¯2 ∂a 2 1 1 1 ∂a12 ∂a21 ∂a21 . 12 (11) J= ∂y2 ∂y2 ∂y1 ∂y1 } + E{¯ y23 ∂a } 3E{y22 y¯1 ∂a } + E{¯ y23 ∂a } 3E{y22 y¯1 ∂a 12 12 21 21 Note that, assuming an ideal operation of the separating system, [y1 y2 ]T could be replaced by [s1 s2 ]T , which permits us to express the stability conditions of (9) in terms of some statistics of the sources. The entries of the Jacobian matrix can be calculated by applying the chain rule property on (3). For instance, it is not difficult to verify from that: ∂y2 ∂y1 = −(y2k + a12 ky2k−1 ). ∂a12 ∂a12
(12)
46
L.T. Duarte and C. Jutten
Given that
1 1 ∂y2 −1 ∂y1 = − a21 y1k , ∂a12 k ∂a12
(13)
and substituting this expression in (12), one obtains: ∂y1 −y2k = . 1 −1 ∂a12 1 − a12 a21 y1k y2k−1
(14)
By conducting similar calculations, one obtains the other derivatives: 1
−1
∂y2 a21 y1k y2k = 1 −1 ∂a12 k(1 − a12 a21 y1k y2k−1 )
(15)
1
∂y1 ka12 y1k y2k−1 = 1 −1 ∂a21 1 − a12 a21 y1k y2k−1
(16)
1
∂y2 −y1k = 1 −1 ∂a21 1 − a12 a21 y1k y2k−1
(17)
As it would be expected, when k = 1, one obtains from the derived expressions the same conditions developed in [8] and [9] for the stability of the H´erault-Jutten algorithm for linear source separation.
4
Experimental Results
Aiming to assess the performance of the proposed solution, experiments were conducted for the cases k = 2 and k = 3. In both situations, the efficacy of the obtained solutions was quantified according to the following index:
E{s2i } SN Ri = 10 log . (18) 2 E{(si − yi ) } From this, a global index can be defined as SN R = 0.5(SN R1 + SN R2 ). k = 2. In a first scenario, we consider the separation of two sources uniformly distributed between [0.1, 1.1]. The mixing parameters are given by a12 = 0.5 and a21 = 0.5; a set of 3000 samples of the mixtures was considered and the number of iterations regarding the learning algorithm (9) was defined to 3500 with μ = 0.05. The initial conditions of the dynamics (3) were chosen as [y1 (1) y2 (1)]T = [0 0]T . The results of this first case are expressed in the first row of Table 1. In Figure 2, the joint distributions of the mixtures and of the retrieved signals are depicted for a typical case (SN R = 35dB). Note that the outputs of the separating system are almost uniformly distributed, which indicates that the separation task was fulfilled. Also, we performed experiments by considering on each sensor an additive white Gaussian noise with a signal-to-noise ratio of 17dB. The results for this second scenario are depicted in the second row of Table 1.
Blind Source Separation of a Class of Nonlinear Mixtures
47
Table 1. Average SNR results over 100 experiments and standard deviation (STD)
k k k k
=2 =2 =2 =3
(Scenario (Scenario (Scenario (Scenario
1) 2) 3) 1)
SN R1 37.18 17.98 36.49 22.46
SN R2 33.05 15.35 31.84 21.04
SN R ST D(SN R) 35.12 8.90 16.67 1.72 34.17 4.92 21.75 5.02
1.8 1.2
1
1.2
0.8 2
1.4
1
y
x
2
1.6
0.6 0.8 0.4 0.6 0.2
0.4 0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0
0
0.2
x
0.4
0.6
y
1
(a) Mixed signals
0.8
1
1.2
1
(b) Retrieved signals
Fig. 2. First scenario - k = 2
A third scenario was composed by a uniformly distributed source between [0.3, 1.3] and a sinusoidal source varying in the range [0.2, 1.2]. In this case, the mixing parameters are given by a12 = 0.6 and a21 = 0.6 and the constants related to the separating system were adjusted as in the first experiment. Again, the separation method was able to separate the original sources, as can be seen in the third row of Table 1. k = 3. The problem becomes more tricky when k = 3. Firstly, we observed through simulations that, even for a separating point [w12 w21 ]T = [a12 a21 ]T that satisfies the equilibrium condition (8), the structure (3) does not guarantee source separation, since there can be another stable equilibrium solution that has no relation with the sources. In this particular case, we observed, after performing some simulations, that the adopted network may be attracted by a stable limit cycle and, also, that it is possible to overcome this problem by changing the initial conditions of (3) when a periodic equilibrium solution occurs. A second problem in this case is related to the convergence of the learning rule. Some simulations suggested the existence of spurious minima in this case. These two problems result in a performance degradation of the method when compared to the case k = 2, as can be seen in the last row of Table 1. In this case, we considered a scenario with two sources uniformly distributed between [0.3, 1.3] and mixing parameters given by a12 = 0.5 and a21 = 0.5. The initial conditions of (3) were defined as [0.5 0.5]T . Also, we considered 3000 samples of the mixtures and 10000 iterations of the learning algorithm with μ = 0.01.
48
5
L.T. Duarte and C. Jutten
Conclusions
The aim of this work was to design a source separation strategy for a class of nonlinear systems that is related to a chemical sensing application. Our approach was based on the ideas presented in [1,2] in such a way that it may be viewed as an extension of these works to the particular model considered herein. Concerning the proposed technique, we investigated the stability of the separation structure as well as the stability of the learning algorithm. This study permitted us to obtain necessary conditions for a proper operation of the separation method. Finally, the viability of our approach was attested by simulations. A first perspective of this work concerns its application in a real problem of chemical sensing. Also, there are several questions that deserve a detailed study as, for example, the design of algorithms that minimizes a better measure of independence between the retrieved sources (e.g. mutual information), including an investigation of the separability of the considered model. Another envisaged extension is to provide a source separation method for the most general case of the Nikolsky-Eisenman model, which is given by (1): 1) by considering the logarithmic terms; and 2) by considering the cases k ∈ Q. Actually, preliminary simulations show that our proposal works for simple cases in k ∈ Q, such as k = 1/3 and k = 2/3. However, there are tricky points in the theoretical analysis conducted in this paper that are not appropriate to this new situation.
References 1. Jutten, C., H´erault, J.: Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing 24, 1–10 (1991) 2. Hosseini, S., Deville, Y.: Blind separation of linear-quadratic mixtures of real sources ´ using a recurrent structure. In: Mira, J.M., Alvarez, J.R. (eds.) IWANN 2003. LNCS, vol. 2687, pp. 241–248. Springer, Heidelberg (2003) 3. Comon, P.: Independent component analysis: a new concept? Signal Processing 36, 287–314 (1994) 4. Taleb, A., Jutten, C.: Source separation in postnonlinear mixtures. IEEE Trans. Signal Processing 47, 2807–2820 (1999) 5. Bermejo, S., Jutten, C., Cabestany, J.: ISFET source separation: Foundations and techniques. Sensors and Actuators B Chemical 113, 222–233 (2006) 6. Bedoya, G., Jutten, C., Bermejo, S., Cabestany, J.: Improving semiconductor-based chemical sensor arrays using advanced algorithms for blind source separation. In: Proc. of the ISA/IEEE Sensors for Industry Conference (SiCon04), pp. 149–154 (2004) 7. Hilborn, R.C.: Chaos and Nonlinear Dynamics: An Introduction for Scientists and Engineers, 2nd edn. Oxford University Press, New York (2000) 8. Sorouchyari, E.: Blind separation of sources, part III: Stability Analysis. Signal Processing 24, 21–29 (1991) 9. Fort, J.C.: Stabilit´e de l’algorithme de s´eparation de sources de Jutten et H´erault (in French). Traitement du Signal 8, 35–42 (1991)
Independent Subspace Analysis Is Unique, Given Irreducibility Harold W. Gutch and Fabian J. Theis Max Planck Institute for Dynamics and Self-Organization, 37073 G¨ ottingen, Germany
[email protected],
[email protected]
Abstract. Independent Subspace Analysis (ISA) is a generalization of ICA. It tries to find a basis in which a given random vector can be decomposed into groups of mutually independent random vectors. Since the first introduction of ISA, various algorithms to solve this problem have been introduced, however a general proof of the uniqueness of ISA decompositions remained an open question. In this contribution we address this question and sketch a proof for the separability of ISA. The key condition for separability is to require the subspaces to be not further decomposable (irreducible). Based on a decomposition into irreducible components, we formulate a general model for ISA without restrictions on the group sizes. The validity of the uniqueness result is illustrated on a toy example. Moreover, an extension of ISA to subspace extraction is introduced and its indeterminacies are discussed.
With the increasing popularity of Independent Component Analysis, people started to get interested in extensions. Cardoso [2] was the first to formulate an extension denoted here as Independent Subspace Analysis. The general idea is that for a given observation X we try to find an invertible matrix W such that WX = (ST1 , . . . , STk )T with mutually independent random vectors Si . If all Si are one-dimensional, this is ICA, and we have the well-known separability results of ICA [3]. However without dimensionality restrictions, if mutual independence of the vectors Si is the only restriction imposed on W, ISA cannot produce meaningful results: if W simply is the identity and k = 1, then S1 = X, which is independent of the (non-existing) rest. So, further restrictions are required for a meaningful model. A common approach is to fix the group size in advance, see [5] for a short review of ISA models. Here, we propose a more general concept based on [5], namely irreducibility of the recovered sources Si that is the requirement that any Si cannot be further decomposed. Our main contribution is a sound proof for the separability of this model together with a confirming simulation, thereby giving the details for the proposed ISA model from [5]. The manuscript is organized as follows. In the next section, we motivate the existence of such a separability result by studying a toy example. Then we give the sketch of the proof, and finally extend it to blind subspace extraction. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 49–56, 2007. c Springer-Verlag Berlin Heidelberg 2007
50
1
H.W. Gutch and F.J. Theis
Motivation
Usually ISA is seen as a byproduct of ICA algorithms, which are assumed to decompose signals into components ‘as independent as possible’; the components are then simply sorted to give a decomposition into higher-dimensional subspaces. However this approach is not as straight-forward as it might seem, as, strictly speaking, if ICA is performed on a data set that cannot be completely decomposed into one-dimensional independent components, we are applying ICA to a data set that does not follow the ICA model and have no theoretical results predicting the behavior of ICA algorithms. Here we present some simulations, which give a hint that indeed ISA might not be so unproblematic. We generated a toy data set consisting of two independent sources, each of which were not further decomposable. The first data set consisted of a wireframe model of a 3-dimensional cube, the second data set was created from a solid 2-dimensional circle, see figure 1. We uniformly picked N = 10.000 samples and mixed them in batch runs by applying M = 200.000 uniformly sampled orthogonal matrices. The 200.000 matrices were sampled, by choosing random matrices B with entries normally sampled with mean 0 and variance 1, which then were symmetrically orthogonalized by A = (BBT )−0.5 B. A mixture with an orthogonal matrix deviating from the block-structure should also deviate from independence within the blocks. As an ad-hoc measure for dependence within the blocks, we used the forth-order cumulant tensor: δD (X) =
5 3 5 5
cum2 (Xi , Xj , Xk , Xl ) .
i=1 j=4 k=1 l=1
This is motivated by the well-known and in ICA often used fact that the crosscumulant tensor is zero, i.e. cum2 (Y1 , Y2 , Y∗ , Y∗ ) = 0, if Y1 and Y2 are independent. We measured the deviation of our mixing matrices from block-structure by simply taking the Frobenius-norm of the off-block-diagonal blocks: off(A) :=
3 5 (a2ij + a2ji ) . i=1 j=4
If ISA actually guarantees a unique block-structure in fourth order, we should get a dependence of 0 only if the mixing matrix itself is block-diagonal that is if off A = 0. However, due to sampling errors, this is of course never reached, so we estimate the minima of δD . Figure 2 shows the relation of off A and δD (AS), and here we observe not only the expected minimum at off A = 0, but two additional minima at off A = 2 and off A = 4. In order to take a closer look at these three points, we chose three matrices A0 , A2 and A4 , corresponding to the three local minima of the plot in Fig. 2. Starting with these matrices, we performed in their neighborhood a search for matrices with a lower model deviation. Again we sampled random orthogonal matrices, but this time biased them to be close to the identity matrix, as we wanted to search locally. We therefore again orthogonalized matrices as above, however chose the matrices B
Independent Subspace Analysis Is Unique, Given Irreducibility
(a) 3-dimensional sources S1
51
(b) 2-dimensional sources S2
Fig. 1. Toy data set
to be not arbitrarily normally sampled, but took matrices whose entries were normally sampled with mean 0 and variance v, which we then added to the identity matrix, modifying A0 , A2 and A4 in every step only if it would perform a better block-independence. We evaluated this for v = 0.1, v = 0.01 and v = 0.001, each time running for 20.000 steps. The result of this is plotted in Fig. 3, and we indeed observe considerably better block-independence in the order of 1.5 magnitudes in the neighborhood of A0 than in the neighborhoods of A2 and A4 . While the three local minima found by random sampling show only small difference (δD (A0 ) = 0.0135, δD (A2 ) = 0.0107, δD (A4 ) = 0.0285), local searches show up better minima for all three areas (δD (A0 ) = 0.0002, δD (A2 ) = 0.0055, δD (A4 ) = 0.0053), especially the area around off A = 0. As a side note, the final matrices A2 and A4 correspond to the product of a blockdiagonal matrix and a permutation matrices where one, respectively two indices in each of the two off-diagonal blocks are non-zero. This shows us that while we observe local minima of our block-dependency measure on our data set, a closer inspection reveals that these minima are of different quality and we actually have only a single global minimum. We conclude that separability of ISA indeed should hold.
2
Uniqueness of ISA
In this section we present the proof of uniqueness of ISA. After explaining the notion of irreducibility of a random vector, we show why this idea is essential for the separability of ISA. 2.1
The ICA Model
Let us quickly repeat a few facts about ICA. The linear, noiseless ICA model can be described by the equation X = AS, where S = (S1 , . . . , Sn )T denotes a random vector with mutually independent components Si (sources) and an
52
H.W. Gutch and F.J. Theis
Fig. 2. Relation between block-crosserror and block-independence. Note the two additional minima at off A = 2 and off A = 4.
invertible mixing matrix A. The task of ICA is the recovery of S, given only the observations X. This is obviously only possible up to the indeterminacies scaling and permutation, and it is well-known that recovery is possible up to exactly these permutations if S is square-integrable and contains at most one Gaussian component [3, 4]. 2.2
The ISA Model
Loosening the requirement of mutual independence of the sources naturally brings up the idea of describing ISA through the same equation X = AS, where now S = (ST1 , ST2 , . . . , STn )T with mutually independent random vectors Si , however this time dependencies within the multidimensional Si are allowed. Obvious indeterminacies of such a model are invertible linear transforms within the subspaces Si (which can be seen as a generalization of scaling to higher dimensions) and permutations of subspaces of the same size (which, again, is the higher dimensional generalization of the regular permutation seen in ICA). However this model is not complete, since for any observation X a decomposition into mutually independent subspaces where dependencies within the subspaces are allowed is given simply by X itself. Realizing this naturally brings up the requirement of S to be ‘as independent as possible’. This is formally described by the following definition. Definition 1. A random vector S is said to be irreducible if it contains no lower-dimensional independent component. An invertible matrix W is called a (general) independent subspace analysis of X if WX = (ST1 , . . . , STk )T with mutually independent, irreducible random vectors Si . Then (ST1 , . . . , STk ) is called an irreducible decomposition of X. Irreducibility is a key property in uniqueness of ISA and indeed, if we additionally assume irreducibility, we can show that this essentially allows for separability of ISA up to the above mentioned indeterminacies of higher dimensional scaling and permutation of subspaces of the same size.
Independent Subspace Analysis Is Unique, Given Irreducibility
53
Fig. 3. Search for local minima around off A = 0 (lower graph) and off A = 2 respectively off A = 4 (upper two graphs)
2.3
Uniqueness of ISA
We will now prove uniqueness of Independent Subspace Analysis under the additional assumption of no independent Gaussian components. Indeed, any orthogonal transformation of two decorrelated (and hence independent) Gaussians is again independent, so for such random vectors clearly such a strong identification result would not be possible. Theorem 1. Given a random vector X with existing covariance and no Gaussian independent component, then an ISA of X exists and is unique except for scaling and permutation. Existence holds trivially, but uniqueness is not obvious. Defining the equivalence relation ∼ on random vectors as X ∼ Y :⇔ X = AY for some A ∈ Gl(n), we are easily able to show uniqueness given the following lemma: Lemma 1. Let S = (ST1 , . . . , STN )T be a square-integrable decomposition of S into irreducible, mutually independent components Si where no Si is a onedimensional Gaussian. If (XT1 , XT2 )T is an independent decomposition of S, then there is some permutation π of {1, . . . , N } such that X1 ∼ (STπ(1) , . . . , STπ(l) )T and X2 ∼ (STπ(l+1) , . . . , STπ(N ) )T for some l. So, given an irreducible decomposition of a random variable S with no independent Gaussian components, any decomposition of it into independent (not necessarily irreducible) components ‘splits along the irreducible components’. Using this lemma, Theorem 1 is easy to show: Given two irreducible decompositions (XT1 , . . . , XTN )T and (ST1 , . . . , STM )T , we search for the smallest irreducible component appearing, which we may assume to be X1 . We then group
54
H.W. Gutch and F.J. Theis
(XT2 , . . . , XTN )T into a (larger) random vector. As this independent decomposition splits along the irreducible components Si and for all j, dim(X1 ) ≤ dim(Sj ), X1 is identical to one of the Sj . We may remove both of these and go on iteratively, thus proving the theorem. The more complicated part is the proof of Lemma 1, and due to space restrictions we can only sketch the proof. Before starting, we note that due to the assumption of existing covariance, we may whiten both X and S, in which case it is easy to observe that A is orthogonal. For notational reasons, we will split up the mixing matrix A into submatrices, the sizes of which are according to the sizes of Si and Xj : ⎛ ⎞ S1 A11 . . . A1N ⎜ . ⎟ X1 . (1) = X2 A21 . . . A2N ⎝ . ⎠ SN N so Xi = k=1 Aik Sk . We now claim that in every pair {A1j , A2j } one of the two matrices is zero. We fix k = k0 and show this claim for k0 . Let us assume the converse, that is that both rank(A1k0 ) = 0 and rank(A2k0 ) = 0. As A has full rank, rank(A1k0 ) + rank(A2k0 ) ≥ dim(Sk0 ) =: D. This leaves us with two cases to handle, rank(A1k0 ) + rank(A2k0 ) = D and rank(A1k0 ) + rank(A2k0 ) > D. Let us first address the first case and show that this contradicts the irreducibility of Sk0 .
Lemma 2. Assume S = (A1 |A2 )
X1 X2
with independent random vectors X1 and X2 and A1 , A2 such that rank(AT1 ) + rank(AT2 ) = dim(S) and rank(A1 |A2 ) = dim(S). Then S is reducible.
Proof. Let D := dim(S) and d := dim ker(AT1 ) . Then dim ker(AT2 ) = D −d, and we can find a linearly independent set {v1 , . . . , vd } such that viT A1 = 0 for any 1 ≤ i ≤ d, and similarly a linearly independent set {vd+1 , . . . , vD } such that vjT A2 = 0 for any d + 1 ≤ j ≤ D. These two sets are guaranteed to be disjoint, as rank(A1 |A2 ) = dim(S). Using these vectors, we define ⎛ T⎞ v1 ⎜ .. ⎟ T := ⎝ . ⎠ . T vD
Then TS = (TA1 |TA2 )
X1 X2
=
T1 0 0 T2
X1 X2
=
T1 X1 T2 X2
with some full rank matrices T1 and T2 . It follows that S is reducible, as X1 and X2 are independent and T is invertible.
Independent Subspace Analysis Is Unique, Given Irreducibility
55
The other case, rank(A1k0 ) + rank(A2k0 ) > D is harder to prove and follows some of the ideas presented in [4]. Lemma 3. Given (1), if there is some 1 ≤ k0 ≤ N such that rank(A1k0 ) + rank(A2k0 ) > dim(Sk0 ), then Sk0 contains an irreducible Gaussian component. This concludes the proof of Theorem 1. 2.4
Dealing with Gaussians
The section above explicitly excluded independent Gaussian components in order to avoid additional indeterminacies. Recently, a general decomposition model dealing with Gaussians was proposed in the form of the so-called non-Gaussian component analysis (NGCA) [1]. It tries to detect a whole non-Gaussian subspace within the data, and no assumption of independence within the subspace is made. More precisely, given a random vector X, a factorization X = AS with an invertible matrix A, S = (SN , SG ) and SN a square-integrable m-dimensional random vector is called an m-decomposition of X if SN and SG are stochastically independent and SG is Gaussian. In this case, X is said to be m-decomposable and X is denoted to be minimally n-decomposable if X is not (n − 1)-decomposable. According to our previous notation, SN and SG are independent components of X. It has been shown that the subspaces of such decompositions are unique [6]: Theorem 2. The mixing matrix A of a minimal decomposition is unique except for transformations in each of the two subspaces. Moreover, explicit algorithms can be constructed for identifying the subspaces [6]. This result enables us to generalize Theorem 1 and to get a general decomposition theorem, which characterizes solutions of ISA. Theorem 3. Given a random vector X with existing covariance, an ISA of X exists and is unique except for permutation of components of the same dimension and invertible transformations within each independent component and within the Gaussian part. Proof. Existence is obvious. Uniqueness follows after first applying Theorem 2 to X and then Theorem 1 to the non-Gaussian part.
3
Independent Subspace Extraction
Having shown uniqueness of the decomposition, we are able to introduce Independent (Irreducible) Subspace Extraction, which separates independent (irreducible) subspaces out of the random vector. Definition 2. A pseudo-invertible (n × m) matrix W is said to be an Independent Subspace Extraction of an m-dimensional random vector X, if WX is an independent component of X. If WX even is irreducible, then W is called an Irreducible Subspace Extraction of X.
56
H.W. Gutch and F.J. Theis
This could lead to a wider variety of algorithms like deflationary approaches which are already common in standard ICA. The interesting aspect here is that we only strive to extract a single component, so Independent (Irreducible) Subspace Extraction could prove to be simpler to handle algorithmically than a complete Independent Subspace Analysis, and thus play an important role in applications (such as dimension reduction) that need to extract only a single component or subspacespace.
4
Conclusion
Although Independent Subspace Analysis has become a common practice in the last few years, separability of it has not been fully shown. We presented examples that showed that ISA is not as unproblematic as it seems. Additionally we proved uniqueness – up to higher-dimensional generalizations of the indeterminacies of ICA – of ISA, given no independent Gaussians and showed how to combine this together with existing theoretical results on NGCA to a full ISA uniqueness result. Using these results, it is now possible to speak of the ISA of any given random vector. Moreover, theorem 3 now gives an complete characterization of decompositions of distributions into independent factors, which might prove to be a useful result in general statistics. Now that uniqueness of ISA has been shown for the theoretical limit of perfect knowledge of the recordings, the next obvious step is the conversion to the real-world case, where only a finite number of samples of the observations are known. Here, a decomposition of the mixtures X such that X = AS where S = (ST1 , . . . , STN )T with irreducible (or merely independent) Si cannot be expected, as in this case we expect to always see some dependency due to sampling errors. Due to uniqueness of ISA in the asymptotic case, identification of the underlying sources should hold here too, given enough samples, but additional work is required to show this in the future.
References 1. Blanchard, G., Kawanabe, M., Sugiyama, M., Spokoiny, V., M¨ uller, K.R.: In search of non-gaussian components of a high-dimensional distribution. Journal of Machine Learning Research 7, 247–282 (2006) 2. Cardoso, J.F.: Multidimensional independent component analysis. In: Proc. of ICASSP ’98, Seattle (1998) 3. Comon, P.: Independent component analysis - a new concept? Signal Processing 36, 287–314 (1994) 4. Theis, F.J.: A new concept for separability problems in blind source separation. Neural Computation 16, 1827–1850 (2004) 5. Theis, F.J.: Towards a general independent subspace analysis. In: Proc. NIPS 2006 (2007) 6. Theis, F.J., Kawanabe, M.: Uniqueness of non-gaussian subspace analysis. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 917–925. Springer, Heidelberg (2006)
Optimization on the Orthogonal Group for Independent Component Analysis Michel Journ´ee1, Pierre-Antoine Absil2 , and Rodolphe Sepulchre1 1
Dept. of Electrical Engineering and Computer Science, University of Li`ege, Belgium 2 Dept. of Mathematical Engineering, Universit´e catholique de Louvain, Belgium
Abstract. This paper derives a new algorithm that performs independent component analysis (ICA) by optimizing the contrast function of the RADICAL algorithm. The core idea of the proposed optimization method is to combine the global search of a good initial condition with a gradient-descent algorithm. This new ICA algorithm performs faster than the RADICAL algorithm (based on Jacobi rotations) while still preserving, and even enhancing, the strong robustness properties that result from its contrast. Keywords: Independent Component Analysis, RADICAL algorithm, optimization on matrix manifolds, line-search on the orthogonal group.
1
Introduction
Independent Component Analysis (ICA) was originally developed for the blind source separation problem. It aims at recovering independent source signals from linear mixtures of these. As in the seminal paper of Comon [1], a linear instantaneous mixture model will be considered in this paper, X = AS,
(1)
where X, A and S are matrices in Rn×N , Rn×p and Rp×N respectively, with p less or equal to n. The rows of S are assumed to be samples of independent random variables. Thus, ICA provides a linear representation of the data X in terms of components S that are statistically independent. ICA algorithms are based on the inverse of the mixing model (1), Z = W T X, where Z and W are matrices in Rp×N and Rn×p , respectively. The aim of ICA algorithms is to optimize over W the statistical independence of the p random variables, whose samples are given in the p rows of Z. The statistical independence is measured by a cost function γ : Rn×p → R : W → γ(W ), termed the contrast function. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 57–64, 2007. c Springer-Verlag Berlin Heidelberg 2007
58
M. Journ´ee, P.-A. Absil, and R. Sepulchre
In the remainder of this paper, we assume that the data matrix X has been preprocessed by means of prewhitening and its dimensions have been reduced by retaining the dominant p-dimensional subspace. Consequently, the contrast function γ is defined on a set of square matrices, i.e, γ : Rp×p → R : W → γ(W ). Several contrast functions for ICA can be found in the literature. In this paper, we consider the RADICAL contrast function proposed in [2]. Advantages of this contrast are a strong robustness to outliers as well as to the lack of samples. A good contrast for γ is not enough to make an efficient ICA algorithm. The other ingredient is a suitable numerical method to compute an optimizer of γ. This is the topic of the present paper. The authors of [2] optimize their contrast by means of Jacobi rotations combined with an exhaustive search. This yields the complete Robust Accurate Direct ICA aLgorithm (RADICAL). We propose a new steepest-descent-based optimization method that reduces the computational load of RADICAL. The paper is organized as follows. The contrast function of RADICAL is detailed in Section 2. Section 3 describes a gradient-descent optimization algorithm. In Section 4, this local optimization is integrated within a global optimization framework. The performance of this new ICA algorithm is briefly illustrated in Section 5.
2
A Robust Contrast Function
Like many other measures of statistical independence, the contrast of RADICAL [2] is derived from the mutual information [3]. The mutual information I(Z) of a multivariate random variable Z = (z1 . . . , zp ) is defined as the KullbackLeibler divergence between the joint distribution and the product of the marginal distributions, p(z1 , . . . , zp ) (2) dz1 . . . dzp . I(Z) = p(z1 , . . . , zp ) log p(z1 ) . . . p(zp ) This quantity presents all the required properties for a contrast function: it is nonnegative and equals zero if and only if the variables Z are statistically independent. Hence, its global minimum corresponds to the solution of the ICA problem. The challenge is to get a good estimator of I(Z). A possible approach is to express the mutual information in terms of the differential entropy of a univariate random variable z, S(z) = p(z) log(p(z))dz, (3) for which efficient statistical estimators are available.
Optimization on the Orthogonal Group for ICA
59
According to definitions (2) and (3), the following holds, I(Z) =
p
S(zi ) − S(z1 , . . . , zp ).
(4)
i=1
The introduction of the demixing model Z = W T X within (4) results in γ(W ) =
p
S (i) (W ) − log(|W |) − S(x1 , . . . , xp ),
(5)
i=1
where S (i) (W ) = S(eTi W T X) and ei is the ith basis vector. The last term of (5) is constant and its evaluation can be skipped by the ICA algorithm. An estimator for the differential entropy of univariate variables was derived in [2] by considering order statistics. Given a univariate random variable z defined by its samples, the order statistics of z is the set of samples {z 1 , . . . , z N } rearranged in non-decreasing order, i.e., z 1 ≤ . . . ≤ z N . The differential entropy of z can be estimated by the simple formula N −m 1 N + 1 (j+m) (j) S(z) = (z log −z ) , N − m j=1 m
(6)
√ where m is typically set to N . Function (5) with the differential entropies being estimated by (6) is the contrast of the RADICAL algorithm [2]. This contrast presents several assets in terms of robustness. Its robustness to outliers was underlined in the original paper [2]. Robustness to outliers means that the presence of some corrupted entries in the observations data set X has little influence on the position of the global minimizer of that contrast. This is a key feature in many applications, especially for the analysis of gene expression data [4], where each entry in the observation matrix results from individual experiments that are likely to sometimes fail. The RADICAL contrast brings also advances in terms of robustness to the lack of samples. This will be illustrated in Section 5.
3
A Line-Search Optimization Algorithm
In accordance with the fact that the independence between random variables is not altered by scaling, the contrast function (5) presents the scale invariance property γ(W ) = γ(W Λ), for all invertible diagonal matrices Λ. Optimizing a function with such an invariance property is a degenerate problem, which entails difficulties of theoretical (convergence analysis) and practical nature unless some constraints are introduced. In the case of prewhitening-based ICA, it is common practice to restrict the matrix W to be orthonormal [1], i.e., W T W = I. Classical constrained optimization methods could be used. We favor the alternative to incorporate the
60
M. Journ´ee, P.-A. Absil, and R. Sepulchre
constraints directly into the search space and to perform unconstrained optimization over the orthogonal group, i.e., min γ(W ) with Op = {W ∈ Rp×p |W T W = I}.
W ∈Op
(7)
Most classical unconstrained optimization methods — such as gradient-descent, Newton, trust-region and conjugate gradient methods — have been generalized to the optimization over matrix manifolds (see [5] and references therein). The remainder of this section deals with the derivation of a line-search optimization method on the orthogonal group for the RADICAL contrast function (5). Line-search on a nonlinear manifold is based on the update formula W+ = RW (tη),
(8)
which consists in moving from the current iterate W ∈ Op in the search direction η with a certain step size t to identify the next iterate W+ ∈ Op . t is a scalar and η belongs to TW Op = {W Ω|Ω ∈ Rp×p , Ω T = −Ω}, the tangent space to Op at W . The retraction RW is a mapping from the tangent space to the manifold. More details about this notion can be found in [5]. Our algorithm selects the Armijo point tA as step size and the opposite of the gradient of the cost function γ at the current iterate as search direction. The Armijo step size is defined by tA = β m α, with the scalars α > 0, β ∈ (0, 1) and m being the first nonnegative integer such that γ(W ) − γ(RW (β m α)) ≥ −σgradγ(W ), β m αη W , where W is the current iterate on Op and σ ∈ (0, 1). This step size ensures a sufficient decrease of the cost function at each iteration. The resulting line-search algorithm converges to the set of points where the gradient of γ vanishes [5]. An analytical expression of the gradient of the RADICAL contrast (5) has been derived in [6]. Let us just sketch the main points of this computation. First, because of the orthonormality condition, the second term of (5) vanishes. Furthermore, since the last term is constant, we have gradγ(W ) =
p
gradS (i) (W ).
i=1
The gradient of S (i) is given by gradS (i) (W ) = PTW gradS˜(i) (W ) , where S˜(i) is the extension of S (i) over Rp×p , i.e., S˜(i) = S (i) |Op , and PTW (Z) is the projection operator, namely, in case of the orthogonal group, PTW (Z) = 1 T T 2 W (W Z − Z W ). The evaluation of the gradient in the embedding manifold is performed by means of the identity DS˜(i) (W )[Z] = gradS˜(i) (W ), Z ,
Optimization on the Orthogonal Group for ICA
61
with the metric Z1 , Z2 = tr(Z1T Z2 ) and where S˜(i) (W + tZ) − S˜(i) (W ) t→0 t
DS˜(i) (W )[Z] = lim
is the standard directional derivative of S˜(i) at W in the direction Z. Since one wants to compute the gradient on the orthogonal group, the direction Z can be restricted to the tangent plane at the current iterate, i.e., Z ∈ TW Op . As we have shown in [6], the gradient of the differential entropy estimator on the orthogonal group Op is finally given by ⎛ ⎞ N −m (kj+m ) (kj ) T 1 (x − x )e i ⎠ gradS (i) (W ) = PTW ⎝ , N − m j=1 eTi W (x(kj+m ) − x(kj ) ) where x(k) denotes the kth column of the data matrix X. The indices kj+m and kj point to the samples of the estimated source zi , which are respectively at positions j + m and j in the order statistics of zi . The computational cost for the gradient is of the same order as for the contrast, namely O(pN log N ). More details about the Armijo point, the computation of gradients and, more generally, about line-search algorithms on manifolds can be found in [5].
4
Towards a Global Optimization Scheme
The algorithm described in the previous section inherits all the local convergence properties of line-search optimization methods [5]. Nevertheless, the contrast of RADICAL presents many spurious local minima that do not properly separate the observations X into independent sources. The line-search algorithm may thus fail in the context of ICA. Nevertheless, it leads to an efficient ICA algorithm when it is initialized within the basin of attraction of the global minimizer W∗ . It is therefore essential to find good initial candidates for the line-search algorithm. The procedure proposed in this paper rests on empirical observations about the shape of the contrast function γ(W ). Figure 1 represents the evolution of this function as well as of the norm of its gradient along geodesic curves on the orthogonal group Op for a particular benchmark setup (p=6, N =1000). Figure 1 and extensive simulations not included in the present paper incite us to view the contrast function of RADICAL as possessing a very deep global minimum surrounded by many small local minima. Furthermore, the norm of the gradient tends to be much larger within the basin of attraction of the global minimizer. The norm of the gradient thus provides a criterion to discriminate between points that are inside this basin of attraction and those that are outside. Our algorithm precedes the gradient optimization with the global search of a point where the gradient has a large magnitude. The search is performed along particular geodesics of the orthogonal group, exploiting the low numerical cost of Jacobi rotations. All geodesics on the orthogonal group Op have the form Γ (t) = W etB , where W ∈ Op and B is a skew-symmetric matrix of the same
62
γ
M. Journ´ee, P.-A. Absil, and R. Sepulchre
Regular geodesic Geodesic that goes through the global minimum
grad (γ)
Regular geodesic Geodesic that goes through the global minimum
W
*
W*
0
Curvilinear abscissa
0
Curvilinear abscissa
Fig. 1. Evolution of the contrast and the norm of its gradient along geodesics of Op
size as W . Jacobi rotations correspond to B having only zero elements except one element in the upper triangle and its symmetric counterpart, i.e., B(i, j) = 1 and B(j, i) = −1 with i < j. The contrast function γ evaluated along such geodesics has a periodicity of π2 , i.e., π
γ(W etB ) = γ(W e(t+ 2 )B ) Such a geodesic is in fact a Jacobi rotation on the two-dimensional subspace spanned by the directions i and j. This periodicity is an interesting feature for an exhaustive search over the curvilinear abscissa t since it allows to define upper and lower bounds for t. Our algorithm evaluates the gradient at a fixed number of points that are uniformly distributed on randomly selected geodesics of periodicity π2 . This process is pursued until a point with sufficient steepness is found. The steepness is simply evaluated by the Frobenius norm of the gradient of γ. Such a point is expected to belong to the basin of attraction of the global minimum and serves as initialization for the line-search algorithm of the previous section.
5
Some Benchmark Simulations
This section evaluates the performance of the new algorithm against the performance of the RADICAL algorithm. All results are obtained on benchmark setups that artificially generate observations X by linear transformation of known statistically independent sources S. Figure 2 illustrates that the new algorithm reaches the global minimum of the contrast with less than half the computational effort required by the RADICAL algorithm. These results are based on a benchmark with N = 1000 samples while the dimension p of the problem varies from 2 to 8. For each p, five different data matrices X are obtained by randomly mixing p sources chosen as sinusoids of random frequencies and random phases. The indicated computational time is an average over these five ICA runs. Figure 3 highlights the robustness properties of the contrast discussed in Section 2. The left graph results from a benchmark with p = 6 sources and
Optimization on the Orthogonal Group for ICA
63
50 40
Time [sec]
60
New algorithm RADICAL
30 20 10
Dimension (p) 0 2
3
4
5
6
7
8
Fig. 2. Reduced computational time of the new ICA algorithm
N = 1000 samples. A given percent of the entries of the data set have been artificially corrupted to simulate outliers. The right graph considers a benchmark with p = 6 sources, no outliers and a varying number of samples. The quality of the ICA separation is measured by an index α1 , which stands for a good performance once it is close to zero. The left graph indicates that both the new algorithm and the RADICAL algorithm are robust to these outliers while classical ICA algorithms such as JADE [7] or FastICA [8] collapse immediately. It should be noted that the new algorithm supports up to 3% of outliers on the present benchmark and is thus more robust than RADICAL. Similarly, the right graph of Figure 3 suggests that the new algorithm is more robust to the lack of samples than RADICAL.
α
α
New algorithm RADICAL JADE FastICA
0.6
0.6
0.4
0.4
0.2
0.2
0 0
1
2
New algorithm RADICAL JADE FastICA
0
3 Percent outliers [%]
50
100
Number of samples (N)
500
1000
Fig. 3. Robustness properties of the new ICA algorithm
6
Conclusions
The RADICAL algorithm [2] presents very desirable robustness properties: robustness to outliers and robustness to the lack of samples. These are essential 1
Given the demixing matrix W ∗ and the matrix W identified by the ICA algorithm, α(W, W ∗ ) = min Λ,P
W ΛP − W ∗ F , W ∗ F
where Λ is a non-singular diagonal matrix and P a permutation matrix.
64
M. Journ´ee, P.-A. Absil, and R. Sepulchre
for some applications, in particular for the analysis of biological data that are usually of poor quality because of the few number of samples available and the presence of corrupted entries resulting from failed experiments [4]. The RADICAL algorithm inherits these robustness properties from its contrast function. In this paper, we have shown that the computation of the demixing matrix by optimization of the RADICAL contrast function can be performed in a more efficient manner than with the Jacobi rotation approach considered in [2]. Our new optimization process works in two stages. It first identifies a point that supposedly belongs to the basin of attraction of the global minimum and performs afterwards the local optimization of the contrast by gradient-descent from this point. This new ICA algorithm requires less computational effort and seems to enhance the robustness margins.
Acknowledgments M. Journ´ee is a research fellow of the Belgian National Fund for Scientific Research (FNRS). This paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office. The scientific responsibility rests with its authors.
References 1. Comon, P.: Independent Component Analysis, a new concept. In: Signal Processing, vol. 36(3), pp. 287–314. Elsevier, Amsterdam (1994) (Special issue on Higher-Order Statistics) 2. Learned-Miller, E.G., Fisher III, J.W.: ICA using spacings estimates of entropy. Journal of Machine Learning Research 4, 1271–1295 (2003) 3. Cover, T.M., Thomas, J.A.: Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, Chichester (2006) 4. Liebermeister, W.: Linear modes of gene expression determined by independent component analysis. Bioinformatics 18, 51–60 (2002) 5. Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press (to appear) 6. Journ´ee, M., Teschendorff, A.E., Absil, P.-A., Sepulchre, R.: Geometric optim. methods for ICA applied on gene expression data. In: Proc. of ICASSP (2007) 7. Cardoso, J.-F.: High-order contrasts for independent component analysis. Neural Computation 11(1), 157–192 (1999) 8. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Chichester (2001)
Using State Space Differential Geometry for Nonlinear Blind Source Separation David N. Levin University of Chicago
[email protected]
Abstract. Given a time series of multicomponent measurements of an evolving stimulus, nonlinear blind source separation (BSS) usually seeks to find a “source” time series, comprised of statistically independent combinations of the measured components. In this paper, we seek a source time series that has a phase-space density function equal to the product of density functions of individual components. In an earlier paper, it was shown that the phase space density function induces a Riemannian geometry on the system’s state space, with the metric equal to the local velocity correlation matrix of the data. From this geometric perspective, the vanishing of the curvature tensor is a necessary condition for BSS. Therefore, if this data-derived quantity is non-vanishing, the observations are not separable. However, if the curvature tensor is zero, there is only one possible set of source variables (up to transformations that do not affect separability), and it is possible to compute these explicitly and determine if they do separate the phase space density function. A longer version of this paper describes a more general method that performs nonlinear multidimensional BSS or independent subspace separation.
1
Introduction
Consider a set of data consisting of x ˜(t), a time-dependent multiplet of n measurements (˜ xk for k = 1, 2, . . . , n). The usual objectives of nonlinear BSS are: 1) to determine if these observations are instantaneous mixtures of n statistically independent source components x(t) x ˜(t) = f [x(t)]
(1)
where f is an unknown, possibly nonlinear, n-component mixing function, and, if so, 2) to compute the mixing function. In most approaches to this problem [1,2], the desired source components are required to be statistically independent in the sense that their state space density function ρ(x) is the product of the density functions of the individual components. However, it is well known that this problem always has many solutions (see [3] and references therein). Specifically, any observed density function can be integrated in order to construct an entire family of functions f −1 that transform it into a separable (i.e., factorizable) form. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 65–72, 2007. c Springer-Verlag Berlin Heidelberg 2007
66
D.N. Levin
The observed trajectories of many classical physical systems [4] can be characterized by density functions in phase space (i.e., (˜ x, x ˜˙ )-space). Furthermore, if such a system is composed of non-interacting subsystems, the state space variables can be chosen so that the system’s phase space density function is separable (i.e., is the product of the phase space density functions of the subsystems). This fact motivates the approach to BSS described in this paper [5]: we search for a function of the state space variable x ˜ that transforms the observed phase space density function ρ˜(˜ x, x ˜˙ ) into a separable form. Unlike conventional BSS, this “phase space BSS problem” has a unique solution in the following sense: either the data are inseparable, or they can be separated by a mixing function that is unique, up to transformations that do not affect separability (translations, permutations, and possibly nonlinear rescaling of individual source components). This form of the BSS problem has a unique solution because separability in phase space is a stronger requirement than separability in state space. In other words, if a choice of variables x leads to a separable phase space density function, it also produces a separable state space density function; however, the converse is not true. In particular, the above-mentioned procedure of using integrals of the state space density function to transform it into separable form [3] cannot be used to separate the phase space density function. It was previously demonstrated [6] that the phase space density function of a time series induces a Riemannian metric on the system’s state space and that this metric can be directly computed from the local velocity correlation matrix of the data. In the following Section, we show how this differential geometry can be used to determine if there is a source coordinate system in which the phase space density function is separable and, if so, to find the transformation between the coordinate systems of the observed variables and the source variables. In a technical sense, the method is straight-forward. The data-derived metric is differentiated to compute the affine connection and curvature tensor on state space. If the curvature tensor does not vanish, the observed data are not separable. On the other hand, if the curvature tensor does vanish, there is only one possible set of source variables (up to translations, permutations, and transformations of individual components), and it is possible to compute these explicitly and determine if they do separate the phase space density function. A longer version of this paper [5] describes the solution of a more general BSS problem (sometimes called multidimensional independent component analysis [MICA] or independent subspace analysis) in which the source components can be partitioned into groups, so that components from different groups are statistically independent but components belonging to the same group may be dependent [7,8,9]. As mentioned above, this paper exploits a stronger criterion of statistical independence than conventional approaches (i.e., separability of the phase space density function instead of separability of the state space density function). Furthermore, the new method differs from earlier approaches on the technical level. For example, the proposed method exploits statistical constraints on source time derivatives that are locally defined in the state space, in contrast to the usual criteria for statistical independence that are global conditions on the source time
Using State Space Differential Geometry
67
series or its time derivatives [10]. Furthermore, the nonlinearities of the mixing function are unraveled by imposition of local second-order statistical constraints, unlike many conventional approaches that rely on higher-order statistics [1,2]. In addition, the constraints of statistical independence are used to construct the mixing function in a “deterministic” manner, without the need for parameterizing it (with a neural network architecture or other means) and without using probabilistic learning methods [11,12]. And, the new method is quite general, unlike some other techniques that are limited to the separation of post-nonlinear mixtures [13] or other special cases. Finally, the use of differential geometry in this paper should not be confused with existing applications of differential geometry to BSS. In our case, the observed measurement trajectory is used to derive a metric on the system’s state space, and the vanishing of the curvature tensor is shown to be a necessary condition for separability of the data. In contrast, other authors [14] define a metric on a completely different space, the search space of possible mixing functions, so that “natural” (i.e., covariant) differentiation can be used to expedite the search for the function that optimizes the fit to the observed data.
2
Method
This Section describes how the phase space density function of the observed data induces a Riemannian geometry on the state space and shows how to compute the metric and curvature tensor of this space from the observed time series. Next, we show that, if curvature tensor is non-vanishing, the observed data are not separable. However, if the curvature tensor vanishes, we show how to determine whether the data are separable, and, if they are, we show how to find the mixing function, which is essentially unique. Let x = x(t) (xk for k = 1, 2, . . . , n) denote the trajectory of a time series. Suppose that there is a phase space density function ρ(x, x), ˙ which measures the fraction of total time that the trajectory spends in each small neighborhood dxdx˙ of (x, x)-space ˙ (i.e., phase space). As discussed in [6], most classical physical systems in thermal equilibrium with a “bath” have such a phase space density function: namely, the Maxwell-Boltzmann distribution [4]. Next, define g kl (x) to be the local second-order velocity correlation matrix [6] ¯˙ k ) (x˙ l − x ¯˙l ) >x g kl (x) =< (x˙ k − x
(2)
where the bracket denotes the time average over the trajectory’s segments in a ¯˙ =< x˙ >x , the local time average of x. small neighborhood of x and where x ˙ kl In other words, g is a combination of first and second moments of the local velocity distribution. Because this correlation matrix transforms as a symmetric contravariant tensor, it can be taken to be a contravariant metric on the system’s state space. Furthermore, as long as the local velocity distribution is not confined to a hyperplane in velocity space, this tensor is positive definite and can be inverted to form the corresponding covariant metric gkl . Thus, under these conditions, the time series induces a non-singular metric on state space. This metric
68
D.N. Levin
k can then be used to compute the affine connection Γlm and Riemann-Christoffel k curvature tensor R lmn of state space by means of the standard formulas of differential geometry [15] k Γlm (x) =
1 kn ∂gnl ∂gnm ∂glm g ( + − ) 2 ∂xm ∂xl ∂xn
(3)
and
∂Γ k lm ∂Γ k ln + + Γ k im Γ i ln − Γ k in Γ i lm (4) ∂xn ∂xm where we have used the Einstein convention of summing over repeated indices. Now, assume that the data are separable and that x represents a set of source variables; i.e., assume that the phase space density function ρ is equal to the product of density functions of each component of x. It follows from definition (2) that the metric g kl (x) is diagonal and has positive diagonal elements, each of which is a function of the corresponding coordinate component. Therefore, the individual components of x can be transformed in order to create a new state space coordinate system in which the metric is the identity matrix and the curvature tensor (4) vanishes. It follows that the curvature tensor must vanish in every coordinate system, including the coordinate system x ˜ defined by the observed data ˜ k lmn (˜ x) = 0 (5) R Rk lmn (x) = −
In other words, the vanishing of the curvature tensor is a necessary consequence of separability. Therefore, if this data-derived quantity does not vanish, the data cannot be transformed so that their phase space density function is separable. On the other hand, if the data do satisfy (5), there is only one possible separable coordinate system (up to transformations that do not affect separability), and it can be explicitly constructed from the observed data x ˜(t). To see this, first note that, on a flat manifold (e.g., (5)) with a positive definite metric, it is always possible to explicitly construct a “Euclidean” coordinate system for which the metric is the identity matrix. Furthermore, if a coordinate system has a diagonal metric with positive diagonal elements that are functions of the corresponding coordinate components, it can be derived from this Euclidean one by means of an n-dimensional rotation, followed by transformations that do not affect separability (i.e., translations, permutations, and transformations of individual components). Therefore, because every separable coordinate system must have a diagonal metric with the aforementioned properties, all possible separable coordinate systems can be found by constructing a Euclidean coordinate system and then finding all rotations of it that are separable. The first step is to construct a Euclidean coordinate system in the following manner: 1) at some arbitrarilyx(i) (i = 1, 2, . . . , n) that are orthonormal chosen point x ˜0 , select n small vectors δ˜ with respect to the metric at that point (i.e., g˜kl (˜ x0 )δ˜ x(i)k δ˜ x(j)l = δij , where δij is the Kronecker delta); 2) starting at x ˜0 , use the affine connection to repeatedly parallel transfer all δ˜ x along δ˜ x(1) ; 3) starting at each point along the resulting geodesic path, repeatedly parallel transfer these vectors along δ˜ x(2) ; ... continue the parallel transfer process along other directions ... n+1) starting
Using State Space Differential Geometry
69
at each point along the most recently produced geodesic path, parallel transfer these vectors along δ˜ x(n) . Finally, each point is assigned the geodesic coordinate s (sk , k = 1, 2, . . . , n), where sk represents the number of parallel transfers of the vector δ˜ x(k) that was required to reach it. Differential geometry [15] guarantees that the metric of a flat, positive definite manifold will be the identity matrix in a geodesic coordinate system constructed in this way. We can now transform the data into this Euclidean coordinate system and examine the separability of all possible rotations of it. The easiest way to do this is to compute the second-order correlation matrix (6) σkl =< (sk − s¯k ) (sl − s¯l ) > where the brackets denote the time average over the entire trajectory and s¯ =< s >. If this data-derived matrix is not degenerate, there is a unique rotation that diagonalizes it, and the corresponding rotation of the s coordinate system is the only candidate for a separable coordinate system (up to transformations that do not affect separability). Its separability can be determined by explicitly computing the data’s phase space density function in order to see if it factorizes in this rotated coordinate system. Alternatively, we can use higher-order statistical criteria to see if the rotated s components are truly independent. In summary, the BSS problem can be solved by the following procedure: 1. Use the data x ˜(t) to compute the metric, affine connection, and curvature of the state space [(2-4)]. 2. If the curvature does not vanish at each point, the data are not separable. 3. If the state space curvature does vanish: (a) Compute the transformation to a Euclidean coordinate system s and transform the data into it. (b) Find the rotation that diagonalizes the second-order correlation matrix σ and transform to the corresponding rotation of the s coordinate system. (c) Compute the phase space density function of the data in the rotated s coordinate system. (d) If the density function factorizes, the data are separable, and the rotated s coordinates are the unique source variables (up to translations, permutations, and transformations of individual components). If the density function does not factorize, the data are not separable.
3
Discussion
This paper outlines a new approach to nonlinear BSS that is based on a notion of statistical independence, which is characteristic of a wide variety of classical noninteracting physical systems. Specifically, the new method seeks to determine if the observed data are mixtures of source variables that have a phase-space density function equal to the product of density functions of individual components. This criterion of statistical independence is stronger than that of conventional approaches to BSS, in which only the state-space density function is required to be separable. Because of the relative strength of this requirement, the new
70
D.N. Levin
approach to BSS produces a unique solution in each case (i.e., data are either inseparable or are separable by a unique mixing function), unlike the conventional approach that always finds an infinite number of mixing functions. Given a time series of observations in a measurement-defined coordinate system (˜ x) on the system’s state space, the basic problem is to determine if there is another coordinate system (a source coordinate system x) in which the density function is factorizable. The existence (or non-existence) of such a source coordinate system is a coordinate-system-independent property of the time series of data (i.e., an intrinsic or “inner” property). This is because, in all coordinate systems, there either is or is not a transformation to such a source coordinate system. In general, differential geometry provides mathematical machinery for determining whether a manifold has a coordinate-system-independent property like this. In the case at hand, we can induce a geometric structure on the state space by identifying its metric with the local second-order correlation matrix of the data’s velocity [6]. Then, a necessary condition for BSS is that the curvature tensor vanishes in all coordinate systems (including the measurement coordinate system). Therefore, if this data-derived quantity is non-vanishing, the observations are not separable. However, if the curvature tensor is zero, the data are separable if and only if the density function is seen to factorize in a coordinate system that can be explicitly constructed from the data-derived affine connection. In that case, these coordinates are the unique source variables (up to transformations that do not affect separability). A longer version of this paper [5] describes the solution of a more general BSS problem (sometimes called multidimensional ICA or independent subspace analysis) in which the source components are only required to be partitioned into groups that are statistically independent of one another but contain statistically interdependent variables [7,8,9]. The possible separable coordinate systems are a subset of all coordinate systems in which the metric is block -diagonal (instead of fully diagonal as in this paper). All of these “block-diagonal coordinate systems” can be derived from geodesic coordinate systems constructed from geodesics along a finite number of special directions in state space, and these special directions can be computed from algebraic equations involving the curvature tensor. Thus, it is possible to construct every block-diagonal coordinate system and then explicitly determine if the density function is separable in it. An exceptional situation arises if the metric can be transformed into a block-diagonal form with two or more one-dimensional blocks. In this case, there is an unknown rotation on this two-dimensional (or higher dimensional) subspace that is not determined by the requirement of metric block-diagonality. However, much as in Sect. 2, this rotation can be determined by applying other statistical requirements of separability, such as block diagonality of the second-order state variable correlation matrix or block-diagonality of higher-order local velocity correlation functions. In reference [5], this procedure for performing multidimensional ICA is described in detail, and it is illustrated with analytic examples, as well as with a detailed numerical simulation of an experiment.
Using State Space Differential Geometry
71
What are the limitations of the applicability of this method? It is certainly critical that there be a well-defined metric on state space. However, this will be the case if the measurement time series is described by a phase space density function, a requirement that is satisfied by the trajectories of a wide variety of physical systems [6]. In practical applications, the measurements must cover state space densely enough to be able to compute the metric, as well as its first and second derivatives (required to calculate the affine connection and curvature tensor). In the numerical simulation in [5], approximately 8.3 million short trajectory segments (containing a total of 56 million points) were used to compute the metric and curvature tensor on a three-dimensional state space. Of course, if the dimensionality of the state space is higher, even more data will be needed. So, a relatively large amount of data may be required in order to be able to determine their separability. There are few other limitations on the applicability of the technique. For example, computational expense is not prohibitive. The computation of the metric is the most CPU-intensive part of the method. However, it can be distributed over multiple processors by dividing the observed data into “chunks” corresponding to different time intervals, each of which is sent to a different processor where its contribution to the metric (2) is computed. As additional data are accumulated, they can be processed separately and then added into the time average of the data that were used to compute the earlier estimate of the metric. Thus, the earlier data need not be processed again, and only the latest observations need to be kept in memory.
Acknowledgments The author is indebted to Michael A. Levin for insightful comments and for a critical reading of the manuscript. Aapo Hyv¨ arinen and Shun-ichi Amari also read the manuscript, and their comments on the uniqueness issue are appreciated. This work was performed with the partial support of the National Institute of Deafness and Communicative Disorders.
References 1. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York (2001) 2. Jutten, C., Karhunen, J.: Advances in nonlinear blind source separation. In: Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA 2003), Nara, Japan (April 2003) 3. Hyv¨ arinen, A., Pajunen, P.: Nonlinear independent component analysis: existence and uniqueness results. Neural Networks 12, 429–439 (1999) 4. Sears, F.W.: Thermodynamics, the Kinetic Theory of Gases, and Statistical Mechanics, 2nd edn. Addison-Wesley, Reading, MA (1959) 5. Levin, D.N.: Using state space differential geometry for nonlinear blind source separation (2006), http://arxiv.org/abs/cs/0612096 6. Levin, D.N.: Channel-independent and sensor-independent stimulus representations. J. Applied Physics 98, 104701 (2005), Also, see the papers posted at http://www.geocities.com/dlevin2001/
72
D.N. Levin
7. Cardoso, J-F.: Multidimensional independent component analysis. In: Proc. 1998 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’98), Seattle, pp. 1941–1944. IEEE Computer Society Press, Los Alamitos (1998) 8. Bingham, E., Hyv¨ arinen, A.: A fast fixed-point algorithm for independent component analysis of complex-valued signals. Int. J. of Neural Systems 10, 1–8 (2000) 9. Nishimori, Y., Akaho, S., Plumbley, M.D.: Riemannian optimization method on the flag manifold for independent subspace analysis. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 295–302. Springer, Heidelberg (2006) 10. Lagrange, S., Jaulin, L., Vigneron, V., Jutten, C.: Analytic solution of the blind source separation problem using derivatives. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 81–88. Springer, Heidelberg (2004) 11. Haykin, S.: Neural Networks - A Comprehensive Foundation, 2nd edn. Prentice Hall, New York (1998) 12. Yang, H., Amari, S.I., Cichocki, A.: Information-theoretic approach to blind separation of sources in non-linear mixture. Signal Processing 64, 291–300 (1998) 13. Taleb, A., Jutten, C.: Source separation in post nonlinear mixtures. IEEE Trans. on Signal Processing 47, 2807–2820 (1999) 14. Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10, 251–276 (1998) 15. Weinberg, S.: Gravitation and Cosmology - Principles and Applications of the General Theory of Relativity. Wiley, New York (1972)
Copula Component Analysis Jian Ma and Zengqi Sun Department of Computer Science, Tsinghua University, Beijing 100084, China
[email protected]
Abstract. A framework named copula component analysis (CCA) for blind source separation is proposed as a generalization of independent component analysis (ICA). It differs from ICA which assumes independence of sources that the underlying components may be dependent by certain structure which is represented by Copula. By incorporating dependency structure, much accurate estimation can be made in principle in the case that the assumption of independence is invalidated. A two phrase inference method is introduced for CCA which is based on the notion of multi-dimensional ICA. Simulation experiments preliminarily show that CCA can recover dependency structure within components while ICA does not.
1
Introduction
Blind source separation (BSS) is to recover the underlying components from their mixtures, where the mixing matrix and distribution of the components are unknown. To solve this problem, independent component analysis (ICA) is the most popular method to extract those components under the assumption of statistically independence[1,2,3,4,5]. However, in practice, the independence assumption of ICA cannot always be fully satisfied and thus strongly confines its applications. Many works have been contributed to generalize the ICA model,[6] such as Tree-ICA[7], Topology ICA[8]. A central problem of those works is how to relax the independent assumption and to incorporate different kinds of dependency structure into the model. Copula [9] is a recently developed mathematical theory for multivariate probability analysis. It separates joint probability distribution function into the product of marginal distributions and a Copula function which represents the dependency structure of random variables. According to Sklar theorem, given a joint distribution with margins, there exists a copula uniquely determined. Through Copula, we can clearly represent the dependent relation of variables and analysis multivariate distribution of the underlying components. The aim of this paper is to use Copula to model the dependent relations between elements of random vectors. By doing this, we transform BSS into a parametric or semi-parametric estimation problem which mainly concentrate on the estimation of dependency structure besides identification of the underlying components as ICA do. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 73–80, 2007. c Springer-Verlag Berlin Heidelberg 2007
74
J. Ma and Z. Sun
This paper is organized as follows: we briefly review ICA and its extensions in section 2. The main conclusions of copula theory are briefly introduced in section 3. In section 4, we propose a new model for BSS, named copula component analysis (CCA) which takes dependency among components into consideration. Inference method for CCA is presented in section 5. Simulation experiments are presented in section 6. Finally, we conclude the paper and give some further research directions.
2
ICA and Its Extensions
Given a random vector x, ICA is modeled as x = As,
(1)
where the source signals s = {s1 , . . . , sn } assume to be mutually independent, A and W = A− is the invertible mixing and demixing matrix to be solved so that the recovered underlying components {s1 , . . . , sn } is estimated as statistically independent as possible. Statistical independence of sources means that the joint probability density of x and s can be factorized as p(x) = p(As) =| det(W) n | p(s) p(s) = i=1 pi (si )
(2)
The community has presented many extensions of ICA with different types of dependency structures. For example, Bach and Jordan [7] assumed that dependency can be modeled as a tree (or a forest). After the contrast function is extended with T-mutual information, Tree-ICA tries to find both a mixing matrix A and a tree structure T by embedding a Chow-Liu algorithm into algorithm. Hyv¨ arinen etc. [8] introduced the variance into ICA model so as to model dependency structure. Cardoso generalized the notion of ICA into multidimensional ICA using geometrical structure.[6]
3
A Brief Introduction on Copula
Copula is a recently developed theory which separates the margin law and the joint law and therefore gives dependency structure as a function. According to Nelson [9], it is defined as follows: Definition 1 (Copula). A bidimensional copula is a function C(x, y) : I 2 → I with following properties: 1. (x, y) ⊂ I 2 2. C(x2 , y2 ) − C(x1 , y2 ) − C(x2 , y1 ) + C(x1 , y1 ) ≥ 0, for x1 ≤ x2 and y1 ≤ y2 ; 3. C(x, 1) = x and C(1, y) = y. It’s not hard to know that such defined C(x, y) is a cdf on I 2 . Multidimensional version can be generalized in a same manner which presents in [9].
Copula Component Analysis
75
Theorem 1 (Sklar Theorem). Given a multidimensional random vector x = (x1 , . . . , xn ) ∈ Rn with its corresponding distribution function and density function ui = Fi (xi ) and pi (xi ), i = 1, . . . , n. Let F (x) : Rn → I denotes the joint distribution, then there exists a copula C(·) : I n → I so that F (x) = C(u).
(3)
where u = (u1 , . . . , un ). If the copula is differentiable, the joint density function of F (x) is P1,...,n (x) =
n
pi (xi )C (u).
(4)
i=1
where C (u) =
∂C(u) ∂u1 ,...,∂un .
Given a random vector x = (x1 , . . . , xn ) with mutually independent variables, and their cdf F (x) = i Fi (xi ). It is easy to obtain that the corresponding copula function called Product Copula is C(u) = i ui and C (u) = 1.
4 4.1
Copula Component Analysis Geometry of CCA
As previously stated, ICA assumes that the underlying components are mutually independent, which can be represented as (1). CCA also use the same representation (1) as ICA, but without independence assumption. Here, Let the joint density function represents by Copula: pc (x) =
N
pi (xi )C (u)
(5)
i=1
where the dependency structure is modeled by function C(u). The goal of estimation is to minimize the distance between the ’real’ pdf of random vector x and its counterpart of the proposed model. Given a random vector x with pdf p(x), the distance between p(x) and pc (x) in a sense of K-L divergence can be represented as p(x) pc (x) p(x) = Ep(x) log − Ep(x) log C (u) i pi (xi )
D(ppc ) = Ep(x) log
(6)
The first term on the right of (6) is corresponding to the K-L divergence between p(x) and ICA model and the second term is corresponding to entropy of copula C(x).
76
J. Ma and Z. Sun
Theorem 2. Given arandom vector x = (x1 , . . . , xn ) ∈ Rn with pdf p(x) and n its joint pdf pc (x) = i=1 pi (xi )C (u), where ui = Fi (xi ) is the cdf of xi and dependency structure is presented by copula function C(u) : I n → I, u ∈ Rn and n C(u) is the derivative of C(u). The K-L divergence D(ppc ) is as C (u) = ∂u∂1 ,...,∂u n D(ppc ) = I(x1 , . . . , xn ) + H(C (u)).
(7)
where H(·) is the Shannon differential entropy. That is, the K-L divergence between p(x) and pc (x) equal to the sum of the mutual information I(x) and copula entropy H for function u ∼ C . Using the invariant of K-L divergence, we now have the following corollary to Theorem 2 for BSS problem s = Wx. Corollary 1. With the same denotation of Theorem 2, the K-L divergence for BSS problem is (8) D(ppc ) = I(s1 , . . . , sn ) + H(C (us )), where us denotes the marginal variable for sources s. Assume that the number of sources equals to that of observations. In other words, the distance between ICA model and the true model is presented by dependency structure and its value equals to entropy of the underlying copula function. It can be easily learned from (7) that if dependency structure was incorporated into model, the distance between data and model can be further closer than that of ICA model. ICA is a special case when it assumes mutual independence of underlying components. Actually, ICA only minimizes the first part of (7) under the assumption of independence. This also explains why sometime ICA model is not applicable when dependency relations between source components exist. 4.2
Multidimensional ICA
From the notion of multidimensional ICA generalized from ICA by Cardoso [6], it can be derived that p(x) =
m
pk (xk ) =
k=1
=
m ik+1 −1
m
pk (xik , . . . , xik+1 −1 )
k=1
pk (xl )Ck (uk )
=
n i=1
k=1 l=ik
pi (xi )
m
(9) Ck (uk )
k=1
where Ck (·) is the copula with respect to pk (·). On the other side, the definition of copula gives n p(x) = pi (xi )C (u) (10) i=1
According to Sklar theorem, if all pi (·) exist, then C(·) is unique. Therefore, we can derive the following result.
Copula Component Analysis
77
Theorem 3. The copula corresponding to multidimensional ICA is factorial if all the marginal pdf of component exist, that is C (u) =
m
Ck (uk )
(11)
k=1
Proof. Because of the unique of C, the above (11) can be easily derived by comparing (9) and (10). The theorem can guide hypothesis selection of copula. That is, Copula should be factorized as a product of sub-function with different type for dependency structure of different sub-space. Combining (7) and (11), we can derived the following: D(ppc ) = I(u1 , . . . , un ) +
m
H(Ck )
(12)
k=1
It means that the distance between the true model and ICA model composes of entropy of Copulas which corresponds to every un-factorial ICs spaces. Therefore, if we want to derive a model much closer to the ’true’ one than ICA, we should find dependency structure of each space, that is, approach the goal step by step. This is one of the guide principles for designing algorithm of copula component analysis.
5 5.1
Inference of CCA General Framework
In this section, we study inference method for CCA based on the notion of multidimensional ICA. Suppose the underlying copula function parameterized by θ ∈ Θ, thus the estimation of CCA should infer the demixing matrix W and θ. According to theorem 2, estimation of the underlying sources through our model requires the minimization of the K-L divergence of (7) or (12). Thus the objective function is (13) min D(ppc ; W, θ) which composes of two sub-objective: min I(x1 , . . . , xn ; W) and min H(C (u); W, θ). Because u in the latter objective depends on the structure of IC spaces derived from the former objective, we should handle the optimal problem min I(x1 , . . . , xn ; W) at first. The first objective can be achieved by ICA algorithm. For the second one we proposed the Infomax like principle given a parametric family of copula. We propose that the framework of CCA composes of two phrases: 1. Solve W through minimization of mutual information . 2. Determine W and θ so that the objective function (13) is minimized.
78
5.2
J. Ma and Z. Sun
Maximum Likelihood Estimation
Given the parametric model of Copula, maximum likelihood estimation can be deployed under the constraint of ICA. Consider a group of independent observations x 1 , . . . , xT of n × 1 random vector X with a common distribution P = Cθ (x) Ti=1 pi (xi ); θ ∈ Θ where pi (xi ) is marginal distribution associated with xi , and the log-likelihood is L(W, θ) =
T 1 log Cθ (x) pi (xi ) T i=1
T 1 1 = log pi (xi ) + log Cθ (x) T T
(14)
i=1
The representation is consist with two-phrase CCA framework in that the first term on the right of equation (14) implies mutual information of x and that the second term is empirical estimation of entropy of x. It is not hard to proof that min D(ppc ) ⇔ max L(W, θ) 5.3
(15)
Estimation of Copula
Suppose the IC subspaces have been correctly determined by ICA and then we can identify the copula by minimizing the second term on the right of (7). Given a class of Copula C(u; θ) with parameter vector θ ∈ Θ, and a set of sources s = (s1 , . . . , sn ) identified from data set X, the problem is such a optimization one (16) max Ep(s) (C (us ; W, θ)) W,θ
By using Sklar theorem, the copula to be identified has been separate with marginal distributions which are known except non-Gaussianity in ICA model. Therefore, the problem here is a semi-parametric one and only need identifying the copula. Parametric method is adopted. First, we should select a hypothesis for copula among many types of copula available. The selection depends on many factors, such as priori knowledge, computational ease, and individual preference. Due to space limitations, only few of them are introduced here. For more detail please refer to [9]. When a set of sources s and a parametric copula C(·; θ) is prepared, the optimization of (16) becomes an optimization problem which can be solved as follows: n ∂C (u; θ) = 0 (17) ∂θ s =1 i
where many readily methods can be utilized.
Copula Component Analysis
6
79
Simulation Experiments
In this section, simulation experiments are designed to compare CCA and ICA on two typical cases to investigate whether CCA can perform better than ICA as previous stated. One cases is with independent components and the other is where there are dependent components. We first apply both methods on independent components recovery from their mixtures and then on recovery of components with dependency structure. In both experiments, the basic case of BSS with two components are considered. Two components are generated by bi-variate distribution associated with Gumbel copula: −θ (18) C(u, v) = exp (− ln u)θ + (− ln v)θ where θ = 1, 5 respectively. Note that two components such generated are independent when θ = 1 and thus compose of sources of ICA problem. The marginal density of components are uniform. Sources are mixed by randomly generated and invertible 2 × 2 matrix A. In our experiments, A is
0.4936 0.9803 A= 0.4126 0.5470 Both ICA and CCA are used to recover the components from their mixtures. Without the attention to study model selection, Gumbel copula family is adopted in CCA method. The results are illustrated in Figure 1. Due to space limitations, we only present copula density structure of sources and their recoveries by both methods indepedent components
components by Gumbel copula
1 0.5 0 −0.5
1 0.5 0 −0.5 0
1000
2000
0
1000
2000
−3
x 10
(a)
5 0 −5 50 0 0
(b) 0.02 0 −0.02 50 50
−3
x 10
0 0
50
−3
x 10
(c)
5 0 −5 50 0 0
50
(d)
5 0 −5 50 0 0
50
−3
x 10
(e)
5 0 −5 50 0 0
(f) 0.02 0 −0.02 50 50
0 0
50
Fig. 1. Simulation experiments. The left column is for independent component experiments and the right column is for the experiment of components by Gumbel copula. The top two sub-figure is sources and (a) and (b) is their corresponding copula density. (c) and (d) is for ICA and (e) and (f) is for CCA.
80
J. Ma and Z. Sun
in Figure 1. Note that copula density structure should be a plane if two components are independent, that is, C(u, v) = 1. It can be learned from figure 1 that both methods works well when components are mutually independent and more importantly that ICA always try to extracts components mutually independent while CCA can recover the dependency between components successfully.
7
Conclusions and Further Directions
In this paper, a framework named Copula Component Analysis for blind source separation is proposed as a generalization of ICA. It differs from ICA which assumes independence of sources that the underlying components may be dependent with certain structure which is represented by Copula. By incorporating dependency structure, much accurate estimation can be made, especially in the case where the assumption of independence is invalidated. A two phrase inference method is introduced for CCA which is based on the notion of multidimensional ICA. A preliminary simulated experiment demonstrates the advantage of CCA over ICA on dependency structure discovery. Many problems remain to be studied in the future, such as Identifiability of the method, selection of copula model and applications.
References 1. Comon, P.: Independent component analysis - A new concept? Signal Processing 36, 287–314 (1994) 2. Bell, A., Sejnowski, T.: An information-maximization approach to blind separation and blind deconvolution. Neural Comp. 7, 1129–1159 (1995) 3. Amari, S., Cichocki, A., Yang, H.H.: A new learning algorithm for blind source separation. In: Advances in Neural Information Processing, pp. 757–763. MIT Press, Cambridge (1996) 4. Cardoso, J.-F., Laheld, B.H.: Equivariant adaptive source separation. IEEE Transactions on Signal Processing 44, 3017–3030 (1996) 5. Pham, D.T., Garat, P.: Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Transactions on Signal Processing 45, 1712–1725 (1997) 6. Cardoso, J.-F.: Multidimensional independent component analysis, Acoustics, Speech, and Signal Processing. In: ICASSP’98. Proceedings of the 1998 IEEE International Conference on, vol. 4, pp. 1941–1944 (1998) 7. Bach, F.R., Jordan, M.I.: Beyond independent components: Trees and clusters. Journal of Machine Learning Research 4, 1205–1233 (2003) 8. Hyv¨ arinen, A., Hoyer, P.O., Inki, M.: Topographic Independent Component Analysis. Neural Computation 13, 1527–1558 (2001) 9. Nelsen, R.B.: An Introduction to Copulas. Lecture Notes in Statistics. Springer, New York (1999)
On Separation of Signal Sources Using Kernel Estimates of Probability Densities Oleg Michailovich1 and Douglas Wiens2 1 2
University of Waterloo, Waterloo, Ontario N2L 3G1, Canada University of Alberta, Edmonton, Alberta T6G 2G1, Canada
Abstract. The discussion in this paper revolves around the notion of separation problems. The latter can be thought of as a unifying concept which includes a variety of important problems in applied mathematics. Thus, for example, the problems of classification, clustering, image segmentation, and discriminant analysis can all be regarded as separation problems in which one is looking for a decision boundary to be used in order to separate a set of data points into a number of (homogeneous) subsets described by different conditional densities. Since, in this case, the decision boundary can be defined as a hyperplane, the related separation problems can be regarded as geometric. On the other hand, the problems of source separation, deconvolution, and independent component analysis represent another subgroup of separation problems which address the task of separating algebraically mixed signals. The main idea behind the present development is to show conceptually and experimentally that both geometric and algebraic separation problems are very intimately related, since there exists a general variational approach based on which one can recover either geometrically or algebraically mixed sources, while only little needs to be modified to go from one setting to another.
1
Introduction
Let X = {xi ∈ IRd , i = 1, . . . , N } be a set of N observations of a random variable def
X which is described by M conditional densities {pk (x) = p(x | X ∈ Ck )}M k=1 , with Ck denoting a class to which a specific realization of X may belong. In other words, the set X can be viewed as a mixture of realizations of M random variables associated with different classes described by corresponding probability densities. In this case, the problem of classification (or, equivalently, separation) refers to the task of ascribing each observation xi to the class Ck which it has most likely come from. The most challenging version of the above problem occurs in the case when the decision has to be made given the observed set X alone. The setting considered above is standard for a variety of important problems in applied mathematics. Probably, the most famous examples here are unsupervised machine learning and data clustering [1,2]. Signal detection and image segmentation are among other important examples of the problems which could be embedded into the same separation framework [3,4]. It should be noted that, although a multitude of different approaches have been proposed previously to M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 81–88, 2007. c Springer-Verlag Berlin Heidelberg 2007
82
O. Michailovich and D. Wiens
address the above problems, most of them are similar at the conceptual level. Specifically, viewing the observations {xi } as points on either a linear or a nonlinear manifold Ω, the methods search for such a partition of the latter so that the points falling at different subsets of Ω are most likely associated with different classes Ck . Moreover, the boundaries of the partition, which are commonly referred to as decision boundaries, are usually defined by means of geometric descriptors. The latter, for example, can be hyperplanes in machine learning [1] or active contours [4] in image segmentation. For this reason, we refer to the problems of this type as the problems of geometric source separation (GSS), in which case the data set X is considered to be a geometric mixture of unknown sources. In parallel to the case of GSS, there exists an important family of problems concerned with separating sources that are mixed algebraically [5]. In a canonical setting, the problem of algebraic source separation (ASS) can be formulated as follows. Let S be a vector of M signals (sources) [s1 (t), s2 (t), . . . , sM (t)]T , with t = 1, . . . , T being either a temporal or a spatial variable. Also, let A ∈ IRM×M be an unknown mixing matrix of full rank. Subsequently, the problem of blind source separation consists in recovering the sources given an observation of their mixtures X = [x1 (t), x2 (t), . . . , xM (t)]T acquired according to1 : X = A S.
(1)
Note that, in (1), neither the sources S nor the matrix A are known, and hence the above estimation problem is conventionally referred to as blind. Note that the problem of (algebraic) blind source separation constitutes a specific instance of Independent Component Analysis, which is a major theory encompassing a great number of applications [5]. Moreover, when M = 1 and A is defined to be a convolution operator, the resulting problem becomes the problem of blind deconvolution [6], which can also be inscribed in our framework of separation problems. The main purpose of this paper is to show conceptually and experimentally that both GSS and ASS problems are intimately interrelated, since they can be solved using the same tool based on variational analysis [7]. To define this tool, let us first introduce an abstract, geometric separation operator ϕ : X → {Sk }M k=1 that “sorts” the points of X into M complementary and mutually exclusive subsets {Sk }M k=1 which represent estimates of the geometrically mixed sources. On the other hand, in the case of ASS, the separation operator is defined algebraically as a de-mixing matrix W ∈ IRM×M such that: S W X,
(2)
with S and X defined to be S = [s1 (t), . . . , sM (t)]T and X = [x1 (t), . . . , xM (t)]T , respectively. Additionally, let yk be an estimate of either a geometric or an algebraic k-th source signal, computed via applying either ϕ or W to the data. This estimate 1
Here and hereafter, the matrix A is assumed to be square which is merely a technical assumption which can be dropped; this is discussed in the sequel.
On Separation of Signal Sources Using Kernel Estimates
83
can be characterized by its empirical probability density function (pdf) which can be computed as given by: Nk 1 K(z − yk (t)), p˜k (z) = Nk t=1
z ∈ IRd ,
(3)
where Nk is the size of the estimate (that is independent of k in the case of ASS). Note that (3) defines a kernel based estimate of the pdf of yk when the kernel function K(z) is normalized to have unit integral [8]. There exist a number of possibilities for choosing K(z), among which the most frequent one is to define the kernel in the form of a Gaussian density function. Accordingly, this choice of K(z) is used throughout the rest of this paper. The core idea of the preset approach is quite intuitive and it is based on the assumption that the “overlap” between the informational contents of the estimated sources has to be minimal. To minimize this “overlap”, we propose to find the optimal separation operator (viz. either ϕ or W) as a minimizer of the cumulative Bhattacharyya coefficient between the empirical pdfs of the estimated sources, which is defined to be [9]: 2 p˜i (z) p˜j (z)dz, i, j = 1, . . . , M. (4) BM = M (M − 1) i<j IRd It should be noted that, apart from the Bhattacharyya coefficient, a number of alternative metrics are available to assess the distance between the probability densities. Thus, for example, the Kullback-Leibler divergence was employed in [10] and [5] to solve the problems of image segmentation and blind source separation, respectively. However, for the reasons discussed below, we prefer using (4), since it results in comparatively more stable and reliable separation. To demonstrate how BM can be used to unify the concept of separation, as it appears in both geometric and algebraic settings, we turn to some specific examples, among which the problem of image segmentation is chosen to be first.
2
Geometric Source Separation: Image Segmentation
In order to facilitate the discussion, we confine the derivations below to the case of two segmentation classes. In this case, the values of a vector-valued image I(u) : Ω ⊆ IR2 → IRd are viewed as a geometric mixture of two sources, viz. the object of interest and its background. Consequently, the segmentation problem can be reformulated as the problem of partitioning the domain of definition Ω of I(u) (with u ∈ Ω) into two mutually exclusive and complementary subsets Ω− and Ω+ . These subsets can be represented by their respective characteristic functions χ− and χ+ , which can, in turn, be defined as χ− (u) = H(−ϕ(u)) and χ+ (u) = H(ϕ(u)), with H standing for the Heaviside function. Given a level-set function ϕ(u), its zero level set {u | ϕ(u) ≡ 0, u ∈ Ω} is used to implicitly represent a curve – active contour – embedded into Ω. For the sake
84
O. Michailovich and D. Wiens
of concreteness, we associate the subset Ω− with the support of the object of interest, while Ω+ is associated with the support of corresponding background. In this case, the objective of active-contour-based image segmentation is, given an initialization ϕ0 (u), to construct a convergent sequence of level-set functions {ϕt (u)}t>0 (with ϕt (u)t=0 = ϕ0 (u)) such that the zero level-set of ϕ∞ (u) coincides with the boundary of the object of interest. The above sequence of level-set functions can be conveniently constructed using the variational framework. Specifically, the sequence can be defined by means of a gradient flow that minimizes the value of the cost functional (4). In the case of two segmentation classes, the optimal level set ϕ (u) is defined as: ϕ (u) = arg inf {B2 (ϕ(u))},
(5)
ϕ(u)
where B2 (ϕ(u)) =
z∈IRN
p− (z | ϕ(u)) p+ (z | ϕ(u)) dz.
(6)
with p− (z | ϕ(u)) and p+ (z | ϕ(u)) being the kernel-based estimates of the pdf’s of the class and background sources. In order to contrive a numerical scheme for minimizing (5), its first variation should be computed first. The first variation of B2 (ϕ(u)) (with respect to ϕ(u)) can be shown to be given by: δB2 (ϕ(u)) = δ(ϕ(u)) V (u), δϕ(u) where 1 1 −1 V (u) = B2 (ϕ(u))(A−1 − − A+ ) + 2 2
with 1 L(z | ϕ(u)) = A+
(7)
z∈IRd
K(z − I(u)) L(z | ϕ(u)) dz,
1 p− (z | ϕ(u)) − p+ (z | ϕ(u)) A−
p+ (z | ϕ(u)) . p− (z | ϕ(u))
(8)
(9)
Note that, in the equations above, δ(·) stands for and A− the delta function, and A+ are the areas of Ω− and Ω+ given by Ω χ− (u) du and Ω χ+ (u) du, respectively. Finally, introducing an artificial time parameter t, the gradient flow of ϕ(u) that minimizes (5) is given by: ϕt (u) = −
δB2 (ϕ(u)) = −δ(ϕ(u)) V (u), δϕ(u)
(10)
where the subscript t denotes the corresponding partial derivative, and V (u) is defined as given by (8). It should be added that, in order to regularize the shape of the active contour, it is common to constrain its length and to replace the theoretical delta function
On Separation of Signal Sources Using Kernel Estimates
85
¯ δ(·) by its smoothed version δ(·). In this case, the final equation for the evolution of the active contour becomes: ¯ (α κ(u) − V (u)) , ϕt (u) = δ(ϕ(u))
(11)
∇ϕ(u) where κ(u) is the curvature of the active contour given by κ(u) = −div ∇ϕ(u) and α > 0 is a user-defined regularization parameter. Note that, in the segmentation results reported in this paper, α was set to be equal to 1.
3
Blind Separation of Algebraically Mixed Sources
It is surprising how little has to be done to modify the separation approach of the previous section to suit the ASS setting. Indeed, let Y = [y1 (t), y2 (t), . . . , yM (t)]T be the matrix of estimated sources computed as Y = W X. Additionally, let def
T th {p(z; wi ) = p˜i (z | W)}M i=1 (where wi is the i row of W) be the set of empirical densities computed as at (3) and that correspond to the source estimates in Y. Consequently, the optimal separation matrix W∗ can be found as:
W = arg inf {BM (W)}, W
where BM (W) =
2 M (M − 1)
p(z; wi ) p(z; wj ), i, j = 1, . . . M.
(12)
(13)
IRd i<j
It should be noted that intrinsic in blind (algebraic) source separation is the problem of permutation and normalization, as, using (2), the sources can only be recovered in an arbitrary order and up to arbitrary multiplication factors. While the order of the sources is rarely of importance, the normalization could become an issue, especially from the viewpoint of numerical minimization. To overcome this difficulty, it is common to prewhiten the mixtures X before they are passed into the computations. In this case, it can be easily shown that the optimal solution W becomes a member of the orthogonal group O(M ) = {W ∈ IRM×M | WWT = I}. We solve this constrained minimization problem with the aid of Lagrange multipliers {λαβ }α≤β . Consider the problem of minimizing
T (14) λαβ wα wβ − δαβ , F (w1 , ..., wM , λ) = BM (W) + α≤β
where δαβ is Kronecker’s delta. Solving the equations ∂ F (w1 , ..., wM , λ) = 0T (∈ IR1×M ), ∂wi
(15)
together with WWT = I (details available from authors) leads to the characterization of W∗ as a fixed point of the function G(W) = (PPT )−1/2 P,
(16)
86
O. Michailovich and D. Wiens
where (PPT )1/2 is a symmetric square root and P = P(W) is defined as follows. ˙ Let K(z) be the N × M matrix with (i, j)th element K (z − yj (i)) and let D(z) be the diagonal matrix with diagonal elements p(z; wj ) j=1,...,M;j =i di (z) = . (17) p(z; wi ) Then PT =
1 X N M (M − 1)
IRd
˙ K(z) D(z) dz.
(18)
We solve (16) by iteration: 1. Initialize W, say W(0) = IM . 2. For l = 0, 1, ... to convergence, compute P(l) from (18), and update W(l) to W(l+1) = G(W(l) ) = (P(l) PT(l) )−1/2 P(l)) .
4 4.1
Results Image Segmentation
The image of Lizard shown in Subplot A of Fig. 1 is considered to be relatively hard to segment due to the multimodality of the pdf related to the object class. Moreover, the intensity distributions of the object and background classes of the image are very similar, which makes it impossible to segment the image based on gray-level information alone. To overcome this difficulty, the input image I(u) was defined to be the bivariate image of the partial derivatives of Lizard, which are shown in Subplots B and C of the figure.
Fig. 1. (Subplot A) Original image of Lizard; (Subplot B) Row-derivative of the image; (Subplot C) Column-derivative of the image; (Subplot D) Initial segmentation; (Subplot E) Separation by the Bhattacharyya flow; (Subplot F) Separation by the K-L flow
On Separation of Signal Sources Using Kernel Estimates
87
The initial segmentation of Lizard and its segmentation obtained using the proposed method are shown in Subplots D and E of Fig. 1, respectively. For the sake of comparison, we have also segmented the image of Lizard using the active contour that maximized the Kullback-Leibler (K-L) divergence between the empirical pdf’s of the object and background classes. The resulting segmentation is shown in Subplot F of Fig.1. It is obvious that the proposed approach (i.e., the one that exploits the Bhattacharyya metric) is the best performer here. It is worthwhile noting that the relatively worse performance of the image segmentation using the K-L divergence seems to be stemming from the properties of the functions involved in its definition, viz. of the logarithm. In particular, the latter is known to be very sensitive to variations of its argument in vicinity of relatively small values of the latter. Moreover, the logarithm is undefined at zero, which makes computing the K-L gradient flow prone to the errors caused by inaccuracies in estimating the tails of probability densities. On the other hand, the square root is a well-defined function in vicinity of zero. Moreover, for relatively small values of its argument, the variability of the square root is considerably smaller than that of the logarithm. As a result, the Bhattacharyya flow is much less susceptible to the influence of the inaccuracies mentioned above.
Fig. 2. (Subplots A1-A3) Original image sources; (Subplot B1-B3) Corresponding mixtures; (Subplot C1-C3) Estimated sources
88
4.2
O. Michailovich and D. Wiens
Blind Source Separation
Subplots A1-A3 of Fig. 2 show the original source images which have been used to test the performance of the proposed separation methodology. The corresponding mixtures obtained using a random mixing matrix A are shown in Subplots B1-B3 of the same figure, whereas Subplots C1-C3 of Fig.2 show the source images estimated by applying 50 iterations of the fixed point algorithm described in Section 3. One can see that the algorithm results in virtually perfect reconstruction of the image sources. For this case, the average interference-to-signal ratio (ISR) was found to be equal to 0.0024, while minimizing the mutual information between the estimated sources resulted in ISR equal to 0.036.
5
Conclusions
The present study has demonstrated the applicability and practicability of the method for separating different components of a data signal based on the notion of a distance between probability distributions. The latter was defined by means of the Bhattacharyya coefficient which was shown to be advantageous over the K-L divergence (and, hence, over the related criterion of mutual information) in practical settings, in which class-conditional densities have to be estimated in a non-paramentric manner. Additionally, the versatility of the proposed criterion was demonstrated via its application to the problems of blind separation of both geometrically and algebraically mixed sources. Thus, from a certain perspective, the proposed method can be seen as unifying for the problems of both classes.
References 1. Duda, R., Hart, R., Stork, D.: Pattern Recognition. Willey, New York (2001) 2. Han, J., Kamber, M.: Data mining: Concepts and techniques. Morgan Kaufmann, San Francisco (2001) 3. Tuzlukov, V.: Signal detection theory. Springer, Heidelberg (2001) 4. Sapiro, G.: Geometric partial differential equations and image analysis. Cambridge University Press, Cambridge (2001) 5. Hyvarinen, A., Karhunen, J., Oja, E.: Independent component analysis. John Wiley and Sons, Chichester (2001) 6. Haykin, S.: Blind deconvolution. Prentice Hall, Englewood Cliffs (1994) 7. Gelfand, I., Fomin, S., Silverman, R.: Calculus of variations. Prentice-Hall, Englewood Cliffs (1975) 8. Silverman, B.: Density estimation for statistics and data analysis. CRC Press, Boca Raton (1986) 9. Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 78, 99–109 (1943) 10. Kim, L., Fisher, J., Yezzi, A., Cetin, M., Willsky, A.: A nonparametric statistical method for image segmentation using information theory and curve evolution IEEE Proc. Image Processing 4(10), 1486–1502 (2005)
Shifted Independent Component Analysis Morten Mørup, Kristoffer H. Madsen, and Lars K. Hansen Technical University of Denmark Informatics and Mathematical Modelling Richard Petersens Plads, Building 321 DK-2800 Kgs. Lyngby, Denmark {mm,khm,lkh}@imm.dtu.dk
Abstract. Delayed mixing is a problem of theoretical interest and practical importance, e.g., in speech processing, bio-medical signal analysis and financial data modelling. Most previous analyses have been based on models with integer shifts, i.e., shifts by a number of samples, and have often been carried out using time-domain representation. Here, we explore the fact that a shift τ in the time domain corresponds to a multiplication of e−iωτ in the frequency domain. Using this property an algorithm in the case of sources≤sensors allowing arbitrary mixing and delays is developed. The algorithm is based on the following steps: 1) Find a subspace of shifted sources. 2) Resolve shift and rotation ambiguity by information maximization in the complex domain. The algorithm is proven to correctly identify the components of synthetic data. However, the problem is prone to local minima and difficulties arise especially in the presence of large delays and high frequency sources. A Matlab implementation can be downloaded from [1].
1
Introduction
Factor analysis is widely used to reconstruct latent effects from mixtures of multiple effects based on the model Xn,m =
An,d Sd,m + En,m ,
(1)
d
where En,m is additive noise. However, this decomposition is not unique since = AQ and S = Q−1 S yields same approximation as A, S. Consequently, conA straints have been imposed such as Varimax rotation for Principal Component Analysis (PCA) [2], statistical independence of the sources S as in Independent Component Analysis (ICA)[3,4]. A related strategy is sparse coding where the objective of minimizing the error is combined with a term penalizing the non-sparsity of S [5]. Factor analysis in the setting of ICA is often illustrated by the so-called cocktail party problem. Here mixtures of several speakers are recorded in several microphones forming the measured signal X. The task is to identify the sources M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 89–96, 2007. c Springer-Verlag Berlin Heidelberg 2007
90
M. Mørup, K.H. Madsen, and L.K. Hansen
S of each original speaker. However, even in an anechoic environment the mixing model is typically not accurate because of different delays in the microphones. Consider two microphones placed at distance L and L + h from a given speaker. Under normal atmospheric conditions, the speed of sound is approximately c = 344 m/s while a typical sampling rate is fs = 22 kHz. Then the delay in samples between the two microphones is given by: #samples= fsch such that the delay increases linearly with the difference in distance. Consequently, a distance of 1 cm gives a delay of 0.6395 samples while h = 1m leads to a delay of 63.95 samples. Harshman and Hong [6] proposed a generalization of the factor models in which the underlying sources have specific delays when they reach the sensors. The model is called shifted factor analysis (SFA), and reads An,d Sd,m−τ n,d + En,m . (2) Xn,m = d
In real acoustic environments we expect echoes due to paths that are created by reflection off surfaces. To account for general delay mixing effects, the ICA model has been generalized to convolutive mixtures, see e.g., [7,8,9] Xn,m =
Aτn,d Sd,m−τ + En,m .
(3)
τ,d
Here Aτ is a filter that accounts for the presence of each source in the sensors at time delay τ . The shifted factor model, thus is a special case of the convolutive model where the filter coefficients Aτn,d = An,d if τ n,d = τ else Aτn,d = 0. In fact shifted mixtures are also seen in many other contexts. For instance, astronomy where star motion Doppler effects induce frequency red shifts that can be modelled using SFA. Here we will focus on the delayed source model. In [6] strong support was found for the conjecture that the incorporation of shifts can strengthen the model enough to make the parameters identifiable up to scaling and permutation (essential uniqueness). We will demonstrate that this conjecture is not correct when allowing for arbitrary shifts. Indeed, the model is, as for regular factor analysis, ambiguous. In [10] an algorithm was proposed to estimate the model. However, the algorithm has the following drawbacks. 1. 2. 3. 4.
All potential shifts have to be specified in the model. Exhaustive integer search for the delays is expensive. The model only accounts for shifts by whole samples. The model is in general not unique.
Prior to the work of [6,10] Bell and Sejnowski [4] sketched how to handle time delays in networks based on a model similar to equation 2. This was further explored in [11]. Although their algorithms derive gradients to search for the delays (alleviating the first two drawbacks above) the models are still based on pure integer delays. In [12] a different model based on equally mixed sources, i.e. A = 1, formed by moving averages incorporated non-integer delays by signal interpolation. Yeredor [13] solved the SFA model by joint diagonalization of
Shifted Independent Component Analysis
91
Fig. 1. Example of activities obtained (black graph) when summing three components (gray, blue dashed and red dash-dotted graphs) each shifted to various degrees (given in samples by the colored numbers). Clearly, the resulting activities are heavily impacted by the shifts such that a regular instantaneous ICA analysis would be inadequate.
the source cross spectra based on the AC-DC algorithm with non-integer shifts for the 2 × 2 system. This approach was extended to complex signals in [14]. The algorithm is least squares optimal for equal number of sensors and sources. More sensors than sources is not a problem for conventional ICA; we simply reduce dimension by variance decomposition, this procedure is exact for noiseless mixing. Due to the delays projection based dimensional reduction will not reproduce the simple single delay structure, but rather lead to a more general convolutive mixture. We will therefore aim at an algorithm for finding a shift invariant subspace. Hence, solve equation 2 by use of the fact that a shift τ in the time domain can be approximated by multiplication by the complex coefficients e−iωτ in the frequency domain. This alleviates the first three drawbacks of the SFA algorithm. We will denote this algorithm a Shift Invariant Subspace Analysis (SISA). To further deal with shift and rotation ambiguities, we impose independence in the complex domain based on information-maximization (IM) [4]. Hence, we form an algorithm for ICA with shifted sources (SICA). Notice, that algorithms for ICA in the complex domain without shifts have previously been derived, see for instance [9,15] and references therein.
2
Method and Results
denotes the In the following U will denote a matrix in the time domain, while U corresponding matrix in the frequency domain. U and U denotes 3-way arrays in the time and frequency domains respectively. Furthermore, U • V denotes −1 such that the direct product, i.e. element-wise multiplication. Also, ω = 2π fM (f ) = U • e−i2π U 2.1
f −1 M τ.
Finally, the ith row of a matrix will be denoted Ui,: .
Shift Invariant Subspace Analysis (SISA)
In the following we will device an algorithm to find a shift invariant subspace based on the SFA model. Consider the SFA model and its frequency transformed f −1 d,f e−i2π M τ n,d + E n,f = n,f . Xn,m = An,d Sd,m−τ n,d + En,m , X An,d S d
d
(4)
92
M. Mørup, K.H. Madsen, and L.K. Hansen
In matrix notation this can be stated as (f ) S f. f = A f + E X
(5)
Due to Parseval’s identity the following holds 1 n,f 2 . Cls = En,m 2F = M E F n,m
(6)
n,f
Thus, minimizing the least square error in the time and frequency domain is equivalent. The algorithm will be based on alternatingly solving for A, S and τ . S update: According to equation 5, Sf can be estimated as f = A f. (f )† X S
(7)
Although, S is updated in the frequency domain the updated version has to remain real when taking the inverse FFT. For S to be real valued the following has to hold ∗ , M−f +1 = S (8) S f where ∗ denotes complex conjugate. This constraint is enforced by updating the first M/2 + 1 elements, i.e. up to the Nyquist frequency, while setting the remaining elements according to equation 8. d,f to the (n) denote the delayed version of the source signal S A update: Let S d,f (n) = S d,f e−i2π nth channel, i.e. S d,f
f −1 M τ n,d .
Then equation 2 can be restated as
Xn,: = An,: S(n) + En,: ,
(9)
This is the regular factor analysis problem giving the update †
An,: = Xn,: S(n) .
(10)
τ update: The least square error for the model stated in equation 5, is given by 1 f − A (f ) S f − A (f ) S f )H (X f ), Cls = M (X (11) f H
denotes the conjugate transpose. Define TN D×1 = vec(τ ), i.e. the where vectorized version of the matrix τ such that Tn+(d−1)N = τ n,d . Let further n,d,f = A (f ) S Q n,d d,f ,
f . f = X f − A (f ) S E
Then the gradient of Cls with respect to τ n,d is given as ∂Cls ∂Cls −1 ∗ ] n,d,f E gn+(d−1)N = ∂Tn+(d−1)N = ∂τ = 2ω[Q n,f M n,d f
(12)
(13)
Shifted Independent Component Analysis
93
The Hessian has the following structure 2 −2 ∗ ] if n = n ∧ d = d ω [Qn,d,f Q n ,d ,f Hn+(d−1)N,n+(d −1)N = −2 M 2 f ∗ ∗ ω [ Q ( Q n,d,f n ,d ,f + En ,f )] if n = n ∧ d = d f M (14) As a result, τ can be estimated using the Newton-Raphson method T ← T − ηH−1 g,
(15)
where η is a step size parameter that is tuned to keep decreasing the cost function. The above iterative update for τ is sensitive to local minima. Thus, to improve the algorithm from being stuck in suboptimal solutions τ was re-estimated by the following cross-correlation procedure every 10th iteration. Let (f ) S n,f = X n,f − A R (16) n,d d,f , d=d
i.e. the signal at the nth sensor at frequency f when projecting all but the d The cross-correlation between the d source and nth sensor is source out of X. ∗ S given as cf = R n,f d ,f , such that τ n,d can be estimated as t = arg max |cm |, m
τ n,d = t − (M + 1),
An,d =
ct . Sd ,: STd ,:
(17)
I.e. as the delay corresponding to maximum cross-correlation between the sensor and source. The value of An,d corresponding to this delay is also given above. 2.2
SISA Is Not Unique
According to equation 5, the reconstructed signal in the complex domain is given f −1 f = A (f ) W (f )−1 S f .Such that W (f ) S (f ) W (f ) = W • e−i2π M τˆ is a f ≈ A as X (f ) is also a rotation, rotation, scaling and shift matrix. Assume the inverse of W (f )−1 = V • e−i2π scaling and shift matrix, i.e. W we find d
f −1 Wd,d Vd ,d e−i2π M (ˆτ d,d +ˇτ d ,d )
f −1 ˇ M τ .
=
(f )−1 = I, (f ) W Since W
0 for d = d ∀ f 1 for d = d ∀ f
(18)
From f = 1 we obtain the relation V = W−1 . For the remaining frequencies this expression can only be valid if τˆ dd + τˇ d d = 0 (diagonal elements) and τˆ dd + τˇ d d = kdd (off diagonal elements) where kdd denotes an arbitrary constant. The first relation gives the constraint that τˆ = −ˇ τ T . The second relation further constraints all the elements of the columns of τˆ to be equal. f −1 (f ) = [W diag(e−i2π M τ )]. Where τ is a Thus the ambiguity is given by W vector describing the shift ambiguity.
94
M. Mørup, K.H. Madsen, and L.K. Hansen
Fig. 2. Results obtained by a shift invariant subspace analysis (SISA). Left panel: the true factors forming a synthetic data set. To the left, the strength of the mixing A of each source is indicated in gray color scale. In the middle, the three sources are shown and to the right is given the time delays of each source to each channel. Right panel: The estimated factors from the SISA analysis. Although, all the variance is explained the decomposition has not identified the true underlying components but an ambiguous mix. Clearly, as for regular factor analysis the SISA is not unique.
2.3
Shifted Independent Component Analysis (SICA)
A common approach to ICA is the maximum likelihood (ML) method [16] which corresponds to the approach of maximizing information proposed in [4]. In the framework of ML a non-gaussian distribution on the sources is assumed such that ambiguity can be resolved up to the trivial ambiguities of scale, permutation and source shifting relative to the time delays. (f ) S f , i.e. the sources at frequency f when transformed f = W Define, U according to the rotation and shift ambiguity described in the previous section. The ambiguity can be resolved by maximizing the log-likelihood assuming the d,f | f ) ∝ e−|U , i.e. (non-gaussian) Laplace distribution p(U f ) f |W, τ ) = (f ) )|p(W (f ) S f |W, τ ) = p(S |det(W (19) p(S f
f
Such that the log-likelihood as a function of W and τ becomes f |d (f ) )| − (f ) S L(W, τ ) = ln | det(W |W f
(20)
d
By maximizing L(W, τ ) W and τ is estimated and a new unambiguous S solu f . The corresponding mixing and delays can be estif = W (f ) S tion found by S mated alternating between the A and τ update. We initialized A as A = AW−1 and τ i,d by the cross-correlation procedure.
Shifted Independent Component Analysis
95
Fig. 3. Result obtained using the SICA on the decomposition found using SISA. By imposing independence, e.g., requiring the amplitudes in the frequency domain to be sparse, the rotation and shift ambiguity inherited in the model is resolved. Clearly the true underlying components and their respective mixing are correctly identified. However, a local minimum has been found, resulting in errors in the estimation of the delays particularly for the first component.
3
Discussion
Traditionally, ICA analysis is based on subspace analysis often using singular value decomposition (SVD). The sources are then found by rotating the vectors spanning the subspace according to a measure of independence. Similarly, we derived the SISA algorithm to find a shift invariant subspace by alternating least squares. Shift and rotation ambiguities were solved by imposing independence on the amplitudes of the frequency transform of the sources. While SVD has a closed form solution the SISA algorithm is non-convex. Estimating both A, S and each delay in τ using the cross-correlation procedure has a closed form solution for fixed values of τ , S and A. While the cross correlation procedure only finds integer delays the Newton-Rhapson procedure can estimate the noninteger delays. The cross-correlation procedure greatly reduces the algorithm’s vulnerability to local minima, however due to the alternating least squares estimation the problem cannot be circumvented completely. Furthermore, the problem becomes increasingly difficult for high frequency sources and large shifts due to additional local minima. In an example we saw this happen: The SICA algorithm failed in correctly identifying the delays of the first component; the component with the highest frequencies. A multistart strategy was invoked, we choose the best of ten random initializations to obtain a good initial solution for the estimation of the shift invariant subspace. While our algorithm was based on likelihood maximization, Yeredor [13] developed an algorithm based on joint diagonalization. The present SISA is potentially useful as a preprocessing step for this latter algorithm when estimating less sources than sensors.
96
M. Mørup, K.H. Madsen, and L.K. Hansen
Previous work based on integer shifts conjectured the decomposition to be unique [6]. When using integer shifts some shifts might perform better than others due to a better integer rounding error. Hence, this might be why the integer shifts formed seemingly unique solutions. However, as demonstrated in figure 2 the shifted factor analysis model is not in general unique. But, by imposing independence unique solutions can be obtained up to trivial permutation, scaling and specific onset relative to the delays of the sources as demonstrated in figure 3. The shift/delay model may prove useful for a wide range of data where ICA already has been employed. Furthermore, the extra information of delays can be useful for spatial source localization when combined with information of position of the sensors. Future work will focus on implementing additional constraints such as non-negativity and attempt to further improve the identifiability in the presence of many local minima. The current algorithm can be downloaded from [1].
References 1. Mørup, M., Madsen, K.H.: Algorithm for sica (2007), www2.imm.dtu.dk/pubdb/views/publication_details.php?id=5206 2. Kaiser, H.F.: The varimax criterion for analytic rotation in factor analysis. Psychometrica 23, 187–200 (1958) 3. Comon, P.: Independent component analysis, a new concept? Signal Processing 36, 287–314 (1994) 4. Bell, A.J., Sejnowski, T.J.: An information maximization approach to blind source separation and blind deconvolution. Neural Computation 7, 1129–1159 (1995) 5. Olshausen, B.A., Field, D.: Emergence of simple-cell receptive field propertises by learning a sparse code for natural images. Nature 381, 607–609 (1996) 6. Harshman, R., Hong, S., Lundy, M.: Shifted factor analysis—part i: Models and properties. Journal of Chemometrics 17, 363–378 (2003) 7. Attias, H., Schreiner, C.: Blind source separation and deconvolution: the dynamic component analysis algorithm. Neural Computation 10(6), 1373–1424 (1998) 8. Parra, L., Spence, C., Vries, B.: Convolutive blind source separation based on multiple decorrelation. In: IEEE Workshop on Neural Networks and Signal Processing, pp. 23–32 (1998) 9. Anemuller, J., Sejnowski, T.J., Makeig, S.: Complex independent component analysis of frequency-domain electroencephalographic data. Neural Networks 16(9), 1311–1323 (2003) 10. Harshman, R., Hong, S., Lundy, M.: Shifted factor analysis—part ii: Algorithms. Journal of Chemometrics 17, 379–388 (2003) 11. Torkkola, K.: Blind separation of delayed sources based on information maximization. Acoustics, Speech, and Signal Processing. ICASSP-96 6, 3509–3512 (1996) 12. Emile, B., Comon, P.: Estimation of time delays between unknown colored signals. Signal Processing 68(1), 93–100 (1998) 13. Yeredor, A.: Time-delay estimation in mixtures. ICASSP 5, 237–240 (2003) 14. Yeredor, A.: Blind source separation in the presence of doppler frequency shifts. ICASSP 5, 277–280 (2005) 15. Cardoso, J.F., Tulay, A.: The maximum likelihood approach to complex ica. ICASSP, pp. 673–676 (2006) 16. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley and Sons, Chichester (2001)
Modeling and Estimation of Dependent Subspaces with Non-radially Symmetric and Skewed Densities Jason A. Palmer1 , Ken Kreutz-Delgado2, Bhaskar D. Rao2 , and Scott Makeig1 1
Swartz Center for Computational Neuroscience University of California San Diego, La Jolla, CA 92093 {jason,scott}@sccn.ucsd.edu 2 Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093 {kreutz,brao}@ece.ucsd.edu
Abstract. We extend the Gaussian scale mixture model of dependent subspace source densities to include non-radially symmetric densities using Generalized Gaussian random variables linked by a common variance. We also introduce the modeling of skew in source densities and subspaces using a generalization of the Normal Variance-Mean mixture model. We give closed form expressions for subspace likelihoods and parameter updates in the EM algorithm.
1 Introduction The Gaussian scale mixture representation can be extended to vector subspaces to yield a model of non-affine dependency, sometimes referred to as “variance dependency” [8]. Hyv¨arinen [8,9] has recently proposed such a model for Independent Subspace Analysis of images. A similar approach is developed by Eltoft, Kim et al. [11,6], which is referred to as Independent Vector Analysis (IVA). In the IVA model, the EM algorithm is used with a particular case of the multivariate Gaussian scale mixture involving a Gamma mixing density. In [10] a method is proposed for convolutive blind source separation in reverberative environments using a frequency domain approach with sources having variance (scale) dependency across frequencies. Typically in the frequency domain approach to blind deconvolution, the “permutation problem” arises when the signals are unmixed separately at each frequency. Due to the permutation ambiguity inherent in ICA [1], the frequency components of each source are output in arbitrary order at each frequency, and some matching heuristic must be employed to reconstruct the complete spectra of the sources. The IVA model allows the modeling of dependency of the frequency components of sources, while maintaining the mutual independence of the sources. Variance dependency also arises in EEG/MEG analysis. In particular, the electromagnetic signals generated by muscles in the scalp, face, and ears will commonly activate together in various facial expressions. In this case, the individual muscle signals are not related or dependent in phase, but their variance increases and decreases together as the components are activated and deactivated together. Variance dependency may also exist among cortex regions that are simultaneously active in certain contexts. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 97–104, 2007. c Springer-Verlag Berlin Heidelberg 2007
98
J.A. Palmer et al.
The densities employed in models proposed previously for speech use only a particular dependent subspace density model, which may limit the flexibility of the model in application to more general domains such as communications and biological signal processing. We propose a general method for constructing multivariate Gaussian scale mixtures, giving an example of a multivariate dependent logistic density. We also propose a scale mixture of Generalized Gaussians model, in which a generalized Gaussian random vector with independent components, is multiplied by a common scalar variance parameter, which is distributed Generalized Inverse Gaussian. This yields a generalization of the generalized hyperbolic density of Barndorff-Nielsen [3]. Finally we show how to use the Normal variance-mean mixtures to model skew in dependent subspaces. The location and “drift” parameters can be updated in closed form using the EM algorithm and exploiting the conditional Gaussianity and closed form formula for the posterior moment in terms of derivatives of the multivariate density function.
2 General Dependent Gaussian Scale Mixtures In this section we show how general dependent multivariate densities can be derived using scalar Gaussian scale mixtures, x = ξ 1/2 z where z is a standard Normal random variable, and ξ is a non-negative random variable. 2.1 Example Densities Examples of Gaussian scale mixtures include the generalized Gaussian density, which has the form, 1 −|x|ρ GG(x; ρ) = 1 e 2Γ (1 + ρ ) It is a Gaussian scale mixture for 0 < ρ ≤ 2. The scale mixing density is related to a positive alpha stable density of order ρ/2. The generalized Cauchy has the form, GC(x; α, ν) =
1 αΓ (ν + 1/α) 2Γ (ν)Γ (1/α) (1 + |x|α )ν+1/α
The Generalized Cauchy is a Gaussian scale mixture for ν > 0 and 0 < α < 2. The scale mixing density is related to the Gamma density. The generalized Logistic, also called the symmetric Fisher’s z distribution [3], has the form, e−αx Γ (2α) GL(x; α) = 2 Γ (α) (1 + e−x )2α The Generalized Logistic is a Gaussian scale mixture for all α > 0. The scale mixing density is related to the Kolmogorov-Smirnov distance statistic [2,3,7].
Modeling and Estimation of Dependent Subspaces
99
2.2 Multidimensional Analogues If x is distributed according to the Gaussian scale mixture density p(x), then, ∞ √ 1 −1 1 p( x) = ξ −1/2 e− 2 ξ x p(ξ)dξ (2π)1/2 0
(1)
We can construct a random vector by multiplying the same scalar random variable ξ 1/2 by a Gaussian random vector, x = ξ 1/2 z where z ∼ N (0, I). For the density of x we then have, ∞ 2 1 −1 1 ξ −d/2 e− 2 ξ x p(ξ)dξ p(x) = (2π)d/2 0 If ξ −1 is a Gamma random variable, then the density of x can be written in terms of the modified Bessel function of the second kind [6]. In general, taking the kth derivative of both sides of (1), we find, (−2)−k ∞ −k−1/2 − 1 ξ−1 x dk √ p( x) = ξ e 2 p(ξ)dξ dxk (2π)1/2 0 Thus, if d is odd, then with k = (d − 1)/2, √ π −(d−1)/2 (−D)(d−1)/2 p( x) =
1 (2π)d/2
∞
ξ −d/2 e− 2 ξ 1
−1
x
p(ξ)dξ
0
and we can write the density of p(x) d odd :
√ p(x) = π −(d−1)/2 (−D)(d−1)/2 p( x )x=x2
(2)
For even d, the density of p(x) can be written formally in terms of the Weyl fractional derivative. However as the fractional derivative is is not generally obtainable in closed form, we consider a modification of the original univariate scale density p(ξ), ξ −1/2 p(ξ) p˜(ξ) = ∞ −1/2 p(ξ)dξ 0 ξ
√ With this modified scale density, the density of x evaluated at x becomes, ∞ √ 1 −1 Z e− 2 ξ x p˜(ξ)dξ p( x) = (2π)1/2 0 where,
Z=
∞
(3)
ξ −1/2 p(ξ)dξ
0
Proceeding as we did for odd d, taking the kth derivative of both sides of (3), with k = d/2, we get, √ √ (4) d even : p(x) = Z −1 2π −(d−1)/2 (−D)d/2 p( x )x=x2
100
J.A. Palmer et al.
2.3 Posterior Moments of Gaussian Scale Mixtures To use scale mixtures in the EM context, it is necessary to calculate posterior moments of the scaling random variable. This section indicates how this is accomplished [5]. Differentiating under the (absolutely convergent) integral we get, ∞ ∞ d p (x) = p(x|ξ)p(ξ)dξ = − ξ −1 xp(x, ξ) dξ dx 0 0 ∞ −1 ξ p(ξ|x) dξ = −xp(x) 0
Thus, with p(x) = exp(−f (x)), we see that, ∞ f (xi ) p (xi ) = E(ξi−1 |xi ) = ξi−1 p(ξi |xi ) dξi = − xi p(xi ) xi 0
(5)
Similar formulae can be derived for higher order posterior moments, and moments of multivariate scale parameters. These results are used in deriving EM algorithms for fitting univariate and multivariate Gaussian scale mixtures. 2.4 Example: 3D Dependent Logistic Suppose we wish to formulate a dependent Logistic type density on R3 . The scale mixing density in the Gaussian scale mixture representation for the Logistic density is related to the Kolmogorov-Smirnov distance statistic [2,3,7], which is only expressible in series form. However, we may determine the multivariate density produced from the product, x = ξ 1/2 z where x, z ∈ R3 , and z ∼ N (0, I). Using the formula (2) with d = 3, we get, sinh 12 x 1 p(x) = 8π x cosh3 12 x
3 Non-radially Symmetric Dependency Models A possible limitation of the Gaussian scale mixture dependent subspace model is the implied radial symmetry of vectors in the subspace, which leads to non-identifiability of features within the subspace—only the subspace itself can be identified. However, a similar approach using multivariate Generalized Gaussian scale mixtures can be developed, in which the multivariate density becomes a function of the p-norm of the subspace vector rather than the radially symmetric 2-norm, maintaining the directionality and identifiability of the within-subspace features, while preserving their (non-affine) dependence. The mixing density of the generalized hyperbolic distribution is the generalized inverse Gaussian, which has the form, (κ/δ)λ λ−1 ξ exp − 12 δ 2 ξ −1 + κ2 ξ (6) N † (δ, κ, λ) = 2Kλ (δκ)
Modeling and Estimation of Dependent Subspaces
101
where Kλ is the Bessel K function, or modified Bessel function of the second kind. The moments of the generalized inverse Gaussian [6] are given by, r δ Kλ+r (δκ) E ξr = (7) κ Kλ (δκ) The isotropic generalized hyperbolic distribution [3] in dimension d, 2 + x2 κ d/2 δ K λ−d/2 1 κ GH(δ, κ, λ) = d/4−λ/2 (2π)d/2 δ λ Kλ (δκ) 2 2 δ + x
(8)
is derived as a Gaussian scale mixture with N † (δ, κ, λ) mixing density. Now, for a generalized Gaussian scale mixture, ∞ −1 1 (9) p(x) = ξ − i pi exp −ξ −1 i |xi |pi p(ξ) dξ Z(p) 0 where, Z(p) = 2 d
d
Γ (1 + 1/pi )
i=1
with N † mixing density p(ξ), the posterior density of ξ given x is also N † ,
p¯ δ 2 + 2 xp , κ, λ − d/¯ p p(ξ|x) = N † where p¯ is the harmonic mean d/
i
(10)
p−1 i , and
xp
d
1/p¯ |xi |
pi
i=1
For x we then get the anisotropic distribution,
p¯ 2 κd/p¯ Kλ−d/p¯ κ δ + 2 xp 1 p(x; δ, κ, λ, p) = (d/p−λ)/2 ¯ Z(p) δ λ Kλ (δκ) 2 p¯ δ + 2 xp
(11)
Using (7) and (10), we have, 2 + 2 xp¯ κ K δ λ−d/ p−1 ¯ p κ E ξ −1 |x = p ¯ p ¯ δ 2 + 2 xp Kλ−d/p¯ κ δ 2 + 2 xp
(12)
The EM algorithm does not require that the complete log likelihood be maximized at each step, but only that it be increased, yielding the generalized EM (GEM) algorithm [4,13]. We employ this method here to increase the complete likelihood in (9) (see [13,12]).
102
J.A. Palmer et al.
4 Skew Models 4.1 Construction of Multivariate Skew Densities from Gaussian Scale Mixtures Given a Gaussian scale mixture x = ξ 1/2 z, ∞ 1 p(x) = ξ −d/2 exp − 12 ξ −1 xT Σ −1 x p(ξ) dξ d/2 1/2 (2π) |Σ| 0 we have, trivially, for arbitrary β, 1 p(x) exp(β T Σ −1 x) = × (2π)d/2 |Σ|1/2 ϕ 12 β T Σ −1 β ∞ 0
ξ −d/2 exp −
1 2
ξ −1 xT Σ −1 x + β T Σ −1 x − 12 ξ β T Σ −1 β
p(ξ) exp
ϕ
1 2
ξ β T Σ −1 β
1 T −1 β Σ β 2
dξ
(13) where ϕ(t) = E exp tξ is the moment generating function of ξ. Now (13) can be written, ∞ 1 2 (14) ξ −d/2 exp − 12 ξ −1 x − ξβΣ −1 p˜(ξ; β) dξ p˜(x) = (2π)d/2 |Σ|1/2 0 where, p˜(x) =
p(x) exp(β T Σ −1 x) , 2 ϕ 12 βΣ −1
p˜(ξ; β) =
p(ξ) exp 12 ξ β2Σ −1 1 2 ϕ 2 βΣ −1
We have thus constructed a skewed density p˜(x) in terms of the isotropic density p(x) and the moment generating function ϕ of the scale mixing density p(ξ). The skewed density is a that of a location-scale mixture [3] of the Gaussian z ∼ N (0, Σ , x = ξ 1/2 z + ξ β. 4.2 EM Algorithm Posterior Updates We now assume arbitrary location parameter μ, along with drift β, and structure matrix Σ. To use the EM algorithm with the Gaussian complete log likelihood in (14), we need to calculate posterior expectation of ξ −1 . We do this using the method of §2.3. If we take the derivative of − log p(x − μ) with respect to 12 x − μ2Σ −1 , then we get,
∂ − log p(x − μ, ξ)dξ 2 ∂ 12 x − μΣ −1 −1 −1 ξ p(x − μ, ξ) dξ ξ p˜(x − μ, ξ) dξ = = = E(ξ −1 |x) p(x − μ, ξ) dξ p˜(x − μ, ξ) dξ Thus, from (2) and (4), with k d/2 (the greatest integer less than d/2) we have, p(k+1) x − μΣ −1 −1 −1 E(ξ |x) = x − μΣ −1 p(k) x − μΣ −1
Modeling and Estimation of Dependent Subspaces
103
where p(k) is kth derivative of the univariate scale mixture p(x). 4.3 Closed Form Parameter Updates Given N observations {xk }N k=1 , the μ that maximizes the complete log likelihood is found to be, 1 k γk xk − β N μ= (15) 1 k γk N where γk = E(ξ −1 |xk ). The estimation equation to be solved for β, which does not involve the posterior estimate of ξk , is, 2 ϕ 12 βΣ −1 (16) β = c−μ 2 ϕ 12 βΣ −1 where c = N1 k xk . This gives β in terms of μ up to a scale factor. Given μ, the 2 optimal β, denoted β ∗ , may be found by first determining ζ 12 β ∗ Σ −1 from, h(ζ)
ϕ (ζ) ϕ(ζ)
2 ζ =
1 2
2
c − μΣ −1
assuming that the univariate function h is invertible. Then β ∗ is given as, β∗ =
ϕ(ζ) c−μ ϕ (ζ)
Given β ∗ , we may determine the optimal μ∗ by substituting β ∗ into (15). Repeated iteration constitutes a coordinate ascent EM algorithm for μ and β. An alternative method suggests itself: if we fix the norm of β in the mixing density, then we can solve for new estimates of μ and β simultaneously. Let, ϕ 12 β2Σ −1 1 1 a N k γk , b N k γk xk , τ 1 2 ϕ 2 βΣ −1 Then from (15) and (16), we have, a μ∗ + β ∗ = b μ∗ + τ β ∗ = c Solving for the components μi , βi , i = 1, . . . , d, we get, ∗ τ b i − ci b a 1 μi , = i ⇒ μ∗i = βi∗ ci 1 τ aτ − 1
βi∗ =
aci − bi aτ − 1
For the structure matrix, Σ, setting the complete log likelihood gradient to zero, we get, 2 T T Σ = N1 k γk (xk − μ)(xk − μ) − N k (xk − μ)β −1 T −1 −1 T ββ . = N1 k γk (xk − μ − γk β)(xk − μ − γk β) − k γk
104
J.A. Palmer et al.
5 Conclusion We have shown how to derive general multivariate Gaussian scale mixtures in terms of scalar Gaussian scale mixtures, and how to optimize them using an EM algorithm. We generalized the spherically (or ellipsoidally) symmetric Gaussian scale mixture by introducing a generalization of Barndorff-Nielsen’s generalized hyperbolic density using Generalized Gaussian scale mixtures, yielding a multivariate dependent anisotropic model. We also introduced the modeling of skew in ICA sources, deriving a general form of skewed multivariate Gaussian scale mixture, and an EM algorithm to update the location, drift, and structure parameters.
References 1. Amari, S.-I., Cichocki, A.: Adaptive blind signal processing—neural network approaches. Proceedings of the IEEE 86(10), 2026–2047 (1998) 2. Andrews, D.F., Mallows, C.L.: Scale mixtures of normal distributions. J. Roy. Statist. Soc. Ser. B 36, 99–102 (1974) 3. Barndorff-Nielsen, O., Kent, J., Sørensen, M.: Normal variance-mean mixtures and z distributions. International Statistical Review 50, 145–159 (1982) 4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977) 5. Dempster, A.P., Laird, N.M., Rubin, D.B.: Iteratively reweighted least squares for linear regression when errors are Normal/Independent distributed. In: Krishnaiah, P.R. (ed.) Multivariate Analysis V, pp. 35–57. North Holland Publishing Company, Amsterdam (1980) 6. Eltoft, T., Kim, T., Lee, T.-W.: Multivariate scale mixture of Gaussians modeling. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 799–806. Springer, Heidelberg (2006) 7. Gneiting, T.: Normal scale mixtures and dual probability densities. J. Statist. Comput. Simul. 59, 375–384 (1997) 8. Hyv¨arinen, A., Hoyer, P.O.: Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Computation 12, 1705– 1720 (2000) 9. Hyv¨arinen, A., Hoyer, P.O., Inki, M.: Topographic independent component analysis. Neural Computation 13(7), 1527–1558 (2001) 10. Kim, T., Attias, H., Lee, S.-Y., Lee, T.-W.: Blind source separation exploiting higher-order frequency dependencies. IEEE Transactions on Speech and Audio Processing, 15(1) (2007) 11. Kim, T., Eltoft, T., Lee, T.-W.: Independent vector analysis: An extension of ICA to multivariate components. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 165–172. Springer, Heidelberg (2006) 12. Palmer, J.A., Kreutz-Delgado, K., Makeig, S.: Super-Gaussian mixture source model for ICA. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 854–861. Springer, Heidelberg (2006) 13. Palmer, J.A., Kreutz-Delgado, K., Wipf, D.P., Rao, B.D.: Variational EM algorithms for nongaussian latent variable models. In: Advances in Neural Information Processing Systems, NIPS 2005, MIT Press, Cambridge (2006)
On the Relationships Between Power Iteration, Inverse Iteration and FastICA Hao Shen1,2 and Knut H¨ uper1,2 Department of Information Engineering Research School of Information Sciences and Engineering, The Australian National University, Canberra ACT 0200, Australia 2 Canberra Research Laboratory, National ICT Australia Locked Bag 8001, Canberra ACT 2612, Australia
[email protected],
[email protected] 1
Abstract. In recent years, there has been an increasing interest in developing new algorithms for digital signal processing by applying and generalising existing numerical linear algebra tools. A recent result shows that the FastICA algorithm, a popular state-of-the-art method for linear Independent Component Analysis (ICA), shares a nice interpretation as a Newton type method with the Rayleigh Quotient Iteration (RQI), the latter method wellknown to the numerical linear algebra community. In this work, we develop an analogous theory of single vector iteration ICA methods. Two classes of methods are proposed for the one-unit linear ICA problem, namely, power ICA methods and inverse iteration ICA methods. By means of a scalar shift, scalar shifted versions of both power ICA method and inverse iteration ICA method are proposed and proven to be locally quadratically convergent to a correct demixing vector.
1
Introduction
Independent Component Analysis (ICA) is a standard statistical tool for solving the Blind Source Separation (BSS) problem. Recently, there has been an increasing interest in developing new algorithms for digital signal processing by applying and generalising existing numerical linear algebra tools. In this work, we develop a theory of one-unit linear ICA algorithms in the framework of single vector iteration methods, which are efficient numerical linear algebra tools for computing one eigenvalue-eigenvector pair of a real symmetric matrix. The FastICA algorithm, developed by the Finnish school, is one of the most popular algorithms for the linear ICA problem. Recent work in [1,2] suggests a strong connection between FastICA and the Rayleigh Quotient Iteration (RQI) method, which is wellknown to the numerical linear algebra community. A deeper result further shows that FastICA shares a nice interpretation as a Newton type method similar to RQI [3]. Other than being interpreted as a Newton method, RQI was originally developed as a single vector iteration method, specifically, a scalar shifted inverse iteration method [4]. In this work, we propose two classes of single vector iteration method for the one-unit linear ICA problem, namely, power ICA methods and inverse iteration ICA methods. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 105–112, 2007. c Springer-Verlag Berlin Heidelberg 2007
106
H. Shen and K. H¨ uper
This paper is organised as follows. Section 2 briefly introduces the one-unit linear ICA model with the motivation of developing single vector iteration ICA methods. By means of a scalar shift , scalar shifted versions of both power ICA method and inverse iteration ICA method are proposed in Section 3 and Section 4, respectively. Both scalar shifted ICA methods are proven to be locally quadratically convergent to a correct demixing vector. As an aside, the standard FastICA can be considered as a special case of the scalar shifted power ICA method. Finally in Section 5, several numerical experiments are provided to verify the theoretical results regarding the local convergence properties of the proposed algorithms.
2
The One-Unit Linear ICA Model
We consider the standard noiseless linear instantaneous ICA model, Z = AS, where S ∈ Rm×n represents n samples of m sources with m n, the full rank matrix A ∈ Rm×m is the mixing matrix, and Z ∈ Rm×n is the observed mixtures, see [5]. The source signals S are assumed to be unknown, having zero mean and unit variance, being mutually statistically independent, and at most one being Gaussian. The task of linear ICA is to recover the sources S by estimating the mixing matrix A given only the mixtures Z. By finding a matrix V ∈ Rm×m such that W = V Z = V AS with E[ww ] = I, the whitened demixing linear ICA model can be formulated as Y = X W , where W ∈ Rm×n is the whitened mixture, X ∈ Rm×m is the demixing matrix with X X = I, and Y ∈ Rm×n is the recovered signal. Let us denote by S m−1 := {x ∈ Rm | x = 1} the (m − 1)-dimensional unit sphere and by X = [x1 , . . . , xm ] the orthogonal demixing matrix. In this work, we study the so-called one-unit linear ICA problem, which estimates only one source at one time. It is equivalent to seeking an x ∈ S m−1 which gives a correct estimation of one single source. A generic contrast function of the one-unit linear ICA, which was proposed for developing the FastICA algorithm [5], can be given as follows f (x) := E[G(x w)], (1) f : S m−1 → R, where G : R → R is usually assumed to be even and differentiable. Under certain weak assumptions, it has been shown that a correct demixing vector x∗ ∈ S m−1 is a critical point of f , refer to [2] for details. Recall the critical point condition of the contrast function f as follows (2) E G (x w)w = γx, with G being the first derivative of G and γ ∈ R. One might consider the expression E[G (x w)w] as a nonlinear operator acting on x ∈ Rm . Thus, a solution (γ, x) of the critical point equation as in (2) can be treated as an eigenvalueeigenvector pair of this nonlinear operator. We can rewrite (2) as G (x w) G (x w)w x w = γx ⇐⇒ E w w x = γx. E (3) x w x w
Power Iteration, Inverse Iteration and FastICA
107
w) It is known that the operator E G x(x w w ∈ Rm×m can be made positive w definite by choosing the function G carefully [2], e.g., two functions widely used for FastICA, G(a) = log cosh(a) and G(a) = a4 both do the job. In this way, the expression E[G (x w)w] can be decomposed as a product of a positive definite matrix with a vector. Let H : R → R be smooth with H(a) ≥ 0 for all a ∈ R, we define (4) B(x) := E H(x w)w w . B : S m−1 → Rm×m , Similar to (3) we then define F : S m−1 → Rm ,
F (x) := B(x)x.
(5)
Notice that such a B(x) is a real symmetric matrix. Therefore the key work of this paper is to develop a theory of single vector iteration methods for solving the one-unit linear ICA problem, which is completely analogous to the numerical linear algebra tools for the real symmetric eigenvalue problem.
3
Power ICA Methods
According to the multiplicative decomposition of the nonlinear operator F suggested in (5), a simple power method applied to the matrix part B(x) can be formulated as follows η : S m−1 → S m−1 ,
x →
B(x)x . B(x)x
(6)
Let x∗ ∈ S m−1 be a correct demixing vector. By the assumption of zero mean ∗ and unit variance of the sources, the expression B(x ) gives an invertible matrix ∗ = E H(x w) > 0 occurring with with two positive eigenvalues, namely, λ 1 multiplicity m − 1 and single λ2 = E H(x∗ w)(x∗ w)2 > 0. The corresponding eigenvector of the eigenvalue λ2 is x∗ , i.e. B(x∗ )x∗ = λ2 x∗ holds. We therefore have proven. Lemma 1. Let x∗ ∈ S m−1 be a correct demixing vector. Then x∗ is a fixed point of the power ICA method η.
By taking the first derivative of η at x∗ in any direction ξ ∈ Tx∗ S m−1 , one gets D η(x∗ )ξ = 0. That is, by a Taylor-type argument, the algorithmic map η does not correspond to a locally quadratically fast algorithm. Here, Tx S m−1 = {ξ ∈ Rm |x ξ = 0} denotes the tangent space of the unit sphere S m−1 at a point x ∈ S m−1 . In the rest of this section, we will modify the power ICA method (6) to obtain second order convergence in the framework of a scalar shift strategy, which has been successfully used in developing the RQI [4] and generalising a simple oneunit ICA method proposed by Regalia and Kofidis [6,3]. Let us define a smooth function ρ : S m−1 → R. We construct a scalar shifted nonlinear operator acting on S m−1 as follows
108
H. Shen and K. H¨ uper
Fs : S m−1 → Rm ,
Fs (x) := (B(x) − ρ(x)I) x. (7) ∗ ∗ 2 Let x = x , one gets Fs (x ) = λs x with λs = E H(x w)(x w) − ρ(x∗ ). To formulate a well-defined power method based as in (7), it on the operator is necessary to have λs = 0, i.e., ρ(x∗ ) = E H(x∗ w)(x∗ w)2 . Moreover, if λs < 0, the corresponding power method is then not differentiable at the point x∗ following a similar argument as for the standard FastICA in [3]. Therefore, by introducing a sign correction term, see [3], we formulate the scalar shifted power ICA method as follows ∗
∗
ηs : S
m−1
→S
∗
m−1
x →
,
1 τ (x) (B(x)x − ρ(x)x) , 1 τ (x) (B(x)x − ρ(x)x)
(8)
where τ (x) = x B(x)x − ρ(x). The following lemma is then immediate. Lemma 2. Let x∗ ∈ S m−1 be a correct demixing vector and ρ : S m−1 → R a ∗ ∗ ∗ 2 smooth map with ρ(x ) = E H(x w)(x w) . Then x∗ is a fixed point of the scalar shifted power ICA method ηs .
Now we will study the additional conditions on the scalar shift ρ, which fulfils already the condition stated in Lemma 2, such that the algorithmic map ηs is locally quadratically convergent to a correct demixing vector x∗ . Define Fs (x) Fs (x) = . τ (x)
(9)
By a straightforward computation, the first derivative of ηs at x∗ in direction ξ ∈ Tx∗ S m−1 can be computed as
Fs (x∗ ) Fs (x∗ ) 1 D ηs (x)ξ|x=x∗ = D Fs (x)ξ|x=x∗ , (10) I− Fs (x∗ ) Fs (x∗ ) Fs (x∗ ) =:P (x∗ )
where P (x∗ ) is an orthogonal projection operator onto the orthogonal complement of the span of x∗ . Thus one has D ηs (x)ξ|x=x∗ = 0
⇐⇒
D Fs (x)ξ|x=x∗ = γx∗ .
(11)
Now by the chain rule, we compute 1 ξ|x=x∗ Fs (x∗ ) + D Fs (x)ξ|x=x∗ = D τ (x)
1 τ (x∗ )
D Fs (x)ξ|x=x∗
(12)
with D Fs (x)ξ|x=x∗ =(E[H(x∗ w)] + E[H (x∗ w)(x∗ w)])ξ − D ρ(x)ξ|x=x∗ x∗ − ρ(x∗ )ξ.
(13)
According to the fact that the first summand in (12) and the second summand in (13) give already a scalar multiple of x∗ , the expression in (10) vanishes if and only if (14) ρ(x∗ ) = E[H(x∗ w)] + E[H (x∗ w)(x∗ w)]. Therefore, following a Taylor-type argument, we conclude
Power Iteration, Inverse Iteration and FastICA
109
Theorem 1. Let x∗ ∈ S m−1 be a correct demixing vector and ρ : S m−1 → R a smooth map such that ρ(x∗ ) = E H(x∗ w)(x∗ w)2 , and ρ(x∗ ) = E[H(x∗ w)] + E[H (x∗ w)(x∗ w)]. Then the scalar shifted power ICA method ηs is locally quadratically convergent
to x∗ . Naturally, a simple choice of the scalar shift ρ to make ηs locally quadratically convergent can be constructed by ρp : S m−1 → R,
ρp (x) := E[H(x w)] + E[H (x w)(x w)].
(15)
We denote the corresponding algorithmic map by ηs using ρp as the scalar shift.
Remark 1. If one defines H(a) = G a(a) as suggested in (3), the resulting algorithm is essentially the same as the FastICA/ANPICA algorithm in [3].
4
Inverse Iteration ICA Methods
Recall the result in Section 3 that the expression B(x∗ ) is indeed an invertible matrix. In this section, we propose power-type methods applied on the inverse of B(x) and its scalar shifted generalisations. We call them inverse iteration ICA methods. Firstly, we define a nonlinear operator acting on S m−1 as K : S m−1 → Rm ,
K(x) := B(x)−1 x.
(16)
Note that the above operator K is locally well defined at least in an open neighborhood Uε (x∗ ) ⊂ S m−1 around x∗ . It is also worthwhile to notice that K(x) can be computed efficiently by solving the following linear system for u ∈ Rm , B(x)u = x.
(17)
Thus a simple inverse iteration ICA method based on K can be formulated as ζ : S m−1 ⊃ Uε (x∗ ) → S m−1 ,
x →
B(x)−1 x . B(x)−1 x
(18)
By the fact that B(x∗ )−1 x∗ = 1/λ1 x∗ , we just have proven Lemma 3. Let x∗ ∈ S m−1 be a correct demixing vector. Then x∗ is a fixed point of the inverse iteration ICA method ζ.
Again, by showing that the first derivative of ζ at x∗ in any ξ ∈ Tx∗ S m−1 does not vanish in general, i.e., D ζ(x∗ )ξ = 0, we conclude that the algorithmic map ζ is not locally quadratically convergent to x∗ . Once more, we will modify the inverse iteration ICA method as in (18) in the framework of scalar shift strategy to obtain second order convergence.
110
H. Shen and K. H¨ uper
Let ρ : S m−1 → R be smooth. We define a scalar shifted nonlinear operator as follows Ks : S m−1 → Rm ,
Ks (x) := (B(x) − ρ(x)I)
−1
x.
(19)
Such an operator is well defined if and only if, for any x ∈ S m−1 , B(x) − ρ(x)I is nonsingular. Now let x = x∗ . As discovered in Section 3, the matrix B(x∗ ) has ∗ only two positive eigenvalues λ1 = E H(x w) and λ2 = E H(x∗ w)(x∗ w)2 . Further analysis shows (i) If ρ(x∗ ) = λ1 , the resulting operator B(x∗ ) − ρ(x∗ )I is of rank one, i.e. Ks is not defined at x∗ ; (ii) If ρ(x∗ ) = λ2 , the matrix B(x∗ ) − ρ(x∗ )I is of rank m − 1. Although the operator Ks is still not defined, one can rescue the situation, i.e. remove the pole, by replacing the inversion by the classical adjoint, which has been used to handle a similar situation when analysing RQI in [7]. We therefore define
s (x) := adj (B(x) − ρ(x)I) x. K
(20)
s is locally well defined in Uε (x∗ ) ⊂ S m−1 and K s (x∗ ) = (λ1 − Note that K m−1 ∗ x . We now propose the following iteration method, still called the scalar λ2 ) shifted inverse iteration ICA method, ζs : S m−1 ⊃ Uε (x∗ ) → S m−1 ,
x →
adj (B(x) − ρ(x)I) x . adj (B(x) − ρ(x)I) x
(21)
We can now state Lemma 4. Let x∗ ∈ S m−1 be vector and ρ : S m−1 → R a a correct demixing ∗ ∗ ∗ smooth map with ρ(x ) = E H(x w) . Then x is a fixed point of the scalar shifted inverse iteration ICA method ζs .
By similar arguments as in Section 3, to make the first derivation of ζs at x∗ in direction ξ ∈ Tx∗ S m−1 vanish, i.e. D ζs (x)ξ|x=x∗ = 0, is equivalent to requiring s (x)ξ|x=x∗ = γx∗ . DK
(22)
A tedious computation shows that the equation in (22) holds true if and only if ρ(x∗ ) = E[H(x∗ w)(x∗ w)2 ] − E[H (x∗ w)(x∗ w)].
(23)
Therefore we just proved Theorem 2. Let x∗ ∈ S m−1 be a correct demixing vector and ρ : S m−1 → R a smooth map such that ρ(x∗ ) = E H(x∗ w) , and ρ(x∗ ) = E[H(x∗ w)(x∗ w)2 ] − E[H (x∗ w)(x∗ w)]. Then the scalar shifted inverse iteration ICA method ζs is locally quadratically convergent to x∗ .
Power Iteration, Inverse Iteration and FastICA
111
A natural and simple choice of the scalar shift ρ to make ζs locally quadratically convergent to x∗ can be constructed by choosing ρi (x) := E[H(x w)(x w)2 ] + E[H (x w)(x w)].
ρi : S m−1 → R,
(24)
It is clear that, in general, ρi (x∗ ) = E[H(x∗ w)(x∗ w)2 ]. Therefore one can construct the following algorithmic map locally, using ρi as the scalar shift, 1 −1 (B(x) − ρ (x)I) x i κ(x) x → ζs : S m−1 ⊃ Uε (x∗ ) → S m−1 , , (25) 1 κ(x) ((B(x) − ρi (x)I)−1 x) with κ(x) = x (B(x) − ρi (x)I)−1 x. The convergence properties as in Theorem 2 apply to ζs as well.
5
Numerical Experiments
In this section, we will verify the results in Theorem 1 and 2 by several experiments. Local convergence properties of two scalar shifted single vector iteration ICA methods, namely ηs and ζs , are investigated and compared with the classical FastICA/ANPICA. We specify the function H for both single vector iteration ICA methods by choosing H(a) = log cosh(a) and the function G for FastICA/ANPICA by G(a) = log cosh(a), as well. A toy set of sources is illustrated in Fig. 1(a). It is well known that the mutual statistical independence can only be ensured if the sample size n tends to infinity. In this experiment, therefore, we set n = 107 artificially high. All three methods are initialised by the same demixing vector. However it is worthwhile to notice that these algorithms can converge to different correct demixing vectors. We only show a case where it happens that they all converge
Source 1
0
10
2 0
−2
10 −2 0
50
100
150
200
250
300
350
400
450
500 −4
10 || xk − x* ||
Source 2
5 0 −5
0
50
100
150
200
250
300
350
400
450
500
−6
10
−8
Source 3
10 2
Scalar shifted power ICA method Scalar shifted inverse iteration method FastICA/ANPICA
−10
10
0 −2 0
50
100
150
200
250 Index
300
350
400
450
500
−12
10
1
2
3
4
5
6
Iteration (k)
(a) Source signals
(b) Convergence properties
Fig. 1. Local convergence properties of scalar shifted single vector iteration ICA methods (scalar shifted power ICA method vs. scalar shifted inverse iteration ICA method) and FastICA/ANPICA
112
H. Shen and K. H¨ uper
to the same correct demixing vector x∗ , see Fig. 1. The error is measured by the distance of the accumulation point x∗ to the current iterate xk , i.e., by the norm xk − x∗ . The numerical results in Fig. 1(b) show that both scalar shifted single vector iteration ICA methods, namely ηs and ζs , share the the same local quadratic convergence properties with the classical FastICA/ANPICA.
Acknowledgment National ICT Australia is funded by the Australian Government’s Department of Communications, Information Technology and the Arts and the Australian Research Council through Backing Australia’s Ability and the ICT Research Centre of Excellence programs.
References 1. Douglas, S.: Relationships between the FastICA algorithm and the Rayleigh Quotient Iteration. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 781–789. Springer, Heidelberg (2006) 2. Shen, H., H¨ uper, K.: Local convergence analysis of FastICA. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 893–900. Springer, Heidelberg (2006) 3. Shen, H., H¨ uper, K., Seghouane, A.K.: Geometric optimisation and FastICA algorithms. In: Proceedings of the 17th International Symposium of Mathematical Theory of Networks and Systems (MTNS 2006), Kyoto, Japan, pp. 1412–1418 (2006) 4. Parlett, B.N.: The Symmetric Eigenvalue Problem. SIAM Classic. In: Applied Mathematics Series, Philadelphia, vol. 20 (1998) 5. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York (2001) 6. Regalia, P., Kofidis, E.: Monotonic convergence of fixed-point algorithms for ICA. IEEE Transactions on Neural Networks 14(4), 943–949 (2003) 7. H¨ uper, K.: A Calculus Approach to Matrix Eigenvalue Algorithms. Habilitation Dissertation, Department of Mathematics, University of W¨ urzburg, Germany (2002)
A Sufficient Condition for the Unique Solution of Non-Negative Tensor Factorization Toshio Sumi and Toshio Sakata Faculty of Design, Kyushu University, Japan
[email protected],
[email protected]
Abstract. The applications of Non-Negative Tensor Factorization (NNTF) is an important tool for brain wave (EEG) analysis. For it to work efficiently, it is essential for NNTF to have a unique solution. In this paper we give a sufficient condition for NNTF to have a unique global optimal solution. For a third-order tensor T we define a matrix by some rearrangement of T and it is shown that the rank of the matrix is less than or equal to the rank of T . It is also shown that if both ranks are equal to r, the decomposition into a sum of r tensors of rank 1 is unique under some assumption.
1
Introduction
In the past few years, Non-Negative Tensor Factorization (NNTF) is becoming an important tool for brain wave (EEG) analysis through Morlet wavelet analysis (for example, see Miwakeichi [MMV] and Morup [MHH]). The NNTF algorithm is based on Non-Negative Matrix Factorization (NNMF) algorithms, amongst the most well-known algorithms contributed by Lee-Seung [LS]. Recently, Chichoki et al. [CZA] deals with a new NNTF algorithm using Csiszar’s divergence. Furthermore, Wang et al. [WZZ] also worked on NNMF algorithms and its interesting application in preserving privacy in datamining fields. These algorithms converged to some stationary points and do not converge to a global minimization point. In fact, it is easily shown that the problem has no unique minimization points in general (see [CSS]). In applications of NNTF for EEG analysis, it is important for NNTF to have a unique solution. However this uniqueness problem has not been addressed sufficiently as far as the authors are aware of. Similarly as in Non-Negative Matrix Factorization (NNMF), it seems that the uniqueness problem has not been solved. However we managed to obtain the uniqueness and proved it. (see Proposition 1). In this paper we give a sufficient condition for NNTF to have a unique solution and for the usual NNTF algorithm to find its minimization point in the case when NNTF exists strictly, not approximation (see Theorem 3).
The research is supported partially by User Science Institute in Kyushu University (Special Coordination Funds for Promoting Science and Technology).
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 113–120, 2007. c Springer-Verlag Berlin Heidelberg 2007
114
2
T. Sumi and T. Sakata
Quadratic Form
As the NNMF problem is a minimization of a quadratic function, we shall first review quadratic functions generally. Let us consider the quadratic form defined by f (x) = xT Ax − 2bT x where A is a n × n symmetric matrix and b is a n vector. The symmetric matrix A is a diagonalized by an orthogonal matrix P as P AP T = diag(e1 , . . . , en ). Then by assigning y = (y1 , . . . , yn )T = P x and c = (c1 , . . . , cn )T = P b, we obtain the equality ci c2 (ei yi2 − 2ci yi ) = f (x) = y T (P AP T )y − 2cT y = ei (yi − )2 − i . ei ei i i We assume that the matrix A is positive definite. Then,when f (x) reaches its minimum at y = (P AP T )−1 c = (P AP T )−1 P b = P A−1 b in Rn , with the value f (A−1 b) = −bT A−1 b at x = P T y = A−1 b ∈ Rn . The minimal value is under the condition x ≥ 0. Here, some basic facts will be explained. Let a ∈ Rn and let 2 h : Rn → R be a function defined as h(x) = x − a , where · stands for the common Euclidean norm. Lemma 1. On the arbitrary closed set S of Rn , h(t), t ∈ S takes a global minimal value in S. Proof. Choose an arbitrary t0 ∈ S, and set s = h(t0 ) and U = h−1 ([0, s]) ∩ S. The set U is a closed subset of Rn . By triangular inequality, we know that h(t) ≥ t − a . Since s ≥ h(t) for t ∈ U , it holds that t ≤ s + a which shows that U is bounded. Hence, since U is bounded and closed, it is compact. Thus h(t), t ∈ U becomes a closed map, and h(U ) is also compact. That is, h(t), t ∈ U takes a global minimum value, say s0 . Thus, it holds that for t ∈ S, h(t) > s if t ∈ / U , and h(t) ≥ s0 if t in U . This means that s0 is the global minimum of h on S. Lemma 2. Let S be a closed convex subset of Rn . Then h(t), t ∈ S reaches a global minimal value at a unique point in S. Proof. The existence of a global minimal value follows from Lemma 1. Let x and y be points in Rn which attain a global minimal value r := minz∈S f (z). Note that x, y ∈ S ∩ ∂Br (z0 ), where Br (a) := {x | x − a ≤ r} and ∂Br (a) := {x | x − a = r}. Since S ∩ Br (a) is also convex, tx + (1 − t)y ∈ S ∩ Br (a) for each 0 ≤ t ≤ 1. If x = y, then a − (x + y)/2 < r, which is contradiction. Therefore x = y. √ √ Let D = diag( e1 , . . . , en ), z = DP x and S = { z ∈ Rn | x ≥ 0 }. Note that S is a convex set of Rn and f (x) = h(z) for a = D−1 P b. Therefore f (x) reaches a global minimal value at a unique point under the condition x ≥ 0.
A Sufficient Condition for the Unique Solution of NNTF
115
The following are some basic facts about matrix decompositions. Let A, W and H be m × n, m × r and r × n matrix respectively. Then A − W H for any H with H ≥ O and any W with W ≥ O reaches a global minimal value at a unique m × n matrix W H but W and H are not unique. We state this precisely below. Proposition 1. The following properties hold: 1. If r = rank(A), there exist W and H such that A = W H. 2. If r > rank(A), there exists an infinite number of pairs of W and H such that A = W H. 3. Let r = rank(A). Then if A = W H = W H there exists a non-singular matrix X such that W = W X, H = X −1 H ([CSS, Full-Rank Decomposition Theorem]). 4. If r < rank(A), there exists no pair of W and H such that A = W H. Proof. (1) In this case, let W be a matrix whose columns are linearly independent vectors of length m. From the assumption it is clear that the columns of A are expressed as linear combination of columns of W hence A = W H. (2) In this case, put s = rank(A), (r > s). By property (1), we know there exists a m × s matrix W1anda s × m matrix H1 such that A = W1 H1 . Place H1 W = W1 W2 and H = where W2 and H2 are m × (r − s) matrix and H2 (r − s) × n matrix respectively and satisfy W2 H2 = 0. There are infinitely many such pairs of (W2 , H2 ), and for all of those it clearly holds that A = W H. (3) From r = rank(A), in the expression of A = W H = W H , the columns of W and W are linearly independent respectively. Hence we have W = W X for some regular r × r matrix X. From this the rest of (3) is derived trivially. (4) Since rank(W H) ≤ r, it is impossible to have A = W H.
3
Non-Negative Matrix Factorization
It is well known NNMF (Non-Negative Matrix Factorization) is not unique ([CSS]). Let V , W and H be a m × n, m × r and r × n matrix respectively. For a matrix A, we denote by Aij the (i, j)-component of A and its Frobenius norm is defined by A2ij , A F := tr(AT A) = i,j
where tr takes the sum of all diagonal entries. Lemma 3. Fixing H, f (W ) = V −W H F attains the minimum at the solution W of the equation W (HH T ) = V H T . Especially, if HH T is non-singular, the minimum is attained at the unique point W = V H T (HH T )−1 .
116
T. Sumi and T. Sakata
Proof. It holds that f (W ) =
(Vij −
i,j
=
i,j
=
Wip Hpj )2
p
Wip Hpj Wiq Hqj − 2
p,q
Vij Wip Hpj +
Vij2
p
(HH T )pq Wip Wiq − 2
i,p,q
(V H T )ip Wip + Vij2 . i,p
i,j
Therefore f (W ) is a quadratic function of Wij (i = 1, 2, · · · , m, j = 1, 2, · · · , r). Put x = (W11 , . . . , W1r , . . . , Wm1 , . . . , Wmr )T ∈ Rmr , a = ((V H T )11 , . . . , (V H T )1r , . . . , (V H T )m1 , . . . , (V H T )mr )T ∈ Rmr and define a mr × mr matrix M by diag(HH T , . . . , HH T ). Then, M is positive semidefinite and f (W ) is expressed as f (W ) = xT M x − 2aT x +
Vij2 .
i,j
Assume that HH T is non-singular. Then M is positive definite and thus the minimum of f (W ) is attained at the unique point x = M −1 a, that is, W T = (HH T )−1 (V H T )T , equivalent to, W = V H T (HH T )−1 . The minimum value is f (W ) = V 2F − W H 2F and we also have W H 2F = tr(W T V H T ) = tr((HH T )−1 (V H T )T (V H T )).
(1)
Since V − W H F = V T − H T W T F , fixing W , V − W H F attains the minimum at the unique point V = (W T W )−1 W T V if W T W is non-singular. We recall the Lee-Seung NNMF Algorithm for the Frobenius norm property. Theorem 1 ([LS]). The Frobenius norm V − W H F is non-increasing under the update rules: Hij ← Hij
(W T V )ij (W T W H)ij
Wij ← Wij
(V H T )ij (W HH T )ij
Now we propose the following improvement of the Lee-Seung NNMF Algorithm for the Frobenius norm property. For matrices X with X ≥ 0 and Y , let tmax (X, Y ) = max{t | (1 − t)X + tY ≥ O, 0 ≤ t ≤ 1}.
A Sufficient Condition for the Unique Solution of NNTF
117
Theorem 2. The Frobenius norm V − W H F is non-increasing under the update rules: ⎧ ⎨(1 − h0 )(W T W )−1 W T V + h0 H, if W T W is non-singular and h0 > 0 H← (W T V )ij ⎩Hij , otherwise (W T W H)ij ⎧ ⎨(1 − w0 )V H T (HH T )−1 + w0 W, if HH T is non-singular and w0 > 0 W ← (V H T )ij ⎩Wij , otherwise (W HH T )ij where h0 = tmax ((W T W )−1 W T V, H) and w0 = tmax (V H T (HH T )−1 , W ). Proof. If either W T W is singular or h0 = 0, the claim follows from Theorem 1. Suppose both W T W is non-singular and h0 > 0. By Lemma 3, fixing W , V − W H F takes minimum at (W T W )−1 W T V without the assumption x ≥ 0. Let us denote H = (1 − h0 )(W T W )−1 W T V + h0 H for clarity. On the line from H to H , the Frobenius norm decreases and thus V − W H F ≥ V − W H F . Clearly H ≥ 0 which follows from the definition of h0 .
4 4.1
Non-Negative Tensor Factorization Existence of a Global Optimal Solution
Let R≥0 be the set of all non-negative real numbers. Let T be a third-order . . . x x , Y = y1 . . . yr and Z = z1 . . . zr be . Let X = tensor in Ra×b×c 1 r ≥0 (a+b+c)r
a× r, b × r and c× r matrices, respectively. We define a function f over R≥0 as
2 f (X, Y, Z) = Xi Yj Zk . tijk − ijk
Let Sb = {x ∈ Rb≥0 | x = 1} be an intersection of an unit sphere in Rb with Rb≥0 . Put S = (Ra )×r × (Sb )×r × (Sc )×r for short, where M ×r = M × · · · × M (r (a+b+c)r
times). Then S is a closed subspace of R≥0
and the image f (S) coincides
(a+b+c)r with the full image f (R≥0 ). Let (X, Y, Z) ∈ Z ≥ O and yj = zj = 1 for all j. Noting that
i,j,k
2 Xi Yj Zk
≥
S. Then X ≥ O, Y ≥ O,
(Xi Yj Zk )2 = x
2
i,j,k
2 if i,j,k ( Xi Yj Zk ) is bounded, X F is also bounded and thus so is S. Hence, we can apply the proof of Lemma 1 for the function f on S instead of h and we obtain an existence of a global minimal value.
118
4.2
T. Sumi and T. Sakata
Uniqueness
We show the uniqueness under some assumption. First,several facts are presented. For convenience, we define (1)
(k)
(k) X1 ◦ · · · ◦ Xk = (x1 ⊗ · · · ⊗ x1 , . . . , x(1) r ⊗ · · · ⊗ xr ) (1)
(1)
(k)
(k)
for matrices X1 = (x1 , . . . , xr ), . . ., Xk = (x1 , . . . , xr ) with r-columns. For u = (1, . . . , 1)T ∈ Rr , we have f (X, Y, Z) = T − (X ◦ Y ◦ Z)u 2F . For a transformation Mσ = (mij ) among {1, . . . , r} σ, a permutation matrix Mσ is defined by mij = δiσ(j) . For a permutation matrix Mσ it does hold that MσT = Mσ−1 = Mσ−1 . Proposition 2. In a general P , the following equation does not hold (X1 P ) ◦ · · · ◦ (Xr P ) = (X1 ◦ · · · ◦ Xr )P. However, if P is a permutation matrix, and P1 , . . . , Pr are diagonal matrices, (X1 P ) ◦ · · · ◦ (Xr P ) = (X1 ◦ · · · ◦ Xr )P (X1 P1 ) ◦ · · · ◦ (Xr Pr ) = (X1 ◦ · · · ◦ Xr )P1 · · · Pr does hold. Lemma 4. Let A and C be m × r matrices and B and D be n × r matrices, and Q be r × r non-singular matrix. Assume that A ◦ B = (C ◦ D)Q and rank(C) = rank(C ◦ D) = r Then there exists a permutation matrix P = Mσ such that both of P Q and QP −1 become diagonal matrices and A = CQX and B = DP −1 X −1 hold for some diagonal matrix X. Further suppose that A, B, C, D ≥ 0. Let Q1/2 be a r×r matrix whose (i, j)-component is the square root of the (i, j)-component of Q. Then Q1/2 is a real matrix, and both A = CQ1/2 X and B = DQ1/2 X −1 hold for some diagonal matrix X. Proof. We use the notations A = (a1 , . . . ar ), B = (b1 , . . . br ) = (bij ), C = (c1 , . . . cr ), D = (d1 , . . . dr ) = (dij ), Q = (qij ). Since ⊗ is a bilinear operation and rank(C ◦ D) = r, it holds that dk = 0 (∀ k) and that dk // d implies k = . Since A ◦ B = (C ◦ D)Q, we have ak ⊗ b k = qk c ⊗ d , ∀ k, and bik ak = qk di c , ∀ i, k.
Since Q is non-singular, for each k there exists a permutation σ(k) such that qσ(k)k = 0. Now we will show that for each there exists an i such that bi = 0. Assume that bs = 0 for some s. Then, it holds that qs c ⊗ d = 0, and since rank(C ◦ D) = r, it holds that qs = 0 (∀ ). This contradicts to the fact that Q is non-singular. Therefore, for each , there exists a τ () such that bτ () = 0. Then, it follows qk bik dτ (k) = qk bτ (k)k di , ∀ i, k,
A Sufficient Condition for the Unique Solution of NNTF
119
from the equality bτ (k)k bik ak =
bik qk dτ (k) c =
bτ (k)k qk di c ,
∀
i, k
and rank(C) = r. On the assumption of qk = 0, since di =
dτ (k) · bik for all bτ (k)k
dτ (k) bk . Especially it holds that dτ (k) = 0. That is, it bτ (k)k holds that d // bk . Hence, by rank(C ◦ D) = r, if qk = 0, then = σ(k). This implies that there exists a permutation matrix P = M σ such that both of P Q d τ (k)σ(k) and QP −1 are diagonal. Then, if we choose X = diag , it holds that bτ (k)k i it holds that d =
ak = qσ(k)k ·
dτ (k)σ(k) bτ (k)k cσ(k) , bk = dσ(k) , bτ (k)k dτ (k)σ(k)
∀
k
that is, it holds that A = CQX, B = DP −1 X −1 . Further, on the assump√ qσ(k)k dτ (k)σ(k) tion of Q ≥ 0, if we choose Y = diag , it holds that A = bτ (k)k CQ1/2 Y, B = DQ1/2 Y −1 . These completes the proof of Lemma 4. We should note that the factorization (X ◦ Y ◦ Z) has the scalar uncertainty such that for scalars a, b, c, it holds (a X) ◦ (b Y ) ◦ (c Z) = (abc)(X ◦ Y ◦ Z). where (a , b , c ) denotes any permutation of (a, b, c). Now we give a sufficient condition that NNTF has the unique global solution. From now set u = (1, . . . , 1)T ∈ Rr and let f l1 (T ) be a a × bc matrix whose (i, j + b(k − 1))-component is tijk . Then the following theorem holds. Theorem 3. For f (X, Y, Z) = T −(X◦Y ◦Z)u 2F , we assume rank(f l1 (T )) = r and min f (X, Y, Z) = 0. Then, under the condition rank(Y ) = rank(Y ◦ Z) = r, the optimal global point is unique up to permutations and scalar uncertainty. Proof. By triangular inequality we have (X1 ◦ Y1 ◦ Z1 )u − (X0 ◦ Y0 ◦ Z0 )u F ≤ f (X0 , Y0 , Z0 ) + f (X1 , Y1 , Z1 ) = 0, and thus (X0 ◦ Y0 ◦ Z0 )u = (X1 ◦ Y1 ◦ Z1 )u which is equivalent to the following equation X0 (Y0 ◦ Z0 )T = X1 (Y1 ◦ Z1 )T . By Proposition 1 (3), there exists a non-singular matrix Q such that X = X0 (QT )−1 , Y ◦ Z = (Y0 ◦ Z0 )Q. From Lemma 4, for some permutation matrix P and diagonal matrix D1 , it holds that D2 := P Q is a diagonal matrix and Y = Y0 QD1 and Z = Z0 P −1 D1−1 . Hence, noting P −1 = P T , it holds that X = X0 P −1 D2−1 , Y = Y0 P −1 D2 D1 , Z = Z0 P −1 D1−1 . Up to scalar uncertainty, (X, Y, Z) is equal to (X0 P −1 , Y0 P −1 , Z0 P −1 ) = (X0 , Y0 , Z0 )P −1 , and also it is, up to permutation, equal to (X0 , Y0 , Z0 ).
120
T. Sumi and T. Sakata
In general, it does not hold (X0 ◦ Y0 ◦ Z0 )u = (X1 ◦ Y1 ◦ Z1 )u, but we can show the following property. Proposition 3. For the function f (X, Y, Z) = T − (X ◦ Y ◦ Z)u 2F , assume that (X0 , Y0 , Z0 ), (X1 , Y1 , Z1 ) are two stationary points which attain the minimal value such that f (X0 , Y0 , Z0 ) = f (X1 , Y1 , Z1 ). Then it holds that (X0 ◦ Y0 ◦ Z0 )u F = (X1 ◦ Y1 ◦ Z1 )u F . Proof. Since f (X, Y, Z) = f l1 (T ) − X(Y ◦ Z)T 2F , from the equation (1), we have f l1(T ) 2F − X0 (Y0 ◦ Z0 )T 2F = f l1(T ) 2F − X1 (Y1 ◦ Z1 )T 2F . That is, it holds that X0 (Y0 ◦ Z0 )T F = X1 (Y1 ◦ Z1 )T F . Finally we remark that the equality X0 (Y0 ◦ Z0 )T F = Y0 (Z0 ◦ X0 )T F = Z0 (X0 ◦ Y0 )T F .
5
Conclusion
For a third-order tensor T and each r, there exists a sum of r tensors of rank 1 which is the closest to T in the sense of Frobenius norm (Existence property). Generally, a global optimal solution is not unique for NNTF. For this problem we proved that if T is of rank r the rank of the matrix made by an arrangement of T is less than or eaual to r, and that if the equality of both ranks holds the decomposition of T into a sum of r tensors of rank 1 is unique under some condition (Uniqueness property).
References [CSS]
Cao, B., Shen, D., Sun, J.-T., Wang, X., Yang, Q., Chen, Z.: Detect and Track Latent Factors with Online Nonnegative Matrix Factorization, IJCAI 2007, pp. 2689–2694 (2007) [CZA] Chichoki, A., Zdunek, R., Amari, S.-i.: Non-Negative Tensor Factorization Using Csiszar’s Divergence. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 32–39. Springer, Heidelberg (2006) [LS] Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 556–562. MIT Press, Cambridge (2001) [MMV] Miwakeichi, F., Mart´ınez-Montes, E., Vald´es-Sosa, P.A., Nishiyama, N., Mizuhara, H., Yamaguchi, Y.: Decomposing EEG Data into Space–Time– Frequency Components Using Parallel Factor Analysis. NeuroImage 22, 1035– 1045 (2004) [MHH] Morup, M., Hansen, L.K., Herman, C.S., Parnas, J., Arnfred, S.M.: Parallel Factor Analysis as an Exploratory Tool for Wavelet Transformed Event-related EEG. NeuroImage 29, 938–947 (2006) [WZZ] Wang, J., Zhong, W., Zhang, J.: NNMF-Based Factorization techniques for High-Accuracy Privacy Protection on Non-negative-valued Datasets. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 513–517. Springer, Heidelberg (2006)
Colored Subspace Analysis Fabian J. Theis1 and M. Kawanabe2 1 Bernstein Center for Computational Neuroscience MPI for Dynamics and Self-Organisation, G¨ ottingen, Germany 2 FraunhoferFIRST.IDA, Germany
[email protected] http://fabian.theis.name
Abstract. With the advent of high-throughput data recording methods in biology and medicine, the efficient identification of meaningful subspaces within these data sets becomes an increasingly important challenge. Classical dimension reduction techniques such as principal component analysis often do not take the large statistics of the data set into account, and thereby fail if the signal space is for example of low power but meaningful in terms of some other statistics. With ‘colored subspace analysis’, we propose a method for linear dimension reduction that evaluates the time structure of the multivariate observations. We differentiate the signal subspace from noise by searching for a subspace of non-trivially autocorrelated data; algorithmically we perform this search by joint lowrank approximation. In contrast to blind source separation approaches we however do not require the existence of sources, so the model is applicable to any wide-sense stationary time series without restrictions. Moreover, since the method is based on second-order time structure, it can be efficiently implemented even for large dimensions. We conclude with an application to dimension reduction of functional MRI recordings.
1
Introduction
Dimension reduction considers the question of removing a noise subspace from a larger multivariate signal. Classically, a signal is differentiated from noise by having a higher variance, and algorithms such as principal component analysis (PCA) in the linear case remove the low-variance components. This can be extended to nonlinear settings, which results in methods including nonlinear PCA [1], kernel PCA [2] and ISOMAP [3], to name but a few. These techniques are well-developed and powerful if the noise is comparatively low (in terms of power i.e. variance) when compared to the signal; in other words a signal manifold has to be ‘visible’ in the local covariance matrix. However the methods fail to capture signals that are deteriorated by noise of similar or stronger power. Broadly speaking, there are two solutions to extract signals from highervariance noise: (a) use higher-order statistics of the data to differentiate signal from noise, or (b) use additional information of the data such as temporal structure to define a signal manifold. (a) leads to the recently proposed non-Gaussian M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 121–128, 2007. c Springer-Verlag Berlin Heidelberg 2007
122
F.J. Theis and M. Kawanabe
component analysis (NGCA) [4,5,6], which is a semiparametric statistical framework for searching non-Gaussian subspaces—there are a few algorithmic implementations such as the multi-index projection pursuit. The noise subspace is characterized simply by being Gaussian. NGCA tries to detect the non-Gaussian signal subspace within the data, and in contrast to independent component analysis no assumption of independence within the subspace is made. More precisely, given a random vector x, a factorization x = As with an invertible matrix A, s = (sN , sG ) and sN a square-integrable n-dimensional random vector is called an n-decomposition of x if sN and sG are stochastically independent and sG is Gaussian. In this case, x is said to be n-decomposable. x is denoted to be minimally n-decomposable if x is not (n − 1)-decomposable. It has been shown that the minimal NGCA signal subspaces of a minimally n-decomposable decomposition are unique [5]. This method is clearly the only available alternative to second-order approaches if i.i.d. signals are given. However, if the observations possess additional structure such as temporal dependencies, approach (b) provides an often simpler dimension reduction framework. Frequently, it is implicitly taken by methods that preprocess the data by transforming them into for example a Fourier or a Wavelet basis, which uses the time structure only in the preprocessing step. The assumption is that in the transformed domain, variance-based methods then suffice. Here, we take approach (b) in a more direct fashion, and propose a novel method that takes the idea of NGCA and its underlying algorithms [4,6], namely the decomposition into a maximally white and ‘another’ signal, to the temporal domain, and apply it to the extraction of the signal subspace of fMRI data sets.
2
Colored Subspace Analysis (CSA)
The goal of CSA is to determine a basis of a random process such that in this basis as many components as possible are white (i.i.d.). The remaining components then span the ‘colored subspace’, onto which we project for dimension reduction. Let x(t) be an (observed) d-dimensional real stochastic process and A an invertible real matrix such that x(t) = As(t). As in NGCA, an n-temporaldecomposition of s(t) is defined by s(t) = (sC (t), sW (t)). Here sC (t) is an ndimensional square-integrable wide-sense stationary random process and sW (t) is i.i.d., such that the auto-crosscorrelation of sW (t) and sC (t) vanishes. Splitting up A = (AC , AW ) accordingly yields the generative model x(t) = AC sC (t) + AW sW (t). With W := A−1 =: (WC , WW ) , the dimension reduction consists of projecting x(t) onto the lower-dimensional signal sC (t) = WC x(t). Note that the more traditional model x(t) = AG sG (t) + n(t) using full-rank noise n(t) is included in the above model, simply by adding the n-dimensional part of n(t) lying in the image of AG to sG (t). Out claim then is that we cannot distinguish between signal and noise in the signal subspace without additional assumptions.
Colored Subspace Analysis
2.1
123
Indeterminacies
The subspace given by the range of WC is denoted as the colored subspace of x(t). Clearly, the coefficients of A or W cannot be unique. However, from similar arguments as below, it can be shown that the colored subspace itself is unique if the n-temporal decomposition is minimal in the sense that no (n − 1)temporal-decomposition of x(t) exists; we have to assume that the noise subspace is maximal as we do not make any assumptions on sC (t). 2.2
Algorithm
The key assumption of the model is that the sC (t) and sW (t) have no common autocorrelations, i.e.—after centering—that Rs (τ ) := E(s(t + τ )s(t) ) is block diagonal of the form R (τ ) 0 (1) Rs (τ ) = sC 0 RsW (τ ) for all τ . Moreover, the noise component sW (t) is characterized by being i.i.d., hence RsW (τ ) = 0 for τ = 0. It can be shown that minimality of the colored subspace is guaranteed if n is chosen maximal such that there still exists a τ = 0 with full-rank RsC (τ ). The factorization model now provides that the observed autocorrelations Rx (τ ) can be factorized into Rx (τ ) = ARs (τ )A .
(2)
As preprocessing, we first remove correlations by PCA, which guarantees that Rx (0) = I. Since the basis in the signal and noise subspaces are non-unique, we may choose coordinates as normalization such that without loss of generality RsC (0) = I and RsW (0) = I, hence Rs (0) = I according to (1). Then A is orthogonal, because AA = ARs (0)A = Rx (0) = I. So the model factorization (2) together with the block structure (1) implies that A and hence the colored subspace can be algorithmically detected by block¯ x (τ ) = 1/2(Rx (τ ) + Rx (τ ) ). Robustness can diagonalizing one symmetrized R be increased by performing orthogonal joint block-diagonalization [7, 8] of mul¯ x (τ ) for τ = 0. tiple or all R The dimension n of the signal subspace can be determined as ¯ x (τ ), n := max rank R τ =0
which in practice has to be replaced by a thresholded or adaptive rank calculation to allow for noise and finite-sample effects. Using the fact that RsW (τ ) = 0, τ = 0 more explicitly, we get Rx (τ ) = (AC , AW )Rs (τ )(AC , AW ) = AC RsC (τ )A C. Hence after joint block-diagonalization, the colored subspace is given by the non-zero eigenvalues—which in the finite-sample case has to be approximated. This model is closely related to the BSS-algorithms AMUSE [9], SFA [10] for one and SOBI [11], TDSEP [12] for multiple autocovariance matrices. The difference is that no generative data model is necessary—CSA is applicable to
124
F.J. Theis and M. Kawanabe
any wide-sense stationary random process; the signal subspace is automatically and uniquely determined, and additional assumptions within the data subspace (such as autodecorrelated sources) are not necessary. This is analogous to the step from ICA to NGCA as discussed in the introduction. Interestingly, the two models of ICA and autodecorrelation can also be combined, see e.g. JADET D [13], where JADE and TDSEP are combined with R(τ ), τ = 0 for ICA in the presence of i.i.d. Gaussian noise. Similar combinations are possible for corresponding dimension reduction frameworks. A review of related cost functions is given in [14]. 2.3
Block-Diagonalization by Joint Low-Rank Approximation
Recently, the authors have presented a method for extracting a single non-zero block from a set of matrices distributed by unitary transformations [14]. There we focused on the NGCA problem and proposed a procedure called joint lowrank approximation (JLA) with a set {Mk }K k=1 of transformed block matrices as Rx (τ ) in Eq.(2) for τ = 0. The reduction matrix W0 , which extracts the non-Gaussian part of the data x can be determined by maximizing L(W0 ) = K 2 d 2 k=1 W0 Mk W0 Fro over Stiefel manifold W0 ∈ Vn (R ), where CFro = ∗ tr(CC ). It can be shown that the true reduction matrix is the unique maximizer of L up to equivalence in the Stiefel manifold. By taking derivative of L, we get the equation K Mk (W0 ) = ΛW0 , W0 k=1
which can be solved by iterative eigenvalue decomposition, where Mk (W0 ) := Mk W0 W0 M∗k + M∗k W0 W0 Mk . Examples of such matrix sets for NGCA case are: (a) fourth-order cumulant tensor, i.e. Q(kl) := (cum(xi , xj , xk , xl ) ) for all (k, l), ∂2 (b) Hessian of log characteristic function, i.e. Mk := ∂ζ∂ζ log E[exp(iζ x)]+Id . For the second case, we developed somewhat sophisticated choices and updates of the frequency vectors ζ k which is necessary to improve the performance of JLA. In the case of CSA, we commonly fix the autocovariance matrices in advance, but informative lags τ can be chosen by a similar idea. Algorithm 1 shortly summarizes how JLA is applied to our autocovariance data set.
3
Simulations
As a simple toy example, we consider n = 3-dimensional colored signals in d = 10-dimensional data. The colored signals are three sinusoids of equal frequency and varying phase, which have been instantaneously gaussianized, see figure 1(a), so methods based on higher-order statistics such as NGCA cannot work. They have been embedded in white Gaussian noise of approximately equal power. The resulting 10-dimensional data set is then mixed by a matrix A with coefficients
Colored Subspace Analysis
125
Algorithm 1: Joint low-rank approximation for CSA Input: d × T sample matrix X of a multivariate time series, number of autocovariances K, source dimension n Output: CSA projection W prewhiten data calculate eigenvalue decomposition (EVD) of covariance E0 Λ0 E 0 = Cov(X) −1/2 E0 V ← Λ0 Y ← VX estimate autocovariance matrices for τ ← 1 . . . K do Mτ ← (T − τ )−1 Y(:, 1 : T − τ + 1)Y(:, τ : T ) initialize JLA algorithm calculate EVD EΛE = τ Mτ + M τ W ← E(:, 1 : n) I ← {1, . . . , K} iterate JLA, possibly quit loop earlier for i ← 1 . . . K do M ← τ ∈I Mτ W WM τ + Mτ W WMτ calculate EVD M = Ei Λi Ei W ← Ei (:, 1 : n) determine τ0 with minimal WMτ W 4F /Mτ 2F remove τ0 from I W ← WV
chosen from an standard normal distribution; the mixtures x(t) are shown in figure 1(b). The resulting SNR is -5dB, so distinction of signal from the noise subspace by power (→ PCA) cannot work either, as will also be shown later. We first apply CSA with K = 10 autocovariance matrices and known signal subspace dimension n = 3. If we multiply the recovered projection WC with the mixing matrix A, we expect WC A to have strong contributions in the first n × n-block and close to zero entries everywhere else. This is the case indicating that CSA works fine, see Hinton-diagram in figure 1(c). Indeed a possible errorindex e(WCSA ) := (WCSA A)(:, n + 1 : d)F is low (0.0139): If we perform similar joint block diagonalization-based search for the projection, extending the SOBI algorithm, we also achieve an approximate signal projection, however with an increased error of 0.0363. If however only PCA is applied, the resulting error is high with e(WPCA ) = 5.12, see figure 1(d). A more systematical comparison of the three methods is achieved when we perform the above experiment for a batch run of length 100, with randomly chosen A in each run. The resulting statistics, figures 1(e-f), confirm the superior performance of CSA in terms of recovery error, as well as computational time (with respect to the extension of SOBI).
126
F.J. Theis and M. Kawanabe 4
1
s1
3
s2
2
s3
3
2
4 5
mixture
si(t)
1 0
6
−1
7
−2
8 9
−3
10 −4 0
200
400
600
800
100
1000
200
t
(a) signal subspace
300
400
500 t
600
700
800
900 1000
(b) mixtures
(c) mixing-separating matrix WCSA A
(d) mixing-separating matrix WPCA A
2
10
0
10
time per run (s)
recovery error
1
10
0
10
−1
10
−1
10
−2
10
−2
10
CSA
SOBI
PCA
(e) recovery error over 100 runs
CSA
SOBI
PCA
(f) computation time over 100 runs
Fig. 1. Toy example of an n = 3-dimensional signal (a) in d = 10 dimensions (b). CSA outperforms the other methods (c-f). See text for details.
4
Signal-Subspaces in fMRI Data
Functional magnetic-resonance imaging (fMRI) can be used to measure brain activity. Multiple MRI scans are taken in various functional conditions; the extracted task-related component reveals information about the task-activated brain regions. Classical power-based methods fail to blindly recover the taskrelated component as it is very small with respect to the total signal, usually around one percent in terms of variance. Hence we propose to use the autocovariance structure (in this case spatial autocovariances) in combination with CSA to properly reduce the data dimension.
Colored Subspace Analysis
127
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6 c(n)
c(n)
fMRI data with 98 images (TR/TE = 3000/60 msec) were acquired with five periods of rest and five photic simulation periods with rest. Simulation and rest periods comprised 10 repetitions each, i.e. 30s. Resolution was 3 × 3 × 4 mm. The slices were oriented parallel to the calcarine fissure. Photic stimulation was performed using an 8 Hz alternating checkerboard stimulus with a central fixation point and a dark background with a central fixation point during the control periods [15]. The first scans were discarded for remaining saturation effects. Motion artifacts were compensated by automatic image alignment. In order to compare the performance of CSA versus standard PCA-based dimension reduction in varying source data dimension, we reduce the total data to p ∈ {2, 5, 10, 20, 50, 98} dimensions by PCA. Then we either apply CSA or PCA and order the components in decreasing order of the eigenvalues (of total autocovariance or covariance respectively). We analyze how well the taskrelated component with the known task vector v ∈ {0, 1}98 is contained in a component by f (i) := (W0 (i, :)v)2 , where W0 is the separating matrix. In order to allow for finite-sample effects, we compare the recovered subspace for all varying reduced dimensions n by comparing it to the total power by plotting p n c(n) = i=1 f (i)/ i=1 f (i) versus n, see figure 2. For strongly reduced data p ≤ 5, both methods capture the task component for low n, PCA more so than CSA. But in more realistic data settings p ≥ 10, necessary for full data evaluation, CSA consistently needs n = 5 components to guarantee that the task-related component is contained in the signal subspace (with cumulative contribution ratio .8 for p ≤ 20), whereas PCA needs already n = 18 components to guarantee the same for p = 20, and more so for larger p. This illustrates that CSA can be used as preprocessing tool for fMRI data much more efficiently than PCA, albeit at a somewhat higher computational cost.
0.5 0.4
0.4 p=2 p=5 p=10 p=20 p=50 p=98
0.3 0.2 0.1 0 0 10
0.5
1
10 n
(a) CSA
p=2 p=5 p=10 p=20 p=50 p=98
0.3 0.2 0.1 2
10
0 0 10
1
10 n
2
10
(b) PCA
Fig. 2. Comparison of CSA (left) and PCA (right) for dimension reduction
Conclusions. We have presented a generally applicable, efficient method for linear dimension reduction that separates a subspace with nontrivial autocorrelations (color) from the white remainder. Results on toy and real data are
128
F.J. Theis and M. Kawanabe
promising. Presently, we are working on a statistically sound estimation of the subspace dimension as well as on a generalization without prewhitening. Moreover, we are planning to study the performance of CSA on other medical imaging applications. We believe that the method may provide a useful tool for preprocessing to allow for more efficient analysis in a lower-dimensional signal subspace still capturing fine-grained and low-power statistics of the observations.
References 1. Oja, E.: Nonlinear pca criterion and maximum likelihood in independent component analysis. In: Proc. ICA 1999, Aussois, France, pp. 143–148 (1999) 2. Mika, S., Sch¨ olkopf, B., Smola, A.J., M¨ uller, K., Scholz, M., R¨ atsch, G.: Kernel PCA and de-noising in feature spaces. In: Kearns, M.S., Solla, S.A., Cohn, D.A (eds.) Advances in Neural Information Processing Systems, vol. 11, MIT Press, Cambridge (1999) 3. Tenenbaum, J., de Silva, V., Langford, J.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000) 4. Blanchard, G., Kawanabe, M., Sugiyama, M., Spokoiny, V., M¨ uller, K.: In search of non-gaussian components of a high-dimensional distribution. Journal of Machine Learning Research 7, 247–282 (2006) 5. Theis, F., Kawanabe, M.: Uniqueness of non-gaussian subspace analysis. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 917–925. Springer, Heidelberg (2006) 6. Kawanabe, M., Theis, F.: Estimating non-gaussian subspaces by characteristic functions. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 157–164. Springer, Heidelberg (2006) 7. Abed-Meraim, K., Belouchrani, A.: Algorithms for joint block diagonalization. In: Proc. EUSIPCO 2004, Vienna, Austria, pp. 209–212 (2004) 8. F´evotte, C., Theis, F.: Orthonormal approximate joint block-diagonalization. Technical report, GET/T´el´ecom Paris (2007) 9. Tong, L., Liu, R.W., Soon, V., Huang, Y.F.: Indeterminacy and identifiability of blind identification. IEEETransactions on Circuits and Systems 38, 499–509 (1991) 10. Wiskott, L., Sejnowski, T.: Slow feature analysis: Unsupervised learning of invariances. Neural Computation 14, 715–770 (2002) 11. Belouchrani, A., Meraim, K.A., Cardoso, J.F., Moulines, E.: A blind source separation technique based on second order statistics. IEEE Transactions on Signal Processing 45(2), 434–444 (1997) 12. Ziehe, A., Mueller, K.R.: TDSEP – an efficient algorithm for blind separation using time structure. In: Niklasson, L., Bod´en, M., Ziemke, T. (eds.) Proc. of ICANN’98, Sk¨ ovde, Sweden, pp. 675–680. Springer, Berlin (1998) 13. M¨ uller, K.R., Philips, P., Ziehe, A.: Jadetd: Combining higher-order statistics and temporal information for blind source separation (with noise). In: Proc. of ICA 1999, Aussois, France, pp. 87–92 (1999) 14. Theis, F., Inouye, Y.: On the use of joint diagonalization in blind signal processing. In: Proc. ISCAS 2006, Kos, Greece (2006) 15. Wism¨ uller, A., Lange, O., Dersch, D., Leinsinger, G., Hahn, K., P¨ utz, B., Auer, D.: Cluster analysis of biomedical image time-series. International Journal on Computer Vision 46, 102–128 (2002)
Is the General Form of Renyi’s Entropy a Contrast for Source Separation? Fr´ed´eric Vrins1 , Dinh-Tuan Pham2 , and Michel Verleysen1 Machine Learning Group Universit´e catholique de Louvain Louvain-la-Neuve, Belgium {vrins,verleysen}@dice.ucl.ac.be 2 Laboratoire de Mod´elisation et Calcul Centre National de la Recherche Scientifique Grenoble, France
[email protected] 1
Abstract. Renyi’s entropy-based criterion has been proposed as an objective function for independent component analysis because of its relationship with Shannon’s entropy and its computational advantages in specific cases. These criteria were suggested based on “convincing” experiments. However, there is no theoretical proof that globally maximizing those functions would lead to separate the sources; actually, this was implicitly conjectured. In this paper, the problem is tackled in a theoretical way; it is shown that globally maximizing the Renyi’s entropy-based criterion, in its general form, does not necessarily provide the expected independent signals. The contrast function property of the corresponding criteria simultaneously depend on the value of the Renyi parameter, and on the (unknown) source densities.
1
Introduction
Blind source separation (BSS) aims at recovering underlying source signals from mixture of them. Under mild assumptions, including the mutual independence between those sources, it is known from Comon [1] that finding the linear transformation that minimizes a dependence measure between outputs can solve the problem, up to acceptable indeterminacies on the sources. This procedure is known as Independent Component Analysis (ICA). This problem can be mathematically expressed in a very simple way. Consider the square, noiseless BSS mixture model: a K-dimensional vector of independent unknown sources S = [S1 , . . . , SK ]T is observed via an instantaneous linear mixture of them X = AS, X = [X1 , . . . , XK ]T , where A is the full-rank square mixing matrix. Many separation methods are based on the maximization (or minimization) of a criterion. A specific class of separation criteria is called “contrast functions” [1]. The contrast property ensures that a given criterion is suitable to achieve BSS. Such a function i) is scale invariant, ii) only depends on the demixing matrix B and of the mixture densities iii) reaches its global maximum if and only if the transfer matrix W = BA is non-mixing [1]. A matrix W M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 129–136, 2007. c Springer-Verlag Berlin Heidelberg 2007
130
F. Vrins, D.-T. Pham, and M. Verleysen
is said non-mixing if it belongs to the subgroup W of the general linear group GL(K) of degree K, and is defined as: . W = {W ∈ GL(K) : ∃P ∈ P K , Λ ∈ DK , W = PΛ}
(1)
In the above definition P K and DK respectively denote the groups of permutation matrices and of regular diagonal matrices of degree K. Many contrast functions have been proposed in the literature. One of the most known contrast function is the opposite of mutual information I(Y) [2] where Y = BX, which can be equivalently written as a sum of differential entropies h(.): K K . h(Yi ) − h(Y) = h(Yi ) − log | det B| − h(X). (2) I(Y) = i=1
i=1
The differential (Shannon) entropy of X ∼ pX is defined by . h(X) = −E[log pX ].
(3)
Since I(Y) has to be minimized with respect to B, its minimization is equivalent to the following optimization problem under a prewhitening step: . C(B), C(B) = − h(bi X), K
max
B∈ SO(K)
problem 1
(4)
i=1
where bi denotes the i-th row of B and SO(K) is the special orthogonal group . SO(K) = {W ∈ GL(K) : WWT = IK , det W = +1} with IK the identity matrix of degree K. The B ∈ SO(K) restriction, yielding log | det B| = 0, results from the fact that, without loss of generality, the source can be assumed to be centered and unit-variance (E[SST ] = IK and A ∈ SO(K) if the mixtures are whitened [8]). Clearly, if W = BA, problem 1 is equivalent to problem 2: . C(W), C(W) =− h(wi S). K
max
W∈ SO(K)
problem 2
i=1
Few years ago, it has been suggested to replace Shannon’s entropy by Renyi’s entropy [4,5]. More recent works still focus on that topic (see e.g. [7]). Renyi’s entropy is a generalization of Shannon’s one in the sense that they share the same key properties of information measures [10]. The Renyi entropy of index r ≥ 0 is defined as: . 1 log prX (x)dx, (5) hr (X) = 1−r Ω(X) . where r ≥ 0 and limr→1 hr (X) = h1 (X) = h(X) and Ω(X) = {x : pX (x) > 0}. Based on simulation results, some researchers have proposed to modify the
Is the General Form of Renyi’s Entropy a Contrast for Source Separation?
131
above BSS contrast C(B) defined in problem 1 by the following modified criterion K . Cr (B) = − hr (Yi ), (6) i=1
assuming implicitly that the contrast property of Cr (B) is preserved even for r = 1. This is clearly the case for the specific values r = 1 (because obviously C1 (B) = C(B)) and r = 0 (under mild conditions); this can be easily shown using the Entropy Power and the Brunn-Minkowski inequalities [3], respectively. However, there is no formal proof that the contrast property of Cr (B) still holds for other values of r. In order to check if this property may be lost in some cases, we restrict ourselves to see if a necessary condition ensuring that Cr (B) is a contrast function is r (W) should admit a local maximum when met. More specifically, the criterion C W ∈ W. To see if this condition is fulfilled, a second order Taylor development r (W) is provided around a non-mixing point W ∈ W in the next section. of C For the sake of simplicity, we further assume K = 2 and that a prewhitening is performed so that we shall constraint W ∈ SO(2) since it is sufficient for our purposes, as shown in the example of Section 3 (the extension to K ≥ 3 is easy).
2
Taylor Development of Renyi’s Entropy
r (W) due to a Setting K = 2, we shall study the variation of the criterion C slight deviation of W from any W ∈ W ∩ SO(2) of the form W ← EW where cos θ sin θ . E= (7) − sin θ cos θ and θ 0 is a small angle. This kind of updates covers the neighborhood of W ∈ SO(K): if W, W ∈ SO(2), there always exists Φ ∈ SO(2) such that W = ΦW ; Φ can be written as E and if W is further restricted to be in the neighborhood of W , θ must be small enough. In order to achieve that aim, let us first focus on a first order expansion of the criterion, to analyse if non-mixing matrices are stationary points of the criterion. This is a obviously a necessary condition for Cr (B) to be a contrast function. 2.1
First Order Expansion: Stationarity of Non-mixing Points
Let Z be a random variable independent from Y . From the definition of Renyi’s entropy given in eq. (5), it comes that Renyi’s entropy of Y + Z is 1 log prY +Z (x)dx, (8) hr (Y + Z) = 1−r where the density pY +Z reduces to, up to first order in [9]: ∂E[(Z|Y = x)pY (x)] + o(). pY +Z (y) = pY (y) − ∂x x=y
(9)
132
F. Vrins, D.-T. Pham, and M. Verleysen
Therefore, we have: prY +Z (y)
=
prY
(y) −
rpr−1 Y (y)
∂[E(Z|Y = x)pY (x)] + φ(, y), ∂x x=y
(10)
where φ(, y) is o(). Hence, noting that log(1 +a) = a+ o(a) as a → 0, equations (8) and (9) yield1 r−1 pY (y)[E(Z|Y )pY ] (y)dy r r + o(). (11) hr (Y + Z) = hr (Y ) − 1−r pY (y)dy But, by integration by parts, one gets 1 pr−1 (y)[E(Z|Y )p ] (y)dy = − pr−1 Y Y Y (y)E(Z|Y = y)pY (y)dy, r−1 yielding
(12)
r−1 pr−1 p (y)E(Z|Y = y)pY (y)dy )pY ] (y)dy Y (y)[E(Z|Y r r = −r Y . pY (y)dy pY (y)dy (13) From the general iterated expectation lemma (p. 208 of [6]), the right-hand side of the above equality equals −
r 1−r
−r
E[pr−2 Y (Y )pY (Y )Z] = E[ψr (Y )Z], r pY (y)dy
(14)
if we define the r-score function ψr (Y ) of Y as r−2 (pr ) (Y ) 1 . rp (Y )pY (Y ) Yr =− . ψr (Y ) = − Y r pY (Y ) pY (y)dy pY (y)dy
(15)
Observe that the 1-score reduces to the score function of Y , defined as −(log pY ) . Then, using eq. (11), noting Y = EW S, cos θ = 1 + o(θ) and sin θ = θ + o(θ), r (EW ) becomes up to first order in θ: the criterion C r (EW ) = −hr (Y1 ) − hr (Y2 ) C
r (W ) ± θ E[ψr (S1 )S2 ] − E[ψr (S2 )S1 ] . ≈C
(16)
The sign of θ in the last equation depends on matrix W ; for example, if W = I2 , it is negative, and if the rows of I2 are permuted in the last definition of W , it is positive. Remind that the criterion is not sensitive to a left multiplication of its argur (I2 ) = r (W ) = C ment by a scale and/or permutation matrix. For instance, C −hr (S1 ) − hr (S2 ). It results that since independence implies non-linear decorr (W) admits a stationary relation, both expectations vanish in eq. (16) and C point whatever is W ∈ W. 1
Provided that there exist > 0 and an integrable function Φ(y) > 0 such that for all y ∈ R and all || < , φ(, y)/ < Φ(y). It can be shown that this is indeed the case under mild regularity assumptions.
Is the General Form of Renyi’s Entropy a Contrast for Source Separation?
2.2
133
Second Order Expansion: Characterization of Non-mixing Points
Let us now characterize these stationary points. To this end, consider the second order expansion of pY +Z provided in [9] (Z is assumed to be zero-mean to simplify the algebra): 1 pY +Z = pY + 2 E(Z 2 )pY + o(2 ). (17) 2 Therefore, since Renyi’s entropy is not sensitive to translation we have, for r > 0: r−1 pY (y)pY (y)dy 2 r r var(Z) + o(2 ), (18) hr (Y + Z) = hr (Y ) + 2 1−r pY (y)dy . =J (Y ) r
where Jr (Y ) is called the r-th order information of Y . Observe that the first 2 order information reduces to J1 (Y ) = E[ψY,r ], i.e. to Fisher’s information [2]. In order to study the “nature” of the stationary point reached at W (minir resulting from the mum, maximum, saddle), we shall check the variation of C update W ← EW up to second order in θ. Clearly, cos θ = 1 − θ2 /2 + o(θ2 ) and tan θ = θ + o(θ2 ), the criterion then becomes: r (EW ) = −hr (Y1 ) − hr (Y2 ) C = −hr (S1 + tan θS2 ) − hr (S2 − tan θS1 ) − 2 log | cos θ| 2 2 r (W )− θ [Jr (S1 )var(S2 ) + Jr (S2 )var(S1 )]−2 log |1 − θ | + o(θ2 ) =C 2 2 2 θ r (W ) − =C [Jr (S1 )var(S2 ) + Jr (S2 )var(S1 )−2] + o(θ2 ) (19) 2 where we have used Hr (αY ) = Hr (Y ) + log |α|, for any real number α > 0. This clearly shows that if the sources share a same density with variance var(S) r (EW ) − Cr (W ) equals, up to and r-th order information Jr (S), the sign of C second order in θ to sign(1 − Jr (S)var(S)). In other words, the criterion reaches a local minimum at any W ∈ W if Jr (Si )var(Si ) < 1, instead of an expected global maximum. In this specific case, maximizing the criterion does not yield the seeked sources.
3
Example
A necessary and sufficient condition for a scale invariant criterion f (W), W ∈ SO(K) to be an orthogonal contrast function is that the set of its global maximum points matches the set of the orthogonal non-mixing matrices, i.e. argmaxW∈SO(K) f (W) = W. Hence, in the specific case where the two sources share the same density pS , it is necessary that the criterion admits (at least) a local maximum at non-mixing matrices. Consequently, according to the results derived in the previous section, the Jr (S)var(S) < 1 inequality implies that the sources cannot be recovered through the maximization of Cr (B).
134
3.1
F. Vrins, D.-T. Pham, and M. Verleysen
Theoretical Characterization of Non-mixing Stationary Points
The above analysis would be useless if the Jr (S)var(S) < 1 inequality is never satisfied for non-Gaussian sources. Actually, it can be shown that simple and common non-Gaussian densities satisfies this inequality. This is e.g. the case of the triangular density. We assume that both sources S1 , S2 share the same triangular density pT 2 : ⎧ √ √ 6| ⎪ √ ⎨ 1−|s/ if |s| ≤ 6 6 . (20) pT (s) = ⎪ ⎩0 otherwise . Observe that E[Si ] = 0, var(Si ) = 1, i ∈ {1, 2}, and note that using integration by parts, the r-th order information can be rewritten as r−2 p (y)[p (y)]2 dy . Jr (Y ) = r Y r Y pY (y)dy √ . Then, for S ∈ {S1 , S2 } and noting u = 1 − s/ 6: ⎧ √ r−2 √6 1 r−2 r(r + 1)/[6(r − 1)] if r > 1 r 0 u du ⎨ r 0 (1 − s/ 6) ds √ = Jr (S) = = ⎩ 6 6 (1 − s/√6)r ds 6 1 ur du ∞ if r ≤ 1 0 0 Thus Jr (S)var(S) < 1 if and only if r(r + 1)/[6(r − 1)] < 1. But for r ≥ 1, the last inequality is equivalent to 0 > r(r + 1) − 6(r − 1) = (r − 2)(r − 3). Therefore Jr (S)var(S) < 1 if and only if 2 < r < 3, as shown in Figure 1(a). We conclude that for a pair of triangular sources, the criterion Cr (B) is not a contrast for 2 < r < 3. 3.2
Simulation
Let us note Y = [Y1 , Y2 ]T , Y = Wθ S, where Wθ is a 2D rotation matrix of angle θ of the same form of E but where θ can take arbitrary values [0, π]. The criterion −(hr (Y1 ) + hr (Y2 )) is plotted with respect to the transfer angle θ. Obviously, the set of non-mixing points reduces to W = {Wθ : θ ∈ {kπ/2|k ∈ Z}}. Drawing this figure requires some approximations, and we are aware about the fact that it does not constitute a proof of the violation of the contrast property by itself; this proof is provided in the above theoretical development where it is shown that out of any problem of e.g. density estimation or exact integration approximation, Renyi’s entropy is not always a contrast for BSS. The purpose of this plot is, complementary to Section 2, to show that in practice, too, the use of Renyi’s entropy with arbitrary value of r might be dangerous. Figure 1(b) has been drawn as follows. For each angle θ ∈ [0, π], the exact triangular probability density function pT is used to compute the pdf of sin θS 2
This density is piecewise differentiable and continuous. Therefore, even if the density expansions are not valid everywhere, eq. (19) is still of use.
Is the General Form of Renyi’s Entropy a Contrast for Source Separation?
135
r=1
−2.8 −2.81 −2.82 −2.83 1
10
−2.5 −2.51 −2.52 −2.53
0
10
1
1.5
2
2.5
3 r
(a)
3.5
4
4.5
5
−2.31 −2.32 −2.33 −2.34 0
r = 2.5
r=5
π/2 θ
π
(b)
Fig. 1. Triangular sources. (a): log(Jr (S)var(S)) vs r. (b): Estimated Renyi’s criterion r (E ) vs θ. The criterion is not a contrast function for r = 2.5 and r = 5. C
and cos θS, S ∼ pT , by using the well-known formula of the pdf of a transformation of random variables. Then, the output pdfs are obtained by convoluting the independent sources scaled by sin θ and cos θ. Finally, Renyi’s entropy is computed by replacing exact integration by Riemannian summation restricted on points were the output density is larger than τ = 10−4 to avoid numerical problems resulting from the log operator. At each step, it is checked that the pdfs of sin θS, cos θS, Y1 and Y2 integrate to one with an error smaller than τ and that the variance of the outputs deviates from unity with an error smaller than τ . Note that at non-mixing points, the exact density pT is used as the output pdf to avoid numerical problems. The two last plots of Figure 1(b) clearly indicate that the problem could be emphasized even when dealing with an approximated form of Renyi’s entropy. r (W) = C(W) (or more preOn the top of the figure (r = 1), the criterion C cisely, Cr (B) = C(B)) is a contrast function, as expected. On the middle plot r (Wθ ) admits a local minimum point when Wθ ∈ W (this results (r = 2.5), C from Jr (S)var(S) < 1), and thus violates a necessary requirement of a contrast function. Finally, on the last plot (r=5), the criterion is not a contrast even though Jr (S)var(S) > 1 since the set of global maximum points of the criterion does not correspond to the set W.
4
Conclusion
In this paper, the contrast property of a well-known Renyi’s entropy based criterion for blind source separation is analyzed. It is proved that at least in one realistic case, globally maximizing the related criterion does not provide the expected sources, whatever is the value of Renyi’s exponent; the transfer matrix W globally maximizing the criterion might be a mixing matrix, with possibly
136
F. Vrins, D.-T. Pham, and M. Verleysen
more than one non-zero element per row. Even worst, it is not guaranteed that the criterion reaches a local maximum at non-mixing solutions ! Actually, the only thing we are sure is that the criterion is stationary for non-mixing matrices. This is a mere information since if the criterion has a local maximum (resp. minimum) point at mixing matrices, then a stationary point might also exist at mixing solution, i.e. at W such that the components of WS are not proportional to distinct sources. Consequently, the value of Renyi’s exponent has to be chosen K with respect to the source densities in order to satisfy i=1 Jr (Si )var(Si ) > K (again, this is not a sufficient condition: it does not ensure that the local maximum is global). Unfortunately, the problem is that the sources are unknown. Hence, nowadays, the only way to guarantee that Cr (B) is a contrast function is to set r = 1 (mutual information criterion) or r = 0 (log-measure of the supports criterion, this requires that the sources are bounded); it can be shown that counter-examples exist for any other value of r, including r = 2. To conclude, we would like to point out that contrarily to the kurtosis criterion case, it seems that it does not exist a simple mapping φ[.] (such as e.g. the absolute value or even powers) that would match the set argmaxB φ[Cr (B)] to the set {B : BA ∈ W} where W is the set of non-mixing matrices, because there is no information about the sign of the relevant local optima.
References 1. Comon, P.: Independent component analysis, a new concept? Signal Processing 36(3), 287–314 (1994) 2. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley Series in Telecommunications. Wiley and Sons, Inc, Chichester (1991) 3. Dembo, A., Cover, T.M., Thomas, J.A.: Information theoretic inequalities. IEEE Transactions on Information Theory 37(6), 1501–1518 (1991) 4. Erdogmus, D., Hild, K.E., Principe, J.: Blind source separation using renyi’s mutual information. IEEE Signal Processing Letters 8(6), 174–176 (2001) 5. Erdogmus, D., Hild, K.E., Principe, J.: Blind source separation using renyi’s αmarginal entropies. Neurocomputing 49(49), 25–38 (2002) 6. Gray, R., Davisson, L.: An Introduction to Statistical Signal Processing. Cambridge University Press, Cambridge (2004) 7. Hild, K.E., Erdogmus, D., Principe, J.: An analysis of entropy estimators for blind source separation. Signal Processing, 86, 174–176, 182–194 (2006) 8. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent component analysis. John Willey and Sons, Inc. New York (2001) 9. Pham, D.-T.: Entropy of a variable slightly contaminated with another. IEEE Signal Processing Letters 12(7), 536–539 (2005) 10. Renyi, A.: On measures of entropy and information. Selected papers of Alfred Renyi 2, 565–580 (1976)
A Variational Bayesian Algorithm for BSS Problem with Hidden Gauss-Markov Models for the Sources Nadia Bali and Ali Mohammad-Djafari Laboratoire des Signaux et Syst`emes, Unit´e mixte de recherche 8506 (CNRS-Sup´elec-UPS) Sup´elec, Plateau de Moulon, 91192 Gif-sur-Yvette, France
Abstract. In this paper we propose a Variational Bayesian (VB) estimation approach for Blind Sources Separation (BSS) problem, as an alternative method to MCMC. The data are M images and the sources are N images which are assumed piecewise homogeneous. To insure these properties, we propose a piecewise Gauss-Markov model for the sources with a hidden classification variable which is modeled by a Potts-Markov field. A few simulation results are given to illustrate the performances of the proposed method and some comparison with other methods (MCMC and VBICA) used for BSS, are presented.
Introduction We consider the problem of sources separation in the case of instantaneous mixture with noisy images. We propose to use the Bayesian inference which gives the possibility to take into account uncertainties and all prior knowledge on the model of sources and observations. We assign priors to the noise, to the sources, to the mixing matrix and to all the hyperparameters of the model and obtain the expression of the posterior of all the unknowns. However, using this posterior to compute its mode, its mean or its exploration needs approximation. Classical methods of approximations are i) numerical methods such as MCMC which requires in practice a great computational effort, ii) analytical methods such as asymptotic approximation of Laplace which is more easily practicable but makes typically a rough approximation. In this paper we propose an alternative approximation method which is based on a variational approach which offers a practical framework in term of efficiency and precision[1],[2]. Indeed the goal is to give an approximation to marginal likelihood or the model evidence which is the distribution of the data knowing the model. The model evidence is obtained by integrating over the hidden variables of the joined posterior law. Using variational approximation of the posterior law is not new in image processing and in particular in image segmentation [3]. However, in our knowledge its use in blind image separation with a particular mixture of Gaussians with a Potts Markov model is new. The hidden Potts Markov model is and many works exist on different approximations such Mean M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 137–144, 2007. c Springer-Verlag Berlin Heidelberg 2007
138
N. Bali and A. Mohammad-Djafari
Field Approximation (MFA) [4],[5] to resolve the problem. Here we propose a global variational approach that insures us to maximize free energy with a more complex hierarchical model since our problem is extended to sources separation problem.
1
Sources Separation Model
An instantaneous sources separation problem can be written as: x(r) = As(r) + (r),
(1)
Where: - x(r) = {xi (r), i = 1, · · · , M } is a set of M images (observations) and r ∈ R = {1, · · · , R} is a pixel position with R the total number of pixels. - A is an unknown mixture matrix with dimension (M, N ), - s(r) = {sj,r , j = 1, · · · , n} is a set of N unknown components (sources images); In the following, we assume that the errors white, Gaussian (r) are centered, 1 1 with inverse covariance matrix Σ = diag σ2 , · · · , σ2 . 1
M
Now, if we note by x = {x(r), r ∈ R}, s = {s(r), r ∈ R} and = {(r), r ∈ R}, then we can write x = As + . (2) and p(x|s, Σ ) =
N (As(r), Σ −1 )
(3)
r
We note that Σ is an inverse covariance matrix. 1.1
Sources Modelization
We propose to model the prior marginal law of each source sj (r) by a mixture of Gaussians model: p(sj (r)) =
K
p(zj (r) = k) N (mj k , σj2 k )
(4)
k=1 2 which implies that p(sj (r)|zj (r) = k) = N (mj k , σj k ) where p(zj (r) = k) = αj,k and k αj,k = 1. This model is appropriate for the image sources which we consider, where the discrete valued hidden variables zj (r) ∈ {1, · · · , Kj } represent the classification labels of the source images pixels sj (r). To insure some spatial regularity to these labels, they are modelized by a Potts-Markov ⎡ ⎤ random field: δ(zj (r) − zj (r ))⎦ . p(zj (r)|zj (r ), r = r, r ∈ R) ∝ exp ⎣βj r ∈V(r)
The parameters βj controls the mean size of regions.
A Variational Bayesian Algorithm
1.2
139
Prior Models for the Mixing Matrix and the Hyperparameters
Mixing matrix model: We consider a Gaussian distribution law for mixture matrix, so the prior distribution of A is given by: π(A|A0 , Γ 0 ) = N (A0 , Γ 0 ).
(5)
Inverse covariance noise model: We assign a Wishart distribution to the covariance of noise Σ π(Σ |ν0 , Σ 0 ) ∝ |Σ |
(ν0 −M −1) 2
1 exp − Tr Σ Σ −1 0 2
(6)
Where ν0 is the number of degrees of freedom and Σ 0 is the prior covariance matrix. Means and variances of different classes: We assign Gaussian laws to the means: π(mz |μ0 , T0 ) = N (μ0 , T0 ), (7) and Wishart law to the inverse covariances π(Σ z |ν0 , V0 ) = W(ν0 , V0 ).
2
(8)
Variational Bayesian Algorithm
Our goal is to obtain a separable approximation q(s, z, θ) = q(s, z) q(θ) for the joint posterior law p(s, z, θ|x, M) of (s, z) and θ = (A, Σ , mz , Σ z ). The idea is thus to minimize the Kullback Leibler divergence between the approximate distribution law q(s, z)q(θ) and the joint posterior law on the hidden variables and the parameters p(s, z, θ|x): KL[q(s, z|x, M)q(θ|x, M)||p(s, z, θ|x))] =
q(s, z|x, M)q(θ|x, M) q(s, z|x, M)q(θ|x, M) ln dθ ds p(s, z, θ|x) z where M is the model. In case of sources separation M represents the number of sources N . Developing this expression at one side and looking to the expression of the evidence p(x|M) at the other-side, it is easy to show that: KL[q(s, z|x, M)q(θ|x, M)||p(s, z, θ|x))] = ln p(x|M) − F(q(s, z), q(θ)) (9) where F is given by:
F (q(s, z), q(θ)) =
⎡ dθq(θ) ⎣
ds
z
⎤ p(θ|M) ⎦ p(x, s, z|θ, M) + ln q(s, z) ln q(s, z) q(θ) (10)
140
N. Bali and A. Mohammad-Djafari
which is called free energy of the model. From these relations we see that minimizing KL is equivalent to maximizing F . The distribution of the variational approximation q(s, z) and q(θ) must belong to a family of distributions simpler than that of the posterior distribution p(s, z, θ|x). Obtaining expressions for q(s, z) and q(θ) is done iteratively. The family of distributions is selected such that q be in the same family than the true posterior distributions. [2] [6] noted that important simplifications are made when updating the variational equations if the choice of the distributions of the variables conditionally to their parameters is done from conjugated exponential families model. In this case, the posterior distribution has analytically stables and intuitive forms. To optimize F (q(s, z), q(θ)) we simply equate to zero the functional derivatives with respect to each distribution q. In summary, the two main hypothesis of the proposed method are: i) posterior independence between (s, z) and θ, ii) separable conjugate priors for all the hyperparameters. This last hypothesis associated with dF dq = 0 results to: q(θi ) ∝ exp(< log p(x|θ, M) >q(θ|i ) )π(θi ), where < f (x) >q = Eq {f (x)} = q(x)f (x)dx. 2.1
(11)
Approximate Posterior Laws for Mixing Matrix and Hyperparameters
Approximation posterior law for mixing matrix: We note by Av the vector wise representation of a matrix defined by :Av = [A(1,.) , · · · , A(m,.) ]t . By dF taking the functional derivative of Eq.(10) and equating to zero dq(A) = 0, we get the update: q(A) ∝ π(A) exp(< ln p(x|s, Σ ) >q(Σ )q(s) )
(12)
With the appropriate conjugate prior π(A) that we chosen in (5), it is easy to Σ A ) = N (A, Σ A ), with: see that q(A|A, A = RΣ ⊗( Σ
−1
)+ q(z(r))Σ s|z
z(r)
and
r
⊗( Σ
q(z(r)) sz (r) stz (r)) + Σ A (13)
z(r)
⎡ ⎤ −1 v = Σ ⎣ (xt (r) ⊗ ( A q(z(r)) stz (r))) + Γ p Ap ⎦ (14) Σ A r
z(r)
Approximate posterior law for noise inverse covariance: q(Σ ) is obdF tained by equating to zero dq(Σ = 0 which results to: ) (R + ν0 − m − 1) 1 ln |Σ | − (Σ Q ) (15) 2 2 −1 −1 where Q = Σ −1 0 + Q + RDA ( z q(z)Σ s|z , Σ A , A). For the definition of DA and for more details see [7], defining a posterior wishart distribution with a mean = (R + ν0 )Q−1 matrix Σ . ln(q(Σ )) =
A Variational Bayesian Algorithm
141
Approximate posterior laws for mz and Σ z : By writing F as a function of mz and Σ z only we can differentiate with respect to these hyperpaz ) = N ( mz , T mz , Tz ), rameters to yield the following update equations : q(mz | νz , Vz ) = W( νz , Vz ). The details of deriving the update equations are q(Σ z | omitted due to the space constraints. They can be obtained in [8]. 2.2
Approximate Posterior Laws for Hidden Variables
Approximate posterior distribution q(s|z): The expression of q(s|z) is dF obtained by dq(s(r)|z(r)) = 0 which results to : 1 x(r))t s(r)−1(s(r)−mz )tΣ z (s(r)−mz ) t Σ ln q(s(r)|z(r)) =− s(r)t Qs(r)+(A 2 2 A + F (Σ , Σ −1 Σ ) which is quadratic in s(r). For the definition with Q = A A −1 of F (Σ , Σ A ) and for more details see [7]. In summary, we obtain s|z ) with q(s(r)|z(r)) = N ( sz (r), Σ s|z = Q + Σ z Σ
(16)
x(r) + Σ z mz ] −1 [A t Σ sz (r) = Σ s|z
(17)
Approximate posterior law for labels variables: Deriving an expression for q(z|x, M) is the most difficult task in this paper due to its Markovian model. Hopefully, using the four nearest neighbors neighborhood system often used in image processing, it is easy to divide z into two subsets zN and zB in the manner of a chess-board. From this, we only need to work with the distributions qzB (z B ) and qzN (z N ). Thus, each white pixel (respectively black) has its four black neighbors (respectively white). All the white pixels (respectively black), knowing the black pixels (respectively white), are thus independent. We can thus write: qz (z|x, M) = qzN (z N |x, M)qzB (z B |x, M). The expression of qzB is obtained by
dF dqzB
= 0:
qzB (z B |x, m) ∝ exp{< ln p(z B |z N , β) >q(zN ) +HB (r)} where: HB (r) =< ln p(s|z, mz , Σ z ) + ln p(x|s, z, A, Σ ) >q(θ),q(s|z),q(zN ) Expanding these expressions, we obtain: exp β δ(zr − zr )q(zr = k) + HB (r) qzB (z B |x, m) ∝ r∈RB
r
k
(18)
142
N. Bali and A. Mohammad-Djafari
with: HB (r)
=
1 −1 + m z sz (r) zΣ z sz (r) − 1 Tr Σ tz Σ − stz (r)Σ s|z 2 2 1 t 1 −1 z − Tr Σ z Σ z m m − z Tz 2 2 M R ν +R+1−i | + ) + M ln 2 + ln |Σ Ψ( 2 i=1 2 ⎧ ⎫ ⎬ 1 −1 −1 R ⎨ Q s|z , Σ A , A) − Tr Σ q(z(r))Σ − Tr Σ DA ( ⎭ 2 2 ⎩ z(r)
In the same manner, we obtain an expression for qzN (z N |x, M) qzN (z N |x, M) ∝
exp β δ(z(r) − z(r ))q(z(r ) = k) + HN (r)
r∈RN
r
k
(19) where: HN (r) =< ln p(s|z, mz , Σ z ) + ln p(x|s, z, A, Σ ) >q(θ),q(s|z),q(zB ) Given all these expressions, the general iterative algorithm obtained and proposed consists in updating successively ⎧ q(s|z, x, θ, M), using (16, 17) ⎪ ⎪ ⎨ qzB (z B |z N , x, M), using (18) q ⎪ z (z |z , x, M), using (19) ⎪ ⎩ N N B q(θi |s, z, x, θ|i , M), using (13, 14, 15) Once these iterations converged, one can use these laws to provide estimation for {s, z and θ}.
3
Free Energy Estimation
The estimate of F enables us to have a criterion of convergence. The maximization of the free energy is made by an iterative procedure (Figure: 2, (a)), following a total iteration which contains an update of all the parameters. We use a threshold on ΔF F to stop the iterations. Since calculations of the parameters for all the q functional are already made, it is easy to calculate F : F
=
< ln p(x|s, z, A, Σ ) >q(s|z),q(A),q(Σ ) − < KL(q(s|z)||p(s|z)) >q(z) − KL(q(A)||p(A)) − KL(q(Σ )||p(Σ )) − KL(q(mz )||p(mz )) − KL(q(Σ z )||p(Σ z ))
We also use the final values of F for model selection.
A Variational Bayesian Algorithm
4
143
Results
To evaluate the performance of our method we generate synthetic data with: 0.86 0.44 10 0 −1 1 0.01 0.01 A= , Σ = , mz = , Σz = and 0.50 0.89 0 10 1 −1 0.01 0.01 β = 1.5. We choose to compare with MCMC method that had been proposed with the same modelization (Figure 1: (d), (e)) This enables us to see a gain in computation time. We also compared with the method (VBICA [8]) (Figure 1: (c), (E)). This method uses a mixture of Gaussian with an independent model on the coefficients
10
10
10
10
10
10
20
20
20
20
20
20
30
30
30
30
30
30
40
40
40
40
40
40
50
50
50
50
50
60
60
60
60
60
10
20
30
40
50
60
10
20
S1
30
40
50
60
10
20
30
40
50
60
10
20
b1 S
X1
30
40
50
50
60
60
10
20
Sb1
30
40
50
60
10
10
10
10
10
10
20
20
20
20
20
20
30
30
30
30
30
30
40
40
40
40
40
40
50
50
50
50
50
60
60
60
60
60
20
30
40
50
60
10
20
S2 (a)
30
40
50
60
10
20
30
40
50
60
10
b2 S (c)
X2 (b)
20
30
40
50
30
40
50
60
40
50
60
b1 Z
10
10
20
Sb1
50
60
60
10
20
Sb2 (d)
30
40
50
60
10
Sb2 (e)
20
30
b2 Z (f )
Fig. 1. Results of separation for two images X 1 et X2 (b) obtained by an instantaneous mixture of S1 et S2 (a) every images is modelized by Markov model. In (c), (d) and (e) are represented sources obtained by three different methods VB ICA, MCMC and the proposed approach with their associated segmentation. 4
−1
x 10
1
0.9 −2
0.8
0.7 −3
0.6
0.5
−4
0.4 −5
0.3
0.2 −6
0.1
−7
0
20
40
60
(a)
80
100
120
0
2
3
4
5
6
7
(b)
Fig. 2. (a) evolution of F during the iterations (b) Final values of F for simulated data composed with 4 mixed images obtained from 3 sources. F has been normalized with respect to its minimum value for N = 4. The maximum is obtained for the right sources number 3.
144
N. Bali and A. Mohammad-Djafari
of mixture contrary to the method we propose. This type of model is often used in a variational estimation[1],[2] considering the a priori law are always selected in separable families. Our method converges at the end of 120 iterations (Figure 2: (a)) whereas method (VBICA) reached 500 iterations without convergence for a degree of tolerance on F of 1e−4 . F enables us to make the suitable choice of the model. For a simulated example with three sources we compute the converged values of F for different number of sources from 2 to 7. As it is shown in the (figure 2: (b)), maximum of F is 3 which is the good result.
5
Conclusion
In this paper we propose a new approach for sources separation problem based on Bayesian estimation and variational approximation. Variational approximation is applied to the posterior law obtained with a more complex source model. The proposed source model is marginally a mixture of Gaussians but also we assigned a Potts model to its hidden variable which enforces the homogeneity and compactness of the pixel positions in the same class. Compared to VBICA, our source model is reacher but also due to the Potts model, it is non-separable. This makes the expressions of the posterior law more complex. However, the main benefice of the complex modelization of the sources is that the hidden variables now can be used as a non-supervised segmentation result. The proposed method then does simultaneously separation and segmentation.
References 1. Miskin, J.W.: Ensemble Learning for Independent Component Analysis, Phd thesis, Cambridge (2000), http://www.inference.phy.cam.ac.uk/jwm1003/ 2. Attias, H.: A variational Bayesian framework for graphical models. In: Solla, S., Leen, T.K., Muller, K.L. (eds.) Advances in Neural Information Processing Systems, vol. 12, pp. 209–215. MIT Press, Cambridge (2000) 3. Cheng, L., Jiao, F., Schuurmans, D., Wang, S.: Variatonal bayesian image modelling. In: Proceedings of the 22nd international conference on Machine learning, Bonn Germany, August 2005, vol. 119, pp. 129–136. ACM Press, New York (2005) 4. Peyrard, N.: Approximations de type champ moyen des probl`emes de champ de Markov pour la segmentation des donn´ees spatiales, Ph.D. thesis, Universit´e Joseph Fourrier (2001) 5. Bali, N., Mohammad-Djafari, A.: Mean Field Approximation for BSS of images with compound hierarchical Gauss-Markov-Potts model. In: Proceedings of MaxEnt05, august 2005, pp. 249–256 (2005) 6. Ghahramani, Z., Beal, M.J.: Propagation algorithms for variational Bayesian learning. In: Leen, T.K., Dietterich, T., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 507–513. MIT Press, Cambridge (2001) 7. Ichir, M.M., Mohammad-Djafari, A.: A mean field approximation approach to blind source separation with lp priors. In: Eusipco, Antalya, Turkey (September 2005) 8. Choudrey, R.A., Roberts, S.J.: Bayesian ICA with Hidden Markov Model Sources. In: ICA, NARA, JAPAN (April 2003)
A New Source Extraction Algorithm for Cyclostationary Sources C. Capdessus1, A.K. Nandi2, and N. Bouguerriou1 1
Laboratoire d’Electronique, Signaux et Images 21 rue Loigny la Bataille 28000 Chartres – France
[email protected] 2 Department of Electrical Engineering and Electronics The University Liverpool, Brownlow Hill Liverpool, L69 3GJ, UK
[email protected]
Abstract. Cyclostationary signals can be met in various domains, such as telecomunications and vibration analysis. Cyclostationarity allows to model repetitive signals and hidden periodicities such as those induced by modulation for communications and by rotating machines for vibrations. In some cases, the fundamental frequency of these repetitive phenomena can be known. The algorithm that we propose aims at extracting one cyclostationary source, whose cyclic frequency is a priori known, from a set of observations. We propose a new criterion based on second order statistics of the measures which is easy to estimate and leads to extraction with very good accuracy. Keywords: Source extraction, cyclostationarity, second order statistics, vibration analysis.
1 Introduction The method we propose here has been developed within the frame of vibration analysis applied to rotating machinery monitoring. Rotating machines produce vibrations whose characteristics depend on the machine state. These vibrations can thus be used to monitor systems such as engines, roller bearings, toothed gearings [1]. One of the problems that arises for complex systems is that each vibration sensor receives a mixture of the vibrations produced by the different parts of the system. These vibrations are usually wideband and spread over much of the spectrum. Though these different contributions can be separated neither in the time domain nor in the frequency domain, some hope still lays in their cyclostationary features. Indeed, due to their symmetric geometry and repetitive movements, each part of the system produces random but repetitive vibrations which are cyclostationary at specific frequencies [2], [3], [4]. In order to avoid early wearing, the different parts are usually designed with different characteristics, ensuring that the system will seldom come back to its initial position. This precaution also ensures that these all M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 145–151, 2007. © Springer-Verlag Berlin Heidelberg 2007
146
C. Capdessus, A.K. Nandi, and N. Bouguerriou
produce vibrations cyclostationary at different cyclic frequencies. Furthermore, the rotation speed is ususally known or easy to measure, so that these frequencies can be computed. We are interested here in the case when one peculiar part of the system, whose cyclic frequency is a priori known, is to be monitored, and no other assumption about the other parts is made except that, if cyclostationary, their cyclic frequencies are different from the one we are interested in. The methods previously developped [5], [6], which suppose that all the cyclic frequencies of the sources are known, are not fitted to this case. We thus developped a new source extraction algorithm based on the only cyclostationary properties of the source to be extracted. In order to derive an algorithm as simple and robust as possible, we chose to base the extraction criterion on second order statistics. We first present the principle of the method, with thorough calculations in the two sources two mixtures case and application to three different cases.
2 Principle of the Method 2.1 Criterion The method will be described and proofs will be given in the two sources and two mixtures case. We assume that the two mixtures are additive and we call the source vector S = [si ] , the mixture vector X = [xi ] and the mixing matrix A = aij (all real
[ ]
valued), where (i, j ) ∈ {1,2} and represent respectively lines and columns. They are related by: 2
X = AS The source
(1)
s1 is supposed to be cyclostationary at a known frequency α 0 . The
source s 2 can be either stationary or cyclostationary at any cyclic frequency except to
α 0 . The sources are supposed to be uncorrelated and their respective powers are denoted σ 12 and σ 22 . Estimating both sources would mean estimating a B matrix
such that Z = BX is an estimate of the source vector S. Our goal is only to retrieve the first source on the first estimate vector component, i.e. to find coefficients b11 and
b12 such that z1 = b11 x1 + b12 x 2
(2)
is an estimate of s1 . The method that we propose consists in maximizing
cyclostationarity at frequency α 0 on z1 and minimizing the power of z1 , in order to ensure that only the cyclostationary source is kept on that estimate. The criterion that we chose to minimize is given by:
C (b11 , b12 ) =
R z1 (0 ) R z1 (0) α0
(3)
A New Source Extraction Algorithm for Cyclostationary Sources
147
where Rz1 (0 ) and R z1 (0 ) are coefficients of the Fourier series decomposition of the α0
autocorrelation function of z1 respectively for cyclic frequencies 0 and α 0 and zero time lag. 2.2 Theoretical Validation of the Criterion In the 2*2 case, the estimate z1 can be written as a function of the mixing matrix A, the demixing coefficients b11 and b12 , and the sources.
z1 = (b11 a11 + b12 a 21 ) s1 + (b11 a12 + b12 a 22 ) s 2
(4)
There exist an infinity of coefficients pairs (b11 ,b12 ) that can lead to the extraction of s1 on z1 , which are all the pairs satisfying
b11 a = − 22 b12 a12
(5)
Let us show that minimizing the chosen criterion leads to these solutions. From eq. (3) and (4), we derive:
C (b11 , b12 ) =
(b11a12 + b12 a 22 )2 σ 22 (0) (b11a11 + b12 a 21 )2 Rsα (0)
σ 12 Rsα1 0
+
0
(6)
1
It is easy to see that this criterion reaches an absolute minimum for any coefficients pair (b11 ,b12 ) that satisfies equation (5). Therefore, for a fixed value of one of the coefficients, the criterion exhibits an absolute minimum versus the other one, and this minimum corresponds to one of the solutions. 2.3 Criterion Evaluation The criterion is estimated from the statistics of the measures. The different statistics to be estimated are:
R xα (0) =
T
[
]
1 0 * E xi (t )x j (t ) e −2πjαt dt ∫ T0 0
(7)
with
T0 =
1
α0
(8)
where the cyclic frequency α ∈ {0, α 0 } and i and j can take the values {1,2} . Assuming that the cyclostationary signal
s1 is cycloergodic, the ensemble averaging
E [...] can be replaced by temporal synchronized averaging, i.e. averaging over cyclic periods of the signal, which leads to the following estimator:
148
C. Capdessus, A.K. Nandi, and N. Bouguerriou
1 Rˆ xα (0) = NT0
NT0
∫ x (t )x (t )e *
i
− 2πjαt
j
dt
(9)
0
The criterion can then be computed for any value of (b11 ,b12 ) by :
Cˆ (b11 , b12 ) =
b112 Rˆ x01 (0) + b122 Rˆ x02 (0) + b11b12 Rˆ x01x2 (0) b 2 Rˆ α 0 (0) + b 2 Rˆ α 0 (0) + b b Rˆ α 0 (0) 11
x1
12
x2
11 12
(10)
x1 x2
The coefficient b11 is arbitrarily set to 1 and the criterion is computed for different values of b12 (with a step 0.01) until it reaches a minimum. If b12 increases until it reaches a given threshold, the strategy is reversed: b12 is set to one and b11 made to vary.
3 Simulations In the following sections, the method is applied to three different cases, each with two sources and two mixtures. The first case includes one artificial cyclostationary source and one stationary source. The second one includes two cyclostationary sources at different cyclic frequencies. In the last case, one of the sources is a real, vibrations signal and the other one is an artificial stationary source. In each case, we apply the algorithm to a set of different mixing matrices randomly generated. Their four coefficients aij are random real numbers equally distributed
over the interval [− 1, 1] . In order to evaluate the performance of the method, we compute the mean square error between the cyclostationary source to be extracted and its estimate. Both are normalised before computing the error, in order to take into account the indeterminacy over the estimate amplitude. This error is computed for each of the random matrices and given in dB relatively to the power of the source. The parameters were estimated over 100 realisations for all the simulations. 3.1 One Cyclostationary Source and One Stationary Source The first simulation was performed over a simple cyclostationary signal
s1 (t ) = a(t ) cos(2πf 0 t )
(11)
with a (t ) a random white noise. This signal is second order cyclostationary at
frequency α 0 = 2 f 0 . The second source is chosen to be a random white noise. Both sources are uncorrelated and have power equal to one. Note that the two sources are both wideband and cannot be separated directly in the frequency domain, so that classical second order source separation methods such as SOBI fail to separate these two peculiar sources. The algorithm was applied to these sources with 100 different mixing matrices generated as previously described. Fig. 1 shows the estimation error versus the determinant of the mixing matrix. It shows that the source was estimated with a –20
A New Source Extraction Algorithm for Cyclostationary Sources
149
dB accuracy for 98% of the matrices, and with a –30 dB accuracy for 95% of the matrices. For most of the mixing matrices, the source was estimated with very good accuracy, close to –40 dB. The only matrices leading to a poor estimation are ill conditioned ones, whose determinant is close to zero. -5
Estimate mean square error in dB
-10
-15
-20
-25
-30
-35
-40 -1.5
-1
-0.5 0 0.5 Determinant of the mixing matrix
1
1.5
Fig. 1. Mean square error (in dB) between source s1 and its estimate represented versus the determinant of the mixing matrix for one cyclostationary source and one stationary source
3.2 Two Cyclostationary Sources Source
s1 is the same cyclostationary source as in section 3.1, while source s2 is a
cyclostationary source built the same way with cyclic frequency α 2 ≠ α 0 . Both sources have power equal to one. 0
Estimate mean square error in dB
-10
-20
-30
-40
-50
-60
-70 -1
-0.5
0 0.5 Determinant of the mixing matrix
1
1.5
Fig. 2. Mean square error (in dB) between source s1 and its estimate represented versus the determinant of the mixing matrix for two cyclostationary sources
150
C. Capdessus, A.K. Nandi, and N. Bouguerriou
Fig. 2 shows the results obtained over 100 randomly generated mixing matrices. This figure shows that the results are really good, since source s1 was estimated with –20 dB accuracy for 99% of the mixing matrices, and –30 dB accuracy for 97% of the mixing matrices. The estimation accuracy is between –30 dB and –60 dB for well conditioned matrices. 3.3 Real Vibration Source and Stationary Source For this last simulation, we mixed the same stationary source s 2 as in section 3.1 with a real damaged roller bearing vibration that was the source to be estimated. The vibration was recorded at 100kHz sampling frequency and the vibration spectrum spreads over the whole recorded frequency range. Damaged roller bearing vibrations were shown to be wide sense cyclostationary and one of their cyclic frequencies, which corresponds to the frequency of the shocks over the damaged part, can be computed from the rotation speed and the location of the damage. For the vibration that we used, this frequency is α 0 = 195Hz . We computed the criterion at this very frequency in order to achieve extraction. Both sources were normalised so as to have the same power and separation was performed with 100 randomly generated mixing matrices. As in the artificial signal cases, the source is estimated with very good accuracy. Though the roller bearing vibration is a complex signal, that exhibits several cyclostationary frequencies, the knowledge of one of them is enough to extract the vibration from a set of additive mixtures. Table 1. sumarises the results obtained in the three studied cases and shows that the proposed algorithm achieves extraction with very good accuracy whatever the nature of the second source can be, provided that it is not cyclostationary at the same frequency as the one to be extracted. Table 1. Percentage of good estimates for a given accuracy depending on the nature of the sources
Cyclostationary / Stationary
Both cyclostationary
Vibration / Stationary
- 20 dB
98 %
99 %
98 %
- 30 dB
95 %
97 %
92 %
4 Conclusion We presented here a new source extraction method for cyclostationary source. The method is based on second order statistics of the sources and only two hypotheses are made about the sources : they are uncorrelated at order two and the source to be extracted exhibits at least one cyclic frequency that it does not share with the other
A New Source Extraction Algorithm for Cyclostationary Sources
151
sources and that is a priori known. These hypotheses are realistic when coping with vibration signals. The method has been presented here in the two sources and two additive mixtures cases. The criterion to be minimised in order to achieve the extraction was shown to exhibit a unique minimum leading to perfect extraction. The method was applied to different mixtures of artificial or real sources and shown to achieve proper estimation for most of the mixing matrices. It can be easily extended to the N*N case. The set of solutions is then given by a set of (N-1) equations with N unknown variables. This will be presented in further publications.
References 1. Tandon, N., Choudhury, A.: A review of vibration and acoustic measurement methods for the detection of defects in rolling element bearings. Tribology International 32, 469–480 (1999) 2. Capdessus, C., Sidahmed, M., Lacoume, J.L.: Cyclostationary processes : application in gear faults early diagnosis. Mechanical Systems and Signal Processing 14(3), 371–385 (2000) 3. McCormick, C., Nandi, A.K: Cyclostationarity in rotating machine vibrations. Mechanical Systems and Signal Processing 12(2), 225–242 (1998) 4. Antoni, J., Bonnardot, F., Raad, A., El Badaoui, M.: Cyclostationary modelling of rotating machine vibration signals. Mechanical Systems and Signal Processing 18, 1285–1314 (2004) 5. Abed-Meraim, K., Xiang, Y., Manton, J.H., Hua, Y.: Blind Source separation Using Second-Order Cyclostationary Statistics. IEEE Trans on Signal Processing, 49(4) (2001) 6. Jafari, M.G., Chambers, J.A.: Normalised natural gradient algorithm for the separation of cyclostationary sources. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, Ap, vol. 5, pp. 301–304 (2003)
A Robust Complex FastICA Algorithm Using the Huber M-Estimator Cost Function Jih-Cheng Chao1 and Scott C. Douglas2 1 2
Semiconductor Group, Texas Instruments, Dallas, Texas 75243, USA Department of Electrical Engineering, Southern Methodist University, Dallas, Texas 75275, USA
Abstract. In this paper, we propose to use the Huber M -estimator cost function as a contrast function within the complex FastICA algorithm of Bingham and Hyvarinen for the blind separation of mixtures of independent, non-Gaussian, and proper complex-valued signals. Sufficient and necessary conditions for the local stability of the complex-circular FastICA algorithm for an arbitrary cost are provided. A local stability analysis shows that the algorithm based on the Huber M -estimator cost has behavior that is largely independent of the cost function’s threshold parameter for mixtures of non-Gaussian signals. Simulations demonstrate the ability of the proposed algorithm to separate mixtures of various complex-valued sources with performance that meets or exceeds that obtained by the FastICA algorithm using kurtosis-based and other contrast functions.
1
Introduction
In complex-valued blind source separation (BSS), one possesses a set of measured signal vectors x(k) = As(k) + ν(k),
(1)
where A is an arbitrary complex-valued (m × m) mixing matrix, such that A = AR + jAI , s(k) = [s1 (k) · · · sm (k)]T is a√complex-valued signal of sources, and si (k) = sR,i (k) + jsI,i (k), where j = −1, and ν(k) contains circular Gaussian uncorrelated noise. In most treatments of the complex-valued BSS task, the {si (k)} are assumed to be statistically-independent, and A is full rank. The goal is to obtain a separating matrix B such that y(k) = Bx(k)
(2)
contains estimates of the source signals. In independent component analysis (ICA), the linear model in (1) may not hold, yet the goal is to produce signal features in y(k) that are as independent as possible. One of the most-popular procedures for complex-valued BSS is the complex circular FastICA algorithm in [1]. This algorithm first prewhitens the mixtures x(k) to obtain v(k) = Px(k) such that E{v(k)vH (k)} = I, after which the M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 152–160, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Robust Complex FastICA Algorithm
153
rows of a unitary separation matrix W are adapted sequentially such that y(k) = Wv(k) contains the separated sources. For mixtures of sources that are proper, such that E{s2i (k)} = 0 for all i, this algorithm appears to separate such complex mixtures given enough snapshots N for an appropriate choice of algorithm nonlinearity. Several algorithm nonlinearities are suggested as possible candidates, although little work has been performed to determine the suitability of these choices for general complex-valued source signals. More recently, several researchers have explored the structure of the complex-valued BSS task for mixtures of non-circular sources, such that E{s2i (k)} = 0 [2]–[4]. In what follows, we limit our discussion to the complex-circular source distribution case, as several practical applications involve mixtures of complex-circular sources. In this paper, we extend our recent work on employing the Huber M -estimator cost function from robust statistics as a FastICA algorithm contrast [5] to the complex-valued BSS task for mixtures of proper sources (E{s2i (k)} = 0). We provide the complete form of the local stability condition for the complex-circular FastICA algorithm omitted in [1]. We then propose a single-parameter nonlinearity for the algorithm and show through both theory and simulations that the algorithm’s performance is largely independent of the cost function’s threshold parameter for many source distributions, making it a robust choice for separating complex-valued mixtures with unknown circularly-symmetric source p.d.f.’s. Simulations comparing various contrast choices for the complex circular FastICA algorithm show that ours based on the Huber M -estimator cost often works better than others based on kurtosis maximization or heuristic choice.
2
Complex Circular FastICA Algorithm
We first give the general form of the single-unit FastICA algorithm for extracting one non-Gaussian-distributed proper source from an m-dimensional complex linear mixture [1] and study its local stability properties. The algorithm assumes that the source mixtures have been prewhitened by a linear transformation P where v(k) = Px(k) contains uncorrelated entries, such that the sample covariance of v(k) is the identity matrix. For the vector wt = [w1t · · · wmt ]T , the complex circular FastICA update is (3) yt (k) = wtT v(k) t = E{yt (k)g(|yt (k)|2 )v∗ (k)} − E{g(|yt (k)|2 ) + |yt (k)|2 g (|yt (k)|2 )}wt (4) w t w wt+1 = , (5) tH w t w where yt (k) is the estimated source at time k and algorithm iteration t, g(u) is a real-valued nonlinearity, g (u) = dg(u)/du, and the expectations in (4) are computed using N -sample averages. This algorithm is formulated in [1] as the solution to the following optimization problem: 2 (6) maximize E{G(|yt (k)|2 )} − E{G(|n|2 )} such that E{|yt (k)|2 } = 1,
(7)
154
J.-C. Chao and S.C. Douglas
where n has a circularly-symmetric unit-variance Gaussian distribution and G(u) is a real-valued even-symmetric but otherwise “arbitrary non-linear contrast function” [1] producing g(u) = dG(u)/du. The criterion in (6) is described as the square of a simple estimate of the negentropy of yt (k). Several √ cost functions are suggested as possible choices for G(u), including G(u) = a1 + u for a1 ≈ 0.1 , G(u) = log (a2 + u) for a2 ≈ 0.1, and the kurtosis-based G(u) = 0.5u2 , although no verification of (9) for the first two choices of G(u) and any well-known nonGaussian distributions has been given. In [1], the authors give the following necessary condition for the above algorithm to be locally-stable at a separating solution, where si possesses the distribution of the source extracted in yt (k): (E{g(|si |2 ) + |si |2 g (|si |2 ) − |si |2 g(|si |2 )}) = 0.
(8)
This condition is not sufficient, however, for local stability of the algorithm, as the curvature of the cost function has not been considered in [1]. Although omitted for brevity, we can show that the necessary and sufficient local stability conditions for the algorithm about a separating solution are [E{g(|si |2 ) + |si |2 g (|si |2 ) − |si |2 g(|si |2 )}] ×[E{G(|si |2 )} − E{G(|n|2 )}] < 0.
(9)
This result can be compared to that for the real-valued FastICA algorithm in [6], which shows a somewhat-different relationship. Thus, it is necessary and sufficient for the two real-valued quantities on the left-hand-side of the inequality in (9) to be non-zero and have different signs for the complex circular FastICA algorithm to be locally-stable.
3
A Huber M-Estimator Cost Function for the Complex Circular FastICA Algorithm
In [5], a novel single-parameter cost function based on the Huber M -estimator cost in robust statistics [7] was proposed for the real-valued FastICA algorithm. Unlike most other cost functions, the one chosen in [5] has certain nice practical and analytical properties. In particular, it is possible to show that there always exists a nonlinearity parameter for the cost function such that two sufficient conditions for local stability of the algorithm are met. We now extend this work to design a novel cost function for the complex-circular FastICA algorithm. As the algorithm in [1] implicitly assumes mixtures of proper source signals, we propose to choose G(|yt (k)|2 ) such that the amplitude of yt (k) is maximized according to the Huber M -estimator cost. Thus, we have ⎧u ⎨ G(u) = 2 2 ⎩ θu1/2 − θ 2
u < θ2 u ≥ θ2
(10)
A Robust Complex FastICA Algorithm
155
where θ > 0 is a threshold parameter designed to trade off the parameter estimation quality with the estimate’s robustness to outliers and lack of prior distributional knowledge. The corresponding algorithm nonlinearities are ⎧ ⎪1 u < θ2 ∂G(u) ⎨ 2 g(u) ≡ = θ (11) ⎪ u−1/2 u ≥ θ2 ∂u ⎩ 2 0 u < θ2 dg(u) = g (u) ≡ (12) θ −3/2 du − u u ≥ θ2 . 4 After some simplification, we can implement the circular complex FastICA update using the above nonlinearities as t = 2E{yt (k)hθ (|yt (k)|)v∗ (k)} − E{tθ (|yt (k)|) + hθ (|yt (k)|)}wt (13) w
1 u<θ 1 u<θ , tθ (u) = hθ (u) = θ (14) 0 u ≥ θ. u≥θ u The functions hθ (u) and tθ (u) depend on the threshold parameter θ, and the choice of this nonlinearity will be considered in the next section. Table 1 lists a short MATLAB script for implementing the multiple-unit version of this algorithm, in which the QR decomposition is used for signal deflation. Table 1. Complex circular FastICA algorithm with Huber M -estimator cost %-------------------------------------------------------------------[N,m]=size(x); R = (1/N)*(x’*x); v = x/chol(0.5*(R+R’)); W = eye(m); for i=1:iter y = v*W; absy = abs(y); t = (absy
4
On the Local Stability of the Huber M-Estimator Cost for FastICA
Given the new stability condition in (9), what can be said about the circularlysymmetric Huber M -estimator cost function when it is used in the complex FastICA algorithm? The following two theorems, proven in the Appendix, illustrate two properties about this cost. These theorems make statements about the p.d.f. of u = |si |2 , the squared amplitude of the extracted source. The theorems are non-trivial extensions of the theorems presented in [5].
156
J.-C. Chao and S.C. Douglas
Theorem 1: Let g(u) and g (u) have the forms in (11) and (12), respectively. Then, so long as the random variable u is not exponentially-distributed, there always exists a value of θ such that E{g(u)} + E{ug (u)} − E{ug(u)} = 0.
(15)
Theorem 2: Let G(u) have the form in (10). Then, so long as the random variable u is not exponentially-distributed, there always exists a value of θ such that E{G(u)} − E{G(|n|2 )} = 0.
(16)
Note that if si is unit-variance circular Gaussian, the p.d.f. of u = |si | is exponential (p(u) = e−u for u ≥ 0). Taken together, these two theorems do not ensure (9) for all non-Gaussian proper source distributions. They suggest, however, that the design range for θ could be significant for many distributions. We substantiate this claim through the analysis below and by simulations in the next section. These results are significant because, to our knowledge, few if any statements about the stability of a specific non-kurtosis-based cost function within the complex FastICA algorithm have been given in the scientific literature. Moreover, it is unlikely that such results could be easily found given the complexity of the integrals for other g(y) choices (e.g. g(y) = 0.5(a1 + y)−1/2 for a1 ≈ 1). We have evaluated the range of θ values for which (9) is satisfied for five well-known zero-mean, unit-power, non-Gaussian distributions: 4-QAM-{±1} + j{±1}, 16-QAM-{± √110 ± √310 } + j{± √110 ± √310 }, 64-QAM-{± √142 ± √342 ± √5 ± √742 } + j{± √142 ± √342 ± √542 ± √742 }, the uniform amplitude cir42 √ cular distribution such that |si | is equally probable for 0 ≤ |si | ≤ 2 and is zero otherwise, and the exponential amplitude distribution in which |si | is exponentially-distribution with E{|si |2 } = 1. For all of these five distributions, the Huber M -estimator cost produces an algorithm that is locally-stable for θ in the range [0, |smax |), where smax is the maximum possible value of si (k) admitted by the source p.d.f. Thus, any positive value of θ that places part of the nonlinear portion of g(u) within the range of |si (k)|2 often results in a locallyconvergent algorithm. Again, this evaluation does not guarantee that the chosen cost function will always work, but it suggests that one does not need to design specific values of θ to achieve separation. In practice, one may not know what θ value to choose to obtain separation of a particular source mixture. As was suggested in [5] in the real-valued case, we recommend that one randomize the value of θ over a range of positive values during coefficient adaptation. The main observed effect using such randomization is a slight slowdown in convergence speed. 2
5
Simulations
We now explore the performance of the FastICA algorithm with various cost function via simulations. In these simulations, m = 15-source mixtures were
A Robust Complex FastICA Algorithm
157
−5 (a +y)0.5 1 log(a2+y) kurtosis Huber,θ=0.9 Huber,θ~Unif[0.5−1]
−10
E{ γ }[dB]
−15
−20
−25
−30 100
200
300
400
500 600 # of snapshots N
700
800
900
1000
Fig. 1. E{γ} vs. number of snapshots N for the various algorithms in the simulation example
generated consisting of three 4-QAM, three 16-QAM, three 64-QAM, three uniform and three exponential amplitude circular-distributed independent sources, and a random mixing matrix. The multi-unit FastICA procedure was applied to this data for numbers of snapshots ranging from N = 100 to N = 5000 and for different θ values. The performance factor computed is the separation cost ⎛ ⎞ m m 2 2
|cil | |cil | 1 ⎝ ⎠ −1 γ= (17) 2+ 2m i=1 max |c | max |cli |2 il l=1 1≤i≤m
1≤l≤m
with C = WPA as obtained at convergence of the algorithm. One hundred iterations were averaged to obtain each data point shown. Fig. 1 compares the performance of FastICA with the Huber cost function and θ = 0.9 and with the Huber cost function and a uniformly-randomized θ in the range 0.5 ≤ √ θ ≤ 1 at each iteration with three other versions of FastICA – using G(y) = a1 + y or g(y) = 2√a11 +y for a1 ≈ 0.1, G(y) = log (a2 + y) or g(y) = a21+y for a2 ≈ 0.1, and the kurtosis-based choice G(y) = 0.5y 2 or g(y) = y. As can be seen, the Huber cost function-based versions outperform the algorithms based on previously-proposed contrast functions. More significantly, our algorithm version with a randomized threshold parameter θ provides good separation performance across all sample sizes; performance deviations were less than ±1dB from the algorithm with a fixed θ = 0.9 value. Fig. 2 illustrates the performance sensitivity of the FastICA algorithm with Huber M -estimator cost to the value of θ for these signal mixtures. As can be seen, the algorithm performs well for values of θ satisfying 0.1 ≤ θ ≤ 1, and its performance degrades gracefully for higher θ values.
158
J.-C. Chao and S.C. Douglas
−5
−10
−15
E{ γ }
−20
−25 N=100 N=200 N=500 N=1000 N=2000 N=5000
−30
−35
−40
0
0.5
1 threshold parameter θ
1.5
2
Fig. 2. E{γ} vs. θ for the FastICA algorithm with Huber M -estimator cost in the simulation example
6
Conclusions
In many blind source separation and independent component analysis algorithms, the cost function used to measure signal independence is a design parameter. In this paper, we have considered Huber’s single-parameter M -estimator cost function for use within the complex-valued FastICA algorithm for proper source mixtures. The algorithm obtained is computationally-simple, and the procedure works well for a wide range of threshold parameters θ. The reasons for the algorithm’s robust behavior for a wide range of the threshold parameter is indicated through a stability analysis.
References 1. Bingham, E., Hyvarinen, A.: A fast fixed-point algorithm for independent component analysis of complex valued signals. Int J. Neural Systems 10(1), 1–8 (2000) 2. De Lathauwer, L., De Moor, B.: On the blind separation of non-circular sources. In: Proc. EUSIPCO-02, Toulouse, France (September 2002) 3. Novey, M., Adali, T.: ICA by maximization of nongaussianity using complex functions. In: Proc. IEEE Workshop Machine Learning for Signal Processing, Mystic, CT, September 2005, pp. 21–26. IEEE Computer Society Press, Los Alamitos (2005) 4. Eriksson, J., Koivunen, V.: Complex random vectors and ICA models: Identifiability, uniqueness and separability. IEEE Trans. Inform. Theory 52, 1017–1029 (2006) 5. Chao, J., Douglas, S.C.: A simple and robust fastICA algorithm using the Huber Mestimator cost function. In: Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, Toulouse, France, vol. 5, pp. 685–688 (May 2006) 6. Hyv¨ arinen, A.: Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Networks 10, 626–634 (1999) 7. Huber, P.: Robust Statistics. Wiley, New York (1981)
A Robust Complex FastICA Algorithm
7
159
Appendix
Proof of Theorem 1: Assume without loss of generality that u is unit variance. Consider the terms on the left-hand-side of (15) for the nonlinearities in (11) and (12), and define f1 (θ) = 2(E{g(u)} + E{ug (u)} − E{ug(u)}). Then, we obtain ∞ θ (18) f1 (θ) = u−1/2 (u3/2 − θu − u1/2 + )p(u)du. 2 θ2 For Eq. (15) not to hold, f1 (θ) = 0 for all possible values of θ. Suppose that the slightly-more-general condition f1 (θ) = c1 θ + c2
(19)
is true, where c1 and c2 are unknown constants. Such a condition justified when p(u) is smooth, as f1 (θ) an then be modeled by a polynomial approximation see the comment below. Then, ∞ ∂f1 (θ) 1 = p(θ2 )θ + ( u−1/2 − u1/2 )p(u)du = c1 (20) ∂θ θ2 2 ∂ 2 f1 (θ) = p (θ2 ) + p(θ2 ) = 0, (21) ∂θ2 which yields the relationship p (u) = −p(u).
(22)
The only distribution p(u) satisfying (22) is the exponential distribution, i.e. p(u) = e−u for u ≥ 0. Thus, the theorem follows. Note that if si is circular Gaussian-distributed, |si |2 has an exponential distribution, although other distributions for si could lead to an exponential distribution for |si |2 . Proof of Theorem 2: Substituting (10) into the left-hand-side of (16), defining f2 (θ) = 2E{G(u) − G(|n|2 )}, and simplifying yields the expression ∞ f2 (θ) = − (u1/2 − θ)2 [p(u) − pn (u)]du, (23) θ2
where pn (u) = e−u for u ≥ 0. For Eq. (16) not to hold, f2 (θ) = 0 for all possible values of θ. Suppose that the slightly-more-general condition f2 (θ) = c1 θ + c2 is true, where c1 and c2 are unknown constants. Then, ∞ ∂f2 (θ) =2 (u1/2 − θ)[p(u) − pn (u)]du = c1 ∂θ θ2 ∞ ∂ 2 f2 (θ) = −2 [p(u) − pn (u)]du = 0. ∂θ2 θ2
(24)
(25) (26)
160
J.-C. Chao and S.C. Douglas
For (24) to hold for all θ > 0, we must have p(u) = pn (u),
(27)
which results in c1 = 0, c2 = 0, and finally f2 (θ) = 0. Thus, the theorem follows. Comment: In both of the above proofs, fi (θ) is a continuous function of θ given a continuous smooth amplitude-squared distribution p(u). Thus, we can express fi (θ) as a polynomial function of θ with coefficients ci . Now, for the condition fi (θ) = 0, we must have all ci = 0. Clearly, it is impossible that c0 = 0 and all ci not equal to 0 for i > 0 and the condition fi (θ) = 0, because any change in θ would make fi (θ) not equal to zero. Hence, fi (θ) defines only one function and therefore only one distribution p(u) has fi (θ) = 0. In both of the above proofs, the exponential distribution yields fi (θ) = 0.
Stable Higher-Order Recurrent Neural Network Structures for Nonlinear Blind Source Separation Yannick Deville and Shahram Hosseini Universit´e Paul Sabatier Toulouse 3 - Observatoire Midi-Pyr´en´ees - CNRS, Laboratoire d’Astrophysique de Toulouse-Tarbes (UMR 5572), 14 Av. Edouard Belin, 31400 Toulouse, France
[email protected],
[email protected]
Abstract. This paper concerns our general recurrent neural network structures for nonlinear blind source separation, especially suited to polynomial mixtures. We here focus on linear-quadratic mixtures. We introduce an extended structure, with additional free parameters as compared to the structure that we previously proposed. We derive the equilibrium points of our new structure, thus showing that it has no spurious fixed points. We analyze its stability in detail and propose a practical procedure for selecting its free parameters, so as to guarantee the stability of a separating point. We thus solve the stability issue of our previous structure. Numerical results illustrate the effectiveness of this approach.
1
Introduction
Blind source separation (BSS) methods aim at restoring a set of N unknown source signals sj (n) from a set of P observed signals xi (n) which are mixtures of these source signals [1], with P = N in the standard configuration considered hereafter. In the simplest case, the observed signals are Linear Instantaneous (LI) mixtures of the source signals. Denoting A = [aij ] the matrix composed of these unknown mixture coefficients aij and s(n) = [s1 (n) . . . sN (n)]T and x(n) = [x1 (n) . . . xP (n)]T the source and observation vectors respectively, the mixing model reads in matrix form x(n) = As(n).
(1)
One of the very first solutions to this LI-BSS problem reported in the literature is the H´erault-Jutten artificial neural network [2]. This network has a recurrent (or feedback) structure, i.e. each of its outputs yi (n) consists of an LI combination of input xi (n) and of all other outputs yj (n) with j = i, using adequate combination coefficients estimated from the outputs by means of an unsupervised algorithm. Such a recurrent structure is not mandatory however, i.e. the same class of LI mappings from the signals xi (n) to the signals yi (n) may also be achieved by a feedforward structure, where each output yi (n) is derived only as an LI combination of all inputs xi (n). The latter structure is simpler than the recurrent one M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 161–168, 2007. c Springer-Verlag Berlin Heidelberg 2007
162
Y. Deville and S. Hosseini
and is therefore the one mainly used today for LI mixtures. The same evolution occurred for convolutive mixtures, i.e. an approach based on a recurrent structure was first proposed by Nguyen and Jutten [3] and extended by Charkani and Deville [4],[5] but other approaches, based on feedforward structures, were then preferred, since they also provide the considered linear (convolutive) mappings. The situation is quite different for nonlinear mixtures, which are now receiving increasing attention. We especially showed in [6]-[8] that recurrent structures are much more attractive than feedforward ones for polynomial mixtures. Building upon our above-mentioned experience on recurrent structures, we proposed in [6]-[8] a general approach for polynomial mixtures and we investigated it in more detail in the case of linear-quadratic mixtures. The higher-order recurrent neural network that we proposed for such mixtures was shown to be promising, but its applicability is limited by stability constraints. This paper also focuses on linear-quadratic mixtures (references of previously published BSS approaches for such mixtures are provided in [7]-[8]). Our main contributions then consist in : (i) introducing a more general higher-order recurrent neural network structure than in [6]-[8], (ii) providing a detailed theoretical analysis of its fixed points and their stability, depending on its parameter values, and (iii) deriving a practical method for selecting these values so as to guarantee stability at a separating point. As a spin-off, we also obtain stability conditions for our previous network. This confirms that it cannot operate with any signals, while its extended version can.
2
Mixing and Separating Models
We here consider an instantaneous mixing model, where two observed signals x1 (n) and x2 (n) consist of linear combinations of two source signals s1 (n) and s2 (n), added to quadratic terms, i.e. terms proportional to s1 (n)s2 (n). Taking into account BSS scale indeterminacies, the rescaled mixing model reads [7]-[8] x1 (n) = s1 (n) − L12 s2 (n) − Q1 s1 (n)s2 (n) x2 (n) = −L21 s1 (n) + s2 (n) − Q2 s1 (n)s2 (n).
(2) (3)
For each time n, a recurrence is performed to compute the values of the outputs yi of the feedback network that we now introduce in this paper. We denote as m the index associated to this recurrence and yi (m) the successive values of each output in this recurrence at time n1 . This recurrence reads y1 (m + 1) = x1 (n) + l11 y1 (m) + l12 y2 (m) + q1 y1 (m)y2 (m)
(4)
y2 (m + 1) = x2 (n) + l21 y1 (m) + l22 y2 (m) + q2 y1 (m)y2 (m)
(5)
where lij and qi are the adaptive weights of the proposed neural network. 1
These successive output values therefore also depend on n. This index n is omitted in the notations yi (m) however, in order to improve readability and to focus on the recurrence on output values for given input values x1 (n) and x2 (n).
Stable Higher-Order Recurrent Neural Network Structures
163
As compared to our previous papers [6]-[8], we here consider the same mixing model (2)-(3), but we propose an extended version (4)-(5) of our previous recurrent neural network, where we introduce the linear feedback terms l11 and l22 from each output yi (m) to the input with the same index. In other words, our previous network is a specific case of the new one, obtained by forcing l11 = 0 and l22 = 0.
(6)
By analyzing the stability of our previous and extended networks, we will show below that the constraint (6) does not make it possible to handle all signal values and we will introduce adequate values of l11 and l22 for solving this problem.
3
Fixed and Separating Points
The stability of the recurrence (4)-(5) is analyzed for fixed points (i.e. equilibrium points) of this recurrence. The first step of this investigation therefore consists in determining these fixed points, i.e. the points (y1E , y2E ) which are such that y1 (m + 1) = y1 (m) = y1E
and
y2 (m + 1) = y2 (m) = y2E .
(7)
As may be seen by combining (4)-(5) and (7), the fixed points of the recurrence (4)-(5) depend: (i) on the inputs xi (n), and therefore on the source values and mixing parameters through the mixing equations (2)-(3), and (ii) also on the network weights lij and qi . Our eventual goal will be to adapt the network weights lij and qi so as to achieve BSS, i.e. so as to make the network outputs yi (m) equal to the source signals, up to BSS indeterminacies. Therefore, before considering that adaptation, we here determine all network weights which are such that one of the associated fixed points corresponds to BSS without permutation, i.e. to y1E = k1 s1 (n)
and
y2E = k2 s2 (n)
(8)
where k1 and k2 are two arbitrary scale factors. In other words, for given mixture parameters in (2)-(3), we look for all network weights lij and qi and scale factors ki such that Eq. (2)-(5) and (7)-(8) are met whatever the source values si (n). Our calculations, which are skipped here due to space limitations, yield l11 = −
1 +1 k1
L12 = L12 l22 k2 Q1 = Q1 l11 l22 q1 = k1 k2 L21 l21 = = L21 l11 k1 1 l22 = − + 1 k2 Q2 q2 = = Q2 l11 l22 k1 k2
l12 =
(9) (10) (11) (12) (13) (14)
164
with
Y. Deville and S. Hosseini
l11 = 1 − l11
and
l22 = 1 − l22 .
(15)
The first expressions in (10)-(12) and (14) show that we thus obtain an infinite number of solutions, due to the two arbitrary scale factors ki with which the sources appear in the network outputs. There is a one-to-one correspondence between these factors and the two network weights l11 and l22 as shown by (9) and (13). One may therefore consider the weights l11 and l22 as the primary parameters, select them (freely at this stage), and then assign accordingly the other network weights, using the second expressions in (10)-(12) and (14). For these weight values, we know by construction that the network has at least one fixed point, i.e. the point defined by (8) with (9) and (13). We must then determine all fixed points for these weight values, because the network may converge to any of these points (depending on their stability) and we should especially determine whether each of them yields separated sources. This topic is addressed by looking for all solutions of Eq. (2)-(5), (7), (10)-(12) and (14)-(15). Long calculations then yield two solutions, more easily expressed by defining α = Q2 + Q1 L21 β = −(Q2 L12 + Q1 )
(16) (17)
γ = 1 − L12 L21 y1 = ±1
(18) (19)
T 1 = y1 sgn(l11 l22 ) sgn(−αs1 (n) + βs2 (n) + γ).
(20)
These solutions then read 1 y1E = {[αs1 (n) + βs2 (n) + γ] − T 1 [−αs1 (n) + βs2 (n) + γ]} 2l11 α 1 y2E = {−αl11 y1E + [αs1 (n) + βs2 (n)]}. βl22
(21) (22)
These two solutions correspond to the two possible values of y1 in (19) and are more easily expressed with respect to T 1 . The first solution, corresponding to T 1 = 1, reads 1 1 (23) y1E = s1 (n) and y2E = s2 (n). l11 l22 This solution yields BSS without permutation (and with scale factors). It is nothing but the above solution (8), as shown by (9), (13) and (15). The second solution, corresponding to T 1 = −1, reads 1 β γ 1 α γ s2 (n) + s1 (n) − y1E = and y2E = . (24) l11 α α l22 β β This additional solution yields BSS with a permutation (and with scale factors and additive constants). We thus only obtain the classical BSS solutions, with indeterminacies associated to nonlinear mixtures. In other words, for these weight values, this network does not yield spurious fixed points, i.e. fixed points such that the outputs are still mixtures of the sources. We now analyze the stability of these fixed points.
Stable Higher-Order Recurrent Neural Network Structures
4
165
Stability Condition
The considered network is a two-dimensional nonlinear dynamic system, since the evolution of its state vector [y1 (m), y2 (m)]T is defined by the nonlinear equations (4)-(5). The linear stability of any such system at a given fixed point may be analyzed by considering a first-order approximation of its evolution equations at that point. The new value of a small disturbance added to the state vector is thus expressed as the product of a matrix H, which defines the first-order approximation of the system, by the former value of that disturbance (see e.g. details in [9]). The (asymptotic) stability of the system at the considered point is guaranteed by constraining the moduli of both eigenvalues of the corresponding matrix H to be lower than 1. Our calculations show that this condition may be expressed as ⎧ ⎨ T > −D − 1 T
0 (26) |l11 |l11 l22 | δy1 − Al11 − Bl22 < 0 (27) with δy1 = [Q2 x1 (n) − Q1 x2 (n) + γ]2 − 4α[x1 (n) + x2 (n)L12 ] Q1 A= −Q2 x1 (n) + Q1 x2 (n) + γ − sgn(l11 l22 ) δy1 + 1 2β Q2 −Q2 x1 (n) + Q1 x2 (n) − γ + sgn(l11 B= l22 ) δy1 + 1. 2α
5
(28) (29) (30)
Limitations of Our Previous Network
The stability condition (26)-(27) especially applies to the more specific network that we proposed in [6]-[8], which corresponds to (6) as explained above, and therefore to l11 = 1 and l22 = 1. Inserting these values in (26)-(27) shows that our previous network yields a stable fixed point only for some values of the mixing coefficients and observed signals, and therefore of the source signals. To clearly illustrate this phenomenon with an example, one may consider the simple case L12 = 0,
L21 = 0,
Q1 = Q2 = Q > 0.
(31)
166
Y. Deville and S. Hosseini
It may be shown that our previous network then only yields stability for a limited domain of source values, which consists of a strip in the (s1 (n), s2 (n)) plane, defined by − s1 (n) −
6 6.1
1 3 < s2 (n) < −s1 (n) + . Q Q
(32)
A Method for Stabilizing Our New Network Analysis of Stability Condition
Condition (26)-(27) is of high theoretical interest, because it completely defines the stability of the considered fixed point. However, it does not show easily if and and l22 may be selected in order to ensure stability at the considered fixed how l11 point for any given observed signal values. To address that topic, we consider as the primary variable and l22 as the secondary variable, and we express l11 it as l22 = λl11 , where λ is a parameter. For any fixed λ, we first investigate whether there exist values of l11 such that (26)-(27) is met, in order to eventually and λ (and therefore l22 ) such that (26)determine if there exist values of l11 (27) is met. In other words, we first determine the intersection of the part of the , l22 ) plane where (26)-(27) is met and of a given line in that plane, defined by (l11 l22 = λl11 (with λ = 0). Using the latter expression of l22 , Eq. (26)-(27) become 2 ) − 2(A + Bλ)l11 +4>0 (33) |λ| δy1 (l11 2 |λ| δy1 (l11 ) − (A + Bλ)l11 < 0. (34) and l22 . (33)For a given λ, (28)-(30) show that A and B do not depend on l11 (34) then yield two inequalities with respect to l11 , which are solved as follows. For (34), any (non-zero) λ is suitable and the solutions of (34) for a given λ are l11 =μ
A + Bλ |λ| δy1
with
0 < μ < μmax
(35)
where μmax = 1. It may then be shown that, taking (33) into account in addition still yields (35), but now with μmax defined as follows. Denoting C(λ) =
we have μmax =
1 1−
1−
(A + Bλ)2 |λ| δy1 if
4 C(λ)
otherwise .
(36) C(λ) ≤ 4
(37)
The above analysis shows that any value of λ yields a non-empty interval of solutions for l11 . For a given λ, a simple and safe approach therefore consists in selecting for l11 the value situated in the middle of the allowed range (35), i.e. l11 =
μmax A + Bλ 2 |λ| δy1
(38)
Stable Higher-Order Recurrent Neural Network Structures
167
with μmax defined by (37). We should eventually propose a method for selecting λ. As stated above, any (non-zero) value of λ is acceptable. A simple solution is λ = ±1, which gives the same ”weight” to l11 and l22 = λl11 . Moreover, the sign of λ may be chosen as follows. As explained above, the considered fixed point = λl11 in addition, (20) here reduces to correspond to y1 = 1. Since l22 T 1 = sgn(λ) sgn(−αs1 (n) + βs2 (n) + γ).
(39)
Therefore, if λ has the same sign as (−αs1 (n) + βs2 (n) + γ), then T 1 = 1, so that the considered stable fixed point yields the non-permuted sources defined in (23). Otherwise, the permuted sources of (24) are obtained. We thus guarantee that both solutions are obtained by successively applying our approach with two opposite values of λ. Moreover, if the source and mixture coefficients are such that the sign of (−αs1 (n) + βs2 (n) + γ) is known, then selecting λ also with that sign guarantees (local) convergence to the non-permuting point. Otherwise, this version of our approach yields a permutation issue, to be further investigated. 6.2
Summary of Proposed Method
Based on the above analysis, the following procedure is guaranteed to yield local convergence towards a separating point: 1. 2. 3. 4.
7
Select λ as explained above. Set l11 according to (38), taking into account sgn(l11 l22 ) = sgn(λ). = λl11 . Set l22 Set all other network parameters according to (10)-(12) and (14)-(15).
Numerical Results
To illustrate the performance of our approach, we consider observations defined by (2)-(3) at a single time n (the same test could then be repeated for different 177
15
x 10
0
−1
10
y2
y2
−2 5
−3 0 −4
−5 −20
−15
−10
−5 y1
0
5 109
x 10
−5 −3
−2.5
−2
−1.5 y1
−1
−0.5
Fig. 1. Divergence of previous network (left), convergence of new network (right)
0
168
Y. Deville and S. Hosseini
times). We use s1 (n) = −1, s2 (n) = −2 and mixing coefficients defined by (31), with Q = 0.5. We first implement our previous network [6]-[8] by running 10 steps of the recurrence (4)-(5) with (6), (10)-(12) and (14), starting from y1 (1) = 0, y2 (1) = 0. The resulting trajectory of (y1 (m), y2 (m)) is provided in Fig. 1. This shows that the network diverges very rapidly. This is in agreement with the fact that condition (32) is not met here. We then implement our new network by running the recurrence (4)-(5) as above but with parameters selected as explained in Section 6.2 (with λ = 1). The network then converges to the = l22 = 0.4). Moreover, it solution (23), as shown in Fig. 1 (we here have l11 converges very rapidly. This shows the effectiveness of the proposed approach.
8
Conclusion
In this paper, we introduced a new BSS neural network structure which solves the stability issue of our previous network. We plan to investigate whether other aspects of its operation may be further improved by taking advantage of its free parameters, especially λ. We will also develop algorithms to adapt the network weights so that they converge towards separating points. Acknowledgment. Y. Deville would like to thank Y. Naudet for drawing his attention to linear-quadratic mixtures, which first resulted in the approach presented in [6], which was then developed in [7]-[8].
References 1. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York (2001) 2. Jutten, C., H´erault, J.: Blind separation of sources, Part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing 24, 1–10 (1991) 3. Thi, H.L.N., Jutten, C.: Blind source separation for convolutive mixtures. Signal Processing 45(2), 209–229 (1995) 4. Charkani, N., Deville, Y.: Self-adaptive separation of convolutively mixed signals with a recursive structure. Part I: stability analysis and optimization of asymptotic behaviour. Signal Processing 73(3), 225–254 (1999) 5. Charkani, N., Deville, Y.: Self-adaptive separation of convolutively mixed signals with a recursive structure. Part II: Theoretical extensions and application to synthetic and real signals. Signal Processing 75(2), 117–140 (1999) 6. Deville, Y.: M´ethode de s´eparation de sources pour m´elanges lin´eaires-quadratiques, private communication (September 6, 2000) 7. Hosseini, S., Deville, Y.: Blind separation of linear-quadratic mixtures of real sources using a recurrent structure. In: Mira, J., Alvarez, J.R. (eds.) IWANN 2003. LNCS, vol. 2687, pp. 241–248. Springer, Heidelberg (2003) 8. Hosseini, S., Deville, Y.: Blind maximum likelihood separation of a linear-quadratic mixture. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 22–24. Springer, Heidelberg (2004) 9. Thompson, J.M.T., Stewart, H.B.: Nonlinear dynamics and chaos. Wiley, Chichester, England (2002)
Hierarchical ALS Algorithms for Nonnegative Matrix and 3D Tensor Factorization Andrzej Cichocki1 , Rafal Zdunek2 , and Shun-ichi Amari3 1
Dept. of EE, Warsaw University of Technology, and IBS PAN Warsaw, Poland 2 Institute of Telecommunications, Teleinformatics and Acoustics, Wroclaw University of Technology, Poland 3 RIKEN Brain Science Institute, Wako-shi, Saitama, Japan {cia,zdunek,amari}@brain.riken.jp
Abstract. In the paper we present new Alternating Least Squares (ALS) algorithms for Nonnegative Matrix Factorization (NMF) and their extensions to 3D Nonnegative Tensor Factorization (NTF) that are robust in the presence of noise and have many potential applications, including multi-way Blind Source Separation (BSS), multi-sensory or multi-dimensional data analysis, and nonnegative neural sparse coding. We propose to use local cost functions whose simultaneous or sequential (one by one) minimization leads to a very simple ALS algorithm which works under some sparsity constraints both for an under-determined (a system which has less sensors than sources) and over-determined model. The extensive experimental results confirm the validity and high performance of the developed algorithms, especially with usage of the multi-layer hierarchical NMF. Extension of the proposed algorithm to multidimensional Sparse Component Analysis and Smooth Component Analysis is also proposed.
1
Introduction - Problem Formulation
Nonnegative Matrix Factorization (NMF) and its multi-way extensions: Nonnegative Tensor Factorization (NTF) and Parallel Factor analysis (PARAFAC) models with sparsity and/or non-negativity constraints have been recently proposed as promising and quite efficient tools for processing sparse signals, images, or general data [1,2,3,4,5,6,7,8]. From a viewpoint of data analysis, NMF/NTF provides nonnegative and usually sparse common factors or hidden (latent) components with physiological meaning and interpretation [6,9]. NMF, NTF, and Sparse Component Analysis (SCA) are used in a variety of applications, ranging from neuroscience and psychometrics to chemometrics [10,1,6,7,9,11,12]. In this paper, we propose new Hierarchical Alternating Least Squares (HALS) algorithms for NMF/NTF. By incorporating the regularization and penalty terms into the local squared Euclidean norms, we are able to achieve sparse and local representations of the desired solution, and to alleviate the problem of getting stuck in local minima. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 169–176, 2007. c Springer-Verlag Berlin Heidelberg 2007
170
A. Cichocki, R. Zdunek, S.-i. Amari
We impose nonnegativity and sparsity constraints to the following NTF (i.e., standard PARAFAC with nonnegativity constraints) model [3]: ˜ + Eq, X q = AD q S
(q = 1, 2, . . . , Q)
(1)
are frontal slices (matrices) of the observed 3D tensor data where X q ∈ RI×T + or signals X ∈ RI×T ×Q , Dq ∈ RJ×J are diagonal scaling matrices, A = + ˜ ∈ RJ×T represents unis a mixing or basis matrix, S [a1 , a2 , . . . , aJ ] ∈ RI×J + + known sources or hidden (nonnegative and sparse) components, and E q ∈ RI×T represents the q-th frontal slice of the tensor E ∈ RI×T ×Q representing a noise or error. In the special case for Q = 1, the model simplifies to the standard NMF model. The objective is to estimate the set of all nonnegative matrices: ˜ 1 . The problem can be converted to a tri-NMF model by applying A, {Dq }, S averaging of frontal slices: In this section, we develop the alternative algorithm which converts the problem to a simple tri-NMF model (under condition that all frontal slices X q have the same dimension): ˜ + E = AS + E, X = AD S (2) Q Q where X = Q q=1 X q , D = q=1 D q = diag{dq1 , dq2 , . . . , dqJ }, E = q=1 E q , ˜ is a scaled matrix of sources. The above system of linear aland S = DS gebraicequations can be represented in an equivalent scalar form as follows xit = j aij sjt + eit , or equivalently in the vector form: X = j aj sj + E where sj are rows of S, and aj are columns of A (j = 1, 2, . . . , J). Such a simple model provides improved performance if the noise (in the frontal slices) is not correlated. The majority of NMF/NTF algorithms for BSS applications works only if the following assumption T >> I ≥ J is held, where J is known or can be estimated using SVD. In the paper, we propose the NMF algorithm that can work also for an under-determined case, i.e. T >> J > I, if signal representations are enough sparse. Our objective is to estimate the mixing (basis) matrix A and the sources S, subject to nonnegativity and sparsity constraints.
2
Locally Regularized ALS Algorithm
The most of known and used adaptive algorithms for NMF are based on alternating minimization of the squared Euclidean distance expressed by the Frobenius norm: DF (X||AS) =
1 X − AS2F + αA ||A||1 + αS ||S||1 , 2
(3)
subject to constraints of all the elements in A and S, where nonnegativity ||A||1 = ir air , ||S||1 = jt sjt , and αA and αS are nonnegative regularization coefficients controlling sparsity of the matrices [9]. 1
˜ are normalized to unit length Usually, the common factors, i.e., matrices A and S column vectors and rows, respectively, and are forced to be as sparse as possible.
Hierarchical ALS Algorithms
171
The basic approach to NMF is alternating minimization or alternating projection: the specified cost function is alternately minimized with respect to two sets of the parameters {sjt } and {aij }, each time optimizing one set of arguments while keeping the other one fixed [9,1]. In this paper we consider minimization of the set of local squared Euclidean cost functions: (j)
DF (X (j) ||aj sj ) =
1 (j) (j) (X (j) − aj sj )2F + αA JA (aj ) + αS JS (sj ), 2
(4)
for j = 1, 2, . . . , J, subject to nonnegativity constraints for all elements: aij ≥ 0 and sjt ≥ 0, where ap sp , (5) X (j) = X − p=j
aj ∈ RI×1 are columns of the basis mixing matrix A, sj ∈ R1×T are rows (j)
(j)
of S, αA ≥ 0 and αS ≥ 0 are local parameters controlling a sparsity level of the individual vectors, and the penalty terms J (a ) = A j i aij and JS (sj ) = t sjt enforce sparsification of the columns in A and the rows in S, respectively. The construction of such a set of local cost functions follows from the simple observation that the observed data can be decomposed approximately as follows X = Jp=1 ap sp + E or more generally X = Jp=1 λp ap sp + E with λ1 ≥ λ2 ≥ . . . ≥ λJ > 0. The gradients of the cost function (4) with respect to the unknown vectors aj and sj are expressed by (j)
∂DF (X (j) ||aj sj ) (j) = aTj aj sj − aTj X (j) + αS , ∂sj
(6)
(j)
∂DF (X (j) ||aj sj ) (j) = aj sj sTj − X (j) sTj + αA , ∂aj (j)
(7)
(j)
where the scalars αS and αA are added/substracted component-wise. By equating the gradient components to zero and assuming that we enforce the nonnegativity constraints with a simple ”half-rectifying” nonlinear projection, we obtain a new set of sequential learning rules: 1 1 (j) (j) (j) (j) T T (a X − αS ) aj ← (X sj − αA ) , (8) sj ← aTj aj j sj sTj +
+
for j = 1, 2, . . . , J, where [ξ]+ = max{, ξ}, and is a small constant to avoid numerical instabilities (usually = 10−16 ). Remark 1. In practice, it is necessary to normalize in each iteration step the column vectors aj and the row vectors sj to unit length vectors (in the sense of norm lp norm (p = 1, 2, ..., ∞)). In the special case of l2 norms the above
172
A. Cichocki, R. Zdunek, S.-i. Amari
algorithms can be further simplified by neglecting the denominator in (8). After we estimate the diagonal matrices estimating the normalized matrices A and S, as follows: ˜ +} , (q = 1, 2, . . . , Q) (9) D q = diag{A+ X q S +
Remark 2. In this paper we have applied a simple nonlinear half-wave rectifying projection [sjt ]+ = max{, sjt }, ∀t (element-wise) in order to impose nonnegativity constraints. However, other nonlinear projections or filtering can be applied to extract sources (not necessary nonnegative) with specific properties. First of all, the proposed method can be easily extended for semi-NMF and semi-NTF, where nonnegativity constraints are imposed only for some preselected sources, i.e, rows of the matrix S and/or some selected columns of the matrix A if some a priori information is available. Furthermore, instead of using the simple nonlinear half-rectifying projection, we can apply more complex nonlinear projections and filtering to estimate bipolar sources which have some specific properties, for example, sources can be bounded, sparse or smooth. In order to estimate the sparse bipolar sources, we can apply well-known adaptive (soft or hard) shrinking nonlinear transformations (e.g, the nonlinear projection can be defined as: Psr (sjt ) = sjt for |sjt | > δ and Psr (sjt ) = 0 otherwise, with the adaptive threshold δ > 0). Alternatively, we may apply a power nonlinear element-wise transformation: Psp (sjt ) = sign(sjt )|sjt |1+γs , ∀t, where γs is a small coefficient which controls a sparsity/density level of individual sources [11]. In order to achieve smoothness of the estimated sources, we may apply a local averaging operator (such as MA or ARMA models) or low pass filtering which gradually enforces some level of smoothness during an iterative process.
3
Possible Extensions and Improvements
To deal with the factorization problem (1) efficiently, we adopt several approaches from constrained optimization and multi-criteria optimization, where we minimize simultaneously several cost functions using alternating switching between the sets of parameters: {A}, {S}. The above simple algorithm can be further extended or improved (in respect of convergence rate and performance). First of all, different cost functions can be used for estimation of the rows in the matrix S and the columns in the matrix A. Furthermore, the columns of A can be estimated simultaneously, instead one by one. For example, by minimizing the set of cost functions in (4) with respect to sj , and simultaneously the cost function (3) with normalization of the columns aj to unit l2 -norm, we obtain the new ALS learning algorithm in which the rows of S are updated locally (row by row) and the matrix A is updated globally (all columns aj simultaneously): (j) , (j = 1, . . . , J), A ← (XS T − αA )(SS T )−1 (10) sj ← aTj X (j) − αS +
with normalization (scaling) of the columns in A to the unit length.
+
Hierarchical ALS Algorithms
173
Secondly, instead of the standard gradient descent approach we can apply the Quasi-Newton method [13,14] for estimation of matrix A. Since the Hessian ∇2A (DF ) = I I ⊗ SS T ∈ RIJ×IJ of DF (X||AS) has the diagonal block structure with the same blocks, we can simplify the update of A with the Newton method to the very simple form:
(11) A ← A − ∇A (DF (X||AS))H −1 A +, where ∇A DF (X||AS) = (AS − X)S T ∈ RI×J , and H A = SS T ∈ RJ×J . The matrix H A may be ill-conditioned, especially if S is sparse, and due to this the Levenberg-Marquardt approach is used to control ill-conditioning of the Hessian. Thus we have developed the following NMF/NTF algorithm: (j) , A ← A − (AS − X)S T (SS T + λI J )−1 , (12) sj ← aTj X (j) − αS +
+
for j = 1, . . . , J, where λ ← λ0 exp{−τ k}, k is an index of a current alternating step, and I J ∈ RJ×J is an identity matrix. Since the alternating minimization technique in NMF is not convex, the selection of initial conditions is very important. Our algorithms are initialized with random uniform matrices. Thus, to minimize the risk of getting trapped in local minima of the cost functions, we use some steering technique that comes from a simulated annealing approach. The solution is triggered with the exponential rule. For our problems, we set heuristically λ0 = 100 and τ = 0.02. 3.1
Multi-layer NMF/NTF
In order to improve the performance of the NTF algorithms proposed in this paper, especially for ill-conditioned and badly scaled data and also to reduce risk of getting stuck in local minima in non-convex alternating minimization computations, we have developed a simple hierarchical multi-stage procedure [15] combined together with multi-start initializations, in which we perform a sequential decomposition of nonnegative matrices as follows. In the first step, (1) using we perform the basic decomposition (factorization) X q ≈ A(1) D (1) q S any available NTF algorithm. In the second stage, the results obtained from 1 from the estimated frontal the first stage are used to build up a new tensor S (1) (1) slices defined as X = S (1) = D(1) , (q = 1, 2, . . . , Q) and in the next q q S q step we perform the similar decomposition for the new available frontal slices: (1) = S (1) ≈ A(2) D (2) S (2) , using the same or different update rules. We conX q q q tinue our decomposition taking into account only the last achieved components. The process can be repeated arbitrarily many times until some stopping criteria are satisfied. In each step, we usually obtain gradual improvements of the performance. Thus, our NTF model has the form: (L) , X q ≈ A(1) A(2) · · · A(L) D(L) q S
(q = 1, 2, . . . , Q),
with final results A = A(1) A(2) · · · A(L) , S = S (L) and D q = D (L) q .
(13)
174
A. Cichocki, R. Zdunek, S.-i. Amari
Fig. 1. (left) Original 10 sparse source signals ; (middle) observed 6 mixed signals with randomly generated mixing matrix A ∈ R6×10 (under-determined case); (right) estimated 10 source signals using our new algorithm (12); For 10 layers we achieved the following performance: SIRs for A and S are as follows: SIRA = 38.1, 37.0, 35.9, 32.4, 28.2, 33.1, 34.5, 41.2, 25.1, 25.1[dB] and SIRS = 23.1, 32.7, 35.2, 26.2, 29.8, 22.5, 41.8, 29.3, 30.2, 32.5[dB], respectively
Physically, this means that we build up a system that has many layers or cascade connections of L mixing subsystems. The key point in our approach is that the learning (update) process to find parameters of matrices S (l) and A(l) is performed sequentially, i.e. layer by layer. In fact, we found that the hierarchical multi-layer approach is necessary to apply in order to achieve high performance for all the proposed algorithms.
4
Simulation Results
All the NTF algorithms presented in this paper have been extensively tested for many difficult benchmarks for signals and images with various statistical distributions and additive noise, and also for preliminary tests with real EEG data. Due to space limitations we present here only the selected simulations results in Figs.1–2. The synthetic benchmark illustrated in Fig.1(left) contains sparse nonnegative and weakly statistically dependent 10 source components. The sources . The have been mixed by the randomly generated full rank matrix A ∈ R6×10 + typical mixed signals are shown in Fig.1(middle). The results obtained with the (j) new algorithm (12) with αS = 0.05 are illustrated in Fig.1(right) with average Signal-to-Interference (SIR) level greater than 25 [dB]. Since the proposed algorithms (alternating techniques) perform a non-convex optimization, the estimated components are initial condition dependent. To estimate the performance in a statistical sense, we present the histograms of 100 mean-SIR samples for estimation of S (Fig.2). We tested the two different algorithms (combination of the algorithms) – algorithm (10): ALS for A and HALS (j) for X (αA = 0, αS = 0.05), and algorithm (12): quasi-Newton for A and HALS for S.
Hierarchical ALS Algorithms
175
Fig. 2. Histograms of 100 mean-SIR samples for estimating S from Monte Carlo analysis performed using the following algorithms with 10 layers: (left) ALS for A and HALS for S (10); (right) quasi-Newton for A and HALS for S (12)
5
Conclusions and Discussion
The main objective and motivation of this paper is to derive simple algorithms which are suitable both for under-determined and over-determined cases. We have proposed the generalized and flexible cost function (controlled by sparsity penalty) that allows us to derive a family of robust and efficient alternating least squares algorithms for NMF and NTF. Exploiting gradient and Hessian properties, we have derived a family of efficient algorithms for estimating nonnegative sources even if the number of sensors is smaller than the number of hidden nonnegative sources under assumption that the sources are sufficiently sparse and not strongly overlapped. This is the unique modification of the standard ALS algorithm, and to the authors’ best knowledge, the first time such a cost function and algorithms have been applied to NMF and NTF. The proposed algorithm gives also better performance (SIRs ans speed) than the ordinary ALS algorithm for NMF, and also some applications of the FOCUSS algorithm [16,17]. We implemented the discussed algorithms in our NMFLAB/NTFLAB MATLAB Toolboxes [18]. The algorithms may be also promising for other applications, such as Sparse Component Analysis, Smooth Component Analysis and EM Factor Analysis because they relax the problems of getting stuck to in local minima much better than the standard ALS algorithm. We have motivated the use of the proposed models in three areas of data analysis (especially, EEG and fMRI) and signal/image processing: (i) multiway blind source separation, (ii) model reduction and selection, and (iii) sparse image coding. Our preliminary experiments are promising. The models can be further extended by imposing additional, natural constraints such as smoothness, continuity, closure, unimodality, local rank - selectivity, and/or by taking into account a prior knowledge about specific 3D, or more generally, multi-way data.
176
A. Cichocki, R. Zdunek, S.-i. Amari
Obviously, there are many challenging open issues remaining, such as global convergence, an optimal choice of the associated parameters.
References 1. Cichocki, A., Amari, S.: Adaptive Blind Signal And Image Processing (New revised and improved edition). John Wiley, New York (2003) 2. Dhillon, I., Sra, S.: Generalized nonnegative matrix approximations with Bregman divergences. In: Neural Information Proc. Systems, Vancouver, Canada (2005) 3. Hazan, T., Polak, S., Shashua, A.: Sparse image coding using a 3D non-negative tensor factorization. In: International Conference of Computer Vision (ICCV), pp. 50–57 (2005) 4. Heiler, M., Schnoerr, C.: Controlling sparseness in non-negative tensor factorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 56–67. Springer, Heidelberg (2006) 5. Hoyer, P.: Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research 5, 1457–1469 (2004) 6. Morup, M., Hansen, L.K., Herrmann, C.S., Parnas, J., Arnfred, S.M.: Parallel factor analysis as an exploratory tool for wavelet transformed event-related EEG. NeuroImage 29, 938–947 (2006) 7. Smilde, A., Bro, R., Geladi, P.: Multi-way Analysis: Applications in the Chemical Sciences. John Wiley and Sons, New York (2004) 8. Oja, E., Plumbley, M.D.: Blind separation of positive sources by globally convergent gradient search. Neural Computation 16, 1811–1825 (2004) 9. Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999) 10. Berry, M., Browne, M., Langville, A., Pauca, P., Plemmons, R.: Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis (in press, 2006) 11. Cichocki, A., Amari, S., Zdunek, R., Kompass, R., Hori, G., He, Z.: Extended SMART algorithms for non-negative matrix factorization. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2006. LNCS (LNAI), vol. 4029, pp. 548–562. Springer, Heidelberg (2006) 12. Kim, M., Choi, S.: Monaural music source separation: Nonnegativity, sparseness, and shift-invariance. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 617–624. Springer, Heidelberg (2006) 13. Zdunek, R., Cichocki, A.: Non-negative matrix factorization with quasi-Newton optimization. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2006. LNCS (LNAI), vol. 4029, pp. 870–879. Springer, Heidelberg (2006) 14. Zdunek, R., Cichocki, A.: Nonnegative matrix factorization with constrained second-order optimization. Signal Processing 87, 1904–1916 (2007) 15. Cichocki, A., Zdunek, R.: Multilayer nonnegative matrix factorization. Electronics Letters 42, 947–948 (2006) 16. Murray, J.F., Kreutz-Delgado, K.: Learning sparse overcomplete codes for images. Journal of VLSI Signal Processing 45, 97–110 (2006) 17. Kreutz-Delgado, K., Murray, J.F., Rao, B.D., Engan, K., Lee, T.W., Sejnowski, T.J.: Dictionary learning algorithms for sparse representation. Neural Computation 15, 349–396 (2003) 18. Cichocki, A., Zdunek, R.: NTFLAB for Signal Processing. Technical report, Laboratory for Advanced Brain Signal Processing, BSI, RIKEN, Saitama, Japan (2006)
Pivot Selection Strategies in Jacobi Joint Block-Diagonalization C´edric F´evotte1, and Fabian J. Theis2 1
GET/T´el´ecom Paris (ENST), 37-39 rue Dareau, 75014 Paris, France [email protected] http://www.tsi.enst.fr/~fevotte/ 2 Bernstein Center for Computational Neuroscience MPI for Dynamics and Self-Organisation, G¨ ottingen, Germany [email protected] http://fabian.theis.name
Abstract. A common problem in independent component analysis after prewhitening is to optimize some contrast on the orthogonal or unitary group. A popular approach is to optimize the contrast only with respect to a single angle (Givens rotation) and to iterate this procedure. In this paper we discuss the choice of the sequence of rotations for such so-called Jacobi-based techniques, in the context of joint block-diagonalization (JBD). Indeed, extensive simulations with synthetic data, reported in the paper, illustrates the sensitiveness of this choice, as standard cyclic sweeps appear to often lead to non-optimal solutions. While not being able to guarantee convergence to an optimal solution, we propose a new schedule which, from empirical testing, considerably increases the chances to achieve global minimization of the criterion. We also point out the interest of initializing JBD with the output of joint diagonalization (JD), corroborating the idea that JD could in fact perform JBD up to permutations, as conjectured in previous works.
1
Introduction
Joint diagonalization techniques have received much attention in the last fifteen years within the field of signal processing, and more specifically within the fields of independent component analysis (ICA) and blind source separation (BSS). JADE, one of the most popular ICA algorithms developed by Cardoso and Souloumiac [1], is based on orthonormal joint diagonalization (JD) of a set of cumulant matrices. To this purpose the authors designed a Jacobi algorithm for approximate joint diagonalization of a set of matrices [2]. In a BSS parlance, JADE allows for separation of determined linear instantaneous mixtures of mutually independent sources, exploiting fourth-order statistics. Other standard BSS techniques involving joint diagonalization include the SOBI algorithm [3], TDSEP [4], stBSS [5] and TFBSS [6], which all rely on second-order statistics
C´edric F´evotte acknowledges support from European Commission under contract FP6-027026-K-SPACE.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 177–184, 2007. c Springer-Verlag Berlin Heidelberg 2007
178
C. F´evotte and F.J. Theis
of the sources, namely covariance matrices in the first through third case and spatial Wigner-Ville spectra in the fourth case; see [7] for a review. Joint block-diagonalization (JBD) came into play in BSS when Abed-Meraim, Belouchrani and co-authors extended the SOBI algorithm to overdetermined convolutive mixtures [8]. Their idea was to turn the convolutive mixture into an overdetermined linear instantaneous mixture of block-dependent sources, the second-order statistics matrices of the source vector thus becoming blockdiagonal instead of diagonal. Hence, the joint diagonalization step in SOBI needed to be replaced by a JBD step. Another area of application can be found in the context of multidimensional ICA or independent subspace analysis [9,10]. Its goal is to linearly transform an observed multivariate random vector such that its image is decomposed into groups of stochastically independent vectors. It has been shown that by using fourth-order cumulants to measure the independence, JADE now translates into a JBD problem [11]; similarly also SOBI and other JD-based criteria can be extended to this group ICA setting [12, 13]. Abed-Meraim et al. have sketched several Jacobi strategies in [14, 15, 16]: the JBD problem is turned into a minimization problem, where the matrix parameter (the joint block-diagonalizer) is constrained to be unitary (because of spatial prewhitening). The minimizer is searched for iteratively, as a product of Givens rotations, each rotation minimizing a block-diagonality criterion around a fixed axis, which we refer to as ‘pivot’. Convergence of the algorithm is easily shown, but convergence to an optimal solution (which minimizes the chosen JBD criterion) is not guaranteed. In fact, we observed that results vary widely according to the choice of the successive pivots (which we refer to as ‘schedule’) and the initialization of the algorithm, which is not discussed in previous works [14, 15, 16]. The main contributions of this paper are 1) to point out that the choice of the rotation schedule is a sensitive issue which greatly influences the convergence properties of the Jacobi algorithm, as illustrated on extensive simulations with synthetic data, 2) to propose a new schedule, which, from empirical testing, offers better chances to converge to the optimal solution (while still not guaranteeing it), as compared to the standard cyclic Jacobi technique. We also point out the interest of initializing JBD with the output of JD, corroborating the idea that JD could in fact perform JBD up to permutations, as suggested by Cardoso in [10], more recently conjectured by Abed-Meraim and Belouchrani in [16] and partially proved in [11, 17]. The paper is organized as follows. Section 2 briefly describes the Jacobi approach to approximate JBD, with fixed equal block sizes. Section 3 compares the convergence results obtained with three choices of initialization/schedule on generated sets of matrices exactly joint block-diagonalizable, with various size, block size and set dimension. Section 4 reports conclusions.
2 2.1
Jacobi Approximate Joint Block-Diagonalization Approximate Joint Block-Diagonalization
Let A = {A1 , . . . , AK } be a set of K complex matrices of size n×n. The problem of approximate JBD consists of finding a unitary matrix U ∈ Cn×n such that
Pivot Selection Strategies in Jacobi Joint Block-Diagonalization
179
∀k ∈ 1, K := {1, . . . , K}, the matrices U Ak UH = Bk are as block-diagonal as possible. More precisely, let us denote L the (fixed) length of the diagonal blocks and m = n/L the number of blocks. Writing for k ∈ 1, K ⎤ ⎡ Ak11 . . . Ak1m ⎢ .. ⎥ Ak = ⎣ ... . ⎦ Akm1 . . . Akmm 2
where Akij is a subblock of dimensions L × L, ∀(i, j) ∈ 1, m , our blockdiagonality criterion is chosen as boff (Ak ) :=
2
Akij F .
(1)
1≤i=j≤m
Here B2F = ij |bij |2 denotes the Frobenius norm. We look for U by minimizing the cost function Cjbd (V; A) :=
K
boff V Ai VH
i=1
with respect to V ∈ U (n), where U (n) is the set of unitary n × n-matrices. 2.2
The Jacobi Approach
Jacobi approaches rely on the fact that any unitary matrix V ∈ U (n) can be written as a product of complex Givens matrices G(p, q, c, s) ∈ U (n), 1 ≤ p < q ≤ n, defined as everywhere equal to the identity In except for [G(p, q, c, s)]pp = [G(p, q, c, s)]qq = c, [G(p, q, c, s)]pq = s¯, [G(p, q, c, s)]qp = −s, with (c, s) ∈ R × C such that c2 + |s|2 = 1. The Jacobi approach consists of iteratively applying the same Givens rotation to all the matrices in set A, with (p, q) chosen as to minimize criterion Cjbd . In other words, for fixed p and q, one iteration of the method consists of the following two steps: – compute (c , s ) = argminc,s Cjbd (G(p, q, c, s); A) – ∀k ∈ 1, K, Ak ← G(p, q, c , s ) Ak G(p, q, c , s )H Let I1 , . . . , Im be the partition of 1, n defined by Ii = (i − 1) L + 1, i L, and let i(k) = i/L give the index i of the interval Ii to which k belongs. Let Bk = G(p, q, c, s) Ak G(p, q, c, s)H , k ∈ 1, K. Bk is everywhere equal to Ak , except for its pth and q th rows and columns, which depend on c and s, such
180
C. F´evotte and F.J. Theis
that [18, 17] bkpp = c2 akpp + |s|2 akqq + c s akpq + c s¯ akqp bkqq = c2 akqq + |s|2 akpp − c s akpq − c s¯ akqp bkpj = c akpj + s¯ akqj bkjp = c akjp + s akjq
(j ∈ Ii(p) , j = p) (j ∈ Ii(p) , j = p)
bkqj = −s akpj + c akqj
(j ∈ Ii(q) , j = q)
bkjq = −¯ s akjp + c akjq
(j ∈ Ii(q) , j = q)
Using the fact that the Frobenius norm is invariant to rotations, minimization of criterion Cjbd (G(p, q, c, s); A) with respect to (c, s) can be shown to be equivalent to the maximization of (c, s) := Cjbd ⎧ K ⎨ |bkpp |2 + |bkqq |2 + ⎩ k=1
j∈Ii(p) ,j=p
|bkpj |2 + |bkjp |2 +
j∈Ii(q) ,j=q
|bkqj |2 + |bkjq |2
⎫ ⎬ ⎭
However, the latter criterion is constant if p and q belong to the same interval Ii (p) (i.e, i(p) = i(q)). Details of above derivations can be found in [18, 17]. (c, s) boils down to It may be shown [15, 16] that the maximization of Cjbd the constrained maximization of a linear quadratic form. This optimization can be achieved using Lagrange multipliers. The computation of the latter requires solving a polynomial of degree 6 in the complex case (i.e, U ∈ Cn×n ), and of degree 4 in the real case (i.e, U ∈ Rn×n ). First order approximations of the criterion are also considered in [15, 16] to simplify its maximization. A tensorial rank-1 approximation is also found in [19]. For real matrices, when both Ak and U belong to Rn×n , maximization of Cjbd (c, s) directly amounts to rooting a polynomial of order 4 (without requiring a Lagragian parametrization), as sketched in [19] and developed in [18, 17]. So far, the indices p and q have been fixed. However, the important issue (c, s), which can be done exactly in appears not to be how to maximize Cjbd a way or another, but how to choose these pivots (p, q). Similarly to JD, the convergence of the proposed (joint) block-diagonalization scheme is guaranteed by construction, whatever the chosen rotation schedule [18,17]. If convergence to the global minimum was in practice usually observed with joint diagonalization schemes, this is certainly not the case for joint block-diagonalization, where we found convergence to be very sensitive to initialization and rotation schedule, as illustrated in the next section.
3
Simulations
The employed algorithms as well as some of the following examples are freely available for download at [20]. The programs have been realized in MATLAB, and sufficient documentation is given to reproduce the results and extend the algorithms. We propose to test the following initialization/schedule strategies.
Pivot Selection Strategies in Jacobi Joint Block-Diagonalization
181
(M1) The first method is inspired from the standard cyclic Jacobi approach [2, 21], which consists of systematically sweeping the pivots one after the other, except for the fact that the couples (p, q) are chosen not to include the diagonal blocks. The algorithm is initialized with the identity matrix, i.e U = In . The algorithm is stopped when all the values of s are lower than 10−4 within a sweep. (M2) The second method is identical to (M1) except for the fact that the algorithm is initialized with the matrix Ujdr provided by joint diagonalization of A, as obtained from [2]. (M3) The third method is inspired from the classical Jacobi method for the diagonalization of a normal matrix [21] and consists, after initialization as in (M2), of choosing at each iteration the pivot (p, q) ensuring a maximum decrease of criterion Cjbd . This requires computing all the differences K | k=1 boff (Bk ) − boff (Ak ) | for all couples (p, q) and to pick up the couple which yields the largest difference value. The algorithm stops when 20 successive values of s are all lower than 10−4 . For simplicity, the three methods are tested in the real case. The three methods are applied to 100 random draws of K real matrices exactly joint blockdiagonalizable in a real common orthogonal basis (optimal rotation angles are thus computed by rooting a polynomial of order 4 like in [18, 17]). Various values of L (size of the blocks), m (number of blocks) and K (number of matrices) are considered. The number of failures over the 100 realizations (i.e, the number of times the methods do not converge to a solution such that Cjbd = 0) is reported in Table 1. Table 1. Number of failures of methods M1, M2 and M3 over 100 random realizations of K matrices exactly block-diagonalizable in a common orthonormal basis m L K M1 M2 M3 m L K M1 M2 M3 m L K M1 M2 M3
1 1 0 0
3 4 0 0
2 6 4 0 0
1 3 0 0
3 14 0 0
2 6 11 0 0
12 18 0 0
24 8 0 0
1 68 29 15
3 54 5 1
1 5 0 0
3 30 0 0
2 6 21 0 0
12 19 0 0
24 16 0 0
1 87 47 21
3 75 7 5
12 1 0 0
24 2 0 0
1 32 11 5
3 33 1 0
2 4 6 25 0 0 3 4 6 38 1 0 4 4 6 68 6 4
12 10 0 0
24 11 0 0
1 55 43 14
3 33 2 0
6 6 21 0 0
12 24 0 0
24 16 0 0
12 33 2 3
24 32 0 1
1 84 53 44
3 60 10 0
6 6 48 8 0
12 51 7 2
24 52 8 8
12 60 4 2
24 59 2 3
1 99 88 65
3 83 15 8
6 6 77 8 2
12 77 4 0
24 75 10 5
182
C. F´evotte and F.J. Theis 30 (M1) (M2) (M3)
25
C
jbdr
20
15 after JD 10
5
0 0 10
1
2
10
3
10 iteration
4
10
10
Fig. 1. Evolution of criterion Cjbd for a random set A such that m = 3, L = 4, K = 3. Using a 2.60 GHz Pentium 4 with 1 Go RAM, the computation times for this particular dataset are: (M1 - 1.2 s), (M2 - 0.3 s), (M3 - 1.2 s). The three methods succeed in minimizing the criterion.
100 (M1) (M2) (M3)
90 80 70
Cjbdr
60 50 40
after JD
30 20 10 0 0 10
1
10
2
3
10
10
4
10
5
10
iteration
Fig. 2. Evolution of criterion Cjbd for a random set A such that m = 4, L = 6, K = 3. Using a 2.60 GHz Pentium 4 with 1 Go RAM, the computation times for this particular dataset are: (M1 - 28.4 s), (M2 - 4.1 s), (M3 - 6.9 s). Only (M3) succeeds in minimizing the criterion.
The results emphasize the importance of the initialization and the choice of the schedule. Failure rates of (M1) are very high, in particular when m and L increase. (M2) and (M3), which are both initialized by joint diagonalization, give much better results, with (M3) being in nearly every case more reliable than
Pivot Selection Strategies in Jacobi Joint Block-Diagonalization
183
(M2). However, none of the two methods systematically converge to a global minimum of Cjbd when m ≥ 3, and, interestingly, the methods do not usually fail on the same data sets. Also, Fig. 1 and Fig. 2 show that (M3) only need a few iterations after JD to minimize Cjbd . This indicates the validity of the claim from [16], that JD minimizes the joint block-diagonality Cjbd , however only up to a permutation. In the above simulation, the permutation is then discovered by application of the JBD algorithm — this also explains why in Figures 1 and 2, when (M2) is used, the cost function after JD only decreases in discrete steps, corresponding to identified permutations. Audio results of the separation of a convolutive mixture with 3 observations and 2 sources, obtained with the generalization of SOBI using our pivot selection scheme and followed by a SIMO identification step removing filtering ambiguities are found at [22], following the approach described in [23].
4
Conclusions
The main algorithmic conclusion of this paper is: Jacobi algorithms for joint block-diagonalization bring up convergence problems that do not occur in joint diagonalization and that still need to be properly addressed. However we proposed a strategy (method (M3)) which considerably reduces the failure rates of the straightforward approach (M1). The fact that lower failure rates are obtained with (M2) and (M3), which are initialized with joint diagonalization, tend to corroborate the conjecture that JBD diagonalization could be achieved up to an arbitrary permutation of columns via JD [10, 16], but it still does not explain why this permutation cannot be solved by minimization of Cjbd . This is a question we are currently working on, and for which partial results exist already [11, 17]. Moreover, extensions to the case of varying, possibly unknown block sizes are interesting [11], with respect to both the optimization and the application in the field of ICA.
References 1. Cardoso, J.F., Souloumiac, A.: Blind beamforming for non Gaussian signals. IEE Proceedings-F 140, 362–370 (1993) 2. Cardoso, J.F., Souloumiac, A.: Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal. Appl. 17, 161–164 (1996) 3. Belouchrani, A., Abed-Meraim, K., Cardoso, J.F., Moulines, E.: A blind source separation technique based on second order statistics. IEEE Trans. Signal Processing 45, 434–444 (1997) 4. Ziehe, A., Mueller, K.R.: TDSEP – an efficient algorithm for blind separation using time structure. In: Niklasson, L., Bod´en, M., Ziemke, T. (eds.) Proc. of ICANN’98, Sk¨ ovde, Sweden, pp. 675–680. Springer, Berlin (1998) 5. Theis, F., Gruber, P., Keck, I., Meyer-B¨ ase, A., Lang, E.: Spatiotemporal blind source separation using double-sided approximate joint diagonalization. In: Proc. EUSIPCO 2005, Antalya, Turkey (2005)
184
C. F´evotte and F.J. Theis
6. F´evotte, C., Doncarli, C.: Two contributions to blind source separation using timefrequency distributions. IEEE Signal Processing Letters 11(3) (2004) 7. Theis, F., Inouye, Y.: On the use of joint diagonalization in blind signal processing. In: Proc. ISCAS 2006, Kos, Greece (2006) 8. Bousbiah-Salah, H., Belouchrani, A., Abed-Meraim, K.: Jacobi-like algorithm for blind signal separation of convolutive mixtures. Electronics Letters 37(16), 1049– 1050 (2001) 9. De Lathauwer, L., Callaerts, D., De Moor, B., Vandewalle, J.: Fetal electrocardiogram extraction by source subspace separation. In: Proc. IEEE Signal Processing / ATHOS Workshop on Higher-Order Statistics, pp. 134–138. IEEE Computer Society Press, Los Alamitos (1995) 10. Cardoso, J.F.: Multidimensional independent component analysis. In: Proc. ICASSP (1998) 11. Theis, F.: Towards a general independent subspace analysis. In: Proc. NIPS 2006 (2007) 12. Theis, F.: Blind signal separation into groups of dependent signals using joint block diagonalization. In: Proc. ISCAS 2005, Kobe, Japan, pp. 5878–5881 (2005) 13. F´evotte, C., Doncarli, C.: A unified presentation of blind source separation methods for convolutive mixtures using block-diagonalization. In: Proc. 4th Symposium on Independent Component Analysis and Blind Source Separation (ICA’03), Nara, Japan (April 2003) 14. Belouchrani, A., Amin, M.G., Abed-Meraim, K.: Direction finding in correlated noise fields based on joint block-diagonalization of spatio-temporal correlation matrices. IEEE Signal Processing Letters 4(9), 266–268 (1997) 15. Belouchrani, A., Abed-Meraim, K., Hua, Y.: Jacobi-like algorithms for joint block diagonalization: Application to source localization. In: Proc. International Symposium on Intelligent Signal Processing and Communication Systems (1998) 16. Abed-Meraim, K., Belouchrani, A.: Algorithms for joint block diagonalization. In: Proc. EUSIPCO’04, Vienna, Austria, pp. 209–212 (2004) 17. F´evotte, C., Theis, F.J.: Orthonormal approximate joint block-diagonalization. Technical Report GET/T´el´ecom Paris 2007D007 (2007), http://service.tsi.enst.fr/cgi-bin/valipub download.cgi?dId=34 18. F´evotte, C.: Approche temps-fr´equence pour la s´eparation aveugle de sources non´ stationnaires. Th`ese de doctorat de l’Ecole Centrale de Nantes (2003) 19. De Lathauwer, L., F´evotte, C., De Moor, B., Vandewalle, J.: Jacobi algorithm for joint block diagonalization in blind identification. In: Proc. 23th Symposium on Information Theory in the Benelux, Louvain-la-Neuve, Belgium, (Mai 2002), pp. 155–162 (2002) 20. http://www.biologie.uni-regensburg.de/Biophysik/Theis/researchjbd.html 21. Golub, G.H., Loan, C.F.V.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996) 22. http://www.tsi.enst.fr/∼ fevotte/bass demo.html 23. F´evotte, C., Debiolles, A., Doncarli, C.: Blind separation of FIR convolutive mixtures: application to speech signals. In: Proc. 1st ISCA Workshop on Non-Linear Speech Processing, Le Croisic, France (May 20-23, 2003)
Speeding Up FastICA by Mixture Random Pruning Sabrina Gaito and Giuliano Grossi Dipartimento di Scienze dell’Informazione Universit` a degli Studi di Milano Via Comelico 39, I-20135 Milano, Italy [email protected]
Abstract. We study and derive a method to speed up kurtosis-based FastICA in presence of information redundancy, i.e., for large samples. It consists in randomly decimating the data set as more as possible while preserving the quality of the reconstructed signals. By performing an analysis of the kurtosis estimator, we find the maximum reduction rate which guarantees a narrow confidence interval of such estimator with high confidence level. Such a rate depends on a parameter β easily computed a priori combining together the fourth and the eighth norms of the observations. Extensive simulations have been done on different sets of real world signals. They show that actually the sample size reduction is very high, preserves the quality of the decomposition and impressively speeds up FastICA. On the other hand, the simulations also show that, decimating data more than the rate fixed by β, the decomposition ability of FastICA is compromised, thus validating the reliability of the parameter β. We are confident that our method will follow to better approach real time applications.
1
Introduction
Independent Component Analysis (ICA) ([1,2,3,4]) is a method to identify a set of unknown and generally non-Gaussian source signals whose mixtures are observed, under the only assumption that they are mutually independent. ICA has become more and more popular and, thanks to the few assumptions needed and its feasibility, it is applied in many areas such as blind source separation (BSS) which we are interested in [5]. More in general, ICA aim is to describe a very large set of data in terms of variables better capturing the essential structure of the problem. In many cases, due to the huge amount of data, it is crucial to make ICA analysis as fast as possible. From this point of view, one of the most popular algorithm is the well-known FastICA [6], which is based on the optimization of some nonlinear contrast functions [7] characterizing the non-Gaussianity of the components. Because of its widespread uses, in this paper we refer only to the kurtosis-based FastICA [6]. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 185–192, 2007. c Springer-Verlag Berlin Heidelberg 2007
186
S. Gaito and G. Grossi
Our aim is to speed up FastICA by a suitable pruning of the linear mixtures that preserves the output quality. Essentially, the method proposed consists in randomly select a subset of data of size d less than the original size d whose sample kurtosis is not too far from the right one. More in details, we perform an analysis of the kurtosis estimator on the sub-sample with the purpose to find the minimum reduction ratio ρ = dd which guarantees a narrow confidence interval with high confidence level. In particular, we identify a data-dependent parameter, called β, which combines both fourth and eighth norms of the observations, from which the reduction rate depends on. The main step in our method is to compute β on the mixed signals and obtain the actual reduction ratio ρ = δβ2 , where and δ are the fixed confidence interval parameters of the sub-sample kurtosis. Then we randomly decimate the sample and we apply FastICA to the reduced dataset. To assess the reliability of β many simulations have been done on different sets of both real world and artificial signals. The experiments show that, accordingly to the β, a consistent ratio of reduction can be normally applied when the sample size is considerable, achieving a great benefit in terms of computation time. Furthermore, since β (and consequently ρ) decreases also with respect to the number of signals n, the simulations show that the computation time is weakly affected by n. Moreover, the experiments give also prominence that when forcing the reduction ratio over the bounds derived by our analysis, the reconstruction error of FastICA grows noticeably. Section 2 describes the pruning methodology. The effect of the data reduction will be analyzed in term of analysis of the kurtosis estimator in Section 3. In the same section the statistical meaning of the parameter β is explained. In Section 4 we apply the method on a large set of real signals extracted from audio signals showing the performance of the proposed method.
2
Random Pruning
The model we assume for ICA is instantaneous and the mixture is linear and noiseless: X = AS, where the n × d matrices X and S are respectively the observed mixtures and the mutually independent unknown signals, while A is a full rank n × n mixing matrix. Thus, n is the number of mixed non-Gaussian signals and d is their length. Therefore, for each i ∈ [1 . . n] the i-th row xi of X represents a i.i.d. sample of size d of the random variable xi representing the i-th mixture. ˆ ≈ A−1 in order to The goal of ICA is to estimate the demixing matrix W reconstruct the original sources signals ˆ = WX. ˆ S
Speeding Up FastICA by Mixture Random Pruning
187
Kurtosis-based FastICA is a very simple fixed-point algorithm with satisfactory performance, but it is time consuming for large scale real signals because its computational complexity is O(nd3 ) [6]. In order to spare running time, before running FastICA we operate a random pruning on the mixtures procedure reducing the data by decimating the sample up to the minimum size allowed by β. Denoting with xi p the usual p-norm, the overall procedure, with the preprocessing pruning preliminary phase, can be summarized in the following steps: Pruning preprocessing xi 88 1. β(xi ) = ∀i ∈ [1 . . n] xi 84 2. β = max β(xi ) xi
1 dβ (dβ − 1) ≈ 2 δε2 δε 4. random draw Id ⊆ [1 . . d] of size d 5. ∀i ∈ [1 . . n] ∀j ∈ Id yij = xij so that y i = (yij1 , . . . , yijd ) FastICA 1. Perform FastICA on the matrix Y (whose i-th row is y i ) instead ˆ by maximizing the sequence kurt w T Y , where of X, obtaining W i ˆ wTi is the i-th row of W ˆ = WX. ˆ 2. Reconstruct the signals S
3. d =
Note that the decimation process throws away the same set of intermediate data points in all mixtures.
3
Theoretical Motivation
In this section we look for a lower bound for the reduction ratio ρ. The main step in FastICA where the sample size is relevant is when the kurtosis is being estimated on the data set. Assuming, as usual, that each mixture xi has zero mean and unitary variance, the kurtosis of each random variable xi reduces to its fourth moment M4 [xi ]. Thus we analyze the effects coming from the use of a reduced data set in terms of confidence interval of the sample fourth moment. The fourth moment estimate is generally performed on the whole sample xi ˆ d [xi ]: of size d via the sample fourth moment M 4 ˆ d [xi ] = 1 M x4 , 4 d t=1 it d
having the following mean and variance: d ˆ [xi ] = M4 [xi ], E M 4
d ˆ [xi ] = 1 (M8 [xi ] − (M4 [xi ])2 ). var M 4 d
188
S. Gaito and G. Grossi
Let us now estimate M4 [xi ] on the basis of the sub-sample y i . Using the Chebyschev inequality we obtain the probability bounds:
ˆ di var M 4 di ˆ [y i ] ≤ M4 [xi ](1 + ε) ≥ 1 − Pr M4 [xi ](1 − ε) ≤ M 4 ε2 (M4 [xi ])2 M8 [xi ] − (M4 [xi ])2 =1− . d ε2 (M4 [xi ])2 Setting the previous term equal to the confidence 1−δ, fixing the margin of error ε and introducing the sample moments, we derive the minimum sample size di which respects the probability bound above: di
ˆ d [xi ])2 ˆ d [xi ] − (M M 8 4 = . 2 ˆ δε (M4 [xi ])2
Expressing the sample moments in terms of norms: ˆ d [xi ] = 1 xi 4 M 4 4 d
ˆ d [xi ] = 1 xi 8 , and M 8 8 d
dxi 88 1 − 1 . δε2 xi 84 It is evident that the minimum allowed sample size depends on the ratio of the two norms xi 88 and xi 84 . Their statistical meaning is related to the variance of the estimator of the fourth moments estimated on the whole sample as: ˆ d [xi ] = 1 (xi 8 − 1 xi 8 ) var M 4 8 4 d2 d we obtain:
di =
Of course a low variance implies a good estimate and the possibility of highly reduce the sample size di . Since it holds that: 1 xi 88 ≤ ≤ 1, d xi 84 we note that the better ratio for the variance is xi 88 = 1d xi 84 . On the other side, the variance of the estimator is highest when xi 88 = xi 84 . Introducing the parameter β = max xi
xi 88 xi 84
the minimum allowed sample size is: d =
1 dβ (dβ − 1) ≈ 2 2 δε δε
and the reduction ratio is: ρ=
β d = 2. d δε
Speeding Up FastICA by Mixture Random Pruning
4
189
Numerical Experiments
In this section we report the summary of extensive computer simulations obtained from the executions of FastICA on different set of sampled source signals: speech, musical and environmental sounds of various nature, mixed with randomly generated matrix. All the experiments have been carried out on Pentium P4 (2GHz, 1GB RAM) through software environment MATLAB 7.0.1. The main purpose of the simulations is to apply the preprocessing pruning technique in order to appreciate the performance of FastICA both in terms of computation complexity and of quality of the reconstructed signals. Specifically, we are interested in validating the reliability of the parameter β observing the performance decay. This attitude may find application in real time scenarios where high sampling rate can make prohibitive the use of the ICA technique. All signals considered in the experiments are very big (order of magnitude 105 and 106 ) because for short sample size FastICA sometimes fails to converge or gets stuck at saddle points [8]. To measure the accuracy of the demixing matrix we use the performance index reported in [9], which represents a plausible measure of discrepancy between the ˆ and the identity matrix, defined as: product matrix P = (pij )n×n = AW ⎞ ⎞ ⎛ ⎛ n n n n |p | |p | ij ij ⎝ ⎝ − 1⎠ + − 1⎠ . Err = max |p | max |pkj | ik i=1 j=1 j=1 i=1 k
k
Due to the limit of space we present here only the most illustrative example, which examines signals of size d = 106 . Table 1 shows the results on different groups of n signals (with 2 ≤ n ≤ 10). Table 1. Average performance index and average computation time of FastICA on various groups of signals (from 2 to 10 with d = 106 ). Second column reports the reduction ratio ρ < 1, third and fourth columns report the performance index both with full and reduced sample size respectively. The last two columns report the computation times in both the cases. The numbers between brackets are the standard deviations calculated on the 30 trials. n
ρ<1
Err (ρ = 1)
Err (ρ < 1)
Time (ρ = 1)
Time (ρ < 1)
2
0.03 (0.01)
0.02 (0.05)
0.03 (0.02)
2.5 (0.9)
0.1 (0.0)
3
0.27 (0.01)
0.04 (0.02)
0.05 (0.02)
4.5 (0.8)
1.3 (0.6)
4
0.25 (0.07)
0.11 (0.11)
0.11 (0.05)
6.7 (0.8)
1.7 (0.6)
5
0.22 (0.07)
0.18 (0.07)
0.33 (0.63)
9.4 (1.3)
2.1 (0.7)
6
0.19 (0.07)
0.37 (0.15)
0.46 (0.14)
12.0 (1.7)
2.4 (0.9)
7
0.16 (0.06)
0.62 (0.70)
0.97 (0.97)
14.7 (1.1)
2.4 (1.0)
8
0.16 (0.06)
1.08 (0.75)
1.44 (1.12)
18.5 (2.1)
2.9 (1.1)
9
0.12 (0.04)
1.23 (1.30)
1.75 (2.70)
26.5 (3.8)
2.8 (0.9)
10
0.11 (0.04)
1.43 (0.29)
1.91 (2.23)
33.5 (3.4)
2.8 (1.0)
190
S. Gaito and G. Grossi
For each group we randomly generated 30 mixtures in order to observe, on average, both the time of convergence and the performance index of FastICA for the whole and the reduced samples respectively. All the experiments are obtained at confidence level 0.9 and margin of error 0.1. Based on the simulations we can draw the following conclusions. 1. Sample size is highly reduced (up to one hundred times) while the quality of the decomposition is preserved, as highlighted by the performance index. Here, in particular, β = ρ ∗ 10−3 is sufficiently small, lying in the range between 10−5 and 10−4 . 2. The discrepancy between the error given by the whole sample and that given by the pruned sample increases very slowly with n (number of signals) as shown graphically in Fig. 1 (the lowest two errors corresponding to the third and fourth column of Table 1). 3. To assess the reliability of β, in the same figure we report the data obtained with a reduction ratio of one order of magnitude under that provided by analysis, i.e., with ρsub = 10−1 ρ (highest error in the graphic). This experiment shows that the error grows noticeably. 4. As far as computation time is concerned, Fig. 2 (average times corresponding to the fifth and sixth column of Table 1) highlights the impressive gain of the computational cost. This gain depends on the fact that the computational cost is cubic with respect to sample size. Moreover, it can be noticed that in our pruning FastICA the computation time depends weakly on the number of signals because β decreases with respect to n. 8 Full samples (ρ = 1) Reduced samples (ρ < 1)
7
−1
Reduced samples (10 ρ) 6
Error
5 4 3 2 1 0 2
3
4
5
6 7 # mixed signals
8
9
10
Fig. 1. Three average errors measured for various groups of signals (d = 106 ): the first is obtained with ρ = 1 (without reduction), the second decimated with ρ = β ∗103 (where β is computed in according to the previous analysis) and the third with ρsub = β ∗ 102 (reducing β of one order of magnitude)
Speeding Up FastICA by Mixture Random Pruning
191
35 Full samples (ρ = 1) Reduced samples (ρ < 1)
30
Times
25
20
15
10
5
0 2
3
4
5
6 7 # mixed signals
8
9
10
Fig. 2. Average times of FastICA on different groups of signals of full and reduced size: the first is obtained with ρ = 1 (without reduction), the second by decimation with ρ = β ∗ 103
5
Conclusions
The contribution of this paper is the derivation of a signal-dependent parameter useful to randomly decimate high-dimensional mixtures in order to reduce the time in kurtosis-based FastICA executions. Such a parameter has been validated both in terms of rigorous high-order moments analysis and by means of computer simulations on real word signals. The results encourage to study the pruning technique deeper by exploring different sub-sampling methodologies and different contrast functions used in ICA. Finally, we are confident that our method can be used in real-time applications dealing with high sampling rate, where the online decimation permits to reasonably reduce the mixture size enabling FastICA to operate tightly.
References 1. Comon, P.: Independent component analysis - a new concept? Signal Processing 36, 287–314 (1994) 2. Jutten, C., Herault, J.: Blind separation of sources, part i: An adaptive algorithm based on neuromimetic architecture. Signal Processing 24, 1–10 (1991) 3. Hyvrinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Chichester (2001) 4. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. John Wiley & Sons, Chichester (2002)
192
S. Gaito and G. Grossi
5. Cardoso, J.: Eigen-structure of the fourth-order cumulant tensor with application to the blind source separation problem (1990) 6. Hyv¨ arinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997) 7. Hyv¨ arinen, A.: Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks 10(3), 626–634 (1999) 8. Tichavsky, P., Koldovsky, Z., Oja, E.: Performance analysis of the fastica algorithm and cramr-rao bounds for linear independent component analysis. IEEE Transaction on Signal Processing 54(4), 1189–1202 (2006) 9. Amari, S., Cichocki, A.: Recurrent neural networks for blind separation of sources. In: Proceedings of International Symposium on Nonlinear Theory and Applications. vol. I, pp. 37–42 (1995)
An Algebraic Non Orthogonal Joint Block Diagonalization Algorithm for Blind Separation of Convolutive Mixtures of Sources Hicham Ghennioui2,3 , El Mostafa Fadaili1 , Nad`ege Thirion-Moreau2, Abdellah Adib3,4 , and Eric Moreau2 1 2
IBISC, CNRS FRE 2873 40 rue du Pelvoux, F-91020 Evry-Courcouronnes, France STD, ISITV, av. G. Pompidou, BP56, F-83162 La Valette du Var Cedex, France [email protected], {fadaili,thirion,moreau}@univ-tln.fr 3 GSCM-LRIT, FSR, av. Ibn Battouta, BP1014, Rabat, Maroc 4 DPG, IS, av. Ibn Battouta, BP703, Rabat, Maroc [email protected]
Abstract. This paper deals with the problem of the blind separation of convolutive mixtures of sources. We present a novel method based on a new non orthogonal joint block diagonalization algorithm (NO − JBD) of a given set of matrices. The main advantages of the proposed method are that it is more general and a preliminary whitening stage is no more compulsorily required. The proposed joint block diagonalization algorithm is based on the algebraic optimization of a least mean squares criterion. Computer simulations are provided in order to illustrate the effectiveness of the proposed approach in three cases: when exact blockdiagonal matrices are considered, then when they are progressively perturbed by an additive Gaussian noise and finally when estimated correlation matrices are used. A comparison with a classical orthogonal joint block-diagonalization algorithm is also performed, emphasizing the good performances of the method.
1
Introduction
In the signal processing community, many works have been recently dedicated to the study of the problem of joint decomposition of matrices or tensors because of their numerous applications especially in blind source separation and array processing [1]-[14]. Here, we are interested in the problem of the blind separation of convolutive mixtures of sources. That is why this communication is dedicated to the socalled joint block-diagonalization of matrices problem. In such a decomposition, the wanted matrices are block diagonal ones1 . Such a problem has been already considered in [1][4][7] but under the constraint that the joint-block diagonalizer is an orthogonal (unitary in the complex case) matrix. Our purpose, here, is to 1
A block diagonal matrix is a block matrix in which the off-diagonal block terms are zero matrices and the diagonal matrices are square.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 193–200, 2007. c Springer-Verlag Berlin Heidelberg 2007
194
H. Ghennioui et al.
discard this unitary constraint. To that aim, we show how the (non necessarily orthogonal) joint-block diagonalizer can be algebraically estimated by minimizing a least mean squares criterion, leading to a new non-orthogonal joint blockdigonalization algorithm. Some computer simulations are provided in order to illustrate the good behaviour of the proposed algorithm. Then, it is shown how this algorithm finds application in blind source separation where it is applied, here, to a set of observations correlation matrices at different time delays. The rest of this communication is organized as follows. The problem statement and the proposed joint block-diagonalization algorithm are both introduced in the Section 2. In the Section 3, we show how this algorithm can be applied to solve the problem of blind separation of convolutive mixtures of sources. Computer simulations are provided in both sections to illustrate the effectiveness of the proposed algorithm and to compare it with another one based on an orthogonal joint block-diagonalization.
2 2.1
Non-orthogonal Joint Block-Diagonalization Problem Problem Statement
The non-orthogonal joint block-diagonalization problem is stated in the following way: let us consider a set M of Nm , Nm ∈ N∗ square invertible matrices Mi ∈ RM×M , i ∈ {1, . . . , Nm } which all admit the following decomposition: Mi = ADi AT ,
or Di = BMi BT ,
∀i ∈ {1, . . . , Nm }
(1)
⎞ Di1 . . . 0 ⎟ ⎜ .. where Di = ⎝ ⎠, ∀i ∈ {1, . . . , Nm }, are N × N block diagonal . 0 . . . Dir matrices with Dij , i ∈ {1, . . . , Nm }, j ∈ {1, . . . , r} are nj × nj square matrices so that n1 + . . . + nr = N (in our case, we will assume that all the matrices have the same size i.e N = r × nj , ∀j ∈ {1, . . . , r}) and where 0 denotes the nj × nj null matrix. A is the M × N (M ≥ N ) full rank matrix and B is its pseudo-inverse (or generalized Moore-Penrose inverse). The non-orthogonal joint block-diagonalization problem consists in estimating the matrix A and the matrices Dij , i ∈ {1, . . . , Nm }, j ∈ {1, . . . , r} (or more simply the matrix B only) from the matrices set M. The case of an orthogonal matrix A has been already considered in [7] where a first solution is proposed. ⎛
2.2
Joint Block-Diagonalization Algorithm
In this communication, we propose to consider the following cost function CBD (C) =
Nm k=1
OffBdiag{CT Mk C}2 ,
(2)
An Algebraic Non Orthogonal Joint Block Diagonalization Algorithm
195
where the operator OffBdiag{·} denotes the zero-block-diagonal matrix and C = BT . Thus: ⎛ ⎞ ⎛ ⎞ 0 M12 . . . M1r M11 M12 . . . M1r ⎜ ⎜ .. ⎟ . ⎟ ⎜M21 . . . ⎜M21 . . . . . . .. ⎟ . ⎟ ⎜ ⎟ ⎜ ⎟ ⇒ OffBdiag{M} = M=⎜ . ⎜ ⎟ .. ⎟ . (3) .. .. .. ⎝ ⎠ ⎝ .. . . ⎠ . ... ... . Mr1 Mr2 . . . 0, Mr1 Mr2 . . . Mrr Let C = [C1 , · · · , Cr ], where Cj , j ∈ {1, · · · , r}, are r block matrices of dimension M × nj . The cost function (2) can be rewritten as: CBD (C) =
Nm
r
CTi Mk Cj 2 =
nj ni Nm
r
T n 2 |(cm i ) Mk cj | (4)
k=1 m=1 n=1 i,j=1(i=j)
k=1 i,j=1(i=j)
where ∀n ∈ {1, . . . , nj } stand for the nj column vectors of matrices Cj , ∀j ∈ {1, . . . , r}. Then: cnj ,
CBD (C) =
Nm n i ,nj
r
T n m T n T ((cm i ) Mk cj )((ci ) Mk cj )
k=1 m,n=1 i,j=1(i=j)
=
Nm n i ,nj
r
T
T
n n T m (cm i ) (Mk cj (cj ) Mk )ci
k=1 m,n=1 i,j=1(i=j)
=
ni r
⎡
T (cm i )
m=1 i=1
=
ni r m=1 i=1
where Qi (Ci ) =
r
j=1(j=i)
⎣
r
nj Nm
⎤ Mk cnj (cnj )T MTk ⎦ cm i
j=1(j=i) n=1 k=1 T
m (cm i ) Qi (Ci )ci
nj Nm n=1
k=1
(5) T
Mk cnj (cnj ) MTk is a symmetric matrix.
cnj (cnj )T
As is rank one, ∀j = 1, . . . , r, and ∀n = 1, . . . , nj , the matrix Qi (Ci ) possesses N −(r −1)nj = nj eigenvectors associated with null eigenvalues. Then, the minimization of this quadratic form under the unit norm constraint can be achieved by taking the nj unit eigenvectors associated with the nj smallest eigenvalues of Qi (Ci ). However since matrix Qi for a given i also depends on column vectors of matrix C, we propose to use an iterative procedure. The proposed non-orthogonal joint block-diagonalization (denoted by NO − JBD) writes: ∀i ∈ {1, . . . , r} with l ∈ N∗ and given Ci
(0)
an initial matrix, do (a) and (b)
(l)
(a) Calculate Qi (Ci ) (l) (b) Find the ni lowest eigenvalues λm , m ∈ {1, . . . , ni } and the associated i (l) m (l) eigenvectors ci , m ∈ {1, . . . , ni } of matrix Qi (Ci ) (l) (l−1) − λm | ≤ ε where ε is Stop after a given number of iterations or when |λm i i a given small positive threshold.
196
H. Ghennioui et al.
2.3
Computer Simulations
We present simulations to illustrate the effectiveness of the proposed algorithm. We consider a set D of Nm = 11 (resp. 31, 56, 96) matrices, randomly chosen (according to a Gaussian law) of mean 0 and variance 1. Initially these matrices are exactly block-diagonal, then random noise matrices of mean 0 and variance σb2 are added. A signal to noise ratio can be defined as SNR = 10 log( σ12 ). To b measure the quality of the separation, the following performance index (which is an extension of the one introduced in [10]) is used: ⎞ ⎞⎤ ⎛ ⎛ ⎡ r r r r 2 2 1 (G) (G) i,j i,j ⎠⎦ ⎝ ⎣ I(G) = − 1⎠ + ⎝ 2 −1 r(r − 1) i=1 j=1 max (G)i, 2 max (G) ,j j=1 i=1
ˆ T A. where (G)i,j ∀i, j ∈ {1, . . . , r} is the (i, j)-th (square) block matrix of G = C All the displayed results have been averaged over 30 Monte-Carlo trials. On the Fig. 1, the performance index of algorithm NO − JBD is displayed versus the number of used matrices (left) and versus the SNR (right). These curves illustrate the good behaviour of the algorithm since I ≈ −110 dB at high SNR. 60
0
SNR=10dB 40
11 matrices 31 matrices 56 matrices 96 matrices
−10
SNR=20dB −20
SNR=30dB 20
SNR=50dB
−30
SNR=100dB
Performance index (dB)
Performance index (dB)
SNR=70dB 0
−20
−40
−60
−40 −50 −60 −70 −80 −90
−80
−100
−100 −110
−120
1
6
11
16
21
26
31
36
41 46 51 56 61 Number of matrices
66
71
76
81
86
91
96
−120
0
5
10
15
20
25
30
35
40
45 50 55 SNR (dB)
60
65
70
75
80
85
90
95 100
Fig. 1. Left: performance index versus number of matrices, right: performance index versus SNR
3 3.1
Separation of Convolutive Mixtures of Sources Model and Assumptions
We consider a convolutive finite-duration impulse response (FIR) model given by xi (t) =
n L j=1 =0
hij ()sj (t − ) + nj (t), ∀i = 1, . . . , m
(6)
An Algebraic Non Orthogonal Joint Block Diagonalization Algorithm
197
where sj (t), ∀j = 1, . . . , n are the n sources, xi (t), i = 1, . . . , m, are the m > n observed signals, hij (t) is the real transfer function between the j-th source and i-th sensor with an overall extent of (L+1) taps. ni (t), ∀i = 1, . . . , m are additive noises. Our developments are based on the two following assumptions: Assumption A: Each source signal is a real temporally coherent signal. Moreover they are uncorrelated two by two, i.e., for all pairs of sources (si (t),sj (t)) with i = j, for all time delay τij , we have Rij (t, τij ) = 0, where Rij (t, τ ) denotes the cross-correlation function between the sources si (t) and sj (t). It is defined as follows: Rij (t, τ ) = E{si (t)sj (t + τ )}, where E{.} stands for the mathematical expectation. Assumption B: The noises ni (t), i = 1, . . . , m, are assumed real stationary white random signals, mutually uncorrelated, independent from the sources, with the same variance σn2 . The noises correlation matrix can be written as: Rn (τ ) = E{n(t)nT (t + τ )} = σn2 δ(τ )Im
(7)
where δ(τ ) stands for the Delta impulse, Im for the m × m identity matrix and (.)T for the transpose operator. Let us now recall how the convolutive mixing model can be reformulated into an instantaneous one [4][7]. Considering the vectors S(t), X(t) and N(t) respectively defined as: S(t) = [s1 (t), . . . , s1 (t − (L + L ) + 1), . . . , sn (t − (L + L ) + 1)]T X(t) = [x1 (t), . . . , x1 (t − L + 1), . . . , xm (t − L + 1)]T N(t) = [n1 (t), . . . , n1 (t − L + 1), . . . , nm (t − L + 1)]T and the (M × N ) matrix A, where M = mL and N = n(L + L ): ⎞ ⎛ A11 . . . A1n ⎟ ⎜ A = ⎝ ... . . . ... ⎠ Am1 . . . Amn where
⎞ hij (0) . . . . . . hij (L) 0 . . . 0 ⎜ .. ⎟ ⎜ 0 ... ... ... ... ... . ⎟ ⎟ ⎜ Aij = ⎜ . ⎟ . . . . . .. .. 0 ⎠ .. .. .. ⎝ .. 0 . . . 0 hij (0) . . . . . . hij (L) ⎛
(8)
are (L × (L + L )) matrices, the model described by Eq. (6) can be written in matrix form as: X(t) = AS(t) + N(t) (9) In order to have an over-determined model, L must be chosen such that mL ≥ n(L + L ). We assume, here, that the matrix A is full rank. Because of the Assumption A, all the components of S(t) are temporally coherent signals. Moreover, two different components of this vector are correlated at least in one non
198
H. Ghennioui et al.
null time delay. With regard to the noise vector N(t), the Assumption B holds for each of its components involving that its correlation matrix RN (τ ) reads: RN (τ ) = E{N(t)NT (t + τ )} ⎛ 2˜ ⎞ 0L σn IL (τ ) 0L . . . ⎜ ⎟ .. .. .. ⎜ 0L ⎟ . . . ⎟ =⎜ ⎜ ⎟ .. .. .. ⎝ . . . 0L ⎠ 0L . . . 0L σn2 ˜IL (τ )
(10)
where ˜IL (τ ) is the L × L matrix which contains ones on the τ th superdiagonal if 0 ≤ τ < L or on the |τ |th subdiagonal if −L ≤ τ ≤ 0 and zeros elsewhere. Then, we have: RX (t, τ ) − RN (τ ) = ARS (t, τ )AT = RY (t, τ )
(11)
Because sources signals are spatially uncorrelated and temporally coherent, the matrices RS (t, τ ), ∀τ are block diagonal matrices. To recover the mixing matrix A, the matrices RY (t, τ ), ∀τ and ∀t can be joint block diagonalized without any unitarity constraint about the wanted matrix A. Notice that in this case, the recovered sources after inversion of the system are obtained up to a permutation and up to a filter but we will not discuss about these indeterminations in this communication. 3.2
Computer Simulations
We present simulations to illustrate the effectiveness of the proposed algorithm in the blind source separation context and to establish a comparison with another algorithm (O − JBD) for the orthogonal joint block diagonalization of matrices. While our algorithm is directly applied on the correlation matrices of the observations, the second algorithm is applied after a pre-whitening stage on the correlation matrices of the pre-whitened observations. We consider m = 4 mixtures of n = 2 speech source signals sampled at 8 kHz, L = 2 and L = 4. These signal sources are mixed according to the following transfer function matrix whose components are randomly generated: ⎛ ⎞ 0.9772 + 0.2079z −1 − 0.0439z −2 −0.6179 + 0.7715z −1 + 0.1517z −2 ⎜−0.2517 − 0.3204z −1 + 0.9132z −2 −0.1861 + 0.4359z −1 − 0.8805z −2⎟ ⎟ A[z] = ⎜ ⎝ 0.0803 − 0.7989z −1 − 0.5961z −2 0.5677 + 0.6769z −1 + 0.4685z −2 ⎠ −0.7952 + 0.3522z −1 + 0.4936z −2 −0.2459 + 0.8138z −1 − 0.5266z −2 where A[z] stands for the z transform of A(t). On the Fig. 2, we have displayed the performance index versus the number of matrices (left) and versus the SNR. One can check that the obtained performance are better with the NO − JBD algorithm than with the O − JBD algorithm. One can also evaluate the blockdiagonalization error defined as:
An Algebraic Non Orthogonal Joint Block Diagonalization Algorithm 40
0 SNR=15dB, O−JBD SNR=15dB, NO−JBD SNR=20dB, O−JBD SNR=20dB, NO−JBD SNR=35dB, O−JBD SNR=35dB, NO−JBD SNR=100dB, O−JBD SNR=100dB, NO−JBD
35 30 25 20 15
5 matrices, O−JBD 5 matrices, NO−JBD 20 matrices, O−JBD 20 matrices, NO−JBD 100 matrices, O−JBD 100 matrices, NO−JBD
−5
−10
10
Performance index (dB)
Performance index (dB)
199
5 0 −5 −10 −15
−15
−20
−25
−30
−20 −25
−35
−30 −35
−40
−40 −45
5
10
20
30
40 50 60 Number of matrices
70
80
90
−45
100
0
5
10
15
20
25
30
35
40
45 50 55 SNR (dB)
60
65
70
75
80
85
90
95 100
Fig. 2. Left: performance index versus number of matrices, right: performance index versus SNR
Nm E = 10 log10 { N1m k=1 OffBdiag{BRY (t, τk )BT 2F } where B is the pseudoinverse of the mixing matrix A and .F denotes the Frobenius norm. Finally, a comparaison of the block-diagonalization error with the NO − JBD and O − JBD algorithms versus the number of matrices (resp. SNR) is given in the left of Fig. 3 (resp. its right). 10
−15
5
SNR=10dB, O−JBD
0
SNR=10dB, NO−JBD −25
−10
SNR=25dB, NO−JBD
−30
−15
SNR=35dB, O−JBD
−35
−20
SNR=35dB, NO−JBD
−25
SNR=45dB, O−JBD
−30
10 matrices, O−JBD 10 matrices, NO−JBD 50 matrices, O−JBD 50 matrices, NO−JBD 100 matrices, O−JBD 100 matrices, NO−JBD
−20
SNR=25dB, O−JBD
Block−diagonalization error (dB)
Block−diagonalization error (dB)
−5
SNR=45dB, NO−JBD
−35 −40 −45 −50 −55
−40 −45 −50 −55 −60 −65
−60
−70
−65 −70
−75
−75 −80
−80 −85
5
10
20
30
40 50 60 Number of matrices
70
80
90
100
−85
0
5
10
15
20
25
30
35
40
45 50 55 SNR (dB)
60
65
70
75
80
85
90
95 100
Fig. 3. Left: block-diagonalization error versus number of matrices, right: blockdiagonalization error versus SNR
4
Discussion and Conclusion
In this paper, we have proposed a new joint block diagonalization algorithm for the separation of convolutive mixtures of sources that does not rely upon a unitary constraint. We have illustrated the usefulness of the proposed approach thanks to computer simulations: the considered algorithm has been applied to source separation using the correlation matrices of speech sources evaluated over different time delays.
200
H. Ghennioui et al.
References 1. Abed-Mera¨ım, K., Belouchrani, A., Leyman, R.: Time-frequency signal analysis and processing: a comprehensive reference. In: Boashash, B. (ed.) chapter in Blind source separation using time-frequency distributions, Prentice-Hall, Oxford (2003) 2. Belouchrani, A., Abed-Mera¨ım, K., Amin, M., Zoubir, A.: Joint antidiagonalization for blind source separation. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’2001), Salt Lake City, Utah (May 2001) 3. Belouchrani, A., Abed-Mera¨ım, K., Cardoso, J.-F., Moulines, E.: A blind source separation technique using second order statistics. IEEE Trans. on Signal Processing 45, 434–444 (1997) 4. Bousbiah-Salah, H., Belouchrani, A., Abed-Meraim, K.: Blind separation of non stationary sources using joint block diagonalization. In: Proc. IEEE Workshop on Statistical Signal Processing, pp. 448–451 (2001) 5. Cardoso, J.-F., Souloumiac, A.: Blind Beamforming for non Gaussian signals. IEEE Proceedings-F 40, 362–370 (1993) 6. Comon, P.: Independant component analysis, a new concept? Signal Processing 36, 287–314 (1994) 7. DeLathauwer, L., F´evotte, C., De Moor, B., Vandewalle, J.: Jacobi algorithm for joint block diagonalization in blind identification. In: 23rd Symposium on Information Theory in the Benelux, Louvain-la-Neuve, Belgium (May 2002) 8. Fadaili, E.-M., Thirion-Moreau, N., Moreau, E.: Algorithme de z´ero-diagonalisation conjointe pour la s´eparation de sources d´eterministes, dans les Proc. du 20`eme colloque GRETSI, Louvain-La-Neuve, Belgique, Septembre 2005, pp. 981–984 (2005) 9. Fadaili, E.-M., Thirion-Moreau, N., Moreau, E.: Non orthogonal joint diagonalization/zero-diagonalization for source separation based on time-frequency distributions. To appear in IEEE Trans. on Signal Processing 55(5) (2007) 10. Moreau, E.: A generalization of joint-diagonalization criteria for source separation. IEEE Trans. Signal Processing 49(3), 530–541 (2001) 11. Pham, D.-T.: Joint approximate diagonalization of positive definite matrices. SIAM Journal on Matrix Analysis and Appli 22(4), 1136–1152 (2001) 12. Yeredor, A.: Non-orthogonal joint diagonalization in the least square sense with application in blind source separation. IEEE Transactions on signal processing 50(7), 1545–1553 (2002) 13. Yeredor, A., Ziehe, A., M¨ uller, K.R.: Approximate joint diagonalization using a natural gradient approach. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 89–96. Springer, Heidelberg (2004) 14. Ziehe, A., Laskov, P., Nolte, G.G., M¨ uller, K.-R.: A fast algorithm for joint diagonalization with non-orthogonal transformations and its application to blind source separation Journal of Machine Learning Research, No. 5, pp. 801–818 (July 2004)
Non Unitary Joint Block Diagonalization of Complex Matrices Using a Gradient Approach Hicham Ghennioui1,2 , Nad`ege Thirion-Moreau1, Eric Moreau1 , Abdellah Adib2,3 , and Driss Aboutajdine2 1
STD, ISITV, av. G. Pompidou, BP56, F-83162 La Valette du Var Cedex, France [email protected], {thirion,moreau}@univ-tln.fr 2 GSCM-LRIT, FSR, av. Ibn Battouta, BP1014, Rabat, Maroc 3 DPG, IS, av. Ibn Battouta, BP703, Rabat, Maroc [email protected], [email protected]
Abstract. This paper addresses the problem of the non-unitary approximate joint block diagonalization (NU − JBD) of matrices. Such a problem occurs in various fields of applications among which blind separation of convolutive mixtures of sources and wide-band signals array processing. We present a new algorithm for the non-unitary joint blockdiagonalization of complex matrices based on a gradient-descent algorithm whereby the optimal step size is computed algebraically at each iteration as the rooting of a 3rd-degree polynomial. Computer simulations are provided in order to illustrate the effectiveness of the proposed algorithm.
1
Introduction
In the recent years, the problem of the joint decomposition of matrices or tensors sets have found interesting solutions through signal processing applications in blind source separation and array processing. One of the first considered problem was the joint-diagonalization of matrices under the unitary constraint, leading to the nowadays well-known JADE [4] and SOBI [2] algorithms. The following works have addressed either the problem of the joint-diagonalization of tensors [5][7][12] or the problem of the jointdiagonalization of matrices but discarding the unitarity constraint [6][10][14] [15][16][17]. A second type of matrices decomposition has proven to be useful in blind source separation, telecommunications and cryptography. It consists in joint zero-diago-nalizing several matrices either under the unitary constraint [1] or not [9][10]. Most of the proposed (unitary) joint-diagonalization and/or zerodiagonalization algorithms have been applied to the problem of the blind separation of instantaneous mixtures of sources. Finally, a third particular type of matrices decomposition arises in both the wide-band sources localization in correlated noise fields and the blind separation M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 201–208, 2007. c Springer-Verlag Berlin Heidelberg 2007
202
H. Ghennioui et al.
of convolutive mixtures of sources problems. It is called joint block-diagonalization since the wanted matrices are block diagonal matrices1 in such a decomposition. Such a problem has been considered in [3][8] where the block-diagonal matrices under consideration have to be positive definite and hermitian matrices and the required joint-block diagonalizer is a unitary matrix. In this paper, our purpose is to discard this unitary constraint. To that aim, we generalize the non unitary joint-diagonalization approach proposed in [16] to the non-unitary joint block-diagonalization of several complex hermitian matrices. The resulting algorithm is based on a gradient-descent approach whereby the optimal step size is computed algebraically at each iteration as the rooting of a 3rd-degree polynomial. The main advantage of the proposed algorithm is that it is relatively general since the only needed assumption about the complex matrices under consideration is their hermitian symmetry. Finally, the use of the optimal step size speeds up the convergence. The paper is organized as follows. We state the considered problem in the Section 2. In the Section 3, we present the algebraical derivations leading to the proposed non-unitary joint block-diagonalization algorithm. Computer simulations are provided in the Section 4 in order to illustrate the behaviour of the proposed approach.
2
Problem Statement
The non-unitary joint block-diagonalization problem is stated in the following way: let us consider a set M of Nm , Nm ∈ N∗ square matrices Mi ∈ CM×M , i ∈ {1, . . . , Nm } which all admit the following decomposition: Mi = ADi AH ⎛ Di1 . . . 0 ⎜ .. where Di = ⎝ .
or Di = BMi BH , ⎞
∀i ∈ {1, . . . , Nm }
(1)
⎟ ⎠, ∀i ∈ {1, . . . , Nm }, are N × N block diagonal
0 . . . Dir matrices with Dij , i ∈ {1, . . . , Nm }, j ∈ {1, . . . , r} are nj × nj square matrices so that n1 + . . . + nr = N (in our case, we assume that all the matrices have the same size i.e. N = r × nj , j ∈ {1, . . . , r}) and where 0 denotes the nj × nj null matrix. A is the M × N (M ≥ N ) full rank matrix and B is its pseudo-inverse (or generalized Moore-Penrose inverse). The non-unitary joint bloc-diagonalization problem consists in estimating the matrix A and the matrices Dij , i ∈ {1, . . . , Nm }, j ∈ {1, . . . , r} from only the matrices set M. The case of a unitary matrix A has been considered in [8] where a first solution is proposed. 1
A block diagonal matrix is a square diagonal matrix in which the diagonal elements are square matrices of any size (possibly even), and the off-diagonal elements are 0. A block diagonal matrix is therefore a block matrix in which the blocks off the diagonal are the zero matrices and the diagonal matrices are square.
Non Unitary Joint Block Diagonalization of Complex Matrices
3
203
Non-Unitary Joint Block-Diagonalization Using a Gradient Approach
In this section, we present a new algorithm to solve the problem of the nonunitary joint block-diagonalization. We propose to consider the following cost function Nm CBD (B) = OffBdiag{BMi BH }2F , (2) i=1
where .F stands for the Frobenius norm and the operator OffBdiag{·} denotes the zero block-diagonal matrix. Thus: ⎛ ⎛ ⎞ ⎞ 0 M12 . . . M1r M11 M12 . . . M1r ⎜ ⎜ .. ⎟ . ⎟ ⎜M21 . . . ⎜M21 . . . . . . .. ⎟ . ⎟ ⎜ ⎜ ⎟ ⎟ ⇒ OffBdiag{M} = M=⎜ . ⎜ ⎟ . . ⎟ E (3) . .. ⎝ .. ⎝ .. . . . . . . .. ⎠ . .. ⎠ Mr1 Mr2 . . . 0 Mr1 Mr2 . . . Mrr Our aim is to minimize the cost function (2). To make sure that the found matrix B keeps on being invertible, it is updated according to the following scheme (see [17]): B(m) = (I + W(m−1) )B(m−1)
∀m = 1, 2, . . . ,
(4)
where B(0) is some initial guess, B(m) denotes the estimated matrix B at the m-th iteration, W(m−1) is a sufficiently small (in terms of Frobenius norm) zeroblock diagonal matrix and I is the identity matrix. (m) Denoting Mi = B(m−1) Mi B(m−1)H ∀i = 1, . . . , Nm and ∀m = 1, 2, . . ., where (·)H stands for the transpose conjugate operator, then at the m-th iteration, the cost function can be expressed versus W(m−1) rather than B(m) . We now have: CBD (W(m−1) ) =
Nm
(m)
OffBdiag{(I + W(m−1) )Mi
(I + W(m−1) )H }2F
(5)
i=1
Nm (m) (m) or more simply CBD (W) i=1 OffBdiag{(I + W)Mi (I + W)H }2F . At each iteration, the wanted matrix W is then updated according to the following adaptation rule: W(m) = −μ∇CBD (W(m−1) ) ∀m = 1, 2, . . .
(6)
where μ is the step size or adaptation coefficient and where ∇CBD (W(m−1) ) stands for the complex gradient matrix defined, like in [13], by: ∇CBD (W(m−1) ) = 2
∂CBD (W(m−1) ) ∂W(m−1)∗
∀m = 1, 2 . . .
(7)
where (·)∗ is the complex conjugate operator. We now have to calculate the (m)
complex gradient matrix ∇CBD (W) = 2
(m)
∂CBD (W) . ∂W∗
204
3.1
H. Ghennioui et al. (m)
Gradient of the Cost Function CBD (W) (m)
(m)
Let Di and Ei respectively denote the block-diagonal and zero block-diagonal (m) (m) (m) (m) matrices extracted from the matrix Mi (Mi = Ei +Di ). As W is a zero(m) block diagonal matrix too, the cost function CBD (W) can be expressed as: (m) CBD (W)
=
Nm
(m)
OffBdiag{Mi
(m)
} + OffBdiag{Mi
(m)
WH } + OffBdiag{WMi
}
i=1 (m)
+ OffBdiag{WMi =
Nm
(m)
Ei
(m)
+ Di
WH }2F (m)
WH + WDi
(m)
+ WEi
WH 2F
i=1
=
Nm
(m)
tr{(Ei
(m)
+Di
(m)
WH +WDi
(m)
+WEi
(m)
WH )H (Ei
(m)
+Di
WH
i=1 (m)
+ WDi
(m)
+ WEi
WH )}
(8)
where tr{.} stands for the trace operator. Then, using the linearity property of the trace and assuming to simplify the derivations that the considered matrices are hermitian, we finally find that: (m)
CBD (W) =
Nm
(m)H
tr{Ei
(m)
Ei
(m)H
} + 2tr{Ei
(m)
(Di
(m)
WH + WDi
)}
i=1 (m)H
+ tr{WDi + + + +
(m)
Di
(m)H
WH + Di
(m)
WH WDi
}
(m)H (m) 2tr{Ei WEi WH } (m)H (m) (m)H (m) tr{WDi WDi + Di WH Di WH } (m)H (m) (m) 2tr{WEi WH (Di WH + WDi )} (m)H (m) tr{WEi WH WEi WH }
(9)
Using now the following properties [11] tr{PQR} = tr{RPQ} = tr{QRP} ∂tr{PXH } =P ∂X∗ ∂tr{PX} =0 ∂X∗ dtr{P} = tr{dP} dtr{PXH QX} = tr{PdXH QX + PXH QdX} ∂tr{PXH QX} = QXP ∂X∗
(10) (11) (12) (13) (14) (15)
Non Unitary Joint Block Diagonalization of Complex Matrices
205
It finally leads to the following result: (m)
∇CBD (W) = 4
Nm
(m)H
Ei
(m)
Di
(m)H
+ WDi
(m)
(m)H
Di
+ Ei
(m)
WEi
i=1 (m)H
+ WEi (m)
+ Di 3.2
(m)
WH Di
(m)H
WH WEi
(m)
(m)H
WH WEi (m) (m)H . + WDi WEi + WEi
(m)
+ Di
(m)H
WH Di
(16)
Seek of the Optimal Step Size
The expression (16) is then used in the gradient descent algorithm (6). To accelerate its convergence, the optimal step size μ is computed algebraically at (m) (m) each iteration. To that aim, one has to calculate CBD (W ← −μ∇CBD (W)), but (m) (m) here we use CBD (W ← μF(m) = −μOffBdiag{∇CBD (W)}). F(m) is the anti(m) (m) gradient matrix. We use OffBdiag{∇CBD (W)} instead of ∇CBD (W) because W is a sufficiently small (in terms of norm) zero block-diagonal matrix and thus only the off block-diagonal terms are involved in the descent of the criterion. We now have to seek for the optimal step μ ensuring the minimization of the cost (m) function CBD (μF(m) ). This step is determined by the rooting of the 3rd-degree polynomial (18) which is obtained as the derivative of the 4rd-degree polynomial (m) CBD (μF(m) ) with respect to μ: (m)
(m)
CBD (μF(m) ) = a0 (m) ∂CBD (μF(m) )
∂μ
(m)
(m)
(m)
(m)
+ a1 μ + a2 μ2 + a3 μ3 + a4 μ4 ,
(m)
(m)
(m)
(17)
(m)
= 4a4 μ3 + 3a3 μ2 + 2a2 μ + a1 ,
(18)
where the coefficients have been found to be equal to: (m)
a0
=
Nm
(m)H
Ei
(m)H
(Di
tr{Ei
(m)
}
(19)
i=1 (m)
a1
=
Nm
tr{Ei
(m)
(m)
FH + FDi
(m)
) + (Di
(m) H
FH + FDi
(m)
) Ei
}(20)
i=1 (m) a2
=
Nm
(m)H (m) (m)H H (m) tr Ei FEi FH + FEi F Ei
i=1 (m)
+ (Di (m)
a3
=
Nm
(m) H
FH + FDi
(m)
) (Di
(m)H
+ FEi (m)
=
Nm i=1
)
(21)
(m) (m) (m) tr (Di FH + FDi )H FEi FH
i=1
a4
(m)
FH + FDi
(m)
FH (Di
(m)H
tr{F(m) Ei
(m)
FH + FDi
) (m)
F(m)H F(m) Ei
(22) F(m)H }.
(23)
206
H. Ghennioui et al.
The optimal step μ corresponds to the root of the polynomial (18) attaining the absolute minimum in the polynomial (17). 3.3
Summary of the Proposed Algorithm
The proposed non-unitary joint block-diagonalization based on a gradient algorithm denoted by JBDNU,G is now presented below: (0)
(0)
(0)
Denote the Nm square matrices as M1 , M2 , . . . , MNm Given initial estimates W(0) = 0 and B(0) = I For m = 1, 2, . . . For i = 1, . . . , Nm (m) Compute Mi as (m)
Mi
(m−1)
= B(m−1) Mi
B(m−1)H
(m)
Compute ∇CBD (W) whose expression is given by equation (16) EndFor (m) Set F(m) = −OffBdiag{∇CBD (W)} (m) (m) Compute the coefficients a0 , . . . , a4 thanks to (19), (20), (21), (22) and (23) Set the optimal step μ by the research of the root of the polynomial (18) attaining the absolute minimum in the polynomial (17) Set W(m) = μF(m) and B(m) = (I + W(m−1) )B(m−1) EndFor
4
Computer Simulations
In this section, we perform simulations to illustrate the behaviour of the proposed algorithm. We consider a set D of Nm = 11 (resp. 31, 101) matrices, randomly chosen (according to a Gaussian law of mean 0 and variance 1). Initially these matrices are exactly block-diagonal, then matrices with random entries chosen from a Gaussian law of mean 0 and variance σb2 are added. The signal to noise ratio (SNR) is then defined by SNR = 10 log( σ12 ) . We use the following perforb mance index which is an extension of that introduced in [12]: ⎡ ⎛ ⎛ ⎞ ⎞⎤ r r r r 2 2 (G) (G) 1 i,j i,j ⎣ ⎝ ⎝ ⎠⎦ − 1⎠ + I(G) = 2 −1 r(r − 1) i=1 j=1 max (G)i, 2 max (G) ,j j=1 i=1
ˆ where (G)i,j ∀i, j ∈ {1, . . . , r} is the (i, j)-th matrix block (square) of G = BA. The displayed results are averaged over 30 Monte-Carlo trials. In this example, they were obtained considering M = N = 12, r = 3 and real and symmetric matrices. On the left of Fig. 1 we display the performance index obtained with the proposed algorithm versus the number of used matrices for different values of the SNR. On its right we have plotted the evolution of the performance index versus the SNR.
Non Unitary Joint Block Diagonalization of Complex Matrices 20
207
0
0
−20
Performance index (dB)
Performance index (dB)
−20
−40
−60
−40
−60
−80 −80
−100
−100
−120
1
6
11
16
21
26
31
36
41 46 51 56 61 Number of matrices
66
71
76
81
86
91
96 101
−120
0
5
10
15
20
25
30
35
40
45 50 55 SNR (dB)
60
65
70
75
80
85
90
95 100
Fig. 1. Left: performance index versus number Nm of used matrices for different values of the SNR (SNR=10 dB (×), 20 dB (◦), 50 dB () and 100 dB (+)). Right: performance index versus SNR for different size of the matrices set to be joint block-diagonalized (Nm =11 (×), 31 (◦), 101 (+)).
5
Discussion and Conclusion
In this paper, we have proposed a new algorithm (named JBDNU,G ) based on a gradient approach to perform the non-unitary joint block-diagonalization of a given set of complex matrices. One of the main advantages of this algorithm is that it applies to complex hermitian matrices. This algorithm finds application in blind separation of convolutive mixtures of sources and in array processing. In the context of blind sources separation, it should enable to achieve better performances by discarding the unitary constraint. In fact, starting with a prewhitening stage is a possible way to amount to a unitary square mixture of sources to be able to use unitary joint-decomposition algorithms. But such a pre-whitening stage imposes a limit on the attainable performances that can be overcome thanks to non-unitary algorithms.
References 1. Belouchrani, A., Abed-Mera¨ım, K., Amin, M., Zoubir, A.: Joint antidiagonalization for blind source separation. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’2001), Salt Lake City, Utah (May 2001) 2. Belouchrani, A., Abed-Mera¨ım, K., Cardoso, J.-F., Moulines, E.: A blind source separation technique using second order statistics. IEEE Transactions on Signal Processing 45, 434–444 (1997) 3. Bousbiah-Salah, H., Belouchrani, A., Abed-Mera¨ım, K.: Blind separation of non stationary sources using joint block diagonalization. In: Proc. IEEE Workshop on Statistical Signal Processing, pp. 448–451 (2001) 4. Cardoso, J.-F., Souloumiac, A.: Blind beamforming for non Gaussian signals. IEEE Proceedings-F 40, 362–370 (1993)
208
H. Ghennioui et al.
5. Comon, P.: Independant component analysis, a new concept? Signal Processing 36, 287–314 (1994) 6. D´egerine, S.: Sur la diagonalisation conjointe approch´ee par un crit`ere des moindres carr´es. In: Proc. 18`eme Colloque GRETSI, Toulouse, Septembre 2001, pp. 311–314 (2001) 7. DeLathauwer, L.: Signal processing based on multilinear algebra. PhD Thesis, Universit´e Catholique de Leuven, Belgique (September 1997) 8. DeLathauwer, L., F´evotte, C., De Moor, B., Vandewalle, J.: Jacobi algorithm for joint block diagonalization in blind identification. In: 23rd Symposium on Information Theory in the Benelux, Louvain-la-Neuve, Belgium (May 2002) 9. Fadaili, E.-M., Thirion-Moreau, N., Moreau, E.: Algorithme de z´ero-diagonalisation conjointe pour la s´eparation de sources d´eterministes. In: dans les Proc. du 20`eme colloque GRETSI, Louvain-La-Neuve, Belgique, Septembre 2005, pp. 981–984 (2005) 10. Fadaili, E.-M., Thirion-Moreau, N., Moreau, E.: Non orthogonal joint diagonalization/zero-diagonalization for source separation based on time-frequency distributions. To appear in IEEE Transactions on Signal Processing, 55(4) (April 2007) 11. Joho, M.: A systematic approach to adaptive algorithms for multichannel system identification, inverse modeling and blind identification. PHD Thesis, Swiss Federal Institute of Technology, Z¨ urich (December 2000) 12. Moreau, E.: A generalization of joint-diagonalization criteria for source separation. IEEE Trans. Signal Processing 49(3), 530–541 (2001) 13. Petersen, K.B., Pedersen, M.S.: The matrix cookbook (January 5, 2005) 14. Pham, D.-T.: Joint approximate diagonalization of positive definite matrices. SIAM Journal on Matrix Analysis and Applications 22(4), 1136–1152 (2001) 15. Yeredor, A.: Non-orthogonal joint diagonalization in the least square sense with application in blind source separation. IEEE Transactions on Signal Processing 50(7), 1545–1553 (2002) 16. Yeredor, A., Ziehe, A., Muller, K.R.: Approximate joint diagonalization using a natural gradient approach. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 89–96. Springer, Heidelberg (2004) 17. Ziehe, A., Laskov, P., Nolte, G.G., M¨ uller, K.-R.: A fast algorithm for joint diagonalization with non-orthogonal transformations and its application to blind source separation. Journal of Machine Learning Research, No. 5, 801–818 (July 2004)
A Toolbox for Model-Free Analysis of fMRI Data P. Gruber1 , C. Kohler1 , and F.J. Theis2 2
1 Institute of Biophysics, University of Regensburg, D-93040 Regensburg Max Planck Institute for Dynamics and Self-Organization, D-37073 G¨ ottingen
Abstract. We introduce Model-free Toolbox (MFBOX), a Matlab toolbox for analyzing multivariate data sets in an explorative fashion. Its main focus lies on the analysis of functional Nuclear Magnetic Resonance Imaging (fMRI) data sets with various model-free or data-driven techniques. In this context, it can also be used as plugin for SPM5, a popular tool in regression-based fMRI analysis. The toolbox includes BSS algorithms based on various source models including ICA, spatiotemporal ICA, autodecorrelation and NMF. They can all be easily combined with higher-level analysis methods such as reliability analysis using projective clustering of the components, sliding time window analysis or hierarchical decomposition. As an example, we use MFBOX for the analysis of an fMRI experiment and present short comparisons with the SPM results. The MFBOX is freely available for download at http://mfbox.sf.net.
1
Introduction
With the increasing complexity and dimensionality of large-scale biomedical data sets, classical analysis techniques yield way to more powerful methods that can take into account higher-order relationships in the data. Such often explorative methods have been popular in the field of engineering and statistics for quite some time, however they perpetrate into the application areas such as psychology, biology or medicine in a much slower fashion. This is partially due to the fact that visual and simple tools for more complex analyses are rarely available. In this contribution, we present a toolbox, MFBOX, for performing blind source separation of complex tasks with appropriate pre- and postprocessing methods. One of our main goals is to provide a simple toolbox that allows for the model-free analysis of fMRI data sets [14], although MFBOX may just as well be applied to other recordings. The graphical user interface of MFBOX enables users to easily try out various model-free algorithms, together with additional pre- and postprocessing and reliability analysis. The design of the toolbox is modular, so it can be easily extended to include your algorithm of choice. It can integrate into SPM5 [1] and can be used to perform model-free analysis of biomedical image time series such as fMRI or PET data sets. The toolbox is realized in MATLAB, and has been tested on Linux, Windows and Mac OS. The paper itself is organized as follows: In the next section, we present the core features of MFBOX, M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 209–217, 2007. c Springer-Verlag Berlin Heidelberg 2007
210
P. Gruber, C. Kohler, and F.J. Theis
Fig. 1. The data flow of the MFBOX application. The user interface is divided into the spm mf box part and the mfbox compare ic part.
from preprocessing to Blind Source Separation (BSS) and higher-order analysis itself to postprocessing. We then illustrate the usage of MFBOX on a complex fMRI experiment, and conclude with an outlook.
2
Features of the MFBOX
The MFBOX application includes two main graphical interfaces, spm mf box which gives access to all processing possibilities and mfbox compare ic which allows for easy comparison of different results. Any of the algorithms can also be used separately from the main toolbox interface and additionally a batch run command mfbox runbatch is provided to ease the analysis of multiple data sets. The workflow can be divided into modular stages, also see figure 1: – Preprocessing – Model-free processing and higher-level analysis – Postprocessing 2.1
Preprocessing
The preprocessing stage can include one or more preprocessing steps in a row where the precedence can be easily controlled. The purpose of this stage is to select regions of interest to apply the model-free algorithms on or enhance the effectiveness of the main processing algorithms by denoising or transforming the original sequence data.
A Toolbox for Model-Free Analysis of fMRI Data
211
datavalue mask selection by bounds on voxel values denoise high quality Local ICA (lICA) based 3d denoising [4] infomap gridding using a Self-Organizing Maps (SOM) [12] based on the information content of the voxels remmean different mean removal options roiselect mask selection by loadable masks selectslices rectangular mask selection in voxel space varthreshold mask selection by bounds on voxel variances Recommended standard options are to mask out the parts of the signal which are outside the brain and regions uninteresting for the Blood Oxygenation Level Dependent Contrast (BOLD) effect like blood support system. Moreover the analysis can be enhanced by only using the white matter voxels. How this mask selection is achieved depends on the available data. If structural data is available the preprocessing option roiselect can use the data from a segmentation step. Otherwise the mask selection can be accomplished by datavalue or varthreshold selection. 2.2
Model-Free Analysis and Higher-Level Analysis
Given a multivariate data set x(t), our goal is to find a new basis W such that Wx(t) fulfills some statistical properties. If we for example require the transformed random vector to be statistically independent, this is denoted as Independent Component Analysis [8]. Independent Component Analysis (ICA) can be applied to solve the BSS problem, where x(t) is known to be the mixture of some underlying hidden independent sources. Depending on the data set also other models are of interest, as for e.g. depending on the data set. The MFBOX currently includes three different paradigms for tackling the BSS problem and for each of these fundamental types it contains different algorithms to perform the separation. 1. ICA algorithms PearsonICA spatial ICA algorithm which employs a fast fixed point algorithm as extension of FastICA [8] where the nonlinearity is estimated from the data and automatically adjusted at each step [9] JADE spatial ICA algorithm which is based on the approximate diagonalization of the fourth cumulant tensor [3] stJADE spatiotemporal version of the JADE algorithm [15], for an impression on how the spatiotemporal weighting can enhance the separation of a real data set see figure 2 TemporalICA temporal ICA [2] optionally using Projection Approximation Subspace Tracking (PAST) [17] for the temporal Principal Component Analysis (PCA) reduction hrfICA semi-blind spatial ICA algorithm using the Haemoglobin Response Function (HRF) function to incorporate prior knowledge about the assumed sources
212
P. Gruber, C. Kohler, and F.J. Theis
2. second order algorithms mdSOBI Multidimensional version of the Second Order Blind Identification (SOBI) [16] algorithm, which is based on second-order autodecorrelation stSOBI Spatio-temporal version of the SOBI algorithm [15] 3. other decomposition algorithms hNMF a Nonnegative Matrix Factorization (NMF) decomposition algorithm using hyperplane clustering [5] These base BSS algorithms can be combined with different types of higherlevel analysis methods. These three methods share the fact that they apply a previously selected BSS algorithm multiple times in order to extract additional statistics e.g. for reliability. The methods will be explained in the following. Reliability analysis with projective k-means clustering. A common problem in explorative data analysis is how to access the reliability of the obtained results. A proven method of reliability analysis in statistics is bootstrapping i.e. to randomly subsample the same methods as before and compare the results from multiple different random sub sample runs, see e.g. [6, 7]. For most BSS algorithms this leads to a permutation problem (namely identifying corresponding Independent Component (IC)) since there is an inherent permutation and scaling, especially sign, invariance of BSS. Here we use projective k-means clustering [5] for the assignment and to evaluate the quality of a component using the resulting cluster size. It is essentially a k-means-type algorithm, which acts
1
0.7 0.65
0.9 0.6 0.55
correlation
reliability
0.8
0.7
0.6
0.5 0.45 0.4 0.35
0.5
0.3 0.4 0.25 0.3
0.2 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
spatiotemporal weighting
(a) Comparison of the maximal temporal reliability measure with varying weighting α
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
spatiotemporal weighting
(b) Comparison of the maximal correlation with the external stimulus
Fig. 2. Evaluation of some of the higher level algorithms present in the MFBOX on a real data set using the stJADE algorithm with 10 components. In both graphs α = 1 is the situation where only temporal ICA is performed whereas at α = 0 only spatial ICA is performed.
A Toolbox for Model-Free Analysis of fMRI Data
213
on samples in the projective space RPn of all lines of Rn . This models the scaling indeterminacy of ICA more precisely than projection onto its hypersphere since it also deals with the sign invariance. The result (figure 2) is explained by the fact that the spatial dimension of the data is strongly smoothed and reveals a some structure. Hence it does not follow the usual linear ICA model of independently drawn samples from an identically distributed random variable. The temporal dimension does not expose such additional structure and is less smoothed by the data acquisition and so the temporal ICA should be more stable. Hierarchical analysis of component maps. Another common issue in datadriven BSS methods is how to choose the number of components and how to evaluate the large amount of possibly interesting components the process might yield without means to identify the ones which are interesting in the current problem. The hierarchical analysis tries to overcome this problem by extracting different numbers of components and extracting a hierarchical structure form the relations between the timecourses and the component maps of the extracted sources. This yields a tree structure of the components which can be used to easily navigate to the components which are of interest in an experiment. For more detailed implementation details see also [11]. Sliding time-window analysis. Usualy ICA cannot deal with non-stationary data, so most commonly approximations or windowing techniques are used. In ordinary ICA, the whole data set is transformed, and as many component maps as time steps are reconstructed. Window ICA [10] groups the observations in windows of a fixed size, and extracts components out of each of them. Correlation analysis then gives corresponding components over the whole time interval, thus yielding additional structure in a single full-scale component map. 2.3
Import, Export and Postprocessing
The MFBOX has rich im- and export possibilities to load and save data, masks, designs, reference brain masks, and parameters from different formats as Analyze, Nifti, plain MATLAB, and its own file format. selectcomp semi-manual selection, labeling and grouping of extracted components denoise lICA based 3d denoising [4] of the extracted components
3
Using the MFBOX on fMRI Recording of an Wisconsin Card Sorting Test (WCST)
In this part we will demonstrate the results from using the MFBOX on a real world data set. After a short introduction into the nature of the data we will present the result and compare it to the standard SPM result. The WCST is
214
P. Gruber, C. Kohler, and F.J. Theis front
side
top −8.6
8.6
stJADE analysis component 2 1.5 1 0.5 0 −0.5 −1 0
100
200
300
(a) The stimulus component as exported by MFBOX
(b) SPM result of the stimulus component
Fig. 3. Result of a stJADE run with α=1.0, selecting the whole brain area as mask, and after applying spatial and temporal mean removal. The component network in the prefrontal cortex and a more compact network at the parietal/occipital lobe is clearly visible.
a traditional neuropsychological test that is sensitive to frontal lobe dysfunction and allows to assess the integrity of the subjects’ executive functions. The WCST-task primarily activates the dorsolateral prefrontal cortex that is responsible for executive working memory operations and cognitive control functions. The given fMRI data set originates from a modified version of the WCST [13]. Its aim is to segregate those network components participating in the above mentioned process. At first, a number of stimulus cards are presented to the participants. Afterwards the subject is asked to match additional cards to one of the stimulus cards with respect to a sorting criterion unknown to the subject. In fact, the subject has to discover it by trial and error (Sorting dimensions include: the color of symbols, their shape or the number of displayed symbols on the card). Sorting dimension changes if a previously defined number of correct answers have been given consecutively. Three different test variants of the WCST were applied Task A. No instructions of dimensions (very close to the original WCST) Task B. Instruction of dimensional change as sorting criterion changes. Task C. Reminder of dimension prior to each trial; subject knows in advance the attribute that was searched for in the test. Our results largely verify the findings of [13], as can be seen in figure 3(a). In addition to the stimulation of the prefrontal cortex, an increased activity in the parietal lobe as well as in the occipital lobe was revealed by our stJADE algorithm. The increased activity in the rear section of the brain (see figure 3(b)) was also discovered by the model-free algorithm. The complete summary of one stJADE analysis of the data set is shown in figure 4. The classification into
A Toolbox for Model-Free Analysis of fMRI Data
215
Fig. 4. A complete BSS analysis output as provided by the SVG export function of the MFBOX. The component labeled stimulus indicates the one which has the highest correlation with the stimulus. The components labeled artifact are most likely artifacts, two of them are also grouped together such that the number of displayed components is 9 although 10 components were extracted by stJADE.
216
P. Gruber, C. Kohler, and F.J. Theis
stimulus component and artifacts was done after the analysis using the reliability analysis and a Minimum Description Length (MDL)-likelihood based noise estimator included in the MFBOX.
4
Conclusion
The MFBOX is an easy to use but nonetheless powerful instrument for explorative analysis of fMRI data. It includes several modern BSS algorithms and due to its highly modular structure it can easily be extended with novel as well as classical approaches. Acknowledgements. The data has been kindly provided by Chuh Hyoun Lie, Institute of Medicine, Research Centre J¨ ulich. The authors gratefully acknowledge partial financial support by the BMBF (project ‘ModKog’).
References [1] SPM5. edn. 5 (July 2005), http://www.fil.ion.ulc.ac.uk/spm/spm5.html [2] Calhoun, V.D., Adali, T., Pearlson, G.D., Pekar, J.J.: Spatial and temporal independent component analysis of functional MRI data containing a pair of taskrelated waveforms. Hum.Brain Map. 13, 43–53 (2001) [3] Cardoso, J.-F., Souloumiac, A.: Blind beamforming for non gaussian signals. IEE Proceedings-F 140(6), 362–370 (1993) [4] Gruber, P., Stadlthanner, K., B¨ ohm, M., Theis, F.J., Lang, E.W., Tom ´e, A.M., Teixeira, A.R., Puntonet, C.G., G´ orriz, J.M.: Denoising using local projective subspace methods. Neurocomputing 69, 1485–1501 (2006) [5] Gruber, P., Theis, F.J.: Grassmann clustering. In: Proc. of EUSIPCO, Florence, Italy (2006) [6] Harmeling, S., Meinecke, F., M¨ uller, K.R.: Injecting noise for analysing the stability of ICA components. Signal Processing 84, 255–266 (2004) [7] Himberg, J., Hyv¨ arinen, A., Esposito, F.: Validating the independent components of neuroimaging time-series via clustering and visualization. NeuroImage 22, 1214– 1222 (2004) [8] Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis (2001) [9] Karvanen, J., Koivunen, V.: Blind separation methods based on pearson system and its extensions. Signal Processing 82(4), 663–673 (2002) [10] Karvanen, J., Theis, F.J.: Spatial ICA of fMRI data in time windows. In: Proc. MaxEnt 2004 AIP conference proceedings, Garching, Germany, vol. 735, pp. 312– 319 (2004) [11] Keck, I.R., Lang, E.W., Nassabay, S., Puntonet, C.G.: Clustering of signals using incomplete independent component analysis. In: Cabestany, J., Prieto, A.G., Sandoval, F. (eds.) IWANN 2005. LNCS, vol. 3512, pp. 1067–1074. Springer, Heidelberg (2005) [12] Kohonen, T.: Self-Organizing Maps. Springer, New York, Inc. Secaucus, NJ (2001) [13] Lie, C.-H., Specht, K., Marshall, J.C., Fink, G.R.: Using fMRI to decompose the neural processes underlying the wisconsin card sorting test. NeuroImage 30, 1038–1049 (2006)
A Toolbox for Model-Free Analysis of fMRI Data
217
[14] McKeown, M.J., Sejnowski, T.J.: Independent component analysis of FMRI data: Examining the assumptions. Human Brain Mapping 6, 368–372 (1998) [15] Theis, F.J., Gruber, P., Keck, I.R., Meyer-B¨ ase, A., Lang, E.W.: Spatiotemporal blind source separation using double-sided approximate joint diagonalization. In: Proc. of EUSIPCO, Antalya, Turkey (2005) [16] Theis, F.J., Meyer-B¨ ase, A., Lang, E.W.: Second-order blind source separation based on multi-dimensional autocovariances. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 726–733. Springer, Heidelberg (2004) [17] Yang, B.: Projection approximation subspace tracking. IEEE Trans. on Signal Processing 43(1), 95–107 (1995)
An Eigenvector Algorithm with Reference Signals Using a Deflation Approach for Blind Deconvolution Mitsuru Kawamoto1, Yujiro Inouye2 , Kiyotaka Kohno2 , and Takehide Maeda2 National Institute of Advanced Industrial Science and Technology (AIST), Central 2, 1-1-1 Umezono, Tsukuba, Ibaraki, 305-8568, Japan [email protected] http://staff.aist.go.jp/m.kawamoto/ Department of Electronic and Control Systems Engineering, Shimane University, 1060 Nishikawatsu, Matsue, 690-8504, Japan [email protected], [email protected], [email protected] 1
2
Abstract. We propose an eigenvector algorithm (EVA) with reference signals for blind deconvolution (BD) of multiple-input multiple-output infinite impulse response (MIMO-IIR) channels. Differently from the conventional EVAs, each output of a deconvolver is used as a reference signal, and moreover the BD can be achieved without using whitening techniques. The validity of the proposed EVA is shown comparing with our conventional EVA.
1
Introduction
This paper deals with a blind deconvolution (BD) problem for a multiple-input and multiple-output (MIMO) infinite impulse response (IIR) channels. To solve this problem, we use eigenvector algorithms (EVAs) [6,7,12]. The first proposal of the EVA was done by Jelonnek et al. [6]. They have proposed the EVA for solving blind equalization (BE) problems of single-input single-output (SISO) channels or single-input multiple-output (SIMO) channels. In [12], several procedures for the blind source separation (BSS) of instantaneous mixtures, using the generalized eigenvalue decomposition (GEVD), have been introduced. Recently, the authors have proposed an EVA which can solve BSS problems in the case of MIMO static systems (instantaneous mixtures) [8]. Moreover, based on the idea in [8], an EVA was derived for MIMO-IIR channels (convolutive mixtures) [9]. In the EVAs in [8,9], an idea of using reference signals was adopted. Researches applying this idea to solving blind signal processing (BSP) problems, such as the BD, the BE, the BSS, and so on, have been made by Jelonnek et al. (e.g., [6]), Adib et al. (e.g., [2]), Rhioui et al. [13], and Castella, et al. [3]. In [8,9], differently from the conventional methods, only one reference signal was utilized for recovering all the source signals simultaneously. However, the EVA in [9] has difference performances for a different choice of the reference signal (see section 4), and in order to recover all source signals, it M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 218–226, 2007. c Springer-Verlag Berlin Heidelberg 2007
An Eigenvector Algorithm with Reference Signals
n(t) s(t)
y(t)
H(z) G(z)
f T(z)
x(t) reference signal
W(z)
z(t) output signal
219
Fig. 1. The composite system of an unknown system and a deconvolver, and a reference system
must be taken into account how to select appropriate eigenvectors from the set of eigenvectors calculated by the EVA. In this paper, in order to circumvent such a tedious (or nasty) task, the output of a deconvolver which is used to recover source signals is used as a reference signal. Accordingly, deflation techniques are needed to recover all source signals. The method proposed in [3] is almost same as the proposed EVA. However, the proposed EVA can achieve the BD without using whitening techniques. Moreover, the proposed EVA provides good performances compared with our conventional EVA [9] (see section 4). The present paper uses the following notation: Let Z denote the set of all integers. Let C denote the set of all complex numbers. Let C n denote the set of all n-column vectors with complex components. Let C m×n denote the set of all m × n matrices with complex components. The superscripts T , ∗, and H denote, respectively, the transpose, the complex conjugate, and the complex conjugate transpose (Hermitian) of a matrix. The symbols block-diag{· · ·} and diag{· · ·} denote respectively a block diagonal and a diagonal matrices with the block diagonal and the diagonal elements {· · ·}. The symbol cum{x1 ,x2 ,x3 ,x4 } denotes a fourth-order cumulant of xi ’s. Let i = 1, n stand for i = 1, 2, · · · , n.
2
Problem Formulation and Assumptions
We consider a MIMO channel with n inputs and m outputs as described by ∞ (1) y(t) = k=−∞ H (k) s(t − k) + n(t), t ∈ Z, where s(t) is an n-column vector of input (or source) signals, y(t) is an m-column vector of channel outputs, n(t) is an m-column vector of Gaussian noises, and {H (k) } is an m × n impulse response matrix sequence. The transfer function of ∞ the channel is defined by H(z) = k=−∞ H (k) z k , z ∈ C. To recover the source signals, we process the output signals by an n × m deconvolver (or equalizer) W (z) described by ∞ z(t) = k=−∞ W (k) y(t − k) ∞ ∞ = k=−∞ G(k) s(t − k) + k=−∞ W (k) n(t − k), (2) where {G(k) } is the impulse response matrix sequence of G(z) := W (z)H(z), ∞ which is defined by G(z) = k=−∞ G(k) z k , z ∈ C. The cascade connection of the unknown system and the deconvolver is illustrated in Fig. 1.
220
M. Kawamoto et al.
Here, we put the following assumptions on the channel, the source signals, the deconvolver, and the noises. A1) The transfer function H(z) is stable and has full column rank on the unit circle |z| = 1, where the assumption A1) implies that the unknown system has less inputs than outputs, i.e., n < m, and there exists a left stable inverse of the unknown system. A2) The input sequence {s(t)} is a complex, zero-mean and non-Gaussian random vector process with element processes {si (t)}, i = 1, n being mutually independent. Each element process {si (t)} is an i.i.d. process with a variance σs2i = 0 and a nonzero fourth-order cumulant γi = 0 defined as γi = cum{si (t), si (t), s∗i (t), s∗i (t)} = 0.
(3)
A3) The deconvolver W (z) is an FIR channel of sufficient length L so that the truncation effect can be ignored. A4) The noise sequence {n(t)} is a zero-mean, Gaussian vector stationary process whose component processes {nj (t)}, j = 1, m have nonzero variances σn2 j , j = 1, m. A5) The two vector sequences {n(t)} and {s(t)} are mutually statistically independent. Under A3), the impulse response {G(k) } of the cascade system is given by L 2 G(k) := τ =L W (τ ) H (k−τ ) , k ∈ Z, (4) 1 where the length L := L2 − L1 + 1 is taken to be sufficiently large. In a vector form, (4) can be written as ˜w g ˜i = H ˜i,
i = 1, n,
(5)
where g ˜i is the column vector consisting of the ith output impulse response of the cascade system defined by g ˜i := [g Ti1 , g Ti2 , · · · , g Tin ]T , T
g ij := [· · · , gij (−1), gij (0), gij (1), · · ·] ,
j = 1, n
(6)
where gij (k) is the (i, j)th element of matrix G(k) , and w ˜ i is the mL-column vector consisting of the tap coefficients (corresponding to the ith output) of the T deconvolver defined by w ˜ i := wTi1 , wTi2 , · · · , wTim ∈ C mL , T
wij := [wij (L1 ), wij (L1 + 1), · · · , wij (L2 )] ∈ C L ,
(7)
˜ is the n×m j = 1, m, where wij (k) is the (i, j)th element of matrix W (k) , and H block matrix whose (i, j)th block element H ij is the matrix (of L columns and possibly infinite number of rows) with the (l, r)th element [H ij ]lr defined by [H ij ]lr := hji (l − r), l = 0, ±1, ±2, · · ·, r=L1 , L2 , where hij (k) is the (i, j)th element of the matrix H (k) . In the multichannel blind deconvolution problem, we want to adjust w ˜ i ’s (i = 1, n) so that ˜ w [˜ g1 , · · · , g ˜n ] = H[ ˜1, · · · , w ˜ n ] = [δ˜1 , · · · , δ˜n ]P ,
(8)
An Eigenvector Algorithm with Reference Signals
221
where P is an n × n permutation matrix, and δ˜i is the n-block column vector defined by δ˜i := [δ Ti1 , δ Ti2 , . . . , δ Tin ]T , i = 1, n ˆ δi , if i = j, δ ij := (· · · , 0, 0, 0, · · ·)T , otherwise.
(9) (10)
Here, δˆi is the column vector (of infinite elements) whose rth element δˆi (r) is given by δˆi (r) = di δ(r − ki ), where δ(t) is the Kronecker delta function, di is a complex number standing for a scale change and a phase shift, and ki is an integer standing for a time shift.
3 3.1
Eigenvector Algorithms (EVAs) Analysis of Eigenvector Algorithms with Reference Signals for MIMO-IIR Channels
In order to solve the BD problem, the following cross-cumulant between zi (t) and a reference signal x(t) (see Fig. 1) is defined; Dzi x = cum{zi (t), zi∗ (t), x(t), x∗ (t)},
(11)
where zi (t) is the ith element of z(t) in (2) and the reference signal x(t) is given by f T (z)y(t), using an appropriate filter f (z). The filter f (z) is called a reference system. Let a(z) := H T (z)f (z) = [a1 (z),a2 (z),· · ·,an (z)]T , then x(t) = aT (z)s(t). The element ai (z) of the filter a(z) is defined as = f T (z)H(z)s(t) ∞ ai (z) = k=−∞ ai (k)z k and the reference system f (z) is an m-column vector L2 fj (k)z k , j = 1, m. whose elements are fj (z) = k=L 1 Jelonnek et al. [6] have shown in the single-input case that by the Lagrangian method, the maximization of |Dzi x | under σz2i = σs2ρi leads to a closed-form solution expressed as a generalized eigenvector problem, where σz2i and σs2ρi denote the variances of the output zi (t) and a source signal sρi (t), respectively, and ρi is one of integers {1, 2, · · · , n} such that the set {ρ1 , ρ2 ,· · ·,ρn } is a permutation of the set {1, 2,· · ·,n}. In our case, Dzi x and σz2i can be expressed in terms of the vector w ˜ i as, respectively, ˜ ˜ i , σz2 = w ˜ ˜i, ˜H ˜H Dzi x = w i Bw i Rw i
(12)
˜ is the m×m block matrix whose (i, j)th block element B ij is the matrix where B with the (l, r)th element [B ij ]lr calculated by cum{yi∗ (t − L1 − l + 1), yj (t − L1 − ˜ = E[˜ y ∗ (t)˜ y T (t)] is the covariance matrix r + 1), x∗ (t), x(t)} (l, r = 1, L) and R of m-block column vector y ˜(t) defined by T y˜(t) := y T1 (t), y T2 (t), · · · , y Tm (t) ∈ C mL , T
(13)
y j (t) := [yj (t-L1 ), yj (t-L1 -1), · · · , yj (t-L2 )] ∈ C , L
(14)
222
M. Kawamoto et al.
j = 1, m. Therefore, by the similar way to as in [6], the maximization of |Dzi x | under σz2i = σs2ρ leads to the following generalized eigenvector problem; i
˜w ˜w B ˜ i = λi R ˜i .
(15)
Moreover, Jelonnek et al. have shown that the eigenvector corresponding to ˜ becomes the solution of the blind ˜†B the maximum magnitude eigenvalue of R equalization problem in [6], which is referred to as an eigenvector algorithm (EVA). Note that since Jelonnek et al. have dealt with SISO-IIR channels or ˜ in (15) are different ˜ w SIMO-IIR channels, the constructions of B, ˜ i , and R from those proposed in [6,7]. In this paper, we want to show how the eigenvector algorithm (15) works for the BD of the MIMO-IIR channel (1). To this end, we use the following equalities; ˜ H, ˜ B ˜=H ˜ H Λ˜H, ˜ ˜=H ˜ HΣ R
(16)
˜ is the block diagonal matrix defined by where Σ ˜ := block-diag{Σ 1 , Σ 2 , · · · , Σ n }, Σ Σ i :=
diag{· · · , σs2i , σs2i , σs2i , · · ·},
(17)
i = 1, n,
(18)
and Λ˜ is the block diagonal matrix defined by Λ˜ := block-diag{Λ1 , Λ2 , · · · , Λn }, Λi := diag{· · · , |ai (−1)|2 γr , |ai (0)|2 γi , |ai (1)|2 γi , · · ·},
(19) (20)
˜ and Λ˜ become diagonal, (16) shows that the two matrices i = 1, n. Since both Σ ˜ ˜ R and B are simultaneously diagonalizable. ˜ is denoted by ˜ −1 Λ Here, let the eigenvalues of the diagonal matrix Σ λi (k) := |ai (k)|2 γi /σs2i ,
i = 1, n,
k ∈ Z.
(21)
We put the following assumption on the eigenvalues λi (k) s. A6) All the eigenvalues λi (k) s are distinct for i = 1, n and k ∈ Z. Theorem 1. Suppose the noise term n(t) is absent and the length L of the deconvolver is infinite (that is, L1 = −∞ and L2 = ∞). Then, under the assumptions A1) through A6), the n eigenvector w ˜ i ’s corresponding to the n nonzero ˜ for i = 1, n and an arbitrary k ∈ Z become ˜†B eigenvalues λi (k) s of matrix R the vectors w ˜ i ’s satisfying (8). Outline of the proof: problem;
Based on (15), we consider the following eigenvector ˜†B ˜w R ˜ i = λi w ˜i.
(22)
˜ HΣ ˜ H) ˜ †H ˜ H Λ˜H ˜w (H ˜ i = λi w ˜i.
(23)
Then, from (16), (22) becomes
An Eigenvector Algorithm with Reference Signals
223
Under L1 = −∞ and L2 = ∞, we have the following equations; ˜ H) ˜ †=H ˜ †H ˜ H† , H ˜ H = I, ˜ HΣ ˜ †Σ ˜ H† H (H
(24)
which are shown in [11] along with their proofs. Then it follows from (23) and (24); ˜ −1 Λ˜H ˜w ˜ †Σ ˜ i = λi w ˜i. (25) H ˜ from the left side and using (24), (25) becomes Multiplying (25) by H ˜ −1 Λ ˜H ˜w ˜w Σ ˜ i = λi H ˜i .
(26)
˜ is a diagonal matrix with diagonal elements λi (k), i = 1, n ˜ −1 Λ By (22), Σ and k ∈ Z, and thus (22) and (26) show that its diagonal elements λi (k) s are ˜ Here we use the following fact; ˜ † B. eigenvalues of matrix R ˜ lim (rank R)/L = n,
L→∞
(27)
which is shown in [10] and its proof is found in [4]. Using this fact, the other ˜ †B ˜ are all zero. From the assumption A6), the n remaining eigenvalues of R nonzero eigenvalues λi (k) = 0, i = 1, n, obtained by (26), that is, the n nonzero eigenvectors w ˜ i , i = 1, n, corresponding to n nonzero eigenvalues λi (k) = 0, ˜ i satisfying (8). i = 1, n, obtained by (22) become n solutions of the vectors w 3.2
How to Choose a Reference Signal
In [9], a reference system f (z) is appropriately chosen, and then all source signals can be recovered simultaneously from the observed signals. However, the performances obtained by the EVA in [9] change with the way of choosing a reference system (see section 4) and moreover, the EVA has such a complicated task that the way of selecting appropriate eigenvectors from the set of eigenvectors calculated from the EVA must be taken into account. In this paper, by adopting xi (t) = w ˜ Ti y˜i (t) as a reference signal, we want to circumvent such a tedious (or nasty) task. To this end, (11) can be reformulated as Dzi xi = cum{zi (t), zi∗ (t), xi (t), x∗i (t)}, i = 1, n,
(28)
The vector w ˜ i in xi (t) is given by an eigenvector obtained from the EVA at the previous time, that is, xi (t) = w ˜ Ti (t − 1)˜ y i (t), where the value of w ˜ Ti (t − 1) is ˜ assumed to be fixed. By using xi (t), the matrix B is calculated, which is denoted ˜ i (t), and then the eigenvector w ˜ Ti (t) at time t can be obtained from the by B ˜ EVA using B i (t). By repeating this operation, the BD can be achieved. Then it can be seen that as the EVA works successfully, xi (t) gradually becomes a source signals sρi (t − ki ). Namely, the diagonal elements of Λ˜ in (19) gradually become zeros except for one element corresponding to sρi (t−ki ). This means that ˜†B ˜ i (t) are calculated for achieving the BD, it is only when the eigenvectors of R
224
M. Kawamoto et al.
enough that we select the eigenvector corresponding to the absolute maximum ˜ †B ˜ i (t). This is the reason why we can circumvent the tedious eigenvalue of R task by using the reference signal. After all, the EVA is implemented as follows: ˜ ˜ i (0) Set initial values: w ˜ i (0), R(0), B for tl = 1 : tlall for t = td (tl − 1)+1:td tl xi (t) = w ˜ Ti (tl − 1)˜ y i (t) ˜ ˜ i (t) by a moving average. Calculate R(t) and B end Calculate the eigenvector w ˜ i (tl ) associated with the absolute maximum eigenvalue |λi | from (22). end where tlall denotes the total number of iterations and td denotes the number of ˜ ˜ i (t). Note that R ˜ is not data samples for estimating the matrices R(t) and B needed to estimate iteratively, but for the sake of our convenience, this way is adopted. Here it is worth noting that when the above algorithm is implemented, it may happen that each output of a deconvolver provides the same source signal. Therefore, in order to avoid such a situation, we apply a deflation approach, that is, the Gram-Schmidt decorrelation [1] to the eigenvectors w ˜ i (tl ) for i = 1, n.
4
Simulation Results
To demonstrate the validity of the proposed method, many computer simulations were conducted. Some results are shown in this section. The unknown system H(z) was set to be the same channel with two inputs and three outputs as in [9]. Also, other setup conditions, that is, the source signals si (t)’s, the noises ni (t)’s,
0
(a)
-5
-10
(b) -15
-20
(c)
-25 5
10
15
20
25
30
35
40
SNR (dB)
Fig. 2. The performances of the proposed EVA and our conventional EVA with varying SNR levels, in the cases of 5,000 data samples
An Eigenvector Algorithm with Reference Signals
225
and their SNR levels were the same as in [9]. As a measure of performances, we used the multichannel intersymbol interference (MISI ) [5], which was the average of 30 Monte Carlo runs. In each Monte Carlo run, the number of iterations tlall was set to be 10, and the number of data samples td was set to be 5,000. For comparison, our conventional EVA in [9] was used, where the conventional EVA does not need deflation approaches. Fig. 2 shows the results of performances of the EVAs when the SNR levels were respectively taken to be 5 through 40dB for every 5 dB, where there are three kinds of reference signals, (a) x(t) = 3i=1 fi (5)yi (t − 5), where each parameter fi (5) was randomly chosen from a Gaussian distribution with zero mean and unit variance, (b) x(t) = f2 (2)y2 (t − 2), where f2 (2) also was randomly chosen ˜ Ti (t − 1)˜ y i (t), i = 1, 3. The last from the Gaussian distribution, (c) xi (t) = w reference signal (c) corresponds to the proposed EVA, while the other two (a) and (b) correspond to our conventional EVA. From Fig. 2, it can be seen that the proposed EVA provides better performances than our conventional EVA [9].
5
Conclusions
We have proposed an EVA for solving the BD problem. Using the output of a deconvolver as a reference signal, the tedious task of our conventional EVA can be circumvented. The simulation results have demonstrated the effectiveness of the proposed EVA. However, from the simulation results, one can see that all our EVAs have such a drawback that it is sensitive to Gaussian noise. Therefore, as a further work, we will propose an EVA having such a property that the BD can be achieved as little insensitive to Gaussian noise as possible. Acknowledgments. This work is supported by the Research Projects, GrantIN AIDs, No. 185001461 and No. 18500542 for Scientific Research of the JSPS.
References 1. Hyv¨ arinen, A.: Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Networks 10(3), 62–634 (1999) 2. Adib, A., et al.: Source separation contrasts using a reference signal. IEEE Signal Processing Letters 11(3), 312–315 (2004) 3. Castella, M., et al.: Quadratic Higher-order criteria for iterative blind separation of a MIMO convolutive mixture of sources. IEEE Trans. Signal Processing 55(1), 218–232 (2007) 4. Inouye, Y.: Autoregressive model fitting for multichannel time series of degenerate rank: Limit properties. IEEE Trans. Circuits and Systems 32(3), 252–259 (1985) 5. Inouye, Y., Tanebe, K.: Super-exponential algorithms for multichannel blind deconvolution. IEEE Trans. Sig. Proc. 48(3), 881–888 (2000) 6. Jelonnek, B., Kammeyer, K.D.: A closed-form solution to blind equalization. Signal Processing 36(3), 251–259 (1994)
226
M. Kawamoto et al.
7. Jelonnek, B., Boss, D., Kammeyer, K.D.: Generalized eigenvector algorithm for blind equalization. Signal Processing 61(3), 237–264 (1997) 8. Kawamoto, M., et al.: Eigenvector algorithms using reference signals. In: Proc. ICASSP 2006, vol. V, pp. 841–844 (May 2006) 9. Kawamoto, M., et al.: Eigenvector algorithms for blind deconvolution of MIMOIIR systems. In: Proc. ISCAS 2007, pp. 3490–3493, (May 2007), This manuscript is downloadable at http://staff.aist.go.jp/m.kawamoto/manuscripts/ISCAS2007.pdf 10. Kohno, K., et al.: Adaptive super-exponential algorithms for blind deconvolution of MIMO systems. In: Proc. ISCAS 2004, vol. V, pp. 680–683 (May 2004) 11. Kohno, K., et al.: Robust super-exponential methods for blind equalization of MIMO-IIR systems. In: Proc. ICASSP 2006, vol. V, pp. 661–664 (2006) 12. Parra, L., Sajda, P.: Blind source separation via generalized eigenvalue decomposition. Journal of Machine Learning, No. 4, 1261–1269 (2003) 13. Rhioui, S., et al.: Quadratic MIMO contrast functions for blind source separation in a convolutive contest. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 230–237. Springer, Heidelberg (2006)
Robust Independent Component Analysis Using Quadratic Negentropy Jaehyung Lee, Taesu Kim, and Soo-Young Lee Department of Bio & Brain Engineering, KAIST, Republic of Korea {jaehyung.lee, taesu.kim, sylee}@kaist.ac.kr
Abstract. We present a robust algorithm for independent component analysis that uses the sum of marginal quadratic negentropies as a dependence measure. It can handle arbitrary source density functions by using kernel density estimation, but is robust for a small number of samples by avoiding empirical expectation and directly calculating the integration of quadratic densities. In addition, our algorithm is scalable because the gradient of our contrast function can be calculated in O(LN) using the fast Gauss transform, where L is the number of sources and N is the number of samples. In our experiments, we evaluated the performance of our algorithm for various source distributions and compared it with other, well-known algorithms. The results show that the proposed algorithm consistently outperforms the others. Moreover, it is extremely robust to outliers and is particularly more effective when the number of observed samples is small and the number of mixed sources is large.
1
Introduction
In the last decade, Independent Component Analysis (ICA) has shown to be a great success in many applications, including sound separation, EEG signal analysis, and feature extraction. ICA shows quite a good performance for simple source distributions, if given assumptions hold well, but its performance is degraded for sources with skewed or complex density functions [1]. Several ICA methods are currently available for arbitrary distributions, but these methods have not yet shown practical performance when the number of sources is large and the number of observed samples is small, thus preventing their application to more challenging real-world applications, such as blind source separation for non-stationary mixing environments and frequency-domain BSS for convolutive mixtures [2]. The problem of ICA for arbitrary distributions mainly arises from the difficulty of estimating marginal entropies that usually appear in the contrast function derived from mutual information. Direct estimation of marginal entripies without parametric assumptions involves excessive computation, including numerical integration, and is sensitive to outliers because of the log terms. Several approximations are available, but these still rely on higher order statistical terms that are also sensitive to outliers. Different estimators of entropy [3] or dependence measure based on canonical correlations [1] have been suggested to M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 227–235, 2007. c Springer-Verlag Berlin Heidelberg 2007
228
J. Lee, T. Kim, and S.-Y. Lee
overcome this problem and have shown promising performance. In addition, there have been approaches using nonparametric mutual information via Renyi’s entropy [4] for ICA [5]. However, this method requires sign correction by kurtosis because Renyi’s entropy does not have a maximum at a Gaussian distribution [6]. In this paper, we define the concept of quadratic negentropy, replace the original negentropy with quadratic negentropy in the original definition of mutual information, and obtain a new contrast function for ICA. Using kernel density estimation along with quadratic negentropy can reduce the integration terms into sums of pairwise interactions between samples. The final contrast function can be calculated efficiently using the fast Gauss transform, guaranteeing scalability. The performance of our algorithm consistently outperforms the best existing algorithms for various source distributions and the existence of outliers, especially when the number of observed samples is small and the number of mixed sources is large. This paper is organized as follows. In Section 2, we review the basic problem of ICA and the contrast function using negentropy. In Section 3, we define a new contrast function for ICA using quadratic negentropy along with kernel density estimation. We also apply the fast Gauss transform to reduce computation. In Section 4, we evaluate the performance of the derived algorithm on various source distributions, varying the number of sources and the number of samples, to compare the proposed algorithm with other, well-known algorithms, such as FastICA and KernelICA.
2
Background on ICA
In this section, we briefly review the basic problem of ICA and the contrast function using original negentropy. 2.1
The Basic Problem of ICA
Let s1 , s2 , ..., sL be L statistically independent source random variables that are linearly mixed by some unknown but fixed mixing coefficients to form m observed random variables x1 , x2 , ..., xL . For example, source variables can be the voices of different people at a location and observation variables represent the recordings from several microphones at the location. This can be written in matrix form as x = As
(1)
where x = (x1 , x2 , ..., xL )T , s = (s1 , s2 , ..., sL )T , and A is an L × L matrix. The basic problem of ICA is to determine W, the inverse of mixing matrix A, to recover the original sources from observations, by using N samples of observation x under the assumption that sources are independent of each other. 2.2
Contrast Function Using Negentropy
Mutual information between components of estimated source vectors is known to be a natural contrast function for ICA because it has a zero value when
Robust Independent Component Analysis Using Quadratic Negentropy
229
the components are independent and a positive value otherwise. In addition, it is well known that mutual information can be represented using joint and marginal negentropies [7], as follows: N Vii 1 (2) I(x) = J(x) − J(xi ) + log 2 det V i=1 where x is a vector random variable of dimension N, xi is the i-th component of x, V is the covariance matrix of x, and J(x) is the negentropy of a random variable x, which can be represented using Kullback-Leibler divergence, as shown below. The proof is based on the fact that only the first and second order moment of Gaussian density are nonzero and that log pφ (ξ) is a polynomial of degree 2 [7]. px (ξ) dξ (3) J(x) = DKL (px ||pφ ) = px (ξ) log pφ (ξ) where φ is a Gaussian random variable that has the same mean and variance with x, and pφ is the pdf of φ. As a result, it is nonnegative, invariant to invertible transforms and zero if px ≡ pφ . If we assume x be whitened, then the last term of Eq. (2) becomes zero and only negentropy terms remain. Now, we define the contrast function of ICA using mutual information, as ˆ = −I(ˆs) = C(W)
L
J(ˆ si ) − J(ˆs).
(4)
i=1
ˆ is the estimated sources using the current estimate of In Eq. (4), ˆs = Wx ˆ and sˆi is the i-th component of ˆs. We assume that the the unmixing matrix W, observation is whitened and thus can restrict the unmixing matrix to rotations only, thus making the first term constant and the third term zero in Eq. (2). The final contrast function of ICA using negentropy can be interpreted as the total nongaussianity of the estimated source components.
3 3.1
ICA Using Quadratic Negentropy Contrast Function Using Quadratic Negentropy
We replace the KL divergence with the L2 distance in Eq. (3) and obtain quadratic negentropy defined as 2 (5) Jq (x) = (px (ξ) − pφ (ξ)) dξ. We can easily show that it is nonnegative, invariant under rotational transform, and zero if px ≡ pφ . Assuming x is whitened and using quadratic negentropy instead of the original negentropy in Eq. (4), we obtain ˆ = −Iq (ˆs) = Cq (W)
L i=1
Jq (ˆ si ) − Jq (ˆs).
(6)
230
J. Lee, T. Kim, and S.-Y. Lee
In addition, Jq (ˆs) is constant because the quadratic negentropy is invariant under a rotational transform. Ignoring the constant gives us ˆ = Cq (W)
L
Jq (ˆ si ) =
i=1
L i=1
2 1 (ˆ psˆi (ξ) − √ e−ξ /2 )2 dξ 2π
(7)
where pˆsˆi is the estimated marginal pdf of sˆi . Here sˆi has zero mean and unit ˆ is rotation and x is whitened. Thus pφ in (5) becomes a variance because W standard Gaussian pdf. To be a contrast function, Eq. (6) and (7) should have a global maximum when components are independent. We hope this can be proved for general source distributions, but currently we have proof only for Laplacian distributions and further work is needed. 3.2
Kernel Density Estimation
Using kernel density estimation, pˆsˆi can be estimated as pˆsˆi (y) =
N 1 G(y − sˆi (n), σ 2 ) N n=1
(8)
where N is the number of observed samples, sˆi (n) is the n-th observed sample of i-th estimated source, and G(y, σ 2 ) is a Gaussian kernel defined as 2 2 1 G(y, σ 2 ) = √ e−y /2σ . 2πσ
(9)
Interestingly, the calculation of integration involving quadratic terms of pˆsˆi estimated as (8) can be simplified as pairwise interactions between samples [8]. Simplifying Eq. (7) using this yields N L N N 1 1 2 √ + 2 G(ˆ si (n)−ˆ si (m), 2σ 2 )− G(ˆ si (n), 1+σ 2 ) , 2 π N n=1 m=1 N n=1 i=1 (10) which is our final contrast function to maximize. Obtaining the partial derivative ˆ with respect to wij yields of Cq (W)
ˆ = Cq (W)
N 2 · G(ˆ si (n), 1 + σ 2 )ˆ si (n) ∂Cq = 2) ∂wij N · (1 + σ n=1
N G(ˆ si (n) − sˆi (m), 2σ 2 )(ˆ si (n) − sˆi (m)) − xj (n) N 2 · σ2 m=1
(11)
where symmetry with L respect to m and n is utilized to simplify equation. Also note that sˆi (n) = j=1 wij xj (n).
Robust Independent Component Analysis Using Quadratic Negentropy
3.3
231
Efficient Computation Using Fast Gauss Transform
It takes O(LN 2 ) time to directly compute the gradient given in Eq. (11). To reduce the computation we use the fast Gauss transform [9] that evaluates the following in O(N + N ) time, given ‘source’ points x = {x1 , ..., xN } and ‘target’ points y = {y1 , ..., yN }. F GT (yj , x, q, h) =
N
qi e−(yj −xi )
2
/h2
, j = 1, ..., N
(12)
i=1
where q = {q1 , ..., qN } are weight coefficients and h is the bandwidth parameter. Using Eq. (12), Eq. (10) can be rewritten as √ N L N 1 F GT (ˆ 1 si (n), ˆ si , 1, 2σ) 2 2 ˆ √ √ Cq (W) = + − G(ˆ si (n), 1+σ ) , 2 π N 2 n=1 2 πσ N n=1 i=1 (13) and the partial derivative in Eq. (10) can be rewritten as N ∂Cq 2 · G(ˆ si (n), 1 + σ 2 )ˆ si (n) = 2 ∂wij N · (1 + σ ) n=1 √ √ F GT (ˆ si (n), ˆ si , ˆ si , 2σ) − F GT (ˆ si (n), ˆ si , 1, 2σ) √ − xj (n) 2 π · N 2 · σ3
(14)
where ˆ si = {ˆ si (1), ..., sˆi (N )} and 1 is an N-dimensional one vector. Now, Eq. (13) and Eq. (14) can be computed in O(LN ) by performing the fast Gauss transform 2L times. 3.4
Steepest Descent on Stiefel Manifold
The set of orthogonal matrices is a special case of the Stiefel manifold and a gradient of a function can be computed based on the canonical metric of the Stiefel manifold [10]. Unconstrained optimization on the Stiefel manifold is more efficient than orthogonalizing the weight matrix per each iteration. In this paper, we used the steepest descent with a bracketed backtracking line search along geodesics. 3.5
Parameter Selection and Convergence Criterion
Our learning rule has one parameter: the bandwidth parameter σ of the kernel density estimation. We used σ = 1.06 × N −1/5 [11]. We calculated the value of the contrast function per each iteration to check convergence. If the difference between iterations becomes less than a given ratio τ = 10−8 of the contrast function, then it is regarded as convergence. In general, ICA contrast functions have multiple local maxima. This is also true for our contrast function, and we needed a fixed number of restarts to find
232
J. Lee, T. Kim, and S.-Y. Lee
a good local optimum. We restarted our algorithm four times with a convergence criterion τ = 10−6 and picked the best one as an initial estimate for final optimization.
4
Experimental Results
We conducted an extensive set of simulation experiments using a variety of source distributions, sample numbers, and components. The 18 source distributions used in our experiment were adopted from the KernelICA paper [1]. They include subgaussian, supergaussian and nearly Gaussian source distributions and Table 1. LEFT: The normalized Amari errors (×100) for mixtures of identical source distributions (top left) and random source distributions (bottom left). L: number of mixed components, N: number of samples, Fast: FastICA, Np: NpICA, Kgv: KernelICA-KGV, Imax: extended infomax ICA, QICA: our method. For identical sources, simulation is repeated 100 times for each of the 18 source distributions for L = {2, 4}, 50 times for L = 8, and 20 times for L = 16. For random sources, simulation is repeated 2000 times for L = {2, 4}, 1000 times for L = 8, and 400 times for L = 16. RIGHT: Amari errors for each source distributions for L = 2 and N = 1000. L 2
4
8
16 L 2
4
8
16
N Fast 100 20.6 250 13.0 1000 6.5 100 28.6 250 16.8 1000 6.9 250 30.2 1000 10.6 2000 6.4 1000 26.2 2000 11.8 4000 7.1
Np 20.3 12.9 9.8 23.0 13.9 6.5 20.9 7.8 4.7 17.3 12.5 6.9
Kgv 16.3 8.6 3.0 28.4 19.2 7.2 31.3 20.6 14.4 30.4 26.1 21.3
Imax 21.3 14.4 8.5 23.1 14.5 8.7 18.1 8.2 6.2 11.1 6.6 4.8
QICA 15.7 7.7 2.9 18.9 9.8 3.6 15.9 4.7 2.8 12.4 6.9 4.3
N Fast 100 18.0 250 11.3 1000 5.6 100 24.5 250 13.7 1000 5.7 250 25.4 1000 6.3 2000 4.0 1000 12.5 2000 4.3 4000 2.9
Np 13.6 7.2 2.8 18.1 8.5 2.6 14.8 2.9 1.7 8.5 2.6 1.2
Kgv 13.4 6.3 2.4 26.3 14.1 3.4 30.0 13.4 5.6 27.9 27.0 20.3
Imax 19.0 13.1 6.7 21.2 13.1 5.9 16.0 6.0 4.1 7.9 4.3 2.9
QICA 12.0 6.1 2.5 14.9 6.9 2.5 10.1 2.7 1.8 4.1 2.3 2.0
pdfs Fast a 4.7 b 5.5 c 2.3 d 7.2 e 5.7 f 4.7 g 1.7 h 5.8 i 9.4 j 7.0 k 5.8 l 12.1 m 3.5 n 5.7 o 4.4 p 3.8 q 21.8 r 6.0 mean 6.5 std 4.5
Np 5.6 4.1 3.1 8.8 0.9 26.9 30.0 5.7 14.9 29.7 3.3 4.8 14.9 10.7 3.1 1.1 4.3 3.5 9.8 9.7
Kgv 3.0 3.0 1.6 5.7 1.3 1.5 1.3 4.5 9.5 1.4 2.8 5.5 1.4 1.8 3.6 1.5 2.1 2.9 3.0 2.1
Imax QICA 2.1 2.7 2.7 2.4 3.0 2.1 6.4 6.4 3.3 1.6 1.6 1.5 1.1 1.3 3.4 3.6 6.9 7.3 11.4 1.4 4.9 2.7 8.2 4.8 4.3 1.4 22.3 1.9 4.2 3.9 8.0 1.6 53.2 2.5 5.1 3.5 8.5 2.9 12.2 1.7
Robust Independent Component Analysis Using Quadratic Negentropy
233
Fig. 1. Robustness to outliers for L = 2, N = 1000. Up to 25 observations are corrupted by adding +5 or -5. The experiment is repeated 1000 times with random source distributions.
unimodal, multimodal, symmetric, and skewed sources. We varied the number of samples from 100 to 4000 and the number of components from 2 to 16. Comparisons were made with four existing ICA algorithms: the FastICA algorithm [12], the KernelICA-KGV algorithm [1], the extended infomax algorithm [13] using tanh nonlinearity, and the NpICA algorithm [14]. Software programs were downloaded from corresponding authors’ websites and were used with default parameters, except for the extended infomax algorithm, which is our own implementation. Note that KernelICA-KGV also has four restarts as a default to obtain initial estimates. The performance was measured using the Amari error [15], which is invariant to permutation and scaling, lies between 0 and L−1 and is zero for perfect demixing. We normalized the Amari error by dividing it by L−1, where L is the number of independent components. We summarized our results in Table 1. Consistent performance improvement over existing algorithms was observed. The improvement was significant if the number of components was large and the number of observations was small. However, the performance gain became smaller as the number of observations increased. Amari errors for each source pdf are also shown separately for twocomponents and 1000 observations. The proposed method showed the smallest standard deviation among the five methods. All of the methods, except for KernelICA-KGV and the proposed method had problems with specific pdfs. Another interesting result was the high performance of the extended infomax algorithm for a large number of components. For L = 16, it showed the best performance among the five methods. But further experiments with outliers discouraged its practical use. Fig. 1 shows the result of the outlier experiment. We randomly chose up to 25 observations and added the value +5 or -5 to a single component in the observation, which was the same as the one in the KernelICA paper. The results show that our method is extremely robust to outliers.
234
5
J. Lee, T. Kim, and S.-Y. Lee
Conclusions
We have proposed a robust algorithm for independent component analysis that uses the sum of marginal quadratic negentropies as a dependence measure. The proposed algorithm can handle arbitrary source distributions and is scalable with respect to the number of components and observations. Experimental results have shown that the proposed algorithm consistently outperforms others. In addition, it is extremely robust to outliers and more effective when the number of observed samples is small and the number of mixed sources is large. The proposed contrast function is not guaranteed to have the same maximum with the original one. Empirically, however, our method shows good performance and can be applied to cases where a limited number of observations is available.
Acknowledgment This research was supported as a Brain Neuroinformatics Research Program by Korean Ministry of Commerce, Industries, and Energy.
References 1. Bach, F.R., Jordan, M.I: Kernel independent component analysis. Journal of Machine Learning Research 3, 1–48 (2002) 2. Araki, S., Makino, S., Nishikawa, T., Saruwatari, H.: Fundamental limitation of frequency domain blind source separation for convolutive mixture of speech. In: Proc. ICASSP. vol. 5, pp. 2737–2740 (2001) 3. Learned-Miller, E.G.: Ica using spacings estimates of entropy. Journal of Machine Learning Research 4, 1271–1295 (2003) 4. Torkkola, K.: Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research 3, 1415–1438 (2003) 5. Hild II, K.E., Erdogmus, D., Principe, J.C.: Blind source separation using renyi’s mutual information. IEEE Signal Processing Letters 8(6), 174–176 (2001) 6. Hild II, K.E., Erdogmus, D., Principe, J.C.: An analysis of entropy estimators for blind source separation. Signal Processing 86, 182–194 (2006) 7. Comon, P.: Independent component analysis, a new concept? Signal Processing 36, 287–314 (1994) 8. Principe, J.C., Fisher III, J.W., Xu, D.: Information theoretic learning. In: Haykin, S. (ed.) Unsupervised Adaptive Filtering, Wiley, New York (2000) 9. Greengard, L., Strain, J.: The fast gauss transform. SIAM Journal on Scientific and Statistical Computing 12(1), 79–94 (1991) 10. Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications 20(2), 303–353 (1998) 11. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, Sydney (1986) 12. Hyvarinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997)
Robust Independent Component Analysis Using Quadratic Negentropy
235
13. Lee, T.-W., Girolami, M., Sejnowski, T.J.: Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation 11, 417–441 (1999) 14. Boscolo, R., Pan, H., Roychowdhury, V.P.: Independent component analysis based on nonparametric density estimation. IEEE Transactions on Neural Networks 15(1), 55–65 (2004) 15. Amari, S., Cichocki, A., Yang, H.: A new learning algorithm for blind source separation. In: Advances in Neural Information Processing 8 (Proc. NIPS’95), pp. 757–763 (1996)
Underdetermined Source Separation Using Mixtures of Warped Laplacians Nikolaos Mitianoudis and Tania Stathaki Communications and Signal Processing Group, Imperial College London, Exhibition Road, London SW7 2AZ, UK [email protected]
Abstract. In a previous work, the authors have introduced a Mixture of Laplacians model in order to cluster the observed data into the sound sources that exist in an underdetermined two-sensor setup. Since the assumed linear support of the ordinary Laplacian distribution is not valid to model angular quantities, such as the Direction of Arrival to the set of sensors, the authors investigate the performance of a Mixture of Warped Laplacians to perform efficient source separation with promising results.
1
Introduction
Assume that a set of M microphones x(n) = [x1 (n), . . . , xM (n)]T observes a set of N sound sources s(n) = [s1 (n), . . . , sN (n)]T . The case of instantaneous mixing, i.e. each sensor captures a scaled version of each signal with no delay in transmission, will be considered with negligible additive noise. The instantaneous mixing model can thus be expressed in mathematical terms, as follows: x(n) = As(n)
(1)
where A represents the mixing matrix and n the sample index. The blind source separation problem provides an estimate of the source signals s, based on the observed microphone signals and some general source statistical profile. The underdetermined source separation problem (M < N ) is a challenging problem. In this case, the estimation of the mixing matrix A is not sufficient for the estimation of nonGaussian source signals s, as the pseudo-inverse of A can not provide a valid solution. Hence, this blind estimation problem can be divided into two sub-problems: i) estimating the mixing matrix A and ii) estimating the source signals s. In this study, we will assume a two sensor instantaneous mixing approach. The combination of several instruments into a stereo mixture in a recording studio follows the instantaneous mixing model of (1). The proposed approach can thus be used to decompose a studio recording into the separate instruments that exist in the mixture for many possible applications, such as music transcription, object-based audio coding and audio remixing. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 236–243, 2007. c Springer-Verlag Berlin Heidelberg 2007
Underdetermined Source Separation Using Mixtures of Warped Laplacians
237
The solution of the two above problems is facilitated by moving to a sparser representation of the data, such as the Modified Discrete Cosine Transform (MDCT). In the case of sparse sources, the density of the data in the mixture space shows a tendency to cluster along the directions of the mixing matrix columns. It has been demonstrated [6] that the phase difference θn between the two sensors can be used to identify and separate the sources in the mixture. θn = atan
x2 (n) x1 (n)
(2)
Using the phase difference information between the two sensors is equivalent to mapping all the observed data points on the unit-circle. The strong superGaussian characteristics of the individual components in the MDCT domain are preserved in the angle representation θn . We can also define the amplitude rn of each point x(n), as follows: (3) rn = x1 (n)2 + x2 (n)2 In a previous work [6], we proposed a clustering approach on the observed θn to perform source separation. In order to model the sparse characteristics of the source distributions, we introduced the following Mixture of Laplacians (MoL) that was trained using an Expectation-Maximisation (EM) algorithm on the observed angles θn of the input data. p(θn ) =
N i=1
αi L(θ, ci , mi ) =
N
αi ci e−2ci |θn −mi |
(4)
i=1
where N is the number of the Laplacians in the mixture, mi defines the mean and ci ∈ R+ controls the “width” of the distribution. Once the model is trained each of the Laplacians of the MoL should be centred on the Direction of Arrival (DOA) of the sources in the majority of the cases, i.e. the angles denoted by the columns of the mixing matrix. One can perform separation using optimal detection approaches for the individual trained Laplacians. There is a shortcoming in the previous assumed model. The model in (4) assumes a linear support for θn , which is not valid as the actual support for θn wraps around ±90o. The linear support is not a problem if the sources are well contained within [−90o, 90o ]. To overcome this problem, we proposed a strategy in [6], where in each update we check whether any of the centres are closer to any of the boundaries (±90o ). In this case, all the data points and the estimated centres mi are rotated, so that the affected boundary (−90o or 90o ) is mapped to the middle of the centres mi that feature the greatest distance. This seemed to alleviate the problem in the majority of cases, however, it still serves as a heuristic solution. To address this problem in a more eloquent manner, one can introduce wrapped distributions to provide a more complete solution. In the literature, there exist several “circular” distributions, such as the von Mises distribution (also known as the circular normal distribution). However, this definition is
238
N. Mitianoudis and T. Stathaki
rather difficult to optimise in an EM perspective. In this study, we examine the use of an approximate warped-Laplacian distribution to model the periodicity of 180o that exists in atan(·) with encouraging results.
2
A Mixture of Warped Laplacians
The observed angles θn of the input data can be modelled, as a Laplacian wrapped around the interval [−90o, 90o ] using the following additive model: Lw (θ, c, m) =
T 1 ce−2c|θ−m−180t| 2T − 1 t=−T
=
T 1 L(θ − 180t, c, m) 2T − 1
∀ θn ∈ [−90o , 90o]
(5)
t=−T
where T ∈ Z+ denotes the number of ordinary Laplacians participating in the wrapped version. The above expression models the wrapped Laplacian by an ordinary Laplacian and its periodic repetitions by 180o. This is an extension of the wrapped Gaussian distribution proposed by Smaragdis and Boufounos [7] for the Laplacian case. The addition of the wrapping of the distribution aims at mirroring the wrapping of the observed angles at ±90o , due to the atan(·) function. In general, the model should have T → ∞ components, however, it seems that in practice a small range of values for T can successfully approximate the full warped probability density function. In a similar fashion to Gaussian Mixture Models (GMM), one can introduce Mixture of Warped Laplacians (MoWL) in order to model a mixture of angular or circular sparse signals. A Mixture of Warped Laplacians can thus be defined, as follows: p(θ) =
N
αi Lw (θ, ci , mi ) =
i=1
N
αi
i=1
T 1 ci e−2ci |θ−mi −180t| 2T − 1
(6)
t=−T
where αi , mi , ci represent the weight, mean and width N of each Laplacian respectively and all weights should sum up to one, i.e. i=1 αi = 1. The ExpectationMaximization (EM) algorithm has been proposed as a valid method to train a mixture model [1]. Consequently, the EM can be employed to train a MoWL over a training set. We derive the EM algorithm, based on Bilmes’s analysis [1] for the estimation of a GMM. Bilmes estimates Maximum Likelihood mixture density parameters using the EM [1]. Assuming K training samples for θn and Mixture of Warped Laplacians densities (6), the log-likelihood of these training samples θn takes the following form: I(αi , ci , mi ) =
K n=1
log
N i=1
αi Lw (θn , ci , mi )
(7)
Underdetermined Source Separation Using Mixtures of Warped Laplacians
239
0.035
0.03
Probability
0.025
0.02
0.015
0.01
0.005
0
−300
−200
−100
0 Angle (θ)
100
200
300
Fig. 1. An example of the Wrapped Laplacian for T = [−1, 0, 1] c = 0.01 and m = 45o
Introducing unobserved data items that can identify the components that “generated” each data item, we can simplify the log-likelihood of (7) for Warped Laplacian Mixtures, as follows: K T N L(θ − 180t, ci , mi ) p(i|θn ) log αi −log(2T + 1)+log J(αi , ci , mi ) = n=1 i=1
t=−T
(8) where p(i|θn ) represents the probability of sample θn belonging to the ith Laplacian of the MoWL. In a similar manner, we can also introduce unobserved data items to identify the individual Laplacian of the ith Warped Laplacian that depends on θn . H(αi , ci , mi ) =
K N
(log αi − log(2T + 1) + log ci
(9)
n=1 i=1
−
T
2ci |θ − 180t − mi |p(t|i, θn ))p(i|θn )
(10)
t=−T
where p(t|i, θn ) represents the probability of sample θn belonging to the ith Warped Laplacian and the tth individual Laplacian. The updates for p(t|i, θn ), p(i|θn ) and αi can be given by the following equations: p(t|i, θn ) = T
L(θn − tπ, mi , ci )
t=−T
L(θn − 180t, mi , ci )
αi Lw (θn , mi , ci ) p(i|θn ) = N i=1 αi Lw (θn , mi , ci ) αi ←
K 1 p(i|θn ) K n=1
(11)
(12)
(13)
240
N. Mitianoudis and T. Stathaki
In a similar manner to [6], one can set ∂H(αi , ci , mi )/∂mi = 0 and solve for mi for the recursive update for mi , as follows: K T ∂H = 2ci sgn(θn − 180t − mi )p(t|i, θn )p(i|θn ) = 0 ⇒ ∂mi n=1
(14)
t=−T
K T n=1 t=−T
K T θn − 180t p(t|i, θn )p(i|θn ) p(t|i, θn )p(i|θn ) = mi ⇒ |θn − 180t − mi | |θ n − 180t − mi | n=1 t=−T
K mi ←
(15)
T
θn −180t n=1 t=−T |θn −180t−mi | p(t|i, θn )p(i|θn ) K T 1 n=1 t=−T |θn −180t−mi | p(t|i, θn )p(i|θn )
(16)
Similarly, one can set ∂H(αi , ci , mi )/∂ci = 0, to solve for the estimate of ci : K T ∂H = (c−1 − 2 |θn − 180t − mi |p(t|i, θn ))p(i|θn ) = 0 ⇒ i ∂ci n=1
(17)
t=−T
K ci ←
2
K n=1
T t=−T
n=1
p(i|θn )
|θn − 180t − mi |p(t|i, θn )p(i|θn )
(18)
Once the MoWL is trained, optimal detection theory and the estimated individual Laplacians can be employed to provide estimates of the sources. The centre of each warped Laplacian mi should represent a column of the mixing matrix A in the form of [cos(mi ) sin(mi )]T . Each warped Laplacian should model the statistics of each source in the transform domain and can be used to perform underdetermined source separation. A “Winner takes all” strategy attributes each point (rn , θn ) to only one of the sources. This is performed by setting a hard threshold at the intersections between the trained Warped Laplacians. Consequently, the source separation opt beproblem becomes an optimal decision problem. The decision thresholds θij tween the i-th and the j-th neighbouring Laplacians are given by the following equation: ln ααji ccij + 2(ci mi + cj mj ) opt (19) θij = 2(ci + cj ) opt Using these thresholds, the algorithm can attribute the points with θij < θn < opt θjk to source j, where i, j, k are neighbouring Laplacians (sources). Having attributed the points x(n) to the N sources, using the proposed thresholding technique, the next step is to reconstruct the sources. Let Si ∩ K represent the point indices that have been attributed to the ith source. We initialise ui (n) = 0, ∀ n = 1, . . . , K and i = 1, . . . , N . The source reconstruction is performed by substituting: ∀ i = 1, . . . , N (20) ui (Si ) = [cos(mi ) sin(mi )]x(Si )
Underdetermined Source Separation Using Mixtures of Warped Laplacians
80
80
60
60
40
40
20
20 Angles (θ)
Angles (θ)
241
0
0
−20
−20
−40
−40
−60
−60
−80
−80
0
20
40
60
80 100 Itearations
120
140
160
0
20
40
60
80
100 Iterations
120
140
160
180
200
Fig. 2. Estimation of the mean using MoL with the shifting strategy (left) and the warped MoL (right)
3
Experiments
In this section, we evaluate the algorithm proposed in the previous section. We will use Hyv¨ arinen’s clustering approach [4], O’Grady and Pearlmutter’s [5] Soft LOST algorithm’s and the MoL-EM Hard as proposed in a previous work [6], to demonstrate several trends using artificial mixtures or publicly available datasets1 . In order to quantify the performance of the algorithms, we are estimating the Signal-to-Distortion Ratio (SDR) from the BSS EVAL Toolbox [2]. The frame length for the MDCT analysis is set to 64 msec for the test signals sampled at 16 KHz and to 46.4 msec for those at 44.1 KHz. We initialise the parameters of the MoL and MoWL, as follows: αi = 1/N and ci = 0.001 and T = [−1, 0, 1] (for MoWL only). The centres mi were initialised in both cases using a K-means step. The initialisation of mi is important, as if we choose two initial values for mi that are really close, then it is very probable that the individual Laplacians may not converge to different clusters. To provide a more accurate estimation of mi , training is initially performed using a “reduced” dataset, containing all points that satisfy rn > 0.2, provided that the input signals are scaled to [−1, 1]. The second phase is to use the “complete” dataset to update the values for αi and ci . 3.1
Artificial Experiment
In this experiment, we use 5 solo audio uncorrelated recordings (a saxophone, an accordion, an acoustic guitar, a violin and a female voice) of sampling frequency 16 KHz and duration 8.19 msec. The mixing matrix is constructed as in (21), choosing the angles in Table 1. Two of the sources are placed close to the wrapping edges (−80o , 60o ) and three of them are placed rather closely at 1
All the experimental audio results are available online at: http://www.commsp.ee.ic.ac.uk/∼nikolao/lmm.htm
242
N. Mitianoudis and T. Stathaki
−40o, −20o , 10o , in order to test the algorithm’s resilience to the wrapping at ±90o. In Table 1, we can see the estimated angles of the original MoL Hard with the shifting solution and the MoWL. In both cases, the algorithms estimate approximately the same means mi , which are very close to the original ones. In Fig. 2, the convergence of the means mi in the two cases is depicted. The proposed warped solution seems to converge smoothly and faster without the perturbations caused by the shifting solution in the previous algorithm. Note that Fig. 2(a) depicts the angles after the rotating steps to demonstrate the shifting of ψi in the original MoL solution. Their performance in terms of SDR is depicted in Table 2. Hyv¨ arinen’s approach is very prone to initialisation, however, the results are acquired using the best run of the algorithm. This could be equally avoided by using a K-means initialisation step. The Soft Lost algorithm managed to separate the sources in most cases, however, there were some audible artifacts and clicks that reduced the calculated quality measure. To appreciate the results of this rather difficult problem, we can spot the improvement performed by the methods compared to the input signals. It seems that the proarinen’s approach, posed algorithm performs similarly to MoL Hard and the Hyv¨ which implies that the proposed solution to approximate the wrapping of the pdf is valid. cos(ψ1 ) cos(ψ2 ) . . . cos(ψN ) (21) A= sin(ψ1 ) sin(ψ2 ) . . . sin(ψN )
Table 1. The five angles used in the artificial experiment and their estimates using the MoL and MoWL approaches ψ1 ψ2 ψ3 ψ4 ψ5 Original −80o −40o -20o 10o 60o Estimated MoL −81.52o −45.45o −23.45o 12.59o 64.18o Estimated MoWL −81.59o −44.98o −23.59o 12.18o 64.19o
3.2
Real Recording
In this section, we tested the algorithms with the Groove dataset, available by (BASS-dB) [3], sampled at 44.1 KHz. The “Groove” dataset features four widely spaced sources: bass (far left), distortion guitar (center left), clean guitar (center right) and drums (far right). In Table 2, we can see the results for the four methods in terms of SDR. The proposed MoWL approach managed to perform similarly to the previous MoL EM, despite the small spacing of the sources and the source being placed at the edges of the solution space, which implies that the warped Laplacian model manages to model the warping of θn without any additional steps. The proposed MoL approaches managed to outperform Hyv¨ arinen and Soft LOST approach for the majority of the sources. Again, the LOST approach still introduces several audio artifacts and clicks.
Underdetermined Source Separation Using Mixtures of Warped Laplacians
243
Table 2. The proposed MoWL approach is compared in terms of SDR (dB) with arinen’s, soft LOST and the average SDR of the mixtures MoL-EM hard, Hyv¨
s1 Mixed Signals MoWL-EM hard MoL-EM hard Hyv¨ arinen soft LOST
4
Artificial experiment s2 s3 s4 s5
-6.00 6.07 6.69 6.53 4.58
-13.37 -2.11 0.32 -1.16 -4.01
-26.26 5.62 7.66 7.60 5.09
-6.67 4.09 3.65 4.14 1.67
s1
Groove Dataset s2 s3 s4
-6.81 -30.02 -10.25 -6.14 -21.24 6.15 4.32 -4.35 -1.16 3.27 6.03 2.85 -4.47 -0.86 3.28 5.79 3.79 -3.72 -1.13 1.49 3.93 4.54 -5.77 -1.74 3.62
Conclusions
The problem of underdetermined source separation is examined in this study. In a previous work, we proposed to address the two-sensor problem by clustering using a Mixture of Laplacian approach on the source Direction of Arrival (DOA) θn to the sensors. In this study, we address the problem of wrapping of θn using a Warped Mixture of Laplacians approach. The new proposed approach features similar performance and faster convergence to MoL hard and seems to be able to separate sources that are close to the boundaries (±900 ) without any extra trick and therefore serves as a valid solution to the problem.
References 1. Bilmes, J.A.: A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian Mixture and Hidden Mixture Models, Tech. Rep. Dep. of Electrical Eng. and Computer Science, U.C. Berkeley, California (1998) 2. F´evotte, C., Gribonval, R., Vincent, E.: BSS EVAL Toolbox User Guide, Tech. Rep. IRISA Technical Report 1706, Rennes, France (April 2005), http://www.irisa.fr/metiss/bsseval/ 3. Vincent, E., Gribonval, R., Fevotte, C., Nesbit, A., Plumbley, M.D., Davies, M.E., Daudet, L.: BASS-dB: the blind audio source separation evaluation database, Available at http://bass-db.gforge.inria.fr/BASS-dB/ 4. Hyv¨ arinen, A.: Independent component analysis in the presence of Gaussian noise by maximizing joint likelihood. Neurocomputing 22, 49–67 (1998) 5. O’Grady, P.D., Pearlmutter, B.A.: Soft-LOST: EM on a mixture of oriented lines. In: Proc. Int. Conf. on Independent Component Analysis and Blind Source Separation, Granada, Spain, pp. 428–435 (2004) 6. Mitianoudis, N., Stathaki, T.: Underdetermined Source Separation using Laplacian Mixture Models, IEEE Trans. on Audio, Speech and Language (to appear) 7. Smaragdis, P., Boufounos, P.: Position and Trajectory Learning for Microphone Arrays. IEEE Trans. Audio, Speech and Language Processing 15(1), 358–368 (2007)
Blind Separation of Cyclostationary Sources Using Joint Block Approximate Diagonalization D.T. Pham Laboratoire Jean Kuntzmann - CNRS/INPG/UJF, BP 53, 38041 Grenoble Cedex, France [email protected]
Abstract. This paper introduces an extension of an earlier method of the author for separating stationary sources, based on the joint approximated diagonalization of interspectral matrices, to the case of cyclostationary sources, to take advantage of their cyclostationarity. the proposed method is based on the joint block approximate diagonlization of cyclic interspectral density. An algorithm for this diagonalization is described. Some simulation experiments are provided, showing the good performance of the method.
1 Introduction Blind source separation aims at recovering sources from their unknown mixtures [1]. All separation methods are based on some “non properties” of the source signals. Early methods which do not exploit the time structure of the signals would require non Gaussianity of the sources. However, by exploiting the time structure, one can separate mixtures of Gaussian sources provided that the sources are not independent identically distributed (iid) in time., that is one (or both) of the two “i” in “iid” is not met. If only the first “i” is not met, one has stationary correlated sources and separation can be achieved by considering the lagged covariances or inter-spectra between mixtures signals. This is the basis of most second order separation methods [2, 3, 4]. If the second “i” in “iid” is not fulfilled, one has nonstationary sources and separation methods can again be developed using only second order statistics [5, 6]. However, “nonstationarity” is a too general non property to be practical, the above works actually focus only on a particular aspect of it: They assume temporal independence (or more accurately ignore possible temporal dependency) and focus only on the variation of variance of the signal in time and assume that this variation is slow enough to be adequately estimated nonparametrically. In this paper, we consider another aspect of non stationarity: the cyclostationarity. The variance of the source is also variable in time but in an (almost) periodic manner. Further, the autocovariance between the source at different time points does not depend only on the delay as in the stationary case, but also on time as well and again in a (almost) periodic manner. Thus the “nonstationary” method in [6] may not work as this source variance can vary rapidly since the period (frequency) can be short (high). Moreover, such method ignores the lagged autocovariance of the sources, which provide important useful information for the separation. The “stationary” methods [2, 3, 4] still work in general if one takes as lagged covariances the average lagged covariances M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 244–251, 2007. c Springer-Verlag Berlin Heidelberg 2007
Blind Separation of Cyclostationary Sources
245
over time. In fact the usual lagged covariance estimator when applied to cyclostationary signal actually estimates the average lagged covariance. However, such methods ignore the specificity of cyclostationary signals and thus don’t benefice from it and further would fail if the sources are noncorrelated (but has variance varying periodically with time). Our method is specially designed to exploit this specificity. There have been several works on blind separation of cyclostationary sources [7, 8, 9, 10]. Our work is different in that we work with cyclic inter-spectral densities while the above works are mainly based on cyclic cross-covariances. Our work may be viewed as an extension of our earlier work for blind separation of stationary sources [4] based on the joint approximate diagonalization of a set of inter-spectral matrices. As said earlier, this method still works for cyclostationary sources, provided that their average spectra are different up to a constant factor. The present method exploits the extra information of cyclostationarity and thus yields better performance and also can be dispensed with the above restriction.
2 Cyclostationary Signals A discrete time (possibly complex value) process {X(t)} is said to be cyclostationary (or almost periodically correlated) if its mean function t → E[X(t)] and its covariance functions t → cov{X(t + τ ), X(t)} are almost periodic [11]. The definition of almost periodicity is rather technical, but here we consider only zero mean cyclostationary process with a finite “number of cycles”, for which an equivalent definition is that there exists a finite subset A of (−1/2, 1/2] such that R(α; τ )ei2παt , ∀t, ∀τ. (1) E[X(t + τ )X ∗ (t)] = α∈A
where ∗ denotes the complex conjugate. The function τ → R(α; τ ) is called the cyclic autocovariance function of cycle α. From (1), it can be computed as T 1 E[X(t + τ )X ∗ (t)]e−i2παt T →∞ T t=1
R(α; τ ) = lim
(2)
Note that for α ∈ / A, the last right hand side yields zero by (1). Thus we may define R(α, τ ) for all α, τ by the above right hand side, and A as the set {α : R(α; ·) = 0}. We shall assume that the function R(α; ·) admits a Fourier transform f (α; ·), called the cyclic spectral density of cycle α: f (α; ν) =
∞ τ =−∞
R(α; τ )e−i2πντ
⇔
1
f (α; ν)ei2πντ dν.
R(α; τ ) = 0
Note It can be seen from (2) that R(−α; τ ) = R∗ (α; −τ )e−i2πατ . This means that if A contains α, it must contain −α. Let α1 , . . . , αq be in A, the matrix of general j, k element R(αk − αj ; τ )ei2παj τ can be seen to be the average autocovariance of lag τ of the vector process {[X(t)ei2πα1 t · · · X(t)ei2παq t ]T }, since
246
D.T. Pham T 1 E[X(t + τ )X ∗ (t)]ei2παj (t+τ ) e−i2παk t T →∞ T t=1
R(αk − αj ; τ )ei2παj τ = lim
Therefore this matrix as a function of τ is of type positive and it follows that its Fourier transform is a non negative (matrix) function. In other words: ⎡ ⎤ · · · f (αq − α1 ; ν − α1 ) f (0; ν − α1 ) ⎢ ⎥ .. .. .. (3) ⎣ ⎦≥0 . . . f (α1 − αq ; ν − αq )
···
f (0; ν − αq )
In particular, f (0; ·) ≥ 0. The functions R(0; ·) and f (0; ·) may be viewed as the average covariance function and spectral density of the process {X(t)}. Since R(0; 0) =
T limT →∞ T −1 t=1 E[|X(t)|2 ] > 0, 0 ∈ A. By taking α1 = 0, one see that the matrix in (3) can contain all the cyclic spectral densities of cycle in A and possibly some other vanishing cyclic spectral densities (since its cycle is not in A) as well. The natural estimator of R(α; τ ) based on an observed sample X(1), . . . , X(T ) is 1 ˆ R(α; τ) = T
min(T,T −τ )
X(t + τ )X ∗ (t)e−i2παt .
(4)
t=max(1,1−τ )
From this estimator, one may construct an estimator for f (α; ν) as fˆ(α; ν) =
T −1
ˆ kM (τ )R(α; τ )e−i2πντ
(5)
τ =1−T
where kM (·) is a given lag windows, often of the form k(·/M ) with k being some given even function taking the value 1 at 0, and M is a window width parameter.
3 The Mixture Model and Separation Method We consider the simplest mixture model in which the mixing is instantaneous without noise and there is a same numbers of mixtures as the sources: X(t) = AS(t) where X(t) and S(t) denote the vectors of mixtures and of sources at time t, and A is a square matrix. The sources are assumed to be independent cyclostationary processes. It is easily seen that the observed mixtures are also cyclostationary, with the set of cycle frequencies contained in the union of the sets of cycle frequencies of the sources, which we denote by A. The goal is to recover the sources from their mixtures. For simplicity, we shall assume that A is known. In practice, such set can be estimated. Further, it is not important that A be accurately known. We define the cyclic autocovariance function RX (α; ·) of cycle α of the vector process {X(t)} similar to (2) except that X(t) is replaced by X(t) and ∗ is understood as the transpose conjugate. Clearly RX (α; τ ) = ARS (α; τ )A∗ where RS (α; ·) is the cyclic autocovariance function of cycle α of the vector source process {S(t)}. The independence of the sources implies that the matrices RS (α; τ ) are diagonal for all α, τ
Blind Separation of Cyclostationary Sources
247
(of course if α ∈ / A this matrix vanishes and is of no interest). Similarly, we define the cyclic spectral density of cycle α of the vector process {X(t)} as the Fourier transform fX (α; ·) of RX (α; ·). Again, we have fX (α; ν) = AfS (α; ν)A∗ where fS (α; ·) is the cyclic spectral density of cycle α of the vector process {S(t)}, which is diagonal for all frequencies and all α. The analogue of the matrix in (3) is the block matrix ⎤ ⎡ C11 (ν) · · · C1K (ν) ⎥ ⎢ .. .. .. (6) C(ν) = ⎣ ⎦ . . . CK1 (ν) where
· · · CKK (ν)
⎤ · · · fXj Xk (αq − α1 ; ν − α1 ) fXj Xk (0; ν − α1 ) ⎥ ⎢ .. .. .. Cjk (ν) = ⎣ ⎦ . . . fXj Xk (α1 − αq ; ν − αq ) · · · fXj Xk (0; ν − αq ) ⎡
(7)
fXj Xk denoting the jk element of fX . The relation fX (α; ν) = AfS (α; ν)A∗ implies that C(ν) = (A ⊗ Iq )D(ν)(A∗ ⊗ Iq ) where D is defined similar to C but with fSj Sk (the jk element of fS ) in place of fX , Iq is the identity matrix of order q and ⊗ denotes the Kronecker product: ⎤ ⎡ A11 M A12 M · · · A ⊗ M = ⎣ A21 M A22 M · · · ⎦ , .. .. .. . . . Aij being the elements of A. The independence of the sources implies that the matrix D is block diagonal (Djk = 0 except when j = k). Thus our idea is to find a separation matrix B such that B ⊗ Iq block diagonalizes all the matrices C(ν) in the sense that the matrices (B ⊗ Iq )C(ν)(B∗ ⊗ Iq ) are block diagonal (of block size q) for all ν. ˆ In practice, the matrices C(ν) have to be replaced by their estimators C(ν). This ˆ estimator is naturally built from the estimators fX (α; ν) of fX (α; ν), defined similarly ˆ ˆ X (α; τ ), the estimator of RX (α; τ ). The last estias in (5) with R(α; τ ) replaced by R mator is defined similarly as in (4) with X(t) replaced by X(t). As the lag window kM in (5) has the effect of a smoothing, the (cyclic) spectral density estimator at a given frequency actually does not estimate the spectral density at this frequency but the average density over a frequency band centered around it. Therefore, we shall limit ourselves to ˆ the matrices C(ν) for ν on some finite grid, so that we have only a finite set of matrices to be block diagonalized. The spacing of the grid would be directly related to the ˆ resolution of the spectral estimator. Of course, since the C(ν) are not exactly equal to C(ν), one cannot block diagonalize them exactly but only approximately, according to some block diagonality measure, which will be introduced below. ˆ It is important that the estimator C(ν) be non negative as C(ν) is. One can ensure that this is the case regardless of the data, by chosing the (real) window kM in (5) such
that τ kM (τ )e−2πντ is real and non negative for all ν. Indeed, there then exists a
1/2 1/2 real window kM (not unique) such that τ kM (τ )e−2πντ = | τ kM (τ )e−2πντ |2
1/2 1/2 or kM (τ ) = u kM (u − τ )kM (u). Therefore
248
D.T. Pham
1 1/2 1/2 ∗ −i2παv ˆ ˜ ˜ X(v + τ )X (v)e kM (u − τ )kM (u) fX (α; ν) = e−i2πντ T τ u v ˜ where X(t) = X(t) for 1 ≤ t ≤ T, = 0 otherwise. The last right hand side equals, after
˜ ν )(v+u) k 1/2 (u)X∗ (v)ei2π(ν−α)v summing up with respect to τ : T −1 u v (kM X M −i2πνt ˜ ν (t) = X(t)e ˜ where X and denotes the convolution. Let t = u+v and summing up again first with respect to u, one gets 1 ˜ ν )(t) (k 1/2 X∗ )(t). fˆX (α; ν) = (kM X ν−α M T t ˆ This formula shows that C(ν) is the sample covariance of certain vector sequence, ˆ hence is non negative, and can be used for the calculation of C(ν).
4 Joint Block Approximate Diagonalization The separation method in previous section leads to the problem of joint ˆ m ), m = approximate block diagonalizing a set of positive definite block matrices C(ν 1, . . . , M , of block size q, by a matrix of the form B ⊗ Iq . Following [4] we take as the measure of block diagonality of a Hermitian non negative block matrix M: (1/2)[log det Diag(M) − log det(M)] where Diag denotes the operator which builds a bloc diagonal matrix from its argument. This measure is always positive and can be zero if and only if the matrix M is block diagonal. Indeed, each diagonal block Mii of M, being non negative, can be diagonalized by a unitary matrix Ui . Thus the matrices Ui Mii U∗i are diagonal with diagonal elements being also those of UMU∗ where U is the block diagonal matrix with diagonal block Ui . Hence by the Hadamard inequality [12], i det(Ui Mii U∗i ) ≥ det UMU∗ with equality if and only if UMU is diagonal. This yields the announced result, since the right and left hand sides of the above inequality are no other than det Diag(M) and det(M), and UMU∗ diagonal is the same as M is block diagonal. Therefore we consider the joint block diagonality criterion M 1 ˆ m )(B ⊗ Iq )] − log det[(B ⊗ Iq )C(ν ˆ m )(B∗ ⊗ Iq )]}. {log det Diag[(B ⊗ Iq )C(ν 2 m=1 (8) Note that the last term in the above curly bracket { } may be replaced by 2q log det |B| since these two terms differ by log det[C(νm )] which does not depend on B. The algorithm in [13] can be adapted to solve the above problem. For lack of space, we here only describe how it works. Starting from a current value of B, it consists in performing successive transformations, each time on a pair of rows of B, the i-th row Bi· and the j-th row Bj· say, according to
Bi· Bj·
← Tij
Bi· , Bj·
Blind Separation of Cyclostationary Sources
249
where Tij is a 2 × 2 non singular matrix, chosen such that the criterion is decreased and whose expression is given later. Once this is done, the procedure is repeated with another pair of rows. The processing of all the K(K − 1)/2 is called a sweep. The algorithm consists of repeated sweeps until convergence is achieved. Put gij =
M 1 tr[C−1 ii (m; B)Cij (m; B)], M q m=1
1 ≤ i = j ≤ K,
ωij =
M 1 tr[C−1 ii (m; B)Cjj (m; B)], M q m=1
1 ≤ i = j ≤ K.
(9)
where Cij (m; B) stands for the ij block of (B ⊗ Iq )C(νm )(B∗ ⊗ Iq ) for short. The matrix is Tij given by 2 0 hij 1 0 − 0 1 1 + hij hji − h∗ij h∗ji + (1 + hij hji − h∗ij h∗ji )2 − 4hij hji hji 0 where hij and hji are the solution of hij gij ωij 1 = . ∗ h∗ji gji 1 ωji ˆ T (−τ )ei2πατ , ˆ X (α; τ ) = R Note 1. In the case where the signal X(t) is real, R X T denoting the transpose, hence fX (α; −ν) = fX (α; α + ν). It follows that
T
fXj Xk (αm − αl ; −ν − αl ) = fXk Xj (αm − αl ; ν + αm ). We already know that if A contain α it must contain −α. Thus it is of interest to choose α1 = 0 and αj = −αq+2−j , 2 ≤ j ≤ q (which implies that q is odd, unless 1/2 ∈ A, in this case q may be even with αq/2+1 = 1/2)1 . Then the above right hand side can be written as fXk Xj (αq+2−l − αq+2−m ; ν − αq+2−m ), with αq+1 = 0 by convention. Therefore by (7): Cjk (−ν) = ΠCTkj (ν)ΠT for some permutation matrix Π, hence C(−ν) = (IK ⊗ Π)CT (ν)(IK ⊗ ΠT ). It follows that for a real matrix B (B ⊗ Iq )C(−ν)(B∗ ⊗ Iq ) = (IK ⊗ Π)[(B ⊗ Iq )C(ν)(B∗ ⊗ Iq )]T (IK ⊗ ΠT ), and thus the measure of block diagonality of the matrix in the above left hand side is the same as that of (B ⊗ Iq )C(ν)(B∗ ⊗ Iq ). It is then of interest to consider a grid of frequencies ν1 , . . . , νM with M even and νm = −νM+1−m mod 1, so as to reduce the number of matrices to be block diagonalized by half, since the term corresponding to νm in (8) can be grouped with the one corresponding to νM+1−m . One may take νm = (m − 1/2)/M which yield a regular grid of spacing 1/M . Note 2. In the case where the signals are real, the matrix B must be constrained to be real, that is the minimization of (8) must be done over the set of real matrices. It can be shown that the algorithm is the same as before but the gij are now defined as the real part of the right hand side of (9). 1
{α1 , . . . , αq } need not be equal to A but can be a subset of A.
250
D.T. Pham
5 Some Simulation Examples We consider two cyclostationary sources constructed as Gaussian stationary autoregressive (AR) processes of second order, modulated with sine waves cos(α2 πt) and cos(α3 πt) respectively. Thus they have cycle frequencies 0, ±α2 and 0, ±α3 respectively. We take α2 = 0.3/π = 0.0955 and α3 = 0.7/π = 0.2228 (the same as in [9]). The AR coefficients are 1.9 cos(0.16π), −0.952 and cos(0.24π), −0.52 for the first and second sources, respectively. This corresponds to the AR polynomials with roots 0.95e±i0.16π and 0.5e±i0.24π respectively. Four hundred series of length 256 are generated for each source. The 2 sources are 1 1 mixed according to the mixing matrix A = and our method is applied to −1 1 obtain the separation matrix B. The number of positive frequency bins is set to 4. To quantify the quality of the separation, we introduce two contamination coefficients c12 and c21 defined as follows. First the global matrix G = BA is formed, then its rows is eventually permuted such that |G11 G22 | ≥ |G12 G21 |, Gij denoting the elements of G. Finally c12 = G12 /G11 and c21 = G21 /G22 . Table 1 shows the mean square of the contamination coefficients and of their products, all multiplied by 256 which is the length of the observed series (since the variance of the estimator should be asymptotically inversely proportional to this length). The mean number of iterations is also listed. For comparison, the values for the stationary method in [4] is also given. This method amounts to running our algorithm with no cycle frequency: q = 1 and α1 = 0, which means that one just ignore the cyclostationarity of the sources and considers them as stationary (with spectrum being the average spectrum over time). It can be seen that cyclostationary method yields better results than the stationary method. However, the algorithm converges a little more slowly and each iteration is also more costly computationally. Table 1. Mean square of the contamination coefficients and of their products and mean number of iterations, obtained from the cyclostationary and stationary methods. The sources are modulated AR processes.
cyclostationary method stationary method
256(mean c212 ) 256(mean c221 ) 256(mean c12 c21 ) mean # iterations 0.3707 0.0310 0.0010 5.86 0.5513 0.1250 −0.0628 3.97
In a second test, we consider two cyclostationary sources constructed as (temporally) independent Gaussian processes of unit variance, modulated in the same way as before. Thus the sources are uncorrelated but have variance varying periodically. Therefore, the stationary methods, which amount to considers the sources as stationary with spectrum being the average spectrum over time, would fail since the average sources spectra are constant. Table 2 compares the results of the cyclostationary and stationary methods. It can be seen the stationary method fails completely, as expected. The cyclostationary still works reasonably well, although less well than in the case where the sources are modulated AR processes. The “nonstationarity” method in [6] is also not suitable since
Blind Separation of Cyclostationary Sources
251
the variance function vary too fast. Indeed, the variance function of the sources have frequencies α1 and α2 respectively, which corresponds to the periods 1/α1 = π/0.3 = 10.472 and 1/α1 = π/0.7 = 4.4880. Thus in order to “see” the variation of the source variances one has to estimate them in a moving window of size less than 4 which is to short. Table 2. Mean square of the contamination coefficients and of their products and mean number of iterations, obtained from the cyclostationary and stationary methods. The sources are modulated independent Gaussian processes of unit variance.
cyclostationary method stationary method
256(mean c212 ) 256(mean c221 ) 256(mean c12 c21 ) mean # iterations 0.6638 0.6131 −0.2586 8.27 74.4639 72.8648 −72.5062 4.43
References 1. Cardoso, J.F.: Blind signal separation: statistical principles. Proceedings of the IEEE 9, 2009–2025 (1998) 2. Tong, L., Soon, V.C., Huang, Y.F., Liu, R.: Amuse: A new blind identification algorithm. In: Proc. IEEE ISCAS, New Orleans, LA, USA, pp. 1784–1787. IEEE Computer Society Press, Los Alamitos (1990) 3. Belouchrani, A., Meraim, K.A., Cardoso, J.F., Moulines, E.: A blind source separation technique based on second order statistics. IEEE Trans. on Signal Processing 45, 434–444 (1997) 4. Pham, D.T.: Blind separation of instantaneous mixture of sources via the Gaussian mutual information criterion. Signal Processing 81, 850–870 (2001) 5. Matsuoka, K., Ohya, M., Kawamoto, M.: A neural net for blind separation of nonstationary signals. Neural networks 8, 411–419 (1995) 6. Pham, D.T., Cardoso, J.F.: Blind separation of instantaneous mixtures of non stationary sources. IEEE Trans. Signal Processing 49, 1837–1848 (2001) 7. Liang, Y.C., Leyman, A.R., Soong, B.H.: Blind source separation using second-order cyclicstatistics. In: First IEEE Signal Processing Workshop on Advances in Wireless Communications, Paris, France, pp. 57–60. IEEE Computer Society Press, Los Alamitos (1997) 8. Ferreol, A., Chevalier, P.: On the behavior of current second and higher order blind source separation methods for cyclostationary sources. IEEE Trans. on Signal Processing 48, 1712– 1725 (2000) 9. Abed-Meraim, K., Xiang, Y., Manton, J.H., Hua, Y.: Blind source-separation using secondorder cyclostationary statistics. IEEE Trans. on Signal Processing 49, 694–701 (2001) 10. Ferreol, A., Chevalier, P., Albera, L.: Second-order blind separation of first- and second-order cyclostationary sources-application to AM, FSK, CPFSK, and deterministic sources. IEEE Trans. on Signal Processing 52, 845–861 (2004) 11. Dehay, D., Hurd, H.L.: Representation and estimation for periodically and almost periodically correlated random processes. In: Gardner, W.A. (ed.) Cyclostationarity in Communications and Signal Processing, IEEE Press, Los Alamitos (1993) 12. Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New-York (1991) 13. Pham, D.T.: Joint approximate diagonalization of positive definite matrices. SIAM J. on Matrix Anal. and Appl. 22, 1136–1152 (2001)
Independent Process Analysis Without a Priori Dimensional Information Barnabás Póczos, Zoltán Szabó, Melinda Kiszlinger, and András Lőrincz Department of Information Systems Eötvös Loránd University, Budapest, Hungary {pbarn,szzoli}@cs.elte.hu, {kmelinda,andras.lorincz}@elte.hu
Abstract. Recently, several algorithms have been proposed for independent subspace analysis where hidden variables are i.i.d. processes. We show that these methods can be extended to certain AR, MA, ARMA and ARIMA tasks. Central to our paper is that we introduce a cascade of algorithms, which aims to solve these tasks without previous knowledge about the number and the dimensions of the hidden processes. Our claim is supported by numerical simulations. As an illustrative application where the dimensions of the hidden variables are unknown, we search for subspaces of facial components.
1
Introduction
Independent Subspace Analysis (ISA) [1] is a generalization of Independent Component Analysis (ICA). ISA assumes that certain sources depend on each other, but the dependent groups of sources are still independent of each other, i.e., the independent groups are multidimensional. The ISA task has been subject of extensive research [1,2,3,4,5,6,7,8]. In this case, one assumes that the hidden sources are independent and identically distributed (i.i.d.) in time. Temporal independence is, however, a gross oversimplification of real sources including acoustic or biomedical data. One may try to overcome this problem, by assuming that hidden processes are, e.g., autoregressive (AR) processes. Then we arrive to the AR Independent Process Analysis (AR-IPA) task [9,10]. Another method to weaken the i.i.d. assumption is to assume moving averaging (MA). This direction is called Blind Source Deconvolution (BSD) [11], in this case the observation is a temporal mixture of the i.i.d. components. The AR and MA models can be generalized and one may assume ARMA sources instead of i.i.d. ones. As an additional step, these models can be extended to non-stationary integrated ARMA (ARIMA) processes, which are important, e.g., for modelling economic processes [12]. In this paper, we formulate the AR-, MA-, ARMA-, ARIMA-IPA generalizations of the ISA task, when (i) one allows for multidimensional hidden components and (ii) the dimensions of the hidden processes are not known. We show that in the undercomplete case, when the number of ‘sensors’ is larger than the number of ‘sources’, these tasks can be reduced to the ISA task.
Corresponding author.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 252–259, 2007. c Springer-Verlag Berlin Heidelberg 2007
Independent Process Analysis Without a Priori Dimensional Information
2
253
Independent Subspace Analysis
The ISA task can be formalized as follows: x(t) = Ae(t), where e(t) = e1 (t); . . . ; eM (t) ∈ RDe
(1)
dm e
and e(t) is a vector concatenated of components em (t) ∈ R . The total dimenM m sion of the components is De = m=1 dm e . We assume that for a given m, e (t) m 1 M is i.i.d. in time t, and sources e are jointly independent, i.e., I(e , . . . , e ) = 0, where I(.) denotes the mutual information (MI) of the arguments. The dimension of the observation x is Dx . Assume that Dx > De , and A ∈ RDx ×De has rank De . Then, one may assume without any loss of generality that both the observed (x) and the hidden (e) signals are white. For example, one may apply Principal Component Analysis (PCA) as a preprocessing stage. Then the ambiguities of the ISA task are as follows [13]: Sources can be determined up to permutation and up to orthogonal transformations within the subspaces. 2.1
The ISA Separation Theorem
We are to uncover the independent subspaces. Our task is to find orthogonal matrix W ∈ RDe ×Dx such that y(t) = Wx(t), y(t) = y1 (t); . . . ; yM (t) , m ym = [y1m ; . . . ; ydmm ] ∈ Rde , (m = 1, . . . , M ) with the condition that components e ym are independent. Here, yim denotes the ith coordinate of the mth estimated subspace. This task can be solved by means of a cost function that aims to minimize the mutual information between components: . (2) J1 (W) = I(y1 , . . . , yM ). One can rewrite J1 (W) as follows: M . J2 (W) = I(y11 , . . . , ydMM ) − I(y1m , . . . , ydmm ). e e
(3)
m=1
The first term of the r.h.s. is an ICA cost function; it aims to minimize mutual information for all coordinates. The other term is a kind of anti-ICA term; it aims to maximize mutual information within the subspaces. One may try to apply a heuristics and to optimize (3) in order: (1) Start by any ’infomax’ ICA algorithm and minimize the first term of the r.h.s. in (3). (2) Apply only permutations to the coordinates such that they optimize the second term. Surprisingly, this heuristics leads to the global minimum of (2) in many cases. In other words, in many cases, ICA that minimizes the first term of the r.h.s. of (3) solves the ISA task apart from the grouping of the coordinates into subspaces. This feature was observed by Cardoso, first [1]. The extent of this feature is still an open issue. Nonetheless, we call it ‘Separation Theorem’, because for elliptically symmetric sources and for some other distribution types one can prove that it is rigorously true [14]. (See also, the result concerning local minimum points [15]). Although there is no proof for general sources as of yet, a number of algorithms apply this heuristics with success [1,3,15,16,17,18].
254
2.2
B. Póczos et al.
ISA with Unknown Components
Another issue concerns the computation of the second term of (3). If the dm e dimensions of subspaces em are known then one might rely on multi-dimensional entropy estimations [8], but these are computationally expensive. Other methods deal with implicit or explicit pair-wise dependency estimations [16,15]. Interestingly, if the observations are indeed from an ICA generative model, then the minimization of the pair-wise dependencies is sufficient to get the solution of the ICA task according to the Darmois-Skitovich theorem [19]. This is not the case for the ISA task, however. There are ISA tasks, where the estimation of pair-wise dependencies is insufficient for recovering the hidden subspaces [8]. Nonetheless, such algorithms seem to work nicely in many practical cases. m are not A further complication arises if the dm e dimensions of subspaces e known. Then the dimension of the entropy estimation becomes uncertain. Methods that try to apply pair-wise dependencies were proposed to this task. One can find a block-diagonalization method in [15], whereas [16] makes use of kernel estimations of the mutual information. Here we shall assume that the separation theorem is satisfied. We shall apply ICA preprocessing. This step will be followed by the estimation of the pair-wise mutual information of the ICA coordinates. These quantities will be considered as the weights of a weighted graph, the vertices of the graph being the ICA coordinates. We shall search for clusters of this graph. In our numerical studies, we make use of Kernel Canonical Correlation Analysis [4] for the MI estimation. A variant of the Ncut algorithm [20] is applied for clustering. As a result, the mutual information within (between) cluster(s) becomes large (small). The problem is that this ISA method requires i.i.d. hidden sources. Below, we show how to generalize the ISA task to more realistic sources. Finally, we solve this more general problem when the dimensions of the subspaces are not known.
3
ISA Generalizations
We need the following notations: Let z stand for the time-shift operation, that is (zv)(t) := v(t − 1). The N order polynomials of D1 ×D2 matrices are denoted as N 1 ×D2 R[z]D := {F[z] = n=0 Fn z n , Fn ∈ RD1 ×D2 }. Let ∇r [z] := (I−Iz)r denote N the rth order difference operator, where I is the identity matrix, r ≥ 0, r ∈ Z. Now, we are to estimate unknown components em from observed signals x. We always assume that e takes the form like in (1) and that A ∈ RDx ×Ds is of full column rank. 1. AR-IPA: The AR generalization of the ISA task is defined by the following equations: x = As, where s is a multivariate AR(p) process i.e, P[z]s = Qe, p s ×Ds . We assume that Q ∈ RDs ×De , and P[z] := IDs − i=1 Pi z i ∈ R[z]D p P[z] is stable, that is det(P[z] = 0), for all z ∈ C, |z| ≤ 1. For dm e = 1 this task was investigated in [9]. Case dm > 1 is treated in [10]. The special case e of p = 0 is the ISA task.
Independent Process Analysis Without a Priori Dimensional Information
255
2. MA-IPA or Blind Subspace Deconvolution (BSSD) task: The ISA task is generalized to blind deconvolution task, MA(q)) as qtask (moving average x ×De follows: x = Q[z]e, where Q[z] = j=0 Qj z j ∈ R[z]D . q 3. ARMA-IPA task: The two tasks above can be merged into a model, where the hidden s is multivariate ARMA(p,q): x = As, P[z]s = Q[z]e. Here s ×Ds s ×De , Q[z] ∈ R[z]D . We assume that P[z] is stable. Thus P[z] ∈ R[z]D p q the ARMA process is stationary. 4. ARIMA-IPA task: In practice, hidden processes s may be non-stationary. ARMA processes can be generalized to the non-stationary case. This generalization is called integrated ARMA, or ARIMA(p,r,q). The assumption here is that the rth difference of the process is an ARMA process. The corresponding IPA task is then x = As, where P[z]∇r [z]s = Q[z]e.
4
(4)
Reduction of ARIMA-IPA to ISA
We show how to solve the above tasks by means of ISA algorithms. We treat the ARIMA task. Others are special cases of this one. In what follows, we assume that: (i) P[z] is stable, (ii) the mixing matrix A is of full column rank, and (iii) Q[z] has left inverse. In other words, there exists a polynomial matrix W[z] ∈ R[z]De ×Ds such that W[z]Q[z] = IDe .1 The route of the solution is elaborated here. Let us note that differentiating the observation x of the ARIMA-IPA task in Eq. (4) in rth order, and making use of the relation zx = A(zs), the following holds: ∇r [z]x = A (∇r [z]s) , and P[z] (∇r [z]s) = Q[z]e.
(5)
That is taking ∇r [z]x as observations, one ends up with an ARMA-IPA task. Assume that Dx > De (undercomplete case). We call this task uARMA-IPA. Now we show how to transform the uARMA-IPA task to ISA. The method is similar to that of [22] where it was applied for BSD. Theorem. If the above assumptions are fulfilled then in the uARMA-IPA task, ˜ (t) := x(t) − observation process x(t) is autoregressive and its innovation x E[x(t)|x(t − 1), x(t − 2), . . .] = AQ0 e(t), where E[·|·] denotes the conditional expectation value. Consequently, there is a polynomial matrix WAR [z] ∈ R[z]Dx ×Dx such that WAR [z]x = AQ0 e. Due to lack of space the proof is omitted here. Thus, AR fit of x(t) can be used for the estimation of AQ0 e(t). This innovation corresponds to the observation of an undercomplete ISA model (Dx > De )2 , which can be reduced to a complete 1
2
One can show for Ds > De that under mild conditions Q[z] has an inverse with probability 1 [21]; e.g., when the matrix [Q0 , . . . , Qq ] is drawn from a continuous distribution. Assumptions made for Q[z] and A in the uARMA-IPA task implies that AQ0 is of full column rank and thus the resulting ISA task is well defined.
256
B. Póczos et al.
ISA (Dx = De ) using PCA. Finally, the solution can be finished by any ISA procedure. The reduction procedure implies that hidden components em can be recovered only up to the ambiguities of the ISA task: components of (identical dimensions) can be recovered only up to permutations. Within each subspaces, unambiguity is warranted only up to orthogonal transformations. The steps of our algorithm are summarized in Table 1. Table 1. Pseudocode of the undercomplete ARIMA-IPA algorithm Input of the algorithm Observation: {x(t)}t=1,...,T Optimization Differentiating: for observation x calculate x∗ = ∇r [z]x ˆ AR [z] AR fit: for x∗ estimate W ˆ AR [z]x∗ ˜=W Estimate innovation: x ˆ PCA x ˜ =W ˜ Reduce uISA to ISA and whiten: x ∗ ˆ ˜ : e = WICA x ˜ Apply ICA for x Estimate pairwise dependency e.g., as in [16] on e∗ Cluster e∗ by Ncut: the permutation matrix is P Estimation ˆ ICA W ˆ ARIMA-IPA [z] = P W ˆ PCA W ˆ AR [z]∇r [z] W ˆ ˆ = WARIMA-IPA [z]x e
5
Results
In this section we demonstrate the theoretical results by numerical simulations. 5.1
ARIMA Processes
We created a database for the demonstration: Hidden sources em are 4 pieces of 2D, 3 pieces of 3D, 2 pieces of 4D and 1 piece of 5D stochastic variables, i.e., M = 10. These stochastic variables are independent, but the coordinates of each stochastic variable em depend on each other. They form a 30 dimensional space together (De = 30). For the sake of illustration, the 3D (2D) sources emit random samples of uniform distributions defined on different 3D geometrical forms (letters of the alphabet). The distributions are depicted in Fig. 1a (Fig. 2b). 30,000 samples were drawn from the sources and they were used to drive an ARIMA(2,1,6) process defined by (4). Matrix A ∈ R60×60 was randomly generated and orthogonal. We also generated the polynomial Q[z] ∈ R[z]60×30 and 5 randomly. The visualization of the 60 the stable polynomial P[z] ∈ R[z]60×60 1 dimensional process is hard: a typical 3D projection is shown in Fig. 1c. The task is to estimate original sources em using these non-stationary observations. rth -order differencing of the observed ARIMA process gives rise to an ARMA process. Typical 3D projection of this ARMA process is shown Fig. 1d. Now, one can execute the other steps of Table 1 and these steps provide the estimations of
Independent Process Analysis Without a Priori Dimensional Information
257
ˆm . Here, we estimated the AR process and its order by the hidden components e the methods detailed in [23]. Estimations of the 3D (2D) components are provided in Fig. 1e (Fig. 1f). In the ideal case, the product of matrix AQ0 and the ˆ PCA )AQ ∈ RDe ×De ˆ ICA W matrices provided by PCA and ISA, i.e., G := (P W 0 m m is a block permutation matrix made of de × de blocks. This is shown in Fig. 1g.
(a)
(b)
(e)
(f )
(c)
(d)
(g)
Fig. 1. (a-b) components of the database. (a): 3 pieces of 3D geometrical forms, (b): 4 pieces of 2D letters. Hidden sources are uniformly distributed variables on these objects. (c): typical 3D projection of the observation. (d): typical 3D projection of the r th -order difference of the observation, (e): estimated 3D components, (f): estimated 2D components, (g): Hinton diagram of G, which – in case of perfect estimation – becomes a block permutation matrix.
5.2
Facial Components
We were interested in the components that our algorithm finds when independence is a crude approximation at best. We have generated another database using the FaceGen3 animation software. In our database we had 800 different front view faces with the 6 basic facial expressions. We had thus 4,800 images in total. All images were sized to 40 × 40 pixel. Figure 2a shows samples of the database. A large X ∈ R4800×1600 matrix was compiled; rows of this matrix were 1600 dimensional vectors formed by the pixel values of the individual images. The columns of this matrix were considered as mixed signals. This treatment replicates the experiments in [24]: Bartlett et al., have shown that in such cases, undercomplete ICA finds components resembling to what humans consider facial components. We were interested in seeing the components grouped by undercomplete ISA algorithm. The observed 4800 dimensional signals were compressed by PCA to 60 dimensions and we searched for 4 pieces of ISA subspaces using the algorithm detailed in Table 1. The 4 subspaces that our algorithm found are shown in Fig. 2b. As it can be seen, the 4 subspaces embrace facial components which correspond mostly to mouth, eye brushes, facial profiles, and eyes, respectively. Thus, ICA finds interesting components and MI based ISA groups them sensibly. The generalization up to ARIMA-IPA processes is straightforward. 3
http://www.facegen.com/modeller.htm
258
B. Póczos et al.
(a)
(b) Fig. 2. (a) Samples from the database. (b) Four subspaces of the components. Distinct groups correspond mostly to mouth, eye brushes, facial profiles, and eyes, respectively.
6
Conclusions
We have extended the ISA task in two ways. (1) We solved problems where the hidden components are AR, MA, ARMA, or ARIMA processes. (2) We suggested partitioning of the graph defined by pairwise mutual information to identify the hidden ISA subspaces under certain conditions. The algorithm does not require previous knowledge about the dimensions of the subspaces. An artificially generated ARIMA process was used for demonstration. The algorithm provided sensible grouping of the estimated components for facial expressions.
Acknowledgment This material is based upon work supported by the EC FET and NEST grants, ‘New Ties’ (No. 003752) and ‘PERCEPT’ (No. 043261). Any opinions, findings and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the EC, or other members of the EC projects.
References 1. Cardoso, J.F.: Multidimensional independent component analysis. In: Proc. of ICASSP, vol. 4, pp. 1941–1944 (1998) 2. Vollgraf, R., Obermayer, K.: Multi-dimensional ICA to separate correlated sources. In: Proc. of NIPS, vol. 14, pp. 993–1000. MIT Press, Cambridge (2001) 3. Stögbauer, H., Kraskov, A., Astakhov, S.A., Grassberger, P.: Least dependent component analysis based on mutual information. Phys. Rev. E, 70 (2004) 4. Bach, F.R., Jordan, M.I.: Beyond independent components: Trees and clusters. Journal of Machine Learning Research 4, 1205–1233 (2003)
Independent Process Analysis Without a Priori Dimensional Information
259
5. Theis, F.J.: Blind signal separation into groups of dependent signals using joint block diagonalization. In: Proc. of ISCAS, pp. 5878–5881 (2005) 6. Hyvärinen, A., Köster, U.: FastISA: A fast fixed-point algorithm for independent subspace analysis. In: Proc. of ESANN, Evere, Belgium (2006) 7. Nolte, G., Meinecke, F.C., Ziehe, A., Müller, K.R.: Identifying interactions in mixed and noisy complex systems. Physical Review E, 73 (2006) 8. Póczos, B., Lőrincz, A.: Independent subspace analysis using geodesic spanning trees. In: Proc. of ICML, pp. 673–680. ACM Press, New York (2005) 9. Hyvärinen, A.: Independent component analysis for time-dependent stochastic processes. In: Proc. of ICANN, pp. 541–546. Springer, Berlin (1998) 10. Póczos, B., Takács, B., Lőrincz, A.: Independent subspace analysis on innovations. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 698–706. Springer, Heidelberg (2005) 11. Choi, S., Cichocki, A., Park, H.-M., Lee, S.-Y.: Blind source separation and independent component analysis. Neural Inf. Proc. - Lett. Rev. 6, 1–57 (2005) 12. Mills, T.C.: Time Series Techniques for Economists. Cambridge University Press, Cambridge (1990) 13. Theis, F.J.: Uniqueness of complex and multidimensional independent component analysis. Signal Processing 84(5), 951–956 (2004) 14. Szabó, Z., Póczos, B., Lőrincz, A.: Separation theorem for K-independent subspace analysis with sufficient conditions. Technical report (2006), http://arxiv.org/abs/math.ST/0608100 15. Theis, F.J.: Towards a general independent subspace analysis. In: Proc. of NIPS, vol. 19 (2007) 16. Bach, F.R., Jordan, M.I.: Finding clusters in Independent Component Analysis. In: Proc. of ICA2003, pp. 891–896 (2003) 17. Szabó, Z., Póczos, B., Lőrincz, A.: Cross-entropy optimization for independent process analysis. In: Rosca, J., Erdogmus, D., Príncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 909–916. Springer, Heidelberg (2006) 18. Abed-Meraim, K., Belouchrani, A.: Algorithms for joint block diagonalization. In: Proc. of EUSIPCO, pp. 209–212 (2004) 19. Comon, P.: Independent Component Analysis, a new concept. Signal Processing, Elsevier 36(3), 287–314 (1994) (Special issue on Higher-Order Statistics) 20. Yu, S., Shi, J.: Multiclass spectral clustering. In: Proc. of ICCV (2003) 21. Rajagopal, R., Potter, L.C.: Multivariate MIMO FIR inverses. IEEE Transactions on Image Processing 12, 458–465 (2003) 22. Gorokhov, A., Loubaton, P.: Blind identification of MIMO-FIR systems: A generalized linear prediction approach. Signal Processing 73, 105–124 (1999) 23. Neumaier, A., Schneider, T.: Estimation of parameters and eigenmodes of multivariate autoregressive models. ACM Trans. on Math. Soft. 27(1), 27–57 (2001) 24. Bartlett, M., Movellan, J., Sejnowski, T.: Face recognition by independent component analysis. IEEE Trans. on Neural Networks 13(6), 1450–1464 (2002)
An Evolutionary Approach for Blind Inversion of Wiener Systems Fernando Rojas1, Jordi Solé-Casals2, and Carlos G. Puntonet1 1
Computer Architecture and Technology Department, University of Granada, 18071, Spain {frojas, carlos}@atc.ugr.es http://atc.ugr.es/ 2 Signal Processing Group, University of Vic (Spain) [email protected] http://www.uvic.es/
Abstract. The problem of blind inversion of Wiener systems can be considered as a special case of blind separation of post-nonlinear instantaneous mixtures. In this paper, we present an approach for nonlinear deconvolution of one signal using a genetic algorithm. The recovering of the original signal is achieved by trying to maximize an estimation of mutual information based on higher order statistics. Analyzing the experimental results, the use of genetic algorithms is appropriate when the number of samples of the convolved signal is low, where other gradient-like methods may fail because of poor estimation of statistics. Keywords: Independent component analysis, signal deconvolution, blind source separation, Wiener systems, genetic algorithms, mutual information.
1 Introduction This paper deal with a particular class of nonlinear systems, composed by a linear subsystem followed by a memoryless nonlinear distortion. This class of nonlinear systems, also known as Wiener systems (Figure 1), can model a considerable range of actual systems in nature, such as the activity of individual primary neurons in response to prolonged stimuli [1], the dynamic relation between muscle length and tension [2], and other situations in biology, industry and psychology. Source (s)
Linear filter (H)
Nonlinear distortion (f)
Observation (x)
Fig. 1. A Wiener system (linear filter + nonlinear distortion)
The inverse configuration for a Wiener system is known as a Hammerstein system and consists of a nonlinear distortion followed by a linear filter. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 260–267, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Evolutionary Approach for Blind Inversion of Wiener Systems Observation (x)
Nonlinear distortion (finv)
Linear filter (W)
261
Estimation (y)
Fig. 2. A Hammerstein system (nonlinear distortion + linear filter)
We propose in this contribution the use of genetic algorithms (GA) for the nonlinear blind inversion of the (nonlinear) convolved mixed signal. The aim of the genetic algorithm is to give more aptitude to those individuals who represent a solution giving a minimal result of a mutual information estimation measure between a known number of pieces of the observed signal (x). In this way, the best solution given by the genetic algorithm must represent a valid estimation signal (y) of the original source (s). Genetic algorithms have been previously applied in linear deconvolution [3] and linear and nonlinear blind source separation [4,5,6,7]. The theoretic framework for using source separation techniques in the case of blind deconvolution is presented in [8]. There, a quasi-nonparametric gradient approach is used, minimizing the mutual information of the output as a cost function to deal with the problem. This work was extended in [9]. The paper has been organized as follows: Section 2 describes the preliminary issues and the inversion main guidelines. Section 3 explains the genetic algorithm which will accomplish blind inversion, specially concerning chromosome representation and fitness function expression. Section 4 presents the experimental results showing the performance of the method. As a final point, section 5 presents the conclusions of this contribution.
2 Blind Inversion of Nonlinear Channels 2.1 Nonlinear Convolution We suppose that the original source (s) is an unknown, non-Gaussian, time independent and identically distributed (i.i.d.) process. The filter H is linear, unknown and invertible. Finally, the nonlinear distortion f is invertible and differentiable. Following this notation, the observation x can be modeled as: x = f(H · s)
(1)
where f is the nonlinear distortion and:
L L L ⎛L ⎜ L h ( t + 1 ) h ( t ) h ( t − 1) ⎜ H=⎜ L h(t + 2) h(t + 1) h(t ) ⎜ ⎜L L L L ⎝
L⎞ ⎟ L⎟ L⎟ ⎟ L⎟⎠
(2)
is an infinite dimension Toeplitz matrix which represents the action of the filter h to the signal s(t). The matrix H is non-singular provided that the filter h is invertible, i.e. satisfies h −1 (t ) * h(t ) = h(t ) * h −1 (t ) = δ (t ) , where δ(t) is the Dirac impulse. The infinite
262
F. Rojas, J. Solé-Casals, and C.G. Puntonet
dimension of vectors and matrix is due to the lack of assumption on the filter order. If the filter h is a finite impulse response (FIR) filter of order Nh, the matrix dimension can be reduced to the size Nh. In practice, because infinite-dimension equations are not tractable, we have to choose a pertinent (finite) value for Nh.
2.1 Nonlinear Deconvolution The process of nonlinear blind inversion of the observed signal (x) assumes solving a Hammerstein system (Figure 1). Therefore, the algorithm should calculate the unknown inverse nonlinear distortion (finv), followed by the unknown inverse filter (W):
y = W(g · x)
(3)
The goal of the proposed algorithm will be to minimize mutual information (Eqn. 4), as it is assumed that mutual information vanishes when the samples in the measured signal are mutually statistically independent: 1 ⎧ T ⎫ ⎨ ∑ H ( y (t ))− H ( y −T ,..., y T )⎬ = H ( y (τ )) − H (Y ) T → ∞ 2T + 1 ⎩t = −T ⎭
I (Y )= lim
(4)
where τ is arbitrary due to the stationary assumption.
3 Genetic Algorithm for Nonlinear Blind Deconvolution Nonlinear blind deconvolution can be handled by a genetic algorithm which evolves individuals corresponding to different inverse filters and nonlinear distortions. Each individual represent a potential solution and it is evaluated according to a measure of statistical independence. This is a problem of global optimization: minimizing or maximizing a real valued function f (x ) in the parameter space x ∈ P . This particular type of problems is suitable to be solved by a genetic algorithm. GAs are designed to move the population away from local minima that a traditional hill climbing algorithm might get stuck in. They are also easily parallelizable and their evaluation function can be any that assigns to each individual a real value into a partially ordered set.
3.1 GA Characterization The proposed canonical genetic algorithm can be generally characterized by the following features: − Encoding Scheme. The genes will represent the coefficients of the unknown deconvolution filter W (real coding) and the unknown nonlinear distortion f. This nonlinear function may be approximated by n-th order odd polynomials. An initial decision must therefore be taken about the length of the inverse filter (s) and the order of the polynomial (n). − Initialization Procedure. Genes within the chromosome are randomly initialized.
An Evolutionary Approach for Blind Inversion of Wiener Systems
263
x(t)
finv z(t) W2s+1
w3
∑
fpoly n
∑
z-1
fpoly n−2 ... fpoly
1
w2⋅s+1
z-1
K
w1
w2
w3
∑
w2
z-1
∑
y(t)
w1
Chromosome
Fig. 3. Encoding scheme in the genetic algorithm for the odd coefficients of the polynomial (fpoly j) approximating the inverse nonlinearity and the linear filter coefficients (wi). The values of the variables stored in the chromosome are real numbers.
− Fitness Function. The chosen evaluation function must give higher scores for those chromosomes representing estimations which maximize statistical independence. According to the central limit theorem, a simple approach as maximizing a higher order statistic like kurtosis absolute value demonstrates to be a good estimator. However, as dependence is supposed to exist between the samples due to time delays, a better estimator is achieved by dividing the observed signal in several “sub-signals” (in the simulations 10 partitions were made) and then computing kurtosis in each of them. Finally, the expression for the chosen fitness function is: n
evalKurt ( w) = ∑ kurt ( yi )
(5)
i =1
where kurt ( x ) =
E( x 4 ) − 3 and the yi are each of the partitions of the estimated E( x 2 )2
signal obtained after applying the nonlinear inverse function (finv) and the inverse filter w (both encoded in the chromosome w ) to the observation x . − Genetic Operators. Typical crossover and mutation operators will be used for the manipulation of the current population in each iteration of the GA. The crossover operator is “Simple One-point Crossover”. The mutation operator is “NonUniform Mutation” [10]. This operator makes the exploration-exploitation tradeoff be more favorable to exploration in the early stages of the algorithm, while exploitation takes more importance when the solution given by the GA is closer to the optimal. − Parameter Set. Population size, number of generations, probability of mutation and crossover and other parameters relative to the genetic algorithm operation were chosen depending on the characteristics of the mixing problem. Generally a population of 50-80 individuals was used, stopping criteria was set between 100-150 iterations, crossover probability is 0.8 per chromosome and mutation probability is typically set between 0.05 and 0.08 per gene.
264
F. Rojas, J. Solé-Casals, and C.G. Puntonet
3.2 GA Scheme
The operation of the algorithm can be illustrated by the following figure: 0. Normalization of the observed mixture. x−x x=
σ (x)
1. Random initialization of the population
2. Apply genetic operators to the population i. Crossover ii. Mutation
No
3. Evaluate the fitness of each chromosome The fitness function value for each estimation y (determined by chromosome gi ) is calculated
4. Selection of the individuals which will take part in the next generation Elitism + Selection
5. Was the maximum number of iterations reached?
Yes 6. Solution
y = W ⋅ f (x)
Fig. 4. Genetic algorithm operation for nonlinear blind deconvolution
4 Experimental Results After the description of the proposed algorithm, some experimental results using uniform random sources are presented. In all the experiments, the source signal is an uniform random source with zero mean and unit variance. As the performance criterion, we have used the crosstalk between the estimations (y) and the sources (s), measured in decibels. Also, the unknown filter is the low-pass filter H ( Z ) = 1 + 0.5 z −1 and we applied a strong non-linearity such as f(x)=atanh (10x) . Genetic algorithm was configured with a crossover probability of 0.8, mutation probability 0.08 per gene, population size is 50, and the stopping criterion is 100 generations. In the first experiment, we applied the former configuration to a random generated signal of t=1900 samples. As the algorithm is non-deterministic, it was run with the same configuration 10 times. The average crosstalk is shown in the equation below:
An Evolutionary Approach for Blind Inversion of Wiener Systems
265
CTalk (s(t ), y (t )) = −13.6dB
(6)
Figure 5 shows the evolution of the fitness of the best individual (left) and the average fitness (right) of each generation along each iteration, showing the smooth convergence of the algorithm. Fitness for Best Solution of Each Generation
Average Fitness of Each Generation
0.65
0.65 0.6
0.6
Fitness Function
Fitness Function
0.55
0.55
0.5
0.5 0.45 0.4 0.35
0.45 0.3 0.4 0
10
20
30
40 50 60 Generation Number
70
80
90
0.25 0
100
10
20
30
40 50 60 Generation Number
70
80
90
100
Fig. 5. Genetic algorithm operation for nonlinear blind deconvolution
Figure 6 shows how the algorithm cancels the nonlinear part of the Wiener system (which it is the most difficult component). Secondly, the number of samples was decreased to t=500 sample, maintaining the rest of parameters. Thus, we can determine whether the algorithm still works when the number of samples is low. Effect of the nonlinearity (f)
Effect of the inverse nonlinearity (finv)
1
finv over f
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
-0.1
-0.1
-0.2
-0.2
-0.3
-0.3
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1 -1
-0.5
0
0.5
1
-0.4 -1
-0.5
0
0.5
1
-0.4 -1
-0.5
0
0.5
1
Fig. 6. First experiment, when t=1900 samples. Left: effect of the original distortion (f) over the filtered signal. Center: effect of the inverse nonlinearity (finv) over the observed signal (x). Right: composition of finv over f. In the ideal situation, should be linear.
266
F. Rojas, J. Solé-Casals, and C.G. Puntonet
The algorithm was again run 10 times, obtaining an average crosstalk of:
CTalk (s(t ), y (t )) = −10.7dB
(7)
Although, this results would not be satisfactory for a linear convolution situation, the nature of this case where exists a strong nonlinearity and the number of samples is not sufficient makes the operation of the algorithm quite acceptable. Effect of the nonlinearity (f)
Effect of the inverse nonlinearity (finv)
1
finv over f
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
-0.1
-0.1
-0.2
-0.2
-0.3
-0.3
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1 -1
-0.5
0
0.5
1
-0.4 -1
-0.5
0
0.5
1
-0.4 -1
-0.5
0
0.5
1
Fig. 7. Second experiment, when t=500 samples. Left: effect of the original distortion (f) over the filtered signal. Center: effect of the inverse nonlinearity (finv) over the observed signal (x). Right: composition of finv over f. In the ideal situation, should be linear.
5 Conclusion This contribution discusses an appropriate application of genetic algorithms to the complex problem of nonlinear blind deconvolution. Using a simple approach based on kurtosis calculation and mutual information approximation, the proposed algorithm demonstrates, as it was shown by the experimental results, an effective operation even when the number of samples is low.
References 1. Segal, B.N., Outerbridge, J.S.: Vestibular (semicircular canal) primary neurons in bullfog: nonlinearity of individual and population response to rotation. J Neurophys 47, 545–562 (1982) 2. Hunter, I.W.: Experimental comparison of Wiener and Hammerstein cascade models of frog muscle fiber mechanics. Biophys J, 49(81)a (1986)
An Evolutionary Approach for Blind Inversion of Wiener Systems
267
3. Rojas, F., Solé-Casals, J., Monte-Moreno, E., Puntonet, C.G., Prieto, A.: A Canonical Genetic Algorithm for Blind Inversion of Linear Channels. In: Rosca, J., Erdogmus, D., Príncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 238–245. Springer, Heidelberg (2006) 4. Rojas, F., Puntonet, C.G., Rodríguez-Álvarez, M., Rojas, I.: Evolutionary Algorithm Using Mutual Information for Independent Component Analysis. In: Mira, J.M., Álvarez, J.R. (eds.) IWANN 2003. LNCS, vol. 2687, pp. 302–9743. Springer, Heidelberg (2003) 5. Martín, R., Rojas, F., Puntonet, C.G.: Post-nonlinear Blind Source Separation using methaheruristics. Electronics Letters 39, 1765–1766 (2003) 6. Rojas, F., Puntonet, C.G., Rodríguez, M., Rojas, I., Martín-Clemente, R.: Blind Source Separation in Post-Nonlinear Mixtures using Competitive Learning, Simulated Annealing and Genetic Algorithms. IEEE Transactions on Systems, Man and Cybernetics (Part C) 34(4), 407–416 (2004) 7. Rojas, F., Puntonet, C.G., Gorriz, J.M., Valenzuela, O.: Assessing the Performance of Several Fitness Functions in a Genetic Algorithm for Nonlinear Separation of Sources. In: Wang, L., Chen, K., Ong, Y.S. (eds.) ICNC 2005. LNCS, vol. 3612, pp. 863–872. Springer, Heidelberg (2005) 8. Taleb, A., Solé-Casals, J., Jutten, C.: Quasi-Nonparametric Blind Inversion of Wiener Systems. IEEE Transactions on Signal Processing 49(5), 917–924 (2001) 9. Solé-Casals, J., Jutten, C., Pham, D.T: Fast approximation of nonlinearities for improving inversion algorithms of PNL mixtures and Wiener systems. Signal Processing 85(9), 1780–1786 (2005) 10. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs, 3rd edn. Springer, Heidelberg (1996)
A Complexity Constrained Nonnegative Matrix Factorization for Hyperspectral Unmixing Sen Jia and Yuntao Qian College of Computer Science, Zhejiang University Hangzhou 310027, P.R. China [email protected], [email protected]
Abstract. Hyperspectral unmixing, as a blind source separation (BSS) problem, has been intensively studied from independence aspect in the last few years. However, independent component analysis (ICA) can not totally unmix all the materials out because the sources (abundance fractions) are not statistically independent. In this paper a complexity constrained nonnegative matrix factorization (CCNMF) for simultaneously recovering both constituent spectra and correspondent abundances is proposed. Three important facts are exploited: First, the spectral data are nonnegative; second, the variation of the material spectra and abundance images is smooth in time and space respectively; third, in most cases, both of the material spectra and abundances are localized. Experimentations on real data are provided to illustrate the algorithm’s performance.
1
Introduction
Hyperspectral sensors collect imagery simultaneously in hundreds of narrow and contiguously spaced spectral bands, with wavelengths ranging from 0.4 to 2.5μm. Owing to the low spatial resolution of the sensor, disparate substances may contribute to the spectrum measured from a single pixel, causing it to be a “mixed” pixel. Consequently, the extraction of constituent spectra from a mixed spectrum, as well as their proportions, is important to both civilian and military applications in which subpixel detail is valuable. So hyperspectral unmixing is a necessary preprocessing step for hyperspectral applications [1,2]. Linear spectral unmixing [1] is a commonly accepted approach to analyze the massive volume of data. In addition, due to the difficulty of obtaining a priori knowledge about the endmembers, unsupervised hyperspectral unmixing tries to identify the endmembers and abundances directly from the observed data without any user interaction [3]. Unsupervised linear hyperspectral unmixing falls into the class of blind source separation (BSS) problems [4]. Therefore, independent component analysis (ICA) is naturally selected to unmix hyperspectral data [5,6]. However, the source (abundance fraction) independence assumption of ICA is violated because the sum of abundance fractions is constant, implying statistical dependence among them, which compromises the applicability of ICA to hyperspectral data [7]. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 268–276, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Complexity Constrained Nonnegative Matrix Factorization
269
Recently, Stone [8] has proposed a complexity based BSS algorithm, called complexity pursuit, which studies the complexity of sources instead of the independence. We introduces the algorithm to unmix hyperspectral data [9], and proves that the undercomplete case, i.e., the number of bands (mixtures number) is much larger than that of endmembers (sources number), is an advantage rather than a limitation. However, it neglects the nonnegativity of both spectra and abundances, which could lead to unmixing results with negative values that are obviously meaningless. Lately, nonnegative matrix factorization (NMF) [10,11], which is a useful technique in representing high dimensional data into nonnegative components, has been extended to unmix hyperspectral data by enforcing the smoothness constraints in spectra and abundances, called SCNMF [12]. For each estimated endmember, it selects the closest signature from a spectral library as the endmember spectrum. However, the best match is not reliable due to the strong atmospheric and environmental variations [13]. The results of the method are given for comparative analysis in the experimental section. In this paper, we develop a complexity constrained NMF (CCNMF) that incorporates the complexity and locality constraints into NMF. Three important facts are exploited: First, the spectral data are nonnegative; second, the variation of endmember spectra and abundances is smooth due to the high spectral resolution and low spatial resolution of hyperspectral data [14]; third, in many cases, both of the endmember spectra and abundances are localized: part of the reflectance spectra are small due to the atmospheric and environmental effects; and the fractions of each endmember are often localized in the scene. The experimental results validate the efficiency of the approach. The rest of the paper is organized as follows. Section 2 reviews the complexity based hyperspectral unmixing. Section 3 presents the CCNMF algorithm. Section 4 evaluates the performance of our proposed algorithm on two real data. Section 5 sets out our conclusion.
2
Complexity Based BSS Algorithm for Hyperspectral Unmixing
In this section, we first present the linear superposition model for the reflectances, then give a brief introduction of the complexity based BSS algorithm for hyperspectral unmixing. 2.1
Linear Spectral Mixture Model
Hyperspectral data is a three dimensional array with the width and length corresponding to spatial dimensions and the spectral bands as the third dimension, which are denoted by I, J and L in sequence. Let R be the image cube with each spectrum Rij being an L × 1 pixel vector. Let M be an L × P spectral signature matrix that each column vector Mp corresponds to an endmember spectrum and P is the number of endmembers in the image. Let S be the abundance cube (the length of each dimension is I, J and P respectively) and every column Sij be
270
S. Jia and Y. Qian
a P × 1 abundance vector associated with Rij , with each element denoting the abundance fraction of relevant endmember present in Rij . The simplified linear spectral mixture model for the pixel with coordinate (i, j) can be written as Rij = MSij + n
(1)
where n is noise that can be interpreted as receiver electronic noise. Meanwhile, endmember spectra and fractional abundances are subject to Mlp ≥ 0 (1 ≤ l ≤ L), Sijp ≥ 0,
P
Sijp = 1
(2)
p=1
which are called nonnegativity and full additivity respectively [2]. 2.2
Complexity Based BSS Algorithm
Instead of the assumption of independence in ICA, complexity based BSS algorithm makes the complexity of the extracted signal to be as low as possible. One simple measure of complexity can be formulated in terms of predictability. The predictability of the hyperspectral data is composed of two parts F (R) = F (M) + F (S)
(3)
F (M) is the predictability of the spectral signatures, which is defined as L
F (M) =
P
ln
p=1
(Mp − Mlp )2
l=1 L
=
lp − Mlp ) (M
2
P p=1
ln
Vp Up
(4)
l=1
lp = λS M (l−1)p + (1 − λS )M(l−1)p M
0 ≤ λS ≤ 1
(5)
Vp is the overall variance of the spectral signature Mp in which Mp is the mean lp is the value. Up is a measure of the temporal “roughness” of the signature. M short-term moving average to predict Mlp , with λS being the predictive rate. Maximizing the ratio Vp /Up means: (i) Mp has a nonzero range, (ii) the values in Mp change slowly. Consequently, F (M) characterizes the smoothness of the spectral signature. F (S) is the predictability of the abundance cube, which is defined as I,J
F (S) =
P p=1
ln
(Sp − Sijp )2
i,j=1 I,J
(6) χ(Sijp )
i,j=1
where Sp is the mean value of the pth abundance image except Sijp . The energy function of Gibbs distribution is used to formulate χ(Sijp ), which measures the local correlation of Sijp . That is,
A Complexity Constrained Nonnegative Matrix Factorization
χ(Sijp ) =
ωij φ(Sijp − Sij p , δ)
271
(7)
ij ∈Nijp
where Nijp is the nearest neighborhood of Sijp , ωij is a weighting factor, δ is a scaling factor, and φ(ξ, δ) is the potential function, which takes the form [15] φ(ξ, δ) = δ ln[cosh(ξ/δ)]
(8)
We assume δ = 0.1, ωij = 1 and Nijp = {(i − 1)j, (i + 1)j, i(j − 1), i(j + 1)}. Similarly, F (S) characterizes the spatial correlation of each abundance image.
3
Complexity Constrained NMF (CCNMF)
NMF uses alternating minimization of a cost function subject to nonnegativity constraints. The most widely used cost function is the euclidean distance function (two matrices R and S are introduced that each row is obtained by converting the band and abundance images of R and S into vectors respectively. The dimensions of them are L × K and P × K, where K is the number of pixels) E(M, S ) =
1 1 R − MS 2 = (Rlk − (MS )lk )2 2 2
(9)
l,k
To represent the local constraints of the spectra and abundances, nonsmooth NMF [16], which explicitly controls the degree of locality, is used. nsE(M, S ) =
1 R − MCS 2 , 2
C = (1 − α)I +
α T 11 P
(10)
where C ∈ RP ×P is a “nonsmoothing” matrix, I is the identity matrix, 1 is a vector of ones, the notation (·)T is matrix transposition, and the parameter α (0 ≤ α ≤ 1) explicitly controls the extent of nonsmoothness of C, which is set to 0.5 in the experiments. Taking into consideration complexity constraints, the cost function of CCNMF can be formulated as D(M, S ) = nsE(M, S ) + θM JM (M) + θS JS (S )
JM (M) =
1 M − M2 , 2
JS (S ) =
|χ(S )| S − S 2
(11)
(12)
and χ(S ) are L × P and P × θM and θS are regularization parameters, M lp and χ(S ) at the (l, p) and (p, k) position K matrices, with the element M pk respectively, | · | is the sum of matrix elements, and S is a P × K matrix with the entries in the pth row equaling to Sp . (12) are the approximations of the reciprocals of (4) and (6) respectively. The numerator of (4) is neglected because function nsE(M, S ) ensures the extracted spectrum is not constant, while that of (6) is reserved to avoid the abundances of adjacent pixels being equal.
272
S. Jia and Y. Qian
The general additive update rules can be constructed as M ← M − μ. ∗
∂D(M, S ) , ∂M
S ← S − ν. ∗
∂D(M, S ) ∂S
(13)
where “.*” denotes element-wise multiplication. Taking the derivatives of D(M, S ) with respect to M and S and after some algebraic manipulations, the gradients about M and S are
where
∂D(M, S ) − M) = −(R − MCS )(CS )T − θM (M ∂M
(14)
∂D(M, S ) ∂JS (S ) = −(MC)T (R − MCS ) + θS ∂S ∂S
(15)
⎞ ∂|χ(S )| ∂JS (S ) ⎜ 2|χ(S )|(S − S ) ⎟ = ⎝ ∂S 2 + 2 ⎠ ∂S S − S S − S 2 ⎛
(16)
Choosing the step sizes from [11], the multiplicative rules are given below − M))./(MCS (CS )T ) M ← M. ∗ (R (CS )T + θM (M
(17)
∂JS (S ) )./((MC)T MCS ) (18) ∂S where “./” denotes element-wise division. In addition, to ensure the full additivity of abundance cube, S should be normalized S ← S . ∗ ((MC)T R − θS
Spk ←
Spk , P Spk
1 ≤ p ≤ P, 1 ≤ k ≤ K
(19)
p=1
One hurdle of the NMF problem is the existence of local minima due to the nonconvexity of the objective function. But through adding complexity and local constraints, the feasible solution set is confined and the nonuniqueness of solution is alleviated. At last, we summarize the CCNMF algorithm. 1. Use virtual dimensionality (VD) method [17] to find the number of endmembers P involved in the mixture data. 2. Initialize M and S with non-negative values. 3. For t = 1, 2, . . . , until D(M, S ) ≤ tol, for a tolerance value tol ∈ R, update M and S by (17) and (18), and then normalize S by (19).
4
Experimental Results
In this section, the data being analyzed were collected by the Hyperspectral Digital Imagery Collection Experiment (HYDICE) system. It is composed of 210
A Complexity Constrained Nonnegative Matrix Factorization
273
channels with spectral resolution 10 nm acquired in the 0.4-2.5 micron region. To evaluate the performance of SCNMF and CCNMF, spectral angle distance (SAD) and Root-Mean-Square Error (RMSE) are employed (The definitions are omitted here due to the length of the paper. Readers are referred to [18] for details). The ground truth of the data are computed according to [9]. 4.1
Washington D.C. Data
Figure 1 shows a subscene of size 30 × 30 extracted from the Washington D.C. data set. After low signal-to-noise ratio (SNR) bands are removed, only 191 bands remain (i.e., L=191). Using VD method to estimate P , it is equal to 4.
Fig. 1. The subscene (30 × 30) extracted from Washington D.C. data set
(a) grass
(b) trail
(c) tree
(d) water
Fig. 2. Abundance maps estimated using SCNMF
(a) grass
(b) trail
(c) tree
(d) water
Fig. 3. Abundance maps estimated using CCNMF
Firstly, SCNMF is applied to the data set. Figure 2 presents the estimated abundance maps. Except that the grass and trail are detected in Figure 2(a)
274
S. Jia and Y. Qian
Fig. 4. Urban scene (307 × 307) extracted from HYDICE data set
(a) asphalt
(b) grass
(c) roof
(d) tree
Fig. 5. Abundance maps estimated using SCNMF
and 2(b), the other two maps are still mixtures. Then CCNMF is utilized, and the results are displayed in Figure 3. Different from Figure 2, all the four abundances: grass, trail, tree and water are successfully extracted. Table 1 quantifies the unmixing results using the two performance metrics. Table 1. SAD-based similarity and RMSE-based error scores between the unmixing results of Washington D.C. data by SCNMF and CCNMF and the ground truth (The numbers in bold represent the best performance) Method
4.2
Endmember tree
grass
trail
water
SCNMF
0.1349 0.2435
0.1075 0.1628
0.24 0.2299
0.2926 0.2152
CCNMF
0.1031 0.2272
0.1299 0.2071
0.1886 0.1403
0.1673 0.0975
(SAD) (RMSE)
Urban Data
The data to be used were obtained from a HYDICE scene of 307 × 307 pixels shown in Figure 4. After low signal-to-noise ratio (SNR) bands are removed, a total of 162 bands are used in the experiment. The estimated number of endmembers using the VD method is 4.
A Complexity Constrained Nonnegative Matrix Factorization
(a) asphalt
(b) grass
(c) roof
275
(d) tree
Fig. 6. Abundance maps estimated using CCNMF
The estimated abundances using SCNMF are illustrated in Figure 5. Only grass and tree are detected in Figure 5(b) and 5(d), the other two are still unmixed. Contrarily, all the four endmembers are separated out by CCNMF, as displayed in Figure 6. Likewise, Table 2 quantifies the unmixing results. Table 2. SAD-based similarity and RMSE-based error scores between the unmixing results of urban data by SCNMF and CCNMF and the ground truth (The numbers in bold represent the best performance)
asphalt
Endmember grass roof
tree
SCNMF
0.2704 0.2468
0.2608 0.2093
0.32 0.2301
0.1887 0.1792
CCNMF
0.189 0.1559
0.111 0.1247
0.2942 0.2228
0.1005 0.1017
Method
5
(SAD) (RMSE)
Conclusion
We have presented a complexity constrained NMF (CCNMF) for hyperspectral unmixing. The algorithm extends the original NMF by incorporating the complexity and locality constraints, which accord with the three characteristics of hyperspectral data. Its effectiveness has been tested by comparison to SCNMF with data from HYDICE data sets. The experimental results show that CCNMF has the potential of providing more accurate estimates of both endmember spectra and abundance maps.
References 1. Keshava, N., Mustard, J.F.: Spectral unmixing. IEEE Signal Processing Mag. 19(3), 44–57 (2002) 2. Keshava, N.: A survey of spectral unmixing algorithms. Lincoln Lab. J. 14(1), 55–73 (2003)
276
S. Jia and Y. Qian
3. Parra, L., Spence, C., Sajda, P., Ziehe, A., M¨ uller, K.R.: Unmixing hyperspectral data. In: Adv. Neural Inform. Process. Syst. 12, Denver, Colorado, USA, pp. 942– 948. MIT Press, Cambridge (1999) 4. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. John Wiley & Sons, Chichester (2002) 5. Chiang, S.-S., Chang, C.-I., Smith, J.A., Ginsberg, I.W.: Linear spectral random mixture analysis for hyperspectral imagery. IEEE Trans. Geosci. Remote Sensing 40(2), 375–392 (2002) 6. Chang, C.-I: Hyperspectral Imaging: Techniques for Spectral Detection and Classification. Kluwer Academic/Plenum Publishers, New York (2003) 7. Nascimento, J.M.P., Dias, J.M.B.: Does independent component analysis play a role in unmixing hyperspectral data? IEEE Trans. Geosci. Remote Sensing 43(1), 175–187 (2005) 8. Stone, J.V.: Blind source separation using temporal predictability. Neural Comput. 13(7), 1559–1574 (2001) 9. Jia, S., Qian, Y.T.: Spectral and spatial complexity based hyperspectral unmixing. IEEE Trans. Geosci. Remote Sensing (to appear) 10. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999) 11. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Adv. Neural Inform. Process. Syst. vol. 13, pp. 556–562 (2000) 12. Paura, V.P., Piper, J., Plemmons, R.J.: Nonnegative matrix factorization for spectral data analysis. Linear Algebra and Applications 416(1), 29–47 (2006) 13. Miao, L.D., Qi, H.R.: Endmember extraction from highly mixed data using minimum volume constrained nonnegative matrix factorization. IEEE Trans. Geosci. Remote Sensing 45(3), 765–777 (2007) 14. Ferrara, C.F.: Adaptive spatial/spectral detection of subpixel targets with unknown spectral characteristics. In: Proc. SPIE, vol. 2235, pp. 82–93 (1994) 15. Green, P.J.: Bayesian reconstructions from emission tomography data using a modified em algorithm. IEEE Trans. Med. Imag. 9(1), 84–93 (1990) 16. Montano, A.P., Carazo, J.M., Kochi, K., Lehmann, D., P.-Marqui, R.D.: Nonsmooth nonnegative matrix factorization (nsNMF). IEEE Trans. Pattern Anal. Machine Intell. 28(3), 403–415 (2006) 17. Chang, C.-I, Du, Q.: Estimation of number of spectrally distinct signal sources in hyperspectral imagery. IEEE Trans. Geosci. Remote Sensing 42(3), 608–619 (2004) 18. Plaza, A., Martinez, P., Perez, R., Plaza, J.: A quantitative and comparative analysis of endmember extraction algorithms from hyperspectral data. IEEE Trans. Geosci. Remote Sensing 42(3), 650–663 (2004)
Smooth Component Analysis as Ensemble Method for Prediction Improvement Ryszard Szupiluk1,2, Piotr Wojewnik1,2, and Tomasz Ząbkowski1,3 1
Polska Telefonia Cyfrowa Ltd., Al. Jerozolimskie 181, 02-222 Warsaw, Poland {rszupiluk,pwojewnik,tzabkowski}@era.pl 2 Warsaw School of Economics, Al. Niepodleglosci 162, 02-554 Warsaw, Poland 3 Warsaw Agricultural University, Nowoursynowska 159, 02-787 Warsaw, Poland
Abstract. In this paper we apply a novel smooth component analysis algorithm as ensemble method for prediction improvement. When many prediction models are tested we can treat their results as multivariate variable with the latent components having constructive or destructive impact on prediction results. We show that elimination of those destructive components and proper mixing of those constructive can improve the final prediction results. The validity and high performance of our concept is presented on the problem of energy load prediction.
1 Introduction The blind signal separation methods have applications in telecommunication, medicine, economics and engineering. Starting from separation problems, BSS methods are used in filtration, segmentation and data decomposition tasks [5,11]. In this paper we apply the BSS method for prediction improvement in case when many models are tested. The prediction problem as other regression tasks aims at finding dependency between input data and target. This dependency is represented by a specific model e.g. neural networks [7,13]. In fact, in many problems we can find different acceptable models where the ensemble methods can be used to improve final results [7]. Usually solutions propose the combination of a few models by mixing their results or parameters [1,8,18]. In this paper we propose an alternative concept based on the assumption that prediction results contain the latent destructive and constructive components common to all the model results [16]. The elimination of the destructive ones should improve the final results. To find the latent components we apply blind signal separation methods with a new algorithm for smooth component analysis (SmCA) which is addressed for signals with temporal structure [4]. The full methodology will be tested in load prediction task [11].
2 Prediction Results Improvement We assume that after the learning process each prediction result includes two types of latent components: constructive, associated with the target, and destructive, associated M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 277–284, 2007. © Springer-Verlag Berlin Heidelberg 2007
278
R. Szupiluk, P. Wojewnik, and T. Ząbkowski
with the inaccurate learning data, individual properties of models, missing data, not precise parameter estimation, distribution assumptions etc. Let us assume there is m models. We collect the results of particular model in column vector xi, i=1,…, m, and treat such vectors as multivariate variable X = [x1, x2…xm]T, X∈Rm×N, where N means the number of observations. We describe the set of latent components as S = [sˆ1 , sˆ 2 ,..., sˆ k , s k +1 , s n ]T , S∈Rn×N, where sˆ j denotes constructive component and si is destructive one [3]. For simplicity of further considerations we assume m = n . Next we assume the relation between observed prediction results and latent components as linear transformation
X = AS ,
(1)
n×n
represents the mixing system. The (1) means matrix X where matrix A∈R decomposition by latent components matrix S and mixing matrix A.
Fig. 1. The scheme of modelling improvement method by multivariate decomposition
Our aim is to find the latent components and reject the destructive ones (replace them with zero). Next we mix the constructive components back to obtain improved prediction results as
ˆ = ASˆ = A[sˆ , sˆ ,..., sˆ ,0 ,...,0 ]T . X k k +1 n 1 2
(2)
The replacement of destructive signal by zero is equivalent to putting zero in the corresponding column of A. If we express the mixing matrix as A = [a1, a2…an] the purified results can be described as ˆ S = [a , a ,..., a ,0 ,...,0 ]S , ˆ =A X 1 2 k k +1 n
(3)
ˆ =[a1,a2…ap, 0p+1, 0p+2…0n]. The crucial point of the above concept is proper Where A A and S estimation. It is difficult task because we have not information which decomposition is most adequate. Therefore we must test various transformations giving us components of different properties. The most adequate methods to solve the first problem seem to be the blind signal separation (BSS) techniques.
Smooth Component Analysis as Ensemble Method for Prediction Improvement
279
3 Blind Signal Separation and Decomposition Algorithms Blind signals separation (BSS) methods aim at identification of the unknown signals mixed in the unknown system [2,4,10,15]. There are many different methods and algorithms used in BSS task. They explore different properties of data like: independence [2,10], decorrelation [3,4], sparsity [5,19], smoothness [4], nonnegativity [12] etc. In our case, we are not looking for specific real signals but rather for interesting analytical data representation of the form (1). To find the latent variables A and S we can use a transformation defined by separation matrix W∈Rn×n, such that
Y = WX .
(4)
where Y is related to S. We also assume that Y satisfies the following relation
Y = PDS ,
(5)
where P is a permutation matrix and D is a diagonal matrix [4,10]. The relation (5) means that estimated signals can be rescaled and reordered in comparison to the original sources. These properties are not crucial in our case, therefore Y can be treated directly as estimated version of sources S. There are some additional assumptions depending on particular BSS method. We focus on methods based on decorrelation, independent component analysis and smooth component analysis.
Decorrelation is one of the most popular statistical procedures for the elimination of the linear statistical dependencies in the data. It can be performed by diagonalization of the correlation matrix Rxx=E{XXT}. It means that matrix W should satisfy the following relation R yy = WR xx W T = E ,
(6)
where E is any diagonal matrix. There are many methods utilizing different matrix factorisation leading to the decorrelation matrix W, Table 1 [6,17]. The decorrelation is not effective separation method and it is used typically as preprocessing, in general. However, we find it very useful for our analytical representation. Table 1. Methods of decorrelation possible for models decomposition
Method Factorisation Decorrelation
Form correlation Rxx= Rxx½ Rxx½ W= Rxx-½
Cholesky Rxx=GTG W=G-T
EIG (PCA) Rxx=UΣUT W=UT
Independent component analysis, ICA, is a statistical tool, which allows decomposition of observed variable X into independent components Y = [y1,y2…yn]T [2,4,10]. Typical algorithms for ICA explore higher order statistical dependencies in a dataset, so after ICA decomposition we have got signals (variables) without any linear and non-linear statistical dependencies. To obtain independent components we explore the fact that the joint probability of independent variables can be factorized by the product of the marginal probabilities
280
R. Szupiluk, P. Wojewnik, and T. Ząbkowski q (Y)
p (Y)
y 64444y74444 8 64447 444 8. p1 (y1 ) p2 (y 2 ) K pn (y n ) = p1...n (y1 , y 2 ,..., y n )
(7)
One of the most popular method to obtain (8) is to find such W that minimizes the Kullback-Leibler divergence between py(Y) and qy(Y) [5] Wopt = min DKL ( p y ( WX) || q y ( WX)) = min W
W
+∞
∫
p y (Y) log
−∞
py (Y) q y (Y)
dY .
(8)
There are many numerical algorithms estimating independent components like Natural Gradient, FOBI, JADE or FASTICA [2,4,10]. Smooth Component Analysis, SmCA, is a method of the smooth components finding in a multivariate variable [4]. The analysis of signal smoothness is strongly associated with the definitions and assumptions about such characteristics [9,17]. For signals with temporal structure we propose a new smoothness measure 1 N ∑ | y(k ) − y (k − 1) | N k =2 P(y ) = , max( y ) − min( y ) + δ (max(y ) − min( y ))
(9)
where symbol δ(.) means Kronecker delta, and P(y)∈[0,1]. Measure (9) has simple interpretation: it is maximal when the changes in each step are equal to range (maximal change), and is minimal when data are constant. The Kronecker delta term is introduced to avoid dividing by zero. The range calculated in denominator is sensitive to local data, what can be avoided using extremal values distributions. The components are taken as linear combination of signals xi and should be as smooth as possible. Our aim is to find such W = [w1, w2…wn] that for Y = WX we obtain Y = [y1, y2…yn]T where y1 maximizes P(y1) so we can write w1 = arg max ( P(w T x )) . ||w||=1
(10)
Having estimated the first n − 1 smooth components the next one is calculated as most smooth component of the residual obtained in Gram-Schmidt orthogonalization: n −1
w n = arg max ( P(w T (x − ∑ y i y i T x))) , || w|| =1
(11)
i =1
where y i = w i T x, i = 1K n . As the numerical algorithm for finding w n we can employ the conjugate gradient method with golden section as a line search routine. The algorithm outline for initial w i (0) = rand , p i (0) = −g i (0) is as follows: 1.
Identify the indexes l for extreme signal values:
l max = arg max w i (k )x(l ) ,
(12)
l min = arg min w i (k )x(l ) ,
(13)
T
l∈1KN
T
l∈1KN
Smooth Component Analysis as Ensemble Method for Prediction Improvement
2.
281
T
Calculate gradient of P(w i x) : N
max min ∑ Δx(l ) ⋅ sign(w i Δx(l )) −P (w i x) ⋅ (x(l ) − x(l )) T
T
∂ P ( w i x) l =2 , = gi = T T T T ∂ wi max( w i x) − min( w i x) + δ (max(w i x) − min( w i x)) T
(14)
where Δx(l ) = x(l ) − x(l − 1) , 3. Identify the search direction (Polak-Ribiere formula[19]) p i ( k ) = −g i ( k ) + 4.
g i T ( k ) (g i ( k ) − g i ( k −1) ) g i T ( k −1) g i ( k −1)
p i (k − 1) ,
(15)
Calculate the new weights: w i (k + 1) = w i (k ) + α (k ) ⋅ p i (k ) ,
(16)
where α (k ) is found in golden search. The above optimization algorithm should be applied as a multistart technique with random initialization [14].
4 Component Classification After latent component are estimated by e.g. SmCA we need to label them as destructive or constructive. The problem with proper signal classification can be difficult task because obtained components might be not pure constructive or destructive due to many reasons like improper linear transformation assumption or other statistic characteristics than explored by chosen BSS method [21]. Consequently, it is possible that some component has constructive impact on one model and destructive on the other. There may also exist components destructive as a single but constructive in a group. Therefore, it is advisable to analyze each subset of ˆ the components separately. In particular, we eliminate each subset (use the matrix A ) and check the impact on the final results. Such process of component classification as destructive or constructive is simple and works well but for many components it is computationally extensive.
5 Generalized Mixing As was mentioned above, the latent components can be not pure so their impact should have weight other than 0. It means that we can try to find the better mixing ˆ . The new mixing system can be formulated more general system than described by A than linear, e.g. we can employ MLP neural network: ) (17) X = g ( 2) (B ( 2) [g (1) (B (1) S + b (1) )] + b ( 2) ) ,
282
R. Szupiluk, P. Wojewnik, and T. Ząbkowski
where g(i)(.) is a vector of nonlinearities, B(i)(.) is a weight matrix and b(i)(.) is a bias vector respectively for i-th layer, i=1,2. The first weight layer will produce results ˆ . But we employ also some nonlinearities and the related to (4) if we take B (1) = A second layer, so in comparison to the linear form the mixing system gains some flexibility. If we learn the whole structure starting from system with initial weights of ˆ , we can expect the results will be better, see Fig. 2. B (1) (0) = A
Fig. 2. The concept of filtration stage
6 Electricity Consumption Forecasting The tests of proposed concept were performed on the problem of energy load prediction [11]. Our task was to forecast the hourly energy consumption in Poland in 24 hours basing on the energy demand from last 24 hours and calendar variables: month, day of the month, day of the week, and holiday indicator. We learned six MLP neural networks using 1851 instances in training, 1313 – in validation, and 1313 – in testing phase. The networks have the structure: M1=MLP(5,12,1) M2=MLP(5,18,1), M3=MLP(5,24,1), M4=MLP(5,27,1), M5=MLP(5,30,1), M6=MLP(5,33,1), where in parenthesis you can find the number of neurons in each layer. The quality of the results is measured with Mean Absolute Percentage Error: MAPE =
1 N
⋅ ∑i =1 N
yi − yˆ i yi
,
(18)
where i is the index of observation, N- number of instances, yi - real load value, and yˆ i - predicted value. In Table 2 we can observe the MAPE values for primary models, effects of improving the modelling results with particular decomposition, and with decomposition supported by neural networks remixing. The last column in Table 2 shows percentage improvement of the best results from each method versus the best primary result.
Smooth Component Analysis as Ensemble Method for Prediction Improvement
283
Table 2. Values of MAPE for primary models and after
Methods M1 Primary results 2.392 Decorr. 2.304 Smooth 2.301 ICA 2.410 Decorr&NN 2.264 Smooth&NN 2.224 ICA&NN 2.327
M2 2.365 2.256 2.252 2.248 2.241 2.227 2.338
Models M3 M4 2.374 2.402 2.283 2.274 2.357 2.232 2.395 2.401 2.252 2.247 2.223 2.219 2.377 2.294
Primary models
Smooth Improv.
M5 2.409 2.255 2.328 2.423 2.245 2.232 2.299
M6 2.361 2.234 2.317 2.384 2.226 2.231 2.237
Best result 2.361 2.234 2.232 2.248 2.226 2.219 2.237
% 5.4 5.5 4.8 5.7 6.0 5.3
Smooth&Neural
2.45
2.4
MAPE
2.35
2.3
2.25
2.2 M1
M2
M3
M4
M5
M6
Models
Fig. 3. The MAPE for primary models, improvement with SmCA, and improvement by SmCA&NN
To compare the obtained results with other ensemble methods we applied also bagging and boosting techniques for the presented problem of energy load prediction. They produced predictions with MAPE of 2.349 and 2.226, respectively, what means results slightly worse than SmCA with neural generalisation.
7 Conclusions The Smooth Component Analysis as well as the other Blind Signal Separation methods can be successfully used as a novel methodology for prediction improvement. The practical experiment with the energy load prediction confirmed the validity of our method. Due to lack of space we compare SmCA approach only with basis BSS methods like decorrelation and ICA. For the same reason extended comparison with other ensemble methods was left as the further work.
284
R. Szupiluk, P. Wojewnik, and T. Ząbkowski
References 1. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996) 2. Cardoso, J.F.: High-order contrasts for independent component analysis. Neural Computation 11, 157–192 (1999) 3. Choi, S., Cichocki, A.: Blind separation of nonstationary sources in noisy mixtures. Electronics Letters 36(9), 848–849 (2000) 4. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing. John Wiley, Chichester (2002) 5. Donoho, D.L., Elad, M.: Maximal Sparsity Representation via l1 Minimization. The Proc. Nat. Acad. Sci. 100, 2197–2202 (2003) 6. Golub, G.H., Van-Loan, C.F.: Matrix Computations. Johns Hopkins, Baltimore (1996) 7. Haykin, S.: Neural nets: a comprehensive foundation. Macmillan, NY (1994) 8. Hoeting, J., Mdigan, D., Raftery, A., Volinsky, C.: Bayesian model averaging: a tutorial. Statistical Science 14, 382–417 (1999) 9. Hurst, H.E.: Long term storage capacity of reservoirs. Trans. Am. Soc. Civil Engineers 116 (1951) 10. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley, Chichester (2001) 11. Lendasse, A., Cottrell, M., Wertz, V., Verdleysen, M.: Prediction of Electric Load using Kohonen Maps – Application to the Polish Electricity Consumption. In: Proc. Am. Control Conf. Anchorage AK, pp. 3684–3689 (2002) 12. Lee, D.D., Seung, H.S.: Learning of the parts of objects by non-negative matrix factorization. Nature, 401 (1999) 13. Mitchell, T.: Machine Learning. McGraw-Hill, Boston (1997) 14. Scales, L.E.: Introduction to Non-Linear Optimization. Springer, NY (1985) 15. Stone, J.V.: Blind Source Separation Using Temporal Predictability. Neural Computation 13(7), 1559–1574 (2001) 16. Szupiluk, R., Wojewnik, P., Zabkowski, T.: Model Improvement by the Statistical Decomposition. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 1199–1204. Springer, Heidelberg (2004) 17. Therrien, C.W.: Discrete Random Signals and Statistical Signal Processing. Prentice Hall, New Jersey (1992) 18. Yang, Y.: Adaptive regression by mixing. Journal of American Statistical Association, 96 (2001) 19. Zibulevsky, M., Kisilev, P., Zeevi, Y.Y., Pearlmutter, B.A.: BSS via multinode sparse representation. Adv. in Neural Information Proc. Sys. 14, 185–191 (2002)
Speed and Accuracy Enhancement of Linear ICA Techniques Using Rational Nonlinear Functions Petr Tichavsk´ y1, Zbynˇek Koldovsk´ y1,2, and Erkki Oja3 Institute of Information Theory and Automation, Pod vod´ arenskou vˇeˇz´ı 4, P.O. Box 18, 182 08 Praha 8, Czech Republic [email protected] http://si.utia.cas.cz/Tichavsky.html 2 Faculty of Mechatronic and Interdisciplinary Studies Technical University of Liberec, H´ alkova 6, 461 17 Liberec, Czech Republic 3 Adaptive Informatics Research Centre, Helsinki University of Technology, P.O. Box 5400, 02015 HUT, Finland [email protected]
1
Abstract. Many linear ICA techniques are based on minimizing a nonlinear contrast function and many of them use a hyperbolic tangent (tanh) as their built-in nonlinearity. In this paper we propose two rational functions to replace the tanh and other popular functions that are tailored for separating supergaussian (long-tailed) sources. The advantage of the rational function is two-fold. First, the rational function requires a significantly lower computational complexity than tanh, e.g. nine times lower. As a result, algorithms using the rational functions are typically twice faster than algorithms with tanh. Second, it can be shown that a suitable selection of the rational function allows to achieve a better performance of the separation in certain scenarios. This improvement might be systematic, if the rational nonlinearities are selected adaptively to data.
1
Introduction
In this paper, a square linear ICA is treated (see e.g. [4,3]), X = AS,
(1)
where X is a d × N data matrix. The rows of X are the observed mixed signals, thus d is the number of mixed signals and N is their length or the number of samples in each signal. Similarly, the unknown d × N matrix S includes samples of the original source signals. A is an unknown regular d × d mixing matrix. As usual in linear ICA, it is assumed that the elements of S, denoted sij , are mutually independent i.i.d. random variables with probability density functions (pdf) fi (sij ) i = 1, . . . , d. The row variables sij for all j = 1, . . . , N , having the same density, are thus an i.i.d. sample of one of the independent sources denoted by si . It is assumed that at most one of the densities fi (·) is Gaussian, and the unknown matrix A has full rank. In the following, let W denote the demixing matrix, W = A−1 . M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 285–292, 2007. c Springer-Verlag Berlin Heidelberg 2007
286
P. Tichavsk´ y, Z. Koldovsk´ y, and E. Oja
Many popular ICA methods use a nonlinear contrast function to blindly separate the signals. Examples include FastICA [5], an enhanced version of the algorithm named EFICA [7], and recursive algorithm EASI [2], Extended Infomax [1], and many others. Adaptive choices of the contrast functions were proposed in [6,8]. In practical large-scale problems, the computational speed of an algorithm is a factor that limits its applications. The main goal of this paper is to propose suitable rational functions that can be quickly evaluated when used instead of tanh and other nonlinearities, and yet achieve the same or better performance. We design such suitable rational nonlinearities for algorithms FastICA and EFICA, based on our recent analytical results on their asymptotic performances, see [5,7]. It is believed that the nonlinearities proposed here will work well when applied to other methods as well. The structure of the paper is as follows. In section II we present a brief description of algorithms FastICA and EFICA, and the analytic expressions that characterize the asymptotic performance of the methods. In section III we propose A) two general-purpose rational nonlinearities that have similar performance as tanh, and B) nonlinearities that are tailored for separation of supergaussian (heavy tailed) sources.
2
FastICA, EFICA, and Their Performance
In general, the FastICA algorithm is based on minimization/maximization of the criterion c(w) = E[G(wT Z)], where G(·) is a suitable nonlinearity, called a contrast function, applied elementwise to the row vector wT Z; see [4]. Next, w is the unitary vector of coefficients to be found that separates one of the independent components from a mixture Z. Here Z denotes a mean-removed is the sample covariance and decorrelated data, Z = C−1/2 (X − X) where C T matrix, C = (X − X)(X − X) /N and X is the sample mean of the mixture data. In the following, in accordance with the standard notation [4], g(·) and g (·) denote the first and the second derivative of the function G(·). The application of g(·) and g (·) to the vector wT Z is elementwise. Classical widely used functions g(·) include “pow3”, i.e. g(x) = x3 (then the algorithm performs kurtosis minimization), “tanh”, i.e. g(x) = tanh(x), and “gauss”, g(x) = x exp(−x2 /2). The algorithm FastICA can be considered either in one unit form, where only is computed, or in symmetric one row w of the estimated demixing matrix W form, which estimates the whole matrix W. The outcome of the symmetric FastICA obeys the orthogonality condition meaning that the sample correlations of the separated signals are exactly zeros. Recently, it was proposed to complete the symmetric FastICA by a test of saddle points that eliminates convergence to side minima of the cost function, which may occur for most nonlinearities g(·) [9]. The test consists in checking if possible saddle points exist for each pair of the signal components exactly halfway between them in the angular sense. This test requires multiple evaluations
Speed and Accuracy Enhancement of Linear ICA Techniques
287
of the primitive (integral) function of g(·), i.e. G(·). If the intention is to perform the test of saddle points in a fast manner, then it is desired that G is a rational function as well. We introduced recently a novel algorithm called EFICA [7], which is essentially an elaborate modification of the FastICA algorithm employing a data-adaptive choice of the associated nonlinearities used in FastICA, and thus reaching a very small asymptotic error. The algorithm is initialized by performing a symmetric FastICA with a fixed nonlinearity. After that, the algorithm uses an idea of a generalized symmetric FastICA, and an adaptive choice of nonlinearities, which may be different for each signal component separately. The final demixing matrix does not obey the orthogonality condition. See [7] for details. For the purpose of this paper we shall assume that the adaptive selection of the nonlinearity in the EFICA algorithm is turned off and a fixed nonlinearity g(·) is used instead. Assume now, for simplicity, that all signal components have the the same probability distribution with the density f (·). It was shown in [9] and in [7] that the asymptotic interference-to-signal ratio of the separated signals (one offdiagonal element of the ISR matrix) for the one-unit FastICA, for the symmetric FastICA and for EFICA is, respectively, 1 γ 1 γ 1 + , ISR = (2) ISR1U = SYM N τ2 2N 2 τ 2 1 γ(γ + τ 2 ) (3) ISREF = 2 N τ γ + τ 2 (γ + τ 2 ) where γ = β − μ2 τ = |μ − ρ|
μ = s g(s) f (s) ds ρ = g (s) f (s) ds β = g 2 (s) f (s) ds
and the integration proceeds over the real line1 . It can be easily seen that ISREF = ISR1U
1/N + ISR1U 1/N + 2 ISR1U
(4)
and ISREF ≤ min {ISR1U , ISRSYM } .
(5)
It is well known that all three ISR’s are simultaneously minimized, when the nonlinearity g(·) is proportional to the score function of the distribution of the sources, g(x) = ψ(x) = −f (x)/f (x). To be more accurate the optimum nonlinearity may have the form g(x) = c1 ψ(x) + c2 x, where c1 and c2 are arbitrary 1
Note that it is the orthogonality constraint that makes the ISR of the symmetric FastICA lower bounded by 1/(4N ).
288
P. Tichavsk´ y, Z. Koldovsk´ y, and E. Oja
constants, c1 = 0. The choice of the constants c1 , c2 does not make any influence on the algorithm performance. For this case it was shown EFICA is maximally efficient: the ISR in (3) in fact equals the respective Cram´er-Rao-induced lower bound [9,7].
3
Optimality Issues
From the end of the previous section it is clear that it is not possible to suggest a nonlinearity that would be optimum for all possible probability distributions of the sources. The opposite is true, however: for each nonlinearity g there exists a source distribution fg such that all other nonlinearities, that are not linear combinations of g and x, perform worse in separating the data having this distribution (in the sense of mean ISR). The density fg can be found by solving the equation g(x) = −c1
fg (x) d + c2 x = −c1 [log fg (x)] + c2 x fg (x) dx
(6)
and has the solution
1 c2 2 fg (x) = exp − x + c3 . g(x)dx + c1 2c1
(7)
The constants c1 , c2 , and c3 should be selected in the way that f is a valid pdf, i.e. is nonnegative, its integral over the real line is one and have zero mean and the variance one. For example, the nonlinearity tanh is optimum for the source distributions of the form (8) ftanh = C0 exp(−C1 x2 )(cosh x)C2
0.5 0.45 C2=5 C2=2 C2=−2 GAUSS
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −5
−4
−3
−2
−1
0
1
2
3
4
5
Fig. 1. Probability density functions (8) for which tanh is the optimum nonlinearity
Speed and Accuracy Enhancement of Linear ICA Techniques
289
where C0 , C1 , and C2 are suitable constants. It can be shown that for any C2 it is possible to find C0 and C1 such that ftanh is a valid density function. Examples of probability densities for which the tanh is the optimum nonlinearity are shown in Figure 1. The pdf’s are compared with the standard Gaussian pdf, which would be obtained for C2 = 0. The figure explains why tanh works very well for so many different pdf’s: it includes supergaussian distributions for C2 < 0 and subgaussian, even double modal distributions for C2 > 0.
4
All-Purpose Nonlinearities
In this subsection we propose two rational functions that can replace tanh in FastICA and in other ICA algorithms, g1 (x) =
x , 1 + x2 /4
g2 (x) =
x(2 + |x|) . (1 + |x|)2
(9)
The former one has very similar behavior as tanh in a neighborhood of zero, and the latter one has a global behavior that is more similar to tanh, see Figure 2. For example, if x → ±∞, then g2 (x) approaches ±1. These rational functions will be called RAT1 and RAT2, for easy reference. 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 TANH RAT1 RAT2
0 −0.2
0 −0.2
−0.4
−0.4
−0.6
−0.6
−0.8
−0.8
−1 −5
−4
−3
−2
−1
0
(a)
1
2
3
4
5
−1 −5
GAUSS EXP1 RAT3(4)
−4
−3
−2
−1
0
1
2
3
4
5
(b)
Fig. 2. Comparison of nonlinearities (a) TANH, RAT1 and RAT2 and (b) GAUSS, EXP3 and RAT3(4), discussed in Section 5. In diagram (b), the functions were scaled to have the same maximum value, 1.
The speed of evaluation of tanh and the rational functions can be compared as follows. In the matlab notation, put x = randn(1, 1000000). It was found that the evaluation of the command y = tanh(x); takes 0.54 s, evaluation of RAT1 via command y = x./(1 + x.2/4); requires 0.07 s and evaluation of RAT2 via the pair of commands h = x. ∗ sign(x) + 1; and y = x. ∗ (h + 1)./h.2; requires 0.11 s.
290
P. Tichavsk´ y, Z. Koldovsk´ y, and E. Oja
The computations were performed on HP Unix workstation, using a matlab profiler. We can conclude that evaluation of RAT1 is nine times faster than tanh, and evaluation of RAT2 is 5 times faster than tanh. As a result, FastICA using nonlinearity RAT1 is about twice faster that FastICA using tanh. Performance of the algorithms using nonlinearities RAT1 and RAT2 appears to be very similar to that of the same algorithms using tanh for many probability distributions of the sources. Assume, for example, that the source distribution belongs to the class of generalized Gaussian distribution with parameter α, which will be denoted GG(α) for easy reference. The pdf of the distribution is proportional to exp(−βα |x|)α where βα is a suitable function of α such that the distribution has unit variance. The asymptotic variance of one-unit FastICA (2) with the three nonlinearities is plotted as a function of α in Figure 3 (a). The variance is computed for N = 1000. We can see that performance of the algorithm with nonlinearity RAT1 is very similar to that of nonlinearity TANH. Performance of RAT2 is slightly better than the previous two ones, if the sources are supergaussian (spiky), i.e. for α < 2, and slightly worse when the distribution is subgaussian (α > 2). −5
0 TANH RAT1 RAT2
−10
−10
−15
ISR [dB]
ISR [dB]
−20 −20
−25
−30
GAUSS EXP1 RAT3(4)
−40 −30
−50
−35
−40
0
0.5
1
1.5
2 α
(a)
2.5
3
3.5
4
−60
0
0.2
0.4
0.6
0.8
1 α
1.2
1.4
1.6
1.8
2
(b)
Fig. 3. Performance of one unit FastICA with nonlinearities (a) TANH, RAT1 and RAT2 and (b) GAUSS, EXP1 and RAT3(4), discussed in Section 5, for sources with distribution GG(α) as a function of the parameter α
The advantage of RAT2 compared to RAT1 is that while the primitive function of g1 (x) is G1 (x) = 2 log(1 + x2 /4) and it is relatively complex to evaluate, the primitive function of g2 (x) is rational, G2 (x) = x2 /(1+|x|) and can be evaluated faster. This might be important for the test of saddle points. It is, however, possible to combine both approaches and use RAT1 in the main iteration, and the primitive function of RAT2 in the test of saddle points. It can be shown that the asymptotic variance ISR1U goes to infinity for any nonlinearity in rare cases, when the source pdf is a linear combination of a supergaussian and a subgaussian distributions (τ in (2) is zero). An
example is shown in Figure 4, where ISR1U is plotted for sources s = βb + 1 − β 2 l as a
Speed and Accuracy Enhancement of Linear ICA Techniques
291
function of parameter β, where b and l stand for binary (BPSK) and Laplacean random variables, respectively. Performances of nonlinearities TANH and RAT1 appear to be very similar, while a performance of RAT2 is slightly different. 10
0
ISR1U [dB]
−10
−20
−30 TANH RAT1 RAT2
−40
−50
0
0.1
0.2
0.3
0.4
0.5 β
0.6
0.7
0.8
0.9
1
Fig. 4. Performance of one unit FastICA with nonlinearities TANH, RAT1 and RAT2
for sources of the type s = βb + 1 − β 2 l and N = 1000
5
Nonlinearities for Supergaussian Sources
In [5] the following nonlinearity was proposed for separation of supergaussian (long-tailed) sources, g(x) = x exp(−x2 /2). (10) For a long time, this nonlinearity was considered the best known one for the separation of supergaussian sources. In [7] it was suggested to use the nonlinearity g(x) = x exp(−η|x|)
(11)
where η = 3.348 was selected. This nonlinearity will be referred as EXP1. This constant is the optimum constant for the nonlinearity of the form (11) provided that the distribution of the sources is Laplacean, i.e. GG(1). It was shown that the latter nonlinearity outperforms the former one for most of distributions GG(α) where 0 < α < 2. It was also shown in [7] that for the sources with the distribution GG(α) with α ∈ (0, 1/2] the asymptotic performance of the algorithm monotonically grows with increasing η. In this paper we suggest to use the following nonlinearity, denoted as RAT3(b), for easy reference, x g3b (x) = . (12) (1 + b|x|)2 We note that like in the case of the nonlinearity EXP, the slope of the function at x = 0 increases with growing parameter b. This phenomenon improves the asymptotic performance of the algorithm in separation of highly supergaussian (long-tailed) sources, but makes the convergence of the algorithm more difficult.
292
P. Tichavsk´ y, Z. Koldovsk´ y, and E. Oja
We found that the choice b = 4 is quite good a trade-off between the performance and the ability to converge. Evaluation of the nonlinearity RAT3(b) was found to be about five times faster than evaluation of EXP1. Performance of the algorithm using the 3 nonlinearities in separating sources with the distribution GG(α), α < 2, is shown in Figure 3(b).
6
Conclusions
The rational nonlinearities were shown to be highly viable alternatives to classical ones in terms of speed and accuracy. Matlab code of EFICA, utilizing these nonlinearities can be downloaded at the second author’s web page. Acknowledgements. This work was supported by Ministry of Education, Youth and Sports of the Czech Republic through the project 1M0572 and by Grant Agency of the Czech Republic through the project 102/07/P384.
References 1. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural Computation 7, 1129–1159 (1995) 2. Cardoso, J.-F., Laheld, B.: Equivariant adaptive source separation. IEEE Tr. Signal Processing 45, 434–444 (1996) 3. Cichocki, A., Amari, S.-I.: Adaptive signal and image processing: learning algorithms and applications. Wiley, New York (2002) 4. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. WileyInterscience, New York (2001) 5. Hyv¨ arinen, A., Oja, E.: A Fast Fixed-Point Algorithm for Independent Component Analysis. Neural Computation 9, 1483–1492 (1997) 6. Karvanen, J., Eriksson, J., Koivunen, V.: Maximum likelihood estimation of ICA model for wide class of source distributions. Neural Networks for Signal Processing X 1, 445–454 (2000) 7. Koldovsk´ y, Z., Tichavsk´ y, P., Oja, E.: Efficient Variant of Algorithm FastICA for Independent Component Analysis Attaining the Cram´er-Rao Lower Bound. IEEE Tr. Neural Networks 17, 1265–1277 (2006) 8. Pham, D.T., Garat, P.: Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Trans. Signal Processing 45, 1712– 1725 (1997) 9. Tichavsk´ y, P., Koldovsk´ y, Z., Oja, E.: Performance Analysis of the FastICA Algorithm and Cram´er-Rao Bounds for Linear Independent Component Analysis. IEEE Tr. on Signal Processing 54, 1189–1203 (2006) 10. Vrins, F.: Contrast properties of entropic criteria for blind source separation, Ph.D. Thesis, Universit´e catholique de Louvain (March 2007)
Comparative Speed Analysis of FastICA Vicente Zarzoso and Pierre Comon Laboratoire I3S, CNRS/UNSA Les Algorithmes – Euclide-B, BP 121 06903 Sophia Antipolis Cedex, France {zarzoso,pcomon}@i3s.unice.fr Abstract. FastICA is arguably one of the most widespread methods for independent component analysis. We focus on its deflation-based implementation, where the independent components are extracted one after another. The present contribution evaluates the method’s speed in terms of the overall computational complexity required to reach a given source extraction performance. FastICA is compared with a simple modification referred to as RobustICA, which merely consists of performing exact line search optimization of the kurtosis-based contrast function. Numerical results illustrate the speed limitations of FastICA.
1
Introduction
Independent component analysis (ICA) aims at decomposing an observed random vector into statistically independent variables [1]. Among its numerous applications, ICA is the most natural tool for blind source separation (BSS) in instantaneous linear mixtures when the source signals are assumed to be independent. Under certain identifiability conditions, the independent components correspond to the sources up to admissible scale and permutation indeterminacies [1]. Two main approaches to ICA have been proposed to date. In the original definition of ICA carried out in early works such as [1] and [2], the independent components are extracted jointly or simultaneously, an approach sometimes called symmetric. On the other hand, the deflation approach estimates the sources one after another [3], and has also been shown to work successfully to separate convolutive mixtures [4]. Due to error accumulation throughout successive deflation stages, it is generally acknowledged that joint algorithms outperform deflationary algorithms without necessarily incurring an excessive computational cost. The FastICA algorithm [5], [6], [7], originally put forward in deflation mode, appeared when many other ICA methods had already been proposed, such as COM2 [1], JADE [2], COM1 [8], or the deflation methods by Tugnait [4] or Delfosse-Loubaton [3]. A thorough comparative study was carried out in [9], where FastICA is found to fail for weak or highly spatially correlated sources. More recently, its convergence has been shown to slow down or even fail in the presence of saddle points, particularly for short block sizes [10]. The objective of the present contribution is to carry out a brief critical review and experimental assessment of the deflationary kurtosis-based FastICA algorithm. In particular, we aim at evaluating objectively the algorithms’ speed and M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 293–300, 2007. c Springer-Verlag Berlin Heidelberg 2007
294
V. Zarzoso and P. Comon
efficiency. For the sake of fairness, FastICA is not compared to joint extraction algorithms [1], [2], [3] but only to a simple modification called RobustICA, possibly the simplest deflation algorithm that can be thought of under the same general conditions.
2
Signal Model
The observed random vector x ∈ CL is assumed to be generated from the instantaneous linear mixing model: x = Hs + n
(1)
where the source vector s = [s1 , s2 , . . . , sK ]T ∈ CK is made of K ≤ L unknown mutually independent components. The elements of mixing matrix H ∈ CL×K are also unknown, and so is the noise vector n, which is only assumed to be independent of the sources. Our focus is on block implementations, which, contrary to common belief, are not necessarily more costly than adaptive (recursive, on-line, sample-by-sample) algorithms, and are able to use more effectively the information contained in the observed signal block. Given a sensor-output signal block composed of T samples, ICA aims at estimating the corresponding T -sample realization of the source vector.
3 3.1
FastICA Revisited Optimality Criteria
In the deflation approach, an extracting vector w is sought so that the estimate def
z = wH x
(2)
maximizes some optimality criterion or contrast function, and is hence expected to be a component independent from the others. A widely used contrast is the normalized kurtosis, which can be expressed as: K(w) =
E{|z|4 } − 2E2 {|z|2 } − |E{z 2 }|2 . E2 {|z|2 }
(3)
This criterion is easily seen to be insensitive to scale, i.e., K(λw) = K(w), ∀λ = 0. Since this scale indeterminacy is typically unimportant, we can impose, without loss of generality, the normalization w = 1 for numerical convenience. The kurtosis maximization (KM) criterion started to receive attention with the pioneering work of Donoho [11] and Shalvi-Weinstein [12] on blind equalization, and was later employed for source separation even in the convolutive-mixture case [4]. Contrast (3) is quite general in that it does not require the observations to be prewhitened and can be applied to real- or complex-valued sources without any modification.
Comparative Speed Analysis of FastICA
295
To simplify the source extraction, the kurtosis-based FastICA algorithm [5], [6], [7] first applies a prewhitening operation resulting in transformed observadef
tions with an identity covariance matrix, Rx = E{xxH } = I. In the real-valued case, contrast (3) then becomes equivalent to the fourth-order moment criterion: M(w) = E{|z|4 },
(4)
which must be optimized under a constraint, e.g., w = 1, to avoid arbitrarily large values of z. Under the same constraint, criteria (3) and (4) are also equivalent if the sources are complex-valued but second-order circular, i.e., the def non-circular second-moment matrix Cx = E{xxT } is null. Consequently, contrast (4) is less general than criterion (3) in that it requires the observations to be prewhitened and the sources to be real-valued, or complex-valued but circular. Indeed, the extension of the FastICA algorithm to complex signals [13], [14] is only valid for second-order circular sources. In the remainder, we shall restrict our attention to sources fulfilling these requirements. 3.2
Contrast Optimization
Under the constraint w = 1, the stationary points of M(w) are obtained as a collinearity condition on E{xzz ∗2}: (5) E |wH x|2 xxH w = λw where λ is a Lagrangian multiplier. As opposed to the claims of [5], eqn. (5) is a fixed-point equation only if λ is known, which is not the case here; λ must be determined so as to satisfy the constraint, and thus it depends on w0 , the optimal value of w: λ = M(|w0 H x|4 }. For the sake of simplicity, λ is arbitrarily set to a deterministic fixed value [5], [7], so that FastICA becomes an approximate standard Newton algorithm, as eventually pointed out in [6]. In the real-valued case, the Hessian matrix of M(w) is approximated as E{(wT xxT w)xxT } ≈ E{wT xxT w}E{xxT } = wT w = I
(6)
As a result, the kurtosis-based FastICA reduces to a gradient-descent algorithm with a judiciously chosen fixed step size leading to cubic convergence: 1 (7) w+ = w − E{x(wT x)3 } 3 w+ ← w+ /w+ . (8) This is a particular instance of the family of algorithms proposed in [4].
4
RobustICA
A simple quite natural modification of FastICA consists of performing exact line search of the kurtosis contrast (3): μopt = arg max K(w + μg). μ
(9)
296
V. Zarzoso and P. Comon
The search direction g is typically (but not necessarily) the gradient: g = ∇w K(w). Exact line search is in general computationally intensive and presents other limitations [15], which explains why, despite being a well-known optimization method, it is very rarely used. However, for criteria that can be expressed as rational functions of μ, such as the kurtosis, the constant modulus [16], [17] and the constant power [18], [19] contrasts, the optimal step size μopt can easily be determined by finding the roots of a low-degree polynomial. At each iteration, optimal step-size (OS) optimization performs the following steps: S1) Compute OS polynomial coefficients. For the kurtosis contrast, the OS polynomial is given by: p(μ) =
4
ak μk .
(10)
k=0
The coefficients {ak }4k=0 can easily can be obtained at each iteration from the observed signal block and the current values of w and g (their expressions are not reproduced here due to the lack of space; see [20] for details). Numerical conditioning in the determination of μopt can be improved by normalizing the gradient vector. S2) Extract OS polynomial roots {μk }4k=1 . The roots of the 4th-degree polynomial (quartic) can be found at practically no cost using standard algebraic procedures known since the 16th century such as Ferrari’s formula [15]. S3) Select the root leading to the absolute maximum: μopt = arg max K(w + μk g). k
+
S4) Update w = w + μopt g. S5) Normalize as in (8). For sufficient sample size, the computational cost per iteration is (5L + 12)T flops whereas that of FastICA’s iteration (7) is 2(L + 1)T flops. A flop is conventionally defined as a real product followed by an addition. To extract more than one independent component, the Gram-Schmidt-type deflationary orthogonalization procedure proposed for FastICA [5], [6], [7] can also be used in conjunction with RobustICA. After step S4, the updated extracting vector is constrained to lie in the orthogonal subspace of the extracting vectors previously found.
5
Numerical Experiments
The experimental analysis of this section aims at evaluating objectively the speed and efficiency of FastICA and RobustICA in several simulation conditions. The influence of prewhitening on the methods’ performance is also assessed.
Comparative Speed Analysis of FastICA
297
Performance-complexity trade-off. Noiseless unitary random mixtures of K independent unit-power BPSK sources are observed at the output of an L = K element array in signal blocks of T samples. The search for each extracting vector is initialized with the corresponding canonical basis vector, and is stopped at a fixed number of iterations. The total cost of the extraction can then be computed as the product of the number of iterations, the cost per iteration per source (Sec. 4) and the number of sources. Prewhitening, if used, also adds to the total cost. The complexity per source per sample is given by the total cost divided by KT . As a measure of extraction quality, we employ the signal mean square error (SMSE), a contrast-independent criterion defined as K 1 E |sk − sˆk |2 . SMSE = K
(11)
k=1
The estimated sources are optimally scaled and permuted before evaluating the SMSE. This performance index is averaged over 1000 independent random realizations of the sources and the mixing matrix. Extraction solutions are computed directly from the observed unitary mixtures (methods labelled as ‘FastICA’ and ‘RobustICA’) and after a prewhitening stage based on the SVD of the observed data matrix (‘pw+FastICA’, ‘pw+RobustICA’). The cost of the prewhitening stage is of the order of 2K 2 T flops. Fig. 1 summarizes the performance-complexity variation obtained for T = 150 samples and different values of the mixture size K. Clearly, the best fastest performance is provided by RobustICA without prewhitening: a given performance
0 K = 20
−5
SMSE (dB)
K = 10
−10
−15
−20
FastICA pw+FastICA RobustICA pw+RobustICA
K=5
−25 1 10
2
3
4
10 10 10 complexity per source per sample (flops)
5
10
Fig. 1. Average extraction quality against computational cost for different mixture sizes K, with signal blocks of T = 150 samples
298
V. Zarzoso and P. Comon 0
SMSE (dB)
−10
−20
−30
−40
−50 1 10
FastICA RobustICA MMSE 2
10 sample size, T
3
10
(a) 0
SMSE (dB)
−10
−20
−30
−40
−50 1 10
pw+FastICA pw+RobustICA MMSE 2
10 sample size, T
3
10
(b) Fig. 2. Average extraction quality against signal block size for unitary mixtures of K = 10 sources and a total complexity of 400 flops/source/sample: (a) without prewhitening, (b) with prewhitening. ‘×’: SNR = 10 dB; ‘’: SNR = 20 dB; ‘◦’: SNR = 40 dB
level is achieved with lower cost or, alternatively, an improved extraction quality is reached with a given complexity. The use of prewhitening worsens RobustICA’s performance-complexity trade-off and, due to the finite sample size,
Comparative Speed Analysis of FastICA
299
imposes the same SMSE bound for two methods. Using prewhitening, FastICA improves considerably and becomes slightly faster than RobustICA. Efficiency. We now evaluate the methods’ performance for a varying block sample size T . Extractions are obtained by limiting the number of iterations per source, as explained above. To make the comparison meaningful, the overall complexity is fixed at 400 flops/source/sample for all tested methods. Accordingly, since RobustICA is more costly per iteration than FastICA, it performs fewer iterations per source. Isotropic additive white real Gaussian noise is present at the sensor output, with signal-to-noise ratio: SNR =
trace(HHT ) . σn2 L
(12)
Results for the minimum mean square error (MMSE) receiver are also obtained by jointly estimating the separating vectors assuming that all transmitted symbols are used for training. The MMSE can be considered as a performance bound for linear extraction. Fig. 2(a) shows the results without prewhitening for random unitary mixtures of K = 10 sources and three different SNR values (10 dB, 20 dB and 40 dB). RobustICA attains the MMSE bound for block sizes of about 1000 samples for the tested SNR levels; the required block size can be shown to decrease for smaller K. FastICA seems to require longer block sizes, particularly for noisier conditions at the given overall complexity. As shown in Fig. 2(b), the use of prewhitening in the same experiment worsens the performance-complexity ratio of RobustICA while improving that of FastICA, making both methods’ efficiency comparable.
6
Conclusions
The computational complexity required to reach a given source extraction quality is put forward as a natural objective measure of convergence speed for BSS/ICA algorithms. The kurtosis-based FastICA method can be considered as a gradient-based algorithm with constant step size. Its speed is shown to depend heavily on prewhitening and sometimes on initialization. Without the performance limitations imposed by the second-order preprocessing, a simple algebraic line optimization of the more general kurtosis contrast proves computationally faster and more efficient than FastICA even in scenarios favouring this latter method. Although not demonstrated in this paper, RobustICA is also more robust to initialization [20], and the optimal step-size technique it relies on proves less sensitive to saddle points or local extrema [17], [19].
References 1. Comon, P.: Independent component analysis, a new concept? Signal Processing 36(3), 287–314 (1994) 2. Cardoso, J.F., Souloumiac, A.: Blind beamforming for non-Gaussian signals. IEE Proceedings-F 140(6), 362–370 (1993)
300
V. Zarzoso and P. Comon
3. Delfosse, N., Loubaton, P.: Adaptive blind separation of independent sources: a deflation approach. Signal Processing 45(1), 59–83 (1995) 4. Tugnait, J.K.: Identification and deconvolution of multichannel non-Gaussian processes using higher order statistics and inverse filter criteria. IEEE Transactions on Signal Processing 45, 658–672 (1997) 5. Hyv¨ arinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9(7), 1483–1492 (1997) 6. Hyv¨ arinen, A.: Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks 10(3), 626–634 (1999) 7. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, New York (2001) 8. Comon, P., Moreau, E.: Improved contrast dedicated to blind separation in communications. In: Proc. ICASSP-97, 22nd IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, April 20-24, 1997, pp. 3453–3456. IEEE Computer Society Press, Los Alamitos (1997) 9. Chevalier, P., Albera, L., Comon, P., Ferreol, A.: Comparative performance analysis of eight blind source separation methods on radiocommunications signals. In: Proc. International Joint Conference on Neural Networks, Budapest, Hungary (July 25-29, 2004) 10. Tichavsky, P., Koldovsky, Z., Oja, E.: Performance analysis of the FastICA algorithm and Cram´er-Rao bounds for linear independent component analysis. IEEE Transactions on Signal Processing 54(4), 1189–1203 (2006) 11. Donoho, D.: On minimum entropy deconvolution. In: Proc. 2nd Applied Time Series Analysis Symposium, Tulsa, OK, pp. 565–608 (1980) 12. Shalvi, O., Weinstein, E.: New criteria for blind deconvolution of nonminimum phase systems (channels). IEEE Transactions on Information Theory 36(2), 312– 321 (1990) 13. Bingham, E., Hyv¨ arinen, A.: A fast fixed-point algorithm for independent component analysis of complex valued signals. International Journal of Neural Systems 10(1), 1–8 (2000) 14. Ristaniemi, T., Joutsensalo, J.: Advanced ICA-based receivers for block fading DS-CDMA channels. Signal Processing 82(3), 417–431 (2002) 15. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C. The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992) 16. Godard, D.N.: Self-recovering equalization and carrier tracking in two-dimensional data communication systems. IEEE Transactions on Communications 28(11), 1867–1875 (1980) 17. Zarzoso, V., Comon, P.: Optimal step-size constant modulus algorithm. IEEE Transactions on Communications (to appear, 2007) 18. Grellier, O., Comon, P.: Blind separation of discrete sources. IEEE Signal Processing Letters 5(8), 212–214 (1998) 19. Zarzoso, V., Comon, P.: Blind and semi-blind equalization based on the constant power criterion. IEEE Transactions on Signal Processing 53(11), 4363–4375 (2005) 20. Zarzoso, V., Comon, P., Kallel, M.: How fast is FastICA? In: Proc. EUSIPCO-2006, XIV European Signal Processing Conference, Florence, Italy (September 4-8, 2006)
Kernel-Based Nonlinear Independent Component Analysis Kun Zhang and Laiwan Chan Department of Computer Science and Engineering, The Chinese University of Hong Kong Shatin, Hong Kong {kzhang,lwchan}@cse.cuhk.edu.hk
Abstract. We propose the kernel-based nonlinear independent component analysis (ICA) method, which consists of two separate steps. First, we map the data to a high-dimensional feature space and perform dimension reduction to extract the effective subspace, which was achieved by kernel principal component analysis (PCA) and can be considered as a pre-processing step. Second, we need to adjust a linear transformation in this subspace to make the outputs as statistically independent as possible. In this way, nonlinear ICA, a complex nonlinear problem, is decomposed into two relatively standard procedures. Moreover, to overcome the ill-posedness in nonlinear ICA solutions, we utilize the minimal nonlinear distortion (MND) principle for regularization, in addition to the smoothness regularizer. The MND principle states that we would prefer the nonlinear ICA solution with the mixing system of minimal nonlinear distortion, since in practice the nonlinearity in the data generation procedure is usually not very strong.
1
Introduction
Independent component analysis (ICA) aims at recovering independent sources from their mixtures, without knowing the mixing procedure or any specific knowledge of the sources. In particular, in this paper we consider the general nonlinear ICA problem. Assume that the observed data x = (x1 , · · · , xn )T are generated from an independent random vector s = (s1 , · · · , sn )T by a nonlinear transformation x = H(s), where H is an unknown real-valued n-component mixing function. (For simplicity, it is usually assumed that the number of observable variables equals that of the original independent variables.) The general nonlinear ICA problem is to find a mapping G : Rn → Rn such that y = G(x) has statistically independent components. In the general nonlinear ICA problem, in order to model arbitrary nonlinear mappings, one may need to resort to a flexible nonlinear function approximator, such as the multi-layer perceptron (MLP) network or the radius basis function (RBF) network, to represent the nonlinear separation system G or the mixing system H (see,
This work was partially supported by a grant from the Research grants Council of the Hong Kong Special Administration Region, China.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 301–308, 2007. c Springer-Verlag Berlin Heidelberg 2007
302
K. Zhang and L. Chan
e.g. [1]). In such a way, parameters at different locations of the network are adjusted simultaneously. This would probably slow down the learning procedure. Kernel-based methods has also been considered for solving the nonlinear blind source separation (BSS) problem [5,10].1 These methods exploit the temporal information of sources for source separation, and do not enforce mutual independence of outputs. In [5], the data are first implicitly mapped to high-dimensional feature space, and the effective subspace in feature space is extracted. TDSEP [13], a BSS algorithm based on temporal decorrelation, is then performed in the extracted subspace. Denote by d the reduced dimension. This method produces d outputs and one needs to select from them n outputs, as an estimate of the original sources. This method produces successful results in many experiments. However, a problem is that its outputs may not contain the estimate of the original sources, due to the effect of spurious outputs. Moreover, this method may fail if some sources lack specific time structures. In this paper we propose a kernel-based method to solve nonlinear ICA. The separation system G is constructed using the kernel methods, and unknown parameters are adjusted by minimizing the mutual information between outputs yi . The first step of this method is similar to that in [5], and kernel principal component analysis (PCA) is adopted to construct the feature subspace of reduced dimension. In the second step we solve a linear problem—we adjust the n × d linear transformation matrix W to make the outputs statistically independent. As stated in [5], standard linear ICA algorithms do not work here. We derive the algorithm for learning W, which is in a similar form to the traditional gradient-based ICA algorithm. We then consider suitable regularization conditions with which the proposed kernel-based nonlinear ICA leads to nonlinear BSS. In the general nonlinear ICA problem, although we do not know the form of the nonlinearity in the data generation procedure, fortunately, the nonlinearity in the generation procedure of natural signals is usually not strong. Hence, provided that the nonlinear ICA outputs are mutually independent, we would prefer the solution with the mixing transformation as close as possible to linear. This information, formulated as the minimal nonlinear distortion (MND) principle [12], can help to reduce the indeterminacies in solutions of nonlinear ICA greatly. MND and smoothness are incorporated for regularization in the kernel-based nonlinear ICA.
2
Kernel-Based Nonlinear ICA
Kernel-based learning has become a popular technique, in that it provides an elegant way to tackle nonlinear algorithms by reducing them to linear ones in some feature space F , which is related to the original input space Rn by a possibly nonlinear map Φ. Denote by x(i) the ith sample of x. The dot products of the form Φ(x(i) ) · Φ(x(j) ) can be computed using kernel representations k(x(i) , x(j) ) = Φ(x(i) ) · Φ(x(j) ). Thus, any linear algorithm formulated in terms of dot products can be made nonlinear by making use of the kernel trick, without 1
Note that kernel ICA [3] actually performs linear ICA with the kernel trick.
Kernel-Based Nonlinear Independent Component Analysis
303
knowing explicitly the mapping Φ. Unfortunately, ICA could not be kernelized directly, since it can not be carried out using dot products. However, the kernel trick can still help to perform nonlinear ICA, in an analogous manner to the development of kTDSEP, which is a kernel-based algorithm for nonlinear BSS [5]. Kernel-based nonlinear ICA involves two separate steps. The first step is the same as that in kTDSEP: the data are implicitly mapped to a high-dimensional feature space and its effective subspace is extracted. As a consequence, the nonlinear problem in input space is transformed to a linear one in the reduced feature space. In the second step, a linear transformation in the reduced feature space is constructed such that it produces n statistically independent outputs. In this way nonlinear ICA is performed faithfully, without any assumption on the time structure of sources. Many techniques can help to find the effective subspace in feature space F . Here we adopt kernel PCA [11], since the subspace it produces gives the smallest reconstruction error in feature space. The effective dimension of feature space, denoted by d, can be determined by inspection on the eigenvalues of the kernel ˜ (i) , x) = Φ(x ˜ (i) ) · Φ(x), ˜ ˜ dewhere Φ matrix. Let x be a test point, and let k(x notes the centered image in feature space. The pth centered nonlinear principal component of x, denoted by zp , is in the form (for details see [11]): zp =
T
˜ (i) , x) α ˜ pi k(x
(1)
i=1
Let z = (z1 , · · · , zd )T . It contains all principal components of the images Φ(x) in feature space. Consequently, in the following we just need find a n × d matrix W which makes the components of y = Wz
(2)
as statistically independent as possible. 2.1
Can Standard Linear ICA Work in Reduced Feature Space?
As claimed in [5], applying standard linear ICA algorithms, such as JADE [4] and FastICA [6], to the signals z does not give successful results. In our problem, zp , p = 1, · · · , d, are nonlinear mixtures of only n independent sources, and we aim at transforming zp to n signals (generally n d) which are statistically independent, with a linear transformation. But standard ICA algorithms, such as the natural gradient algorithm [2] and JADE, assume that W is square and invertible and try to extract d independent signals from zi . So they can not give successful results in our problem. Although FastICA, which aims at maximizing the nongaussianity of outputs, can be used in a deflationary manner, its relation to maximum likelihood of the ICA model or minimization of mutual information between outputs is established when the linear ICA model holds and W is square and invertible [7]. When the linear ICA model does not hold, just like in our problem, nongaussianity of outputs does not necessarily lead to the independence between them. In fact, if we apply FastICA in a deflationary manner to zi , the outputs yi will be
304
K. Zhang and L. Chan
extremely nongaussian, but they are not necessarily mutually independent. The extreme nongaussianity of yi is because theoretically, with a properly chosen kernel function, by adjusting the ith row of W the mapping from x to yi covers quite a large class of continuous functions. 2.2
Learning Rule
Now we aim to adjust W in Eq. 2 to make the outputs yi as independent as possible. This can be achieved n by minimizing the mutual information between yi , which is defined as I(y) = i=1 H(yi )−H(y) where H(·) denotes the differential entropy. Denote by J the Jacobian matrix of the transformation from x to y, i.e. J = ∂y ∂x , and by J1 the Jacobian matrix of the transformation from x to z, ∂z 2 . Due to Eq. 2, one can see J = W · J1 . We also have H(y) = i.e. J1 = ∂x H(x) + E log | det J|. Consequently, n n I(y) = H(yi ) − H(y) = − log pyi (yi ) − E log | det(W · J1 )| − H(x) i=1
i=1
As H(x) does not depend on W, the gradient of I(y) w.r.t. W is ∂I(y) = E[ψ(y) · zT ] − E[J−T · JT1 ] ∂W
(3)
where ψ(y) = (ψ1 (y1 ), · · · , ψn (yn ))T with ψi being the score function of pyi , p
defined as ψi = −(log pyi ) = − pyyi . W can then be adjusted according to Eq. 3 i with the gradient-based method. Note that the gradient in Eq. 3 is in a similar form to that in standard ICA, and the only difference is that the second term becomes −E[W−T ] in standard ICA3 . In standard ICA, we can obtain correct ICA results even if the estimation of the densities pyi or the score functions ψi is not accurate. But in the nonlinear case, they should be estimated accurately. We use the mixture of 5 Gaussian’s to model pyi . After each iteration of Eq. 3, parameters in the Gaussian mixture model are adjusted by the EM algorithm to adapt the current outputs yi .
3
With Minimum Nonlinear Distortion
Solutions to nonlinear ICA always exist and are highly non-unique [8]. In fact, in the general nonlinear ICA problem, nonlinear BSS is impossible without additional prior knowledge on the mixing model [9]. Smoothness of the mapping 2
3
˜ (i) , x) = Φ(x ˜ (i) ) · Φ(x) ˜ J1 is involved in the obtained update rule Eq. 3. Since k(x = k(x(i) , x)− T1 Tp=1 k(xp , x)− T1 Tq=1 k(x(i) , x(q) )+ T12 Tp=1 Tq=1 k(x(p) , x(q) ). We (i) (p) ˜ (i) have ∂ k(x∂x ,x) = ∂k(x∂x ,x) − T1 Tp=1 ∂k(x∂x ,x) . According to Eq. 1, the pth row of ˜ (i) ∂z J1 is then ∂xp = Ti=1 α ˜ pi ∂ k(x∂x ,x) . This can be easily calculated and saved in the first step of our method for later use. Assuming that W is square and invertible, the natural gradient ICA algorithm is by WT W [2]. However, as W in obtained multiplying the right-hand side of ∂I(y) ∂W Eq. 2 is n × d, the natural gradient for W could not be derived in this simple way.
Kernel-Based Nonlinear Independent Component Analysis
305
G provides a useful regularization condition to lead nonlinear ICA to nonlinear BSS [1]. But it seems not sufficient, as shown by the counterexample in [9]. In this paper, in addition to the smoothness regularization, we exploit the “minimal nonlinear distortion” (MND) principle [12] for regularization of nonlinear ICA. MND has exhibited quite good performance for regularizing nonlinear ICA, when the nonlinearity in the data generation procedure is not very strong [12]. The objective function to be minimized thus becomes J(W) = I(y) + λ1 R1 (W) + λ2 R2 (W)
(4)
where R1 denotes the regularization term for achieving MND, R2 is that for enforcing smoothness, and λ1 and λ2 are corresponding regularization parameters. 3.1
Minimum Nonlinear Distortion
MND states that, under the condition that the separation outputs yi are mutually independent, we prefer the nonlinear mixing mapping H that is as close as possible to linear. So R1 is a measure of ”closeness to linear” of H. Given a nonlinear mapping H, its deviation from the affine mapping A∗ , which fits H best among all affine mappings A, is an indicator of its “closeness to linear” or the level of its nonlinear distortion. Mean square error (MSE) is adopted to measure the deviation, since it greatly facilitates the following analysis. Let x∗ = (x∗1 , · · · , x∗n )T = A∗ y. R1 can be defined as the total MSE between xi and x∗i (here we assume that both x and y are zero-mean): R1 = E{(x − x∗ )T (x − x∗ )} , where ˜ , and A∗ = argA min E{(x − Ay)T (x − Ay)} x∗ = A∗ y
(5)
∂R1 ∗ T The derivative of R1 w.r.t. A∗ is ∂A ∗ = −2E{(x − A y)y }. Setting the ∗ ∗˜ T ∗ y } = 0 ⇔ A = E{xyT }[E{yyT }]−1 . derivative to 0 gives A : E{(x − A y)˜ We can see that due to the adoption of MSE, A∗ can be obtained in closed form. This will greatly simplify the derivation of learning rules. ∗ ∗ T = Tr E[(x−A y)(x−A y) ] = −Tr E[xyT ]{E[yyT ]}−1 · We then have R 1 T E[yx ] +const. Since in the learning process, yi are approximately independent from each other, they are approximately uncorrelated. We can also normalize the variance of yi after each iteration. Consequently E[yyT ] = I. Let L = E[xzT ]. we have E[xyT ] = LWT . Thus R1 = −Tr(LWT WLT ) + const. This gives
∂R1 = −2WLT L ∂W
(6)
It was suggested to initialize λ1 in Eq. 4 with a large value at the beginning of training and decreasing it to a small constant during the learning process [12]. A large value for λ at the beginning helps to reduce the possibility of getting into unwanted solutions or local optima. As training goes on, the influence of the regularization term is reduced, and G gains more freedom. In addition, initializing G to an almost identity mapping would also be useful. This can be achieved by simply initializing W with W = E[xzT ]{E[zzT ]}−1 .
306
K. Zhang and L. Chan
The MND principle can be incorporated in many nonlinear ICA/BSS methods to avoid unwanted solutions, under the condition that the nonlinearity in the mixing procedure is not too strong. As an example, for kTDSEP [5], MND provides a way to select a subset of output components corresponding to the original sources [12]. 3.2
Smoothness: Local Minimum Nonlinear Distortion
Both MND and smoothness are used for regularization in our nonlinear ICA method. In fact, the smoothness regularizer exploiting second-order derivatives is related to MND. Particularly, enforcing local closeness to linear of the transformation at every point will lead to such a smoothness regularizer [12]. For a one-dimensional sufficiently smooth function g(x), its second-order Tay ∂g T lor expansion in the vicinity of x is g(x+ε) ≈ g(x)+ ∂x ·ε+ 21 εT Hx ε. Here ε is 2
g a small variation of x and Hx denotes the Hessian matrix of g. Let ij = ∂x∂i ∂x . j It can be shown [12] that if we use the first-order Taylor expansion of g at x to approximate g(x + ε), the square error is n ∂g T 2 2 2 1 1 · ε ≈ εT Hx ε = ij εi εj g(x + ε) − g(x) − ∂x 4 4 i,j=1
1 2 +2 4 i=1 ii
n
n
≤
i,j=1,i=j
2ij
n i=1
ε4i + 2
n i,j=1,i=j
n 1 ε2i ε2j = ||ε||4 · 2ij 4 i,j=1
The above inequality holds due to the Cauchy’s inequality. We can see that in order to make g locally close to linear at every point in the domain of x, we n just need minimize Dx i,j=1 2ij dx. When the mapping is vector-valued, we need apply this regularizer to each output of the mapping. R2 in Eq. 4 can then n ∂ 2 yl 2 . The be constructed as R2 = Dx ni,j=1 Pij dx, where Pij l=1 ∂xi ∂xj ∂2 z
p 2 derivation of ∂R ∂W is straightforward. In the result, ∂xi ∂xj is involved. It can be computed and saved in the first step of kernel-based nonlinear ICA.
4
Experiments
According to the experimental results in [1] and our experience, mixtures of subgaussian sources are more difficult to be separated well, than those of supergaussian sources. So for saving space, here we just present some experimental results on separating two subgaussian sources. The sources are a sawtooth signal (s1 ) and an amplitude-modulated waveform (s2 ), with 1000 samples. xi are generated in the same form as the example in Sec. 4 of [5], i.e. x = Bs + cs1 s2 , but here c = (−0.15, 0.5)T . The waveforms and scatterplots of si and xi are shown in Fig. 1, from which we can see that the nonlinear effect is significant. The regularization parameter for enforcing smoothness is λ2 = 0.2, and that for enforcing MND, λ1 , decays from 0.3 to 0.01 during the learning process.
Kernel-Based Nonlinear Independent Component Analysis
s1
2
4
2
0
2
400
0
x
200
s2
0
2
1 −2
307
0
s2
2 −1
−2
0
−2
−2 0
200
400
−2
0 s1
−4 −5
2
0 x1
5
Fig. 1. Source and their nonlinear mixtures. Left: waveforms of sources. Middle: scatterplot of sources. Right: scatterplot of mixtures.
We chose the polynomial kernel of degree 4, i.e. k(a, b) = (aT b + 1)4 , and found d = 14. Here we compare the separation results of four methods/schemes, which are linear ICA (FastICA is adopted), kernel-based nonlinear (kNICA) without explicit regularization, kNICA with only the smoothness regularization, and kNICA with both smoothness and MND regularization. Table 1 shows the SNR of the recovered signals. Numbers in parentheses are the SNR values after trivial indeterminacies are removed.4 Fig. 2 shows the scatterplots of yi obtained by various schemes. In this experiment, clearly kNICA with the smoothness and MND regularization gives the best separation result. Table 1. SNR of the separation results on various methods (schemes) Channel FastICA kNICA (no regu.) kNICA (smooth) kNICA (smooth & MND) No. 1 3.72 (4.59) 9.25(9.69) 11.1(14.4) 12.1 (16.5) No. 2 5.76 (6.04) 6.07(8.19) 8.9(12.7) 15.4 (25.1)
1.5
1.5
0
1
−0.5
y
0.5 0
−1.5 −2
−1
−2.5
−2 −4
1 0.5 0
0 −0.5
−3
−3
−2
−1
0
y1
(a)
1
2
3
−3.5 −2
1.5
1 0.5
−1
−0.5
−1.5
1.5
−1
0
y1
(b)
1
2
y2
0.5
y2
1
2
y2
2
3 2.5
−1
−0.5
−1.5
−1
−2
−1.5
−2.5
−2
−3
−2.5
−1.5
−1
−0.5
0
0.5
y1
(c)
1
1.5
2
2.5
−1
0
1
2
3
y1
(d)
Fig. 2. Scatterplot of yi obtained by various methods/schemes. (a) FastICA. (b) kNICA without explicit regularization. (c) kNICA with the smoothness regularizer. (d) kNICA with the smoothness and MND regularization. 4
We applied a 1-8-1 MLP, denoted by T , to yi to minimize the square error between si and T (yi ). In this way trivial indeterminacies are removed.
308
5
K. Zhang and L. Chan
Conclusion
We have proposed to solve the nonlinear ICA problem using kernels. In the first step of the method, the data are mapped to high-dimensional feature space and the effective subspace is extracted. Thanks to the kernel trick, in the second step, we need to solve a linear problem. The algorithm in the second step was derived, in a form similar to standard ICA. In order to achieve nonlinear BSS, we incorporated the minimal nonlinear distortion principle and the smoothness regularizer for regularization of the proposed nonlinear ICA method. MND helps to overcome the ill-posedness of nonlinear ICA, under the condition that the nonlinearity in the mixing procedure is not very strong. This condition usually holds for practical problems.
References 1. Almeida, L.B.: MISEP - linear and nonlinear ICA based on mutual information. Journal of Machine Learning Research 4, 1297–1318 (2003) 2. Amari, S., Cichocki, A., Yang, H.H.: A new learning algorithm for blind signal separation. In: Advances in Neural Information Processing Systems (1996) 3. Bach, F.R., Jordan, M.I.: Beyond independent components: trees and clusters. Journal of Machine Learning Research 4, 1205–1233 (2003) 4. Cardoso, J.F., Souloumiac, A.: Blind beamforming for non-Gaussian signals. IEE Proceeding-F 140(6), 362–370 (1993) 5. Harmeling, S., Ziehe, A., Kawanabe, M., M¨ uller, K.R.: Kernel-based nonlinear blind source separation. Neural Computation 15, 1089–1124 (2003) 6. Hyv¨ arinen, A.: Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks 10(3), 626–634 (1999) 7. Hyv¨ arinen, A.: The fixed-point algorithm and maximum likelihood estimation for independent component analysis. Neural Processing Letters 10(1), 1–5 (1999) 8. Hyv¨ arinen, A., Pajunen, P.: Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks 12(3), 429–439 (1999) 9. Jutten, C., Karhunen, J.: Advances in nonlinear blind source separation. In: Proc. ICA2003, pp. 245–256 (2003) Invited paper in the special session on nonlinear ICA and BSS 10. Martinez, D., Bray, A.: Nonlinear blind source separation using kernels. IEEE Transaction on Neural Network 14(1), 228–235 (2003) 11. Sch¨ olkopf, B., Smola, A., Muller, K.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10, 1299–1319 (1998) 12. Zhang, K., Chan, L.: Nonlinear independent component analysis with minimum nonlinear distortion. In: ICML 2007, Corvallis, OR, US, pp. 1127–1134 (2007) 13. Ziehe, A., M¨ uller, K.R.: TDSEP − an efficient algorithm for blind separation using time structure. In: Proc. ICANN98, Sk¨ ovde, Sweden, pp. 675–680 (1998)
Linear Prediction Based Blind Source Extraction Algorithms in Practical Applications Zhi-Lin Zhang1,2 and Liqing Zhang1 Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China 2 School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China [email protected], [email protected] 1
Abstract. Blind source extraction (BSE) is of advantages over blind source separation (BSS) when obtaining some underlying source signals from high dimensional observed signals. Among a variety of BSE algorithms, a large number of algorithms are based on linear prediction (LP-BSE). In this paper we analyze them from practical point of view. We reveal that they are, in nature, minor component analysis (MCA) algorithms, and thus they have some problems that are inherent in MCA algorithms. We also find a switch phenomenon of online LP-BSE algorithms, showing that different parts of a single extracted signal are the counterparts of different source signals. The two issues should be noticed when one applies these algorithms to practical applications. Computer simulations are given to confirm these observations.
1
Introduction
Blind source extraction (BSE) [1] is a powerful technique that is closely related to blind source separation (BSS). The basic task of BSE is to estimate some of underlying source signals that are linearly combined in observations. Compared with BSS, BSE has some advantages [1]. An attractive one is its ability to extract a small subset of source signals from high-dimensional observed signals. Hence it is often recommended to be used in EEG/MEG fields and alike [1,2,3]. There are many BSE algorithms for extracting source signals based on their temporal structures [1,4]. Among them there is a class of algorithms based on linear prediction. For example, Cichocki, Mandic, and Liu et al. [2,6,7,8] proposed several BSE algorithms based on short-term linear prediction. Barros et al. [5] proposed a BSE algorithm based on long-term linear prediction. Later Smith et al. [3] proposed a BSE algorithm combining short-term prediction and long-term prediction. And recently Liu et al.[9] extended a basic linear prediction based algorithm to the one suitable for noisy environment. In this paper we consider some possible problems when applying the linear prediction based BSE (LP-BSE) algorithms to practical applications, especially EEG/MEG fields. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 309–316, 2007. c Springer-Verlag Berlin Heidelberg 2007
310
2
Z.-L. Zhang and L. Zhang
The Linear Prediction Based Algorithms
Suppose that unknown source signals s(k) = [s1 (k), · · · , sn (k)]T are zero-mean and spatially uncorrelated, and suppose that x(k) = [x1 (k), · · · , xn (k)]T is a vector of observed signals, which is a linear instantaneous mixture of source signals by x(k) = As(k), where k is time index and A ∈ Rn×n is an unknown mixing matrix of full rank. The goal of BSE is to find a demixing vector w such that y(k) = wT x(k) = wT As(k) is an estimate of a source signal. To cope with ill-conditioned cases and to make algorithms simpler and faster, before extraction whitening [1] is often used to transform the observed signals x(k) to z(k) = Vx(k) such that E{z(k)z(k)T } = I, where V ∈ Rn×n is a prewhitening matrix and VA is an orthogonal matrix. Assuming that the underlying source signals have temporal structures, the class of LP-BSE algorithms is derived by minimizing the normalized mean square prediction error given by [6,8] J1 =
E{(y(k) − bT y(k))2 } E{e(k)2 } = E{y(k)2 } E{y(k)2 }
(1)
where y(k) = wT x(k), b = [b1 , b2 , · · · , bP ]T , y(k) = [y(k − 1), y(k − 2), · · · , y(k − P )]T and P is AR order that is set before running algorithms. If one performs the whitening and normalizes the demixing vector w, the objective function (1) reduces to [2,5]: J2 = E{e(k)2 } = E{(y(k) − bT y(k))2 }
(2)
where y(k) = wT z(k) = wT Vx(k) and w = 1. Without loss of generality, we only consider the objective function (2) in the following. After some algebraic calculations, from (2) we obtain z w = wT VAR s q, s AT VT w = qT R J2 = E{e(k)2 } = wT R
(3)
in which q = AT VT w, and z = Rz (0) − R
P
bp Rz (p) −
p=1
s = Rs (0) − 2 R
P
P
bq Rz (−q) +
q=1
bp Rs (p) +
p=1
P P
bp bq Rz (q − p)
(4)
p=1 q=1
P P
bp bq Rs (q − p)
(5)
p=1 q=1
where Rz (p) = E{z(k)z(k − p)T }, and Rs (p) = E{s(k)s(k − p)T } is a diagonal s is a diagonal matrix, whose diagonal matrix due to the assumptions. Also, R elements are given by ρi = ri (0) − 2
P p=1
bp ri (p) +
P P p=1 q=1
bp bq ri (q − p), i = 1, · · · , n
(6)
Linear Prediction Based Blind Source Extraction Algorithms
311
where ri is the autocorrelation function of si . Now we calculate the concrete value of ρi . Suppose when J2 achieves its minimum, b achieves b∗ = [b∗1 , b∗2 , · · · , b∗p ]T . We express all the source signals as si (k) =
P
b∗p si (k − p) + ei (k), i = 1, · · · , n
(7)
p=1
where ei (k) is called residual processes. Then we have ri (0) = E
P
P b∗p si (k − p) + ei (k) b∗q si (k − q) + ei (k)
p=1
=
P P
q=1
b∗p b∗q ri (q − p) + 2E{ei (k)si (k)} − E{ei (k)2 }
(8)
p=1 q=1
where we use the relationship (7). On the other hand, we also have ri (0) = E
P
P b∗p si (k − p) + ei (k) si (k) = b∗p ri (p) + E{ei (k)si (k)}. (9)
p=1
p=1
Substitute (8) and (9) into (6), we obtain ρi = E{ei (k)2 },
(10)
implying that ρi (i = 1, · · · , n) are just the powers of residual processes of linear prediction to the source signals given the coefficients b∗p (p = 1, · · · , P ). Obviously, calculating the minimum of J2 is equivalently finding the minimum among all s and are also the ones of R z. ρi (i = 1, · · · , n), which are the eigenvalues of R And the demixing vector w is the associated eigenvector. Thus the LP-BSE algorithms are in nature the MCA algorithms [10,11,15].
3
Analysis of the LP-BSE Algorithms
It is recognized that MCA algorithms have some flaws in practical applications z are [10,11]. First, in practice the small eigenvalues of the covariance matrix R often close to each other, which reduces the estimate accuracy of associated eigenvectors [14] and brings difficulties to global convergence [10,11]. Moreover the performance of MCA algorithms often suffers from outliers and noise [12]. Naturally, the LP-BSE algorithms inherit some of these flaws when dealing with high dimensional observed signals. Take the extraction of event-related potentials as an example. The number of sensor signals are often larger than 64, and some underlying source signals have similar time structures [13]. According z are close to each other, which makes to (10) and (7) the small eigenvalues of R the estimation of the minor eigenvector sensitive to sensor noise [12].
312
Z.-L. Zhang and L. Zhang
Now consider online versions of LP-BSE algorithms. Suppose the current extracted source is s1 (k), whose current residual process’s power level is e21 (k). This implies that given the prediction AR order P in algorithms, e21 (k) is the smallest among all e2j (k), j = 1, · · · , n. If at time k + 1, s1 (k + 1)’s true AR order starts to change but the given prediction AR order does not change, e21 (k+1) may become larger1. Then there may be another source signal, say s2 (k + 1), whose e22 (k + 1) with the given prediction order is smaller than that of s1 (k + 1). Consequently, the algorithms switch to extract s2 (k + 1). Therefore the extracted signal is still mixed by the two source signals in the sense that the first part of the extracted signal is the counterpart of s1 and the second part is the counterpart of s2 . We call this the switch phenomenon. The essential reason to the existence of the switch phenomenon is the use of the fixed prediction order that is set before performing LP-BSE algorithms. Similarly, if the true AR coefficients of source signals vary fast and bi (i = 1, · · · , P ) cannot be adjusted in the same pace, the switch phenomenon may also occur. Remind that in the EEG data processing, especially in the even-related brain potential extraction, the underlying source signals’ AR order and coefficients may quickly vary. Thus the phenomenon may occur in these cases.
4
Simulations
In the first simulation we illustrated unsatisfying performance of LP-BSE algorithms due to their MCA nature. We used the data set ABio7, a benchmark in ICALAB [17]. Three typical LP-BSE algorithms, i.e. the ones in [2,7,8], were used to extract these signals. To make comparison, we intuitively gave a PCAlike BSE algorithm, a variation of our algorithm [4], as follows2 : w = P CAi
P
z ), Rz (τi ) = P CAi (R
(11)
i=1
z ) was the where Rz (τi ) = E{z(k)z(k − τi )T }, τi was time delay, and P CAi (R operator that calculated the i-th principal eigenvector of Rz . Using a priori knowledge one can choose a specific set of time delays to achieve better performance [4]. Actually, (11) is only a framework, and can be implemented offline or online by using many efficient and robust methods [14,16]. Note that the PCAlike BSE algorithm obtains principal eigenvectors, while the LP-BSE algorithms obtain minor ones. All the algorithms were implemented offline. The source signals were randomly mixed and whitened. Then each algorithm was performed on these signals. The step-size of the algorithm in [8] was 0.1. 1 2
It also may become smaller. So in this case the switch phenomenon does not occur. Note that we present the PCA-like algorithm in purpose to show that the class of LP-BSE algorithms may not achieve satisfying results when applied to practical applications. Admittedly, better algorithms than the algorithm may be developed, which is not the topic in this paper.
Linear Prediction Based Blind Source Extraction Algorithms
313
The learning rate parameter μ0 of the algorithm in [7] (see Equ.(16) in [7]) was 0.5. The extraction performance was measured by
1 qi2 −1 2 n−1 maxi qi n
PI =
(12)
i=1
where q = [q1 , · · · , qn ] = wT VA was a global vector, V was the whitening matrix, A was the mixing matrix and w was the demixing vector obtained by algorithms. P I’s value lay in [0,1] for any vector q = [q1 , · · · , qn ]. The smaller it was, the better the extraction performance was. Simulations were independently carried out 50 trials. The results are shown in Table 1, from which we can see that the LP-BSE algorithms generally performed poorly. Table 1. The averaged performance indexes of the algorithms in the first simulation. For the three LP-BSE algorithms the parameter P was the prediction order, while for the PCA-like algorithm P meant that the time delay set was {1, · · · , P }. P Alg. (11) Alg. [8] Alg. [2] Alg. [7]
1 0.00 0.19 0.02 0.11
2 0.00 0.17 0.00 0.14
3 0.01 0.17 0.07 0.11
4 0.09 0.18 0.07 0.09
5 0.02 0.18 0.06 0.12
6 0.07 0.18 0.03 0.09
7 0.01 0.19 0.03 0.10
8 0.01 0.19 0.02 0.11
9 0.06 0.19 0.04 0.13
10 0.02 0.19 0.08 0.12
12 0.01 0.19 0.10 0.13
20 0.05 0.19 0.05 0.06
30 0.04 0.19 0.06 0.13
40 0.07 0.19 0.07 0.08
50 0.03 0.19 0.08 0.16
400 0.01 0.19 0.14 0.08
In the second simulation we used the 122-dimension MEG data set (Fig.1 (a)) in [18] to show performance of a typical LP-BSE algorithm in extracting horizontal eye movements, which occurred at about the 4000-th sampling point and the 15500-th sampling point. Since the movements resulted from the same group of muscles, we safely believed that artifacts associated with the movements occurring at different time should appear in the same extracted signal. After performing the same preprocessing as that in [18], we used the offline LP-BSE algorithm in [8] to extract the artifacts with different levels of data dimension reduction. Its step-size was 0.5 and prediction order was 10. The results are shown in Fig.1 (b), where y1 , y2 , y3 and y4 were extracted by the LP-BSE algorithm with data dimension reduced to 120, 80, 60, and 40, respectively. y5 was extracted by the PCA-like algorithm (11) without data dimension reduction (τi = {1, · · · , 5}). y6 was extracted by FastICA, which was also used in [18]. Since Vig´ ario et al. have shown that FastICA can perfectly extract the horizontal eye movent artifacts, we regarded y6 as a benchmark. From y1 − y3 we see that the artifacts were not perfectly extracted, since the horizontal eye movement artifact at about the 15500-th sampling point was not extracted. Although in y4 all the artifacts were extracted, it was mixed by artifacts resulting from eye blinks [18]. Besides, we see that the extraction performance of the LP-BSE algorithm was affected by the dimension reduction. When the dimension was reduced to a certain degree, the extraction performance became relatively better. In contrast, in y5 all the horizontal eye movement artifacts were extracted without mixed
314
Z.-L. Zhang and L. Zhang 5 Y1
16
0 −5
14 Y2
5
12
0 −5 5
Y3
10
0 −5
8 Y4
5
6
0 −5 5
Y5
4
0 −5
5 Y6
2
0
2000
4000
6000
8000
10000
12000
14000
16000
0 −5
2000
Sampling Points
(a)
4000
6000
8000
10000
12000
14000
16000
Sampling Points
(b)
Fig. 1. A subset of the MEG data set [18] (a) and extracted artifacts (b)
by other artifacts, and we found the extraction quality was not affected by the data dimension (the extraction results with dimension reduction are not shown here due to limited space). We also ran other LP-BSE algorithms and almost obtained the same results. Due to space limit we omit the report. In the last simulation we showed the switch phenomenon of online LP-BSE algorithms. We generated three AR(6) Gaussian signals of 5-second duration time (Fig.2). Each source signal had zero mean and unit variance. The sampling frequency was 1000 Hz. The AR coefficients of each signal were unchanged during the first 2.5 second, given by: source1 : b = [−1.6000, 0.9000, −0.2000, 0.0089, 0.0022, −0.0002] source2 : b = [−0.1000, −0.4300, 0.0970, 0.0378, −0.0130, 0.0009] source3 : b = [−2.3000, 2.0400, −0.8860, 0.1985, −0.0216, 0.0009] And hereafter the AR coefficients changed to: source1 : b = [−1.6000, 0.9000, −0.2000, 0.0089, 0.0022, −0.0002] source2 : b = [−2.3000, 2.0400, −0.8860, 0.1985, −0.0216, 0.0009] source3 : b = [−0.1000, −0.4300, 0.0970, 0.0378, −0.0130, 0.0009] We used the online version of the LP-BSE algorithm in [8] to extract a signal. Its step-size was 0.01 and prediction order was 10. The result is shown in Fig.2 (see y1 ), from which we see that the first part of y1 (before 2.5 second) was the counterpart of source signal s3 , but from 3.6 second or so the signal was clearly the counterpart of source signal s1 . To further confirm this, we measured the similarity between the extracted signal and the source signals, using the performance index P I2 = −10 lg(E{(s(k) − s˜(k))2 })(dB), where s(k) was the desired source signal, and s˜(k) was the extracted signal (both of them were normalized to be zero-mean and unit-variance). The higher P I2 is, the better the performance. Denote by Part1 the extracted signal’s segment from 2.0 s to 2.5 s, and denote by Part2 the extracted signal’s segment from 4.0 s to 5.0 s.
Linear Prediction Based Blind Source Extraction Algorithms
315
The P I2 of Part1 measuring the similarity between Part1 and the counterpart of s3 was 18.5 dB, showing Part1 was very similar to the counterpart of s3 . The P I2 of Part2 measuring the similarity between Part2 and the counterpart of s1 was 19.7 dB, showing Part2 was very similar to the counterpart of s1 . Next we used an online version of the PCA-like algorithm (11), implemented by the OJAN PCA algorithm [16], to extract a source signal. The extracted signal is shown in Fig.2 (see y2 ), from which we can see that the extracted signal was just s3 and the switch phenomenon did not occur. We also calculated the algorithm’s P I2 at Part1 and Part2. The P I2 of Part1 measuring the similarity between Part1 and the counterpart of s3 was 22.3 dB, showing Part1 was very similar to the counterpart of s3 . The P I2 of Part2 measuring the similarity between Part2 and the counterpart of s3 was 19.9 dB, showing Part2 was very similar to the counterpart of s3 as well. The results show that the online version has well extracted the whole source signal s3 .
s1
4 0 −4
s2
4 0 −4
s3
4 0 −4
y1
4 0 −4
y2
4 0 −4
2.2
2.6
3.0
3.4
3.8
4.2
Time (Sec.)
Fig. 2. Segments of the AR source signals (s1 , s2 , s3 ) and the extracted signals. y1 was extracted by the online LP-BSE algorithm in [8], while y2 was extracted by the online version of the algorithm (11).
5
Conclusion
In this paper we analyze a class of linear prediction based BSE algorithms, revealing that they are in nature the MCA algorithms and showing a switch phenomenon of their online versions. Based on these results, careful attentions should be paid when one applies these algorithms to practical applications such as EEG and MEG fields.
Acknowledgments The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301) and National Natural Science Foundation of China (Grant No.60375015).
316
Z.-L. Zhang and L. Zhang
References 1. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. John Wiley & Sons, New York (2002) 2. Cichocki, A., et al.: A Blind Extraction of Temporally Correlated but Statistically Dependent Acoustic Signals. In: Proc. of the 2000 IEEE Signal Processing Society Workshop on Neural Networks for Signal Processing X, pp. 455–464. IEEE Computer Society Press, Los Alamitos (2000) 3. Smith, D., Lukasiak, J., Burnett, I.: Blind Speech Separation Using a Joint Model of Speech Production. IEEE Signal Processing Lett. 12(11), 784–787 (2005) 4. Zhang, Z.-L., Yi, Z.: Robust Extraction of Specific Signals with Temporal Structure. Neurocomputing 69(7-9), 888–893 (2006) 5. Barros, A.K., Cichocki, A.: Extraction of Specific Signals with Temporal Structure. Neural Computation 13(9), 1995–2003 (2001) 6. Cichocki, A., Thawonmas, R.: On-line Algorithm for Blind Signal Extraction of Arbitrarily Distributed, but Temporally Correlated Sources Using Second Order Statistics. Neural Processing Letters 12, 91–98 (2000) 7. Mandic, D.P., Cichocki, A.: An Online Algorithm for Blind Extraction of Sources with Different Dynamical Structures. In: Proc. of the 4th Int. Conf. on Independent Component Analysis and Blind Signal Separation (ICA 2003), pp. 645–650 (2003) 8. Liu, W., Mandic, D.P., Cichocki, A.: A Class of Novel Blind Source Extraction Algorithms Based on a Linear Predictor. In: Proc. of ISCAS 2005, pp. 3599–3602 (2005) 9. Liu, W., Mandic, D.P., Cichocki, A.: Blind Second-order Source Extraction of Instantaneous Noisy Mixtures. IEEE Trans. Circuits Syst. II 53(9), 931–935 (2006) 10. Taleb, A., Cirrincione, G.: Against the Convergence of the Minor Component Analysis Neurons. IEEE Trans. Neural Networks 10(1), 207–210 (1999) 11. Feng, D.-Z., Zheng, W.-X., Jia, Y.: Neural Network Learning Algorithms for Tracking Minor Subspace in High-dimensional Data Stream. IEEE Trans. Neural Networks 16(3), 513–521 (2005) 12. Wilkinson, J.H.: The Algebraic Eigenvalue Problem. Oxford Univ. Press, Oxford (1965) 13. Makeig, S., Westerfield, M., Jung, T.-P., et al.: Dynamic Brain Sources of Visual Evoked Responses. Science 295, 690–694 (2002) 14. Golub, G.H., Loan, C.F.V.: Matrix Computation, 3rd edn. The John Hopkins University Press, Baltimore (1996) 15. Zhang, Q., Leung, Y.-W.: A Class of Learning Algorithms for Principal Component Analysis and Minor Component Analysis. IEEE Trans. Neural Networks 11(1), 200–204 (2000) 16. Chatterjee, C.: Adaptive Algorithms for First Principal Eigenvector Computation. Neural Networks 18, 145–159 (2005) 17. ICALAB Toolboxes, Available: http://www.bsp.brain.riken.jp/ICALAB 18. Vig´ ario, R., et al.: Independent Component Analysis for Identification of Artifacts in Magnetoencephalographic Recordings. In: Proc. of NIPS 1997, pp. 229–235 (1997)
Blind Audio Source Separation Using Sparsity Based Criterion for Convolutive Mixture Case A. A¨ıssa-El-Bey, K. Abed-Meraim, and Y. Grenier ENST-Paris, TSI Department, 46 rue Barrault 75634, Paris Cedex 13, France {elbey,abed,grenier}@tsi.enst.fr
Abstract. In this paper, we are interested in the separation of audio sources from their instantaneous or convolutive mixtures. We propose a new separation method that exploits the sparsity of the audio signals via an p -norm based contrast function. A simple and efficient natural gradient technique is used for the optimization of the contrast function in an instantaneous mixture case. We extend this method to the convolutive mixture case, by exploiting the property of the Fourier transform. The resulting algorithm is shown to outperform existing techniques in terms of separation quality and computational cost.
1
Introduction
Blind Source Separation (BSS) is an approach to estimate and recover independent source signals using only the information within the mixtures observed at each channel. Many algorithms have been proposed to solve the standard blind source separation problem in which the mixtures are assumed to be instantaneous. A fundamental and necessary assumption of BSS is that the sources are statistically independent and thus are often separated using higher-order statistical information [1]. If extra information about the sources is available at hand, such as temporal coherency [2], source nonstationarity [3], or source cyclostationarity [4], then one can remain in the second-order statistical scenario, to achieve the BSS. In the case of non-stationary signals (including audio signals), certain solutions using time-frequency analysis of the observations exist [5]. Other solutions use the statistical independence of the sources assuming a local stationarity to solve the BSS problem [6]. This is a strong assumption that is not always verified [7]. To avoid this problem, we propose a new approach that handles the general linear instantaneous model (possibly noisy) by using the sparsity assumption of the sources in the time domain. Then, we extend this algorithm to the convolutive mixture case, by transforming the convolutive problem into instantaneous problem in the frequency domain, and separating the instantaneous mixtures in every frequency bin. The use of sparsity to handle this model, has arisen in several papers in the area of source separation [8,9]. We first present a sparsity contrast function for BSS. Then, in order to achieve BSS, we optimize the considered contrast function using an iterative algorithm based on the relative gradient technique. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 317–324, 2007. c Springer-Verlag Berlin Heidelberg 2007
318
A. A¨ıssa-El-Bey, K. Abed-Meraim, and Y. Grenier
In the following section, we discuss the data model that formulates our problem. Next, we detail the different steps of the proposed algorithm. In Section 4, some simulations are undertaken to validate our algorithm and to compare its performance to other existing BSS techniques.
2 2.1
Instantaneous Mixture Case Data Model
Assume that N audio signals impinge on an array of M ≥ N sensors. The measured array output is a weighted superposition of the signals, corrupted by additive noise, i.e. x(t) = As(t) + w(t)
t = 0, . . . , T − 1
(1)
where s(t) = [s1 (t), · · · , sN (t)]T is the N × 1 sparse source vector, w(t) = [w1 (t), · · · , wM (t)]T is the M × 1 complex noise vector, A is the M × N full column rank mixing matrix (i.e., M ≥ N ), and the superscript T denotes the transpose operator. The purpose of blind source separation is to find a separating matrix, i.e. a N × M matrix B such that s(t) = Bx(t) is an estimate of the source signals. Before proceeding, note that complete blind identification of separating matrix B (or equivalently, the mixing matrix A) is impossible in this context, because the exchange of a fixed scalar between the source signal and the corresponding column of A leaves the observations unaffected. Also note that the numbering of the signals is immaterial. It follows that the best that can be done is to determine B up to a permutation and scalar shifts of its columns, i.e., B is a separating matrix iff: Bx(t) = P Λs(t) (2) where P is a permutation matrix and Λ a non-singular diagonal matrix. 2.2
Sparsity-Based BSS Algorithm
Before starting, we propose to use ’an optional’ whitening step which set the mixtures to the same energy level and reduces the number of parameters to be estimated. More precisely, the whitening step is applied to the signal mixtures before using our separation algorithm. The whitening is achieved by applying a N × M matrix W to the signal mixtures in such a way Cov(W x) = I in the noiseless case, where Cov(·) stands for the covariance operator. As shown in [2], W can be computed as the inverse square root of the noiseless covariance matrix of the signal mixtures (see [2] for more details). In the following, we apply our separation algorithm on the whitened data: xw (t) = W x(t). We propose an iterative algorithm for the separation of sparse audio signals, namely the ISBS for Iterative Sparse Blind Separation. It is well known that
Blind Audio Source Separation Using Sparsity Based Criterion
319
audio signals are characterized by their sparsity property in the time domain [8,9] which is measured by their p norm where 0 ≤ p < 2. More specifically, one can define the following sparsity based contrast function Gp (s) =
N
1
[Jp (si )] p ,
(3)
T −1 1 |si (t)|p . T t=0
(4)
i=1
where Jp (si ) =
The algorithm finds a separating matrix B such as, B = arg min {Gp (B)} ,
(5)
Gp (B) Gp (z),
(6)
B
where
and z(t) Bxw (t) represents the estimated sources. The approach we choose to solve (5) is inspired from [10]. It is a block technique based on the processing of T received samples and consists in searching iteratively the minimum of (5) in the form: B (k+1) = (I + (k) )B (k) z (k+1) (t) = (I + (k) )z (k) (t)
(7) (8)
where I denotes the identity matrix. At iteration k, a matrix (k) is determined from a local linearization of Gp (B (k+1) xw ). It is an approximate Newton technique with the benefit that (k) can be very simply computed (no Hessian inversion) under the additional assumption that B (k) is close to a separating matrix. This procedure is illustrated in the following: At the (k + 1)th iteration, the proposed criterion (4) can be developed as follows: p T −1 N 1 (k+1) (k) (k) (k) zi (t) + )= ij zj (t) Jp (zi T t=0 j=1 p (k) T −1 N z (t) 1 (k) p (k) j = |zi (t)| 1 + ij (k) . T t=0 zi (t) j=1 Under the assumption that B (k) is close to a separating matrix, we have (k)
|ij | 1
320
A. A¨ıssa-El-Bey, K. Abed-Meraim, and Y. Grenier (k+1)
and thus, a first order approximation of Jp (zi (k+1)
Jp (zi
)≈
1 T
T −1 t=0
(k)
|zi (t)|p + p
N j=1
) is given by:
(k) (k) (k) (k) e(ij )e |zi (t)|p−1 e−jφi (t) zj (t)
(k) (k) (k) (k) − m(ij )m |zi (t)|p−1 e−jφi (t) zj (t) (9) (k) where e(x) and m(x) denote the real and imaginary parts of x and φi (t) is (k) the argument of the complex number zi (t). Using equation (9), equation (3) can be rewritten in more compact form as: Gp B (k+1) = Gp B (k) + e T r (k) R(k)H D (k)H
(10)
where (·) denotes the conjugate of (·), T r(·) is the matrix trace operator and the ij th entry of matrix R(k) is given by: (k)
Rij =
T −1 1 (k) p−1 −jφ(k) (k) |z (t)| e i (t) zj (t) T t=0 i
p1 −1
(k) (k) D (k) = diag R11 , . . . , RN N .
(11)
(12)
Using a gradient technique, (k) can be chosen as: (k)
(k) = −μD (k) R
(13)
where μ > 0 is the gradient step. Replacing (13) into (10) leads to, Gp B (k+1) = Gp B (k) − μD (k) R(k) 2 .
(14)
So μ controls the decrement of the criterion. Now, to avoid the algorithm’s convergence to the trivial solution B = 0, one normalizes the outputs of the T −1 (k+1) 2 (k+1) separating matrix to unit-power, i.e. ρzi T1 t=0 |zi (t)| = 1, ∀ i. Using first order approximation, this normalization leads to: (k)
(k)
ii =
1 − ρz i (k)
.
(15)
2ρzi
After convergence of the algorithm, the separation matrix B = B (K) is applied to the whitened signal mixtures xw to obtain an estimation of the original source signals. K denotes here the number of iterations that can be either chosen a priori or given by a stopping criterion of the form B (k+1) − B (k) < δ where δ is a small threshold value.
Blind Audio Source Separation Using Sparsity Based Criterion
3
321
Convolutive Mixture Case
Unfortunately, instantaneous mixing is very rarely encountered in real-world situations, where multipath propagation with large channel delay spread occurs, in which case convolutive mixtures are considered. In this case, the signal can be modeled by the following equation: x(t) =
L
H(l)s(t − l) + w(t)
(16)
l=0
where H(l) are M × N matrices for l ∈ [0, L] representing the impulse response L coefficients of the channel and the polynomial matrix H(z) = H(l)z −l is l=0
assumed to be irreducible (i.e. H(z) is of full column rank for all z). If we apply a short time Fourier transform (STFT) to the observed data x(t), the model in (16) (in the noiseless case) becomes approximately Sx (t, f ) ≈ H(f )Ss (t, f )
(17)
where Sx (t, f ) is the mixture STFT vector, Ss (t, f ) is the source STFT vector and H(f ) is the channel Fourier Transform matrix. It shows that, for each frequency bin, the convolutive mixtures reduce to simple instantaneous mixtures. Therefore we can apply our ISBS algorithm for each frequency and separate the signals. As a result, in each frequency bin, we obtain the STFT source estimate Ss (t, f ) = B(f )Sx (t, f ).
(18)
It seems natural to reconstruct the separated signals by aligning these Ss (t, f ) obtained for each frequency bin and applying the inverse short time Fourier transform. For that we need first to solve the permutation and scaling ambiguities as shown next. 3.1
Removing the Scaling End Permutation Ambiguities
In this stage, the output of the separation filter is processed with the permutation matrix Π(f ) and the scaling matrix C(f ). G(f ) = Π(f )C(f )B(f ) .
(19)
The scaling matrix C(f ) is a N × N diagonal matrix found as in [11] by C(f ) = diag[B(f )# ]. For the permutation matrix Π(f ), we exploit the continuity property of the acoustic filter in the frequency domain [12]. To align the estimated sources at two successive frequency bins, we test of the closeness of G(fn )G(fn−1 )# to a diagonal matrix. Indeed, by using the representation (19), one can find the permutation matrix by minimizing: ⎧ ⎫ ⎨ 2 ⎬ # ΠC(f Π(fn ) = arg min . (20) n )B(fn )G(fn−1 ) ⎩ ij ⎭ Π i=j
322
A. A¨ıssa-El-Bey, K. Abed-Meraim, and Y. Grenier
In our simulations, we have used an exhaustive search to solve (20). However, when the number of sources is large, the exhaustive search becomes prohibitive. In that case, one can estimate Π(fn ) as the matrix with ones at the ij th entry satisfying |M(fn )|ij = max|M(fn )|ik and zeros elsewhere with k
M(fn ) = C(fn )B(fn )G(fn−1 )# . This solution has the advantage of simplicity but may lead to erroneous solution in difficult context. An alternative solution would be to decompose Π(fn ) as product of elementary permutations1 Π(pq) . The latter is considered at a given iteration, only if it decrease criterion (20), if |M(fn )|2pq + |M(fn )|2qp > |M(fn )|2pp + |M(fn )|2qq Finally, we obtain: Π(fn ) =
(pq) , Π
(21)
nb of iterations 1≤p
(pq) being either the identity matrix or the above permutation matrix Π(pq) Π depending on the binary decision rule define above. We stop the iterative process, (pq) are equal to the identity. We have observed that one or, when all matrices Π at most, two iterations are sufficient to get the desired permutation. Finally, we apply the updated separation matrix G(f ) to the frequency domain mixture: Ss (t, f ) = G(f )Sx (t, f ).
4
(22)
Simulation Results
We present here some numerical simulations to evaluate the performance of our algorithm. We consider an array of M = 2 sensors receiving two audio signals in the presence of stationary temporally white noise of covariance σ 2 I (σ 2 being the noise power). 10000 samples are used with a sampling frequency of 8Khz (this represents 1.25sec recording). In order to evaluate the performance in the instantaneous mixture case, the separation quality is measured using the Interference to Signal Ratio (ISR) criterion [2] defined as: E |(BA)pq |2 ρq def (23) ISR = E (|(BA)pp |2 ) ρp p=q
−1 |si (t)|2 . where ρi = E(|si (t)|2 ) is the ith source power evaluated here as T1 Tt=0 Fig. 1-(a) represents the two original sources and their mixtures in the noiseless case. In Fig. 1-(b), we compare the performance of the proposed algorithm in instantaneous mixture case, to the Relative Newton algorithm developed by Zibulevsky et al. in [9] where the case of sparse sources is considered and to SOBI algorithm developed by Belouchrani et al. in [2]. We plot the residual interference between separated sources (ISR) versus the SNR. It is clearly shown that our algorithm (ISBS) performs better in terms of ISR especially for low SNRs as compared to the two other methods. In Fig. 2-(a), we represent the evolution of 1
Π(pq) is defined such as way that for a given vector y, y = Π(pq) y iff y(k) = y(k), for k ∈ / {p, q}, y(p) = y(q) and y(q) = y(p).
Blind Audio Source Separation Using Sparsity Based Criterion
−1
s2(t)
10
0 0
0.2
0.4
0.6
0.8
1
1.2
1
−10
0
−20
−1
0
0.2
0.4
0.6
0.8
1
1.2
x1(t)
1
−30 −40
0 −1
ISBS algorithm Relative Newton algorithm SOBI algorithm
0
ISR (dB)
s1(t)
1
323
0
0.2
0.4
0.6
0.8
1
−50
1.2
x2(t)
1 −60
0 −1
0
0.2
0.4
0.6 Time (sec)
0.8
1
−70 0
1.2
5
10
15
20
25
30
35
40
45
50
SNR (dB)
(a)
(b)
Fig. 1. (a) Up the two original source signals and bottom the two signal mixtures. (b) Interference to Signal Ratio (ISR) versus SNR for 2 audio sources and 2 sensors in instantaneous mixture case.
the ISR as a function of the iteration number. A fast convergence rate is observed. In Fig. 2-(b), we compare, in the 2 × 2 convolutive mixture case the separation performance of our algorithm, Deville’s algorithm in [13], Parra’s algorithm in [14] and extended version of Zibulevsky’s algorithm to the convolutive mixture case. The filter coefficients are chosen randomly and the channel order is L = 128. We use in this experiment the ISR criterion defined for the convolutive case in [14] that takes into account the fact the source estimates are obtained up to a scalar filter. We observe a significant performance gain in favor of the proposed method especially at low SNR values. Moreover, the complexity of the proposed
−10
−1 SNR = 10dB SNR = 20dB SNR = 40dB
−20
ISBS Algorithm Relative Newton Algorithm Parra’s Algorithm Deville’s Algorithm
−2
−3
ISR (dB)
ISR (dB)
−30
−40
−4
−5
−50 −6 −60
−70
−7
0
20
40
60
Iteration number
(a)
80
100
−8
5
10
15
20
25
30
35
40
45
SNR (dB)
(b)
Fig. 2. (a) ISR as a function of the iteration number for 2 audio sources and 2 sensors in instantaneous mixture case. (b) ISR versus SNR for 2 × 2 convolutive mixture case.
324
A. A¨ıssa-El-Bey, K. Abed-Meraim, and Y. Grenier
algorithm is equal to 2N 2 T + O(N 2 ) flops per iteration whereas the complexity of the Relative Newton algorithm in [9] is 2N 4 + N 3 T + N 6 /6.
5
Conclusion
This paper presents a blind source separation method for sparse sources in instantaneous mixture case and its extension to the convolutive mixture case. A sparse contrast function is introduced and an iterative algorithm based on gradient technique is proposed to minimize it and perform the BSS. Numerical simulation results have been given evidence the usefulness of the method. The proposed technique outperforms existing solutions in terms of separation quality and computational cost in both instantaneous and convolutive mixture cases.
References 1. Cardoso, J.F.: Blind signal separation: statistical principles. In: Proceedings of the IEEE, October 1998, vol. 86(10), pp. 2009–2025. IEEE Computer Society Press, Los Alamitos (1998) 2. Belouchrani, A., Abed-Meraim, K., Cardoso, J.-F., Moulines, E.: A blind source separation technique using second-order statistics. IEEE T-SP 45(2) (1997) 3. Belouchrani, A., Amin, M.G.: Blind source separation based on time-frequency signal representations. IEEE Transactions on Signal Processing 46(11) (1998) 4. Abed-Meraim, K., Xiang, Y., Manton, J.H., Hua, Y.: Blind source separation using second order cyclostationary statistics. IEEE T-SP 49(4), 694–701 (2001) 5. Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing 52(7), 1830–1847 (2004) 6. Pham, D.T., Cardoso, J.F.: Blind separation of instantaneous mixtures of non stationary sources. IEEE Transactions on Signal Processing 49, 1837–1848 (2001) 7. Smith, D., Lukasiak, J., Burnett, I.S.: An analysis of the limitations of blind signal separation application with speech. Signal Processing 86(2), 353–359 (2006) 8. Cichocki, A., Amari, S.: Ch. 2. In: Adaptive Blind Signal and Image Processing, Wiley & Sons, Ltd., UK (2003) 9. Zibulevsky, M.: Sparse source separation with relative Newton method. In: Proc. ICA, pp. 897–902 (April 2003) 10. Pham, D.T., Garat, P.: Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE T-SP 45(7), 1712–1725 (1997) 11. Murata, N., Ikeda, S., Ziehe, A.: An approach to blind source separation based on temporal structure of speech signals. Neurocomputing 41(1-4), 1–24 (2001) 12. Pham, D.T., Servi´ere, C., Boumaraf, H.: Blind separation of convolutive audio mixtures using nonstationarity. In: Proc. ICA, Nara, Japan, pp. 981–986 (April 2003) 13. Albouy, B., Deville, Y.: Alternative structures and power spectrum criteria for blind segmentation and separation of convolutive speech mixtures. In: Proc. ICA, Nara, Japan, pp. 361–366 (April 2003) 14. Parra, L., Spence, C.: Convolutive blind separation of non-stationary sources. IEEE Transactions on Speech and Audio Processing 8(3), 320–327 (2000)
Maximization of Component Disjointness: A Criterion for Blind Source Separation Jörn Anemüller Medical Physics Section Dept. of Physics Carl von Ossietzky University Oldenburg 26111 Oldenburg, Germany
Abstract. Blind source separation is commonly based on maximizing measures related to independence of estimated sources such as mutual statistical independence assuming non-Gaussian distributions, decorrelation at different time-lags assuming spectral differences or decorrelation assuming source non-stationarity. Here, the use of an alternative model for source separation is explored which is based on the assumption that sources emit signal energy at mutually different times. In the limiting case, this corresponds to only a single source being “active” at each point in time, resulting in mutual disjointness of source signal supports and negative mutual correlations of source signal envelopes. This assumption will not be fulfilled perfectly for real signals, however, by maximizing disjointness of estimated sources (under a linear mixing/demixing model) we demonstrate that source separation is nevertheless achieved when this assumptions is only partially fulfilled. The conceptual benefits of the disjointness assumption are that (1) in certain applications it may be desirable to explain observed data in terms of mutually disjoint “parts” and (2) the method presented here preserves the special physical information assigned to amplitude zero of a signal which corresponds to the absence of energy (rather than subtracting the signal mean prior to analysis which for non zero-mean sources destroys this information). The method of disjoint component analysis (DCA) is derived and it is shown that its update equations bear remarkable similarities with maximum likelihood independent component analysis (ICA). Sources with systematically varied degrees of disjointness are constructed and processed by DCA and Infomax and Jade ICA. Results illustrate the behaviour of DCA and ICA under these regimes with two main results: (1) DCA leads to a higher degree of separation than ICA, (2) DCA performs particularly well on positive-valued sources as long as they are at least moderately disjoint, and (3) The performance peak of ICA for zero-mean sources is achieved when sources are disjoint (but not independent)1 .
1 Introduction Representation of measured data in terms of a number of generating causes or underlying “sources" is an important problem that has gained widespread attention in recent 1
This research was supported by the EC under the DIRAC integrated project IST-027787.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 325–332, 2007. c Springer-Verlag Berlin Heidelberg 2007
326
J. Anemüller
years, either with the goal of extracting known-to-exist sources from measurements (blind source separation), or in order to find an efficient—possibly lower-dimensional— description of given data (exploratory data analysis). We propose and investigate a novel technique, “disjoint component analysis" (DCA) that is based on the goal of extracting components with maximally disjoint support from given data, i.e., it is sought to describe the data in terms of components of which as few as possible should be activated at any single time (or sample) point. Ideally, only a single source process would account for a single sample of measured data. Since this goal is too strong for real-world data, we demonstrate that it can be significantly relaxed while still retaining the beneficial characteristics of the method. Disjoint support between generating source processes may constitute a relevant general principle in domains where other assumptions, e.g., statistical independence and the implied effective physical separation of generating source processes, have to be postulated or justified post-hoc rather than deduced a-priori. In some cases such as communicating speakers or densely interconnected nervous cells in the brain, theoretical considerations argue in favor of dependencies between source processes. Even though such dependencies might turn out to be largely negligible in some domains, it does appear to be worthwhile to consider the implications of incorporating such dependencies into the models. In the opposite direction (and with a different intention than ours), some authors have argued that sources that are often regarded as independent can effectively be modeled as being “w-disjoint orthogonal" [10]. We are demonstrating a close formal link between algorithms derived from striving for independent and disjoint representations, respectively, which may be seen as an indication that both notions may contain similarities that we have not yet fully appreciated. In relation to existing techniques, DCA differs from ICA [2,4] since the disjointness assumption corresponds to a source model with dependent sources. Sparse-coding approaches [9], unlike DCA, impose a sparse prior on each source but do not incorporate a mutual disjointness of sources. Non-negative matrix factorisation approaches [6] and l1-norm minimization methods [5] aim to obtain a parts-based description of the data with fundamentally different algorithms than DCA.
2 Disjoint Component Analysis 2.1 Derivation of Algorithm Consider N observed signals x(t) = [x1 (t), . . . , xN (t)]T which are assumed to be generated from N underlying sources s(t) = [s1 (t), . . . , sN (t)]T by multiplication with a mixing system A as x(t) = A s(t) (1) It is sought to linearly transform the observations by a matrix W to obtain output signals y(t) = W x(t)
(2)
with components y(t) = [y1 (t), . . . , yN (t)]T . When source reconstruction is desired, these should resemble the sources up to arbitrary rescaling and permutation. When an
Maximization of Component Disjointness Outputs
Sources 10 0
1
1
0
5
5 y (t)
−10
s (t)
−5
2
2
0
3 2 1 0
5 3
y (t)
−5
5 y (t)
0
4
4
0 −5
2 s (t)
0
−5
3
s (t)
y (t)
s (t)
5
−2 0
327
100
200 300 samples
400
500
400
500
0 −5 0
100
200 300 samples
400
500
Mixes 1
x (t)
5 0 −5
2
x (t)
10 0 −10
3
x (t)
10 0 −10
4
x (t)
10 0 −10 0
100
200 300 samples
Fig. 1. Disjoint component analysis of four sources (top left) which are not strictly disjoint but exhibit significant overlap. Sources were mixed with a randomly chosen 4 × 4 mixing matrix to yield observation signals (bottom left) which were successfully separated into the original sources up to arbitrary permutation, rescaling and sign flip (top right) using DCA.
exploratory data analysis view is adopted, the output signals should convey a signal representation that is meaningful in some to-be-specified sense. A central notion in our approach is the overlap between two output signals yi and yj which we define as2 oij = E(|yi | |yj |), (3) where E(·) denotes expectation and sample index t is omitted where convenient. With oij ≥ 0 and oij = 0 if and only if yi (t) yj (t) = 0 for all t and i = j, two signals yi and yj have disjoint support if oij = 0. In this case, yi and yj are called disjoint, i.e., at most one of the signals is non-zero at any time. For strictly disjoint source signals s(t) and a non-singular matrix A, strictly disjoint outputs can be obtained that resemble the sources up to arbitrary permutation and rescaling. Note that in this case sources are not mutually independent but exhibit statistical dependencies through the negative correlations of their signal envelopes or signal power time-courses. 2
Different definitions of the overlap, involving other non-linear functions of the output signals, are possible but beyond the scope of the present paper.
328
J. Anemüller
While it is not possible in general to linearly transform an arbitrary signal x(t) into a signal y(t) with only disjoint components, finding minimally overlapping outputs is a natural goal as it corresponds to a signal description in terms of processes out of which only a small number is active at any given time. A natural choice to obtain maximally disjoint, minimally overlapping output signals is minimization of the function 1 1 H= oij = E(|yi | |yj |) (4) 2 2 i=j
i=j
The global minimum H = 0 is attained only for strictly disjoint signals where for all t any signal yi (t) = 0 if and only if yj (t) = 0 for all j = i. Substituting 2 into 4, the partial derivatives are given by ∂H = E sign(yi ) xj |yk | (5) ∂wij k=i
which in matrix notation is easily rewritten as (6) ∇H = E −yxH + ||y||1 sign(y)xH where ||y||1 = i |yi | denotes the 1-norm of y. Right-multiplication with WT W yields an expression similar to the natural gradient ICA algorithm of [1], ˜ = E −yyH + ||y||1 sign(y)yH W. (7) ∇H Gradients (6) and (7) are similar to the corresponding gradients derived from infomax or maximum-likelihood ICA with a sparse prior, however, we emphasize that the mean has not been removed from the output signals (i.e., source estimates) y(t). Without constraints the gradients converge to the trivial solution W = 0. To remove the scaling ambiguity each row wi of matrix W is fixed to unit-norm ||wi ||2 = 1. Hence, each row Δi of ∇H is projected according to H Δ⊥ i = Δi − (Δi wi ) wi
(8)
resulting in the projected gradient matrix Δ⊥ that is then used for gradient descent. The final update rule for matrix W with a step size of η is W ← W − η Δ⊥
(9)
for the ordinary gradient (6) and similarly for (7). Periodic row re-normalization of W is applied to keep it on the constraint manifold for non-infinitesimal η.
3 Evaluation 3.1 Synthetic Data Generation Disjoint sources si (t) are generated from mutually independent signals ζi (t) by multiplying them with disjoint masking functions μi (t) ∈ {0, 1} for all i, t and si (t) = μi (t) ζi (t) E(μi μj ) = 0
if
i = j
(10) (11)
Maximization of Component Disjointness Performance data class 1
Performance data class 2
120
120
100
100 DCA
DCA 80 SIR/dB
SIR/dB
80 60 40
60 40
ICA
20 0
329
ICA
20
0
0.2
0.4 0.6 Overlap γ
0.8
1
0
0
0.2
0.4 0.6 Overlap γ
0.8
1
Fig. 2. Separation performance of DCA and ICA in terms of signal-to-interference ratio (SIR) in dB after separation. Performance is given for data class 1 (left panel, sources with positive and negative observation values) and data class 2 (right panel, sources with positive only observation values) as a function of overlap γ. A value of γ = 0 corresponds to strictly disjoint sources (statistical dependencies between sources through negative correlation of signal envelopes); γ = 0.5 corresponds to statistically independent sources; and γ = 1.0 corresponds to fully overlapping, not disjoint sources (statistical dependencies through positive correlation of signal envelopes). Mean and variance of performance for 100 separation runs, each with independently generated data, are given for each condition.
These sources may then be used to generate observations by multiplication with a matrix A according to Eq. 1. Strictly disjoint sources with zero overlap are not expected to be an appropriate model for real data. Hence, sources with variable masker overlap γij , which may depend on the source pair (i, j), γij = E(μi μj ) / E(μ2i )
(12)
with E(μ2i ) = const for all i are also generated. In the experiments reported below masker overlap γij is chosen such that a value of γij = 1 corresponds to a source pair (si , sj ) exhibiting mutual statistical dependence through maskers with positive correlation. The value γij = 0 corresponds to strictly disjoint sources that exhibit mutual statistical dependence through maskers with negative correlation. Finally, a value of γij = 0.5 coincides with statistically independent sources (si , sj ) because of uncorrelated maskers (and statistically independent ζi (t)). The signal generation scheme was inspired by a functional magnetic resonance imaging (fMRI) experiment design [3]. 3.2 Separation of Synthetic Sources Four sources were generated according to the scheme described above, mixed with a randomly chosen mixing matrix and processed with the natural gradient disjoint component analysis algorithm (Eq. 7) with regularization (Eq. 8). The underlying mutually independent signals ζi (t) were chosen as a speech signal (ζ1 ), i.i.d. noise from a normal
330
J. Anemüller
distribution with zero-mean and unit-variance (ζ2 ), i.i.d. noise from a uniform distribution on the interval [0, 1] (ζ3 ), and a sine wave (ζ4 ). The maskers μi (t) were chosen such that γij = 0.6 for source pairs (1, 2), (2, 3), (3, 4), (1, 4), and γij = 0.4 for source pairs (1, 3), (2, 4). Source signals, observed (mixed) signals and output signals are displayed in Fig. 1, demonstrating that the algorithm performs successful separation even though sources are not strictly disjoint but show significant overlap. Similarly, the algorithm successfully separates mixtures of four strictly disjoint sources with γij = 0 for all i = j (data not shown here). 3.3 Variable Degree of Overlap The goal of this experiment was to systematically study the influence of the degree of overlap on the performance of the disjoint component analysis algorithm. Results are reported for the gradient version of the algorithm (Eq. 6) with regularization (Eq. 8). Results for the natural gradient version are virtually identical and not reported separately. Sources were generated based on two different underlying signal classes. In the first part of the experiment (“data class 1"), two sources s1 and s2 were generated from ζ1 and ζ2 that were drawn as i.i.d. signals from a zero-mean and unit-variance normal distribution, hence containing positive and negative values. In the second part of the experiment (“data class 2"), ζ1 and ζ2 were chosen to be i.i.d. signals from a uniform distribution on the interval [0, 1], hence containing only positive values. For both data sets the single overlap parameter γ was varied from 0 (no overlap, source dependence through negative masker correlation) via 0.5 (50% overlap, statistically independent sources) to 1.0 (full overlap, source dependence through positive masker correlation) in steps of 0.1. Hence, 11 data set conditions were generated for each of the two data classes. For each condition, disjoint component analysis was performed on 100 individual datasets drawn independently according to the description above. This resulted in a total of 2200 datasets each with 10000 samples for each of the two sources. Fig. 2 shows the results with mean and variance of signal separation in dB signalto-interference ratio (SIR) after separation separately for data class 1 (left panel) and data class 2 (right panel). For data class 1 with sources that adopt positive and negative values, DCA separation performance shows no significant dependence on the overlap parameter γ except (as expected) for complete overlap at γ = 1 where the algorithm essentially attempts to separate two i.i.d. normally distributed sources which is ill-posed. In all other cases of data class 1, DCA separation is excellent with about 100 dB SIR. The results look different for data class 2 with positive only source values. Separation remains excellent for data sets with a small overlap (0.0 ≤ γ ≤ 0.4), with again about 100 dB SIR. In the case of independent sources at γ = 0.5, separation is still very good at 80 dB. Performance breaks down for large overlaps (1.0 ≥ γ ≥ 0.6), an effect which we attribute to the positivity of the sources. 3.4 Comparison with Independent Component Analysis The same data generated for section 3.3 was re-analyzed with natural gradient infomax ICA [1,2] using the ICA toolbox [7,8] with logistic function non-linearity. For comparison, a simple gradient approach with fixed step size and sign function non-linearity
Maximization of Component Disjointness
331
was also used and gave virtually identical results for data class 1. On data class 2, the fixed step gradient approach gave qualitatively similar results but was outperformed by the referenced ICA toolbox in terms of SIR separation performance. All source signals have been checked to have positive kurtosis. Processing of the same signal with the jade algorithm [4] gave virtually identical results. Results in Fig. 2 show that in most cases ICA results in a poorer SIR than DCA. For data class 1, ICA shows excellent signal separation for strictly disjoint sources (γ = 0.0). Performance is significantly lower, though still good, for independent sources, which seems to stand in contradiction to the independence assumption. As expected, performance decreases towards sources with strong overlap (γ = 1.0). For data class 2, ICA performs best when sources are independent (γ = 0.5) with a drop off in performance towards both lower and higher source overlaps, which is plausible due to ICA’s independence assumption.
4 Conclusion Disjoint component analysis (DCA) has been shown to yield good performance for strictly disjoint and moderately disjoint data sets. For data with high overlap between sources (weakly disjoint), performance depends on the specific type of data, with good performance for data sets with sources that take positive and negative observation values, and a break-down of performance in case of purely positive source data. We have shown that under certain approximations DCA is closely related to independent component analysis (ICA), albeit both start from significantly different assumptions. The empirical algorithm evaluation showed a better separation performance for DCA than for ICA under most conditions. Interestingly, ICA produced the best performance not for statistically independent sources but for strictly disjoint ones (cf. also ICA 2006 oral presentation of I.C. Daubechies). Results presented here appear to warrant a closer investigation of the differences and similarities of both algorithm classes. It would be desirable to gain experience with a wider range of synthetic and natural data than could be presented here. We are tempted to speculate that DCA might be appropriate in particular for analyzing data where the independence assumption is not strictly fulfilled, where a data representation in terms of disjoint components is preferable to independent components, and where signals are comprised of positive only measurement values. This could be the case, e.g., for brain signals such fMRI, for data from dialog speech signals, and for comparably short signal sequences where independence cannot be fully attained due to finite sample effects.
References 1. Amari, S.-I.: Natural gradient works efficiently in learning. Neural Computation 10, 251–276 (1998) 2. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural Computation 7, 1129–1159 (1995) 3. Benharrosh, M.S., Takerkart, S., Cohen, J.D., Daubechies, I.C., Richter, W.: Using ICA on fMRI: Does independence matter?. In: Human Brain Mapping, abstract no. 784 (2003)
332
J. Anemüller
4. Cardoso, J.-F., Souloumiac, A.: Blind beamforming for non Gaussian signals. IEE Proceedings–F 140, 362–370 (1993) 5. Donoho, D., Elad, M.: Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. Proc. Nat. Acad. Sci. 100, 2197–2202 (2003) 6. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative Matrix Factorization. Nature 40, 788–791 (1999) 7. Makeig, S., et al.: EEGLAB: ICA toolbox for psychophysical research, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California, San Diego (2000), http://www.sccn.ucsd.edu:/eeglab 8. Makeig, S., Bell, A.J., Jung, T.-P., Sejnowski, T.J.: Independent component analysis of electroencephalographic data. Advances in neural information processing system 8, 145–151 (1996) 9. Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research 37, 3311–3325 (1997) 10. Rickard, S., Yilmaz, Z.: On the approximate w-disjoint orthogonality of speech. In: ICASSP ’02, pp. I–529–I–532 (2002)
Estimator for Number of Sources Using Minimum Description Length Criterion for Blind Sparse Source Mixtures Radu Balan Siemens Corporate Research 755 College Road East Princeton, NJ 08540 [email protected]
Abstract. In this paper I present a Minimum Description Length Estimator for number of sources in an anechoic mixture of sparse signals. The criterion is roughly equal to the sum of negative normalized maximum log-likelihood and the logarithm of number of sources. Numerical evidence supports this approach and compares favorabily to both the Akaike (AIC) and Bayesian (BIC) Information Criteria.
1
Signal and Mixing Models
Consider the following model in time domain: xd (t) =
L
sl (t − (d − 1)τl ) + nd (t) , 1 ≤ d ≤ D
(1)
l=1
This model corresponds to an anechoic Uniform Linear Array (ULA) with L souces and D sensors. In frequency domain, (1) becomes Xd (k, ω) =
L
e−iω(d−1)τl Sl (k, ω) + Nd (k, ω)
(2)
l=1
We use the following notations: X(k, ω) for the D-complex vector of components (Xd (k, ω))d , S(k, ω) for the L-complex valued vector of components (Sl (k, ω))l , and A(ω) the D × L complex matrix whose (d, l) entry is Ad,l (ω) = e−iω(d−1)τl . In this paper I make the following statistics assumptions: 1. (H1) Noise signals (nd )1≤d≤D are Gaussian i.i.d. with zero mean and unknown variance σ 2 ; 2. (H2) Source Signals are unknown, but for every time-frequency point (k, ω), at most one signal Sl (k, ω) is nonzero, among the total of L signals; 3. (H3) The number of source signals L is a random variable. The probem is to design a statistically principled estimator for L, the number of source signals. In this paper I study the Minimum Description Length approach for this problem. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 333–340, 2007. c Springer-Verlag Berlin Heidelberg 2007
334
R. Balan
For this model, the measured data is Ξ = {(Xd (k, ω))1≤d≤D , 1 ≤ k ≤ T, 1 ≤ ω ≤ F }. Furthermore the number of sensors D is also known. The rest of parameters are unknown. I denote θ = (θ , L), where: θ = {(Sl (k, ω))1≤l≤L ; 1 ≤ k ≤ T, 1 ≤ ω ≤ F } , (τl )1≤l≤L , σ 2
(3)
Notice that hypothesis (H2) above imposes a constraint on set (Sl (k, ω))1≤l≤L , for every (k, ω). More specifically, the L complex vector (Sl (k, ω))1≤l≤L has to lay in one of the L 1-dimensional coordinate axes (that is, all but one component has to vanish). This fact has a profound implication on estimating the complexity penalty associated to the parameters set. Some real world signals may satisfy (H2) only approximately. For instance [1] studies this assumption for speech signals. 1.1
Prior Works
The signal and mixing model described before has been analyzed by many works before. In the past series of papers [2,3,4,5,6,7] the authors studied (1), and several generalizations of this model in the following respects. Mixing model: each channel may have an attenuation factor (equivalently, τl may be complex); Noise statistics: noise signals may have inter-sensor correlations; Signals: more signals may non-vanish at each time-frequency point (maximum number allowed is D − 1); more recently we have considered temporal, and time-frequency, dependencies on signal statistics. A similar model, and a similar sparsness assumption, has been used by the DUET algorithm [1], or by [8], [9]. Similar assumptions to [5] have been made by [10] for an instantaneous mixing model. As the authors mentioned there, as well in [11,12], and several others, a new signal separation class is defined by sparsness assumption, called Sparse Component Analysis (SCA). In this vein, this present paper proposes a look at the Minimum Description Length paradigm in the context of Sparse Component Analysis. Before discussing the new results of this paper, I would like to comment on other approaches to the BSS problem. Many other works dealt with the mixing model (1), or its generalizations to a fully echoic model. A completely different class of algorithms is furnished by the observation that, in frequency domain, the echoic model simply becomes an instantaneous mixing model. Therefore standard ICA techniques can be applied, as in [13,14] to name a few. Next, one has to connect frequency domain components together for the same source. The permutation ambiguity is the main stumbling block. Several approaches have been proposed, some based on ad-hoc arguments, [15,9]. A more statistically principled approach has been proposed and used by Zibulevsky [16] and in more recent papers, as well as by other authors, by assuming a stochastic prior model for source signals. The Maximum A Posteriori (MAP), or Minimum Mean Square Error (MMSE) estimators can be derived. While principly they are superior to Maximum Likelihood type estimators derived in [4,5], or mixed estimators such
Estimator for Number of Sources
335
as [1,8,9], they require a good prior stochastic model. This makes difficult the comparison between classes of BSS solutions. In the absence of noise, the number of sources can be estimated straightforwardly by building a histogram of the instantaneous delay (τ ), or for a more general model see [10]. As I mention later, the MDL paradigm here may be well applied in conjunction with other signal estimators, in particular with the MAP estimators described before.
2
Estimators
Assume the mixing model (1) and hypotheses (H1),(H2),(H3). Then its associated likelihood is given by L(θ) := P (Ξ|θ) =
(k,ω)
1 1 2 exp − X(k, ω) − A(ω)S(k, ω) π D σ 2D σ2
(4)
In the next subsection the maximum likelihood estimator for θ , and the maximum likelihood value are going to be derived. Following a long tradition of statistics papers, consider the following framework. Let P (X) denote the unknown true probability of data (measurements), P (X|θ) denote the data likelihood given the model (1) and (H1-H3). Then the estimation objective is to minimize the misfit between these two distributions measured by a distance between the two distribution functions. One can choose the Kullback-Leibler divergence, and obtain the following optimization criterion: J(θ) = D(PX ||PX|θ ) :=
log
P (X) dP (X) = P (X|θ)
log P (X) dP (X)− log P (X|θ) dP (X) (5)
Since the first term does not depend on θ, the objective becomes maximization of the second term: θˆ = argmaxθ E[log PX|θ (X|θ)]
(6)
where the expectation is computed over the true data distribution PX . However the true distribution is unknown. A first approximation is to replace the expectation E by average over data points. Thus one obtains the maximum likelihood estimator (MLE): N 1 log PX|θ (Xt |θ) θˆM L = argmaxθ N t=1
(7)
where N is the number of sample points (Xt )1≤t≤N . As is well known in statistical estimation (see [17,18]), the MLE is usually biased. For discrete parameters, such as number of source signals, this bias has a bootstraping effect that monotonically increases the likelihood and makes the number of parameter estimation impossible through naive MLE. Several approaches proposed to estimate and make correction for this bias. In general, the optimization problem is restated as:
336
R. Balan
N 1 ˆ log P (Xt |θ) + Φ(θ, N ) θ = argminθ − N t=1
(8)
Following e.g. [18] we call Φ the regret. Akaike [17] proposes the following regret: ΦAIC (θ, N ) =
|θ|0 N
(9)
where |θ|0 represents the total number of parameters. Schwarz [19] proposes a different regret, namely ΦBIC (θ, N ) =
|θ|0 log N 2N
(10)
In a statistically plausible interpretation of the world, Rissanen [20] obtains for regret the shortest possible description of the model using the universal distribution function of Kolmogorov, hence the name Minimum Description Length, ΦM DL (θ, N ) = Coding LengthKolmogorov
p.d.f. (M odel(θ, N ))
(11)
Based on this interpretation, Φ(θ, N ) represents a measure of the model complexity. My approach here is the following. I propose the following regret function ΦM DL−BSS (θ, N ) = log2 (L) +
L log2 (M ) N
(12)
where M represents precision in optimization estimation of delay parameters τ (for instance the number of grid points of an 1-D exhaustive search). Thus the optimization in (8) is carried out in two steps. First, for fixed L, the log likelihood is optimized over θ : θˆ M LE (L) = argmaxθ P (X|θ , L) , M LV (L) = P (X|θˆ M LE , L)
(13)
Here MLV denotes the Maximum Likelihood Value. Then L is estimated via: ˆ M DL−BSS = arminL − log(M LV (L)) + log (L) + L log2 (M ) L 2 N
(14)
In the next subsection I present the computation of the Maximum Likelihood Value (MLV). Then, in the following subsection I argue the particular form (12) for Φ(θ, N ) inspired by the MDL interpretation. In same subsection I also present difficulties in a straightforward application of AIC or BIC criteria. 2.1
The Maximum Likelihood Value
The material from this subsection is presented in more detail in [4]. Results are summarized here for the benefit of the reader. The constraint (H2) assumed in section 1 can be recast by introducing the selection variable V (k, ω): V (k, ω) = l iff Sl (k, ω) = 0, and the complex amplitudes G(k, ω). Thus a slightly different parametrization of the model is obtained. The new set of parameters is now ψ = (ψ , L) where ψ = {(G(k, ω), V (k, ω)) ; 1 ≤ k ≤ T, 1 ≤ ω ≤ F } , (τd )1≤d≤D , σ 2
(15)
The signals in θ are simply obtained through: SV (k,ω) (k, ω) = G(k, ω), and Sl (k, ω) = 0 for l = V (k, ω).
Estimator for Number of Sources
337
The likelihood (4) becomes:
⎛ ⎞ 1 1 L(ψ) = DN 2DN exp ⎝− 2 X(k, ω) − G(k, ω)AV (k,ω) (ω)2 ⎠ π σ σ
(16)
(k,ω)
where N is the number of time-frequency data points, and Al (ω) denotes the lth column of matrix A(ω). The optimization over G is performed immediately, as a least square problem. The optimum value is replaced in L(ψ): logL((V )k,ω , (τl )l , L) = −DN log(π)−DN log(σ 2 )−
1 1 2 2 − (ω)| X(k, ω) |X(k, ω), A V (k,ω) σ2 D k,ω
The optimization over (V )k,ω and (τl )1≤l≤L is performed iteratively as in the K-means algorithm: – For a fixed set of delays (τl )l , the optimal selection variables are V (k, ω) = argmaxm|X(k, ω), Am (ω)|
(17)
– For a fixed selection map (V (k, ω))k,ω , consider the induced partition Πm = {(k, ω) ; V (k, ω) = m}. Then τm is obtained by solving L 1-dimensional optimization problems τm = argmaxτ
|X(k, ω), Am (ω; τ )|2
(18)
(k,ω)∈Πm
This steps are iterated until convergence is reached (usually is a relatively small number of steps, e.g. 10). Denote VˆMLE (k, ω) and τˆl MLE the final values, and replace these values into L. The noise variance parameter is estimated by maximizing L over σ 2 , 1 1 σˆ2 M LE = X(k, ω)2 − |X(k, ω), AVˆM LE (k,ω) (ω; τˆM LE |2 N D
(19)
(k,ω)
Finally, the log maximum likelihood value becomes: log(M LV (L)) =
1 log(L(ψˆ M LE ; L)) = −D log(π) − 1 − D log(σˆ2 M LE ) N
(20)
where ψˆ MLE denoted the optimal parameter set ψ containing the combined ˆ MLE (k, ω))(k,ω) , (τˆl )1≤l≤L , σˆ2 MLE . optimal values (VˆMLE (k, ω))(k,ω) , (G 2.2
Number of Sources Estimation
The next step is to establish the regret function. As mentioned earlier the approach here is to use an estimate of the Minimum Description Length of the model (1) together with hypotheses (H1-H3). In general this is an impossible task since the Kolmogorov’s universal distribution function is unkown. However the L-dependent part of the model description is embodied in the mixing parameters (τl )1≤l≤L , and the selection map (V (k, ω))(k,ω) . Approximating by a uniform distribution in the space of delays with a finite discretization of, say,
338
R. Balan
M levels, and no prior preferential treatment of one source signal versus the others, an upper bound on the description length is obtained as the code length of an entropic encoder for this data added to the description length of the entire sequence of models with respect to the Kolmogorov universal distribution: l∗ (M odel; N ) ≤ Llog2 (M ) + N log2 (L) + C(M odel)
(21)
This represents an upper bound since l∗ (M odel; N ) is supposed to represent the optimal description (minimal description) length, whereas the description splits into two parts: the sequence of models parametrized by ψ and N , and then, for a given (L, N ) the entropic length of ψ. This clearly represents only one possible way of encoding the pair (M odel(ψ), N ). This discussion justifies the following choice for the regret function ΦMDL−BSS ΦM DL−BSS (L, N ) =
L log2 (M ) + N log2 (L) L log2 (M ) = log2 (L) + N N
(22)
as mentioned earlier in (12). Before presenting experimental evidence supporting this approach, I would like to comment on AIC and BIC criteria. The main difficulty comes from the estimation of the number of parameters. Notice that, using θ description, the number of parameters becomes LN +L+2, whereas in ψ description, this number is only 2N + L + 2. The difference is due to that fact that the set of realizable signal vectors (Sl )1≤l≤L lays in a collection of L 1-dimensional spaces. Thus this can be either modeled as a collection of L variables, or by 2 variables: complex amplitude, and a selection map V . Consequently, the regret function for AIC L+2 can be either L + L+2 N , or 2 + N . Similarly, for BIC the regret function can (L+2)log(N ) ) be L log(N )/2 + , or log(N ) + (L+2)log(N . The criterion I propose 2N 2N in (22) interpolates between these two extrema, and, in my opinion, it captures better the actual size of model parametrization.
3
Experimental Evaluation
Consider the following setup. A Uniform Linear Array (ULA) with a variable number of sensors runging from 2 to 5, and distance between adjacent sensors of 5 cm, that records anechoic mixtures of signals coming from L ≤ 6 sources. The sources are spread uniformly with a minimum of 30 degrees separation. Additive Gaussian noise of average SNR ranging from 10dB to 100dB has been added to recordings. The signals were TIMIT voices sampled at 16 KHz, and each of length 38000 samples (roughly 3 male and female voices saying “She had a dark suit in a greasy wash water all year”). For this setup, the noise was varied in 10dB steps, and number of sources ranged from 1 to 6. The delay optimization (18) was performed through a grid search with step 0.05 samples. Since τmax = 2.4, there were M = 96 possible values of τ . Thus log2N(M) = 1.7 10−4 and the correction term L log2N(M) in ) L ΦMDL−BSS had no influence. Similarly, the N term in AIC and L log(N = N −4 3 10 L in BIC are too small. Therefore the only meaningful AIC and BIC were
Estimator for Number of Sources
339
given by the former regret functions. To summarize, the source number estimator is given by: ˆ M DL−BSS = argminL [−log M LV (L) + log2 (L)] L ˆ AIC = argminL [−log M LV (L) + L] L ˆ LBIC = argminL [−log M LV (L) + L log(N )]
(23) (24) (25)
where the optimization is done by exhaustive search for L over the range 1 to 10. For a total of 1680 experiments (10 levels of noise x 4 number of sensors x 6 number of sources x 7 realizations), the histogram of estimation error has been obtained. For each of the three estimators, the histogram is rendered in Figure 1. Statistical performance of these estimators is presented in Table at right.
Fig. 1. The histograms of estimation errors for MDL-BSS criterion (left bar), AIC criterion (middle bar), BIC criterion (right bar). Table with statistical performance of the three estimators.
4
Conclusions
The MDL-BSS estimator clearly performed best among the three estimators, since the error distribution is the most concentrated to zero, in every sense: the number of errors is the smallest, the average error is the smallest, the variance is the smallest, the bias is the smallest. Estimation error is explained by a combination of two factors: 1) source signals (voices) do not satisfy the hypothesis (H2), instead there is always an overlap between time-frequency signal supports; and 2) the estimates for location, noise variance, and separated signals were biased; this bias compounded and inverted the minimum position. The other two estimators (AIC, and BIC) were biased towards underestimating the number of sources. This paper provides a solid theoretical footing for a statistical criterion to estimate number of source signals in an anechoic BSS scenario with sparse signals. Extension to other mixing models (such as instantaneous) is obvious. The regret function stays the same, only the MLV is modified. The same approach can be used to other Sparse Component Analysis, and this analysis will be done elsewhere. The numerical simulations confirmed the estimation performance.
340
R. Balan
References 1. Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. on Sig. Proc. 52(7), 1830–1847 (2004) 2. Rickard, S., Balan, R., Rosca, J.: Real-time time-frequency based blind source separation. In: Proc. ICA, pp. 651–656 (2001) 3. Balan, R., Rosca, J., Rickard, S.: Non-square blind source separation under coherent noise by beamforming and time-frequency masking. In: Proc. ICA (2003) 4. Balan, R., Rosca, J., Rickard, S.: Scalable non-square blind source separation in the presence of noise. In: ICASSP2003, Hong-Kong, China (April 2003) 5. Rosca, J., Borss, C., Balan, R.: Generalized sparse signal mixing model and application to noisy blind source separation. In: Proc. ICASSP (2004) 6. Balan, R., Rosca, J.: Convolutive demixing with sparse discrete prior models for markov sources. In: Proc. BSS-ICA (2006) 7. Balan, R., Rosca, J.: Map source separation using belief propagation networks. In: Proc. ASILOMAR (2006) 8. Aoki, M., Okamoto, M., Aoki, S., Matsui, H.: Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones. Acoust. Sci. & Tech. 22(2), 149–157 (2001) 9. Sawada, H., Mukai, R., Araki, S., Makino, S.: A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. SAP 12(5), 530–538 (2004) 10. Georgiev, P., Theis, F., Cichocki, A.: Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Tran. Neur.Net. 16(4), 992–996 (2005) 11. Cichocki, A., Li, Y., Georgiev, P., Amari, S.-I.: Beyond ica: Robust sparse signal representations. In: IEEE ISCAS Proc. pp. 684–687 (2004) 12. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. Wiley, Chichester (April 2002) 13. Comon, P.: Independent component analysis, a new concept? Signal Processing 36(3), 287–314 (1994) 14. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural Computation 7, 1129–1159 (1995) 15. Annemuller, J., Kollmeier, B.: Amplitude modulation decorrelation for convolutive blind source separation. In: ICA, pp. 215–220 (2000) 16. Bofill, P., Zibulevsky, M.: Blind separation of more sources than mixtures using sparsity of their short-time Fourier transform. In: Proc. ICA, Helsinki, Finland, pp. 87–92 (June 19-22, 2000) 17. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Aut. Cont. 19(6), 716–723 (1974) 18. Barron, A., Rissanen, J., Yu, B.: The minimum description length principle in coding and modeling. IEEE Trans. Inf. Th. 44(6), 2743–2760 (1998) 19. Schwarz, G.: Estimating the dimension of a model. Ann. Statist. 6(2), 461–464 (1978) 20. Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
Compressed Sensing and Source Separation Thomas Blumensath and Mike Davies IDCOM & Joint Research Institute for Signal and Image Processing The University of Edinburgh, The King’s Buildings, Edinburgh, EH9 3JL, UK Tel.: +44(0)131 6505659; Fax.: +44(0)131 6506554 [email protected], [email protected]
Abstract. Separation of underdetermined mixtures is an important problem in signal processing that has attracted a great deal of attention over the years. Prior knowledge is required to solve such problems and one of the most common forms of structure exploited is sparsity. Another central problem in signal processing is sampling. Recently, it has been shown that it is possible to sample well below the Nyquist limit whenever the signal has additional structure. This theory is known as compressed sensing or compressive sampling and a wealth of theoretical insight has been gained for signals that permit a sparse representation. In this paper we point out several similarities between compressed sensing and source separation. We here mainly assume that the mixing system is known, i.e. we do not study blind source separation. With a particular view towards source separation, we extend some of the results in compressed sensing to more general overcomplete sparse representations and study the sensitivity of the solution to errors in the mixing system.
1
Compressed Sensing
Compressed sensing or compressive sampling is a new emerging technique in signal processing, coding and information theory. For a good place of departure see for example [1] and [2]. Assume that a signal y is to be measured. In general y is assumed to be a function defined on a continuous domain, however, for the discussion here it can be assumed to be a finite vector, i.e. y ∈ RNy say. In a standard DSP textbook we learn that one has to sample a function on a continuous domain at least at its Nyquist rate. However, assume that we know that y has a certain structure, for example we assume that y can be expressed as y = Φs,
(1)
Ny ×Ns
and where we allow Ns ≥ Ny , i.e. we allow Φs to be an where Φ ∈ R overcomplete representation of y. Crucially, we assume s to be sparse, i.e. we assume that only a small number of elements in s are non-zero or, more generally,
This research was supported by EPSRC grant D000246/1. MED acknowledges support of his position from the Scottish Funding Council and their support of the Joint Research Institute with the Heriot-Watt University as a component part of the Edinburgh Research Partnership.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 341–348, 2007. c Springer-Verlag Berlin Heidelberg 2007
342
T. Blumensath and M. Davies
that most of the energy in s is concentrated in a few coefficients. It has recently been shown that, if a signal has such a sparse representation, then it is possible to take less samples (or measurements) from the signal than would be suggested by the Nyquist limit. Furthermore, one is then often still able to reconstruct the original signal using convex optimisation techniques [1] and [2]. The simplest scenario are measurements taken as follows: x = My,
(2)
where x ∈ RNx with Nx < Ny . Extensions to noisy measurements can be made [2], i.e. one can consider the problem of approximating y given a noisy measurement: ˜ = x + e = My + e. x
(3)
The ability to reconstruct the original signal relies heavily on the structure of y and different conditions have been derived under which one can exactly or approximately recover y. For example, if y = Φs and s has only a small number of non-zero elements, then linear programming can exactly recover y if enough measurements have been taken. Similar results have been derived in the case where s is not exactly sparse, but where the ordered coefficients in s decay with a power law. In this case y can be recovered up to some small error. We give examples of these theorems below. More details can be found in for example [2] and the references therein.
2
Relationship to Source Separations
Sparsity has also often been exploited for source separation. In particular, the problem of underdetermined blind source separation has been solved using the fact that an orthogonal transform can often be found in which the data is sparse [3] [4] [5] [6]. More general, possibly over-complete dictionaries have been used for source separation in [7]. Let us assume a quite general source separation scenario. A set of sources, say g1 , g2 , . . . , each represented in a column vector, are collected into a matrix T
G = [g1 g2 . . .]
(4)
and similarly, the set of observations f1 , f2 , . . . are gathered in a matrix F. The relationship between the sources G and the observations F is then modelled by a general linear operator A: F = A(G). (5) Note that this operator does not have to be a matrix and can also represent for example convolutions, so that the above model incorporates a wide range of source separation problems. The connection to compressive sampling becomes evident if instead of collecting the sources and observations in a matrix, we interleave them into vectors as: T
x = [f1 [1] f2 [1] . . . f1 [2] f2 [2] . . .] and
(6)
Compressed Sensing and Source Separation
y = [g1 [1] g2 [1] . . . g1 [2] g2 [2] . . .]T .
343
(7)
We further assume that the operator A can be expressed in matrix form M so that the mixing system becomes: x = My,
(8) 1
which is exactly the compressive sampling measurement equation . If we have more sources g than observations f , where the length of each observation and each source is assumed to be equal, then we have less measurements than samples in y. We therefore require knowledge of additional structure if we want to be able to (approximately) reconstruct y. We can, for example, assume the existence of a sparse representation of y of the form y = Φs, where s is sparse. Note that we do not assume that Φ is an orthogonal transform and explicitly allow Φ to be overcomplete, i.e. to have more columns than rows. Let us look at a simple example of an instantaneous mixture. In this case, A is the Nf × Ng mixing matrix and the matrix M becomes matrix diagonal: ⎡ ⎤ A 0 ··· 0 ⎢ 0 A ··· 0 ⎥ ⎢ ⎥ (9) M=⎢ . ⎥, .. ⎣ .. . ⎦ 0 0 ··· A Where 0 is a Nf × Ng matrix of zeros. Similarly, assume a convolutive model, in which the impulse responses, say h1,1 , h2,1 , h3,1 and h1,2 , h2,2 , h3,2 and so on are interleaved into the matrix H as follows: ⎤ ⎡ h1,1 [n] h2,1 [n] h3,1 [n] h1,1 [n − 1] · · · ⎥ ⎢ (10) H = ⎣ h1,2 [n] h2,2 [n] h3,2 [n] h1,2 [n − 1] · · · ⎦ . .. . The measuring matrix then becomes: ⎡
H ···0 ⎢0 H ···0 ⎢ M=⎢. .. ⎣ .. . 0 0 ···H
⎤ ⎥ ⎥ ⎥, ⎦
(11)
where again 0 is a Nf × Ng matrix of zeros. Also, depending on boundary assumptions, the first and last rows of M might only contain part of the matrix H.
3
Theoretic Results
The source separation problem is equivalent to the decoding problem faced in compressive sampling and theoretical results from the compressive sampling literature therefore also apply to the source separation problem. However, in 1
We could have alternatively stacked the vectors f1 , f2 , . . . (and/or g1 , g2 , . . . ) on top of each other to produce a permutation of the above model.
344
T. Blumensath and M. Davies
compressive sampling, the measurement matrix M can often be ‘designed’ to fulfil certain conditions2 . Furthermore, in the current compressive sampling literature, Φ is normally assumed to be the identity matrix or an orthogonal transform. In source separation, the measuring system is not normally at our control. Furthermore, orthogonal transform are often not available to sufficiently ‘sparsify’ many signals of interest. In this paper we address these problems and extend several results from the compressed sensing literature to more general sparse representations. We start by reviewing some of the important results on compressed sensing [8], which we here write in terms of the m-restricted isometry condition of the matrix P = MΦ. 3.1
m-Restricted Isometry
For any matrix P and integer m, define the m-restricted isometry δm (P) as the smallest quantity such that: (1 − δm (P)) ≤
PΓ y22 ≤ (1 + δm (P)), y22
(12)
for all Γ : |Γ | ≤ m and all y. Here |Γ | is a set of indices and PΓ the associated submatrix of P with all columns removed apart from those with indices in Γ . δm is then a measure of how much any sub-matrices of P with size m can change the norm of a vector, hence the name. The quantities (1 − δm (P)) and (1 + δm (P)) can be understood as lower and upper bounds on the squared singular values of all possible sub-matrices of P with m or less columns. 3.2
Estimation Error Bounds
As examples of the types of theorems available in the compressive sampling literature, we here state two of the fundamental results (these can be found in [2] and references therein), which rely on δm (MΦ) to be small3 . Theorem 1. (Exact Recovery) Assume that s has a maximum of m non-zero coefficients and that x = MΦs and that δ2m (MΦ) + δ3m (MΦ) < 1, then the solution to the linear program: min ˆs, such that x = MΦˆs
(13)
recovers the exact representation y = Φs. 2 3
‘Design’ here often means taking a random matrix drawn from certain distributions. Note that Georgiev et al. [9] have also studied a similar problem in relationship to blind source separation. However, the results in [9] are concerned with identifiability of both y and A. The theorems given here assume knowledge of A but are stronger than those in [9] in that they also states that we can use convex optimisation methods to identify the sources. Furthermore, theorem 2 is valid for more general sources and does not require the existence of an exact m-term representation.
Compressed Sensing and Source Separation
345
A similar result can be derived for noisy observations and a further generalisation was derived in [8] for signals for which the original signal is not m-sparse, but has a power law decay, i.e. the magnitude of the ordered coefficients decays as 1 |sik | ≤ Ck − p , where p ≤ 1. In particular, for an i.i.d. Gaussian observation error with variance σ 2 , we have: Theorem 2. (Dantzig selector) Assume that s can be reordered so that |sik | ≤ 1 ck − p√for p ≤1. For some m assume that δ2m (MΦ) + 3δ3m (MΦ) < 1. For λ = 2 log Nx the solution to: min ˆ y1 : MT (x − Mˆ y)∞ ≤ λσ obeys the bound:
(14)
p
s − ˜s22 ≤ C2(log Nx )σ p c1− 2 .
(15)
This can be extend trivially to a bound on the error in the signal space: p
˜ 22 = Φs − Φ˜s22 ≤ C2(log Nx )σ p c1− 2 Φ22 . y − y
3.3
(16)
Random Mixing Conditions
In source separation, the mixing system should be considered independently from the dictionary Φ in which the signal has a sparse representation. The theorems above are based on δm (MΦ), which is required to be small. In this and the next section we derive new results that give insight into this quantity by considering M and Φ separately. The first theorem is a slight modification from [10]4,5 : Theorem 3. Assume that M ∈ RNx ×Ny is a random matrix with columns drawn uniformly from the unit sphere and let Φ ∈ RNy ×Ns have restricted isometry δm (Φ) < 1, then there exists a constant c, such that for m ≤ cNx log(Ns /m): (1 − δPm (M)) ≤
MΦΓ s22 ≤ (1 + δPm (M)), Φs22
(17)
and (1 − δm (Φ))(1 − δPm (M)) ≤
MΦΓ s22 ≤ (1 + δPm (M))(1 + δm (Φ)), s22
(18)
holds with probability ≥ 1 − 2(eNs /m)m (12/δPm )m e− 4
5
Nx 2
2 3 (δP /8−δm /24) m
.
(19)
Note, that there are a range of other distributions for which this theorem would hold, see [10] for details. Since the first submission of this manuscript we became aware of the paper [11], which contains very similar results.
346
T. Blumensath and M. Davies
Proof (Outline). The proof that equation (17) holds is similar to the proof given in [10] with the only difference that Theorem 5.1 in [10] can be shown to hold for any m dimensional subspace, and where N in theorem 5.2 in [10] can be replaced by Ns . The restricted isometry in equation (18) then follows by bounding ||Φs||22 from above and below using the restricted isometry (1 − δm (Φ)) ≤ ||Φs||22 ≤ (1 + δm (Φ)). Therefore, for any dictionary Φ with δm (Φ) < 1 and for M sampled uniformly from the unit sphere, δm (MΦ) ≤ δm (Φ) + δPm (M ) + δm (Φ)δPm (M ) with high probability, whenever m ≤ CNx / log(Ns /Nx ). 3.4
Non-random Mixing Matrix Conditions
Unfortunately, theorem 3 assumes randomly generated mixing systems, which is rather restrictive. We therefore derive conditions that relate the measurement matrix M , the dictionary Φ and δm (MΦ). To bound δm (MΦ) we define the (to our knowledge novel) concept of M coherence: μM (Φ) = max |φTi MT Mφj |. i,j:i=j
(20)
This quantity measures the coherence in the dictionary as ‘seen through’ the measuring matrix. We also need the quantities: amin = min Mφi 22 and amax = max Mφi 22 , i
i
(21)
which measure how much the measuring matrix can deform elements of the dictionary. We assume that amin ≥ mμM (Φ), then by the Gersgorin disk theorem for the eigenvalues of ΦΓ MT MΦΓ , we find that all squared singular values σ 2 of the matrix MΦΓ with |Γ | ≤ m are bounded by: amin − mμM (Φ) ≤ σ 2 ≤ amax + mμM (Φ).
(22)
We therefore have the bound: amin − mμM (Φ) ≤
MΦΓ s22 ≤ amax + mμM (Φ). s22
(23)
Using a = max{amax − 1, 1 − amin } we have6 the bound on δm (MΦ) of δm (MΦ) ≤ a + mμM (Φ),
(24)
which is in terms of quantities that are easy to determine for a given dictionary Φ and measurement matrix M. 6
Note that we have the bound M2 ≥ amax ≥ amin ≥ 0.
Compressed Sensing and Source Separation
3.5
347
Sensitivity to Errors in M
In most source separation applications the mixing system is not given a priori and has to be estimated. This leads to the question of robustness of the method to errors in the estimation of the measuring matrix M. ˜ = M + N and estimated Assume we have an estimated mixing system M ˜ s and such that ˜s is supported on m ˜ = MΦ˜ sources ˜s such that x ˜ elements. Also assume that x = MΦs is the true generating system with s supported on m elements. Further assume that ˜ x − x2 ≤ 2. ˜ ( MΦ) < 1, then we have the bound: If δm+m ˜ ˜ 2 ≤ y − y
˜ ˜ s2 Φ2 MΦs − MΦ˜ . ˜ 1 − δm+m ˜ (MΦ)
(25)
˜ ˜ s ≤ 2 together with Replacing MΦs with MΦs+ NΦs and using MΦs− MΦ˜ the triangle inequality, we get the bound: 2 + Ny2 Φ2 2 + N2 Φ2 y2 ˜ 2 ≤ y − y ≤ . ˜ ˜ 1 − δm+m ( MΦ) 1 − δ ( MΦ) ˜ m+m ˜
4
(26)
Discussion and Conclusion
Underdetermined mixtures are a form of compressive sampling. Source separation is therefore equivalent to the decoding problem faced in compressive sampling. This equivalence opens up many new lines of enquiry, both in compressive sampling and in source separation. On the one hand, as done in this paper, results form compressive sampling shed new insight into the source separation problem. For example, theorems 1 and 2 state that for signals with a sparse underlying representation, whether exact, or with decaying coefficients, convex optimisation techniques can be used to recover or approximate the original signal from a lower dimensional observation. Furthermore, results from compressive sampling give bounds on the estimation error for sources that have a sparse representation with decaying coefficients. The error is a function of this coefficient decay and properties of the matrix mapping the sparse representation into the mixed domain. In the source separation literature, linear programming techniques have been a common approach [3] [4] [7] and the new theory gives additional justification for the application of these techniques and, what is more, provides estimation bounds for certain problems. The main novel contribution of this paper was an extension of recent results from compressive sampling to allow for more general, possibly over-complete dictionaries for the sparse representation. In source separation, the mixing matrix is in general unrelated to the dictionary and the main contribution of this paper was to derive conditions on the dictionary, the mixing system and their interaction that allow the application of standard compressive sampling results to
348
T. Blumensath and M. Davies
the more general source separation problem. In particular we have disentangled the dictionary and the measurement matrix and could shown that for randomly generated mixing systems, the required conditions hold with high probability. For more general mixing systems, we have presented bounds on this condition, which are functions of simple to establish properties of the mixing system, the overcomplete dictionary and their interaction. If the mixing system has to be estimated as in many source separation settings, errors in this estimate will influence the estimates of the sources. The theory in subsection 3.5 gives bounds on this error. Not only does source separation benefit from progress made in compressive sampling, compressive sampling has also much to learn from the extensive work done on source separation. For example, in source separation, the mixing system is not known in general and has to be estimated together with the sources. Many different estimation techniques have therefore been developed in the source separation community able to estimate the mixing system. This suggests an extension of compressive sampling to blind compressive sampling (BCS). Different scenarios seem possible depending on the application and, in our notation, either Φ or M (or both) might be unknown or known only approximately. Preliminary work in this direction has shown encouraging first results and more formal studies are currently undertaken.
References 1. Donoho, D.: Compressed sensing. IEEE Trans. on Information Theory 52(4), 1289– 1306 (2006) 2. Cand`es, E.: Compressive sampling. In: Proceedings of the International Congress of Mathematics, Madrid, Spain (2006) 3. Lewicki, M.S., Sejnowski, T.J.: Learning overcomplete representations. Neural Computation 12, 337–365 (2000) 4. Zibulevsky, M., Bofill, P.: Underdetermined blind source separation using sparse representations. Signal Processing 81(11), 2353–2362 (2001) 5. Davies, M., Mitianoudis, N.: A simple mixture model for sparse overcomplete ICA. IEE Proc.-Vision, Image and Signal Processing 151(1), 35–43 (2004) 6. Fevotte, C., Godsill, S.: A bayesian approach for blind separation of sparse sources. IEEE Transactions on Speech and Audio Processing PP(99), 1–15 (2005) 7. Zibulevsky, M., Pearlmutter, B.A.: Blind source separation by sparse decomposition in a signal dictionary. Neural Computation 13(4), 863–882 (2001) 8. Candes, E., Tao, T.: The dantzig selector: statistical estimation when p is larger than n. manuscript (2005) 9. Georgiev, P., Theis, F., Cichocki, A.: Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Transactions on Neural Networks 16(4), 992–996 (2005) 10. Baraniuk, R., Davenport, M., De Vore, R., Wakin, M.: The Johnson-Lindenstrauss lemma meets compressed sensing (2006) 11. Rauhut, H., Schnass, K., Vandergheynst, P.: Compressed sensing and redundant dictionaries. IEEE Transactions on Information Theory (submitted, 2007)
Morphological Diversity and Sparsity in Blind Source Separation J. Bobin1 , Y. Moudden1 , J. Fadili2 , and J.-L. Starck1 CEA-DAPNIA/SEDI, Service d’Astrophysique, CEA/Saclay, 91191 Gif sur Yvette, France [email protected], [email protected], [email protected] 2 GREYC CNRS UMR 6072, Image Processing Group, ENSICAEN 14050, Caen Cedex, France [email protected] 1
Abstract. This paper describes a new blind source separation method for instantaneous linear mixtures. This new method coined GMCA (Generalized Morphological Component Analysis) relies on morphological diversity. It provides new insights on the use of sparsity for blind source separation in a noisy environment. GMCA takes advantage of the sparse representation of structured data in large overcomplete signal dictionaries to separate sources based on their morphology. In this paper, we define morphological diversity and focus on its ability to be a helpful source of diversity between the signals we wish to separate. We introduce the blind GMCA algorithm and we show that it leads to good results in the overdetermined blind source separation problem from noisy mixtures. Both theoretical and algorithmic comparisons between morphological diversity and independence-based separation techniques are given. The effectiveness of the proposed scheme is confirmed in several numerical experiments.
Introduction Hereafter, we address the classical blind source separation problem. The m × t data matrix X is the concatenation of m mixtures {xi }i=1,··· ,m each of which being the instantaneous linear combination of n sources {si }i=1,··· ,n stored in the n × t matrix S: X = AS + N (1) where A is the mixing matrix and N models noise or model imperfections. In this setting, the aim of blind source separation (BSS) techniques is to estimate both the sources S and the mixing matrix A. BSS is clearly an ill-posed inverse problem which requires additional prior information in order to be solved. Previous work addressing BSS issues clearly emphasized on the need for diversity between the sources to be separated. From a statistical point of view, ICA-like source separation methods use statistical independence (more precisely mutual information) as a kind of “diversity measure” to distinguish between the sources. In [1], the authors proved that maximizing any measure of independence M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 349–356, 2007. c Springer-Verlag Berlin Heidelberg 2007
350
J. Bobin et al.
is equivalent to minimizing mutual information. ICA algorithms are then devised according to particular approximations of mutual information. Recently, sparsity has raised interest in a wide range of applications. Briefly, a signal is said to be sparse in representation Φ if most of the entries of α such that x = αΦ are almost zero and only a few have significant amplitudes. Sparsity-based BSS methods have recently been devised. In [2], a BSS algorithm is described in which it is taken advantage of sparsity to enhance the diversity between independent sources. Several studies (see [3] and references therein) have explored the extreme sparse case as they considered sources with strictly disjoint (and thus orthogonal) supports. In Section 1, we define a particular sparsity-based diversity measure coined morphological diversity. We propose a new effective BSS algorithm coined GMCA which separates the mixed sources based on their morphological diversity. In Section 2, numerical experiments are given showing how GMCA performs well to separate sources from noisy mixtures.
1
The GMCA Framework
Notations and Definitions Let x be a 1×t signal and Φ a signal dictionary. For the sake of simplicity, we will first assume that Φ is orthonormal. In this case, x has a unique representation α T in Φ such that x = αΦ readily as α = xΦ . The support S0,Φ (x) of x obtained in Φ is defined as S0,Φ (x) = t α[t]| > 0 where α[t] is the t-th entry of α. Let us also define the δ-support of x in Φ as : Sδ,Φ (x) = tα[t]| > δx∞ . We then say that two sources s1 and s2 are δ-disjoint in Φ if Sδ,Φ (s1 ) ∩ Sδ,Φ (s2 ) = ∅. Sources with strictly disjoint supports in Φ are obviously δ-disjoint with δ = 0. 1.1
Generalized Morphological Component Analysis
Sparse coding: Let us first assume that the mixing matrix A is known. In the GMCA framework, the data are modelled as a linear combination of several sources as in Equation 1. Furthermore, the sources {si }i=1,··· ,n are assumed to be the linear combination of so-called morphological components (see [4]) : D si = k=1 ϕik . By definition, those morphological components are assumed to be sparse in different orthonormal bases {Φk }k=1,··· ,D . Based on these assumptions, the GMCA algorithm endeavors to estimate the sources via the estimation of those morphological components : {ϕik } = Arg min X − AS22 + 2λ {ϕik }
n D
ϕik ΦTk 1
(2)
i=1 k=1
In [5], we proposed solving this optimization problem by estimating iteratively and alternately each multichannel morphological component {ϕik } via a “blockcoordinate”-like algorithm (see [6]). The product AS is then split into n × D
Morphological Diversity and Sparsity in Blind Source Separation
351
terms. Introducing the data residual Xik = X − {j,l}={i,k} ai ϕjl , where ai is the i-th column of A, the morphological components are estimated one at a time according to : ϕik = Arg minϕik Xik − ai ϕik 22 + 2λϕik ΦTk 1 . This equation has an exact solution known as soft-thresholding (see [7]). This sparse decomposition is closely linked to a sparse coding stage as already exposed in [8]. Dictionary learning: In the previous paragraph, we assumed that the mixing matrix was known and we showed that estimating the morphological components (and thus the sources) boils down to a sparse coding step. We consider now that the morphological components are fixed and we want to learn the mixing mixing matrix A. This dictionary learning issue has already been addressed by extensive work for a wide range of applications. Refer to [8] and references therein for more on that question. Following the same estimation scheme we introduced previously, we propose to estimate each column of A assuming the morphological components are fixed as follows: ai = Arg minai X − j=i aj sj − ai si 22 . This update clearly leads to a least-squares estimate of the columns of A : ai = j T X − j=i a sj si /si 22 . 1.2
The GMCA Algorithm for Blind Source Separation
Owing to the “block-coordinate”-like structure of our optimization scheme, for a fixed threshold λ, the blind GMCA algorithm estimates alternately the different parameters in the model i.e. the columns of A and the morphological components. The blind GMCA algorithm is as follows. 1. Set the number of iterations Imax and threshold λ(0) 2. While λ(h) is higher than a given lower bound λmin (e.g. can depend on the noise variance), For i = 1, · · · , n For k = 1, · · · , D (h) • Compute the residual term rik assuming the current estimates of ϕ{pq}={ik} , (h−1) ϕ ˜{pq}={ik} are fixed: (h−1) T (h−1) (h−1) (h) rik = a X − {p,q}={i,k} a ˜i ˜p ϕ ˜{pq} (h)
• Estimate the current coefficients of ϕ ˜ik by Thresholding with threshold λ(h) : (h) (h) T α ˜ ik = Δλ(h) rik Φk (h)
• Get the new estimate of ϕik by reconstructing from the selected coefficients α ˜ ik : (h)
(h)
˜ ik Φk ϕ ˜ik = α (h)
(h)
p=k Update ai assuming the morphological components ϕ˜pq are fixed : a and(h−1) (h) (h)T (h) (h) i(h) p 1 s ˜ a ˜ = (h) 2 X − n a ˜ s ˜ where s˜i = D ˜ik p i p=i k=1 ϕ ˜ si
2
– Decrease the thresholds λ(h) following a given strategy
352
J. Bobin et al.
Note that the value of λ fixes a certain sparsity level in the sparse coding stage. When λ is “high”, the sparse coding step will select the most “significant” features in the data which are very likely to belong to the true morphological components. As already introduced in [5] and [7], the threshold λ decreases towards λ(min) progressively incorporating new features. The purpose of such a thresholding scheme is twofold: i) it provides numerical stability to the algorithm, ii) it gives robusteness to noise as the morphological components {ϕik } are first estimated from their most significant coefficients in {Φk }. The sparse coding step is quite similar to a thresholding-based “denoising”. Handling noisy mixtures then boils down to fixing the final threshold λmin . Typically, in the white Gaussian noise case, λmin = 3σ where σ is the noise standard deviation. 1.3
A Fast GMCA Algorithm
When Φ is orthogonal: Note that the above GMCA algorithm is a multichannel extension of MCA (Morphological Component Analysis - see [9,4]) which has been devised in the single channel case. In [4], we showed that MCA is likely to solve the 0 decomposition (and thus the 1 decomposition when the two problems are equivalent - see [10] and references therein) of sparse signals in the overcomplete dictionary Φ = [Φ1 , · · · , ΦD ]. Nevertheless, in the multichannel case, this sparse coding step requires the use of D transforms for each of the n sources leading to a prohibitive computational cost. Interestingly, if we restrict ourselves to the case where Φ is orthonormal, the problem in Equation 2 becomes simpler: n ΘS 1 (3) ΘS = Arg min ΘX − AΘS 22 + 2λ ΘS
i=1
T
where ΘS = SΦ . The sources and the mixing matrix can be estimated in the sparse domain Φ which drastically reduces the computational burden. Assuming that A is nearly orthonormal1 leads to a very simple sparse coding step:
(4) ΘS = Δλ A† ΘX where Δλ (.) is the soft-thresholding operator with threshold λ and A† is the pseudo-inverse of A. In that setting, estimating A leads straightforwardly to a simple least-squares estimate : −1 (5) A = ΘX ΘS T ΘS ΘS T Interestingly, we show in [5] that alternating the updates in Equation 4 and Equation 5 provides a fixed-pointalgorithm the convergence condition of which T T is the following: ΘS Δλ (ΘS ) = Δλ (ΘS ) Δλ (ΘS ) In [5], we give heuristics supporting the good convergence of our algorithm. In the same paper, we also show that the same fast blind GMCA algorithm can be used with non-orthonormal Φ. 1
In practice, even if this assumption is very stringent and seldom true, the algorithm performs well.
Morphological Diversity and Sparsity in Blind Source Separation
1.4
353
Morphological Diversity
At the beginning of this section we introduced a particular sparsity-based diversity measure to distinguish the sources. A classical sparsity-based diversity measure (see [3]) leads to separate sources with strictly disjoint supports in a sparsifying representation Φ. Nevertheless, most natural signals seldom have strictly disjoint supports in most practical dictionaries Φ (e.g. discrete cosine, wavelets, bandlets ... etc.). In [7], we slightly relaxed this assumption by considering sources with disjoint supports in an overcomplete signal dictionary made of a union of orthonormal bases Φ = [Φ1 T , Φ2 T ]T . In this paper, we introduce a new sparsity-based diversity measure which relaxes the strict disjoint support assumption. The genesis - a deterministic diversity measure: In the next section, we switch from the deterministic point of view we adopted in the above, and examine the concept of morphological diversity from the statistical side. In the former viewpoint, separable sources are such that there exists a sparse representation Φ in which these signals have δ-disjoint supports. Heuristically, given sources e.g. images which are “visually” and in that sense morphologically different, there exists a dictionary Φ and a value of δ for which these sources are sparse and have δ-disjoint supports. In that setting, the way Φ is chosen specifies which signals are distinguishable. In practice, in image processing, taking Φ to be the union of the curvelet frame [11] and the local discrete cosine representation leads to good separation results for a wide range of images. From a probabilistic viewpoint: The minimization problem in Equation 2 can also be interpreted as Maximum A Posteriori estimator of the morphological components assuming: 1) the coefficients of the morphological components are generated independently from the same Laplacian law with zero mean and precision λ, 2) the entries of the mixing matrix are uniformly distributed, 3) the additive noise follows a Gaussian distribution with zero mean and identity covariance matrix. Then what is the meaning of morphological diversity in a statistical framework? The point is that sources generated independently from the same iid sparse stochastic process are very likely to have δ-disjoint supports for some value of δ. For instance, let us assume that the sources s1 and s2 are independently generated from the same Laplacian probability density in the sparse Φ-domain. Indeed, each entry of the coefficient vector αi=1,2 is drawn according to: Pα (αi [k]) = μ2 exp (−μ|αi [k]|). We would like to assess the probability for such sources to have δ-disjoint supports. Define the proposition Hτ = “|α1 [t ]| = α1 ∞ > τ, |α2 [t ]| = α2 ∞ > τ and ∀t = t , |αi=1,2 [t]| ≤ τ ”. Hτ states that s1 and s2 have strictly τ -joint supports; otherwise, if Hτ is false then s1 and s2 have at least τ -disjoint supports. We define P|α|>τ = Pα (|α| > τ ) and P|α|≤τ = Pα (|α| ≤ τ ). As the entries of each vector αi=1,2 are independently generated from the same probability density function Pα , then P (Hτ ) si such that: P (Hτ ) = T exp (−2μτ ) (1 − exp (−μτ ))2(T −1) . As, in practice, T = dN 1, sources generated independently from the same sparse probability density function are τ -disjoint. In other words, from a statistical viewpoint, sparse
354
J. Bobin et al.
independent sources are morphologically diverse with very high probability. In that sense a separation technique based on morphological diversity is closely related to ICA in a statistical framework. ICA and GMCA from an algorithmic viewpoint: The fast blind GMCA method introduced in section 1.3 can be expressed as a fixed-point algorithm the conT vergence condition of which asks that the matrix ΘS Δλ (ΘS ) be symmetric, for all values of λ. Interestingly, as summarized in [12], this condition can be related to the convergence condition of some ICA algorithms which require the symmetry of matrix E {f (BX)BX} where B is a demixing matrix and f (.) is the so-called score function. The thresholding operator Δλ (.) in blind GMCA is similar in its role to the score function of ICA algorithms. A specific and important feature of the thresholding Δλ (.) is that it evolves as the threshold λ decreases from one iteration to the next of the blind GMCA algorithm. In [5], we give heuristics showing that this “evolving” score function is likely to avoid local false mimima of the objective thus providing some numerical stability to the algorithm. A clear difference lies in the estimation of A instead of a demixing matrix B; this distinction is important as GMCA is also designed to handle data in a noisy environment.
2
Results
The last paragraph emphasized on sparsity as the key for very efficient source separation methods. In this section, we will compare several BSS techniques with GMCA in an image separation context. We choose 3 different reference BSS methods: i) JADE: the well-known ICA (Independent Component Analysis) based on fourth-order statistics (see [13]), ii) Relative Newton Algorithm: the separation technique we already mentioned. This seminal work (see [14]) paved the way for sparsity in Blind Source Separation. In the next experiments, we used the Relative Newton Algorithm (RNA) on the data transformed by a basic orthogonal bidimensional wavelet transform (2D-DWT), iii) EFICA: this separation method improves the FastICA algorithm for sources following generalized Gaussian distributions (which can be well-suited for some sparse signals). EFICA was also applied after a 2D-DWT of the data where the assumptions on the source distributions are appropriate. Figure 1 shows the original sources (top pictures) and the 2 mixtures (bottom pictures). The original sources s1 and s2 have unit variance. The matrix A that mixes the sources is such that x1 = 0.25s1 + 0.5s2 + n1 and x2 = −0.75s1 + 0.5s2 + n2 where n1 and n2 are Gaussian noise vectors (with decorrelated samples) such that the SNR equals 10dB. The noise covariance matrix ΓN is diagonal. Figure 2 depicts the behav˜ † A1 (A ˜ is the estimate of ior of the mixing matrix criterion ΩA = In − PA A) as the signal-to-noise ratio (SNR in dB) increases. When the mixing matrix is perfectly estimated, ΩA = 0, otherwise ΩA > 0. First, JADE does not perform well; it points out the importance of choosing an appropriate diversity measure to separate the sources. Thus, fourth-order statistics are not well suited to the source images in these experiments. Secondly, RNA and EFICA behave
Morphological Diversity and Sparsity in Blind Source Separation
355
Fig. 1. Left pictures: the 256 × 256 source images. Right pictures: two different mixtures. Gaussian noise is added such that the SNR is equal to 10dB.
Mixing Matrix Criterion
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 5
10
15
20
25
30
35
40
SNR in dB
Fig. 2. Evolution of the mixing matrix criterion ΔA as the noise variance varies: solid line: GMCA, dashed line: JADE, (): EFICA, (+): RNA. Abscissa: SNR in dB. Ordinate: mixing matrix criterion value.
rather similarly. Finally, GMCA provides the best results giving mixing matrix criterion values that are approximately 2 times lower than RNA and EFICA. These results clearly show that i) sparsity enhances the distinguishability of the sources, ii) morphological diversity is a well-performing diversity measure and GMCA is well suited to account for that measure.
3
Conclusion
In this paper we introduced a new diversity measure to distinguish between sources: the morphological diversity. It states that morphologically diverse signals should be separated in so-called sparse representations. The recent advances in harmonic analysis and overcomplete representation theory make morphological diversity a practical way to disentangle source processes from mixtures. We proposed a new algorithm coined blind GMCA (Generalized Morphological Component Analysis) to address the blind source separation problem based on morphological diversity. Numerical experiments show that GMCA performs notably well. Furthermore, GMCA is an effective algorithm designed to handle noisy mixtures. Further work will focus on extending the algorithm to the underdetermined blind source separation issue.
356
J. Bobin et al.
References 1. Lee, T.W., Girolami, M., Bell, A.J., Sejnowski, T.J.: A unifying informationtheoretic framework for independent component analysis (1998) 2. Zibulevsky, M., Pearlmutter, B.B.: Blind source separation by sparse decomposition. Neural Computations 13/4 (2001) 3. Li, Y., Amari, S., Cichocki, A., Ho, D.W.C., Xie, S.: Underdetermined blind source separation based on sparse representation. IEEE Transactions on signal processing 54, 423–437 (2006) 4. Bobin, J., Starck, J-L., Fadili, J., Moudden, Y., Donoho, D.L.: Morphological component analysis: new results. IEEE Transactions on Image Processing - revised (2006), available at http://perso.orange.fr/jbobin/pubs2.html 5. Bobin, J., Starck, J.-L., Fadili, J., Moudden, Y.: Sparsity and morphological diversity in blind source separation. IEEE Transactions on Image Processing - submitted (2007), available at http://perso.orange.fr/jbobin/pubs2.html 6. Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimizations. J. of Optim. Theory and Appl. 109(3), 457–494 (2001) 7. Bobin, J., Moudden, Y., Starck, J.L., Elad, M.: Morphological diversity and source separation. IEEE Signal Processing Letters 13(7), 409–412 (2006) 8. Aharon, M., Elad, M., Bruckstein, A.: k-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing 54(11), 4311–4322 (2006) 9. Elad, M., Starck, J.L., Donoho, D., Querre, P.: Simultaneous cartoon and texture image inpainting using morphological component analysis (MCA). ACHA 19(3), 340–358 (2005) 10. Tropp, T.: Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory 50(10), 2231–2242 (2004) 11. Candes, E., Demanet, L., Donoho, D., Ying, L.: Fast discrete curvelet transforms. SIAM Multiscale Model. Simul. 5/3, 861–899 (2006) 12. Cichocki, A.: New tools for extraction of source signals and denoising. In: Proc. SPIE 2005, Bellingham. vol. 5818, pp. 11–24 (2005) 13. Cardoso, J.F.: Blind signal separation: statistical principles. Proceedings of the IEEE. Special issue on blind identification and estimation 9(10), 2009–2025 (1998) 14. Zibulevski, M.: Blind source separation with relative newton method. In: Proccedings ICA2003, pp. 897–902 (2003)
Identifiability Conditions and Subspace Clustering in Sparse BSS Pando Georgiev1, Fabian Theis2 , and Anca Ralescu1 1
2
Computer Science Department University of Cincinnati Cincinnati, OH 45221, USA {pgeorgie,aralescu}@ececs.uc.edu Institute of Biophysics, University of Regensburg D-93040 Regensburg, Germany [email protected]
Abstract. We give general identifiability conditions on the source matrix in Blind Signal Separation problem. They refine some previously known ones. We develop a subspace clustering algorithm, which is a generalization of the k-plane clustering algorithm, and is suitable for separation of sparse mixtures with bigger sparsity (i.e. when the number of the sensors is bigger at least by 2 than the number of non-zero elements in most of the columns of the source matrix). We demonstrate our algorithm by examples in the square and underdetermined cases. The latter confirms the new identifiability conditions which require less hyperplanes in the data for full recovery of the sources and the mixing matrix.
1 Introduction A goal of the Blind Signal Separation (BSS) is the recovering of underlying source signals of some given set of observations obtained by an unknown linear mixture of the sources. In order to decompose the data set, different assumptions on the sources have to be made. The most common assumption nowadays is statistical independence of the sources, which leads to the field of Independent Component Analysis (ICA), see for instance [2], [6] and references therein. ICA is very successful in the linear complete case, when as many signals as underlying sources are observed, and the mixing matrix is non-singular. In [3] it is shown that the mixing matrix and the sources are identifiable except for permutation and scaling. In the overcomplete or underdetermined case, less observations than sources are given. It can be seen that still the mixing matrix can be recovered [4], but source identifiability does not hold. In order to approximatively detect the sources, additional requirements have to be made, usually sparsity of the sources. We refer to [7,9,10,11] and reference therein for some recent papers on sparsity and underdetermined ICA (m < n).
2 Blind Source Separation Using Sparseness Definition 1. A vector v ∈ Rm is said to be k-sparse if v has at least k zero entries. A matrix S ∈ Rm×n is said to be k-sparse if each column of it is k-sparse. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 357–364, 2007. c Springer-Verlag Berlin Heidelberg 2007
358
P. Georgiev, F. Theis, and A. Ralescu
The goal of Blind Signal Separation of level k (k-BSS) is to decompose a given m-dimensional random vector X into X = AS
(1)
with a real m × n-matrix A and an n × N -dimensional k-sparse matrix S. S is called the source matrix, X the mixtures and A the mixing matrix. We speak of complete, overcomplete or undercomplete k-BSS if m = n, m < n or m > n respectively. More generally, when a solution to the above problem doesn’t exist, we formulate least square BSS problem as follows: find a best approximation of X by AS, such that S is k-sparse, i.e. minimize|X − AS under constraints A ∈ Rm×n , S ∈ Rn×N and S is k-sparse
(2) (3)
In the following without loss of generality we will assume m n: the undercomplete case can be reduced to the complete case by projection of X.
3 Identifiability Conditions The following identifiability conditions are extension and refinement of those in [5]. The extension is for the case of bigger sparsity i.e. when the source matrix is at least (n − m + 2)-sparse. The refinement can be seen, for instance, when n m + 2 and the source matrix is (n − m + 1)-sparse. The provided examples illustrate both cases. Assume that: (H1) the indeces {1, ..., N } are divided in two groups N1 and N2 such that (a) for any index i ∈ {1, ..., n} there exist ki m different sets of indeces Ii,j ⊂ {1, ..., n}, j = 1, ..., ki each of which contains i, such that |Ii,j | m − 1, (b) for any set Ii,j from (a) there exist mi,j > |Ii,j | vectors s1 , ..., smi,j from the set S1 := {S(:, j), j ∈ N1 }, such that each of them has zero elements in places with indeces not in Ii,j , the rank of the set {s1 , ..., smi,j } is full (i.e. equal to |Ii,j |), and i Ii,j = {i}. (c) kj=1 (H2) Every m vectors from the group {S(:, j), j ∈ N2 } are linearly independent. Theorem 1 (Matrix identifiability). Let the assumptions (H1) and (H2) be satisfied. Then, in the set of all matrices with dimension m × n there is a subset of a full measure A such that, if X = AS and A ∈ A, then A is uniquely determined by X up to permutation and scaling of the columns. Sketch of the proof. The condition (H1) (c) implies that for any column ai of the mixing matrix there exists less that |Ii,j | detectable subspaces in the data which intersection is this column. One more subspace is reserved for ”confirmation” that their intersection is indeed a column of the mixing matrix. Their intersection is stable under small perturbation of the columns, so ”most” of the matrices will exclude false intersection (not corresponding to a column). ”Detectable subspace” here means that it contains at least |Ii,j | + 1 data vectors (columns of X) different from zero and any |Ii,j | from them are linearly independent (follows from (H1) (b)), so such a subspace can be detected by an algorithm. Condition (H2) ensures that there is no false intersection of subspaces (not corresponding to a column of A).
Identifiability Conditions and Subspace Clustering in Sparse BSS
359
4 Affine Hyperplane Skeletons of a Finite Set of Points The solution {(n0i , b0i )}ki=1 of the minimization problem: N minimize f {(nTi , bi )}ki=1 = min |nTi xj − bi |l j=1
1ik
subject to ni = 1, bi ∈ R, i = 1, ..., k,
(4) (5)
defines affine hyperplane k (l) -skeleton of X ∈ Rm×N , introduced for l = 1 in [8] and for l = 2, in [1]. It consists of a union of k affine hyper-planes T
Hi = {x ∈ Rm : n0i x = b0i }, i = 1, ..., k, such that the sum of minimum distances raised to power l, from every point xj to them is minimal. Consider the following minimization problem: minimize
N j=1
˜ j |l min |˜ nTi x
1ik
subject to ˜ ni = 1, i = 1, ..., k,
(6) (7)
˜ i = (ni , bi ), x ˜ j = (xj , −1). Its solution n ˜ i defines hyperpane skeleton in where n Rm+1 , consisting of a union of k hyperplanes ˜ i = {˜ ˜ = 0}, i = 1, ..., k, ˜ Ti x H x ∈ Rm+1 : n ˜ j to them such that the sum of minimum distances raised to power l, from every point x m ˜ is minimal. The restriction of Hi to R gives Hi . The usefulness of working in Rm+1 ˜ i the i-th cluster can be seen for l = 2, if we denote by X ˜ i = {˜ ˜ j |2 = |˜ ˜ j |2 } X xj : min |˜ nTp x nTi x 1pk
(8)
for some given {˜ nTi }ki=1 . Then the following minimization problem ˜ iX ˜ T vi minimize viT X i subject to v ∈ Rm+1 , vi = 1,
(9) (10)
will produce a solution {vi0 }ki=1 (consisting of eigenvectors corresponding to the minimal eigenvalues) for which f in (4) will not increase, i.e. f ({vi0 }ki=1 ) f ({˜ ni }ki=1 ). So we arrived to the following analogue of the classical k-means clustering algorithm. Hyperplane clustering algorithm. (see the pseudocode of the subspace clustering algorithm below in the particular case when ri = 1). Apply iteratively the following two steps until convergence:
360
P. Georgiev, F. Theis, and A. Ralescu
˜ i , i = 1, ..., k by (8), 1) Cluster assignment - forming X 2) Cluster update - solving the eigenvalue problem (9), (10). Such kind of analogue of the classical k-means clustering algorithm, (working in Rm instead in Rm+1 ) was introduced by Bradley and Mangasarian in [1] and called k-plane clustering algorithm. Since our hyperplane clustering algorithm is equivalent to the k-plane clustering algorithm, it has the same properties, i.e. termination after finite number of steps at a cluster assignment which is locally optimal, i.e. f in (4) cannot be decreased by either reassigning of a point to a different cluster plane, or by defining a new cluster plane for any of the clusters (see Theorem 3.7 in [1]). Affine Subspace Skeleton The solution of the following minimization problem minimize F ({Ui }ki=1 ) =
N j=1
min
1ik
ri
|uTi,s xj − bi,s |l
(11)
s=1
subject to ui,s = 1, bi,s ∈ R, i = 1, ..., k, s = 1, ..., ri , uTi,p ui,q = 0, p = q,
(12) (13)
i where where Ui = {ui,s }rs=1 , will be called affine subspace skeleton. It consists of a union of k affine subspaces Si = {x ∈ Rm : uTi,s x = bi,s , s = 1, ..., ri } each with codimension ri , i = 1, ..., k, such that the sum of minimum distances raised to power l, from every point xj to them is minimal. Consider the following minimization problem:
˜ i }k ) = minimize F˜ ({U i=1
N j=1
min
1ik
ri
˜ j |l |˜ uTi,s x
(14)
s=1
subject to ˜ ui,s = 1, ∈ R, i = 1, ..., k, ˜ i,q = 0, p = q, ˜ Ti,p u s = 1, ..., ri , u
(15) (16)
i ˜ i = {˜ ˜ i,s = (ui,s , bi,s ), x ˜ j = (xj , −1). Its solution n ˜ i,s defines a where U ui,s }rs=1 ,u subspace skeleton in Rm+1 , consisting of a union of k subspaces ˜ = 0, s = 1, ..., ri }, i = 1, ..., k, ˜ Ti,s x S˜i = {˜ x ∈ Rm+1 : n
˜ j to them such that the sum of minimum distances raised to power l, from every point x is minimal. The restriction of S˜i to Rm gives Si (a non-trivial fact). Similarly to the hyperspace clustering algorithm, we can write the following analogue of the classical k-means clustering algorithm for finding the subspace skeleton, when l = 2. Subspace clustering algorithm. (see the pseudocode below) It consists of two steps applied alternatively until convergence: ˜ i , i = 1, ..., k by: for given k orthonormal families 1) Cluster assignment - forming X ri of vectors {˜ ni,s }s=1 , i = 1, ..., k, put ri ri ˜i = x ˜ j |2 = min ˜ j |2 . ˜j : X (17) |˜ uTi,s x |˜ uTi,s x s=1
1ik
s=1
Identifiability Conditions and Subspace Clustering in Sparse BSS
361
2) Cluster update - for every i = 1, ..., k, solving the minimization problem minimize
ri
˜ iX ˜ Ti Vi ) trace(ViT X
(18)
s=1
under orthogonality constraints ViT Vi = Iri ,
(19)
where Vi has dimensionality (n + 1 × ri ). For any i, it will produce a solution Vi0 - a matrix with columns equal to the eigen˜ T (a non-trivial ˜ iX vectors corresponding to the ri smallest eigenvalues of the matrix X i 0 k k ˜ ˜ ˜ ˜ fact) and F in (14) will not increase, i.e. F ({Vi }i=1 ) F ({Ui }i=1 ). This algorithm terminates in a finite number of steps (similarly to the BradleyMangasarian algorithm) to a solution which is locally optimal, i.e. F˜ in (14) cannot be decreased by either reassigning of a point to a different cluster subspace, or by defining a new cluster subspace for any of the clusters.
Algorithm 1. Subspace clustering algorithm Data: samples x(1), . . . , x(T ) Result: estimated k subspaces S˜i ⊂ Rn+1 , i = 1, ..., k given by the orthonormal bases i of their orthogonal complements S˜i⊥ . {˜ ui,s }rs=1 1
2
3 4 5
˜ i,s = (ui,s , bi,s ) with |˜ Initialize ε > 0 and randomly u ui,s | = 1, i = 1, . . . , k, s = 1, . . . , ri . ˜ i }ki=1 is less do (Until the difference of F˜ calculated in two subsequent estimates of {U than ε) Cluster assignment. for t ← 1, . . . , T do ˜ i (a matrix), where i is chosen to minimize ˜ (t) = (x(t), −1) to cluster X Add x ri T 2 s=1 |ui,s xj − bi,s | (distance to the subspace Si ). end Cluster update. for i ← 1, . . . , k do ˜ iX ˜ Ti . Calculate the i-th cluster correlation C := X Choose eigenvectors vs , s = 1, ..., ri of C corresponding to the ri minimal eigenvalues. ˜ i = ∪ri u ˜ i,s ← vs /|vs |, s = 1, ..., ri , U Set u s=1 ˜ i,s . end end
Algorithm for Identification of the Mixing Matrix 1) Cluster the columns {X(:, j) : j ∈ N1 } in k groups Hi , i = 1, ..., k such that the span of the elements of each group Hi produces ri -codimensional subspace and these ri -codimensional subspaces are different. 2) Calculate any basis of the orthogonal complement of each of these ri-codimensional subspaces.
362
P. Georgiev, F. Theis, and A. Ralescu
3) Find all possible groups such that each of them is composed of the elements of at least m bases in 2), and the vectors in each group form a hyperplane. The number of these hyperplanes gives the number of sources n. The normal vectors to these hyperplanes are estimations of the columns of the mixing matrix A (up to permutation and scaling). In practical realization all operations in the above algorithm are performed up to some precision ε > 0. Source Recovery Algorithm 1. Repeat for j = 1 to N : 2.1. Identify the subspace Hi containing xj := X(:, j), or, in practical situation with presence of noise, identify Hi to which the distance from xj is minimal and project xj ˜j ; onto Hi to x 2.2. if Hi is produced by the linear hull of column vectors ap1 , ..., apm−ri , then find m−r i ˜j = coefficients Lj,l such that x Lj,l apl . These coefficients are uniquely deterl=1
˜ j (see [5], Theorem 3). mined generically (i.e. up to measure zero) for x 2.3. Construct the solution sj = S(:, j): it contains Lj,l in the place pl for l = 1, ..., m − ri , the other its components are zero. It is clear that such found (A, S) is a solution to the least square BSS problem defined by (2).
5 Computer Simulation Examples First example. We created artificially four source signals, sparse of level 2, i.e. each column of the source matrix contains at least 2 zeros (shown in Figure 1). They are mixed with a square nonsingular randomly generated matrix A: ⎛
0.4621 ⎜ 0.2002 A=⎝ 0.8138 0.2899
0.6285 0.4921 0.4558 0.3938
0.5606 0.5829 0.2195 0.5457
⎞
0.4399 0.3058 ⎟ . 0.2762 ⎠ 0.7979
The mixed sources are shown in Fig.2, the recovered sources by our algorithm are shown in Fig. 3. The created signals are superposition of sinusoidal signals, so it is clear that they are dependent and ICA methods could not separate them (lack of space prevent us to show the results of applying ICA algorithms). The estimated mixing matrix with our subspace clustering algorithm is ⎛
0.6285 ⎜ 0.4921 ⎜ M=⎝ 0.4558 0.3938
0.4399 0.3059 0.2762 0.7979
0.4621 0.2002 0.8138 0.2899
⎞ 0.5606 0.5829 ⎟ ⎟. 0.2195 ⎠ 0.5457
Identifiability Conditions and Subspace Clustering in Sparse BSS
363
Second Example. Now we create six signals with level of sparsity 4, i.e. each column of the source matrix has 4 zeros. We mixed them with a randomly generated matrix with dimension (3 × 6): ⎛
⎞ 0.8256 0.3828 0.4857 0.4053 0.7720 0.2959 H = ⎝ 0.2008 0.7021 0.0197 0.5610 0.6182 0.6822 ⎠ . 0.5273 0.6004 0.8739 0.7218 0.1476 0.6686
The estimated mixing matrix with our subspace hyperplane clustering algorithm is ⎛
⎞ 0.2959 0.4857 0.8256 0.4053 0.3828 0.7720 M = ⎝ 0.6822 0.0197 0.2008 0.5610 0.7021 0.6182 ⎠ . 0.6686 0.8739 0.5273 0.7218 0.6004 0.1476
Here the number of hyperplanes presented in the data are 9, while the number of all possible hyperplanes (according to the identifiability conditions in [5]) when all combination of 4 zeros in the columns of the source matrix are present, are 15.
0.2
0.5
0
0
−0.2
−0.5
0
100
200
300
400
500
600
700
800
0.2
0
100
200
300
400
500
600
700
800
0
100
200
300
400
500
600
700
800
0
100
200
300
400
500
600
700
800
0
100
200
300
400
500
600
700
800
0.5
0.1
0
0
−0.5
0
100
200
300
400
500
600
700
800
0.2
0.2
0
0
−0.2
−0.2
0
100
200
300
400
500
600
700
800
1
0.5
0.5
0
0
−0.5
0
100
200
300
400
500
600
700
800
Fig. 1. Example 1: original source signals (left) and mixed ones (right)
0.2
1 0 −1 0 1
0 0
100
200
300
400
500
600
700
800
100
200
300
400
500
600
700
800
900
100
200
300
400
500
600
700
800
900
100
200
300
400
500
600
700
800
900
100
200
300
400
500
600
700
800
900
100
200
300
400
500
600
700
800
900
100
200
300
400
500
600
700
800
900
0
1 −1 0 1
0.5 0 −0.5
0 0
100
200
300
400
500
600
700
800
0.2
−1 0 1 0
0
−0.2
−1 0 1 0
100
200
300
400
500
600
700
800
0
0.2 −1 0 0.5 0 0 −0.2
0
100
200
300
400
500
600
700
800
−0.5
0
Fig. 2. Left: Example 1 - separated signals. Right: Example 2 - original source signals
364
P. Georgiev, F. Theis, and A. Ralescu
2
0.5 0
1
−0.5 0 1
0
100
200
300
400
500
600
700
800
900
100
200
300
400
500
600
700
800
900
100
200
300
400
500
600
700
800
900
100
200
300
400
500
600
700
800
900
100
200
300
400
500
600
700
800
900
100
200
300
400
500
600
700
800
900
−1 0 −2
0
100
200
300
400
500
600
700
800
900
2
0
1
−1 0 1
0 −1 −2
−1 0 1
0 0
100
200
300
400
500
600
700
800
900
−1 0 1
2 0 1 −1 0 1
0 −1 −2
0 0
100
200
300
400
500
600
700
800
900
−1
0
Fig. 3. Example 2: mixed signals (left) and separated signals (right)
6 Conclusion We presented new identifiability conditions for sparse BSS problem, allowing less hyperplanes in the data points for full recovery of the original sources and the mixing matrix. New subspace clustering algorithm is designed for subspace detection, based on eigenvalue decomposition. The square and the underdetermined cases for signals with bigger sparsity are illustrated by examples.
References 1. Bradley, P.S., Mangasarian, O.L.: k-Plane Clustering. J. Global optim. 16(1), 23–32 (2000) 2. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing. John Wiley, Chichester (2002) 3. Comon, P.: Independent component analysis - a new concept? Signal Processing 36, 287–314 (1994) 4. Eriksson, J., Koivunen, V.: Identifiability and separability of linear ica models revisited. In: Proc. of ICA 2003, pp. 23–27 (2003) 5. Georgiev, P., Theis, F., Cichocki, A.: Sparse Component Analysis and Blind Source Separation of Underdetermined Mixtures. IEEE Trans. of Neural Networks 16(4), 992–996 (2005) 6. Hyv¨arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Chichester (2001) 7. Lee, T.-W., Lewicki, M.S., Girolami, M., Sejnowski, T.J.: Blind sourse separation of more sources than mixtures using overcomplete representaitons. IEEE Signal Process. Lett. 6(4), 87–90 (1999) 8. Rubinov, A.M., Soukhoroukova, N., Ugon, J.: Minimization of the sum of minia of convex functions and its application to slustering. In: Jeyakumar, V., Rubinov, A. (eds.) Continuous Optimization, Current trends and modern applications, pp. 409–434. Springer, Heidelberg (2005) 9. Theis, F.J., Lang, E.W., Puntonet, C.G.: A geometric algorithm for overcomplete linear ICA. Neurocomputing 56, 381–398 (2004) 10. Waheed, K., Salem, F.: Algebraic Overcomplete Independent Component Analysis. In: Proc. Int. Conf. ICA2003, Nara, Japan, pp. 1077–1082 (2003) 11. Zibulevsky, M., Pearlmutter, B.A.: Blind source separation by sparse decomposition in a signal dictionary. Neural Comput. 13(4), 863–882 (2001)
Two Improved Sparse Decomposition Methods for Blind Source Separation B. Vikrham Gowreesunker and Ahmed H. Tewfik Dept. of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 {gowr0001,tewfik}@umn.edu
Abstract. In underdetermined blind source separation problems, it is common practice to exploit the underlying sparsity of the sources for demixing. In this work, we propose two sparse decomposition algorithms for the separation of linear instantaneous speech mixtures. We also show how a properly chosen dictionary can improve the performance of such algorithms by improving the sparsity of the underlying sources. The first algorithm proposes the use of a single channel Bounded Error Subset Selection (BESS) method for robustly estimating the mixing matrix. The second algorithm is a decomposition method that performs a constrained decomposition of the mixtures over a stereo dictionary.
1
Introduction
In the blind source separation(BSS) problem, we have mixtures of several source signals and the goal is to separate them with as little prior information as possible. In this work, we study the instantaneous underdetermined BSS case, where we have more sources than mixtures. We are concerned with separating mixtures of speech signals when the mixing matrix and number of underlying sources are unknown. This problem is intrinsically ill-defined and its solution requires some additional assumptions compared to its overdetermined counterpart. The difficulty of the underdetermined setup can be somewhat alleviated if there exists a representation where all the sources are rarely simultaneously active, which entails finding a representation where the sources are sparse. Some authors have shown that speech signals are sparser in the time-frequency than in the time domain, and that there exists several other representations such as wavelets packets, where different degrees of sparsity can be obtained [12]. It has been shown that better separation can indeed be achieved by exploiting such sparsity [7],[6],[11]. In this paper, we investigate methods for performing BSS using overcomplete dictionaries in the underdetermined case. The fundamental success of the separation depends on two factors, namely the type of dictionary used and the type of decomposition method employed. We study both areas. For the first case, we demonstrate how the underlying sparsity of the sources can be improved using a KSVD-based [3] trained dictionary. For the second case, we propose the use of the Bounded Error Subset Selection(BESS) [5] method as a robust method M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 365–372, 2007. c Springer-Verlag Berlin Heidelberg 2007
366
B.V. Gowreesunker and A.H. Tewfik
for estimating the mixing matrix, even in the presence of additive noise in the mixture. Furthermore, we propose a multichannel sparse decomposition method for extracting the underlying sources. The remainder of this paper is organized as follows. In section 2 we give a mathematical description of the problem and an explanation of how sparsity is used in source separation. In section 3, we show how the right dictionary can improve the underlying sparsity of the sources. In section 4, we illustrate how single channel sparse decomposition algorithms fail to distribute some coefficients to the proper source, setting the stage for multichannel methods. Finally, we detail a decomposition algorithm with directionality constraints and discuss its benefits.
2
Mixture Model for an Arbitrary Dictionary
For a problem where we have M mixtures of N sound sources and M ≤ N , our goal is to separate the sources into individual tracks. We are concerned with the underdetermined linear instantaneous mixture model, which can be formulated mathematically as follows, x(t) = As(t) + q(t),
(1)
where s(t) is an unknown N × 1 vector containing the source data, x(t) is a known M × 1 observation vector, q(t) is the M × 1 additive noise vector, t is the sample index and A is an unknown M × N mixing matrix. Over T time samples, we have the following expression, X = AS + Q,
(2)
where X = [x(1])x(2]) . . . x(T )], S = [s(1)s(2) . . . s(T ])], Q = [q(1)q(2) . . . q(T )]. One approach to solving this problem is to assume that the sources are sufficiently sparse in a given representation. To solve the underdetermined BSS problem using sparsity, one can decompose the signal into a dictionary where the source signals are known to be sparse or use a Sparse Decomposition (SD) algorithm to find a sparse representation for an overcomplete dictionary. In the rest of this section, we illustrate how sparsity can lead to separation, then show how sparsity in alternative representations can be exploited. To simplify the discussion, let us assume without loss of generality that M = 2 and N = 3. Assuming there is no additive noise for illustrative purposes, we can expand Eq. 2 as follows ⎡ ⎤ s1 a11 a12 a13 x1 (3) = × ⎣ s2 ⎦ x2 a21 a22 a23 s3 By looking at the ratio
x1 x2
for the case when only the j th source is active, we get x1 a1j = x2 a2j
(4)
Two Improved Sparse Decomposition Methods for BSS
367
Hence a scatter plot of x1 v/s x2 for the case where the sources never overlapped would reveal 3 distinct lines such that the j th source corresponds to the line with a gradient a1j . Separation can be easily achieved for Eq. 4. The general thinking 2j is that the sparser the sources are, the less likely they are to be active at the same time, resulting in better separation. The same argument can be applied when the data is represented in a different dictionary. The dictionary can be a basis matrix or an overcomplete matrix and each column is referred to as a dictionary atom. We can express signal,{sj }N j=1 , in terms of the dictionary, D and coefficient matrix, C such that S T = DC.
(5)
X = AC T DT ,
(6)
Substituting in Eq. 2, We get
where C is a K × M coefficient matrix and D is a T × K matrix. Thus, it is clear that the sparsity of the source is inherently limited by the dictionary used. In the following section we illustrate how the quest for a sparse representation entails finding a good dictionary. In the rest of the paper, we consider the case where we have 2 mixtures and 3 sources.
3
Effect of Dictionary on Signal Sparsity
In this section, we illustrate the importance of dictionary selection for BSS algorithms that rely on sparsity. Although, it has already been well established that speech signal exhibits different degrees of sparsity in different representations [11],[12], we do not know what is the optimally sparse representation for speech signals or whether optimal sparsity offers sufficient improvement in separation quality over standard overcomplete dictionaries such as the Cosine Packet(CP) dictionary[8]. In the absence of an optimal representation, we resort to trained dictionaries to demonstrate that a properly chosen dictionary can offer significant performance improvement. In this section, we present some numerical results to illustrate that indeed the trained dictionary offers better sparsity than the CP dictionary. In section 4, we compare their impact on separation. 3.1
Dictionary Design
The CP dictionary was designed using the Atomizer and Wavelab Matlab toolboxes[1], with dimensions 128 × 896. The matched dictionary, of the same size as the CP dictionary, was designed using a KSVD method [3]. This method takes after the Vector Quantization technique used in codebook design and tends to promote a sparse structure. The speech data used for training the dictionary had to be first formatted into frames of length T , with a standard windowing technique. To compare the sparsity improvement of the matched and unmatched dictionaries for speech signals, we use a standard sparse decomposition method,
368
B.V. Gowreesunker and A.H. Tewfik
the Orthogonal Matching Pursuit(OMP)[8]. Sparsity for a dictionary was evaluated as the minimum number of coefficients required to represent the signal in that dictionary using OMP. Using three different speakers, we trained three dictionaries. The first, dictionary1, was trained using speaker 1 only, the second, dictionary12, was trained with speaker 1 and 2, and the third was trained using all three speakers. The signal from each speaker was then independently decomposed in the CP dictionary and the three trained dictionaries. 3.2
Sparsity Comparison
We summarize the sparsity of the original sources for each dictionary in Fig. 1. In all three experiments, source1 is clearly much sparser with the KSVD dictionary than the CP. The most important observation is that all speakers exhibit a very high sparsity improvement with dictionary123. A study of the number of dictionary atoms concurrently used by simultaneous sources reveals that indeed the sparsest dictionary, dictionary123, results in the lowest number of overlapping atoms. For space consideration, details are omitted here. Hence, for our purposes, dictionary123 is sparsest available dictionary for the sources under study. cosine packet dictionary
Number of Coefficients using OMP
Trained dict using speaker 1 Trained dict using speaker 1 and 2
Number of Coefficients
7000
Trained dict using speaker 1,2,3
6000 5000 4000 3000 2000 1000 0 1
2
3
Speaker Number
Fig. 1. Number of nonzero coefficients when using OMP. Dictionary1 is trained using source1 only. Dictionary12 is trained using source1 and source2. Dictionary123 is trained using source1, source2 and source3.
4
Sparse Decomposition Methods for Separation
Our goal in this section is to introduce two algorithms for underdetermined BSS. The first algorithm is a single channel BESS, which is shown to be a robust technique for estimating the mixing matrix. The second algorithm performs a constrained decomposition of the two mixture signals over a stereo dictionary. 4.1
Single Channel BESS as a Robust Estimate of the Mixing Matrix
In Fig.2(a), we show the scatter plot for coefficients from performing single channel BESS independently on the each mixture signal. There are two interesting
Two Improved Sparse Decomposition Methods for BSS
369
observations about this plot. First, there is a set of coefficients that appear on the axes, and second, there is a different set of coefficients clustered along the mixing matrix columns. The coefficients on the axes are dictionary elements that appear in one mixture but not in the other, and cannot be separated using this method. The coefficients along the matrix orientation share the same dictionary atoms for both mixtures and belong to a single source. We propose using only the dictionary atoms common to both mixtures in estimating the mixing matrix. A well known method for estimating the mixing matrix is performing a Hough transform on the coefficients of the sparse decomposition. This basically entails finding the histogram of angles, arctan cc12 , for all the coefficients pairs (c1 , c2 ) of the decomposed mixtures. The mixing matrix columns show up as peaks on the histograms. The quality of the estimate depends on the how well the peaks can be estimated. We combine this method with the BESS decomposition discussed above, and compare the results with the Hough transform of the coefficients of the Modified Discrete Cosine Transform(MDCT) of the mixtures. We find that both methods provide equally good estimates, but that the BESS technique is more robust in the presence of additive noise. Fig. 2(b) shows the histogram for MDCT coefficients and fig. 2(c) shows the results for the BESS method, in the presence of noise. We can clearly see that the peaks position gets blurred in the presence of noise with the MDCT transform, while the new method provides more robust estimate of the peaks under noisy conditions. 4.2
A Decomposition Method with Directionality Constraints
Inspired by the stereo dictionary method of Gribonval [7], we propose an algorithm that performs a constrained decomposition of the two mixture signals. This require prior knowledge or estimation of the mixing matrix. We detail the algorithm below and contrast it with stereo dictionary decomposition method. Given a T × K dictionary, D, consisting of {di }K i=1 and two mixtures vectors, x1 , and x2 , the decomposition is given as follows, 1. At iteration M = 1, the nth residual, RnM is initialized to nth mixture signal, xn . Each channel has a set of coefficients, cn,i = 0, where i is the index of the dictionary atom. A list of the available dictionary atoms is kept tracked in list, Li. All indices are included at initialization. At each iteration, a dictionary atom is picked and the coefficients associated with that atom are computed. 2. For picking the dictionary atom, we use the same criteria as [7], i.e., we pick the atom, k, that is maximally correlated to both mixtures channels, k = argmaxk | < R1M , dk >2 + < R2M , dk >2 | 3. Next we find the projection of the residuals of each channel onto the k th dictionary element. Denoting this projection pairs as, ct = [< R1M , dk >, < R2M , dk >], we proceed to constrain it to lie along one the mixing matrix columns. 4. By finding the inner product of this projection pair with the mixing matrix columns, we finding the closest column as, J = argmaxJ | < ct , aJ > |, where J is the column index.
370
B.V. Gowreesunker and A.H. Tewfik
(a) Scatter plot of 2 independent BESS decomposition
(b) Histogram of all MDCT coefficients
(c) Histogram of BESS coefficients, common to both mixtures
Fig. 2. Coefficients of 2 mixtures in the presence of additive noise at sensors
5. We then set the coefficient for the k th atom to be equal to < ct , aJ >, and remove the index of this dictionary element from index list, Li. 6. We check the reconstruction error of both mixtures signal and increment the iteration if necessary. Stages 1 and 2 are essentially the same as the StereoMP. However, stage 2 can be replaced by some other criteria for picking the dictionary element. In stage 3 through 5, we constrain the coefficients of a particular dictionary elements to lie along one of the columns of the mixing matrix. The difference between doing the projection during the decomposition as opposed to after is that error between ct × dk and < ct , aJ > ×ct × dk can be reassigned to another dictionary element during the decomposition. 4.3
Results
We computed commonly used performance indices[4] for the above mentioned algorithm for 2 cases, a CP dictionary and a KSVD-trained dictionary. For the purpose of this experiment, we used three 10 seconds long male speech tracks and artificially mixed them using a predefined mixing matrix. The experiments were run on frames of 512 points with a 50% overlap and the dictionary used was of size 512 × 4608. In Fig. 3, we compared the results for the CP dictionary and the trained dictionary. We find that the trained dictionary offers a Signal-toInterference Ratio (SIR) improvement of 4 to 9 dB over the CP dictionary and Signal-to-Artifact Ratio (SAR) improvement of of 3 to 4 dB. Also, not shown here for space consideration is an improvement in Signal-to-Distortion Ratio (SDR) similar to the SAR. This confirms the claim in section 3 that dictionary selection is a very important part of the separation process. We also found that our method to have comparable SIR to Gribonval’s method, but with a noticeable improvement numerically and subjectively in distortion and artifacts. This improvement in distortion metrics confirms our previous claim that redistributing the coefficient projection error indeed improves the quality of the separation. The downside, however, is that a larger number of iterations is required.
Two Improved Sparse Decomposition Methods for BSS
Signal-to-Inteference Ratio(SIR)
Cosine Packet Dictionary
Signal-to-Artifact Ratio(SAR)
371
Cosine Packet Dictionary
10
Speech Trained Dictionary
25
8 7
15
SAR(dB)
SIR (dB)
Speech Trained Dictionary
9
20
10 5
6 5 4 3 2
0 1
2
3
1 0
Source Number
1
2
3
Source Number
(a) Signal-to-Interference Ratio (dB)
(b) Signal-to-Artifact Ratio
Fig. 3. Performance Comparison between CP and Trained dictionaries for algorithm of section 4.2
5
Conclusion
We discussed the implications of sparsity on source separation and explored different avenues to maximize separation for given signals. We looked at both the dictionary selection problem and selection of decomposition algorithms, and used the BESS method as a robust estimate of the mixing matrix. In this work, we clearly illustrated how the choice of the proper dictionary makes a difference in the sparsity of the coefficients and the resulting separation. Furthermore, we proposed a decomposition method which constrains the coefficients at each iteration, and demonstrated its performance. We currently in process of developing an extension of the BESS method that could be extended to multichannel source separation.
References 1. http://www-stat.stanford.edu/∼ wavelab/ 2. http://sassec.gforge.inria.fr/ 3. Aharon, M., Elad, M., Bruckstein, A.: The K-SVD: An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation. IEEE Trans. on Signal Processing 54, 4311–4322 (2006) 4. Vincent, E., Gribonval, R., Fevotte, C., et al.: BASS-dB: the blind audio source separation evaluation database, Available on-line http://www.irisa.fr/metiss/BASS-dB/ 5. Alghoniemy, M., Tewfik, A.H.: Reduced Complexity Bounded Error Subset Selection In: IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), pp. 725–728 (March 2005) 6. Fevotte, C., Godsill, S.: A Bayesian approach for blind separation of sparse sources. IEEE Trans. on Speech and Audio Processing 14, 2174–2188 (2006)
372
B.V. Gowreesunker and A.H. Tewfik
7. Gribonval, R.: Sparse decomposition of stereo signals with Matching Pursuit and application to blind separation of more than two sources from a stereo mixture. In: IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), vol. 3, pp. 3057–3060 (May 2002) 8. Mallat, S.: A Wavelet Tour of Signal Processing, 2nd edn. Academic Press, San Diego (1999) 9. Shindo, H., Hirai, Y.: Blind Source Separation by a Geometrical Method. In: Proceedings of the 2002 International Joint Conference on Neural Networks (IJNN), pp. 1109–1114 (May 2002) 10. Mallat, S., Zhang, Z.: Matching Pursuit with Time-frequency Dictionaries. IEEE Trans. on Signal Processing 41, 3397–3415 (1993) 11. Tan, V.Y., Fevotte, C.: A study of the effect of source sparsity for various transforms on blind audio source separation performance. In: Workshop on Signal Processing with Adaptative Sparse Structured Representations (SPARS’05), Rennes, France (November 2005) 12. Zibulevsky, M., Pearlmutter, B.A., Bofill, P., Kisilev, P.: Blind Source Separation by Sparse Decomposition. In: chapter in the book: Roberts, S.J., Everson, R.M. (eds.) Independent Component Analysis: Principles and Practice, Cambridge (2001)
Probabilistic Geometric Approach to Blind Separation of Time-Varying Mixtures Ran Kaftory and Yehoshua Y. Zeevi Technion - Israel Institute of Technology, Department of Electrical Engineering, Haifa 32000, Israel [email protected], [email protected]
Abstract. We consider the problem of blindly separating time-varying instantaneous mixtures. It is assumed that the arbitrary time dependency of the mixing coefficient, is known up to a finite number of parameters. Using sparse (or sparsified) sources, we geometrically identify samples of the curves representing the parametric model. The parameters are found using a probabilistic approach of estimating the maximum likelihood of a curve, given the data. After identifying the model parameters, the mixing system is inverted to estimate the sources. The new approach to blind separation of time-varying mixtures is demonstrated using both synthetic and real data.
1
Introduction
Most of the research in the area of blind source separation, has been focused on the instantaneous model. In recent years, more attention is being directed towards convolutive and time-varying problems which represent the majority of real-world mixtures. In blind separation of time-varying instantaneous mixtures, the sources are mixed with time-varying coefficients. It is assumed that the sources are not filtered nor are delayed prior to being mixed. Many studies have addressed this problem by using an adaptive form of stationary blind source separation algorithms [4,6,7]. Some have even utilized particle filtering in order to track the mixing coefficients [2]. These approaches assume that the variation of the mixing system is small, in order to enable the extraction of the statistical nature of the mixtures. A different idea was employed in [9,10], where a parametric model of the mixing coefficients was used in the case of a linear or periodic dependency of the mixing system on time. This approach can be more efficient for larger variations in comparison to the adaptive algorithms. The case where the mixing system is an arbitrary function of time, with large variations, is still overlooked. In this paper, the model of the mixing coefficients is arbitrary. We assume that it is known up to a finite number of parameters, and that the sources are sparse (or can be sparsely represented using an appropriate transform). For such sources, the value of the mixing coefficients can be identified over many time instances. We interpret these instances geometrically, as samples of curves representing the parametric model. Estimating the parameters, requires the grouping of the time instances and assigning them to a curve. This is done M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 373–380, 2007. c Springer-Verlag Berlin Heidelberg 2007
374
R. Kaftory and Y.Y. Zeevi
by using a probabilistic approach of estimating the maximum likelihood of a curve given the data. After identifying the model parameters, the mixing system is inverted to estimate the sources.
2
Time-Varying Instantaneous Mixtures
In the problem of blind separation of time-varying (or position-varying in the case of images) instantaneous mixtures, sensors xi (t) receive sources, sj (t), which are linearly mixed by a matrix, A(t), with time-dependent elements aij (t)1 : x(t) = A(t)s(t) + n(t),
(1)
where n(t) is an additive noise with known or unknown parameters. We can usually assume the form of aij (t) by using some knowledge about the physics of the problem2 : aij (t) = gij (αijk ; t), (2) where the function gij is known up to some finite number of parameters αijk . The index k = 1, .., N , where N is the number of unknown parameters. To illustrate a system of 2 sensors receiving data from position-varying instantaneous mixtures of 2 sources, consider Fig. 1. The semi-reflective glass mixes an image and a reflection. The position-varying mixing is obtained by non-uniform illumination. The objective of the blind source separation problem is to estimate the sources, sj , and the unknown mixing parameters, αijk , from the observed mixtures, xi . One of the ways for doing so, is to identify the matrix A(t) and apply its pseudoinverse to the observations x(t)3 : sˆ(t) = A(t)† x(t) = s(t) + A(t)† n(t),
(3)
where A(t)† stands for the pseudo-inverse of A(t). Similar to the case of instantaneous mixtures, where the separation is up to scaling and permutation of the original sources, we assume that the blind separation of time-varying instantaneous mixtures is up to a time-varying scaling and permutation of the sources. Therefore, we can define our separation problem to account for this fact and rewrite it as follows: ˜ s(t) + n(t), x(t) = A(t)˜ (4) where a ˜ij (t) = aij (t)/a1j (t) and s˜j (t) = a1j sj (t). Similar to Eq. (2), we define g˜ij (αijk ; t), such that a ˜ij (t) = g˜ij (˜ αijk ; t) = 1 2
3
gij (αijk ; t) . g1j (α1jk ; t)
(5)
In the case of images, the spatial coordinate y replaces t in the equations. If the function is unknown, we can still assume that it is an analytic function and therefore, can be approximated by a polynomial (representing its Taylor expansion). Presuming the noise is low.
Probabilistic Geometric Approach to Blind Separation
375
Fig. 1. A setup for acquiring spatial varying instantaneous mixtures: two pictures are positioned opposite to each other while a semi-reflective glass is mounted on the optical axis of one of them. First mixture is acquired using uniform lighting. A second mixture is acquired using a non-uniform illumination by a fluorescent lamp. This setup mimics a physical situation, where the position-varying mixing parameter cannot be avoided.
The time-varying scaled estimated sources, can be found by using ˜ † x(t). sˆ(t) = A(t)
(6)
˜ using only the observed mixtures x(t). The problem is how to estimate A(t)
3
Geometric Approach to Separation of Sparse Signals
Sparse signal is characterized by a small percentage of samples with nonzero value. It can be instrumental in estimation of the mixing matrix [5]. If a signal is not sparse, it can be sparsified by using a linear transformation to a domain in which it is sparsely represented [11]4 . ˜ Sparse signals can be geometrically interpreted to identify A(t). If we observe during some time instance t0 , the existence of a nonzero signal in the sensors, the most probable assumption would be that the signal was originated from a single source. We can see from Eq. (4) that if only the j th source is active during t0 , a ˜ij (t0 ) can be found by: a ˜ij (t0 ) = 4
xi (t0 ) xi (t0 ) − ni (t0 ) ≈ . x1 (t0 ) − n1 (t0 ) x1 (t0 )
(7)
By knowing the parametric form of A(t), a transformation can be chosen such that T [A(t)s(t)] ≈ A(t)T [s(t)] and the blind source separation can be solved in the transformed sparse domain.
376
R. Kaftory and Y.Y. Zeevi
Since sj is rarely active, we have several non-uniform samples of the continuous curve a ˜ij (t). The obstacle is that when one observes a nonzero signal in the sensors, it is not known to which source it corresponds. Thus, the problem requires to group time instances in which the received signal was originated from the same source.
4
˜ Probabilistic Framework for Identifying A(t)
We want to construct a framework for grouping time instances originated from the same sensor and estimating the α ˜ ijk parameters. This framework is necessary since correct estimation of α ˜ ijk solves the grouping problem and vice versa, correct grouping estimates α ˜ ijk . Suppose we take all the time instances t0 ...tM−1 in which a nonzero (or above some threshold) signal was detected. For each individual time instance tl , we define the ratio ri (tl ) ≡ xi (tl )/x1 (tl ). A maximum likelihood approach for estimating α ˜ ijk would be to maximize: α ˜ ijk = arg max J(˜ αijk ) ≡ arg max P [ri (t0 ...tM−1 ) | α ˜ ijk ]. α ˜ ijk
α ˜ ijk
(8)
Using Bayes’ rule, we can calculate the conditional probability of the right hand side of Eq. (8) by: ˜ijk ] = J(˜ αijk ) ≡ P [ri (t0 ...tM−1 ) | α
P [˜ αijk | ri (t0 ...tM−1 )]P [ri (t0 ...tM−1 )] . (9) P [˜ αijk ]
Since prior information regarding the distribution of α ˜ijk is not available, a uniform distribution is assumed. Omitting P [˜ αijk ], which is assumed to be constant, and omitting P [ri (t0 ...tM−1 )], which does not depend on α ˜ ijk , does not affect the maximization of Eq. (9) with respect to α ˜ijk . Therefore, estimating α ˜ ijk using the maximum likelihood approach would be to maximize: ˜ αij ) ≡ arg max P [˜ αijk | ri (t0 ...tM−1 )]. α ˜ ijk = arg max J(˜ k α ˜ ijk
α ˜ ijk
(10)
In order to evaluate the conditional probability P [˜ αijk | ri (t0 ...tM−1 )], we first construct a density estimation to get a certain ratio on a specific time, given the measurements xi (tl ) and x1 (tl ). It can be done using a kernel density estimation: ri − ri (tl ) t − tl 1 fˆ(ri , t | xi (tl ), x1 (tl )) = K , , (11) M hr ht hr ht where M is the number of time instances a signal was detected, K is a multivariate kernel density estimator, and hr , ht are the kernel support in the r and t axes respectively. Using the law of total probability, fˆ(ri , t) can be found by calculating: fˆ(ri , t) =
M−1
fˆ(ri , t | xi (tl ), x1 (tl ))P [xi (tl ), x1 (tl )].
(12)
l=0
We interpret the probability P [xi (tl ), x1 (tl )] as a measure of the correctness of calculating a ˜ij (tl ) using Eq. (7). If the noise is at least one order of magnitude
Probabilistic Geometric Approach to Blind Separation
377
smaller than the observed signals, the approximation of Eq. (7) holds. If the noise parameters can be estimated, for example in the case of normal distributed noise with a known variance σ 2 , the probability of the noise being an order of magnitude smaller than the measurement xi (tl ) or x1 (tl ), is5 : |x i (tl )|
P [xi (tl ), x1 (tl )] ≡ min{ −|xi (tl )|
v2 1 √ e− 20σ2 dv, σ 20π
|x 1 (tl )|
−|x1 (tl )|
v2 1 √ e− 20σ2 dv}. σ 20π
(13) ˜ ijk We want to evaluate the conditional probability of a ˜ij (t) represented by α parameters given the ratio of the measurements ri (t0 ...tM−1 ), i.e. Eq. (10). Since we know from Eq. (5) that a ˜ij (t) = g˜ij (˜ αijk ; t), we can calculate this probability using a line integral over a scalar field, being the density fˆ(ri , t): t2 ˜ αij ) ≡ P [˜ J(˜ αijk | ri (t0 ...tM−1 )] = k
(˜ αijk ; t), t) 1 + [˜ gij αijk ; t)]2 dt, fˆ(gij (˜
t1
(14) stands for the derivation with respect to t, and t1 , t2 are the observation where g˜ij start and stop time respectively. The optimization can be done using the Newton method, starting from several initial points as follows: (0)
1. Take as an initial guess a vector of α ˜ijk parameters. (m)
2. Use the vector α ˜ ijk obtained from the previous step and construct the gra(m) (m) ˜ dient vector ∇J(˜ α ) and the Hessian matrix H ˜(˜ α ) using: ijk
J
˜ α(m) ) ∂ J(˜ ijk (m)
(m)
=
∂P [˜ αijk | ri (t0 ...tM−1 )] ∂α ˜ ijk
(m)
(m)
(m)
(15)
(m)
∂α ˜ ijk
˜α ) ∂ 2 J(˜ ijk
ijk
(m)
=
αijk | ri (t0 ...tM−1 )] ∂ 2 P [˜
∂α ˜ ijk ∂ α ˜ ijp
(m)
(m)
.
(16)
∂α ˜ ijk ∂ α ˜ ijp
3. Update the estimated parameters: (m+1)
α ˜ ijk
(m)
(m)
(m)
=α ˜ijk − HJ˜(˜ αijk )∇J˜(˜ αijk ).
(17)
4. Repeat steps 2 and 3 until convergence. The initial points can be chosen by using an approach similar to that of the Hough transform in image processing [1]. The grouping problem resolved by issuing each time instance to the maximum which maximized Eq. (10) with sufficient probability. 5
In the case where the noise parameters are unknown, we assume that P [xi (tl ), x1 (tl )] is constant.
378
R. Kaftory and Y.Y. Zeevi
A few implementation remarks are in order: – The integral of Eq. (14) should be found indefinitely. Therefore, we suggest using the Epanechnikov radially symmetric kernel [8] which is defined as: 2 (1 − r2 − t2 ) (r2 + t2 ) ≤ 1 K(r, t) = π (18) 0 othewise αijk ) makes the indefinite integral of Eq. (14) unsolvable, an approxi– If g˜ij (˜ mation for g˜ij (˜ αijk ) can be used. – It is usually preferable to define ˜bij (t) ≡ arctan(˜ aij (t)) in order to eliminate the noise amplification accompanying the calculation of xi (t)/x1 (t). The definition of ri (tl ) to be inserted in Eq. (11) should be changed to ri (tl ) ≡ ˜ijk parameters, a ˜ij (t) can arctan(xi (tl )/x1 (tl )). After optimizing for the α be found using a ˜ij (t) = tan(˜bij ).
5
Results
We tested our approach on both synthetic and real mixtures. For synthetic mixtures, 2 random signals were drawn from an exponential distribution and mixed with the diagonal matrix A(t) with diagonal coefficients: ajj (t) = αjj1 t2 +αjj2 t+ αjj3 . Random noise with normal distribution was added to the mixtures yielding a mixture-to-noise ratio of 26 dB. The setup depicted in Fig. 1 was build. Real mixtures were acquired with a Canon D100 digital camera connected to a computer. The PSNR, which was estimated by taking consecutive identical pica2j (t)) and assumed that a second tures, was 40 dB. We defined ˜b2j (t) ≡ arctan(˜
Fig. 2. Estimated ˜b2j (t) for the synthetic (left) mixture and ˜b2j (y) for the real (right) mixture. The estimated ˜b2j (t) or ˜b2j (y) using our algorithm and quadratic approximation is depicted by the solid line. The true ˜b2j (t), for the synthetic mixture, is depicted by the dashed line. Scattered dots represents r2 (tn ) or r2 (yn ), where tn and yn are all the time instances or pixels where the signal (or its derivative) was above a threshold.
Probabilistic Geometric Approach to Blind Separation
379
Table 1. Signal-to-Noise Ratio of the Synthetic Example Mixture/Estimated x1 x2 Estimated
s1 [dB] -15 10 16
s2 [dB] 8 -10 21
order polynomial is the estimated Taylor expansion of ˜b2j (t). Since images are not naturally sparse, a derivation with respect to the y axis (along the rows) was applied to the image mixtures. We initiated our algorithm with several starting parameters and the algorithm converged to two local maxima. Fig. 2 shows the results of identifying ˜b2j (t) for the synthetic and the real mixtures. It can be seen that the algorithm produces a close approximation. Table 1 summarizes the results of inverting the system of the synthetic example. The signal-to-noise ratio has improved dramatically after the separation. Fig 3. shows the results of inverting the system of the real mixture to recover the separated images. As it can be seen, the estimated images are separated properly.
Fig. 3. Results of separating the real mixtures: [Top] The mixtures. [Bottom] Estimated separated sources.
380
6
R. Kaftory and Y.Y. Zeevi
Conclusion
The proposed approach is general in that it can be applied to any type of position-varying or time-varying mixing model. As such, it is useful in various optical cases where the mixing lens cannot be assumed to acquire distortionless images. Likewise in various time domain problems, the mixing characteristic vary with time. Whereas in simple linear or quadratic forms of varying mixing parameters a simple solution can be obtained according to our approach, in other cases a solution is possible by solving integral equations numerically. Further extension of this study is now in progress, separating time-varying convolutive mixtures by using a proper transformation to a time frequency domain in which the problem is of time-varying instantaneous form (see [3] for details).
Acknowledgement Research supported in part by the Ollendorff Minerva center, by the HASSIP Research Network Program HPRN-CT-2002-00285, sponsored by the European Commission, and by the Fund for Promotion of Research at the Technion. R. K. gratefully acknowledges the special doctoral fellowship awarded by HP.
References 1. Ballard, D.H.: Generalizing the Hough Transform to Detect Arbitrary Shapes. Pattern Recognition 13(2), 111–122 (1981) 2. Everson, R.M., Roberts, S.J.: Particle Filters for Non-stationary ICA. In: Advances in Independent Component Analysis, pp. 23–41. Springer, Heidelberg (2000) 3. Kaftory, R., Zeevi, Y.Y.: Blind Separation of Time-Varying Signal Mixtures Using Zadeh’s Transform. In: Proceedings of EUSIPCO (2006) 4. Kenneth, H., Deniz, E., Jose, P.: Blind Source Separation of Time-Varying Instantaneous Mixtures Using an On-line Algorithm. In: Proc. ICASSP’ 02, pp. 993–996 (2001) 5. Kisilev, P., Zibulevsky, M., Zeevi, Y.Y.: Blind Source Separation Using Multinode Sparse Representation. In: Proc. ICIP 2001, pp. 202–205 (2001) 6. Kuivunen, V., Enescu, M., Oja, E.: Adaptive Algorithm for Blind Separation from Noisy Time-Varying Mixtures. Neural Computation. 13, 2339–2358 (2001) 7. Mukai, R., Sawada, H., Araki, S., Makino, S.: Robust Real-Time Blind Source Separation for Moving Speakers in a Room. In: Proc. ICASSP’ 03, pp. 469–472 (2003) 8. Scott, D.W.: Multivariate Density Estimation, p. 139. Wiley, Chichester (1992) 9. Weisman, T., Yeredor, A.: Separation of Periodically Time-Varying Mixtures Using Second-Order Statistics. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, Springer, Heidelberg (2006) 10. Yeredor, A.: TV-SOBI: An Expansion of SOBI for Linearly Time-Varying Mixtures. In: Proceedings of the ICA 2003 (2003) 11. Zibulevsky, M., Pearlmutter, B.A., Bofill, P., Kisilev, P.: Blind Source Separation by Sparse Decomposition, Independent Component Analysis: Principles and Practice, Cambridge (2001)
Infinite Sparse Factor Analysis and Infinite Independent Components Analysis David Knowles and Zoubin Ghahramani Department of Engineering University of Cambridge CB2 1PZ, UK [email protected], [email protected] Abstract. A nonparametric Bayesian extension of Independent Components Analysis (ICA) is proposed where observed data Y is modelled as a linear superposition, G, of a potentially infinite number of hidden sources, X. Whether a given source is active for a specific data point is specified by an infinite binary matrix, Z. The resulting sparse representation allows increased data reduction compared to standard ICA. We define a prior on Z using the Indian Buffet Process (IBP). We describe four variants of the model, with Gaussian or Laplacian priors on X and the one or two-parameter IBPs. We demonstrate Bayesian inference under these models using a Markov Chain Monte Carlo (MCMC) algorithm on synthetic and gene expression data and compare to standard ICA algorithms.
1
Introduction
Independent Components Analysis (ICA) is a model which explains observed data, yt (dimension D) in terms of a linear superposition of independent hidden sources, xt (dimension K), so yt = Gxt + t , where G is the mixing matrix and t is Gaussian noise. In the standard ICA model we assume K = D and that there exists W = G−1 . We use FastICA [1], a widely used implementation, as a benchmark. The assumption K = D may be invalid, so Reversible Jump MCMC [2] could be used to infer K. In this paper we propose a sparse implementation which allows a potentially infinite number of components and the choice of whether a hidden source is active for a data point, allowing increased data reduction for systems where sources are intermittently active. Although ICA is not a time-series model it has been used successfully on time-series data such as electroencephalograms [3]. It has also been applied to gene expression data [4], the application we choose for a demonstration.
2
The Model
We define a binary vector zt which acts as a mask on xt . Element zkt specifies whether hidden source k is active for data point t. Thus Y = G(Z X) + E M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 381–388, 2007. c Springer-Verlag Berlin Heidelberg 2007
(1)
382
D. Knowles and Z. Ghahramani
where denotes element-wise multiplication and X, Y, Z and E are concatenated matrices of xt , yt , zt and t respectively. We allow a potentially infinite number of hidden sources, so that Z has infinitely many rows, although only a finite number will have non-zero entries. We assume noise with variance Gaussian σ2 , which is given an inverse Gamma prior IG σ2 ; a, b . We define two variants based on the prior for xkt : infinite sparse Factor Analysis (isFA) has a unit Gaussian prior; infinite Independent Components Analysis (iICA) has a Laplacian(1) prior. Other heavy tailed distributions are possible but are not explored here. Varying the variance is redundant because we infer the variance of the mixture weights. The prior on the elements of G is Gaussian 2 , which is given an inverse Gamma prior. We define the prior on with variance σG Z using the Indian Buffet Process with parameter α (and later β) as described in Section 2.1 and in more detail in [5]. We place Gamma priors on α and β. All four variants share t ∼ N 0, σ2 I 2 gk ∼ N 0, σG Z ∼ IBP(α, β)
σ2 ∼ IG (a, b)
(2)
∼ IG (c, d) α ∼ G (e, f )
(3) (4)
2 σG
The differences between the variants are summarised here. β=1 β ∼ G (1, 2)
2.1
xkt ∼ N (0, 1) xkt ∼ L(1) isFA1 iICA1 isFA2 iICA2
Defining a Distribution on an Infinite Binary Matrix
Start with a finite model. We derive our distribution on Z by defining a finite K model and taking the limit as K → ∞. We then show how the infinite case corresponds to a simple stochastic process. We assume that the probability of a source k being active is πk , and that the sources are generated independently. We find P (Z|π) =
N K
P (zkt |πk ) =
k=1 t=1
K
πkmk (1 − πk )N −mk
(5)
k=1
N where N is the total number of data points and mk = t=1 zkt is the number α , 1) prior on πk , of data points for which source k is active. We put a Beta( K where α is the strength parameter. Due to the conjugacy between the binomial and beta distributions we are able to integrate out π to find P (Z) =
K k=1
α +K )Γ (N − mk + 1) α Γ (N + 1 + K )
α K Γ (mk
(6)
Infinite Sparse Factor Analysis and Infinite ICA
383
Take the infinite limit. By defining a scheme to order the non-zero rows of Z (see [5]) we can take K → ∞ and find (N − mk )!(mk − 1)! αK+ exp {−αHN } N! h>0 Kh ! K+
P (Z) =
(7)
k=1
N where K+ is the number of active features, HN = j=1 1j is the N -th harmonic number, and Kh is the number of rows whose entries correspond to the binary number h. Go to an Indian Buffet. This distribution corresponds to a simple stochastic process, the Indian Buffet Process. Consider a buffet with a seemingly infinite number of dishes (hidden sources) arranged in a line. The first customer (data point) starts at the left and samples Poisson(α) dishes. The ith customer moves from left to right sampling dishes with probability mik where mk is the number of customers to have previously sampled that dish. Having reached the end of the previously sampled dishes, he tries Poisson( αi ) new dishes. If we apply the same ordering scheme to the matrix generated by this process as for the finite model, we recover the correct exchangeable distribution. Since the distribution is exchangeable with respect to the customers we find by considering m where m = the last customer that P (zkt = 1|z−kt ) = k,−t k,−t s=t zks , which N is used in sampling Z. By exchangeability and considering the first customer, the number of active sources for a data point follows a Poisson(α) distribution, and the expected number Z is N α. We also see that the number of active of entries in α Poisson( ) features, K+ = N t=1 t = Poisson(αHN ) which grows as α log N for large N . Two parameter generalisation. A problem with the one parameter IBP is that the number of features per object, α, and the total number of features, N α, are both controlled by α and cannot vary independently. Under this model, we cannot tune how likely it is for features to be shared across objects. To overcome this restriction we follow [6], introducing β, a measure of the feature mk and repulsion. The ith customer now samples dish k with probability β+i−1 αβ ) new dishes. samples Poisson( β+i−1 mk,−t We find P (zkt = 1|z−kt , β) = β+N −1 . The marginal probability of Z becomes
(αβ)K+ P (Z|α, β) = exp {−αHN (β)} B(mk , N − mk + β) h>0 Kh ! K+
(8)
k=1
where HN (β) =
3
N
β j=1 β+j−1 .
Inference
Given the observed data Y, we wish to infer the hidden sources X, which sources are active Z, the mixing matrix G, and all hyperparameters. We use Gibbs sampling, but with Metropolis-Hastings (MH) steps for β and sampling new features.
384
D. Knowles and Z. Ghahramani
We draw samples from the marginal distribution of the model parameters given the data by successively sampling the conditional distributions of each parameter in turn, given all other parameters. Hidden sources. We sample each element of X for which zkt = 1. We denote the k-th column of G by gk and t |zkt =0 by −kt . For isFA we find the conditional distribution is a Gaussian: gT −kt σ2 P (xkt |G, x−kt , yt , zt ) = N xkt ; 2 k T , 2 T (9) σ + gk gk σ + gk gk For iICA we find a piecewise Gaussian distribution, which it is possible to sample from analytically given the Gaussian c.d.f. function F N xkt ; μ− , σ 2 xkt > 0 P (xkt |G, x−kt , yt , zt ) = (10) N xkt ; μ+ , σ 2 xkt < 0 where μ± =
gkT −kt ± σ2 σ2 2 and σ = gkT gk gkT gk
(11)
Active sources. To sample Z we first define the ratio of conditionals, r r=
P (yt |G, x−kt , z−kt , zkt = 1, σ2 ) P (zkt = 1|z−kt ) P (yt |G, x−kt , z−kt , zkt = 0, σ2 ) P (zkt = 0|z−kt )
rl
(12)
rp
r so that P (zkt = 1|G, X−kt , Y, Z−kt ) = r+1 . From Section 2.1 the ratio of priors mk,−t is rp = β+N −1−mk,−t . To find P (yt |G, x−kt , z−kt , zkt = 1) we must marginalise over all possible values of xkt . P (yt |G, x−kt , z−kt , zkt = 1) = P (yt |G, xt , z−kt , zkt = 1)P (xkt )dxkt (13)
2 μ . For iICA For isFA, using Equation (9) and integrating we find rl = σ exp 2σ 2 we use Equation (11) and integrate above and below 0 to find 2 2 μ+ μ− π rl = σ , σ)) exp + (1 − F (0; μ (14) F (0; μ+ , σ) exp − 2 2σ 2 2σ 2 Creating new features. Z is a matrix with infinitely many rows, but only the non-zero rows can be held in memory. However, the zero rows still need to be taken into account. Let κt be the number of rows of Z which contain 1 only in column t, i.e. the number of features which are active only at time t. New features are proposed by sampling κt with a MH step. We propose a move ξ → ξ ∗ with probability J(ξ ∗ |ξ), following [7], we set to be equal to the prior on ξ ∗ . This move is accepted with probability min (1, rξ→ξ∗ ) where rξ→ξ∗ =
P (rest|ξ ∗ )P (ξ ∗ )P (ξ) P (rest|ξ ∗ ) P (ξ ∗ |rest)J(ξ|ξ ∗ ) = = P (ξ ∗ |rest)J(ξ ∗ |ξ) P (rest|ξ)P (ξ)P (ξ ∗ ) P (rest|ξ)
(15)
Infinite Sparse Factor Analysis and Infinite ICA
385
where rest denotes all other variables. By this choice rξ→ξ∗ becomes the ratio of αβ likelihoods. From the IBP the prior for κt is P (κt |α) = Poisson( β+N −1 ). For isFA we can integrate out xt , the new elements of xt , but not G , the 1 new columns of G, so our proposal is ξ = {G , κt }. We find rξ→ξ∗ = |Λ|− 2 1 T ∗T ∗ exp 2 μ Λμ where Λ = I + G σ2G and Λμ = σ12 G∗ T t . For iICA marginalisation is not possible so ξ = {G , xt , κt }. From Equation (15) we find 1 T ∗T ∗ (16) rξ→ξ∗ = exp − 2 xt G (G xt − 2t ) 2σ Mixture weights. We sample the columns gk of G. We denote the kth row T 2 ) ∝ P (Y|G, X, Z, σ2 ) of (Z X) by xk . We have P (gk |G−k , X, Y, Z, σ2 , σG 2 P (gk |σG ). The total likelihood function has exponent −
1 1 T tr(ET E) = − 2 ((xk xk )(gkT gk ) − 2gkT E|gk =0 ) + const 2σ2 2σ
(17)
where E = Y − G(Z X). We thus find the conditional of gk is N (μ, Λ) where σ2 x T x μ = x T x σG2 +σ2 E|gk =0 xk and Λ = kσ2 k + σ12 ID×D . k
k
G
G
Learning the noise level. We allow the model to learn the noise level σ2 . Applying Bayes’ rule we find b ND 2 2 2 2 , P (σ |E, a, b) ∝ P (E|σ )P (σ |a, b) = IG σ ; a + (18) 2 1 + 2b tr (ET E) 2 Inferring the scale of the data. For sampling σG the conditional prior on G acts as the likelihood term d DK 2 2 2 2 , P (σG |G, c, d) ∝ P (G|σG )P (σG |c, d) = IG σG ; c + 2 1 + d2 tr (GT G) (19)
IBP parameters. We infer the IBP strength parameter α. The conditional prior on Z, given by Equation (8), acts as the likelihood term f P (α|Z, β) ∝ P (Z|α, β)P (α) = G α; K+ + e, (20) 1 + f HN (β) We sample β by a MH step with acceptance probability min (1, rβ→β ∗ ). By ∗ ) Equation (15) setting J(β ∗ |β) = P (β ∗ ) = G (1, 1), results in rβ→β ∗ = PP(Z|α,β (Z|α,β) .
386
4
D. Knowles and Z. Ghahramani
Results
4.1
Synthetic Data
We ran all four variants and three FastICA variants (using the pow3, tanh and gauss non-linearities) on 30 sets of randomly generated data with D = 7, K = 6, N = 200, the Z matrix shown in Figure 1(a), and Gaussian or Laplacian source distributions. Figure 1 shows the average inferred Z matrix and algorithm convergence for a typical 1000 iteration ICA1 run. Z is successfully recovered within an arbitrary ordering. The gaps in the inferred Z are a result of inferring zkt = 0 where xkt = 0.
(a) Top: True Z. Bottom: Inferred Z
(b) Log likelihood, α and K + for duration of 1000 iteration run
Fig. 1. True and inferred Z and algorithm convergence for typical iICA1 run
Boxplots of the Amari error [8] for each algorithm are shown in Figure 2. Figure 2(a) shows the results when the synthetic data has Gaussian source distributions. All four variants perform significantly better on the sparse synthetic data than any of the FastICA variants, but then we do not expect FastICA to recover Gaussian sources. Figure 2(b) shows the results when the synthetic data has Laplacian source distributions. As expected the FastICA performance is much improved because the sources are heavy tailed. However, isFA1 , iICA1 and iICA2 still perform better on average because they correctly model the sparse nature of the data. The performance of isFA2 is severely effected by having the incorrect source model, suggesting the iICA variants may be more robust to deviations from the assumed source distribution. The two parameter IBP variants of both algorithms actually perform no better than the one parameter versions: β = 1 happens to be almost optimal for the synthetic Z used. 4.2
Gene Expression Data
We now apply our model to the microarray data from an ovarian cancer study [4], which represents the expression level of N = 172 genes (data points) across
Infinite Sparse Factor Analysis and Infinite ICA
(a) Gaussian sources
387
(b) Laplacian sources
Fig. 2. Boxplots of Amari errors for 30 synthetic data sets with D = 7, N = 6, N = 100 analysed using each algorithm variant and FastICA
D = 17 tissue samples (observed variables). The tissue samples are grouped into five tissue types: one healthy and four diseased. ICA was applied to this dataset in [4], where the term gene signature is used to describe the infered hidden sources. Some of the processes which regulate gene expression, such as DNA methylation, completely silence the gene, while others, such as transcription regulation, affect the level at which the gene is expressed. Thus our sparse model is highly valid for this system: Z represents which genes are silenced, and X represents the expression level of active genes. Figure 4.2 shows the mean G matrix infered by iICA1 . Gene signature (hidden source) 1 is expressed across all the tissue samples, accounted for genes shared by all the samples. Signature 7 is specific to the pd-spa tissue type. This is consistent with that found in [4], with the same top 3 genes. Such tissue type dependent signatures could be used for observer independent classification. Signatures such as 5 which is differentially expressed across the pd-spa samples could help subclassify tissue types.
Fig. 3. Hinton diagram of GT : the expression level of each gene signature within each tissue sample
388
5
D. Knowles and Z. Ghahramani
Conclusions and Future Work
In this paper we have defined the Infinite Sparse FA and Infinite ICA models using a distribution over the infinite binary matrix Z corresponding to the Indian Buffet Process. We have derived MCMC algorithms for each model to infer the parameters given observed data. These have been demonstrated on synthetic data, where the correct assumption about the hidden source distribution was shown to give optimal performance, and gene expression data, where the results were consistent with those using ICA. A MATLAB implementation of the algorithms will be made available at http://learning.eng.cam.ac.uk/zoubin/. There are a number of directions in which this work can be extended. The recently developed stick breaking constructions for the IBP will allow a slice sampler to be derived for Z which should allow faster mixing than the MH step currently used in sampling new features. Faster partially deterministic algorithms would be useful for online learning in applications such as audio processing. The sparse nature of the model could have useful applications in data compression for storage or data reduction for further analysis.
References 1. Hyv¨ arinen, A.: Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks 10(3), 626–634 (1999) 2. Richardson, S., Green, P.J.: On bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society 59, 731–792 (1997) 3. Makeig, S., Bell, A.J., Jung, T.P., Sejnowski, T.J.: Independent component analysis of electroencephalographic data. Advances in Neural Information Processing Systems 8, 145–151 (1996) 4. Martoglio, A.M., Miskin, J.W., Smith, S.K., MacKay, D.J.C.: A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer. Bioinformatics 18(12), 1617–1624 (2002) 5. Griffiths, T., Ghahramani, Z.: Infinite latent feature models and the indian buffet process. Technical Report 1, Gatsby Computational Neuroscience Unit (2005) 6. Ghahramani, Z., Griffiths, T., Sollich, P.: Bayesian nonparametric latent feature models. In: Bayesian Statistics 8, Oxford University Press, Oxford (2007) 7. Meeds, E., Ghahramani, Z., Neal, R., Roweis, S.: Modeling dyadic data with binary latent factors. In: Neural Information Processing Systems. vol. 19 (2006) 8. Amari, S., Cichocki, A., Yang, H.H.: A new learning algorithm for blind signal separation. In: Advances in Neural Information Processing Systems, vol. 8, pp. 757– 763. The MIT Press, Cambridge (1996)
Fast Sparse Representation Based on Smoothed 0 Norm G. Hosein Mohimani1 , Massoud Babaie-Zadeh1, , and Christian Jutten2 Electrical Engineering Department, Advanced Communications Research Institute (ACRI), Sharif University of Technology, Tehran, Iran 2 GIPSA-lab, Department of Images and Signals, National Polytechnic Institute of Grenoble (INPG), France [email protected], [email protected], [email protected]
1
Abstract. In this paper, a new algorithm for Sparse Component Analysis (SCA) or atomic decomposition on over-complete dictionaries is presented. The algorithm is essentially a method for obtaining sufficiently sparse solutions of underdetermined systems of linear equations. The solution obtained by the proposed algorithm is compared with the minimum 1 -norm solution achieved by Linear Programming (LP). It is experimentally shown that the proposed algorithm is about two orders of magnitude faster than the state-of-the-art 1 -magic, while providing the same (or better) accuracy. Keywords: sparse decomposition.
1
component
analysis,
over-complete
atomic
Introduction
Obtaining sparse solutions of under-determined systems of linear equations is of significant importance in signal processing and statistics. Despite recent theoretical developments [1,2,3], the computational cost of the methods has remained as the main restriction, especially for large systems (large number of unknowns/equations). In this article, a new approach is proposed which provides a considerable reduction in complexity. To introduce the problem in more details, we will use the context of Sparse Component Analysis (SCA). The discussions, however, may be easily followed in other contexts of application, for example, in finding ‘sparse decomposition’ of a signal in an over-complete dictionary, which is the goal of the so-called over-complete ‘atomic decomposition’ [4]. SCA can be viewed as a method to achieve separation of sparse sources. The general Blind Source Separation (BSS) problem is to recover n unknown (statistically independent) sources from m observed mixtures of them, where little or no information is available about the sources (except their statistical independence) and the mixing system. In linear instantaneous (noiseless) model, it is assumed that x(t) = As(t) in which x(t) and s(t) are the m × 1 and n × 1
This work has been partially funded by Sharif University of Technology, by Center for International Research and Collaboration (ISMO) and by Iran NSF (INSF).
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 389–396, 2007. c Springer-Verlag Berlin Heidelberg 2007
390
G.H. Mohimani, M. Babaie-Zadeh, and C. Jutten
vectors of sources and mixtures and A is the m × n mixing matrix. The goal of BSS is then to find s(t) only from x(t). The general BSS problem is not easy for the case n > m. However, if the sources are sparse (i.e., not a totally blind situation), then the problem can be solved in two steps [3,2]: first estimating the mixing matrix [3,2,5,6], and then estimating the sources assuming A to be known [3,2,7,8]. In this paper we only consider the second step. To obtain the sparsest solution of As = x, we may search for a solution of it having minimal 0 norm, i.e., minimum number of nonzero components. It is usually stated in the literature [4,9,10,3] that searching the minimum 0 norm is an intractable problem as the dimension increases (because it requires a combinatorial search), and it is too sensitive to noise (because any small amount of noise completely changes the 0 norm of a vector). Consequently, the researchers look for other approaches to find sparse solution of As = x which are tractable. One of the most successful approaches is Basis Pursuit (BP) [11,1,10,3] which finds the minimum 1 norm (that is, the solution of As = x for which i |si | is minimized). Such a solution can be easily found by Linear Programming (LP) methods. The idea of Basis Pursuit is based on the property that for large systems of equations, the minimum 1 norm solution is also the minimum 0 norm solution [1,11,12,13]. By utilizing fast LP algorithms, specifically interior-point LP solvers or 1 -magic [14] (which is about one order of magnitude faster), large-scale problems with thousands of sources and mixtures become tractable. However, although this approach is tractable, it is still very time-consuming. Another approach is Matching Pursuit (MP) [15,16,3] which is very fast, but is somewhat heuristic and does not provide good estimation of the sources. In this article, we present a fast method for finding the sparse solution of an under-determined system of linear equations, which is based on minimization of 0 norm. The paper is organized as follows. The next section introduces a family of Gaussian sparsity norms and discusses their optimization. The algorithm is then stated in Section 3. Finally, Section 4 provides some experimental results of our algorithm and its comparison with BP.
2
The Main Idea
The main idea of this article is to approximate the 0 norm by a smooth (continuous) function, which lets us to use gradient based methods for its minimization and solves also the sensitivity of 0 norm to noise. In this section we introduce a family of smooth approximators of 0 norm, whose optimization results in a fast algorithm for finding the sparse solution while preserving noise robustness. The 0 norm of s = [s1 . . . sn ]T is defined as the number of non-zero components of s. In other words if we define 1 s = 0 ν(s) = (1) 0 s=0 then s0 =
n i=1
ν(si ).
(2)
Fast Sparse Representation Based on Smoothed 0 Norm
391
It is clear that the discontinuities of 0 norm are caused by discontinuities of the function ν. If we replace ν by a smooth estimation of it in (2), we obtain a smooth estimation of 0 norm. This may also provide some robustness to noise. Different functions may be utilized for this aim. We use zero-mean Gaussian family of functions which seem to be very useful for this application, because of their differentiability. By defining: fσ (s) = exp(−s2 /2σ 2 ),
we have: lim fσ (s) =
σ→ 0
1 0
s=0 . s = 0
(3) (4)
Consequently, limσ→ 0 fσ (s) = 1 − ν(s), and therefore if we define: Fσ (s) =
n
fσ (si ),
(5)
i=1
we have: lim Fσ (s) =
σ→ 0
n
(1 − ν(si )) = n − s0 .
(6)
i=1
We take then n − Fσ (s) as an approximation to s0 : s0 ≈ n − Fσ (s).
(7)
The value of σ specifies a trade-off between accuracy and smoothness of the approximation: the smaller σ, the better approximation, and the larger σ, the smoother approximation. From (6), minimization of 0 norm is equivalent to maximization of Fσ for sufficiently small σ. This maximization should be done on the affine set S = {s | As = x}. For small values of σ, Fσ contains a lot of local maxima. Consequently, it is very difficult to directly maximize this function for very small values of σ. However, as value of σ grows, the function becomes smoother and smoother, and for sufficiently large values of σ, as we will show, there is no local maxima (see Theorem 1 of the next section). Our idea for escaping local maxima is then to decrease the value of σ gradually1 : for each value of σ we use a steepest ascent algorithm for maximizing Fσ , and the initial value of this steepest ascent algorithm is the maximizer of Fσ obtained for the previous (larger) value of σ. Since the value of σ changes slowly, the steepest ascent algorithm is initialized not far from the actual maximum. Consequently, we hope that it would not be trapped in the local maxima. Remark 1. Equation (6) proposes that Fσ (·) can be seen as a measure of ‘sparsity’ of a vector (especially for small values of σ): the sparser s, the larger Fσ (s). In this viewpoint, maximizing Fσ (s) on a set is equivalent to finding the ‘sparsest’ element of that set. 1
This idea is similar to simulated annealing, but, here, the sequence of decreasing values is short and easy to define so that the solution is achieved in a few steps, usually less than 10.
392
G.H. Mohimani, M. Babaie-Zadeh, and C. Jutten
– Initialization: 1. Choose an arbitrary solution from the feasible set S, v0 , e.g., the minimum 2 norm solution of As = x obtained by pseudo-inverse (see the text). 2. Choose a suitable decreasing sequence for σ, [σ1 . . . σK ]. – for k = 1, . . . , K: 1. Let σ = σk . 2. Maximize (approximately) the function Fσ on the feasible set S using L iterations of the steepest ascent algorithm (followed by projection onto the feasible set): • Initialization: s = vk−1 . • for j = 1 . . . L (loop L times): (a) Let: Δs = [s1 exp (−s21 /2σk2 ), . . . , sn exp (−s2n /2σk2 )]T . (b) Let s ← s − μΔs (where μ is a small positive constant). (c) Project s back onto the feasible set S: s ← s − AT (AAT )−1 (As − x) 3. Set vk = s. – Final answer is s = vl .
Fig. 1. The final algorithm (SL0 algorithm)
3
The Algorithm
Based on the main idea of the previous section, the final algorithm (smoothed 0 or SL0) is given in Fig. 1. As indicated in the algorithm, the final value of previous estimation is used for the initialization of the next steepest ascent. By choosing a slowly decreasing sequence of σ, we may escape from getting trapped in the local maxima, and obtain the sparsest solution. Remark 2. The internal loop (steepest ascent for a fixed σ) is repeated a fixed and small number of times (L). In other words, for increasing the speed, we do not wait for the (internal loop of the) steepest ascent algorithm to converge. This may be justified by gradual decrease in value of σ, and the fact that for each value, we do not need the exact maximizer of Fσ . All we need, is to enter a region near the (absolute) maximizer of Fσ for escaping from its local maximizers. Remark 3. Steepest ascent consists of iterations of the form s ← s + μk ∇Fσ (s). Here, the step-size parameters μk should be decreasing, i.e., for smaller values of σ, smaller values of μk should be applied. This is because for smaller values of σ, the function Fσ is more ‘fluctuating’, and hence smaller step-sizes should be used for its maximization. If we set2 μk = μσk2 , we obtain s ← s − μΔs as stated in the algorithm of Fig. 1 (note that Δs −∇Fσ (s)/σ 2 ). Remark 4. The algorithm may work by initializing v0 (initial estimation of the sparse solution) to an arbitrary solution of As = x. However, the best initial 2
In fact, we may think about changing the σ in (3) and (5) as looking at the same curve (or surface) at different ‘scales’, where the scale is proportional to σ 2 . For having the same step-sizes of the steepest ascent algorithm in these different scales, μk should be proportional to σ 2 .
Fast Sparse Representation Based on Smoothed 0 Norm
393
value of v0 is the minimum 2 norm solution of As = x, which is given by the pseudo-inverse of A. It is because this solution is the (unique) maximizer of Fσ (s) on the feasible set S, where σ tends to infinity. This is formally stated in the following theorem (refer to appendix for the proof). Theorem 1. The solution of the problem: Maximize Fσ (s) subject to As = x, where σ → ∞, is the minimum 2 norm solution of As = x, that is, s = AT (AAT )−1 x. Remark 5. Having initiated the algorithm with the minimum 2 norm solution (which corresponds to σ = ∞), the next value for σ (i.e., σ1 ) may be chosen about two to four times of the maximum absolute value of the obtained sources (maxi |si |). To see the reason, note first that: 1 , if |si | σ . (8) exp(−s2i /2σ 2 ) = 0 , if |si | σ Consequently, if we take, for example, σ > 4 maxi |si | for all 1 ≤ i ≤ n, then exp(−s2i /2σ 2 ) > 0.96 ≈ 1, and comparison with (8) shows that this value of σ acts virtually like infinity for all values of si , 1 ≤ i ≤ n. For the next σk ’s (k ≥ 2), we have used σk = ασk−1 , where α is usually between 0.5 and 1. Remark 6. The final value of σ depends on the noise level. For the noiseless case, it can be decreased arbitrarily to zero (its minimum values is determined by the desired accuracy, and/or machine precision). For the noisy case, it should be terminated about one to two times of energy of the noise. This is because, while σ is in this range, (8) shows that the cost function treats small (noisy) samples as zero (i.e., for which fσ (si ) ≈ 1, 1 ≤ i ≤ n). However, below this range, the algorithm tries to ‘learn’ these noisy values, and moves away from the true answer. Restricting σ to be above energy of the noise, provides the robustness of this approach to noise, which was one of the difficulties of using the exact 0 norm. In the simulations of this paper, this noise level was assumed to be known3 .
4
Experimental Results
In this section, we justify performance of the presented approach and compare it with BP. Sparse sources are artificially created using Mixture of Gaussian (MoG) model4 : si ∼ p · N (0, σon ) + (1 − p) · N (0, σoff ), (9) 3
4
Note that its exact value is not necessary, and in practice a rough estimation is sufficient. The model we have used is also called the Bernoulli-Gaussian model.
394
G.H. Mohimani, M. Babaie-Zadeh, and C. Jutten Table 1. Progress of SL0, Compared to 1 -magic itr. # σ 1 1 0.5 2 0.2 3 0.1 4 0.05 5 0.02 6 0.01 7 algorithm total time SL0 0.227 seconds 1 -magic 20.8 seconds
MSE SNR (dB) 3.75 e −2 2.88 2.19 e −2 5.21 4.28 e −3 12.29 1.67 e −3 16.37 6.18 e −4 20.71 1.91 e −4 25.80 1.87 e −4 25.89 MSE SNR (dB) 2.34 e −4 25.67 4.64 e −4 21.95
where p denotes probability of activity of the sources. σon and σoff are the standard deviations of the sources in active and inactive mode, respectively. In order to have sparse sources, the parameters are required to satisfy the conditions σoff σon and p 1. σoff is to model the noise in sources, and the larger values of σoff produces stronger noise. In the simulation σon is fixed to 1. Each column of the mixing matrix is randomly generated using the normal distribution which is then normalized to unity. The mixtures are generated using the noisy model x = As + n, where n is an additive white Gaussian noise with variance σn Im (Im is the m × m identity matrix). Note that both σoff and σn can be used for modeling the noise and they are both set to 0.01 in the simulation5 . The values used for the experiment are n = 1000, m = 400, p = 0.1, σoff = 0.01, σon = 1, σn = 0.01 and the sequence of σ is fixed to [1, 0.5, 0.2, 0.1, 0.05, 0.02, 0.01]. μ is set equal to 2.5. For each value of σ the gradient-projection loop (the internal loop) is performed three times (i.e., L = 3). We use the CPU time as a measure of complexity. Although, the CPU time is not an exact measure, it can give us a rough estimation of complexity, and lets us roughly compare SL0 (Smoothed 0 norm) with BP6 . Table 1 shows the gradual improvement in the output SNR after each iteration, for a typical run of SL0. Moreover, for this run, total time and final SNR have been shown for SL0 and for BP (using 1 -magic). Figure 2 shows the actual source and it’s estimations in different iterations for this run of SL0. The experiment is then repeated 100 times (with the same parameters, but for different randomly generated sources and mixing matrices) and the values of SNR (in dB) obtained over these simulations are averaged. For SL0, this averaged SNR was 25.67dB with standard derivation of 1.34dB. For 1 magic, 5
6
Note that, although in the theoretical model only the noiseless case was addressed, because of continuity of the cost functions, the method can work as well in noisy conditions. Our simulations are performed in MATLAB7 under WinXP using an AMD Athlon sempron 2400+, 1.67GHz processor with 512MB of memory.
Fast Sparse Representation Based on Smoothed 0 Norm
395
2 0 −2
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
2 0 −2 2 0 −2 2 0 −2
Fig. 2. Evolution of SL0 toward the solution: From top to bottom, first plot corresponds to the actual source, second plot is its estimation at the first level (σ = 1), third plot is its estimation at the second level (σ = 0.5), while the last plot is its estimation at third level (σ = 0.2)
these values were 21.92dB and 1.36dB, respectively. The minimum value of SNR was 20.12dB compared with minimum of 18.51dB for BP.
5
Conclusions
In this article, a fast method for finding sparse solutions of an under-determined system of linear equations was proposed (to be applied in atomic decomposition and SCA). SL0 was based on maximizing a ‘smooth’ measure of sparsity. SL0 shows to be about two orders of magnitude faster than the 1 -magic, while providing the same (or better) accuracy. The authors conclude that sparse decomposition problem is not computationally as hard as suggested by the LP approach.
References 1. Donoho, D.L.: For most large underdetermined systems of linear equations the minimal l1 -norm solution is also the sparsest solution, Tech. Rep. (2004) 2. Bofill, P., Zibulevsky, M.: Underdetermined blind source separation using sparse representations. Signal Processing 81, 2353–2362 (2001) 3. Gribonval, R., Lesage, S.: A survey of sparse component analysis for blind source separation: principles, perspectives, and new challenges. In: Proceedings of ESANN’06, April 2006, pp. 323–330 (2006) 4. Donoho, D.L., Elad, M., Temlyakov, V.: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Info. Theory 52(1), 6–18 (2006) 5. Movahedi, F., Mohimani, G.H., Babaie-Zadeh, M., Jutten, C.: Estimating the mixing matrix in sparse component analysis (SCA) based on partial k-dimensional subspace clustering, Neurocomputing (sumitted)
396
G.H. Mohimani, M. Babaie-Zadeh, and C. Jutten
6. Washizawa, Y., Cichocki, A.: on-line k-plane clustering learning algorithm for sparse comopnent analysis. In: Proceedinds of ICASSP’06, Toulouse (France), pp. 681–684 (2006) 7. Li, Y.Q., Cichocki, A., Amari, S.: Analysis of sparse representation and blind source separation. Neural Computation 16(6), 1193–1234 (2004) 8. Zibulevsky, M., Pearlmutter, B.A.: Blind source separation by sparse decomposition in a signal dictionary. Neural Computation 13(4), 863–882 (2001) 9. Georgiev, P.G., Theis, F.J., Cichocki, A.: Blind source separation and sparse component analysis for over-complete mixtures. In: Proceedings of ICASSP’04, Montreal (Canada), May 2004, pp. 493–496 (2004) 10. Li, Y., Cichocki, A., Amari, S.: Sparse component analysis for blind source separation with less sensors than sources. In: ICA2003, pp. 89–94 (2003) 11. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20(1), 33–61 (1999) 12. Donoho, D.L., Huo, X.: Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory 47(7), 2845–2862 (2001) 13. Elad, M., Bruckstein, A.: A generalized uncertainty principle and sparse representations in pairs of bases. IEEE Trans. Inform. Theory 48(9), 2558–2567 (2002) 14. Candes, E., Romberg, J.: 1 -Magic: Recovery of Sparse Signals via Convex Programming, URL: www.acm.caltech.edu/l1magic/downloads/l1magic.pdf 15. Mallat, S., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Trans. on Signal Proc. 41(12), 3397–3415 (1993) 16. Krstulovic, S., Gribonval, R.: MPTK: Matching pursuit made tractable. In: ICASSP’06 (2006)
Appendix: Proof of Theorem 1 Let g(s) As − x, and consider the method of Lagrange multipliers for maximizing Fσ (s) subject to the constraint g(s) = 0. Setting ∇Fσ (s) = λT ∇(g(s)), where λ = [λ1 , . . . , λm ]T is the vector collecting the m Lagrange multipliers, along with the constraints As = x, results in the nonlinear system of m + n equations and m + n unknowns: As = x (10) T A = [s1 exp (−s2 /2σ 2 ) . . . sn exp (−s2 /2σ 2 )] λ 1
n
is an m × 1 unknown vector (proportional to λ). In general, it is not where λ easy to solve this system of nonlinear equations and for small values of σ, the solution is not unique (because of existence of local maxima). However, when σ increases to infinity, the system becomes linear and easy to solve: As = x T T T −1 (11) = s ⇒ AA λ = x ⇒ s = A (AA ) x AT λ which is the minimum 2 -norm or the pseudo-inverse solution of As = x.
Estimating the Mixing Matrix in Sparse Component Analysis Based on Converting a Multiple Dominant to a Single Dominant Problem Nima Noorshams1, Massoud Babaie-Zadeh1, , and Christian Jutten2 Electrical Engineering Department, Advanced Communications Research Institute (ACRI), Sharif University of Technology, Tehran, Iran 2 GIPSA-lab, Department of Images and Signals, National Polytechnic Institute of Grenoble (INPG), France nima [email protected], [email protected], [email protected]
1
Abstract. We propose a new method for estimating the mixing matrix, A, in the linear model x(t) = As(t), t = 1, . . . , T , for the problem of underdetermined Sparse Component Analysis (SCA). Contrary to most previous algorithms, there can be more than one dominant source at each instant (we call it a “multiple dominant” problem). The main idea is to convert the multiple dominant problem to a series of single dominant problems, which may be solved by well-known methods. Each of these single dominant problems results in the determination of some columns of A. This results in a huge decrease in computations, which lets us to solve higher dimension problems that were not possible before.
1
Introduction
Sparse Component Analysis (SCA) [1,2,3,4] is a semi-Blind Source Separation problem [5], in which it is a priori known that the source signals are ‘sparse’. A sparse signal is a signal whose most samples are nearly zero, and just a few percents take significant values. It has been already shown that such a prior information permits source separation for the case the number of sources exceeds the number of sensors [6,1,2,3,4]. The problem of SCA can be stated as follows. Consider the linear model: x(t) =
n
si (t)ai = As(t)
t = 1, 2, . . . , T
(1)
i=1
where A = [a1 . . . an ] ∈ Rm×n is the mixing matrix, s(t) and x(t) are the vectors of all samples of n sources and m observed signals (mixtures) at instant t, T is the number of ‘time’ samples. The goal of SCA is then to estimate A and s(t), only from x(t), 1 ≤ t ≤ T and the sparsity assumption. In this paper, we address only the problem of estimation of A (note that where there are more sources than sensors, it is not equivalent to the estimation of the sources). We call each
This work has been partially funded by Sharif University of Technology, by Center for International Research and Collaboration (ISMO) and by Iran NSF (INSF).
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 397–405, 2007. c Springer-Verlag Berlin Heidelberg 2007
398
N. Noorshams, M. Babaie-Zadeh, and C. Jutten
column of the mixing matrix, i.e. each ai , 1 ≤ i ≤ n, a mixing vector. Although the word “time” will be used throughout this paper, the above model may be in another domain, in which the sparsity assumption holds. To see this, let T be a linear ‘sparsifying’ transform, and the mixing system is stated as x = As in the time domain. Then, we have T {x} = A T {x} in the transformed domain, and because of the sparsity of T {s}, it is in the form of (1). Let k denote the average number of active sources at each instant. In fact, if the probability of inactivity of a source is denoted by p (sparsity implies that p ≈ 1), then1 k = n(1 − p). Then, two different cases should be distinguished for estimating the mixing matrix: single dominant component and multiple dominant components. In the former, k is equal to one, and the scatter plot of x(t) (t = 1, . . . , T ) geometrically shows the data concentration directions. This can be seen from the fact that at each instant, x(t) = As(t) = s1 (t)a1 +· · ·+sn (t)an , t = 1, . . . , T ; and for most instants, only one of si ’s is dominant and the others are almost zero. Consequently, in most samples, x(t) is in the direction of one of the mixing vectors. In the latter, k is greater than one and the mixing matrix would not be estimated easily from the scatter plot. Up to now, many papers have been addressed the former case [1,3,4], while only few researchers have considered the latter case [4,7,8]. In this paper, we focus on the case of multiple dominant components. In the multiple dominant components SCA, the observed data concentrate around k-dimensional subspaces which are spanned by a set of k mixing vectors. We call these subspaces concentration subspaces throughout this paper. In a multiple dominant problem finding a k-dimensional concentration subspace is not equivalent to find some of the mixing vectors. All of the existing methods [7,9] need to find most of the concentration subspaces and then estimate the mixing matrix from them (this is not the case for our algorithm). The main idea of this paper is to show that the multiple dominant problem can be converted to a series of single dominant problems, which may be solved by simple algorithms of the single dominant problem to estimate the mixing matrix. Moreover, by estimating each concentration subspace, some of the mixing vectors are found (contrary to [7,9] in which all or many concentration subspaces were needed to be estimated before starting the estimation of mixing vectors). This results in a low computational cost in comparison to the methods of [7,9] and therefore, problems with higher dimensions can be solved by this algorithm. Up to our best knowledge there is no practical algorithm for solving this problem when k ≥ 3 but our method can handle dimensions more than this. Throughout the paper, we suppose that the sources are sparse enough so k < m/2 (where m is the number of mixtures), the sources are independent and the probability of activity are the same for all of them. Finally, we assume also that each subset of m columns of A is linearly independent. 1
More precisely, in this paper, by the average number of active sources we mean an integer. If n(1 − p) is slightly greater than an integer k = n(1 − p) (for example is n(1 − p) = 1.05, then the k-means algorithm, which has been designed for k = 1, still works). In other cases, k = n(1 − p).
Estimating the Mixing Matrix in Sparse Component Analysis
2
399
The Main Idea
The main idea of converting a multiple dominant problem to a single dominant one comes from the following theorem (the proof is left to the appendix). Theorem 1. If k ≤ m 2 and the sources are statistically independent then the average number of active sources in a k dimensional concentration subspace (denoted by k) is k = k(1 − p). The above theorem states that although the average number of active sources k = n(1 − p) may be greater than 1, the average number of active sources within a concentration subspace B (that is, k˜ = k(1 − p) = n(1 − p)2 ) is one level sparser . In other words, a multiple dominant problem in the original space may be transformed into a single dominant problem within the subspace B. Consequently in the subset of data points which lie in B, we can use a single dominant algorithm (like that of [2]) for estimating the mixing vectors which are a subset of the mixing vectors of the main problem. If n(1 − p)2 does not less than or approximately equal to one, then the single dominant assumption does not hold and the above technique should be used one or several levels. In summary, our approach for estimating the mixing matrix consists of the following steps: – Step 1: Find a new concentration subspace. A concentration subspace can be found by maximizing a cost function (see Sec. 3). For finding a ‘new’ concentration subspace, the steepest ascent is initialized by a randomly different starting point (note that there are a lot of concentration subspaces). – Step 2: Determine all data points which lie in this concentration subspace, and run a single dominant algorithm to find the mixing vectors in that subspace, which are a subset of the mixing vectors of the main problem. The points whose distances to the desired subspace are less than a specific value are supposed to belong to this subspace. – Step 3: If all of the mixing vectors have been found, the search has been finished. Otherwise, go to step one, and continue. In this paper the number of sources is supposed to be known in advance. Remark: Assuming that the probability of inactivity (p) is identical for all sources, pn is the probability of no source being active, and hence p can be 1/n estimated as pˆ = ( N , where T is the total number of data points, and N is T ) the number of ‘active’ data points (i.e., x’s whose distances from the origin is greater than a threshold). However, in this paper, p is assumed already known.
3
Finding Concentration Subspaces
Each k-dimensional subspace can be represented by an m by k matrix, whose columns form an orthonormal basis for the subspace. In this paper, we do not distinguish between a subspace and its matrix representation. Let B ∈ Rm×k be the orthonormal matrix representation of an arbitrary k-dimensional subspace.
400
N. Noorshams, M. Babaie-Zadeh, and C. Jutten
The following cost function has been presented in [9] to detect whether B is a concentration subspace or not: fσ (B) =
T i=1
exp
−d2 (xi , B) 2σ 2
,
(2)
where d(xi , B) is the distance of xi from the subspace represented by B [9]. For small values of d(xi , B) compared to σ, exp(−d2 (xi , B)/2σ 2 ) is about 1 and for large values of d(xi , B), it is nearly zero. Therefore, for sufficiently small values of σ, the above function is approximately equal to the number of data points close to B. Therefore, by maximizing the function f , we actually maximize the number of data points close to B thus we find a concentration subspace. Moreover, if the set of points are concentrated around several different k-dimensional concentration subspaces, f has a local maximum where B is close to the basis of each of them. The idea of [9] for finding a concentration subspace is to maximize the function fσ for a sufficiently small σ, using steepest ascent method. For very small σ, many local maxima exist which do not correspond to any concentration subspaces. These local maxima correspond to spaces which contains r < k mixing vectors instead of k. On the other hand if σ is large, then the peaks are mixed together. In contrast to [9] which uses an iterative method by considering a sequence of decreasing σ to prevent getting trapped in local maxima, in this paper we use only a medium value for σ. In each step, we find a subset of k mixing vectors which are related to the estimated concentration subspace. As will be discussed in Sec. 7, if an incorrect concentration subspace with r (r < k) mixing vectors is estimated, the algorithm detects r mixing vector rather than k and therefore it is robust to these errors.
4
Estimating Mixing Vectors and the Mixing Matrix
Consider a concentration subspace B and suppose that the points xi for i ∈ I ⊂ {1...T } belong to this subspace. The fact that k < 1 ensure us that most of these points concentrate along k, 1-dimensional subspaces. Then, we use the same idea of [2] designed for finding the mixing vector in the case k = 1: Firstly, data samples are normalized by dividing them by their norms (x¯i = xi /xi ), that is, the points are projected onto the unit sphere. Moreover, the sign of the first component is forced to be positive. Then, we have a point distribution on a unit hemisphere. Note that most of these points are concentrated around k points, and hence the mixing vectors (which corresponded to the centroid of these clusters) may be found by a clustering algorithm. However, there are numerous outliers which do not belong to any clusters. Outlier points make the clustering algorithms inaccurate and increase the probability of error in detecting cluster centers, therefore they have to be removed as more as possible. We say that two points are neighbor if the distance between them is less than a specific value r which is dependent to the energy of
Estimating the Mixing Matrix in Sparse Component Analysis
401
the sources. For outlier detection the fact that outliers are alone in the space is used. In other words, they do not have any neighbor, but this is not true for cluster centers because the density around them is high. By this definition, a point is considered as an outlier if it does not have any neighbor. The method we used in this paper for the clustering is subtractive clustering [10]. In this method each point is considered as a cluster center and its potential for being a cluster center is computed. The point with highest potential is considered as a center and that cluster is removed. This process continues to find all clusters.
5
The Final Algorithm
Putting all together, the final algorithm is summarized as follows. 1. Remove the data samples (x(t)) which are near the origin. In these samples, all of the sources are probably inactive. 2. Estimate k to set the dimension of the concentration subspaces and also p to check that if k˜ is smaller than 1. 3. Assume an appropriate value for the free parameter of the cost function (σ). 4. Maximize fσ (B) with the steepest ascent algorithm in several steps: (a) Choose a random starting subspace (an orthonormal m by k matrix B1 ). (b) Set Bj+1 = Bj + μ∂fσ /∂Bj .2 (c) Orthonormalize Bj+1 . (d) If Bj+1 − Bj < 10−3 go to (5) else j = j + 1 and go to (b). 5. Consider the points whose distances to B are less than a specific value (d) and ignore the other points. 6. Normalize the points and force the sign of the first component to be positive. 7. Remove the points with no adjacent (outlier points) by preprocessing. 8. Detect the cluster centers with subtractive clustering algorithm (these vectors are some of the mixing vectors). 9. Compare obtained vectors (in the previous step) with former mixing vectors. If each of these vectors is new3 , then add it up to the list of estimated vectors, else throw it away. 10. If the number of estimated mixing vectors is n, then stop the algorithm, else go back to (4).
6
Experimental Results
In this section, 2 simulations are presented to justify the algorithm. In all of these simulations, sparse sources are generated independently and identically distributed (i.i.d) by the Bernoulli-Gaussian model. In other words, the sources 2 3
In all simulations we consider μ = .01. Two vectors are considered identical if the angle between them is less than a certain amount (5 degree in our simulations).
402
N. Noorshams, M. Babaie-Zadeh, and C. Jutten 0.018
Error in estimated mixing vectors
0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0
0
10
20 30 Simulation number
40
50
Fig. 1. Error of the overall algorithm for all simulations in the case n = 10, m = 6, k = 2 and T = 10000 for 50 different simulations
are inactive with probability p and are active with probability 1 − p. In the inactive case, their value is a zero mean Gaussian with standard deviation σoff , and in active case it is a zero mean Gaussian with standard deviation σon . Consequently si ∼ (1 − p) N (0, σon ) + p N (0, σoff ). In order to have sparse sources, the conditions σon σoff and p ≈ 1 should be applied (σoff is to model the noise). In all simulations, the values σon = 1 and σoff = 0.005 have been used and each component of the mixing matrix is generated randomly in the [0, 1] interval after that each column of it, is normalized. All simulations were performed in MATLAB 7 under WindowsXP, using an Intel Pentium IV 2.4 GHz processor with 1 Gigabyte RAM. Experiment 1: Performance In this experiment, the performance of our algorithm is demonstrated. 50 simulations for 50 different mixing matrixes are performed for the case n = 10, m = 6, k = 2 (p = 0.8) and T = 10000. The parameters are chosen as σ = 1/40, d = .01 and r = .02. In all cases, the obtained vectors are compared with the mixing vectors. For ˆ 2 is used, where P is the set of comparison the criterion E = minP∈P A − AP all permutation matrices (this is the same criterion used in [7]). This estimation error is shown in Fig. 1 for all simulations. The average number of iterations for successfully finding all mixing vectors is around 30, but in 3 simulations this number exceeded 100 iterations and in 1 case more than 500 iterations was required. This may increase the run time of the algorithm. By considering this inefficiency the processes took less than 90 sec in average for estimating a mixing matrix. Moreover the maximum error in the mixing matrix estimation is .018, therefore, the error is negligible. Experiment 2: Middle and large scale problems To show that the method is capable of solving medium scale problems, two simulations are performed. In the first simulation, the parameters were n = 25,
Estimating the Mixing Matrix in Sparse Component Analysis
403
m = 15, k = 5 and T = 100000, whereas in the second experiment, they were n = 35, m = 20, k = 4 and T = 50000. The process took about 1 hour for the first case and 3 hours for the second case. As far as we know there is no algorithm to estimate the mixing vectors in these dimensions (k = 4 or 5). In these scales the sources are not so sparse but our algorithm can handle this situation. To measure the accuracy of the estimation, the angle between each estimated vector and its corresponding actual mixing vector (i.e. inverse cosine of their dot product) were calculated. These n angles were all less than 0.01 radian, showing that all of the mixing matrix have been correctly estimated.
7
Conclusion and Discussion
In this paper, we introduced a method for estimating the mixing matrix in the multiple dominant SCA problem which can handle larger k in comparison to other methods ([7,9]). At our best knowledge, all existing SCA methods are unable to estimate mixing matrix in large and even medium scales, for the multiple dominant case. However, our method solves the problem at least in the medium scale cases and maybe it can handel larger scales in comparison to other existing methods till now (our algorithm is capable of solving this problem when the averaged number of active sources is up to 5). As observed in the experimental results, all mixing vectors may be detected with good accuracy. However, some mixing vectors might not be found in few iterations, either because of lack of sufficient data, or because some of the actual mixing vectors are close to each other. As was mentioned in the section 3 a medium value for σ must be considered and for very small σ the chance of error in finding a concentration increases. The subtractive clustering method does not need any prior information about the number of cluster centers, therefore, if the estimated subspace contains r (r < k) mixing vectors rather than k, the projected data on the positive normal hemisphere concentrate around r clusters and the clustering method detects r centers instead of k, thus our algorithm is somehow robust to these errors and consequently to σ. Unfortunately, our algorithm is not efficient to some extent, because some of the mixing vectors are detected several times in order to find all vectors. This may lead to a greater number of iterations and consequently a longer run time. Finding an efficient method for estimating the mixing matrix is a future work.
References 1. Gribonval, R., Lesage, S.: A survey of sparse component analysis for blind source separation: principles, perspectives, and new challenges. In: Proceedings of ESANN’06, April 2006, pp. 323–330 (2006) 2. Zibulevsky, M., Pearlmutter, B.A.: Blind source separation by sparse decomposition in a signal dictionary. Neural Computation 13(4), 863–882 (2001)
404
N. Noorshams, M. Babaie-Zadeh, and C. Jutten
3. Bofill, P., Zibulevsky, M.: Underdetermined blind source separation using sparse representations. Signal Processing 81, 2353–2362 (2001) 4. Georgiev, P.G., Theis, F.J., Cichocki, A.: Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Transactions of Neural Networks 16(4), 992–996 (2005) 5. Babaie-Zadeh, M., Jutten, C.: Semi-blind approaches for source separation and independent component analysis. In: Proceedings of ESANN’06 (April 2006) 6. Donoho, D.L.: For most large underdetermined systems of linear equations the minimal l1 -norm solution is also the sparsest solution, Tech. Rep (2004) 7. Washizawa, Y., Cichocki, A.: on-line k-plane clustering learning algorithm for sparse comopnent analysis. In: Proceedinds of ICASSP’06, Toulouse (France), pp. 681–684 (2006) 8. Li, Y., Amari, S., Cichocki, A., Ho, D.W.C., Xie, S.: Underdetermined blind source separation based on sparse representation. IEEE Transactions on Signal Processing 54(2), 423–437 (2006) 9. Movahedi, F., Mohimani, G.H., Babaie-Zadeh, M., Jutten, C.: Estimating the mixing matrix in sparse component analysis (SCA) based on multidimensional subspace clustering. In: ICT’07, Malaysia (May 2007) 10. Chiu, S.: Fuzzy model identification based on cluster estimation. Journal of Intelligent and Fuzzy Systems, 2(3) (September 1994)
Appendix: Proof of Theorem 1 Consider a concentration subspace B. Then, by definition, it is formed by a linear combination of k mixing vectors. Let al1 · · · alk be these mixing vectors. Then k for every point x in this subspace, we have x = i=1 sli ali where sli 1 ≤ i ≤ k are the sources. Lemma 1. If k ≤ m 2 and the mixing matrix is full rank then, each point in a concentration subspace (B), for the sparsest solution, is almost always n the linear combination of a set of k fixed mixing vectors. Precisely if x = i=1 s˜i ai then si = 0 for i ∈ {1, 2, ..., n} − {l1 , l2 , ..., lk }. To prove this lemma suppose that there is another set of mixing h vectors {at1 · · · ath } and real valued variables s´t1 , · · · , s´th such that x = ti ati . i=1 s´ k h Then x = i=1 sli ali = i=1 s´ti ati shows that the set {al1 , · · · , alk , at1 , · · · , ath } is not linearly independent. A is assumed to be full rank thus each k´ ≤ m different mixing vectors are linearly independent. From this comment we can conclude that k + h > m, moreover, k ≤ m/2 (see section 1) thus h > m/2 and we have h > k. This is in contrast to our basic assumption that we want to find the sparsest solution to BSS problem. This lemma is in fact similar to the theorem of uniqueness of the sparsest solution [6]. Using the above lemma, the expected value of active sources in B is k˜ =
k i=0
iP {i sources from l1 , · · · , lk active| remaining n − k sources inactive}
Estimating the Mixing Matrix in Sparse Component Analysis
405
where P {·} denotes the probability and k˜ is the expected value of the number of active sources in a concentration subspace. Since the sources (and hence their activity status) are assumed to be independent, the above equation is reduced to: k=
k i=0
iP {i sources of l1 , · · · , lk active} =
k n i (1 − p)i pk−i k i=0
This is the expected value of a binomial random variable and hence k = k(1 − p).
Dictionary Learning for L1-Exact Sparse Coding Mark D. Plumbley Department of Electronic Engineering, Queen Mary University of London, Mile End Road, London E1 4NS, United Kingdom [email protected]
Abstract. We have derived a new algorithm for dictionary learning for sparse coding in the 1 exact sparse framework. The algorithm does not rely on an approximation residual to operate, but rather uses the special geometry of the 1 exact sparse solution to give a computationally simple yet conceptually interesting algorithm. A self-normalizing version of the algorithm is also derived, which uses negative feedback to ensure that basis vectors converge to unit norm. The operation of the algorithm is illustrated on a simple numerical example.
1
Introduction
Suppose we have a sequence of observations X = [x1 , . . . , xp ], xk ∈ IRn . In the sparse coding problem [1] we wish to find a dictionary matrix A and representation matrix S such that X = AS (1) and where the representations sk ∈ IRm in the matrix S = [s1 , . . . , sp ] are sparse, i.e. where there are few non-zero entries in each sj . In the case where we look for solutions X = AS with no error, we say that this is an exact sparse solution. In the sparse coding problem we typically have m > n, so this is closely related to the overcomplete independent component analysis (overcomplete ICA) problem, which has the additional assumption that the components of the representation vectors sj are statistically independent. If the dictionary A is given, then for each sample x = xk we can separately look the sparsest representation min s0 s
such that
x = As.
(2)
However, even this is a hard problem, so one approach is to solve instead the ‘relaxed’ 1 -norm problem min s1 s
such that
x = As.
(3)
This approach, known in the signal processing literature as Basis Pursuit [2], can be solved efficiently with linear programming methods (see also [3]). Even if such efficient sparse representation methods exist, learning the dictionary A is a non-trivial task. Several methods have been proposed in the M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 406–413, 2007. c Springer-Verlag Berlin Heidelberg 2007
Dictionary Learning for L1-Exact Sparse Coding
407
literature, such as those by Olshausen and Field [4] and Lewicki and Sejnowski [5], and these can be derived within a principled probabilistic framework [1]. A recent alternative is the K-SVD algorithm [6], which is a generalization of the K-means algorithm. However, many of these algorithms are designed to solve the sparse approximation problem X = AS + R for some nonzero residual term R, rather than the exact sparse problem (1). For example, the Olshausen and Field [4] approximate maximum likelihood algorithm is ΔA = ηE(rsT )
(4)
where r = x − As is the residual after approximation, and the K-SVD algorithm [1] minimizes the norm RF = X − ASF . If we have a sparse representation algorithm that is successfully able to solve (3) exactly on each data sample xk , then we have a zero residual R = 0, and there is nothing to ‘drive’ the dictionary learning algorithm. Some other dictionary learning algorithms have other constraints: for example, the method of Georgiev et al [7] requires at most m − 1 nonzero elements in each column of S. While these algorithms have been successful for practical problems, in this paper we specifically explore the special geometry of the 1 exact sparse dictionary learning problem. We shall derive a new dictionary learning algorithm for † the 1 exact sparse problem, using the basis vertex c = (A )T 1 associated with a subdictionary (basis set) A identified in the 1 exact sparse representation problem (3).
2
The Dual Problem and Basis Vertex
The linear program (3) has a corresponding dual linear program [2] max xT c c
such that
± aTj c ≤ 1
j = 1, . . . , m
(5)
which has an optimum c∗ associated with any optimum s∗ of (3). In a previous paper we explored the polytope geometry of this type of dual problem, and derived an algorithm, Polytope Faces Pursuit (PFP), which searches for the optimal vertex which maximizes xT c, and uses that to find the optimal vector s [8]. Polytope Faces Pursuit is a gradient projection method [9] which iteratively builds a solution basis A consisting of a subset of the signed columns σj aj of A, σj ∈ {−1, 0, +1}, chosen such that x = A¯s with ¯s > 0 containing the absolute value of the nonzero coefficients of s at the solution. The algorithm is similar in structure to orthogonal matching pursuit (OMP), but with a modified admission criterion aTi r (6) a = arg max ai 1 − aT c i to add a new basis vector a to the current basis set, together with an additional rule to switch out basis vectors which are no longer feasible.
408
M.D. Plumbley †
The basis vertex c = (A )T 1 is the solution to the dual problem (5). During the operation of the algorithm c satisfies AT c ≤ 1, so it remains dual-feasible throughout. For all active atoms aj in the current basis set, we have aTj c = 1. Therefore at the minimum 1 norm solution the following conditions hold: T
A c=1 x = A¯s
(7) ¯s > 0.
(8)
We will use these conditions in our derivation of the dictionary learning algorithm that follows.
3
Dictionary Learning Algorithm
We would like to construct an algorithm to find the matrix A that minimizes the total 1 norm p sk 1 (9) J= k=1 k
k
k
where s is chosen such that x = As for all k = 1, . . . , p, i.e. such that X = AS, and where the columns of A are constrained to have unit norm. In particular, we would like to construct an iterative algorithm to adjust A to reduce the total 1 norm (9): let us therefore investigate how J depends on A. For the contribution due to the kth sample we have J = k J k where J k = k
sk 1 = 1T ¯sk since ¯sk ≥ 0. Dropping the superscripts k from xk , A and ¯sk we therefore wish to find how J k = 1T ¯s changes with A, so taking derivatives of J k we get (10) dJ k /dt = 1T (d¯s/dt). Now taking the derivative of (8) for fixed x we get 0 = (dA/dt)¯s + A(d¯s/dt)
(11)
and pre-multiplying by cT gives us cT (dA/dt)¯s = −cT A(d¯s/dt)
(12)
= −1T (d¯s/dt)
(13)
= −dJ /dt
(14)
k
where the last two equations follow from (7) and (10). Introducing trace(·) for the trace of a matrix, we can rearrange this to get dJ k T dA T T dA T dA ¯s = − trace (c¯s ) = − trace c = − c¯s , (15) dt dt dt dt from which we see that the gradient of J k with respect to A is given by ∇A J k = −c¯sT . Summing up over all k and applying to the original matrix A we get ck (sk )T = −CST (16) ∇A J = − k
Dictionary Learning for L1-Exact Sparse Coding
409
with C = [ck ], a somewhat surprisingly simple result. Therefore the update ΔA = η
ck (sk )T
(17)
k
will perform a steepest descent search for the minimum total 1 norm J, and any path dA/dt for which (dA/dt), ∇A J < 0 will cause J to decrease. 3.1
Unit Norm Atom Constraint
Now without any constraint, algorithm (17) will tend to reduce the 1 norm by causing A to increase without bound, so we need to impose a constraint on A. A common constraint is to require the columns aj of A to be unit vectors, 2 aj 2 = 1, i.e. aTj aj = 1. We therefore require our update to be restricted to paths daj /dt for which aTj (daj /dt) = 0. To find the projection of (16) in this direction, consider the gradient component dJ =− ck skj . (18) gj = daj k
The orthogonal projection of gj onto the required tangent space is given by ˜j = Paj gj = g
I−
aj aTj
gj = gj −
2
aj 2
1
T 2 aj (aj gj ).
aj 2
(19)
Now considering the rightmost factor aTj gj , from (18) we get aTj gj = −
aTj ck skj .
(20)
k k
Considering just the kth term aTj ck skj , if aj is one of the basis vectors in A (with possible change of sign σjk = sign(skj )) which forms part of the solution k
xk = A ¯sk found by a minimum 1 norm solution, then we must have σjk aTj ck = kT
1 where σjk = sign(skj ), because A
c = 1, so aTj ck skj = σjk skj = |skj |. On the k
other hand, if aj does not form part of A , then skj = 0 so aTj ck skj = 0 = |skj |. k
Thus regardless of the involvement of aj in A , we have aTj ck skj = |skj |, so aTj gj = −
|skj |
(21)
k
and therefore ˜j = − g
k
ck skj
−
1
k 2 aj |sj | aj 2
.
(22)
410
M.D. Plumbley
Therefore we have the following ‘tangent’ update rule: 1 k k k aj (T + 1) = aj (T ) + η aj (T )|sj | c sj − aj (T )22 k
(23)
which will perform a tangent-constrained steepest descent update to find the minimum total 1 norm J. We should note that the tangent update is not entirely sufficient to constrain aj to remain of unit norm, so an occasional renormalization step aj ← aj /aj 2 will be required after a number of applications of (23).
4
Self-normalizing Algorithm
Based on the well-known negative feedback structure used in PCA algorithms such as the Oja [10] PCA neuron, we can modify algorithm (23) to produce the following self-normalizing algorithm that does not require the explicit renormalization step:
aj (T + 1) = aj (T ) + η (24) ck skj − aj (T )|skj | k 2
where we have simply removed the factor 1/aj (T )2 from the second term in (23). This algorithm is computationally very simple, and suggests an online
version aj (k) = aj (k − 1) + η ck skj − aj (k − 1)|skj | with the dictionary updated as each data point is presented. For unit norm basis vectors aj (T )2 = 1, the update produced by algorithm (24) is identical to that produced by the tangent algorithm (23). Therefore, for unit norm basis vectors, algorithm (24) produces a step in a direction which reduces J. (Note that algorithm (24) will not necessarily reduce J when aj is not unit norm.) To show that the norm of the basis vectors aj in algorithm (24) converge to unit length, we require that each aj must be involved in the representation of at least one pattern xk , i.e. for some k we have skj = 0. (If this were not true, that basis vector would have been ignored completely so would not be updated by the algorithm.) Consider the ordinary differential equation (ode) version of (24):
daj = ck skj − aj |skj | (25) dt k 1 = −˜ gj + − 1 a |skj | (26) j 2 aj 2 k
1 2 = −˜ gj + |skj | (27) 2 1 − aj 2 aj aj 2 k ˜j = aTj Paj gj = 0, gives us which, noting that aTj g
1 daj 2 2 aTj 1 − a aTj aj = |skj | = (1 − aj 2 ) |skj |. j 2 2 dt aj 2 k k
(28)
Dictionary Learning for L1-Exact Sparse Coding
411
Constructing the Lyapunov function [11] Q = (1/4)(1 − aj 22 )2 ≥ 0, which is zero if and only if aj has unit length, we get 2
dQ/dt = −(1 − aj 2 )aTj (daj /dt) 2 = −(1 − aj 2 )2 |skj |
(29) (30)
k
≤0
(31)
where k |skj | > 0 since at least one of the skj is nonzero, so equality holds in 2 2 (31) if and only if aj 2 = 0. Therefore the ode (25) will cause aj 2 to converge to 1 for all basis vectors aj . While algorithm (24) does not strictly require renormalization, we found experimentally that an explicit unit norm renormalization step did produce slightly more consistent behaviour in reduction of the total 1 norm J. Finally we note that at convergence of algorithm (24), the basis vectors must satisfy k k c sj (32) aj = k k k |sj | so that aj must be a (signed) weighted sum of the basis vertices ck in which it is involved. While equation (32) is suggestive of a fixed point algorithm, we have observed that it yields unstable behaviour if used directly. Nevertheless we believe that it would be interesting to explore this in future for the final stages of an algorithm, as it nears convergence.
5
Augmenting Polytope Faces Pursuit
After an update to the dictionary A, it is not necessary to restart the search for the minimum 1 norm solutions sk to xk = Ask from sk = 0. In many cases the dictionary vector will have changed only slightly, so the signs σjk and k
selected subdictionary A may be very similar to the previous solution, before the dictionary update. At the T th dictionary learning step we can therefore restart the search for sk (T ) from the basis set selected by the last solution sk (T − 1). However, if we start from the same subdictionary selection pattern after a change to the dictionary, we can no longer guarantee that the solution will be dual-feasible, i.e. that (5) is always satisfied, which is required for the Polytope Faces Pursuit algorithm [8]. While we will still have aTj c = 1 for all vectors aj in the solution subdictionary, we may have increased some other aj , not in the original solution set, so that now aTj c > 1. To overcome this problem, if aTj ck > 1 such that dual feasibility fails for a particular sample k and basis vector aj , we simply restart the sparse Polytope Faces Pursuit algorithm from sk = 0 for this particular sample, to guarantee that dual-feasibility is restored. We believe that it may be possible to construct
412
M.D. Plumbley
a more efficient method to restore dual-feasibility, based on selectively swapping vectors to bring the solution into feasibility, but it appears to be non-trivial to guarantee that loops will not result.
6
Numerical Illustration
6
4
4
2
2 2
6
0
x
x2
To illustrate the operation of the algorithm, Figure 1 shows a small graphical example. Here four source variables sj are generated with identical mixture-of-
0
−2
−2
−4
−4
−6
−6 −5
0
5
−5
0
x
5
x
1
1
(a)
(b)
6
3100
4
3000 L1 norm
x2
2 0 −2 −4
2900 2800 2700
−6 −5
0 x
1
(c)
5
2600
0
5
10 15 Iteration
20
25
(d)
Fig. 1. Illustration of dictionary learning of a n = 2 dimensional mixture of m = 4 MoG-distributed sources, for p = 3000 samples. The plots show (a) the initial condition, and updates after (b) 5 dictionary updates and (c) 25 dictionary updates. The learning curve (d) confirms that the 1 norm decreases as the algorithm progresses. On (a)-(c), the longer arrows are scaled versions of the learned dictionary vectors aj , with the shorter arrows showing the directions of the generating dictionary vectors for comparison.
gaussian (MoG) densities in an n = 2 dimensional space and added with angles θ ∈ {0, π/6, π/3, 4π/6}. It is important to note that, even in the initial condition Figure 1(a), the basis set spans the input space and optimization of (3) has an exact solution xk = Ask for all data samples xk , at least to within numerical precision of the algorithm. Therefore this situation would not be suitable for any dictionary learning algorithm which relies on a residual r = x − As.
Dictionary Learning for L1-Exact Sparse Coding
7
413
Conclusions
We have derived a new algorithm for dictionary learning for sparse coding in the 1 exact sparse framework. The algorithm does not rely on an approximation residual to operate, but rather uses the special geometry of the 1 exact sparse solution to give a computationally simple yet conceptually interesting algorithm. A self-normalizing version of the algorithm is also derived, which uses negative feedback to ensure that basis vectors converge to unit norm. The operation of the algorithm was illustrated on a simple numerical example. While we emphasize the derivation and geometry of the algorithm in the present paper, we are currently working on applying this new algorithm to practical sparse approximation problems, and will present these results in future work.
Acknowledgements This work was supported by EPSRC grants GR/S75802/01, GR/S82213/01, GR/S85900/01, EP/C005554/1 and EP/D000246/1.
References 1. Kreutz-Delgado, K., Murray, J.F., Rao, B.D., Engan, K., Lee, T.W., Sejnowski, T.J.: Dictionary learning algorithms for sparse representation. Neural Computation 15, 349–396 (2003) 2. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20, 33–61 (1998) 3. Bofill, P., Zibulevsky, M.: Underdetermined blind source separation using sparse representations. Signal Processing 81, 2353–2362 (2001) 4. Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996) 5. Lewicki, M.S., Sejnowski, T.J.: Learning overcomplete representations. Neural Computation 12, 337–365 (2000) 6. Aharon, M., Elad, M., Bruckstein, A.: K-SVD: Design of dictionaries for sparse representation. In: Proceedings of SPARS’05, Rennes, France, pp. 9–12 (2005) 7. Georgiev, P., Theis, F., Cichocki, A.: Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Transactions on Neural Networks 16, 992–996 (2005) 8. Plumbley, M.D.: Recovery of sparse representations by polytope faces pursuit. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 206–213. Springer, Heidelberg (2006) 9. Rosen, J.B.: The gradient projection method for nonlinear programming. Part I. Linear constraints. J. Soc. Indust. Appl. Math. 8, 181–217 (1960) 10. Oja, E.: A simplified neuron model as a principal component analyzer. Journal of Mathematical Biology 15, 267–273 (1982) 11. Cook, P.A.: Nonlinear Dynamical Systems. Englewood Cliffs, Englewood Cliffs, NJ (1986)
Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures Paris Smaragdis1, Bhiksha Raj1 , and Madhusudana Shashanka2, Mitsubishi Electric Research Laboratories Cambridge MA, USA Department of Cognitive and Neural Systems Boston University, Boston MA, USA 1
2
Abstract. In this paper we describe a methodology for model-based single channel separation of sounds. We present a sparse latent variable model that can learn sounds based on their distribution of time/ frequency energy. This model can then be used to extract known types of sounds from mixtures in two scenarios. One being the case where all sound types in the mixture are known, and the other being being the case where only the target or the interference models are known. The model we propose has close ties to non-negative decompositions and latent variable models commonly used for semantic analysis.
1
Introduction
Separation of sounds from single-channel mixtures can be a daunting task. There is no exact solution nor a process that guarantees good separation behavior. Most approaches in this scenario are model-based and perform separation by splitting the spectrogram of the mixture in parts that correspond to a single source. This approach has been taken in [1,2,3,4] among many others and has been one of the easiest ways to obtain reasonable results. In this paper we employ a similar approach using a new decomposition algorithm which is best suited for spectrogram analysis. We show how this approach can be used both supervised and semi-supervised settings for separation from monophonic mixtures, and draw connections to various types of known analysis methods. 1.1
Probabilistic Latent Component Analysis
In this section we describe the statistical model we will use for acoustic modeling. Probabilistic Latent Component Analysis (PLCA) is a straightforward extension of Probabilistic Latent Semantic Indexing (PLSI) [5] which deals with an arbitrary number of dimensions and can easily extended to exhibit various features such as sparsity or shift-invariance. The basic model is defined as: N P (z) j=1 P (x(j) |z) (1) P (x) = z
Work performed while at Mitsubishi Electric Research Laboratories. M. Shashanka was partially supported by an AFOSR grant to Prof. Barbara Shinn-Cunningham.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 414–421, 2007. c Springer-Verlag Berlin Heidelberg 2007
Supervised and Semi-supervised Separation of Sounds
415
where P (x) is a distribution over the N -dimensional random variable x and x(j) denotes j’th dimension. The random variable zis a latent variable, and the P (x(j) |z) are one-dimensional distributions. Effectively this model represents a mixture of marginal distribution products to approximate an N -dimensional distribution. Our objective is to discover the most appropriate marginal distributions. The estimation of the marginals P (x(j) |z) is performed using the EM algorithm. In the expectation step we estimate the posterior probability of the latent variable z: (j) P (z) N j=1 P (x |z) P (z|x) = (2) N (j) z P (z ) j=1 P (x |z ) and in a maximization step we re-estimate the marginals using the above weighting to obtain a new and more accurate estimate: P (z) =
P (x)P (z|x)dx ∗ (j) P (x |z) = · · · P (x)P (z|x)dx(k) , ∀k = j P (x(j) |z) =
P ∗ (x(j) |z) P (z)
(3) (4) (5)
Repeating the above steps in an alternating manner multiple times produces a converging solution for the marginals P (x(j) |z) along each dimension j, and the latent variable priors P (z). In the case where P (x) is discrete we only have to substitute the integrations with summations. Likewise the latent variable z can be continuous valued in which case the summations over z become integrals. In practical applications P (x) and z will both be discrete and we assume that to be the case in the remainder of this paper. 1.2
Sparsity Constraints
In this section we will introduce a modification to the PLCA algorithm which enables us to produce sparse (or maximally non-sparse) estimates of P (x(j) |z). Since the estimated quantities of PLCA are probability distributions, we can directly obtain sparsity by imposing an entropic prior instead of obtaining the effect by more traditional means such as L1-norm minimization. This prior can can impose a bias towards estimating a low (or high) entropy P (x(j) |z). We can thus obtain a sparse estimate by requesting low entropy results, a flatter estimate by requesting high entropy results, or any combination of the two cases for different values of the latent variable z. Let us assume that we wish to manipulate the entropy of the distribution P (x(j) |z). The form of the entropic prior for this distribution is defined as (j) (j) (j) (j) e−βH(P (x |z) = eβ i P (xi |z)logP (xi |z) , where P (xi |z) denotes the i’th ele(j) ment of the distribution P (x |z). Incorporating the entropic prior in the PLCA
416
P. Smaragdis, B. Raj, and M. Shashanka
model and adding the constraint that the following function: P ∗ (x(j) |z) (j) P (xi |z)
i
(j)
P (xi |z) = 1 results into optimizing
(j)
+ β + βlogP (xi |z) + λ = 0,
(6)
where P ∗ (x(j) |z) is defined in equation 4 and λ is the Langrange multiplier enforcing the unity summation constraint. As shown in [6] this equation can be solved using Lambert’s W function resulting in: P (x(j) |z) =
P ∗ (x(j) |z)/β . W(−P ∗ (x(j) |z)e1+λ/β /β)
(7)
Alternating between the last two equations for a couple of iterations we can obtain a refined estimate of P (x(j) |z) which accommodates the entropy constraint. This process is described in more detail in [7].
2
Applications of PLCA for Source Separation
The two separation scenarios we will introduce in the next sections are both making use of PLCA models of sounds. We will now briefly introduce how we can model a class of sounds using PLCA. One major feature that we can use to describe a sound is that of its frequency distribution. For example we know that speech tends to have a harmonic distribution with most energy towards the low end of the spectrum, whereas, say, a siren would have a more simple timbral profile mostly present at higher frequencies. We can use the PLCA model to obtain a dictionary of spectral profiles that best describe a class of sounds. To do so we consider the 2-d formulation of PLCA when applied on time-frequency distributions of sounds. The model will be: P (z)P (f |z)P (t|z) (8) P (f, t) = z
where P (f, t) is a magnitude spectrogram. The decomposition will result into two sets of marginals, one for the frequency axis and one for the time axis. The time axis marginals are not particularly informative, the frequency axis marginals however will contain a dictionary of spectra which best describe the sound represented by the input spectrogram. To illustrate this operation consider the spectrograms in figure 1 and their corresponding frequency marginals. One can easily see that the extracted marginals are latching on to the specific spectral structure of each sound. These frequency marginals can be used as a model of a class of sounds such as human voice, speech of a specific speaker, a specific type of background noise, etc. The the next sections we describe how this model can be used for a supervised and semi-supervised source separation.
Supervised and Semi-supervised Separation of Sounds Speech example sound
1 2 3 4 5 6 7 8 9 1011 z
Time
Harp example sound
Frequency
Harp PLCA Frequency marginals
Frequency
Frequency
Frequency
Speech PLCA Frequency marginals
417
1 2 3 4 5 6 7 8 9 10 z
Time
Fig. 1. Example of PLCA models of two different sounds. The two left plots display a spectrogram of speech and a set of speech-derived frequency marginals. Likewise the two right plots display the same information for a harp sound. Note how the derived marginals in both cases extract representative spectra for each sound.
2.1
Supervised Separation
In the case of supervised separation we assume that the mixture we are operating on contains classes of sounds for which we have already trained PLCA models (in the form of frequency marginals, as described above). If the kind of timefrequency distribution that we use is (at least approximately) linearly additive in nature, we can assume that the marginal distributions of our trained models can be used to approximate the mixture’s distribution. For the experiments presented in this paper we employ the magnitude short time Fourier transform. Although the linearity assumption does not exactly apply for this transform, it is sufficiently approximately correct in the context of sound mixtures. In order to perform the separation let us consider a mixture composed out of samples from the two sound classes analyzed in figure 1. Let us denote the already known frequency marginals from these two sounds as P1 (f |z) and P2 (f |z). The spectrogram of the mixture, which we denote by P (f, t), is shown in figure 2. One can easily see elements of both sounds present in it. Once we obtain the spectrogram of the mixture we need to find how to use the already known marginals from prior analysis to approximate it. Doing so it a very simple operation which involves partial use of the training procedure shown above. First we consolidate the marginals of the known sounds into one set P (f |z) = {P1 (f |z) P2 (f |z)}. Since all the of the marginals in P (f |z) should explain the mixture spectrogram P (f, t) we only need to estimate a set of time marginals P (t|z) which will facilitate the approximation. We therefore perform the training outlined in the previous sections, only this time we only estimate P (t|z) and keep P (f |z) fixed to the already known values. After we obtain a satisfactory estimate of P (t|z) we appropriately split it to two sets which correspond to each Pi (f |t). We can then reconstruct the elements of the input spectrogram that correspond to only one sound class by using only the time and frequency marginals that correspond to that sound class. The results in this particular case are shown in figure 2. As is evident the contribution to the mixture from each of the two sources is cleanly separated into two spectrograms. Once the spectrograms of each known sound
418
P. Smaragdis, B. Raj, and M. Shashanka
have been recovered we can easily transform them back to the time domain by using the corresponding phase values from the original mixture.
Extracted speech
Time
Extracted harp
Frequency
Frequency
Frequency
Spectrogram of mixture
Time
Time
Fig. 2. Example of supervised separation using PLCA. The leftmost plot displays the inputs spectrogram. We can easily see features of the speech and harp sounds. The two remaining plots show the mixture spectrogram as approximated by the speech marginals (center plot), and the harp marginals (right plot).
As one might suspect this approach does not allow the separation of spectrally similar sounds since there will be significant similarity between the marginals of each sound class. The more dissimilar the sounds in the mixture are the better the quality of the separation will be. In this particular example separation was almost flawless since the two sounds had a very different spectral profile. For experiments using 0dB speech mixtures the target source improvement ranged from 3dB to 10dB depending on the similarity between the speakers. Using examples such as various types of ambient noise and speech we often achieved separation of more than 12dB. 2.2
Semi-supervised Separation
In the case of semi-supervised separation we assume that we only have a PLCA model for one of the sounds in a mixture. In this case we cannot directly use the aforementioned procedure to perform separation. In this section we introduce a methodology which deals with this problem. Assume that we have a mixture of multiple sounds and we only have a PLCA model for one of them. We can perform PLCA on the mixture using the known frequency marginals for one of the sounds and in the process estimate additional marginals to explain the elements in the mixture we can’t already. Doing so with the training procedure we have shown in the previous sections is very easy. We train as we usually do when learning both the frequency and the time marginals, but we make sure that a portion of the frequency marginals are kept fixed as we update only the remaining ones using the same training procedure as before. The fixed marginals are the ones we already know as a model for one of the sounds. Conclusion of training will result into a set of new frequency marginals which are best suited to explain the sources in the mixture other than the one we already
Supervised and Semi-supervised Separation of Sounds
419
know. Since there will most likely be some spectral similarity between the known sound and the rest of the sources we also encourage sparsity on the time marginals to ensure that there is minimal co-occurence of frequency marginals at any time. Once the marginals of the additional sources have been identified we can revert back to the supervised separation methodology to obtain the results we seek. The additional complication in this scenario is that by having a model of only one of the sources results into the ability to extract either that source by itself or all the other sources as one. This means that we can use this method for applications akin to denoising where we either know the target characteristics, of the background noise characteristics. In our experiments we have used this approach to separate speech from music, where the results often are very impressive1 . A separation example is shown in figure 3, where a soprano is separated from a piano. We only had a model for the piano and learned the soprano model using the aforementioned methodology. The suppression of the piano was audibly flawless and the only artifact of this approach was a slight coloring of the extracted soprano voice (attributed mostly to the usage of phase of the original mixture).
Frequency
Piano and soprano mixture
Time
Frequency
Extracted soprano
Time
Fig. 3. Example of semi-supervised separation using PLCA. The left plot displays the mixture of a piano and a soprano, the right plot displays the extracted soprano voice. One can easily see that the harmonic series corresponding to the piano notes are strongly suppressed.
3
Discussion
In this section we discuss the selection of parameters and their effect in separation performance and point to some of the relationships of the PLCA model to other known decompositions. 1
Demonstration sound samples of this approach can be found in http://www.merl.com/people/paris/sep.html under the section “PLCA for spectral factoring”.
420
3.1
P. Smaragdis, B. Raj, and M. Shashanka
Parameter Selection
In order to obtain reasonable results we have to make sure that the right parameters are used in the process. First we need to ensure that the time/frequency decomposition we employ is adequate to perform separation. In our experience a 1/10sec analysis window is usually a good choice for separation. As this window becomes smaller it results in inadequate frequency resolution, and as it grows larger it results in time smearing. The hop size of the transform also needs to be small enough to ensure a clean reconstruction during the transformation from time/frequency to time (a fourth of the transform size is a good choice). Applying a Hanning window for the frequency analysis is also advised since it minimizes high frequency artifacts which are not part of the sound we model and can result into a skewed representation. The selection of the PLCA parameters is very important in order to achieve good results. In most of our simulations, sounds were modeled using around 100 marginals (i.e. z = {1, 2, 3, ..., 100}). Using a small number of marginals results into a poor representation which attains spectrally quantized results, whereas a large number of marginals results into large sets of simple marginals which can also describe elements of the interfering sounds. The tradeoff in this case is between accuracy of model versus separability of models. The sparsity parameter is something we only use in the case of the semi-supervised learning on the time marginals. It ensures that the new marginals that we learn will not overlap as much with the already known ones. Common usage values in our experiments were β = {0, 0.01, 0.05, 0.1}, where larger values were used in harder to separate problems where more spectral overlap between sources was present. The audible effect of using sparsity is a degradation of reconstruction of the sound quality of the source to be learned. Therefore using the sparsity parameter is best when we have a model of the target source and we wish to remove the remaining sources. 3.2
Relation to Similar Decompositions
The PLCA model which we introduced is closely related to a variety of known decompositions. The non-sparse 2-d manifestation is identical to the Probabilistic Latent Semantic Indexing (PLSI) algorithm [5], which itself is a probabilistic generalization of the Singular Value Decomposition. The functional difference is that PLSI/PLCA operate on distributions instead of raw data which means that they can effectively only analyze non-negative inputs. If we rewrite the 2-d PLCA model in terms of matrix operations, this relationship is more evident: P (z)P (f |z)P (t|z) ≡ V = W · S · H (9) P (f, t) = z
where V is a matrix containing the distribution P (f, t), W is a matrix containing in its columns P (f |z) for every z, S is a diagonal matrix containing in its diagonal the values of P (z), and H is a matrix containing in its rows P (t|z) for every z. Additionally if we absorb the values of S into the two matrices W and H so ¯ · H, ¯ we can make a connection to the Non-negative that: V = W · S · H = W
Supervised and Semi-supervised Separation of Sounds
421
Matrix Factorization (NMF) decomposition [8]. NMF can employ the Kullback¯ H ¯ approximates the Leibler divergence to measure how well the factorization W· input V. The EM training which we perform also indirectly optimizes the same cost function as it improves the model’s log-likelihood. In fact the two training procedures for 2-d PLCA and NMF can be shown to be numerically identical. Finally we can make a loose connection to non-negative ICA by noting that by using the entropic prior to manipulate the joint entropy of H we can obtain the equivalent of an ICA mixing matrix in W. Although this is only a conjecture on our part, preliminary results from simulations are encouraging.
4
Conclusions
In this document we introduced a sparse latent variable model which can be employed for the decomposition of time/frequency distributions to perform separation of sources from monophonic recordings. We demonstrated the use of this model for both supervised and semi-supervised source separation, and discussed its relationship with other known decompositions. Our results are very encouraging and amenable to various modifications, such as the use of convolutive bases and transformation invariance, which can help to successfully apply this work to even more challenging source separation problems.
References 1. Casey, M., Westner, A.: Separation of Mixed Audio Sources by Independent Subspace Analysis. In: proceedings ICMC (2000) 2. Roweis, S.T.: One Microphone Source Separation. In: NIPS (2000) 3. Benaroya, L., McDonagh, L., Bimbot, F., Gribonval, R.: Non negative sparse representation for Wiener based source separation with a single sensor. In: proceedings of the ICASSP (2003) 4. Vincent, E., Rodet, X.: Music transcription with ISA and HMM. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, Springer, Heidelberg (2004) 5. Hofmann, T.: Probabilistic Latent Semantic Indexing. In: proceedings SIGIR’99 (1999) 6. Brand, M.E.: Structure Learning in Conditional Probability Models via an Eutropic Prior and Parameter Extinction. Neural Computation Journal 11(5), 1155–1182 (1999) 7. Shashanka, M.V.S.: A Unified Probabilistic Approach to Modeling and Separating Single-Channel Acoustic Sources, Ph.D. Thesis, Department of Cognitive and Neural Systems. Boston University, Boston (2007) 8. Lee, D.D, Seung, H.S.: Algorithms for Non-negative Matrix Factorization. In: NIPS (2001)
Image Compression by Redundancy Reduction Carlos Magno Sousa, Andr´e Borges Cavalcante, Denner Guilhon, and Allan Kardec Barros [email protected], [email protected], [email protected], [email protected] http://pib.dee.ufma.br
Abstract. Image compression is achieved by reducing redundancy between neighboring pixels but preserving features such as edges and contours of the original image. Deterministic and statistical models are usually employed to reduce redundancy. Compression methods that use statistics have heavily been influenced by neuroscience research. In this work, we propose an image compression system based on the efficient coding concept derived from neural information processing models. The system performance is compared with principal component analysis (PCA) and the discrete cosine transform (DCT) at several compression ratios (CR). Evaluation through both visual inspection and objective measurements showed that the proposed system is more robust to distortions such as ringing and block artifacts than PCA and DCT.
1
Introduction
The goal of data compression is to reduce space or bandwidth required to store or transmit some information [1]. Specifically, in case of images, the data contains a high degree of correlation or redundancy between neighboring samples. This way, compression is achieved by reducing the redundancy of data preserving quality and features such as edges and contours of the original image. The redundancy reduction principle can be analyzed in both deterministic and statistical fashions [1,2]. In the first, redundancy is understood as data samples that can be inferred without use of statistical information of data. On the second one, redundancy reduction is performed transforming the data into an efficient representation according to statistical independence criterion [2]. There is a large number of deterministic methods used for image compression. For instance, the well-known JPEG scheme employs the discrete cosine transform (DCT) [4,3] to encode images. The DCT is a Fourier-related transform which converts data into frequency components. Contrarily to Fourier’s, these components are real coefficients defined as the inner product between the image to be encoded and the DCT basis functions. Although the DCT is easy to implement and fast to compute, it still undergo as many drawbacks as any Fourier-related transform. Gibbs phenomenon [5] is an example of such deficiencies that in case of images, are smoothed edges. Another DCT problem is blocking artifacts [6]. These artifacts are due to the block processing of images and can be understood as luminance discontinuities between block boundaries. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 422–429, 2007. c Springer-Verlag Berlin Heidelberg 2007
Image Compression by Redundancy Reduction
423
On the other hand, compression methods that use statistics have heavily been influenced by neural information processing models [2]. Neuroscience studies suggested that neuron populations process stimuli information according to the concept of “efficient coding” [7]. Under this concept, neuron responses are mutually statistically independent which means that there is no “redundant information” processed throughout the population. Statistical independence criteria may be explored by two approaches: principal component analysis (PCA) and independent component analysis (ICA). PCA utilizes second-order statistics while ICA uses high-order statistics to obtain an efficient code. For instance, PCA is employed in several image compression systems in order to reduce data dimension [8]. The shortcoming of PCA based systems is that second-order statistics can only provide efficient representations for Gaussian data and images are normally non-Gaussian [7]. To circumvent this problem, new compression systems have used ICA to encode images such as [9]. However, this method does not take into account the non-orthogonality propriety of ICA basis functions in the code estimation. Therefore the purpose of this work is to propose an image compression system based on the “efficient coding” concept using ICA. In this model, data compression is carried out projecting images onto subspaces learned by ICA where the efficient code is given by the respective projection coefficients. This model is also applied to electrocardiogram data compression in [10].
2
Methods
Let us divide an image into a vector of blocks y = [y1 , y2 , . . . , yn ]T with size of m x m. Now, assume that each block yi can be reconstructed as a linear combination of vectors from a stochastically learned subspace Φ = [φ1 , φ2 , . . . , φn ]T , where φi is also called basis function. This process might be written as yˆi = w1 φ1 + w2 φ2 + . . . + wn φn ,
(1)
where yˆi is the reconstructed version of block yi and each component wi is the projection coefficient on the ith base function. To learn the subspace Φ and estimate the projection coefficients wi , we propose the image compression system, shown in Figure 1, which is based on the concept of efficient coding. The following subsections provide a full explanation of the structure of our model. Efficient Coding. Let us assume that an image block is encoded by the projection coefficients w = [w1 , w2 , . . . , wn ]T in Eq. (1). The goal of efficient coding is to estimate a subspace Φ that reduces the mutual statistical dependence between the coefficients wi . An estimation of the subspace Φ may be accomplished by either PCA or ICA. The first assumes that the components are uncorrelated while the second assumes that components are mutually statistically independent. In this work, we use the last approach.
424
C.M. Sousa et al.
Fig. 1. Proposed image compression system. The system consist of two phases: learning and projection. In the learning phase, we use independent component analysis to learn a subspace. In the projection phase, we estimate the projections coefficients through a minimum mean square error (MSE) estimation.
Learning a Subspace through ICA. Let x = [x1 , x2 , . . . , xn ]T be a set of observations taken from the same data class and written as in Eq. (2). Using x as a training input, ICA learns basis functions φi for the data class so that the set of variables which composes vector a = [a1 , a2 , . . . , an ]T are mutually statistically independent. (2) x = aT Φ. To achieve statistically independence, ICA algorithms work with higher-order statistics which point out directions where data is maximally independent. Here, we used the FastICA algorithm [11]. Projecting an Image onto Subspaces. The projection phase consists of finding out the image representation for the subspace Φ. This representation is given by the projection vector w where each coefficient wi shows how the i − th base function is activated inside the image. In our model, the projection vector is found out through minimum mean square error (MSE) estimation. Hence, for a given image block yi , the projection vector is estimated such that it minimizes the MSE between the reconstructed block yˆi and the original one. For this estimation method, the solution vector is given by w = E[ΦΦT ]−1 E[yi Φ].
(3)
The term E[ΦΦT ]−1 in Eq. (3) holds information about the angles between every two basis functions of Φ and is necessary to achieve the minimum error once the ICA basis are non-orthogonal. Eq. (3) is also used to select a smaller subspace Ψ from larger subspace Φ with minimum MSE. This process consists of a block-level deflationary basis pursuit which allow us to change the compression rate (CR) using as many coefficients as desired to represent the image. This basis pursuit is summarized in the following procedure: Step 1: Define an empty subspace Ψ; Step 2: Repeat next step for k = 1, 2, ..., n, where n is the dimension of Φ;
Image Compression by Redundancy Reduction
425
Step 3: Using Eq. (3), find the reconstructed image yˆik projecting yi onto the subspace composed of [Ψ, φk ], where φk is the k th base function of Φ; Step 4: Select the base functions according to the following criteria: φk = arg min M SE(ˆ yik − yik );
(4)
Step 5: Move φk from Φ to Ψ so that n = n − 1; Step 6: Return to step 2 until Ψ get the desired dimension; It is important to notice that this basis pursuit is non-orthogonal in contrary to the standard Matching Pursuit [12]. Hence, when Ψ has obtained a dimension higher than one, the term E[ΦΦT ]−1 is used to estimate the correct w.
3
Results
In the learning phase, we have used 200 male and 200 female face images from AR Face Database [13]. This database contains 4000 faces from 126 persons for many expressions and cloths. To form the training set for ICA, the 400 images were divided into 8 x 8 blocks. For each block, two training samples of 64 pixels were obtained. The first sample was obtained performing a line-wise reading of the pixels inside the block and the second one, through column-wise reading. Figure 2 shows an example of ICA subspace learned from face images. Also, for the same training set, the PCA subspace along with the standard DCT’s.
(a)
(b)
(c)
Fig. 2. Example of subspaces: (a) ICA and (b) PCA subspaces learned from face images, (c) standard DCT subspace
In projection phase, the coefficients were quantized using Lloyds’s algorithm [14]. And since we are concerned with losses introduced in the coding process, we used no further lossless methods such as Huffman coding [15]. To evaluate the proposed system performance, we have compressed several images (not used in the learning phase) for ICA, PCA and DCT. Then, the respective reconstructed images were compared using visual inspection and objective measurements. For subjective insight, Figure 3 shows several reconstructed images along with the respective values of CR and picture quality scale (PQS) [16].
C.M. Sousa et al. CR = 13:1
CR: 10:1
PQS =0.22
PQS = 0.94
PQS = -2.2
PQS = - 1.16
PQS =-2.02
PQS = - 0.82
DCT
PCA
ICA
426
Fig. 3. Reconstructed images for ICA, PCA and DCT. The face images in the first column were all reconstructed using three basis functions and its projections coefficients were quantized using four bits so that the CR = 13:1. For the image in the second column, five basis functions and six bits for quantization, CR = 10:1.
Image Compression by Redundancy Reduction
427
It is important to notice that the images are compressed with ICA, PCA, and DCT at same CR. The PQS value varies from zero to five and matches the subjective scale MOS, but can assume negative values for poor reconstructions. Further, we used a weighted version of the standard percent root-mean square difference (WPRD) which is defined as E[CF(yˆi − yi )2 ] . (5) WPRD = E[yi2 ] The WPRD is found out by filtering the spatial frequency of the error image (yˆi − yi ) with the human contrast sensitivity function (CF) [17]. The weighting aims to match the error influence according to human visual system sensibility for certain frequencies. Figure 4 shows the average WPRD obtained by increasing the basis functions number used for reconstruction of five face images. 0.04 ICA PCA DCT
0.035 0.03
WPRD
0.025 0.02 0.015 0.01 0.005 0
3
5
7
9 11 13 15 Number of basis functions
17
19
21
Fig. 4. Average WPRD performance for five reconstructed images which were not used in the learning phase. In the projection phase, the coefficients were quantized using five bits.
4
Discussion
The proposed model has a reasonable physiological foundation: efficient coding. Hence, this model should introduce less perceptible distortions for humans than methods non-based on neural modeling such as DCT or even PCA. Actually, there are many interesting points to highlight. Firstly, let us analyze the subspaces presented in Figure 2. The statistical independence criteria used
428
C.M. Sousa et al.
in ICA span edge detectors as basis functions for face images. These “edges” corresponds to significant variations of the grey level for the training set and are results of the redundancy reduction process [18]. PCA and DCT are both orthogonal transforms so that they similarly distribute data spectrum information between their basis. This fact can be confirmed observing that the shape of their basis functions are alike. Further, we can also see that reconstructed images for PCA and DCT in Figure 3 are very similar. Secondly, let us analyze an important point in the performance of compression techniques: the presence of ringing and blocking artifacts in reconstructed images. Ringing artifacts are produced by Gibbs phenomenon [5]. This phenomenon can be understood as large oscillations in image areas that have discontinuities such as edges and contours. These oscillations produce a smooth effect on the figure, what can be unacceptable for applications such as medical image compression [19]. From the reconstructed images in the first and second columns of Figure 3, we can see that edges and contours defining image details, such as pupils, eyes, nose, lips and hair wires of face in the first column; and the table, arms, legs and face contours in the second image were better preserved in the proposed model than for PCA and DCT. The problem of “ringing” might be even worse if blocking artifacts are considered. And in this model, images are processed in non-overlapped blocks which are independently transformed and quantized. Therefore, luminance discontinuities may be introduced between block boundaries, what is highly perceivable by humans when few coefficients are used in the reconstruction. Thus, either the compression method is intrinsically robust to this distortion, or additional methods may be required to solve this problem, increasing the complexity of the compression system. Once more, a neural based modeling should fit. In fact, comparing the images in Figure 3, we can clearly see that our model is more robust to blocking artifacts. Indeed, since PQS and WPRD takes human sensitivity into account, we can see from the negative values of PQS for the reconstructed images in Figure 3 and the WPRD performance in Figure 4 that our model introduces less errors than PCA and DCT regarding the human perception.
5
Conclusions
We have proposed an image compression system based on the concept of efficient coding. Our system consists of two phases: learning and projection. In the learning phase, we used independent component analysis (ICA) to learn a subspace that maximizes the code efficiency. In the projection phase, we found out the code projecting the image onto the subspace. The projection was carried out through a mean square error estimation. The system was compared with principal component Analysis (PCA) and the discrete cosine transform (DCT) through both visual inspection and objective measurements. The results analysis
Image Compression by Redundancy Reduction
429
showed that our model is more robust to distortions which are highly perceptible by humans. Several reconstructed images from our model can be found at http://pib.dee.ufma.br
References 1. Kortamnl, C.M.: Redundancy Reduction-A Practical Method of Data Compression. Proc. of The IEEE 55(3), 223–226 (1967) 2. Barros, A.K., Chichocki, A.: Neural Coding by Redundancy Reduction and Correlation. In: Proc. of the VII Brazilian Symposium on Neural Networks (SBRN) (IEEE) (2002) 3. Ahmed, N., Natarajan, T., Rao, K.R.: Discrete Cosine Transform. IEEE Trans. Computers, 90–93 (1974) 4. Wallace, G.K.: The JPEG still-picture compression standard. Commun. ACM 34, 30–44 (1991) 5. Gibbs, J.W.: Fourier Series. Nature, 59 (1898) 6. Coudoux, F.X., Gazalet, M.G., Corlay, P., Rouvaen, J.M.: A Perceptual Approach to the Reduction of Blocking Effect in DCT-Coded Images. Journal of Visual Communication and Image Representation 8(4), 327–337 (1997) 7. Simoncelli, E.P., Olshausen, B.A.: Natural Image statistics and Neural Representation. Annu. Rev. Neurosci. 1193–216 (2001) 8. Dony, R.D., Haykin, S.: Proc. of IEEE Neural Network Approaches to Image Compression, vol. 83(2), pp. 288–303 (1995) 9. Ferreira, A.J, Figueiredo, M.A.T.: On the use of independent component analysis for image compression. Signal Processing: Image Communication 21, 378–389 (2006) 10. Guilhon, D., Barros, A.K.: ECG Compression by Efficient Coding, submitted to ICA (2007) 11. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley and Sons, New York (2001) 12. Mallat, S., Zhang, Z.: Matching pursuit with time-frequency dictionaries. IEEE Transactions on Signal Processing 41(12), 3397–3415 (1993) 13. Martinez, A.M., Benavente, R.: The AR Face Database. CVC Technical Report, vol. 24 (1998) 14. Lloyd, S.P.: Least Squares Quantization in PCM. IEEE Transactions on Information Theory IT-28, 129–137 (1982) 15. Huffman, D.A.: A method for the construction of minimum-redundancy codes. In: Proceedings of the I.R.E. pp. 1098–1102 (1952) 16. Miyahara, M., Kotani, K., Algazi, V.: Objective picture quality scale (pqs) for image coding. IEEE Trans. Commun 46, 1215–1226 (1998) 17. Mannos, J.L., Sakrison, D.J.: The Effects of a Visual Fidelity Criterion on the Encoding of Images. IEEE Trans. on Information Theory it-20, 525–536 (1974) 18. Ziou, D., Tabbone, S.: Edge Detection Techniques - An Overview. International Journal of Pattern Recognition and Image Analysis 8, 537–559 (1998) 19. Strijm, J., Cosmanb, P.C.: Medical image compression with lossless regions of interest. Signal Processing 59, 155–171 (1991)
Complex Nonconvex lp Norm Minimization for Underdetermined Source Separation Emmanuel Vincent METISS Group, IRISA-INRIA Campus de Beaulieu, 35042 Rennes Cedex, France [email protected]
Abstract. Underdetermined source separation methods often rely on the assumption that the time-frequency source coefficients are independent and Laplacian distributed. In this article, we extend these methods by assuming that these coefficients follow a generalized Gaussian prior with shape parameter p. We study mathematical and experimental properties of the resulting complex nonconvex lp norm optimization problem in a particular case and derive an efficient global optimization algorithm. We show that the best separation performance for three-source stereo convolutive speech mixtures is achieved for small p.
1
Introduction
Underdetermined source separation is the problem of recovering the singlechannel source signals sj (t), 1 ≤ j ≤ J, underlying a multichannel mixture signal xi (t), 1 ≤ i ≤ I, with I < J. The mixing process can be modeled in the time-frequency domain via the Short-Term Fourier Transform (STFT) as [1] X(n, f ) = A(f )S(n, f )
(1)
where S(n, f ) is the vector of source STFT coefficients in time-frequency bin (n, f ), X(n, f ) is the vector of mixture STFT coefficients in the same bin, and A(f ) is a complex mixing matrix. This problem can be tackled by first estimating the mixing matrices and then deriving the Maximum A Posteriori (MAP) source coefficients under the constraint (1), based on some prior distribution. Existing separation methods rely on the assumption that the source coefficients are independent and sparsely distributed, i.e. a large proportion of coefficients are close to zero. Examples of sparse priors include mixtures of Dirac impulses and Gaussians [2], mixtures of Gaussians [3], Student t distributions [4] and the Laplacian distribution [5,6,1,7]. The latter is popular since it leads to a convex optimization problem that can be solved efficiently. In this paper, we extend this approach by assuming that the source coefficients follow a generalized Gaussian prior, of which the Laplacian is a special case. This extension is not straightforward, since the resulting criterion can be nonconvex. The structure of the rest of this paper is as follows. In Section 2, we argue that generalized Gaussian priors are well suited to the modeling of speech signals. We M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 430–437, 2007. c Springer-Verlag Berlin Heidelberg 2007
Complex Nonconvex lp Norm Minimization
431
study the properties of the resulting optimization problem in Section 3 in the case where J = I + 1 and derive an efficient global optimization algorithm. We evaluate its performance on instantaneous and convolutive speech mixtures in Section 4 and conclude in Section 5.
2
Generalized Gaussian Priors
Generalized Gaussian priors were first introduced in the context of source separation via Independent Component Analysis (ICA) [8,9,10]. The phases of the source STFT coefficients Sj (n, f ) are assumed to be uniformly distributed, while their magnitudes are modeled by P (|Sj (n, f )|) = p
β 1/p −β |Sj (n,f )|p e Γ (1/p)
(2)
where the parameters p > 0 and β > 0 govern respectively the shape and the variance of the prior. This prior includes the Laplacian (p = 1) and the Gaussian (p = 2) as special cases and its sparsity increases with decreasing p. In order to assess the benefit of using this prior, we computed the best shape parameters p for 30 speech signals, considering all frequency bins either separately or together. The signals were sampled at 8 kHz and had a duration of 12 s. The STFT was computed using half-overlapping sine windows of various lengths L and each frequency bin was scaled to unit variance. The Maximum Likelihood (ML) parameters were estimated using Matlab fminunc optimizer1 . The observed parameter range is depicted in Figure 1. On average, p varies between 0.4 and 0.9 depending on the window length L and stays almost constant across frequency, except at very low frequencies where background noise dominates. This shows that generalized Gaussian priors with p < 1 better fit speech sources than Laplacian priors. Interestingly, the observed value of p reaches a minimum for L = 64 ms, which was also previously determined to be the optimal window length for source separation via binary masking [11,12].
3
Properties of the Complex lp Norm Criterion
Given these results, we now assume that the mixing matrices A(f ) are known and that the source STFT coefficients follow a generalized Gaussian prior with fixed parameters p and β. The MAP source coefficients are given by f ) = arg min Sp subject to A(f )S = X(n, f ) S(n, p S∈CJ
where Sp is the lp norm of the vector S defined by Spp = p < 1, this criterion is nonconvex hence difficult to minimize. 1
J j=1
(3) |Sj |p . When
This algorithm is based on a subspace trust region method. For more details, see http:// www.mathworks.com/access/helpdesk r13/help/toolbox/optim/fminunc.html
432
E. Vincent
1.5
1.5
1
1
p
2
p
2
0.5
0
0.5
101
L (ms)
0
102
0
1
2 f (kHz)
3
4
Fig. 1. Variation of the shape parameter p measured on speech STFT coefficients as a function of the window length L (left) and as a function of frequency f with L = 64 ms (right). The black curve and the gray area represent respectively the geometric mean and the geometric standard deviation of the measured values. This illustration is motivated by the fact that the measured values are approximately log-Gaussian.
3.1
Unconstrained Expression
We focus in the rest of this article on the simple case where J = I + 1 and A(f ) has full row rank. Under this assumption, the constrained optimization problem (3) is equivalent to a one-dimensional unconstrained complex optimization problem [1]. The MAP source coefficients are then expressed as f) = O + u S(n, V
(4)
where O is any vector satisfying the constraint, e.g. O = A(f )† X(n, f ) with † denoting pseudo-inversion, V is any vector spanning the null space of A(f ) and u = arg min O + uVpp . u∈C
(5)
This optimization problem has to be solved for each STFT bin (n, f ) individually. Using the complex derivative notation [13], the first and second order derivatives of the criterion L(u) = O + uVpp are given by p ∂L ∂L = = |Oj + uVj |p−2 Vj (Oj + uVj ) ∂u ∂u 2 j=1 J
J ∂2L p2 ∂2L = = |Oj + uVj |p−2 |Vj |2 ∂u∂u ∂u∂u 4 j=1
∂2L ∂2L p(p − 2) 2 |Oj + uVj |p−4 Vj (Oj + uVj )2 2 = ∂u2 = 4 ∂u j=1
(6)
(7)
J
(8)
Complex Nonconvex lp Norm Minimization
3.2
433
Singular and Non-singular Solutions
It is well known that for real variables the global minimum of L with p ≤ 1 results in at least one source coefficient being zero and can be found by combinatorial optimization [6,1]. However, this is not true anymore with complex variables, as shown in [1,7] in the particular case p = 1. Nevertheless, the local minima of L can still be characterized using the two lemmas below. O
Lemma 1. Let J = {j : Vj = 0}. When p < 1, the points zj = − Vjj , j ∈ J , are singular ( i.e. non-differentiable) local minima of L. Proof. Let Zj = {k : zk = zj }. The point zj is characterized by the fact that Sk (n, f ) = 0 for all k ∈ Zj and Sk (n, f ) = 0 for all k ∈ / Zj . By developing the expression of L around this point when p < 1, we get p |Vk | |u|p + O(u). (9) L(zj + u) = L(zj ) + k∈Zj
Thus L is non-differentiable at zj and L(zj + u) > L(zj ) for small u = 0.
Lemma 2. The other local minima of L are non-singular and within the convex hull of zj , j ∈ J . Proof. If u = zj for all j and u is a local minimum of L, then L is differentiable at u according to (6) and ∂L ∂u = 0. After rearranging this equality, we get j∈J u=
|Oj + uVj |p−2 |Vj |2 zj
j∈J
|Oj + uVj |p−2 |Vj |2
.
(10)
Thus u can be expressed as a weighted sum of zj , j ∈ J , with positive weights summing to one. In the following, we use the term “singular” to characterize by extension the local minima of L where at least one source coefficient is zero, although L is differentiable at these minima when p > 1. 3.3
Critical Value of p for the Existence of Non-singular Solutions
The above distinction between singular and non-singular local minima raises the question whether non-singular minima can exist for all values of p and whether the global minimum can be non-singular. We studied this question experimentally with I = 2 and J = 3. We draw 100 independent source coefficient vectors following the generalized Gaussian distribution with shape parameter p = 0.4 using the MetropolisHastings algorithm [14]. We also draw 100 instantaneous (real) mixing matrices of the form A1j = cos(θj ) and A2j = sin(θj ) with θj uniformly distributed in [− π2 , π2 ] and 100 convolutive (complex) mixing matrices of the form A1j = 1 and
434
E. Vincent
A2j = e2iπθj with θj uniformly distributed in (−π, π]. The multiplication of each source coefficient vector by each matrix resulted in a total of 10000 instantaneous mixtures and 10000 convolutive mixtures. For each mixture, we tested whether non-singular minima of L existed and whether the global minimum was non-singular as follows. Given Lemma 2, we sampled L on a discrete grid spanning the convex hull of zj , 1 ≤ j ≤ J, containing k1 k2 1 −k2 z1 + 3K z2 + 3K−k z3 , 1 ≤ kj ≤ K, with K = 50, points of the form u = 3K 3K and we selected the global minimum u on this grid. If u was non-singular, then the true global minimum was necessarily non-singular. Otherwise, we decided that the global minimum was singular. In the latter case, we also sampled the gradient and the Hessian of L on the same grid, selected as a potential local minimum the point with the smallest gradient among all points with positive definite Hessian and refined it using the fminunc optimizer. We then observed whether the optimizer converged to a non-singular local minimum or not. The results were very similar for instantaneous and convolutive mixtures. The average percentage of mixture draws resulting in a non-singular local minimum or a non-singular global minimum is depicted in Figure 2 as a function of p. Both quantities decrease with decreasing p, with a large drop around p = 1. For p 0.75, there remains a few non-singular local minima, but no global minima. This can be illustrated in a more general case using the example below. 100 80
%
60 40 20 0
0.8
1
p
1.2
1.4
Fig. 2. Percentage of mixture draws resulting in a non-singular local minimum (dashed curve) or a non-singular global minimum (plain curve) of the lp norm criterion with three-source two-channel mixtures
j
Example 1. Let Oj = e2iπ J and Vj = 1, 1 ≤ j ≤ J. Then u = 0 is a non-singular local minimum of L for all p > 0. Furthermore, the value of L at this minimum is smaller than at all singular local minima for p > pcrit where pcrit is defined J−1 implicitly by j=1 |1 − Oj |pcrit = J and equals respectively 0.738, 0.612, 0.534, 0.481 for J = 3, 4, 5, 6, and decreases with increasing J.
Complex Nonconvex lp Norm Minimization
435
J J 2 Proof. Using the fact that j=1 Oj = j=1 Oj = 0, the coefficients of the complex gradient, the diagonal coefficients of the complex Hessian and the offdiagonal coefficients of the complex Hessian of L, defined in [13], are given reJp2 ∂L ∂2L ∂2L ∂2L ∂2L spectively at u = 0 by ∂L ∂u = ∂u = 0, ∂u∂u = ∂u∂u = 4 and ∂u2 = ∂u2 = 0. Thus the complex gradient is zero and the complex Hessian is positive-definite, which proves that u = 0 is a non-singular local minimum of L [13]. The values of the criterion at this non-singular minimum and at the singular J−1 local minima zj = −Oj are given by L(0) = J and L(zj ) = j=1 |1 − Oj |p for all j. The latter is a strictly increasing function of p. Indeed, it can be checked dL(z ) d2 L(z ) that dp j = log J > 0 at p = 0 and dp2 j > 0 for all p > 0. Thus L(0) > L(zj ) if and only if p > pcrit where pcrit is the value of p such that L(0) = L(zj ). This shows that the global minimum of L can be non-singular when p > pcrit . We conjecture that pcrit is the lowest value of p for which this can happen. Conjecture 1. The global minimum of L with p < pcrit is always singular. We have not yet managed to prove this conjecture mathematically. However we verified it experimentally with J = 3 (see Figure 2) and with 4 ≤ J ≤ 6 using the same number of mixture draws and a similar discrete grid for optimization. 3.4
Efficient Optimization Algorithm
Lemmas 1 and 2 and Conjecture 1 suggest the following efficient algorithm for the estimation of the MAP source coefficients. – If p ≥ 1, run any gradient-based optimizer initialized randomly using (6)–(8). – If p ≤ pcrit , sample the criterion at the singular points zj , 1 ≤ j ≤ J, and select the minimum of the criterion among these points. – If pcrit < p < 1, sample the criterion on a discrete grid spanning the convex hull of the singular points zj , 1 ≤ j ≤ J, and containing these points. Select the minimum of the criterion on this grid. If it is non-singular, refine it via any gradient-based optimizer using (6)–(8). Provided that Conjecture 1 is true, this algorithm is guaranteed to find the global minimum of the criterion when p ≥ 1 or p ≤ pcrit , but also when pcrit < p < 1 if the discrete grid is tight enough. Moreover, this algorithm is quite fast, particularly for small p. Using Matlab on a 1.2 GHz CPU with the fminunc optimizer and the discrete grid defined in Section 3.3, the computation time equals on average 0.15 s, 0.0065 s and 0.00025 s for p = 1, p = 0.9 and p = 0.5 respectively with I = 2 and J = 3. By contrast, the optimization via Second Order Cone Programming (SOCP) for p = 1 takes about 0.36 s, using the same Matlab toolbox as in [1,7].
4
Source Separation Results
We evaluated the proposed algorithm for the separation of 10 instantaneous and 10 convolutive speech mixtures with I = 2 and J = 3. The mixture signals
436
E. Vincent
were obtained by mixing the source signals of Section 2 either with a matrix of positive coefficients or with a set of simulated room impulse responses corresponding to a reverberation time of 250 ms, as described in [12]. Following [11,12], the STFT window length L was set to 512 (64 ms) for instantaneous mixtures and 2048 (256 ms) for convolutive mixtures. The frequency-domain complex mixing matrices A(f ) were computed by Fourier transform of the mixing filters in the convolutive case. The performance was measured in decibels sj − sj ) and subse(dB) for each estimated source by SDRj = 20 log10 (sj / quently averaged. For comparison, we also evaluated the performance when the criterion was optimized over the singular points only, as suggested in [1]. The results are shown in Figure 3. In the instantaneous case, the best SDR is achieved for p = 1 and the SDR for smaller values of p is 0.6 dB smaller. In the convolutive case, the best SDR is achieved for p → 0 and it is 1.2 dB larger than for p = 1. Note that the ML value of p determined in Section 2, namely p = 0.4, does not result in the best SDR. This suggests that algorithms estimating the value of p from the data might not improve performance. Note also that the SDR remains much smaller than the theoretical upper bound computed in [12].
10
10 SDR (dB)
15
SDR (dB)
15
5
0
5
0
0.5
1 p
1.5
2
0
0
0.5
1 p
1.5
2
Fig. 3. Average SDR as a function of p for the separation of instantaneous (left) and convolutive (right) three-source two-channel speech mixtures (plain curve: optimization over the full space, dashed curve: optimization over the singular points only)
5
Conclusion
In this article, we investigated the benefit of modeling the sources via generalized Gaussian priors instead of Laplacian priors for underdetermined source separation. This generalization is not straightforward, since the resulting lp norm criterion is nonconvex for p < 1. In the simple case where J = I + 1, we characterized mathematically the local minima of this criterion, conjectured that the global maximum is always singular below a critical value of p and derived an efficient global optimization algorithm. We evaluated this algorithm on speech mixtures and showed that small values of p resulted in the best separation
Complex Nonconvex lp Norm Minimization
437
performance in the convolutive case, but also in the fastest optimization. This work raises further research issues, including the proof of the above conjecture and the extension of the proposed algorithm when J > I + 1.
References 1. Winter, S., Sawada, H., Makino, S.: On real and complex valued l1 -norm minimization for overcomplete blind source separation. In: Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 86–89. IEEE Computer Society Press, Los Alamitos (2005) 2. Vielva, L., Erdo˘ gmu¸s, D., Pr´ıncipe, J.C.: Underdetermined blind source separation using a probabilistic source sparsity model. In: Proc. Int. Conf. on Independent Component Analysis and Blind Source Separation (ICA), pp. 675–679 (2001) 3. Davies, M.E., Mitianoudis, N.: Simple mixture model for sparse overcomplete ICA. IEE Proceedings on Vision, Image and Signal Processing 151, 35–43 (2004) 4. F´evotte, C., Godsill, S.J.: A Bayesian approach for blind separation of sparse sources. IEEE Trans. on Audio, Speech and Language Processing 14, 2174–2188 (2006) 5. Lee, T.W., Lewicki, M.S., Girolami, M.A., Sejnowski, T.J.: Blind source separation of more sources than mixtures using overcomplete representations. IEEE Signal Processing Letters 6, 87–90 (1999) 6. Zibulevsky, M., Pearlmutter, B.A., Bofill, P., Kisilev, P.: Blind source separation by sparse decomposition in a signal dictionary. In: Independent Component Analysis: Principles and Practice, pp. 181–208. Cambridge Press, Cambridge (2001) 7. Bofill, P., Monte, E.: Underdetermined convoluted source reconstruction using LP and SOCP, and a neural approximator of the optimizer. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 569–576. Springer, Heidelberg (2006) 8. Choi, S., Cichocki, A., Amari, S.: Flexible independent component analysis. In: Neural Networks for Signal Processing (NNSP 8), pp. 83–92 (1998) 9. Everson, R., Roberts, S.: Independent component analysis: a flexible nonlinearity and decorrelating manifold approach. Neural Computation 11, 1957–1983 (1999) 10. Wu, H.C., Pr´ıncipe, J.C.: Generalized anti-Hebbian learning for source separation. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), II–1073–1076 (1999) 11. Yılmaz, O., Rickard, S.T.: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. on Signal Processing 52, 1830–1847 (2004) 12. Vincent, E., Gribonval, R., Plumbley, M.D.: Oracle estimators for the benchmarking of source separation algorithms. Signal Processing 87, 1933–1950 (2007) 13. van den Bos, A.: Complex gradient and Hessian. IEE Proceedings on Vision, Image and Signal Processing 141, 380–382 (1994) 14. Casella, G., Robert, C.P.: Monte Carlo Statistical Methods, 2nd edn. Springer, New York (2005)
Sparse Component Analysis in Presence of Noise Using an Iterative EM-MAP Algorithm Hadi Zayyani1, Massoud Babaie-Zadeh1, , G. Hosein Mohimani1 , and Christian Jutten2 Electrical Engineering Department, Advanced Communications Research Institute (ACRI), Sharif University of Technology, Tehran, Iran 2 GIPSA-lab, Department of Images and Signals, National Polytechnic Institute of Grenoble (INPG), France [email protected], [email protected], [email protected], [email protected]
1
Abstract. In this paper, a new algorithm for source recovery in underdetermined Sparse Component Analysis (SCA) or atomic decomposition on over-complete dictionaries is presented in the noisy case. The algorithm is essentially a method for obtaining sufficiently sparse solutions of under-determined systems of linear equations with additive Gaussian noise. The method is based on iterative Expectation-Maximization of a Maximum A Posteriori estimation of sources (EM-MAP) and a new steepest-descent method is introduced for the optimization in the Mstep. The solution obtained by the proposed algorithm is compared to the minimum 1 -norm solution achieved by Linear Programming (LP). It is experimentally shown that the proposed algorithm is about one order of magnitude faster than the interior-point LP method, while providing better accuracy. Keywords: sparse component analysis, sparse decomposition, blind source separation, independent component analysis.
1
Introduction
Finding (sufficiently) sparse solutions of under-determined systems of linear equations (possibly in the noisy case) has been studied extensively in recent years [1,2,3,4,5,6,7,8,9,10]. The problem has a growing range of applications in signal processing. One of these applications is the noisy under-determined sparse source separation which is also called Sparse Component Analysis (SCA) [1, 2, 3, 4, 5, 6]. Another application is the so-called ’atomic decomposition’ problem which aims at finding a sparse representation for a signal in an overcomplete dictionary [7, 8, 9, 10]. In this paper, we will mainly use the context of SCA stating our approach. The discussions, however, may be easily followed in other contexts of application such as atomic decomposition. SCA can be viewed as a method to achieve separation of sparse sources. The Blind Source Separation (BSS) problem is to recover m unknown sources from n
This work has been partially funded by Sharif University of Technology, by Center for International Research and Collaboration (ISMO) and by Iran NSF (INSF).
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 438–445, 2007. c Springer-Verlag Berlin Heidelberg 2007
Sparse Component Analysis in Presence of Noise
439
observed mixtures of them, where little or no information is available about the sources (except their statistical independence) and the mixing system. In this paper we consider the noisy linear instantaneous model: x(t) = As(t) + n(t).
(1)
where x(t), s(t) and n(t) are n × 1, m × 1 and n × 1 vectors of sources, mixtures and white Gaussian noises, respectively, and A is the n × m mixing matrix. In the under-determined case (m > n), estimating the mixing matrix is not sufficient to recover the sources, since the mixing matrix is not invertible. Then, the estimation of sources requires some prior information on the sources and passes from a blind problem to a semi-blind problem. One such prior information is the sparsity of sources. It means that only a few samples of the sources are nonzero (say they are active) and most of them are almost zero (say they are inactive). Then SCA can be solved in two steps: first estimating the mixing matrix, and then estimating the sources. The first step may be accomplished by means of clustering [1] or other methods [4]. The second step requires finding the sparse solution of (1) assuming A to be known [9]. In this paper, we focus on the source estimation assuming A is known. In the atomic decomposition viewpoint [10], we have one signal whose samples are collected in the m × 1 signal vector s and the objective is to express it as a linear combination of a set of predetermined signals where their samples are collected in vector {ϕi }m i=1 . After [11], the ϕi ’s are called atoms and they collectively form a dictionary over which the signal is to be decomposed. In this paper, mwe also consider a noise term for the decomposition. So we can write s = i=1 αi ϕi = Φα + n, where Φ is the n × m dictionary (matrix) where the columns are the atoms and α is the m × 1 vector of coefficients. The vector n can be interpreted as either the noisy term of the original signal that we intend to decompose or the allowed error for the decomposition process. To obtain the sparse solution of (1), an approach is to search solutions having minimal 0 norm, i.e., minimum number of nonzero components. This method is intractable when the dimension increases (due to combinatorial search), and it is too sensitive to noise (due to discontinuity of 0 norm). One of the most successful approaches is Basis Pursuit (BP) [10] which finds the minimum 1 norm of (1) which can be easily implemented by Linear Programming (LP) methods (especially fast interior-point LP solvers). Another approach is Matching Pursuit (MP) [11] which is very fast, but is somewhat heuristic and does not provide good estimation of sources. In [5], we proposed a three step (sub-)optimum (in MAP sense) method for SCA in the noisy under-determined case (briefly called MAP) which has the drawback of great complexity and is not tractable for sparse decomposition application (which requires large values of m and n). This problem exists also in [6]. In this article, we propose an iterative method to tackle the great complexity of our MAP method. In the maximization step of our algorithm, we propose here an optimization method based on steepest-descent. Our method results in a fast
440
H. Zayyani et al.
sparse decomposition (faster than interior point LP) while improving the quality of source recovery because of its optimality in the MAP sense and dealing with noise.
2
System Model
The noise vector in the model (1) is assumed zero-mean Gaussian with covariance matrix σn2 I. For modeling the sparse sources the following model is used: the sources are inactive with probability p, and are active with probability 1 − p (sparsity of sources implies that p should be near 1). In the inactive case the sample of sources is zero and in the active case the sample has a Gaussian distribution. We call this model the ‘spiky model’ which is a special case of the Bernoulli-Gaussian model used in [5]. The probability density function (PDF) of the sources is then: p(si ) = pδ(si ) + (1 − p)N (0, σr2 ).
(2)
In this model, any sample of the sources can be written as si = qi ri where qi is a binary variable (with binomial distribution) and ri is the amplitude of i’th source with Gaussian distribution. So the source vector can be written as: s = Qr
Q = diag(q).
(3)
We refer the vector q [q1 , . . . , qm ] as the ‘source activity vector’, where denotes vector/matrix transposition. Each element of this vector shows the activity of the corresponding source. That is: 1 if si is active with probability p qi = (4) 0 if si is inactive with probability 1 − p The probability of source activity vector p(q) is equal to: p(q) = (1 − p)na (p)m−na .
(5)
where na is the number of active sources or the number of 1’s in q.
3
Review of Our Previous MAP Algorithm
In [5] we proposed a three step MAP algorithm for the noisy sparse component analysis. The parameter estimation step is done by a novel method based on second and fourth order moments of one mixture and an EM algorithm. The source activity estimation step is done with a MAP method that maximizes the posterior probability. This step is the maximization of: −1 −1 p(q) x Qq x). exp( p(q)p(x|q) = 2 det(2πQq )
(6)
Sparse Component Analysis in Presence of Noise
441
where Qq = AVq A + σn2 I, in which, Qq E{xx | q} and Vq E{ss | q} = σr2 Q are the conditional covariances of observations and sources given q. After finding the optimum source activity vector, the source amplitudes are estimated as: (7) r = σr2 QA (σr2 AQA + σn2 I)−1 x. Maximization of (6) is done over discrete space of vector q with 2m discrete elements. In [5] this maximization had been done through an exhaustive search on all these 2m cases. In this paper, this maximization is done by first converting it to a continuous maximization and then to use a steepest descent algorithm (this is similar to the idea used in [9]). To convert our discrete problem to a continuous one, we use a Mixture of two Gaussians model centered around 0 and 1 with sufficient small variances. By this method our discrete binomial variable qi is converted to a continuous variable. To avoid falling into local maxima of (6) a gradually decreasing variance can be used in the different iterations (similar to simulated annealing methods). But (6) is still very complex to derive for providing an efficient optimization method such as steepest-descent.
4
The Iterative EM-MAP Algorithm
The main idea of our algorithm is that the source estimation is equal to estimation of vectors q and r, as observed from (3). Estimation of q and r can be done is assumed and then the MAP estimate iteratively. First, an estimated vector q and the observation vector of vector r based on the known estimated vector q x is obtained (we refer to it as r). Secondly, the MAP estimate of vector q is obtained based on the estimated vector r and the observation vector x (we refer ). Therefore, the MAP estimation of sources is done in two other to it as vector q MAP estimation steps. is assumed and the estimation of In the first step a source activity vector q r will be computed. Because the vector r is Gaussian, its MAP estimation is equal to the Linear Least Square (LLS) estimation [13] and can be computed as follows: ) = E(rx | rLLS = E(r|x, q q)E(xx | q)−1 x. (8) rMAP = This step can be nominated as Expectation step or Estimation step (E-step). Computation and simplification of (8) (like what done in [5]) leads to the following equation which is similar to (7). (σr2 AQA + σn2 I)−1 x. r = σr2 QA
(9)
In the second step we estimate q based on the known r and the observed x. The MAP estimation is: MAP = argmax p(q|x, q r) ≡ p(q| r)p(x|q, r) ≡ p(q)p(x|q, r). q
(10)
442
H. Zayyani et al.
In (10), p(q) can be computed as a continuous variable: p(q) =
m
p(qi ) =
i=1
m
[p exp(
i=1
−qi2 −(qi − 1)2 ) + (1 − p) exp( )]. 2σ02 2σ02
(11)
Also the term p(x|q, r) in (10) can be computed as: p(x|q, r) = pn (x − AQ r) = (2πσn2 )
−m 2
exp(
−1 (x − AQ r) (x − AQ r)) . (12) 2σn2
The second step can be called Maximization step (M-step). The maximization can be done over the logarithm of (10). So this step can be simplified as: M − step : where L(q) =
m
log(p(qi )) +
i=1
= max L(q). q q
−1 (x − AQ r) (x − AQ r) . 2σn2
(13)
(14)
Maximization of L(q) in the M-step can be done with the steepest descent method. The main steepest descent iteration is: qk+1 = qk − μ
∂L(q) . ∂q
(15)
In the appendix, we show that the steepest descent algorithm for the M-step is: qk+1 = qk +
μ μ g(q) + 2 Diag(A AQ r − A x). r. 2 σ0 σn
(16)
where g(q) is defined in the appendix. In the successive iterations, we gradually (i) (i−1) where α is selected between decrease the variance σ0 in the form σ0 = ασ0 0.6 and 1. Also, the step-size μ should be decreasing, i.e., for smaller σ’s, smaller μ’s should be applied. This is because for smaller variances, our function under maximization is more fluctuating. So the step size can be decreased in the similar form as μ(i) = αμ(i−1) . Our simulations show that for α = .8 only about 4 or 5 iterations are sufficient to maximize the expression L(q) in the M-step. Also the EM-step converges at the third or fourth iteration. The first initialization of the EM-MAP method is done with the minimum 2 norm solution. As we see from (16) the second summand is responsible for increasing the prior probability p(q) while the third summand is responsible for decreasing the noise power ||x − As||. When σ0 is much larger than σn , the second term is more effective than the third term and as a result exactness of x = As is more important than sparsity of s. When σ0 is decreased to be comparable to σn , both terms are effective to yield the equilibrium point between sparsity and noise. In summary, the overall algorithm is an iterative two step (E-step and M-step in (9) and (13) respectively) algorithm in which the M-step is done iteratively with the steepest descent method in (16).
Sparse Component Analysis in Presence of Noise
443
35 EM−MAP with actual parameters EM−MAP with simple estimated parameters LP
34 33
SNR(dB)
32 31 30 29 28 27 26 25 400
450
500
550 600 number of mixtures
650
700
Fig. 1. Comparison of the results of our algorithm in two cases and the (interior-point) LP method. The parameters of simulation are m = 1000, p = .9, σr = 1, σn = .01, r and μ(0) = 10−6 . Four iterations are used for EM-step and five α = .8, σ (0) = σ iterations for the M-step (steepest descent).
5
Simulation Results
In this section, we examine the performance of our algorithm in two cases (the case of using actual parameters and the case of using a simple method for estimating the parameters which is explained below) and then compare it to the interior-point LP method. Our performance criterion is Signal to Noise Ration (SNR) in dB defined by SNR = 10 log10 {s2 /ˆs − s2 }. The simulations have been repeated 50 times (with the same parameters, but for different randomly generated sources and mixing matrix) and the resulting SNR’s (in dB) have been averaged. The values used for the experiment are m = 1000, n = 400, ..., 700, p = .9, σr = 1 and σn = .01. The elements of the mixing matrix are randomly chosen between -1 and 1 and each column normalized to have unit length. In the M-step the value of α is between 0.6 and 1. This parameter effects on the speed of convergence. We use an average value of α = .8 in our simulations. The initial value of σ0 is selected equal to estimated σr . The initial value of μ can be selected between 10−3 and 10−8 . But for small values and large values in this range, the performance is somewhat deteriorated. So we select the value of μ = 10−6 . Four iterations are used for the EM-step and five iterations are used for the M-step (steepest descent). In one of our simulation we use a very simple estimation of the parameters. In this case the parameter p underestimated as p = .8. With this assumption and by considering the ergodicity are the ensembles of m of sources (i.e. the mixtures a random variable xj = i=1 aji si + ej where aji = bji b21i + b22i + ... + b2ni and bji is a random variable with uniform distribution on [-1,1] and si and ej are random variables), and by neglecting the noise power, we have E(x2j ) = n n mE(a2ji )E(s2i ). We know that j=1 a2ji = 1 (which come from j=1 a2ji = 1 and 2 2 the independence of aji ’s), and E(si ) = (1−p)σr . With the assumption of p = .8, we will have σ r =
5nE(x2j ) . m
For the noise variance, we choose σ n = σ r /10.
444
H. Zayyani et al.
The results of our simulation are shown in Fig. 5. These results show 3-4 dB improvement (with actual parameters) and 1-2 dB improvement (with simply estimated parameters) of our algorithm over the LP method. Although, the CPU time is not an exact measure of complexity, it can give us a rough estimation of it, and we compare our algorithm with LP using this measure. Our simulations were performed in MATLAB 7.0 environment using an Intel 2.40 GHz processor with 512 MB of RAM and under Microsoft Windows XP operating system. For one typical simulation, our algorithm takes about 34 seconds while the simulation time of the (interior-point) LP method requires about 204 seconds. So our algorithm is roughly one order of magnitude faster.
6
Conclusions
In this paper, a relatively fast method for finding sparse solution of an underdetermined system of linear equations was proposed. The method was based on the iterative MAP estimation of the sources. This algorithm is approximately one order of magnitude faster than (interior-point) LP, while providing 1-2 dB improvement (with simply estimated parameters). The better performance is obtained due to the optimality of our algorithm which is based on optimum MAP estimation of sources. The simplicity of our algorithm (and its high speed) is obtained due to iterative estimation of source activities and amplitudes and also utilizing an efficient steepest descent for the M-step.
References 1. Zibulevsky, M., Pearlmutter, B.A.: Blind source separation by sparse decomposition in a signal dictionary. Neural Computation 13(4), 863–882 (2001) 2. Gribonval, R., Lesage, S.: A survey of sparse component analysis for blind source separation: principles, perspectives, and new challanges. In: Proceeding of ESANN’06, pp. 323–330 (2006) 3. Davies, M., Mitianoudis, N.: Simple mixture model for sparse overcomplete ICA. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 35–43. Springer, Heidelberg (2004) 4. Li, Y.Q., Amari, S., Cichocki, A., Ho, D.W.C, Xie, S.: Underdetermined blind source separation based on sparse representation. IEEE Transaction on Signal Processing 54(2), 423–437 (2006) 5. Zayyani, H., Babaie-Zadeh, M., Jutten, C.: Source estimation in noisy sparse component analysis. Accepted in DSP’2007 (2007) 6. Balan, R., Rosca, J.: Source separation using sparse discrete prior models. In: Proceeding of ICASSP’06 (2006) 7. Donoho, D.L.: For most large underdetermined systems of linear equations the minimal 1 norm is also the sparsest solution. Technical Report (2004) 8. Donoho, D.L., Elad, M., Temlyakov, V.: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transaction on Information theory 52(1), 6–18 (2006) 9. Mohimani, G.H, Babaie-Zadeh, M., Jutten, C.: Fast sparse representation based on smoothed 0 norm. Accepted in ICA’2007 (2007)
Sparse Component Analysis in Presence of Noise
445
10. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20(1), 31–61 (1999) 11. Mallat, S., Zhang, Z.: Matching pursuit with time-frequency dictionaries. IEEE Transaction on Signal Processing 41(12), 3397–3415 (1993) 12. Djafari, A.M.: Bayesian source separation: beyond PCA and ICA. In: Proceeding of ESANN’06 (2006) 13. Anderson, B.D., Moor, J.B.: Optimal filtering, 2nd edn. Prentice Hall, Englewood Cliffs (1979)
Appendix: Steepest Descent Algorithm From (13), we have: ∂ ∂L(q) 1 ∂ = (x − AQ r) (x − AQ log(p(qi )) − 2 r) . ∂q ∂q i=1 2σn ∂q m
(17)
m ∂ we define g(q) −σ02 ∂q r) (x − AQ r). i=1 log(p(qi )) and n(q) (x − AQ With these definitions the scalar function g(qi ) and the n(q) (with omitting the constant terms) can be computed as: −q2
g(qi ) =
i −1) ) pqi exp( 2σ2i ) + (1 − p)(qi − 1) exp( −(q2σ 2 2
0
0
−q2
i −1) p exp( 2σ2i ) + (1 − p) exp( −(q2σ ) 2 2
0
.
(18)
0
n(q) = −2x AQ r+ r QA AQ r.
(19)
with the definitions C A A and n1 (q) −2x AQ r and n2 (q) r QCQ r we can write: ∂n1 (q) = diag(−2x A). r. (20) ∂q If we define W Q r (m × 1 vector) then n2 (q) = W CW and so we have: ∂n2 (q) ∂n2 (q) ∂Wj = . ∂qi ∂Wj ∂qi j=1 m
(21)
2 (q) = 2CW d. Also from the definiFrom the vector derivatives, we have ∂n∂W m ∂Wj 2 (q) tion of W we have ∂qi = ri δij . So (21) is converted to ∂n∂q = j=1 dj ri δij = i ri di . So the vector form of (21) is equal to:
∂n2 (q) = diag(d). r. ∂q
(22)
From (20) and (22) and n(q) = n1 (q) + n2 (q) and definitions of vectors d and C, we can write: ∂n(q) = 2diag(A AQ r − A x). r. (23) ∂q Finally, (23) and (17) and (15) with the definitions of n(q) and g(q) yields the steepest descent iteration in (16).
Mutual Interdependence Analysis (MIA) Heiko Claussen1,2 , Justinian Rosca1 , and Robert Damper2 Siemens Corporate Research Inc., 755 College Road East, Princeton, New Jersey 08540 {Heiko.Claussen,Justinian.Rosca}@siemens.com 2 University of Southampton, School of Electronics and Computer Science, Southampton SO17 1BJ, UK {hc05r,rid}@ecs.soton.ac.uk 1
Abstract. Functional Data Analysis (FDA) is used for datasets that are more meaningfully represented in the functional form. Functional principal component analysis, for instance, is used to extract a set of functions of maximum variance that can represent the data. In this paper, a method of Mutual Interdependence Analysis (MIA) is proposed that can extract an equally correlated function with a set of inputs. Formally, the MIA criterion defines the function whose mean variance of correlations with all inputs is minimized. The meaningfulness of the MIA extraction is proven on real data. In a simple text independent speaker verification example, MIA is used to extract a signature function per each speaker, and results in an equal error rate of 2.9 % in the set of 168 speakers.
1
Introduction
Principal Component Analysis (PCA), discussed in [1] for nonfunctional and [2] for functional data, is an essential feature extraction method. Principal component functions are such that they can approximate new data with minimum mean square error even if only a subset of all components is used. The principal component functions are given by the directions of maximum variance in the projections of the data. The directions of minimum variance, or minor components, have received much less attention in the literature. Minor components correspond to directions that are either orthogonal to the inputs or minimize the variance of their projections. Hence, they represent an invariant or “common” direction of the inputs. Further information about PCA and MCA, including their probabilistic models, can be found in [3], [4], [5] and [6]. [7] acknowledged that minor components are important in some signal processing applications. Examples are spectral estimation, curve and hyper-surface fitting, cognitive perception and computer vision. This work coined the name Minor Component Analysis (MCA) to refer to neural network fitting methods that compute minor components. The authors also suggested that MCA solves the Total Least Square (TLS) problem [8]. Recently, a method called Extreme Component Analysis (XCA) was introduced that combines minor and principal components in order to represent a M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 446–453, 2007. c Springer-Verlag Berlin Heidelberg 2007
Mutual Interdependence Analysis (MIA)
447
dataset “optimally”. As discussed in [9], the “optimal” representation, if by minor or principal components, is dependent on the dataset. By freely choosing between minor, principal components and their combinations, datasets can be represented with possibly a lower number of components. As previously mentioned, MCA can be used to extract minor components that represent invariants of the data. However, when it is desired to extract the mutual interdependencies of a set of input functions, MCA suffers because of the necessary preprocessing step of input data centering. In the usual case, where the input functions are linearly independent, centering reduces the span of the data. Therefore, these centered functions can no longer fully represent the inputs. Furthermore, as discussed in [10] the TLS and therefore the MCA solution is non trivial and can usually not be found in closed form. In this paper, we propose Mutual Interdependence Analysis (MIA) to solve a TLS-like optimization problem in the span of the original inputs in order to extract a mutually interdependent function from the input function set. We prove that in the case of linearly independent input functions, a closed form solution can be found that minimizes the MIA criterion. In section 2 we define the MIA problem, derive its solution and illustrate its properties. In the experimental section 3, MIA is used for simple text independent speaker recognition. We end the paper with conclusions and directions for further work.
2
Mutual Interdependence Analysis (MIA)
Throughout the paper, we use xT to denote the transpose of column vector x. Also, we denote X to be the matrix whose columns are xi with i = 1, . . . , D. We use 1 to represent a vector of ones, 1 to represent a matrix of ones and I to be the identity matrix. The dimension will be clear from the context. Consider D real inputs, xi (tj ) with i = 1, . . . , D and j = 1, . . . , N , where each input xi = [xi (t1 ), . . . , xi (tN )]T is viewed as a single entity (i.e. has the intrinsic structure of a function) rather than a series of individual observations. Functional Data Analysis (FDA) normally treats data this way, therefore we refer to each xi as an input. In our case, N is typically larger or much larger than D. TLS solves a linear equation XT · s = b by finding a direction s that minimizes the squared, D |xT ·s−bi |2 orthogonal distances to the data points xi : min i=1 siT ·s+1 . On the other s
hand, the ordinary Least Square (LS) problem finds a direction s that minimizes 2 D min i=1 xTi · s − bi . While LS assumes that only vector b to contain errors, s the TLS approach models uncertainties in both b and in the data X. Our goal is different: Extract a new vector (function) s that “optimally” represents the D input functions. An optimal solution is a function that is maximally correlated with all inputs with the constraint that it is a linear combination of the inputs. We call this problem Mutual Interdependence Analysis (MIA). Formally, the optimality criterion J(X|s), given a functional data series s, is as follows:
448
H. Claussen, J. Rosca, and R. Damper
D
J(X|s) =
i=1
D 1 T s · xi − s · xk D
2
T
(1)
k=1
Our problem is to find the maximum likelihood vector ˆs of norm one, in the span of the inputs, that minimizes J(X|s): ˆs =
arg min s,s=1,s=
D
J(X|s)
(2)
ck ·xk
k=1
2.1
Solution to MIA
Let us find an equivalent formulation for the MIA problem in (2). Consider the D (c) 1 (m) mean function x(m) = D . i=1 xi and the centered functions xi = xi − x (c) D 1 T T It can be easily shown that sT · xi − D s · x = s · x . Hence, k i k=1 (2) becomes: 2 T ˆs = arg min (3) s · X(c) , s,s=1,s=
D k=1
ck ·xk
(c) xi (m)
where X(c) has as columns X(c) = [x1 − x(m) |...|xD − x
with i = 1, . . . , D. Then, x(m) = ] = X − x(m) · 1T . It follows that
X(c) = X · P
with P = I −
1 1 . D
1 D
X · 1 and
(4)
(c) (c) (c) (c) Obviously, D i=1 xi = 0 . Hence, the nullspace N ULL(x1 , x2 , . . . , xD ) is (c) (c) (c) non trivial. All vectors s ∈ N ULL(x1 , x2 , . . . , xD ) will minimize J(X|s). In the next theorem, we show that the problem given by (3) has at least one solution. Theorem 1. Assume x1 , x2 , . . . , xD are linearly independent. Then, there exists (c) (c) (c) s = 0 in N ULL(x1 , x2 , . . . , xD ) such that s is in the span of the inputs xi , i = 1, . . . , D. Proof. A solution s = 0 and c = [c1 , c2 , . . . , cD ]T ∈ RD of the system of equations: sT · X(c) = 0 s = X·c
(5) (6)
will also satisfy the theorem and solve the optimization criterion of problem (3). (c) Indeed, (5) is equivalent to the existence of s such that s ∈ N ULL(x1 , (c) (c) x2 , . . . , xD ) and (6) specifies that s is in the span of the inputs xi and s = 0. Let us substitute s from (6) and X(c) from (4) in (5): ⇒ cT · XT · X · P = 0
(7)
Given that G = (XT · X) is a Gram matrix formed by linearly independent vectors x1 , x2 , . . . , xD , G is invertible (see theorem 7.2.10 in [11]). Let
Mutual Interdependence Analysis (MIA)
cT = dT · (XT · X)−1 .
449
(8)
Therefore, (7) becomes: dT · P = 0 with (9)
d=ζ1 and ζ ∈ R. When substituting (9) into (8): c = ζ (XT · X)−1 · 1. Hence: ˆs = ζ X · (XT · X)−1 · 1.
(10)
Then, ˆˆss is a solution to (6) and (5) for all ζ ∈ R . Therefore, any MIA problem with linearly independent inputs has a solution given by (10). An alternative interpretation of the MIA solution ˆˆss is that of a direction in the N -dimensional space which minimizes the variance of the projections of all points xi , i = 1, . . . , D. Should the inputs be translated by a constant λ ∈ R, i.e.
xi ← xi − λ 1
∀i
(11)
then the solution of J(X |s) changes. Indeed, it can be easily proven that the criterion (1) itself is invariant to a translation (11), however the constraint of (2) will require that the solution ˆs is in the span of xi . 2.2
Example with Synthetic Data
A synthetic example is given below to compare MIA, MCA, PCA and Independent Component Analysis (ICA) (see [12]). Assume three inputs given by: x1 = f1 + f2 x2 = f1 − 0.5 f2 x3 = 2 f1 + 2 f3 +10 1 where f1 = sin( 2Πi N ) i = 1, . . . , N , f2 is Gaussian noise N (0, 1) and f3 is Laplacian noise. The inputs and results of the methods are illustrated in Fig. 1. The MIA solution closely approximates f1 , in contrast to the other methods.
3
Application: Text Independent Speaker Verification
In this section, we apply MIA to the problem of extracting signatures from speech data for the purpose of text independent speaker recognition. This problem is challenging when we need to verify the identity of a person but can not control the way data is acquired (i.e. recording equipment, environment, etc.). For this study, we have used the TIMIT database [13]. Data from 168 speakers was partitioned 50-50 for training and testing. The data was preprocessed by silence removal and normalization of each recording. Data for a given speaker was used as input to MIA in order to generate a speaker signature as described below. We compare the equal error rate (EER) results for speaker verification obtained with MIA versus the PCA and ICA-based methods described in [14].
0.05 0 −0.05 −0.1 0
5 0 −5 0
1st IC (ICA)
0.1
10 Normalized Amplitude
Amplitude
15
0.15
200
400
600
Sample Number
(a) Inputs xi
800
1000
500 Sample Number
1st PC (PCA)
0.1 0.05 0 −0.05 −0.1 0
500 Sample Number
0.15 1st MC (MCA)
0.1 0.05 0 −0.05 −0.1
1000
0.15
0
Normalized Amplitude
x1 x2 x3
20
Normalized Amplitude
H. Claussen, J. Rosca, and R. Damper
Normalized Amplitude
450
500 Sample Number
1000
0.15 s (MIA)
0.1 0.05 0 −0.05 −0.1
1000
0
500 Sample Number
1000
(b) Results of Different Methods
Fig. 1. (a) Inputs xi are linear combinations of three basis functions. (b) Signals extracted using ICA, MCA, PCA and MIA. The MIA result, with λ = 0, is meaningful.
3.1
Data Model
A speech signal can be modeled as an excitation that is convolved with a linear dynamic filter which represents the vocal tract. The excitation signal can be modeled for voiced speech as a periodic signal and for unvoiced speech as random noise. It is common to analyze the voiced and unvoiced speech separately [15]. In this example, only the voiced speech is used for speaker recognition. Let E(p) , H(p) and V(p) be the spectral representations of the excitation, vocal tract filter and the voiced signal parts of a person p respectively. Moreover, let M represent speaker independent signal parts in the spectral domain (i.e. recording equipment, environment, etc.). Therefore, the data can be modeled as: V(p) = E(p) · H(p) · M.
(12)
By cepstral deconvolution, the model can be represented as a linear combination of its basis functions: (p)
xi
= log V(p) = log E(p) + log H(p) + log M.
(13)
This model suggests that we could use MIA to extract a function that represents the speaker’s signature. In practice, we take speech segments of about one (p) second as MIA inputs xi in order to achieve spectral accuracy. An example (p) of inputs xi is shown in Fig. 2(a). Therefore, MIA will extract signatures that capture typical speaker dependent correlations in the logarithmic spectral domain. Speaker independent signal parts M will be minimized if they are not equally present in all MIA inputs. 3.2
MIA-Based Text Independent Speaker Verification
We partition the training and testing data for each person p into D = 8 segments (p) {xi }i=1,...,D of one second. For each person, we extract a voice signature s(p)
Mutual Interdependence Analysis (MIA)
451
using MIA. The cosine distance between the testing data signatures and training data signatures is used as a measure of similarity. A matrix that represents the cosine distances between all signatures in the database is illustrated in Fig. 2(b). Let NFA and NCA represent the number of false and correct acceptances respectively. NU is the number of registered users. The speaker recognition results are evaluated by a comparison of the false acceptance rate FA with the false rejection rate FR, calculated as follows: FA =
NFA NU2 − NU
and FR =
NU − NCA . NU
(14)
NFA and NCA result from the entire database by testing each learned speaker signature against all test speaker signatures. This means that a number of NU2 tests is performed, including NU correct combinations and NU2 − NU impostors. By changing a threshold value, more people can be accepted which results in a trade off between FA and FR. The FA versus FR plot of this example is illustrated in Fig. 3(a). The equal error rate (EER), where FA equals FR, is used to compare to previous results in [14]. 0
Test Signature of Speaker Nr.
Logarithmic Amplitude
0
−5
−10
−15
−20 0
2
4
6
8
Frequency [kHz]
(a) Inputs xi in the Fourier Domain
50
100
150 0
50
100
150
Learned Signature of Speaker Nr.
(b) Similarity Scores
Fig. 2. MIA applied to text independent speaker recognition. (a) Representation of (p) input data {xi }i=1,...,8 , given by speech segments of a single speaker p in the logarithmic Fourier domain. (b) Matrix of similarity scores between different signatures. Bright gray stands for high and dark gray for low similarity between signatures.
3.3
Results
We used a set of 168 speakers from the TIMIT database [13]. For each signature extraction from the training and testing data, we used 8 seconds of voiced speech. The data was partitioned into 8 windows with non overlapping, nearly rectangu1 second. lar windowing functions of one second lengths and Gaussian tails of 20 The input functions had their mean subtracted. Thereafter, each extracted signature was down-sampled to 256 points. The mean signature was subtracted
H. Claussen, J. Rosca, and R. Damper
Normalized Amplitude
20
15
0.15
0 −0.05 −0.1 0
5
0 0
1st MC (MCA)
0.1 0.05
10 Normalized Amplitude
False Rejection Rate
Speaker Dependent Variance Threshold Trace Dependent, Equal Threshold
5
10
15
20
False Acceptance Rate
(a) FA vs. FR percentage
25
Normalized Amplitude
452
2 4 6 Frequency [kHz]
8
0.15 1st PC (PCA)
0.1 0.05 0 −0.05 −0.1 0
2 4 6 Frequency [kHz]
8
0.15 s (MIA)
0.1 0.05 0 −0.05 −0.1 0
1
2
3
4 5 Frequency [kHz]
6
7
8
(b) Results of Different Methods
Fig. 3. Results of MIA-based text independent speaker verification. (a) False rejection (FR) versus false acceptance (FA) rate. (b) Speaker signature extracted by MIA from (p) {xi }i=1,...,8 . Also plotted are the first minor component (MCA) and the first principal component (PCA) of the data.
from all signatures to focus on the evaluation of differences. The comparison of the MIA-based signature with the 1st minor component and the 1st principal component is illustrated in Fig. 3(b). Note visually that only MIA extracts signal amplitudes in accordance with the well known result that low frequencies contain most information about a speaker. In order to alter the proportion between FA and FR, two different thresholds were used. First, a trace dependent threshold treats every user the same. Secondly, the variance dependent threshold uses information about the similarity between test cases to learn a speaker dependent weighting. The EER of this MIA-based text independent speaker recognition system was 2.9 % for the variance dependent threshold and 5.4 % for a trace dependent threshold. For a similar experiment, [14] reports EER’s between 4.3 % and 6.1 % using ICA and PCA features. Here it has to be noted that this test was done with 462 speakers. However, one person was represented by 16 or 32 features of 128 or 256 samples length. On the other hand, MIA only uses a single signature of 256 samples length per speaker.
4
Conclusion
We proposed a novel feature extraction method, Mutual Interdependence Analysis (MIA), which finds an invariant function of the input function set, representing the direction of minimum variance of input projections. Intuitively, this function is mutually interdependent with all inputs. The proof of the minimization problem exploits the unconstrained span of the original input data to infer a closed form solution. Furthermore, we showed the effect of input data translation by a constant value λ. In this way, one can control the degree of correlatedness with outlier functions in the input set. Indeed, one can choose a value of λ to
Mutual Interdependence Analysis (MIA)
453
discriminate between inputs and bias towards a result which correlates only with a subset of them. Further work will analyze the robustness of MIA to noise. Moreover, the effect of changes in the span constraint, or the choice of a basis function set, will be explored.
Acknowledgments We would like to thank Radu Balan for his invaluable input to this paper.
References 1. Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer, Heidelberg (2002) 2. Ramsay, J.O., Silverman, B.W.: Functional Data Analysis, 2nd edn. Springer, Heidelberg (2006) 3. Oja, E.: Principal components, minor components, and linear neural networks. Neural Networks 5, 927–935 (1992) 4. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. Journal of the Royal Statistical Society 61(3), 611–622 (1999) (Series B) 5. Williams, C., Agakov, F.: Products of gaussians and probabilistic minor components analysis. Neural Computation 14(5), 1169–1182 (2002) 6. Thompson, P.A.: An adaptive spectral analysis technique for unbiased frequency estimation in the presence of white noise. In: Systems and Computers, pp. 529–533 (1980) 7. Xu, L., Oja, E., Suen, C.Y.: Modified hebbian learning for curve and surface fitting. Neural Networks 5, 441–457 (1992) 8. Moon, T.K., Stirling, W.C.: Mathematical methods and algorithms for signal processing. Prentice-Hall, Upper Saddle River, NJ (2000) 9. Welling, M., Agakov, F., Williams, C.K.I.: Extreme components analysis. In: Thrun, S., Saul, L., Sch¨ olkopf, B. (eds.) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA (2004) 10. Rao, Y.N., Principe, J.: Efficient total least squares method for system modeling using minor component analysis. In: Proceedings of international workshop on neural networks for signal processing, pp. 259–268 (2002) 11. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1999) 12. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent component analysis. John Wiley and Sons, Chichester (2001) 13. Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L., Zue, V.: Timit acoustic-phonetic continuous speech corpus. CDROM (1993) 14. Rosca, J., Kofmehl, A.: Cepstrum-like ica representations for text independent speaker recognition. In: ICA, pp. 999–1004 (2003) 15. Deng, L., O’Shaughnessy, D.: Speech Processing: A Dynamic and OptimizationOriented Approach. Signal Processing and Communications. Marcel Dekker, Inc. (2003)
Modeling Perceptual Similarity of Audio Signals for Blind Source Separation Evaluation Brendan Fox, Andrew Sabin, Bryan Pardo, and Alec Zopf Northwestern University, Evanston, IL, USA 60201, USA [email protected] http://music.cs.northwestern.edu
Abstract. Existing perceptual models of audio quality, such as PEAQ, were designed to measure audio codec performance and are not well suited to evaluation of audio source separation algorithms. The relationship of many other signal quality measures to human perception is not well established. We collected subjective human assessments of distortions encountered when separating audio sources from mixtures of two to four harmonic sources. We then correlated these assessments to 18 machine-measurable parameters. Results show a strong correlation (r=0.96) between a linear combination of a subset of four of these parameters and mean human assessments. This correlation is stronger than that between human assessments and several measures currently in use. Keywords: Source Separation, Perceptual Model, Music, Audio.
1
Introduction
Blind Source Separation (BSS) is the process of isolating individual source signals, from mixtures of source signals, when the characteristics of the individual sources are not known before-hand. BSS is an active area of research [1,2,3,4,5] and new techniques are continually developing. The effectiveness of a BSS algorithm is typically measured by comparing the quality of a signal extracted from a mixture (the signal estimate) to the original source signal. Given this methodology, it becomes important to choose an error measure that captures the salient differences between the original and the estimate. Our research [6] focuses on source separation of acoustic sound sources from audio mixtures. Because our ultimate goal is the creation of audio for a human listener, human perception determines what we consider ”good” results. Unfortunately, it is not practical to conduct a human listening study each time one varies a parameter of a BSS algorithm. Thus, researchers typically use machine-measurable signal quality measures. Most BSS researchers for audio applications use existing measures of signal quality such as Signal to Distortion Ratio (SDR) [6] or quality measures specifically for audio source separation, such as Signal to Interference Ratio (SIR) [10]. The relationship between human perception of signal quality and such commonly used machine-measureable statistics remains unstudied studied over the range M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 454–461, 2007. c Springer-Verlag Berlin Heidelberg 2007
Modeling Perceptual Similarity of Audio Signals for BSS Evaluation
455
of distortions introduced by audio source separation algorithms. This makes it difficult to estimate the perceptual effect of a change in the value of such statistics. One approach to measuring BSS effectiveness for audio applications has been to use the PEAQ (Perceptual Evaluation of Audio Quality) [7] perceptual model for audio codecs. PEAQ calculates a set of statistics about the audio that are fed into a three layer feed-forward perceptron that maps the statistics onto a single quality rating called an Objective Difference Grade (ODG). Vanam and Creusere implemented a version of PEAQ that improves its correlation with subjective human data for intermediate quality codecs [8]. Unfortunately, their improvement depends on the kinds of distortions introduced by particular audio codecs, making it unsuitable for BSS evaluation. Although PEAQ works well for evaluating the small degradations of audio signals introduced by audio compression codecs, the measure has shortcomings when evaluating signals with the larger distortions resulting from source separation. For these signals, PEAQ does not correlate well with subjective human quality assessments and often saturates at the maximum possible rating. Thus, the relationship between the currently used measures of BSS effectiveness to human perception of audio quality is not well established. In this paper we measure the correlation of 18 existing machine-measurable statistics to human perception of signal quality for sounds extracted from audio mixtures with BSS. We then create a combined model from those statistics that correlate best with human perception.
2
Study of Perceived Sound Similarity
We performed a study to collect human similarity assessments between reference recordings and distorted versions of the references extracted from audio mixtures using BSS algorithms. For this study, each participant was seated at a computer terminal. A series of audio recordings clips, in matched pairs, was played to the participant over headphones. Each pair consisted of a reference audio recording followed by a distorted version of the recording, called the test. The participant had only one chance to hear each pair. For each pair, the participant was asked to rate the similarity of the reference sound to the test sound on a scale from 0 to 10 where the values correspond to the following ratings: 10 – Signals are indistinguishable 8 – Signals are just barely distinguishable 6 – Signals strongly resemble each other but are easily distinguishable 4 – Signals resemble each other 2 – Signals just barely resemble each other 0 – Signals are completely dissimilar The task began with a short training session of ten pairs to familiarize the participant with the task. Participants then listened to 130 audio pairs and rated
456
B. Fox et al.
the similarity of each pair. The task, complete with instructions, typically took less than one hour per participant. We collected responses from 31 participants drawn from the Northwestern University student, faculty and staff. Median participant age was 22 and the age range was from 18 to 35. Just under half (15) of the participants were male and 16 were female. Participants were screened to ensure they had never been diagnosed with a hearing disorder or language disorder. 2.1
Audio Corpus
The reference audio recordings used in the study are individual long-tones, ranging from 2-4 seconds, played on the alto saxophone, linearly encoded as 16-bit, 44.1 kHz audio. Mixtures of these recordings were created to simulate the stereo microphone pickup of spaced source sounds in an anechoic environment. We assume omni-directional microphones, spaced according to the highest frequency we expect to process. Instruments were placed in a semi-circle around the microphone pair at a distance of one meter. In the two-instrument mixtures, the difference in azimuth angle from the sources to the microphones was 180 degrees. The BSS algorithms we currently study depend on having significant differences in azimuth between sources. A difference of 180 degrees between sources will produces the best results. As the difference tends to 0 degrees, source separation degrades. To generate a range of BSS output from good to bad, instruments were placed at random angles around the microphones in the three and four instrument mixtures. For each mixture, each source signal was assigned a randomly selected pitch from the 13 pitches on the equal tempered chromatic scale from C4 through C5. We created nine two-instrument mixtures, six three-instrument mixtures, and five four-instrument mixtures in this manner, which is a total of 56 individual instrument comparisons, once extracted. This provides an approximately equal number of single note samples for each type of mixture. Mixtures were separated using the Active Source Estimation (ASE) [6] and DUET [4] source separation algorithms, resulting in 112 extracted sounds. The corpus also included a set of calibration sounds. For these calibration sounds, the proportion of altered time-frequency frames varied from 0.2 to 1.0 (where 1 means all frames were altered). In altered frames, phase was randomly varied in the full range and amplitude was randomly varied between 8 dB and 20 dB. For eight of the example pairs the test sound was a repeat of the reference sound. The 112 extracted examples, 10 manually distorted examples, and 8 repeat examples, give a total of 130 example pairs for the test corpus. 2.2
Human Study Results
In our listening data we included eight reference-test pairs where the reference and test sounds were identical. We excluded data from three participants who proved unreliable at labeling identical pairs as highly similar (a 9 or a 10). These
Modeling Perceptual Similarity of Audio Signals for BSS Evaluation
457
three participants gave an average similarity score below 8 for the set of identical pairs. This mean fell over two standard deviations below the mean similarity score given to identical pairs by the group as a whole. The group, excluding these three outliers gave a mean similarity rating of 9.6 to the identical pairs. There was a strong correlation between the remaining 28 participants in the subjective similarity ratings assigned to example pairs of audio. We compared the individual response of each participant to the mean response reported by the group, excluding that participant. The correlation coefficient between each individual and the remainder of the group ranged from 0.8458 to 0.9737 with a median correlation of r = 0.9155. The left panel of Figure 1 illustrates the correlation between the mean group ratings and those of a randomly selected individual. Given the strength of correlation across participants, we based our perceptual model on the mean similarity ratinings for each of the 130 referencetest pairs, averaged across the 28 remaining participants.
Fig. 1. (Left) Correlation between group mean ratings and those of a randomly selected participant (r = 0.935). Each point is one reference-test pair. (Right) The standard deviation of the range of participant responses, indexed by the mean value of these responses. Each point is one reference-test pair.
The right panel of Figure 1 shows the standard deviation of participant similarity ratings for example pairs, indexed by the mean response value. The line shown is a quadratic polynomial fit to the data with r = 0.83. The standard deviation is quite low for ratings toward the maximum (10) and minimum (0) similarity. There is an increase in across-participant variability at middle values, indicating more agreement on the extremes.
3
Modeling Human Responses
To build a model that effectively predicts human judgments of the similarity between two sounds, one must map machine-quantifiable measures onto human similarity assessments. In our study we consider the measures listed in Table 1. These measures were selected due to their use in the blind source separation community or as inputs to perceptual models used for audio codec evaluation. We refer the reader to the original paper citations for detailed definitions of these measures.
458
B. Fox et al.
Table 1. Linear correlation of machine-measurable statistics to mean human subject ratings Machine Measurable Statistic ISR - Ratio of signal energy to error due to spatial distortion [10] SIR - Ratio of the signal energy to the error due to interference [10] SAR - Ratio of signal energy to the error due to artifacts [10] SDR - Signal to Distortion Ratio [6] ODG - Output of the PEAQ model [7] DIX - A measure of perceived audio quality[7] BWT - Bandwidth of Test signal [7] HSE - Harmonic structure of the error over time [7] NLS - Noise Loudness in Sones [11] AMD2 - Alternate calculation of average modulation difference [12] NMR - Noise to mask ratio [7] MPD - Maximum probability of detection after lowpass filter [12] THD - Total Harmonic Distortion [7] WMD - Windowed modulation difference [12] BWR - Bandwidth of Reference signal (Hz) [7] AMD - Average modulation difference [12] RDF - Relative number of Distorted Frames [12] ADB - Average Distorted Block [12]
r value 0.87563 0.82131 0.75001 0.72313 0.67735 0.67074 0.48946 0.13531 -0.09409 -0.12796 -0.34614 -0.36465 -0.48789 -0.58947 -0.67536 -0.75003 -0.78455 -0.81710
As the table shows, the ODG values reported by the PEAQ perceptual model are only loosely correlated to human similarity assessments in our dataset. One might argue that this is because the ODG values may be correlated to a more complex function than a simple linear fit. This hypothesis is not supported when ODG is plotted against mean human assessments. This is shown in Figure 2. The figure also shows a ceiling effect for SDR and poor correlation between THD and mean human assessment. The measures ISR, SIR and ADB all have stronger negative or positive correlation to the mean human similarity assessments than do ODG, SDR or THD. 3.1
Results and Data Analysis
We followed the lead of the PEAQ researchers by mapping objective signal measures to human assessments using a variety of feed-forward, multilayer perceptrons. Every network architecture used all measures except ODG (the output of the PEAQ perceptual model) from Table 1 as input, with one input node for each measure. We varied the number of hidden layers from 0 to 2 and the number of nodes in each hidden layer from 8 to 13. All networks used 11 output nodes network, representing the ratings from 0 through 10 reported by human listeners. In training our neural networks, we applied 6-fold cross validation by dividing our full dataset (130 responses from 31 participants, making 4030 examples) into 6 bins. Figure 3 show correlation results of the best performing network for each number of hidden layers. For the neural network plots (all except the far right
Modeling Perceptual Similarity of Audio Signals for BSS Evaluation
459
Fig. 2. Correlation of ODG (left), SDR (center) and THD (right) to human perceptions of audio similarity. Each data point indicates one example pair. The vertical axis shows value of the measure and the horizontal axis the mean human similarity assessment.
panel) the vertical coordinate of each point is determined by a weighted average of the output node activations to a given comparison pair. The horizontal coordinate is the mean human response for that pair. Perfect performance for a model would result in a straight line from (0,0) to (10,10), an r value of 1, and root mean squared error (rmse) of 0. Here, the error for a single data point is the difference between the model output and the mean human response. As can be seen from the figure, the performance difference between networks architectures is minor.
Fig. 3. Correlation of model responses to human responses. In every panel, each data point indicates one example pair. The vertical axis shows the output of the given model. The horizontal axis shows the mean similarity value over the 31 human participants. Both the r value and the root-mean-squared error (rmse) are shown for each network.
Neural networks with no hidden layer can only successfully discriminate linearly seperable classes. Networks with hidden layers can discriminate between classes that are not linearly separable. Since the performance difference between our networks was negligible, we infer that a linear combination of the statistics from Table 1 could be used to map machine measurable statistics of signal similarity onto human estimates of signal similarity. Thus, we fitted human responses to a multivariate linear regression model. As expected, its correlation to human
460
B. Fox et al.
similarity assessments is nearly identical to those of the neural networks. This is shown in Figure 3. To make a more parsimonious model of human similarity assessments, we performed a stepwise multivariate linear regression on the machine-measurable statistics used in our study. Here, the dependent variable was the mean human response and the independent variables were all the measures from Table 1. Stepwise multivariate linear regression generates a linear model using only those inputs that independently account for the most variance in the dependent variable. After performing this process, we achieved a linear fit to the data with R = 0.96 using only four of the measures from Table 1. The resulting linear correlation is shown in Table 2. The order in which parameters were added to the model is shown by ”Entrance Order.” Any measure from Table 1 not shown in the Table 2 did not significantly increase correlation between the linear model, given the previous measures already added to the model. Table 2. Results of stepwise multivariate linear regression using mean human similarity responses as the dependent variable and the measures from Table 1 as the independent variables Statistic Coefficient (b) Cumulative Correlation (r) Entrance Order n/a constant offset 14.968 n/a ISR 0.194 0.876 1 SIR 0.064 0.938 2 SAR 0.103 0.952 3 MPD -12.787 0.960 4
The r-value corresponding to the multivariate linear regression model in Figure 3 is 0.913 while the corresponding value in Table 2 is 0.960. This is because results shown in Figure 3 were generated using a 6-fold round robin validation technique where there was no overlap between the the training and testing sets. The correlation in Table 2 was done over the full data set, rather than a subset.
4
Conclusions
We have shown that a linear combination of four machine-measurable statistics can successfully model human similarity assessments for pairs of sounds with a correlation of r=0.96. Three of these statistics (ISR, SIR and SAR) are used in a recent comparison of multichannel audio source separation [10]. Correlation of a linear combination is not improved upon by the nonlinear modeling possible with a multilayer perceptron. For the range of signals under consideration (woodwinds distorted by audio source separation algorithms), the linear model performed much better than the PEAQ (ODG) perceptual model. In future work, we plan to expand the range of test signals over which we study human evaluations of similarity, with a focus on speech, as well as music. If the results of future
Modeling Perceptual Similarity of Audio Signals for BSS Evaluation
461
studies correlate with the current paper, this will provide further evidence that the measures with high correlation in Table 2 are the most useful statistics upon which to measure the effectiveness of source separation for audio applications. Acknowledgements. This work was funded in part by National Science Foundation Grant number IIS-0643752. We thank John Woodruff and Emmanuel Vincent for their help.
References 1. Anemuller, J., Kollmeier, B.: Amplitude Modulation Decorrelation for Convolutive Blind Source Separation. In: International Symposium on Independent Component Analysis and Blind Source Separation. Helsinki, Finland (1999) 2. O’Grady, P.D., Pearlmutter, B.A., Rickard, S.: Survey of Sparse and Non-Sparse Methods in Source Separation. International Journal of Imaging Systems and Technology 1(15), 18–33 (2005) 3. Virtanen, T., Klapuri, A.: Separation of Harmonic Sounds Using Multipitch Analysis and Iterative Parameter Estimation. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz, NY (2001) 4. Yilmaz, O., Rickard, S.: Blind Separation of Speech Mixtures via Time-Frequency Masking. IEEE Transactions on Signal Processing 52(7), 1830–1847 (2004) 5. Master, A.: Stereo Music Source Separation via Bayesian Modeling. Doctoral Thesis. Stanford University, 1–199 (2006) 6. Woodruff, J., Pardo, B.: Using Pitch, Amplitude Modulation and Spatial Cues for Separation of Harmonic Instruments from Stereo Music Recordings. EURASIP Journal on Advances in Signal Processing (Article ID 86369) (2007) 7. Thiede, T., Treurniet, W., Bitto, R., Schmidmer, C., Sporer, T., Beerends, J., Colomes, C., Keyhl, M., Stoll, G., Brandenburg, K., Feiten, B.: PEAQ–The ITU Standard for Objective Measurement of Perceived Audio Quality. Journal of the Audio Engineering Society 48(1/2), 3–29 (2000) 8. Creusere, C.: Evaluating low bitrate scalable audio quality using advanced version of PEAQ and energy equalization approach. Acoustics, Speech, and Signal Processing 3, 189–192 (2005) 9. Vincent, E.: Musical Source Separation Using Time-Frequency Source Priors. IEEE Transactions on Audio, Speech and Language Processing 14(1), 91–98 (2006) 10. Vincent, E., Sawada, H., Bofill, P., Makino, S., Rosca, J.: First Stereo Audio Source Separation Evaluation Campaign: Data, Algorithms and Results. In: International Conference on Independent Component Analysis and Blind Source Separation (ICA) (2007) 11. Schroeder, M., Atal, B., Hall, J.: Optimizing digital speech coders by exploiting masking properties of the human ear. Journal of the Acoustical Society of America 66(6), 1647–1652 (1979) 12. Kabal, P.: An Examination and Interpretation of ITU-R BS.1387: Perceptual Evaluation of Audio Quality. McGill University Technical Report, p. 1–96 (2003)
Beamforming Initialization and Data Prewhitening in Natural Gradient Convolutive Blind Source Separation of Speech Mixtures Malay Gupta and Scott C. Douglas Department of Electrical Engineering Southern Methodist University Dallas, Texas 75275 USA Abstract. Successful speech enhancement by convolutive blind source separation (BSS) techniques requires careful design of all aspects of the chosen separation method. The conventional strategy for system initialization in both time- and frequency-domain BSS involves a diagonal center-spike FIR filter matrix and no data preprocessing; however, this strategy may not be the best for any chosen separation algorithm. In this paper, we experimentally evaluate two different approaches for potentially-improving the performance of time-domain and frequencydomain natural gradient speech separation algorithms – prewhitening of the signal mixtures, and delay-and-sum beamforming initialization for the separation system – to determine which of the two classes of algorithms benefit most from them. Our results indicate that frequencydomain-based natural gradient BSS methods generally need geometric information about the system to obtain any reasonable separation quality. For time-domain natural gradient separation algorithms, either beamforming initialization or prewhitening improves separation performance, particularly for larger-scale problems involving three or more sources and sensors.
1
Introduction
Convolutive blind source separation (CBSS) refers to the separation of signals that have been mixed through a dispersive environment using signal processing procedures that do not have specific knowledge of the source properties or the mixing conditions. Due to the dispersive nature of the channel, CBSS algorithms must attempt to undo both spatial and temporal mixing effects. As a result, CBSS algorithms tend to be more complicated than their spatial-only BSS counterparts. Frequency-domain approaches to CBSS transform the measured mixtures into the discrete frequency-domain via the short-time Fourier transform (STFT) and apply spatial-only (instantaneous) BSS algorithms in each frequency component of the mixtures individually [1]. After separation in the frequency-domain, these signals must be carefully reconstructed before being inverse-Fouriertransformed to recover the time-domain signals. This reconstruction process requires estimating the permutation and scaling ambiguities for all the frequency M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 462–470, 2007. c Springer-Verlag Berlin Heidelberg 2007
Beamforming Initialization and Data Prewhitening
463
components of the separated sources. Prior information about the array geometry and directions-of-arrival (DOAs) of the sources at the sensor array is often assumed. Several researchers have offered ways to use this information in the reconstruction process [2]–[6]. Post-processing permutation resolution can be computationally-demanding if more than two sources are being separated. In many cases, a closed form solution is not possible [4]. In contrast, time-domain CBSS algorithms adapt the impulse response of a multichannel linear filter using only as many output signals as the number of sources that are being extracted [7]–[10]. Because they use time-domain convolutions instead of frequency-domain multiplications, these methods tend to be more difficult to code. Note that their computational complexities can be made to be similar to those of frequency-domain approaches through block processing [8]. Since the algorithms employ a separation criterion whose number of outputs equals the number of sources being estimated, time-domain CBSS approaches do not appear to have severe source permutation problems over different extracted frequencies. These time-domain methods tend to converge more slowly, however, if careful strategies for algorithm implementation are not considered. In [11] one simple way to improve convergence performance for the time-domain method in [8] has been described. In this paper, we compare the use of two well-known strategies for improving the performance of CBSS algorithms: (1) beamforming initialization [5,12], and (2) multichannel prewhitening [8]. Both time-domain and frequency-domain versions of the well-known natural gradient CBSS method are evaluated and their performances compared to other competing approaches using data collected from a controlled laboratory measurement setup. These numerical experiments show that (a) beamforming initialization is required for frequency-domain natural gradient CBSS methods if no other technique is used to resolve permutation ambiguities, (b) prewhitening alone does not improve the performance of frequency-domain natural gradient CBSS methods, and (c) the performance of time-domain natural gradient algorithms improves with either signal prewhitening or beamforming coefficient initialization, and this improvement is significant when dealing with mixtures of more than two sources.
2
Time- and Frequency-Domain Signal Models
For multichannel acoustic recordings, the n-dimensional signal mixtures at time k, x(k) = [x1 (k) · · · xn (k)]T can be modeled as x(k) =
∞
Al s(k − l),
(1)
l=−∞
−l where {Al } denotes a sequence of n × m mixing matrices, A(z) = ∞ l=−∞ Al z T is the multichannel system transfer function, and s(k) = [s1 (k) · · · sm (k)] is the m-dimensional signal vector at time k. All CBSS algorithms attempt to find a time-varying separating or demixing system B(k, z) to process the signal
464
M. Gupta and S.C. Douglas
mixtures x(k) = A{s(k)} such that y(k) = B{x(k)} contains the estimates of each of the sources in s(k) without repetition. Mathematically, this can be represented as ∞ y(k) = Bl (k)x(k − l). (2) l=−∞
In practice, a truncated causal approximation to (2) is often employed, where L is a positive integer and y(k) =
L
Bl (k)x(k − l).
(3)
l=0
Frequency-domain CBSS algorithms use the STFT to transform the time-domain data into the frequency-domain, whereby a separate complex-valued instantaneous demixing system is found for each of the frequency components of the mixed signals. The input data in the lth frequency bin ωl is given by x(ωl , k) = A(ωl )s(ωl , k),
(4)
where k denotes the time dependence of the STFT, s(ωl , k) is the transformed source signal vector, and A(ωl ) denotes the mixing matrix for the lth frequency bin. As such, the demixing process in each frequency bin is formulated as y(ωl , k) = B(ωl , k)x(ωl , k),
(5)
where y(ωl , k) and B(ωl , k) are the estimated source signal vector and the demixing matrix, respectively, in the lth frequency bin at time k.
3
Beamforming vs. Prewhitening in Convolutive Blind Source Separation
Both beamforming and prewhitening attempt to solve part of the goal achieved by successful application of CBSS methods in certain contexts. Beamforming and CBSS attempt to suppress interferences caused by spatiallydistinct sources to extract individual source signals when operating on data collected from a uniform linear array. Beamforming methods work by providing maximum gain in the direction of the desired user. CBSS based methods have been observed to place spatio-temporal nulls in the directions of interfering users in some environments [12]. Beamforming methods typically assume a working knowledge of the sensor array manifold and the directions-of-arrival (DOAs). CBSS methods, on the other hand, typically assume no known signal or measurement structure other than a linear dispersive channel for the mixing process. Researchers have suggested the merger of beamforming with CBSS to include prior information about the array manifold and DOAs within CBSS algorithms [3]. These techniques are primarily designed to remove permutation difficulties that lead to lower separation
Beamforming Initialization and Data Prewhitening
465
performance. In situations where DOA information is not available a priori, researchers have suggested procedures for estimating DOAs as part of the CBSS algorithm being developed [3,6]. In narrowband beamforming, the directional vector associated with a frequency ω for a source impinging on the array from a direction θ is given as d(ω, θ) = [exp(jωτ0 (θ)) · · · exp(jωτm−1 (θ))]T ,
(6)
where τl (θ) = ld sin(θ)/c is the time delay associated with the lth sensor with respect to the reference sensor, m is the number of sensors in the array, d is the array element separation, and c is the speed of sound. Signals with significant frequency content (e.g. audio signals) received at a sensor array will typically have directional vectors associated with each of their frequency components. Thus, clustering of the directional vectors may be needed to obtain a consistent estimate of the source DOAs. Perhaps the simplest way to employ DOA knowledge to improve CBSS convergence performance is to initialize the separation system coefficients {Bl (0)} or their frequency counterparts {B(ωl , 0)} to a series of fixed beamformers in which the mainlobe of each of the beampatterns in each frequency bin for the ith separation system points toward a talker. In this case, we would choose B(ωl , 0) = [d(ωl , θ1 ) d(ωl , θ2 ) . . . d(ωl , θm )]H .
(7)
For time-domain algorithms, we can compute the appropriate initial coefficients by taking the inverse FFTs of the frequency-domain responses in (7) about their points of symmetry. Initializing CBSS algorithms in this way does not modify the algorithm’s operation other than choosing its initial state. The main alternative to this coefficient initialization is center-spike initialization, in which I, l = L2 Bl (0) = (8) 0, otherwise. For frequency-domain algorithms, we can compute the appropriate initial coefficients by taking the FFT of the time-domain responses in (8) about their points of symmetry. Center-spike initialization makes no assumption about the source-sensor array geometry. Prewhitening is a preprocessing strategy whereby the measured signals {xi (k)}, 1 ≤ i ≤ n are linearly filtered such that the filtered signals {vi (k)} approximately satisfy E{vi (k)vj (k − l)} ≈ E{|vi (k)|2 }δi−j δl ,
(9)
where δl is the Kronecker delta function. These prewhitened signals are used in place of {xi (k)} in the separation system. Examples of prewhitening algorithms include the linear phase adaptive procedure in [13] and the least-squares multichannel linear predictor described in [8]. When block processing is used, one can use successive filtering operations to process {x(k)} to produce v(k), which is likely the most computationally efficient method.
466
M. Gupta and S.C. Douglas
Prewhitening solves part of the CBSS task, as decorrelation is a necessary but not sufficient condition for source separation of mixtures of statisticallyindependent signals. Hence, it is reasonable to use prewhitening as a preprocessing step to remove any signal correlations contained in the data prior to performing separation with any CBSS method. In this case, the input signals {xi (k −l)} used in the separation system are replaced by the prewhitened signals {vi (k − l)} obtained at the outputs of the prewhitening system. One can view beamforming initialization and prewhitening as two simple but competing approaches for improving the performances of CBSS methods that do not require significant alteration of the separation algorithm. Note that prewhitening effectively alters the DOAs seen by the separation system within the prewhitened data, so using both prewhitening and beamforming initialization does not make sense unless special constraints are placed on the prewhitening task. It is unclear without performing experiments which procedure is to be preferred, and whether both frequency-domain-based and time-domain-based CBSS algorithms benefit from such procedures. The goal of this paper is to explore these issues through experimental evaluation on real-world speech signal mixtures to see what classes of algorithms benefit most from them.
4
Numerical Experiments
We now present numerical evaluations to illustrate the separate effects that beamforming initialization and data prewhitening have on the behaviors of one class of CBSS algorithms. In order to minimize any performance effects due to choice of separation criterion, we focus on the natural gradient algorithms presented in [4] and [8]. The algorithms attempt to minimize the mutual information of the extracted signals using frequency-domain and time-domain system structures, respectively. For comparison, we show the performance of two other algorithms on this data: one employing decorrelation with geometric beamforming constraints [3], and one using contrast-based optimization with prewhitening [9,10]. These latter algorithms incorporate either beamforming or prewhitening within their structures and are not claimed to work without such pre-processing. Data for these evaluations was generated in an acoustically-isolated laboratory environment with three loudspeakers playing recordings of talkers (one female and two male) as the sources. The sources were located 127 cm away from the three omnidirectional microphones and were spaced at angles of −300 , 00 , and 27.50 , respectively, from the array normal. The inter-sensor spacing of the microphone array was 4 cm. Acoustic foam was placed on the walls of the room to obtain a reverberation time of 300 ms for the environment. All recordings were made using 7 seconds of data per channel and a 48 kHz sampling rate and were downsampled to an 8 kHz sampling rate for processing. Fig. 1 shows the impulse responses of the loudspeaker/microphone paths for these mixing conditions. The various algorithms were applied to this measured microphone data for two- and three-source mixtures, whereby the 00 source was omitted for the twosource mixture. After separation, least-squares methods were used to estimate
Beamforming Initialization and Data Prewhitening
−0.02
Impulse Response
0.1 0.2 Time in seconds Spkr = 6 Mic = 2 Fs = 8000 Hz
Impulse Response
0
0.02 0 −0.02
Impulse Response
0.1 0.2 Time in seconds Spkr = 6 Mic = 3 Fs = 8000 Hz
Impulse Response
0
0.02 0 −0.02 0
0.1 0.2 Time in seconds
Impulse Response
0
0.02 0 −0.02 0
0.1 0.2 Time in seconds Spkr = 7 Mic = 2 Fs = 8000 Hz
Impulse Response
0.02
Spkr = 7 Mic = 1 Fs = 8000 Hz
0.02 0 −0.02 0
0.1 0.2 Time in seconds Spkr = 7 Mic = 3 Fs = 8000 Hz
Impulse Response
Impulse Response
Impulse Response
Spkr = 6 Mic = 1 Fs = 8000 Hz
0.02 0 −0.02 0
0.1 0.2 Time in seconds
467
Spkr = 8 Mic = 1 Fs = 8000 Hz 0.02 0 −0.02 0 0.1 0.2 Time in seconds Spkr = 8 Mic = 2 Fs = 8000 Hz 0.02 0 −0.02 0 0.1 0.2 Time in seconds Spkr = 8 Mic = 3 Fs = 8000 Hz 0.02 0 −0.02 0
0.1 0.2 Time in seconds
Fig. 1. Acoustic chamber impulse response in a three source, three microphone setup. Room conditions correspond to a reverberation time (RT) of 300 msec.
the contributions of the source recordings to each of the recorded mixtures as well as the output signals from each algorithm. By calculating power ratios from these least-squares estimates, we can compute the average improvement in signal-to-interference-plus-noise ratio (SINR) for each algorithm in each case. For the normalized natural gradient algorithm in the frequency domain [11], the parameters chosen were L = 512 and μ = 0.35, and 200 passes of the algorithm through the data have been used to adapt the filter. For the natural gradient time-domain algorithm [8], we used L = 512 and a step size schedule of μ = .0009 for 150 data passes followed by μ = 0.0001 for a single data pass followed by μ = 0.00001 for a second single data pass. The data nonlinearity used in each algorithm was f (y) = y/|y|, where y in this case corresponds to the ith frequency bin output or the ith time-domain filter output, respectively. Table 1 shows the SINR improvements obtained by the various algorithms for the various processing strategies on the two-source mixture data. As can be seen, the frequency-domain natural gradient method does not perform well either with center-spike initialization or with data prewhitening. With beamforming initialization, the algorithm achieves good performance on this data that closely matches the time-domain natural gradient algorithm. The latter algorithm’s performance is quite good for center-spike initialization on this data, but improvements of 1.0dB and 2.7dB are obtained with beamforming initialization and data prewhitening, respectively. Shown for comparison are the behaviors of the decorrelation-based method in [3] as well as the contrast-based method with prewhitening in [11]. As can be seen, the time-domain natural gradient
M. Gupta and S.C. Douglas
0 −50
50 0 −50
50 0 −50
50 0 −50
−50 0
1000 2000 Number of Taps
0
1000 2000 Number of Taps Combined Response C23
50 0 −50
0
1000 2000 Number of Taps Combined Response C32
Combined Response
Combined Response
1000 2000 Number of Taps Combined Response C31
0
0 −50
1000 2000 Number of Taps Combined Response C22
0
50
50
0
Combined Response
1000 2000 Number of Taps Combined Response C21
Combined Response
0
Combined Response
50
Combined Response C13
Combined Response
Combined Response C12
Combined Response
Combined Response
Combined Response C11
0
1000 2000 Number of Taps Combined Response C33
Combined Response
468
50 0 −50 0
1000 2000 Number of Taps
50 0 −50 0
1000 2000 Number of Taps
Fig. 2. Combined impulse response of the time-domain truncated natural gradient with delay-and-sum beamforming initialization Table 1. Improvement in average SINR [dB]; RT=300 ms TWO SOURCE CASE THREE SOURCE CASE Center w/Beam- w/Prewhi- Center w/Beam- w/PrewhiSpike forming tening Spike forming tening SNGFD[11] 0.25 13.56 1.52 3.33 12.55 4.55 NGTD[8] 12.63 13.60 15.34 10.89 17.07 16.80 Parra-GBSSII[3] – 7.95 – – 5.42 – STFICA-Symm[9,10] – – 11.23 – – 12.66 Algorithm
method outperforms both of these competing methods when using the same spatial knowledge of the environment or data pre-processing. Also shown in Table 1 are the SINR improvements obtained by the various algorithms for the various processing strategies on the three-source mixture data. Similar performance relationships as in the two-source data case are observed in this case. The frequency-domain natural gradient algorithm obtains adequate separation only with beamforming initialization, whereas the timedomain natural gradient algorithm can separate the source mixtures with any of the three strategies employed. The best performance is obtained with beamforming initialization, although separation using data prewhitening is nearly as good. Fig. 2 shows the combined impulse responses at convergence for the natural gradient time-domain algorithm with beamforming initialization when applied to this data, indicating that separation has occurred. It should be noted that
Beamforming Initialization and Data Prewhitening
469
prewhitening-based processing strategies can still be used if knowledge of the source-sensor array geometry is not available.
5
Conclusions
In convolutive blind source separation of speech signal mixtures, beamforming initialization and prewhitening are two simple strategies for improving the performance of any separation algorithm not already leveraging this structural knowledge. This paper evaluates the behaviors of two versions of the well-known natural gradient algorithm as implemented in the time- and frequency-domains, respectively, when using each of these strategies. Experiments indicate that the frequency-domain natural gradient algorithms rely on the spatial structure of the source-microphone mixing conditions, and they cannot adequately separate sources without using knowledge of the directions-of-arrival within the algorithm. Prewhitening alone does not help the performance of frequencydomain algorithms. Time-domain natural gradient algorithms can separate without directions-of-arrival knowledge; however, their performances are improved when either beamforming initialization or data prewhitening is employed.
References 1. Smaragdis, P.: Blind separation of convolved mixtures in the frequency domain. Neurocomputing 22(1-3), 21–34 (1998) 2. Parra, L., Spence, C.: Convolutive blind separation of non-stationary sources. IEEE Trans. Speech Audio Processing 8, 320–327 (2000) 3. Parra, L., Alvino, C.: Geometric source separation: Merging convolutve source separation with geometric beamforming. IEEE Trans. Speech Audio Processing 10(6), 352–362 (2002) 4. Mitianoudis, N., Davies, M.E.: Audio source separation of convolutive mixtures. IEEE Trans. Speech Audio Processing 11, 489–497 (2003) 5. Sawada, H., Mukai, R., Araki, S., Makino, S.: A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. Speech Audio Processing 12, 530–538 (2004) 6. Saruwatari, H., Kawamura, T., Nishikawa, T., Lee, A., Shikano, K.: Blind source separation based on a fast-convergence algorithm combining ICA and beamforming. IEEE Trans. Audio Speech Language Processing 14, 666–678 (2006) 7. Amari, S., Douglas, S.C., Chichocki, A., Yang, H.H.: Multichannel blind deconvolution and equalization using the natural gradient. In: Proc. IEEE Workshop Signal Proc. Adv. Wireless Comm. Paris, France, April 1997, pp. 101–104. IEEE Computer Society Press, Los Alamitos (1997) 8. Douglas, S.C., Sawada, H., Makino, S.: Natural gradient multichannel blind deconvolution and speech separation using causal FIR filters. IEEE Trans. Speech Audio Processing 13, 92–104 (2005) 9. Douglas, S.C., Sawada, H., Makino, S.: A spatio-temporal FastICA algorithm for separating convolutive mixtures. In: Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, Philadelphia, PA, vol. 5, pp. 165-168 (March 2005)
470
M. Gupta and S.C. Douglas
10. Douglas, S.C., Gupta, M., Sawada, H., Makino, S.: Spatio-temporal FastICA algorithms for the blind separation of convolutive mixtures. IEEE Trans. Speech Audio Language Processing, 15(5) (July 2007) 11. Douglas, S.C., Gupta, M.: Scaled natural gradient algorithms for instantaneous and convolutive blind source separation. In: IEEE Int. Conf. Acoust. Speech, Signal Processing, Honolulu, HI (April 2007) (to appear) 12. Araki, S., Makino, S., Hinamoto, Y., Mukai, R., Nishikawa, T., Saruwatari, H.: Equivalence between frequency-domain blind source separation and frequencydomain adaptive beamforming for convolutive mixtures. EURASIP J. Applied Signal Processing 2003(11), 1157–1166 (2003) 13. Douglas, S.C., Cichocki, A.: Neural networks for blind decorrelation of signals. IEEE Trans. Signal Processing 45, 2829–2842 (1997)
Blind Vector Deconvolution: Convolutive Mixture Models in Short-Time Fourier Transform Domain Atsuo Hiroe Intelligent Systems Research Laboratory, Information Technologies Laboratories, Sony Corporation, 5-1-12 Kitashinagawa, Shinagawa-ku, Tokyo, 141-0001 Japan [email protected]
Abstract. For short-time Fourier Transform (STFT) domain ICA, dealing with reverberant sounds is a significant issue. It often invites a dilemma on STFT frame length: frames shorter than reverberation time (short frames) generate incomplete instantaneous mixtures, while too long frames may disturb the separation. To improve the separation of such reverberant sounds, the authors propose a new framework which accounts for STFT with short frames. In this framework, time domain convolutive mixtures are transformed to STFT domain convolutive mixtures. For separating the mixtures, an approach of applying another STFT is presented so as to treat them as instantaneous mixtures. The authors experimentally confirmed that this framework outperforms the conventional STFT domain ICA.
1
Introduction
To separate mixtures of speeches or sounds, Independent Component Analysis (ICA) in short-time Fourier transform (STFT) domain (or frequency domain) has often been used [1]. Compared with time-domain ICA, STFT domain ICA has the advantages of faster convergence and less computation since time domain convolutive mixtures are reduced to instantaneous mixtures in each frequency bin. Also it can generate ‘permutation-free’ unmixed results by using a measure of independence that is computed from the whole spectrograms [5]. For STFT domain ICA, however, another issue still remains. It is on the length of the STFT analysis frame [2]; STFT frames shorter than real reverberation (short frames) make the conversion to instantaneous mixtures incomplete, while long frames decrease the number of substantive samples since they worsen time resolution in STFT domain. Both features can disturb the separation. As their trade-offs, STFT domain ICA often reaches the peak of the separation performance although the frame is much shorter than the reverberation time. If a model accounts for STFT domain mixing process more accurately, it can improve the performance of the short frames. Such an approach has been proposed in [3]. Their separation algorithm, however, is simplified to the twoinput-two-output case, and the permutation correction is out of their framework. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 471–479, 2007. c Springer-Verlag Berlin Heidelberg 2007
472
A. Hiroe
In this paper we present a new framework which includes the mixing process with short frames and the separation without permutation inconsistencies.
2
STFT Domain Convolutive Mixture Models
In this section, we examine what occurs when convolutive mixtures are transformed with short frames. Let xki (t) be the contribution from si (t), source i at time t, to xk , observation in sensor k. It is represented as convolution between source si (t) and the filter coefficients aki (τ ): xki (t) =
p−1
aki (τ )si (t − τ ).
(p : filter length)
(1)
τ =0
Then define Xki (ω, r) as the STFT of xki (t) in frequency bin ω and frame r: def
Xki (ω, r) =
L−1 t=0
ω−1 t , (j: imaginary unit) (2) w(t)xki (rN + t) exp −2πj L
where L and N are frame length (number of taps) and shift width respectively. Now, consider the case p > L. According to the discussion in Appendix, Xki (ω, r) is approximately represented as convolution also in STFT domain: Xki (ω, r) ≈
P −1
Aki (ω, τ )Si (ω, r − τ ),
(3)
τ =0
where Aki (ω, τ ) and Si (ω, r−τ ) are the STFTs of aki (τ ) and si (t) respectively; P denotes number of frames (frame taps) in STFT domain convolution. Therefore Xk (ω, r), the STFT domain observations in sensor k, is approximately represented as convolutive mixtures also in STFT domain: Xk (ω, r) =
m
Xki (ω, r) ≈
m P −1
Aki (ω, τ )Si (ω, r − τ ).
Expanding (4) over k and ω, we write them as a single formula: ⎡ ⎤⎡ ⎡ ⎤ ⎤ [τ ] [τ ] X 1 (r) S 1 (r − τ ) A11 · · · A1m P −1 ⎥ ⎢ .. ⎥ ⎢ ⎥ .. ⎢ .. . . . .. ⎥ ⎢ ⎣ . ⎦≈ ⎦, . . ⎦⎣ ⎣ .
X n (r) X (r)
(4)
i=1 τ =0
i=1
τ =0
[τ ] An1
··· A[τ ]
] A[τ nm
(5)
S m (r − τ ) S (r−τ )
where X k (r) = [Xk (1, r), . . . , Xk (M, r)]T and S k (r) = [Sk (1, r), . . . , Sk (M, r)]T are spectra of observations and sources respectively; n, m and M are number [τ ] of sensors, sources and frequency bins respectively; Aki = diag{Aki (1, τ ), . . . , Aki (M, τ )} is an M × M diagonal matrix.
Blind Vector Deconvolution: Convolutive Mixture Models in STFT Domain
473
Equation (5) gives us another interpretation in STFT domain: each source spectrum S i (r), a vector of M elements, occurs independently in frame r to arrive at the sensors with at most P frames’ delay. It means that S (r) affects P frames’ observations X (r),. . . ,X (r + P − 1), and equivalently that X (r) is a convolution between the previous P frames’ sources S (r),. . . ,S (r − P + 1) and mixing matrices A[0] ,. . . ,A[P −1] . The conventional STFT domain ICA corresponds to the particular case P = 1 in (5). It is satisfied only when the filter length p is relatively small.
3
Proposed Framework: Blind Vector Deconvolution
As well as the case of time domain [8], we can consider two approaches to estimate sources from STFT domain convolved observation vectors: 1. Convert observations to instantaneous mixtures through another STFT. (Modulation spectrogram domain ICA) 2. Unmix observations directly in STFT domain. (STFT domain deconvolution) We call them ‘Blind Vector Deconvolution’ (BVD). In this paper, we address the first approach, Modulation spectrogram domain ICA. This is an application of our previous work [5] to the modulation spectrogram domain. 3.1
Transform to Modulation Spectrograms
Applying another STFT to spectrograms generates modulation spectrograms (MS) [9]. Use of proper length of a frame (L ) and shift (N ) transforms STFT domain convolutive mixtures (4) to MS domain instantaneous mixtures: Xk (ω, ω2 , r )
≈
⇔ Xk (ω , r ) ≈
m k=1 m
Aki (ω, ω2 )Si (ω, ω2 , r )
(6)
Aki (ω , r )Si (ω , r ),
(7)
k=1
where Xk (), Aki () and Si () are MS domain data transformed from Xk (ω, r), Aki (ω) and Si (ω, r) respectively; ω2 and r correspond to newly generated bin and frame respectively. Using a serial index ω instead of (ω, ω2 ), we can write (6) also as (7). Fig. 1 shows the outline of the conversion to MS: (a) is a set of STFT domain observations; each bin’s STFT (2nd STFT) converts them to the MS domain cube structures (b). Using the same notation as (5), we can write the MS domain mixing process simply as: X (r ) = A S (r ).
(8)
474
A. Hiroe
(a)
Y1′(r′)
(b) X1′
X1
STFT
P (Y1′(r′))
Y1′
L'
H (Y1′)
H (Y ′) ω
Xn
ω
M
r
X n′
Yn′
M
ω2
r′
Fig. 1. (a) Spectrograms, (b) Modulation spectrograms
3.2
H (Yn′)
r′
Fig. 2. KL divergence is computed as k H(Y k ) − H(Y )
Independence Measure and Learning Rule for MS Domain ICA
Since the MS domain observations are instantaneous mixtures, techniques for the instantaneous ICA apply. In particular, by using the independence measure computed from the whole MSs [5], unmixed results without permutation inconsistencies are generated, as described below. (ω ) The unmixing process in channel k and bin ω is written as (9), where Wki denotes an unmixing coefficient. Expanding (9) over k and ω , as in (5), the whole unmixing process is written as (10). Yki (ω , r ) =
n
Wki (ω )Xi (ω , r )
(9)
i=1
⎡
⎤ ⎡ ⎤⎡ ⎤ Y 1 (r ) X 1 (r ) W 11 · · · W 1n ⎢ ⎥ ⎢ .. . . ⎥ ⎥ ⎢ .. .. . ⎣ ⎦=⎣ . ⎦ . .. ⎦ ⎣ . . Y n (r ) X n (r ) W n1 · · · W nn Y (r ) W X (r )
(10)
As the measure of independence, we use Kullback-Leibler divergence (KLD) computed from the whole MSs Y . The KLD of Y is equivalent to the difference between summation of each channel’s entropy H(Y k ) and the joint entropy H(Y ): KLD(Y ) = =
n k=1 n
H(Y k ) − H(Y )
(11)
Er − log P Y k (r ) − log det W − H(X ),
(12)
k=1
where P Y k (r ) denotes a multivariate probability density function of Y k (r ), and Er [] means expectation over r (Fig. 2).
Blind Vector Deconvolution: Convolutive Mixture Models in STFT Domain
475
Applying the natural gradient rule [4] to (12), we obtain a set of learning rules to seek the unmixing matrix W that makes Y 1 ,. . . ,Y n mutually independent in the MS domain: W (ω ) ← W (ω ) + ηΔW (ω ) (η : learning rate) (13) ⎡ ⎤ ⎡ ⎤H ⎤ ⎡ Y1 (ω , r ) ϕω Y 1 (r ) ⎢ ⎥⎢ ⎢ ⎥ ⎥ . .. ⎢ . ΔW (ω ) = Er ⎣I + ⎣ ⎦⎣ ⎦ ⎥ . . ⎦ W (ω ), (14) Yn (ω , r ) ϕω Y n (r ) ⎤ ⎡ (ω ) W11 (ω ) · · · W1n ∂ log P Y k (r ) ⎥ ⎢ .. .. .. , W (ω ) =⎣ where ϕω Y k (r ) = ⎦ (15) . . . ∂Yk (ω , r ) Wn1 (ω ) · · · Wnn (ω )
3.3
Multivariate Probability Density Functions
To compute H(Y k ), entropy of each channel’s MS, we employ P (Y k (r )) ∝ exp(−K||Y k (r )||2 ), as proposed in [5]. We then obtain an activation function: Y (ω , r ) 2 1/2 (16) ϕω Y k (r ) = −K k . ||Y k (r )||2 = |Yk (ω , r )| ||Y k (r )||2 ω 3.4
Postprocesses
After the learning, the minimal distortion principle based rescaling [6] is applied, such as W ← diag(W −1 )W . Then through the overlap-add based invert STFT [10], the MS domain unmixed results Y are converted to STFT domain data Y . To convert Y to the time domain signals, another invert STFT is applied.
4
Experimental Results
To confirm the separation performance, we performed some experiments using a set of data recorded separately and mixed on a computer. Recording was done in our office room (0.25∼0.3s reverberation time) by playing each source from different loudspeakers (Fig. 3). Four or eight seconds of recorded data were used in the experiments. The sampling rate (Fs ) was 16k[Hz]. As observations, we generated eight different mixtures as in Fig. 4, where S, F and M denote street noise [7], female speech and male speech, respectively. To unmix these mixtures, we used the following two methods with STFT parameters shown in Table 1: Benchmark (BM): STFT domain ICA described in [5]. Proposed: Modulation spectrogram domain ICA described in Section 3.
A. Hiroe
45cm
135cm
450cm
Loud speaker 30cm 3
mic1 100cm
130cm 175cm
700cm
g de m 45 00c 1 4
45cm
Partition (153cm height)
5cm mic4
Height: Room 260cm Microphones 55cm Speakers 80cm
1
375cm
476
70cm 2 180cm
Fig. 3. Recording environment
Test No. Loudspeaker 1 2 3 4 1 S F M S M F 2 F S M 3 M S M 4 F M 5 M F 6 F M 7 M M 8 Fig. 4. Sources in each test (S: street noise, F/M: female/male speech)
Table 1. Experimental parameters (L, L : Frame length; N, N : Shift width) 1st STFT 2nd STFT L N L N BM 256 to 8192 (hanning) L/4 — — Proposed 512 or 1024 (hanning) L/4 4 to 64 (hamming) L /8 Common: η=0.3, 400 iteration, K= square root of number of bins
In the second STFT, we used a hamming window to utilize the both ends of a frame effectively. (In case of L = 4, we used N = 1.) To evaluate the separation performance in STFT domain, we used Signal-toInterference Ratio (SIR) in STFT domain instead of time domain. It is computed as follows: 1. Approximate each bin’s unmixed result Yk (ω, r) as a linear sum of all sources m S1 (ω, r), . . . , Sn (ω, r), namely Yk (ω, r) ≈ i=1 αki (ω)Si (ω, r). 2. Calculate SIR(Yk , Si ), i.e. each output channel’s SIR, as Eω [10 log10 Er {|αki (ω)Si (ω, r)|2 }/Er {| l=i αkl (ω)Sl (ω, r)|2 }]. 3. Define SIRY , i.e. SIR of the unmixed results, as Ei [maxk SIR(Yk , Si )]. 4. Define SIRimp , i.e. improved SIR, as Ei [maxk [SIR(Yk , Si ) − SIR(Xk , Si )]]. To compare various STFT parameters, we serialized them along with ‘timespan’, length of observations in second used to estimate single frame’s Y (r). It is calculated as: Timespan = {L + (L − 1)N } /Fs
(or L/Fs for the benchmark) (17)
The average SIRimp over eight tests are plotted in Fig. 5. In these plots, dashed vertical lines, dash-dot lines and solid lines show the reverberation time (assuming 0.27[s]), the benchmark and proposed methods respectively. Table 2 shows the best SIRs and corresponding parameters in each method.
Blind Vector Deconvolution: Convolutive Mixture Models in STFT Domain
477
25 28
Improved SIR [db]
Improved SIR [db]
26 20
Benchmark (L=256∼8192) Proposed (L=512, L’=4∼64) Proposed (L=1024, L’=4∼32) Reverb. time = 0.27 [s]
15
24 22 20
Benchmark (L=256∼8192) Proposed (L=512, L’=4∼64) Proposed (L=1024, L’=4∼32) Reverb. time = 0.27 [s]
18 16
10 −2 10
−1
10 Timespan [s] (logarithmic scale)
0
10
−2
10
−1
10 Timespan [s] (logarithmic scale)
0
10
Fig. 5. Plots of SIRs (Left: 4 seconds, Right: 8 seconds) Table 2. Best SIR in each method Length [s] BM 4.0 Proposed 4.0 BM 8.0 Proposed 8.0
SIRY [dB] 20.36 21.65 23.31 25.94
SIRimp 1st STFT [dB] L N 22.33 1024 256 23.66 512 128 25.35 2048 512 27.99 512 128
2nd STFT Time span L’ N’ [s] — — 0.064 16 2 0.088 — — 0.128 16 2 0.152
From the above experiments, we have confirmed the followings: 1. For the benchmark, the best SIR is made with L = 1024 or L = 2048, much shorter than the reverberation time. Longer frames rather worsen SIR, as mentioned in Section 1. 2. For the proposed methods, the best SIR is achieved in longer time span.
5
Discussion on Experimental Results
In the framework of Blind Vector Deconvolution, single frame’s unmixed results Y (r) is estimated from observations over L consecutive frames around frame r, while in the conventional STFT domain ICA, it is estimated from single frame’s observations X (r). This property can remove cross-frame interference which are due to using short frames. Moreover, the fact that adjacent frames in STFT domain are overlapped to each other in time domain, prevents the decrease of number of substantive samples in the second STFT. These are reasons why the proposed methods outperform the benchmark. In longer time-span, however, the proposed methods have also shown the decline in SIR, which is more apparent in the 4 second case. We guess that it is due to the following reasons: 1) In the second STFT, the dilemma of frame
478
A. Hiroe
length still remains; 2) Longer frames in the second STFT add more cross-frame artifacts to the target signal in STFT domain. We expect that we can avoid the dilemma by carefully selecting parameters in the first and second STFT, or by introducing STFT domain deconvolution.
6
Conclusion
We proposed a new framework based on STFT domain convolutive mixtures and presented an approach that unmixes them in the modulation spectrogram domain. We experimentally confirmed that the proposed methods outperform the conventional STFT domain ICA due to avoiding the dilemma of the STFT frame length although it still remains in long time-span. In future works, we plan to evaluate our methods in various environments and to examine STFT domain deconvolution.
References 1. Smaragdis, P.: Blind separation of convolved mixtures in the frequency domain. Neurocomputating 10(2), 251–276 (1998) 2. Araki, S., Makino, S., Mukai, R., Nishikawa, T., Saruwatari, H.: Fundamental limitation of frequency domain blind source separation for convolved mixture of speech. In: Proc. ICA2001, pp. 132–137 (December 2001) 3. Servi`ere, C.: Separation of speech signals under reverberant conditions. In: Proc. EUSIPCO04, pp. 1693–1696 (2004) 4. Amari, S., Cichocki, A., Yang, H.H.: A new learning algorithm for blind signal separation. In: Advances in Neural Information Processing Systems, vol. 8, MIT Press, Cambridge (1996) 5. Hiroe, A.: Solution of Permutation Problem in Frequency Domain ICA, Using Multivariate Probability Density Functions. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 601–608. Springer, Heidelberg (2006) 6. Matsuoka, K., Nakashima, S.: Minimal distortion principle for blind source separation. In: Proc. ICA 2001, pp.722–727 (December 2001) 7. http://sound.media.mit.edu/ica-bench/sources/ 8. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis (2001) 9. Greenberg, S., Kingsbury, B.E.D.: The Modulation Spectrogram. In: Pursuit Of An Invariant Representation Of Speech. In: Proc. ICASSP-97, pp. 1647–1650 (1997) 10. Rabiner, L., Schafer, R.: Short-Time Fourier Analysis. In: Chapter in Digital Processing of Speech Signals, Prentice-Hall, London (1978) 11. Kim, T., Eltoft, T., Lee, T.-W.: Independent Vector Analysis: an extension of ICA to multivariate components. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 165–172. Springer, Heidelberg (2006)
Appendix: STFT of Convolved Signals First we examine STFT with single shift (N = 1) [10]. Let S˜i (ω, R) be the STFT of si (t) in frame R (also in time R). It is defined in (18). From (1) and (18),
Blind Vector Deconvolution: Convolutive Mixture Models in STFT Domain
479
˜ ki (ω, R), the STFT of xki (t), is written as (19). It represents the convolution X between S˜i (ω, R − τ ) and aki (τ ). def S˜i (ω, R) =
˜ ki (ω, R) = X
L−1 t=0 p−1
ω−1 t . w(t)si (R + t) exp −2πj L
(18)
aki (τ )S˜i (ω, R − τ )
(19)
τ =0
Then we apply (19) to STFT with general shift width N . Using integers P , N and L such that (P − 1)N < p ≤ (P − 1)N + L, (19) is rewritten as an overlapadd form (20). It can be approximated as in (21), since τ close to 0 should satisfy (22). Equation (21) shows the convolution between S˜i (ω, R − τ N ) and A˜ki (ω, τ N ). It means that as long as the approximation in (22) is sound, STFT with shift N converts time domain convolution to STFT domain convolution. P −1
L−1 N aki (τ N + τ )S˜i (ω, R − τ N − τ ) (20) L τ =0 τ =0 P −1 L−1 ω−1 N S˜i (ω, R − τ N ) (21) τ aki (τ N + τ ) exp −2πj ≈ L L τ =0 τ =0
(19) =
˜ki (ω,τ N ) A
ω−1 ˜ ˜ Si (ω, R − τ N − τ ) ≈ exp −2πj τ Si (ω, R − τ N ) (0 ≤ τ < L) (22) L Finally, we rewrite (21) as (3) through following replacements: ˜ ki (ω, rN ), Xki (ω, r) = X
Aki (ω, τ ) = A˜ki (ω, τ N ),
Si (ω, r−τ ) = S˜i (ω, R−τ N )
A Batch Algorithm for Blind Source Separation of Acoustic Signals Using ICA and Time-Frequency Masking Eugen Hoffmann, Dorothea Kolossa, and Reinhold Orglmeister Berlin University of Technology, Electronics and Medical Signalporcessing Group, Einsteinufer 17, 10587 Berlin, Germany {Eugen.Hoffmann.1, Reinhold.Orglmeister}@tu-berlin.de, [email protected]
Abstract. The problem of Blind Source Separation (BSS) of convolved acoustic signals is of great interest for many classes of applications such as in-car speech recognition, hands-free telephony or hearing devices. Due to the convolutive mixing process, the source separation is performed in the frequency domain, using Independent Component Analysis (ICA). However the quality of solution of the ICA-algorithms can be improved by applying time-frequency masking. In this paper we present a batchalgorithm for time-frequency masking using the time-frequency structure of separated signals.
1
Introduction
Blind Source Separation (BSS) deals with the problem of recovering the source signals from their mixtures when the mixing process is unknown. Recently, the problem has been widely studied and many methods have been proposed [7]. In this paper we concentrate on the case of BSS for acoustic speech signals observed in a real environment, i.e. convolutive mixtures. Most existing demixing methods are based on Independent Component Analysis (ICA) in the frequency-domain, where the convolutions of the source signals with the room impulse response are reduced to multiplications with the corresponding transfer functions. So for each frequency bin, an individual instantaneous ICA problem arises [2],[7]. The quality of the recovered source signals can be improved by applying timefrequency masking on the ICA outputs. There exist a number of algorithms, calculating the time-frequency mask from the estimated direction of arrival of the separated signals [4],[5], algorithms calculating the power ratio between inputs and outputs of a spatial filter [3] and algorithms based on the cosine distance between a sample vector and the basis vector corresponding to the target [6]. The proposed algorithm estimates the time-frequency mask by comparing the time-frequency structure of the separated signals. On this basis, we propose a batch-algorithm for separation of two sources by applying the JADE-Algorithm [1] for computation of unmixing filters and M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 480–487, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Batch Algorithm for BSS of Acoustic Signals
481
time-frequency masking to improve the separation quality. The effectiveness of this approach is shown by the separation results of the algorithm.
2
The Proposed Method
The block diagram of the algorithm is shown in Fig. 1. First the time-domain signals x(t) are converted into frequency-domain timeseries signals X(Ω, τ ) using the Short-Time Fourier Transform (STFT), on which the JADE-Algorithm is applied, to compute the unmixing filter matrices W(Ω). At the next step the permutation problem is treated, so the unmixing filter matrices can be corrected by multiplication with the permutation matrix P(Ω). At the post-processing stage of the algorithm, the time-frequency mask M(Ω) is estimated, to minimize the crosstalk components, which could not be eliminated by the ICA-algorithm, and the output vector y(t) is obtained by transforming the unmixed signals Y(Ω, τ ) back into the time-domain.
Fig. 1. Overview of the algorithm
2.1
ICA
Acoustic signal mixtures in reverberant environments can be described by x(t) = A ∗ s(t),
(1)
where s(t), x(t) and A denote the the vector of source signals, the vector of mixed signals and a matrix containing the impulse responses between the sources and the sensors and ∗ denotes the convolution operator. Transforming (1) into the frequency domain reduces the convolutions to multiplications: X(Ω, τ ) ≈ A(Ω)S(Ω, τ ),
(2)
where Ω is the angular frequency, τ represents the frame index, A(Ω) is the mixing system in the frequency domain, S(Ω, τ ) = [S1 (Ω, τ ), . . . , SN (Ω, τ )] represents the source signal, and X(Ω, τ ) = [X1 (Ω, τ ), . . . , XN (Ω, τ )] denotes the observed signals. So for each frequency bin an instantaneous ICA-Problem has to be solved. For this purpose we use the JADE-algorithm, which is based on joint diagonalization of most significant cumulant matrices of higher order [1].
482
2.2
E. Hoffmann, D. Kolossa, and R. Orglmeister
Permutation Correction
The filter matrices calculated by ICA can be randomly permutated. To solve the permutation problem, the phase differences in the estimated mixing filter matrices are used [2],[5]. For this purpose we normalize the estimated mixing matrix A(Ω) on the first row, so the normalized mixing matrix can be written as: ⎤ ⎡ 1 ··· 1 ˆ ˆ2n e−jΩδ2n ⎦ A(Ω) = ⎣a ˆ21 e−jΩδ21 · · · a (3) ··· ··· ··· To correct the permutations, we compare the phase differencies δi and sort the columns of the mixing filter matrix, so δi < δk for all i < k. Figure 2 shows the results of the permutation correction. Remark 1. It should be noted, that this approach alone works only in case of signals coming from different directions. DOA of estimated signal 2 0
−50
−50
0
0
−20
50
50
−40
0
2000 4000 Frequency (Hz)
0
Attenuation (dB)
DOA (Deg)
DOA of estimated signal 1
2000 4000 Frequency (Hz)
Fig. 2. DOA estimation for two signals using ICA. The gray scale values show the grade of attenuation.
2.3
Time-Frequency Masking
Despite of good performance of the ICA-algorithm some interference remains in the estimated source signals. To minimize the remaining interference, we use time-frequency masking, which is performed as ˜ Y(Ω, τ ) = M(Ω, τ )Y(Ω, τ ),
(4)
where 0 ≤ M(Ω, τ ) ≤ 1 is a mask specified for each time index τ in each frequency bin. To calculate the mask, we estimate the interferences Φ(Ω, τ, RΩ , Rτ )(Yk (Ω, τ ) − m=k Ym (Ω, τ )) (5) Uk (Ω, τ, RΩ , Rτ ) = Φ(Ω, τ, RΩ , Rτ ) m=k Ym (Ω, τ )
A Batch Algorithm for BSS of Acoustic Signals
and Nk (Ω, τ, RΩ , Rτ ) =
483
Φ(Ω, τ, RΩ , Rτ )(Yk (Ω, τ ) − m=k Ym (Ω, τ ))
. (6) Φ(Ω, τ, RΩ , Rτ )Yk (Ω, τ ) between the spectrogram Yk (Ω, τ ) and m=k Ym (Ω, τ )). · denotes the Euclidean norm operator and ⎧ ⎨ W(Ω − Ω0 , τ − τ0 , RΩ , Rτ ), |Ω − Ω0 | ≤ RΩ , |τ − τ0 | ≤ Rτ (7) Φ(Ω, τ, RΩ , Rτ ) = ⎩ 0, otherwise uses a two dimensional window function W(Ω − Ω0 , τ − τ0 , RΩ , Rτ ) of the size RΩ × Rτ (e.g. a two dimensional Hanning window). The estimate of the interference gives us different possibilities for mask calculation: – The interference can be used to estimate the time frequency bins of the signal k, where the signal of interest is active: 1, Uk (Ω, τ, RΩ , Rτ ) > λu (8) Mk (Ω, τ ) = 0, otherwise where λu is the threshold for the signal of interest and Uk (Ω, τ, RΩ , Rτ ) is the interference from (5). – The interference can be used to estimate the time frequency bins of the signal k, where the jammer signal is active, or 0, Nk (Ω, τ, RΩ , Rτ ) > λn (9) Mk (Ω, τ ) = 1, otherwise where λn is the threshold for the jammer signal. – It is possible to combine both methods, using both estimated interferences Sk (Ω, τ, RΩ , Rτ ) and Nk (Ω, τ, RΩ , Rτ ): 1, Nk (Ω, τ, RΩ , Rτ ) < λn ∧ Uk (Ω, τ, RΩ , Rτ ) > λu Mk (Ω, τ ) = (10) 0, otherwise Figure 3 shows the results of the interference estimation and masking with (10).
3 3.1
Experiments and Results Conditions
For the evaluation of the results of the proposed algorithm, the TIDigits [8] database with two different male speakers was used, which was played back and recorded once with two speaker signals simultaneousely and once separately in three different setups of loudspeakers. The recordings were made in a real room
484
E. Hoffmann, D. Kolossa, and R. Orglmeister
with a reverberation time TR = 300 ms (Fig. 4). The loudspeakers were placed with the angles of incidence relative to broadside as shown in the Table 1. The algorithm was tested at a resolution of N F F T = 512, the recordings sample rate was 11kHz. Table 1. Experimental configurations config A B C
3.2
θ1 45◦ 45◦ 10◦
θ2 recordings 45◦ speaker 1, speaker 2, both speakers 25◦ speaker 1, speaker 2, both speakers 25◦ speaker 1, speaker 2, both speakers
Performance Measures
For calculation of the effectiveness of the proposed algorithm the signal to interference ratio (SIR) was used as a measure of the separation performance and the signal to distortion ratio (SDR) as a measure of the signal quality: 2 n yisi (n) 2 SIRi = 10 log (11) j=i n yisj (n)
x2ksi (n) 2 2 n (xksi (n) − αyisi (n − D))
SDRi = 10 log
n
(12)
where yi,sj is the i-th separated signal with only the sj source active, and xk,sj is the observation obtained by microphone k when only sj is active. α and D are parameters for phase and amplitude chosen to optimally compensate the difference between yi,sj and xk,sj . 3.3
Experimental Results
In this section we compare the results of ICA with and without the proposed time frequency masking algorithm and show the results of time frequency masking for different parameter values and different masks as described in Sect. 2.3. Table 2 shows the average SIR improvement resulting form different algorithms applied to the configurations from Table 1. Table 2. Result comparison. Average SIR improvement in dB. Algorithm ICA only ICA and TF-Mask form [3] ICA and TF-Mask form [6] DUET [5] ICA and proposed algorithm
Config. A 11.9 dB 31.9 dB 30.8 dB 31.4 dB 35.3 dB
Config. B 23.1 dB 35.4 dB 34.6 dB 23.6 dB 39.4 dB
Config. C 12.7 dB 16.1 dB 18.1 dB 17.7 dB 18.1 dB
A Batch Algorithm for BSS of Acoustic Signals
485
1) Mixture
2) Estimated source signal 1 (ICA only)
3) Estimated interference 1
4) Estimated interference 2
5) Masked signal 1
6) Reference signal 1
1500 1000 500
Frequency (Hz)
0
1500 1000 500 0
1500 1000 500 0
50
100 time lags
150
50
100 time lags
150
Fig. 3. The lower 200 frequency bins of spectrograms of 1) the mixture X1 (Ω, τ ), 2) the output signal Y1 (Ω, τ ) obtained only with ICA, 3) the estimated interference N1 (Ω, τ, RΩ , Rτ ), 4) the estimated interference U1 (Ω, τ, RΩ , Rτ ), 5) the output signal Y1 (Ω, τ ) obtained by a combination of ICA and T-F masking calculated with (10), and 6) the clean source signal S1 (Ω, τ )
Fig. 4. Experimental setup
486
E. Hoffmann, D. Kolossa, and R. Orglmeister sir signal 1
sir signal 2 32 30 28 26 24 22
dB
dB
48 46 44 42 40 38 5
10 window size sdr signal 1
15
10 window size sdr signal 2
15
5
10 treshold
15
5
2.5
dB
dB
3
5
4
2 3
1.5 5
10 window size
15
Fig. 5. Comparison of the results of time frequency masking for different window lengths. Δ : λu = λn = 1.3, ◦ : λu = λn = 2.5. sir signal 1
sir signal 2
40
20 dB
40
dB
60
20 0 0.5
0 1
1.5 2 treshold sdr signal 1
2.5
3
−20 0.5
2
0 0.5
1.5 2 treshold sdr signal 2
2.5
3
1
1.5 2 treshold
2.5
3
10 dB
dB
4
1
1
1.5 2 treshold
2.5
3
5
0 0.5
Fig. 6. Comparison of the results of time frequency masking for different window lengths. ∗: the mask calculated with equation (8), ◦: the mask calculated with equation (9), Δ: the mask calculated with equation (10) and : λu = λn .
Figure 5 shows the SIR and SDR for masking over window size. In both cases the mask was calculated with (10) for configuration A in Table 1. Figure 6 shows the SIR and SDR for the masking with different masks and different thresholds.
4
Conclusion
An approach for time frequency masking has been presented, that uses the estimate of the signal interference for mask calculation. The interference is estimated using the normalized correlation of the separated signals.
A Batch Algorithm for BSS of Acoustic Signals
487
The proposed algorithm has been tested on real room recordings with a reverberation time of 300 ms, where an SIR-improvement of up to 16dB has been obtained, which was 19dB above ICA performance for the same dataset. The results show, that an significant improvement of separation performance is possible by evaluating residual correlations between separated signals.
References 1. Cardoso, J.-F.: High order contrasts for independent component analysis. Neural computation 11, 157–192 (1999) 2. Baumann, W., Kolossa, D., Orglmeister, R.: Maximum Likelihood Permutation Correction for Convolutive Source Separation. In: ICA 2003, pp. 373–378 (2003) 3. Kolossa, D., Orglmeister, R.: Nonlinear postprocessing for blind speech separation. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 832–839. Springer, Heidelberg (2004) 4. Rickard, S., Balan, R., Rosca, J.: Real-time time-frequency based blind source separation. In: Proc. ICA, pp. 651–656 (December 2001) 5. Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process 52(7), 1830–1847 (2004) 6. Sawada, H., Araki, S., Mukai, R., Makino, S.: Blind extraction of a dominant source from mixtures of many sources using ICA and time-frequency masking. In: IEEE International Symposium on Circuits and Systems (ISCAS 2005), pp. 5882–5885 (May 2005) 7. Mansour, A., Kawamoto, M.: ICA Papers Classified According to their Applications and Performances. IEICA Trans. Fundamentals E86-A(3), 620–633 (2003) 8. Leonard, R.G.: A Database for Speaker-Independent Digit Recognition. In: Proc. ICASSP 84, vol. 3, pp. 11–42 (1984)
The Role of High Frequencies in Convolutive Blind Source Separation of Speech Signals Maria G. Jafari and Mark D. Plumbley Centre for Digital Music, Queen Mary University of London, UK [email protected] http://www.elec.qmul.ac.uk
Abstract. In this paper, we investigate the importance of the high frequencies in the problem of convolutive blind source separation (BSS) of speech signals. In particular, we focus on frequency domain blind source separation (FD-BSS), and show that when separation is performed in the low frequency bins only, the recovered signals are similar in quality to those extracted when all frequencies are taken into account. The methods are compared through informal listening tests, as well as using an objective measure.
1
Introduction
Convolutive blind source separation is often addressed in the frequency domain, through the short-time fourier transform (STFT), and source separation is performed separately at each frequency bin, thus reducing the problem to that of several instantaneous BSS problems. Although the approximation of convolutions by multiplications result in reduced computational complexity, frequency domain BSS (FD-BSS) remains computationally expensive because source separation has to be carried out on a large number of bins (a typical STFT length is 2048 point), each containing sufficient data samples for the independence assumption to hold. In addition, transforming the problem to several independent instantaneous problems, has the unwelcome side effect of introducing the problem of frequency permutations, whose solution is often quite computationally expensive [1], as it involves the clustering the frequency components of the recovered sources, using methods such as beamforming approaches, e.g. [3,4]. These methods exploit phase information contained in the de-mixing filters identified by the source separation algorithm. Generally, the characteristics of speech signals are such that little information is contained in the frequencies above 4kHz [9], suggesting a possible approach to BSS for speech mixtures that focuses on the lower frequencies. Motivated by this, and in order to reduce the computational load of FD-BSS algorithms, we consider here the role of high frequencies in source separation of speech signals. We show that high frequencies are not as important as low frequencies,
This work was funded by EPSRC grant GR/S85900/01.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 488–494, 2007. c Springer-Verlag Berlin Heidelberg 2007
The Role of High Frequencies in Convolutive BSS of Speech Signals
489
and that intelligibility is preserved even when the high frequency subbands are left umixed, and simply added back onto the separated signal. Other possible approaches would exploit existing methods that assume that high frequencies are not available, such as bandwidth extension. The structure of this paper is as follows: the basic convolutive BSS problem is described in section 2; an overview of FD-ICA is given in section 3, while the role of high frequencies is discussed in section 4. Simulation results are presented in section 5, and conclusions are drawn in section 6.
2
Problem Formulation
The simplest convolutive BSS problem arises when 2 microphones record mixtures x(n) of 2 sampled real-valued signals, s(n), which in this paper are considered to be speech signals. The aim of blind source separation is then to recover the sources, from only the 2 convolutive mixtures available. Formally, the signal recorded at the q-th microphone, xq (n), is xq (n) =
2 L
aqp (l)sp (n − l),
q = 1, 2
(1)
p=1 l=1
where sp (n) is the p-th source signal, aqp (l) denotes the impulse response from source p to sensor q, and L is the maximum length of all impulse responses [1]. The source signals are then reconstructed according to yp (n) =
2 L
wqp (l)xq (n − l),
p = 1, 2
(2)
q=1 l=1
where yp (n) is the p-th recovered source, and wqp (l), are the unmixing filters which must be estimated.
3
Frequency Domain Blind Source Separation
The convolutive audio source separation is often addressed in the frequency domain. It entails the evaluation of the N -point short-time fourier transform of the observed signals, followed by the use of instantaneous BSS, independently on each of the resulting N subbands. Thus, the mixing and separating models in (1) and (2) become, respectively X(f, t) = A(f )S(f, t) Y(f, t) = W(f )X(f, t)
(3) (4)
where S(f, t), and X(f, t) are the STFT representations of the source and mixture vectors respectively, A(f ) and W(f ) are the mixing and separating matrices at frequency bin f , Y(f, t) is the frequency domain representation of the recovered sources, and t denotes the STFT block index.
490
M.G. Jafari and M.D. Plumbley
FD-BSS has the drawback of introducing the problem of frequency permutations, which is typically solved by clustering the frequency components of the recovered sources, often using beamforming techniques, such as in [1,3,4,5], where the direction of arrival (DOA) of the sources are evaluated from the beamformer directivity patterns Fp (f, θ) =
2
ICA Wqp (f )ej2πf d sin θp /c ,
p = 1, 2
(5)
q=1 ICA is the ICA de-mixing filter from the q-th sensor to the p-th output, where Wqp d is the spacing between two sensors, θp is the angle of arrival of the p-th source signal, and c ≈ 340m/s is the speed of sound in air. The frequency permutations are then determined by ensuring that the directivity pattern for each beamformer is approximately aligned along the frequency axis. The BSS algorithm considered in this paper is given in [6]. It updates the unmixing filters according to ΔW(f ) = D diag(−αi ) + E φ(y(f, t))yH (f, t) W(f )
W(f ) ← W(f )(W(f )H W(f ))−0.5
(6)
H
where y is the conjugate transpose of y, αi = E{yi (f, t)φ(yi (f, t))}, D = diag(1/(αi − E{φ (yi (f, t)})), and the activation function φ(y(f, t)) is given by φ(y(f, t)) =
y(f, t) , ∀|y(f, t)| = 0 |y(f, t)|
(7)
with its derivative approximated by φ (y(f, t)) ≈ |y(f, t)|−1 − y(f, t)2 |y(f, t)|−3 [6]. Moreover, the algorithm (6) requires that the mixtures x(f, t) be prewhitened; we refer to it as MD2003.
4
The Role of High Frequencies
In this paper, we aim to investigate the role of the high frequencies in convolutive blind source separation of speech signals, whose characteristics are such that little information is contained in the frequencies above a certain cut-off frequency [9], which we define in this paper as fc . Here, we consider the following decomposition of the observed signal X(f, t) = XLF s (f, t) + X(f, t)HF s
(8)
where XLF s (f, t) is the STFT representation of the mixtures with the subbands corresponding to the high frequencies (f > fc ) set to zero, and similarly X(f, t)HF s has the low frequencies subbands (f ≤ fc ) set to zero. Defining the recovered signal as Y(f, t) = YLF s (f, t) + Y(f, t)HF s , the following four scenarios are considered, in which source separation is performed using MD2003: 1. on all frequency bins (MD2003): Y(f, t) = YLF s (f, t) + Y(f, t)HF s 2. on the low frequency bins only; the high frequencies are set to zero (LF): Y(f, t) = YLF s (f, t)
The Role of High Frequencies in Convolutive BSS of Speech Signals
491
3. on the low frequency bins; the high frequency components are extracted using a beamformer WB F (f ) based on the DOAs estimated from the low frequency components (LF-BF): Y(f, t) = YLF s (f, t) + WB F (f ) X(f, t)HF s 4. on the low frequency bins; the high frequency components are left mixed, and they are added back to the separated low frequencies prior to applying the inverse STFT (LF-BF): Y(f, t) = YLF s (f, t) + X(f, t)HF s Figure 1 illustrates the four methods described above.
5
Simulation Results
In this section, we consider the separation of two speech signals, from two male speakers, sampled at 16kHz. The sources were mixed using simulated room impulse responses, determined by the image method [2] using MGovern’s RIR Matlab function1 , with a room reverberation time of 160 ms. The STFT frame length used was set to 2048 in all cases. The performance of the FD-BSS method in [6] (MD2003) was compared for the four methods described in section 4, and permutations were aligned as in [3]. We set fc = 4.7kHz, so that the low frequency bands are between 0 to 4.7kHz, while the high frequencies are above 4.7kHz. This value was obtained empirically by inspecting the frequency content of the mixtures, and with the aim of ensuring that as much information as possible is preserved in the low frequencies. Table 1. Signal-to-distortion (SDR), signal-to-interference (SIR), and signal-to-artifact ratios (SAR), for the four methods separating the sources signals: At all frequencies - MD2003; At low frequencies only - LF; At low frequencies; BF applied at high frequencies - LF-BF; At low frequencies; high frequencies added still mixed - LF-HF, for a cut off of 4.7kHz Method SDR (dB) SIR (dB) SAR (dB) Listening Tests MD2003 [6] 5.37 19.17 6.08 +++ LF
5.37
19.66
5.59
+
LF-BF
5.15
17.33
5.52
++
LF-HF
5.04
13.16
6.14
++++
The performance of each method was evaluated using the objective criteria of Signal-to-Distortion Ratio (SDR), Signal-to-Interference Ratio (SIR) and Signal-to-Artefacts Ratio (SAR), as defined in [7]. SDR, SIR and SAR measure, respectively, the level of the total distortion in the estimated source, with respect to the target source, the distortion due to interfering sources, and other 1
Available from:http://2pi.us/code/rir.m
492
M.G. Jafari and M.D. Plumbley
(a) Separation of all frequency bins (MD2003): Y(f, t) = YLF s (f, t) + Y(f, t)H F s
(b) Separation of low frequency bins only (LF): Y(f, t) = YLF s (f, t)
(c) Separation of low frequency bins, with beamforming in the high frequencies (LF-BF): Y(f, t) = YLF s (f, t) + WB F (f )Y(f, t)H F s
(d) Separation of low frequency bins. High frequency are added back without separation (LF-HF): Y(f, t) = YLF s (f, t) + X(f, t)H F s
Fig. 1. Illustration of the four methods compared
The Role of High Frequencies in Convolutive BSS of Speech Signals
493
remaining artefacts. The evaluation criteria allows for the recovered sources to be modified by a permitted distortion, and we considered a time-invariant filter of length 512 samples, when calculating the performance measures. This length was chosen so that the filter would cover the reverberation time. We obtained SDR, SIR and SAR figures for the four methods, and for all sources and microphones. The results are shown in Table 1, where the single figure was produced by averaging the criteria across all microphones and all sources. The SDRs in Table 1 show that the total distortion for all methods is essentially the same. Distortion increases for LF-HF, due to the high frequencies not being separated, and therefore re-introducing some level of distortion. This is supported by the corresponding SIR figure for the same method, which shows that a higher level of interference from the other source is present. The values for SAR indicate that most artefacts are introduced when separation is performed on the low frequency (LF) components only, and when the high frequency components are extracted using beamforming (LF-BF). This is hardly surprising, since both methods can have quite severe effects on the data. The most interesting result is observed from the SIR figures. They show that separating only the low frequency components, and truncating the high frequency ones, has the effect of removing more interference from the undesired source signal than when working with all frequencies, while not introducing any additional distortion (SDR is unchanged), although the level of artefacts present increases. This result is rather counterintuitive, as it suggests that there is little to be gained from performing separation in the high frequencies. This might be explained by the fact that source separation methods perform worse on high frequency components, which are generally lower in amplitude; using beamforming methods to deal with the permutation problem also yields poor results due to phase ambiguity in the high frequencies [8]. Informal listening tests were performed, to corroborate the outcome of the objective criteria. They indicated that the ratios are a good guide to the audible performance. The outputs of LF were found to sound the least natural among all the recovered signals, due to the high frequencies not being present, while the sources separated with LF-HF were found to sound somehow better than the outputs of MD2003. However, the crucial point is that the outputs of all methods sounded similar in quality, suggesting that they all have similar performance. The last column in Table 1 shows a classification of the recovered sources, with the number of + indicating how good the quality of the separated signal is. In general, LF-HF gave the best results, and LF is the worst only because it it not as natural as the others. Nonetheless, the output of LF is equally as intelligible as the others. We can conclude from these results that performing separation in all subbands is not always the best approach. Especially for speech signals, it might be more advantageous to apply BSS only in the low frequencies, hence reducing, or even halving, the computational burden of some frequency domain algorithms.
494
6
M.G. Jafari and M.D. Plumbley
Conclusions
In this paper, we discussed the role of the high frequencies in frequency domain blind source separation of speech signals. We found that when the high frequencies are ignored, the separated sources remain quite clear, albeit they do not always sound very natural. Our findings were supported by objective criteria, and informal listening tests, which have suggested that it might be a good strategy to separate the mixtures in the low frequencies only, and then add on the high frequency components, without performing any processing on them. This approach may bring significant advantages in terms of reduced computational complexity.
References 1. Sawada, H., Mukai, R., Araki, S., Makino, S.: A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. on Speech and Audio Processing 12, 530–538 (2004) 2. McGovern, S.: A model for room acoustics (2003), Available at http//2pi.us/rir.html 3. Mitianoudis, N., Davies, M.: Permutation alignment for frequency domain ICA using subspace beamforming methods. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 669–676. Springer, Heidelberg (2004) 4. Saruwatari, H., Kurita, S., Takeda, K.: Blind source separation combining frequencydomain ICA and beamformning. In: Proc. ICASSP, vol. 5, pp. 2733–2736 (2001) 5. Ikram, M., Morgan, D.: A beamforming approach to permutation alignment for multichannel frequency-domain blind speech separation. In: Proc. ICASSP, vol. 1, pp. 881–884 (2002) 6. Mitianoudis, N., Davies, M.: Audio source separation of convolutive mixtures. IEEE Trans. on Audio and Speech Processing 11, 489–497 (2003) 7. F´evotte, C., Gribonval, R., Vincent, E.: BSS EVAL Toolbox User Guide, IRISA Technical Report 1706 (April 2005), http://www.irisa.fr/metiss/bss eval/ 8. Jafari, M.G., Adballah, S.A., Plumbley, M.D., Davies, M.E.: Sparse coding for convolutive blind audio source separation. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 132–139. Springer, Heidelberg (2006) 9. Balcan, D., Rosca, J.: Independent component analysis for speech enhancement with missing TF content. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 552–560. Springer, Heidelberg (2006)
Signal Separation by Integrating Adaptive Beamforming with Blind Deconvolution Kostas Kokkinakis and Philipos C. Loizou Center for Robust Speech Systems, Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX 75083, USA {kokkinak,loizou}@utdallas.edu
Abstract. In this paper, we present a broadband two-microphone blind spatial separation technique by efficiently combining adaptive beamforming (ABF) with multichannel blind deconvolution (MBD). First, the inaccessible source signal streams are partially identified by simple time-delay steering and then are spatially separated through an MBD structure. The proposed spatio-temporal ABF-MBD algorithm exhibits fast convergence properties and high computational efficiency. Numerical experiments illustrate the practical appeal of the proposed method in separating convolutive mixtures of speech within nearly anechoic and also highly reverberant enclosures.
1
Introduction
Multi-microphone speech enhancement methods have a very strong potential for use in a variety of applications such as automatic speech recognition, hearing aid devices and hands-free telephony. In such applications, speech from various sources is often collected simultaneously over two spatially distributed microphones or through a closely-spaced multi-sensor array. The key challenge here is to develop algorithms that can maximize speech intelligibility in both anechoic and modest-to-severe reverberant scenarios by recovering and perceptually enhancing the waveform of the desired (or target) source signal, while relying only on an observed set of composite (or mixed) signals. One such prominent technique, is blind source separation (BSS). Practically, BSS can blindly recover a set of unknown signals, the so-called sources from their observed mixtures, based on very little to almost no prior knowledge about the source characteristics or the mixing structure itself. Recent work on BSS has been encouraging even in long reverberation cases. However, most BSS techniques suffer from (1) ambiguities (e.g., scaling and permutation) in the independence criterion when the problem is treated exclusively in the frequency-domain or (2) are inherently slow and computationally inefficient when operating in the time-domain, especially if in this latter case long filters are needed to reach an adequate level of separation.
Work supported in part by Grant R01–DC07527 awarded from the NIDCD/NIH.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 495–503, 2007. c Springer-Verlag Berlin Heidelberg 2007
496
K. Kokkinakis and P.C. Loizou
To overcome these problems, various authors have proposed spatial filtering techniques that combine BSS with adaptive beamforming (ABF). Beamforming has long been used in many areas, such as radar, medical imaging and hearing aids [5]. By definition, the purpose of beamforming is to pick up and amplify sounds from one direction, while suppressing undesired interferences and reverberation arriving from all other directions. Exploiting the similarity between ABF and BSS, Parra and Alvino [14], resorted to beamforming and incorporated geometric constraints to solve permutations between adjacent frequency bands. Their method, called geometric source separation (GSS), was shown to work well in reverberant conditions, albeit at the expense of adding multiple sensors to estimate the directions of arrival (DOAs) of the sources. The potential of using the directivity (or gain) pattern formed by two parallel beamformers to yield maximally independent outputs, was also later explored in [1] (time-domain) and [3] (frequency-domain). Unlike most frequency-domain methods, the latter technique did not suffer from different permutations along the frequency bands. More recently, similar techniques have also achieved a correct permutation alignment, based on information acquired from a beamformer about the direction of the nulls (sidelobes) in different frequency bins (e.g., see [13], [16]). In this paper, we use multichannel blind deconvolution (MBD) for convolutive BSS. In stark contrast to frequency-domain BSS, our MBD approach operates in the z-domain partially and thus remains immune to permutation disparities [12]. Also, scaling indeterminacies that cause whitening are fully alleviated to retain intelligible source contributions at the output [9], [10]. Nonetheless, MBD approaches are computationally demanding when long filters are used. To ameliorate this and speed up convergence, we use ABF to place spatial nulls at the location of the interfering sources before passing the signals through MBD.
2
Signal Model
To isolate the original or “true” sources in a multipath propagation scenario, one needs to rely solely on information extracted from the convolutive mixtures of T the original signal streams x(t) = [x1 (t), . . . , xm (t)] ∈ IRm given by ∞ x(t) = H() s(t − ), t = 1, 2, . . . (1) =0
where H() is the unknown linear-time invariant (LTI) multiple-input multipleoutput (MIMO) mixing system that models the acoustic channel. MBD can blindly achieve the recovery of the sources s(t), by processing measurements at T the sensors, such that the system outputs u(t) = [u1 (t), . . . , un (t)] ∈ IRn read u(t) =
L−1
W() x(t − ),
t = 1, 2, . . .
(2)
=0
where W() is the unmixing matrix linking the jth source estimate uj (t) with the ith sensor observation xi (t), composed of sufficiently long finite impulse response
Signal Separation by Integrating ABF with Blind Deconvolution
497
(a) 1/2
X1
R
zR
U1 R_
1/2
R
X2
W12
_
(b) 1/2
X2
R
zR
U2 R_
1/2
R
X1
_
W21
Fig. 1. Setup of two-microphone Griffiths-Jim beamformers. The microphone signals are added and subtracted to form the sum signal and the difference signal. (a) A null pattern is formed towards source S2 . (b) A null pattern is formed towards source S1 .
(FIR) filters with each element given by vector [wji (0), wji (1), . . . , wji (L − 1)] for all coefficients 0 ≤ ≤ L − 1 with j = 1, 2, . . . , n and i = 1, 2, . . . , m.
3 3.1
Beamforming with Multichannel Blind Deconvolution Stage 1. ABF
The basis of our multi-microphone algorithm is the Griffiths-Jim beamformer [6]. Although, any multiple-input single-output (MISO) algorithm can be used to adapt the filter coefficients, here we choose the least-mean-squares (LMS) algorithm to continuously adjust the filter weights. Since our purpose is to segregate two signals S1 and S2 with two microphones, we use two such MISO beamformers (see Figure 1). The ABF shown in Figure 1(a) can identify S1 by forming a null directivity pattern towards S2 using W12 . Similarly, the ABF in Figure 1(b) can focus on S2 by using spatial filter W21 to attenuate source S1 . The upper ABF structure forms a sum and difference signal by adding and subtracting the microphone signals X1 and X2 . The sum signal is passed through the delay element z −Δ . Typically, if nothing is known a priori about the setup of the sources and the microphone locations, Δ = L/2. The difference (or interference) signal is filtered through the (L + 1)-point adaptive filter W12 to form
498
K. Kokkinakis and P.C. Loizou
an interference cancellation signal, which is then subtracted from the delayed sum signal in the primary channel to form the desired source estimate U1 . The weights of the filter taps are adapted using LMS to minimize the error, and ultimately yield U1 . The same process is repeated while recovering source U2 . 3.2
Equivalence Between ABF and MBD
Focusing on the z-domain for convenience, in the 2 × 2 scenario depicted in Figure 2, we can easily deduce that the mixtures at the sensor inputs are X1 (z) = H11 (z) S1 (z) + H12 (z) S2 (z)
(3)
X2 (z) = H21 (z) S1 (z) + H22 (z) S2 (z) By observing the ABF structures in Figure 1, the source estimates are equal to U1 (z) = X1 (z) + X2 (z) − W12 (z) X1 (z) − X2 (z) U2 (z) = X1 (z) + X2 (z) − W21 (z) X2 (z) − X1 (z)
(4)
If we assume that the transfer functions from the sources to the microphones are similar (a valid assumption for closely-spaced arrays), then upon subtracting one from the other, the two microphone signals will (ideally) cancel the contribution from the undesired source out. Hence, the difference signal in the upper ABF structure can be approximated as X1 (z) − X2 (z) ≈ U2 (z) and accordingly in the lower ABF structure as X2 (z) − X1 (z) ≈ U1 (z). In addition, summing the microphone inputs will result in obtaining filtered versions of X1 (z) (upper ABF) and X2 (z) (lower ABF). Based on such simplifications, (4) reduces to U1 (z) ≈ X1 (z) − W12 (z) U2 (z) U2 (z) ≈ X2 (z) − W21 (z) U1 (z)
(5)
which resembles a feedback network of FIR filters in the MBD configuration. A drawback of ABF is that due to signal subtraction in the auxiliary channel, the output or target signals will almost always exhibit a typical high-pass characteristic resulting in loss of frequencies below 1 kHz (see also [8]). In order to compensate for this effect, the outputs from the beamformer can be low-pass filtered before any further processing [15]. Another drawback is that the effectiveness of ABF in realistic conditions is only limited to zero-to-moderate reverberation settings. Several authors have suggested that the presence of reverberant energy severely degrades the performance of beamforming [5]. As a general rule, the more reverberation present in the environment, the more difficulty this algorithm has in placing nulls at the locations of the undesired signals. Still, after performing spatial filtering with ABF, the tasks of multichannel separation and deconvolution are expected to become easier, leading to a significant increase in separation performance and faster convergence to the optimal filter coefficients.
Signal Separation by Integrating ABF with Blind Deconvolution
X1
S1
H11
R
U1
ABF
ABF1
W11
R
499
U1 R
R
W12
H12 OUTPUT SIGNALS
SOURCE SIGNALS
ESTIMATED SOURCES
W21
H 21 R
H 22
R
ABF2
U2
X2
S2
W22 ABF
R
R
U2
Fig. 2. Cascaded mixing and unmixing system configuration in the two-source and two-sensor scenario when integrating two ABF structures (see Figure 1) with MBD
3.3
Stage 2. MBD
Based on the isomorphic mapping between scalar and FIR polynomial matrices (e.g., see [12]), several adaptation rules derived from the entropy maximization principle [4], have been extended to MBD. Such an efficient update is the linear prediction-based NGA (LP-NGA) algorithm (e.g., see [9], [10]), which stems from the natural gradient algorithm (NGA) introduced by Amari et al. [2]. As shown in Figure 2, the partly identified signal estimates U1 (z) and U2 (z) obtained from the ABF stage are then fed as inputs to the LP-NGA algorithm, which reads Wk+1 (z) = Wk (z) + μ ΔWk (z)
(6)
where W(·) is the unmixing FIR polynomial matrix, μ denotes the step-size and ¯ W11 (z) W12 (z) 1 ¯ 0 H ΔWk (z) = (7) ¯ 1 ¯ − FFT[ϕ(u)] u 0 W21 (z) W22 (z) Wji (·) are the unmixing FIR filters in the z-domain, (·)H is the Hermitian operator, the matrix composed of a sequence of all ones (¯1) in the main diagonal and all zeros (¯ 0) elsewhere is the identity (unit) FIR matrix, whereas the term FFT [ϕ(u)] denotes the score function vector ϕ(u), operating in the time domain ϕi (ui ) = −
d log pui (ui ), dui
i = 1, 2.
(8)
with the ABF-MBD spatially separated source outputs written as u (z) = [U1 (z), U2 (z)]T = W(z) u (z)
MBD
ABF
(9)
500
4
K. Kokkinakis and P.C. Loizou
Experimental Results
The source signals were sentences of one male and one female speaker, approximately 3 s in duration, recorded at a sampling rate of 8 kHz. In total, we processed 20 speech stimuli taken from the IEEE database [7]. The sound level of each individual source was also adjusted relative to the fixed level of the other, yielding a signal-to-interference ratio (SIR) equal to 0 dB. Both speech signals had the same onset, and where necessary were edited to have an equal duration. 4.1
Experiment 1. Short Reverberation
A set of head-related transfer functions (HRTFs) were used to simulate speech mixtures under moderately reverberant scenarios. The length of the HRTFs was 256 samples, amounting to a small delay of 32 ms and very short reverberation. 4.2
Experiment 2. Long Reverberation
The speech signals were convolved with a set of binaural room impulse responses (BRIRs) (e.g., see [17]). In contrast to the relatively smooth and nearly free-field HRTFs used in Experiment 1, these exhibit rapid variations both in phase and magnitude and are, in general, difficult to invert with FIR filters [11]. The BRIRs were measured in a 5 × 9 × 3.5 m classroom using a KEMAR positioned at 1.5 m above the floor at ear level [17]. By convolving the speech signals with the premeasured impulse responses, one source is placed directly to the front and the other at an angle of 30◦ to the right, while both are 1 m away from the KEMAR. In this case, the broadband reverberation time of the room is T60 = 300 ms. 4.3
Results and Discussion
The convolutive speech mixtures were processed with four different spatial processing schemes, ABF only (‘ABF’), MBD only with a single pass (on-line mode) (‘MBD [1]’), MBD only with 10 passes (‘MBD [10]’) corresponding to 30 s of total training time (off-line mode), and the new processing scheme merging ABF with MBD (‘ABF-MBD’). Note that in the latter case, the algorithm was allowed only one pass through the data. The algorithms were executed with 128and 512-sample point adaptive FIR filters for Experiments 1 and 2, respectively, and a fixed step-size maximized up to the stability margin. The overlap between successive frames (or blocks) of data was set to 50%. To assess the separation ability of the algorithms in different reverberation scenarios, we used the signalto-interference-ratio improvement (SIRI) and measured the overall amount of crosstalk reduction in dB, before (SIRi ) and after (SIRo ) separation, as in [11]. The SIRI values averaged for both sources and across all sentences are plotted in Figure 3 (Experiment 1) and Figure 4 (Experiment 2). According to Figure 3, the MBD algorithm yields a substantial improvement in SIR, when allowed multiple passes through the mixed speech data. Separation performance for ABFMBD was slightly lower (around 25%) compared to MBD with 10 passes. The
Signal Separation by Integrating ABF with Blind Deconvolution
501
MBD (10)
ABF-MBD
MBD (1)
ABF
T60 = 5 ms 0
2
4
6
8
10
12
14
16
18
SIRI (dB)
Fig. 3. Mean SIRI values (dB) for 10 IEEE sentences (T60 = 5 ms). Error bars indicate standard deviations.
MBD (10)
ABF-MBD
MBD (1)
ABF
T60 = 300 ms 0
2
4
6
8
10
12
14
16
18
SIRI (dB)
Fig. 4. Mean SIRI values (dB) for 10 IEEE sentences (T60 = 300 ms). Error bars indicate standard deviations.
performance of MBD with one pass, was significantly lower than ABF-MBD for the same amount of training. Also, ABF alone only partially managed to recover the source estimates and provided a marginal improvement. By observing the results shown in Figure 4, we notice that the overall separation performance decreases when reverberation energy increases for all spatial separation schemes tested here. MBD performs fairly well, but only with an adequate amount of training. In contrast, the benefit of ABF was found to be negligible. The separation performance obtained with ABF-MBD is still lower by about 20% than the
502
K. Kokkinakis and P.C. Loizou
one obtained with MBD, albeit this new adaptive processing strategy requires much less training and uses relatively short FIR filters to equalize long BRIRs.
5
Conclusions
The joint use of ABF and MBD has been shown to reduce computational demands without severely compromising separation performance. The proposed ABF-MBD algorithm can achieve a satisfactory SIR improvement and can be implemented in on-line mode. ABF-MBD requires only a single pass through the data and therefore is potentially amenable to real-time implementation. Experimental results reveal an equally encouraging performance in both nearly anechoic and highly reverberant settings. The potential of this technique when operating in a subband processing scheme is currently under investigation (e.g., see [11]).
References 1. Aichner, R., Araki, S., Makino, S., Nishikawa, T., Saruwatari, H.: Time domain blind source separation of non-stationary convolved signals by utilizing geometric beamforming. In: Proc. 12th IEEE Int. Workshop on Neural Networks for Signal Process, Martigny, Valais, Switzerland, September 4-6, pp. 445–454. IEEE Computer Society Press, Los Alamitos (2002) 2. Amari, S., Cichocki, A., Yang, H.: A new learning algorithm for blind signal separation. In: Advances in Neural Information Processing Systems, vol. 8, pp. 757–763. MIT Press, Cambridge, MA (1996) 3. Baumann, W., Kolossa, D., Orglmeister, R.: Beamforming-based convolutive source separation. In: Proc. 28th IEEE Int. Conf. on Acoust. Speech and Signal Process, Hong Kong, April 6-10, pp. 357–360 (2003) 4. Bell, A.J., Sejnowski, T.J.: An information maximization approach to blind separation and blind deconvolution. Neural Computat. 7(6), 1129–1159 (1995) 5. Greenberg, Z.E., Zurek, P.M.: Evaluation of an adaptive beamforming method for hearing aids. J. Acoust. Soc. Am. 91(3), 1662–1676 (1992) 6. Griffiths, L.J., Jim, C.W.: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag. 30(1), 27–34 (1982) 7. IEEE Subcommittee, IEEE recommended practice speech quality measurements. IEEE Trans. Audio Electroacoust. vol. 17(3), pp. 225–246 (1969) 8. Joho, M., Moschytz, G.S.: On the design of the target-signal filter in adaptive beamforming. IEEE Trans. Circuits Syst. 46(7), 963–966 (1999) 9. Kokkinakis, K., Nandi, A.K.: Optimal blind separation of convolutive audio mixtures without temporal constraints. In: Proc. 29th IEEE Int. Conf. on Acoust. Speech and Signal Process. Montr´eal, Canada, May 17-21, pp. 217–220 (2004) 10. Kokkinakis, K., Nandi, A.K.: Multichannel blind deconvolution for source separation in convolutive mixtures of speech. IEEE Trans. Audio, Speech, Lang. Process 14(1), 200–212 (2006) 11. Kokkinakis, K., Loizou, P.C.: Subband-based blind signal processing for source separation in convolutive mixtures of speech. In: Proc. 32nd IEEE Int. Conf. on Acoust. Speech and Signal Process, Honolulu, Hawaii, April 15-20, pp. 917–920 (2007)
Signal Separation by Integrating ABF with Blind Deconvolution
503
12. Lambert, R.H.: Multichannel Blind Deconvolution: FIR Matrix Algebra and Separation of Multipath Mixtures. Ph.D. Thesis, University of Southern California (May 1996) 13. Mitianoudis, N., Davies, M.: Permutation alignment for frequency domain ICA using subspace beamforming methods. In: Proc. 5th Int. Conf. on ICA and BSS, Granada, Spain, September 22-24, pp. 669–676 (2004) 14. Parra, L., Alvino, C.V.: Geometric source separation: Merging convolutive source separation with geometric beamforming. IEEE Trans. Speech Audio Process 10(6), 352–362 (2002) 15. Puder, H.: Adaptive signal processing for interference cancellation in hearing aids. Signal Process 86(6), 1239–1253 (2006) 16. Saruwatari, H., Kawamura, T., Nishikawa, T., Lee, A., Shikano, K.: Blind source separation based on a fast-convergence algorithm combining ICA and beamforming. IEEE Trans. Speech Audio Process 14(2), 666–678 (2006) 17. Shinn-Cunningham, B., Kopco, N., Martin, T.: Localizing nearby sound sources in a classroom: Binaural room impulse responses. J. Acoust. Soc. Am. 117(5), 3100–3115 (2005)
Blind Signal Deconvolution as an Instantaneous Blind Separation of Statistically Dependent Sources Ivica Kopriva Rudjer Bošković Institute, Bijenička cesta 54, P.O. Box 180, 10002 Zagreb, Croatia [email protected]
Abstract. We propose a novel approach to blind signal deconvolution. It is based on the approximation of the source signal by Taylor series expansion and use of a filter bank-like transform to obtain multichannel representation of the observed signal. Currently, as an ad hoc choice a wavelet packets filter bank has been used for that purpose. This leads to multi-channel instantaneous linear mixture model (LMM) of the observed signal and its temporal derivatives converting single channel blind deconvolution (BD) problem into instantaneous blind source separation (BSS) problem with statistically dependent sources. The source signal is recovered provided it is a non-Gaussian, non-stationary and non- independent identically distributed (i.i.d.) process. The important property of the proposed approach is that order of the channel filter does not have to be known or estimated. We demonstrate viability of the proposed concept by blind deconvolution of the speech and music signals passed through a linear low-pass channel. Keywords: Blind deconvolution, Blind source separation, Independent component analysis, Instantaneous mixture model, Statistically dependent sources.
1 Introduction The problem of single channel BD is to reconstruct the original signal from its filtered version also termed observed signal, where only observed signal is available. Neglecting the noise term the process is modeled as a convolution of the unknown causal channel impulse response h(t) with an original source signal s(t) as: x(t ) = ∑τT = 0 h(τ ) s(t − τ )
(1)
where T denotes the order of the channel filter. Standard algorithms for blind deconvolution are capable of recovering source signal s(t) based on the observed signal x(t) only, provided that s(t) is a non-Gaussian i.i.d. process, [1]. We shall demonstrate here that proposed concept is capable of blind deconvolution of signals with colored statistics such as speech. The original signal s(t-τ) can be approximated by Taylor series expansion around s(t) giving: s(t − τ ) = ∑ ( (−τ ) n n !) s ( n ) (t ) + H.O.T. N
n =0
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 504–511, 2007. © Springer-Verlag Berlin Heidelberg 2007
(2)
Blind Signal Deconvolution as an Instantaneous Blind Separation
505
where s ( n ) (t ) denotes n-th order temporal derivative of s(t) and H.O.T. denotes higher-order-terms. It is assumed that s (0) (t ) = s (t ) . Inserting (2) into (1) yields: N
x(t ) ≅ ∑ a1( n +1) s ( n ) (t )
(3)
n =0
where a11 = ∑τ = 0 h(τ ) , a12 = −∑τ = 0τ h(τ ) , a13 = ∑τ = 0 (τ 2 / 2 ) h(τ ) , etc. Evidently, quality T
T
T
of the approximations (2) and (3) depends on the number of terms in the Taylor series expansion of the source signal s(t). However, x(t) in (3) can be also obtained as an inverse Fourier transform of the expression H ( jω ) S ( jω ) where H ( jω ) and S ( jω ) respectively represent Fourier transforms of the channel impulse response and source signal. Owing to the fact that h(t) is an aperiodic sequence H ( jω ) is obtained as T T T ( jω ) 2 T 2 H ( jω ) = ∑ h(τ )e − jωτ ≅ ∑ h(τ ) − jω ∑τ h(τ ) + (4) ∑τ h(τ ) τ =0
τ =0
τ =0
2
τ =0
that yields ⎛ T ⎞ ⎛ T ⎞ ⎛ T ⎞ ( jω )2 X ( jω ) ≅ ⎜ ∑ h(τ ) ⎟ S ( jω ) − ⎜ ∑τ h(τ ) ⎟ jω S ( jω ) + ⎜ ∑τ 2 h(τ ) ⎟ S ( jω ) ⎝ τ =0 ⎠ ⎝ τ =0 ⎠ ⎝ τ =0 ⎠ 2
(5)
Evidently, number of terms in the expansions (4) and (5) depends on the property of the channel: size of the support T of the impulse response h(t), but also on the property of the signal: size of its support Ω in the frequency domain i.e. S ( jω ) ≅ 0 for ω > Ω . For example, for either T=0 or Ω =0 relation (3) and inverse Fourier transform of (5) yield the same result. Thus, channels with the maximal delay that is small relative to the coherence time of the signal, i.e. T<<(2π/ Ω), will demand small number of terms, N, in the approximation (3) and vice versa. Taylor series expansion has already been used in [2]-[7] to convert multichannel convolutive BSS problem into instantaneous BSS problem. Two cases can be distinguished. In [2]-[5] authors assumed sensor array that is smallest than the shortest wavelength of the sources. This allows to keep only the first order derivative in the Taylor series expansion in Eq.(2). This is due to the fact that delay is defined relative to the center of the array and is therefore always smaller than the coherence time of the sources. Under this assumption another array that calculates spatial gradients of the observed signal converts the convolutive BSS problem into instantaneous BSS problem with the first order temporal derivatives of the source signals acting as sources. Once they are recovered by instantaneous ICA, the true sources are obtained by their temporal integration. In [6] and [7] Taylor series expansion is also used to convert multichannel convolutive BSS problem into instantaneous BSS problem. In [6] it is assumed that delay T is smaller than the coherence time of the source signals which allows to use only first order temporal derivative in the Taylor series expansion Eq.(2). Assuming that signal and its first order derivative are statistically independent, that is actually proven for stationary signals only [8][9], the instantaneous BSS problem is solved by some of the standard ICA methods. However, assumption that delay is smaller than the coherence time of the source signals is too restrictive for
506
I. Kopriva
realistic reverberant environments. That was realized in [7]. In that case higher order temporal derivatives exist in the Taylor series expansion Eq.(2), and they are statistically dependent. An algorithm is derived in [7] for grouping dependent sources and extracting source signals from each group. The algorithm proposed here solves single channel BD problem by converting it into instantaneous BSS problem with statistically dependent sources. No special assumption is made on the amount of delay. Thus, higher order derivatives in the Taylor series expansion are allowed. The problem of their statistical dependence is solved by means of independence enhancement technique, which is based on innovations of the multichannel version of the observed signal. However, another transform such as high-pass filtering, [10], may be used for independence enhancement purpose as well. A BD capable of recovering temporally dependent signals is derived in [11]. It is based on the measure of temporal predictability and argumentation that an output of the low-pass channel is smoother and therefore more predictable than the input to the channel. Thus, the BD problem is formulated as temporal predictability minimization problem and numerically solved as general eigenvalue problem. Equivalent solution of the instantaneous BSS problem by looking for maximum of the temporal predictability is defined in [12]. In relation to the proposed Taylor series expansion BD method, the temporal predictability approach suffers from the fact that order of the deconvolution filter has to be defined based on some a priori knowledge. Because the order of the generalized eigenvalue problem equals the order of the deconvolution filter the temporal predictability based algorithm can become numerically very demanding. Temporal predictability itself is defined for deconvolved signal M y (t ) = ∑ k = 0 w(k ) x(t − k ) as: V ({ y (t )} )
∑ ( y (t ) − y (t ) ) F ({ y (t )}) = log = log U ({ y (t )}) ∑ ( y% (t ) − y (t ) ) tmax
2
t tmax
2
(6)
t
where V reflects the extent to which y(t) is predicted by long term moving average y (t ) and U reflects the extent to which y(t) is predicted by short term moving average y% (t ) , [12][11].
2 Formulation of the Instantaneous Linear Mixture Model We now apply a filter bank-like transform on (3) in order to obtain a multichannel representation, x, of the observed signal x(t). It is the matter of further analysis to find out which type of the transform is optimal. Here, in order to illustrate the concept, as an ad hoc choice we have used a non-decimated wavelet packets filter bank with two decomposition levels that results in L=6 filters. In order to have clear notation let us introduce x1(t)=x(t). When filters are applied on observed signal x(t) we obtain: xl +1 (t ) ≅ al +1,1s (t ) + al +1,2 s (1) (t ) + al +1,3 s (2) (t )
l = 1,.., L
(7)
where al +1,1 = ∑τ = 0 hl (τ ) , al +1,2 = −∑τ = 0τ hl (τ ) , al 3 = ∑τ = 0 (τ 2 / 2 ) hl (τ ) , where hl (t ) repreT
T
T
sents convolution of the appropriate l-th filter with h(t), T = T + M + 2 and M is an
Blind Signal Deconvolution as an Instantaneous Blind Separation
507
order of the filter. Observed signal and its filtered versions can be represented in a form of the following instantaneous LMM: ⎡ s (t ) ⎤ ⎡ x1 (t ) ⎤ ⎡ a11 a12 a13 ........... a1, N +1 ⎤ ⎢ (1) ⎥ ⎥ ⎢ s (t ) ⎥ ⎢ ⎥ ⎢ x2 (t ) ⎥ ⎢ a21 a22 a23 ...........a2, N +1 ⎥ ⎢ (2) ⎥ ⎢ x (t ) = ≅ s (t ) = As(t ) ⎥ ⎢... ⎥ ⎢........................................... ⎥ ⎢ ⎥ ⎢..... ⎥ ⎢ ⎥ ⎢ ⎢⎣ xL +1 (t ) ⎥⎦ ⎢⎣ aL +1,1 aL +1,2 aL +1,3 ......aL +1, N +1 ⎥⎦ ⎢ ( N ) ⎥ ⎢⎣ s (t ) ⎥⎦
(8)
where x∈R ( L +1)× K , A∈R ( L +1)×( N +1) , s∈R ( N +1)× K , K represents number of samples and N represents an unknown number of temporal derivatives of the source signal. We have used inspection of the singular values of the sample data covariance matrix Rˆ xx = (1 K ) xxT to estimate overall number of sources, N+1. ICA algorithms can be applied to the LMM given by Eq. (8) in order to extract the source signal s(t), with the benefits that the order T of the channel impulse response h(t) is absorbed in the mixing matrix A and does not have to be known or estimated. The source signals have to be non-Gaussain and statistically independent but not i.i.d. This has important practical consequence because BD of signals with colored statistics is possible. This is demonstrated in the section 4 where simulation results are presented.
3 Statistical Properties of the Source Signal: Implications to Deconvolution Results We reproduce here results and conditions from [8][14] necessary for the stochastic differentiability of the random source signal s(t). We emphasize that conclusions drawn from this analysis can in principle be generalized to blind image deconvolution problem due to the existence of the space filling curves (Peano-Hillbert curves) that enable 2D to 1D mapping and vice versa by preserving local or neighborhood statistics [13]. First we present two important results that relate (non-)stationarity and linear signal representation. If the signal s(t) is stationary it can be represented by the linear time invariant generative model: ∞
s(t ) = ∑ b(ν )ε (t − ν )
(9)
ν =0
where ε(t) is an i.i.d. driving signal. If the signal s(t) is non-stationary the linear signal model becomes time variant: ∞
s(t ) = ∑ b(t ,ν )ε (t − ν )
(10)
ν =0
First order derivative s (1) (t ) of the stationary signal s(t) is defined if the first order
derivative of the autocorrelation function at the time lag zero is zero i.e. ρ s(1) (0) = 0 , [8]. ρ s(1) (0) is always zero for non-i.i.d. process due to symmetry of ρ s (τ ) . According to [8] the stronger condition for the existence of s (1) (t ) is ρ s(2) (0) ≠ 0 . If this is true then from [9] it is also true ρ s(2) (τ ) ≠ 0, ∀τ . Analogously, condition for existence of s ( 2) (t )
508
I. Kopriva
assumes ρ s(3) (0) ≠ 0 . If the first order derivative of the stationary signal s(t) exists then [8]: E ⎡⎣ s (t ) s (1) (t ) ⎤⎦ = 0
(11)
where E represents mathematical expectation. We now interpret these results for the three types of the source signal s(t). Source signal is a stationary i.i.d. process. In this case a condition ρ s(1) (0) = 0 is not fulfilled. The reason is that autocorrelation function of the i.i.d. process is delta function i.e. ρ s (τ ) = σ s2δτ . Therefore, Taylor series expansion (2) for such a signal does not exist. Consequently, the LMM model (8) also does not exist. Thus, i.i.d. signals can not be blindly deconvolved by the proposed algorithm. However, this is not a drawback since a number of blind deconvolution methods solve this problem, [1][15]. Source signal is a stationary non-i.i.d. process. As it has been said such signal has first order derivative. Under previously defined conditions second order derivative also exists. However, we have to emphasize that stationary signals, that are represented by linear time invariant generative signal model (9), can also not be blindly deconvoloved by the proposed algorithm. Assuming that b(t) represents impulse response of the linear time invariant signal generative model, it is impossible to distinguish the channel filter h(t) from the linear convolution of the channel filter and modeling filter h(t)*b(t). Thus, proposed algorithm will deconvolve the i.i.d. driving sequence ε(t), i.e. the algorithm will have the whitening effect on the stationary non-i.i.d. signal. Source signal is a non-stationary and non-i.i.d. process. Although, conditions required for stochastic differentiability are derived for stationary signals only we can use the linear generative model of the non-stationary signal (10) and derive derivatives of the non-stationary signal s(t) provided that time varying filter b(t,ν) is stationary with respect to the independent variable t. In such a case we define: ∞
s ( m ) (t ) = ∑ b ( m ) (t ,ν )ε (t − ν )
(12)
ν =0
where b( m ) (t ,ν ) = ( d mb ( t ,ν ) dt m ) . Thus, Taylor series expansion (2) and the LMM (8) do exist. However, we can not make conclusion regarding statistical independence between s(t), s (1) (t ) , s ( 2) (t ) , etc, as it was the case with a stationary signal, (11). Thus, it is justified to use some of the methods derived to enhance statistical independence between the hidden variables in the LMM (8). One of them that is computationally efficient is based on innovations, [16]. It is known that innovations are more nonGaussian and more statistically independent than original processes. These conditions are of essential importance for the success of the ICA algorithms. Innovation process of the hidden components of s is s%n (t ) = sn (t ) − E ⎡⎣ sn (t ) t , sn (t − 1), sn (t − 2),...⎤⎦
sn ∈ {s, s (1) , s (2) ,...}
(13)
where the second term in Eq.(13) represents conditional expectation. If both sides of (13) are multiplied by the unknown basis matrix A we obtain x% (t ) = As% (t )
(14)
Blind Signal Deconvolution as an Instantaneous Blind Separation
509
Eq.(14) implies that innovations preserve the basis matrix A. The innovations based multichannel model (14) enables more accurate estimation of the mixing matrix A by means of ICA algorithms, than when ICA algorithms are applied directly on the LMM (8). The expectation is in practice replaced by the autoregressive (AR) model of the finite order yielding: x%l (t ) = ∑ j = 0 g j xl (t − j ) J
(15)
where J represents order of the AR model and g0=1. The coefficients of the prediction-error filter gj are efficiently estimated by means of Levinson’s algorithm, [17]. We identify for the LMM model (8) L+1 filters and obtain the prediction-error filter in (15) as an average of all identified filters. Hidden variables are then recovered by applying the Moor-Penrose pseudoinverse A† on the originally observed process x. The temporal predictability measure Eq.(6) could be used as a criteria for the selection of the recovered source signal sˆ(t ) after solution of BSS problem (8).
4 Simulation Results We have conducted the following experiments: BD of the female speech signal and BD of the choir singing passed through a lowpass channels. 2nd order Butterworth lowpass filter has been used to model the channel response. Figure 1 shows one hundred time points of the true female speech source signal, signal recovered by temporal predictability based algorithm, [12] and signal recovered by the proposed algorithm. For temporal predictability based algorithm we have shown the best result obtained after experimenting with several values for the order of the deconvolution filter. Normalized correlation coefficients between the source and mixed signal, source signal and signal recovered by the proposed algorithm, and source signal and
Fig. 1. One hundred time samples of the source signal (solid), signal recovered by temporal predictability based algorithm (dashed), [12], and proposed algorithm based on the Taylor series expansion (dotted)
510
I. Kopriva
Fig. 2. From top to bottom: spectrograms of the source signal, observed signal, signal recovered by temporal predictability based algorithm, [12], and signal recovered by the proposed algorithm
signal recovered by algorithm [12] are respectively: 0.71774, 0.88658 and 0.75476. Spectrograms of these signals are shown in Figure 2. Regarding the choir-singing signal the normalized correlation coefficients in the same order as before were 0.5276, 0.86152 and 0.84015.
5 Conclusion Novel single channel BD algorithm has been formulated. It is based on the approximation of the source signal by Taylor series expansion and use of a filter bank-like transform to yield a multichannel representation of the observed singlesensor signal. This yields instantaneous LMM and converts the single channel BD problem into instantaneous BSS problem with statistically dependent sources with the important property that channel order does not have to be known. It has been shown that signal amenable for BD by proposed method must be non-stationary and noni.i.d. non-Gaussian process. As yet unresolved issues remain: optimality of the linear transforms used to yield a multivariate representation of the observed signal and efficiency of the linear transforms used to enhance statistical independence among hidden variables of the LMM. The later issue might affect performance of the proposed algorithm when degradations are strong as it can be expected for real acoustic channels.
Blind Signal Deconvolution as an Instantaneous Blind Separation
511
Acknowledgment. Part of this work was supported through grant 098-0982903-2558 funded by the Ministry of Science, Education and Sport, Republic of Croatia. Initial suggestion by Professor Andrzej Cichocki is also gratefully acknowledged.
References 1. Haykin, S. (ed.): Unsupervised Adaptive Filtering – Blind Deconvolution, vol. II. John Wiley, Chichester (2000) 2. Cauwenberghs, C., Stanacevic, M., Zweig, G.: Blind Broadband Localization and Separation in Miniature Sensor Arrays. In: Proc. International Symposium Circuits and Systems (ISCAS’2001), Sydney, Australia, pp. 193–196 (2001) 3. Stanacevic, M., Cauwenberghs, C., Zweig, G.: Gradient flow broadband beamforming and source separation. In: Proc. ICA 2001, San Diego, pp. 49–52 (2001) 4. Stanacevic, M., Cauwenberghs, C., Zweig, G.: Gradient flow adaptive beamforming and signal separation in a miniature microphone array. In: Proc. ICASSP, pp. 4016–4019 (2002) 5. Stanacevic, M., Cauwenberghs, C.: Gradient Flow Bearing Estimation with Blind Identification of Multiple Signals and Interference. In: Proc. International Symposium Circuits and Systems, vol. 5, pp. 5–8 (2004) 6. Barrére, J., Chabriel, G.: A Compact Sensor Array for Blind Separation of Sources. IEEE Trans. Circuits and Systems-I: Fundamental Theory and Applications 49, 565–574 (2002) 7. Chabriel, G., Barrére, J.: An Instantaneous Formulation of Mixtures for Blind Separation of Propagating Waves. IEEE Trans. on Signal Processing 54, 49–58 (2006) 8. Priestley, M.B.: Spectral Analysis and Time Series. Academic Press, London (1981) 9. Yaglom, A.M.: Introduction to the Theory of Stationary Random Functions. Prentice-Hall, Englewood Cliffs (1962) 10. Cichocki, A., Georgiev, P.: Blind source separation algorithms with matrix constraints. IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences E86-A, 522–531 (2003) 11. Stone, J.V.: Blind Source Separation Using Temporal Predictability. Neural Computation 13, 1559–1574 (2001) 12. Stone, J.V.: Blind deconvolution using temporal predictability. Neurocomputing 49, 79–86 (2002) 13. Lam, W.M., Shapiro, J.M.: A Class of Fast Algorithms for the Peano-Hillbert Space Filling Curve. In: Proceedings of the IEEE International Conference Image Processing (ICIP-94) vol. 1, pp. 638–641 (1994) 14. Kopriva, I.: Approach to Blind Image Deconvolution by Multiscale Subband Decomposition and Independent Component Analysis. Journal Optical Society of America A 24, 973–983 (2007) 15. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing. John Wiley, Chichester (2002) 16. Hyvärinen, A.: Independent component analysis for time-dependent stochastic processes. In: Proceedings of the International Conference on Artificial Neural Networks (ICANN’98) Skovde, Sweden, pp. 541–546 (1998) 17. Orfanidis, S.J.: Optimum Signal Processing – An Introduction, 2nd edn. MacMillan Publishing Comp. New York (1988)
Solving the Permutation Problem in Convolutive Blind Source Separation Radoslaw Mazur and Alfred Mertins Institute for Signal Processing, University of Lübeck, 23538 Lübeck, Germany {mazur,mertins}@isip.uni-luebeck.de
Abstract. This paper presents a new algorithm for solving the permutation ambiguity in convolutive blind source separation. When transformed to the frequency domain, the source separation problem reduces to independent instantaneous separation in each frequency bin, which can be efficiently solved by existing algorithms. But this independency leads to the problem of correct alignment of these single bins which is still not entirely solved. The algorithm proposed in this paper models the frequency-domain separated signals using the generalized Gaussian distribution and utilizes the small deviation of the exponent between neighboring bins for the detection of correct permutations.
1
Introduction
Blind Source Separation (BSS) is used to recover signals from observed mixtures without prior knowledge of the sources nor the mixing system. For the case of linear instantaneous mixtures, a number of different efficient approaches has been proposed [1,2]. When aiming at real-world mixtures of audio signals like speech, the situation becomes much more difficult. In this case, the mixing process is convolutive and can be modeled using FIR filters, where, for realistic scenarios, the length of these filters can be up to several thousands taps. The unmixing then has to be done using FIR filters of similar length. It is possible to calculate such filters directly in the time domain [3,4], but this approach suffers from high computational cost and difficulties of convergence. The most successful approach is to transform the signals to the frequency domain, where the convolution becomes multiplication [5]. Then the separation can be done independently in each frequency bin, which is a much simpler task. The major drawback of this approach is that the separated bins usually have different scaling and are arbitrarily permuted. Therefore they have to be correctly equalized and aligned, because otherwise the entire process of separation will fail. While it is possible to obtain a proper scaling for the frequency components [6], there is still no algorithm that can tackle the permutation problem in all cases. One idea for solving the permutation problem is based on the assumption that neighboring bins have alike time structure [7]. Correlation coefficients for neighboring
This work has been supported by the German Research Foundation under Grant No. ME 1170/1.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 512–519, 2007. c Springer-Verlag Berlin Heidelberg 2007
Solving the Permutation Problem in Convolutive BSS
513
bins then yield a criterion for correct permutation. Another approach uses the unmixing matrices as beamformer. After computation of the directions of arrival for all bins, most of them can be aligned properly [8]. Unfortunately, if there are more than two sensors in a nonuniform array, the computation becomes very difficult. In this paper we present a new approach for solving the permutation problem based solely on the statistics of the signals. The new algorithm models the single frequency bins using the generalized Gaussian Distribution (GGD) and utilizes the small changes of the shape parameter of the GGD between neighboring bins.
2 2.1
Model and Methods BSS for Instantaneous Mixtures
In the instantaneous case the mixing process of N sources into N observations can be modeled by an N × N matrix A. Given the source vector s(n) = [s1 (n), . . . , sN (n)]T and assuming negligible measurement noise, the vector of observation signals x(n) = [x1 (n), . . . , xN (n)]T can be described as x(n) = A · s(n).
(1)
The separation can be written as a multiplication with a N × N matrix B: y(n) = [y1 (n), . . . , yN (n)]T = B · x(n)
(2)
The aim of BSS is to find B from the observed process x(n) so that BA = DΠ where Π is a permutation matrix and D an arbitrary diagonal matrix. These matrices represent the two ambiguities of BSS: (a) the separated signals appear in arbitrary order and (b) they are scaled versions of the sources. We here consider the well known gradient-based update rule [1] (3) ΔB ∝ (I + E g(y)y T )B with g(y) = (gi (yi ), . . . , gn (yn )) being a component-wise vector function of nonlinear score functions gi of the assumed source probability densities pi (si ): pi (si ) (4) pi (si ) In order to achieve good separation performance, the probability density function of the sources has to be known or at least well approximated [9]. gi =
2.2
Statistical Source Models and Estimators
Speech signals usually follow a Laplacian distribution. Therefore, for instantaneous mixtures, the nonlinear function gi (·) reduces to gi (y) =
sgn(y) . σ
(5)
Unfortunately, this assumption does not hold for the time-frequency representation X(ωk , n). The probability density functions of the components in the
514
R. Mazur and A. Mertins
bins ωk can vary in a large range from being sub- to super-Gaussian. A sufficient approximation can be achieved by the generalized Gaussian distribution (GGD) [10]: β β py (y) = e−(|y|/α) (6) 2αΓ (1/β) ∞ with α, β > 0 and the Gamma function given by Γ (y) = 0 xy−1 e−x dx. The βparameter of the GGD describes the overall structure of the distribution. With β = 2 the GGD reduces to standard Gaussian distribution, with β = 1 to a Laplacian distribution and with β = 0.5 to a Gamma distribution. Generally, a large value of β indicates a flat distribution, whereas a small value yields a spiky distribution. α is the generalized measure of the standard deviation. Usually, nonlinearities for super-Gaussian distributions utilize sigmoidal functions like sgn() or tanh(). Using the GGD, this model can be more generalized. The nonlinear function gi (·) becomes gi (x) = |x|β−1 sgn(x), and using sgn(x) = x/|x|, we obtain x gi (x) = . (7) |x|2−β As shown in [9], based on this nonlinear function, even mixtures of sub- and super-Gaussian signals can be separated. Although the authors used fixed values for β they could achieve good results. The above approach has been extended in [11], where an adaptive algorithm has been proposed. Because, in the blind scenario, the sources are not available and therefore an accurate estimation of β is not possible, the authors proposed to calculate β based on the statistics of the separated signals. They used the method of moments [12] to estimate β after each iteration of (3) and used this new value for the next step. It was shown that the approach leads to improved overall performance in terms of better separation and faster convergence. 2.3
Convolutive Mixtures
In real-world acoustic scenarios, the mixing channels can be modeled by FIR filters of length L, where L can be 2000 or more, depending on the reverberation time and sampling rate. The convolutive mixing model reads x(n) = H(n) ∗ s(n) =
L−1
H(l)s(n − l)
(8)
l=0
where H(n) is a sequence of N × N matrices containing the impulse responses of the mixing channels. For the separation we use FIR filters of length M ≥ L − 1 and obtain M−1 W (l)x(n − l) (9) y(n) = W (n) ∗ x(n) = l=0
with W (n) containing the unmixing coefficients. Estimating W (n) in the time domain is a very difficult task, because the number of unknowns, M N 2 , can reach several tens of thousands. Although there
Solving the Permutation Problem in Convolutive BSS
515
exist approaches to this problem [3,4] the results are not satisfying because of distortions introduced by the unmixing system. Due to this problem another approach is widely used. After transforming the signals to the frequency domain, for example using the blockwise Short-TimeFourier-Transform (STFT), the convolution becomes a multiplication [5]: Y (ωk , n) = W (ωk )X(ωk , n)
(10)
Instead of estimating all coefficients at once, in the frequency domain it is possible to separate every bin independently. However, since there is the scaling and permutation ambiguity in every bin, we obtain Y (ωk , n) = W (ωk )X(ωk , n) = D(ωk )Π(ωk )S(ωk , n)
(11)
with Π(ωk ) being a permutation matrix and D(ωk ) a diagonal scaling matrix for frequency ωk . Therefore, it is necessary to correct the amplitudes and solve the permutation before transforming the signals back to the time domain. The scaling ambiguity can be resolved to an acceptable degree using the method proposed by Ikeda and Murata [6]. The central idea is to recover the signals as they have been recorded by the sensors. Matusuoka and Nakashima [13] showed that this is the optimal approach, as it minimizes E{|y(t) − x(t)|2 }. Their Minimal Distortion Principle uses the following unmixing matrix: W (ωk ) = diag(W −1 (ωk )) · W (ωk )
(12)
with diag(·) returning the argument with all off-diagonal elements set to zero. The correction of the permutation ambiguity is even more important. Even if every bin is perfectly separated, different permutations at different frequencies make both signals appear in every output channel.
3
Resolving the Permutation Ambiguity
One of the first ideas used for the permutation problem is based on the statistics of the separated signals [6,7]. The key assumption is that the envelopes of all bins of one source are highly correlated. With V (ωk , n) = |Y (ωk , n)| the correlation between two bins k, l is defined as N −1 n=0 V (ωk , n)V (ωl , n) ρqp (ωk , ωl ) = (13) N −1 N −1 2 2 n=0 V (ωk , n) n=0 V (ωl , n) with p, q being the indices of the separated signals. To decide if two bins are permutated equally, the value of r=
ρpp (ωk , ωl ) + ρqq (ωk , ωl ) ρpq (ωk , ωl ) + ρqp (ωk , ωl )
(14)
can be used. If r > 1, then the bins are sorted correctly. Otherwise, with r < 1, a permutation has occurred. With more than two sources the value of r has to be estimated for all pairs, which means that N ! calculations have to be performed.
516
R. Mazur and A. Mertins
Fig. 1. Beta values of two signals over the frequency index. The detected clusters are indicated with bars .
Although there are algorithms with less complexity, the practical use is restricted to only few sources [7]. Trying to sort all bins with respect to r for all p and q usually does not work for speech signals. The reason for this is that the key assumption of highly correlated envelopes often does not hold for frequencies which are not close together. Restricting the test to only neighboring frequencies is also not a solution, because at some frequencies, the envelopes of the individual signals do not differ enough to allow for correct sorting. A compromise is the dyadic sorting [7], which starts with pairwise correlation of two neighboring bins and then successively builds groups of bins in a dyadic fashion. This algorithm utilizes the fact that, in a sorted group, a few outliers do not preponderate, and the groups can be aligned properly. But like other proposals that rely only on the correlation of the separated signals, this algorithm suffers if there are too many poorly separated bins close to each other. Because of this, some of the first small groups are often not sorted properly, which then propagates while building the larger groups. The results are block permutations and the separation of the whole signal fails.
4
The Proposed Method
In this paper we propose to use the smoothness of the exponent β of the GGD. For this, we approximate the statistics of every bin by (6). Although the values of β vary in a significant way, the values in neighboring bins do not differ much. Furthermore, two different signals usually have distinct values in most bins, as can be seen in Fig. 1 for a typical situation. However, it can also be seen in Fig. 1 that there are some bins with almost the same value of β, like the bins around 3920. In this range, no differentiation of the two signals is possible on the basis of the value of β. But huge ranges like the bins 3800-3840, can be clustered with certainty. These clusters can be used to correctly de-permute wide frequency ranges. Afterwards, the remaining bins can be de-permuted using alternative methods. The proposed method consists of three parts: (1) estimation of the boundaries of the clusters, (2) calculation of the permutation between the clusters, and (3) aligning the remaining bins. The algorithm is at first derived for two signals and then extended for multiple signals.
Solving the Permutation Problem in Convolutive BSS
4.1
517
Calculation of the Cluster Boundaries
The first step is to estimate the β values for all bins of both estimated sources. One possibility is to estimate this parameter in every iteration of the BSS algorithm mentioned in Section 2.1. Alternatively, any other known BSS algorithms can be used, because β can be also estimated after separation. The second step is to make a simple grouping. The bins are compared pairwise: the ones with higher values of β are assigned to one and the ones with lower values are assigned to the other source. The third step is to determine the actual clusters. The idea for a simple and fast method is the following: Take an existing cluster and find out if the neighboring bin can be added to it. The decision is based on the assumption of the values of β being distinct and smooth. The actual implementation is as follows: 1. 2. 3. 4.
Start at bin l = 1. Test by comparing the β-values if the next bin l + 1 can be added. If yes, then add this bin to the cluster, increase l and go to Step 2. If not, then the end of the cluster has been found. If the cluster is large enough, mark it as being correctly permutated. Increase l, mark l as the beginning of a new cluster and go to Step 2.
The result of this algorithm is shown in Fig. 1. If there are more than two signals, the algorithm can be extended. For this, the β values are sorted, and the two largest ones are assigned to βH (ωk ) and βL (ωk ), respectively. After clustering and removing βH (ωk ), the same procedure can be applied to the remaining bins. An analogous procedure can be applied to the bottommost values for increased performance. 4.2
Calculation of Cluster Correlations and Aligning the Remaining Bins
The next step after the identification of the clusters is to determine the permutation between them. As the gaps between clusters are usually much smaller than the clusters themselves, the assumption of highly correlated envelopes can be used. Here we follow the idea of dyadic sorting and calculate the value of r, as defined in (14), for all combinations of all bins of two clusters. As the bins within the clusters are de-permuted with high confidence, the correct permutation between clusters can be determined by the highest or lowest value of r, as for the dyadic sorting in [7]. After calculating the correct permutation for the clusters, the remaining bins also have to be aligned. Again, a comparison of the correlation coefficients r for these bins with all coefficients for the bins in the neighboring clusters can be used.
5
Simulations
In a first simulation, the algorithm has been used on unmixed audio signals, which have been arbitrarily permuted in the frequency domain. This should
518
R. Mazur and A. Mertins
simulate the behavior of the algorithm in ideal conditions, as if the blind separation stage in each frequency bin would be able to work perfectly. In this case, the algorithm was able to correctly de-permute all bins. When using real-world data, the separation in the single bins is not always perfect. Therefore, the estimation of correct permutations is harder. In the experiments we used a data set where the individual contributions from the sources to the microphones were available [14], and the separation performance could be estimated using the signal-to-interference ratio SIRyi = 10log10
E[(gii (n) ∗ si (n))2 ] N E[( gij (n) ∗ sj (n))2 ]
(15)
j=1,j=i
with gij (n) = wi (n) ∗ hj (n). In Fig. 2, the separation performance for the single bins is given. As the individual sources are known, the best possible unmixing can be estimated. In Fig. 3, the difference between this best approach and the result of the proposed algorithm is shown. As we can see, above 300 Hz the proposed algorithm produces exactly the same output as the ideal de-permutation. Below this frequency there occur permutations, but this is a frequency range where the separation has failed in several bins. Further inspection of the data showed that the estimation of clusters worked, but the cluster correlations were incorrect. This is a typical behavior for correlation-based approaches, when the separation is not perfect. The overall performance with swapped bins is an SIR of 13.16 dB. When leaving the low frequencies out and recovering only the signal components above 300 Hz, the overall performance increases to 20.03 dB.
Separation in dB
40
20
0 500
1000
1500 2000 2500 Frequency in Hz
3000
3500
4000
3000
3500
4000
Fig. 2. Separation Performance
1 0.5 0 500
1000
1500 2000 2500 Frequency in Hz
Fig. 3. Swap errors
Solving the Permutation Problem in Convolutive BSS
6
519
Summary
In this paper, we presented a new approach for resolving the permutation problem, which occurs in convolutive blind source separation. For this we modeled every bin using the generalized Gaussian Distribution and used the exponent β for estimating the correct permutation. The performance of the algorithm has been studied on artificial and real word data.
References 1. Amari, S., Cichocki, A., Yang, H.H.: A new learning algorithm for blind signal separation. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 757–763. The MIT Press, Cambridge (1996) 2. Hyvärinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997) 3. Douglas, S.C., Sawada, H., Makino, S.: Natural gradient multichannel blind deconvolution and speech separation using causal fir filters. IEEE Trans. Speech and Audio Processing 13(1), 92–104 (2005) 4. Aichner, R., Buchner, H., Araki, S., Makino, S.: On-line time-domain blind source separation of nonstationary convolved signals. In: Proc. 4th Int. Symp. on Independent Component Analysis and Blind Signal Separation (ICA2003), Nara, Japan (April 2003), pp. 987–992 (2003) 5. Smaragdis, P.: Blind separation of convolved mixtures in the frequency domain. Neurocomputing 22(1-3), 21–34 (1998) 6. Ikeda, S., Murata, N.: A method of blind separation based on temporal structure of signals. In: Proc. Int. Conf. on Neural Information Processing, pp. 737–742 (1998) 7. Rahbar, K., Reilly, J.: A frequency domain method for blind source separation of convolutive audio mixtures. IEEE Trans. Speech and Audio Processing 13(5), 832–844 (2005) 8. Sawada, H., Mukai, R., Araki, S., Makino, S.: A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. Speech and Audio Processing 12(5), 530–538 (2004) 9. Choi, S., Cichocki, A., Amari, S.: Flexible independent component analysis. In: Constantinides, T., Kung, S.Y., Niranjan, M., Wilson, E. (eds.) Neural Networks for Signal Processing VIII, pp. 83–92 (1998) 10. Gazor, S., Zhang, W.: Speech probability distribution. IEEE Signal Processing Letters 10(7), 204–207 (2003) 11. Kokkinakis, K., Nandi, A.K.: Multichannel Speech Separation Using Adaptive Parameterization of Source PDFs. In: ICA 2004. LNCS, vol. 3195, pp. 486–493. Springer, Heidelberg (2004) 12. Varanasi, M.K., Aazhang, B.: Parametric generalized gaussian density estimation. Acoustical Society of America Journal 86(4), 1404–1415 (1989) 13. Matsuoka, K.: Minimal distortion principle for blind source separation. In: Proceedings of the 41st SICE Annual Conference. vol. 4, pp. 2138–2143 (August 5-7, 2002) 14. http://www.kecl.ntt.co.jp/icl/signal/sawada/demo/bss2to4/index.html
Discovering Convolutive Speech Phones Using Sparseness and Non-negativity Paul D. O’Grady1 and Barak A. Pearlmutter2 1
Complex & Adaptive Systems Laboratory, University College Dublin, Belfield, Dublin 4, Ireland 2 Hamilton Institute, National University of Ireland Maynooth, Co. Kildare, Ireland [email protected], [email protected]
Abstract. Discovering a representation that allows auditory data to be parsimoniously represented is useful for many machine learning and signal processing tasks. Such a representation can be constructed by Non-negative Matrix Factorisation (NMF), which is a method for finding parts-based representations of non-negative data. Here, we present a convolutive NMF algorithm that includes a sparseness constraint on the activations and has multiplicative updates. In combination with a spectral magnitude transform of speech, this method extracts speech phones that exhibit sparse activation patterns, which we use in a supervised separation scheme for monophonic mixtures.
1
Introduction
A preliminary step in many data analysis tasks is to find a suitable representation of the data. Typically, methods exploit the latent structure in the data. For example, ICA [1] reduces the redundancy of the data by projecting the data onto its independent components, which can be discovered by maximising a statistical measure such as independence or non-Gaussianity. Non-Negative Matrix Factorisation (NMF) approximately decomposes a nonnegative matrix V into a product of two non-negative matrices W and H [2, 3]. NMF is a parts-based approach that does not make a statistical assumption about the data. Instead, it assumes that for the domain at hand, negative numbers are physically meaningless. Data that contains negative components, for example audio, must be transformed into a non-negative form before NMF can be applied. Here, we use a magnitude spectrogram. Spectrograms have been used in audio analysis for many years and in combination with NMF have been applied to a variety of problems such as sound separation [4] and automatic transcription of music [5]. In this paper, we combine a previous convolutive extension of NMF [4] with a sparseness constraint on H, and present an algorithm that has multiplicative updates. This paper is structured as follows: We overview convolutive NMF in Section 2 and present sparse convolutive NMF in Section 3. In Section 4 we apply sparse convolutive NMF to speech spectrograms, and extract phones that M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 520–527, 2007. c Springer-Verlag Berlin Heidelberg 2007
Discovering Convolutive Speech Phones
521
have sparse activation patterns. We use these phones in a supervised separation scheme for monophonic mixtures, and demonstrate the superior separation performance achieved over those extracted by convolutive NMF in Section 5.
2
Convolutive NMF
NMF [3] is a linear non-negative approximate factorisation, and is formulated as follows. Given a non-negative matrix V ∈ R≥0,M×T the goal is to approximate V as a product of two non-negative matrices W ∈ R≥0,M×R (basis) and H ∈ R≥0,R×T (activations), V ≈ WH, where R ≤ M , such that the reconstruction error is minimised. For our purposes we require a convolutive basis, such a model has previously been used to extend NMF [4], which we review in this section. For conventional NMF each object is described by its spectrum and corresponding activation in time, while for convolutive NMF each object has a sequence of successive spectra and corresponding activation pattern across time. The conventional NMF model is extended to the convolutive case: V≈
T o −1
t→
vik ≈
Wt H
t=0
T R o −1
t→
wijt (hjk )
(1)
t=0 j=1
where To is the length of each spectrum sequence and the j-th column of Wt describes the spectrum of the j-th object t time steps after the object has begun. i→
The function (·) denotes a column shift operator that moves its argument i places to the right; as each column is shifted off to the right the leftmost columns are ←i
zero filled. Conversely, the (·) operator shifts columns off to the left, with zero filling on the right. We use the beta divergence, which is a parameterisable divergence, as the reconstruction objective, DBD (VΛ, β) =
ik
vik
β−1 vik − [Λ]ik β(β − 1)
β−1
+ [Λ]ik
− vik , β
β−1 [Λ]ik
(2)
where β controls reconstruction penalty and Λ is the current estimate of V, t→ To −1 Λ = t=0 Wt H. The choice of the β parameter depends on the statistical distribution of the data, and requires prior knowledge, see [6, Chapter 3]. For β = 2, Squared Euclidean Distance is obtained; for β → 1, the divergence tends to the Kullback-Leibler Divergence; and for β → 0, it tends to ItakuraSaito Divergence. It is evident that Eq. 1 can be viewed as a summation of To conventional NMF operations. Consequently, as opposed to updating two matrices (W and H) as in conventional NMF, To + 1 matrices require an update (W0 , . . . , WTo −1 and H). The resultant convolutive NMF update equations are T wij t ← wij t
M
t→
2−β k=1 (vik /[Λ]ik )hjk , t→ T β−1 [Λ] h jk k=1 ik
hjk ← hjk
←−t
2−β i=1 wij t (vik /[Λ]ik ) , ←t M β−1 w [ Λ ] i=1 ij t ik
(3)
522
P.D. O’Grady and B.A. Pearlmutter
where H is updated to the average result of its updates for all t. When To = 1 this reduces to conventional NMF.
3
Sparse Convolutive NMF
Combining our reconstruction objective (Eq. 2) with a sparseness constraint on H results in the following objective function: G(VΛ, H, β) = DBD (VΛ, β) + λ hjk , (4) jk
where the left term of the objective function corresponds to convolutive NMF, and the right term is an additional constraint on H that enforces sparsity by minimising the L1 -norm of its elements. The parameter λ controls the trade off between sparseness and accurate reconstruction. 3.1
Basis Normalisation
The objective of Eq. 4 creates a new problem: The right term is a strictly increasing function of the absolute value of its argument, so it is possible that the objective can be decreased by scaling wij t up and H down (wij t → αwij t and H → (1/α)H, with α > 1). This situation does not alter the left term in the objective function, but will cause the right term to decrease, resulting in the elements of wij t growing without bound and H tending toward zero. Consequently, the solution arrived at by the optimisation algorithm is not influenced by the sparseness constraint. To avoid the scaling misbehaviour of Eq. 4 another constraint is needed; by normalising the convolutive bases we can control the scale of the elements in wij t and H. Here, normalisation is performed for each object matrix, Wj , by ¯ j = Wj , j = 1, . . . , R, where the matrix Wj rescaling it to the unit L2 -norm, W Wj is constructed from the j-th column of wij t at each time step, t = 0, 1, . . . , To −1. 3.2
Multiplicative Updates
Multiplicative updates can be obtained by including the normalisation requirement in the objective. Previously, this has been achieved for conventional NMF using the Squared Euclidean Distance reconstruction objective [7]. Here, we present the multiplicative updates for a convolutive NMF algorithm utilising beta divergence. Our new reconstruction objective is a modification of Eq. 2 ¯ j , resulting in the following generative where each object, Wj , is normalised, W t→ To −1 R ¯ jt (hj ). By substituting Λ for Δ in Eq. 4 we obtain model: Δ = t=0 j=1 w [6, Chapter 4] the following multiplicative update rules for H and W: M hjk ← hjk
i=1
M
←−t
w ¯ijt (vik /[Δ]2−β ik ) ←t
¯ijt [ Δ]β−1 +λ i=1 w ik
,
(5)
Discovering Convolutive Speech Phones
523
Frequency (kHz)
4
2
0
1
2
3
4
5
6
7
8
9
10 11 12 Basis Index
13
14
15
16
17
18
19
20
0 21
22
23
24
25
26
27
28
29
30 31 32 Basis Index
33
34
35
36
37
38
39
40
Frequency (kHz)
4
2
Fig. 1. A collection of 40 phone-like basis functions for a mixture of a male (DMT0) and female speaker (SMA0) taken from the TIMIT speech database t→ hjk (vik /[Δ]2−β ¯ijt (w ¯ijt [Δ]β−1 ik ) + w ik ) ← wijt . T t→ β−1 2−β [Δ] h + w ¯ ( w ¯ (v /[Δ] )) ijt ijt ik k=1 jk ik ik
T
wijt
4
k=1
(6)
Sparse Convolutive NMF Applied to Speech Spectra
We apply sparse convolutive NMF to speech, and present a learned basis for the sparse representation of speech using the TIMIT database. Recently, such work has been presented for convolutive NMF [8]. 4.1
Discovering a Phone-Like Basis
To illustrate the differences between the phones extracted by convolutive NMF and sparse convolutive NMF we perform the following experiment for both algorithms: We take around 15 seconds of speech from a male (DMT0) and female (SMA0) speaker to create a contiguous mixture. The data is normalised to unit variance, down-sampled from 16 kHz to 8 kHz and a magnitude spectrogram of the data is constructed. We use a FFT frame size of 512, a frame overlap of 384 and a Hamming window to reduce the presence of sidelobes. We extract 40 bases, R = 40, with a temporal extent of 0.176 seconds, To = 8, and run convolutive NMF (with β = 1) for 200 iterations. The extracted bases are presented in Figure 1. The experiment is repeated for sparse convolutive NMF with λ = 15, and the corresponding bases are presented in Figure 2. For convolutive NMF, it is evident that the extracted bases correspond to speech phones. The verification of which, can be achieved by listening to an audible reconstruction. Most of the phones represent harmonic series with differing pitch inflections, while a smaller subset of phones contain wideband components that correspond to consonant sounds. It is evident for the harmonic phones that
524
P.D. O’Grady and B.A. Pearlmutter
Frequency (kHz)
4
2
0
1
2
3
4
5
6
7
8
9
10 11 12 Basis Index
13
14
15
16
17
18
19
20
0 21
22
23
24
25
26
27
28
29
30 31 32 Basis Index
33
34
35
36
37
38
39
40
Frequency (kHz)
4
2
Fig. 2. A collection of 40 phone-like basis functions for a a mixture of a male (DMT0) and female speaker (SMA0) taken from the TIMIT speech database. The basis is extracted using Spare Convolutive NMF with λ = 15.
some bases have harmonics that are spaced much closer together, which is indicative of a lower pitched male voice, while others are farther apart, indicating a higher pitched female voice. Therefore, it is evident that the extracted phones correspond to either the male or female speaker, which indicates that the timbral characteristics of the male and female speaker are sufficiently different, such that phones that are representative of both cannot be extracted. By placing a sparseness constraint on the activations of the basis functions, we specify that the expressive power of each basis be extended such that it is capable of representing phones parsimoniously, much like an over-complete dictionary. The result is that the extracted phones exhibit a structure that is rich in phonetic content, where harmonics at higher frequencies have a much greater intensity than seen in the phones extracted by convolutive NMF. Analysis of the male and female sparse phone set reveals another important difference between the two speakers. In addition to difference in harmonic spacing, it is evident that the structure of the male phones are of a more complex nature, where changes over time are much more varied than for the female phone set.
5
Supervised Method for the Separation of Speakers
As illustrated in our previous experiments, the structure of the bases that are extracted from the speech spectrogram are uniquely dependent on the speaker (given the same algorithm parameters). In the context of speech separation, it is not unreasonable to expect that the bases extracted for a specific speaker adequately characterise the speaker, such that they can be used to discriminate them from other speakers. For a monophonic mixture where a number of speakers are summed together, it is possible to separate the speakers in the mixture by constructing an individual magnitude spectrogram from each speaker, using the
Discovering Convolutive Speech Phones
525
phones specific to that speaker. More formally, we use the following procedure for the separation of a known male and female speaker from a monophonic mixture: 1. Obtain training data for the male, sm (t), and female, sf (t), speaker, create a magnitude spectrogram for both, and extract corresponding phone sets, f Wm t and Wt , using sparse convolutive NMF. f 2. Construct a combined basis set Wmf = [Wm t t |Wt ], which results in a basis that is twice as big as R. 3. Take a mixture that is composed of two unknown sentences voiced by our selected speakers, and create a magnitude spectrogram of the mixture. Fit by performing sparse convolutive NMF with Wt fixed the mixture to Wmf t mf to Wt , and learn only the associated activations H. 4. Partition H such that the activations are split into male, Hm , and female, Hf , parts that correspond to their associated bases, H = [Hm |Hf ]T . 5. Construct a magnitude spectrogram speakers, using their respective To −1 formboth To −1 bases and activations: Sm = t=0 Wt Hm and Sf = t=0 Wft Hf . 6. Use the phase information from the mixture to create an audible reconstruction for both speakers. This procedure may also be used for convolutive NMF, and can be generalised for more than two speakers, and speakers of the same gender. 5.1
Separation Experiments
Here, we compare the separation performance of convolutive NMF and sparse convolutive NMF. For an extensive study of the relationship between parameter selection and separation performance for convolutive NMF, see [8]. We select three male (ABC0, BJV0, DWM0) and three female (EXM0, KLH0, REH0) speakers from the TIMIT database, and create a training set for each that includes all but one sentence voiced by that speaker. We artificially generate a monophonic mixture by summing the remaining sentences for a selected malefemale pair, for a total of nine mixtures. Each sentence pair is normalised to unit variance, down-sampled from 16 kHz to 8 kHz, and summed together. A magnitude spectrogram of each mixture is constructed using an FFT frame size of 512, a frame overlap of 256 and a Hamming window. The separation performance for both algorithms is evaluated for each mixture over a selection of values for R (R = {40 80 140 220}). For both algorithms the temporal extent of each phone is set to 0.224 seconds (To = 6), the number of iterations is 150, β is set to 1 and each experiment is repeated for 10 Monte Carlo runs. For convolutive NMF, a total of 24 speaker phone sets are extracted and used in 360 (9× 4 × 10) separation experiments. For sparse convolutive NMF separation performance is tested for λ = {0.01 0.1 0.3 1.0 2.0}; resulting in 120 (6 × 4 × 5) speaker phone sets and 1800 (9 × 4 × 5 × 10) separation experiments. For the purposes of ease of comparison with existing separation methods, we evaluate the separation performance of both algorithms using the Sourceto-Distortion Ratio (SDR) measure provided by the BSS_EVAL toolbox [9]; SDR
526
P.D. O’Grady and B.A. Pearlmutter scNMF Vs cNMF: SDR 3 2
SDR (dB)
1 0 −1 −2 −3 cNMF
0.01
0.1
λ
0.3
1.0
2.0
Fig. 3. A comparison of the SDR results obtained by convolutive and sparse convolutive NMF: Box plots are used to illustrate the performance results, where each box represents the median and the interquartile range of the results. It is evident that for λ = 0.1, a better spread of results is obtained, indicating that sparse convolutive NMF achieves superior overall performance.
indicates overall separation performance and is expressed in dB, with higher performance values indicating better quality estimates. An extensive investigation utilising all measures provided by the toolbox is presented in [6, Chapter 4]. 5.2
Separation Performance
We statistically analyse the performance of convolutive NMF and sparse convolutive NMF by collating the results from all experiments and presenting the results using box plots: Each box presents information about the median and the statistical dispersion of the results. The top and bottom of each box represents the upper and lower quartiles, while the length between them is the interquartile range; the whiskers represent the extent of the rest of the data, and outliers are represented by +. Box plots for SDR are presented in Figure 3. The SDR results indicate that for λ = {0.1, 0.3}, the median performance obtained (0.66 dB, 0.62 dB) exceeds convolutive NMF (0.44 dB), for our given algorithm parameters. It is also evident that a better spread of results is produced for sparse convolutive NMF; demonstrating that when λ is chosen appropriately, sparse convolutive NMF achieves superior overall performance. However, audible reconstructions reveal that convolutive NMF is more resilient to artifacts; this may reflect the fact that each sparse phone set exhibits phones that are rich in features, which may manifest as artifacts in the resultant source estimates. It is also evident that the performance of the sparse convolutive algorithm degrades significantly for large λ, so much so, that it renders the results useless, for our data this is especially evident for λ > 1.
Discovering Convolutive Speech Phones
6
527
Conclusion
In this paper, we presented a sparse convolutive NMF algorithm with multiplicative updates, which effectively discovers a sparse parts-based convolutive representation for non-negative data. This method extends the convolutive NMF objective by including a sparseness constraint on the activation patterns, enabling the discovery of over-complete representations. Furthermore, we demonstrated the superiority of sparse convolutive NMF over convolutive NMF, when applied to a supervised monophonic speech separation task.
Acknowledgements ´ Supported by Higher Education Authority of Ireland (An tUdar´ as Um ArdOideachas), and Science Foundation Ireland grant 00/PI.1/C067.
References [1] Comon, P.: Independent component analysis: A new concept. Signal Processing 36, 287–314 (1994) [2] Paatero, P., Tapper, U.: Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 111–126 (1994) [3] Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Adv. in Neu. Info. Proc. Sys. 13, pp. 556–562. MIT Press, Cambridge (2001), URL citeseer.ist.psu.edu/lee00algorithms.html [4] Smaragdis, P.: Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 494–499. Springer, Heidelberg (2004) [5] Abdallah, S.A., Plumbley, M.D.: Polyphonic transcription by non-negative sparse coding of power spectra. In: Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR 2004), pp. 318–325 (2004) [6] O’Grady, P.D.: Sparse Separation of Under-Determined Speech Mixtures. PhD thesis, National University of Ireland Maynooth (2007), URL http://ee.ucd.ie/∼ pogrady/ogrady2007 phd.pdf [7] Eggert, J., K¨ orner, E.: Sparse coding and NMF. In: IEEE International Joint Conference on Neural Networks, Proceedings, July 2004, vol. 4, pp. 2529–2533. IEEE, Los Alamitos (2004) [8] Smaragdis, P.: Convolutive speech bases and their application to supervised speech separation. IEEE Transaction on Audio, Speech and Language Processing (2007) [9] F´evotte, C., Gribonval, R., Vincent, E.: BSS EVAL toolbox user guide. Technical Report 1706, IRISA (2005)
Frequency-Domain Implementation of a Time-Domain Blind Separation Algorithm for Convolutive Mixtures of Sources Masashi Ohata and Kiyotoshi Matsuoka Department of Brain Science and Engineering, Kyushu Institute of Technology, 2-4 Hibikino, Wakamatsu-ku, Kitakyushu city, 808-0196, Japan
Abstract. This paper proposes a way to implement a time-domain blind separation algorithm for convolutive mixtures of source signals. The approach provides another form of the algorithm by discrete Fourier transform and has the possibility of designing a separating filter in the frequency domain, without bothering about the permutation problem inherent in frequency-domain blind separation approach. This paper also shows a technique to improve separation performance in the frequency domain. The validity of our approach was demonstrated by performing an experiment on separation for convolutive mixtures of two speeches. Keywords: source separation, convolutive models, time-frequency representations, normalization.
1
Introduction
Blind source separation (BSS) for convolutive mixtures of sources is formulated as follows. Let us consider a case where M sensors receive convolutive mixtures of M statistically independent signals referred to as source signals. The relationship between the source signals and their mixtures are expressed as x(n) =
K−1
Al s(n−l),
(1)
l=0
where s(n) and x(n) are M -dimensional real-valued vectors, representing collections of the source signals and the observations at discrete time n, respectively. Matrices Al represents the impulse response of the channel from the sources to the sensors. Although the channel is not known beforehand, it is assumed to −l is nonsingular for every satisfy, 1) the channel is invertible; A(z) = K−1 l=0 Al z complex variable z on the unit circle |z| = 1, 2) each source signal is a stationary non-Gaussian process with zero mean and contains every frequency component. Approaches to the separation problem can be classified into two types. One is referred as frequency-domain BSS (FD-BSS) and the other as time-domain BSS (TD-BSS); various separation methods are collected in [1]. The former cuts out M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 528–535, 2007. c Springer-Verlag Berlin Heidelberg 2007
Frequency-Domain Implementation
529
a series of frames with appropriate length Ls from sequence {x(n)} and converts the frames to the series of frequency data by discrete Fourier transform (DFT): x[n, k] =
L s −1
x(n+m)e−jωk m ≈ A[k]s[n, k] (k = 0, 1, . . . , N −1).
(2)
m=0
Here, ωk denotes a discrete angular frequency: ωk = 2πk/N . N is a number by which the interval [0, 2π) is divided, and is 2 to the power of positive integer large enough (Ls ≤ N ). Symbol j is the unit imaginary number. {A[k]} is the DFT of {Al } and is a collection of M×M nonsingular matrices. x[n, k] and s[n, k] are the DFTs of the segmented sequences of x(n) and s(n), respectively. By applying an ICA (independent component analysis) method to the converted sequence individually, independent components can be obtained at each frequency: y[n, k] = B[k]x[n, k] (k = 0, 1, . . . , N −1).
(3)
Here, B[k] are M ×M nonsingular matrices. We refer to {B[k]} as a separating filter or shortly a separator hereafter. This approach evaluates the statistical independence at each frequency. The ICA solutions are given in the form of ˜ = P[k]D[k]A−1 [k], where P[k] and D[k] represent a permutation matrix B[k] and an invertible diagonal matrix, respectively. These two matrices cannot be determined from the statistical independence. When converting back the frequency components to the time domain expression by inverse discrete transform (IDFT), the independent components at multiple frequencies have to be aligned in such a way that the IDFTs of the frequency components correspond to the source signals; it is necessary to set a constant matrix P to P[k] over every frequency. To solve the alignment problem, various methods have been proposed; for example, using a time structure of speech signal in [2] and a beamforming technique of microphone array in [3],[4]. Advantages of this approach are that the algorithm is simple and can be parallelized on a computer, and that the separation performance can be improved in the frequency domain. On the other hand, the TD-BSS approach does not require such an alignment step. However, separation performance in the frequency domain is not taken into account. To make the most of the advantages of these two approaches, this paper shows a way to implement a TD-BSS algorithm in the frequency domain. More specifically, our method evaluates the statistical independence among signals in the time domain, but designs a separating filter in the frequency domain.
2 2.1
Time-Domain Blind Source Separation Basic Algorithm
The output sequence of a separator, i.e., the IDFT of Eq.(3) is given as y(n+m) =
N−1 1 y[n, k]ejωk m (m = 0, 1, . . . , N −1), N k=0
(4)
530
M. Ohata and K. Matsuoka
where y(n) = [y1 (n), . . . , yM (n)]T (superscript ‘T ’ denotes the transpose of a vector or a matrix). Since x(n) is real-valued, its DFT satisfies x[n, k] = x∗ [n, N− k], where superscript ‘ * ’ denotes the conjugate of a complex value. Equation (4) is real-valued if and only if the DFT sequence {B[k]} satisfies B[k] = B∗ [N − k] (k = 0, 1, . . . , N −1). Several contrast functions have been proposed to solve blind source separation problem. In this paper, an information-theoretical approach is employed: C (n, {B[k]}) = −
N −1 M m=0 p=1
N −1
log rp (yp (n+m))−
1 log det B[k]BH [k]. 2
(5)
k=0
Here, superscript ‘H’ denotes the conjugate transpose of a vector or a matrix, and rp (y) is a model for the probability density function (pdf) of source p. This is the DFT expression of the contrast function mentioned in [5]. Since DFT and IDFT can be expressed using nonsingular matrices, the optimization of the contrast function in the discrete time domain is equivalent to that in the discrete frequency domain (when the length of an optimized filter is N ). The contrast function can evaluate the independence among the filter outputs to input {x(n), . . . , x(n+Ls −1)}. To search for a desired separator, the function is minimized with respect to matrices B[k]. It is necessary that sequence {B[k]} should satisfy B[k] = B∗ [N − k] during the minimization. Let B[n, k] be an updated filter of B[k] at the n-th iteration and ΔB[n, k] be an update value for the filter: B[n + 1, k] = B[n, k]+ΔB[n, k]. The derivative of (− log rp (y)) is denoted by ϕp (y). Let ϕ(y(n)) be the M -dimensional column vector whose elements are ϕp (yp (n)). N −1 Define f [n, k] = [ϕ1 [n, k], . . . , ϕM [n, k]]T = m=0 ϕ(y(n+m))e−jωk m . Using the natural gradient method proposed by Amari [6], the following rule is obtained: ΔB[n, k] = α N I−f [n, k]yH [n, k] B[n, k], (6) ΔB[n, N −k] = ΔB∗[n, k] (k = 0, 1, . . . , N/2). Here, α is a small positive constant and I is the M × M identity matrix. This rule is available to the case of independent, identically distributed (i.i.d.) source signals. But, the rule is undesirable for color sources. It is possible to obtain a set of unwhitened independent signals by employing the nonholonomic constraint method proposed by Amari et al.[7]. Specifically, by applying diag ΔB[n, k]B−1 [n, k] = O to Eq.(6), the following rule is obtained: ΔB[n, k] = −α off-diag f [n, k]yH [n, k] · B[n, k].
(7)
Here, diag Z (off-diag Z) represents the matrix whose diagonal (off-diagonal) elements are identical to those of square matrix Z and other elements are zeros. 2.2
Proposed Algorithm
Algorithm (7) is affected by the magnitudes of yp [n, k] and ϕp [n, k], because they are distributed in a wide range. The learning rate requires being carefully
Frequency-Domain Implementation
531
set to obtain sufficient separation results. This may induce good separation in some frequency ranges, but poor separation in other ranges. To improve the separation performance, we modify the algorithm as follows. Let γp2 [n, k] and σp2 [n, k] be estimates of the variances of ϕp [n, k] and yp [n, k], respectively. The estimates are updated in accordance with γp2 [n, k] = (1−β1 )γp2 [n−1, k]+β1|ϕp [n, k]|2 ,
(8)
σp2 [n, k] = (1−β2)σp2 [n−1, k]+β2|yp [n, k]|2 ,
(9)
where β1 and β2 are positive constants smaller than one. Define M × M di2 [n, k], . . . , 1/ γ 2 [n, k]} and Ξ[n, k] = agonal matrices Γ[n, k] = diag{1/ γ 1 M 2 [n, k]}. By using these positive-definite matrices diag{1/ σ12 [n, k], . . . , 1/ σM and a method proposed in [8], the separation algorithm can be extended as ΔB[n, k] = −α off-diag Γ[n, k]f [n, k]yH [n, k]Ξ[n, k] · B[n, k].
(10)
The elements of the time average of Γ[n, k]f [n, k]yH [n, k]Ξ[n, k] are the correlation coefficients corresponding to those of f [n, k]yH [n, k]. Learning rules (7) and (10) cannot obtain a separator with unique D[k]. It is possible to remove the ambiguity on D[k] by using the minimal distortion principle (MDP), which is originally proposed by Matsuoka and Nakashima [9]. The MDP separator can be obtained as ˆ B[n+1, k] = diag B−1 [n+1, k] · B[n+1, k]e−jωk τ ,
(11)
where τ is a positive integer representing a delay (the similar equation is mentioned in [4]). The MDP solution can be also iteratively obtained without calculating the inverse of a matrix; the procedure is omitted in this paper. A given sequence is segmented into a series of frames with length Ls by shifting a frame by interval R and then the shifted segments are transformed to frequency-sequences by DFT. Let l be an integer variable representing the frame position. Setting n = lR in Eq.(10), the following iterative procedure is obtained: P1: P2: P3: P4: P5: P6: P7: P8: P9:
Initialization: l = 0, B[0, k], γp2[0, k], σp2 [0, k] (k = 0, 1, . . . , N/2), Ls−1 x(lR+m)e−jωk m , Obtain x[lR, k] = m=0 Calculate y[lR, k] = B[lR, k]x[lR, k](y[lR, N −k] = y∗[lR, k]), Obtain the IDFT {y(lR), . . . , y(lR+N −1)} of {y[lR, k]}. Calculate {f [lR, k]}: the DFT of {ϕ(y(lR)), . . . , ϕ(y(lR+N −1))}, Update γp2 [lR, k] and σp2 [lR, k] by Eqs.(8) and (9), respectively, Update separator in accordance with Eq.(10), Increment l: l← l+1, Go to P2.
The convolution using DFT is basically not linear convolution, but circular convolution as mentioned in [10]. The output y(n + m)(m = 0, . . . , N − 1) obtained by Eqs.(3) and (4) is a linear convolution between {x(n), . . . , x(n +
532
M. Ohata and K. Matsuoka
Ls − 1)} and impulse response {Bl } corresponding to the filter {B[k]} if the length Lf of the impulse response and Ls satisfy Lf + Ls − 1 ≤ N . That may affect the search of a desired separator because the length of the IDFT of {B[k]} does not necessarily satisfy the inequality; the length is N in general. During the learning of a separator, it is necessary that the inequality holds. Thus, the impulse response {Bl } should be truncated at fixed intervals during the learning.
3
Experiment
An experiment on speech separation was performed to demonstrate the validity of our algorithm. Two different speeches were produced with two loudspeakers and their mixtures were recorded with two omnidirectional microphones (M = 2) in a room. The loudspeakers and the microphones were arranged as shown in Fig.1 (the reverberation time was approximately 550 ms). They were set 1.2 m high from the floor. While the position of Loudspeaker 2 was fixed, that of Loudspeaker 1 was changed as θ = 0, 15, 30, 45, 60, 75, 90 [deg].
Speaker 1
450 cm
m c 09 7
Speaker 2 θ degrees
45 degrees
O
100 cm
m c 00 4
Microphone 1
3.5 cm
1380 cm (a) Positions of the microphones and the speakers in a room
O Microphone 2
(b) Arrangement of the microphones and the speakers
Fig. 1. Experimental setup
We used four kinds of speech, consisting of two men voices and two women voices. We selected every combination of two different speeches from them. The number of the combinations is 4 C2 × 2 = 12, which includes permutations of speeches. To easily evaluate separation performance of an obtained separator, we individually recorded two different speeches produced with the loudspeakers by the microphones at each position of Loudspeaker 1. The speeches were recorded at a sampling rate of 8 kHz for ten seconds. The recorded signals in which case a speech is produced with Loudspeaker q is represented as xq (n) = [x1q (n) x2q (n)]T . We summed these recorded signals on a computer and used the summations as convolutive mixtures of two different speeches:
Frequency-Domain Implementation
533
x(n) = x1 (n) + x2 (n). Let yq (n) = [y1q (n) y2q (n)]T be the outputs of separator B (z) to inputs xq (n)(q = 1, 2) : y(n) = y1 (n)+y2 (n). The separation performance of the obtained separators can be evaluated in terms of the signal-to-interference ratio (SIR): SIR[B (z)] = 12 2p=1 SIRp [B (z)] = 2 2 1 2 p=1 10 log10 (maxq ypq (n) / minq ypq (n) ), where · represents the time av2 erage. If a desired separator were obtained, then the ratio would become infinity. xx A separator with a larger SIR is closer to a desired separator. Let Ppq (f ), Pryy (f ) yx and Pr,pq (f ) be the power and cross spectra of xpq (n) and yr (n)(p, q, r = 1, 2). The coherence between xpq (n) and yr (n) is defined as Cohyx r,pq (f ) =
yx |Pr,pq (f )|2 . xx (f ) Pryy (f )Ppq
This takes a value in the range [0, 1]. To investigate the separation performance in the frequency domain, we used the function defined as Cohyx rq (f ) = 1 2 yx Coh (f ). If this function took one in the whole frequency range, then r,pq p=1 2 the output signal yr (n) would be identical to signal sq (n). For both the algorithms, the initial value of separator was set to
1 −0.9e−jωk −jωk τ e (k = 0, 1, . . . , N −1). B[0, k] = −0.9e−jωk 1 This setting came from a consideration on the basis of a microphone arrey technique (the detail is omitted due to the limited space). For algorithm (10), the initial values of γp2 [n, k] and σp2 [n, k] were set to 10.0. The values of the learning rates and other parameters are given in the Table 1. The repetition number in the table represents the frequency at which algorithm is applied to mixtures of speeches. Algorithms (7) and (10) were applied to the speech data and SIRs were calculated for the obtained separators and the MDP separators obtained by Eq.(11). The SIRs were averaged over all the combinations of two different speeches for each position of Loudspeaker 1. Table 1. Parameters in experiments Parameters Algorithm (7) Algorithm (10) Sample numbers of obserbations Ls 4096 Filter length Lf 4000 Length of DFT N 8192 Shifting samples R 2750 Delay τ 2000 Repetition number 5 Nonliner function ϕp (y) tanh(75y) Learning rate for separator α 1.5 × 10−5 8.5 × 10−3 Learning rate in Eq.(8) β1 – 0.15 Learning rate in Eq.(9) β2 – 0.005
534
M. Ohata and K. Matsuoka
Algorithm (7) Algorithm (10)
20 20
SIR [dB]
SIR [dB]
15 15
10
10 SIR1 for Algorithm (7) SIR2 for Algorithm (7) SIR1 for Algorithm (10) SIR2 for Algorithm (10)
5 0
1
2
3
4
5
0
5
0
15
Number of repetitions
(a) Variations of the averaged SIRs for algorithms (7) and (10) at θ = 30(deg.)
30
45
60
Angle θ [deg]
75
90
(b) Averaged SIRs for the MDP separators (at the fifth repetition)
Fig. 2. Separation performance in SIR
12
1 0.5
0 0
1000
3000
4000
22
1
2000
Frequency f [Hz]
Cohyx (f), Cohyx (f)
11
0.5
0 0
1000
1
2000
3000
4000
2000
3000
4000
Frequency f [Hz]
0.5
21
0.5
21
22
Cohyx (f), Cohyx (f)
Cohyx (f), Cohyx (f)
1
11
12
Cohyx (f), Cohyx (f)
Figure 2(a) illustrates the variations of the averaged SIRs for the separators obtained by the algorithms with θ = 30 [deg]; the separator is not minimal distortion solutions. This suggests that algorithm (10) can obtain a better separating filter faster than algorithm (7). Figure 2(b) depictes the averaged SIRs for the MDP separator obtained at each location of Loudspeaker 1. This shows that our setting of an initial separator is effective in other cases than θ = 0, 15 [deg]. Figure 3 depictes Cohyx rq (f ) for the MDP separators. This result states that, Eq.(10) can design better separator in the frequency domain than Eq.(7).
0 0
1000
2000
Frequency f [Hz]
3000
4000
ˆ (z) obtained by (7) (a) MDP filter B
0 0
1000
Frequency f [Hz]
ˆ (z) obtained by (10) (b) MDP filter B
Fig. 3. Coherences for the MDP filters (θ = 30); the solid lines represent Cohyx 11 (f ) and yx yx Cohyx 21 (f ), and the dashed lines represent Coh12 (f ) and Coh22 (f )
Frequency-Domain Implementation
535
We programed the above algorithms in MATLAB code. It took approximatly 13.4 second to perform the program of Eq.(10) for a repetition (when the algorithm was applied to ten second data once) on a computer with Pentium 4 CPU-3.0 GHz (Intel Co.).
4
Conclusion
This paper proposed a frequency domain implementation for a time-domain blind separation algorithm. Our method provides the possibility of improving the separation peformance in the frequency domain. This paper shows a modification of the performance by normalizing the terms evaluating the statistical independence in the algorithm. The result of the experiment on speech separation shows that our algorithm can separate source signals from thier mixtures without solving the permutation problem inhelent in FD-BSS.
References 1. Cichocki, A., Amari, S.: Adaptive blind signal and imagine processing. John Wiley & Sons, Lid. Chichester (2002) 2. Murata, N., Ikeda, S., Ziehe, A.: An approach to blind source separation based on temporal structure of speech signal. Neurocomputing 41(4), 1–24 (2001) 3. Saruwatari, H., Kurita, S., Takeda, K., Itakura, F., Nishikawa, T., Shikano, K.: Blind source separation combining independent analysis and beamforming. EURASIP Journal on Applied Signal Processing, No.11, 1135–1146 (2003) 4. Sawada, H., Mukai, R., Araki, S., Makino, S.: Robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. on Speech and Audio Processing 12(5), 530–538 (2004) 5. Amari, S., Douglas, S.C., Cichocki, A., Yang, H.H.: Multichannel blind deconvolution and equalization using the natural gradient. In: Proc. of IEEE International Workshop on Signal processing advances in Wireless Communications, pp. 101– 104. IEEE Computer Society Press, Los Alamitos (1997) 6. Amari, S.: Natural gradient learning works efficiently in learning. Neural Computation 10(2), 251–276 (1998) 7. Amari, S., Chen, T.-P., Cichocki, A.: Nonholonomic orthogonal learning algorithms for blind source separation. Neural Computation 12(6), 1463–1484 (2000) 8. Georgiev, P., Cichocki, A., Amati, S.: On some extensions of the natural gradient algorithm. In: Proc. of the Third international conference on Independent Component Analysis and Blind Signal Separation, December 9-12, San Diego, CA, pp. 581–585 (2001) 9. Matsuoka, K., Nakashima, S.: Minimal distortion principle for blind source separation. In: Proc. of the Third international conference on Independent Component Analysis and Blind Signal Separation, December 9-12, San Diego, CA, pp.722–727 (2001) 10. Oppenheim, A.V., Schafer, R.W., Buck, J.R.: Discrete-time signal processing the second edition. Prentice Hall International, Inc, Englewood Cliffs (1998)
Phase-Aware Non-negative Spectrogram Factorization R. Mitchell Parry and Irfan Essa Georgia Institute of Technology College of Computing / GVU Center 85 Fifth Street NW, Atlanta, GA USA {parry,irfan}@cc.gatech.edu http://www.cc.gatech.edu/
Abstract. Non-negative spectrogram factorization has been proposed for single-channel source separation tasks. These methods operate on the magnitude or power spectrogram of the input mixture and estimate the magnitude or power spectrogram of source components. The usual assumption is that the mixture spectrogram is well approximated by the sum of source components. However, this relationship additionally depends on the unknown phase of the sources. Using a probabilistic representation of phase, we derive a cost function that incorporates this uncertainty. We compare this cost function against four standard approaches for a variety of spectrogram sizes, numbers of components, and component distributions. This phase-aware cost function reduces the estimation error but is more affected by detection errors. Keywords: audio processing, source separation, sparse representations, time-frequency representations, unsupervised learning.
1
Introduction
Non-negative spectrogram factorization (NSF) has been proposed for singlechannel source separation [1, 2, 3], music transcription [4, 5], and speech recognition [6]. The input mixture is first transformed into a time-frequency representation such as the short-time Fourier transform (STFT). Because of phase-invariant aspects of human hearing the phase information in the STFT is removed yielding the absolute value or absolute square of the STFT (i.e., magnitude or power spectrogram) [2]. The resulting spectrogram matrix is then factored into the sum of rank-one component spectrograms using independent component analysis (ICA) or non-negative matrix factorization (NMF). Each component comprises a static spectral shape and time-varying amplitude envelope. Ideally, each component contains information unique to a particular source for separation or a particular event for transcription. We focus on this basic approach although various other algorithms incorporate sparseness, convolution, or multiple channels [4, 7, 8]. NSF methods commonly assume that the mixture magnitude or power spectrogram is well approximated by the sum of source components. ICA forces this M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 536–543, 2007. c Springer-Verlag Berlin Heidelberg 2007
Phase-Aware Non-negative Spectrogram Factorization
537
relationship while maximizing the independence of the spectral components [1], whereas NMF minimizes a cost function between the mixture spectrogram and the sum of spectral components [9]. However, because of the nonlinearity of the absolute value function a mixture spectrogram is not the sum of the component spectrograms. Instead, the mixture spectrogram depends on the component spectrograms and their phases. We derive a cost function suitable for NSF by treating the phase as a uniform random variable and maximizing the likelihood of the mixture spectrogram. In previous work, we derived the explicit likelihood function for the case of two components [10]. In this paper, we extend this result to the case of more than two components and show that it is analogous to the multiplicative noise model employed by Abdallah and Plumbley [4]. Even though this cost function is specifically tailored to non-negative spectrograms, the Euclidean distance or generalized Kullback-Leibler divergence is more commonly used for NSF. We compare each cost function based on its ability to estimate the component spectrograms for a variety of spectrogram sizes, numbers of components, and component distributions.
2
Non-negative Matrix Factorization
Non-negative matrix factorization (NMF) was first proposed for the decomposition of images [11]. Image data is inherently non-negative and a single image can be regarded as a linear combination of underlying image parts. NMF estimates these components by minimizing the distance between a set of mixture images contained in the columns of a matrix, A, and the sum of the component matrices, B. The two common distance functions are the Euclidean distance: (Aij − Bij )2 (1) A − B2 = ij
and a generalized version of the Kullback-Leibler divergence: Aij D(AB) = Aij log − Aij + Bij . Bij ij
(2)
When applied to non-negative spectrograms, A represents the mixture spectrogram and B represents the sum of component spectrograms. Instead of decomposing multiple images, spectrogram factorization decomposes multiple spectral frames contained in the columns of A. Although magnitude or power spectrograms are non-negative they are not a linear combination of underlying component spectrograms because of the nonlinearity of the absolute value function used to generate them.
3
Non-negative Spectrograms
A popular way to transform an audio signal into a series of image-like representations is to extract its frequency spectrum at multiple time-points. We consider
538
R.M. Parry and I. Essa
the case of one mixture signal and model it as the sum of R source component signals: R x(t) = sr (t). (3) r=1
The short-time Fourier transform (STFT) is a linear transformation into the frequency domain that preserves this relationship: Fx (k, t) =
R
Fsr (k, t).
(4)
r=1
The magnitude spectrogram is the absolute value of the complex-valued STFT: Xkt = |Fx (k, t)|
[Sr ]kt = |Fsr (k, t)|.
The original STFT contains additional phase information: Fx (k, t) = Xkt (cos Θkt + i sin Θkt ) = [Sr ]kt (cos [Θr ]kt + i sin [Θr ]kt ).
(5)
(6)
r
When applied to non-negative spectrograms, ICA and NMF estimate rank-one component spectrograms. The columns of a K ×R matrix W specify the spectral shapes and the rows of an R × T matrix H represent the amplitude envelopes of all the component spectrograms: [Sr ]kt = Wkr Hrt .
(7)
The various algorithms for NSF vary in the way that they estimate W and H.
4
Non-negative Spectrogram Factorization
The vast majority of NSF methods treat each column of a magnitude or power spectrogram matrix as though it were an image and use ICA or NMF to estimate the components. To our knowledge, there has been only one cost function specifically designed for non-negative spectrograms, namely that of Abdallah and Plumbley [4]. They derive a divergence function based on a multiplicative noise model for estimating the variance (i.e., power) at each time-frequency bin. In this paper, we define the mixture magnitude spectrogram in terms of the component magnitude spectrograms and their phases. Using a probabilistic representation of the phase, we derive an analogous divergence function. Both ICA- and NMF-based techniques implicitly assume that the mixture non-negative spectrogram, X, is well approximated by the sum of the spectral components, Sr . However, by incorporating the phase of the components, Θr , we make this relationship precise: Xkt = [Sq ]kt [Sr ]kt cos ([Θq ]kt − [Θr ]kt ). (8) qr
The mixture magnitude spectrogram does not equal the sum of component magnitude spectrograms unless at most one component is active at a time or all active components have the same phase.
Phase-Aware Non-negative Spectrogram Factorization
5
539
Probabilistic Representation of Phase
Given the mixture spectrogram’s dependence on the phase in Equation 8, we represent the phase as a uniform random variable. We also make the simplifying assumption that the phase is independent at different time-frequency points. To some degree, this is true. However, the unwrapped phase of a steady state signal can be approximated from the previous two time-steps [12]. Although this violates the independence assumption, we have found that the resulting approach works well in practice. We wish to maximize the likelihood of the mixture magnitude spectrogram as a function of the source component magnitude spectrograms. For the case of two components, Equation 8 is a function of one random variable (i.e., Θd = Θ1 − Θ2 ) and it is relatively straightforward to derive p(X|S1 , S2 ) directly [10]. However, for more components it becomes increasingly difficult to derive the precise likelihood function. Instead, we estimate the likelihood using the central limit theorem to capture the shape of the distribution for a large number of components. The probability density function for a complex random variable with magnitude Sr and uniform random phase has a mean of zero and a variance of Sr2 . According to the Lindeberg-Feller central limit theorem [13], the sum of many such tends toward a complex Gaussian with zero mean and a variance variables 2 of r Sr . This theorem is valid under the Lindeberg condition, which states that the component variances, Sr2 , are small relative to their sum [13]. Applied to magnitude spectrograms we have the following: 2 1 Xkt exp − , (9) p(Fx |S1 , . . . , SR ) = πΛkt Λkt kt where Λkt = r [Sr2 ]kt . We find the likelihood of X by integrating with respect to phase, resulting in a Rayleigh distribution: 2 2Xkt Xkt exp − . (10) p(X|S1 , . . . , SR ) = Λkt Λkt kt
6
Maximum Likelihood
In order to estimate Sr , we propose minimizing the negative log likelihood of X: 2Xkt X 2 log − kt . (11) − log p(X|S1 , . . . , SR ) = − Λkt Λkt kt
For comparison, we frame our maximum likelihood approach in terms of a diver2 . By gence function. The minimum of Equation 11 is 1− log (2/Xkt ) at Λkt = Xkt subtracting this value we find a divergence function that is non-negative reaching 2 zero only when all Λkt = Xkt : X2 Λkt 2 kt − 1 + log Ds = D(1X /Λ) = , (12) 2 Λkt Xkt kt
540
R.M. Parry and I. Essa
which is equivalent to Equation 8 in Abdallah and Plumbley [4]. We derive the 2 2 gradient for Ds with respect to Wkr and Hrt : 2 2 Λkt − Xkt ∂Ds Λkt − Xkt ∂Ds 2 2 = = H W , (13) rt kr 2 ) 2) ∂(Wkr Λ2kt ∂(Hrt Λ2kt t k 2 2 2 Hrt . Although Ds is not convex with respect to Wkr or where Λkt = r Wkr 2 Hrt , we find local minima using the following multiplicative update rules: 2 2 2 2 2 2 2 2 2 2 t Hrt Xkt /Λkt k Wkr Xkt /Λkt Wkr ← Wkr Hrt ← Hrt . (14) 2 2 t Hrt /Λkt k Wkr /Λkt
7
Results
We compare the phase-aware cost function, Ds , to four other cost functions based on Euclidean or Kullback-Leibler divergence for magnitude or power spectrograms. Figure 1 plots the shape of the likelihood functions for each of the cost functions with X = 1. Magnitude spectrogram methods (Em and Dm ) reach a maximum on the line S1 + S2 = X. Power spectrogram methods (Ep , Dp , and Ds ) reach a maximum on the circle S12 + S22 = X 2 . When X = 1, the sum of S1 and S2 must be greater than one. Ds encourages this result by penalizing solutions near the origin more than the other cost functions. In our experiment, we evaluate the performance of the cost functions for a variety of spectrogram sizes, numbers of components, and component distributions. Specifically, we construct square spectrograms and vary their size with
1 0.75 0.5 0.25 0
2 1.5
1 0.75 0.5 0.25 0
1 S2
0.5 S1
1 S2
0.5
0.5
1
2 1.5
S1
1.5 2
2 1.5 1 S2
0.5
0.5
1
S1
2
1 0.75 0.5 0.25 0
2 1.5 1 S2
0.5
0.5
1
2
1 0.75 0.5 0.25 0
2 1.5 1 S2
0.5
0.5
1
S1
1.5 2
(c) Ep = X − Λ 2
(b) Dm = D(XW H)
2
S1
1.5
1.5 2
(a) Em = X − W H
1 0.75 0.5 0.25 0
0.5
1
1.5 2
(d) Dp = D(X Λ) 2
(e) Ds = D(1X 2 /Λ)
Fig. 1. The shape of the likelihood functions derived from the 5 labeled cost functions for the case of two components and X = 1
Phase-Aware Non-negative Spectrogram Factorization
541
[1,32] [1,64] [1,128] [1,256] [1,512] [1,1024] [2,1024] [2,512] [2,256] [3,1024] [2,128] [3,512] [4,1024] [2,64] [3,256] [4,512] [5,1024] [2,32] [3,128] [4,256] [6,1024] [5,512] [3,64] [7,1024] [6,512] [5,256] [8,1024] [4,128] [7,512] [9,1024] [6,256] [10,1024] [8,512] [5,128] [3,32] [9,512] [7,256] [4,64] [10,512] [8,256] [6,128] [4,32] [9,256] [5,64] [7,128] [6,64] [10,256] [8,128] [9,128] [5,32] [7,64] [6,32] [8,64] [10,128] [7,32] [9,64] [10,64] [8,32] [9,32] [10,32]
K = T ∈ [32, 64, 128, 256, 512, 1024], R ∈ [1, . . . , 30], and W and H drawn from the uniform, positive normal, or exponential distribution. After drawing W and H from the specified distribution, we construct X using Equations 5–7 with uniformly distributed random phase, Θr . We then estimate W and H using each cost function with multiplicative update rules derived in Section 6 or by Lee and Seung [9]. Because scaling W by α and H by 1/α produces the same cost, we normalize the rows of H to unit L2 norm after every update. We evaluate each cost function according to the mean square error between the original and estimated {Sr }. Because the factorization technique is permutation invariant, we must determine the mapping between each estimated and original Sr . For this purpose, we use a greedy algorithm that matches the two most similar components (one original and one estimated) and then removes them from consideration. The process repeats until the mapping is complete. Figure 2 plots the average performance over ten trials for each configuration of parameters. For space considerations, we only show R ∈ [1, . . . , 10] and W and H drawn from the uniform distribution. Each of the 60 [R, K] pairs are sorted along the x-axis in order of increasing minimum error among the five cost functions. Clearly, the problem becomes more difficult as R increases or as K decreases.
0.18
0.16
Em = X − W H Dm = D(XW H) Ep = X 2 − Λ Dp = D(X 2 Λ) Ds = D(1X 2 /Λ)
1
0.8
Detection Rate
Detection Rate and Estimation Error vs. Problem Difficulty (Uniform dist.)
0.4
0.1 0.2
0.08
0.06
Estimation Error
0
0.04
0.02
0
Problem Difficulty
Fig. 2. Estimation error and detection rate for the five cost functions
Detection Rate
0.6 0.12
[1,32] [1,64] [1,128] [1,256] [1,512] [1,1024] [2,1024] [2,512] [2,256] [3,1024] [2,128] [3,512] [4,1024] [2,64] [3,256] [4,512] [5,1024] [2,32] [3,128] [4,256] [6,1024] [5,512] [3,64] [7,1024] [6,512] [5,256] [8,1024] [4,128] [7,512] [9,1024] [6,256] [10,1024] [8,512] [5,128] [3,32] [9,512] [7,256] [4,64] [10,512] [8,256] [6,128] [4,32] [9,256] [5,64] [7,128] [6,64] [10,256] [8,128] [9,128] [5,32] [7,64] [6,32] [8,64] [10,128] [7,32] [9,64] [10,64] [8,32] [9,32] [10,32]
Mean Square Estimation Error
0.14
542
R.M. Parry and I. Essa
The bottom of Figure 2 plots the mean square estimation error. For simpler versions of the problem, Ds outperforms the rest. However, toward the right of the plot the performance becomes markedly worse and Em and Dm perform better. This inversion of performance is linked to the detection rate. The top of Figure 2 plots the detection rate. When each estimated component uniquely matches a real component, the detection rate is 100%. However, when none of the estimated components match one of the real components, that component is not detected. We compute the detection rate as the fraction of real components that are the closest match (in the mean square sense) for at least one estimated component. At [R, K] = [4, 32], the detection rate for Ds drops below 100% for the first time and this corresponds to the first large increase in estimation error. After that, the estimation rate for Ds accelerates until it is the worst of the group. We speculate that if 100% detection could be maintained, Ds would continue to outperform the others. The underlying distribution of W and H also affects estimation and detection. As presented, the cost functions implicity assume a uniform prior distribution on W and H in the maximum likelihood framework. Therefore, as the component distributions diverge from the uniform distribution (e.g., become more sparse) the maximum likelihood approach becomes less realistic. The aggregated mean square error for the uniform, positive normal (more sparse), and exponential (most sparse) distribution is 0.036, 0.19, and 0.44, respectively. However, sparseness has the opposite effect on detection. All of the cost functions attain 100% detection for more problems as sparseness increases. Table 1 lists the number of problems that resulted in 100% detection and the number of times each algorithm provides the best estimation error for each of the distributions and R between 2 and 10. Table 1. Summary of detection rate and lowest estimation error for R = [2, 10]
Distribution: Uniform Cost func. 100% det. Best est. Em 27 9 34 8 Dm 23 0 Ep 33 0 Dp 35 37 Ds Total 152 54
8
Positive Normal 100% det. Best est. 37 3 43 6 29 0 38 4 40 41 187 54
Exponential 100% det. Best est. 44 0 47 6 30 0 41 3 42 45 204 54
Conclusion
We present a new derivation of a divergence function, Ds , specifically tuned to non-negative spectrogram factorization. We compare its performance against four standard approaches for a variety of spectrogram sizes, numbers of components, and sparseness. We show that this divergence improves the estimation of
Phase-Aware Non-negative Spectrogram Factorization
543
the source components. However, it is more affected by detection error. Algorithms aimed at improving detection rates (e.g., a prior distribution on W and H) are likely to improve Ds .
References 1. Casey, M., Westner, W.: Separation of mixed audio sources by independent subspace analysis. In: Proc. of the Int’l. Computer Music Conf. (2000) 2. Smaragdis, P.: Redundancy Reduction for Computational Audition, a Unifying Approach. PhD thesis, MAS Dept. Massachusetts Institute of Technology (2001) 3. Wang, B., Plumbley, M.D.: Investigating single-channel audio source separation methods based on non-negative matrix factorization. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 17–20. Springer, Heidelberg (2006) 4. Abdallah, S.A., Plumbley, M.D.: Polyphonic transcription by non-negative sparse coding of power spectra. In: Proc. of the Int’l. Conf. on Music Information Retrieval, pp. 318–325 (2004) 5. FitzGerald, D., Coyle, E., Laylor, B.: Sub-band independent subspace analysis for drum transcription. In: Proc. of Int’l. Conf. on Digital Audio Effects, pp. 65–69 (2002) 6. Raj, B., Singh, R., Smaragdis, P.: Recognizing speech from simultaneous speakers. In: Eurospeech (2005) 7. Virtanen, T.: Separation of sound sources by convolutive sparse coding. In: ISCA Tutorial & Research Wkshp on Statistical & Perceptual Audio Processing (2004) 8. FitzGerald, D., Cranitch, M., Coyle, E.: Sound source separation using shifted non-negative tensor factorisation. In: Proc. of the IEEE ICASSP (2006) 9. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in NIPS 13, pp. 556–562. MIT Press, Cambridge (2001) 10. Parry, R.M., Essa, I.: Incorporating phase information for source separation via spectrogram factorization. In: Proc. of IEEE ICASSP (2007) 11. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999) 12. Bello, J.P., Sandler, M.B.: Phase-based note onset detection for music signals. In: Proc. of the IEEE ICASSP. vol. 5, pp. 441–444 (2003) 13. Feller, W.: An Introduction to Probability Theory and Its Applications. Wiley, New York (1971)
Probabilistic Amplitude Demodulation Richard E. Turner and Maneesh Sahani Gatsby Computational Neuroscience Unit, UCL, Alexandra House, 17 Queen Square, London, U.K. {turner,maneesh}@gatsby.ucl.ac.uk http://www.gatsby.ucl.ac.uk
Abstract. Auditory scene analysis is extremely challenging. One approach, perhaps that adopted by the brain, is to shape useful representations of sounds on prior knowledge about their statistical structure. For example, sounds with harmonic sections are common and so timefrequency representations are efficient. Most current representations concentrate on the shorter components. Here, we propose representations for structures on longer time-scales, like the phonemes and sentences of speech. We decompose a sound into a product of processes, each with its own characteristic time-scale. This demodulation cascade relates to classical amplitude demodulation, but traditional algorithms fail to realise the representation fully. A new approach, probabilistic amplitude demodulation, is shown to out-perform the established methods, and to easily extend to representation of a full demodulation cascade. Keywords: audio processing, dynamic and temporal models, hierarchical models, sparse representations, unsupervised learning.
1
Introduction
Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼ 1 s); phonemes (∼ 10−1 s); glottal pulses (∼ 10−2 s); and formants (∼ 10−3 s or less). This temporal diversity results directly from the diversity of physical processes that support and control sound production. If the impact of these many processes could be expressed in a single efficient representation, then difficult problems like source separation and auditory scene analysis, that are routinely solved by the brain, might become more accessible to machine audition. However, the diversity of structures and time-scales makes this hard, and most work has concentrated on shorter-time features (e.g. time-frequency representations). Here, we introduce representations that capture longer-range temporal structure in natural sounds. For speech, which will be the running example throughout, this means the sentence and phoneme structure. The basic idea is to represent a sound as a product of processes drawn from a hierarchy, or cascade, of progressively longer time-scale modulators. For speech this might involve three processes: representing sentences on top, phonemes in the middle, and pitch and formants at the bottom (e.g. fig. 2A). To construct such a representation, one M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 544–551, 2007. c Springer-Verlag Berlin Heidelberg 2007
Probabilistic Amplitude Demodulation
545
might start with a traditional amplitude demodulation algorithm, which decomposes a signal into a quickly-varying carrier and more slowly-varying envelope. The cascade could then be built by applying the same algorithm to the (possibly transformed) envelope, and then to the envelope that results from this, and so on. This procedure is only stable, however, if both the carrier and the envelope found by the demodulation algorithm are well-behaved. In section 2 we show that traditional methods return a suitable carrier or envelope, but not both. A new approach to amplitude demodulation is thus called for. Fundamentally, amplitude demodulation is ill-posed: there are infinitely many decompositions of a signal into a slow positive modulator and quickly varying carrier. Ill-posed problems cannot be solved without assumptions, and a deficiency of traditional methods is to make these assumptions implicit. The approach developed here (section 3) is quite different: Demodulation is viewed as a task of probabilistic inference, to which prior information is integral. Our Bayesian approach thus serves to make the unavoidable assumptions, that determine the solution, explicit [2]. This tack yields many benefits. One is that we can tap into the extensive collection of methods developed for probabilistic inference. These are used to construct a family of new algorithms that out-perform traditional amplitude demodulation methods. A second is that the approach generalises easily, for instance to hierarchies and to multidimensional time-series.
2
Traditional Amplitude Demodulation
We begin by briefly reviewing two traditional methods for amplitude demodulation. The first is to obtain the envelope by low-pass filtering a non-linear transformation of the stimulus (for example, square and low pass (SLP)). Roughly, this works because the non-linearity moves energy associated with the modulation from high to low frequencies (via a self convolution in the case of SLP), where it can then be extracted by a low-pass filter. By tuning the filter cutoff one can recover a good estimate for the modulator (see fig. 1A). This type of algorithm derives from applications in radio engineering, where the carrier is a pure sinusoid of known frequency. However, for more general carriers, the demodulated carrier (obtained by point-wise division) is often badly behaved, with large spikes and a non-stationary variance. While the method drives the envelope to be smooth, it places no useful constraint on the carrier waveform. The second method is based on a quantity called the analytic signal. The goal is to express the original signal y(t) in terms of a time varying amplitude and phase such that, y(t) = [a(t) exp [iθ(t)]]. In general, however, this problem is ill posed and the solution is shaped by assumption. One choice might be to constrain the time-scale of a(t) (see section 3). More commonly, however, the imaginary part is chosen to make the signal analytic, by setting it to the Hilbert transform of y(t). This does mean that, in certain circumstances, the amplitude of the analytic signal (AAS), a(t), is restricted to lower frequencies than the phase component exp [iθ(t)] [1], and this property might well be desirable. In general, however, the particular signals that result may not correspond to intuition. Indeed, by contrast
546
R.E. Turner and M. Sahani
to SLP, the modulators often seem poor (see fig. 1B), but the demodulated carriers are good, at least in that they have a stationary variance. B.
x(1)
y and x(2)
A.
1
1.5
2
2.5
3
0.5
1
1.5
2
2.5
3
x(1)
y and x(2)
0.5
2.5
2.6
2.7
time/s
2.8
2.9
2.5
2.6
2.7
2.8
2.9
time/s
Fig. 1. The result of applying two traditional demodulation schemes to a spoken sentence (black), shown at two different scales (top and bottom). Both the envelopes (red) and carriers (blue) are shown. A) SLP: The cut-off of the filter was chosen to have a time-scale of 0.1s corresponding to the phoneme structures. The extracted envelope is good, but the carrier contains large spikes in regions where the envelope is zero, but the signal is non-zero. B) AAS: This demodulates the pitch period and not the phoneme structure. The carrier variance is stationary.
To summarise, SLP was derived from the perspective of feed-forward processing and, broadly speaking, extracts good estimates for the envelopes, but poor carriers. By contrast, the AAS, a demodulation method by accident rather than design, extracts good carriers, but poor envelopes. SLP has a tunable parameter (the cut-off of the low pass filter), which is useful to select one of several different envelopes present in a signal. However, it might be advantageous to automate the setting of such a slowness parameter. The AAS has no free parameters which is favourable when we want a method to work quickly.
3
Probabilistic Amplitude Demodulation (PAD)
One conclusion from the previous section is that a good demodulation algorithm should recover not only a smooth and slow envelope, but also a carrier with stationary variance. The new approach we propose explicitly utilises these two types of prior knowledge in order to find an optimal envelope and carrier. A natural framework in which to leverage prior information to solve an illposed problem is provided by Bayesian methods [2]. These are based on a probabilistic model for the observed signal, which in our case includes a model for
Probabilistic Amplitude Demodulation
547
the two latent variables, the carrier (X(1) ) and the modulator (X(2) ), and for the dependence of the observed data (Y) on these. Our prior beliefs about the variables (p(X(1) ), p(X(2) )) are expressed through probability distributions; so for example, the distribution over envelopes may assign higher probability to slow processes than to fast ones. Having specified this model for p(Y, X(1) , X(2) |θ), the calculus of probabilities leads naturally to algorithms to infer the latent variables, and to learn parameters. Whilst the integrals required to form such quantities may be analytically intractable, there are a variety of well known approximation schemes that can be exploited. 3.1
The Generative Model
Perhaps the simplest generative model for amplitude modulation is as follows, (2) (2) (2) (2) = Norm (0, 1) , p zt |zt−1 = Norm λzt−1 , σ ∀t > 0, (1) p z0 (2) (2) (1) (2) (1) xt = fa(2) zt (2) , xt = Norm (0, 1) , yt = xt xt . This expresses the generation of the envelope in two steps. First a slowly varying, but symmetric, process is produced (Z(2) ); the Gaussian random-walk gives this an effective length-scale determined by λ, leff = − log(λ), which is inherited by X(2) . This length-scale is learnt from data, and is typically long (i.e. λ is close to one and leff = − log(1 − δ) ≈ 1δ ). The positive envelope is obtained using point-wise non-linearity, here given by (2) (2) fa(2) zt = log exp zt + a(2) + 1 , (3) (2)
which is logarithmic for large negative values of zt , and linear for large positive values. This transforms the Gaussian distribution over Z(2) into a sparse distribution, which is a good match to the marginal distributions of natural envelopes. The parameter a(2) controls exactly where the transition from log to linear occurs, and consequently alters the degree of sparsity. Having generated the envelope, the carrier is simply Gaussian white noise. The observations Y are generated by a point-wise product of the envelope and carrier. A typical draw from this generative model can be seen in Fig. 2B. This model is a fairly crude one for speech. For example, the speech carrier process will be structured (containing formant and pitch information) and yet it is modelled as Gaussian white noise. Whilst more complex models can certainly be developed, surprisingly even this very simple model is excellent at demodulation. 3.2
Learning and Results
The joint probability of both latent and observed signals is: T (2) (1) (2) (2) (2) (1) (2) (1) p yt |xt , xt p xt |xt−1 p xt , p Y, X , X |λ, σ = p x0
(4)
t=1
with
(2) (2) dz(2) (2) (2) t p xt |xt−1 = p zt |zt−1 (2) . dx t
(5)
548
R.E. Turner and M. Sahani
B.
C.
y
1:T
x(1)
1:T
x(2)
1:T
x(3)
1:T
A.
2 4 time /s
6
Fig. 2. An example of a modulation-cascade representation of speech (A) and typical samples from generative models used to derive that representation (B and C). A) The spoken-speech waveform (black) is represented as the product of a carrier (blue), a phoneme modulator (red) and a sentence modulator (magenta). (Derived using the method described in section 4.) B) The standard generative model with an envelope (red), a carrier (blue), and the waveform (black). C) The extended model (M = 3) with an additional slowly varying envelope (magenta). For sounds drawn from generative model and processed using PAD see [5].
(1) (2) (1) (2) = δ yt − xt xt , we can integrate out the carrier from this As p yt |xt , xt expression which yields, T (2) T (2) yt dzt 1 (2) (2) (1) (2) p Y, X |λ, σ = p z0 p zt |zt−1 p xt = (2) . (6) (2) (2) dx x x t=0
t
t=1
t
t
Unfortunately, this expression cannot be analytically marginalised with respect to the envelope X(2) , and so an approximation is needed. One approach is to assume the distribution over X(2) is highly peaked and to approximate the integral by its value at the peak: the maximum a posteriori (MAP) value, p (Y|λ, σ) ≈ p Y, X(2) MAP |λ, σ . This is a coarse approximation, but it is well established and resembles a zero-temperature form of expectation maximisation. An alternative, to which we will return, is to approximate the integral by sampling. Before discussing these approximations in more detail, we describe a final improvement to the model. In the Bayesian methodology parameters have the same status as latent variables and accordingly, may also be integrated out. In the present case, this is possible for either of the parameters controlling Z(2) ; σ 2 or λ. More general models might have multidimensional λ, and so we choose to integrate over this, but for the simple model both methods work equally well. There are several specific advantages to this approach in the present application. Firstly, while the old model had one smoothness, the new model is more flexible, being essentially a weighted sum of the old models with different smoothnesses
Probabilistic Amplitude Demodulation
549
(see eq. 7). Secondly, it is not possible to learn λ, σ 2 and X(2) using the old (2) approach: a trivial solution (xt = 1, λ = 1, σ 2 = 0) causes the likelihood to diverge. However, this maximum has infinitesimal width and therefore essentially no mass, so integration over λ removes this deficiency. Practically the integration proceeds as follows: A conjugate Gaussian prior pri is placed on the smoothnesses p(λ) = Norm(μpri λ , σλ ), and the integral that results is Gaussian,
pri pri (2) pri . (7) p Y, X |μλ , σλ , σ = dλ p Y, X(2) |λ, σ p λ|μpri λ , σλ Completing this integral yields the following objective function, ⎡ ⎤ T 2 1 1 (2) 2 y ⎢ ⎥ (2) pri log p Y, X(2) |μpri + t 2 ⎦ ⎣2 log xt + 2 zt λ , σλ , σ = − 2 t=1 σ (2) xt 2 2 T dz(2) 1 2 T 1 μpri 1 μpost t (2) 2 λ λ z + log (2) − − log σ − + dx 2 0 2 2 σλpri 2 σλpost t=0 t σλpost 3 T + 1 log 2π, (8) + log pri + 2 σλ 2 where μpost and σλpost are the posterior mean and variance over the smoothλ ness parameter, which are given by, 2 2 (2) (2) T pri pri pri 2 2 σ σ σ post 2 t=1 zt−1 zt + μλ σ λ λ post = σλ 2 2 , μλ = 2 2 . (2) (2) T T σ 2 + σλpri σ 2 + σλpri t=1 zt−1 t=1 zt−1 (9) The MAP estimate of the envelope can be found by gradient-based optimisation of this cost function. We used a conjugate-gradient algorithm on the log of the envelope (to ensure positivity). Depending on the application we can optimise the remaining parameter and hyperparameters too, or set them by hand. Results are shown for both approaches. In fig. 3A all parameters and hyper-parameters have been optimised and the envelope picks off the sentence structure. In fig. 3B the priors and the variances have been fixed in order that the algorithm picks off a rather faster envelope (this occurs for a wide range of prior/parameters settings and does not require fine tuning). In this case the phonemes are discovered. Qualitatively the performance appears far superior to that of traditional algorithms: a smooth envelope and a demodulated carrier of approximately constant variance are recovered reliably.
4
Extensions to Probabilistic Amplitude Demodulation
Our original motivation was to develop new representations for the long-temporal structures in sounds, particularly those based on a product of processes. A
550
R.E. Turner and M. Sahani
B.
x(1)
y, x(2)
A.
x(1)
y, x(2,3)
C.
time/s
time /s
Fig. 3. Carriers (blue) and modulators (red and magenta) extracted by probabilistic amplitude demodulation (PAD) from a spoken sentence (black). A) Vanilla PAD selects a slow sentence envelope, but the carrier is still significantly modulated. B) Fixing the priors and variances leads to a faster, phoneme envelope, and results in a carrier that is more demodulated. C) The cascaded version of PAD, using sampling to generate error-bars on the extracted processes, provides an elegant representation of the sound.
necessary stepping stone along this path was the development of new methods for amplitude demodulation. We have already outlined a recursive procedure, which can use these new algorithms, for deriving such a representation (see section 1). The approach was to successively remove the fastest remaining process. However, ideally we would like to estimate the processes concurrently. Fortunately this can be done by extending the probabilistic method to a cascade of M processes: (m) (m) (m) (m) 2 = Norm (0, 1) , p zt |zt−1 = Norm λm zt−1 , σm ∀t > 0, (10) p z0 (m)
xt
(m) = fa(m) zt ∀m > 1,
(1)
xt
= Norm (0, 1) ,
yt =
M
(m)
xt
. (11)
m=1
A suitable model for speech might have M =3 with a “sentence” modulator (X(3) ) and a “phoneme” modulator (X(2) ). A priori we would expect λm+1 > λm . Learning and inference in this model can be completed in an analogous manner to that described in the previous section. That is; integrate out the carrier X(1)
Probabilistic Amplitude Demodulation
551
and the dynamics λ1:M and optimise over the modulators X(2:M) simultaneously (this was how fig. 2A was generated). However, an alternative method is to use sampling to integrate out the modulators, approximately. The most amenable method is Hamiltonian Monte Carlo as it requires the exact same evaluations as required by a gradient-based optimiser, namely evaluations of the PAD objective function and its derivatives. There are several potential advantages of the sampling approach, which gives back samples from the posterior distribution over the modulators p(X(2:M) |Y), rather than just the maximum value of this distribution. Firstly, using information about the whole distribution might help learn better parameters. Secondly, we can now put error-bars on our inferences for the envelope. Thirdly we can check whether the mode of the posterior is typical of the distribution, and therefore assess the merits of the previous approach. This sampling procedure was used to learn the cascade model with M = 3. The results are shown in fig. 3C. The algorithm extracts a sentence process and a phoneme process, and provides an elegant representation of the speech sound. Empirically, the mode and the mean of the distribution over envelopes is found to be in a similar location, and the parameter values discovered by both methods similar. This indicates that the MAP approximation might not be too severe.
5
Conclusion
The contributions of this paper are two fold. Firstly we provide a family of algorithms for probabilistic amplitude demodulation that out perform traditional methods. Secondly, and more generally, we propose an elegant new representation for the long time-scale temporal structure in sounds based on a cascade of modulatory processes. The goal of future research will be to wed this model for phonemes and sentences, to one that models pitch and formant information (e.g. [4]), to solve hard machine-audition tasks like blind-source separation and auditory scene analysis. Acknowledgements. We would like to thank Pietro Berkes for helpful discussions. This work was supported by the Gatsby Charitable Foundation.
References 1. Cohen, L.: Time-Frequency Analysis. Prentice Hall Signal Processing Series (1995) 2. Mackay, D.J.C.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge (2003) 3. Lawrence, N.: Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models. J. Mach. Learn. Res. 6, 1783–1816 (2005) 4. Turner, R.E., Sahani, M.: Modeling Natural Sounds with Gaussian Modulation Cascade Processes. In: Advances in Models for Acoustic Processing Workshop (2006) 5. Turner, R.E., Sahani, M.: Probabilitic Amplidude Demodulation Technical Report. GCNU TR 2007-002 (2007), http://www.gatsby.ucl.ac.uk/publications/tr/
First Stereo Audio Source Separation Evaluation Campaign: Data, Algorithms and Results Emmanuel Vincent1 , Hiroshi Sawada2 , Pau Bofill3 , Shoji Makino2 , and Justinian P. Rosca4 METISS Group, IRISA-INRIA Campus de Beaulieu, 35042 Rennes Cedex, France [email protected] 2 Signal Processing Research Group, NTT Communication Science Labs 2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0237, Japan Departament d’Arquitectura de Computadors, Universitat Polit`ecnica de Catalunya Campus Nord M` odul D6, Jordi Girona 1-3, 08034 Barcelona, Spain 4 Siemens Corporate Research 755 College Road East, Princeton NJ 08540, USA 1
3
Abstract. This article provides an overview of the first stereo audio source separation evaluation campaign, organized by the authors. Fifteen underdetermined stereo source separation algorithms have been applied to various audio data, including instantaneous, convolutive and real mixtures of speech or music sources. The data and the algorithms are presented and the estimated source signals are compared to reference signals using several objective performance criteria.
1
Introduction
Large-scale evaluations facilitate progress in a field by revealing the effects of different choices in algorithm design, promoting common test data and evaluation criteria and attracting the interest of funding bodies. Several evaluations of audio source separation algorithms have been conducted recently, focusing on singlechannel speech mixtures1 or multichannel over-determined speech mixtures2,3,4 . This article provides an overview of the complementary evaluation campaign for stereo underdetermined audio mixtures organized by the authors. Detailed results of the campaign are available at http://sassec.gforge.inria.fr/. We define the source separation task and describe test data and evaluation criteria in Section 2. Then we present the algorithms submitted by the participants in Section 3 and summarize their results in Section 4. We conclude in Section 5. 1 2 3 4
http://www.dcs.shef.ac.uk/~martin/SpeechSeparationChallenge.htm http://bme.engr.ccny.cuny.edu/faculty/parra/bss/ http://homepages.inf.ed.ac.uk/mlincol1/SSC2/ http://mlsp2007.conwiz.dk/index.php?id=43
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 552–559, 2007. c Springer-Verlag Berlin Heidelberg 2007
First Stereo Audio Source Separation Evaluation Campaign
2
553
Data and Evaluation Criteria
2.1
The Stereo Underdetermined Source Separation Task
Common audio signals, e.g. radio, television, music CDs and MP3s, are typically available in stereo (two-channel) format and consist of a mixture of more than two sound sources. Denoting by J > 2 the number of sources, each channel xi (t) (1 ≤ i ≤ 2) of the mixture signal can be expressed as [1] xi (t) =
J
simg ij (t)
(1)
j=1
where simg ij (t) is the spatial image of source j (1 ≤ j ≤ J) on channel i, that is the contribution of this source to the observed mixture in this channel. Different types of mixtures can be distinguished. Instantaneous mixtures are generated via (1) using a mixing desk or dedicated software by constraining the spatial images of each source j to simg ij (t) = aij sj (t), where sj (t) is a singlechannel source signal and aij are positive mixing gains. Synthetic convolutive mixtures are obtained similarly via simg ij (t) = τ aij (τ )sj (t − τ ), where aij (τ ) are mixing filters. Live recordings are acquired by recording all the sources simultaneously in a room using a pair of microphones. These recordings may also be obtained by recording the sources one at a time in the same room and adding the resulting source images together within each channel [2]. We define the source separation task as that of estimating the spatial images simg ij (t) of all sources j on all channels i from the two channels xi (t) of a mixture. This definition has two advantages: it is valid for all types of mixtures, even with spatially extended sources that cannot be represented as single-channel signals, and potential gain or filtering indeterminacies about the estimated single-channel source signals sj (t) disappear when considering their spatial images instead [1]. 2.2
Development and Test Data
The development and test data used for the evaluation campaign involved four classes of signals: male speech, female speech, non-percussive music and music including drums. Music mixtures involved three sources taken from synchronized multitrack recordings, while speech mixtures involved four independent sources. All the source signals were sampled at 16 kHz and had a duration of 10 s. The development data consisted of one instantaneous mixture, two synthetic convolutive mixtures and two live recordings per class. Instantaneous mixtures were generated by scaling the source signals by positive gains. Live recordings were acquired by playing the source signals through loudspeakers in a room at NTT with RT60 = 250 ms reverberation time and recording them using two pairs of omnidirectional microphones with spacings of 5 cm and 1 m. Figure 1 depicts the arrangement of loudspeakers and microphones. Synthetic convolutive mixtures were obtained by filtering the sources with simulated room impulse responses computed for the same arrangement using Roomsim5 . Ground truth 5
http://media.paisley.ac.uk/˜campbell/Roomsim/
554
E. Vincent et al.
data, i.e. the source signals, their spatial images and the mixing filters or gains, were distributed with the mixture signals at http://sassec.gforge.inria.fr/. 4.45m 2m
Room height 2.5m
3.55m
Omnidirectional 04 1 microphone pairs 0 1 1m 3 1 45° 3 Spacing 5cm/1m 0 00 11 Height 1.4m 00 11 15° 11 0 1.5m
1 20 40 1
00 Loudspeakers 11
002 11 −10°
11 00 00 11 −50° 001 11
Height 1.6m
Fig. 1. Recording arrangement used for development data. Only three of the four loudspeakers were used for music mixtures.
The same number of test data was obtained similarly to the development data, using different source signals and positions for each mixture. The distances of the sources from the center of the microphone pairs were drawn randomly between 80 cm and 1.2 m and their angles of arrival between −60◦ and +60◦ with a minimal spacing of 15◦ . The mixture signals were made available, but ground truth data, including the exact source positions, was kept hidden6 . 2.3
Objective Performance Criteria
The participants were asked to provide estimates sˆimg ij (t) of the spatial images of all sources j for some test mixtures. The quality of these estimates was then evaluated by comparison with the true source images simg ij (t) using four objective performance criteria, inspired from criteria previously designed for singlechannel source estimates [3]. By contrast with other existing measures [4,5], the proposed criteria can be computed for all types of separation algorithms and do not necessitate knowledge of the separating filters or masks. The criteria derive from the decomposition of an estimated source image as img spat interf (t) + eartif sˆimg ij (t) ij (t) = sij (t) + eij (t) + eij
(2)
spat interf where simg (t) and eartif ij (t) are disij (t) is the true source image and eij (t), eij tinct error components representing spatial (or filtering) distortion, interference and artifacts. This decomposition is motivated by the auditory distinction between sounds from the target source, sounds from other sources and “gurgling” 6
Only the first two authors of this article had potentially access to these data.
First Stereo Audio Source Separation Evaluation Campaign
555
spat interf noise, corresponding to the signals simg (t) and eartif ij (t) reij (t) + eij (t), eij spectively. The computational modeling of this auditory segregation process is an open issue so far. For simplicity, we chose to express spatial distortion and interference components as filtered versions of the true source images, computed by least-squares projection of the estimated source image onto the corresponding signal subspaces [3] L img sij )(t) − simg espat ij (t) = Pj (ˆ ij (t)
einterf (t) = ij eartif ij (t)
=
L L img Pall (ˆ simg sij )(t) ij )(t) − Pj (ˆ img img L sˆij (t) − Pall (ˆ sij )(t)
(3) (4) (5)
where PjL is the least-squares projector onto the subspace spanned by simg kj (t−τ ), L is the least-squares projector onto the subspace 1 ≤ k ≤ I, 0 ≤ τ ≤ L−1, and Pall spanned by simg kl (t − τ ), 1 ≤ k ≤ I, 1 ≤ l ≤ J, 0 ≤ τ ≤ L − 1. The filter length L was set to 512 (32 ms), which was the maximal tractable length. The relative amounts of spatial distortion, interference and artifacts were then measured using three energy ratio criteria expressed in decibels (dB): the source Image to Spatial distortion Ratio (ISR), the Source to Interference Ratio (SIR) and the Sources to Artifacts Ratio (SAR), defined by I img 2 t sij (t) i=1 (6) ISRj = 10 log10 I spat 2 t eij (t) i=1 I img 2 (sij (t) + espat ij (t)) (7) SIRj = 10 log10 i=1I t interf (t)2 t eij i=1 I img spat interf (t))2 t (sij (t) + eij (t) + eij i=1 SARj = 10 log10 . (8) I artif 2 t eij (t) i=1 The total error was also measured by the Signal to Distortion Ratio (SDR) I img 2 t sij (t) i=1 (9) SDRj = 10 log10 I spat interf (t) + eartif (t))2 ij t (eij (t) + eij i=1 We emphasize that this measure is arbitrary, in the sense that it weights the three error components equally. In practice, each component should be given a different weight depending on the application. For instance, spatial distortion is of little importance for most applications, except for karaoke where it can result in imperfect source cancellation, even in the absence of interference or artifacts. Similarly, artifacts are crucial for hearing aid applications, for which “gurgling” noise should be avoided at the cost of increased interference. These criteria were implemented in Matlab and distributed at http://sassec.gforge.inria.fr/.
3
Algorithms
The campaign involved thirteen participants, who submitted the results of fifteen source separation algorithms. The underlying approaches are summarized in
556
E. Vincent et al. Table 1. Submitted source separation algorithms
N◦ Submitter Name
Source localization
Source signal estimation
Algorithms for instantaneous mixtures only Manual IID clustering from a Source magnitude estimation in magnitude-weighted histogram the STFT bins associated with with auditory feedback [6] each IID cluster [6] P. Bofill Peak picking on a smoothed IID Minimization of the l1 norm of histogram [7] with STFT bins the real and imaginary parts of selected as in [8] the source STFTs [9] A. Ehmann Manual peak picking on an IID Binary STFT masking with difhistogram ferent resolutions at high/low frequencies V. Gowreesunker Peak picking on a thresholded Binwise MDCT projection onto IID histogram [10] the nearest IID subspace [10] M. Kleffner Peak picking on a thresholded Online FFT-domain minimumIID histogram [11] with STFT variance beamforming [13] bins selected as in [12] N. Mitianoudis Soft IID clustering given the Binwise MDCT projection onto number of sources [14] the nearest IID subspace [14] H. Sawada Hard IID clustering given the Binary STFT masking number of sources E. Vincent Manual peak picking on an IID Minimization of the l0 norm of histogram weighted as in [12] the source STFTs [15] M. Xiao Hard fixed-width IID clustering Mixing inversion with 2 sources SABM+SSDP on selected STFT bins [8] per time frame estimated from the mixture covariance [16] M. Xiao Hard fixed-width IID clustering Extension of [16] with more acSABM+SNSDP on selected STFT bins [8] tive sources in some time frames
1 D. Barry ADRess 2
3
4 5
6 7 8 9
10
Algorithms for instantaneous and/or convolutive mixtures Soft (IID,ITD) clustering given Maximum SNR beamforming the number of sources [17] [18] and soft STFT masking [19] Y. Izumi Soft clustering of the mixture Soft STFT masking by cluster STFT bins based on (IID,IPD) probabilities [20] given the number of sources [20] T. Kim FFT-domain independent component analysis [21] and soft masking (two sources only) R. Weiss & M. Soft (IID,IPD) clustering given Soft STFT masking by cluster Mandel the number of sources [22] probabilities [22] H. Sawada Frequency-wise (IID,IPD) clus- Binary STFT masking tering given the number of sources as in [17] and sorting [23]
11 S. Araki 12
13
14 15
Table 1. All algorithms except n◦ 13 could be broken into (possibly iterated) source localization and source signal estimation steps. These two steps were conducted in the time-frequency domain via a Short-Time Fourier Transform (STFT) or a
First Stereo Audio Source Separation Evaluation Campaign
557
Table 2. Results for instantaneous mixtures Algorithm SDR (dB) ISR (dB) SIR (dB) SAR (dB) Time (s)
1 4.0 7.5 13.2 5.3 1
2 4.2 8.2 12.9 10.8 300
3 6.8 13.9 15.5 7.8 5
4 57 6 7 3.5 -23.4 -16.0 7.2 6.2 -21.8 -12.8 14.6 14.4 12.8 13.2 15.9 5.5 5.9 5.3 8.1 10 600 200 9
8 10.3 19.2 16.0 12.2 5
9 5.8 15.9 10.7 5.8 2
10 14 2.7 -2.4 20.0 4.1 6.8 -3.0 8.7 4.2 2 1000
Table 3. Results for synthetic convolutive mixtures and live recordings with two different microphone spacings Mixtures Algorithm SDR (dB) ISR (dB) SIR (dB) SAR (dB) Time (min)
Synth 5 117 14 2.5 0.9 6.0 2.8 5.8 -2.7 4.9 14.1 1 20
cm Synth 1 m 15 14 15 0.2 0.7 0.6 4.6 2.8 4.4 4.4 -0.4 4.2 7.5 10.7 7.5 0.6 20 0.6
Live 5 cm 117 127 138 14 2.6 -23.2 -20.3 1.2 5.9 -19.2 -17.0 4.0 4.6 1.3 2.9 -1.9 5.4 6.2 6.2 13.0 1 1 4 20
Live 1 m 15 138 14 15 1.8 -19.0 2.1 3.6 7.0 -15.5 4.9 8.4 4.2 2.9 0.8 6.9 6.8 5.8 8.0 6.8 0.6 4 20 0.6
Modified Discrete Cosine Transform (MDCT), except for algorithms n◦ 9 and 10 where source estimation was directly performed in the time domain. The directions of the sources were modeled by the Interchannel Intensity Difference (IID) or variants thereof in the instantaneous case. The Interchannel Time Difference (ITD) or the Interchannel Phase Difference (IPD) were additionally used in the convolutive case. Algorithms n◦ 2, 4, 5, 9 and 10 were fully blind, while others required manual input of the number of sources or the source directions.
4
Results
The performance of each algorithm was assessed by sorting the estimated source image signals so as to maximize the average SIR and successively averaging the measured SDR, ISR, SIR and SAR over the sources and over the mixtures. The resulting figures are given in Tables 2 and 3 for instantaneous and convolutive mixtures respectively, along with platform-specific computation times. The large negative SDR and ISR figures for algorithms n◦ 5, 6, 12 and 13 are due to incorrect scaling of the submitted source images. Detailed results and sound files are available at http://sassec.gforge.inria.fr/. In the instantaneous case, most algorithms provided similar SIR and SAR values clustered around 13 dB and 6 dB respectively, denoting high interference rejection but clear artifacts. Algorithms n◦ 2 and 8 resulted in fewer artifacts, while algorithms n◦ 10 and 14 provided more interference. Note that blind 7 8
Average performance for speech mixtures only. Average performance over the two estimated sources for speech mixtures only.
558
E. Vincent et al.
algorithms n◦ 9 and 10 achieved similar source localization accuracy as non-blind algorithms n◦ 3, 7 and 8, as shown by large ISR values. In the convolutive case, most algorithms provided again similar SIR and SAR values but around 4 dB and 6 dB respectively, indicating both strong interference and artifacts. Algorithms n◦ 11 and 15 resulted in slightly less interference, while algorithm n◦ 14 provided much more interference but less artifacts. Interestingly, performance did not vary much between synthetic convolutive mixtures and live recordings or with different microphone spacings.
5
Conclusion
In this article, we described the test data and objective performance criteria used in the context of the first stereo audio source separation evaluation campaign and summarized the approaches behind the fifteen submitted algorithms and their results. We are currently planning to complement objective performance figures by listening tests and present detailed results on the campaign website. We hope that this campaign fosters interest for evaluation in the source separation community and that larger-scale regular campaigns will take place in the future. The creation of a collaborative organization framework appears crucial to this aim, since it would allow sharing between the participants of time-consuming tasks such as the collection of test data under appropriate licenses and the recording of live mixtures.
References 1. Cardoso, J.F.: Multidimensional independent component analysis. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), IV–1941–1944 (1998) 2. Schobben, D., Torkkola, K., Smaragdis, P.: Evaluation of blind signal separation methods. In: Proc. Int. Conf. on Independent Component Analysis and Blind Source Separation (ICA), pp. 261–266 (1999) 3. Vincent, E., Gribonval, R., F´evotte, C.: Performance measurement in blind audio source separation. IEEE Trans. on Audio, Speech and Language Processing 14, 1462–1469 (2006) 4. Mansour, A., Kawamoto, M., Ohnishi, N.: A survey of the performance indexes of ICA algorithms. In: Proc. IASTED Int. Conf. on Modelling, Identification and Control (MIC), pp. 660–666 (2002) 5. Yılmaz, O., Rickard, S.T.: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. on Signal Processing 52, 1830–1847 (2004) 6. Barry, D., Coyle, E., Lawlor, B.: Real-time sound source separation using azimuth discrimination and resynthesis. In: Proc. 117th AES Convention. (2004) (preprint 6258) 7. Bofill, P., Zibulevsky, M.: Underdetermined blind source separation using sparse representations. Signal Processing 81, 2353–2362 (2001) 8. Xiao, M., Xie, S., Fu, Y.: A novel approach for underdetermined blind source separation in the frequency domain. In: Wang, J., Liao, X.-F., Yi, Z. (eds.) ISNN 2005. LNCS, vol. 3498, pp. 484–489. Springer, Heidelberg (2005)
First Stereo Audio Source Separation Evaluation Campaign
559
9. Bofill, P., Monte, E.: Underdetermined convoluted source reconstruction using LP and SOCP, and a neural approximator of the optimizer. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 569–576. Springer, Heidelberg (2006) 10. Gowreesunker, B.V., Tewfik, A.H.: Two improved sparse decomposition methods for blind source separation. In: Proc. Int. Conf. on Independent Component Analysis and Blind Source Separation (ICA) (2007) 11. Mohan, S., Kramer, M.L., Wheeler, B.C., Jones, D.L.: Localization of nonstationary sources using a coherence test. In: Proc. IEEE Workshop on Statistical Signal Processing (SSP), pp. 470–473 (2003) 12. Arberet, S., Gribonval, R., Bimbot, F.: A robust method to count and locate audio sources in a stereophonic linear instantaneous mixture. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 536–543. Springer, Heidelberg (2006) 13. Lockwood, M.E., Jones, D.L., Bilger, R.C., Lansing, C.R., O’Brien Jr., W.D., Wheeler, B.C., Feng, A.S.: Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms. Journal of the Acoustical Society of America 115, 379–391 (2004) 14. Mitianoudis, N., Stathaki, T.: Underdetermined source separation using mixtures of warped Laplacians. In: Proc. Int. Conf. on Independent Component Analysis and Blind Source Separation (ICA) (2007) 15. Vincent, E.: Complex nonconvex lp norm minimization for underdetermined source separation. In: Proc. Int. Conf. on Independent Component Analysis and Blind Source Separation (ICA) (2007) 16. Xiao, M., Xie, S., Fu, Y.: A statistically sparse decomposition principle for underdetermined blind source separation. In: Proc. Int. Symp. on Intelligent Signal Processing and Communication Systems (ISPACS), pp. 165–168 (2005) 17. O’Grady, P.D., Pearlmutter, B.A.: Soft-LOST: EM on a mixture of oriented lines. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 428–435. Springer, Heidelberg (2004) 18. Araki, S., Sawada, H., Makino, S.: Blind speech separation in a meeting situation with maximum SNR beamformers. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), vol. I, pp. 41–44 (2007) 19. Cermak, J., Araki, S., Sawada, H., Makino, S.: Blind source separation based on beamformer array and time-frequency binary masking. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), vol. I, pp. 145–148 (2007) 20. Izumi, Y., Ono, N., Sagayama, S.: Sparseness-based 2ch BSS using the EM algorithm in reverberant environment. In: Submitted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2007) 21. Kim, T., Attias, H.T., Lee, S.Y., Lee, T.W.: Blind source separation exploiting higher-order frequency dependencies. IEEE Trans. on Audio, Speech and Language Processing 15, 70–79 (2007) 22. Mandel, M.I., Ellis, D.P.W., Jebara, T.: An EM algorithm for localizing multiple sound sources in reverberant environments. In: Advances in Neural Information Processing Systems (NIPS 19) (2007) 23. Sawada, H., Araki, S., Makino, S.: Measuring dependence of bin-wise separated signals for permutation alignment in frequency-domain BSS. In: Proc. IEEE Int. Symp. on Circuits and Systems (ISCAS), pp. 3247–3250 (2007)
‘Shadow BSS’ for Blind Source Separation in Rapidly Time-Varying Acoustic Scenes S. Wehr1 , A. Lombard1 , H. Buchner2 , and W. Kellermann1 University Erlangen-Nuremberg Multimedia Communications and Signal Processing Cauerstraße 7, 91058 Erlangen, Germany {Wehr,Lombard,WK}@LNT.de 2 Deutsche Telekom Laboratories Technical University Berlin Ernst-Reuter-Platz 7, 10587 Berlin, Germany [email protected] 1
Abstract. This paper addresses the tracking capability of blind source separation algorithms for rapidly time-varying sensor or source positions. Based on a known algorithm for blind source separation, which also allows for simultaneous localization of multiple active sources in reverberant environments, the source separation performance will be investigated for abrupt microphone array rotations representing the worst case. After illustrating the deficiencies in source-tracking with the given efficient implementation of the BSS algorithm, a method to ensure robust source separation even with abrupt microphone array rotations is proposed. Experimental results illustrate the efficiency of the proposed concept.
1
Introduction
This paper is motivated by the so-called cocktail-party problem which arises when convolutive mixtures of multiple simultaneously active speakers are recorded by multiple microphones. In many applications (e.g. hands-free human-machine interfaces, [1]), we need to focus on one single source and try to suppress interfering sources. We address this problem here by blind source separation (BSS) algorithms which can deal well with unknown microphone and source positions [2]. Furthermore, BSS provides us with several separated source signals which may be individually selected for further processing. We briefly review the generic ICA-based BSS framework for convolutive mixtures called TRINICON [3,4], which is also capable of simultaneously localizing multiple active sources [5,6]. The motivation for considering it here is that most of the known state-of-the-art BSS algorithms may be seen as certain approximations of this concept. As a fairly recent and advanced approximate practical algorithm, we investigate here [7]. This algorithm serves thus as a good representative for many of the major ICA algorithms. It is based on a special choice of Sylvester constraint, the correlation method, the natural gradient, and on an M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 560–568, 2007. c Springer-Verlag Berlin Heidelberg 2007
‘Shadow BSS’ for BSS in Rapidly Time-Varying Acoustic Scenes
561
approximated normalization. This allows for an efficient implementation, but may be responsible for the observed deficiencies in certain situations. Although the investigated algorithm is designed for Q active sources, we consider only two active sources in this paper. In Section 2, we demonstrate by simulations that both the separation performance of the considered BSS algorithm as well as the performance of the BSS-based source localization may significantly degrade for rapidly time-varying sensor positions. Analysis of this scenario leads us to proposing the so-called shadow-BSS system, which runs in parallel to the main BSS algorithm. Simulation results confirm the efficiency of the proposed extension. s1 (n)
s2 (n)
x1 (n)
h11 (n)
w11 (n)
h12 (n)
w12 (n)
h21 (n)
w21 (n)
h22 (n)
x2 (n)
w22 (n)
y1 (n)
y2 (n)
Fig. 1. 2-Channel Mixing and Demixing System
Figure 1 illustrates the BSS scheme for two sources with the source signals si (n), the sensor signals xi (n), and the BSS output signals yi (n), respectively (i = 1, 2). The unknown mixing system is modeled by M -tap room impulse responses hij (n) and the demixing system determined by BSS is modeled by Ltap FIR filters wij (n) (j = 1, 2). The microphone signals and BSS output signals can then be written as: xi (n) =
2 M−1
hji (κ)sj (n − κ)
(1)
wji (κ)xj (n − κ).
(2)
j=1 κ=0
yi (n) =
2 L−1 j=1 κ=0
The source separation problem is then solved by appropriately determined demixing filters. Further details on the adaptation of the demixing filters are given in, e.g., [7]. As shown in [6], TRINICON-based BSS inherently identifies the (unknown) mixing system up to a scaling for Q = 2: wji (n) = −αji · hji (n)
and
wjj (n) = αii · hii (n),
i = j.
(3)
Based on this system identification, the TDOA (Time Difference of Arrival) can be derived from the demixing filters simultaneously for both active sources from the main peaks of wij (n) [6].
562
2
S. Wehr et al.
BSS and Source Localization in Rapidly Time-Varying Scenarios
We now investigate the separation and localization performance of the chosen BSS algorithm for rapidly time-varying scenarios. The experimental setup is as follows: We assume abrupt rotations of a microphone array consisting of two sensors, which represents a worst case for time-variant scenarios. Dealing with rapidly rotating microphone arrays is important in many applications of BSS, e.g. when the microphone array is held and moved by a person. Figure 2 illustrates the DOAs (Direction of Arrival) for two typical scenarios: In the first scenario, the broadside of the microphone array points between the two sources and the array is rotated by ±30◦ . In the second scenario, one source is located broadside and the other source is on the side after each turn, where the rotations are ±80◦ . Note that the array orientation is abruptly changed and that the DOA is measured relative to the broadside direction of the array. We use clean speech signals as sources, which are convolved by measured impulse responses of a low-echoic chamber (T60 ≈ 50ms) to represent the microphone array inputs. All simulations are performed with sampling rate fs = 16kHz and demixing filter length L = 1024.
DOA [°]
Scenario 1 Source 1 Source 2
60 30 −30 −60 0
10
20
30 Time [s] Scenario 2
40
50
60
Source 1 Source 2
DOA [°]
80
0
−80 0
10
20
30 Time [s]
40
50
60
Fig. 2. DOAs for two scenarios with abrupt microphone array rotations
Figures 3 and 4 depict both the BSS gain and the results of the source localization obtained with the chosen BSS algorithm for both scenarios, where the BSS gain in dB represents the suppression of the interfering source in each BSS channel. The vertical dashed lines indicate the time instants, when the array orientation is changed. The channel-averaged BSS gain for each array orientation is displayed in the framed boxes. In scenario 1, the chosen BSS algorithm exhibits the expected good source separation and localization performance: After each array rotation, the demixing filters allow for good source separation and the estimated TDOAs follow the DOAs given by scenario 1. In scenario 2, we observe that the investigated BSS algorithm is only able to track the first array rotation (see the time period 10s-20s), but already with a degradation in separation and
‘Shadow BSS’ for BSS in Rapidly Time-Varying Acoustic Scenes
563
BSS Gain [dB]
Average Overall BSS−Gain: 9.0dB 15 10 5 0
TDOA [Samples]
0
8.1dB
10
9.1dB
20
9.4dB
9.0dB 30 Time [s]
40
8.7dB
50
9.8dB
60
Source 1 Source 2
10 5 0 −5 −10 0
10
20
30 Time [s]
40
50
60
Fig. 3. BSS gain (top) and Estimated TDOAs (bottom) obtained with the chosen BSS algorithm, Scenario 1 Average Overall BSS−Gain: 1.1dB BSS Gain [dB]
20 10 0 −10
TDOA [Samples]
−20
0
6.8dB
10
4.1dB
20
−0.4dB
−1.3dB −0.1dB −2.8dB 30 40 50 60 Time [s] Source 1 Source 2
10 5 0 −5 −10 0
10
20
30 Time [s]
40
50
60
Fig. 4. BSS gain (top) and Estimated TDOAs (bottom) obtained with the chosen BSS algorithm, Scenario 2
localization performance. However, the chosen BSS algorithm fails to separate and locate the two sources after the subsequent array turns: The BSS gain is even negative and the estimated TDOAs seem to be ”frozen”. The results of the TDOA estimation suggest that the chosen BSS algorithm is not able to adapt the demixing filters after the second array rotation. Therefore, we investigate the demixing filters. The first 32 coefficients of the demixing filters w12 (n) and w21 (n) are depicted in Figure 5. The filter coefficients of w11 (n) and w22 (n) are not depicted, because they mainly consist of distinct positive peaks at 16 samples, which leads to a delayed but mainly unfiltered contribution of the microphone signals to the two BSS outputs. The vertical dashed lines indicate again array rotations. The horizontal dashed lines mark the filter coefficients which are important for suppressing sources from -80◦ , 0◦ , and 80◦ , respectively. By appropriately placing negative peaks in the demixing filters w12 (n) and w21 (n), spatial nulls are formed and source 1 and source 2 are suppressed in BSS output
564
S. Wehr et al.
2 and in BSS output 1, respectively. We observe that after the second array rotation, the spatial null in 0◦ , which is caused by the negative peak at filter coefficient 16 in w21 (n), is fixed. This spatial null cancels the source located in broadside direction and thus the source on the side is enhanced. However, the BSS adaptation does not form a significant negative peak in w12 (n) to cancel the source at ±80◦ . Instead, a minor positive peak at filter coefficient 16 is formed, which basically corresponds to a filter-and-sum beamformer at 0◦ enhancing the broadside source. Demixing Filter w (n)
Demixing Filter w (n)
12
21
0.1
0.1 0
16
−0.1 6
Filter Coefficients
Filter Coefficients
0.2 26
26 0 −0.1
16
−0.2 6
−0.3
−0.2 0
20 40 Time [s]
60
0
20 40 Time [s]
60
Fig. 5. Demixing filters obtained with the chosen BSS algorithm in scenario 2
In order to further analyze and understand the encountered problem, we take a close look at the update rule of the chosen BSS algorithm given in [4]. For each online block m, matrix W(m), which contains the demixing filters wij (n) in a so-called Sylvester structure, is updated as follows: W(m) = W(m − 1) − μW(m).
(4)
Incorporating the natural gradient, the update W becomes W = 2
∞
−1
β(i, m)W {Ryy − bdiag {Ryy }} · (bdiag {Ryy })
,
(5)
i=0
where the weighting function β(i, m) allows offline and online implementations. The operator bdiag {Ryy } returns the block-diagonal elements of Ryy , which is the cross-correlation matrix of the BSS outputs. After abrupt array rotations, the update W is then based on badly adapted demixing filters and on BSS output signals, which will not exhibit any source separation. Due to this improper data, the chosen BSS algorithm is not able to adapt the demixing filters and thus fails to track the sources after abrupt array rotations in scenario 2. This deficiency might be caused by the approximations described in [7].
3
Shadow-BSS
In Section 2, we studied a scenario, where the chosen BSS algorithm was not able to track abrupt array rotations. However, we could always observe that
‘Shadow BSS’ for BSS in Rapidly Time-Varying Acoustic Scenes x2
x1
W(Sdw)
BSS(Sdw) (Sdw)
Decision
(Sdw)
rTh
W(Sdw)
565
BSS
(Sdw)
y1
y2
norm(xcorr(.))
norm(xcorr(.)) y1
y2
Fig. 6. Block diagram of the Shadow-BSS algorithm
the chosen BSS algorithm is capable to converge to well-separating demixing filters after blind initialization. Therefore, we propose the usage of a so-called shadow-BSS system, which is periodically blindly initialized and – in the case of outperforming the signal separation of the main BSS system – the demixing filters of the shadow-BSS are transferred for use in the main BSS system. The shadow-BSS system is motivated by the successful usage of shadow systems in, e.g., adaptive echo cancellation [8]. The effectiveness of the shadow-BSS scheme is demonstrated by simulations. 3.1
Algorithm
Figure 6 shows the block diagram of the proposed shadow-BSS system, where the dependency on time has been omitted for notational convenience. BSS and BSS(Sdw) denote the main BSS system and the shadow-BSS system. Note that the demixing filter length in the shadow-BSS system L(Sdw) can be chosen independently from L. Both systems use the two microphone signals x1 and x2 and (Sdw) perform source separation, which leads to the output signals y1,2 and y1,2 . The shadow-BSS system is periodically blindly reinitialized at multiples of period T (Sdw) . We now investigate the method for replacing the demixing filters of BSS by the demixing filters of BSS(Sdw) , if BSS(Sdw) performs better source separation. Based on the two pairs of output signals, the blocks norm(xcorr(.)) compute the norms of the cross-correlations as follows: D 2 |Ryy (m, τ )| (6) norm {Ryy (m)} = τ =−D
D 2 (Sdw) (Sdw) (m) = norm Ryy Ryy (m, τ ) .
(7)
τ =−D
The parameter τ denotes the time lags of the cross-correlation. The crosscorrelation norms may now be considered as a quantity measuring the separation performance of BSS and BSS(Sdw) . In the case of good source separation, the
566
S. Wehr et al.
two output signals of a separation system are sufficiently uncorrelated and the according norm in Equations (6), (7) is small. Hence the ratio of both crosscorrelation norms r(m), (Sdw) norm Ryy (m) r(m) = (8) norm {Ryy (m)} is used as a decision variable, which indicates the separation system (BSS or BSS(Sdw) ) with the better separation performance. To avoid unnecessary transfers of the demixing filters from BSS(Sdw) to BSS, we can average r(m) with an exponentially decaying forgetting factor λ(Sdw) : (Sdw)
norm Ryy (m) . (9) r(m) = λ(Sdw) r(m − 1) + 1 − λ(Sdw) norm {Ryy (m)} (Sdw)
Comparing r(m) to the threshold rTh (Sdw)
r(m) < rTh r(m) ≥
(Sdw) rTh
allows for a decision:
⇒ Transfer demixing filters from shadow-BSS to BSS ⇒ Keep demixing filters of BSS (Sdw)
The sensitivity of the overall system may be adjusted by the threshold rTh . Note that the latter L − L(Sdw) filter coefficients of BSS are set to zero, if the demixing filters are transferred from shadow-BSS to BSS. 3.2
Simulations
We now present both the separation performance and the results of the TDOA estimation obtained with the proposed algorithm based on a shadow-BSS system for scenario 2 described in Section 2. The demixing filter lengths are L = 1024 Average Overall BSS−Gain: 7.0dB BSS Gain [dB]
20 15 10 5 0
TDOA [Samples]
−5
0
6.8dB
10
7.6dB
20
5.2dB
7.5dB 30 Time [s]
40
6.3dB
50
8.3dB
60
Source 1 Source 2
10 5 0 −5 −10 0
10
20
30 Time [s]
40
50
60
Fig. 7. BSS gain (top) and Estimated TDOAs (bottom) obtained with the shadow-BSS system
‘Shadow BSS’ for BSS in Rapidly Time-Varying Acoustic Scenes Demixing Filter w12(n)
567
Demixing Filter w21(n)
0 −0.05 16
−0.1 −0.15
6
−0.2
Filter Coefficients
Filter Coefficients
0.05 26
0
26
−0.1 16 −0.2 6
−0.3
−0.25 0
20 40 Time [s]
60
0
20 40 Time [s]
60
Fig. 8. Demixing filters of BSS influenced by shadow-BSS
and L(Sdw) = 30. Moreover, the configuration of the shadow-BSS system is (Sdw) T (Sdw) = 2s, λ(Sdw) = 0.6, and rTh = 0.8. Both the separation performance of BSS and the estimated TDOAs depicted in Figure 7 illustrate that the proposed shadow-BSS system is capable to track even the worst case of abrupt microphone array rotations. After a few seconds, the estimated TDOAs represent the true DOAs illustrated by Figure 2. Again, we investigated the demixing filters of BSS influenced by shadow-BSS, where we especially focus on the two cross-filters w12 (n) and w21 (n). As we see from Figure 8, the spatial nulls formed by the demixing filters now follow the alternating DOAs of the sources caused by the abrupt array rotations. Finally, it should be mentioned that no audible artefacts could be observed when the demixing filters are transferred from shadow-BSS to BSS.
4
Conclusions
In this paper, we first investigated the source separation and localization performance of the chosen BSS algorithm in the case of abrupt array rotations. We found that for certain relevant cases the chosen BSS algorithm was not able to maintain the usually high separation performance and thus fails to localize the sources correctly. We then proposed an extension to BSS, which incorporates a periodically blindly initialized shadow-BSS system. If the separation performance of the shadow-BSS system outperforms the main BSS system, the demixing filters were transferred from the shadow-BSS system to the main BSS system. An appropriate method for comparing the separation performance of both systems was introduced. Finally, we would like to mention that the proposed shadow-BSS system may also be applied in the case of rapidly moving sources.
References 1. Brandstein, M.S., Ward, D.B. (eds.): Microphone Arrays: Signal Processing Techniques and Application. Springer, Berlin (2001) 2. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, New York (2001)
568
S. Wehr et al.
3. Buchner, H., Aichner, R., Kellermann, W.: Audio Signal Processing for NextGeneration Multimedia Communication Systems. Kluwer Academic Publishers, Boston (2004) 4. Buchner, H., Aichner, R., Kellermann, W.: A generalization of blind source separation algorithms for convolutive mixtures based on second order statistics. IEEE Transactions on Speech and Audio Processing 13(1), 120–134 (2005) 5. Buchner, H., Aichner, R., Kellermann, W.: Relation between blind system identification and convolutive blind source separation. In: Conf. Rec. Joint Workshop for Hands-Free Speech Communication and Microphone Arrays (HSCMA) (March 2005) 6. Buchner, H., Aichner, R., Stenglein, J., Teutsch, H., Kellermann, W.: Simultaneous localization of multiple sound sources using blind adaptive MIMO filtering. In: IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) (March 2005) 7. Aichner, R., Buchner, H., Kellermann, W.: Exploiting narrowband efficiency for broadband convolutive blind source separation. EURASIP Journal on Applied Signal Processing 2007, 1–9 (2006) 8. Ochiai, K., Araseki, T., Ogihara, T.: Echo canceler with two echo path models. IEEE Transactions On. Communications COM-25(6), 589–595 (1977)
Evaluation of Propofol Effects in Atrial Fibrillation Using Principal and Independent Component Analysis Raquel Cervig´on, Conor Heneghan, Javier Moreno, Francisco Castells, and Jos´e Millet Innovation in Bioengineering Research Group, Universiy of Castilla-La Mancha, Camino del Pozuelo sn. 16071 Cuenca, Spain School of Electrical, Electronic and Mechanical Engineering, University College Dublin, Belfield, Dublin 4, Ireland {raquel.cervigon}@uclm.es, {conor.heneghan}@ucd.ie, {javier.moreno}@secardiologia.es, {fcastells,jmillet}@eln.upv.es
Abstract. The mechanisms responsible for the initiation, maintenance and spontaneous termination of atrial fibrillation (AF) are not yet completely understood. Though much of the underlying physiology has been well determined, it has been demonstrated in numerous clinical investigations that the autonomic nervous system plays an important role in AF genesis and maintenance. In this work the effects of a widely used anaesthetic (propofol) in AF therapies has been studied. ECG recording and 12 intracardiac bipolar leads were recorded from 17 patients diagnosed with AF at both baseline and during anaesthetic infusion, in order to study its effects on AF behavior. By considering all intracardiac leads, the dominant atrial cycle length found at baseline was higher than during propofol infusion, but this difference was not statistically significant. However, the process of averaging results over all 12 leads may obscure clinically significant changes. In order to try to emphasize any differences which may exist, Principal Component Analysis (PCA) and Independent Component Analysis (ICA) were applied. This statistical analysis did show a significant difference between both groups. The shorter cycle lengths found in this study at baseline are consistent with parasympathetic and/or other physiological modulation during anaesthetic infusion and suggest that propofol may have antiarrhythmic properties. Keywords: Atrial fibrillation, anaesthetic, Independent and Principal Component Analysis (ICA and PCA).
1
Introduction
Atrial fibrillation (AF) is the most commonly encountered arrhythmia in clinical practice, with prevalence rising to near 10% in the elderly. AF is originated at the atria (the upper heart chambers), and is considered to be due to the coexistence M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 569–576, 2007. c Springer-Verlag Berlin Heidelberg 2007
570
R. Cervig´ on et al.
of multiple re-entrant atrial wavelets which are often initiated by arrhythmogenic foci located within the pulmonary veins [1,2]. Among the factors contributing to the genesis or maintenance of circulating wavelets, the Autonomic Nervous System (ANS) may play a significant pro-arrhythmic role [3]. Since AF is associated with an elevated heart rate, a common treatment strategy is heart rate control, although there is still clinical controversy over whether rhythm control (i.e., conversion to normal sinus rhythm) is more desirable. Cardioversion is the process of restoring the AF rhythm to normal sinus rhythm [4]. However, pharmaceutically-induced cardioversion is only marginally effective in treating this arrhythmia, and may have the potential for serious side effects, including life-threatening pro-arrhythmic effects. Therefore, in recent years catheter-based ablation of atrial tissue by application of energy through intracardiac catheters has become a widely used therapeutic method in patients with both persistent and paroxysmal AF [5]. The radiofrequency (RF) ablation procedure consists of generating electrical barriers in various sites within the atria by altering the tissue properties in the vicinity of the ablating catheter tip. The extent of the altered tissue depends on the power and duration of the application, as well as on the characteristics of the tissue itself. During the RF catheter ablation procedure (and other cardiac electrophysiological studies), patients are typically under the influence of anaesthetic agents. One of the most useful agents is propofol (2,6-diisopropylphenol) which is a new, rapidly acting intravenous anaesthetic. The rapid redistribution and metabolism of propofol results in a short elimination half-life of approximately one hour, and suggests that the drug could be suitable for use in short procedures. This is the reason for the recent interest in whether (and how) propofol may affect the electrophysiological properties of cardiac tissue, and hence alter the electrical activity within the atria. The purpose of this study was, therefore, to explore the short-term influence of propofol on the electrical activity within the atria in patients with AF, who were undergoing catheter ablation. A working hypothesis is that the cumulative effect of the numerous wavelets circulating within the atria affect the spatiotemporal organization of AF, and hence determine the overall fibrillatory wavefront observed within the atria. The time course of these circulating electrical wavelets is determined by the local refractoriness of the tissue (i.e., its inability to generate action potentials at an arbitrarily high rate). Propofol may act by altering atrial refractoriness, but its exact effects are unknown. A good index of refractoriness is the overall atrial cycle length (which is the inverse of the dominant frequency of the fibrillatory waveform). This has previously been shown to be a local index of atrial refractoriness during fibrillation [6,7,4]. Accordingly, in this paper we attempt to measure the dominant fibrillatory frequency (and hence atrial refractoriness) both before and during the administration of propofol. Since the localized intracardiac electrograms (IEGMs) recorded during the procedure are a mixture of both local and global cardiac activity, it can be
Evaluation of Propofol Effects in Atrial Fibrillation
571
hard to distinguish overall trends. Therefore, we have used Principal Component Analysis and and Independent Component Analysis to emphasize the most significant trends in the measured set of IEGMs.
2
Data Acquisition
IEGMs were recorded in 17 patients (13 paroxysmal AF and 4 persistent AF) before and during anesthesia with propofol (bolus of 100-180 mg/kg intravenously with incremental doses depending on the weight and time to hypnosis). These patients were undergoing radio-frequency catheter ablation. Informed consent was provided by all patients, and the study was approved by the Hospital Ethics Review Board. A bipolar catheter was positioned in the high right atrium, and a 24-pole catheter (Orbiter, Bard Electrophysiology, 2-7-2 mm electrode spacing) was positioned with electrodes at the level of the atrial septum and the left and the right atriums. Since these are bi-polar electrograms, the results are in a set of 12 signals, which we refer to as lead or dipole 1-2, 3-4, etc. While the exact positioning of the catheter will vary from subject to subject, in general the leads 1-2, 3-4, and 5-6 are in the left atrium, leads 7-8, 9-10, 11-12 are in the septum, and leads from 15-16, 17-18, . . . to 23-24 are in the right atrium. Contact catheter data and surface three-lead ECGs (I, aVF, V1) (1 kHz sampling rate, 16 bit A/D conversion, equipment supplied by Siemens-Elema AB, Solna, Sweden), were recorded simultaneously on the Duo Laboratory system (Bard) over 30 to 60 seconds before and during the anaesthetic effect. Note that bipolar electrograms, collected from a pair of closely spaced electrodes, measure the differential voltage between the two electrodes. This differential measurement is equivalent to a spatially high-pass filtered version of the underlying activation traversing the two electrodes. Thus it is sensitive to the direction of nearby depolarization and repolarization wavefronts.
3 3.1
Methods Frequency Domain Analysis
A fast Fourier transform (FFT) was calculated on the wave over a 4-s Hamming window of 4096 points just prior to energy delivery, overlapping 50% between adjacent windowed sections. The maximum peak of the resulting magnitude spectrum was identified and the positions of the harmonic peaks were determined based on the position of the maximum peak [8]. 3.2
Principal Component Analysis
PCA is a popular data processing and dimension reduction technique [9]. PCA seeks the linear combinations of the original variables such that the derived variables capture maximal variance. The objective is to find a (linear) transformation of the original variables, ordered by high proportion of the variation of
572
R. Cervig´ on et al.
the old variables, in a set of new uncorrelated variables, the principal components (PCs). PCA can be done via the singular value decomposition (SVD) of the data matrix. Biological expression data are currently rather noisy, and SVD can detect and extract small signals from noisy data. In detail, let the data X be a n × p matrix, where n and p are the number of observations or samples and the number of variables, respectively. Without loss of generality, assume the column means of X are all 0. Suppose we have the SVD of X as X = UDVT ,
(1)
where U are the PCs of unit length, and the columns of V are the corresponding loadings of the PCs. The variance of the ith PC, is d2i , with di being the ith element in the diagonal of D. Usually the first q (q << p) PCs are chosen to represent the data, thus a great dimensionality reduction can be achieved. 3.3
Independent Component Analysis
ICA is a technique recently developed for the analysis of multidimensional signals [10,11]. It has been employed in numerous biomedical applications, such as electroencephalography (EEG), magnetoencephalography (MEG), functional magnetic resonance imaging (fMRI) electrocardiography (ECG) and magnetocardiography (MCG), among others [11,12]. PCA is a common pre-processing technique for signal processing before using ICA [10]. The first step is to center X, subtract its mean vector to make X a zero mean variable. The second is using PCA to find a smaller set of variables with less redundancy. The application of ICA to the study assumes that the sources are non-Gaussian and mutually independent [13,14]. In the ICA model, the observed data X has been generated from source data through a linear process X=AS, where both the mixing matrix A and the source S are unknown. It is assumed that X is the observed data vector with a dimension n, number of the atrial signals recorded at different electrodes, S is the m-dimensional source vector, therefore A requires to be an n × m matrix of full rank with n ≥ m. FastICA has been the algorithm employed. It is a computationally highly efficient method for performing the estimation of ICA which uses an approximation of negentropy and a fixed-point iteration scheme to carry out the optimization of the contrast function. 3.4
Statistical Analysis
The parameters are expressed as mean ± SD. Paired and unpaired t-tests were used for comparison between the 2 groups of results. Comparison of serial measures was obtained by repeated measures ANOVA coupled with the StudentNewman-Keuls test. Results were considered to be statistically significant at p < 0.05. All statistical analyses were performed with the SPSS program.
Evaluation of Propofol Effects in Atrial Fibrillation
4
573
Result
4.1
Results from Complete Set of 12 Lead Electrograms
At first, by applying frequency domain analysis to the signals (imposing a range between 4-9Hz for the atrial frequency), the average dominant frequency from each recording was calculated, before propofol administration and during its effect. The results showed a trend in which the frequency was reduced 6.14±0.93 versus 5.99 ± 0.79 Hz, but it was not statistically significant (p = 0.14).
(a) Fp from original signal
(b) Fp extracted from PCA
(c) Fp extracted from ICA
Fig. 1. Average Main Frequency (Fp) from original leads (left), 5PCs (middle), and 5ICs (right) in non-anaesthetic (basal) and anaesthetic (propofol) states
In order to improve the differences between both groups, PCA and ICA were applied. The chosen parameter to discriminate between both groups was the main atrial fibrillatory frequency. In order to discard a noise/signal separation, the dimensionality of the data was reduced using the first eigenvectors, because most of the signal is contained in the first few PCs. There were statistically significant differences between the frequencies pre and post-propofol when the analysis was carried out on the reduced set of PCs. With 7 PCs the frequency varied from (6.26 ± 0.72 Hz at baseline to 5.98 ± 0.64 Hz,(p = 0.009) at peak propofol effect. These 7 PCs captured 95% of the signal energy. With 5 components the change was from 6.35 ± 0.72 to 6.04 ± 0.57 Hz, p = 0.012), and these 5 PCs typically captured 85% of the energy (Fig. 1). Therefore PCA is a useful tool in more clearly indicating statistically significant trends. The results of applying ICA after PCA, to the index of the last eigenvalue retained with PCA, were similar in significance, when using dominant fibrillatory frequency as a classification parameter, obtaining at basal 6.26 ± 0.67 vs. 6.03 ± 0.80 Hz during propofol infusion (p = 0.012) for 7 components and 6.41 ± 0.63 vs. 6.06 ± 0.73 Hz for 5 components (p = 0.001). 4.2
Results from the Different Atrial Regions
At each different region inside the atrium, the average frequency of activation was calculated. For simplicity, we will consider three different regions: right atrium
574
R. Cervig´ on et al.
(RA), left atrium (LA) and the area between them called the septum area (SA). In each atrial region the main peak frequency was calculated. The results from original signals showed a more distinctive difference between both states in the RA, 6.32 ± 0.96 Hz at basal state vs. 6.08 ± 0.79 Hz during propofol infusion, (p = 0.069), however, there was any substantial change of dominant frequency in the LA. After the application of PCA on the LA (dipoles 1-2, 3-4 and 5-6), statistically significant differences between both groups were found in the main frequency of the first two components (6.14 ± 0.54 basal state vs. 5.83 ± 0.79 Hz during propofol infusion, p = 0.012), and after ICA processing the significance decreased to p = 0.05 (6.08 ± 0.50 vs. 5.75 ± 0.80 Hz). On the SA (dipoles 9-10, 11-12 and 13-14), the significant differences of the main frequency from PCA components are a little higher than in the LA. Before and during the anaesthetic infusion these values were 6.04±0.64 vs. 5.71±0.65 Hz respectively (p = 0.038), and after ICA, the changes had a statistical significance of p = 0.042 (6.05 ± 0.66 vs. 5.76 ± 0.51 Hz). On the RA (dipoles 15-16, 17-18, 19-20 and 21-22), the most important differences in main frequency were obtained by extracting 2 components using PCA p = 0.005 (6.49 ± 0.95 vs. 6.03 ± 0.76 Hz) and with ICA p = 0.014 (6.48 ± 0.91 vs. 6.08 ± 0.78 Hz). If the SA and RA groups are joined (leads 9-10 to 21-22), the significance of the differences in frequency using the first 4 components extracted from PCA increases to p = 0.001 (6.44 ± 0.68 vs. 6.00 ± 0.60 Hz), and with ICA p = 0.002 (6.44 ± 0.67 vs. 6.05 ± 0.64 Hz). It is possible that this processing leads to an error, because of a different covariance matrix for the groups before and during anaesthetic infusion, so the transformation applied to both groups could be different. In order to counteract this possibility, the signals before and after the anesthetic treatment were concatenated prior to PCA and ICA processing, so that a unique transformation matrix was obtained, which is applicable to both groups. After computing the fundamental frequency of the PCs and ICs for the corresponding segments, it was concluded that the differences were also significant, as indicated in Table 1. Table 1. Statistical Significance of the Change in Main Peak Frequency(Fp) for PCs and ICs with a Transformation Matrix Based on Basal and Propofol States PCA Fp ICA Fp Significance Significance 12 leads PCA 7 comp 0.05 0.03 LA 3 leads (1,2-5,6) 2comp > 0.05 > 0.05 SA 2 leads (9,10-11,12) 2comp 0.05 0.04 RA 4 leads (15,16-21,22) 2comp 0.04 0.02 SRA 7 leads (9,10-21,22) 4comp 0.02 0.01
In addition, it has been possible to find differences between the patients that have paroxysmal AF and those that have persistent AF. During persistent AF
Evaluation of Propofol Effects in Atrial Fibrillation
575
the differences between both states are low, and the highest changes are seen in the left atrial region (Table 2). Table 2. Statistical Significance of Main Peak Frequency(Fp) with different combinations of PCA and ICA components in Paroxysmal and Persistent AF Paroxysmal AF Fp Signification Original leads LA (1,2-5,6) 0.89 Original leads SA (9,10-11,12) 0.25 Original leads RA (15,16-21,22) 0.23 LA 3 leads (1,2-5,6) PCA 2comp 0.07 LA 3 leads (1,2-5,6) ICA 2comp 0.94 SA 2 leads (9,10-11,12) PCA 2comp 0.04 SA 2 leads (9,10-11,12) ICA 2comp 0.08 RA 4 leads (15,16-21,22) PCA 2comp 0.01 RA 4 leads (15,16-21,22) ICA 2comp 0.03
5
Persistent AF Fp Signification 0.89 0.36 0.38 0.02 0.78 0.36 0.28 0.29 0.18
Discussion and Conclusion
Atrial refractoriness may be affected by the administration of anaesthetic agents. The influence of propofol on the electrophysiological properties of the myocardium is sparsely reported in the literature. Results from in vitro experiments on isolated rabbit sinoatrial node preparation showed that propofol had only small effects on atrial conduction at 10μg · ml−1 , but that it reduced conduction drastically at 33μg · ml−1 and caused complete block at 100μg · ml−1 [15]. The conclusion from this study is that propofol does affect the electrical properties of the heart in patients with AF. Its effects in patients with AF are such that the atrial rate is consistently decreased as a result of the anaesthesia. The observation of these differences was optimized after the application of PCA and ICA algorithms, with rates of the first components in the range of the atrial activity in AF, discarding noise and improving the statistical significance between both groups. These results correlate with those extracted from a study that analyzes the variability of AF during circadian cycles, where the AF cycle length during chronic AF exhibits diurnal variability, with longer cycle lengths occurring at night [16]. Although the exact mechanisms of the alteration of the atrial rate remain to be fully understood, the results in the present study have important implications. The atrial cycle length is markedly modulated during anaesthesia compared to the resting conscious state immediately before. This modulation may affect the activation and the repolarization sequence in the heart. More information about the effect of individual intra-operative factors would help to evaluate these phenomena, but it is strongly suggested that the differences
576
R. Cervig´ on et al.
between electrophysiological properties before and during anaesthetic infusion are due to true changes in the physiological conditions. Acknowledgments. This work was supported by Castilla-La Mancha research scheme (Ref: PAC-05-008-1) and by the Minister of Education and Science of Spain (Ref: TEC2005-08401-C02-01).
References 1. Moe, G.K.: On multiple wavelet hypothesis of atrial fibrillation. Archives Internationales de Pharmacodynamie et de Therapie 140, 183–188 (1962) 2. Konings, K., Allessie, M., Wijels, M.: Atrial arrhythmias: State of the art. In: DiMarco, J.P., Prytowsky, E.N. (eds.) chapter Electrophysiological mechanisms of atrial fibrillation, pp. 155–161. Futura Publishing Company, Armonk (1995) 3. Akselrod, S., Gordon, D., Ubel, F.A., Shannon, D.C., Berger, A.C., Cohen, R.J.: Power spectrum analysis of heart rate fluctuation: a quantitative probe of beat-tobeat cardiovascular control. Science 213, 220–222 (1981) 4. Coumel, P.: Paroxysmal atrial fibrillation: a disorder of autonomic tone? Eur Heart J 15(Suppl A), 9–16 (1994) 5. Pappone, C., Rosanio, S., Oreto, G.: A new anatomic approach for curing atrial fibrillation. Circulation 102, 2619–2628 (2000) 6. Kim, K.-B., Rodefeld, M.D., Schuessler, R.B., Cox, J.L., Boineau, J.P.: Relationship between local atrial fibrillation interval and refractory period in the isolated canine atrium. Circulation 94, 2961–2967 (1996) 7. Capucci, A., Biffi, M., Boriani, G., Ravelli, F., Nollo, G., Sabbatani, P., Orsi, C., Magnani, B.: Dinamic electrophysiological behaviour of humann atria during paroxysmal atrial fibrillation. Circulation 92, 1193–1202 (1995) 8. Welch, P.D.: Use of fast fourier transform for estimation of power spectra: A method based on time averaging over short modified periodograms. IEEE Trans. Audio Electroacoust AE-15, 70–73 (1967) 9. Joliffe, I.T: Principal component analysis. Springer, Heidelberg (2002) 10. Hyvrinen, A., Oja, E.: Independent component analysis: A tutorial. helsinki university of technology. Laboratory of Computer and Information Science (1999) 11. Cichocki, A., Amari, S.I.: Adaptive blind signal and image processing. Sons Inc. (2002) 12. Cardoso, J.F., Souloumiac, A.: Blind beamforming for non gaussian signals. In: IEEE Proc F, vol. 140, pp. 362–370 (1993) 13. Papoulis, A.: Probability, random variables and stochastic processes. McGraw-Hill, New York (1991) 14. Comon, P.: Independent component analysis a new concept? Signal Process 36, 287–314 (1994) 15. Briggs, I., Heapy, C.G., Pickering, L.: Electrophysiological effects of propofol on isolated sinoatrial node preparations and isolated atrial conduction in vitro. Br J Pharmacol 97, 504 (1989) 16. Meurling, C., Sornmo, L., Stridh, M., Olsson, B.: Non invesive assessment of atrial fibrillation (af) cycle length in man: potential application for studyimg af. Ist Super Sanita 37(3), 341–349 (2001)
Space-Time ICA and EM Brain Signals Mike Davies , Christopher James, and Suogang Wang IDCOM & Joint Research Institute for Signal and Image Processing, University of Edinburgh, Scotland, UK [email protected] Signal Processing and Control Group, ISVR, University of Southampton, UK {c.james,sgw}@soton.ac.uk
Abstract. Recently Single Channel ICA has been proposed where it can be shown that the algorithms learn temporal filters for separating the different components. Here we consider the natural extension to learning a set of space-time separating filters. We argue that these are capable of separation above and beyond that possible using only spatial or temporal methods alone. We then consider the potential of these ideas when applied to Ictal Electroencephalographic (EEG) data and Brain Computer Interaction (BCI).
1
Introduction
Independent Component Analysis (ICA) was originally proposed for the blind separation of vector-valued observations into independent sources. However it has also been used to learn ‘features’ or codebooks for single channel data that provide a more efficient representation of a signal than a fixed (non-adapted) representation [1,2]. Empirical evidence suggested that it was possible to also use this for signal separation (e.g. [3,4,5]) and it was recently shown [6] that under certain restricted conditions this is indeed the case. Here we examine the natural extension of the model in [6] to consider a combination of space time vectors: Space-Time ICA (ST-ICA). Previously this model has be studied empirically for the convolutional source separation problem [7,8]. Here we consider the circumstances under which signals are separable within this model. As with single channel ICA, separation of more sources than sensors is possible. It therefore provides an appealing alternative to the usual computationally demanding solutions to the more sources than sensors problem. Before we embark we note that these ideas are distinct from the Spatiotemporal ICA proposed in [9] and it is not equivalent to using temporal information to learn a spatial unmixing matrix, e.g. as in [10]. In particular ST-ICA takes advantage of full spatial-temporal filtering.
MD acknowledges support of his position from the Scottish Funding Council and their support of the Joint Research Institute with the Heriot-Watt University as a component part of the Edinburgh Research Partnership.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 577–584, 2007. c Springer-Verlag Berlin Heidelberg 2007
578
2
M. Davies, C. James, and S. Wang
Mathematical Framework
Traditional ICA assumes that we observe a sequence of vectors x(t) ∈ RQ which are a linear mixture of independent components: x(t) = As(t). Using nonGaussianity, nonstationarity or temporal structure we can then generally estimate A blindly up to the usual ambiguities and hence estimate the individual sources: ˆ −1 x(t). ˆs(t) = A An alternative way to view the problem, first proposed by Cardoso [11], is to treat it as an additive model: x(t) =
C
xp (t)
(1)
p
which can be related back to the original ICA model with xp (t) = aTp sp (t) where A = [a1 , . . . , aQ ]T . As this alternative approach only defines the independent sources within the observation domain it generalises ICA and allows xp (t) to be intrinsically multi-dimensional. Let Ep define an np -dimensional subspace containing the ith component: xi (t) ∈ Ei . Then, if the sources are non-Gaussian (or non-stationary etc.) and as long as all the subspaces, Ei , are linearly independent the sources can still be separated. Furthermore separation can be performed using a standard ICA algorithm followed by a component grouping step. See Cardoso [11] for details. 2.1
Single Channel and Space-Time ICA
In [6] we showed that when the input data is formed from a delay vector of samples, x(t) = [x(t), x(t − 1), . . . , x(t − N + 1)]T taken from a single channel, source separation is still possible and the resulting single channel ICA can be seen as a special instance of MICA. This means that it is possible to separate multiple sources from a single channel. However this model carries a rather restrictive separability requirement. For the MICA subspaces Ep to be independent, the sources (assuming stationarity) must have disjoint spectral support. This assumption is generally over-optimistic, although under certain circumstances it may hold approximately. We can, however relax this requirement in ST-ICA as we descibe now. Suppose that we observe a vector valued sequence x(t) ∈ RQ that we believe is composed of multiple independent stationary sources, as in the MICA model (1). In the same manner as single channel ICA, we can augment the dimension of the observation space by including delayed copies of observations. Let us define ˜ (t) as: the Q × N -dimensional space-time vector x ˜ (t) = [x(t), x(t − 1), . . . , x(t − N + 1)]T x
(2)
where N is the number of ‘taps’ in our delay vector. We can now treat this as a Q × N dimensional MICA problem as opposed to a Q dimensional one. As with MICA and single channel ICA source separation can be performed by using
Space-Time ICA and EM Brain Signals
579
a standard ICA algorithm followed by component grouping, [11]. We call this ST-ICA. Note that spatial ICA, spatial MICA and convolutional ICA are all restrictions of the ST-ICA model. The link between ST-ICA and convolutional ICA modelling is further explored in [7,8].
3
Separability of Sources
Like single channel ICA, ST-ICA can provide an advantage over spatial ICA (and MICA) when the independent sources have a finite spectral support. Many signals exhibit this property, at least approximately, as is evidenced by the popular use of Singluar Systems Analysis [12]. In this context we can define ST-ICA separability requirements in terms of MICA-type requirements at each frequency. If we assume that the subspaces, Ep , have already been correctly identified it only remains to determine conditions for which they are linearly independent. This means that the space-time correlation matrix for each source, Rxp = E{xp xTp }, must be rank deficient and the sources characterisable as multichannel singular systems. If we further assume that the sources are stationary then we know that the correlation matrices, Rxp , are block Toeplitz which, in the large window limit (N → ∞) are block diagonalizable via the Fourier transform giving the matrix valued power spectra, Sxp (ω). Transforming the MICA model into the Fourier domain gives: X(ω) = Xp (ω) (3) p
and due to the block diagonal structure of the correlation matrices we can now consider each frequency, ω, separately. Let us assume that each Xp (ω) is restricted to a subspace Ep (ω) ⊂ CQ . Without any further restriction on the model we therefore have the following separability requirement: Separability: A Q-dimensional random process, x(t) = p xp (t), composed of independent stationary random processes, xp (t), is linearly separable if and only if the subspaces Ep (ω) such that Xp (ω) ⊂ Ep (ω) are linearly independent. This allows there to be more sources than sensors with the restriction that there should be no more than Q sources present at any given frequency.
4
Applications to EM Brain Signals
The technique will be trialled on two different examples of EEG data. In the first set, clinical data in the form of epileptic (ictal) EEG will be analysed with the goal of extracting epileptic seizure components from cortical recording channels placed either over the epileptic focus (focal) or further away (extra focal).
580
M. Davies, C. James, and S. Wang
In BCI brain signals are interpreted to provide a means of communication. One of the more popular applications of BCI is called the P300 word speller introduced by [13]. The P300 evoked potential is a late positive wave that occurs over the parietal cortex at about 300 ms after the onset of a meaningful stimulus. The P300 word speller presents a matrix of letters, numbers, and other symbols (generally 6x6), whereby over short intervals one of the rows or columns of the matrix is randomly flashed (the stimulus). The user selects a character by focusing attention on individual characters in the matrix. The premise is that the P300 recording in the EEG is prominent only in those responses elicited by focusing attention on the desired character, these are, however, buried in the ongoing brain activity, and artifacts (such as movement artifacts, eyeblinks, etc.). Stimulus locked coherent averaging of the P300 responses is the usual method of enhancing the SNR, however this requires trails to be repeated several times. Not only is this costly in terms of time but it is also widely believed that due to habituation this may affect the signal as well as the noise components. 4.1
Ictal EEG
The data were recorded during pre-surgical evaluation at the Epilepsy Center of the University Hospital of Freiburg, Germany. Intracranial grid-, strip-, and depth-electrodes were used. The EEG data were acquired using a Neurofile NT digital video EEG system with 128 channels, 256 Hz sampling rate, and a 16 bit analogue-to-digital converter. 23 EEG recordings with simple partial, complex partial and generalized tonic-clonic seizures from patients with focal epilepsy originated in the temporal region were recorded and made available for dissemination [14].1 We depict the results when applied to a seizure recording of one specific patient with temporal lobe epilepsy with 6 cortical recording channels. Figure 1a depicts the data, channels 1,2 & 3 are focal to the seizure and channels 4,5 & 6 are extra-focal. The vertical lines in each figure represent the start and stop of the seizure as indicated by an epileptologist. Channel 3 is the channel with the strongest visible seizure involvement, the extra-focal electrodes show no visible seizure recording. Figure 1b depicts the outputs of ensemble ICA (Fast ICA) on all 6 recording channels. It can be seen that the unmixing process is not sucessful in completely isolating seizure activity, most probably because the recordings are already fairly independent of each other this is apparent in the mixing matrix as shown in Figure 1c where the focal (seizure) components map almost 1:1 onto their corresponding recording channels. Next, the pair of electrodes 3 & 6 are analysed with ST-ICA. (The space-time vector had dimension 190 = 2×95, and a data reduction was performed to reduce the data-matrix to rank 30 through SVD). Figure 1d (upper) depicts the mixing filters learned by the ST-ICA process; two per independent component (IC). These can be manually clustered into similar groups of shifted filters. Figure 1d 1
We are grateful to the Epilepsy Center of the University Hospital of Freiburg, Germany for their permission to use the invasive EEG recordings in this study.
Space-Time ICA and EM Brain Signals (a)
1
2
(b)
(c)
IC1
1 2 3 4 5 6
IC2
600 500 400 300 200 100
1 3
581
2
3 4
5 6
IC3
(d)
4
IC4
5
IC5
6
IC6
0.8 @ lag: 12
0.9 @ lag: 12
0.7 @ lag: 12
0.8 @ lag: 14
0.9 @ lag: 12
0.7 @ lag: 11
0.8 @ lag:-9
1.0 @ lag:-11
0.6 @ lag:-12
0.8 @ lag:-11
0.9 @ lag: 10
0.9 @ lag: 11
0.8 @ lag:33
1.0 @ lag:3
0.7 @ lag:-8
0.9 @ lag:-10
1.0 @ lag: 11
1.0 @ lag:5
1.0 @ lag:1
0.9 @ lag:5
0.9 @ lag:8
0.7 @ lag: 18
1.0 @ lag: 26
1.0 @ lag:1
1.0 @ lag:4
-100
0
0.9 @ lag:8
100
-100
0
1.0 @ lag:-15
100
-100
0
100
1.0 @ lag:0
-100
0
1.0 @ lag:-1
100
-100
0
1.0 @ lag:0
100
-100
0
100
Fig. 1. (a) 6 channels of ictal EEG recorded from the cortex; (b) following ensemble ICA and (c) mixing matrix. (d) The mixing filters and their cross-correlations following ST-ICA on channels 3 and 6.
(lower) shows the cross-correlation between the two mixing filters for each IC for 100 lags - in each case greatest (absolute) value of cross-correlation is indicated along with the lag at which it occurs. It can be seen that a number of filters peak at a lag of around 10-12 samples (39.1-46.9 ms) with the filters of channel 3 leading those of channel 6, whilst others are maximal at around 0-5 samples (0-19.5 ms) with channel 6 leading channel 3. Figure 2a shows resulting waveforms following manual clustering of the filters into 3 groups, S1-S4. S1 & S2 depict seizure components, and S3 & S4 show no evidence of seizure activity. Figures 2b, 2c and 2d depict an examplar waveform extracted for each of the main clusters Seizure 1, Seizure 2 and non-seizure. In each case the relative amplitude scales have been fixed and the cross-correlation between both depicted. It can be seen that for the seizure component, the signal over the focal area is strongest and leads a weaker signal in the extra-focal electrode which lags by about 50 ms. The non-seizure components are equally present in both channels with no discernable lags.
582
(a)
M. Davies, C. James, and S. Wang
S1
(b) S1 1
3
0.8 0.6 0.4 0.2 0 -0.2 -0.4
S2
6
-0.6 -0.8 -1 -50 -40 -30 -20 -10
0
10 20 30 40 50
0
10 20 30 40 50
0
10 20 30 40 50
(c) S2 1 0.8
3
0.6 0.4 0.2 0 -0.2
S3
-0.4 -0.6
6
-0.8 -1 -50 -40 -30 -20 -10
(d) S3/S4 1
3
S4
0.8 0.6 0.4 0.2 0 -0.2 -0.4
6
-0.6 -0.8 -1 -50 -40 -30 -20 -10
Fig. 2. (a) Reconstructed sources following ST-ICA; (b)-(d) depict seizure and noise components projected to both recording channels 3 and 6 along with their crosscorrelations
4.2
BCI P300 Speller
For this demonstration we used data from data set IIb (P300 speller paradigm) obtained from the BCI competition 2003 data bank. The data was collected from one subject with 64 scalp electrodes (10/20 System) and sampled at 240 Hz. We demonstrate the proposed methods on the data by randomly selecting only five 1.5 s-epochs with possible P300 patterns and concatenate them to form a 7.5s trial. We apply ST-ICA to channels C3 and C4 (channels C3 & C4 cover the P300 focus). Finally we test using a similar setup whilst replacing alternating epochs with P300, non-P300, P300 etc. Figure 3 depicts the two channels C3 and C4 with the 5 epochs highlighted, within each epoch the first vertical line represents the presentation of the visual stimulus and the second line - 300 ms afterwards - represents the location where the P300 response (if present) should be maximal. The second part of Figure 3 depicts the two resulting P300 components after applying ST-ICA - 16 independent components were manually identified as contributing towards P300 responses and they were projected back to the measurement space. Although still relatively noisy, clear peaks can be seen around the 300 ms mark following stimulus presentation. In order to test
Space-Time ICA and EM Brain Signals
583
Raw recordings 2000
C3
Amplitude (A/D Units)
1500 1000 500 0 -500 -1000 -1500 -2000 2000
C4
Amplitude (A/D Units)
1500 1000 500 0 -500 -1000 -1500 -2000
P300 components after ST-ICA 600
C3
Amplitude (A/D Units)
400 200 0 -200 -400 -600 600
C4
Amplitude (A/D Units)
400 200 0 -200 -400 -600
Alternate P300s/ non-P300s after ST-ICA 600
C3
Amplitude (A/D Units)
400 200 0 -200 -400 -600 600
Amplitude (A/D Units)
400
C4
200 0 -200 -400
No P300
-600 0
1000
2000
No P300 3000
4000
5000
6000
7000
ms
Fig. 3. 2 channels of EEG (C3 and C4) made up of concatenated P300 epochs (5 epochs). The extracted P300s are depicted following ST-ICA. Alternate P300/ nonP300 epochs are analyzed in the same way.
the reliablity of the process in the absence of P300 responses the final section of Figure 3 depicts the same analysis as before, however the dataset is chosen such that a P300 response appears in an alternating pattern. i.e. P300 - non-P300 P300 - non-P300 - P300. It can be seen that whilst P300 responses are present where expected, the two epochs where there was no visual stimulus present, show no P300 response at all.
5
Discussion and Conclusion
ST-ICA provides a method of exploiting both spatial and temporal means of discrimination between independent sources, thereby allowing more sources to be extracted from fewer observation channels. We have shown that applied to quite different types of brain signals the ST-ICA technique yields very insightful results. This technique will be extremely useful in areas of brain signal analysis where multiple brain sources underly few channel recordings - either by design
584
M. Davies, C. James, and S. Wang
or through necessity. For the ictal EEG we have shown how an extra-focal electrode can still detect the presence of an epileptic seizure elsewhere in the brain, indicating that few channel recordings could be successfully used for systems such as seizure onset predictors. Similarly, in the BCI scenario we see that we can successfully enhance the presence of P300 evoked potentials using as little as 5 epochs - this is paramount for the design of BCI systems whose overall aims are always to be fast and accurate.
References 1. Abdallah, S., Plumbley, M.D.: If edges are the independent components of natural images, what are the independent components of natural sounds? In: Proc. Int Conf. ICA 2001, pp. 534–539 (2001) 2. Bell, A.J., Sejnowski, T.J.: An information maximization approach to blind separation and blind deconvolution. Neural Computation 7(6), 1129–1159 (1995) 3. Casey, M., Westner, A.: Separation of Mixed Audio Sources by Independent Subspace Analysis. In: Proc. Int. Comp. Music Conf. Berlin (2000) 4. James, C.J., Lowe, D.: Extracting information from multichannel versus single channel EEG data in epilepsy analysis. In: Hyder, A.K., Shahbazian, E., Waltz, E. (eds.) Multisensor Fusion, pp. 889–895. Kluwer Academic Publishers, Dordrecht (2002) 5. James, C.J., Gibson, O., Davies, M.E.: On the analysis of single versus multiple channels of electromagnetic brain signals. Artificial Intelligence in Medicine 37(2), 131–143 (2006) 6. Davies, M.E., James, C.J.: Source Separation using Single Channel ICA., Signal Processing, special issue on Advances on Independent Component Analysis (in press, 2007) 7. Abdallah, S., Plumbley, M.D.: Application of geometric dependency analysis to the separation of convolved mixtures. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 540–547. Springer, Heidelberg (2004) 8. Davies, M.E., Jafari, M., Abdallah, S., Vincent, E., Plumbley, M.D.: Blind Source Separation using Space-Time Independent Component Analysis. In: Makino, S., Lee, T-W., Sawada, H. (eds.) Blind Speech Separation, Springer, Heidelberg (to appear, 2007) 9. Stone, J.V., Porrill, J., Buchel, C., Friston, K.: Spatial, Temporal, and Spatiotemporal Independent Component Analysis of fMRI Data. In: 18th Leeds Statistical Research Workshop on Spatial-temporal modelling and its applications (July 1999) 10. Belouchrani, A., Abed-Meraim, K., Cardoso, J.F., Moulines, E.: A blind source separation technique using second-order statistics. IEEE Trans. SP 45(2), 434–444 (1997) 11. Cardoso, J.-F.: Multidimensional independent component analysis. In: Proc. ICASSP’98, Seattle, WA, pp. 1941–1944 (1998) 12. Golyandina, N., Nekrutkin, V., Zhigljavsky, A.: Analysis of Time Series Structure SSA and Related Techniques. Chapman and Hall, Sydney (2001) 13. Farwell, L.A., Donchin, E.: Talking off the top of your head: Toward a mental prosthesis utilizing event-related brain potentials. Electroencephalogr. Clin. Neurophysiol 70, 510–523 (1988) 14. epilepsy.uni-freiburg.de/freiburg-seizure-prediction-project/ eeg-database
Extraction of Gastric Electrical Response Activity from Magnetogastrographic Recordings by DCA C.A. Estombelo-Montesco1, D.B. De Araujo1, A.C. Roque1, E.R. Moraes1, A.K. Barros2, R.T. Wakai3, and O. Baffa1 1
Department of Physics and Mathematics, FFCLRP, University of Sao Paulo, Ribeirao Preto, SP, Brazil, 14040-901 [email protected], [email protected], [email protected], [email protected], [email protected] 2 Department of Electrical Engineering, Federal University of Maranhao, Sao Luis, Maranhao, Brazil [email protected] 3 Department of Medical Physics, Medical School, University of Wisconsin, Madison-WI, USA [email protected]
Abstract. The detection of the basic electric rhythm (BER), composed of 3 cycles/minute oscillation, can be performed using SQUID sensors. However the electric response activity (ERA), which is generated when the stomach is performing a mechanical activity, was detected mainly by invasive electrical measurements and only recently one report was published dealing with its detection by magnetic measurements. This study was performed with the aim to detect and extract the ERA and ECA noninvasively before and after a meal. After acquire MGG recordings the signals were processed to extract both source components and remove cardiac interference and others interferences by an algorithm based on Dependent Component Analysis (DCA) then autoregressive and wavelet analysis was performed. Therefore, first, we can compare their relative amplitudes in the time or frequency domain, and get evidences of ERA signal. Second, we can get the spatial contribution from each channel to the source signal extracted. Finally, results have shown that there is an increase in the signal power at higher frequencies around (0.6-1.3 Hz) from ERA source component usually associated with the basic electric rhythm (ECA source component). We show that the method is effective in removing interference signals of MGG recordings, and is computationally efficient.
1 Introduction The first non-invasive measurement of the stomach’s electrical activity was made by Alvarez in 1922. Since then this field has grown considerably due to the extensive exploration of information from electrogastrography (EGG) and more recently from magnetogastrography (MGG) [1]. While the literature shows many contributions leading to a better understanding of gastric electrical activity (GEA), the EGG is difficult to measure, mostly because it is superimposed with other electrical signals that M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 585–592, 2007. © Springer-Verlag Berlin Heidelberg 2007
586
C.A. Estombelo-Montesco et al.
are difficult to discriminate [2]. The MGG records the magnetic field produced by the GEA, and has been measured using SQUIDs (Superconducting Quantum Interference Device). The magnetic signals are less affected by tissue conductivity than the electric signals and show a stronger dependence of source-to-sensor distance [3]. MGG, thus, can provide higher spatial resolution than EGG [3]. Some evidence suggests that most gastrointestinal diseases are related to gastric motility impairment (or mechanical activity) [4]. Previous invasive studies have shown that the electrical extracellular signal detected with serosal electrodes has two distinct components. One, often referred to as ‘electrical control activity’ (ECA) or ‘slow wave’, is an omnipresent periodic activity not necessarily related to contractile motion. The second component, called electrical response activity (ERA) or ‘spike activity’, is time-locked to the ECA, but only occurs in conjunction with phasic contractile activity[5;6]. A better characterization of this ERA is considered a subject of major importance that had not been investigated satisfactorily in noninvasive human studies. Unfortunately, MGG signals are highly contaminated. Recently, major advances in signal processing have been achieved with blind source separation (BSS). In most cases, extraction of all the source signals is unnecessary; instead, a priori information can be applied to extract only the signal of interest. Here we propose a strategy based on the algorithm of Barros and Cichocki [7;8] to separate the ECA and ERA signal from the other interferences, even in cases of low signal-to-noise-ratio, which we call dependent component analysis (DCA). For a quasi-periodic signal, DCA identifies the signal component based on the time delay determined from the temporal characteristics of the ECA in MGG measurement.
2 Material and Methods The recordings were made with a 74-channel first-order gradiometer system (Magnes, Biomagnetic Technologies, Inc) housed within a magnetic shielded room. The system consists of two sensor units, A and B, each containing 37 channels uniformly distributed over circular areas of diameter 13.7 and 14.4 cm respectively. Seven asymptomatic subjects volunteered for the study. The subjects lay with the stomach over the B sensor through a special bed with an opening such that the stomach could lie directly on the B sensor. The A sensor was positioned over the back of the subject. With this experimental arrangement it was possible to acquire signals from the stomach at the closest possible distance. Three epochs of duration10 minutes were acquired. The first was acquired before the ingestion of the test meal (pre-prandial). After that, a standard test meal comprised of a cheese sandwich with 250 kcal (110 kcal bread + 140 kcal cheese) was given to the subjects immediately before the second measurement (first post-prandial). Then, the last 10 minute acquisition was made (second post-prandial). The dc-coupled MGG signals were sampled at 73.1Hz and stored for subsequent analysis. 2.1 Blind Source Separation Using Temporal Structure The proposed strategy to extract ECA and ERA components is based on Dependent Component analysis (DCA) method. DCA is a method based on multivariate analysis
Extraction of Gastric Electrical Response Activity
587
that uses a priori the delay based on the temporal characteristics of the ECA and ERA signal to be extracted (see Figure 3). The method for artifact removal is fully described in [8] and a short description follows. Consider n sources T that are mixed into vector x through the following linear comS = [ s 1 , s 2 ,..., s n ]
bination: X = AS , where A is an n × n invertible matrix. Our goal here is to find the source of interest, s i . In general, the number of independent components can be
as large as the dimension of X. The method proposed by Barros and Cichocki[7] aims at extracting only the desired component with a given characteristic, instead of all sources, and has been successfully applied to other applications [8]. The method can be described briefly as follows. As we wish to extract only a single source, we can write the signal as y(k) = w T x ( k) , where w is a weight vector for that single source. Defining the error as ε a ( k ) = y ( k ) − y ( k − p ) and minimizing the mean squared error ξ ( w ) = Ε [ ε a 2 ] , we find: w = E [ x y p ] , where y p = y ( k − p ) . We will use sequential signal extraction along with a priori information about the autocorrelation function. One practical problem is how to estimate the optimal time delay. A simple solution is to calculate the autocorrelation function of the sensor signals and find the feature, in our case a peak with appropriate time-lag, corresponding to the signal of interest. In order to accomplish this, we model the system using autoregression, as described next. 2.2 The Adaptive Autoregressive Model for Spectrum Estimation To characterize the power spectrum of the signal we use the well known autoregressive (AR) model [1;8]. This power spectrum will be used to estimate the optimal time delay (before DCA processing) for each desired component and at the end (after DCA processing) to compare energies of the estimated source signals in pre-prandial versus post-prandial measurements.
Fig. 1. ERA power spectrum of a pre-prandial single recording from raw data. ERA signal has weak energy compared to heart component at higher frequency.
Fig. 2. Power spectrum of a first postprandial single recording from raw data. Two peaks corresponding to the ERA and heart components.
In Figure 1 and Figure 2 we show representative power spectra of segments of the raw signal from a pre-prandial and post-prandial recording respectively. These representative power spectra show peaks near 0.8 Hz (ERA), indicating a different state
588
C.A. Estombelo-Montesco et al.
compared to pre-prandial measurement in Figure 1. For ECA signal the process is similar where we found that the power is at 0.05Hz (or 3cpm). This a priori information used to estimate the time-delay for DCA filtering. 2.3 Avoiding Scaling Factor Problem by Projection Approach The estimation of the signal extracted by DCA might be scaled at the output. Subsequently this estimation can lead to an erroneous comparison between pre-prandial and post-prandial energies if we don’t determine the proper scale factor. To overcome this problem we need an estimation method for the scale factor for each signal component extracted by DCA. Consider an output vector, y = wx = wAs . It has an indeterminacy that can be expressed as y = α s , where α is a scaling factor that needs to be estimated for the signal extracted from each epoch (see Figure 3). After extracting the desired source component, y , one can project the source signal of interest back onto sensor array signals, calculating the scale factor as follows. First let us define the following error: ε b = x i − α i y , where x i is the desired signal and y is the output of the DCA filter. Next we estimate a scale factor α i that minimizes the mean squared error ξ (α i ) = E [ ε b 2 ] . Then we have ξ (α i ) = xi 2 − 2xiα i y + (α i y) 2 . The minimum will be reached yielding the following scale factor: α i = Ε [ y 2 ] − 1 Ε [ xy ] . The scaling factor provides two valuable pieces of information. First if we take the high absolute value of α , then we have the scaling factor for the output of DCA (estimated source signal). Therefore we can compare their relative amplitudes in the time or frequency domain, and get evidences of ERA signal. Second, we can get the contribution from each channel to the source signal extracted. It allows spatial localization over 37 channels of the estimated source signal.
Fig. 3. Schematics for scaling factor determination. a) First each source (stomach, heart, tissue, artifacts, etc) produces a magnetic signal, actually not seen directly. b) Then each source produced is mixed with other sources, in our case we consider a linear mixture and it is represented by the block mixture process. c) When the signals are acquired actually we are measuring the signals from the mixture process obtaining one time series for each sensor. d) Then separation/extraction of the source of interest can be done and further evaluated through DCA process. e) At the output there is a single time series of interest where a scale factor needs to be calculated to estimate the relative amplitude. These steps were applied for every epoch: pre and post-prandial measurements.
Extraction of Gastric Electrical Response Activity
589
3 Results: ECA and ERA Components by DCA The upper diagram of Figure 4 and Figure 5 shows the ECA(dotted line) and ERA(solid line) components extracted at each epoch by DCA. The upper diagram of Figure 4 shows the extracted components for the pre-prandial epoch and The upper diagram of Figure 5 the extracted components for the first post-prandial epochs. Figure 4 and Figure 5 show that ECA component are always present in the stomach. Furthermore, it can be noted that ERA components have higher amplitude than the pre-prandial component, especially during plateau phase of the ECA near 50 seconds, 70 seconds, 90 seconds and 110 seconds of Figure 5. The right side shows the ECA scale and left side shows the ERA scale. The ECA signal component consists of an upstroke followed by a plateau and then by a slow depolarization phase with approximate frequency 3 cpm. Note the difference of amplitude between the ERA signals in pre-prandial epoch and the ERA signals of the post-prandial epochs. The upper diagram of Figure 4 shows the ERA and ECA components during a preprandial epoch. The lower diagram of Figure 4 shows a white solid line, obtained by summing the ERA and ECA time series, superimposed on the time-frequency representation (TFR) of the ERA component, obtained by the Morlet wavelet transformation [9;10]. The y-scale of the TFR is from 0.5 Hz to 1.3Hz. A few localized high energy regions can be observed for the ERA component, but they are not necessarily time-locked with the ECA component.
Fig. 4. Time interval of pre-prandial epoch with the ECA and ERA components in the upper panel and the TFR of the ERA component in the lower panel. Superimposed on the TFR with a solid line is the sum of the ECA component plus the ERA component, which are shown in the upper panel. The scale of the left side is for the TFR of the ERA and the scale on the right side is for the summed time series. The x-scale is in seconds.
In Figure 5, upper panel, shows the ERA and ECA components during the first post-prandial epoch. Here, in contrast, the amplitude of the ERA component is greater than during the pre-prandial epoch. The lower panel of Figure 5 shows high energy spots of the ERA component that are time-locked with the ECA component. This characteristic is very important to verify the existence of the ERA component. Another characteristic to note is the fundamental frequency of the ERA component; although it concentrates at 0.8 Hz it can vary with time increasing up to 1.30 Hz, but it preserves the time-locking with the ECA component at the plateau.
590
C.A. Estombelo-Montesco et al.
During the second post-prandial epoch we can observed that the ERA amplitude remains high, as in the first post-prandial epoch. Differences are related to a more diffuse energy distribution than in the previous epoch. However, it still preserves the time-locking with the ECA component, despite the energy decrease.
Fig. 5. Time interval during the first post-prandial epoch with ECA and ERA components. In the upper panel the spikes of the ERA component can be observed, which are reflected in the TFR in the lower panel and are time-locked with the ECA component. The frequency of the ERA component in the TFR varies from 0.6 Hz to 1.0 Hz. The scales of this figure are similar to those of Figure 4. The x-scale is in seconds.
The energy contribution from each channel can be used to construct isocontribution maps. These contributions show a spatial representation of the area where the source signal came from. The representation of the 37 channel layout from ERA (post-prandial epoch) is shown in Figure 6. In Figure 6 shows the post-prandial epoch and the localization of ERA after DCA extraction. An increase in the contribution energy in a number of channels on right side is observed, whereas the contribution of these channels was low in the preprandial epoch. After extracting the desired source with DCA and estimating the scale factor, we can calculate the adaptive power spectrum to determine the energy for each epoch using the autoregressive (AR) method. The results show an amplitude increase of the signal around (0.6-1.0 Hz) (Figure 7) with a dominant frequency at 0.8Hz, usually correlated with the higher intensity of the ECA rhythm. The integrated power spectrum in the frequency band of (0.5-1.33) Hz was used to generate an index of ERA. Signal acquired from all volunteers shown an increment of ERA index, when the pre and post-prandial were compared. These results can be seen in Table 1.
4 Discussion In the present study, the amplitudes of the ECA and ERA signals were different, especially in comparison with the cardiac signal. Moreover, due to the overlap of the cardiac and ERA signals in the frequency domain, it is not possible to use a classical filter to remove the cardiac component. Finally, analysis of MGG signals is
Extraction of Gastric Electrical Response Activity
591
complicated by biological interferences such as respiration, small and large intestine, and duodenum magnetic signals. 0.2
0. 2
0. 8
1.4
0.4
0.6
0. 2
1 1.2
0. 4. 6 0
0.2
1.4 1.2 0. 8
0.4
1 0.6
Fig. 6. Isocontribution map of the ERA component from the first post-prandial epoch
Fig. 7. Autoregressive power spectrum of preprandial and post-prandial epochs after signal extraction from one subject. The inset shows the index of each epochs.
Table 1. Index for each epoch (between 0.5Hz and 1.33Hz) x 1051
Volunteer – 1 Volunteer – 2 Volunteer – 3 Volunteer – 4 Volunteer – 5 Volunteer – 6 Volunteer – 7
Pre-prandial 8.0 0.72 9.0 27 16 4.2 8.6
Post-prandial (1) 1100 12 91 29 300 2800 99
Post-prandial (2) 930 18 120 44 83 5100 180
These difficulties were overcome by using DCA to extract only the desired component with a specified periodicity, rather than extracting all sources. DCA can be applied even to a low signal-to-noise ratio recording. One problem, however, is that the scale of the extracted component can be altered, leading to an inaccurate energy. To avoid the scaling factor problem, a projection approach can be applied. Generally scaling factors have not been a problem in many applications of BSS that involve a single measurement or experimental condition. But this is not the case in our study, which includes three epochs, one pre-prandial and two post-prandial. It is therefore necessary to compute the scaling factor to obtain coherent relative amplitudes for each epoch, which was accomplished here by the projection approach. A previous work [2] described and stated confidently the detection and analysis of a type of human ERA signal; however, based on qualitative analysis the authors proposed future work to validate their results with simultaneous measurement involving serosal or mucosal electrode recordings. In our study, the extracted signals satisfy the properties of the ERA signals reported in the literature in which invasive recordings were made in animals, apparently requiring no further experiments. It is important to notice that through the DCA process the ECA and ERA components can be extracted in time domain. The energy increase can be seen in all
592
C.A. Estombelo-Montesco et al.
volunteers, as shown in Figure 7 and Table 1. This method and result have not been reported previously. Recordings using invasive methods [6] show that the ERA component is timelocked with the ECA component. In this work, using MGG, a non-invasive method, along with DCA and time-frequency analysis, we found that the ERA component was time-locked with ECA component, which agrees with previous invasive methods. We conclude that MGG can detect the electric response activity in normal volunteers. Further improvements in signal processing and standardization of signal acquisition are necessary to ascertain its possible use in clinical situations to identify and study gastric diseases. Acknowledgements. This work was partially supported by the Brazillian agencies CNPq, FAPESP and CAPES.
References [1] Moraes, E.R., Troncon, L.E.A., Baffa, O., Oba-Kunyioshi, A.S., Wakai, R., Leuthold, A.: Adaptive, autoregressive spectral estimation for analysis of electrical signals of gastric origin. Physiological Measurement 24(1), 91–106 (2003) [2] Irimia, A., Richards, W.O., Bradshaw, L.: Magnetogastrographic detection of gastric electrical response activity in humans. Physics in Medicine and Biology 51(5), 1347– 1360 (2006) [3] Andrä, W., Nowak, H.: Magnetism in Medicine: A Handbook. Wiley-vch, Chichester (2006) [4] Chen, J.D., McCallum, R.W.: Clinical-Applications of Electrogastrography. American Journal of Gastroenterology 88(9), 1324–1336 (1993) [5] Smout, A.J.P.M., Vanderschee, E.J., Grashuis, J.L.: What Is Measured in Electrogastrography. Digestive Diseases and Sciences 25(3), 179–187 (1980) [6] Akin, A., Sun, H.H.: Time-frequency methods for detecting spike activity of stomach. Medical & Biological Engineering & Computing 37(3), 381–390 (1999) [7] Barros, A.K., Cichocki, A.: Extraction of specific signals with temporal structure. Neural Computation 13(9), 1995–2003 (2001) [8] de Araujo, D.B., Barros, A.K., Estombelo-Montesco, C., Zhao, H., da Silva, A.C.R., Baffa, O., et al.: Fetal source extraction from magnetocardiographic recordings by dependent component analysis. Physics in Medicine and Biology 50(19), 4457–4464 (2005) [9] TallonBaudry, C., Bertrand, O., Delpuech, C., Pernier, J.: Oscillatory gamma-band (30-70 Hz) activity induced by a visual search task in humans. Journal of Neuroscience 17(2), 722–734 (1997) [10] Jensen, O., Gelfand, J., Kounios, J., Lisman, J.E: Oscillations in the alpha band (9-12 Hz) increase with memory load during retention in a short-term memory task. Cerebral Cortex 12(8), 877–882 (2002)
ECG Compression by Efficient Coding Denner Guilhon2 , Allan K. Barros3, , and Silvia Comani1,2 2
1 Department of Clinical Sciences and Bio-imaging, Chieti University, Italy ITAB - Institute of Advanced Biomedical Technologies, University Foundation ‘G.D’Annunzio’, Chieti University, Italy [email protected], [email protected] http://www.unich.it/itab 3 Federal University of Maranh˜ ao - UFMA, S˜ ao Lu´ıs – MA, Brazil [email protected] http://pib.dee.ufma.br
Abstract. The continuous demand for high performance and low cost electrocardiogram (ECG) processing systems have required the elaboration of more and more efficient and reliable ECG compression techniques. Such techniques face a tradeoff between compression ratio and retrieved quality, where the decrease of the last can compromise the subsequent use of the signal for clinical purposes. The objective of this work is to evaluate the validity and performance of an independent component analysis (ICA) based scheme used to efficiently compress ECG signals while introducing tests for a different type of record of the electrical activity of the heart, such as fetal magnetocardiogram (fMCG). As a result, the reconstructed signals underwent negligible visual deterioration, while achieving promising compression ratios. Keywords: independent component analysis, efficient coding, electrocardiogram, fetal magnetocardiogram.
1
Introduction
As the need for constantly larger quantities of ECG data increases, more efficient compression methods are required. Whether to optimize storage or to make online transmission possible over the public phone network, many efforts have been made in order to enhance ECG compression techniques. Consequently, throughout this process many other methods have emerged [1]. Recent works on ECG compression are related mainly to transform methods, as Karhunen-Loeve (KL) transform [2] and wavelet transforms [3][4][5]. Similarly to other compression techniques, electrocardiogram (ECG) compression aims to reduce data volume while preserving the morphological features of the signal after reconstruction. It implies that signal redundancy must be minimized without loss of the information contained therein. When discussing about the primary visual cortex, neuroscientists [6] argued that a primary function of visual neurons is to re-code the input in a way that
Corresponding author.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 593–600, 2007. c Springer-Verlag Berlin Heidelberg 2007
594
D. Guilhon, A.K. Barros, and S. Comani
reduces redundancy and maximizes the information transmitted to the output. It requires a redundancy reduction process in which the activation of each visual feature is supposed to be as statistically independent from the others as possible [7]. As for natural scenes, an efficient ECG compression method must take into account the high-order statistical dependencies in the data and safeguards its information content, in order to seek a minimum-entropy code [8]. In this work, we discuss a compression method that, for a given ECG signal, finds its basis functions (features) and then the coefficients of the projection of this signal onto a vector subspace spanned by the basis functions. This technique aims to obtain a less redundant and, therefore, more efficient code representation of the ECG source. To achieve this, independent component analysis (ICA) is used to find the vector subspace. Then the signal is projected on that subspace, estimating the projection coefficients in such a manner that they minimize a mean-square-error (MSE) cost function. The same reasoning has led to promising results on image compression [9]. Additionally, initial tests for the compression of fetal magnetocardiograms (fMCG) signals are introduced. Magnetocardiography is a noninvasive and riskfree technique that allows recording the magnetic fields associated with the spontaneous electrophysiological activity of the fetal heart during the second half of gestation [10]. ICA is particularly efficient to process the fMCG and to retrieve fetal cardiac signals that are undistorted by the tissues interposed between the fetal heart and the sensors. The retrieved fetal signals can complement diagnostic methods for pregnancies at risk, such as in the case of intra-uterine fetal growth retardation, or in the presence of fetal arrhythmias [11][12][13]. The use of ICA to extract the fetal signal from fMCG, in conjunction with the application of an ICA based compression algorithm, might permit the online reconstruction of the fetal cardiac signal, which would reveal extremely useful in those clinical cases for which the online monitoring of the fetal cardiac activity is required, such as life threatening fetal arrhythmias.
2 2.1
Methods Proposed Solution
Let e(t) be an ECG signal and assume that it can be divided in m windows of fixed length n. Let us also assume that we can find through ICA a vectorial subspace Φ = [φ1 (t), . . . , φn (t)], where the columns φi (t) are defined as the basis functions of e(t) (see Figure 1). Given that the projection of the ith window of e(t) into Φ is expressed by [14] eˆi (t) = w1 φ1 (t) + w2 φ2 (t) + . . . + wn φn (t)
i = 1, . . . , m,
(1)
where eiˆ(t) is the estimated version of ei (t) and each component wi is the projection coefficient for the ith base function of the subspace Φ. We will drop time index t, for convenience.
ECG Compression by Efficient Coding
595
Fig. 1. System block diagram. (a) The learning phase where the system learns the basis functions φi through the ICA algorithm. (b) The test phase, where the coefficients of the projection are calculated, through a simple mean-square error estimator.
Those coefficients can be calculated through mean-square estimation, where the signal ei is the desired one and input to the estimator. The vector which finds the minimum of the mean-square error, E[ε2 ], is given by wi = E[ΦΦT ]−1 E[ei Φ]
(2)
Here we assume that the desired signal spans the same subspace as the training one, otherwise the output would be null, and that the length of the training input has to be small enough so that in a specific time window, the signal is stationary [15]. The chosen ICA algorithm was FastICA [11][16]. 2.2
Efficient Coding
Let the mutual information of the random variables e1 , . . . , em be defined as I(e1 , . . . , em ) =
m
H(ei ) − H(e1 , . . . , em ),
(3)
i=1
where H(e1 , . . . , em ) is the joint entropy of e1 , . . . , em . The mutual information gives a measure of the dependency among variables. Since information cannot be lost, we recall that I(e1 , . . . , em ) ≥ 0.
(4)
Substituting (4) into (3) yields m
H(ei ) ≥ H(e1 , . . . , em ).
(5)
i=1
Likewise, let the mutual information of the random variables w1 , . . . , wm be defined as m H(wi ) − H(w1 , . . . , wm ), (6) I(w1 , . . . , wm ) = i=1
596
D. Guilhon, A.K. Barros, and S. Comani
where H(w1 , . . . , wm ) is the joint entropy of w1 , . . . , wm . If we assume that w1 , . . . , wm are independents [8], we can state that [17] I(w1 , . . . , wm ) = 0.
(7)
Then substituting (7) into (6) we get m
H(wi ) = H(w1 , . . . , wm ).
(8)
i=1
Given the linear transform ei = Φwi ,
(9)
H(e1 , . . . , em ) = H(w1 , . . . , wm )1 .
(10)
we have [18]
Hence, from (5), (8) and (10) we obtain m
=⇒
i=1 m
H(ei ) ≥ H(w1 , . . . , wm ) H(ei ) ≥
i=1
m
H(wi ).
(11a) (11b)
i=1
Since Lmin = H(ϑ), i.e., the average code length is minimum when it equals the entropy of the set, we can conclude that m i=1
Lmin (ei ) ≥
m
Lmin (wi ),
(12)
i=1
Eq.(12) establishes a relationship between the total code length required to represent e, by means of either ei or wi . We observe that both representations have the same length only when the total code length of ei is the minimum possible, for the representation of wi is already efficient [7]. Otherwise, the total code length of ei would be increased due to the presence of redundancies, that were not present in wi , according to Eq. (7).
3
Results
Our approach was first tested for the MIT Normal Sinus Rhythm Database and the MIT Supraventricular Arrhythmia Database, 15 records each. The first 30 minutes of each record were used in the learning phase, while the test phase used two minutes [14]. 1
The equantion is valid since the random variable wi is of discrete type and the transformation ei = g(wi ) has a unique inverse.
ECG Compression by Efficient Coding
597
Then, similar tests were performed for 15 fetal signals reconstructed from fMCG data recorded for a pregnancy at 35 weeks. The fetal cardiac signals were extracted according to [19]. The first 100 seconds of each record were used in the learning phase, while the test phase used 10 seconds. The coefficients found through Eq. (2) were quantized using as many levels as needed to properly reconstruct the ECG and the fMCG signals. Figure 2 shows that the method does not introduce errors that result in significant visual differences, indicating that the morphological characteristics of the fetal magnetocardiogram are preserved. The reconstruction errors, also shown in figure 2, confirm those results. Figures 3 and 4 show the mean values of 50 repetitions of the percent root-mean-square-difference (PRD) upon reconstruction of each record, defined as n 2 (i) − sigrec (i)] i=1 [sig norig 2 ∗ 100% (13) PRD = i=1 sigorig (i) where sigorig (i) is the original signal, and sigrec is the reconstructed one. (a) 1
0.5
0
−0.5
0
200
400
600
800
1000
1200
800
1000
1200
800
1000
1200
(b) 1
0.5
0
−0.5
0
200
400
600 (c)
1
0.5
0
−0.5
0
200
400
600
Fig. 2. Result of 1200 samples of a fMCG record. (a) Original signal. (b) Reconstructed signal, with CR = 2.66:1 and PRD = 3.43%. (c) Reconstruction error.
4
Discussion
ICA can be used to find a vectorial subspace where the component projections of the ECG signal are mutually independents. Therefore, the coefficients of the signal projected onto this subspace are independent as well. Coding that projection, as a whole or by its parts, results in the same code length [17].
598
D. Guilhon, A.K. Barros, and S. Comani
(a) −1
10
ICA PCA
0
5
10
15
10
15
10
15
PRD
(b)
−2
10
0
5 (c)
−2
10
0
5
Fig. 3. Percent root-mean-square-difference upon reconstruction of 15 records of the MIT Normal Sinus Rhythm Database. Full lines with circle marker stand for ICA results, whereas dashed-dot lines with square markers stand for PCA results. (a), (b) and (c) show results for compression ratios fixed at 3, 2.4 and 2, respectively. Notice the logarithm scale.
(a) ICA PCA
−1
10
0
5
10
15
10
15
10
15
(b) −1.2
PRD
10
−1.5
10
−1.8
10
0
5 (c)
−2
10
0
5
Fig. 4. Percent root-mean-square-difference upon reconstruction of 15 records of the MIT Supraventricular Arrhythmia Database. Full lines with circle marker stand for ICA results, whereas dashed-dot lines with square markers stand for PCA results. (a), (b) and (c) show results for compression ratios fixed at 2.5, 2 and 1.667, respectively. Notice the logarithm scale.
ECG Compression by Efficient Coding
599
Figures 3 an 4 refer to the reconstruction errors obtained when using either ICA or PCA to find a vectorial subspace associated with each one of the 15 ECG records of both databases. These results reflect the performance simulation of our method when compared to that of [2], excluding the contributions of classic compression algorithms. By using ICA, it is observed that the ECG reconstruction errors, even after the quantization process, are smaller than those calculated using PCA. Again, one can see that the method achieves data compression without adding relevant distortion to the signal, what can be confirmed by figure 2. Moreover, from (12) we observe that the average code length required to represent the original signal is larger than that required for represent its coefficients, unless the first is already efficiently coded. In that case, the left side of the equation equals the right one, which means that our method do not alter the representation code length, because it is minimum. Otherwise, due to the presence of redundancies, the code length of the original data representation would be larger than that achieved using our method.
5
Conclusion
We evaluated the validity and performance of an ICA based scheme used to compress ECG and fMCG data. It was verified that such a tool efficiently compresses those signals by means of non-redundant representations of them, ensuring both the reduction of the total data volume and the preservation of the morphological characteristics of the signals. From the perspective of clinical applications, the shown effectiveness of the described compression algorithm would be beneficial not only for adult ECG, but also for prenatal online monitoring of the fetal cardiac activity. A further step in the development of this tool might be its use as a preprocessing step for a classic compression algorithm; furthermore, a selection criterion might be included to allow also the reduction of the numbers of coefficients that should be stored.
Acknowledgements This work was supported by FAPEMA.
References [1] Jalaleddine, S.M., Hutchens, C.G., Strattan, R.D., Coberly, W.A.: ECG data compression techniques-A unified approach. IEEE Trans. Biomed. Eng. 37(4), 329–343 (1990) [2] Olmos, S., Millan, M., Garcia, J., Laguna, P.: ECG data compression with the Karhunen-Loeve transform. In: Computers in Cardiology, Menlo Park, CA, pp. 253–256. IEEE Comput. Soc. Press, Los Alamitos (1996)
600
D. Guilhon, A.K. Barros, and S. Comani
[3] Miaou, S.G., Chao, S.N.: Wavelet-based lossy-to-lossless ECG compression in a unified vector quantization framework. IEEE Trans. Biomed. Eng. 52(3), 539–543 (2005) [4] Blanco-Velasco, M., Cruz-Roldan, F., Godino-Llorente, J.I., Barner, K.E.: ECG compression with retrieved quality guaranteed. Electronics Letters 40(23), 1466– 1467 (2004) [5] Rajoub, B.A.: An efficient coding algorithm for the compression of ECG signals using the wavelet transform. IEEE Trans. Biomed. Eng. 49(4), 355–362 (2002) [6] Sekuler, A.B., Bennet, P.J.: Visual neuroscience: Resonating to natural images. Curr. Biol. 11, R733–R736 (2001) [7] Bell, A.J., Sejnowski, T.J.: Edges are the independent components of natural scenes. In: Advances in neural information processing systems, vol. 9, pp. 831– 837. MIT Press, Cambridge (1997) [8] Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996) [9] Souza, C.M., Cavalcante, A.B., Guilhon, D., Barros, A.K.: Image compression by Redundancy Reduction, submited to ICA (2007) [10] Tavarozzi, I., et al.: Magnetocardiography: current status and perspectives: Part I. Physical principles and instrumentation, Ital. Heart J. 3, 75–85 (2002) [11] Hild II, K.E., Alleva, G., Nagarajan, S.S., Comani, S.: Performance comparison of six Independent Components Analysis algorithms for fetal signal extraction from real fMCG data. Phys. Med. Biol. 52, 449–462 (2007) [12] Comani, S., et al.: Time course reconstruction of fetal cardiac signals from fMCG: Independent Component Analysis vs. Adaptive Maternal Beat Subtraction, Physiol. Meas 25, 1305–1321 (2004) [13] Comani, S., et al.: Characterization of fetal arrhythmias by means of fetal magnetocardiography in three cases of difficult ultrasonographic imaging. Pacing. Clin. Electrophysiol. 27, 1647–1655 (2004) [14] Guilhon, D., Medeiros, E., Barros, A.K.: ECG Data Compression by Independent Component Analysis. In: IEEE Workshop on Machine Learning for Signal Processing, September 28-30, 2005, pp. 189–193 (2005) [15] Barros, A.K., Ohnishi, N.: Single channel speech enhancement by efficient coding. Signal Processing 85(9), 1805–1812 (2005) [16] Hyvarinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997) [17] Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, New York (2001) [18] Papoulis, A., Pillai, S.U.: Probability, Random Variables and Stochastic Processes, 4th edn. McGraw-Hill, New York (2002) [19] Comani, S., et al.: Independent component analysis: fetal signal reconstruction from magnetocardiographic recordings. Comput. Meth. Prog. Biomed. 75, 163– 177 (2004)
Detection of Paroxysmal EEG Discharges Using Multitaper Blind Signal Source Separation Jonathan J. Halford Medical University of South Carolina, Department of Neuroscience, Adult Neurology Division, 96 Jonathan Lucas St., Suite 307 Clinical Science Building Charleston, SC 29425 [email protected]
Abstract. The routine electroencephalogram (rEEG) is a useful diagnostic test for neurologists. But this test is frequently misinterpreted by neurologists due to a lack of systematic understanding of paroxysmal electroencephalographic discharges (PEDs), one of the most important features of EEG. A heuristic algorithm is described which uses conventional blind signal source separation (BSSS) algorithms to detect PEDs in a routine EEG recording. This algorithm treats BSSS as a ‘black box’ and applies it in a computationally-intensive multitaper algorithm in order to detect PEDs without a pre-specification of signal morphology or scalp distribution. The algorithm also attempts to overcome some of the limitations of conventional BSSS as applied to the study of neurophysiology datasets, specifically the ‘over-completeness problem’ and the ‘non-stationarity problem’.
1
Introduction
The routine electroencephalogram (rEEG) recorded from the scalp is an important diagnostic test used by neurologists for the medical management of patients with undetermined spells, epileptic seizures, altered mental status, and coma [1]. Unfortunately, the misinterpretation of rEEG is common, since most rEEG studies are interpreted by neurologists who lack subspecialty fellowship training in clinical neurophysiology. These misinterpretations sometimes cause medical mismanagement and harm to patients [2]. Clinically-significant misinterpretations of rEEG studies usually involve the misinterpretation of paroxysmal EEG discharges (PEDs) – short bursts of electrical activity usually lasting between 0.1 and 2.0 seconds which have higher signal amplitude than the surrounding background EEG activity. The manifestation of PEDs in rEEG are varied and complex. The morphological features which describe the boundaries of normality for PEDs are subtle (particularly in children), not rigorously defined, and vary between experts and clinical neurophysiology training programs [3]. The subspecialty training of clinical neurophysiologists in the interpretation of PEDs is much like the training of a baseball umpire who gradually develops a ‘strike-zone’ through observation, attention to oral history, and supervision by instructors. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 601–608, 2007. c Springer-Verlag Berlin Heidelberg 2007
602
J.J. Halford
In order to promote the rigorous scientific study of PEDs, it would be useful to be able to collect all PED activity in a rEEG recording objectively using a computerized detection algorithm. A database of the PEDs present in a rEEG could then be characterized in signal morphology and scalp distribution and submitted, along with clinical data, to clustering algorithms. This could provide a more objective definition of what constitutes a normal or abnormal constellation of PEDs in a given rEEG, for both clinical and research purposes. To the authors knowledge, a computer algorithm which detects all PED signal activity in rEEG in a non-specific way (with little regard to signal morphology or scalp distribution) does not exist. Previous scientific study of PEDs in rEEG has focused on other algorithmic approaches. In many studies, the signal morphology and scalp distribution of normal or abnormal types of PEDs are defined a priori and algorithms for detecting PEDs are created based on these morphologic and topographic characteristics [4,5]. This causes a significant selection bias in the detection of PEDs since ‘you find only what you are looking for’. Other studies use a neural network approach to categorize a sample of rEEG data as normal or abnormal based on a training signal from human experts [6]. This does not require an a priori specification of the characteristics of the PEDs to be analyzed, but it does not provide an objective method of collecting and studying individual PEDs either. Algorithms which enable the objective detection and capture of all PED activity in an rEEG record are needed to provide an improved first-stage to advanced pattern-recognition algorithms which could be useful for rEEG research and clinical interpretation. Some clinical neurophysiology researchers have begun using BSSS as a first-stage in their pattern recognition algorithms [7]. Routine EEG recordings consist of a mixture of electrical signals generated by a myriad of both intracranial and extracranial biological and artifactual sources. Blind signal source separation (BSSS) algorithms are a method for separating signals of interest from a mixture of signals [8]. BSSS is an ideal foundation for algorithms to detect PEDs in a non-specific way, since BSSS enables some separation of the PED signal from other concurrent EEG signals. But there are limitations to conventional BSSS algorithms when applied to neurophysiologic recordings. The classic description of BSSS is the ‘cocktail party problem’ in which BSSS analysis separates the voices of n number of speakers in a cocktail party using n number of microphones placed throughout the room [9]. Conventional BSSS algorithms assume that the number of guests at the cocktail party is limited to the number of microphones and that the location of the guests do not change over time. But the ‘cocktail party problem’ facing clinical interpreters of EEG is very different from this classic one. In the ‘cocktail parties’ of rEEG recordings, there are an unknown number of guests at the party, the number of guests is likely larger than the number of microphones, and the number of guests probably changes over time. This leads to a significant ‘over-completeness problem’. Also, many of the guests are probably walking about the room while they are talking (causing their location to be ‘nonstationary’ throughout data acquisition), leading to a significant ‘non-stationarity problem’ [10]. This paper describes a computationally-demanding heuristic algorithm created to use conventional BSSS algorithms to:
Detection of Paroxysmal EEG Discharges Using Multitaper BSSS
603
1. Detect PEDs in neurophysiologic datasets in a non-specific way (with minimal regard to signal morphology or scalp distribution) 2. Circumvent the over-completeness problem of BSSS 3. Partially address the non-stationarity problem of BSSS.
2 2.1
Algorithm Step 1. Multitaper Blind Signal Source Separation
Because rEEG recordings are often lengthy and obtained with a limited number of recording electrodes (approximately 20), they are assumed to be overcomplete. This overcompleteness could cause a BSSS analysis of an entire rEEG dataset to fail to resolve an individual PED into one or more components of the output signal. Also, although the source locations for brain activity are anatomically fixed, the sequential activations of interconnected neurons which occur during brain activations frequently cause non-stationarity in measured rEEG signals [10]. BSSS analysis of an entire rEEG dataset will represent an individual PED, and often a group of similar PEDs, in a limited number of components. If the source generators for a PED are non-stationary and vary slightly from PED-to-PED, each individual PED could be better represented by a unique set of scalp distribution vectors (SDVs) associated with different ranges of time-points during the timeof-occurence of the individual PED. A simple (but computationally-intensive) algorithm is implemented to do this. In the multitaper BSSS algorithm (mBSSS), a large number of BSSS operations are performed throughtout the rEEG dataset using a range of window lengths at incremental temporal locations in a multitaper method. Because the author has implemented the mBSSS algorithm with Infomax independent component analysis (ICA), the output of the algorithm will be referred to as independent components (ICs) and SDVs [11]. By performing many BSSS operations in an evenly-spread pattern of window lengths and temporal locations, mBSSS attempts to detect each PED in one or more ICs, with one or more associated unique SDVs. Any BSSS algorithm may be ‘plugged-in’ to the mBSSS algorithm, as long as the number of output ICs (and SVDs) match the number of input sources and as long as it converges on a solution very reliably. A number of mBSSS parameters must be chosen empirically. These parameters describe how many BSSS operations are performed, where they are performed in the dataset, and how much the windows for BSSS overlap. This is a list of parameters used for the algorithm: wmin shortest window length for BSSS, given in datapoints wmax longest window length for BSSS, given in datapoints wef window length expansion factor γ BSSS window overlap parameter τ parameter defining number of ’zones’ within each BSSS window
604
J.J. Halford
These are the sets used for the algorithm: X(n, t) time versus amplitude rEEG dataset of n channels of length t W BSSS ‘window’ lengths to be used for mBSSS R number of BSSS operations (‘repetitions’) performed for each window length B start time positions in X (‘begin points’) for all BSSS windows S ICs produced by mBSSS A SDVs produced by mBSSS The set of window lengths W to be used for BSSS range exponentially from
Fig. 1. Parameter settings of τ = 2 and γ = 4. At a setting of τ = 2, the IC is divided into five zones. The window for BSSS is moved forward by a length equal to a zone length divided by γ.
wmin to wmax with an expansion factor of wef . The γ parameter, which can be any non-zero integer, defines the extent to which consecutive windows for BSSS overlap. The start-points for each of the consecutive windows for BSSS are distributed equally throughout the dataset. Each window for BSSS is partitioned into 2τ + 1 ’zones’ each of equal length. For a given window length, the mBSSS algorithm performs consecutive BSSS operations progressively through the dataset, translating the location of the window for BSSS a γ(2τ1+1) fraction of the window length at each increment. The output sets for the algorithm are S, which is a large set of ICs, and A, which is a large set of the respective SDVs. 2.2
Step 2. Paroxysmal Event Detection
Since the magnitude of the signals in both sets A and S are related indirectly and unpredictably to the amplitude of the source signals X, the A and S sets are normalized to create An and Sn , respectively, using a normalization parameter. ICs in set Sn with a paroxysmal appearance are retained and all of the other ICs in set Sn are discarded along with their respective SDVs in An . Each window for BSSS is partitioned into 2τ + 1 ’zones’ each of equal length (i.e., a central zone with tau zones on either side, as shown in Figure 2). An IC is considered paroxysmal (and therefore may represent a PED) if it has a higher root-mean-square
Detection of Paroxysmal EEG Discharges Using Multitaper BSSS
605
(RMS) power in the central zone of the IC (‘zone B’) than in the surroundings zones (’zone A’ and ’zone C’). This is illustrated in Figure 2. The RMS power
Fig. 2. At a setting of τ = 2, the IC is divided into five zones. Zones A, B, and C are always assigned as the central three zones within the IC, regardless of the value of tau (τ ).
of each IC in set Sn in zones A, B, and C is defined as PA , PB , and PC . The ‘paroxysmal event index’ (PEI) termed Υ for each IC in Sn is calculated: 2PB2 − PA2 − PC2 (1) Υ = W The PEI (Υ ) threshold parameter υp is selected empirically. Only those independent components in Sn (along with their respective SDVs in An ) which have an Υ > υp are retained and placed in set Snp (and their respective SDVs and placed in set Anp ). All other ICs in Sn (and respective SDVs in An ) are discarded. 2.3
Step 3. IC Redundancy Reduction and Representation
The database of independent components (ICs) in set Snp often contains multiple ICs representing each single PED. ICs which represent the same PED can be identified by simple cross-multiplication of the ICs which overlap temporally by 50 percent of the length of at least one of the two ICs. The simplest approach to removing this redundancy is to pick out and retain the IC (and its respective SDV) which has the highest Υ (or some other characteristic) within the group of ICs with the same ’event number’. Another more complicated method is to develop a hierarchical representation of each group of ICs with the same ’event number’ which contains a longer IC and possibly several shorter ‘subcomponent’ ICs. This provides a description of how the SDV of the PED changes over time, helping to address the ‘non-stationarity’ problem of applying BSSS to neurophysiology datasets. 2.4
Step 4. Visualization of PEDs Detected Using mBSSS
Depending on the values of the parameters selected, when the mBSSS algorithm is applied to a rEEG dataset, a large database of ICs and their respective SDVs (which represent PEDs) can be created. In the author’s experience, is it possible to capture hundreds of PEDs in a typical 30-minute routine EEG recording. Visualization of this many ICs is a challenge. The individual IC signals and
606
J.J. Halford
and SDVs can be visualized in groups based on their ‘event number’. Also, single-value secondary characteristics of the zone B of each IC in set Snp can be calculated. These secondary characteristics may include typical quantitative measures such as peak spectral frequency and ‘spikiness’ morphology but may also include more complex measures such as peak wavelet power using various mother wavelets. Single-value secondary characteristics of the SDV such as location of peak IC activity and degree of focality can also be calculated. Secondary characteristics of each IC (and the respective SDV) captured by mBSSS can be plotted in a graph with time on the x-axis, IC length on the y-axis, and a single-value IC or SDV secondary characteristic on the z-axis (color).
3
Medical Application
One year of routine clinical EEGs performed at the MUSC Neurophysiology Laboratory were reviewed by the author (over 1000 EEGs). These digital EEGs were acquired at 256 Hz with 19 channels using the standard 10-20 electrode placement. A database of 101 PEDs was collected by the author from these clinical EEG recordings. This database of PEDs consists of 61 PEDs from the EEGs of 50 patients without of history of epilepsy which were judged by the author to be normal but difficult to interpret. This database also included 40 PEDs from the EEGs of 29 patients with known epilepsy judged by the author to be abnormal PEDs which were subtle due to their relatively low amplitude. Thirty-second EEG epochs containing these 101 PEDs were de-identified and assimilated into a single digital EEG file termed the ‘source EEG dataset’. Each 30-second segment contained the PED-of-interest in approximately the temporal midpoint of the segment. Another parallel database of 101 tag signals was created using segments of EEG from just one EEG channel which the author thought best represented each PED. (The length of these ‘tag signals’ varied based on the length of each PED.) The mBSSS analysis was performed on the source EEG dataset in the sequence described below using Matlab code composed by the author and using various EEGLab Matlab scripts including Infomax ICA [12]: 1. mBSSS analysis was performed over four-second windows of EEG data centered temporally over all of the 101 PEDs. The parameters for mBSSS included τ = 5, γ = 6, and 14 BSSS window lengths varying exponentially from approximately 0.5 to 10 seconds (each window length 10% greater than the next). This mBSSS analysis produced 991 PED ICs and their respective SDVs. 2. The 991 PED ICs were compared with each other and it was determined that they represented 379 unique PEDs in the source EEG dataset. 3. One of 991 PED ICs was selected out of each of the 379 groups of PED ICs to represent each of the 379 detected PEDs (producing 379 ‘representative PED ICs’), based on its PEI (Υ ) value and other secondary characteristics. 4. Each of the 379 representative PED ICs were compared to the 101 EEG tag signals using the methods described in #2 above. The representative PED ICs which matched one of the 101 PED tag signals were labelled as such.
Detection of Paroxysmal EEG Discharges Using Multitaper BSSS
607
The mICA analysis detected 98 of the 101 target PEDs in the source EEG dataset. All of the PEDs which were not detected by mBSSS were normal PEDs. Secondary characteristics of the detected PEDs and the sensitivity and specificity for categorizing the target PEDs as normal and abnormal based on these characteristics have recently been presented elsewhere as a poster presentation and can be viewed at http://www.drivehq.com/web/halfordjj/HalfordAESposter2006.ppt [13].
4
Conclusion
Based on the results of the preliminary testing presented above, the mBSSS algorithm is able to detect many PEDs in routine EEG datasets without a prespecification of signal morphology or scalp distribution. But the mBSSS algorithm has many limitations. First, it has been developed heuristically and is not based on the fundamental mathematical principles of BSSS. Secondly, the definition of PED is subjective and needs to be better defined both mathematically and clinically. This would require the creation of standardized EEG datasets to verify the accuracy of PED detection with mBSSS. Third, the mBSSS algorithm is very computationally intensive and therefore not practical for routine clinical or research use at the present time. If the algorithm is to be useful for research or clinical practice, many improvements need to be made. First, the mBSSS algorithm needs to be restructured to minimize the number of empirically-set algorithm parameters. Secondly, since the current implementation uses un-compiled Matlab scripts, the algorithm needs to be C-coded and implemented with an efficient ICA algorithm which possibly not only detects supra-Gaussian sources but also sub-Gaussian sources as well. Third, experiments with a range of parameter settings need to be performed for parameter optimization. These experiments will require standardized clinical rEEG datasets. In order to test if the mBSSS algorithm can detect PED activity visible to the human eye, rEEG datasets need to be developed in which all visible PEDs have been marked by several expert rEEG interpreters. Information retrieval statistics could be used to test the precision, recall, and accuracy of the mBSSS algorithm in detecting PEDs. Also, in order to determine if the algorithm can capture PED activity which is too subtle for the human eye to detect, datasets with intracranial EEG or magnetoencephalogram (MEG) data acquired concurrently with rEEG data need to be studied in a similar fashion. The high computational requirement of the algorithm may become less of a problem over time. The cost of computational power continues to decrease at a somewhat predictable rate. If the trend toward the development of multi-core microprocessors continues, substantial computational power for parallel processing could be available in personal computers within a few decades. The code for algorithms which use multiple BSSS computations can be easily multithreaded. These algorithms could be implemented on small local clusters or used via a telemedicine approach in which clinical EEG datasets are moved to and from remote centralized data processing centers using the Web. BSSS algorithms which detect PEDs could help provide an objective structure to the computer analysis of PED activity in
608
J.J. Halford
routine clinical EEG. This could lead to an improved understanding of clinical EEG and an improvement in neurology patient care. Acknowledgement. Special thanks to Martin J. McKeown MD for inspiration and to Jose C. Principe PhD and Brian C. Dean PhD for advice.
References 1. Zifkin, B.G., Cracco, R.Q.: An Orderly Approach to the Abnormal Electroencephalogram. In: Current Practice of Clinical Electroencephalography, pp. 288–302. Lippincott Williams & Wilkins, Philadelphia, PA (2003) 2. Benbadis, S.R.: The eeg in nonepileptic seizures. Journal of Clinical Neurophysiology 23, 340–352 (2006) 3. Westmoreland, B.F.: Benign Electroencephalographic Variants and Patterns of Uncertain Clinical Significance. In: Current Practice of Clinical Electroencephalography, pp. 235–245. Lippincott Williams & Wilkins, Philadelphia (2003) 4. Wilson, S.B., Emerson, R.: Spike detection: a review and comparison of algorithms. Clinical Neurophysiology 113, 1873–1881 (2002) 5. da Lopes Silva, F.: Computer-Assisted EEG Diagnosis: Pattern Recognition and Brain Mapping. In: Electroencephalography: Basic Principles, Clinical Applications, and Related Fields, vol. 2, pp. 1233–1264. Lippincott Williams & Wilkins, Philadelphia, PA (2005) 6. Acir, N., Oztura, I., Kuntalp, M., Baklan, B., Guzeli, C.: Automatic detection of epileptiform events in eeg by a three-stage procedure based on artificial neural networks. IEEE Transactions on Biomedical Engineering 52(1), 30–40 (2005) 7. LeVan, P., Urrestarazu, E., Gotman, J.: A system for automatic artifact removal in ictal scalp eeg based on independent component analysis and bayesian classification. Clinical Neurophysiology 117(117), 912–927 (2006) 8. Haykin, S., Chen, Z.: The cocktail party problem. Neural Computation 17, 1875– 1902 (2005) 9. Brown, G.D., Yamada, S., Sejnowski, T.J.: Independent component analysis at the neural cocktail party. Trends in Neurosciences 24, 54–63 (2001) 10. Unsworth, C., Spowart, J., Lawson, G., Brown, J., Mulgrew, B., Minns, R., Clark, M.: Redundancy of independent component analysis in four common types of childhood epileptic seizure. Journal of Clinical Neurophysiology 23(3), 245–253 (2006) 11. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural Computing 7, 1129–1159 (1995) 12. Delorme, A., Makeig, S.: Eeglab: an open source toolbox for analysis of single-trial eeg dynamics. Journal of Neuroscience Methods 134, 9–21 (2004) 13. Halford, J.J.: Analysis of temporal lobe paroxysmal events using multitaper independent component analysis. Abstract Presentation at the American Epilepsy Society Annual Meeting. San Diago, CA (December 2006)
Constrained ICA for the Analysis of High Stimulus Rate Auditory Evoked Potentials James M. Harte Centre for Applied Hearing Research, Acoustic Technology, Ørsted•DTU, Technical University of Denmark, DK-2800, Lyngby, Denmark [email protected]
Abstract. A temporally-constrained blind-source-separation algorithm was used to analyse auditory evoked potentials, evoked from impulse trains with inter-stimulus rates of 95 and 198 Hz. A nonstationarity of variance contrast function was used, and a simulation run showing its ability to extract sources based on a simple convolved model of auditory brainstem and middle latency responses. For a stimulus rate of 95 Hz, where no neural adaptation occurs, this approach was partially successful for experimental data. For the higher rate of 198 Hz particularly poor results were observed for brainstem responses. It is hypothesised that this may be due to the neural adaptation process and/or an inappropriate choice of source model.
1
Introduction
Auditory evoked potentials are the summed response from many remotely located neurons recorded via scalp electrodes. They can be recorded from all levels of the auditory pathway, from the auditory nerve, the brainstem up to the cortex. They are typically grouped in terms of time of occurrence after stimulus offset and thus are known as; auditory brainstem responses (ABRs) recorded between 1 and 7 ms after stimulus offset; middle latency responses (MLRs) recorded in the interval 15-50 ms after acoustic stimulus; and auditory late response (ALR) recorded in the interval 75-200 ms after stimulus. To date no studies, to the author’s knowledge, exist on successfully using independent component analysis (ICA) for examining ABR responses. Though examples exist for using ICA on MLR and ALR responses [4,9]. It is the goal of this paper to develop a tool for characterising the phenomenon known as neural adaptation, where a reduced neural output is observed due to prolonged or repeated stimulation, in each stage of the auditory pathway. This is by no means a trivial task and the work presented here represents a first step toward this goal. Adaptation has an important function in auditory perception models [1], and is therefore of interest in audiological science. It is often investigated as a function of stimulus rate of repeated impulse or chirp stimuli [6]. Here, ICA with reference [7,8] will be used to subtract two components attributed to ABR and MLR based sources, using a correlation based distance M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 609–616, 2007. c Springer-Verlag Berlin Heidelberg 2007
610
J.M. Harte
metric [5]. A nonstationarity of variance [2] based approach for finding maximally independent components will be used. It is assumed that any approach using temporal structure in the data is more appropriate for evoked potentials, with well known temporal features. If it is possible to separate signals based on a principled semi-blind source separation algorithm such as the one proposed here, then it may prove a useful tool for non-invasively investigating the human auditory pathways.
2
Constrained Blind Source Separation
The classic blind source separation problem assumes that we have a series of mixtures obtained via an unknown mixing matrix on a set of underlying sources, x = As. The goal is to find an unmixing matrix W such that the estimate of the sources is given by ˆs = Wx. It is common to pre-process and reduce the dimension of the linear mixtures via principle component analysis (PCA) z = Ex. All discussion from this point of the mixtures assumes PCA and dimension reduction has been performed. Hyv¨ arinen [2] showed that maximization of the nonstationarity of variance, measured by the absolute value of the 4th cross-cumulant, of a linear combination of the observed mixtures allows for the estimation of one source signal. The 4thorder cross-cumulant corresponds to the autocorrelation of energy in a given signal, and is given by 2
cum(y, y, yτ , yτ ) = E{y 2 yτ2 } − E{y 2 }E{yτ2 } − 2 (E{yyτ })
(1)
where the time dependence on y(t) is omitted for brevity and yτ represents the delayed signal y(t − τ ), this convention will be used for the rest of this paper. Lu and Rajapakse [7,8] developed a framework for constrained independent component analysis (cICA), to incorporate additional requirements and prior information in the form of constraints on the ICA contrast function. The goal is to extract the component closest to some user specified reference signal r(t), via some distance metric. The closeness constraint to be used here will be correlation at zero lag: (2) g(w) = ζ − E r(wT z) ≤ 0 where w is a single (one-unit) demixing weight vector, such that y(t) = wT z, and ζ is the threshold that defines the lower bound of the optimum correlation [5]. The one-unit constrained optimization problem for a nonstationarity of variance contrast function J(y) is given by: maximise J(y) = |cum wT z, wT z, wT zτ , wT zτ | (3) subject to g(w) ≤ 0, h(w) = E{(wT z)2 } − 1 = 0 The equality constraint h(w) bounds the scale of the output y and the weight vector w, and is needed as cumulants are not scale invariant. The specific algorithm used in this paper is shown in appendix A.
Constrained ICA
3
611
Experimental Methods and AEP Data
1
voltage [uV]
Stimulus [a.u.]
In this preliminary study, only a single test subject was used. The stimulus was generated in MATLAB and A/D conversion made through an ADI-8 Pro 24-bit converter, the levels were set via a DT PA5 programmable attenuator, and the stimuli presented to the left ear of the test subject via an ER-2 insert earphone. EEG activity was recorded differentially between the vertex and 28 electrodes distributed over a 61-channel triangulated equidistant arrangement headcap, with the ground electrode placed on the forehead. Silver/silver chloride electrodes were used, and an inter-electrode impedance was maintained below 5kΩ. The EEG activity was recorded on a SynAmps 5803 amplifier, providing 74 dB of gain before a low-pass filter stage (cut-off of 2 kHz), with a sampling rate of 10 kHz. After recording the EEG-data was epoched and filtered again from 10 to 1500 Hz using a 200 tap FIR filter. The epochs were averaged using an iterative weighted-averaging algorithm [11].
0.5 0
−0.5
Pa
V
0 0.5
Na time [ms]
1
voltage [uV]
Stimulus [a.u.]
Time [ms]
0.5 0
−0.5 0 0.5 time [ms]
voltage [uV]
Stimulus [a.u.]
Time [ms] 1 0.5 0 0
20
40
60
Time [ms]
(a)
80
100
120
−0.5 0 0.5 0
20
40
60 time [ms]
80
100
120
(b)
Fig. 1. (a) Experiment stimuli waveforms and (b) event-related potentials for a single exemplary channel located on the ipsilateral mastoid, showing combined ABR and MLR responses to (top) single impulse; (middle) 95 Hz impulse train; and (bottom) 198 Hz impulse train. The two curves represent two repeat recordings with 4000 averages and show good repeatability.
The basic stimuli used in this experiment was an 83 μs duration impulse. Three sets of stimuli conditions, see Fig. 1(a), were presented at a constant inter-epoch rate of ≈ 8.33 Hz (i.e. a duration of 120 ms). The first stimuli set was a single impulse to act as a control where both ABR and MLR responses could be seen. Stimuli set 2 presented a train of impulses with a within train rate of 95 Hz, shown in [6] to produce an unadapted ABR response. The MLR response will be convolved over the epoch, as the inter-epoch rate was chosen with no jitter so circular convolution would occur. Stimuli set 3 presents the impulses at a
612
J.M. Harte
rate of 198 Hz, ensuring the ABRs would be adapted over the impulse train. A total of 4000 averages were made per stimulus type and repeated twice to ensure repeatability of results. The stimuli were all presented at a level of 60 dB pe SPL, to ensure good SNR and test subject comfort. An illustrative event-related potential for a single channel located at the ipsilateral mastoid for the three stimulus conditions is shown in Fig. 1(b). The results for the single impulse stimuli (top) show typical features expected in the ABR and MLR auditory evoked potentials, namely the ABR wave V is clearly located around 6 ms after stimulus offset. Typical MLRs are observed with the first negativity Na around 25 ms after stimulus offset and the first positivity Pa around 35 ms. The two high stimulus rate ERPs are shown in the middle and bottom panes of Fig. 1(b). It can be seen that the slow MLR response are highly convolved within the event related potential.
4
Simulation
A simulation was carried out assuming three sources and square mixing. Source 1 used the ABR within the single impulse experiment evoked potential, re-filtered with a pass-band 100-1500 Hz, and windowed temporally to minimise the influence of the MLR. This ‘clean’ ABR was then convolved with the 95 Hz impulse train in the frequency domain to obtain circular convolution. The second source was similarly obtained by filtering (10-100 Hz), windowing and convolving the MLR response from the single impulse experiment evoked potential. A third biologically inspired noise source was added to limit the algorithms performance. It was defined to have a pink power spectra (i.e. 1/f ), typical of raw EEG recordings, limited to the pass-band 10 – 1500 Hz of interest in this study. The noise had a Gaussian distribution as one might expect after synchronous averaging and filtering due to the central limit theorem. The sources are shown in Fig. 2(a). The cICA algorithm was given the exact MLR source as reference, and a deflationary orthogonalization procedure [3] was used to extract the remaining components in the data. Fig. 2(b) shows the output from the cICA algorithm for a delay of τ = 1. The first component, obtained with the reference and constraints, was extracted with a correlation of ζrs = 1.00, i.e. perfectly. The ABR component, obtained from the deflationary orthogonalization, appears to be adequately extracted. However, some corruption with the noise source is apparent. Similar results were obtained for the 198 Hz inter-impulse rate simulation. Blind-source-separation via nonstationarity of variance assumes the 4th crosscumulant for a source is non-zero, i.e. some structure is required in the energy of the signal. Hyv¨ arinen [2] gave examples with smoothly changing variances where the envelopes were random and low-pass filtered, and thus have sharply decaying cross-cumulants for increasing delay τ . In this experiment matters are complicated by the introduction of a set of convolved or repeated energy profiles, thus the cumulant for the source becomes periodic as a function of delay τ . We also assume two (or more) sources with the same periodicity, i.e. phase locked to the periodic stimuli. The maxima in the 4th cross-cumulant for such a mixing model
40
60 Time [ms]
80
100
120
5 MLR source
0 0 5
20
40
60 Time [ms]
80
100
120
Amplitude [a.u.]
Amplitude [a.u.]
20
Noise source
0 −5 0
20
Amplitude [a.u.]
0
ABR source
Amplitude [a.u.]
4 2 0 −2
Amplitude [a.u.]
Amplitude [a.u.]
Constrained ICA
40
60 Time [ms]
80
100
120
(a)
613
5 MLR I.C.
0 0 5
20
40
60 Time [ms]
80
100
120
40
60 Time [ms]
80
100
120
40
60 Time [ms]
80
100
120
Noise I.C.
0 −5 0 4
20 ABR I.C.
2 0 −2 0
20
(b)
Fig. 2. Simulation results for an 95Hz inter-impulse rate: (a) Original sources used; (b) extracted maximally independent components
become functions of delay τ . As indicated by Hyv¨ arinen [2] the choice of delay τ becomes difficult, yet theoretically crucial. The success of the algorithm can be judged through the use of the convergence index suggested in [2]. Defined as the sums of the absolute value of the matrix WA − 3, this will be zero if the mixing matrix A was estimated perfectly. The simulation was carried out 1000 times for delays τ = 1 → 50 to obtain a mean convergence index as a function of delay. It was hoped that this might suggest a meaningful delay to use when analysing the experimental results. There was significant variance for each τ equal to the variance across τ so no useful optimal delay could be found from this simple analysis. Due to this a delay of τ = 1 was chosen for the experimental results.
5
Constrained ICA Results
As indicated in section 2 the data was whitened prior to applying the cICA algorithm. A dimension reduction was carried out from 28 mixtures to only 9 principle components, accounting for 99% of the data variance. The two main benefits for dimension reduction are in reducing noise and preventing overlearning. The previous section discussed a simulation where it was assumed that the ABR and MLR source signals could be modeled as the response to a single impulse convolved with the stimulus impulse train. This further assumes that the ABR and MLRs can themselves be represented by a single source. The adaptation process known to exist for high stimulus rates was not included in the simulation or source model. As a first approximation this simple model may be useful in analysing the experimental data. The MLR and ABR based sources from the simulation were used as two reference signals for the cICA algorithm. A deflationary orthogonalization procedure was used, where the first independent component was extracted using the MLR reference, then the second using the ABR reference, and finally the remaining unconstrained components. Fig. 3(a)
614
J.M. Harte
2
8 Reference signal
Reference signal
Amplitude [a.u.]
Amplitude [a.u.]
3
1 0 −1 −2 −3 0
20
40
60 Time [ms]
80
100
120
0
0
20
40
60 Time [ms]
80
100
120
60 Time [ms]
80
100
120
Extracted component
Extracted component
Amplitude [a.u.]
Amplitude [a.u.]
2
8
1 0 −1 −2 −3 0
4
−2
3 2
6
6 4 2 0 −2
20
40
60 Time [ms]
(a)
80
100
120
0
20
40
(b)
Fig. 3. Extracted independent component for (a) convolved MLR based reference (correlation of ζry = 0.94); and (b)convolved ABR based reference (correlation of ζry = 0.38)
shows the reference and extracted component for the convolved MLR reference for the stimulus set 2, with an inter-impulse rate of 95 Hz. A correlation of ζry = 0.94 was obtained between the reference and the extracted component. The second reference and extracted component is shown in Fig. 3(b). Here a correlation of only ζry = 0.38 was maximally obtained after a number of runs of the algorithm. The results for the 198 Hz inter-impulse rate data were similar for the MLR based source(ζry = 0.92), however the ABR source never obtained a correlation greater than ζry = 0.11.
6
Discussion
The poor results obtained for the ABR extracted components potentially indicates one or more problems: (1) The nonstationarity of variance contrast function is not appropriate here, due to the periodic 4th cross-cumulant. The results from the simulation would tend not to agree with this. (2) The correlation constraint metric used might be inappropriate for the data. However, the constraint only guides the algorithm toward an independent component, as defined by the contrast function. (3) The underlying source model assumed here might be inappropriate. For the 198 Hz stimulus rates then neural adaptation effects might explain poor performance. However, each ABR response is known to have a contribution from multiple auditory brainstem structures [10]. The predominant recorded surface electrode potentials for ABRs are believed to come from a propagating action potential along the auditory pathway, rather than more cortically located post-synaptic potentials. Thus ABRs are generated by a time-varying source propagating through the neural structures from the brainstem up to the mid-brain. This nonstationarity manifests as small changes in the morphology of the ABR seen across the channels in the averaged event related potential, i.e. the peak of
Constrained ICA
615
the wave V may occur at a latency of 6 ms, say in an ipsilaterally located electrode, and a latency of 6.2 ms in a contralateral electrode site. This fact may limit performance of the source separation algorithm used here. This nonstationarity of variance based extraction method may be sensitive to the small nonstationarity associated with ABRs over time across the electrodes. The author also applied the ICA with reference algorithm from [7,8,5], with their negentropy contrast function. This approach might be expected to yield better results due to its insensitivity to the temporal structure of the data. Negentropy is a measure of nongaussianity, and thus time shifting the source will yield identical results to the original. However, almost identical results were obtained to the nonstationarity of variance contrast function. Thus the nonstationarity of the ABR generating mechanisms may not play a roll in its identifiability. Also both algorithms may ‘incorrectly’ separate the ABR source in the remaining unconstrained components, though this did not appear to be the case through observation. More work needs to be done to verify or challenge these observations. The choice of delay or lag for the cumulant estimation is very important. The convolved model used here is too simple and did not aid the extraction of the real sources from the mixtures. Ironically the de facto choice of delay τ = 1 used here produced the best results. It is not clearly understood why this is and therefore more work is needed.
References 1. Dau, T., P¨ uschel, D., Kohlrausch, A.: A quantitative model of the effective signal processing in the auditory system. I. Model structure. J. Acoust. Soc. Am. 99(6), 3615–3622 (1996) 2. Hynv¨ arinen, A.: Blind Source Separation by Nonstationarity of Variance: A Cumulant-Based Approach. IEEE transactions on neural networks 12(6), 1471– 1474 (2001) 3. Hynv¨ arinen, A., Karhunen, J., Oja, E.: Independent component analysis. WileyInterscience, Chichester (2001) 4. Iyer, D., Boutros, N.N., Zouridakis, G.: Independent Component Analysis of Multichannel Auditory Evoked Potentials. In: Proc. of the 2nd Joint EMBS/BMES Conference, pp. 204–205 (2002) 5. James, C., Gibson, O.J.: Temporally Constrained ICA: An application to artifact rejection in electromagnetic brain signal analysis. IEEE Transactions on Biomedical Engineering 50, 1108–1116 (2003) 6. Junius, D., Dau, T.: Influence of cochlear traveling wave and neural adaptation on auditory brainstem responses. Hearing Research 205, 53–67 (2005) 7. Lu, W., Rajapakse, J.C.: ICA with reference. In: Proc. 3rd Int. Conf. Independent Component Analysis and Blind Signal Separation: ICA2001, pp. 120–125 (2001) 8. Lu, W., Rajapakse, J.C.: Approach and Applications of Constrained ICA. IEEE Transactions on Neural Networks 16(1), 203–212 (2005) 9. Makeig, S., Jung, T.P., Bell, A., Ghahremani, D., Sejnowski, T.J.: Blind Separation of Auditory Event-Related Brain Responses into Independent Components. PNAS 94(20), 10980–10984 (1997) 10. Melcher, J.R., Kiang, N.Y.S.: Generators of the brainstem auditory evoked potential in cat III: identified cell populations. Hearing Research 93, 52–71 (1996)
616
J.M. Harte
11. Riedel, H., Granzow, M., Kollmeier, B.: Single-sweep-based methods to improve the quality of auditory brainstem responses. Part II: Averaging methods. Z. Audiol 40(2), 62–85 (2001)
A
Constrained ICA Algorithm
The constrained optimization problem can be solved using the method of Lagrangian multipliers, and transforming the inequality constraint g(w) ≤ 0 to an equality constraint by introducing a slack variable. Following [7,8], the augmented Lagrangian function L1 (w, α, β) is given by: 1 [max2 {α + γ ζ − E r(wT z) ] 2γ 2 1 − β E{(wT z)2 } − 1 − γ E{(wT z)2 } − 1 (4) 2 where α and β are the Lagrange multipliers, γ is the scalar penalty parameter, . denotes the Euclidean norm, and 12 γ h(w)2 is a penalty term to help the convergence of the algorithm [7]. The maximum of the augmented Lagrangian is found via a Newton learning algorithm: −1 wk+1 = wk − η L1 w2 L1 wk (5) L1 (w, α, β) = J(y) −
k
where k is the iteration index, η is a step size parameter that can change with the iteration count until some satisfactory convergence occurs. L1 wk and L1 w2 are k respectively the Jacobian and Hessian for the augmented Lagrangian function L1 (w, α, β). It can be shown that the Jacobian, L1 wk , is given by: L1 wk =χ 2E{z(wT z)(wT zτ )2 } + 2E{zτ (wT zτ )(wT z)2 }
(6) α − 4MwE{(wT z)(wT zτ )} − 4w − E{rz} − βw 2 T T T T where χ = sgn cum w z, w z, w zτ , w zτ , and sgn(.) is the sign function, M is symmetric and is the sum of the covariance matrices, Czzτ = E{zzτ } and Czτ z = E{zτ z}. Similarly through a little therapeutic algebra it is possible to approximate the Hessian by: (7) L1 w2 ≈ − 4χMwwT M + βI k
where I is the identity matrix. The optimum Lagrangian multipliers can be found by using an iterative gradient-ascent method[7,8]: T αk+1 = max 0, αk + γ ζ − E r(w z) , (8) βk+1 = βk + γ E{(wT z)2 } − 1 . Thus the learning algorithm to find the maximum of the augmented Lagrangian can be formulated from Eqns. 5 – 8.
Gradient Convolution Kernel Compensation Applied to Surface Electromyograms Aleˇs Holobar1 and Damjan Zazula2 LISiN, Politecnico di Torino, Torino, Italy [email protected] http://storm.uni-mb.si Faculty of Electrical Engineering and Computer Science,University of Maribor, Maribor, Slovenia 1
2
Abstract. This paper introduces gradient based method for robust assessment of the sparse pulse sources, such as motor unit innervation pulse trains in the filed of electromyography. The method employs multichannel recordings and is based on Convolution Kernel Compensation (CKC). In the first step, the unknown mixing channels (convolution kernels) are compensated, while in the second step the natural gradient algorithm is used to blindly optimize the estimated source pulse trains. The method was tested on the simulated mixtures with random mixing matrices, on synthetic surface electromyograms and on real surface electromyograms, recorded from the external anal sphincter muscle. The results prove the method is highly robust to noise and enables complete reconstruction of up to 10 concurrently active motor units.
1
Introduction
Biomedical signals are important, but very complex source of information. They typically comprise contributions of many concurrently active sources, such as neurons and muscle fibers. The sources are commonly considered statistically independent (or at most weakly correlated), but their mixing process is virtually unknown. Therefore, the acquired signals must be decomposed blindly. When it comes to neurophysiology, electromyograms (EMG) are one of the most active research areas. They measure electrical activity of skeletal muscles and provide an insight into peripheral properties of skeletal muscles and control strategies of human motor cortex [1]. Their field of application ranges from clinical assessments of neuromuscular disorders and objective evaluations of medical treatments to basic investigations of different physiological phenomena (e.g. cramps, muscle reinnervation, etc.). Comprising action potentials (AP) from several tens of concurrently active motor units (MU), EMG signal is commonly considered a random interference pattern which is very difficult to interpret [2]. This is especially true in the case of surface electromyography [1], which deals with measuring the electrical activity of human muscles on the surface of the skin. Bad electrode-to-skin contact and presence of noise hinder the decomposition of surface EMG and make the extraction of clinically relevant information difficult. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 617–624, 2007. c Springer-Verlag Berlin Heidelberg 2007
618
A. Holobar and D. Zazula
Recent development of high-density surface electrode arrays enabled acquisition of several tens or even hundreds of EMG channels. Different pattern recognition techniques, capable of dealing with such amount of data were also proposed. Kleine et al. [3] studied the importance of two-dimensional spatial filters, Gazzoni et al. [4] introduced the template matching segmentation and classification technique, while Wood et al. [5] employed the finite element analysis. Blind source separation decomposition techniques have also been proposed. Garcia et al. [6] modelled the EMG signal as an instantaneous mixture of motor unit action potential (MUAP) trains, while Holobar and Zazula [7] proposed Convolution Kernel Compensation (CKC) to deal with the convolutive mixtures of motor unit innervation pulse trains (IPT). The latter proved to be highly efficient as it enables the complete reconstruction of up to 30 concurrently active MUs from a good quality multichannel surface EMG. In this paper the gradient-based extension of the CKC method [7] applied to the low-quality noisy signals is addressed. This extension is of paramount importance for clinical practice, where recoding environment cannot be strictly controlled. This paper is organized in five sections. In Section 2, the assumed data model is presented and the classic CKC approach is briefly summarized. Section 3 introduces its gradient-based extension, while in Section 4 the results of tests on simulated and real EMG signals are presented. Section 5 discusses the results and concludes the paper.
2
Data Model and Convolution Kernel Compensation
Suppose M convolutive measurements are simultaneously observed and denote their sampled vector by x(n) = [x1 (n), ...., xM (n)]T , where xi (n) stands for the n-th sample of the i-th measurement. In the case of linear time-invariant (LTI) multiple-input multiple-output (MIMO) system, x(n) can be written as: x (n) = H¯ t (n) + ω (n)
(1)
where ω(n) = [ω1 (n), ...., ωM (n)]T is a zero-mean spatially and temporally white additive noise vector, ¯ t (n) = [t1 (n) , t1 (n − 1) , ..., t1 (n − L + 1) , ..., tN (n) , ... tN (n − L + 1)]T is the extended version of vector of input signals from N sources T t (n) = [t1 (n) , ..., tN (n)] , and the mixing matrix H comprises all the channel responses (convolution kernels) hij = [hij (0) , ..., hij (L − 1)] of length of L samples: ⎤ ⎡ h11 (0) . . . h11 (L − 1) h12 (0) . . . h12 (L − 1) . . . ⎢ h21 (0) . . . h21 (L − 1) h22 (0) . . . h22 (L − 1) . . . ⎥ ⎥ ⎢ H=⎢ (2) ⎥ .. .. .. .. ⎣ . ... . . ... . ...⎦ hM1 (0) . . . hM1 (L − 1) hM2 (0) . . . hM2 (L − 1) . . . In surface electromyography, the channel response hij corresponds to the j-th MUAP, as detected by the i-th measurement, while each model input tj (n) is a
Gradient CKC Applied to Surface Electromyograms
619
sample of an IPT, modelled as a binary pulse sequence carrying the information about the MUAP triggering times only: tj (n) =
∞
δ [n − Tj (k)]
(3)
k=−∞
where δ(.) denotes the Dirac impulse and Tj (k) stands for the time instant in which the k-th MUAP of the j-th MU appears. 2.1
Convolution Kernel Compensation
CKC method compensates the unknown mixing matrix H in model (1) and directly estimates the innervation pulse trains tˆj (n) [7]: tˆj (n) = cTtj x C−1 xx x(n)
(4)
where = E x(n)xT (n) is the correlation matrix of measurements, ctj x = Cxx E x(n)tTj (n) is cross-correlation vector, and E(.) denotes mathematical expectation. Estimator (4) is linear minimum mean square error (LMMSE) estimator of the j-th IPT and requires the cross-correlation vector ctj x to be known in advance. This is rarely the case and Holobar and Zazula [7] proposed probabilistic iterative procedure for its blind estimation. In the first iteration step, the unknown cross-correlation vector ctj x is approximated by vector of measurements ˆ ctj x = x(n1 ) where, without loss of generality, we assumed the j-th MU discharged at time instant n1 . Then, the first estimation of the j-th IPT yields tˆj (n) = ˆ cTtj x C−1 xx x(n)
(5)
In the second step, the largest peak in tˆj (n) is selected as the most probable candidate for the second discharge of the j-th source, n2 = max(tj (n)) |n2 =n1 , arg
and the vector ˆ ctj x is updated as: ˆ ctj x =
ˆ ctj x + x(n2 )
(6)
2
This procedure is then repeated, with a special attention paid to the separation of superimposed pulse sources (note that more than one source may be active at instant n1 ). Interested reader is referred to [7] for further description of classic CKC approach.
3
Gradient Descent Optimization of Estimated Pulse Trains
Let us use shorthand notation wj,k = C−1 ctj x to denote the estimation of the jxx ˆ th separation vector in the k-th iteration step and let F (tˆj ) = f (tˆj (m)) denote m
620
A. Holobar and D. Zazula
sample cost function in the manifold of pulse trains, with arbitrary differentiable scalar function f (t) applied to each sample of pulse train tˆj (n). General iteration step for natural gradient descent algorithm is defined as [8]: wj,k+1 = wj,k − η(k)GΔwj,k
(7)
where η(k) is the learning rate and G is Riemannian metric tensor embedding the manifold of pulse trains t (n) into the manifold of measurements x (n). For smooth manifolds, the induced metric tensor G can be computed as: G = H−1 H−T
(8)
where, in our case, H−1 stands for the Jacobian of the system t (n) = H−1 x (n). Taking the common convention on the amplitude ambiguity of trains t (n) into account, we assume C¯t¯t = I. Hence, G can be written as G = H−1 C¯−1 H−T = t¯ t −1 Cxx . Finally, by expressing the gradient Δwj,k in terms of cost function F (tˆj ) we derive to the following gradient update rule: wj,k+1 = wj,k − η(k)
∂f (tˆj (m)) m
∂ tˆj (m)
C−1 xx x(m)
(9)
There is yet another possible interpretation of update rule (9). By rewriting (9) in terms of (5) we get the update rule for the cross-correlation vector ˆ ctj x : ˆ ctj x = ˆ ctj x − η(k)
∂f (tˆj (m)) m
∂ tˆj (m)
x(m)
(10)
which provides insight into the convergence properties of the algorithm (9). Namely, by selecting the ∂f∂t(t) to be concave even function, e.g. ∂f∂t(t) = t2 , the peaks in tˆj (n) get reinforced (i.e. the corresponding measurement vectors x(m) in (10) get multiplied by large weights), while the base-line noise (i.e. values close to zero) is suppressed. The steeper the function ∂f∂t(t) , the larger the weights multiplying the peaks in tˆj (n), and the faster the convergence of (10). However, by setting the peak weights too high, we jeopardize the stability of convergence as it may happen that the highest peak in tˆj (n) outweights all the others. In such a case, ˆ ctj x converges to x(mp ), where mp denotes the time instant of the largest peak in tˆj (n). Typically, ∂f∂t(t) = |t| or ∂f∂t(t) = t2 prove to be a good compromise between the speed yielding the and stability of convergence, 1 ˆ 1 ˆ3 2 ˆ ˆ ˆ tj (m) tj (m) and F (tj ) = tj (m), respectively. cost functions F (tj ) = 2
m
3
m
Gradient descent algorithm (9) still requires a good initial approximation of tj (n) in order to converge to the genuine solution (i.e. the peaks in tˆj (n) must represent the true train pulses). As demonstrated in the next section, the CKC approximation (5) with ˆ ctj x = x(n1 ) proves to be a good initialization point. Required initial time instants n1 can be selected from the activity index IA (n) = xT (n)C−1 xx x(n), as suggested in [7].
Gradient CKC Applied to Surface Electromyograms
4
621
Simulation and Experimental Results
Gradient CKC method was applied to three different sets of test signals. The first two experiments evaluated the influence of noise in the case of multichannel synthetic measurements with well-conditioned random mixing matrix H and in the case of badly-conditioned synthetic surface EMG measurements, while in the third experiment, the method was applied to recordings of external anal sphincter muscle. In all three experiments the scalar function f (t) = 13 t3 was used in (9), while in each iteration step the adaptive learning rate η(k) was adjusted according to bisection. The sensitivity of gradient CKC algorithm, false alarm rate and the number of reconstructed pulse trains were observed and compared to the results of classic CKC approach. Experiment 1: Ten simulation runs were performed, with the number of sources 200 N set equal to 10. In each run, random input pulse trains tj (n) = k=1 δ (n − k · 100 + Tj (k)) were generated with the mean inter-pulse interval (IPI) set equal to 100 samples and the values Tj (k); k = 1, 2, . . . , 200 uniformly distributed on the interval [−10, 10]. The length of simulated pulse trains t was 20,000 samples. Random zero-mean mixing matrix H was generated, with L = 10 samples long convolution kernels. The number of observations M was set equal to 25. Seven realizations of Gaussian zero-mean noise per each generated signal (1) were simulated, with SNR ranging from 20 dB to -10 dB. In order to increase the number of measurements, 9 delayed repetitions of each original measurement were used as additional measurements [7]. As a result, the number of extended pulse trains increased to 190, while the number of observations was fixed at 250. Each mixture was decomposed two times - by classic and gradient CKC. The results are summarized in Fig. 1. The CKC gradient method converged after 15 iterations, on average.
No. IPTs
TP (%)
10
FP (%)
100
15
95
10
90
5
85
0
5 0
100
10
15
95 5 0 −10
10
90
−5
0
5
10
SNR (dB)
15
20
85 −10
5 −5
0
5
10
SNR (dB)
15
20
0 −10
−5
0
5
10
15
20
SNR (dB)
Fig. 1. Number of reconstructed IPTs (left column), True Positive rate (TP) (central column) and False Positive rate (FP) (right column) for gradient CKC (black line) and classic CKC (gray line). The results are averaged over 10 simulation runs (error bars indicate std. deviations). In each run, random mixing matrix H was generated, with condition number set equal to 240 ± 10 (upper row) and 1200 ± 50 (bottom row).
622
A. Holobar and D. Zazula
Experiment 2: Synthetic surface EMG signals were generated by cylindrical volume conductor model consisting of bone, muscle, subcutaneous, and skin tissues [2]. Biceps Brachii muscle with 200 MUs and 200 mm2 cross-section was simulated. The distribution of the MU locations was random and the fibers of a MU were randomly scattered in circular MU territory, with a density of 20 fibers/mm2 . Exponentially distributed innervation numbers ranged from 25 to 2500. The surface-recorded MUAP comprised the sum of the action potentials of the muscle fibers belonging to the MU. The MUs had muscle fiber conduction velocities of 4 ± 0.3 m/s. The recording system was a grid of 13 × 5 electrodes of circular shape (1-mm radius) with 5-mm interelectrode distance in both directions. The EMG signals were additionally corrupted by additive zero-mean Gaussian noise between 20 and 0 dB SNR. Ten seconds long contraction at 10% excitation level (constant over time) was simulated what resulted in 105 active MUs. MU IPTs were based on a motor unit population recruitment model [3] with the recruitment and the peak discharge rate set to 8 and 35 pulses per second. In likefashion to the Experiment 1,9 delayed repetitions of each original measurement were added to the original set of measurements. The results of CKC decomposition, averaged over 25 Monte Carlo runs are depicted in Fig. 2. No. IPTs
TP (%)
FP (%)
10
97.4
2.3
8
97.2
2.2
6
97
4
96.8
2.1 2 1.9
2
96.6 1.8 0
5
10
SNR (dB)
15
20
0
5
10
SNR (dB)
15
20
0
5
10
15
20
SNR (dB)
Fig. 2. Number of reconstructed IPTs (left panel), True Positive rate (TP) (central panel) and False Positive rate (FP) (right panel) for gradient CKC (black line) and classic CKC (gray line) when decomposing synthetic surface EMG (Experiment 2). The results are averaged over 25 simulation runs (error bars indicate std. deviations).
Experiment 3: The last experiment was conducted on real surface EMG signals, recorded by a 48-channel cylindrical anal probe from the external anal sphincter muscle. The electrodes (1 × 10 mm) were arranged in 3 circumferential arrays of 16 electrodes each. Interelectrode and inter-array distance was 2.7 mm and 5 mm, respectively. The experiment was conducted in Gynecological Clinic at University of T¨ ubingen, Germany, and was approved by the local ethics committee. Six subjects participated to the experiment. The signals were acquired during three 10 s long maximum voluntary contractions. The EMG signals were amplified, band-pass filtered (3 dB bandwidth, 10 Hz-500 Hz), sampled at 2 kHz, and converted to digital form by a 12-bit A/D converter. The acquired set of measurements was additionally extended by 9 delayed repetitions of each
Gradient CKC Applied to Surface Electromyograms
623
MU7 MU6
MU6
MU5
MU5
MU4
MU4
MU3
MU3
MU2
MU2
MU1
MU1
5
6
7
8
9
5
6
7
Time [s]
8
9
Time [s]
Fig. 3. IPTs reconstructed from real surface EMG by classic CKC (left panel) and by gradient CKC (right panel). Each plotted dot corresponds to single MU discharge. EMG signals were recorded from the external anal sphincter muscle. Table 1. Number of MUs (mean ± std. dev.) reconstructed from real EMG signals Subject classic CKC gradient CKC
A
B
C
D
E
F
2.7 ± 0.6 6.3 ± 1.2
3.7 ± 1.2 8.3 ± 1.2
3.3 ± 0.6 4.7 ± 0.6
3.3 ± 0.6 5.7 ± 1.2
3.3 ± 0.6 8.0 ± 2.0
2.3 ± 1.2 3.3 ± 0.6
measurements. The results of decomposition are summarized in Table 1 and exemplified by Fig. 3.
5
Discussion and Conclusions
The gradient CKC method proved to be highly efficient. In low noise environments, it is equivalent to the classic CKC approach. In the presence of severe noise, however, it provides superior accuracy. The results on synthetic measurements with random mixing matrix H proved that almost complete reconstruction of pulse trains at the SNR of -5 dB is possible. In the case of synthetic surface EMG, the robustness to noise was reduced. This was mainly due to the badly conditioned mixing process (in the case of surface EMG, typical condition number of the mixing matrix H is about 107 ). Nevertheless, up to 5 pulse trains were reconstructed down to the SNR of 10 dB, with the pulse accuracy exceeding the 97%. When compared to the classic CKC approach at low SNR, the gradient CKC yielded 30% increase in the number of reconstructed IPTs and 0.5% increase in the accuracy of the reconstructed pulse trains. Similar results were observed in the case of real surface EMG signals, acquired from the external anal sphincter. The general quality of acquired signals was low, mainly due to the bad electrode-mucosa contact and movement of the anal probe with respect to the muscle fibers. The gradient CKC technique reconstructed 6.2 ± 2.3 MUs per contraction (compared to 3.1 ± 0.8 MUs reconstructed by the classic CKC).
624
A. Holobar and D. Zazula
MU discharge patterns reconstructed by gradient CKC also exhibited higher regularity than those of the classic CKC method, as verified by careful visual inspection. It is concluded that gradient CKC is highly robust to noise. In low SNR environment, it yields the performance superior to the classic CKC approach and has the potential to be used in regular clinical practice, where the quality of acquired signals cannot be strictly controlled.
Acknowledgment The authors are sincerely grateful to Prof. R. Merletti from Politecnico di Torino, Italy, and to Prof. P. Enck, Dr. H Hinninghofen and Dr. T. Kreutz from University of T¨ ubingen, Germany, for their measurements of the external anal sphincter. This research was supported by a Marie Curie Intra-European Fellowships within the 6th European Community Framework Programme (Contract No. MEIF-CT-2006-023537), and by German-Italian research project TASI.
References 1. Merletti, R., Parker, P.A.: Electromyography: physiology, engineering, and noninvasive applications, IEEE Press and John Wiley & Sons (2004) 2. Farina, D., Merletti, R.: A novel approach for precise simulation of the EMG signals detected by surface electrodes. IEEE trans. Biomed. Eng. 48, 637–646 (2001) 3. Kleine, B.U., van Dijk, J.P., Lapatki, B.G., Zwarts, M.J., Stegman, D.F.: Using two-dimensional spatial information in decomposition of surface EMG signals, J. of Electromy and Kinesiol (2006) 4. Gazzoni, M., Farina, D., Merletti, R.: A new method for the extraction and classification of single motor unit action potentials from surface EMG signals. J. of Neurosc. Methods 136, 165–177 (2004) 5. Wood, S.M., Jarratt, J.A., Barker, A.T., Brown, B.H.: Surface electromyography using electrode arrays: a study of motor neuron diseases. Muscle Nerve 24, 223–230 (2001) 6. Gracia, G.A., Okuno, R., Akazawa, K.: A Decomposition Algorithm for Surface Electrode-Array Electromyogram: A Noninvasive, Three-Steep Approach to Analyze Surface EMG Signals. IEEE Eng. in Med. Biol. Magazine 5, 63–71 (2005) 7. Holobar, A., Zazula, D.: Multichannel Blind Source Separation Using Convolution Kernel Compensation, IEEE Trans.Sig. Process., 56 (2007) 8. Amari, S.I.: Natural Gradient Works Efficiently in Learning. Neural Computation 10, 251–276 (1998) 9. Fuglevand, A.J., Winter, D.A., Patla, A.E.: Models of recruitment and rate coding organization in motor unit pools. J. Neurophysiol 70, 2470–2488 (1993)
Independent Component Analysis of Functional Magnetic Resonance Imaging Data Using Wavelet Dictionaries Robert Johnson1,2, , Jonathan Marchini1 , Stephen Smith2 , and Christian Beckmann2 Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK [email protected] FMRIB (Oxford Centre for Functional Magnetic Resonance Imaging of the Brain), John Radcliffe Hospital, Headington, Oxford OX3 9DU, UK 1
2
Abstract. Functional Magnetic Resonance Imaging (FMRI) allows indirect observation of brain activity through changes in blood oxygenation, which are driven by neural activity. ICA has become a popular exploratory analysis approach due its advantages over regression methods in accounting for structured noise as well as signals of interest. However, standard ICA in FMRI ignores some of the spatial and temporal structure contained in such data. Using prior knowledge that the Blood Oxygenation Level Dependent (BOLD) response is spatially smooth and manifests itself on certain spatial scales, we estimate the unmixing matrix using only the coarse coefficients of a 3D Discrete Wavelet Transform (DWT). We utilise prior biophysical knowledge that the BOLD response manifests itself mainly at the spatial scales we use for unmixing. Tests on realistic synthetic FMRI data show improved accuracy, greater robustness to misspecification of underlying dimensionality, and an approximate fourfold speed increase; in addition the algorithm becomes parallelizable. Keywords: Functional Magnetic Resonance Imaging, Independent Component Analysis, Biophysical Prior, Sparse Dictionaries.
1
Introduction
ICA offers several advantages over regression in FMRI: it can account for the large amounts of structured noise found in FMRI data such as MR acquisition artefacts, physiological noise, and often stimulus correlated head motion occur. Furthermore, ICA has been used to study resting state networks[3], investigate the mechanisms of memory and decision-making, and brain activity before an epileptic seizure[9]. In all cases the timings of task-related activity are difficult to specify accurately which makes it difficult - if not impossible - to use simple linear regression.
Corresponding author.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 625–632, 2007. c Springer-Verlag Berlin Heidelberg 2007
626
R. Johnson et al.
Given an FMRI experiment with n voxels1 measured at p different timepoints, after the pre-processing steps of motion correction, removal of low-frequency drift, voxelwise de-meaning and voxelwise variance normalisation, the spatial structure is discarded to allow the construction of a p × n matrix X. We follow the methodology outlined in [2]: the generative model is that of linear mixing with additive Gaussian noise N X = AS + N
(1)
where S includes not only BOLD components but also structured noise of physical and physiological origin. A is the linear mixing matrix. Typically, a Singular Value Decomposition is performed to obtain a factorisation into orthogonal regression timecourses and their corresponding spatial maps. The data is partitioned into major and minor PCA subspaces, using the eigenspectrum of the data covariance matrix to estimate the underlying dimensionality q (number of components). Within the standard PICA framework[2] ICA unmixing based on maximisation of non-Gaussianity is performed in the major subspace within the spatial domain, with the minor subspace used as the noise component for subsequent hypothesis testing (see [2] for details). In the resulting decomposition, A is a p × q linear mixing matrix and the ICs span the major subspace, with each voxel having factor loadings of the ICs associated with it. Such an approach partially ignores knowledge about the spatiotemporal characteristics of FMRI data. In particular, standard linear decomposition techniques ignore neighbourhood relationships in the spatial and temporal domains once the data is represented as a 2D matrix X. We have physiological knowledge that BOLD activation is spatially sparse due to the way the brain is organised and smooth due to the fluid characteristics of blood and the diffusion of oxygen from the bloodstream. We also know that functional brain anatomy is on a reasonably coarse scale compared to FMRI voxel sizes. In the temporal domain we know that the BOLD response is smooth at the timescales we observe, again because oxygen is supplied by diffusion from the bloodstream. In previous work[10] we have incorporated assumptions about the temporal smoothness of the signals via a smoothness constraint in the temporal domain, substituting Regularised Principal Component Analysis (RPCA)[16] for PCA2 . Such an approach is computationally efficient and has been shown to provide an increase in accuracy for block paradigms. However, a disadvantage is that the temporal variance is represented less effectively and that non-smooth components are estimated less efficiently. The non-smooth components may correspond to MR-physics related effects and can co-exist with smooth components which relate to physiological processes. Other researchers have applied a temporal smoothness constraint within the ICA algorithm[5,19] on only some components of interest, although they have assumed that they are analysing 1
2
A voxel is a cube of tissue, typically of size 3mm3 in FMRI, compared to 1mm3 in structural MRI. RPCA constrains the factors to be represented by a B-spline basis set and applies a roughness penalty in addition to the existing orthogonality constraint.
Independent Component Analysis of FMRI Data Using Wavelet Dictionaries
627
experiments with a clearly defined experimental design to enable the identification of components of interest. This introduces some of the disadvantages of a regression analysis into the modified ICA algorithm. A way of incorporating prior biophysical knowledge while not being dependent on identifiable design timecourse(s) is to use a prior in the spatial domain. In section 2 we describe an algorithm which uses Discrete Wavelet Transforms (DWTs) to incorporate our prior biophysical knowledge that the effect we attempt to observe is spatially smooth, sparse, and occurs at certain scales. We evaluate our proposed algorithm with realistic synthetic data in section 3. Finally, in section 4 we summarise our findings and discuss the demonstrated and potential advantages to constraining the ICA solution in the spatial domain rather than the temporal domain.
2
Algorithm
It has previously been demonstrated that ICA on natural signals operates more effectively when the signals are represented using a sparse dictionary[20]. We use a 3 level 3D separable DWT using ‘Farras’ wavelets[1,18] as our (complete) spatial dictionary Φ, since the multiresolution property proves useful. We zeropad the FMRI volumes so they have dimensions which are multiples of 23 to enable a 3 level decomposition, and perform the DWT on each FMRI volume. The filterbank analysis and synthesis algorithms mean that we never have to explicitly calculate the n × n matrix Φ. If X, S and N have p × n coefficient matrices in the dictionary Φ, denoted by Cmixures , Csources and Cnoise , we have X = Cmixtures Φ
(2)
with similar equations for Csources and Cnoise . Having postmultiplied by Φ−1 we can restate Eqn. 1 as Cmixtures = ACsources + Cnoise
(3)
We can incorporate our spatial prior within the ICA unmixing by estimating the unknown mixing matrix A from only the father wavelet coefficients. Conveniently, Gaussian noise will present itself mainly in the more detailed wavelet levels, a property used for wavelet denoising. This representation differs from [20]’s notion of sparseness because instead of the majority of the wavelet coefficients being close to zero, which can only be assumed for the mother wavelets, we have summarised the important features of the data in the father coefficients. The father coefficients themselves are not sparse, but do form a sparse representation of the data since we are ignoring the mother coefficients. Denoting the number of father coefficients as r, we find that we will be estimating A using 1 of the wavelet coefficients — since it is a dyadic transform, we raise only 512 2 to the power of 3 for the scale level and then to the power of 3 again for the number of dimensions. Denoting the truncated p × r coefficient matrices by C mixtures , C sources and C noise and the r × n truncated dictionary matrix by
628
R. Johnson et al.
Φ , we can restate Eqn. 1 with A constrained to lie in the subspace of the father coefficients as C mixtures = AC sources + C noise . (4) We previously used PCA[2,10] in the temporal domain to constrain the number of ICs to be equal to the estimated dimensionality of the data. Noiseless ICA is performed in the major PCA subspace, while the minor PCA subspace is treated as a Gaussian noise term for later hypothesis testing. If we perform PCA in the temporal domain on Eqn. 4 and retain only the major subspace we now have q × r coefficient matrices denoted by C∗mixtures , etc. The q rows contain factor loadings for regressed timecourses and r columns contain the spatial coefficients. We now have the generative model: C∗mixtures = AC∗sources .
(5)
We make a number of assumptions in order for ICA to be applicable: – The underlying sources of our FMRI data are statistically independent. – The independent components have non-Gaussian distributions (the minor PCA subspace we discarded is assumed to contain the Gaussian noise). – We correctly identified the dimensionality q of the data, so our unknown mixing matrix is square. A is estimated by performing noiseless ICA in the spatial domain on Eqn. 5 using the natural gradient Infomax algorithm[15] implemented in [14]. Next, we ˆ −1 to unmix project Cmixtures onto the regressed PCA timecourses and use A the result, and finally apply the inverse DWT to the source estimates. Our BOLD ICs should now be estimated more accurately since unmixing has been performed in the scale subspace where the BOLD signals represent themselves most strongly.
3
Evaluation
Two FMRI datasets were obtained – from a subject at rest, and from the same subject hearing a 30 second on, 30 second off boxcar auditory stimulus. The second dataset was analysed within the general linear model framework as implemented in FEAT[17] and the activation maps found were thresholded at a fixed Z-statistic level. The intensity of the activation map remaining was linearly mapped to the range [0.1,1] and combined in the temporal domain with a simulated boxcar stimulus convolved with a gamma based haemodynamic response function. The resulting 4D activation was embedded in the resting data to produce a synthetic FMRI dataset with known spatial activation maps and timecourse. This was done at Contrast to Noise Ratios (CNRs) from 0% CNR to 200% CNR in 10% increments by scaling the timecourse amplitude accordingly.
Independent Component Analysis of FMRI Data Using Wavelet Dictionaries
629
1
Gini−10% FPR Accuracy
0.8
0.6
0.4
0.2 PICA Spatial Prior Temporal Prior 0 0
20
40
60 80 100 120 140 Peak % Contrast To Noise Ratio
160
180
200
Fig. 1. Comparison of performance of PICA with spatial prior proposed in this paper and PICA with temporal prior implemented by using RPCA instead of PCA against standard PICA. For the spatial prior a ‘step’ is visible: nothing is resolved below 80% CNR, much higher accuracy is achieved above 80%. The temporal prior outperforms standard PICA in the CNR range 30%-130% and from 140% does slightly worse, the smoothness constraint outweighing the benefits when the BOLD signal is very strong.
GLM
GLM
1
1 55
45
50 40
0.8
0.8 45
30
0.6
25
0.4
20
Number of Components retained
Number of Components retained
35 40 0.6
35
30
25
0.4
20
15 15 0.2
0.2
10
10 5
5 0
0 0
50
100 Peak % CNR
(a)
150
200
0
50
100 Peak % CNR
150
200
(b)
Fig. 2. Accuracy across number of components retained and peak % CNR. Results for a simple GLM are given in the bars at the top. A trade-off is apparent between greater sensitivity at lower CNRs for lower dimensionality estimates in the range 20– 30 and greater accuracy at higher CNRs for higher estimates of dimensionality of 35 upwards. (a) Accuracy of PICA with spatial prior proposed in this paper. (b) Accuracy of standard PICA.
630
3.1
R. Johnson et al.
Test Metrics
Receiver operating characteristic (ROC) plots are frequently used to illustrate the accuracy of FMRI analysis techniques. Here, we calculated the Gini coefficient[8] as our measure of signal detection accuracy. The Gini coefficient provides a scalar measure, but typically integrates over the entire False Positive Rate (FPR) range. For the evaluation of the detection of signals from FMRI experiments however, a range of 0% up to 10% FPR is more reasonable since no meaningful conclusions can be drawn if a large percentage of the brain is known to be falsely classified as active. We therefore calculate the Gini measure in the 0–10% False Positive Rate (FPR) range and normalised it to lie in the range [0,1]. 3.2
Test Results
Figures 1 and 2 illustrate the performance of the proposed methodology on the simulated data. Figure 1 shows a line plot of the accuracy over CNRs obtained by standard PICA, the new algorithm we have described, and the results of our previous work using a temporal smoothness prior. We speculate that the jump in accuracy using the spatial scale prior occurs because below a certain level of CNR too little information is available to enable effective unmixing. A similar effect can be observed on the plots in Fig. 2 when too few PCs are retained. Figure 2 shows plots for accuracy (colour-coded) across CNR and number of components retained. With 11 or more PCs retained our proposed methodology gives values of above 0.6 accuracy at CNRs of 80% upwards, with only 2 exceptions. This indicates that our proposed methodology would be more robust to errors in specifying the underlying dimensionality of the data. Figure 2(b) shows that the standard methodology can vary considerably in its performance with small changes in dimensionality estimation. The algorithm proposed here is faster than the standard PICA technique – the time taken to perform the DWTs is more than offset by a considerable reduction in the time taken to perform the ICA step, since the computationally intensive unmixing is performed on hundreds rather than tens of thousands of spatial coefficients, leading to an overall speed increase of more than four times, and the DWTs are easily parallelizable while the ICA unmixing is not.
4
Conclusion
We have described a method of incorporating a spatial biophysical prior into ICA for FMRI and tested it on realistic synthetic data. Incorporating a prior in the spatial domain offers the advantages over the temporal domain of increased accuracy, an increase in speed and the fact that no experimental design needs to be specified, unlike some methods of incorporating a temporal prior. Note that the alternative approach of using all the wavelet coefficients or any other wavelet detail level yielded no advantage over the standard PICA method, as we would expect if the benefit comes from the scale prior rather than merely the sparse representation in the wavelet domain increasing the efficiency of ICA.
Independent Component Analysis of FMRI Data Using Wavelet Dictionaries
631
Elsewhere[11] we have evaluated using the Dual-Tree Complex DWT to denoise FMRI statistical maps obtained using ICA. Initial results suggest an increase in estimation accuracy to approximately 0.85 from a CNR level of 80% upwards. We are encouraged by this because 80% CNR is typical of the FMRI BOLD signals we observe, and at this level on our test data the standard PICA methodology in current use only achieved an accuracy measure of approximately 0.20 in our tests. An adaptive technique is probably required to incorporate some of the more detailed coefficients in the estimation step and overcome the poor accuracy achieved by our algorithm at 70% CNR and below in Fig. 1. One disadvantage of our method is that while BOLD signals of interest are estimated more accurately, there may be a decrease in the accuracy of estimating non-BOLD structured noise which presents itself mainly at the scale levels we discard. A solution to this may be to estimate different classes of components using different subsets of wavelet coefficients. Neurological conclusions are typically drawn from only about a hundred functionally defined regions of the brain and areas on the cortical surface. Performing inference using anatomically informed basis functions[12] to inform the unmixing might lead to further improvements. A spatial prior could also be incorporated into the analysis of multi-subject FMRI data. Particularly when using simultaneous group analysis techniques such as tensorial extensions to ICA[4], the unmixing of coefficients from a DWT would decrease the computational and memory requirements. Acknowledgments. We are grateful for EPSRC and MRC funding for our research and for the helpful comments from our anonymous reviewers.
References 1. Farras Abdelnour, A., Selesnick, I.: Symmetric Nearly Shift-Invariant Tight Frame Wavelets. IEEE Trans. on Signal Processing 53(1), 231–239 (2005) 2. Beckmann, C., Smith, S.: Probabilistic Independent Component Analysis for Functional Magnetic Resonance Imaging. IEEE Trans. on Medical Imaging 23(2), 137– 152 (2005) 3. Beckmann, C., DeLuca, M., Devlin, J., Smith, S.: Investigations into resting-state connectivity using independent component analysis. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 360(1457), 1001–1013 (2005) 4. Beckmann, C., Smith, S.: Tensorial Extensions of Independent Component Analysis for Group FMRI Data Analysis. NeuroImage 25(1), 294–311 (2005) 5. Calhoun, V., Adali, T., Stevens, M., Kiehl, K., Pekar, J.: Semi-blind ICA of fMRI: a Method for Utilizing Hypothesis-Derived Time Courses in a Spatial ICA Analysis. NeuroImage 25, 527–538 (2005) 6. Damoiseaux, J., Rombouts, S., Barkhof, F., Scheltens, P., Stam, C., Smith, S., Beckmann, C.: Consistent Resting-State Networks across Healthy Subjects. PNAS 103(37), 13848–13853 (2006) 7. Donoho, D., Johnstone, I.: Adapting to Unknown Smoothness via Wavelet Shrinkage. Journal of the American Statistical Association 90, 1200–1224 (1995)
632
R. Johnson et al.
8. Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Researchers. Technical report, HP Laboratories, Palo Alto, CA (2004) 9. Federico, P., Abbott, D., Briellmann, R., Harvey, A., Jackson, G.: Functional MRI of the pre-ictal state. Brain 128(8), 1811–1817 (2005) 10. Johnson, R., Marchini, J., Smith, S., Beckmann, C.: Temporal Regularisation for PCA Applied to FMRI. In: Twelfth Annual Meeting of the Organization for Human Brain Mapping (2006) 11. Johnson, R., Marchini, J., Smith, S., Beckmann, C.: Wavelet denoising as a postprocessing step for ICA applied to FMRI. In: Thirteenth Annual Meeting of the Organization for Human Brain Mapping (2007) 12. Kiebel, S., Goebel, R., Friston, K.: Anatomically Informed Basis Functions. Neuroimage 11(6), 656–667 (2005) 13. Kingsbury, N.: Image processing with complex wavelets. Phil. Trans. Royal Society London A 357(1760), 2543–2560 (1999) 14. Makeig, S., et al.: EEGLAB: ICA Toolbox for Psychophysiological Research WWW Site, Swartz Center for Computational Neuroscience, Institute of Neural Computation, University of San Diego California, http://www.sccn.ucsd.edu/eeglab/ 15. Makeig, S., Bell, A., Jung, T.-P., Sejnowski, T.: Independent component analysis of electroencephalographic data. In: Touretzky, D., Mozer, M., Hasselmo, M. (eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 145–151. MIT Press, Cambridge, MA (1996) 16. Ramsay, J., Silverman, B.: Functional Data Analysis. Springer, New York (1997) 17. Smith, S.M., et al.: FEAT, part of FSL suite, Accessed (March 2007), http://www.fmrib.ox.ac.uk/fsl/feat5 18. Cai, S., Li, K., Selesnick, I.: Matlab implementation of wavelet transforms (accessed May 2007), http://taco.poly.edu/WaveletSoftware/ 19. Valente, G., De Martino, F., Balsi, M., Formisano, E.: Optimizing ICA Using Generic Knowledge of the Sources. Technical report, Universita degli Studi di Roma La Sapienza, Rome, Italy (2005) 20. Zibulevsky, M., Pearlmutter, B., Bofill, P., Kisilev, P.: Independent Component Analysis. In: Principles and Practice, chapter Blind Source Separation by Sparse Decomposition in a Signal Dictionary, pp. 181–208. Cambridge University Press, Cambridge (2001)
Multivariate Analysis of fMRI Group Data Using Independent Vector Analysis Jong-Hwan Lee1, Te-Won Lee2, Ferenc A. Jolesz1, and Seung-Schik Yoo1,3 1
Dept. of Radiology, Brigham and Women’s Hospital, Harvard Medical School, MA, USA [email protected] 2 Institute of Neural Computation, University of California at San Diego, La Jolla, CA, USA 3 Dept. of BioSystems, Korea Advanced Insitute of Science and Techonology, Daejeon, Korea
Abstract. A multivariate non-parametric approach for the processing of fMRI group data is important to address variability of hemodynamic responses across subjects, sessions, and brain regions. Independent component analysis (ICA) has a limitation during the inference of group effects due to a permutation problem of independent components. In order to address this limitation, we present an independent vector analysis (IVA) for the processing of fMRI group data. Compared to the ICA, the IVA offers an extra dimension for the dependent parameters, which can be assigned for the automated grouping of dependent activation patterns across subjects. The IVA was applied to the fMRI data obtained from 12 subjects performing a left-hand motor task. In comparison with conventional univariate methods, IVA successfully characterized the group-representative activation time courses (as component vectors) without extra data processing schemes to circumvent the permutation problem, while effectively detecting the areas with hemodynamic responses deviating from canonical, model-driven ones. Keywords: Independent Vector Analysis, fMRI, Neuroimaging, Group Study, Multivariate Analysis.
1 Introduction Functional MRI (fMRI) measures the blood-oxygenation-level-dependent (BOLD) signal changes associated with neural activity. Thus, temporal dynamics of the BOLD signal (time series; TS), called as hemodynamic response function (HRF), is the key element in analyzing fMRI data. Typically, univariate approaches such as the generalized linear model (GLM) or regression analysis, are performed to estimate the conformity of a measured voxel-wise BOLD TS to the fixed, canonical HRF [1]. However, the BOLD TS may not be fully appreciated from the univariate methods due to the variations across subjects, scans, and brain regions [2]. ICA [3], as one of the multivariate approaches, has provided flexibility in data processing compared to the hypothesis-driven, univariate methods. Such flexibility applies especially when observed hemodynamic responses deviated from expected (hypothesized) HRF [4]. Task-related activation components, often similar in their M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 633–640, 2007. © Springer-Verlag Berlin Heidelberg 2007
634
J.-H. Lee et al.
spatial/temporal patterns across subjects/sessions, may not be inherently generalized from individual to group level analysis since the ICA algorithm permutes the order of output components. Therefore, the task-related components across subjects/sessions were manually inspected and grouped in the previous study [5]. This method may require careful selection of the component-of-interest from the large number of subjects/sessions. In the present study, we propose a novel fMRI analysis method for group processing based on independent vector analysis (IVA) [6]. IVA was originally proposed to address the limitation of the conventional ICA approach during the blind source/signal separation (BSS) in the frequency domain (i.e. permutation of extracted independent components across frequency bins). IVA correctly indexed the independent components (ICs) that were identified during the BSS in the frequency domain by utilizing the mutual dependency among the extracted ICs across frequency bins. Intuitively, the IVA model offers an extra dimension for processing dependent components, compared to the ICA model. In the fMRI study, this extra dimension can be assigned for automated grouping of similar IC maps across subjects. Fundamentally, IVA is an extension of ICA, whereby the component of an input and an output stage forms a vector (instead of a scalar value as in the case of ICA). IVA assumes and, therefore, attempts to increase independency across output vector components, while maintaining dependency among scalar elements within each vector. The ‘dependency’ in the fMRI study is analogous to mutual activation patterns across the subjects, comparable to the group trend in similar spatial activation patterns. Using the IVA algorithm, the spatially-similar trend in activation maps across subjects (dependent, thus representing group-trend) can be derived as the output vector components. As a result, one can avoid the complications of manual selection of task-related IC maps/time-courses (TCs) across subjects, thus rendering the whole process user-independent. In order to show the utility of the proposed method, we implemented and applied the IVA algorithm to analyze fMRI data of a left-hand motor clenching task using a short-time trial-based paradigm design. The obtained result was compared with the result from the generalized linear model (GLM) in SPM2 (Wellcome Department of Imaging Neuroscience, University College London, London, UK; www.fil.ion.ucl.ac.uk/spm) .
2 Methods 2.1 IVA Model and Learning Algorithm Figure 1 shows a schematic diagram of concurrent synthesis (generative) and analysis models for group data processing using IVA. In the synthesis model, the weighting values at the vth voxels associated with IC maps (assuming mutual dependence across subjects) were grouped into a single vector array assuming the spatial similarity across the subjects. Through the mixing matrix A (TCs represented by each column), the arrays of these vectors are linearly combined to form sets of other vector arrays (equal to the number of temporal points of the BOLD TS), each containing the measured BOLD signal from a specific (vth) voxel location across all of the subjects. In the analysis model, the unknown weighting values (also associated with IC map) can be estimated via a corresponding unmixing matrix W . The matrix W can be
Multivariate Analysis of fMRI Group Data Using Independent Vector Analysis
635
obtained by applying a learning rule to increase independence among output vector arrays (thus, sorting out the activation patterns). This is accompanied by maintaining dependence of weighting values within each output vector array, thus deriving mutual activations across the subjects for group inference.
Fig. 1. The schematic diagrams of synthesis and analysis models for fMRI group data processing using independent vector analysis. a ji = [a (ji1) ,L, a (jiM ) ] and w kj = [wkj(1) , L, wkj( M ) ].
From the synthesis model in Fig. 1, the measured BOLD TS at the vth voxel of subject m can be represented as
x ( m ) ( v ) = A ( m ) c ( m ) ( v ).
(1)
Each subject has its own mixing matrix (i.e. TCs) with independent activations (i.e. weighting values associated with IC maps) for the measured BOLD TS and does not share a mixing matrix with other subjects. In addition, the dependence among weighting values across subjects, within each unknown vector component, is assumed by the multivariate probability density function (p.d.f.) p(c i (v ))(= p (ci(1) (v),L, ci( M ) (v)) ) . The boldface and lightface represent a vector/matrix and scalar value, respectively. The superscript and subscript denote the indices of the subjects and the IC map/fMRI volumes, respectively. M is the number of subjects ( m ∈ {1,L , M }), N is the number of fMRI volume acquisitions corresponding to the unknown IC maps ( i, j, k ∈ {1,L, N }), and V is the number of voxels within a brain
region ( v ∈ {1,L,V }). The number of unknown IC maps was assumed to be the same as the number of volume acquisitions in Fig. 1. The assumed number of IC maps can be further reduced using a dimension reduction scheme such as principle component analysis (PCA) [3,5]. Then, by applying an unmixing matrix in the analysis model (i.e. inverse of TCs), the weighting values at the vth voxel (across IC maps) of subject m can be estimated as
cˆ ( m ) (v ) = W ( m ) x ( m ) ( v ).
(2)
636
J.-H. Lee et al.
In order to derive learning algorithm, as presented in [6], the Kullback-Leibler (KL) divergence was adopted as a measure of independence of output vector components. Also, variance-dependent multivariate p.d.f. was utilized as a measure of dependence among elements within output vector component. By following the procedure in [6], the algorithm for calculating an update term, ΔW ( m ) corresponding to the unmixing matrix of the subject m, can be derived as
[
]
ΔW ( m ) ∝ I − ϕ (cˆ ( m ) ( v ) )(cˆ ( m ) ( v ) ) W ( m ) , where
T
(3)
T ϕ (cˆ ( m) (v )) = [ϕ (cˆ1( m) (v )) L ϕ (cˆN(m) (v ))] , and 2 M ∑l =1 cˆk(l ) (v ) . Applying Eq. (3), the unmixing matrix can be iteratively
I is an identity matrix ( N × N ),
ϕ (cˆk( m ) (v ) ) = cˆk( m ) (v )
updated for the data from all voxels ( v = 1,L,V ). The only difference compared to the Infomax-based ICA (for the processing of a single subject data [3]) is the application of a nonlinear function ϕ (cˆ ( m ) ( v ) ) (c.f. score function in ICA), which is dependent across subjects in IVA. 2.2 Application to Trial-Based fMRI Data This study was approved by the local Institutional Review Board. Twelve righthanded subjects (aged 24.7 4.5, 5 females) performed one session of a left hand (LH) clenching (2 times/sec) task based on a short-time trial-based paradigm design (65-sec duration excluding 10-sec of dummy scans; task onset occurred at 15-sec followed by a 3-sec task-period). For the start/end of the task, a pre-recorded sound cue was played to the subject in the MRI system via an auditory headset (Avotec, Jensen Beach, FL). The fMRI data was obtained in a 3-Tesla clinical scanner (Signa VH, GE Medical Systems) using a single channel, standard birdcage, head coil. To obtain functional data, an EPI sequence was applied to image most of the brain volume (13 axial slices, flip angle=80°, TE/TR=40/1000msec, 64 frequency and phase encoding: 64×64 in-plane voxels, 5mm thickness with a 1mm gap, 240mm square field-of-view) for detection of the BOLD TS associated with neural activity. Prior to group processing, individual EPI data was standardized to the MNI (Montreal Neurological Institute) space by following preprocessing steps in SPM2 (i.e. in order: slice timing correction, realignment, normalization, and spatial smoothing using an 8mm full-width-at-half-maximum 3-D Gaussian kernel). Before processing using IVA algorithm, a PCA-based dimension reduction scheme [3,5] was applied to reduce the number of IC maps/TCs to 50. The sum of corresponding 50 eigenvalues was more than 99% of a sum of total 65 eigenvalues for each subject. Using the IVA algorithm in Eq. (3) to a normalized set of dimension-reduced fMRI group data, a semi-batch learning scheme [3] was applied to update the unmixing matrix of each subject. A batch of a 10×10×10mm3 isotropic cluster (5×5×5=125 voxels due to the 2×2×2mm3 isotropic voxel) was used, assuming dependencies of neural activations within this cluster across subjects. The learning rate (η ) was set to 10-3 throughout iterations. The iteration was stopped when a ratio of weight change (η ΔW( m) W( m) ) was stabilized (3.6×10-5 ~ 7.5×10-4). After the algorithm converged, the resulting IC map (i.e. weighting values of TC) was transformed into a z-scored map by subtracting a mean value and dividing by the standard deviation [3].
±
Multivariate Analysis of fMRI Group Data Using Independent Vector Analysis
637
Sign ambiguity of IC maps/TCs that typically arise from ICA-based methods [3, 4, 5, 7] also applied to the case of IVA. In IVA, the voxel-wise correlation coefficients between (1) the IC z-map within the activated regions (|z|>threshold) and (2) the original fMRI volumes were utilized. If the averaged (across time points) value of the correlation coefficients was negative, the sign of the IC z-map (& corresponding TC) was inverted. In order to find task-related components, a spatial sorting scheme was employed, whereby a voxel-wise correlation value between the sign-corrected IC z-map and cross-correlation (CC) map (obtained from the temporal correlation coefficients using the task-related canonical HRF) was regarded as the degree of task relevance of each IC z-map for each subject. This similarity measure (i.e. the voxel-wise correlation value) was then averaged across all subjects for each output vector array of IVA and subsequently sorted in descending order. The resulting five most task-related IC z-maps across subjects were processed using one-sample t-test implemented in SPM2 by considering a random effect (RFX) model [8]. These resulting task-related group activation maps of IVA were compared to the group activation map of GLM obtained from SPM2. After obtaining a group inference, the group activation maps were qualitatively compared in terms of location of activation. The areas with an activation volume greater than 5×5×5mm3 (p<10-3) were identified and labeled from the Brodmann’s Area (BA) and Automated Anatomical Labeling (AAL) templates (provided by MRIcro; www.mricro.com).
3 Results Figure 2 shows the task-related group activation maps obtained from GLM and IVA. In the results of GLM (Fig.2A), which is a univariate approach, a single task-related group activation map was acquired. In the results of IVA, the 5 most task-related components were selected after the reordering of output components based on the spatial sorting strategy explained in Section 2.2. The group activations (p<10-2 ~ p<10-7) were coded with a color gradient. The labeled anatomical areas of the group activations were listed in Table 1. From the analysis by GLM, the directly task-related areas (e.g. right primary motor area: M1, supplementary motor area: SMA, primary sensory area: S1, and cingulate gyrus) and paradigm-related areas (auditory; superior temporal) showed more significant activations compared to basal ganglia (caudate/putamen/pellidum) and thalamus (see Fig. 2A). In the results of IVA, the task-related activations (e.g. the right M1, SMA, S1, cingulate gyrus, and sup./mid. temporal gyrus) were extracted as the 1st task-related group activation map cˆ 1 . The group activations in the remaining components were dominant in the primary auditory area ( cˆ 2 ), basal ganglia & thalamus ( cˆ 3 ), inf. frontal & insular cortex ( cˆ 4 ), and mid. frontal & cingulate gyrus ( cˆ 5 ).
From the comparison of group activation maps between GLM and IVA (Fig. 2), some of the activations revealed by IVA were underestimated by GLM. For example, the size of activated regions obtained in the auditory area by GLM (examples were shown with green circles in Fig. 2A) was reduced compared to the area detected by IVA (marked as green in Fig. 2B). Also, the activations in the basal ganglia/thalamus
638
J.-H. Lee et al.
identified from IVA (blue area in Fig. 2B) were not detected when processed with GLM (examples were shown with blue circles in Fig. 2A).
Fig. 2. Group activation maps obtained by (A) GLM and (B) IVA methods. For IVA, five most task-related component maps ( cˆ 1 ~ cˆ 5 ) were color-coded. Table 1. The strongly activated cortical areas inside group activation map. If any activated cluster (p<10-3) was bigger than 5×5×5mm3, the center of this cluster was registered using BA and AAL indices (provided by MRIcro; www.mricro.com).
GLM
Left-Hemisphere Inf. frontal / SMA / Insula / Mid. Cingulate / S1 / SupraMarginal / Sup. Temp. *BA: 6, 23, 24, 32, 42, 47, 48
Right-Hemisphere M1 / Inf. Frontal / SMA / Insula / Mid. Cingulate / S1 / Parietal (Sup., Inf) / SupraMarginal / Sup. Temp. *BA: 2, 3, 4, 6, 8, 23, 24, 32, 38, 40, 42, 44, 47, 48
cˆ 1 : SMA / Mid. Cingulate / S1 / SupraMarginal / Temp. (Sup., Mid.) *BA: 3, 6, 8, 22, 23, 24, 32, 41, 42, 48
cˆ 1 : M1 / Frontal (Sup., Mid.) / SMA / Mid. Cingulate / S1 / Parietal (Sup., Inf) / SupraMarginal / Sup. Temp. *BA: 2, 3, 4, 6, 23, 24, 32, 40, 48
cˆ 2 : Insula / SupraMarginal / Heschl / Temporal (Sup., Mid.) *BA: 22, 42, 48
cˆ 3 IVA
: Frontal (Sup., Inf.) / Insula /
Hippocampus / Amygdala / Putamen / Pallidum / Thalamus / Heschl / Sup.Temp. *BA: 11, 20, 25, 34, 38, 47, 48
cˆ 2 : Heschl / Sup. Temp. *BA: 22, 48 cˆ 3 : Insula / Hippocampus / Amygdala / Caudate / Putamen / Pallidum Thalamus *BA: 11, 20, 27, 34, 37, 38, 48
/
cˆ 4
/
*BA: 6, 48
: Inf. Frontal / Insula SupraMarginal / Putamen / Heschl *BA: 45, 47, 48
cˆ 5 : Mid. Frontal / Cingulate (Ant., Mid.)
cˆ 5 :
/ Sup. Temp.
Sup. Temp. *BA: 24, 32, 48
cˆ 4 : Inf. Frontal / Insula / Putamen
*BA: 24, 32, 38, 45, 46
Insula / Cingulate (Ant., Mid.) /
Multivariate Analysis of fMRI Group Data Using Independent Vector Analysis
639
We conjectured that the differences in activation maps between GLM and IVA were caused by the individual differences in temporal patterns of hemodynamic responses from the areas. Therefore, individual differences in the obtained TCs were further compared across subjects (Note that the TC represents a dominant feature of the BOLD TS corresponding to the activated voxels in the IC map). Figure 3 shows the individual TCs corresponding to three highly task-related group activation maps ( cˆ 1 ~ cˆ 3 in Fig. 2B) by IVA. Here, we adopted the convention introduced by Duann et al., [8], whereby the normalized TC (0~1) was coded in gray scale (black: 0 & white: 1) so that relative amplitudes can be readily discriminated within/across subjects. First of all, the TCs (Fig. 3A) corresponding to the 1st (highly) task-related group activation map cˆ 1 was in good agreement with the hypothesized HRF in GLM (yellow box: task-related period; correlation coefficient between the averaged TC across subjects and the hypothesized HRF: 0.88). On the other hand, the TCs (Fig. 3B&C) corresponding to cˆ 2 & cˆ 3 showed some degree of variations in peak position within
the task-related period. It is also notable that additional peaks during rest-periods were observed (examples are shown with arrows) with reduced correlation coefficients of 0.56 & 0.63, compared to the cˆ 1 . Because of these large variations between the actual hemodynamic responses (analogous to TCs) and the hypothesized HRF across subjects, the activations in the corresponding areas may not be detected by GLM.
Fig. 3. Image plots of TCs across subjects corresponding to 3 highly task-related group activation maps ( cˆ 1 ~ cˆ 3 ). Each TC was normalized between 0 (black) and 1 (white). A yellow box indicates the period of task-related response. Green arrows indicate examples of the peaks during the rest-period. The plots in the bottom are the hypothesized HRF (green line), along with the averaged TC (blue line), and standard deviations (red bars) across all subjects. A correlation coefficient between the averaged TC and the hypothesized HRF is shown in the topright corner of each averaged time plot. A task-period (3-sec) is marked with thick black bar.
4 Conclusion In this study, we have proposed the use of IVA to infer the group-activation pattern from fMRI data. The IVA algorithm was applied for multiple subjects’ BOLD signals
640
J.-H. Lee et al.
and the spatially-similar trend in activation maps across subjects (dependent, thus representing group-trend) were derived as the output vector components. From the application of the proposed method to fMRI data, the resulting IC maps/TCs of individual subjects provided reliable task-related information for a further group level inference. In addition, IVA provided more robust activation patterns than GLM (based on the hypothesized univariate HRF), especially when the HRF deviated from the hypothesized HRF (e.g. from the basal ganglia/thalamus). These results show the feasibility of IVA to be used in fMRI group studies as potential alternative to conventional univariate approach. The proposed model may also be adopted to find out multivariate common activation patterns across multiple trials/sessions from a single subject data by substituting the index of a subject for the index of a trial/session. The IVA approach demands computational load since an individual unmixing matrix is iteratively trained using the results from other subjects (represented as nonlinear function, ϕ (cˆ ( m ) (v ) ) ), and thus, all of the unmixing matrices should be parallely updated. This increased computational demand can be alleviated by the increasing hardware memory. In order to achieve the fully-automated group processing using IVA, elaborate sets of optimization in terms of learning parameters and sorting schemes (for the selection of task-related features) are needed. Acknowledgments. This work was partially supported in part by grants from NIH (R01-NS048242 to Yoo, SS and NIH U41RR019703 to Jolesz FA).
References 1. Worsley, K.J., Friston, K.J.: Analysis of fMRI time-series revisited—again. Neuroimage. 2, 173–181 (1995) 2. Aguirre, G.K., Zarahn, E., D’Esposito, M.: The variability of human, BOLD hemodynamic responses. NeuroImage 8, 360–369 (1998) 3. McKeown, M.J., Makeig, S., Brown, G.G., Jung, T.P., Kindermann, S.S., Bell, A.J., Sejnowski, T.J.: Analysis of fMRI data by blind separation into independent spatial components. Hum. Brain Mapp. 6, 160–188 (1998) 4. Quigley, M.A., Haughton, V.M., Carew, J., Cordes, D., Moritz, C.H., Meyerand, M.E.: Comparison of independent component analysis and conventional hypothesis-driven analysis for clinical functional MR image processing. AJNR Am. J. Neuroradiol. 23, 49–58 (2002) 5. Calhoun, V.D., Adali, T., McGinty, V.B., Pekar, J.J., Watson, T.D., Pearlson, G.D.: fMRI activation in a visual-perception task: network of areas detected using the generalized linear model and independent component analysis. Neuroimage 14, 1080–1088 (2001) 6. Kim, T.S., Attias, H.T., Lee, S.Y., Lee, T.W.: Blind source separation exploiting higherorder frequency dependencies. IEEE Trans. Audio, Speech and Language Process. 15, 70– 79 (2007) 7. Duann, J.R., Jung, T.P., Kuo, W.J., Yeh, T.C., Makeig, S., Hsieh, J.C., Sejnowski, T.J.: Single-trial variability in event-related BOLD signals. Neuroimage 15, 823–835 (2002) 8. Friston, K.J., Holmes, A.P., Worsley, K.J.: How many subjects constitute a study? Neuroimage 10, 1–5 (1999)
Extraction of Atrial Activity from the ECG by Spectrally Constrained ICA Based on Kurtosis Sign Ronald Phlypo1,2, , Vicente Zarzoso1, Pierre Comon1 , Yves D’Asseler2 , and Ignace Lemahieu2 1
Laboratoire I3S, CNRS/UNSA Les Algorithmes - Euclide-B, BP 121, 2000 Route des Lucioles, 06903 Sophia Antipolis Cedex, France {phlypo,zarzoso,pcomon}@i3s.unice.fr 2 MEDISIP-IBBT, ELIS/UGent, IBiTech - Campus Heymans, De Pintelaan 185, B-9000 Ghent, Belgium {ronald.phlypo,yves.dasseler,ignace.lemahieu}@ugent.be
Abstract. This paper deals with the problem of estimating atrial activity during atrial fibrillation periods in the electrocardiogram (ECG). Since the signal of interest differs in kurtosis sign from the dominant sources in the ECG, we propose an independent component analysis method for source extraction based on the different kurtosis sign and extend it with a constraint of spectral concentration in the 3-12Hz frequency band. Results show that we are able to estimate the atrial fibrillation with a single algorithm having low computational complexity (O(7n-7)T).
1
Introduction
This paper describes a method to recover narrow band independent signals from a linear mixture model where high impulsive (high kurtosis) signals are the main source of interference. This set-up is a commonly encountered problem in biomedical signal analysis, e.g. when considering spectral bands of activity in electroencephalographic recordings or atrial fibrillation (AF) signals in the electro cardiogram (ECG) where the main sources of interference are respectively the ocular activity and the QRS(-T) complex. The focus here is on the recovery of atrial activity during AF from an ECG recording, whatever the conditions of noise or interference from other physiological signals (e.g. QRS complex). Since we consider narrow band spectra in a volume conductor, a linear approximation of the electromagnetic Maxwell equations is valid and hence we may suppose that a general linear mixing model holds. This mixture model translates the measured potentials at the chest or body surface into bio-electrical source signals (generally specified by their currents) and vice versa. If we consider the mixing-demixing model, all relations exhibit the
R. Phlypo would like to thank H. Rix and V. Zarzoso for their kind hospitality at the laboratories of I3S.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 641–648, 2007. c Springer-Verlag Berlin Heidelberg 2007
642
R. Phlypo et al.
characteristics of a linear model and thus we can rewrite our system of measurements y into an equivalent set of potentials x, where each xi is associated with a column of A, ai . The latter represent the mappings of the sources on the measurement surface (known as source topographies). Or, in matrix notation: y (t) = Ax (t) + η (t) ,
(1)
which explains the relations between the measurements y ∈ IRm×1 , the mixing matrix A ∈ IRm×n , the sources x ∈ IRn×1 and the noise η ∈ IRm×1 . The measurements and the sources can be seen as realisations of random variables. Therefore we will drop the time index in the subsequent work to improve readability. For the biomedical case we might assume that these sources are quasi statistically independent. In the case of atrial fibrillation, we can restrict the AF source characteristics even further by imposing the extra constraint that the AF signal should have a narrow band spectrum, thus having platokurtic statistics. This is in contrast to the QRS(-T) complex - the main masking source - which is highly leptokurtic (see e.g. [1]). In the rest of this paper we will develop this idea further, sketching a framework in which we can extract the independent AF source based on the difference in kurtosis sign and under the constraint of narrow band source spectra. The solution is given as the ensemble of subsequent algebraic solutions to the pairwise separation problem, subjected to a conditional update. This guarantees a robust algorithm, with only few parameters to estimate and omitting the need for exhaustive search algorithms.
2 2.1
Methods The Kurtic Difference as an Object Function to ICA
ICA. The solution to the ICA problem has been proposed by different authors, using different contrast functions. Despite the diversity at the basis of the algorithms, the solution space is almost always given by components whose higher order cross cumulants vanish [2,3], which in its turn is equivalent to a reduction of the mutual information between the components [4,5]. Solutions have been proposed to solve the problem by deflation approaches [6] - estimating source by source - or to interact on the whole subset at once. The deflation approach offers the ability to solve for independence in a component-by-component way, sorted according to the value they take in the cost function, see e.g. RobustICA [7], FastICA [8]. However, when considering the a priori constraint of a narrow spectrum, most algorithms lack the possibility to include this without going to excessive computational complexity, see e.g. the number of tensor slices or correlation matrices needed in JADE [4]-like, respectively SOBI [9]-like, algorithms especially when applied to high data dimensionalities n. Givens Rotations. The method proposed here is an extraction (or deflation) approach with pairwise optimisation. The advantage is that there exists an algebraic expression able to update the source estimates, avoiding computationally
Extraction of Atrial Activity from the ECG
643
unattractive search methods. Moreover, since the signals are prewhitened (i.e. mutually decorrelated), it suffices to find an orthogonal matrix to find maximally independent source estimates. We can thus constrain our parameter space to only one parameter per signal pair if we do not take into account permutation and scaling, which are irrelevant parameters when considering the independence criterion. Our search space can thus be limited to the optimal rotation angle for each pair to process [4,3]. This amounts to the following algorithm for ˆ ∈ IRn×1 : prewhitened signals z cos θ sin θ ˆ ij = Q (θ ) xij , where Q (θ ) = , (2) x − sin θ cos θ where θ is the optimal rotation angle that is to be specified, and the matrix ˆ ij are Q (θ ) represents a plane rotation, also known as Givens rotation. xij and x ˆ . The result of the left multiplicathe ith and jth component of x, respectively x ˆ ij by Q−1 = QT would thus results in the standarized sources tion of the data x ˆ , with additional constraints imposed by the objective function to which θ is x a solution. If the objective function is chosen well, these sources are maximally independent, an assumption that is believed to hold true for many, if not all, bioelectrical source signals. When the objective function meets the requirements of being maximal if and only if the components are independent, while being blind to possible permutations and scaling, it becomes a contrast function for ICA [3]. Kurtic Difference as a Contrast. We have shown in [10] that the objective function n ˆ i κx (3) Ψ (Q) = iiii i=1
fulfils all requirements to be a contrast function for ICA, where i is the sign ˆ of the fourth order auto cumulant of the ith source and κx iiii the fourth order cumulant of the ith output. Based on this fact, together with the assumption that the atrial activity caused by AF is a (the sole) platokurtic source in the ECG, the contrast would translate into: n ˆ ˆ ΨAF (Q) = κx (4) − κx iiii 1111 , i=2
which can be solved using subsequent Givens rotations as defined above. The next paragraph gives the algebraic solution for θ when a pair of signals is considered. The Optimal θ-value: θ . We can now obtain the optimal value for θ, θ , by calculating the stationary point of our contrast function ΨAF (4) by setting its derivative to zero. It is sufficient to consider the pairs with opposite kurtosis signs, the other cases being known. As a function of the observed whitened ˆ = Qx we obtain for ΨAF : signals x ΨAF (θ) = λ2 − λ1 = α cos 2θ + 2β sin 2θ,
(5)
644
R. Phlypo et al.
xˆ ˆ ˆ ˆ x x where α and β are given by κx 1111 − κ2222 and κ1112 + κ1222 , respectively ˆ ij , which can be written as a multi linand λ1 , λ2 are the kurtosis values of x ear function of the source kurtosis values [3] using Eq. (2). Equation 5 has its stationary points at 2β , (6) 2θ = arctan α where θ is the rotation angle to be found. Equation 6 is also the equation obtained in [11] based on centroid estimators. 2.2
Inclusion of the Spectral Concentration Constraint
AF is typically characterised by a sinusoidal to triangular waveform (depending on the relative power in the harmonics) with a frequency and amplitude modulation. This spectrally rather narrow banded signal has its main frequency in the 3 to 12 Hz band. This enables us to create additional constraints regarding the spectra, forcing us to redefine the update sequence of the pairwise processing for source extraction as given in [10]. The method extends the natural sweeping procedure for source extraction to a criterion based sweeping procedure. The update criterion is given as ˆi and Criterion 1. Replace the source estimates x ˆi and xˆj with their updates x T ˆ ij iff the spectral concentration in the 3ˆ ij = Q x x ˆj by using the relation x 12Hz band of one of the new estimates exceeds the spectral concentration of the reference source estimate. The spectral concentration in criterion 1 is taken as the ratio of spectral density in a ±10% band around the center frequency fc to the total energy in F s/2 1.1f c P (τ )e−2πτ f df P (τ )e−2πτ f df, if the signal’s spectrum, i.e. SC = .9fc
/
0
fc ∈ [3Hz, 12Hz], otherwise SC = 0. With this information we can define the sweep procedure as described in table 1, where the stopping criterion is defined to be positive when a sweep occurs without update of the reference source estimate. Table 1. The pseudo-code for the sweep algorithm ˆ1 Initialise reference source with x While false(stopping criterion) StartSweep: For j from 2 to m ˆ 1 and estimate x ˆj Compute θ for the reference source estimate x ˆ 1 and x ˆ j Compute spectral concentration for x ˆ having highest (lowest) SC If criterion 1: replace xˆ1 (xˆj ) with x EndSweep
Extraction of Atrial Activity from the ECG
3 3.1
645
Results Data
Patient Data. The data upon which the algorithm was run consists of 51 patient registrations with known AF. All ECG sets are standard 12 lead ECG measurements consisting of the leads I-III, aVR, aVL, aLL and the potentials at the electrodes V1-V6. The dataset is by definition overdetermined for the biopotentials since I-III, aVR, aVL and aLL can all be expressed in terms of the left arm (LA), right arm (RA) and left leg (LL) electrode potentials. This means that there is a redundancy of factor 2 in the leads. If we take the LL electrode as the reference electrode for all measurements (which is generally the physical measurement setup), then we are left with 8 independent variables. We can thus reduce our set to 8 recording sites or derivations only without compromising the information in the data. Taking V1-V6 and extracting the potentials at LA and RA from the leads would eliminate this data redundancy. One has to be careful though in highly noisy environments where the noise at the electrodes is not stationary and the noise term would thus take a sufficiently high amount of the total data subspace. In that case it might not suffice to take only the 8 electrode potentials and the extra derivations might add extra information to solve the ill-conditioned problem. Simulated Data. To have an idea of the quality of separation we introduce a simulated dataset. This dataset contains an AF signal constructed following the method in [12], whereas the QRS-T simulation has been done using high kurtosis components using the model: tan (aj (t) sin (jω (t) t)) . (7) QRS-Ti (t) = j
The model allows for amplitude modulation in aj , where max t
|aj (t) | ≤ 1, and
j
modulation in ω (t). By changing the number of harmonics and the parameters in the modulations we can change the statistics of the total time series. Additionally we added two sources that are of no physiological meaning but have a positive kurtosis value, so we do not violate the model assumptions. The randomly drawn square mixing matrix is orthonormal, avoiding the need of the prior whitening step as described in the introduction without restricting the generality. 3.2
Estimating the Central Frequency
To evaluate our method, we focus on the value of the main frequency estimated by our method. The main frequency is defined as the frequency at which the power spectral density is the highest in the 3 - 12Hz band. For the 51 patient registrations we compare the resulting frequencies with those found from a combined FastICA and SOBI approach as it was applied in [13]. We compare the
646
R. Phlypo et al.
results as well for AEML, EML and the combEML methods [11] in the same constrained updating framework. In table 2 the results of the central frequency difference among the methods with their respective standard deviation are displayed. To exclude biasing toward short time artefactual instances in the data, the data has been resampled at each iteration before calculation of the optimal rotation angle θ based on a bootstrap sampling, allowing for overlap. We choose a bootstrap sample size of 2 · 103 . To exclude the bootstrap based differences between the methods, we consider 100 Monte Carlo runs per method over the whole dataset of which the mean of the results so obtained are used as the frequencies to construct table 2. Table 2. Differences in main frequency estimation. The upper right triangle displays the results when 12 leads are considered, in the lower left triangle displays the results for the reduced set of 8 electrode potentials are given. fastICA+SOBI
AEMLa
AEMLc
combEML
fastICA+SOBI 0 -0.026 (0.444) 0.083 (0.294) 0.425 (0.429) AEMLa 0.467 (1.025) 0 0.142 (0.384) 0.032 (0.291) AEMLc 0.291 (0.526) -0.085 (0.755) 0 -0.110 (0.338) combEML 0.425 (0.761) 0.063 (0.767) 0.148 (0.390) 0
Visualisation of the Results. To evaluate the performance on simulated and real datasets, we present the artificial mixture, respectively the observations of the electrode potentials and the source extraction results in Figs. (1) & (2).
7 uV
7 uV
1
1
2
2
3 uV 1
3
3 00:00:00
4
00:00:00
00:00:01
00:00:02
4
00:00:01
00:00:02
00:00:00
00:00:01
00:00:02
Fig. 1. Extraction of an AF like signal from an artificially generated mixture. left: the original sources; center: the mixture; right: the extracted source.
Fig. (3) gives the PSD for both source estimates in Figs. (1) & (2).
Extraction of Atrial Activity from the ECG
I
647
3 uV
II III aVR aVL aVF V1 V2 V3 V4 V5 V6
1
00:00:10
00:00:15
00:00:20
00:00:10
00:00:15
00:00:20
4 uV
Fig. 2. Extraction of an AF signal from the ECG. upper: 12 channel ECG signal; lower: extracted AF signal.
a.
b.
10
0.7 0.6
8
2
PSD (PV )
PSD (PV2)
0.5 6
4
0.4 0.3 0.2
2
0
0.1 0
5
10
15 f (Hz)
20
25
0
0
5
10
15
20
25
f (Hz)
Fig. 3. PSD for the extracted sources from (a) the artificially generated mixture in Fig. (1) and (b) the ECG signal in Fig. (2)
4
Discussion
From Table 2 it is clear that From the figures we can see that the restriction of the spectral concentration does not prevent to extract signals with multiple harmonics, a result that is supported by the fact that the main harmonic is still the central frequency. Moreover, the sources are maximally independent, being a solution to the contrast in [10] subjected to the constraint of spectral concentration in the 3-12Hz band. Moreover, since our technique is based on source extraction, there is no need to do a full decomposition with a posteriori source selection, which is computationally attractive, since the overall complexity is of order O (7n-7)T.
5
Conclusion
The results of the method based on constrained extended AEML are promising toward the extraction of AF from the ECG. Although the presented values and
648
R. Phlypo et al.
figures are already showing the strengths of the method, it remains to explore how to obtain a quantitative and objective measure for the evaluation of the proposed source extraction technique against the widely accepted techniques of unmasking the AF through suppression of the QRS-T complex.
References 1. Castells, F., Igual, J., Millet, J., Rieta, J.: Atrial activity extraction from atrial fibrillation episodes based on maximum likelihood source separation. Signal Processing 85, 523–535 (2005) 2. Comon, P.: Analyse en composantes ind´ependantes et identification aveugle. Traitement du signal 7(3), 435–450 (1990) (Numero special non lineaire et non gaussien) 3. Comon, P.: Independent component analysis, a new concept? Signal Processing 36, 287–314 (1994) 4. Cardoso, J.F.: High-order contrasts for independent component analysis. Neural Computation 11, 157–192 (1999) 5. Lee, T.W., Girolami, M., Bell, A.J., Sejnowski, T.J.: A unifying informationtheoretic framework for independent component analysis. International Journal on Mathematical and Computer Modeling (1998) 6. Delfosse, N., Loubaton, P.: Adaptive separation of independent sources: a deflation approach. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’94), Adelaide, Australia, 19-22 April 1994, vol. 4, pp. 41–44. IEEE Computer Society Press, Los Alamitos (1994) 7. Zarzoso, V., Comon, P.: How fast is FastICA? In: Proceedings of the 14th European Signal Processing Conference (EUSIPCO), Firenze, Italy (September 2006) 8. Hyv¨ arinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neur. Comp. 9, 1483–1492 (1997) 9. Belouchrani, A., Abed-meraim, K., Cardoso, J., Moulines, E.: A blind source separation technique using second order statistics. IEEE Trans on Sign. Proc. 45(2), 434–444 (1997) 10. Phlypo, R., Zarzoso, V., Comon, P., D’Asseler, Y., Lemahieu, I.: ISRN I3S/RR2007-13-FR: A contrast for ICA based on the knowledge of source kurtosis signs. Technical report, I3S, Sophia Antipolis, France (2007), http://www.i3s.unice.fr/∼ mh/RR/2007/liste-2007.html 11. Zarzoso, V., Nandi, A.K., Hermann, F., Millet-Roig, J.: Combined estimation scheme for blind source separation with arbitrary source PDFs. IEE Electronics Letters 37(2), 132–133 (2001) 12. Stridh, M., S¨ ornmo, L.: Spatiotemporal qrst cancellation techniques for analysis of atrial fibrillation. IEEE Trans. Biomed. Eng. 48(1), 105–111 (2001) 13. Castells, F., Rieta, J., Millet, J., Zarzoso, V.: Spatiotemporal blind source separation approach to atrial activity estimation in atrial tachyarrhythmias. Biomedical Engineering, IEEE Transactions on 52(2), 258–267 (2005)
Blind Matrix Decomposition Techniques to Identify Marker Genes from Microarrays R. Schachtner1 , D. Lutter1 , F.J. Theis1 , E.W. Lang1 , A.M. Tom´e2 , J.M. Gorriz Saez3 , and C.G. Puntonet3 1
Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany [email protected] 2 DETI / IEETA, Universidade de Aveiro, 3810-Aveiro, Portugal 3 DATC / ETSI, Universidad de Granada, 18371 Granada, Spain
Abstract. Exploratory matrix factorization methods like PCA, ICA and sparseNMF are applied to identify marker genes and classify gene expression data sets into different categories for diagnostic purposes or group genes into functional categories for further investigation of related regulatory pathways. Gene expression levels of either human breast cancer (HBC) cell lines [6] or the famous leucemia data set [10] are considered.
1
Introduction
The transcriptom comprises all cellular units and molecules needed to read out the genetic information encoded in the DNA. Among others, the level of messenger RNA (mRNA), specific to each gene, depends on environmental stimuli or the internal state of the cell and represents the gene expression profile (GEP) of the cell. High-throughput genome-wide measurements of gene transcript levels have become available with the recent development of microarray technology [1]. Microarray data sets are characterized by many variables (the GEPs) on only few observations (environmental conditions). Traditionally two strategies exist to analyze such data sets: a) Supervised approaches can identify gene expression patterns, called features, specific to each class but also classify new samples. b) Unsupervised approaches like PCA [3], ICA or NMF [2] represent exploratory matrix decomposition techniques for microarray analysis. Both approaches can be joined to build classifiers which allow to classify GEPs into different classes. We apply PCA, ICA and NMF to two well-characterized microarray data sets to identify marker genes and classify the data sets according to the diagnostic classes they represent.
2 2.1
The Data Sets Breast Cancer Cell Lines - Bone Metastasis
The data set was taken from the supplemental data to [6]. The study investigated the ability of human breast cancer (BC) cells (MDA-MB-231 cell line) to form M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 649–656, 2007. c Springer-Verlag Berlin Heidelberg 2007
650
R. Schachtner et al.
bone metastasis. Data set 1 comprised 14 samples; experiments 1-8 showed weak (7 and 8 mild) metastasis ability, while experiments 9-14 were highly active. Data set 2 consists of 11 experiments, 5 among them of high and 6 showing weak metastasis ability. Both data sets carry measured expression levels of 22283 genes using the Affymetrix U133a chip. For each measurement, the flags A(absent) or P(present) are provided. All genes showing more than 40% absent calls in one of the two data sets were removed. The remaining data sets contained the same 10529 genes. The authors published a list of 16 potential marker genes, 14 of which were still contained in the reduced data set. 2.2
Leukemia
Leukemia (LK) is a form of cancer in which white blood cells (leukocytes) show abnormal behavior. Pathologically, leukemia is split into acute and chronic forms. Acute leukemia types can be divided into Acute Myelogenous Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL). Furthermore lymphocytes can be classified by their cell surface into B-cells and T-cells. ALL leukemia cells can thus be divided into the subtypes ALL-B- and ALL-T-leukemia. The data set was taken from the well-known supplemental data to [10] which comprises a training set of 38 experiments and a test set of 34 samples. The training set contains 27 ALL samples from childhood ALL patients, and 11 adult AML samples. The test set in addition contains peripheral blood samples and cases of childhood AML. The original training and test data sets contain measurements of 7129 genes. All data were hybridized on an affy-HU6800 Affymetrix-chip. The training set contains negative expression levels. In a later study [5], a non-negative version of the training data set was provided. It contains expression profiles of 5000 genes. In this study, this latter data set was used.
3
Data Analysis
The gene expression profiles are commonly represented by an (N × M ) data matrix X = [x1 · · · xM ] with each column xm representing the expression levels of all genes in one of the M experiments conducted. Note that the data matrix is non-square with N ≈ 103 · M typically. This renders a transposition of the data matrix necessary when techniques like PCA and ICA are applied. Hence ICA follows the data model XT = AS. Then each row represents the expression profile of all genes within one experiment. The rows of S contain the nearly independent component expression profiles, called expression modes, and the columns of A the corresponding basis vectors containing the mixing coefficients. In this study the JADE-algorithm [4] was used throughout, though with the natural gradient and the fastICA algorithm equivalent results were obtained. With NMF, a decomposition is sought according to X = WH which is not unique, of course, and needs further specification. The columns of W are usually called metagenes and the rows of H are called meta-experiments. The local NMF (LNMF) algorithm [8] was applied in this study.
Blind Matrix Decomposition Techniques
3.1
651
ICA - Analysis
We propose a new method based on basic properties of the matrix decomposition model as well as on available diagnostic information to build a diagnostic classifier. ICA essentially seeks a decomposition XT = AS of the data matrix. Column am of A can be associated with expression mode sm , representing the m-th row of S. The m-th row of the matrix A contains the weights with which the k ≤ M expression levels of each of the N genes, forming the columns of S, contribute to the m-th observed expression profile. Hence a concise analysis of matrix A hopefully provides insight into the structure of the data set. Each microarray data set investigated here represents at least two different diagnostic classes. If the M expression profiles of XT are ordered according to their class labels, this assignment is also valid for the rows of A. Suppose one of the independent expression modes sm is characteristic of a putative cellular gene regulation process, which is related to the difference between the classes. Then in all experiments, this characteristic profile should only contribute substantially to experiments of one class and less so to the experiments of the other class (or vice versa). Since the m-th column of A contains the weights with which sm contributes to all observations, this column should show large/small entries according to their class labels. In contrast to the method used by [7], the clinical diagnosis of the experiments is taken into account. The strategy concentrates on the identification of a column of A, which shows a class specific signature. Informative columns were identified using the correlation of each column vector of A with a design vector d whose i-th entry is di = ±1, according to the class label of experiment xi . 3.2
Local NMF - Analysis
With NMF, each column of X comprises the expression profile resulting from one experiment. After applying the LNMF- algorithm [8], at least one column of W, called a metagene is expected to be characteristic of a regulatory process, which is related to the class specific signature of the experiments. Its contribution to the observed expression profiles is contained in a corresponding row of matrix H, called a meta-experiment. The correlation coefficients c(hj , d) between every meta-experiment hj and d are then computed. Empirically, c > 0.9 signifies a satisfactory similarity between a meta-experiment and the design vector. The number of extracted basis components k, i.e. the metagenes, controls the structure of W and H. For several decompositions X = WH using different numbers k of metagenes, the rows of H are studied with respect to their correlation with the design vector. A metagene is considered informative only if all entries of the corresponding meta-experiment which belong to class 1 are smaller than all other entries of that meta-experiment (or vice versa). After 5000 iterations, the cost function of the LNMF algorithm did not show noticeable changes with any of the data sets investigated. For k = 2, . . . , 49, ten separate simulations were carried out and only the simulation showing the smallest reconstruction error was stored. Further matrix decompositions with k = 50, . . . , 400 metagenes were examined. In the latter case, three simulations were performed only for each k.
652
4 4.1
R. Schachtner et al.
Results Breast Cancer Data Set
In order to test the idea of a diagnostic classifier, we first selected the set of expression profiles from bone metastasis mediating breast cancer cell lines provided by [6]. 1200 1000 2
800 600
3
400 4
200 0
1
2
3
4
5
6
7
8
1
9 10 11 12 13 14
2
3
4
5
6
7
8
9
10
11
12
13
14
50
0
ratio
−50
−100
−150
−200 0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
gene number
Fig. 1. Top Left: Entries of the 9-th column of mixing matrix A as obtained with JADE, Top Right: Matrix H resulting from a decomposition into k = 4 metagenes using LNMF. Row 3 and 4 show a clear separation between columns 1, . . . , 8 and columns 9, . . . , 14 Bottom:Componentwise ratio of row x9 with an9 s9 . The genes of [6] are marked.
ICA Analysis. The analysis of the 14 × 14 matrix A identified one column with |c((a)9 , d)| = 0.89 (see Fig. 1). Hence s9 should contain genes which provide diagnostic markers for the metastasis forming ability of the cell lines considered. In [6], a list of 16 putatively informative genes is provided. As shown in Tab. 1 the expression levels (taken from S) across all M experiments of many of these genes exhibit a high correlation with the design vector d indicating a rather high single discriminative power. Many of these genes show large negative expression levels in expression mode 9. An even more revealing picture appears if one divides componentwise the rows of the data matrix by the weighted row of the informative expression mode. The resulting diagram marks genes which contribute most to the observation. Many of the genes listed by [6] stick out as informative here. NMF Analysis. The same data set was also analyzed using the LNMF algorithm. The decomposition is very robust and highly accurate. Considering the correlation between any row of matrix H and the design vector d, a decomposition into k = 4 metagenes yielded two rows of the 4 × 14 matrix H which
Blind Matrix Decomposition Techniques
653
Table 1. The correlation coefficient c of the gene vector sn with the design vector d for the 16 genes suggested by [6]. number denotes the column index in the data set X, gene name denotes the affymetrix-ids, —– genes missing in the reduced data set. number affymetrix-id 3611 204749-at 1586 201859-at 5007 209101-at 4311 207345-at 1585 201858-s-at 4529 208378-x-at 5532 209949-at 860 201041-s-at
gene name NAP1L3 PRG1 CTGF FST PRG1 FGF5 NCF2 DUSP1
c-value number affymetrix-id gene name -0.96 3694 204948-s-at FST -0.95 10480 222162-s-at ADAMTS1 -0.94 6133 211919-s-at CXCR4 -0.93 4233 206926-s-at IL11 -0.92 3469 204475-at MMP1 -0.92 4232 206924-at IL11 -0.92 —- 210310-s-at FGF5 -0.89 —- 209201-x-at CXCR4
c-value -0.89 -0.86 -0.81 -0.57 -0.47 -0.43 —– —–
show an excellent correlation to the design vector d, see Fig. 1, with coefficients c(row 3, d) = −0.91 and c(row 4, d) = 0.91, respectively. Thus, a decomposition in a comparatively small set of metagenes perfectly displays the diagnostic structure of the breast cancer data set. For the sake of comparison, a decomposition into k = 20 metagenes revealed four informative meta-experiments and their related metagenes. A comparison of the ten most expressed genes in each of the four identified metagenes shows, that 5 genes were also identified in case of k = 4, while 7 genes were also identified with ICA and 9 genes were identified with a SVM [9] approach as well. These genes are spread over all four metagenes. 4.2
Leukemia Data Set
Two different types of diagnostic classes are present in this data set: a) The leukemia-types ALL (experiments 1, . . . , 27) and AML (experiments 28, . . . , 38) of the training set. A putative reference list of informative genes is available here from [10], b) ALL-leukemia can further be split into subtypes ALL-B (exp. 1, . . . , 19) and ALL-T (exp. 20, . . . , 27). ICA Analysis. During the whitening step of the JADE-algorithm, a reduction, i.e. (k < M ), in the number of extracted expression modes can be performed. For k = 2, . . . , 38, the maximal correlation coefficient c between any column vector of Ak and the design vectors di , i = 1, 2 for cases (1) and (2) was computed. These correlation coefficients peak at c = 0.86 and k = 17 in case 1 (AML vs. ALL), and at c = 0.97 and k = 12 in case 2 (ALL-T vs. ALLB/AML). The first 17 principal components cover 93.5% of the total variance of the data set, while the first 12 principal components still represent 89.8% of the variance. In case 1 the 9-th column of matrix A shows a very pronounced discrimination between AML and ALL cells, and in case 2 the 8-th column of matrix A shows a very pronounced discrimination between the ALL-T cells and all others. In [10], a list of 24 significantly up-regulated and 24 significantly down-regulated genes is provided, considering the discrimination of the ALLAML classes. Most of the down-regulated genes appear in the clusters derived
654
R. Schachtner et al.
from expression mode 9neg, while the up-regulated genes are contained in 9pos mostly. Six of them even belong to the 10 most strongly expressed genes in expression modes 9neg and 9pos. Concerning a discrimination of the subtypes
Fig. 2. Left: Mixing coefficients ai8 , i = 1, . . . , 38 resulting in a signature of column a8 of matrix A with a high correlation to the design vector d2 . Right: Corresponding signature of column a9 for the discrimination of AML vs ALL.
ALL-B/ALL-T, a decomposition in k = 12 expression modes revealed a strong correlation of column 8 of matrix A12 with the design vector d2 , see Fig. 2. The most expressed genes in mode 8pos and 8neg, respectively, thus form diagnostic marker genes to differentiate between subtypes of the ALL-type Leukemia. LNMF Analysis. In case AML vs ALL, class 1 was chosen to represent AMLleukemia, while in case ALL-T vs ALL-B/AML, class 1 was considered to be ALL-T, and class 2 consequently comprised all ALL-B and AML probes. Case AML vs ALL: First, the structure of meta-experiments related to the case AML vs. ALL was investigated. Correlations between the meta-experiments and the design vector d were determined for all k ≤ 400. Strong correlations (c > 0.85) could be obtained for several decompositions. The strongest correlation was found for k = 100. Fig. 3 shows the most informative meta-experiment and its related metagene. From the latter the 10 most strongly expressed genes are listed in Table 2. Six of them could be identified also with a SVM-classifier [9] or by applying ICA, or belonged to the list published in [10]. Case ALL-T vs ALL-B/AML: In case ALL − B < ALL − T , at least one metaexperiment was found with c > 0.9 for any k > 20. In the reverse situation ALL − T < ALL − B highly correlated (|c| > 0.85) meta-experiments were found for all k > 70. Fig. 3 shows the result of a factorization into k = 300 metagenes. Only one informative metagene could be identified. The two most strongly expressed genes of this metagene also have been identified as potential marker genes using ICA. In contrast to the BC data set, only few informative meta-experiments could be identified with the LK data set. However, the related metagenes obtained for different decompositions were very consistent as measured by their respective correlation coefficient. Obviously, the LNMF decomposition detects rather similar metagenes related to the diversity of ALL-T and -B-subtypes when the number k of extracted metagenes is varied.
Blind Matrix Decomposition Techniques expression value
350 300
coefficient
250 200 150 100 50 0 0
5
10
15
20
25
30
0.15
0.1
0.05
35
0 0
500
1000
1500
expression value
coefficient
150
100
50
0 0
5
10
15
20
25
30
35
column index
2000
2500
3000
3500
4000
4500
5000
3500
4000
4500
5000
gene number
column index 200
655
0.4 0.3 0.2 0.1 0 0
500
1000
1500
2000
2500
3000
gene number
Fig. 3. Top: case AML vs ALL, k = 100: Signature of the most informative metaexperiment and the related metagene, Bottom: case ALL-T vs ALL-B, k = 300: Signature of the most informative meta-experiment and corresponding metagene Table 2. List of the 10 most strongly expressed genes of metagene 100. (* mentioned in the original paper [10], ** identified with an SVM-classifier [9], *** identified by ICA). gene nr. affy-id 1 * 215 M96326-rna1-at 2 ***,*4828 M27891-at 3 1157 J04990-at 4 **3543 M19507-at 5 **1555 M84526-at
5
gene nr. affy-id 6 1274 D88422-at 7 651 M27783-s-at 8 **,*2646 X95735-at 9 *4870 M57710-at 10 4454 M20203-s-at
Conclusion
The application of matrix decomposition techniques like ICA and NMF to microarray data explores the possibility to extract features like statistically independent expression modes or strictly positive and sparsely encoded metagenes. Combined with a design function reflecting the experimental protocol, biomedical knowledge is incorporated into the data analysis task which allows to construct a classifier for diagnostic purposes based on a global analysis of the whole data set rather than a statistical analysis based on single gene expression levels. This global analysis is based on the columns (ICA) or rows (NMF) of a matrix which contains the weights with which the underlying expression modes or metagenes contribute to any given observation in response to an applied environmental stimulus. It was shown on two benchmark data sets that if the signature of these column or row vectors matches the experimental design vector, the related expression mode or metagene contains genes with a high discriminative power. These genes represent biomarkers for diagnostic purposes. Knowledge of such marker genes allows to construct a simple and cheap chip for diagnostic purposes.
656
R. Schachtner et al.
Acknowledgement Support by grants from the DFG (Graduate College 638) and the DAAD (PPP Luso - Alem˜ a) is gratefully acknowledged.
References 1. Baldi, P., Hatfield, W.: DNA Microarrays and Gene Epression. Cambridge University Press, Cambridge (2002) 2. Cichocki, A., Amari, S.-I.: Adaptive Blind Signal and Image Processing. Wiley, Chichester (2002) 3. Diamantaras, K.I., Kung, S.Y.: Principal Component Neural Networks, Theory and Applications. Wiley, Chichester (1996) 4. Souloumiac, A., Cardoso, J.-F.: Blind beamforming for non-gaussian signals. IEEE Proc. 140, 362–370 (1993) 5. Golub, T., Mesirov, J.P., Brunet, J.-P., Tamayo, P.: Metagenes and molecular pattern discovery using matrix factorization. PNAS 101, 4164–4169 (2004) 6. Kang, Y., Siegel, P.M., Shu, A., Drobnjak, M., Kakonen, S.M., Cord´ on, C., Guise, Th.A., Massagu´e, J.: A multigenic program mdeiating breast cancer metastasis to bone. Cancer Cell 3, 537–549 (2003) 7. Lee, S.-I., Batzoglou, S.: Application of independent component analysis to microarrays. Genome Biology, 4, R76.1–R76.21 (2003) 8. Zhang, H.J., Cheng, Q., Li, S.Z., Hou, X.W.: Learning spatially localized, partsbased representation. In: IEEE (2001) 9. Schachtner, R.: Machine Learning Approaches to the Analysis of Microarray Data. Diploma Thesis, University of Regensburg (2006) 10. Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomeld, C.D., Lander, E.S., Golub, T.R., Slonim, D.K.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286 (1999)
Perception of Transformation-Invariance in the Visual Pathway Wenlu Yang1,2 , Liqing Zhang1, , and Libo Ma1 Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China [email protected], [email protected] 2 Department of Electronic Engineering, Shanghai Maritime University, Shanghai 200135, China 1
Abstract. Visual perception of transformation invariance, such as translation, rotation and scaling, is one of the important functions of processing visual information in the Brain. To simulate this perception property, we propose a computational model for perception of transformation. First, we briefly introduce the transformation-invariant basis functions learned from natural scenes using Independent Component Analysis (ICA). Then we use these basis functions to construct the perceptual model. By using the correlation coefficients of two neural responses as the measure of transformation-invariance, the model is able to perform the task of perception of transformation. Comparisons with Bilinear Sparse Coding presented by Grimes and Rao and Topo-ICA by Hayvarinen show that the proposed perceptual model has some advantages such as simple to implement and more robust to transformation invariance. Computer simulation results demonstrate that the model successfully simulates the mechanism for visual perception of transformation invariance.
1
Introduction
We can recognize an object regardless of its distance, position or rotation. In the mathematical term, object recognition is not influenced by its transformation, such as translation, rotation or scaling. Many recent researches in the fields of neuroscience, neurophysiology and psychology show that such a transformationinvariant preprocessing could be a necessary step to achieve transformation invariant classification or detection in a hierarchical computational system. In this paper, we will focus on the computational mechanism for transformation invariance. We will propose a hierarchical model that simulates the mechanism in the visual pathway. On the other hand, due to biological evolution from nature in the long term, this mechanism has an important correlation with statistical properties of natural scenes. Following this way, Barlow[1,2] found that the role of early sensory neurons in the visual pathway is to remove statistical redundancy in the sensory inputs, suggesting that Redundancy Reduction is an important processing principle in the neural system. Based on this principle, Gabor-like features
To whom correspondence should be addressed.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 657–664, 2007. c Springer-Verlag Berlin Heidelberg 2007
658
W. Yang et al.
resembling the receptive fields of simple cells in the primary visual cortex(V1) have been derived either by imposing sparse over-complete representations[6] or statistical independence as in Independent Component Analysis(ICA)[8]. However, these studies have not taken transformation invariance into account, and the question is how well this line of research predicts the full spatiotemporal receptive fields of simple cells. For example, when an image rotates within receptive fields of simple cells, how do the simple cells and complex cells response? Some researchers have begun to bring this question into consideration. Hyvarinen and Hoyer[9,10] modelled receptive fields of complex cells and Van Hateren [12]obtained spatiotemporal receptive fields of complex cells. Grimes and Rao[14] proposed a bilinear generative model to study the translation-invariance. Berkes[7] investigated temporal slowness as a learning principle for receptive fields using slow feature analysis. However, there are few models in the literatures perceiving transformations of objects or images. To investigate the problem, we apply ICA to learning from natural scenes the transformation-invariant features, and then use these features to construct a model for transformation-invariant perception. The goal of the model is to perceive transformation of patches from natural images. The rest of the paper is organized as follows. Section 2 introduces a method for learning transformation-invariant basis functions and then propose a model for perception of transformation invariance. In section 3, we will demonstrate these basis functions and perceptual simulation results. The final section gives the comparison with other related works and models.
2
The Invariance Perception Model
In this section, we first introduce the method for learning the transformationinvariant basis functions. Then we propose a perceptual model for perception of transformation invariance. 2.1
Method for Learning Invariant Basis Functions
To obtain transformation-invariant basis functions, the training data sets should have the possession of transformational properties. The method for generating the training data will be introduced in section 3.1. Applying ICA on the training data, sequences of patches with the parameter αi (i = 1, ..., M ), yields transformation-invariant basis functions with parameters same as patches. For simple explanation of the method, shown in Fig.1, we use the rotation transformation and the resulting rotation basis functions. We briefly introduce the learning algorithm of ICA for training sparse basis functions. For the standard ICA model x = Wu, Cichocki et al.[5] used the Kullback-Leibler divergence between the distribution p(x; W) of obtained by the actual value W and the reference distribution q(x) to give the cost function as n 1 Elog qi (xi ). (1) R(x,W) = − log |det(WWT )| − 2 i=1
Perception of Transformation-Invariance in the Visual Pathway
659
Fig. 1. Method for learning transformation-invariant basis functions. The input data is a set of natural images. Patches selected from the images are transformed and feed to the ICA algorithm.
Applying the Natural Gradient rule to the cost function, the learning algorithm of W(the corresponding basis functions A = W−1 ) can be described[3,4] as W = −η(t)
∂R WT W = η(t)[I − ϕ[x(k)]xT (k)]W, ∂W
(2)
q (x )
where, ϕi (xi ) = − qii (xii ) . q(xi ) is a supergaussian probability distribution, for instance, the Laplace pdf. 2.2
Model for Perception of Transformation Invariance
In this section, we will propose a model for transformation-invariant perception, shown in Fig. 2. The invariance perception model consists of three layers. The first layer is to receive the input patterns which are two patches with parameters of αi and αj , respectively. Here, αi and αj belong to a same group of parameters. For rotation, α is in the range of zero and three hundred sixty degree by an interval of fifteen. For scaling, α in the range of from one to two times by ten percent. And, for translation, α in the range of size of input images. For simplicity, we only discuss in detail rotational samples and basis functions in the model. The middle layer of the model is to sparsely represent input patterns with a group of basis functions which is one of three groups respectively including translational, rotational, and scaling bases, shown in Figs.{3,4,5}. After the neurons respond to the stimuli uαi at time t1 and uαj at time t2 , the final layer of the model calculates the correlation coefficients between any two responses Xtα1i (i = 1, ..., M ) and Xtα2j (j = 1, ..., M ), and of which the maximum is selected to determine the relative dispersion. The index (i, j) of the maximum in coefficient matrix will tell us the relative dispersion such as counter-clockwise rotation angle Δθ, translational distance Δd, and scaling ratio
660
W. Yang et al.
Fig. 2. Model for transformation-invariant perception. For example, the input patterns are the rotational data. xtα1i,k (k = 1, 2, · · · , N ) denotes the response of the k-th neuron in the row αi responding to stimulus uαi at time t1 through the basis function αi,k . And so does response xtα2j,l (l = 1, 2, · · · , N ) at time t2 . Xtα1i denotes the vector of responses that the neurons in the row αi respond to stimulus uαi at time t1 through the subsets αi of basis functions. Namely, Xtα1i = [xtα1i,1 , xtα1i,2 , · · · , xtα1i,N ]T .
Δr. It is necessary to note that we only need the relative transformation, not the absolute value of parameters of the stimuli. For rotation, if j ≥ i, Δθ = (j − i) × 360/M ; otherwise, Δθ = (M + j − i) × 360/M . For translation, i and j (xj , yj ), respectively. We can have their corresponding coordinates (xi , yi ) and calculate the relative translation distance Δd = (xi − xj )2 + (yi − yj )2 and the moving direction according the relative position of coordinates (xi , yi ) and (xj , yj ). For scaling, if j ≥ i, Δr = rj − ri ; otherwise, Δr = ri − rj . Here, rj and ri denote the scaling ratio of the j-th and i-th subsets of basis functions, respectively.
3
Simulations and Results
We present experimental results to verify the performance of our proposed model and the learning algorithm. First we present the basis functions of transformation invariance including translation, rotation and scaling. Then, as an example, the rotation-invariant perception is discussed. 3.1
Training Data
To learn basis functions from natural scenes, we sample a sequence of small patches of size 10×10 from a set of big natural images by three methods of transformations such as translating, rotating, and scaling. This three data sets are used to learn transformation-invariant basis functions. For example, the sampling method of rotational data set is described in detail as follows. A sampling window is randomly located on a big natural scene and a patch is selected. Then fix the same center, clockwise rotate the sampling window by an interval of 15 degree, another patch is sampled. Again, rotate the window and
Perception of Transformation-Invariance in the Visual Pathway
661
sample next one, till twenty-four times. Similarly, the total twenty-four of patches are sampled and then reshaped to one column vector as a sample, size of 2400by-1. We select patches from a set of big natural images by the above sampling methods and generate three data sets which are composed of 20000 samples respectively. All data sets are then low-pass filtered by reducing the dimension of the data vector by principle component analysis (PCA), retaining the 100 principal components with the largest variances, after which the data is whitened by normalizing the variances of the principal components. These preprocessing steps are essentially similar to those used in [6,9]. 3.2
Transformation-Invariant Basis Functions
Respectively using the translational, scaling, and rotational training data to learn transformation-invariant basis functions, the translation-, scaling-, and rotation-invariant basis functions are yielded, shown in Figs.{3,4,5}. From these figures, we note that the Gabor-like basis functions, which are localized, oriented, and bandpass, resemble receptive fields of simple cells found in V1[13]. Meanwhile, there also are different characteristics, as follows, among three types of basis functions.
Fig. 3. Subsets of translation-invariant basis functions
Fig. 4. Subsets of scaling-invariant basis functions
Translation-invariant basis functions. In Fig. 3, bigger rectangles composed of 5 × 5 basis functions with the same orientation are similar to the receptive fields of complex cells which activate while the same orientational contents are moving within their receptive fields. Scaling-invariant basis functions. Fig. 4 shows that basis functions in one row represent the receptive field of a complex cell which performs the perception of scaling invariance. Those in one row are subsets with the same scaling. The scaling interval is 10%. Rotation-invariant basis functions. In Fig. 5(Left ), a group of basis functions in one row is similar to the receptive field of a complex cell which performs perception of rotation invariance. The neighboring basis functions in a row have an interval of fifteen degree of counter-clockwise rotation. Basis functions in a
662
W. Yang et al.
Fig. 5. Subsets of rotation-invariant basis functions. Right: basis functions (in the left) are rearranged in counter-clockwise along the circumference with an interval of fifteen degree. Every circle includes basis functions in one row (left).
column are a group of which elements are used to reconstruct the input patterns while given corresponding activities of simple cells. For the convenience of viewing the regularity, these basis functions are arranged in counter-clockwise along the circumference with an interval of fifteen degree, shown in Fig. 5(Right ). Every circle resembles the receptive field of a complex cell. 3.3
Perception Experiments
For anyone of three transformations: translation, rotation, and scaling, the same experimental method is used and the invariant results are easy to obtain. Here, the invariance means that complex cells maintain their existing states, while a patch is moving, rotating and scaling within their receptive fields. In other words, we can recognize the same object however it moves, rotates, and scales within our field of vision. An example of rotation perception is introduced and its goal is to calculate the relative rotation angle. According the perception model in section 2.2, two input patterns with different rotational angles are needed. Randomly select two image patches uαi and uαj (i.e. i=6, j=11) from a sample data which is composed of twenty-four patches, shown in Fig. 6. For example, the sixth and eleventh of stimuli represents, respectively, rotational angles of ninety and one hundred and sixty-five in degree. The sixth patch is first input to the perception model at time t1 and the eleventh is second at time t2 . Computing the responses of neurons at time t1 and t2 and the matrix of the correlation coefficients Coeff(Xαt1k , Xαt2l )(k, l=1, 2, · · · , M ), here M =24. Find the max value from any row in the matrix and obtain its corresponding index of its row and column, i.e. at the first row, the index of max value is (1,6). So, we know the relative rotation angle is (6-1)×15=75 degree. It is necessary to note that we only need the relative transformation, not the absolute value of angles of the stimuli. In more detail, at time t1 and t2 , the responses Xαt16 and Xαt211 are plotted in the last column of Fig.6. It is easy to know Xαt16 and Xαt211 are very similar to each other. The difference between Xαt16 and Xαt211 , plotted in Fig. 6, shows
Perception of Transformation-Invariance in the Visual Pathway
663
Fig. 6. Rotation-invariant perception
the rotation invariance of neuronal responses while the input pattern is rotating from time t1 to t2 .
4
Discussions and Conclusions
We have proposed a method for learning transformation-invariant basis functions and a model for perception of transformation invariance. Computer simulation results show that our proposed model do work well in simulating the perceptual function of transformation invariance in the brain. Our proposed model has some different properties compared with others such as bilinear generative models [14] and Topo-ICA[11]. First, bilinear generative models[14] proposed by Grimes and Rao only study the translation invariance and however, ours is able to provide more transformation invariant basis functions such as translational, rotational and scaling basis functions. Our model also performs perception of the three types of transformations. On the other hand, our algorithm is much simpler whereas that of the bilinear model is more complex. Second, the Topo-ICA model[11] provided by Hyvarinen et al. considered the second-order correlation of responses of simple cells, but the Topo-ICA model cannot produce overcomplete basis functions because of constrains of orthogonality. Our future work will focus on learning other transformational basis functions such as three dimensional geometry transformations and on transformational perception of more complex stimuli. We are also going to extend the model to a framework for learning other transformation-invariant basis functions and perception of other transformation such as view changes.
664
W. Yang et al.
Acknowledgment The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301) and the National High-Tech Research Program of China (Grant No. 252006AA01Z125).
References 1. Barlow, H.B.: Possible principles underlying the transformations of sensory messages. In: Rosenblith, W.A. (ed.) Sensory Communication, pp. 217–234. MIT Press, Cambridge (1961) 2. Barlow, H.B.: Redundancy reduction revisited. Network: Computation in Neural Systems 12, 241–253 (2001) 3. Zhang, L., Cichocki, A., Amari, S.: Natural Gradient Algorithm to Blind Separation of Over-determined Mixture with Additive Noises. IEEE Signal Processing Letters 6(11), 293–295 (1999) 4. Zhang, L., Cichocki, A., Amari, S.: Self-Adaptive Blind Source Separation Based on Activation function Adaptation. IEEE Transactions on Neural Networks 15(2), 233–244 (2004) 5. Cichocki, A., Zhang, L.: Two-stage bink deconvolution using state-space models. In: proceedings of the Fifth International Conference on Neural Information Processing, Kitakyushu, Japan, pp. 729–732 (1998) 6. Olshausen, B.A., Field, D.J: Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research 37, 3311–3325 (1997) 7. Berkes, P., Wiskott, L.: Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision 5, 579–602 (2005) 8. Bell, A.J., Sejnowski, T.J.: The independent component of natural scenes are dege filters. Vision Research 37, 3327–3338 (1997) 9. Hyvarinen, A., Hoyer, P.O.: Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neur. Comp. 12(7), 1705–1720 (2000) 10. Hoyer, P.O., Hyvarinen, A.: A multi-layer sparse coding network learns contour coding from natural images. Vision Research 42(12), 1593–1605 (2002) 11. Hyvarinen, A., Hoyer, P.O.: Emergence of Topography and Complex Cell Properties from Natural Images using Extensions of ICA. Advances in Neural Information Processing Systems (NIPS*99) 12, 827–833 (2000) 12. Van Hateren, J.H., Ruderman, D.L.: Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proc. R. Soc. London B 265, 2315–2320 (1998) 13. Hubel, D.H., Wiesel, T.N.: Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology (London) 195, 215–243 (1968) 14. Grimes David, B., RAO, R.P.N.: Bilinear sparse coding for invariant vision. In: Neural computation, vol. 17(11), pp. 47–73. MIT Press, Cambridge (2005)
Subspaces of Spatially Varying Independent Components in fMRI Jarkko Ylipaavalniemi1,2 and Ricardo Vig´ ario2,3 1
[email protected] Adaptive Informatics Research Centre, Laboratory of Computer and Information Science, Helsinki University of Technology, P.O. Box 5400, FI-02015 TKK, Finland 3 Advanced Magnetic Imaging Centre, Helsinki University of Technology, P.O. Box 3000, FI-02015 TKK, Finland 2
Abstract. In contrast to the traditional hypothesis-driven methods, independent component analysis (ICA) is commonly used in functional magnetic resonance imaging (fMRI) studies to identify, in a blind manner, spatially independent elements of functional brain activity. ICA is particularly useful in studies with multi-modal stimuli or natural environments, where the brain responses are poorly predictable, and their individual elements may not be directly relatable to the given stimuli. This paper extends earlier work on analyzing the consistency of ICA estimates, by focusing on the spatial variability of the components, and presents a novel method for reliably identifying subspaces of functionally related independent components. Furthermore, two approaches are considered for refining the decomposition within the subspaces. Blind refinement is based on clustering all estimates in the subspace to reveal its internal structure. Guided refinement, incorporating the temporal dynamics of the stimulation, finds particular projections that maximally correlate with the stimuli.
1
Introduction
Functional magnetic resonance imaging (fMRI) is one of the most successful methods for studying the living human brain. Traditionally, fMRI analysis relies on artificially generated stimuli, coupled with hypothesis-driven statistical signal processing (cf., [1]). Independent component analysis (ICA) (see, e.g., [2]) of fMRI data, as first proposed in [3], has recently gained considerable attention for its ability to blindly decompose the measured brain activity into spatially independent functional elements. The corresponding mixing vectors reveal the temporal dynamics of each element. However, the individual elements are often not directly relatable to a given stimulus. This is particularly true in studies using multi-modal stimuli, such as in natural environments, where the brain responses are poorly predictable. Furthermore, it has been proposed that such functional elements can participate in varying networks, to perform complex tasks [4]. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 665–672, 2007. c Springer-Verlag Berlin Heidelberg 2007
666
J. Ylipaavalniemi and R. Vig´ ario
The optimization landscape of ICA is defined by structure of the data, noise, as well as the objective function used. The landscape can form elongated or branched valleys, containing many strong points, instead of singular local optima. Previous studies [5,6] have analyzed the consistency of independent components, and suggested that some components can have a characteristic variability. The goal was to provide additional insight into the components, that is not possible to attain with single run approaches. Complex valleys can also be considered as separate subspaces, where statistical independence is not necessarily the best objective for decomposition. In this paper, we present a novel method to reliably identify subspaces formed by independent components, and illustrate two approaches to further refine the decomposition into functionally meaningful components. The subspace detection is based on analyzing the spatial variability under a similar consistent ICA as in the previous studies. The subspaces reveal connections between the individual functional elements. One refinement method uses clustering to distinguish the internal structure of the subspace. Another method is based on finding the coordinate system inside the subspace that maximally correlates with the temporal dynamics of the stimulation. The directions are found with canonical correlation analysis (CCA) [7]. Related canonical correlation approaches have been recently suggested for fMRI (see, e.g., [8,9,10]). However, the goals have been to utilize several stimulation time-courses to simply rank the individual components found by ICA, or to extend the purely hypothesis-driven methods into multivariate analyses.
2
Materials and Methods
The analysis uses data from a recent fMRI study carried out by Malinen et al., at the Advanced Magnetic Imaging Centre [11]. The study combined auditory, visual, and tactile stimuli, in a continuous manner. The stimuli were presented in 6–33 s blocks, with no resting periods in between. Fig. 1 illustrates the block design of the sequence, which has a duration of 8 min 15 s. 2.1
Measured and Preprocessed fMRI Data
The recordings, thoroughly described in [11], were made with a Signa VH/i 3.0 T MRI scanner (General Electric, Milwaukee, WI, USA). Functional images were acquired using gradient echo-planar-imaging sequence (TR 3 s, TE 32 ms, matrix 64 × 64, 44 oblique axial slices, voxel size 3 × 3 × 3 mm3 , FOV 20 cm, flip angle 90◦ ) producing 165 volumes including 4 dummy scans, which were excluded from further analysis. Structural images were scanned with 3-D T1 spoiled gradient imaging (TR 9 ms, TE 1.9 ms, matrix 256 × 256, slice thickness 1.4 mm, FOV 26 cm, flip angle 15◦ , preparation time 300 ms, number of excitations 2). Preprocessing of the data using SPM2 [12] included realignment, normalization and smoothing with a 6 mm (full-width half maximum) Gaussian filter. Skull stripping was also performed. For further details, see [11].
Subspaces of Spatially Varying Independent Components in fMRI
2.2
667
Consistent Spatial ICA
Independent component analysis is one of the most popular methods for solving the blind source separation (BSS) problem. It consists of finding solutions to the mixture X = AS, where only the observed data X is known. ICA assumes only statistical independence of the sources S, and full rank of the mixing A. In the context of fMRI, independence is considered in the spatial domain, and the mixing reveals the temporal activation patterns of the corresponding sources. The reliable ICA approach, proposed in [5], is based on multiple runs of FastICA [13] in a bootstrapping framework, i.e., with resampled data and randomized initializations. In this study, FastICA was run 100 times with tanh nonlinearity in symmetric mode. On each run, the data was whitened to 80 dimensions and 40 independent components were extracted. The estimated mixing vectors from all runs were normalized to have zero mean and unit variance, and grouped using correlation. The correlation matrix was thresholded by 0.85 and raised to a power of 4 (see [5] for further details). The parameter values were selected heuristically. Starting with a few dimensions, the dimensionality was increased until the new components were all overfits, appearing only once. Similarly, starting with a high value, the correlation was lowered as long as the most consistent components, appearing 100 times, did not split into many groups. 2.3
Subspace Canonical Correlation Analysis
The emergence of a subspace in ICA means that the coordinate system within the subspace can not be identified, based solely on statistical independence. Even if there is a strong relation between the subspace as a whole and the stimulation, this relation may not be readily visible as a high correlation between any given component and the stimuli. Canonical correlation analysis seeks for covariations between two spaces. In the current work, they are the independent subspace and the stimulation design. Such relation is found through maximally correlated linear transformations of both spaces. Let Y be a set of columns of the mixing matrix A, corresponding to an independent subspace, and Z the set of stimulation time-courses. The goal of CCA is to maximize corr(Wy T Y, Wz T Z) with respect to Wy and Wz , which are the transformation projections. As a result, the coordinate system within the subspace is fixed according to maximal correlation to the stimuli, rather than independence.
3
Results
Fig. 2 shows a set of independent components (ICs), strongly related to auditory stimulation. Each IC is consistent, appearing in all or most of the 100 runs. The mixing variability is also minimal. However, the spatial variance reveals a coincident location of variability, shared by all ICs. The variability links the ICs
668
J. Ylipaavalniemi and R. Vig´ ario
(a) Auditory stimulation
(b) Visual stimulation
(c) Tactile stimulation
Fig. 1. Stimulation block design with hemodynamically convolved time-courses. (a) Auditory stimulation with tone pips, spoken history text and spoken instruction text (represented by red, green and blue in the color version). (b) Visual stimulation with scenes dominated by buildings, faces and hands (represented by red, green and blue in the color version). (c) Tactile stimulation.
count: 100 skew: 5.91
IC 4
IC 4
count: 99 skew: 3.20
IC 6
IC 6
count: 87 skew: 3.48
IC 26
IC 26
(a) Temporal and spatial mean
(b) Spatial variance
Fig. 2. A set of independent components identified as a subspace through the shared variance, with strongly auditory stimulus-related time-courses. (a) The mean spatial maps and time-courses of each component. (b) The spatial variance maps of the corresponding components. A sagittal, coronal and axial slice of each volume is shown with the histogram of the mean volume. Consistency counts and skewness of the histograms are shown as text, and the reference blocks for the time-courses are from Fig. 1(a).
CC 1
correlation: 0.8
CC 1
CC 2
correlation: 0.5
CC 2
CC 3
correlation: 0.2
CC 3
(a) Stimulus recombination
(b) Temporal and spatial recombination
Fig. 3. A set of linear combinations that maximize the correlation between the mean time-courses of the subspace components shown in Fig. 2, and the stimulation timecourses shown in Fig. 1(a). (a) The time-courses combined from the stimulation design. (b) The spatial maps and time-courses of the corresponding, maximally correlated, combinations of the independent components. Other details as in Fig. 2.
Subspaces of Spatially Varying Independent Components in fMRI count: 100 skew: 3.97
IC 9
IC 9
count: 51 skew: 2.58
IC 21
IC 21
count: 100 skew: 4.61
IC 22
IC 22
(a) Temporal and spatial mean
669
(b) Spatial variance
Fig. 4. A set of independent components identified as a subspace through the shared variance, with weakly stimulus-related time-courses. Other details as in Fig. 2, except no reference blocks are shown.
into a three dimensional subspace, even though ICA has consistently identified directions within the subspace. The subspace in Fig. 2 was further analyzed with CCA using all auditory references, shown in Fig. 1(a). Fig. 3 shows the canonical components (CCs) identified within the subspace. Compared to the ICs, the CCs reveal the best stimulation-matching decomposition within the subspace. A thorough physiological interpretation of the results is out of the scope of this paper, but the
count: 80 skew: 2.74
IC 13
IC 13
count: 77 skew: 2.69
IC 17
IC 17
count: 38 skew: 2.10
IC 19
IC 19
count: 38 skew: 2.29
IC 29
IC 29
count: 4 skew: 1.21
IC 45
IC 45
(a) Temporal and spatial mean
(b) Spatial variance
Fig. 5. A set of independent components identified as a subspace through the shared variance, with transiently stimulus-related time-courses. Other details as in Fig. 2, except no reference blocks are shown.
670
J. Ylipaavalniemi and R. Vig´ ario count: 474 skew: 5.20
IC 35
IC 35
(a) Temporal and spatial mean count: 111 skew: 6.06
count: 78 skew: 4.38
count: 98 skew: 4.82
count: 42 skew: 6.16
count: 19 skew: 5.19
count: 73 skew: 3.52
count: 11 skew: 5.64
count: 25 skew: 2.73
(b) Spatial variance
IC 35
IC 35
sub 1
sub 1
IC 35
IC 35
sub 2
sub 2
IC 35
IC 35
sub 3
sub 3
IC 35
IC 35
sub 4
sub 4
IC 35
IC 35
sub 5
sub 5
IC 35
IC 35
sub 6
sub 6
IC 35
IC 35
sub 7
sub 7
IC 35
IC 35
sub 8
sub 8
(c) Reclustered temporal and spatial mean
(d) Reclustered spatial variance
Fig. 6. One independent component, identified as a subspace through overall variability, with strongly visual stimulus-related time-course. (a) The mean spatial map and time-course of the component. (b) The spatial variance map of the component. (c) The mean spatial maps and time-courses of components from reclustering within the subspace. (d) The spatial variance maps of the corresponding components. Other details as in Fig. 2, except the reference blocks are from Fig. 1(b).
decomposition appears refined. The first, and highest correlating, CC depicts a baseline of activity related to all types of auditory stimulation. The second CC reveals a clear deviation from the baseline, occurring during the tone pip stimuli. It includes two brain regions, associated with auditory processing, having opposite signs in the spatial map. The last CC appears quite scattered, containing most of the activity within the subspace that is not explained by the other two CCs, as indicated by the low correlation.
Subspaces of Spatially Varying Independent Components in fMRI
671
Another example of a subspace linked through spatial variance is shown in Fig. 4, which appears weakly stimulus-related. The last IC in the subspace presents a potential artifact, with a sharp peak at a single time instance. Fig. 5 shows a more complex set of activity, also identified as a subspace by the shared spatial variance. In this case, the ICs themselves are less consistent, and have considerable mixing variability. As no single component appears in all 100 runs, ICA can not identify consistent directions within the subspace. Some of the ICs are weakly stimulus-related, so a meaningful coordinate system inside the subspace could be fixed with CCA. However, the given stimulus design is not rich enough to decompose the 5 dimensional subspace. The last example, shown in Fig. 6, is identified as a subspace already by the consistent ICA method. The strong mixing variability, together with the count of 474 estimates suggest that ICA can separate the subspace from the other components, but roughly 5 arbitrary directions from the subspace appear on each run. Additionally, the spatial variance coincides with the component itself, rather than being shared with other ICs. To further analyze the consistency of the strongly stimulus-related subspace, the 474 estimates within the subspace were clustered again, now using a higher threshold of 0.95. Fig. 6 also shows the set of 8 most consistent directions within the subspace. The directions are not strictly independent, since the clustering does not take into account from which run the estimates are taken. The subspace directions appear functionally meaningful, representing separate brain regions of the visual processing stream, including the primary visual cortex and other areas along the occipital lobes. Again, with a richer set of stimulus references, CCA could offer further refinement. In addition to the illustrated subspaces, several other were identified, either through the overall variability of the components or by their shared spatial variance. The complete set of 46 consistent ICs also included several that were not part of a subspace.
4
Conclusions
Analyzing the variability of independent components, under a consistent ICA framework, can reveal characteristic information related to the underlying phenomena that is otherwise not visible. As shown by the results, components can be roughly divided into 3 classes based on spatial variance: individual and consistent components, with distributed variance due to noise; consistent members of a subspace, with focal variance coincident with the variance of the other members (see Fig. 2); and unconsistent subspaces, with variances coincident with their own mean (see Fig. 6). Such subspaces can provide information on networks of related activity in a purely data-driven manner. Directions within each subspace can be further refined either blindly by clustering them into semi-independent constituents, or by using CCA with additional data. More than just refining the subspace decomposition, CCA provides a direct link to the set of related stimuli. However, the use of CCA is limited by the richness of the stimulation design. A more supervised approach was recently
672
J. Ylipaavalniemi and R. Vig´ ario
presented, with the goal of relating networks of brain activity with given complex stimulus features [4]. Acknowledgments. We thank Sanna Malinen and Riitta Hari for discussions on the subject of the paper, and both, together with Yevhen Hlushchuk for the data. All are from the Advanced Magnetic Imaging Centre (AMI Centre) and the Brain Research Unit in the Low Temperature Laboratory of Helsinki University of Technology.
References 1. Haacke, E.M., Brown, R.W., Thompson, M.R., Venkatesan, R.: Magnetic Resonance Imaging: Physical Principles and Sequence Design, 1st edn. WileyInterscience, New York (1999) 2. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis, 1st edn. Wiley-Interscience, New York (2001) 3. McKeown, M.J., Makeig, S., Brown, G.G., Jung, T.P., Kindermann, S.S., Bell, A.J., Sejnowski, T.J.: Analysis of fMRI Data by Blind Separation Into Independent Spatial Components. Human Brain Mapping 6(3), 160–188 (1998) 4. Ylipaavalniemi, J., Savia, E., Vig´ ario, R., Kaski, S.: Functional Elements and Networks in fMRI. In: Proceedings of the 15th European Symposium on Artificial Neural Networks (ESANN 2007), Bruges, Belgium (April 2007), pp. 561–566 (2007) 5. Ylipaavalniemi, J., Vig´ ario, R.: Analysis of Auditory fMRI Recordings via ICA: A Study on Consistency. In: Proceedings of the, International Joint Conference on Neural Networks (IJCNN 2004). Budapest, Hungary (July 2004), vol. 1, pp. 249–254 (2004) 6. Ylipaavalniemi, J., Mattila, S., Tarkiainen, A., Vig´ ario, R.: Brains and Phantoms: An ICA Study of fMRI. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 503–510. Springer, Heidelberg (2006) 7. Timm, N.H.: Applied Multivariate Analysis, 1st edn. Springer, New York (2002) 8. Friman, O., Carlsson, J., Lundberg, P., Borga, M., Knutsson, H.: Detection of neural activity in functional MRI using canonical correlation analysis. Magnetic Resonance in Medicine 45(2), 323–330 (2001) 9. Youssef, T., Youssef, A.B.M., LaConte, S.M., Hu, X.P., Kadah, Y.M.: Robust ordering of independent components in functional magnetic resonance imaging time series data using Canonical correlation analysis. In: Proceedings of the SPIE Medical Imaging,: Physiology and Function: Methods, Systems, and Applications. San Diego, CA (February 2003), vol. 5031, pp. 332–340 (2003) 10. Hardoon, D.R., Mour˜ ao-Miranda, J., Brammer, M., Shawe-Taylor, J.: Unsupervised fMRI Analysis. In: NIPS Workshop on New Directions on Decoding Mental States from fMRI Data, Whistler, Canada (December 2006) 11. Malinen, S., Hlushchuk, Y., Hari, R.: Towards natural stimulation in fMRI – Issues of data analysis. NeuroImage 35(1), 131–139 (2007) 12. SPM2: MATLABTM Package (2002), http://www.fil.ion.ucl.ac.uk/spm 13. FastICA: MATLABTM Package (1998), http://www.cis.hut.fi/research/ica/fastica
Multi-modal ICA Exemplified on Simultaneously Measured MEG and EEG Data Heriberto Zavala-Fernandez1, Tilmann H. Sander2 , Martin Burghoff2 , Reinhold Orglmeister1, and Lutz Trahms2 1
Technische Universit¨ at Berlin, Fachgebiet Elektronik und medizinische Signalverarbeitung, Einsteinufer 17, 10587 Berlin, Germany [email protected] 2 Physikalisch-Technische Bundesanstalt, Abbestr. 2-10, 10587 Berlin, Germany
Abstract. A multi-modal linear mixing model is suggested for simultaneously measured MEG and EEG data. On the basis of this model an ICA decomposition is calculated for a combined MEG and EEG signal vector using the TDSEP algorithm. A single modality demixing procedure is developed to classify ICA components to be multi-modality sources detected by EEG and MEG simultaneously or to be single mode sources. Under this premise, data from 10 subjects are analysed and four exemplary types of sources are selected. We found that these sources represent physically meaningful multi- and single-mode signals: Alpha oscillations, heart activity, eye blinks, and slow signal drifts.
1
Introduction
Magnetoencephalography (MEG) and Electroencephalography (EEG) are noninvasive methods to measure the electrical activity of groups of neurons in the human brain with a ms temporal resolution. This electrical activity leads to potential differences at the scalp (EEG) and a magnetic field outside of the head (MEG). It is well known that the sensitivty profiles of EEG and MEG are quite different with respect to the orientation and the location of neural currents. This is utilized in studies such as [1] [2] [3] [4], where the aim was an improved source localization by a combined MEG and EEG data analysis. The value of applying ICA to single modality data, i.e., either MEG or EEG, has been demonstrated frequently (see [5] for an overview). Here we want to apply ICA specifically to combined EEG and MEG datasets and evaluate a different mixing model compared to the subspcace channel ICA approach in [6]. Given the results of an ICA applied to multi-modal data sets it is important to determine, whether a component represents a source recorded by both modalities, MEG and EEG, or a source recorded exclusively by one of the modalities.
This work was supported by the DAAD (German Academic Exchange Service) scholarship for H.Z.F (A0421558) and the Berlin Neuroimaging Centre (BMBF 01GO 0503 BNIC).
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 673–680, 2007. c Springer-Verlag Berlin Heidelberg 2007
674
H. Zavala-Fernandez et al.
To answer this question post-hoc for a mixing matrix A estimated from experimental data is not easy due to the permutation problem of ICA and the lack of a common scale for multi-modal data sets such as combined MEG and EEG data. For this context we propose a way of checking the origin of the estimated sources without modifying an ICA algorithm itself.
2 2.1
Multi-modal ICA Mixing Model for Multi-modal Data Sets
The basic model of ICA is well known, where signals x(t) are described as the product of an unknown mixing matrix A and an unknown source vector s(t). ICA aims to estimate A from statistical properties of x(t). With an estimate of A the source signals can be computed from s(t) = A−1 · x(t) = W · x(t).
(1)
Extending the basic model to the (multi-modal) case of two simultaneously measured quantities, EEG and MEG with related content, the data vector ”x(t)” now reads (superscript M or E always denotes MEG respectively EEG, it does not denote power): M x (t) xM,E (t) = . (2) xE (t) For the mixing process at least two cases have to be considered, firstly the number of sources is known and less than the sum of the m MEG and n EEG channels. In this case the mixing matrix A contains m + n rows and as many columns as there are sources. This case is discussed generally in [6]. It is well known that single brain sources can contribute to MEG and EEG signals at the same time. Nevertheless the number of brain sources is generally not known and considering the strong background and noise signals in MEG and EEG data the more general case of a square matrix of dimension (m + n, m + n) is considered here. In case that the MEG and EEG signals are without common sources the combined mixing matrix has to have the following block form M 0 A , (3) A= 0 AE where AM ∈ Rm×m for the magnetic and AE ∈ Rn×n for the electric sources. Due to an arbitrary permutation of the columns of A estimated by ICA and the lack of a common physical scale for MEG and EEG data the simple form (3) cannot be compared to a matrix A resulting from experimental data. Naturally the more general case of common sources in MEG and EEG data is of interest here and the mixing matrix has the general form given by M M 0 C D , (4) A= CE 0 DE
Multi-modal ICA
675
where inner matrices CM and CE with columns cj represent sources recorded by both modalities and therefore a gain in information in comparison to a separate analysis (3). Similarly, columns dk and dl of inner matrices DM and DE , respectively, are then sources recorded exclusively by one of the modalities. Consequently, j+k+l = m+n. The columns ai of A now represent in the upper part, rows 1 . . . m, a so called MEG map, i.e. sensor weights for the MEG channels, and in the lower part, rows m + 1 . . . m + n, an EEG map. 2.2
Single Modality Demixing
To characterize whether a multi-modal ICA component corresponds in fact to a single modality signal we devised a procedure termed ”single modality demixing”. The aim is to avoid a rescaling of the experimental data while removing the inherent ICA scaling ambiguity between mixing matrix A and sources s(t). Therefore a first step is to move all signal energy into the mixing matrix. This is done by creating sources s (t) that have a root-mean-square (rms) value of unity: si (t) = bi · si (t), where rms(si ) = 1 (equivalent to a scaling of the distribution of the observed variable to unit variance). Combining the bi into a diagonal matrix B the sources s(t) are replaced by the s (t) in (1) and x(t) = B · A · s (t) = A · s (t) . Multiplying the mixing matrix A by B, a new mixing matrix results A , which is meant in the text from now on. This rescaling of the mixing matrix should not be confused with the whitening procedure used as a first step in most ICA algorithms. The single modality demixing is now defined as setting either xM or xE in (2) to zero and then applying (1) and to obtain single modality demixed time series sM,E=0 (t) respectively sM=0,E (t): sM,E=0 = W ·
xM , 0
sM=0,E = W ·
0 . xE
(5)
This single modality demixing is motivated by the block mixing form in (3). In this case only the MEG sources are non-zero for sM,E=0 and vice versa for sM=0,E . These single modality demixed time series can then be compared with each other or with the time series from the full demixing, sM,E . The ICA component vector ai is not affected by the single modality demixing. The single modality mixing should not be confused with completely separate ICA calculations for MEG and EEG. The rms values of sM,E=0 (t) and s M=0,E (t) can be computed and compared with each other and the rms = 1 of s (t), the original source. The rms of the single modality demixed time series indicates the relative contribution of this source to the MEG and EEG data. Another measure to be used is the correlation between single modality demixed time series and the time series resulting from the full demixing. The single modality demixing approach can be exemplified assuming two MEG channels and one EEG channel and then depicting the demixing matrix
676
H. Zavala-Fernandez et al.
M W . Assuming w33 = 0 in (6) it follows sM,E = w31 xM 1 + w32 x2 irrespective of 3 E x3 , i.e., s3 represents a pure MEG source.
⎛
⎞ ⎛ M ⎞ ⎞ ⎛ sM,E (t) x1 (t) w11 w12 w13 1 ⎝ sM,E (t) ⎠ = ⎝ w21 w22 w23 ⎠ · ⎝ xM ⎠ 2 (t) 2 E x w w w (t) 31 32 33 sM,E (t) 3 3
3 3.1
(6)
Experimental Procedure MEG and EEG Data Acquisition
Using a typical recording configuration, e.g. m > n, the MEG was measured using m = 93 channel SQUID system (ET160 Eagle Systems) installed in a shielded room. Within the MEG helmet the EEG was recorded from n = 28 sites at positions given by the international 10-20 system for electrode placement. MEG and EEG channels were connected to a 128 channel data acquisition system to ensure a simultaneous recording. Ten healthy subjects had to listen to 30 tones of 2 Khz and 30 tones of 1 Khz presented in a random order and with a random interstimulus interval in a session lasting 305 s. This task was part of a larger set of measurements investigating the auditory N100 in relation to attention. The results with respect to the N100 are not discussed here. The data were recorded with a sampling rate of 2 kHz and then off-line down sampled to 250 Hz. The data were off-line filtered with a zero-phase delay band pass from 0.5 to 40 Hz preserving typical brain signals. 3.2
Data Processing Using TDSEP-ICA
The Second Order Blind Identification algorithm (SOBI[7], TDSEP[8]) is suitable for data with a rich temporal structure and consequently colored spectra. The basis of the TDSEP algorithm is a set of time-lagged covariance matrices Rx (τ ) = x (t + τ ) · xT (t) with τ = 0. For independent sources these matrices have to be diagonal. To estimate the sources a joint diagonalization of the time-lagged covariance matrices is performed. To use a set of τ values is an empirically established procedure to avoid an inferior source separation as there is no theoretically proven choice of τ values. The TDSEP ICA algorithm was applied to the combined MEG and EEG data vector given in (2). As time delays τ the set τ = 1, 5, 10, 15, . . . , 500 was chosen in the TDSEP calculation and consequently a set of 101 time-lagged covariance matrices had to be simultaneously diagonalized. The calculation was performed for each of the subjects separately. Given that TDSEP probes the temporal structure of signals by correlating EEG and MEG signals with each other, any phase difference due to the technical recording equipment has to be avoided. This was tested and no delay was found between any of the channels of both modalities.
Multi-modal ICA
677
Fig. 1. Occipital alpha component of the multi-modal ICA applied to simultaneously measured MEG and EEG data. The left column shows the MEG and EEG part of the component vector, the top time trace of the middle column is a short section of the component time series sM,E and the spectrum is shown to the right. The other two time series and spectra are the single modality demixed signals (sM,E=0 respectively sM =0,E , see text).
4
Results
The time series and power spectra of the single modality demixed signals, sM,E=0 respectively sM=0,E , are shown below the data of sM,E in Figs. 1-3. Note that the multi-modal approach excludes the use of terminology like EOG or MOG. In the multi-modal case different physical quantities are combined in the mixing matrix and consequently the ICA time series is dimensionless. A physical unit can be given for the scaling of the MEG and EEG maps as indicated in Figs. 1-3. Figure 1 presents a component, in which both single modality demixed signals have spectra with 10 Hz peaks, and the time series are similar to the fully demixed signal. The 3 s section of the associated time series sM,E (top of middle column) shows the corresponding amplitude modulation of about 10 Hz. A quantitative comparison is given in Table 1, where the rms values of the single modality demixed signals and the correlations ρ between the signals are mean values for the group of subjects. The rms values are about 0.6 and 0.7 indicating an equal contribution of MEG and EEG for the component of Fig. 1. The correlations between sM,E and the single modality demixed signals are fairly high in agreement with the visual impression from Fig. 1. Interestingly the correlation between sM,E=0 and sM=0,E is fairly small. Considering that one row of M E M,E=0 (6) reads sM,E = w11 xM and sM=0,E 1 +w12 x2 +w13 x3 the rms values of s 1 and the correlation between them cannot be close to one at the same time as this contradicts the normalization sM,E = 1.0. The ICA component shown in Fig. 1 has MEG and EEG part maps (left side of Fig. 1) with peaks at the back of the head. The most common spontaneous oscillatory signal detected in normal subjects is an occipital alpha wave, which has a spectral range between 8 and 13 Hz and
678
H. Zavala-Fernandez et al.
Table 1. Quantitative results for the time signals obtained by single modality demixing (ALPHA = alpha waves, HEART = heart beat, EYE = eye movements, DRIFT = slow signal drifts). The top two rows are mean rms values of the single modality demixed sources for the four types of components and the group of subjects (standard deviations indicated), the bottom three rows are correlations ρ between the time signals including the time series obtained by a full demixing. Type
M,0
rms(s
ALPHA
)
rms(s0,E )
ρ(sM,0 , s0,E )
ρ(sM,E , sM,0 )
ρ(sM,E , s0,E )
HEART
EYE
DRIFT
0.5983 ± 0.1697
0.9881 ± 0.0610
0.4029 ± 0.1188
1.0006 ± 0.0022
0.7195 ± 0.1197
0.4716 ± 0.1908
0.8006 ± 0.1155
0.0806 ± 0.0150
0.1867 ± 0.2736
−0.2403 ± 0.0750
0.3400 ± 0.3172
−0.0453 ± 0.0299
0.7131 ± 0.1254
0.8665 ± 0.1228
0.6530 ± 0.1926
0.9966 ± 0.0012
0.7999 ± 0.1464
0.2346 ± 0.1715
0.9232 ± 0.0529
0.0352 ± 0.0206
can be seen in relaxed awake subjects. The occipital lobes are believed to be the source of alpha activity [9]. Comparing these properties with the features of the data of Fig. 1, this ICA component can be attributed to alpha oscillations. The fact that the MEG and EEG part maps are orthogonal to each other indicates that the magnetic and electric field are due to the same current configuration.
Fig. 2. Heart beat component of the multi-modal ICA. The figure is structured as Fig. 1. The map of the MEG part is typical for heart activity and the sM,E and sM,E=0 time series show the heart beat. It is notably absent in sM =0,E .
Figure 2 shows a component with a time series that obviously reflects the QRS activity of the heart. Heart activity is an inherent artifact known to be present in MEG [10] [12]. In contrast to that, the QRS complex is not visible in most raw EEG data and of little relevance for EEG studies [11]. In our multimodal ICA, the heart artifact is only found in the MEG part as can be seen in Fig. 2, where sM,E and sM,E=0 show the heart beat, but sM=0,E does not. As given in Table 1 the single modality demixing rms values are 0.99 and 0.47 for
Multi-modal ICA
679
MEG and EEG, respectively, indicating a dominance of the MEG contribution. A correlation value of 0.87 between sM,E and sM,E=0 and 0.23 between sM,E and sM=0,E confirms this behaviour. Figure 3 shows a component that is obviously generated by a source in the frontal region of the head. The corresponding tabulated values for the standard deviations of rms and correlations in Table 1 indicate that the multi-modal ICA result is quite homogeneous for the group of subjects investigated and electric field are due to the same current configuration. We attribute this component to eye blinks considering the fact that the eyes and related muscles are themselves magnetical and electrical sources MEG and EEG data can be contaminated heavily by eye blink signals especially in frontal regions [12].
Fig. 3. Eye movement component of the multi-modal ICA. The figure is structured as Fig. 1. The map of the MEG and EEG part have their peaks at the front and can be attributed to the magnetical and electrical signature of eye blinks (muscle and eye dipole).
In addition to these components, there is another component that exhibits little correlation between MEG and EEG channels. The rms values are 1 and 0.04 for MEG and EEG, respectively, indicating a dominance of the MEG contribution as for the heart beat component. We refer this component to magnetic field drifts. Slow signal drifts are often present in MEG measurements due to the reduced shielding capabilities of mu-metallic rooms at frequencies below 1 Hz. Typical EEG artifact signals such as loose electrodes were easily identified in the same manner as the MEG signal drifts.
5
Conclusion
By our analysis, a component associated with the auditory evoked M100/N100 was not identified. This is probably due to the random interstimulus interval by which the auditory stimulus were presented. However, we found four characteristic sources in almost all of the ten subjects studied using multi-modal ICA on common
680
H. Zavala-Fernandez et al.
MEG and EEG data. These four types of signals were easily found by manual inspection in almost all subjects of the group. The common source for occipital alpha oscillations was found without relying on any physical modeling of currents in the brain. A similar result was obtained for signals due to eye blinks. One heart beat component was found, which was detected mainly in the MEG as deduced from the correlation and rms values associated with the single modality demixing. This behaviour agrees with the properties of the electric potential and the magnetic field of the heart muscle. We have demonstrated that multi-modal ICA and single modality demixing are powerful tools to identify common sources in MEG and EEG data. The next step will be a comparison between multi-modal ICA results and single modality ICA decompositions and the usage of a physical model to estimate the source location from the combined ICA EEG and MEG maps. Future works will assess the influnce of channels number. The choice of the TDSEP algorithm was somewhat arbitrary and the usefulness of the single modality demixing concept should be tested using other algorithms and other multi-modal data sets.
References 1. Baillet, S., Garnero, L., Marin, G., Hugonin, J.P.: Combined MEG and EEG Source Imaging by Minimization of Mutual Information. Biomedical Eng. 46(5) (1999) 2. Dale, A.M., Sereno, M.I.: Improved Localization of Cortical Activity by Combining EEG and MEG with MRI Cortical Surface Reconstruction: A Linear Approach. Journal of Cognitive Neuroscience 5(2), 162–176 (1993) 3. Huizenga, H.M., van Zuijen, T.L., Heslenfeld, D.J., Molenaar, P.C.M.: Simultaneous MEG and EEG source analysis. Phys. Med. Biol. 46, 1737–1751 (2001) 4. Yoshinaga, H., Nakahori, T., Ohtsuka, Y., Oka, E., Kitamura, Y., Kiriyama, H., Kinugasa, K., Miyamoto, K., Hoshida, T.: Benefit of Simultaneous Recording of EEG and MEG in Dipole Localization. Epilepsia 43(8), 924–928 (2002) 5. James, C., Hesse, C.: ICA for Biomedical Signals. Physiol. Meas. 26, 15–39 (2005) 6. Anem¨ uller, J.: Second-Order Separation of Multidimensional Sources with Constrained Mixing System. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 16–23. Springer, Heidelberg (2006) 7. Belouchrani, A., Abed-Meraim, K., Cardoso, J.-F., Moulines, E.: A BSS Technique Based on Second-Order Statistics. IEEE Trans. on Sig. Proc. 45, 434–444 (1997) 8. Ziehe, A., M¨ uller, K.-R.: TDSEP–An Efficient Algorithm for Blind Separation Using Time Structure. In: Proc. of the 8th ICANN, pp. 675–680. Springer, Heidelberg (1998) 9. Hari, R., Salmelin, R.: Human cortical oscillations: a neuromagnetic view through the skull. Trends Neurosci. 20(1), 44–49 (1997) 10. Sander, T.H., Wuebbeler, G., Lueschow, A., Curio, G., Trahms, L.: Cardiac Artifact Subspace Identification and Elimination in Cognitive MEG-Data Using TimeDelayed Decorrelation. IEEE Trans. Biomed. Eng. 49, 345–354 (2002) 11. Dirlich, G., Vogl, L., Plaschke, M., Strian, F.: Cardiac field effects on the EEG. Electroencephalography and clinical Neurophysiology 102, 307–315 (1997) 12. Zavala-Fernandez, H., Sander, T., Burghoff, M., Orglmeister, R., Trahms, L.: Comparison of ICA Algorithms for the Isolation of Biological Artifacts in Magnetoencephalography. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 511–518. Springer, Heidelberg (2006)
Blind Signal Separation Methods for the Identification of Interstellar Carbonaceous Nanoparticles O. Bern´e1,2, Y. Deville2 , and C. Joblin1
2
1 Centre d’Etude Spatiale des Rayonnements, CNRS et Universit´e Paul Sabatier Toulouse 3, Observatoire Midi-Pyr´en´ees, 9 Av. du Colonel Roche, 31028 Toulouse cedex 04, France [email protected], [email protected] Laboratoire d’Astrophysique de Toulouse-Tarbes, CNRS et Universit´e Paul Sabatier Toulouse 3, Observatoire Midi-Pyr´en´ees, 14 Av. Edouard Belin, 31400 Toulouse, France [email protected]
Abstract. The use of Blind Signal Separation methods (ICA and other approaches) for the analysis of astrophysical data remains quite unexplored. In this paper, we present a new approach for analyzing the infrared emission spectra of interstellar dust, obtained with NASA’s Spitzer Space Telescope, using FastICA and Non-negative Matrix Factorization (NMF). Using these two methods, we were able to unveil the source spectra of three different types of carbonaceous nanoparticles present in interstellar space. These spectra can then constitute a basis for the interpretation of the mid-infrared emission spectra of interstellar dust in the Milky Way and nearby galaxies. We also show how to use these extracted spectra to derive the spatial distribution of these nanoparticles.
1
Introduction
The Spitzer Space Telescope (Spitzer ) comprises one of today’s best instruments to probe the mid-infrared (mid-IR) emission of interstellar dust in the Milky Way and nearby galaxies. This emission is mainly carried by very small (nanometric) interstellar dust particles. One of the goals of infrared astronomy is to identify the physical/chemical nature of these species, as they play a fundamental role in the evolution of galaxies. Unfortunately, the observed spectra are mixtures of the emission from various dust populations. The strategy presented in this paper is to apply Blind Signal Separation (BSS) methods i.e. FastICA and NMF to a set of Spitzer mid-IR (5-30 μm) spectra obtained with the InfraRed Spectrograph (IRS), in order to extract the genuine spectrum of each population of nanoparticles. We first present these observations in Sect. 2, then we apply the
This work is based on observations made with the Spitzer Space Telescope, which is operated by the Jet Propulsion Laboratory, California Institute of Technology under a contract with NASA.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 681–688, 2007. c Springer-Verlag Berlin Heidelberg 2007
682
O. Bern´e, Y. Deville, and C. Joblin
two BSS methods to these observations and finally give an example of how the extracted spectra can be used to trace the evolution of dust, in the Milky Way and external galaxies.
2
Observations
We have observed with Spitzer nearby photo-dissociation regions (PDRs), which consist of a star illuminating the border of dense clouds of gas and dust. The physical conditions (UV field intensity and hardness, cloud density) strongly vary from a PDR to another as well as inside each PDR depending on the considered position. These variations are extremely useful to probe the nature of dust particles which are altered by the local physical conditions [1]. Therefore, we have observed 11 PDRs as part of the SPECPDR program using the IRS in ”spectral mapping” mode. This mode enabled us to construct one dataset for each PDR. This dataset is a spectral cube, with two spatial dimensions and one spectral dimension (see Fig. 1). Each spectral cube is thus a 3-dimensional matrix C(px , py , λ), which defines the variations of the recorded data with respect to the wavelength λ, for each considered position with coordinates (px , py ) in the cube. The dimensions of these cubes are generally about 30 × 30 positions and 250 points in wavelength ranging between 5 and 30 μm.
Fig. 1. Left: Infrared (8 μm) view of the NGC 7023 North PDR. The star is illuminating the cloud situated in the upper part of the image. Right: Mid-IR spectrum for a given position in the spectral cube of NGC 7023.
3
Blind Separation of Interstellar Dust Spectra
BSS is commonly used to restore a set of unknown ”source” signals from a set of observed signals which are mixtures of these source signals, with unknown mixture parameters [2]. BSS is most often achieved using ICA methods such as FastICA [3]. An alternative class of methods for achieving BSS is NMF, which was introduced in [4] and then extended by a few authors. In the astrophysical
Blind Signal Separation Methods
683
community, ICA has been successfully used for spectra discrimination in infrared spectro-imagery of Mars ices [5], elimination of artifacts in astronomical images [6] or extraction of cosmic microwave background signal in Planck simulated data [7]. To our knowledge, NMF has not yet been applied to astrophysical problems. However, it has been used to separate spectra in other application fields, e.g. for magnetic resonance chemical shift imaging of the human brain [8] or for analyzing wheat grain spectra [9]. The simplest version of the BSS problem concerns so-called ”linear instantaneous” mixtures. It is modeled as follows: X = AS
(1)
where X is an m × n matrix containing n samples of m observed signals, A is an m × r mixing matrix and S is an r × n matrix containing n samples of r source signals. The observed signal samples are considered to be linear combinations of the source signal samples (with the same sample index). It is assumed that r ≤ m in most investigations, including this paper. The objective of BSS algorithms is then to recover the source matrix S and/or the mixing matrix A from X, up to BSS indeterminacies. The correspondence between the generic BSS data model (1) and the 3dimensional spectral cube C(px , py , λ) to be analyzed in the present paper may be defined as follows. In this paper, the sample index is associated to the wavelength λ, and each observed signal consists of the overall spectrum recorded for a cube pixel (px , py ). Each one of these signals defines a row of the matrix X in Eq. (1). Moreover, each observed spectrum is a linear combination of ”source spectra” (see Sect. 3.1), which are respectively associated to each of the (unknown) types of nanoparticles that contribute to the recorded spectral cube. Therefore, the recorded spectra may here be expressed according to (1), with unknown combination coefficients in A, unknown source spectra in S and an unknown number r of source spectra. 3.1
Suitability of BSS Methods for the Analysis of Spitzer -IRS Cubes
In order to apply the NMF or FastICA to the IRS data cubes, it is necessary to make sure that the ”linear instantaneous” mixture condition is fulfilled. Here we consider that each observed spectrum is a linear combination of ”source spectra”, which are due to the emission of different populations of dust nanoparticles. The main effect that can disturb the linearity of the model is radiative transfer as shown by [10], because of the non-linearity of the equations. In our case however, this effect is completely negligible because the emission spectra we observe come from the surface of clouds and are therefore not altered by radiative transfer. 3.2
Considered BSS Methods
In this section, we detail which particular BSS methods we have applied to the observed data.
684
O. Bern´e, Y. Deville, and C. Joblin
NMF. We used NMF as presented in [11]. The matrix of observed spectra X is approximated using W H, (2) where W and H are non-negative matrices, with the same dimensions as in (1). This approximation is optimized by adapting the matrices W and H using the algorithm of [11] in order to minimize the divergence between X and W H. We implemented the algorithm with Matlab. Convergence is reached after about 1000 iterations (which takes less than one minute with a 3.2 GHz processor). The value of r (number of ”source” spectra) is not imposed by the NMF method. Our strategy for setting it so as to extract the sources was the following: • Apply the algorithm to a given dataset, with the minimum number of assumed sources, i.e. rˆ = 2, providing 2 sources. → If the found solutions are physically coherent and linearly independent, we consider that at least rˆ = 2 sources can be extracted. → Else, we consider that the algorithm is not suited for analysis (this never occurred in our tests). • Try the algorithm on the same dataset but with rˆ = 3 sources. → If the found solutions are physically coherent and linearly independent, we consider that at least rˆ = 3 sources can be extracted. → Else, we consider that only two sources can be extracted, extraction was over with rˆ = 2 and thus r = 2. • Same as previous step but with rˆ = 4 sources. → If the found solutions are physically coherent and linearly independent, we consider that at least rˆ = 4 sources can be extracted. → Else we consider that only three sources can be extracted, extraction was over with rˆ = 3 and thus r = 3. ... Physically incoherent spectra exhibit sparse peaks (spikes) which cannot be PDR gas lines. We found r = 3 for NGC 7023-NW and r = 2 for the other PDRs, implying that we could respectively extract 3 and 2 spectra from these data cubes. FastICA. We used FastICA in the deflation version [3] in which each source is extracted one after the other and subtracted from the observations until all sources are extracted. The advantage of this FastICA method is that it is not necessary to fix, before running the algorithm, the number r of sources that we want to extract, as it is for NMF. The extraction of the sources takes less than one minute using FastICA coded with Matlab, and with a 3.2 GHz processor. 3.3
Results
Using the BSS methods presented in this paper, we were able to extract up to three source spectra from the Spitzer observations. The number r of sources found in a given PDR is always the same with NMF and FastICA. The three extracted spectra in NGC 7023 North are presented in Fig. 2. Two of them exhibit
Blind Signal Separation Methods
685
the series of aromatic bands which have previously been attributed to Polycyclic Aromatic Hydrocarbons (PAHs, [12] and [13]). These two spectra show different band intensity ratios. One is the spectrum of neutral PAHs (PAH0 ) while the other is due to ionized PAHs (PAH+ ). The last spectrum exhibits a continuum and aromatic bands, which can be attributed to very small carbonaceous grains (VSGs), possibly PAH clusters [14].
Fig. 2. The three BSS-extracted spectra from our study on PDRs
3.4
FastICA vs. NMF for Our Application
As mentioned in Sect. 3.3, we were able to extract the source spectra from our data using both FastICA and NMF. However, the extracted spectra are not exactly the same for both methods. We conducted several tests in order to be able to evaluate which one of the two methods is more appropriate for our application. We created a set of 2/3 artificial carbonaceous nanoparticle spectra, to which we added a variable level of white, spatially homogeneous noise. We mixed these spectra with a random matrix to create a set of 100 artificial observed spectra. We then applied the two BSS methods considered in this paper. With a noise level at zero, both methods recover the original signals with high efficiency (correlation coefficients between original and extracted signals above 0.995). When adding noise, this efficiency decreases but remains acceptable down to a noise level corresponding to a SNR of 3dB (which is much lower that the average SNR of the Spitzer spectra). We note however that the efficiency of FastICA drops slightly faster than the one of NMF under the effect of an increasing noise, and drops dramatically below a SNR of 3dB, while NMF can still partly recover the original signals. Finally, with both methods we observe that the power of the residuals (i.e. observed signal minus signal reconstructed from the estimated sources and mixing coefficients) has the same level as Spitzer noise. We have shown in Sect. 3.3 that there are two main populations: one with a continuum (VSGs) and one with bands only (PAHs). Using FastICA, we sometimes find a residual continuum in the BSS-extracted PAH spectrum, which we interpret as an incomplete separation. It is possible that the criterion of NMF is more appropriate in our case because less restrictive. Indeed, NMF only requires non-negativity of the sources and mixing coefficients, which is in essence the case for emission spectra, while FastICA is based on the statistical independence and
686
O. Bern´e, Y. Deville, and C. Joblin
non-gaussianity of the sources, which is more difficult to prove. As a conclusion, we would like to stress the fact that both methods are very efficient for the first task presented in this paper. We however note that NMF seems slightly better for this particular application.
4
Deriving the Spatial Distribution of Carbonaceous Nanoparticles
The next step of our analysis consists in using our extracted source spectra (Fig. 2) in order to determine the spatial distribution of the three populations in galactic clouds or in external galaxies. The Spitzer observations on-line archive contains hundreds of mid-IR spectral cubes of such regions which can be interpreted in this way. Our strategy consists in calculating the correlation parameter cp = E[Obs(px , py , λ)yp (λ)] between an observed spectrum Obs(px , py , λ) at a position (px , py ) in a spectral cube and one of our extracted source spectra yp (λ), where E[.] stands for expectation. With the considered (i.e. linear instantaneous) mixture model, each observed spectrum reads Obs(px , py , λ) =
w(px , py )n Sn (λ)
(3)
n
where Sn (λ) is the nth source spectrum and w(px , py )n are the mixing coefficients associated to that source. Moreover, BSS methods extract the sources up to arbirary scale factors, i.e. they provide yp (λ) = ηp Sp (λ), where ηp is an unknown scale factor and Sp (λ) is the pth source. By centering the observations and thus the extracted spectra, and assuming that the sources are not correlated, the above-defined correlation parameter becomes cp = ηp w(px , py )p E[Sp (λ)2 ].
(4)
This coefficient cp is calculated for all the positions (px , py ), therefore yielding a 2D correlation map. Eq (4) shows that this map is proportional to w(px , py )p and thus defines the spatial distribution of the considered extracted source yp (λ) = ηp Sp (λ). We applied this approach to the spectral cube of NGC 7023 North (Fig. 1) and obtained the correlation maps presented in Fig. 3. We find that the three nanoparticle populations emit in very different regions. It appears from the maps of Fig 3 that there is an evolution from a population of VSGs to PAH0 and then P AH + while approaching the star. This reveals the processing of the nanoparticles by the UV stellar radiation. The same strategy was tested using the cubes of external galaxies from the SINGS program which provides a database of mid-IR spectral cubes for tens of nearby galaxies. Fig. 4 presents a map of the ratio of the two correlation parameters, resp. of PAH0 and PAH+ , obtained for the Evil Eye galaxy. This method provides a unique way to spatially trace the ionization fraction of PAHs which, combined with other tracers, is fundamental to understand the evolution of galaxies.
Blind Signal Separation Methods
687
Fig. 3. Correlation maps of the three populations of nanoparticles in NGC 7023 North: VSGs in red, PAH0 in green and PAH+ in blue. The contours in black show the emission at 8 μm from Fig. 1. The slight correlation of VSGs with observations seen near the star is an artifact.
Fig. 4. Left: Infrared (8 μm) view of the NGC 4826 (Evil Eye) Galaxy. The rectangle indicates the region observed in spectral mapping with IRS. Right: Map of the ratio of P AH 0 over P AH + in NGC 4826 achieved using the BSS-extracted spectra (Fig.2).
5
Conclusion
Using two BSS methods, we were able to identify the genuine mid-IR spectra of three propulations of carbonaceous nanoparticles in the interstellar medium. We have shown that both FastICA and NMF are efficient for this task, although NMF is found to be sligthly more appropriate. The extracted spectra enable us to study the evolution of carbonaceous nanoparticles in the interstellar medium
688
O. Bern´e, Y. Deville, and C. Joblin
with unprecedented precision, including in external galaxies. These results stress the fact that BSS methods have much to reveal in the field of observational astrophysics. We are currently analyzing more spectral cubes observations from the Spitzer database using the strategy presented in this paper.
References 1. Rapacioli, M., Joblin, C., Boissel, P.: Spectroscopy of polycyclic aromatic hydrocarbons and very small grains in photodissociation regions. Astronomy and Astrophysics 429, 193–204 (2005) 2. Hyvarinen, A., Karhunen, J., Oja, E. In: Wiley (ed.): Independent Component Analysis (2001) 3. Hyvarinen, A.: A Fast Robust Fixed Point Algorithm for Independent Component Analysis. IEEE Transactions on Neural Networks 10, 626–934 (1999) 4. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999) 5. Forni, O., Poulet, F., Bibring, J.P.: The Omega Science Team: Component Separation of OMEGA Spectra with ICA. In: 36th Annual Lunar and Planetary Science Conference (March 2005), p. 1623 (2005) 6. Funaro, M., Erkki, O., Valpola, H.: Independent component analysis for artifact separation in astrophysical images. Neural Networks 16, 469–478 (2003) 7. Maino, D., Farusi, A., Baccigalupi, C., Perrotta, F., Banday, A.J., Bedini, L., Burigana, C., De Zotti, G., G´ orski, K.M., Salerno, E.: All-sky astrophysical component separation with Fast Independent Component Analysis (FASTICA) (July 2002) 334, 53–68 (2002) 8. Sajda, P., Du, S., Brown, T.R., Stoyanova, R., Shungu, D.C., Mao, X., Parra, L.C.: IEEE Transactions on Medical Imaging 23, 1453–1465 (December 2004) 9. Gobinet, A., Elhafid, A., Vrabie, V., Huez, R., Nuzillard, D.: In: Proceedings of the 13th European Signal processing Conference (September 2005) 10. Nuzillard, D., Bijaoui, A.: Blind source separation and analysis of multispectral astronomical images. Astronomy and Astrophysics, Supplement 147, 129–138 (2000) 11. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: NIPS, vol. 13, p. 556. MIT Press, Cambridge (2001) 12. L´eger, A., Puget, J.L.: Identification of the ’unidentified’ IR emission features of interstellar dust? Astronomy and Astrophysics 137, L5–L8 (1984) 13. Allamandola, L.J., Tielens, A.G.G.M., Barker, J.R.: Polycyclic aromatic hydrocarbons and the unidentified infrared emission bands - Auto exhaust along the Milky Way. The Astrophysical Journal, Letters 290, L25–L28 (1985) 14. Rapacioli, M., Calvo, F., Joblin, C., Parneix, P., Toublanc, D., Spiegelman, F.: Formation and destruction of polycyclic aromatic hydrocarbon clusters in the interstellar medium. Astronomy and Astrophysics 460, 519–531 (2006)
Specific Circumstances on the Ability of Linguistic Feature Extraction Based on Context Preprocessing by ICA Markus Borschbach and Martin Pyka Dept. of Mathematics and Computer Science, Institute for Computer Science, University of M¨ unster, Einsteinstr. 62, D-48149, Germany {markus.borschbach,pyka}@uni-muenster.de
Abstract. Blind Signal Separation (BSS) based on Independent Component Analysis (ICA) is an emerging approach which application is not limited to the signal processing research, where its application principle is rather straight forward. For an increasing amount of information processing fields, ICA has meaningful application which are still undiscovered. The aim of this paper is to investigate the ability of linguistic feature extraction based on word context preprocessing by ICA. The work refers to a first brief analysis in which ICA was applied to an English corpus. We continue this analysis depending on the number of components and the amount of syntactical information that we take into account. Furthermore we discuss to which extent the results deliver general linguistic features, or linguistic features giving us information about the text.
1
Introduction
In various information processing fields, Independent Component Analysis (ICA) [1] has emerged as a key notion of the underlying principle, often leading to an alternative preprocessing step or filter of the observed data streams. The different cases of assumed underlying models, - from a scalar and linear case, to a convolutive, a nonlinear or the case of single channel ICA, have enabled a broad spectrum of suitable different application areas. In many signal processing fields, the task of Blind Signal Separation based on ICA can be directly applied to a given number of observations and the assumed underlying sources can be extracted. In all but a few information processing areas, the meaning of the independent components is rather undiscovered so far. All assumed underlying ICA models are comparable in the sense of identification of similarities expressed by the condition of independence. At first, this paper contributes by summarizing approaches for linguistic text analysis based on ICA and expressing the role of the independent components. More specific, ICA is used as a preprocessing step to identify similarities in the context of all words of different types of text. The paper is organized as follows. Section 2 briefly reviews the objective function of ICA in an existing approach for linguistic feature extraction. The operations step of ICA-preprocessing and their strengths and weaknesses are outlined in section 3. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 689–696, 2007. c Springer-Verlag Berlin Heidelberg 2007
690
M. Borschbach and M. Pyka
The simulation results based on this approach with two kinds of preprocessed texts are presented in section 4 and discussed in section 5.
2 2.1
Related Work Word-Based Linguistic Feature Extraction
The approach which is examined in this paper was introduced primarily in [2] and applied on a corpus of 4,921,934 tokens (words in the running text) including 117,283 types (different unique words). At the preprocessing all uppercase letters were replaced by the corresponding lowercase letters and punctuation was removed. For the creation of the context matrix, one hundred common words were manually selected. The context of these words was analyzed using the 2000 most common words of the text in the following way: If one of the hundred words appears in the text followed by a word of the 2000 most common words, the corresponding value cij , where i is the index of the manually selected word and j the index of the context word, is increased by 1. Therefore the context matrix displays the number of occurrences of each context word in the immediate context of the manually selected words. If BSS is assumed as the classical ICA signal processing application task, the dimension of the mixture vectors is 100 and the samples size is 2000. In the last preprocessing step the logarithm was taken in order to reduce the effect of the very common words in the text. The FastICA-Algorithm [4] was used to reduce and extract the 10 strongest independent components based on the greatest eigenvalues (using Principle Component Analysis). Now each of the 2000 selected context words is not represented by its number of occurrences in the context of different index words but by a ten dimensional vector. Depending on the amplitude of the values in each dimension the syntactical function of the words can be categorized. In the given experiment all plural nouns had a significant greater value in component five than in other components for instance. Adjectives, verbs and even verb forms of “to be” and “to have” were represented by other categories. The results show that the ICA approach was able to extract emergent linguistic features although the analysis reveals that the assignments of the ICs to the linguistic features are still quite loose. Therefore we consider it necessary to investigate, how the results can be improved by changing the context distance, the size of the corpora and the number of independent components. 2.2
Morpheme-Based Linguistic Feature Extraction
The same approach was applied on a Finnish corpus taking into account morphemes as the smallest unit [3]. During preprocessing a Finnish newspaper text with 30 million words and 1.3 million unique words was segmented into morphemes by applying the segmentation tool Hutmegs [10]. For a selection of 3759 data morphemes a context matrix was generated representing the distribution of the 506 most common morphemes in the immediate context of the data morphemes. The first 50 independent components were obtained via the FastICA
Specific Circumstances on the Ability of Linguistic Feature Extraction
691
algorithm. The results showed that morphemes with a common grammatical feature had a similar 50-dimensional vector with high values (+ or -) in the components which indicate their grammatical functions. For instance morphemes of country names had three components with high values. A further analysis showed that this group of morphemes can be extended to words which refer to languages (ranska/ranskaksa: France/frensh), to inhabitants (ranska/ranskalainen: France/ frensh-man) or to words with the same kind of endings. Ambiguous words had a higher number of active components representing the different meanings.
3
Simulation Onset
In this paper, we tried to reproduce and to improve the results of the mentioned approach in order to figure out possible application fields. Therefore we increased the number of independent components up to 30 and - in a second step - we ran the same test with the full stop as sentence separator which should increase the amount of linguistic information that was used to create the context matrix. 3.1
Settings and Selections
In the first run we tried to reproduce the same or at least similar results with a corpus of articles of the newspaper Times from 1992, including 4,261,330 words and 97,936 unique tokens. Therefore we used the 2000 most common word of the text as contextwords and 100 manually selected so-called indexwords in order to create a context matrix in which each element of the matrix contains the number of occurrences of a contextword after an indexword. Figure 1 illustrates this approach. After that the number of dimensions was reduced to 10 by principle component analysis. The analysis of all components reveals, that there are five clear distinguishable syntactical categories and four categories that have overlapping syntactical features. For one component a meaningful linguistic binding could not be identified. While the first five components correspond very well to syntactical categories, the last four categories shown in the table (and the category that did not match to any recognizable linguistic feature binding) seem to have overlapping syntactical features which can be explained by the limited number of independent components that we chose. Nevertheless, even for those components plausible syntactic categories could be assigned. For example, component six contains verbs like ’should’, ’says’ or ’took’. Their common feature are the group of words, mainly pronouns, which are prefixed. All verbs in this category are used in expressions like ’he should...’, ’he says...’ or ’it took...’. In category eight a similar phenomenon appeared. It includes words like ’november’, ’edingburgh’ or ’1986’. They have in common that they are mainly used with ’on’, ’in’ and ’of’. Although time- and location-related words dominate the results, we found words like ’corporation’ or ’principle’ as well. Therefore, the dependency of the kind of syntactical categories depend on the principles for establishing the context
692
M. Borschbach and M. Pyka
Fig. 1. All unique words are extracted out of a corpus. After a manual selection of the index- and contextwords a contextmatrix is computed. Through ICA with PCA it is reduced to a given number of independent components.
1 2 3 4 5 6 7 8
Verbs in Past Tense Verbs in Infinitive Numbers Nouns in plural Proper names Verbs in various forms Prepositions, conjunctions, articles Nouns, that follow ’in’ and ’on’, mainly names of cities and months 9 Words which appear frequently with ’the’ and ’an’ Fig. 2. Clear distinctive categories
Specific Circumstances on the Ability of Linguistic Feature Extraction
693
matrix and the number of independent components. The chances and weaknesses of this behavior are discussed in the following section.
4 4.1
Results Settings
Based on the same context matrix the number of components was increased in order to see how good further linguistic categories can be extracted and to which extend the overlapping components can be minimized. First 20 categories were calculated by principle component analysis and their results investigated. Then an analysis with 30 categories was compared with the results of the one before. In general it could be stated: The more independent components one computes, the more specialized are the categories that can be found in the data. They do not necessarily correspond to typical linguistic categories. Their dependency on the collocations in the text leads to various groups in which one can only vaguely discern any linguistic or even semantical categories. First the results of this experiment are examined in more detail and then the possible application fields will be discussed. 4.2
Implications
The results are summarized in table 1. The categories 1, 2, 3, 4, 5 and 16 can be contemplated as the most clear syntactical categories that could be extracted with ICA. They relate to a clear syntactical environment in which they occur and they are discernible for humans as a part of speech as well. Although it seems self-evident that singular nouns are a strong syntactical group as well, they do not occur as a whole group. In the categories of the 20 and 30 components we find them divided into several sub groups, like category 10 in which there are all nouns that follow after “on”, “in” or “last”, like “on sunday”, “in december” or “last week”. Category number 11 contains nouns that follow prevalently after words like “former”, “national”, “party” and some others. Typical collocations are “former chairman”, “national institute” and “party leader”. These words seem to share a semantical feature, which is vaguely related to the domain ’corporate structures, business, economy’. This kind of category and the words in it, reflect topics of the articles in the Times newspaper. So the categories that we receive are on one hand the result of an emergent unsupervised learning process of the English language. We get distinctive features that have clear syntactical properties. On the other hand they give us a notion of the topics that are discussed in the articles. We have to admit that they do not qualify for completeness, but at least it seems that a particular selection of context- and indexwords can lead to various components in which we can find key words of a text. This has to be analyzed in more detail with various indexwords and different kinds of texts. Due to the fact that we build the context matrix as described in [5], the emergent feature categories do not always match completely to classical categories.
694
M. Borschbach and M. Pyka Table 1. Linguistic categories in 10, 20 and 30 components
No. 1 2 3 4 5 6 7 8
10 components Verbs in Past Tense Verbs in Infinitive Numbers Nouns in plural Proper names Verbs in various forms Prep., conj., articles Nouns, that follow after “in” and “on”, mainly names of cities and months 9 Words which appear frequently with “the” and “an”
10
11
12
13
14
15
16 17
18
19
20
21
22
23
20 components Verbs in Past Tense Verbs in Infinitive Numbers Nouns in plural Proper names Verbs in various forms Prep., conj., articles Nouns, that follow after “in” and “on”, mainly names of cities and months Verbs like “going”, “based” or “able” that follow after “was” or “is” Nouns, that follow after “on” and “last”: “january”, “february”, “monday’ Words that follow after “former” or “national”: “chief”, “director”, “institute’ Words that follow after numbers or forenames: “per”, “million”, “clarke” Numbers that begin with zero: 01, 02, 05 (part of the structure of the Times corpus) Words that follow after mainly possessive pronouns: “father”, “mother”, “greatest” Verbs that mainly occur with personal pronouns: “could”, “might”, “hope” Adjectives
30 components Verbs in Past Tense Verbs in Infinitive Numbers Nouns in plural Proper names Verbs in various forms Prep., conj., articles Nouns, that follow after “in” and “to”, mainly years and months (some cities) Verbs like “going”, “based” or “able” that follow after “was” or “is” Nouns, that follow after “on” and “last”: “january”, “february”, “monday” Words that follow after “former” or “national”: “chief”, “director”, “institute” Words that follow after numbers: “weeks”, “years”, “goals” Numbers that begin with zero: 01, 02, 05 (part of the structure of the Times corpus) Words that follow after mainly possessive pronouns: “father”, “mother”, “greatest” Words that can occur with possessive pronouns or “the”: “biggest”, “largest”, “eyes” Adjectives Words that follow after “last”, “first”, “next”: “week”, “game”, “century” Words that follow after auxiliaries or conjunctions, mainly adverbs: “hardly”, “usually”, “not” Words that follow after “national”, “international, “british”: “market”, “group”, “football” Words that follow after “new”: “zealand”, “york”, “ideas” Verbs that follow after “he”, “she”, “it”, mainly verbs in third person singular: “seems”, “plans”, “did” A second category of proper names, mainly forenames: “richard”, “george”, “professor” Words that follow after “for”: “himself”, “lunch”, “example”
Specific Circumstances on the Ability of Linguistic Feature Extraction
695
An interesting example can be found in category 14, which consists of words that follow after possessive pronouns like “my”, “his”, “her” etc.. It includes words like “mother”, “father”, “brother”, “colleagues” and body parts like “eyes” and “hands”. In general these are nouns in this category. As there must be a big amount of collocations like “my greatest...” or “my entire...”, the words “greatest” and “entire” and several others are in category 14 as well. Here we expect to reduce the effect of overlapping syntactical categories by using a different context matrix which takes the previous and the following word of an index word into account. 4.3
Using Punctuation
In the previous chapter the punctuation marks were not taken into account. Instead we followed the guideline described in [5]. Hence the punctuation marks might be an important syntactic information for the analysis and were therefore left for the second test in the text. So only question marks and brackets were removed, everything else was treated like a word. The extraction of the 30 strongest components showed that basically the same results as in the analysis of the text without punctuation marks can be extracted. But instead of 23 categories we were able to find syntactical similarities in 26 of 30 components. 18 of the 23 categories were in both tests identical1 . The punctuation marks as separator in the text prevented the connection between two successive words that are not located in the same sentence. Nevertheless the analysis of the words without punctuation marks was still quite effective.
5
Conclusion
The increase of the number of components led to a more detailed syntactical segmentation of the 2000 selected words. This segmentation gives us information about the context in which the words are generally used. Although some of this categories relate even to lexical categories, there are always a few outstanding words that occur through exceptions in the grammatical use of the language. The more components we extract with ICA, the more specific are the syntactical categories. In corpora with an emphasis on certain topics we were able to extract collocations that are not typical for the English language in general but for this specific topic. For example, we extracted a category with words that are used in collocations like “former director” or “national institute” (category 11), and a category with words that follow after “international”, ’british’ and several others. They are used in collocations like “international group” or “british market”. This segmentation does not only reflect the syntactical features of the English language, but also the mainly used terms and phrasings of a specific text. Due to 1
Due to the fact that FastICA starts with a random initialization, the results can differ from run to run. But in all tests that were made with 30 components we got at least 14 categories that occur in each test.
696
M. Borschbach and M. Pyka
their disproportionately high occurrences, they were grouped in a few categories. Further experiments with different context distances and different types of text will show if this could be a way of extracting topic related information from a text as well. The segmentation-ability of this approach could be interesting in the following way as well. Each context word that is used in this experiment is encoded in a 10- up to 30-dimensional vector which represents its functionality in the text. It is an abstracted description of the word. These codes could be used for finding certain patterns in English sentences in an unsupervised and emergent manner as well. For instance, it should be possible to extract grammatical rules from the text through the encoded words because their vectors identify them as words with particular syntactical features. There are many possibilities to increase the amount of syntactical information that might be important for ICA to extract linguistic features. One is the handling of punctuation marks as single words which allowed us to extract a few clear distinguishable linguistic features more. Further experiments with different context matrices, greater corpora, more context words and more independent components should show to which extend we could extract linguistic features that are maybe usable in other applications.
References 1. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley & Sons, Chichester (2001) 2. Honkela, T., Hyv¨ arinen, A., V¨ ayrynen J.: Emergence of Linguistic Features: Independent Component Analysis of Contexts. In: Proc. of the 9th Neural Computation and Psychology Workshop (NCPW9), pp. 129–138 (2005) 3. Lagus, K., Creutz, M., Virpioja, S.: Latent Linguistic Codes for Morphemes using Independent Component Analysis. In: Proc. of the 9th Neural Computation and Psychology Workshop (NCPW9), pp. 129–138 (2005) 4. Hyv¨ arinen, A.: Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Transactions on Neural Networks 10(3), 626–634 (1999) 5. Honkela, T., Hyv¨ arinen, A.: Linguistic Feature Extraction using Independent Component Analysis. In: Proc. of Int. Joint Conf. on Neural Networks (IJCNN), pp. 279–284 (2004) 6. Rapp, R.: Mining Text for Word Senses using Independent Component Analysis. In: Proc. of Int. Conf. on Data Mining (2004) 7. Yarowsky, D.: Unsupervised Word Sense Disambiguation rivaling Supervised Methods. In: Proc. of the 33rd Annual Meeting of the ACL, pp. 189–196 (1995) 8. Steiner, P.: Wortarten und Korpus. Dissertation at the Westf¨ alische WilhelmsUniversit¨ at M¨ unster. Shaker Verlag (2004) 9. Rapp, R.: Automatic Identification of Word Translations from unrelated English and German Corpora. In: Proc. of the 37th Annual Meeting of the ACL, pp. 519– 526 (1999) 10. Creutz, M., Lind´en, K.: Morpheme Segmentation Gold Standards for Finnish and English. Techreport Helsinki University of Technology, Publications in Computer and Information Science (2005)
Conjugate Gamma Markov Random Fields for Modelling Nonstationary Sources A. Taylan Cemgil1 and Onur Dikmen2, Engineering Dept., University of Cambridge, CB2 1PZ, Cambridge, UK [email protected] http://www-sigproc.eng.cam.ac.uk/∼atc27 Dept. of Computer Eng., Bogazici University, 80815 Bebek, Istanbul, Turkey 1
2
Abstract. In modelling nonstationary sources, one possible strategy is to define a latent process of strictly positive variables to model variations in second order statistics of the underlying process. This can be achieved, for example, by passing a Gaussian process through a positive nonlinearity or defining a discrete state Markov chain where each state encodes a certain regime. However, models with such constructs turn out to be either not very flexible or non-conjugate, making inference somewhat harder. In this paper, we introduce a conjugate (inverse-) gamma Markov Random field model that allows random fluctuations on variances which are useful as priors for nonstationary time-frequency energy distributions. The main idea is to introduce auxiliary variables such that full conditional distributions and sufficient statistics are readily available as closed form expressions. This allows straightforward implementation of a Gibbs sampler or a variational algorithm. We illustrate our approach on denoising and single channel source separation.
1
Introduction
In the Bayesian framework, various signal estimation problems can be cast into posterior inference problems. For example, source separation [6,5,9,11,3,2], can be stated as 1 (1) dΘo dΘs p(x|s, Θo )p(s|Θs )p(Θo )p(Θs ) p(s|x) = Zx where s ≡ s1:K,1:N and x ≡ x1:K,1:M . Here, the task is to infer N source signals sk,n given M observed signals xk,m where n = 1 . . . N , m = 1 . . . M at each index k where k = 1 . . . K. Here, k typically denotes time or a time-frequency atom in a linear transform domain. In Eq.(1), the (possibly degenerate, deterministic) conditional distribution p(x|s, Θo ) specifies the observation model where Θo denotes
This research is funded by EPSRC (Engineering and Physical Sciences Research Council of UK) under the grant EP/D03261X/1 entitled “Probabilistic Modelling of Musical Audio for Machine Listening” and by TUBITAK (Scientific and Technological Research Council of Turkey) under the grant entitled “Time Series Modelling for Bayesian Source Separation”.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 697–705, 2007. c Springer-Verlag Berlin Heidelberg 2007
698
A.T. Cemgil and O. Dikmen
the collection of mixing parameters such as the mixing matrix, observation noise variance, etc. The prior term p(s|Θs ), the source model, describes the statistical properties of the sources via their own prior parameters Θs . The normalisation term Zx = p(x) is the marginal probability (evidence) of the data under the model and plays a key role for model order selection (such as determining the number of sources) [8]. The hierarchical model is completed by postulating hyper-priors over the nuisance parameters Θs and Θo . Estimates of the sources can be obtained from posterior features such as marginal maximum-a-posteriori (MMAP) or minimum-mean-square-error (MMSE) estimate1 s∗ = argmax p(s|x) sp(s|x) = sp(s|x)ds s
Unfortunately, exact calculation of these quantities is intractable for almost all relevant observation and source models, even under conditionally Gaussian and independence assumptions. Hence, approximate numerical integration techniques have to be employed. In applications, the key object is often the source model p(s|Θs ). Indeed, many popular signal estimation algorithms can be obtained by choosing a particular source prior and applying Bayes rule. If the sources have some known structure, one can design more realistic prior models to improve the estimates. In this paper, we explore generic source models that explicitly model nonstationarity. Perhaps the prototypical example of a nonstationary process is one where the conditional variance is slowly changing in time. In finance literature, such models are known as stochastic volatility models and are important to characterise non-stationary behaviour observed in financial markets [12]. In spatial statistics, similar constructions are needed in 2-D where one is interested in changing intensity over a region [15]. In audio processing, the energy content of a signal is typically time-varying hence it is natural to model audio with a process with a time varying power spectral density on a time frequency plane [10,14,4]. In the sequel, we introduce an alternative model that is useful for modelling a random walk on variances. The main idea is to introduce auxiliary variables such that full conditional distributions and sufficient statistics are readily available as closed form expressions. This allows straightforward implementation of a Gibbs sampler or a variational algorithm. Consequently we extend the model to 2-D random fields, which is useful for modelling nonstationary time-frequency energy distributions or intensity functions that need to be strictly positive. We illustrate our approach on various denoising and source separation scenarios.
2
Model
The inverse Gamma distribution with shape parameter a and scale parameter z is defined as IG(v; a, z) ≡ exp((a + 1) log v −1 − z −1 v −1 + a log z −1 − log Γ (a)) 1
Here, and elsewhere the notation f (x)p(x) will denote the expectation of the func tion f (x) under the distribution p(x), i.e. f (x)p ≡ dxf (x)p(x).
Conjugate Gamma Markov Random Fields
699
Here, Γ is the gamma (generalised factorial) function. The sufficient of statistics the inverse-Gamma distribution are given by v −1 IG = az and log v −1 IG = Ψ (a)−log z −1 where Ψ is the digamma function defined as Ψ (a) ≡ d log Γ (a)/da. The inverse gamma distribution is the conjugate prior for the variance v of a Gaussian distribution N (s; μ, v) ≡ exp −(s−μ)2 v −1 /2 + log v −1 /2− log(2π)/2 . When the prior p(v) is inverse Gamma, the posterior distribution p(v|s) can be represented as an inverse Gamma distribution since the logarithm of a Gaussian is a polynomial in v −1 and log v −1 . Similarly, the Gamma distribution with shape parameter a and scale parameter z is defined as G(λ; a, z) ≡ exp((a − 1) log λ − z −1 λ + a log z −1 − log Γ (a)) The sufficient statistics of the Gamma distribution are given by λG = az and log λG = Ψ (a) − log z −1 . Gamma distribution is the conjugate prior for the precision parameter (inverse variance) of a Gaussian distribution as well as for the intensity parameter λ of a Poisson distribution c ∼ PO(c; λ) ≡ e−λ λc /c! = exp (c log λ − λ − log Γ (c + 1)) We will exploit this property to estimate intensity functions of non-homogeneous Poisson processes. 2.1
Markov Chain Models
It is possible to define a Markov chain on inverse Gamma random variables in a straightforward way by vk |vk−1 ∼ IG(vk ; a, vk−1 /a). The full conditional distribution p(vk |vk−1 , vk+1 ) is conjugate, i.e. it is also inverse Gamma. However, by this construction it is not possible to attain positive correlation between vk and vk−1 . Positive correlations can be obtained by conditioning on the reciprocal of vk−1 and defining p(vk |vk−1 ) = IG(vk ; a, (vk−1 a)−1 ); however in this case the full conditional distribution p(vk |vk−1 , vk+1 ) becomes non-conjugate since it has vk , 1/vk and log vk terms. The basic idea is to introduce latent auxiliary variables zk between vk and vk−1 such that when zk are integrated out we restore positive correlation between vk and vk−1 while retaining conjugacy. We define an Inverse Gamma-Markov chain (IGMC) for k = 1 . . . K as follows z1 ∼ IG(z1 ; az , bz /az ) vk |zk ∼ IG(vk ; a, zk /a) zk+1 |vk ∼ IG(zk+1 ; az , vk /az ) Here, zk are auxiliary variables that ensure the full conditionals −1 p(vk |zk , zk+1 ) ∝ exp (a + az + 1) log vk−1 − (azk−1 + az zk+1 )vk−1
(2)
and p(zk |vk , vk−1 ) are inverse Gamma. By integrating out over the auxiliary variable zk we obtain the effective transition kernel of the Markov chain, where it can be easily shown that p(vk |vk−1 ) = dzk p(vk |zk )p(zk |vk−1 ) = dzk IG(vk ; a, zk /a)IG(zk ; az , vk−1 /az ) =
−1 −1 Γ (a + az ) (az vk−1 )az (avk )a v −1 −1 Γ (az )Γ (a) (az vk−1 + avk−1 )(az +a) k
(3)
700
A.T. Cemgil and O. Dikmen
This distribution, which in our knowledge does not have a designated name, is a scale mixture of inverse Gamma distributions where the scaling function is also inverse Gamma. The transition kernel p(vk |vk−1 ) has positive correlation for various shape parameters az and a. The absolute value of az and a control the strength of the correlation and the ratio az /a controls the skewness. For az /a < 1 ( az /a > 1), the probability mass is shifted towards the interval vk < vk−1 ( vk > vk−1 ) hence, typical trajectories from a IGMC will exhibit a systematic negative (positive) drift. Using an exactly analogous construction, we define a Gamma-Markov chain (GMC) as z1 ∼ G(z1 ; az , (bz az )−1 ), λk |zk ∼ G(λk ; aλ , (zk aλ )−1 ), zk+1 |λk ∼ G(zk+1 ; az , (λk az )−1 ). The effective transition kernel has a very similar expression as in Eq.3. Example 1, Nonstationary Gaussian Process: We define a non-stationary Gaussian process {yk }k=1,2,... by drawing the variances {vk }k=1,2,... from an IGMC and drawing yk |vk ∼ N (yk ; 0, vk ) In Figure 1(a)-top, we show a realisation of v1:K from the IGMC, labelled as “true” and generate y1:K conditionally Figure 1(a)-bottom. Given a realisation y1:K , we can estimated the posterior variance vk |y1:K . In this case, inference is carried out with variational Bayes as will be detailed in section 3. Example 2, Nonhomogeneous Poisson Process: We partition an interval I on the real line into small disjoint regions Rk of area L such that I = ∪K k=1 Rk . We assume that the unknown intensity function of the process is piecewise constant and has the value λk on region Rk . The intensity function {λk }k=1,2,... is drawn according to a GMC. The number points in Rk , given the intensity function, is denoted by the Poisson random variable ck |λk ∼ PO(ck ; λk L) To generate a realisation from the Poisson process, we can uniformly draw ck points in each region Rk . In Figure 1(b), we show a realisation from the model. Given the number of events in each region Rk , we can estimate the value of the intensity function on Rk by calculating λk |c1:K . 2.2
(Inverse) Gamma Markov Random Fields – (I)GMRF
We have defined the IGMC and GMC in the previous section using conditional distributions. An alternative but equivalent factorisation, that encodes the same distribution but corresponds to −1 −1 p(z, v) ∝ ψ(b−1 φ(vk−1 ; a + az )φ(zk−1 ; a + az )ψ(azk−1 , vk−1 )ψ(az vk−1 , zk+1 ) z , az z1 ) k
where we specify singleton potentials φ and pairwise potentials ψ as φ(ξ; α) = exp((α + 1) log ξ)
ψ(ξ, η) = exp(−ξη)
Generalising this to a general undirected graph with vertex set V and undirected edge set E, we define an IGMRF on ξ = {ξi }i∈V by a set of connection weights a = {ai,j }(i,j)∈E for i, j ∈ V and i = j 1 ∗ 1 φ(ξi−1 ; ai,j ) ψ(ξi−1 , (ai,j /2)ξj−1 ) ≡ pa (ξ) (4) p(ξ; a) = Za Z a j i∈V
(i,j)∈E
Conjugate Gamma Markov Random Fields 20
150 True VB
True VB k
100
10
λ
v
k
15
50
5 0
701
0
200
400
600
800
0
1000
10
1 0.8
5 ck
y
k
0.6 0
0.4 −5 −10
0.2 0
200
400
600
800
1000
0
0
0.2
k
0.4
0.6
0.8
1
Arrival time
(a)
(b)
Fig. 1. Synthetic examples generated from the model. The thick line shows the result of the variational inference(a) Non-stationary Gaussian Process. (a-Top) a typical draw of v1:K from the IGMC bz = 1, a = az = 100. (a-Bottom) Draw from the Gaussian process y1:K given v1:K . (b) Non-homogeneous Poisson Process. (b-Top) a typical draw from the GMC with bz = 10, a = az = 100 and frame length μ(Rk ) = L = 0.001. (b-Bottom) Number of events in each Rk .
where φ and ψ are defined above. A GMRF is defined similarly by the potentials φ(ξi ; j ai,j ) and ψ(ξi , (ai,j /2)ξj ) but with φ(ξ; α) = exp((α − 1) log ξ).
3
Inference
Exact inference in random fields is in general intractable and various numerical methods have been developed, based on sampling (Monte Carlo-stochastic) or analytic approximation (Variational-deterministic). Here, we focus on a particularly simple variational algorithm (mean field - variational Bayes [1,13]) – but application of Monte Carlo methods, such as the Gibbs sampler [7] is algorithmically very similar[2]. Variational methods have been applied extensively, notably in machine learning for inference in large models. While lacking theoretical guarantees of Monte Carlo approaches, variational methods have been viable alternatives in several practical situations where only a fixed amount of CPU budget is available. The main idea in variational Bayes is to approximate a target distribution P ≡ p∗ (ξ)/Z (such as the IGMRF defined in Eq.(4)) with a simple distribution Q. The variational distribution Q is chosen such that its expectations can be obtained easily, preferably in closed form. One such distribution is a factorised one Q(ξ) = i∈V Qi (ξi ). An intuitive interpretation of mean field method is minimising the KL divergence with respect to (the parameters of) Q where KL(Q||P) = log QQ − log p∗ /Za Q . Using non-negativity of KL, one can obtain a lower bound on log-normalisation constant log Za ≥ log p∗ Q − log QQ
(5)
The maximisation of this lower bound is equivalent to finding the “nearest” Q to P in terms of KL divergence. Whilst the solution is in general not available in
702
A.T. Cemgil and O. Dikmen
(a)
(b)
(c)
(d)
Fig. 2. Various IGMRF topologies as priors for time frequency energy distributions. White nodes (placed always on a rectangular grid) correspond to vν,τ where the vertical and horizontal axis corresponds to the frequency band index ν and time index τ respectively. Gray nodes correspond to the auxiliary variables z. (a)-(d) vertical, horizontal, band, grid.
closed form, it can be easily shown, e.g. see [13], that each factor Qi of the optimal approximating distribution should satisfy the following fixed point equation
(6) Qi ∝ exp log p∗ Q−i where Q−i ≡ Q/Qi , that is the joint distribution of all factors excluding Qi . Hence, the mean field approach leads to a set of (deterministic) fixed point equations that need to be iterated until convergence. For a MRF, this fixed point expression is efficient to evaluate since it depends only on the neighbouring variables j ∈ N (i). Finally, for conjugate models the factors are available in (t) (t) (t) closed form; for example IGMRF leads to the factors Qi (ξi ) = IG(ξi ; αi , βi ) with (t) (t) αi = θα,i + ai,j βi−1 = θβ,i + ai,j ξj−1 Q(t−1) j∈N (i)
j∈N (i)
j
Here, θα,i and θβ,i denote the data contributions when a IGMRF is used as a prior where some of the ξ are observed via a conjugate observation model. For example, in the conditionally Gaussian observation model of section 2 we have θα,i = 1/2 and θβ,i = yi2 /2. Similarly, the Poisson model with a GMRF prior has θα,i = ci and θβ,i = L. 3.1
Simulation Experiments
In Figure 1, we show the results of variational inference for two synthetic examples. In the following, we will illustrate the IGMRF model used as a prior for time-frequency energy distributions of nonstationary sources. Linear time-frequency representations decompose a signal y(t) as a linear
decomposition of form y(t) = (ν,τ ) s(ν,τ ) fν,τ (t) where s(ν,τ ) is the expansion coefficient corresponding to the basis function fν,τ (t). Somewhat succinctly we can write y = F s, where the collection of basis functions is denoted by a matrix F where each column corresponds to a particular fν,τ . The well known Gabor representation or modified cosine transform (MDCT) have this form and can be
Conjugate Gamma Markov Random Fields
703
computed using fast transforms where τ corresponds to time and ν corresponds to frequency.In the sequel, we will impose a conditionally Gaussian prior on transform coefficients N (s(ν,τ ) ; 0, v(ν,τ ) ) where the covariance structure will be drawn from a IGMRF. Denoising: In the first experiment, we illustrate the denoising performance of 4 MRF topologies on a set of 5 audio clips (speech, piano, guitar, perc1, perc2) in 3 different noise conditions low, medium, high. We transform each clip via MDCT to strue (ν,τ ) and add independent white Gaussian noise with variance r ∼ IG(r; ar , br ) to obtain x(ν,τ ) . Note that since MDCT is an orthonormal linear transform, we could have added noise in time domain and the noise characteristics would have remained unaltered. As inference engine, we use variational Bayes. The task of the inference algorithm is to infer the latent source coefficients s(ν,τ ) by integrating out the noise variance r and the IGMRF variables ξ. The optimisation of MRF parameters a is carried out by the Newton method where we maximise the lower bound in Eq.5. We assume homogeneous MRF structure where the coupling values are the same throughout the network 2 . The signal to noise ratio of reconstructions and inference results are given in Figure 3-(a). The SNR results do not show big variations across topologies, with the grid model consistently providing good reconstruction, especially in high and medium noise conditions. We note that SNR may not be the best metric to measure perceptual quality and the reader is invited to listen to the audio examples provided online at http://www.cmpe.boun.edu.tr/∼dikmen/ICA07/. In informal subjective listening tests, we perceive a somewhat better reconstruction quality with the grid topology. Source Separation: In the second experiment, we illustrate the viability of the approach on a single channel separation task with j = 1 . . . J sources. At each time-frequency location k ≡ (ν, τ ), the generative model is vj ∼ IGMRFj , sk,j ∼
N (sk ; 0; vk,j ), xk = Jj=1 sk,j In this scenario, the reconstruction equations have −1 a particularly simple form. Given the variance estimate Vk,j = 1/vk,j at k, we define a positive quantity,
which we shall name as responsibility also know as = V /( filter factors, κ k,j k,j j Vk,j ), where by construction 0 < κj for all j and
j κj ≤ 1. It turns out that the sufficient statistics can be compactly written as sk,j = κk,j xk
2 sk,j = Vk,j (1 − κk,j ) + κ2k,j x2k
We illustrate this approach to separate a piano sound into its constituent components. We assume that J = 2 components are generated independently by two IGMRF models with vertical and horizontal topology. In figure 3-(b), we observe that the model is able to separate transients and harmonic components. 2
In (a) (vertical), each white node is connected to two gray nodes with anorth or asouth and in (b) (horizontal) with awest or aeast . The grid topology (d) has couplings in four directions and in (c) (band), we use a single a.
704
A.T. Cemgil and O. Dikmen Xorg
Shor
Sver
SNR
20 band ver hor grid
15
medium 10 speech
Frequency Bin (ν)
low
high piano
guitar
perc1
Audio Signals
(a)
perc2 Time (τ)
(b)
Fig. 3. (a) Signal-to-Noise ratio results for reconstructions obtained from the audio clips in low, medium,high noise conditions. (b) Single channel Source Separation example, left to right, log-MDCT coefficients of the original signal and reconstruction with horizontal and vertical IGMRF models.
3.2
Discussion
We have introduced a conjugate gamma Markov random field model for modelling nonstationary sources. The conjugacy makes it possible to design fast inference algorithms in a straightforward way. The simple MRF topologies considered here are quite generic, yet provide good results without any hand tuned parameters. One can envision more structured models with a larger set of parameters to capture physical reality, certainly for acoustical signals. We are currently investigating further applications such as restoration, transcription or tracking time varying intensity functions.
References 1. Attias, H.: Independent factor analysis. Neural Computation 11(4), 803–851 (1999) 2. Cemgil, A.T., Fevotte, C., Godsill, S.J.: Variational and Stochastic Inference for Bayesian Source Separation. Digital Signal Processing (in print, 2007) 3. Cichocki, A., Amari, S.I.: Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications, revised version edition. Wiley, Chichester (2003) 4. F´evotte, C., Daudet, L., Godsill, S.J., Torr´esani, B.: Sparse regression with structured priors: Application to audio denoising. In: Proc. ICASSP, Toulouse, France (May 2006) 5. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, New York (2001) 6. Knuth, K.H.: Bayesian source separation and localization. In: SPIE’98: Bayesian Inference for Inverse Problems, San diego, pp. 147–158 (July 1998) 7. Liu, J.S.: Monte Carlo strategies in scientific computing. Springer, Heidelberg (2004) 8. MacKay, D.J.C.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003) 9. Miskin, J., Mackay, D.: Ensemble learning for blind source separation. In: Roberts, S.J., Everson, R.M. (eds.) Independent Component Analysis, pp. 209–233. Cambridge University Press, Cambridge (2001)
Conjugate Gamma Markov Random Fields
705
10. Reyes-Gomez, M., Jojic, N., Ellis, D.: Deformable spectrograms. In: AI and Statistics Conference, Barbados (2005) 11. Rowe, D.B.: A Bayesian approach to blind source separation. Journal of Interdisciplinary Mathematics 5(1), 49–76 (2002) 12. Shepard, N. (ed.): Stochastic Volatility, Selected Readings. Oxford University Press, Oxford (2005) 13. Wainwright, M., Jordan, M.I.: Graphical models, exponential families, and variational inference. Technical Report 649, Department of Statistics, UC Berkeley (September 2003) 14. Wolfe, P.J., Godsill, S.J., Ng, W.J.: Bayesian variable selection and regularisation for time-frequency surface estimation. Journal of the Royal Statistical Society (2004) 15. Wolpert, R.L., Ickstadt, K.: Poisson/gamma random field models for spatial statistics. Biometrica (1998)
Blind Separation of Quantum States: Estimating Two Qubits from an Isotropic Heisenberg Spin Coupling Model Yannick Deville1 and Alain Deville2 Universit´e Paul Sabatier Toulouse 3 - CNRS, Laboratoire d’Astrophysique de Toulouse-Tarbes (UMR 5572), 14 Av. Edouard Belin, 31400 Toulouse, France [email protected] 2 Universit´e de Provence, Bˆ atiment IRPHE, 49 Rue Joliot Curie, BP146, 13384 Marseille Cedex 13, France [email protected] 1
Abstract. Blind source separation (BSS) and Quantum Information Processing (QIP) are two recent and rapidly evolving fields. No connection has ever been made between them to our knowledge. However, future practical QIP systems will probably involve ”observed mixtures”, in the BSS sense, of quantum states (qubits), e.g. associated to coupled spins. We here investigate how individual qubits may be retrieved from Heisenberg-coupled versions of them, and we show the relationship between this problem and classical BSS. We thus introduce new nonlinear mixture models for qubits, motivated by actual quantum physical devices. We analyze the invertibility and ambiguities of these models. We propose practical data processing methods for performing inversions.
1
Introduction
Various areas in the information processing field developed very rapidly during the last decades. This includes the generic Blind Source Separation (BSS) problem [1], which consists in estimating a set of unknown source signals from a set of observed (i.e. measured) signals which are ”mixtures” of these source signals. BSS methods thus apply to a wide range of signal denoising and component extraction problems. This especially concerns communications, e.g. when a set of radio-frequency antennas provide linear combinations, i.e. ”mixtures”, of several emitted signals and one aims at retrieving each emitted signal only from their available mixtures (see e.g. [2] for an implementation of this approach). Another growing area is Quantum Information Processing (QIP), which is closely related to Quantum Physics (QP) [3]. QIP uses abstract representations of systems whose behavior is requested to obey the laws of QP. This already made it possible to develop new and powerful information processing methods, to be contrasted with ”classical”, i.e. non-quantum, methods such as the above-mentioned BSS approaches. Their effective implementation then requires to develop corresponding practical quantum systems, which is only an emerging topic today. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 706–713, 2007. c Springer-Verlag Berlin Heidelberg 2007
Blind Separation of Quantum States
707
To our knowledge, no connection has ever been made between the BSS and QIP/QP areas. One may expect, however, that ”coupling” between individual ”signals” (i.e. states) will also have to be considered in the QIP/QP area. Such couplings indeed occur even in the basic situation when two spins interact according to the isotropic Heisenberg model. In this paper, we consider this configuration, we investigate how each spin may be retrieved from the coupled version of both of them, and we show the relationship between this problem and classical BSS. The relevance of this approach also stems from the fact that, to a large extent, classical BSS belongs to the more general Statistical Signal Processing (SSP) field. Since QIP and QP are essentially based on a probabilistic view of physical phenomena, trying to bridge the gap between SSP/BSS and QIP/QP is a priori a reasonable attempt.
2
Quantum and SSP Points of View for One Qubit
The fundamental concept used in abstract QIP is the quantum bit, or qubit [3]. A qubit has a state |ψ >, which may be expressed in the basis defined by two vectors, that we denote |+ > and |− > hereafter. This state thus reads |ψ >= α|+ > +β|− >
(1)
where α and β are two complex-valued coefficients, which are requested to be such that the state |ψ > is normalized, i.e. |α|2 + |β|2 = 1 .
(2)
From a QP point of view, this abstract mathematical model especially concerns the spin of an electron, which is a quantum (i.e. non-classical) quantity. The component of this spin along a given arbitrary axis Oz defines a two-dimensional linear operator sz . The two eigenvalues of this operator are equal to + 21 and − 12 in normalized units, and the corresponding eigenvectors are therefore denoted |+ > and |− >. The value obtained when measuring this spin component can only be + 12 or − 12 . Moreover, assume this spin is in the state |ψ > defined by (1) when performing such a measurement. Then, the probability that the measured value is equal to + 21 (resp. − 21 ) is equal to |α|2 (resp. |β|2 ), i.e. to the squared modulus of the coefficient in (1) of the associated eigenvector |+ > (resp. |− >). The above discussion concerns the state of the considered spin at a given time. In addition, this state evolves with time. During the time interval when this spin (or the two coupled spins in the next section) is considered, it is supposed to be isolated or placed in a magnetic field, so that this system has a Hamiltonian. The spin state thus evolves according to Schr¨odinger’s equation. Briefly, if this state |ψ(t0 ) > is defined by (1) at time t = t0 , its value at any time t is then |ψ(t) >= αe−iωp (t−t0 ) |+ > +βe−iωm (t−t0 ) |− > 1 2
(3)
where i = (−1) and the real (angular) frequencies ωp and ωm depend on the considered physical setup.
708
Y. Deville and A. Deville
While we summarized above well-known concepts from QIP and QP points of view, we now propose a way to link them with a SSP perspective. We essentially aim at handling the situation when a first user (called ”W” hereafter) ”writes” the considered spin, i.e. initializes its state at a given time tw with the value |ψ(tw ) >= α|+ > +β|− >
(4)
and a second user (called ”R”) does not know this state and aims at ”reading” it, i.e. at estimating it, at another time tr . From a SSP point of view, the above description suggests that this may be achieved by introducing a ”repeated write/read” (RWR) procedure, which consists in performing K times the same ”write/read” step. Each occurence k of this step, with k = 1 . . . K, consists in first letting user W write the spin at a time tw (k) with the state of interest, i.e. |ψ(tw (k)) >= α|+ > +β|− > .
(5)
Due to (3), this spin state becomes at time tr (k) |ψ(tr (k)) >= αe−iωp T (k) |+ > +βe−iωm T (k) |− >
(6)
with T (k) = tr (k) − tw (k). User R reads this state at time tr (k). As explained above, this measurement of the spin component along axis Oz can only yield one of the values + 21 and − 12 , resp. with probabilities p1 and p2 equal to the squared moduli of the coefficients associated to the states |+ > and |− > in (6), i.e. −iωp T (k) 2 (7) = p1 αe 2 −iωm T (k) (8) = p2 . βe These equations do not depend on their phase factors, i.e. they reduce to 2
(9)
2
(10)
|α| = p1 |β| = p2 .
Estimates of p1 and p2 may be straightforwardly obtained as the relative frequencies of occurence of the values + 12 and − 21 resp. in the measurements. Eq. (9)-(10) then directly provide the squared moduli (up to estimation errors) of the parameters α and β that user R aims at determining. This leaves a phase ambiguity, which may then be reduced or completely avoided as follows. First consider a configuration where we constrain user W to only write the spin with real-valued α and β. Eq. (9)-(10) then make it possible to estimate these parameters up to only a sign ambiguity. Now consider another configuration, where only real and positive (including zero) values of α and β are used. The approach that we proposed above then yields these parameters without any ambiguity. We thus defined a procedure which allows user R to read a spin ”blindly”, i.e. without any knowledge about it except, possibly, about the domain where α and β are requested to be situated. Let us stress that this procedure may be actually implemented in practice, i.e. the component of a spin along a given axis may be measured, although this requires a complex physical setup. Building upon this solution for a single spin, we now aim at extending it to two coupled spins.
Blind Separation of Quantum States
3
709
Estimating Two Qubits with Heisenberg Spin Coupling
3.1
Considered Qubits: Quantum Point of View
Future QIP systems will simultaneously handle several qubits, which will e.g. be physically implemented as sets of spins. One may expect that undesired coupling (i.e. ”mixture”, using BSS terminology1 ) between these spins will appear in quantum physical setups, as in current classical signal processing systems, such as the one outlined in Section 1 for communication applications. Indeed, a well-known type of mixture between two spins consists of isotropic Heisenberg coupling, which may be defined as follows [4]. Assume two spins, called spin 1 and spin 2 hereafter, are resp. initialized with states (α1 |+ > +β1 |− >) and (α2 |+ > +β2 |− >)
(11)
at a given time t0 and coupled according to Heisenberg model from then on. Hereafter, we consider the state |ψ(t) > of the overall system composed of these two identifiable spins. At time t0 , this state is equal to the tensor product of the states (11) of both spins. It may be expressed as |ψ(t0 ) >= α1 α2 | + + > +α1 β2 | + − > +β1 α2 | − + > +β1 β2 | − − >
(12)
in the four-dimensional basis B+ = {| + + >, | + − >, | − + >, | − − >} which corresponds to the operators s1z and s2z resp. associated to the components of the two spins along a given axis Oz. This state may also be expressed in the four-dimensional basis composed of the eigenvectors of Heisenberg coupling’s Hamiltonian. We here denote this basis B1 = {|1, 1 >, |1, −1 >, |1, 0 >, |0, 0 >}. Using the known expression of B+ with respect to B1 , (12) yields α1 β2 − β1 α2 α1 β2 + β1 α2 √ √ |1, 0 > + |0, 0 > . 2 2 (13) The temporal evolution of this state then corresponds to phase rotations for each eigenvector, as in (3). The state at any time t then reads in basis B1 |ψ(t0 ) >= α1 α2 |1, 1 > +β1 β2 |1, −1 > +
|ψ(t) > = α1 α2 e−iω1,1 (t−t0 ) |1, 1 > +β1 β2 e−iω1,−1 (t−t0 ) |1, −1 > (14) α1 β2 − β1 α2 −iω0,0 (t−t0 ) α1 β2 + β1 α2 −iω1,0 (t−t0 ) √ √ e e |1, 0 > + |0, 0 > + 2 2 where ωi,j is the real frequency associated to the phase rotation for each eigenvector |i, j >. Using the expression of B1 with respect to B+ then yields the expression of the system state at any time t in basis B+ |ψ(t) > = α1 α2 e−iω1,1 (t−t0 ) | + + > +β1 β2 e−iω1,−1 (t−t0 ) | − − > (15) 1 + (α1 β2 + β1 α2 )e−iω1,0 (t−t0 ) + (α1 β2 − β1 α2 )e−iω0,0 (t−t0 ) | + − > 2 1 + (α1 β2 + β1 α2 )e−iω1,0 (t−t0 ) − (α1 β2 − β1 α2 )e−iω0,0 (t−t0 ) | − + > . 2 1
This should not be confused with ”statistical mixtures” used in QP. From a QP point of view, this paper only concerns pure states, as opposed to statistical mixtures.
710
Y. Deville and A. Deville
Note that this state |ψ(t) > is more easily expressed in basis B1 than in B+ . We have to consider the latter expression however, because only this basis corresponds to variables which may be measured in practice, i.e. s1z and s2z . We here started from a concrete (i.e. physical) setup, thus considering a QP point of view. This led us to the state expression (15). From here on, we may therefore move to an abstract QIP point of view, only considering the couple of qubits defined by this state expression (15) and aiming at estimating each of these qubits from their coupled version (15). 3.2
Considered Qubits: SSP Point of View
The first step that we propose towards the estimation of the considered qubits is again based on a SSP approach. It extends to two qubits the RWR procedure that we introduced for one qubit in Section 2. The resulting method operates as follows. In each occurence k of the write/read step, user W first writes both qubits at time tw (k), resp. with the states defined in (11), and user R then reads at time tr (k) the state of the system composed of the two coupled qubits, which is defined by (15) except that (t − t0 ) is replaced by T (k) = tr (k) − tw (k). Reading this state means that user R measures the couple of values associated to s1z and s2z . This couple is then equal to one of the four possible values (+ 12 , + 21 ), (− 12 , − 12 ), (+ 12 , − 21 ) and (− 12 , + 21 ), resp. with probabilities p1 , p2 , p3 and p4 equal to the squared moduli of the coefficients associated to the states composing B+ which appear in the considered modified version of (15), i.e. 2 (16) α1 α2 e−iω1,1 T (k) = p1 2 (17) β1 β2 e−iω1,−1 T (k) = p2 2 1 (18) (α1 β2 + β1 α2 )e−iω1,0 T (k) + (α1 β2 − β1 α2 )e−iω0,0 T (k) = p3 4 2 1 (19) (α1 β2 + β1 α2 )e−iω1,0 T (k) − (α1 β2 − β1 α2 )e−iω0,0 T (k) = p4 . 4 Again, (16)-(17) do not depend on their phase factors, i.e. they reduce to 2
(20)
2
(21)
|α1 α2 | = p1 |β1 β2 | = p2 .
In order to use our SSP approach, (18)-(19) should involve the same parameter values in all occurences k of the write/read step. The write-read time interval T (k) should therefore be the same for all occurences. It is denoted T hereafter. Estimates of p1 to p4 may be straightforwardly obtained as the relative frequencies of occurence of the four values (+ 12 , + 12 ) to (− 12 , + 12 ) resp. in the measurements. However, unlike in Section 2, these estimated probabilities do not directly yield the parameters αi and βi that user R aims at determining, i.e. the two considered qubits are still ”mixed” in these measured data. This therefore defines a new nonlinear BSS-like problem (a survey of nonlinear BSS is e.g.
Blind Separation of Quantum States
711
available in [5]), where the observed data consist of the measured probabilities p1 to p4 , the ”source signals” to be extracted from them are the parameters αi and βi and the unknown coefficients of the considered set of nonlinear mixing equations are the frequencies ωi,j . This quantum mixture model (16)-(19), or its slightly simplified form involving (20)-(21), may be referred to as ”natural complex-valued isotropic Heisenberg spin coupling model” or more briefly ”DD1 model” (for Deville & Deville quantum model no. 1 from BSS point of view). Note that the equations in this model are partly redundant: p1 + p2 + p3 + p4 = 1
(22)
because the initial states (11) are normalized, so that the state |ψ(tw (k)) > defined by (12) is normalized, and this state then evolves according to Schr¨ odinger’s equation, which keeps norm unchanged. We now show how to separate these sources, depending on how these qubits are initialized. 3.3
Retrieving Qubits with Real-Valued Initialization
We first consider the case when the parameters αi and βi which define the initial states of both qubits are constrained to be real-valued. (20)-(21) then reduce to α21 α22 = p1
(23)
β12 β22
(24)
= p2 .
These equations are sufficient for extracting the two qubits2 , as shown below. This ”real-valued isotropic Heisenberg spin coupling model” (or sub-model) is called ”DD2” below. It may be inverted as follows. Each initial qubit state meets the normalization condition (2), so that βi2 = 1 − α2i . Inserting it in (24) yields α22 = 1 −
p2 . 1 − α21
(25)
(26)
Inserting (26) in (23) and multiplying by (1 − α22 ) yields a second-order polynomial equation for α21 , whose roots read 1 (27) (1 + p1 − p2 ) ± (1 + p1 − p2 )2 − 4p1 . α21 = 2 The corresponding values of α22 , may be derived from (26), but this is not needed: it may be shown that if α21 is set to one of the roots of (27), then the corresponding value of α22 defined by (26) is equal to the other root of (27). As for α21 and α22 , the considered problem then has a single solution, up to a permutation. 2
At least up to some ambiguities. The possibility to reduce these ambiguities by using (18)-(19) will be presented in a future paper, due to space limitations. The use of (18) is also discussed in Section 3.4, in a slightly different context.
712
Y. Deville and A. Deville
Besides, each value of α2i yields a single value βi2 , due to (25). This leads to the following results for the overall qubit values: (i) if both qubits are constrained to be initialized with positive values, then the above equations yield both qubit parameters (α1 , β1 ) and (α2 , β2 ) only up to a permutation ambiguity, which is natural due to the symmetry of (23)-(24) with respect to both qubits, (ii) if the signs of the real parameters of the initial states of both qubits are not constrained, then a sign ambiguity appears in addition, because each above value of α2i or βi2 yields two opposite solutions for αi or βi . One may note the similarity of these results with (i) those for a single qubit obtained in Section 2, (ii) the ambiguities in classical linear instantaneous BSS, i.e. permutation and scale, with scale ambiguity reducing to sign ambiguity for normalized signals. 3.4
Retrieving Qubits with Complex-Valued Initialization
We now come back to the general DD1 model, i.e. we consider arbitrary complexvalued qubit initializations. We here express each qubit parameter in polar form α1 = r1 eiθ1
β1 = q1 eiφ1
α2 = r2 eiθ2
β2 = q2 eiφ2 .
(28)
Eq. (16) and (17) are then easily shown to be equivalent to r12 r22 = p1
(29)
q12 q22 = p2 .
(30)
Longer calculations using phase factorizations show that (18) is equivalent to (r1 q2 cos ΔE )2 + (q1 r2 sin ΔE )2 − 2r1 r2 q1 q2 cos ΔE sin ΔE sin ΔI = p3 (31) (ω1,0 − ω0,0 )T where ΔI = (φ2 − φ1 ) − (θ2 − θ1 ) and ΔE = . (32) 2 A similar expression may be derived from (19) but, due to (22), this expression is redundant with (29)-(31). The latter equations then form our ”polar complexvalued isotropic Heisenberg spin coupling model” (or sub-model), called ”DD3”. Eq. (29)-(30) only concern the moduli of the qubits. They have already been solved above, since they turn out to be identical to (23)-(24), here with positive parameters so that they yield a single solution, up to a permutation. The phase part of the qubits is then addressed by (31), which yields the following conclusions. Only the combination ΔI of the qubit phases, defined in (32), may be retrieved from (31). To avoid ambiguities, one may therefore fix three of the phase parameters θ1 , φ1 , θ2 , and φ2 (e.g. to 0) and only use the fourth parameter to store information. Eq. (31) only yields sin ΔI , i.e. it provides ΔI up to the ambiguities of the sine function. To derive sin ΔI from (31), all quantities r1 , r2 , q1 , q2 , cos ΔE and sin ΔE must be non-zero. To derive sin ΔI from (31), all other quantities should be known a priori or estimated. p1 to p3 are estimated from measurements and r1 , r2 , q1 , q2 , are derived from them as explained above. In a given standard configuration, ΔE is fixed but not known a priori, because this requires very detailed knowledge of the system’s physical properties. We here
Blind Separation of Quantum States
713
aim at estimating it blindly with SSP methods, i.e. from a sequence of different coupled-qubit values, where these values are unknown but some of their statistical properties are assumed to be known. This need for SSP methods should be contrasted with the previous aspects of this paper, where the ”mixing” equations could be solved from a single couple of qubits. For example, a simple SSP approach consists in using a realization of a sequence (indexed by n) of identically distributed, mutually statistically independent, random variables r1 (n), r2 (n), q1 (n), q2 (n) and ΔI (n) and considering the expectation of (31). This yields E{r12 (n)}E{q22 (n)} cos2 ΔE + E{q12 (n)}E{r22 (n)} sin2 ΔE (33) −2E{r1 (n)}E{r2 (n)}E{q1 (n)}E{q2 (n)} cos ΔE sin ΔE E{sin ΔI (n)} = E{p3 (n)}. The moduli ri (n) and qi (n) for each couple of qubits in the sequence may be estimated as explained above, and then used to estimate the statistics of these moduli used in (33). E{p3 (n)} is also estimated from this sequence. By constraining the sequence of processed data (i.e. qubit values) to be such that their statistical parameter E{sin ΔI (n)} has a known value, (33) makes it possible to estimate ΔE . Note that the solution is quite simple when E{sin ΔI (n)} = 0.
4
Conclusion
In this theoretical paper, we bridged the gap between the QIP/QP and SSP/BSS domains. We thus introduced what may become the ”Blind Quantum Source Separation” (BQSS) field. From a BSS point of view, we proposed several nonlinear mixture models, motivated by actual quantum physical systems. We analyzed the invertibility and ambiguities of these models and we proposed practical methods for restoring qubits from their coupled versions. The next stages of this work, not detailed here due to space limitations, will consist in extending these BSS methods, testing them with simulations of quantum systems and introducing more general quantum mixture models which result in higher needs for SSP methods.
References 1. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York (2001) 2. Deville, Y., Damour, J., Charkani, N.: Multi-tag radio-frequency identification systems based on new blind source separation neural networks. Neurocomputing 49, 369–388 (2002) 3. Nielsen, M.A., Chuang, I.L.: Quantum computation and quantum information. Cambridge University Press, Cambridge (2000) 4. Caspers, W.J.: Spin Systems. World Scientific, Singapore (1989) 5. Jutten, C., Karhunen, J.: Advances in Blind Source Separation (BSS) and Independent Component Analysis (ICA) for Nonlinear Mixtures. International Journal of Neural Systems 14(5), 267–292 (2004)
An Application of ICA to BSS in a Container Gantry Crane Cabin’s Model Juan-Jos´e Gonz´alez de-la-Rosa1,2 , Carlos G. Puntonet3 , A. Moreno Mu˜ noz2,4 , 2 2 A. Illana , and J.A. Carmona University of C´ adiz, Electronics Area, EPSA, Av. Ram´ on Puyol S/N. E-11202, Algeciras-C´ adiz, Spain [email protected] Research Group TIC168-Computational Instrumentation and Industrial Electronics 3 University of Granada, Dept. of Architecture and Computers Technology, ESII, C/Periodista Daniel Saucedo. 18071, Granada, Spain [email protected] 4 University of C´ ordoba, Electronics Area Edf. Albert Einstein. Campus Rabanales, C´ ordoba, Spain [email protected] 1
2
Abstract. This paper deals with the simulation of a ship-containers’ gantry crane cabin behavior, during an operation of load releasing and the BSS via ICA de-noising and movements separation. The goal consists of obtaining a reliable model of the cabin, with the aim of reducing the non-desired cabin vibrations. We present the Simulink -based simulation results and the result of the signal separation algorithms when the load is released by the crane in the containers’ ship. We conclude that the mass center position of the cabin affects dramatically to the vibrations of the crane. A set of graphs are presented involving displacements and rotations of the cabin to illustrate the effect of the mass center position’s bias and the results of the ICA action.
1
Introduction
The study of the vibrations in a gantry crane used in a containers terminal is an issue related to the security of the crane operator and to the durability of the design. The vibrations take place mostly in the operator cabin. The main problem is that a short amplitude vibration in the trolley may produce high amplitude values in the cabin, which may affect the operator’s safety. Numerous achievements have been made in the field of the control for overhead crane systems, which have proven to be an improvement in the position accuracy, safety and stabilization control [1,2,3,4,5]. With the goal of adapting the developed control schemes to portainers (container gantry cranes), the modeling of the system has to be developed. In this paper we present an innovative Simulink model of a real-life gantry crane cabin, like one shown in Fig. 1, and its emulated performance when a container is released into the ship. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 714–721, 2007. c Springer-Verlag Berlin Heidelberg 2007
An Application of ICA to BSS in a Container Gantry Crane Cabin’s Model
715
Fig. 1. Container Gantry Cranes at Algeciras harbor
The crucial role of FastICA consist of extracting any parasitic vibration which may remain in the system, affecting the human operator’s comfort, along with the de-noising of the signals in order to perform a further analysis of the vibration modes of the cabin in an ulterior cabin’s model. Similar results of ICA have been observed in [6] and in [7], which indicates the suitability of the method for signal extraction. The inedit of this paper is the application in gantry cranes. The results show a new set of signals that may be used in a future vibration control scheme and the power of ICA in extracting parasitic vibrations and de-noising (SNR=20 dB). The paper is structured as follows. In Section 2 we present the Simulink model of the portainer’s cabin; Section 3 summarizes ICA foundations and FastICA. Section 4 comprises the set of the simulation results and the ICA analysis which in fact are ”the guts” of the paper; finally conclusions are drawn in Section 5.
2 2.1
The Simulink Model Model Equations
Fig. 2 shows an scheme of the complete crane structure where we can see the cabin, whose dimensions are detailed in Fig. 3.
716
J.-J. Gonz´ alez de-la-Rosa et al.
Fig. 2. Gantry crane model scheme
Fig. 3. Gantry crane cabin dimensions. Units in meters. Note where the mass center is and where it should be. Points 5-8 play a special role in the equations that model the dynamics.
The six degrees of freedom of the cabin are solved using the well-known Newton equations, applied to the mass center of the cabin, three of them for forces and other three for torques, from Eq. (1) to Eq. (6); where all the variables and points are referred to Fig. 3. Fi,x = M x ¨mc i∈{5,6,7,8}
Fi,x = Ci,x (xi,r − xi,b ) ± Ci,xy (yi,r − yi,b ) + Ki,x (xi,r − xi,b ) ± Ki,xy (yi,r − yi,b )
(1)
Fi,y = M y¨mc
i∈{5,6,7,8}
Fi,y = Ci,y (yi,r − yi,b ) ± Ci,xy (xi,r − xi,b ) + Ki,y (yi,r − yi,b ) ± Ki,xy (xi,r − xi,b ) Fi,z = M z¨mc i∈{5,6,7,8}
Fi,z = Ci,z (zi,r − zi,b ) + Ki,z (zi,r − zi,b )
(2)
(3)
An Application of ICA to BSS in a Container Gantry Crane Cabin’s Model
Mi,x = Ix w ¨x,mc − (Iy − Iz )wy,mc wz,mc
i∈{5,6,7,8}
717
Mi,x =
(4) (Fi,z di,y + Fi,y di,y di,z )
i∈{5,6,7,8}
Mi,y = Iy w ¨y,mc − (Iz − Ix )wz,mc wx,mc
i∈{5,6,7,8}
Mi,y =
(5) (Fi,z di,x + Fi,x di,z )
i∈{5,6,7,8}
Mi,z = Iz w ¨z,mc − (Ix − Iy )wx,mc wy,mc
i∈{5,6,7,8}
Mi,z =
(6) (Fi,x di,y + Fi,y di,x )
i∈{5,6,7,8}
Some remarks are to be made in this set of equations. The ”±” refers to index i=5,8, respectively, the ”-” sign refers to i=6,7. Subindex ”mc” refers to the mass center, ”r” refers to the trolley and ”b” to the cabin. ”w” are angles, ”F” forces, ”M” torques, ”I” inertias, ”K” are for springs, ”C” are for dampers; ”d” symbolizes distances. 2.2
Simulink Scheme
The Simulink model solves and plots the displacements, velocities and accelerations of each one of the six degrees of freedom of the cabin. To do that the trolley movements and the system’s physical constants (mass, inertias, spring and damper values and mass center position) have to be considered. Fig. 4 presents a detail of the model, concretely the forces and torques solver block. The model is mainly divided into four blocks. The forces and torques solver block (Fig. 4) receives all the constants and positions of the system and solves every force and torque. The second block is the equations’ solver. It receives forces, torques, mass, inertias and angles to solve every acceleration of the mass center of the cabin. The third block converts accelerations into velocities and positions of the mass center, which are the outputs of the system. Finally the fourth block calculates positions and velocities of the four cabin-trolley connection points, using cabin and trolley positions and velocities; finally it connects them to the first block, so the new forces and torques may be calculated.
3 3.1
The ICA Model and Algorithms Outline of ICA
BSS by ICA is receiving attention because of its applications in many fields such as speech recognition, medicine and telecommunications [8]. Statistical
718
J.-J. Gonz´ alez de-la-Rosa et al.
Fig. 4. Coarser representation of the Simulink model. Forces and torques solver block. This is valid only to get an approximate idea of the complex model.
methods in BSS are based in the probability distributions and the cumulants of the mixtures. The recovered signals (the source estimators) have to satisfy a condition which is modeled by a contrast function. The underlying assumptions are the mutual independence among sources and the non-singularity of the mixing matrix [6],[9]. Let s(t) = [s1 (t), s2 (t), . . . , sm (t)]T be the transposed vector of sources (statistically independent). The mixture of the sources is modelled via x(t) = A · s(t)
(7)
where x(t) = [x1 (t), x2 (t), . . . , xm (t)]T is the available vector of observations and A = [aij ] ∈ m×n is the unknown mixing matrix, modelling the environment in which signals are mixed, transmitted and measured [10]. We assume that A is a non-singular n×n square matrix. The goal of ICA is to find a non-singular n×m separating matrix B such that extracts sources via ˆs(t) = y(t) = B · x(t) = B · A · s(t)
(8)
where y(t) = [y1 (t), y2 (t), . . . , ym (t)]T is an estimator of the sources. The separating matrix has a scaling freedom on each row because the relative amplitudes of sources in s(t) and columns of A are unknown [9]. The transfer matrix G ≡ BA relates the vector of independent (original) signals to its estimators. 3.2
FastICA
One of the independent components is estimated by y = bT x. The goal of FastICA is to take the vector b that maximizes the non-Gaussianity (independence)of y, by finding the maxima of its negentropy [9]. The algorithm scheme is
An Application of ICA to BSS in a Container Gantry Crane Cabin’s Model
719
an approximative Newton iteration, resulting from the application of the KuhnTucker conditions. This leads to the Eq. (9) E{xg(bT x) − βb = 0}
(9)
where g is a non-quadratic function and β is an iteration parameter. Provided with the mathematical foundations the experimental results are outlined.
4
Results
We present the set of results in the form of graphics due to the interest and inedit results. We have introduced, in the simulated model, a real-life bias in the position of the mass center (A = 1 m, B = 1.35 m, C = 1 m), in order to asses the real cabin behavior. A delay of 1 sec is introduced to enhance the visualization of the graphs. The initial conditions are null for all the variables involved in the differential equations. A step-type input (5 cm amplitude) is chosen to assess the outputs of the system. This input emulates the behavior of the sudden bump in the trolley when the load is released in the container ship. The left column in the matrix of signals in Fig. 5 shows all the signals involved, and the right column shows the ICA outputs. First of all we analyze the measured signals (left column). Regarding the X displacement of the mass center of the cabin, it can be seen than the system is not able to dump it adequately. This movement is produced by the horizontal bias; it has the peculiarity that the vertical bias of the mass center also affects this horizontal movement in a critical way. But we have to point out that the unique presence of a vertical bias is not enough to start this movement. Another coupling effect is the one produced by the vertical input, this time in the Y axis. The high frequency component of the signal is rapidly attenuated while the low frequency component is not attenuated at all and remains as a parasitic vibration in the system. This fact has also been shown in the X direction. The displacement of the system in the Z-axis is the only one that behaves like a typical response to a step-like input. We must point out than the amplitude of the movement nearly doubles the input; so, an immediate conclusion is that the system’s behavior is far from its original aim of isolate the cabin from the trolley vibrations. in other words, this movement has the peculiarity of not being fully dumped. Finally, the rotations of the cabin are not attenuated. These rotations affect to the X, Y, Z movements, and will not be extinguished due to the geometric disposition of the dumps. This scheme fits a BSS scenario. The ICA results show the separation of the remaining oscillation (IC6). The noise is extracted in IC5 (a SNR=20 dB is simulated). The z-angle movement is extracted in IC4. IC3 is the pure y-angle movement. IC2(1) is the y(x)-displacement without the parasitic vibration.
J.-J. Gonz´ alez de-la-Rosa et al. ICA outputs
Sensor inputs, mixtures IC# 1
0.02 0 −0.02 1000
2000
3000
4000
IC# 2
0 −0.02 1000
2000
3000
4000
10 8 6 4 2
5000
0.02
5000
2 0 −2 −4 2
IC# 3
0.8 0.6 0.4 0.2
1000
2000
3000
4000
5000
1000
2000
3000
4000
5000
1000
2000
3000
4000
5000
1000
2000
3000
4000
5000
1000
2000
3000
4000
5000
2000 3000 4000 Time instances, sec.
5000
0 −2
−3
x 10 1000
2000
3000
4000
5000 2 IC# 4
4 2 0 −2
0 −2
−3
x 10 1000
2000
3000
4000
5000 IC# 5
5 0 −5
2 0 −2
−4
x 10 1000
2000
3000
4000
5000
20 IC# 6
x6, z−ang., radx5, y−ang., radx4, x−ang., radx3, z−disp., m
x2, y−disp., m x1, x−disp., m
720
10
2 0 −2
0 1000
2000 3000 4000 Time instances, sec.
5000
1000
Fig. 5. Measured-simulated signals (left column) and ICA outputs (Independent Components, ICs, right column)
5
Conclusions
We conclude that the system is not able to dump the cabin vibrations, as every real cabin mass center has a bias. Even with very high values of dump constants, the time needed to attenuate the vibrations is high. Our direct real-life experience in those cabins shows that they continually work in a transitory vibration state, often leading the system to resonances. A real cabin prototype is being built to adequate our model to the reality, and solutions to the vibration matter will be tested in it. The critical influence of the mass center position showed in this paper lead us to think that cabin-trolley connection points must be placed around the real mass center position instead than on the top of the cabin, where the vertical distance to the mass center makes the system to behave like a pendulum. ICA extracts the parasitic vibration and the uniform noise process added in the simulation, leading us to get the real mechanical vibration modes of the cabin’s operator.
Acknowledgement We would like to acknowledge the Spanish Ministry of Education and Science for funding the projects DPI2003-00878 and PETRI-95-0824-OP, and to the Andalusian Government for funding the project TIC155. We would also like to acknowledge Mr. John Thonsem for the trust shown in the research group
An Application of ICA to BSS in a Container Gantry Crane Cabin’s Model
721
PAI-TIC-168 from the University of C´ adiz, which works in the MAERSK containers’ terminal in the Algeciras harbor.
References 1. Ju, F., Choo, Y., Cui, F.: Dynamic response of tower crane induced by the pendulum motion of the payload. International Journal of Solids and Structures, 376–389 (2006) 2. Hua, Y.J., Shine, Y.K.: Adaptive coupling control for overhead crane systems. Mechatronics (in Press, 2007) 3. Lee, D.H., Cao, Z., Meng, Q.: Scheduling of two-transtainer systems for loading outbound containers in port container terminals with simulated annealing algorithm. Int. J. Production Economics (in Press, 2007) 4. Benhidjeb, A., Gissinger, G.: Fuzzy control of an overhead crane performance comparison with classic control. In: Proceedings of Control Eng. Practice, pp. 168–796 (1995) 5. Jianqiang, Y., Naoyoshi, Y., Kaoru, H.: Anti-swing and positioning control of overhead traveling crane. Inform. Sci. 19–42 (2003) 6. De la Rosa, J.J.G., Puntonet, C.G., Lloret, I.: An application of the independent component analysis to monitor acoustic emission signals generated by termite activity in wood. In: Measurement, vol. 37, pp. 63–76. Elsevier, Amsterdam (2005) (Available online, October 12, 2004) 7. Antoni, J.: Blind separation of vibration components: Principles and demonstrations. In: Mechanical Systems and Signal Processing, vol. 19, pp. 1166–1180. Elsevier, Amsterdam (2005) 8. Puntonet, C.G., de la Rosa, J.J.G., Galiana, I.L., S´ aez, J.M.G.: On the performance of a hos-based ica algorithm in bss of acoustic emission signals. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 400–405. Springer, Heidelberg (2006) 9. Hyv¨ arinen, A., Oja, E.: Independent Components Analysis: A Tutorial. Helsinki University of Technology, Laboratory of Computer and Information Science (1999) 10. De la Rosa, J.J.G., Puntonet, C.G., G´ orriz, J.M., Lloret, I.: An application of ICA to identify vibratory low-level signals generated by termites. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 1126–1133. Springer, Heidelberg (2004)
Blind Separation of Non-stationary Images Using Markov Models Rima Guidara, Shahram Hosseini, and Yannick Deville Laboratoire d’Astrophysique de Toulouse-Tarbes (LATT) - UMR 5572 Universit´e Paul Sabatier Toulouse 3 - CNRS 14 Avenue Edouard Belin, 31400 Toulouse, France {rguidara,shosseini,ydeville}@ast.obs-mip.fr
Abstract. In recent works, we presented a blind image separation method based on a maximum likelihood approach, where we supposed the sources to be stationary, spatially autocorrelated and following Markov models. To make this method more adapted to real-world images, we here propose to extend it to non-stationary image separation. Two approaches, respectively based on blocking and kernel smoothing, are then used for the estimation of source score functions required for implementing the maximum likelihood approach, in order to allow them to vary within images. The performance of the proposed algorithm, tested on both artificial and real images, is compared to the stationary Markovian approach, and then to some classical blind source separation methods.
1
Introduction
In a previous work [1], we proposed a blind source separation method based on a maximum likelihood approach, where Markov processes are used to model temporal autocorrelations of stationary sources. This method exploits both nonGaussianity and temporal autocorrelation of mutually independent sources and provides an asymptotically efficient estimator. We then extended this approach in [2] to bi-dimensional sources, where secondorder Markov Random Fields (MRFs) were used to describe the spatial autocorrelation within each source image. The likelihood function was maximized using a modified equivariant Newton-Raphson algorithm, and a parametric polynomial estimator was introduced in order to reduce the computational cost of conditional score function estimation, required for implementing the method. In [2], we supposed the source images to be stationary, so that their statistics did not vary over the image pixels. This hypothesis is, however, not realistic for the majority of real-world images, so that the algorithm sometimes fails to separate highly mixed images. In this paper, we propose to modify the method presented in [2] so that the Probability Density Functions (PDFs) of the pixels may vary within each source image. As a result, the images may be non-stationary and the proposed method can then simultaneously exploit non-Gaussianity, non-stationarity and spatial autocorrelation in a quasi-optimal manner. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 722–729, 2007. c Springer-Verlag Berlin Heidelberg 2007
Blind Separation of Non-stationary Images Using Markov Models
723
The non-stationarity of the sources has already been exploited in the literature [3,4,5,6,7,8] for blind source separation. Nevertheless, most of these methods do not exploit the non-Gaussianity and/or the autocorrelation of the sources and are usually based on variance non-stationarity while our method can also exploit higher-order non-stationarities.
2
Markovian Separation Method
In this paper, we consider the blind image separation problem in its simplest form, where the observations are linear instantaneous mixtures of source images. In a noiseless and determined context, this problem can be formulated as follows. Having N = N1 × N2 pixels of K linear transforms of K source images, we consider the mixture model x(n1 , n2 ) = As(n1 , n2 ), where x(n1 , n2 ) and s(n1 , n2 ) are, respectively, the K-dimensional observation and source vectors, and A is an unkown K × K invertible mixing matrix. The Maximum Likelihood (ML) approach can be used to estimate the separating matrix B = A−1 up to a diagonal matrix and a permutation matrix. It consists in maximizing, with respect to the matrix B, the joint PDF of all the pixels of all the images in the observation vector x fx (x1 (1, 1), · · · , xK (1, 1), · · · , x1 (N1 , N2 ), · · · , xK (N1 , N2 )) .
(1)
Assuming source images are independent and described by second-order MRFs according to the sweeping scheme defined in [2], this joint PDF can be approximated by
N1 N K 2 −1 N 1 fsi (n1 ,n2 ) (si (n1 , n2 )|si (n1 , n2 − 1), |det(B−1 )| i=1 n =2 n =2 1
2
si (n1 − 1, n2 + 1), si (n1 − 1, n2 ), si (n1 − 1, n2 − 1)).
(2)
Defining then the conditional score function ψsk,l of a source si , with respect i (n1 ,n2 ) to the pixel si (n1 − k, n2 − l), by (n1 , n2 ) = ψsk,l i (n1 ,n2 )
−∂ log fsi (n1 ,n2 ) (si (n1 , n2 )|si (n1 , n2 − 1), ∂si (n1 − k, n2 − l) (3) si (n1 − 1, n2 + 1), si (n1 − 1, n2 ), si (n1 − 1, n2 − 1))
the maximization of the logarithm of (2) leads finally to a system of K(K − 1) estimating equations defined as k,l EN [ ψsi (n1 ,n2 ) (n1 , n2 ).sj (n1 − k, n2 − l)] = 0 i = j = 1, · · · , K (4) (k,l)∈Υ
where Υ = {(0, 0), (0, 1), (1, −1), (1, 0), (1, 1)} corresponds to the considered sweeping scheme over the image [2] and EN [.] is a spatial average operator N N −1 defined by EN [.] = N1 n11=2 n22=2 [.].
724
R. Guidara, S. Hosseini, and Y. Deville
In practice, the sources si , actually unknown, are replaced using an iterative ˆ algorithm by the estimated sources sˆi (n1 , n2 ) = ei T Bx(n 1 , n2 ), where ei is the th i column of the identity matrix. The separating matrix B may be estimated via the resolution of the system of equations (4), using for example the modified equivariant version of the Newton-Raphson algorithm presented in [2]. The score functions and their derivatives, required for the computation of the system coefficients, may be estimated using a parametric polynomial estimator. In [2], we supposed that source images were stationary, so that the score func(.) did not vary with n1 and n2 , i.e. reduced to ψsk,l (.). However, tions ψsk,l i i (n1 ,n2 ) this condition being unrealistic for most real-world images, we propose, in this paper, to extend this approach to non-stationary sources, allowing the statistics to be dependent on the pixel position. Two methods, based respectively on blocking and kernel smoothing, are introduced in the following section in order to adapt the score function estimation to non-stationary images.
3
Non-stationary Estimation of the Score Functions
For simplicity, we denote the conditional score functions ψsk,l (n1 , n2 ) i (n1 ,n2 ) ψsk,l (si (n1 , n2 )|si (n1 , n2 −1), si (n1 −1, n2 +1), si (n1 −1, n2 ), si (n1 −1, n2 −1)) i (n1 ,n2 ) by ψsk,l (ξ0 |ξ1 , . . . , ξ4 ). This function can be rewritten as follows i (n) ψsk,l (ξ0 |ξ1 , . . . , ξ4 ) = ψsk,l (ξ0 , . . . , ξ4 ) − ψsk,l (ξ1 . . . , ξ4 ) . i (n) i (n) i (n)
(5)
We propose to estimate each of the non-stationary joint score functions in (5) using a parametric third-order polynomial estimator. The polynomial function order is chosen so that the resulting approximation induces low computational cost without decreasing the estimation performance. Thus, assuming (ξ0 , . . . , ξ4 , W) is a polynomial estimator of the joint score function gsk,l i (n) ψsk,l (ξ0 , . . . , ξ4 ), we can define this estimator by i (n) k,l gsk,l (ξ , . . . , ξ , W) = wj (si (n))hj (ξ0 , . . . , ξ4 ) = hT Wk,l (si (n)) 0 4 (n) i j
where hj (ξ0 , . . . , ξ4 ) and wjk,l (si (n)) are respectively the monomial functions and the coefficients. The polynomial coefficients wjk,l (si (n)) should be selected so that gsk,l (ξ0 , . . . , ξ4 , W) is the least mean-square estimator of the joint score funci (n) tion ψsk,l (ξ0 , . . . , ξ4 ). Consequently, wjk,l (si (n)) are solutions of the following i (n) optimization problem (ξ0 , . . . , ξ4 ) − gsk,l (ξ0 , . . . , ξ4 , W)]2 } . Wk,l (si (n)) = argminE{[ψsk,l i (n) i (n)
(6)
Using the theorem mentioned in Section 4 of [2], we obtain T
∂h k,l T k,l k,l T k,l W (si (n))= argmin E h W (si (n))[W (si (n))] h −2E k,l .W (si (n)) ∂ξ
Blind Separation of Non-stationary Images Using Markov Models
725
where ξ k,l ∈ {ξ0 , . . . , ξ4 } and represents si (n1 − k, n2 − l) according to (3) and (5). The minimum of the above function is finally given by −1 ∂h W(si (n))k,l = E hT h E . ∂ξ k,l
(7)
A parametric estimator for the second joint score function ψsl,k (ξ1 . . . , ξ4 ) may i (n) be obtained following the same calculus, or simply deduced from the first estimator by regression. In the initial version of this estimation procedure [2], we supposed the sources were stationary. As a consequence, the mathematical expectations in Eq. (7) were replaced by a spatial average over all the pixels in the image. To extend this estimation to non-stationary sources, two methods, proposed initially in [8] for non-stationary temporally uncorrelated one-dimensional sources, are here adapted to model the statistics of non-stationary spatially autocorrelated images. Blocking method: Each observed image is split into M1 × M2 sub-images Ij , supposing that the score function variations, in each block Ij , are small. Each of these sub-images may be then supposed to be stationary, so that the mathematical expectations, required in Eq. (7), can be locally approximated by a spatial average over each sub-image. The coefficients of the parametric estimator of the score functions being unchanged within Ij , we can write ψsl,k = ψsl,k (j), i i (n1 ,n2 )
∀
(n1 , n2 ) ∈ Ij .
Kernel smoothing method: In a kernel smoothing approach, score functions are approximated at each pixel of the image by a local spatial average all around this pixel. Using a general notation, we denote E = E(φ(ξ0 (n1 , n2 ), . . . , ξ4 (n1 , n2 ))) the mathematical expectations required for the estimation of the score function parameters at pixel si (n1 , n2 ), where (ξ0 (n1 , n2 ), . . . , ξ4 (n1 , n2 )) represents (si (n1 , n2 ), si (n1 , n2 − 1), si (n1 − 1, n2 + 1), si (n1 − 1, n2), si (n1 − 1, n2 − 1)) and φ(.) each of the functions involved in Eq. (7). Contrary to the blocking method, these expectations are estimated in each pixel using the formula N1 −1 N2 −1 ˆ= E
μ1 =2
1 μ2 −n2 κ( μ1 −n , ν )φ(ξ0 (μ1 , μ2 ), . . . , ξ4 (μ1 , μ2 )) ν N1 −1 N2 −1 μ1 −n1 μ2 −n2 , ν ) μ1 =2 μ2 =2 κ( ν
μ2 =2
(8)
where κ(.) is a kernel function and ν a window width parameter. The kernel smoothing method should be advantageous in cases when the score functions vary rapidly within the image. In fact, in such cases, the size of nearly stationary blocks is so small that there are not enough pixels for a reliable estimation of score functions using the blocking method. However, the kernel smoothing method is computationally too expensive, since a new estimation is required at each pixel. To reduce the algorithm complexity, we can approximate
726
R. Guidara, S. Hosseini, and Y. Deville
the estimator (8) by a sparser one, according to the formula L1 ˆ= E
l1 =l11
L 2 l2 =l22
l1 Q1
−n1
κ( L1 ν L 1
l2 Q2
−n2
, L2 ν L2
l1 =l11
)φ(ξ0 ( l1LQ11 , l2LQ22 ), . . . , ξ4 ( l1LQ11 , l2LQ22 ))
l2 =l22 κ(
l1 Q1 L1
−n1 , ν
l2 Q2 L2
−n2 ) ν
Q2 1 where Q1 = N1 − 1, Q2 = N2 − 1, L1 and L2 are chosen so that Q L1 and L2 2L2 1 are integers, and l11 and l22 are the first integers greater than 2L Q1 and Q2 , respectively. The choice of sparsness parameters L1 and L2 should be adapted to the smoothness of the signal.
4 4.1
Experimental Results Artificial Images
In our first simulations, we want to compare the proposed non-stationary Markovian blocking method to the stationary Markovian image separation approach, presented in [2]. In the first experiment, we apply our blocking method to two artificial mixtures of two non-stationary source images, following exactly second-order MRF models. Two independent white and uniformly distributed noise images with size 200 × 200, e1 (n1 , n2 ) and e2 (n1 , n2 ), are therefore generated and filtered by two Infinite Impulse Response (IIR) filters, according to the following scheme ςi (n1 , n2 ) = ei (n1 , n2 ) + ρi0,1 si (n1 , n2 − 1) + ρi1,−1 si (n1 − 1, n2 + 1) +ρi1,0 si (n1 − 1, n2 ) + ρi1,1 si (n1 − 1, n2 − 1) .
(9)
The coefficients ρi of the first and second filters are fixed to {−0.5, 0.3, 0.5, −0.29} and {−0.5, 0.4, 0.5, 0.3}, respectively, and satisfy the filter stability conditions. The resulting images are then split into L1 × L2 sub-images, and each sub-image is multiplied by a different coefficient αip , p = 1, . . . , L1 × L2 . The resulting autocorrelated, non-stationary sources are finally mixed using the mixing matrix 1 0.99 A= . The Markovian non-stationary blocking method is then used 0.99 1 to separate the sources. After normalizing the separated sources sˆi (n1 , n2 ) so that they have the same variances and signs as the source signals si (n1 , n2 ), the output Signal to Interference Ratio (in dB) is computed using the formula SIR =
K E[s2i (n1 , n2 )] 1 10 log10 K i=1 E[(ˆ si (n1 , n2 ) − si (n1 , n2 ))2 ]
where K = 2 in the above experiment. Choosing L1 = L2 = 4, we computed the mean of SIR over 100 Monte Carlo simulations for different values of the algorithm parameters M1 and M2 . The results are shown in Fig. 1 as a function of the number of sub-images (M1 × M2 ), with M1 = M2 = M .
Blind Separation of Non-stationary Images Using Markov Models
727
The one-block image case, i.e. M1 = M2 = 1, corresponding to the stationary Markovian method, led only to an average SIR of 22 dB. The separation performance increases rapidly when we use a greater number of blocks, and it reaches its maximum, with an average SIR of 99 dB, when the model takes the same number of blocks as in the source image, i.e. M1 = L1 and M2 = L2 . Moreover, it can be noticed that over-blocking the image does not have a great effect on the separation result, provided that the number of pixels in each sub-image is sufficient for the score function estimation. In comparison to the stationary Markovian algorithm, the blocking approach significantly reduces both memory and time consumption. In the above experiment, the stationary Markovian algorithm needs 192 Mbytes of memory to estimate score functions while the blocking algorithm with M1 = M2 = 4 needs less than 12 Mbytes. The time consumption reduction induced by the blocking method is significant with large-size images. For example, using mixtures of 400 × 400-sized non-stationary images, the running times of the blocking algorithm with M1 = M2 = 4 and the stationary Markovian algorithm on a 1.53 GHz AMD-Athlon PC are 48 seconds and 1647 seconds, respectively, for each iteration. 150
120
100
100 SIR (dB)
SIR (dB)
80
60
50
40
20
0
0
10
20
30
40
50
60
70
80
M1× M2=M2
Fig. 1. Mean of SIR vs. the number of sub-images using IIR filtered sources
0
0
10
20
30
40
50
60
70
80
M1× M2=M2
Fig. 2. Mean of SIR vs. the number of subimages using FIR filtered sources
In the second experiment, we want to test the robustness of our blocking method with respect to the Markov model assumption. As in the first simulation, two noise images with size 200 × 200 are generated but filtered this time by two symmetrical bidimensional Finite Impulse Response (FIR) filters. Thus, the filtered images can no longer be modeled exactly by second-order MRFs. The resulting images are then split into L2 = 16 square blocks, so that L1 = L2 = 4, and each sub-image is multiplied by a different coefficient αip . Finally, the same mixing matrix A as in the first experiment is used and the average SIR over 100 Monte Carlo simulations is computed, and shown in Fig. 2 as a function of the number of blocked sub-images M1 × M2 = M 2 . Results show clearly the high performance of our algorithm, even when the Markov model is not satisfied. The average SIR achieved by the stationary Markovian algorithm is only 27 dB, whereas the blocking algorithm led to 140 dB for M 2 = 16 sub-images.
728
R. Guidara, S. Hosseini, and Y. Deville
In the third simulation, our goal is to highlight the advantage of the kernel smoothing method compared to the blocking one for images whose statistics are rapidly varying. The kernel smoothing algorithm being very time consuming, we can only use it for small-sized images. Two images with size 32 × 32, shown in Fig. 3, are artificially mixed using the same matrix A as in the first experiment. The kernel smoothing method, using a Gaussian kernel, is then applied to the mixture and the results are compared to those achieved by the blocking method. The kernel smoothing method, with a kernel standard deviation σ = 10, leads to 57-dB SIR, whereas the blocking method completely fails to separate the sources in this case because they are too non-stationary.
5
5
10
10
15
15
20
20
25
25
30
30 5
10
15
20
25
30
5
10
15
20
25
30
Fig. 3. Two small-sized images with highly non-stationary variations
50
50
100
100
150
150
200
200
250
250
300
300 50
100
150
200
250
300
350
400
50
100
150
200
250
300
350
400
Fig. 4. Two photographic real-world images
4.2
Real-World Images
In the last experiment, we want to evaluate the performance of our blocking algorithm using real-world images. Two photographic images, provided by [9], 1 0.99 are artificially mixed using the matrix A = . The two images, shown 0.99 1 in Fig. 4, are 320 × 420-sized and clearly non-stationary. The blocking method, using different values for the number of blocks, is first applied and the SIR is compared to the SIR achieved by the stationary Markovian method. Provided we select an adequate number of sub-images, which corresponds to M1 = M2 = 10 in this case, the blocking method led to nearly 60 dB SIR, whereas the separation completely failed with the stationary Markovian algorithm. In the second step, we compared our blocking method to the 15 classical algorithms available in the ICALAB Toolbox [10,11]. Our method outperforms all 15 algorithms, which led to 37 dB SIR at best, with the SOBI-RO method.
Blind Separation of Non-stationary Images Using Markov Models
5
729
Conclusion
In this paper, we extended to non-stationary source images our Markovian maximum likelihood approach for blind image separation. The proposed algorithms exploit simultaneously non-Gaussianity, spatial autocorrelation and nonstationarity of the sources in a quasi-optimal manner. To handle non-stationarity, we adapt to our problem two approaches, respectively based on blocking and kernel smoothing. Experimental results, using both artificial and real images, show clearly the better performance of our blocking method compared to the stationary Markovian algorithm and the classical algorithms available in the ICALAB Toolbox. Moreover, tests proved the high performance of our kernel smoothing method, even with highly non-stationary images. However, this method can only be applied to small-sized images, since it is very time consuming. Therefore, we are currently working on reducing its computational cost.
References 1. Hosseini, S., Jutten, C., Pham, D.-T.: Markovian source separation. IEEE Trans. on Signal Processing 51, 3009–3019 (2003) 2. Hosseini, S., Guidara, R., Deville, Y., Jutten, C.: Markovian Blind Image Separation. In: Rosca, J., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 106–114. Springer, Heidelberg (2006) 3. Matsuoka, K., Ohya, M., Kawamoto, M.: A neural net for blind separation of nonstationary signals. Neural Networks 8(3), 411–419 (1995) 4. Souloumiac, A.: Blind source detection and separation using second-order nonstationarity. In: Proc. ICASSP, pp. 1912–1915 (1995) 5. Choi, S., Cichoki, A., Belouchrani, A.: Second order non-stationary source separation. Journal of VLSI Signal Processing 32(1-2), 93–104 (2002) 6. Hyvarinen, A.: Blind source separation by non-stationarity of variance: a cumulant based approach. IEEE Trans. on Neural Networks 12(6), 1471–1474 (2001) 7. Pham, D.-T., Cardoso, J.-F.: Blind separation of independent mixtures of nonstationary sources. IEEE Trans. on Signal Processing, 49(9) (2001) 8. Pham, D.-T.: Blind Separation of non stationary non Gaussian sources. In: Proc. European Signal Processing Conference (EUSIPCO 2002), Toulouse, France (September 2002) 9. Web site of ICACentral: http://www.tsi.enst.fr/icacentral/base multi.html 10. Cichoki, A., Amari, S., Siwek, K., Tanaka, T., et al.: ICALAB Toolboxes, http://www.bsp.brain.riken.jp/ICALAB 11. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. Wiley, Chichester (2003)
Blind Instantaneous Noisy Mixture Separation with Best Interference-Plus-Noise Rejection Zbynˇek Koldovsk´ y1,2 and Petr Tichavsk´ y1 Institute of Information Theory and Automation, Pod vod´ arenskou vˇeˇz´ı 4, P.O. Box 18, 182 08 Praha 8, Czech Republic [email protected] 2 Faculty of Mechatronic and Interdisciplinary Studies Technical University of Liberec, H´ alkova 6, 461 17 Liberec, Czech Republic
1
Abstract. In this paper, a variant of the well known algorithm FastICA is proposed to be used for blind source separation in off-line (block processing) setup and a noisy environment. The algorithm combines a symmetric FastICA with test of saddle points to achieve fast global convergence and a one-unit refinement to obtain high noise rejection ability. A novel test of saddle points is designed for separation of complex-valued signals. The bias of the proposed algorithm due to additive noise can be shown to be asymptotically proportional to σ 3 for small σ, where σ 2 is the variance of the additive noise. Since the bias of the other methods (namely the bias of all methods using the orthogonality constraint, and even of recently proposed algorithm EFICA) is asymptotically proportional to σ 2 , the proposed method has usually a lower bias, and consequently it exhibits a lower symbol-error rate, when applied to blind separation of finite alphabet signals, typical for communication systems.
1
Introduction
The noisy model of Independent Component Analysis (ICA) considered in this paper, is X = AS + σN, (1) where S denotes a vector of d independent random variables representing the original signals, A is an unknown regular d × d mixing matrix, and X represents the observed mixed signals. The noise N denotes a vector of independent variables having the covariance matrix Σ. Without loss of generality, we will further assume that Σ equals to the identity matrix I. Consequently, σ 2 is the variance of the added noise to the mixed signals. All signals considered here are i.i.d. sequences, i.e., they are assumed to be white in the analysis. It is characteristic for most ICA methods that they were derived for the noiseless case, so to solve the task of estimating the mixing matrix A or its inversion
This work was supported by Ministry of Education, Youth and Sports of the Czech Republic through the project 1M0572 and through the Grant 102/07/P384 of the Grant Agency of the Czech Republic.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 730–737, 2007. c Springer-Verlag Berlin Heidelberg 2007
Blind Instantaneous Noisy Mixture Separation
731
W = A−1 . Then, abilities to separate noised data are studied experimentally, and the non-vanishing estimation error as N → +∞, N being length of data, is taken for a bias caused by the noise. To compensate such bias, several techniques were proposed [6]. Unfortunately, these methods have a drawback that the covariance structure of the noise needs to be known a priori [4]. In accord with [7], we suggest to measure the separation quality not through accuracy of estimation of the mixing mechanism but through the achieved interference + noise to signal ratio (INSR) or its inverse SINR. In separating the finite alphabet signals, the ultimate criterion should be the symbol error rate (SER). Computation of the INSR or the SER assumes that the permutation, scale, and sign or phase ambiguities were resolved by minimizing the INSR. In the case Σ = I, the INSR of a k-th estimated signal can be computed as d d 2 2 2 i=k (BA)ki + σ i=1 Bki , (2) INSRk = (BA)2kk where B is the separating transformation [7]. The solutions that minimize (2) are known to be given by the MMSE separating matrix, denoted by WMMSE , that takes the form (3) WMMSE = AH (AAH + σ 2 I)−1 where H denotes the conjugate (Hermitian) transpose. Signals given by WMMSE X will be further called the MMSE solution. Note that these signals may not be necessarily normalized to have unit variance, unlike outcome of common blind separation methods, that produce normalized components. For exact comparisons, we introduce a matrix WNMMSE such that WNMMSE X are the normalized MMSE signals. The paper is organized as follows. In section 2, we briefly describe several variants of algorithm FastICA and the proposed method, including a novel test of saddle points for separating complex-valued signals. Section 3 presents analytic expressions for an asymptotic bias of solutions obtained by real domain FastICA variants [5,8] from the MMSE solution. Specifically, we study the biases of estimates of de-mixing matrix W from WNMMSE , and the one-unit FastICA and the proposed algorithm is shown to be less biased than the other methods. Simulations in Section 4 demonstrate drawbacks of the unbiased algorithm [6] (further referred to as unbiased FastICA) following from required knowledge of Σ and/or σ. Conversely, the proposed algorithm with one-unit FastICA-like performance is shown to be the best blind MMSE estimator when separating noisy finite-alphabet signals.
2
FastICA and Its Variants
Common FastICA algorithms work with the decorrelated data Z = C−1/2 X, where C = E[XXH ] is the data covariance matrix. Only the unbiased FastICA [6] that aims at unbiased estimation of A−1 assuming that the noise has a known covariance matrix Σ, uses the preprocessing Z = (C − Σ)−1/2 X.
732
Z. Koldovsk´ y and P. Tichavsk´ y
One-unit FastICA in real domain [5] estimates one de-mixing vector wk1U iteratively via the recursion T
T
wk+ ← E[Zg(wk1U Z)]} − wk1U E{g (wk1U Z))},
wk1U ← wk+ /wk+ (4)
until convergence is achieved. Here g(·) is a smooth nonlinear function that approximates/surrogates the score function corresponding to the distribution of the original signals [11]. The theoretical expectation values in (4) are, in practice, replaced by their sample-based counterparts. Similar recursion was proposed for one-unit FastICA in the complex domain [1]. The symmetric (real or complex) variant performs the one-unit iterations in parallel for all d separating vectors, but the normalization in (4) is replaced by a symmetric orthogonalization. The algorithm EFICA [8] combines the symmetric approach with the test of saddle points, an adaptive choice of nonlinearity gk (·) for each signal separately, and it does the refinement step that relaxes the orthogonal constraint introduced by the symmetric approach and is designed towards asymptotic efficiency. The unbiased FastICA [6] uses the recursion unb E[g (wunb Z)], wk+ ← E[Zg(wkunb Z)] − (I + Σ)w k k T
T
= (C − Σ)−1/2 Σ(C − Σ)−1/2 . Both approaches (one-unit and symwhere Σ metric) can be considered; in simulations, we use the one-unit variant, and the resulting de-mixing matrix will be denoted by WUNB . In order to compare performance of the unbiased FastICA by means of (2) with the other techniques fairly, it is necessary to consider a MMSE estimate derived from WUNB , namely WMMSE-UNB = Σ−1 (WUNB )−T × [(WUNB )−1 Σ−1 (WUNB )−T + σ 2 I]−1 2.1
(5)
Proposed Algorithm
The proposed algorithm is a combination of symmetric FastICA, test of saddle points, and one-unit FastICA as a refinement. Usually, one unit FastICA is used in a deflation way, when the estimated components are subtracted from the mixture one by one. This is computationally effective method, but accuracy of the later separated components might be compromised. Therefore, we propose to initialize the algorithm using symmetric FastICA, that is known for having very good global convergence and allows equal separation precision for all components. The test of saddle points was first proposed in [11] to improve probability of the symmetric FastICA to converge to the true global maximum of the cost function [E{G(wT Z)} − G0 ]2 where G(·). is a primitive function of g(·) and G0 = E{G(ξ)}, where ξ is a standard Gaussian random variable. In short, the test of saddle points consists in checking all pairs of the estimated components (uk , u ), whether or not other pair of signals (uk , u ) gives a higher value of the cost function c(uk , u ) = [E{G(uk )} − G0 ]2 + [E{G(u )} − G0 ]2 , √ √ where uk = (uk + u )/ 2 and u = (uk − u )/ 2.
(6)
Blind Instantaneous Noisy Mixture Separation
733
The motivation is that a random initialization of the algorithm may begin at a point of zero gradient of the cost function (a saddle point / an unstable point of the iteration) and terminate there, despite being not the desired stable solution. See [11] for details. In the complex domain, the situation is a bit more tricky, because if (uk , u ) is the pair of valid independent components in the mixture, not only their weighted sum and a difference represent a false (unstable) point of the iteration. All √ √ pairs (uk , u ) of the form uk = (uk + eiα u )/ 2 and u = (uk − eiα u )/ 2 are stationary but unstable for any phase factor eiα , α ∈ R. Therefore we propose to do a phase shift of each separated component so that the real part and the imaginary part of the signal are as much independent each of other as possible before the test of the saddle points. This phase shift can be easily performed using a two-dimensional symmetric FastICA in the real domain applied to the real and imaginary part of the component. After this preprocessing, it is sufficient to perform the test of saddle points exactly √ as in the real-valued case, i.e. to check all pairs (uk , u ) with uk = (uk + u )/ 2 and √ u = (uk − u )/ 2, whether they give a higher value of the cost function (6) or not. Validity of the above described complex domain test of the saddle points can be easily confirmed √ in simulations by starting√the algorithm from the pairs uk = (uk + eiα u )/ 2 and u = (uk − eiα u )/ 2 with an arbitrary α ∈ R where uk and u are the true independent sources. We have successfully tested this approach on separation of complex-valued finite alphabet sources known in communications (QAM, V27). The resultant algorithm (symmetric FastICA + test of saddle points + one unit refinements) will be referred to as 1FICA.
3
Bias of the FastICA Variants
In this section, asymptotic expressions for bias of algorithms described in previous section working in the real domain will be presented. (The complex-domain FastICA exhibits a similar behavior in simulations.) For details of analysis, the reader is referred to [9] due to lack of space. In brief, the theoretical analysis is done for “small” σ and infinite number of samples. Similarly to [11], for theoretical considerations, it is assumed that the analyzed method starts from the MMSE solution and stops after one iteration. This assumption is reasonable due to the following facts: (1) deviation of the of the FastICA cost function from WMMSE is of the order global maximizer W 2 O(σ ), and (2) convergence of the algorithm is at least quadratic [10]. Therefore, after performing the one iteration, the deviation of the estimate from the global is of the order O(σ 4 ) and, hence, is negligible. maximizer W The bias of the algorithm will be studied in terms of the deviation of W(WMMSE )−1 from a diagonal matrix. More precisely, the bias is equal to the MMSE −1 ) and D = WNMMSE [WMMSE ]−1 , where D difference between E[W](W is the diagonal matrix that normalizes the MMSE signals SMMSE = WMMSE X.
734
Z. Koldovsk´ y and P. Tichavsk´ y
It holds that
1 D = I + σ 2 diag[V11 , . . . , Vdd ] + O(σ 3 ). (7) 2 From here we use the notation W = A−1 and V = WWT . Finally, for a matrix − D. that separates the data SMMSE , the bias is E[W] W 3.1
Bias of the One-Unit FastICA and 1FICA
It can be shown that the de-mixing vector wk1U resulting from the one-unit FastICA (applied to the data SMMSE ), for N → +∞, is proportional to 1 wk1U = τk ek + σ 2 Vkk (τk + δk )ek + O(σ 3 ) 2
(8)
where τk = E[sk g(sk )−g (sk )], and δk is a scalar that depends on the distribution of sk and on the nonlinear function g and its derivatives to the third order. Since (8) is a scalar multiple of ek (the k-th column of the identity matrix), it follows that the asymptotic bias of the one-unit approach is O(σ 3 ). Prospectively, the separating matrix W1F given by the proposed 1FICA has the same bias. Simulations confirm this expectation [9]. 3.2
Bias of the Inversion Solution
It is interesting to compare the previous result with the solution that is given by exact inversion of the mixing matrix, i.e. WX = S + σWN; the signals will be called the inversion solution. From W(WMMSE )−1 = W(AAT + σ 2 I)WT = I + σ 2 V it follows that the “bias” of the inversion solution is proportional to σ 2 and in general it is greater than that of 1FICA. In other words, the algorithm 1FICA produces components that are asymptotically closer to the MMSE solution than to the inversion solution. 3.3
Bias of Algorithms Using the Orthogonal Constraint
Large number of ICA algorithms (e.g. JADE [2], symmetric FastICA, etc.) use an orthogonal constraint, i.e., they enforce the separated components to have sample correlations equal to zero. Since the second-order statistics cannot be estimated perfectly, this constraint compromises the separation quality [3,11]. Here we show that the bias of all ICA algorithms that use the constraint has the asymptotic order O(σ 2 ). The orthogonality constraint can be written as T WX) T ] = W(AA T = I . E[WX( + σ 2 I)W
(9)
It follows that the bias of all constrained algorithms is lower bounded by MMSE −1 min W(W ) − DF = O(σ 2 ) 2 T =I W(AA + σ I)W T
(10)
Blind Instantaneous Noisy Mixture Separation
735
The matrix D in (10) is the same as in where the minimization proceeds for W. MMSE −1 (7). For the minimizer W of (10) it holds that W(W ) = I+σ 2 Γ+O(σ 3 ), T where Γ is a nonzero matrix obeying Γ + Γ = V; see [9] for details. This result can be interpreted in the way that the algorithms using the orthogonality constraint may prefer some of the separated components to give them a zero bias, but the total average bias for all components has the order O(σ 2 ). 3.4
Bias of the Symmetric FastICA and EFICA
The biases of the algorithms can be expressed as MMSE −1 E[W](W ) −D=
where Hk =
|τ |−|τk | |τk |+|τ |
EFICA, where ck = 2
1 2 σ V (1d×d − I + H) + O(σ 3 ), 2
(11)
ck |τ |−|τk | |τk |+ck |τ | for γk = E[gk2 (sk )] −
for the symmetric FastICA, and Hk =
|τ |γk |τk |(γ +τ2 )
for k = and ckk = 1. Here,
E [sk gk (sk )], and gk is the nonlinear function chosen for the k-th signal. It can be seen that the bias of both of the algorithms has the order O(σ 2 ).
4
Simulations
In this section, we present results of two experiments to demonstrate and compare the performance of the proposed algorithm 1FICA with competing methods: The symmetric FastICA (marked by SYMM), the unbiased FastICA (unbiased FICA), EFICA, and JADE [2]. Results given by “oracle” MMSE solution and the inversion solution are included as well. Examples with complex signals are not included due to lack of space. In the first example, we separate 10 randomly mixed [7] BPSK signals with added Gaussian noise, first, for various length of data N (Fig. 1(a)) and, second, for varying input signal-to-noise ratio (SNR) defined as 1/σ 2 (Fig. 1(b)). The experiment encompasses several extremal conditions: In the first scenario, where . SNR=5dB (σ = 0.56), N goes from 100, which is quite low for the dimension d = 10. The second situation examines N = 200 and SNR going down to 0dB. Note that the bias may be less important than the estimation variance when the data length N is low. Therefore, in simulations, we have included two slightly changed versions of 1FICA and EFICA algorithm, denoted by “1FICA-biga” and “EFICA-biga”, respectively. The modifications consist in that the used nonlinear function g is equal to the score function of marginal pdfs of the signals to-be estimated (i.e., noisy BPSK that have bimodal Gaussian distribution, therefore, “biga” in the acronym). Adopted from the noiseless case [11], better performance of the modified algorithms may be expected. Figure 1 shows superior performance of the proposed algorithm 1FICA and of its modified version. The same performance is achieved by the modified EFICA for N ≤ 200, but it is lower due to the bias when N is higher. The unbiased FastICA achieves the same accuracy for N ≥ 500 but is unstable when N is low.
736
Z. Koldovsk´ y and P. Tichavsk´ y
The average performance of an algorithm is often spoiled due to poorer stability, which occurs in high dimensions and low N cases, mainly. In this issue, we highlight positive effect of the test of saddle points that is included in the proposed 1FICA or in EFICA. For instance, the results achieved by the symmetric FastICA would be significantly improved if the test was included in it. The second example demonstrates conditions when the covariance of the noise is not exactly known or varying. To this end, the noise level was changed randomly from trial to trial. Five BPSK signals of the length N = 50000 were mixed with a random matrix and disturbed by Gaussian noise with covariance σ 2 I, where σ was randomly taken from interval [0, 1], and then blindly separated. The mean value of the noise covariance matrix, i.e. I/3, was used as the input parameter of the unbiased FastICA. Note that INSR and BER of this method were computed for solutions given by WMMSE-UNB defined in (5). The following table shows the average INSR and bit error rate (BER) that were achieved in 1000 trials. The performance of the proposed 1FICA is almost the same like that of “oracle” MMSE separator, because, here, N is very high, and the estimation error is caused by the bias only. The unbiased FastICA significantly suffers from inaccurate information about the noise intensity. algorithm 1FICA Symmetric FastICA unbiased FastICA EFICA MMSE solution inversion solution JADE
average INSR [dB] -5,98 -5,68 6,79 -5,79 -5,98 -4,76 -5,68
BER [%] 3,19 3,55 5,25 3,41 3,19 4,71 3,55
2
10
30
MEAN BER [%]
20 15 10
MEAN BER [%]
1FICA−tanh 1FICA−biga SYMM−tanh EFICA EFICA−biga MMSE inversion Jade Unbiased FICA
5
1
10
0
10
3 −1
10 2 2 10
3
10
N
(a)
4
10
0
1FICA−tanh 1FICA−biga SYMM−tanh EFICA EFICA−biga MMSE inversion Jade Unbiased FICA 2
4
6
8
10
SNR [dB]
(b)
Fig. 1. Average BER of 10 separated BPSK signals when (a) SNR is fixed to 5dB and (b) a fixed number of data samples is N = 200. Averages are taken from 1000 independent trials for each settings.
Blind Instantaneous Noisy Mixture Separation
5
737
Conclusions
This paper presents novel results from analysis of bias of several FastICA variants, whereby the one-unit FastICA was shown to be minimally biased from the MMSE solution, i.e., it achieves the best interference-plus-noise rejection rate for N → +∞. By virtue of the theoretical results, a new variant of FastICA algorithm, called 1FICA, was derived to have the same global convergence as symmetric FastICA with the test of saddle points, and a noise rejection like the one-unit FastICA. Computer simulations show superior performance of the method when separating binary (BPSK) signals. Unlike the unbiased FastICA, it does not require prior knowledge of covariance of the noise to achieve the best MMSE separation.The Matlab codes for 1FICA in real and in complex domains can be downloaded from the first author’s homepage, http://itakura.kes.tul.cz/zbynek/downloads.htm.
References 1. Bingham, E., Hyv¨ arinen, A.: A fast fixed-point algorithm for independent component analysis of complex valued signals. Int. J. Neural Systems 10, 1–8 (2000) 2. Cardoso, J.-F., Souloumiac, A.: Blind Beamforming from non-Gaussian Signals. IEE Proc.-F 140, 362–370 (1993) 3. Cardoso, J.-F.: On the performance of orthogonal source separation algorithms. In: Proc. EUSIPCO, Edinburgh, pp. 776–779 (1994) 4. Davies, M.: Identifiability Issues in Noisy ICA. IEEE Signal Processing Letters 11, 470–473 (2005) 5. Hyv¨ arinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997) 6. Hyv¨ arinen, A.: Gaussian Moments for Noisy Independent Component Analysis. IEEE Signal Processing Letters 6, 145–147 (1999) 7. Koldovsk´ y, Z., Tichavsk´ y, P.: Methods of Fair Comparison of Performance of Linear ICA Techniques in Presence of Additive Noise. In: Proc. of ICASSP 2006, Toulouse, pp. 873–876 (2006) 8. Koldovsk´ y, Z., Tichavsk´ y, P., Oja, E.: Efficient Variant of Algorithm FastICA for Independent Component Analysis Attaining the Cram´er-Rao Lower Bound. IEEE Tr. Neural Networks 17, 1265–1277 (2006) 9. Koldovsk´ y, Z., Tichavsk´ y, P.: Asymptotic analysis of bias of FastICA-based al´ ˇ gorithms in presence of additive noise, Technical report no 2181, UTIA, AV CR (2007), Available at http://itakura.kes.tul.cz/zbynek/publications.htm 10. Oja, E., Yuan, Z.: The FastICA algorithm revisited: Convergence analysis. IEEE Trans. on Neural Networks 17, 1370–1381 (2006) 11. Tichavsk´ y, P., Koldovsk´ y, Z., Oja, E.: Performance Analysis of the FastICA Algorithm and Cram´er-Rao Bounds for Linear Independent Component Analysis. IEEE Tr. on Signal Processing 54, 1189–1203 (2006)
Compact Representations of Market Securities Using Smooth Component Extraction Hariton Korizis, Nikolaos Mitianoudis, and Anthony G. Constantinides Communications and Signal Processing Group, Imperial College London, Exhibition Road, London SW7 2AZ, UK {hariton.korizis99,n.mitianoudis,a.constantinides}@ic.ac.uk
Abstract. Independent Component Analysis (ICA) is a statistical method for expressing an observed set of random vectors as a linear combination of statistically independent components. This paper tackles the task of comparing two ICA algorithms, in terms of their efficiency for compact representation of market securities. A recently developed sequential blind signal extraction algorithm, SmoothICA, is contrasted to a classical implementation of ICA, FastICA. SmoothICA uses an additional 2nd order constraint aiming at identifying temporally smooth components in the data set. This paper demonstrates the superiority of this novel smooth component extraction algorithm in terms of global and local approximation capability, applied to a portfolio of 60 NASDAQ securities, by utilizing common ordering algorithms for financial signals. Keywords: Independent Component Analysis, SmoothICA, FastICA, market securities, Finance.
1 Introduction The goal of Independent Component Analysis is to find a linear representation of nonGaussian variables. Finding such a representation provides an insight to the underlying structure of many signal processing problems. The ICA problem is equivalent to establishing the following generating model for the data: x = As
(1)
where x and s are n-dimensional random vectors, and components s are assumed mutually independent. A is a constant n × n full rank matrix, denoting the unknown mixing matrix. Relevant to our investigation is the formulation that x consists of a set of observation vectors generated in the financial markets, which are driven by the hidden underlying sources s. The driving mechanisms s are mixed and contaminated among others by elements, such as news and expectations related to results of companies and sectors, domestic and foreign politics that affect exchange and interest rates, consumer confidence, unexpected events and even the weather that affects the commodities’ prices. The transformation: s = Wx (2) can be defined, with W the demixing matrix and A = W −1 . This method allows at most one Gaussian component, concentrating all the signal innovations which cannot M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 738–745, 2007. c Springer-Verlag Berlin Heidelberg 2007
Compact Representations of Market Securities
739
be accounted for by the original problem assumptions. In the case of signals originating from the financial markets, this assumption can be considered valid for a great majority of the cases, as purely gaussian financial signals are rarely generated. The assumption of statistical independence of the source signals, can be assumed to be valid in the scope of global economy and the hugely diverse micro- and macroeconomic factors that affect financial processes. Unexplained noise, as well as the markets’ response to large trades can be also of significance to researchers and traders. However it is logical that the underlying driving sources’ independence assumption in the financial markets can be debated, as every source might exert a small influence on all others. A review of various contrast functions can be found in [1], as the sources s can be separated using various interpretations of statistical independence. In a recent paper, the authors in [2] proposed a sequential blind signal extraction algorithm incorporating a smoothness constraint based on the original FastICA algorithm [3]. Along with the negentropy cost function, the added temporal constraint seeks to find smooth orthogonal projections in the mixtures vector x. In [4], this algorithm is referred to as SmoothICA and contrasted to the performance of FastICA, in the search for temporal structure in the underlying sources that give rise to stock evolutions. A portfolio of 20 NASDAQ securities was analyzed and possible advantages of this novel approach were highlighted over FastICA, as it produced components with smoother temporal structure. A small degree of correlation was present among the components extracted, introduced by the balancing of the 4th and 2nd order temporal constraint. In Principal Component Analysis (PCA) the component ranking can be achieved according to the eigenvalues of the source vectors. In ICA the same task is not straightforward, as the corresponding projection vector consists of normalized rows to unity [3]. PCA is optimal for dimensionality reduction to a set of uncorrelated components, in terms of variance. However, the idea here is the existence of independent underlying sources, thus examining only algorithms that seek independence. Common ordering algorithms for financial signals are used and their error performance are contrasted in approximating a given portfolio using reduced set of components. This text is organized in the following way. The next section contains an overview of ICA ordering techniques for financial time series. Section 3 contains an overview of the SmoothICA algorithm. Section 4 presents the ordering algorithms that are utilized in the experimental Section 5. An ordering method for selecting a reduced number of components for reconstruction of a whole portfolio of securities is considered (global approximation), as well as two methods for ordering the components’ contributions to each security signal and reconstructing accordingly. Section 6 concludes this study.
2 Overview of Independent Component Ordering in Finance Several investigations of ICA with application to finance have been performed. The most influential was done in [5], examining the portfolio returns of 28 Japanese stocks. PCA and ICA performances were compared for such signals. From the operation of ICA, the components produced have var(si ) = 1. It is therefore assumed that any information about the contribution of each individual component, to a mixture’s variation is engulfed in the mixing matrix A. The authors used the maximum norm L∞ to sort
740
H. Korizis, N. Mitianoudis, and A.G. Constantinides
the rows of the A, and thus to determine which ICs have the maximum contribution to a selected signal’s amplitude. Such a measure is applied in this research. Cheung and Xu [6] presented a criterion for ordering source signals, according to their contribution to the trend reservation of each observed signal. This algorithm uses the MSE criterion and is named Testing-and-Acceptance (TnA), and when applied to foreign exchange rates it produces superior results over the L∞ norm method. This is the second method which will be used in this paper. The same authors have presented a criterion to select the appropriate dimension for the source signal subset to approximate a portfolio of foreign exchange rates in [8]. An algorithm using the Relative Hamming Distance (RHD) instead, was proposed in [7]. A consequence of the increased interest in this type of component extraction and its demonstrated superiority in terms of source separation over PCA, are applications utilizing its capabilities in econometrics and finance; from prediction approaches [9] and Factor Model estimation [10] to the computation of the risk of a portfolio of securities [11] and the application of ICA in the context of state space models for interbank foreign exchange rates to obtain a better separation of the observation noise and the ”true” price [12]. It is worth focusing on [11] where the contribution of an individual independent component to the variance of the whole portfolio of securities is calculated. The ICs are ordered according to that contribution, and this operates as a preprocessing step for dimensionality reduction before switching back to the prices’ space. This is the third method examined in the current paper, testing global approximation performance.
3 The SmoothICA Algorithm After an initial prewhitening step, SmoothICA solves the following inequality constrained optimization problem: max w
J1 (w)
subject to J2 (w) ≤ 0 J3 (w) = 0
(3) (4) (5)
where w the projection operator of the white data z, J1 (·) is the approximated negentropy as proposed by Hyvarinen [3], J2 (w) = E{(w T Δz)2 } − ρE{(w T z)2 } is the second-order smoothness criterion, J3 (·) is the unit-norm constraint, ρ ∈ [0, 1] defines the degree of smoothness [4] and Δz is the whitened data differences matrix. Modifying the inequality constraint to the equality constraint max(J2 (w), 0) = 0, one can find the desired optima using alternating unconstrained maximization of the Lagrangian function J1 (w) + λ max(J2 (w), 0) + κJ3 (w), where λ, κ are the Lagrange multipliers. The following Newton-step provides an update: 2 −1 ∂ J ∂J + w ←w− (6) 2 ∂w ∂w where, in this case, the gradient vector and the Hessian matrix are estimated using the following updates : ∂J = μE{zG (w T z)} + λ(E{(w T Δz)Δz − ρ(w T z)z})(sgn(J2 ) + 1) ∂w
Compact Representations of Market Securities
741
∂2J = μE{G (w T z)}I + λ(CΔz − ρI)(sgn(J2 ) + 1) ∂w2 where u = wT z, G(u) = logcosh(u), μ = sgn(E{G(u)} − E{G(v)}) and v a zero-mean, unit-norm Gaussian variable. After calculating the estimate for w, we calculate estimates for λ via alternating optimization. The unit-norm constraint w + ← w + /w+ is then imposed as a projection of the w estimate on the unit hypersphere, to ensure that rotation and not scale deformation is performed. The orthogonal deflation procedure is used to extract subsequent smooth components [1].
4 Ordering Methods for Independent Component Analysis In Finance dimensionality reduction is applied for various purposes. It is performed to remove unwanted information and hence get a clearer picture of an underlying process, allowing better modeling and understanding of its statistical nature. It it also applied to represent a large set of assets by an appropriate subset that best defines it and reduce memory requirements and computational burden. Unlike PCA, ICA is not constructed to have an inherent ordering of the ICs. The methods below follow two notions; approximation of a particular security using a few ICs (selected according to their contribution to that particular security) and approximation of a whole portfolio of securities by selecting an appropriate subset of independent components. 4.1 Global Approximation In ICA the components produced are scaled to unit variance. This means that the additional information about individual contributions of the ICs to the observed signals lies in the mixing matrix A [11]. The variance of the security i is σi2 and the amount of total variance Vj explained by each component sj can be derived from: σi2 =
n
n a2ij
j=1 and Vj = n
a2ij
i,j=1
i=1
a2ij
(7)
Thus by ordering the ICs according to their individual contributions to the whole portfolio, we can approximate efficiently by selecting a reduced number of components. 4.2 Local Approximation The L∞ norm: The weighted ICs, given by (9), with the largest amplitudes are defined to be the dominant ICs. This of course presents an ordering criterion, as these ICs have the largest effect on the securities. The reconstruction of the ith security from the estimates of the source signals is: x i =
n k=1
aik sk
(8)
742
H. Korizis, N. Mitianoudis, and A.G. Constantinides
where sk is the kth estimated IC and aik is the weight in the ith row, kth column of A. The weighted ICs are therefore obtained from: sik = aik sk
k = 1..n
(9)
The L∞ norm was used in [5] to order the weighted ICs for each particular stock, as this reveals the magnitude contribution of each source signal to a particular stock. The Testing-and-Acceptance algorithm: The TnA algorithm in [6] aims at creating a list Li , whose elements are the component subscripts decided according to decreasing contribution to a specified security signal. Initially, the IC which introduces the minimum MSE error of reconstruction of the selected security if omitted, is selected from the m components. The reconstructed security, while the ith component is omitted, is { yj } m j=1,j=i . The subscript of this IC is put last in the list L. The next step of the iteration starts with a subset of the ICs that do not include the previously selected component. It finds the next component that, while omitted, causes minimum MSE error of approximation, and puts it second to last in L, and so on. It is a suboptimal heuristic method compared with the exhaustive search, however the TnA algorithm involves just m(m+1) − 1 compared to (m + 1)! steps. 2 The algorithm operates as follows: 1. Let the set of independent component subscripts Z = {j | 1 ≤ j ≤ m}, d = 0, and the order list Li = (). 2. For each j ∈ Z and N being the signal’s length, let: υij(t) = sim(t) , 1 ≤ t ≤ N (10) m=j,m∈Z
The β which will be stored as the dth element of Li and removed from the set Z, is selected according to: β = arg min M SE(xi , υij ) j∈Z
(11)
dnew =dold + 1 Then let: Lnew =Lold +β i i Z new =Z old − {β} 3. If Z = {}, goto Step 2; otherwise stop. In order to make the list ordered according to descending contribution, flip it.
5 Experiments 5.1 Description of the Data The experiments are performed with daily closing prices of a portfolio of 60 US technology stocks1 , for the period ranging from 01/01/2002 to 05/04/2005. The data is 1
The portfolio consists of the first 60 stocks (alphabetically) of the NASDAQ US Exchange.
Compact Representations of Market Securities
743
centered and whitened so that uncorrelated, unit variance signals are obtained. The SmoothICA algorithm is performed on the centered and whitened data as outlined in [2] and [4]. Extraction of smoother components than FastICA is achieved, although more computational time is required. Flexibility is added with ρ starting at 0.05 and increasing progressively in the case of non converging components. Coefficient ρ is initialized at this level, as this corresponds to a long-period sinusoidal signal with low levels of noise. Using a large portfolio (a high number of mixtures) the issue of low correlation among the components [4] is avoided. The correlation matrix among the sources is now a proper identity matrix. The FastICA algorithm is also applied, due to its convergence efficiency, which gives the reference results for approximation fitness comparison for the all the ranges of subset orders possible. 5.2 Global Approximation Comparison After obtaining successful convergence for both algorithms, the percentages of variance contribution of each of their components are calculated, using the expression for Vj in (7). The result is presented on Figure 1. While in the FastICA case there is an almost equal parsing of the variance contributions among the ICs, a significant amount of variance is concentrated in approximately the first 20 ICs that SmoothICA extracts. Indicatively, 35 FastICA ICs contain 70% of the portfolio’s variance, while by using the additional smoothness temporal constraint only 16 components are required and 90% of the variance in just 21 contrasted to 49 components. This signifies the great advantage in terms of dimensionality reduction and global approximation using a smaller subset of signals. SmoothICA, as observed, produces components that have an inherent ordering of the source signals and can provide a more efficient representation of a portfolio of securities. It can be used among other tasks, as an alternative to dimensionality reduction for simpler modeling or extraction of seasonal and structural variations (currently done during the pre-whitening step by PCA). 6
9 8
5 7 6 % Contribution
% Contribution
4
3
2
5 4 3 2
1 1
0
0
10
20
30 40 component subscript
(a) For the FastICA case.
50
60
0
0
10
20
30 40 component subscript
50
60
(b) For the SmoothICA case.
Fig. 1. Global approximation performance. Percentage variance contributions of each source signal.
744
H. Korizis, N. Mitianoudis, and A.G. Constantinides
5.3 Local Approximation Comparison The local approximation performances of both algorithms are compared using both ordering methods presented above. Two error criteria produce different mean approximation errors across all securities on the portfolio. The error criteria calculated are the Mean Squared Error (MSE) and the Mean Absolute Percentage Error (MAPE). The former is a commonly used fitness measure penalizing large deviations from observed security prices in a greater extent, while the latter being an easily understood intuitive measure. On the x-axis lie the numbers of ICs used for approximation of each security signal; from only 1 to all the ICs (60). The lists containing the ordered contributions are calculated for both algorithms, using both L∞ norm measure and TnA heuristic algorithm. The results on Figure 2 demonstrate a clearly superior local approximation performance of SmoothICA. Equally consistent results are obtained for the Root Mean Squared Error (RMSE) and the Mean Percentage Error (MPE) not presented here for economy of space. The differences in the performances of L∞ and TnA seem marginal, however the former is significantly less computationally demanding, thus preferred.
70
25
using SmoothICA using FastICA
using SmoothICA using FastICA Mean Absolute Percentage Error
60
Mean Squared Error
50
40
30
20
20
15
10
5 10
0
0
10
20
30
40
50
0
60
0
10
Components used for reconstruction
20
30
40
50
60
Components used for reconstruction
(a) Mean MSE using L∞ norm ordering.
(b) Mean MAPE using L∞ norm ordering.
70
25
using SmoothICA using FastICA
using SmoothICA using FastICA Mean Absolute Percentage Error
60
Mean Squared Error
50
40
30
20
20
15
10
5 10
0
0
10
20
30
40
50
Components used for reconstruction
(c) Mean MSE using TnA algorithm.
60
0
0
10
20
30
40
50
60
Components used for reconstruction
(d) Mean MAPE using TnA algorithm.
Fig. 2. Local approximation performance. SmoothICA is shown to be superior in terms of more efficient representation of each source signal.
Compact Representations of Market Securities
745
6 Conclusions Through the addition of the 2nd order temporal constraint, which seeks to identify temporally smooth underlying sources, the SmoothICA algorithm is more efficient than the FastICA in approximating a portfolio of securities from an appropriate subset of the estimated sources (section 5.2). This novel algorithm estimates smoother underlying sources that have an inherent ordering, as a high percentage of the portfolio’s variance is contained in the first few components, compared to FastICA which has a significantly higher variance spreading among its ICs. To contain 70% of the portfolio’s variance in just 16 components, while the classical FastICA requires 35, and 90% of the variance in just 21 contrasted to 49 components, is a significant improvement in terms of global approximation. In the local approximation part of this paper (section 5.3), each security in the portfolio is reconstructed by appropriate subsets of the source signals of dimensions 1 to 60. For each dimension selected the mean MSE and MAPE approximation error across the portfolio is plotted against the subset dimension. For both component ordering methods examined, the errors calculated show consistent superiority of the SmoothICA algorithm for efficient compact representation of a portfolio of securities. Furthermore, the gradients in the plots of Figure 2 support the global case conclusions.
References 1. Hyvarinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Networks 13(4-5), 411–430 (2000) 2. Mitianoudis, N., Stathaki, T., Constantinides, A.G.: Smooth signal extraction from instantaneous mixtures. IEEE Signal Processing Letters 14(4) (2007) 3. Hyvarinen, A.: Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks 10(3), 626–634 (1999) 4. Korizis, H., Constantinides, A., Christofides, N.: Smooth Component Extraction from a Set of Financial Data Mixtures. In: Proc. Signal Processing, Pattern recognition and Applications, Innsbruck Austria, pp. 136–554 (2007) 5. Back, A.D., Weigend, A.S.: A First Application of Independent Component Analysis to Extracting Structure from Stock Returns. Int. J. Neural Systems 8(4), 473–484 (1997) 6. Cheung, Y.M., Xu, L.: The MSE Reconstruction Criterion for Independent Component Ordering in ICA Time Series Analysis. NSIP 8(4), 793–797 (1999) 7. Lai, Z.B., Cheung, Y.M., Xu, L.: Independent Component Ordering in ICA Analysis of Financial Data. In: Computational Finance, Ch. 14, pp. 201–212. The MIT Press, Cambridge (1999) 8. Cheung, Y.M., Xu, L.: An empirical method to select dominant independent components in ICA for time series analysis. In: Proc. Int. Joint Conf. on Neural Networks ’99, Washington DC, pp. 3883–3887 (1999) 9. Malaroiu, S., Kiviluoto, K., Oja, E.: Time series prediction with Independent Component Analysis. In: Proc. ’99 Conf. on Advanced Investment Technologies, Gold Coast Australia (2000) 10. Cha, S.M., Chan, L.W.: Applying Independent Component Analysis to Factor Model in Finance. In: Leung, K.-S., Chan, L., Meng, H. (eds.) IDEAL 2000. LNCS, vol. 1983, pp. 538– 544. Springer, Heidelberg (2000) 11. Chin, E., Weigend, A., Zimmermann, H.: Computing portfolio risk using gaussian mixtures and independent component analysis. In: Proc. CIFEr ’99, New York, pp. 74–117 (1999) 12. Moody, J., Wu, L.: What is the true price? State space models for high frequency FX data. In: Proc. CIFEr ’97, New York (1997)
Bayesian Estimation of Overcomplete Independent Feature Subspaces for Natural Images Libo Ma and Liqing Zhang Department of Computer Science and Engineering, Shanghai Jiao Tong University 800 Dong Chuan Road, Shanghai 200240, China [email protected],[email protected]
Abstract. In this paper, we propose a Bayesian estimation approach to extend independent subspace analysis (ISA) for an overcomplete representation without imposing the orthogonal constraint. Our method is based on a synthesis of ISA [1] and overcomplete independent component analysis [2] developed by Hyv¨arinen et al. By introducing the variables of dot products (between basis vectors and whitened observed data vectors), we investigate the energy correlations of dot products in each subspace. Based on the prior probability of quasi-orthogonal basis vectors, the MAP (maximum a posteriori) estimation method is used for learning overcomplete independent feature subspaces. A gradient ascent algorithm is derived to maximize the posterior probability of the mixing matrix. Simulation results on natural images demonstrate that the proposed model can yield overcomplete independent feature subspaces and the emergence of phase- and limited shift-invariant features—the principal properties of visual complex cells.
1 Introduction Recent linear implementations of efficient coding hypothesis [3,4], such as independent component analysis (ICA) [5] and sparse coding [6], have provided functional explanations for the early visual system, especially neurons in the primary visual cortex (V1). Nevertheless, there are many complex nonlinear statistical structures in the natural signals, which are not able to be extracted by a linear model. For instance, Schwartz et al. have observed that, for natural images, there are significant statistical dependencies among the variances of filter outputs [7]. Several algorithms have been proposed to extend the linear ICA model to capture such residual nonlinear dependencies [1,8,7,9]. Hyv¨arinen et al. developed the independent subspace analysis (ISA) method, which generalizes the assumption of component independence to subspace independence [1]. However, this method is limited to the complete case. The orthogonality requirement of the mixing matrix restricts the generalization to the overcomplete representation. In the overcomplete representation, the dimension of the feature vector is larger than the dimension of the input. Overcomplete representations present several potential advantages. High-dimensional representations are more flexible in capturing inherent structures in signals. Overcomplete representations generally provide more efficient representations than the complete case [10]. Furthermore, studies of human visual cortex have shown interesting implications of overcomplete representations in the visual system [11]. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 746–753, 2007. c Springer-Verlag Berlin Heidelberg 2007
Bayesian Estimation of Overcomplete Independent Feature Subspaces
747
In this paper, we combine ISA [1] and overcomplete independent component analysis [2] to extend ISA for overcomplete representations. We apply a Bayesian inference to estimating overcomplete independent feature subspaces of natural images. In order to derive the prior probability of the mixing matrix, the quasi-orthogonality of the dot product between two basis vectors is investigated. Moreover, we assume that the probability density of the dot products (between basis vectors and whitened observed data vectors) in one subspace depends only on the norms of the projections of the data onto the subspace. Then, a learning rule based on gradient ascent algorithm is derived to maximize the posterior probability. Simulation results on natural image data are provided to demonstrate the performance of overcomplete representations for independent subspace analysis. Furthermore, our model can lead to the emergence of phase- and limited shift-invariant features—principal properties of visual complex cells as well. This paper is organized as follows: In section 2, we propose a Bayesian approach to estimate the overcomplete independent feature subspaces. The learning rule is given as well. In section 3, some experimental results on natural images are presented. Finally, some discussions on representation performance of the proposed method are given in section 4.
2 Model 2.1 Bayesian Inference In this section, we apply Bayesian MAP (maximum a posteriori) approach to estimating overcomplete independent feature subspaces. The basic ICA model can be expressed as: x = As =
N
ai si ,
(1)
i=1
where x = (x1 , x2 , ..., xM )T is a vector of observed data, s = (s1 , s2 , ..., sN )T is a vector of components, and A is the mixing matrix. ai is ith the column of A, and it is often called basis function or basis vector. In our model, the observed data vector x is whitened to vector data z, just as the preprocessing step in most ICA methods. Furthermore, instead of considering the independent components, as in most ICA, we consider the dot product between the ith basis vector and the whitened data vector. For simplicity, it is assumed that the norms of the basis vectors are set to be unity and that the variances of the sources can differ from unity. Then, the dot product is aTi aj sj , (2) yi = aTi z = aTi As = si + j=i
where si is the ith independent component. Given the overcomplete representations of our model (there is a large number of components in a high-dimensional space), the second term approximately follows Gaussian distribution. Moreover there is no component whose variance is considerably larger than others. Therefore the marginal distributions of dot products should be maximally sparse (super-Gaussian). And maximizing the non-Gaussianities of these dot products is sufficient to provide an approximation
748
L. Ma and L. Zhang
of basis vectors. Thus, we we can replace the component si by the dot product yi to estimate independent feature subspaces. Considering the dot product vector y = (y1 , ..., yN )T = AT z, the probability for z given A can be approximated by p(z(t)|A) = p(y) ≈ C
N i=1
pyi (yi ) = C
N
pyi (aTi z(t)),
(3)
i=1
where C is a constant. Obviously, the accuracy of the prior probability pyi is important, especially for overcomplete representations [10]. Several choices of prior on the basis coefficients P (s) have been applied in classical linear models respectively. Bell and Sejnowski utilize the prior P (si ) ∝ sech(si ), which is corresponding to the hyperbolic tangent nonlinearity [5]. Olshausen and Field use a generalized Cauchy prior [6]. Whereas van Hateren and van der Schaaf simply explore non-Gaussianity [12]. Nevertheless, all these choices of prior is derived under a single-layer network of linear model. Surely, it is desirable to capture nonlinear dependencies by a second or third stage in a hierarchical fashion. In our model, we apply the prior probability pyi proposed in the ISA algorithm, in which the basis function coefficients in each subspace have the energy correlations [1]. A diagram of feature subspaces is given in Figure 1.
Fig. 1. Illustration of feature subspaces. The dot products between basis vectors and whitened observed data vectors are taken. Then, they are squared respectively and summed inside the same feature subspace. Square roots are taken for normalization.
The dot product (neuronal response) yi is assumed to be divided into n-tuples, so that yi inside a given n-tuple may be dependent on each other, but different n-tuples are mutually independent. The subspaces model introduces a certain dependency structure
Bayesian Estimation of Overcomplete Independent Feature Subspaces
749
for different components. Let Ωj , j = 1, ..., J denote the set of independent feature subspaces, where J is the number of subspaces. The probability distributions for ntuples of yi are spherically symmetric. In other words, the probability density pyj (.) of n-tuple can be expressed as a function of the sum of the squares of yi , i ∈ Ωj only. And, for simplicity, we assume pyj (.) are identical for all subspaces. Therefore, the probability density inside the j th n-tuple of yi can be calculated: (4) yi2 , pyj (yj ) = exp G i∈Ωj
where the function G(y) should be convex for non-negative y. For example, one could √ use the form of G(.) as: G(y) = −α1 y + β1 , where α1 is the scaling constant and β1 is the normalization constant. These constants are unimportant for the learning process. Overcomplete representations mean that there is a large number of basis vectors. In other words, the basis vectors are randomly distributed in a high-dimensional space. In order to approximate the prior probability of basis vectors, we employ a result presented by Hecht-Nielsen [13]: the number of almost orthogonal directions is much larger than that of orthorgonal directions. This property is called quasi-orthogonality [2]. Therefore, in a high-dimensional space even vectors having random directions might be sufficiently close to orthogonality. Thus, the prior probability of the mixing matrix A can be obtained in terms of the quasi-orthogonality as follows: p(A) = cm
1 − (aTi aj )2
m−3 2
,
(5)
i<j
where cm is a constant. The detailed derivation of Equation (5) can be obtained in [2]. Bayes’ Theorem allows one to describe the probability of the model in terms of the likelihood of the data and the prior probability of the model. Thus, given observation z, the posterior probability p(A|z) can be derived as follows: p(A|z) =
p(z|A)p(A) , p(z)
(6)
where p(z) is constant with respect to A. It is easier to estimate the mixing matrix that maximize the logarithm of posterior probability p(A|z). Thus, taking the logarithm of Equation (6) and combining Equation (5) with Equation (3) and (4), we obtain the approximation of log-probability of the posterior: log p(A|z(t), t = 1, .., T ) ∝
T J 2 G yi + αT log(1 − (aTi aj )2 ) + C.(7) t=1 j=1
where α is a constant related to cm .
i∈Ωj
i<j
750
L. Ma and L. Zhang
2.2 Learning Rule Gradient ascent maximization of posterior probability over basis vector ak yields the following learning rule: Δak ∝ η
T
z(t) aTk z(t) g (aTi z(t))2 + αT
t=1
i<j
i∈Ωj(k)
−2aTi aj bk , (8) 1 − (aTi aj )2
where η is the learning rate, and Ωj(k) is the subspace to which ak belongs. bk is the k th column vector of matrix B = [0, ..., aj , ..., ai , ...0], aj is the ith column vector, and ai is the j th column vector. The function g is the derivative of G. After each iteration in equation (8), the norm of the basis vector ak needs to be set to unity. This is different from ordinary ISA, where the mixing matrix is orthonormalized.
3 Simulations We tested the algorithm for overcomplete independent subspace analysis on natural image data. The training set of images consists of 50,000 patches of size 16 × 16 that were randomly extracted from thirteen 256 × 512 pixel gray images. We use the natural images in [1], which is available on http://www.cis.hut.fi/projects/ica/data/images/. The mean gray-scale value of data (i.e., the DC component) was removed. The dimension of data was reduced by principle component analysis, projecting onto the leading 160 eigenvectors of the data covariance matrix. Then, the data vectors were whitened as in most ICA methods. The log posterior probability was maximized by an ordinary gradient method to estimate A, using the averaged version of the learning rule in equation (8). Note that there was no constraint of orthogonality of basis vectors during each
(a)
(b)
Fig. 2. Learned bases from natural images. (a) complete case (40 subspaces and 4 basis vectors in each subspace) (b) 2× overcomplete case (40 subspaces and 8 basis vectors in each subspace).
Bayesian Estimation of Overcomplete Independent Feature Subspaces
751
iteration. Only the norms of basis vectors were set to unity. The random initial value was set for mixing matrix. The effects of varying the level of overcompleteness and the dimension of subspaces were investigated in depth. The basis was set to be complete and 2× overcomplete. The dimension of components is 160 and 320, respectively. Figure 2 shows the estimated basis vectors, which is the complete case of four-dimensional subspaces and 2× overcomplete case of eight-dimensional subspaces. To analyze the tiling properties of the estimated basis vectors, we fitted each basis vector with a Gabor function by minimizing the squared error between the estimated basis vectors and the model Gabor. Figure 3 shows the distribution of parameters obtained by fitting Gabor functions to complete and 2× overcomplete basis vectors. We can see that, with the increasing of the level of overcompleteness, the scattering points in the plot of location, spatial frequency and orientation become denser and more uniform. And the distribution of phase is much closer to uniform.
12
Completeness
14 10
12 10
8
8
6
6
4
4
2
2
2×Overcompleteness
2
4
6
8
10
12
14
14
0
20
40
60
80
20
40
60
80
20
12 15
10 8
10
6 4
5
2 2
Orientation and Freqency
4
6
8
10
Location
12
14
0
Phase
Fig. 3. The distributions of parameters derived by fitting Gabor functions with completeness and 2× overcompleteness. (a) Center location of Gabor fitted within a patch. (b) Joint distribution of orientation and spatial frequency (plotted in the upper-half plane) (c) Histogram of phase of Gabor fitted (mapped to range 0 ◦ ˜ 90 ◦ ).
Furthermore, we compare the responses of all the feature subspace and the corresponding linear filters for different stimulus cases. First, an optimal stimulus for the feature subspace was computed in the set of Gabor filters. The tested stimuli for the subspace was calculated in the set of Gabor functions. In each time, only one stimuli parameter was changed to see how the response changes. The tested parameters were location (shift), orientation, and phase. Figure 4 shows the median responses of the whole population of 40 subspace and 320 linear filters corresponding to 2× overcomplete case. The top row shows the absolute responses of the linear filters, and the
Normalized Response
L. Ma and L. Zhang Linear filters
752
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
Normalized Response
Subspaces
0
0 −50 0 50 Phase Shift (Degrees)
0.2 −5 0 5 Location Shift (Pixels)
0
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0 −50 0 50 Phase Shift (Degrees)
−5 0 5 Location Shift (Pixels)
0
−50 0 50 Orientation Shift (Degrees)
−50 0 50 Orientation Shift (Degrees)
Fig. 4. Statistical curves for whole population and linear filers while shifting different Gabor parameters: orientation, frequency, and phase with 2× overcompleteness. The solid line gives the median response in the population of all filters or subspaces. The dashed lines give the 90% and 10% percentiles of the responses.
bottom row shows the results of the feature subspaces. We can see that the responses of subspaces are considerably invariant to phase, and somewhat invariant to position. The sharpness of tuning to orientation and spatial frequency remains roughly unchanged. Thus it can be observed that invariance with respect to phases is a strong property of the feature subspaces. It is closely related to the response properties of complex cells in V1, which are based on location, frequency, and orientation and independent of phase. In contrast, the responses of the linear filters show no invariance with respect to any of these parameters.
4 Discussions and Conclusions We have demonstrated in this paper how the Bayesian approach can be employed for learning overcomplete representations by utilizing the quasi-orthogonal property of basis vectors in a high-dimensional space, whereas ordinary ISA can only provide complete representations of basis functions. In addition, we examine the dot products (between basis vectors and whitened observed data vectors) instead of the basis function coefficients. Furthermore, our model need not impose the constraint of orthogonality on basis vectors. Only the norms of basis vectors were set to unity during the learning process. In contrast, basis vectors have to be orthogonal in ordinary ISA. Compared with the methods for estimating overcomplete bases by using maximum likelihood estimation, our method is as computationally effective as basic ICA estimation. Another issue addressed in this paper is the relevance of the learned codes to neurobiological plausibilities. Both complete and overcomplete basis functions adapted to natural images suggest functional similarities to neurons of V1 receptive fields. Simulation results on natural image data demonstrate that our model can lead to the emergence of phase- and shift-invariant features—principal properties of visual complex cells as
Bayesian Estimation of Overcomplete Independent Feature Subspaces
753
well. This method shows promising prospects in extended applications of our method to higher levels of cortical representations. An important concern in our model is the accuracy of the coefficient prior probability. Our overcomplete ISA algorithm can capture the underlying statistical structure of images, i.e., the energy correlations of coefficients in each subspace. However, a Laplacian prior probability as in overcomplete ICA algorithms can not capture well higher-order statistics, such as dependencies among the variances of filter outputs. This method finds compact descriptions of overcomplete representation and has the potential in a wide varieties of applications, such as image processing and pattern recognition.
Acknowledgments The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301) and the National High-Tech Research Program of China (Grant No.252006AA01Z125).
References 1. Hyv¨arinen, A., Hoyer, P.: Emergence of phase-and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Computation 12(7), 1705–1720 (2000) 2. Hyv¨arinen, A., Inki, M.: Estimating Overcomplete Independent Component Bases for Image Windows. Journal of Mathematical Imaging and Vision 17(2), 139–152 (2002) 3. Attneave, F.: Some informational aspects of visual perception. Psychol. Rev. 61(3), 183–193 (1954) 4. Barlow, H.B.: Possible principles underlying the transformation of sensory messages. Sensory Communication, 217–234 (1961) 5. Bell, A.J., Sejnowski, T.J.: The independent components of natural scenes are edge filters. Vision Research 37(23), 3327–3338 (1997) 6. Olshausen, B., Field, D.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381(6583), 607–609 (1996) 7. Schwartz, O., Simoncelli, E.: Natural signal statistics and sensory gain control. Nature Neuroscience 4, 819–825 (2001) 8. Hyv¨arinen, A., Hoyer, P.O., Inki, M.: Topographic Independent Component Analysis. Neural Computation 13(7), 1527–1558 (2001) 9. Karklin, Y., Lewicki, M.S.: A Hierarchical Bayesian Model for Learning Nonlinear Statistical Regularities in Nonstationary Natural Signals (2005) 10. Lewicki, M., Olshausen, B.: Probabilistic framework for the adaptation and comparison of image codes. Journal of the Optical Society of America A 16(7), 1587–1601 (1999) 11. Popovic, Z., Sjostrand, J.: Resolution, separation of retinal ganglion cells, and cortical magni fication in humans. Vision Research 41(10-11), 1313–1319 (2001) 12. van Hateren, J.H.: Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings: Biological Sciences 265(1394), 359–366 (1998) 13. Hecht-Nielsen, R.: Context vectors: general purpose approximate meaning representations self-organized from raw data. Computational Intelligence: Imitating Life, 43–56 (1994)
ICA-Based Image Analysis for Robot Vision Naoya Ohnishi1 and Atsushi Imiya2 School of Science and Technology, Chiba University, Japan Yayoicho 1-33, Inage-ku, Chiba, 263-8522, Japan [email protected] Institute of Media and Information Technology, Chiba University, Japan Yayoi-cho 1-33, Inage-ku, Chiba, 263-8522, Japan [email protected] 1
2
Abstract. In this paper, we develop an ICA-based obstacle detection and 3D-environment understanding for a mobile robot navigation. From a camera mounted on a mobile robot, the robot observes a sequence of images. This sequence of images allows the robot to compute optical flow, which is the apparent motion of each point on the image. We apply ICA to the optical flow field computed from images captured by the camera mounted on the robot. ICA-based separation of optical flow derives a obstacle region and a ground plane region in a space. For these applications, we also introduce an ordering criterion of independent components using its variances.
1
Introduction
Independent Component Analysis(ICA) [6] extracts statistically independent features from signals and still images. In this paper, we apply independent component analysis to a dynamic image sequence for autonomous robot navigation, that is, ICA is applied to optical flow [1] computed from a image sequence. The optical flow [1] is the apparent motion of successive images and is independent of the features in images, unlike edges or corner points in images. Furthermore, optical flow is considered to be fundamental information for navigation and obstacle avoidance in the context of biological data processing [11]. Therefore, the use of optical flow is valid for the robot navigation using the vision system. In neuroscience, it is known that the medial superior temporal (MST) area performs visual motion processing. For motion cognition at the MST area in the brain[7,12], it is shown that independent components of optical flow are used. Furthermore, since the optical flow field on an image can be represented as a linear combination of independent components of optical flow, we can use ICA for the detection of the dominant plane by separating obstacles and the dominant part in an image. Our application of ICA separates the planes from image sequences. Statistical analysis of optical flow are addressed in [3,5] for the robust motion understanding for optical flow. Our ICA algorithm separates optical flow on the dominant plane and the obstacle areas in optical flow observed through an uncalibrated camera mounted on a mobile robot. First, an optical flow is computed M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 754–761, 2007. c Springer-Verlag Berlin Heidelberg 2007
ICA-Based Image Analysis for Robot Vision
755
from a pair of successive images observed through a camera mounted on a mobile robot. Optical flow is used for the estimation of homography on the dominant plane. Homography can be approximately calculated by affine transform. Then, we estimate optical flow on the dominant plane using the obtained homography. The details of the algorithm are described in reference [9]. Next, optical flow computed from images and the estimated optical flow on the dominant plane are used as input signals in ICA. Since two signals are input in ICA, two signals are output. For the concurrent detection of a local and global motion, we use independent components of optical flow fields on pyramidal layers. It is known that animals, insects, and human beings use the independent components of optical flow fields for visual behavior [7,11]. In human object recognition, the hierarchical model is proposed [4]. Furthermore, for the computation of optical flow, the pyramid transform of an image sequence is used for the analysis of a global motion and local motion [2,8]. The pyramid transform generates multiple-resolution images as layered images. These layered images are used for computation of optical flow in its original images from the image in the lowest layer. This idea based on the assertion that a global motion is described as the collection of a local motion. We introduce the application of hierarchical image expression for motion analysis, that is, we develop an algorithm for the detection layered optical flows from a multi resolution image sequence.
2
ICA of Optical Flow Field
In this section, we introduce an algorithm for applying ICA to optical flow fields. The optical flow is apparent motion of each points computed from successive two images [1]. Setting I(x, y, t) to be time-varying image, the optical flow is computed by solving the equation Ix x˙ + Iy y˙ + It = 0,
(1)
where (x, ˙ y) ˙ is the optical flow vector. To solve this singular equation, we adopt the Lucas and Kanade method with the pyramid transform [1,2,9,10]. Similar to ICA separating mixture signals into independent components, the MST area in the brain separates the motion fields from visual perception into independent components [7,12]. As previously introduced, we accept the assumption that optical flow fields observed by the moving camera are linear combinations of optical flow fields of the dominant plane and the obstacles. That is, setting u˙ dominant and u˙ obstacle to be optical flow fields of the dominant plane and the obstacles, respectively, the observed optical flow field u˙ is approximately expressed by a linear combination of u˙ dominant and u˙ obstacle as u˙ = a1 u˙ dominant + a2 u˙ obstacle ,
(2)
where a1 and a2 are the mixture coefficients, as shown in Fig. 1. This assumption is numerically and geometrically acceptable if motion displacement is small
756
N. Ohnishi and A. Imiya Camera motion
dominant plane
= a1
obstacles
+ a2
Fig. 1. Linear combination of optical flow field in the scene. Example of camera displacement and these optical flow fields. a1 and a2 are mixture coefficients. Observed optical flow
Separated optical flows
median value
Layer l+1
l+l
l ij
Layer l : :
l
l ij : :
: :
Layer 0
0
l ij Variance σα2
<
σβ2
(a)
(b)
Fig. 2. Ordering of independent components of optical flow fields. (a)Difference in the motions of the dominant plane and obstacles. The order of the components can be determined by using variances σα2 and σβ2 . (b)Sorting using the norm l for determination of output order.
compared with the size of obstacles, as shown in a numerical experiment. Therefore, ICA is suitable for the separation of optical flow into the independent flow components. For each image in a sequence, we consider that optical flow vectors in the dominant plane correspond to independent components. For ICA of optical flow fields, we align the matrix of two-dimensional vectors to a one-dimensional array as ˙ ˙ v) ˙ k · · · (u, ˙ v) ˙ n ) → (u˙ 1 · · · u˙ n v˙ 1 · · · v˙ n ) = vecu. u˙ → ((u, ˙ v) ˙ 1 · · · (u,
(3)
˙ Since the relation w˙ = αu˙ + β v˙ leads to the relation vecw˙ = αvecu˙ + βvecv. These steps are invertible. Therefore, it is possible to extract regions corresponding to u˙ and v˙ if the observation w˙ is decomposed into two independent ˙ We use this vector, derived from a vector-valued image, components u˙ and v. as input to ICA.
3
ICA-Based Obstacle Detection
In this section, we present algorithms for obstacle detection using ICA and optical flow. The first algorithm is ground-plane detection for the mobile robot
ICA-Based Image Analysis for Robot Vision
757
navigation, The details of the algorithm is described in [10]. The second algorithm detects multiple planes by extension of the first one. The third algorithm is estimate multi-resolution obstacle using multi-layer optical flow fields. 3.1
Dominant-Plane Detection by ICA
For the detection of the dominant plane, ICA requires at least two input signals for separation into two independent components. Then, we use optical flow field h,w h,w ˆ = {(ˆ u˙ = {(u, ˙ v) ˙ u, vˆ) ij }i=1,j=1 and planar flow field u ij }i=1,j=1 as the input vectors of ICA, where w and h are the width and the height of an image. Since planar flow is the motion of the dominant plane relative to the robot motion, the use of planar flow is suitable for separation into the dominant plane and obstacles. Setting vα and vβ to be the output vectors, vα and vβ have ambiguities in those order and length of each component. We are required to determine whether components have optical flow of the dominant plane or of obstacle areas. We solve this problem using the difference between the variances of the norms of vα and vβ . ˙ v) ˙ ij }h,w Setting lα,β = {lij }h,w i=1,j=1 to be the norm of vα,β = {(u, i=1,j=1 , that is, 2 ˙ v) ˙ ij | and the variance σ is computed as lij = |(u, 1 σ = hw 2
h,w
1 (lij − ¯ l)2 , where ¯l = hw i=1,j=1
h,w
lij .
(4)
i=1,j=1
The motions of the dominant plane and obstacles in the images are different, and the dominant-plane motion is smooth on the images compared with obstacle motion, as shown in Fig. 2. Consequently, the output signal of obstacle motion has larger variance than the output signal of dominant-plane motion. Therefore, if σα2 > σβ2 , we use the norm lα of output flow field vα for dominant-plane detection; else we use the norm lβ of output flow field vβ . Since the planar flow field is subtracted from the optical flow field including obstacle motion, l is constant on the dominant plane. However, the length of l is ambiguous. Then, we use the median value of l for the detection of the dominant plane. Since the dominant plane occupies the largest domain in the image, we compute the distance between l and the median of l, as shown in Fig. 2(b). The area which has the median value of the component is detected as the dominant plane. Setting m to be the median value of the elements in l, the distance d = {dij }h,w i=1,j=1 is dij = |lij − m|.
(5)
We detect the area in which dij ≈ 0 as the dominant plane. The procedure for dominant-plane detection by ICA is summarized as follows. ˆ to ICA, and output the 1. Input optical flow field u˙ and planar flow field u optical flow fields vα and vβ . 2. Compute the norms lα and β from vα and vβ , respectively.
758
N. Ohnishi and A. Imiya Estimation by random selection
Image sequence Optical flow with obstacles Camera
Camera motion
Supervisor signal
Separation by ICA Dominant-plane detection
Ground plane with obstacles
Determination of order
Fig. 3. Procedure for dominant-plane detection
14
12
Norm
10
8
6
4
2
0 Descending order of the norm
Fig. 4. Input optical flow fields to ICA for translational motion in an environment ˙ planar flow field with one obstacle. Left to right: captured image, optical flow field u, ˆ detected dominant plane, and sorted norm l of output vα . Variances of vα and vβ u, are σα2 = 1.60 and σβ2 = 0.51, respectively. The median value m = 0.09. The area where norm l is large corresponds to an obstacle, and the area where lij ≈ m corresponds to the dominant plane.
3. 4. 5. 6.
Compute the variances σα2 and σb2 eta from lα and lβ , respectively. If σα2 > σβ2 , then l = lα , else l = lβ . Compute the distance d between l and the median of l. Detect the area in which dij ≈ 0 as the dominant plane.
Figure 3 shows the procedure of dominant-plane detection from the image sequence using ICA. Figures 4 and 5 are experimental results on detecting the dominant plane. 3.2
Iterative Multiple Plane Segmentation
Using the dominant-plane-detection algorithm iteratively, we develop an algorithm for multiple-plane segmentation in an image. After removing the region corresponding to the dominant plane from an image, we can extract the second dominant planar region from the image. Then, it is possible to extract the the third dominant plane by removing the second dominant planar area. This process is expressed as A(R \ Dk−1 ), k ≥ 2, (6) Dk = A(R), k = 1, where A, R, Dk stand for the dominant-plane-extraction algorithm, the region of interest observed by a camera, and the k-th dominant planar area, respectively.
14
14
12
12
12
10
10
10
10
8
8
8
8
6
6
6
4
4
4
2
2
2
0
0
Descending order of the norm
(a)m = 0.60
Norm
14
12
Norm
14
Norm
Norm
ICA-Based Image Analysis for Robot Vision
6
4
2
0
0
Descending order of the norm
(b)m = 0.68
759
Descending order of the norm
(c)m = 0.45
Descending order of the norm
(d)m = 0.60
Fig. 5. Results obtained using optical flows with error. (a) Translational motion in an environment with one obstacle. (b) Rotational motion in an environment with one obstacle. (c) Translational motion in an environment with two obstacles. (d) Rotational motion in an environment with two obstacles. Graphs in bottom row are sorted norm l of output vα .
Fig. 6. Top row: captured image, dominant plane d at first, second, and third steps. Bottom row: optical flow, planar flow fields at first, second, and third steps
The algorithm is stopped after iterated to a pre-determined iteration time or the size of k-th dominant plane is smaller th pre-determined size. Setting R to be the root of the tree, this process derives a binary tree such that (7) RD1 , R \ D1 D2 , R2 \ D2 · · · , Assuming that D1 is the ground plane on which the robot moves, Dk for k ≥ 2 is the planar areas on the obstacles. Therefore, this tree expresses the hierarchical structure of planar areas on the obstacles. We call this tree the binary tree of planes. Using this tree constructed by the dominant-plane detection algorithm, we obtain geometrical properties of planes in a scene. For example, even if an object exists in a scene and it lies on Dk k ≥ 2, the robot can navigate ignoring this object, using the tree of planes. Figures 6 and 7 are experimental results on detecting multiple planes.
760
N. Ohnishi and A. Imiya
Fig. 7. Image and results estimated from Marbled-Block. The white area is the first dominant plane. The light-gray and dark-gray areas are second and third dominant plane.
3.3
Obstacle Detection Using ICA on Pyramid Layers
Our algorithm is processed at layers l = 0, · · · , L in the pyramid transform. Using the optical flow field ul (x, y, t) at layer l, we detect obstacles in a image sequence. Figure 8 shows that, setting Ol to be the obstacle region on the l-th layer, the hierarchical expression of obstacles satisfies the relations O0 ⊂ O1 ⊂ · · · ⊂ OL and DL ⊂ DL−1 ⊂ · · · ⊂ D0
(8)
for the dominant plane DK = Rk0 . These relations imply that a pair Cl = (Dl , Ol ) shows global and local configuration in the work space for a larger and a
Fig. 8. Pyramidal representation of the Marbled-Block images in a simulated environment. Computed layer optical flow fields from the Marbled-Block images. Detected obstacle at each layer.
ICA-Based Image Analysis for Robot Vision
761
smaller l, respectively. This hierarchical relation is automatically detected from a pyramid-based hierarchical expression of images for optical flow computation. The system uses selectively Cl for navigation and spatial perception.
4
Conclusion
Our algorithm is an application of ICA for vector-valued images. This is an extension of ICA-based image analysis to vector-valued images. We proposed an ordering technique for independent components of vector-valued images. The ordering allows us to separate obstacles and ground planes from a sequence of images captured by a camera mounted on an autonomous robot. For each image in a sequence, it is shown that the dominant plane corresponds to an independent component. This relationship provides a statistical definition of the dominant plane. Combination of our ICA based image analysis and multi-resolution image representation provides a method which allows to detect hierarchical configuration of obstacles. This hierarchical processing is achieved concurrently for the simultaneous detection of local and global obstacle-configuration in the work space.
References 1. Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. International Journal of Computer Vision 12, 43–77 (1994) 2. Bouguet, J.-Y.: Pyramidal implementation of the Lucas Kanade feature tracker description of the algorithm. Intel Corporation, Microprocessor Research Labs, OpenCV Documents (1999) 3. Calow, D., Kr¨ uger, N., W¨ org¨ otter, F., Lappe, M.: Statistics of optic flow for selfmotion through natural scenes. Dynamic Perception, 133–138 (2004) 4. Domenella, R.G., Plebe, A.: A neural model of human object recognition development. In: De Gregorio, M., Di Maio, V., Frucci, M., Musio, C. (eds.) BVAI 2005. LNCS, vol. 3704, pp. 116–125. Springer, Heidelberg (2005) 5. Ferm¨ uller, C., Shulman, D., Aloimonos, Y.: The statistics of optical flow. Computer Vision and Image Understanding 82, 1–32 (2001) 6. Hyvarinen, A., Oja, E.: Independent component analysis: algorithms and application. Neural Networks 13, 411–430 (2000) 7. Jabri, M.A., Park, K.-Y., Lee, S.-Y., Sejnowski, T.J.: Properties of independent components of self-motion optical flow. In: Proc. of IEEE Int. Symp. on MultipleValued Logic, pp. 355–362. IEEE Computer Society Press, Los Alamitos (2000) 8. Mahzoun, M.R., Kim, J., Sawazaki, S., Okazaki, K., Tamura, S.: A scaled multigrid optical flow algorithm based on the least RMS error between real and estimated second images. Pattern Recognition 32, 657–670 (1999) 9. Ohnishi, N., Imiya, A.: Featureless robot navigation using optical flow. Connection Science 17, 23–46 (2005) 10. Ohnishi, N., Imiya, A.: Dominant plane detection using optical flow and independent component analysis. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 497–506. Springer, Heidelberg (2005) 11. Vaina, L.M., Beardsley, S.A., Rushton, S.K.: Optic flow and beyond. Kluwer Academic Publishers, Dordrecht (2004) 12. Zemel, R.S., Sejnowski, T.J.: A model for encoding multiple object motions and self-motion in area mst of primate visual cortex. Neuroscience 18, 531–547 (1998)
Learning of Translation-Invariant Independent Components: Multivariate Anechoic Mixtures Lars Omlor1 and Martin A. Giese1,2 1
ARL, Hertie Institute for Clinical Brain Sciences, T¨ ubingen, Germany 2 School of Psychology, University of Wales, Bangor, UK
Abstract. For the extraction of sources with unsupervised learning techniques invariance under certain transformations, such as shifts, rotations or scaling, is often a desirable property. A straight-forward approach for accomplishing this goal is to include these transformations and its parameters into the mixing model. For the case of one-dimensional signals in presence of shifts this problem has been termed anechoic demixing, and several algorithms for the analysis of time series have been proposed. Here, we generalize this approach for sources depending on multi-dimensional arguments and apply it for learning of translationinvariant features from higher-dimensional data, such as images. A new algorithm for the solution of such high-dimensional anechoic demixing problems based on the Wigner-Ville distribution is presented. It solves the multi-dimensional problem by projection onto multiple onedimensional problems. The feasibility of this algorithm is demonstrated by learning independent features from sets of real images.
1
Introduction
Many common approaches in blind source separation (BSS) are based on linear instantaneous mixture models, where the output signals result from the linear superposition of source signals, time-point by time-point. In spite of its great success in feature extraction (see [3] for review) this simple model usually fails when features appear under transformations, such as scaling, rotation or shifts. Invariance against such transformations can be achieved by embedding them into the mixture model, requiring the estimation of additional parameters. This approach has been applied to acoustic data for modeling the transmission delays of different microphones, applying a generative model of the form: xi (t) =
d
αij · sj (t − τij ) i = 1, . . . , m
(1)
j=1
The time functions xi (t) signify the original signals, sj (t) the source signals, and αij the mixing weights. The constants τij are the temporal shifts (delays) that need to be estimated. Since this model allows for delays, but not for reverberations of the sounds, it has been termed anechoic mixing model. Most existing algorithms have treated this problem for the under-determined case, M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 762–769, 2007. c Springer-Verlag Berlin Heidelberg 2007
Learning of Translation-Invariant Independent Components
763
where sources outnumber the sensors (i.e. m ≤ d ) [6,14,15,16,17]. Much fewer algorithms exist for the over-determined case m ≥ d (cf. [8]). The invariance of the estimated model under translation (time shifts) of the source signals is obvious from equation (1). This property remains valid if the time arguments are replaced by vectors, resulting in multivariate functions xi (t) = xi (t1 , . . . , tn ). In this case the scalar delays nhave to be replaced ∈ R . Such a multivariτ→ by displacement vectors τij = − ij = τij1 , · · · , τijn ate model is suitable for feature learning that accounts for shift-invariance in higher-dimensional spaces. In this paper, we present an algorithm for the solution of such generalized anechoic mixing problems for multivariate signals. The algorithm is derived by applying methods from stochastic time-frequency analysis to the generalized mixture model. These techniques allow the projection of the multi-dimensional mixture equation onto multiple one-dimensional problems. For their solution we introduce a modification of non-negative matrix factorization(NMF) [4,9] that extends this method for convolutive models. The efficiency of the developed algorithm is demonstrated by shift-invariant learning of features from sets of gray-level images.
2
Derivation of the Algorithm
Notation: Throughout the paper the following notations will be used: ek ∈ Rn denotes the kth canonical unit vector E denotes the expectation value F = F 1 denotes the (multidimensional) Fourier transform, F −1 its inverse Tτ , Mf denote the time and frequency shift operators, e.g. (Tτ Mf x)(t) := e−2π f (t−τ )x(t − τ ) – Rξ denotes the multidimensional Radon transform; with h being a squareintegrable multivariate function and ξ = 1 it is defined [5] by:
– – – –
Rξ h(u) :=
h(t)δ(ξ · t − u)dt
– F a denotes the (multidimensional) fractional Fourier transform. F a is the ath power of the classical Fourier operator [13]. Thus F a is the operator with the same eigenfunctions as the Fourier transform (Hermite-Guassians π π ψn ) but with the eigenvalues (e− n /2 )a instead of e− n /2 . a – The notations F and F will also be used for the discrete versions of the fractional Fourier operators. The algorithm for the solution of the multivariate anechoic mixture problem is based on the Wigner-Ville Spectrum (WVS), a time-frequency integral transform with particularly suitable properties for the solution of anechoic mixture problems [8]. In the following, we introduce first this transform, apply it to the mixture model, and derive the algorithm for a solution by projection onto one-dimensional problems. An efficient algorithm for the solution of the onedimensional problems is presented in section 2.3.
764
L. Omlor and M.A. Giese
2.1
Wigner Ville Spectrum
Stochastic time-frequency distributions provide powerful mathematical tools for the analysis of non-stationary random processes. The most prominent quadratic distribution is the Wigner-Ville spectrum (WVS) defined as [11]: Wx (t, ω) = τ
τ −2π τ ∗ x t− e E x t+ 2 2
ωτ
(2)
dτ
where x(t) is a random process and x∗ (t) = x¯(t) the conjugated process. The WVS can loosely be interpreted as a time-frequency distribution of the mean energy of x(t). This definition implies many useful properties. The following derivations rely in particular on two properties. The first is the time-frequency shift covariance given by: W(Tτ Mf x) (t, ω) = Wx (t − τ, ω − f ) The second property is a relationship between the Radon transform of the WVS and the fractal Fourier transform of the original signal. In the case of a onedimensional signal h this relationship can be expressed as (cf. [13]): |F a h(t)|2 = Raπ/2 Wx (t, ω) Both properties can be easily generalized for the multidimensional case [2,12]. 2.2
Application of the WVS to the Multivariate Mixture Model n Introducing the notation τij = − τ→ ij = τij1 , . . . , τijn ∈ R and with vectorial n t = t1 , . . . , tn ∈ R the multivariate anechoic mixing model can be written compactly: xi (t) =
d
αij · sj (t − τij )
i = 1, · · · , m
(3)
j=1
With the the assumption that the multivariate source signals are independent and have zero means, i.e. E{si } = 0 and E{si sj } = 0 ∀i = j, the mixture model (3) can be mapped into the time-frequency domain by application of the WVS: Wxi (t, ω) =
d τ τ E αij αik sj (t + − τij )s∗k (t − − τik ) e−2π 2 2
ωτ
dτ
j,k=1
=
d
|α|2ij Wsj (t − τij , ω)
i = 1, . . . , m
(4)
j=1
A direct evaluation of this equation is possible only in the univariate case. Even for two-dimensional time arguments the computational costs (in time and memory) grow prohibitively. However, equation (4) is redundant and can be solved by computing a set of projections onto lower dimensional spaces that specify the same information as the original problem (3). Projections by integrating over unbounded domains with respect to the vectorial time parameter t are particularly useful as they eliminate the dependence of the unknown shifts τij .
Learning of Translation-Invariant Independent Components
765
If a sufficient number of integral projections is computed the inversion theorem for the Radon transform guarantees that the solution of problem (4), and thus of equation (3), can be uniquely recovered. Let ξ ∈ Rn be an arbitrary unit vector, then the radon transform of equation (4) can be computed explicitly: R π2 ξ (Wxi (t, ω))(u) = E{|F ξ xi (u)|2 } =
d
|α|2ij R π2 ξ (W(Tτij sj ) (t, ω))(u)
(5)
j=1
=
d
|α|2ij E{|F ξ sj (u − (τijl cos ξl )l )|2 }
(6)
j=1
In contrast to the power spectrum of the ordinary Fourier transform, the fractional power spectrum is not shift invariant. Instead, all displacement vectors τij are scaled with the factors cos(ξ) = (cos(ξ1 ), . . . , cos(ξn )) [13]. The special choice ξl = π2 ∀l thus eliminates all delays from (6), since in this case the fractional Fourier transform reduces to the ordinary Fourier transform. Moreover, the special form of (6) makes it possible to specify conditions for which the projections are reversible. For every dimension at least two fractional power spectra with distinct exponents suffice for inversion [10]. This implies that the special choice ξ ∈ {ϕk = ( π2 , . . . , π2 ) − εk ek ∀k} (with the free parameters εk = 0) defines an invertible family of anechoic mixing problems. Each of these problems depends solely on a single component of the delay vectors τij . For example this implies: E{|F (1,...,1,εk ,1,...) xi (u)|2 } =
d
|α|2ij E{|F (1,...,1,εk ,1,...) sj (u − τijk cos εk ek )|2 }
(7)
j=1
By transformation of all multidimensional variables into column vectors the equations (7) can be vectorized. Each of them specifying a one-dimensional anechoic mixture with positivity constraints for all variables. 2.3
Solution of the One-Dimensional Problems
The one-dimensional problems derived from (4) can be written more compactly with the non-negative time series yi (t), i = 1, . . . , m and sources σj (t), j = 1, . . . , d, t ∈ R. The solution of these problems can be obtained by solving the optimization problem: min yi (t) −
aij ,σj ,τij
d
aij σj (t − τij )
subject to σj ≥ 0 , aij ≥ 0 ∀i, j
j=1
This is a special case of a positive deconvolution problem: min yi (t) −
νij ,σj
n
(νij ∗ σj )(t)
subject to νij ≥ 0 , σj ≥ 0 ∀i, j
(8)
j=1
For the implementation time is discretized. In this case it is more convenient to adopt a matrix-vector notation. If Y, A, Σ signify the vectorized time-discrete
766
L. Omlor and M.A. Giese
variables yi (l), νij (l), σj (l), and A, X the block circulant matrices that represent the sums of convolutions with νij and σj , equation (8) can be rewritten: minY − AΣ = minY − X A subject to A, X , Σ, A ≥ 0 ∀i, j A,Σ
X ,A
This shows that the (discretized) deconvolution (8) is a special case of a nonnegative matrix factorization problem. Depending on the exact error-function or norm used for the minimization, it is possible to adopt different multiplicative update rules [4]. Since the adjoint of a convolution operator is again a convolution, fast implementations of many update rules can be obtained exploiting Fast Fourier Transform (FFT). A standard rule based on the Euclidian distance [9] can be written:
νij ← νij
F −1 Fσj Fyi ∀i, j, F −1 k Fσj Fνik Fσk
σ j ← σj
F −1 ( k Fνkj · Fyk ) ∀j F −1 ( p,l Fνpj · Fνpl · Fσl )
More suitable for the anechoic case are update rules that allow a control of the sparseness of the estimated sources or filters [4]:
β μβ 1+λ1 νij ← νij F −1 Fσj F yi/F −1 ( k F νik F σk )
(9)
β μ 1+λ2 β σj ← σj F −1 Fνjk F yk/F −1 l F νkl F σl
(10)
k
These learning rules are quite flexible, and a variety of regularizers can be included if additional information about the sources is available. The specific choice μ = 1.9, β = 2, λ1 = 0.02, λ2 = 0 results in sparse features [4]. Summarizing the individual steps defines the complete algorithm for the solution of the one-dimensional positive anechoic mixture problem: Input: Data yi for i = 1, . . . , m Initialize: Choose random values νij , σj ≥ 0 for iter1 = 1 : maxiter1, Update σj using (10). for iter2 = 1 : maxiter2, % Sparse update for the filters Update νij using (9). ν . Normalize νij = νij ij end νij = max(νij ) el with l ∈ {l|νij (l) = max(νij )} % Sparse filter replaced with delta function end 2.4
Algorithm for Multivariate Anechoic Mixtures
The algorithm for solving the multivariate anechoic mixing problem can be summarized: 1. Input Data xi ∈ Rn , i = 1, . . . , m Parameters εk = 0, k = 1, . . . , n
Learning of Translation-Invariant Independent Components
767
2. Compute (fractional) Fourier transforms F xi and F (εk ek ) (F xi ) =: F ϕk xi 3. Solve the 1D-anechoic mixture problems: |F ϕk xi (ω)|2 =
d
|α|2ij |F ϕk sj (ω − τijk ek cos εk )|2
j=1
These subproblems can be treated with the deconvolutive NMF algorithm presented in section 2.3. This step provides estimates for the linear weights αij , the delays τij , and the fractional power spectra |F ϕk sj |. 4. Compute Radon inversion, applying one of the following methods: – Gerchberg-Saxton phase retrieval: Given n+ 1 different fractional Fourier intensity spectra |F ϕk sj |, the signals sj can be reconstructed by a modification of the original GerchbergSaxton phase recovery algorithm [10]. – Deconvolution: For known weights and delays (3) specifies a normal deconvolution problem with known mixing filters. Applying standard deconvolution algorithms (e.g. Wiener filter) the sources sj can be retrieved by least squares estimation. In the last step of the algorithm several other methods can be applied for the integration of the solutions of the one-dimensional problems. For small parameters εk one can also compute the angular derivative of the fractional Fourier transforms. The unknown phases could then be recovered by integration [1].
3
Applications in Image Processing
The described algorithm has a broad application spectrum. In principle, it permits invariant learning of mixture models in an arbitrary number of dimensions. The algorithm described in section 2.4 is adjustable to under-, over- and evendetermined mixtures, and can be easily modified by inclusion of sparseness or positivity constraints. Specifically, the last two steps of the algorithm can be implemented using a broad range of established methods for non-negative matrix factorization and inverse Radon transformation. Given the limited available space, we present here only two examples from image processing, where the method is applied for the extraction of image components that reappear at different positions in different images. One set of images was generated by pasting two objects from an image data basis at different randomly chosen positions of images with a size of 150 × 150 pixels. The generated images are shown in figure 1. Goal of the application of the algorithm is the extraction of the original objects from the image set. For the last step of the algorithm two implementations were compared (Gerchberg-Saxton algorithm and the deconvolution method). Figure 2 shows the results of the feature extraction. Both implementations retrieve the original objects. However, the deconvolution approach is clearly superior to the phase retrieval. This partially due to the very slow convergence of the Gerchberg-Saxton algorithm, especially for
768
L. Omlor and M.A. Giese
small fractional powers and for inaccurate estimates of the power spectra. In addition, the deconvolution method exploits for a second time the specific structure of the mixture model. A quantitative comparison shows that images predicted from the extracted components predict 95% of the variance of the original images for the deconvolution method, but only 72% for the phase retrieval method.
Mixture 1
Mixture 3
Mixture 2
Mixture 4
Fig. 1. Example images defining an (over-determined) anechoic mixture in two dimensions
Extracted Features (Gerchberg-Saxton)
Extracted Features (using deconvolution)
Fig. 2. Extracted features from the image set in figure 1 using two different algorithms for Radon inversion
The second data set consisted of four gray-scale images taken with a digital camera and resampled with a resolution of 200 × 200 pixels (cf. figure 3). The photographs show two objects (scissors and a cup) that were placed at different positions on a wooden surface. Before application of the algorithms the images were whitened [7] to level the correlation statistics of natural images, removing strong correlations between features on small spatial scales. In this case only the deconvolution method was implemented. The reconstruction explains 85% of the of the pre-whitened training images and recovers the original objects with reasonable accuracy.
Fig. 3. Left: Real Images. Right: Extracted Features from pre-whitened images.
4
Conclusion
We presented a generalization of the anechoic mixture problem for the multivariate case, which allows learning of independent components with invariance
Learning of Translation-Invariant Independent Components
769
against translation in multi-dimensional spaces. An efficient algorithm was presented that makes the high-dimensional problem tractable by projection onto one-dimensional problems, resulting in a computational complexity that grows linearly in the number of dimensions. We demonstrated the efficiency of this algorithm by learning independent features that reappear at different positions of real images. Invariant unsupervised learning of independent features has a vast amount of other applications, e.g. in computer vision and computer graphics, 3D data analysis, and neural data analysis. Future work will extend the work to new applications and optimize the computational steps. In addition, the robustness of the algorithm against different levels of reverberations will be tested. Acknowledgments. This work was supported by HFSP, DFG, the Volkswagenstiftung, and the EU FP6 Project ’COBOL’.
References 1. Alieva, T., et al.: Signal reconstruction from two close fractional Fourier power spectra. SPIE Milestone Series 181, 566–577 (2006) 2. Alieva, T., Bastiaans, M.J.: Wigner distribution and fractional Fourier transform for two-dimensional symmetric optical beams. J. Opt. Soc. Am. 17, 2319–2323 (2000) 3. Choi, S., et al.: Blind source separation and independent component analysis: A review. Neural Information Processing - Letters and Review 6, 1–57 (2005) 4. Cichocki, A., et al.: New Algorithms for Non-Negative Matrix Factorization in Applications to Blind Source Separation. In: ICASSP-06, pp. 621–625 (2006) 5. Cuyt, A., et al.: Multidimensional Integral Inversion, with Applications in Shape Reconstruction. SIAM J. Sci. Comput. 27, 1058–1070 (2005) 6. Emile, B., Comon, P.: Estimation of time delays between unknown colored signals. Signal Processing 69, 93–100 (1998) 7. Gluckman, J.: Higher order whitening of natural images. In: IEEE Computer Society Conference vol. 2, pp. 354–360 (2005) 8. Omlor, L., Giese, M.A.: Blind source separation for over-determined delayed mixtures. In: Neural Information Processing Systems, vol. 19, MIT Press, Cambridge (2007) 9. Lee, D.D., Seung, H.S.: Learning the parts of objects by Non-Negative Matrix Factorization. Nature 401, 788–791 (1999) 10. Tian-He, L., et al.: Image recovery from double amplitudes in fractional Fourier domain. Chinese Phys. 15, 347–352 (2006) 11. Martin, W.: Time-frequency analysis of random signals. In: Proc. IEEE Int. Conf. Acoust. Speech, Sig. Proc. vol. 82, pp. 1325–1328 (1982) 12. Matz, G., Hlawatsch, F.: Wigner distributions (nearly) everywhere. Signal Processing 83, 1355–1378 (2003) 13. Ozaktas, H.M., et al.: The Fractional Fourier Transform. John Wiley & Sons, Chichester (2001) 14. Roy, R., Kailath, T.: ESPRIT-Estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech, Sig. Proc. 37, 984–995 (1989) 15. Torkkola, K.: Blind separation of delayed sources based on information maximization. In: Proc. IEEE Int. Conf. Acoust. Speech, Sig. Proc. vol. 96, pp. 3509–3512 (1996) 16. Yeredor, A.: Time-delay estimation in mixtures. Acoustics, Speech, and Signal Processing 5, 237–240 (2003) ¨ Rickard, S.: Blind Separation of Speech Mixtures via Time-Frequency 17. Ylmaz, O., Masking. IEEE Transactions On Signal Processing 52, 1830–1847 (2004)
Channel Estimation for O-STBC MISO Systems Using Fourth-Order Cross-Cumulants H´ector J. P´erez-Iglesias and Adriana Dapena Departamento de Electr´ onica y Sistemas, Universidade da Coru˜ na Campus de Elvi˜ na 5, 15071 A Coru˜ na, Spain Tel.: ++34-981-167000, Fax: ++34-981-167160 [email protected], [email protected]
Abstract. This paper proposes several algorithms to recover the transmitted signals in systems with multiple antennas that make use of orthogonal space time block code (O-STBC) to attain full transmit diversity. We interpret the scheme proposed by Alamouti in [1] and half-rate systems presented in [2] as classic blind source separation (BSS) problems where the received signals (observations) are instantaneous mixtures of the transmitted signals (sources). In order to recover the sources, we first propose to perform an eigenvalue decomposition of matrices containing fourth-order cross-cumulants of the observations. Subsequently, we show that the performance of this approach can be improved by doing a simultaneous diagonalization of the cumulant matrices. This second approach can be interpreted as a particular case of Joint Approximate Diagonalization of Eigen-matrices (JADE) algorithm for systems where the mixing matrix is orthogonal.
1
Introduction
Wireless communication systems that employ multiple antennas at both transmission and reception are commonly referred to as Multiple Input Multiple Output (MIMO) systems. One of the major advantages of MIMO systems is their ability to provide spatial diversity gains to decrease the Symbol-Error-Rate (SER) in multipath fading channels [3]. Diversity gain results from combining signals that experience independent signal fades. Achieving the promised performance gains, even in practical operating conditions, requires for specific Space-Time Coding (STC) techniques that spread the transmitted symbols over the space and time dimensions. The popularity of STC is due to the techniques known as Orthogonal Space-Time Block Codes (O-STBC), where different versions of the original data are transmitted by
This work has been supported by Xunta de Galicia, Ministerio de Educaci´ on y Ciencia of Spain and FEDER funds of the European Union under grants number PGIDT05PXIC10502PN and TEC2004-06451-C05-01.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 770–777, 2007. c Springer-Verlag Berlin Heidelberg 2007
Channel Estimation for O-STBC MISO Systems
771
several transmitting antennas across several time-slots. In general, O-STBC does not provided coding gain but, however, the decoding method is very simple and it can be performed using linear processing. In addressing the issue of decoding complexity, Alamouti has proposed in [1] a remarkable O-STBC scheme for transmission with two antennas, that is currently part of both the W-CDMA and CDMA-2000 standards [4]. This code achieves a transmission rate equal to one by transmitting a pair of symbols in a time-slot and the same pair with a different phase in the next time-slot. This scheme supports Maximum-Likelihood (ML) detection only based on linear processing at the receiver. Tarokh et al. [2,5] has developed a theory to design O-STBC which also supports ML detection with linear processing at the receiver. For any number of transmitting antennas, these codes achieve the maximum possible transmission rate when the symbols correspond to any arbitrary real constellation and 1/2 for complex constellations. For the specific case of three or four transmitting antennas, it is possible to achieve 3/4 of the maximum transmission using any complex constellation. The simulation results presented in [5] show that significant gains can be achieved by increasing the number of transmitting antennas with very little decoding complexity. This paper shows in Section 2 that the system proposed by Alamouti and the half-rate coding schemes presented by Tarokh et al. can be interpreted as classic problems of BSS where a set of unknown signals (sources) must be recovered from instantaneous mixtures of them without resorting to pilot symbols. Using this model, we propose in Section 3 to estimate the channel matrix by performing an eigenvalue decomposition of fourth-order cross-cumulant matrices. Subsequently, we present an approach based on performing a simultaneous diagonalization of the cumulant matrices. Simulations results are presented in Section 4 to compare the performance of the proposed approaches. Finally, Section 5 is devoted to the conclusions.
2
O-STBC Schemes
We consider a Multiple Input Single Output (MISO) system, a particular case of MIMO systems, with P transmitting antennas and only one receiving antenna. Let si , i = 1, ..., N be N zero-mean complex-valued statistically independent signals (sources). An O-STBC is defined by a K × P transmission matrix G(P ) formed by orthogonal columns containing linear combination of these N sources, and their conjugates. Each row represents a time slot and each column represents one transmitting antenna. Since K time-slots are used to transmit N signals, the code rate is defined as R = N/K. In this paper, we will consider the O-STBC code proposed by Alamouti in [1] and the half-rate code presented in [2] for four transmitting antennas. This
772
H.J. P´erez-Iglesias and A. Dapena
codes are, respectively, characterised by the matrices ⎤ ⎡ s1 s2 s3 s4 ⎢ −s2 s1 −s4 s3 ⎥ ⎥ ⎢ ⎢ −s3 s4 s1 −s2 ⎥ ⎥ ⎢ ⎢ −s4 −s3 s2 s1 ⎥ s1 s2 (4) ⎥ ⎢ G(2) = = G ⎢ s∗1 s∗2 s∗3 s∗4 ⎥ −s∗2 s∗1 ⎥ ⎢ ∗ ⎢ −s2 s∗1 −s∗4 s∗3 ⎥ ⎥ ⎢ ∗ ⎣ −s3 s∗4 s∗1 −s∗2 ⎦ −s∗4 −s∗3 s∗2 s∗1
(1)
Denoting by xk the received signal at the k-th time slot, we can written xk =
K (P ) i=1 hi Gk,i + nk , k = 1, ..., K where P = 2, 4 is the number of transmitting antennas, hi represents the channel path from the i-th transmitting antenna to the receiving antenna and nk is modelled as additive white Gaussian noise (AWGN). (P ) The term Gk,i denotes the element into the k-th row, i-th column of G(P ) . By defining the observation vector x = [x1 , ..., xK/2 , x∗K/2+1 ...., x∗K ]T , the source vector s = [s1 , ..., sN ]T and the noise vector n = [n1 , ..., nK/2 , n∗K/2+1 ...., n∗K ]T , the relationship between the observations and the sources can be written as x = H(P ) s + n
(2)
where H(P ) is the channel matrix depending of the number of transmitting antennas, ⎤ ⎡ h1 h2 h3 h4 ⎢ h2 −h1 h4 −h3 ⎥ ⎥ ⎢ ⎢ h3 −h4 −h1 h2 ⎥ ⎥ ⎢ ⎢ h4 h3 −h2 −h1 ⎥ h1 h2 (2) (4) ⎥ ⎢ H = H =⎢ ∗ (3) ∗ ∗ ∗⎥ h∗2 −h∗1 ⎢ h1∗ h2∗ h3∗ h4∗ ⎥ ⎢ h2 −h1 h4 −h3 ⎥ ⎥ ⎢ ∗ ⎣ h3 −h∗4 −h∗1 h∗2 ⎦ h∗4 h∗3 −h∗2 −h∗1 For simplicity in the notation we will remove the superindex (P ) from H(P ) . Note that the matrices H in (3) are orthogonal, i.e., HH H = ||h||2 IP
(4)
P 2 2 = where ||h||2 = k=1 |hk | for the Alamouti’s coding scheme and ||h||
P 2 2 k=1 |hk | for the half-rate system, IP is the P × P identity matrix and H is the Hermitian operator. For the Alamouti’s coding system it is also verified H HH = ||h||2 IP . Since the observations can be interpreted as instantaneous mixtures of the sources (equation (2)), the channel matrix can be found by using many existing BSS algorithms (see, for instance, [6] and references therein). Most of these algorithms only take into account the independence of the sources and they
Channel Estimation for O-STBC MISO Systems
773
do not consider other properties of the channels. Recently, algorithms based on second-order statistics have been developed for blind channel estimations in O-STBC transmissions [7,8]. In practice, when these methods are used for the Alamouti’s coding scheme, the communication system must be modified by including a precoder before the O-STBC encoder. In this paper we investigate the performance of approaches based on fourth-order cross-cumulants which do not required to include additional modules in the encoder.
3
Proposed Approaches
In this section we present several methods to estimate the channel matrix by diagonalizing matrices containing fourth-order cross-cumulants of the observations. We will consider that the sources have the same power and the same kurtosis. The fourth-order cross-cumulant matrices is defined by C[k, l] = c4 (x, xH , xk , x∗l ), k, l = 1, ..., K ⇒ C[k, l](i, j) = c4 (xi , x∗j , xk , x∗l )
(5) (6)
where c4 (x1 , x2 , x3 , x4 ) = E[x1 x2 x3 x4 ] − E[x1 x2 ]E[x3 x4 ] − E[x1 x3 ]E[x2 x4 ] − E[x1 x4 ]E[x2 x3 ]. This matrices can be decomposed in C[k, l] = βH Δ[k, l] HH
(7)
where β is a complex valued number and Δ[k, l] is a P × K diagonal matrix with the form Δ[k, l] = diag(Hk,1 H∗l,1 , Hk,2 H∗l,2 , . . . , Hk,N H∗l,N )
(8)
where Hm,n denotes the element in the m-th row, n-th column of matrix H. 3.1
Eigenvalue Decomposition
A way to estimate the channel matrix consists in computing the eigenvectors, U, of the fourth-order cross-cumulants matrices. For the Alamouti’s coding scheme, U is a squared matrix of dimension 2 × 2. In the case of the half-rate code, U contains the P eigenvectors of dimension K × 1 corresponding to the largest eigenvalues. The sources are recovered using ˆs = UH s. From equation (8) we deduce that the condition to guarantee that the mixing system be identifiable using an eigenvalue decomposition is that the matrix Δ[k, l] has different real values into its diagonal. In particular, Beres and Adve have proposed in [9] to identify the channel matrix for the Alamouti’s coding scheme by using of matrices C[1, 1] or C[2, 2]. Note, however, that this approach fails when |h1 | = |h2 | because Δ[k, k] for both k = 1 and k = 2 have equal entries into its diagonal and, therefore, C[k, k] = ρ4 |h1 |2 I2 . Now, we focus our attention in the matrix C[1, 2] for the Alamouti’s coding system. For this case the cross-cumulant matrix can be written as equation (7)
774
H.J. P´erez-Iglesias and A. Dapena
where β = ρ4 h1 h2 , ρ4 = c4 (s1 , s∗1 , s1 , s∗1 ) = c4 (s2 , s∗2 , s2 , s∗2 ) is the kurtosis and Δ[1, 2] = diag(1, −1). Therefore, we deduce that the mixing system is always identifiable independently of channel path values. Note that for the half-rate code, the matrix C[k, l], k = l can be also written as equation (7) but, however, the elements into the diagonal matrix Δ[k, l] are complex valued and they cannot be identified using an eigenvalue decomposition. 3.2
Joint Diagonalization
The mixing matrix can be also computed by performing a simultaneous diagonalization of some fourth-order cross-cumulant matrices. Towards this aim, we have used the extension of the Jacobi technique described in [10]1 . This idea can be also interpreted as an particularisation of the Joint Approximate Diagonalization of Eigen-matrices (JADE) algorithm [11] obtained by including the orthogonality of the mixing matrix. For the Alamouti’s coding scheme we have only diagonalized the matrices C4 [1, 1] and C4 [1, 2]. For the half-rate code, we have 64 matrices with 8×8 fourthorder cross-cumulants. It is apparent the high computational load associated to compute these matrices and to perform their diagonalization. However, exists a similarity between the matrices: C[k, l] = C∗ [l, k], C[k, k] = C[k − 4, k − 4] with k = 5, ..., 8 and C[k, l] = C∗ [k − 4, l − 4] for k, l = 5, ..., 8. As a consequence, the number of matrices can be considerably reduced. In the simulations we have considered the following cases: – Approach I: joint − diag(C[1, 1], C[1, 2], ...., C[1, 8]) – Approach II: joint−diag(C[1, 1], C[1, 2],...., C[1, 8], C[2, 3], C[2, 4],...., C[2, 8]) – Approach III: joint − diag(C[1, 1], C[1, 2], ...., C[1, 8], C[2, 3], C[2, 4], ...., C[2, 8], C[3, 4], C[3, 5], ...., C[3, 8]) – Approach IV: joint − diag(C[1, 1], C[1, 2], ...., C[1, 8], C[2, 3], C[2, 4], ...., C[2, 8], C[3, 4], C[3, 5], ...., C[3, 8], C[4, 5], ...., C[4, 8]).
4
Simulation Results
This section presents the results of several computer simulations carried out to verify the estimation algorithms explained in previous sections. The experiments have been performed by transmitting QPSK over Rayleigh-distributed randomly generated block fading channels. We assume no Intersymbol Interference (ISI), perfect synchronisation and sampling to symbol period. The statistics in (5) have been calculated for each block by sample averaging over the block symbols. The performance has been measured in terms of the Symbol Error Rate (SER). For comparison purposes we also present the SER obtained with Perfect Channel State Information (Perfect CSI). The simulations have been performed using MATLAB code running on an Athlon XP 3200+. 1
The matlab code is available in www.tsi.enst.fr/˜ cardoso/jointdiag.html
Channel Estimation for O-STBC MISO Systems
775
0
10
−1
10
−2
SER
10
−3
10
−4
10
Perfect CSI Beres−Adve: C[1,1] C[1,2] joint−diag([C[1,1] C[1,2]])
−5
10
0
5
10
15
20
25
SNR (dB)
Fig. 1. SER versus SNR obtained for the Alamouti’s coding scheme using packets of 500 symbols
In the first set of simulations, the QPSK signal has been coded using the Alamouti’ scheme over two transmitting antennas. Figure 1 shows the performance of the proposed approaches in terms of SER versus the SNR. The SER has been obtained by averaging the results for 100, 000 blocks of L = 500 symbols. In the figure we can see that the joint diagonalization method achieves the optimum performance while the other approaches present a flooring effect for high SNR. For a SNR of 20 dB, Figure 2 shows the SER versus the packet size SNR = 20 dB
0
10
Perfect CSI Beres−Adve: C[1,1] C[1,2] joint−diag([C[1,1] C[1,2]]) −1
SER
10
−2
10
−3
10
−4
10
0
100
200
300
400 500 600 packet size (symbols)
700
800
900
1000
Fig. 2. SER versus the packet size for the Alamouti’s coding scheme obtained for a SNR of 20 dB
776
H.J. P´erez-Iglesias and A. Dapena
200 Beres−Adve: C[1,1] & C[1,2] joint−diag([C[1,1] C[1,2]]) JADE
180 160
time (seconds)
140 120 100 80 60 40 20
0
50
100
150
200 250 300 packet size (symbols)
350
400
450
500
Fig. 3. Time requires to process 105 packets for the Alamouti’s coding scheme
0
10
−1
10
−2
10
−3
10
−4
SER
10
−5
10
−6
10
Perfect CSI Beres−Adve: C[1,1] Approach I Approach II Approach III Approach IV
−7
10
−8
10
−9
10
0
5
10
15
20
25
SNR (dB)
Fig. 4. SER versus SNR obtained for the half-rate code using packets of 500 symbols
and Figure 3 shows the time needed to process 105 packets. We can see that the joint diagonalization method requires few symbols (about 200 symbols) to achieve the optimum SER but the time required to process one packet is higher than the needed with the other approaches. For comparison, Figure 3 also shows the time used by JADE to process one block. It is apparent the difference with respect to the joint-diagonalization method proposed in this paper. In the second set of simulations, the QPSK signal have been coded using half-rate code with four transmitting antennas. Figure 4 plots the SER versus SNR obtained using the method proposed by Beres and Adve and the results
Channel Estimation for O-STBC MISO Systems
777
obtained with the joint diagonalization method presented in Subsection 3.2. It can be seen that it is possible to achieve the Perfect CSI using the Approach III and the Approach IV. Comparing with the result obtained with the Alamouti’s code (Figure 1), we conclude that a significant gain can be achieved by increasing the number of transmitting antennas.
5
Conclusions
This paper proposes several algorithms to recover the transmitted signals without using pilot symbols in systems that makes use of O-STBC without using of training sequences. The basic idea is to estimate the channel parameters by calculating the eigenvectors of a square matrix formed by fourth-order crosscumulants obtained from the signals received in different time-slots. Simulation results show that the best performance is obtained by performing a simultaneous diagonalization of the fourth-order cross-cumulants matrices.
References 1. Alamouti, S.M.: A simple transmit diversity technique for wireless communications. IEEE Journal Select. Areas Communications 16, 1451–1458 (1998) 2. Tarokh, V., Jafarkhani, H., Calderbank, A.R.: Space-Time Block Codes from Orthogonal Designs. IEEE Transactions on Information Theory 45(5), 1456–1467 (1999) 3. Goldsmith, A.: Wireless Communications. Cambridge University Press, Cambridge (2005) 4. Karim, M.R., Sarraf, M.: W-CDMA and cdma2000 for 3G mobile networks. McGraw-Hill, New York (2002) 5. Tarokh, V., Jafarkhani, H., Calderbank, A.R.: Space-Time Block Coding for Wireless Communications: Performance Results. IEEE Journal on Select Areas in Communications 17(3), 451–460 (1999) 6. Hyvarinen, A., Oja, E.: Independent Component Analysis: Algorithms and Applications. Neural Networks 13(4-5), 411–430 (2000) 7. Shahbazpanah, S., Gershman, A.B., Manto, J.: Closed-form Blind MIMO Channel Estimation for Orthogonal Space-Time Block Codes. IEEE Trans. on Signal Processing 53(12), 4506–4516 (2005) 8. V´ıa, J., Santamar´ıa, I., P´erez, J., Ram´ırez, D.: Blind Decoding of MISO-OSTBC Systems based on Principal Component Analysis. In: Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. IV, pp. 545–549 (May 2006) 9. Beres, E., Adve, R.: Blind Channel Estimation for Orthogonal STBC in MISO Systems. In: Proc. of Global Telecommunications Conference, 2004, vol. 4, pp. 2323–2328 (November 2004) 10. Cardoso, J.-F., Souloumiac, A.: Jacobi angles for simultaneous diagonalization. SIAM Journal on Matrics Analysis and Applications 17(1), 161–164 (1996) 11. Cardoso, J.-F., Souloumiac, A.: Blind beamforming for non-Gaussian signals. IEE Proceedings F 140(46), 362–370 (1993)
System Identification in Structural Dynamics Using Blind Source Separation Techniques F. Poncelet1 , G. Kerschen1 , J.C. Golinval1 , and F. Marin2 1
Aerospace and Mechanical Engineering Department (LTAS), University of Li`ege, 1 Chemin des Chevreuils, B-4000 Li`ege, Belgium {FPoncelet,G.Kerschen and JC.Golinval}@ulg.ac.be http://www.ltas-vis.ulg.ac.be 2 V2 i S.A., Li`ege, Belgium [email protected] http://www.V2i.be
Abstract. This paper proposes to explore the potential of Blind Source Separation (BSS) techniques for the estimation of modal parameters, namely the resonant frequencies, vibration modes and damping ratios. The concept of virtual sources, which was introduced in recent publications, allows to consider BSS as a simple way of doing output-only modal analysis. This work illustrates the proposed methodology using free and random responses of an experimental truss structure. Keywords: Blind Source Separation, Second-Order Blind Identification, Structural Dynamics, Experimental Application.
1
Introduction
Blind Source Separation (BSS) techniques were initially developed for signal processing in the early 80’s, but during the last decade the number of the application fields never stops increasing. This success certainly comes from two of their intrinsic characteristics. Firstly, the ambition of BSS (which is to recover unobserved source signals from their observed mixtures) is shared with many other research domains. Secondly, the small number of necessary assumptions allows to consider the application of the methodology to various kinds of data sets (resulting from fields as diverse as finance, image or speech processing, astrophysics, and even medicine). However, if BSS techniques proved useful in numerous application domains, they were quite underused for many years in structural dynamics. Some applications were naturally carried out such as damage detection, condition monitoring and discrimination between pure tones and sharp-pointed resonances, but the modal parameter estimation remained quite marginal in these studies. Recently, using the concept of virtual sources, a one-to-one relationship between the vibration modes and the BSS modes (i.e.. the mixing matrix) was demonstrated [1], allowing the use of BSS for modal analysis. Since then, two algorithms were tested, and one of them (namely the Second-Order Blind Identification) seemed to perform quite well [2]. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 778–785, 2007. c Springer-Verlag Berlin Heidelberg 2007
System Identification in Structural Dynamics Using BSS Techniques
779
This paper proposes to explore this possibility from an experimental case submitted to an impulse and a random excitation. The results are compared with those of a well-established modal analysis method, the so-called Stochastic Subspace Identification [3].
2 2.1
System Identification in Structural Dynamics: Modal Analysis What Is Modal Analysis?
The precise knowledge of the dynamics characteristics is essential for the design and validation of many engineering products. The usual modeling of a dynamical system is defined by modal parameters, namely the natural frequencies fi , vibration modes ni and damping ratios ξi . A modal parameter prediction is usually performed using numerical techniques such as the finite element method. But because of possible uncertainties on the material behavior, boundary conditions and joint modeling, experimental validations are often necessary. These ones are performed using Experimental or Operational Modal Analysis (EMA or OMA). Modal analysis quickly proved to be very popular, and numerous approaches were developed to estimate modal parameter from the structural dynamic response. In 1973, Ibrahim proposed a robust time domain method [4]. The stochastic subspace identification method (and its variant versions) [3] and more recently the polyreference least-squared complex frequency domain method [5] are also efficient modal analysis methods. For further information about modal analysis, the interested reader may consult [6]. 2.2
Why Another Method?
The lack of a priori knowledge about the number of existing natural frequencies in the range of interest usually prevents the classical modal analysis methods from a direct identification of modal parameters. A model order (linked to the number of identified frequencies) has to be chosen and progressively increased. A diagram, the so-called stabilization diagram, collecting all these frequencies for increasing order is then plotted. This step allows to separate the physical natural frequencies from the numerical ones, introduced by the algorithm. Unfortunately, the separation between numerical and stabilized frequencies may require a great deal of expertise and can be cumbersome. Even if many modal analysis methods already exist, a method combining an automatic selection process (with a confidence criteria for each mode) and a physical interpretation of this choice should be interesting. The application of BSS techniques to perform modal analysis partially meets these objectives. During, the last years, other statistical signal processing techniques have been considered for the analysis of structural dynamic responses. The proper orthogonal decomposition (POD) is one of these, and its modes, the so-called proper orthogonal modes (POMs), have been linked to the structural normal modes [7]. Some additional information can be found in [8,9].
780
3
F. Poncelet et al.
From Signal Processing to Modal Analysis
3.1
Second-Order Blind Identification (SOBI)
The basic idea of BSS is to recover the unobservable inputs of a system, called the sources si , only from the measured outputs xi even though very little, if anything, is known about the mixing system. The simplest BSS model assumes the existence of n sources signals s1 (t), ..., sn (t) and the observation of as many mixtures x1 (t), ..., xn (t). Note that we focus on systems with linear and static mixtures. Using matrix notations the noisy model can be expressed as x(t) = A · s(t) + σ(t)
(1)
where A is referred to as the mixing matrix, and σ is the noise vector corrupting the data. Most BSS approaches are based on a model in which the sources are independent and identically distributed variables. Independent component analysis (ICA, [10]) does not escape the rule; the sample order has no importance in the method. The objective of SOBI is to take advantage, whenever possible, of the temporal structure of the sources for facilitating their separation. The SOBI algorithm consists in constructing several time-lagged covariance matrices R(τ ) from the measured data and to find a matrix U which jointly diagonalizes all the covariance matrices. This matrix corresponds to the mixing matrix A of (1). R(τ ) = E[x(t + τ ) · x∗ (t)]
(2)
For further detail about the SOBI method, the reader can refer to [11]. 3.2
Concept of Virtual Source
The dynamic response of mechanical systems which are considered in this study is described by the equation M·x ¨(t) + C · x(t) ˙ + K · x(t) = f (t)
(3)
where M, C and K are the mass, damping and stiffness matrices, respectively. The vector f represents the real excitation sources applied to the structure. The system response x(t) may be expressed as a mixture of these real sources f (t). Unfortunately, this mixture is a convolutive product between the impulse response function, denoted h(t), and the sources f (t), and the separation of convolutive mixtures of sources is not yet completely solved. An interesting alternative is to use the modal expansion. Indeed, the m normal modes n(i) form a complete basis for the expansion of any m-dimensional vector (if m is the number of degrees of freedom). Then the response can be expressed using modal superposition x(t) =
m i=1
n(i) · ηi (t) = N · η(t)
(4)
System Identification in Structural Dynamics Using BSS Techniques
781
where the weight coefficients ηi are in fact the modal coordinates and represent the amplitude modulation of the corresponding normal modes n(i) . The similarity between equations (1) and (4) shows that the modal coordinates may act as virtual sources (which are statistically independent as proved in [1,2]) regardless of the number and type of physical excitation forces. In addition, the time response can be interpreted as a static mixture of these virtual sources, which renders the application of the BSS techniques possible. The SOBI algorithm (which requires sources with different spectral contents) is particulary appropriate for the separation of these sources. In the free response case of the system (3), the theoretical expression of the normal coordinates is an exponentially damped harmonic function (5) ηi (t) = Y · exp(−ξi · ωi · t) · cos( 1 − ξi2 · ωi · t + αi ) where ωi and ξi are the natural frequency and damping ratio of the ith mode, respectively. The amplitude Y and the phase α are constants depending on the initial conditions. The modal coordinates are then monochromatic, with different spectral contents. 3.3
Procedure Details
In summary, a simple modal analysis procedure is proposed, using the modal coordinates as virtual sources. The procedure is as follows: 1. Perform experimental measurements of the structure response to obtain time series at different sensing position. 2. Apply SOBI directly to the measured time series to estimate the mixing matrix A and the sources s(t). 3. The mode shapes are simply contained in the mixing matrix A. 4. In the case of random excitation, the identified (random) sources are transformed into free decaying responses using NExT (Natural Excitation Technique) algorithm [12]. 5. The identification of the other modal parameters (frequencies and damping ratios) is carried out by fitting the time series of the sources s(t) with the theoretical expression (5). 6. The fitting error between the identified and fitted sources is then computed which allows to reject the non-reliable virtual sources easily.
4
Experimental Demonstration
In this paper the proposed modal analysis technique is applied to the response of the truss structure depicted in Figure 1. The free and random response cases are considered. At each corner, on each storey, two accelerometers measure the horizontal responses. The quality of the identification results is evaluated comparing with the covariance-driven stochastic subspace identification (SSI) method [3].
782
F. Poncelet et al.
Fig. 1. Experimental fixture mounted on a 26kN electrodynamic shaker
4.1
Free Response
The free response was obtained using a hammer which provided a short impulse to the system. The sampling frequency was set to 5120 Hz, and the first 6000 samples of the measured time series were taken into account. The SOBI identification requires the definition of delays (for the construction of correlation matrices (2)). 20 delays were chosen uniformly distributed between 0.0025 and 0.1 seconds, which covers the whole frequency range of interest.
(a)
(b)
Fig. 2. Fitting error of the 16 SOBI identified sources for the free response (a) and MAC comparison between SOBI and SSI modes (b)
Because there are 16 measurement locations, a total of 16 virtual sources can be considered. The fitting error of each source is shown, in Figure 2(a). 11 sources have a fitting error below 7% and can be safely retained. The sum of their
System Identification in Structural Dynamics Using BSS Techniques
783
participation in the system response is above 97.7%. The identification results are listed in Table 1. Concerning the frequency and damping ratio identification, the SOBI results are totally similar to those of the SSI method. Note that the damping ratios of SSI are presented as intervals because the value changes according to the chosen model order. The comparison between the two methods for the mode shapes is performed using the Modal Assurance Criterion in the Figure 2(b). The closer the value to 1, the higher the correspondence. We can see a complete correlation between the modes identified using SSI and SOBI. Table 1. Identified natural frequencies and damping ratios for the free response SOBI Freq. SSI Freq. SOBI Damping Ratio SSI Damping Ratio [Hz] [Hz] [%] [%] 75.94 111.37 130.75 181.06 256.30 334.24 345.75 365.79 374.34 380.55 396.91
4.2
75.82 110.99 130.76 180.69 256.48 334.32 345.76 365.81 374.45 380.45 396.81
0.20 0.37 0.21 0.18 0.18 0.05 0.04 0.05 0.15 0.16 0.08
[0.05 [0.40 [0.20 [0.20 [0.10 [0.02 [0.04 [0.05 [0.10 [0.20 [0.07
-
0.12] 0.60] 0.28] 0.28] 0.15] 0.05] 0.05] 0.06] 0.30] 0.40] 0.10]
Forced Response
For the random response, the structure was mounted on a 26kN electrodynamic shaker (see Figure 1). The sampling frequency was set to 5120 Hz, and 160000 samples were considered for the measured time series. The same parameters as previously were chosen for the SSI and SOBI methods. The fitting error of each identified source was computed and is presented in Figure 3(a). This time, 14 sources have a fitting error below 8%. Table 2 lists all the reliable identified results and Figure 3(b) compares the corresponding mode shapes. Once more the correspondence between both methods is remarkable, except for the mode at 75 Hz. If the results obtained using SSI in the free response case are taken as a reference we can note that none of the methods seems able to accurately estimate this mode. The MAC values SOBI random/SSI free and SSI random/SSI free are both lower than 0.65. Finally, we note that the SSI method was able to identify 4 more modes in the frequency range considered (around 162, 189, 204 and 294 Hz). Nonetheless, because the participation in the system response of the 14 sources identified using SOBI amounts to 93%, these four modes have necessarily a very low participation in the system response.
784
F. Poncelet et al.
(a)
(b)
Fig. 3. Fitting error of the 16 SOBI identified sources for the random response (a) and MAC comparison between SOBI and SSI modes (b) Table 2. Identified natural frequencies and damping ratios for the random response SOBI Freq. SSI Freq. SOBI Damping Ratio SSI Damping Ratio [Hz] [Hz] [%] [%] 74.75 110.06 133.59 180.87 245.29 257.47 333.21 345.64 365.60 368.19 374.34 380.06 390.81 396.83
5
74.68 110.28 133.77 180.98 245.38 257.47 333.34 345.51 365.76 369.53 374.69 378.34 390.95 397.20
2.15 2.03 0.85 0.23 0.16 0.11 0.12 0.09 0.12 0.33 0.16 0.71 0.33 0.17
[1.70 [1.50 [0.60 [0.20 [0.01 [0.09 [0.05 [0.10 [0.07 [0.15 [0.20 [0.50 [0.45 [0.15
-
2.00] 2.00] 0.80] 0.30] 0.05] 0.11] 0.10] 0.12] 0.15] 0.30] 0.40] 0.70] 0.50] 0.25]
Conclusions
Based on the virtual source concept, a new application is developed for the BSS methods, and particulary for the SOBI algorithm, in the field of structural dynamics. An output-only modal analysis technique is proposed. The experimental application shows that the method holds promise for identification of mechanical system for free as well as for forced response.
System Identification in Structural Dynamics Using BSS Techniques
785
– A truly simple identification scheme is proposed for the modal parameters, due to the straightforward application of SOBI to the measured data. – A seemingly robust criterion has been developed for the selection of reliable sources. The use of stabilization charts, which always require a great deal of expertise, is therefore avoided. In addition, the selection of a model order, a common issue for conventional modal analysis techniques such as SSI, is not necessary. – Compared to SSI, the computation load is very reduced, which makes the method a potential candidate for online modal analysis. A possible limitation of the method is that sensors should always be chosen in number greater or equal to the number of active modes. This will be addressed in subsequent studies. Acknowledgments. The author F. Poncelet is supported by a grant from the Walloon Government (Contract N 516108-VAMOSNL). The author G. Kerschen is supported by a grant form the Belgian National Fund for Scientific Research (FNRS).
References 1. Kerschen, G., Poncelet, F., Golinval, J.C.: Physical interpretation of independent component analysis in structural dynamics. Mechanical System and Signal Processing 21, 1561–1575 (2007) 2. Poncelet, F., Kerschen, G., Golinval, J.C., Verhelst, D.: Output-only modal analysis using blind source separation techniques. Mechanical Systems and Signal Processing. Corrected Proof. Available online (January 5, 2007) (in press) 3. Van Overschee, P., De Moor, B.: Subspace Identification for Linear Systems: Theory, Implementation, Applications. Kluwer Academic Publishers, Dordrecht (1996) 4. Ibrahim, S.R., Mikulcik, E.C.: A time domain modal vibration test technique. Shock and Vibration Bulletin 43, 21–37 (1973) 5. Peeters, B., Van Der Auweraer, H., Guillaume, P.: The PolyMAX frequency domain method: a new standard for modal parameter estimation. Shock and Vibration 11, 395–409 (2004) 6. Ewins, D.J.: Modal Testing: Theory, Practice and Application, 2nd edn. Research Studies Press LTD, Hertfordshire (2000) 7. Feeny, B.F., Kappagantu, R.: On the physical interpretatino of proper orthogonal modes in vibrations. Journal of Sound and Vibration 211, 607–616 (1998) 8. Kerschen, G., Golinval, J.C.: Physical interpretatino of the proper orthogonal modes using the singular value decompostion. Journal of Sound and Vibration 249, 849–865 (2002) 9. Chelidze, D., Zhou, W.: Smooth orthogonal decomposition-based vibration mode identification. Journal of Sound and Vibration 292, 461–473 (2006) 10. Comon, P.: ICA: a new concept? Signal Processing 36, 287–314 (1994) 11. Belouchrani, A., Abed-Meraim, K., Cardoso, J.F., Molines, E.: A BSS technique using 2nd-order statistics. IEEE Trans. on Signal Processing 45, 434–444 (1997) 12. James, G.H., Carne, T.G., Lauffer, J.P.: The natural excitation technique for modal parameter extraction from operating wind turbines. SAND92-1666 UC–261 (1993)
Image Similarity Based on Hierarchies of ICA Mixtures Arturo Serrano, Addisson Salazar, Jorge Igual, and Luis Vergara Universidad Politécnica de Valencia, Departamento de Comunicaciones, Camino de Vera s/n, 46022 Valencia, Spain [email protected], [email protected]
Abstract. This paper presents a novel algorithm to build hierarchies from independent component analyzer mixtures and its application to image similarity measure. The hierarchy algorithm composes an agglomerative (bottom-up) clustering from the estimated parameters (basis vectors and bias terms) of the ICA mixture. Merging at different levels of the hierarchy is made using the Kullback-Leibler distance between clusters. The procedure is applied to merge similar patches on a natural image, to group different images of an object, and to create hierarchical levels of clustering from images of different objects. Results show suitable image hierarchies obtained by clustering from basis functions to higher-level structures.
1 Introduction Independent component analyzers (ICA) mixture models were introduced in [1] considering a source model switching between Laplacian and bimodal densities. Recently this model has been relaxed using generalised exponential sources [2], self-similar areas as a mixture of Gaussians sub-features [3], and sources with non-Gaussian structures recovered by a learning algorithm using Beta divergence [4]. Real applications of those works span: separation of eye-movement artefacts from EEG recordings, separating ‘back-ground’ brain tissue, fluids and tumours in fMRI images, and the separation of voices and background music in conversations. It is well known that local edge detectors can be extracted from natural scenes by standard ICA algorithms as Infomax [5], or fastICA [6] or new approaches as Linear Multilayer ICA [7]. In addition there is neurophysiological evidence that suggest relation of primary visual cortex activities with the detection of edges, and some theoretical dynamic models of abstraction process from visual cortex to higher-level abstraction has been proposed [8]. The contribution of this paper is to provide a new algorithm to process the parameters of ICA mixtures in order to obtain hierarchical structures from the basis function level (edges) to higher levels of clustering. Particularly the algorithm is applied to image analysis obtaining promising results in discerning object similarity and suitable levels of hierarchies by processing image patches. This kind of feedforward process would suggest some relation with abstraction. The algorithm is agglomerative and uses the symmetric Kullback-Leibler distance [9] to select the grouping of the clusters at each level. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 786–793, 2007. © Springer-Verlag Berlin Heidelberg 2007
Image Similarity Based on Hierarchies of ICA Mixtures
787
2 Hierarchy of ICA Mixtures 2.1 Estimation of the ICA Mixture Parameters In the ICA mixture model, the observation vectors x are modelled as the result of applying a linear transformation Ak ( Wk = A k−1 is the filter matrix) to a vector sk (sources), whose elements are independent random variables, plus a bias vector b k , for all the classes Ck , ( k = 1K K number of ICAs) . The probability of every available observation vector can be separated into the contributions due to every class. An iterative learning algorithm based on maximum-likelihood estimation (MLE) is used to adapt the parameters of the ICA mixtures, i.e., the basis functions and the bias terms of each class, using gradient ascent [1]. To estimate the probability density function of the sources different priors could be used as Laplacian [1] or nonparametric densities [10]. 2.2 Agglomerative Clustering From the estimated ICA mixture parameters, a procedure that follows a bottom-up agglomerative scheme for merging the mixtures was developed. The conditional probability density of x for cluster Ckh , k = 1, 2,..., K − h + 1 in
level h = 1, 2,..., K is p(x / Ckh ) . At the first level, h = 1 , it is modelled by K ICA mix-
tures, i.e., p( x / Ck1 ) is: p(x / Ck1 ) = det A −k 1 p ( s k ) , s k = A k (x − b k ) −1
(1)
At each consecutive level, two clusters are merged according to some minimum distance measure until we reach at level h = K only one cluster. As distance measure we use the symmetric Kullback-Leibler distance between the ICA mixtures. It is defined for the clusters u, v by: DKL (Cuh , Cvh ) = ∫ p( x / Cuh ) log
p (x / Cuh ) p ( x / Cvh ) dx + ∫ p (x / Cvh ) log dx h p ( x / Cv ) p ( x / Cuh )
(2)
For level h = 1 , from (2), we can obtain (we write pxu (x ) = p( x / Cu1 ) and omit the superscript h = 1 for brevity): DKL (Cu , Cv ) = DKL ( pxu ( x ) // pxv ( x )) = ∫ pxu ( x ) log
px u ( x ) p xv ( x )
dx + ∫ pxv ( x ) log
px v ( x ) px u ( x )
dx
(3)
where, imposing the independence hypothesis and supposing that both clusters have the same number of sources M for simplicity (assuming that sources follow the same model and the data they draw are on the same space):
788
A. Serrano et al. M
pxu (x) =
∏ psu ( sui ) i =1
(
, sui = A u−i1 x − b ui
i
det A u
) (4)
M
pxv (x) =
∏ psv ( sv j ) j =1
j
det A v
(
, sv j = A v−j1 x − b v j
)
The pdf of the sources is approximated by a non-parametric kernel-based density for both clusters: N
psu ( sui ) = ∑ ae i
1 ⎛ su − sui ( n ) ⎞ − ⎜ i ⎟⎟ 2 ⎜⎝ h ⎠
2
N
, psv ( sv j ) = ∑ ae j
n =1
1 ⎛ sv j − sv j ( n ) ⎞ ⎟ − ⎜ ⎟ 2⎜ h ⎝ ⎠
2
(5)
n =1
where again for simplicity we have assumed the same kernel function for all the clusters, with the parameters a , h and number of samples N adapted to each cluster. Note that this corresponds to a Gaussian mixture model where the number of Gaussians is maximum (one for every observation) and the weights are equal. Reducing to standard mixture of Gaussians does not help in order to compute the Kullback-Leibler distance because there is not analytical solution to it. Therefore, we prefer to maintain the non parametric approximation of the pdf in order to model more complex distributions than a mixture of a small finite number of Gaussians. The symmetric Kullback-Leibler distance between the clusters u, v can be expressed such as: DKL ( pxu (x) // pxv (x)) = − H (xu ) − H (x v ) − ∫ pxu (x) log pxv (x)dx − ∫ pxv (x) log pxu (x)dx
(6)
where H (x ) is the entropy, defined as H (x ) = − E [log px ( x )] . To obtain the distance, we have to calculate the entropy for both clusters and the cross-entropy terms Exv ⎡⎣log pxu (x ) ⎤⎦ , Exu ⎡⎣ log px v ( x ) ⎤⎦ . The entropy for the cluster u can be calculated through the entropy of the sources of that cluster considering the linear transformation of the random variables and their independence (4): M
H ( xu ) = ∑ H ( sui ) + log det A u
(7)
i =1
The entropy of the sources can not be analytically calculated. Instead, we can obtain a sample estimate Hˆ ( sui ) using the training data. Denote the i-th source obtained
{
}
for the cluster u by sui (1), sui (2),K , sui (Qi ) . The entropy can be approximated as follows: 1 Hˆ ( sui ) = − Eˆ ⎡log psu ( sui ) ⎤ = − i ⎣ ⎦ Q
Qi
∑ log p
i n =1
N
psu ( sui (n )) = ∑ ae i
l =1
1 ⎛ su ( n ) − sui ( l ) ⎞ − ⎜⎜ i ⎟⎟ 2⎝ h ⎠
2
sui
( sui (n )),
(8)
Image Similarity Based on Hierarchies of ICA Mixtures
789
The entropy of H (x v ) is obtained analogously: M
M
i =1
i =1
H ( x v ) = ∑ H ( svi ) + log det A v ∑ Hˆ ( svi ) + log det A v 1 Hˆ ( svi ) = − Qi
Qi
N
∑ log psv ( svi (n)), psv (svi (n)) = ∑ ae i
n =1
i
1 ⎛ sv ( n ) − svi ( l ) ⎞ − ⎜⎜ i ⎟⎟ h 2⎝ ⎠
(9)
2
l =1
with Hˆ ( svi ) defined analogously to (8). Following the same procedure for j-th source we can estimate Hˆ ( sv j ) . Once the entropy is computed, we have to obtain the cross-entropy terms. After some operations and considering the relationships x = A u su + bu , x = A v sv + b v and
thus sv = A v−1 ( A u su + b u − b v ) , the independence of the sources, and that the samples
{s
for clusters u, v follow the corresponding distribution
{
ui
}
}
(1), sui (2),K , sui (Qi ) ,
i = 1,..., M , sv j (1), sv j (2),K , sv j (Q j ) , j = 1,..., M ; we can estimate,
(
Hˆ (s v , sui ) =
1 M
∏ Qi
QM
Q1
N
sv1 =1
svM =1
n =1
⋅ ∑ ... ∑ log ∑ ae
)
⎛ ⎡ A −1 A s + b − b ⎤ − s ( n ) ⎞ v v v u ⎦ ui ⎟ 1⎜ ⎣ u i − ⎜ ⎟ 2⎜ h ⎟ ⎝ ⎠
2
(10)
i =1
and Hˆ (su , sv j ) defined analogously to (10), with su = A u−1 ( A v s v + b v − b u ) . Using the terms obtained above, we can estimate the symmetric Kullback-Leibler distance between the clusters u, v : M
M
M
M
i =1
j =1
i =1
j =1
DKL ( pxu (x) // pxv (x)) = −∑ Hˆ ( sui ) −∑ Hˆ ( sv j ) − ∑ Hˆ (s v , sui ) − ∑ Hˆ (su , sv j )
(11)
As we can observe, the similarity between clusters depends not only on the similarity between the bias term, but the similarity between the distributions and the mixing matrices. Once the distances are obtained for all the clusters, the two clusters with minimum distance are merged in level h = 2 . This is repeated in every step of the hierarchy until we reach one cluster in the level h = K . To merge cluster in level h we can calculate the distances from the distances of level h − 1 . Suppose that from level h − 1 to h the clusters Cuh −1 , Cvh −1 are merged in cluster Cwh . Then, the density for the merged cluster at level h is: ph ( x / Cwh ) =
ph −1 (Cuh −1 ) ph −1 (x / Cuh −1 ) + ph −1 (Cvh −1 ) ph −1 (x / Cvh −1 ) ph −1 (Cuh −1 ) + ph −1 (Cvh −1 )
(12)
where ph −1 (Cuh −1 ), ph −1 (Cvh −1 ) are the priors or proportions of the clusters u, v at level h − 1 . The rest of terms are the same in the mixture model at level h that at level h − 1 . The only difference from one level to the next one in the hierarchy is that there
790
A. Serrano et al.
is one cluster less and the prior for the new cluster is the sum of the priors of its components and the density the weighted average of the densities that are merged to form it. Therefore, the estimation of the distance at level h can be done easily starting from the distances at level h − 1 and so on until level h = 1 . Consequently, we can calculate the distances at level h from a cluster Czh to a merged cluster Cwh obtained by the agglomeration of clusters Cuh −1 , Cvh −1 at level h − 1 as the distance to its components weighted by the mixing proportions: Dh ( ph (x / Cwh ) // ph ( x / C zh )) =
ph −1 (Cuh −1 ) ⋅ Dh −1 ( ph −1 ( x / Cuh −1 ) // ph −1 ( x / C zh −1 )) ph −1 (Cuh −1 ) + ph −1 (Cvh −1 )
p (C h −1 ) ⋅ Dh −1 ( ph −1 (x / Cvh −1 ) // ph −1 (x / C zh −1 )) + h −1 v . ph −1 (Cuh −1 ) + ph −1 (Cvh −1 )
(13)
3 Application on Image Data ICA can be used to analyze image patches as a linear superposition of basis functions. Those vectors have been related with the detection of borders in natural images [5]. Therefore basis functions have a physical relation with objects and they can be used to measure the similarity between objects based on ICA decomposition. In image patches decomposition, the set of independent components is larger than what can be estimated at one time, and what we get at one time is an arbitrarily chosen subset [11]. Nevertheless ICA has been applied successfully in several image applications [1]. 3.1 Object Similarity For the hierarchical classification of images of objects, the COIL-100 database was used [12]. The database consists of different views of objects over a dark background. The method applied to preprocess the images was this. The images were converted to greyscale, and grouped in different views in order to obtain several images to train up to three classes per object. From each image, patches of 8 by 8 pixels were randomly taken to estimate the basis function previous a whitening process, with a reduction to 40 components. A total of 1000 patches per object were extracted [5]. The basis functions of each class were then calculated with the ICA mixtures algorithm, considering supervision, and using the Laplacian prior to estimate the source pdfs. Fig. 1 shows the 40 basis functions of six classes corresponding to different views of two objects. The basis functions of Fig. 1a correspond to a box with a label inscribed whereas Fig. 1b corresponds to an apple. We can observe the similarity between the functions of each object and differences, for instance, the lower frequency in the pattern corresponding to a natural object versus the frequency in the pattern of a more artificial object. The same data were used to measure the distance between classes estimating the symmetric Kullback-Leibler distance from the mixture matrices calculated previously, as we explain in Section 2. Distances reveal that basis functions allow finding
Image Similarity Based on Hierarchies of ICA Mixtures
791
the similarity (short distances) between classes corresponding to the same object (intra-object), whereas distances are much longer between classes of different objects (inter-object), see Table 1.
(a)
(b)
Fig. 1. Two groups of basis functions corresponding to two different objects. Basis functions at top are from a little box and basis functions at bottom are from an apple. Table 1. Mean distances inter-object and intra-object of Fig. 1 Object box (a) apple (b)
box (a) 12.89 114.90
apple (b) 114.90 13.81
Additionally, experiments in order to create a hierarchical classification of objects were performed. Thus, patches were sampled from a large number of objects, some of them very similar among themselves. A hierarchical representation was then created applying the agglomerative clustering algorithm. Fig. 2 shows an example of classification of eight objects, with three main kinds of objects. The tree outlined by the dendrogram positively shows grouping of objects based on similarity content, and suitable similarities between ‘families’ of objects, e.g., cars were more alike with cans than with apples. 3.2 Natural Images The proposed algorithm was applied to natural images in order to obtain a bottom-up structure merging several zones of an image. Fig. 3 shows an image with 9 zones, some of then clearly different and others more or less similar each other. Dendrogram of Fig. 3 shows how the zones are merged from the patches. It shows two broad kinds of basis functions that correspond to the part of the image that mainly contains portions of sky, and those zones that correspond to patches where there is a predominant portion of stairs (high frequency). The dendrogram also shows the distances at which the clusters are merged, it can be used as a similarity measure of the zones of the image. The bottom zones are merged at low distances due to the high similarity in borders.
792
A. Serrano et al.
Fig. 2. Hierarchical representation of object agglomerative clustering. Three kinds of object ‘families’ are obtained.
Fig. 3. (Left) Image divided in nine zones. (Right) Hierarchical representation of the zones of the image based on basis functions similarity. It shows two broad groups of zones.
4 Conclusions The new algorithm for hierarchical ICA mixtures uses the mixture matrices to calculate distances between the distributions of the independent sources based on a symmetric Kullback-Leibler distance. The estimation of the source pdfs is made using a non-parametric kernel-based approach allowing adaptation to several kinds of densities. Clusters are merged using a bottom-up strategy defining hierarchical levels creating higher-level structures. Results of the hierarchical algorithm application demonstrated its suitability to process image data. Image content similarity between objects based on ICA basis functions allows learning an organization of objects in higher-levels of abstraction where the more separated hierarchical levels more different the objects. Experiments with natural images showed application to image segmentation based on similarity of the different zones. The application of the procedure could be extended to unsupervised or semi-supervised classification of images in order to discover meaningful hierarchical levels.
Image Similarity Based on Hierarchies of ICA Mixtures
793
Many potential applications of the procedure could be approached as defect classification in non-destructive testing. Hierarchical levels would represent concepts as material condition, kind of defect, defect orientation, or defect dimension [13].
Acknowledgements This work was supported by Spanish Administration under grant TEC 2005-01820.
References 1. Lee, T.W., Lewicki, M.S., Sejnowski, T.J.: ICA mixture models for unsupervised classification of non-gaussian classes and automatic context switching in blind signal separation. IEEE Trans. on Patt. Analysis and Machine Intelligence 22(10), 1078–1089 (2000) 2. Penny, W.D., Roberts, S.: Mixtures of independent component analyzers. In: Dorffner, G., Bischof, H., Hornik, K. (eds.) ICANN 2001. LNCS, vol. 2130, pp. 527–534. Springer, Heidelberg (2001) 3. Choudrey, R., Roberts, S.: Variational Mixture of Bayesian Independent Component Analysers. Neural Computation 15(1), 213–252 (2003) 4. Mollah, N.H., Minami, M., Eguchi, S.: Exploring Latent Structure of Mixture ICA Models by the Minimum ß-Divergence Method. Neural Computation 18(1), 166–190 (2006) 5. Bell, A.J., Sejnowski, T.J.: The Independent Components of natural scenes are edge filters. Vision Research 37(23), 3327–3338 (1997) 6. Van Hateren, J.H., van der Shaaf, A.: Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of Royal Society of London: B 265, 359–366 (1998) 7. Matsuda, Y., Yamaguchi, K.: Linear multilayer ICA generating hierarchical edge detectors. Neural Computation 19(1), 218–230 (2007) 8. Lee, T.S., Mumford, D.: Hierarchical Bayesian inference in the visual cortex. Journal of the Optical Society of America A 20(7), 1434–1448 (2003) 9. Mackay, D.J.: Information theory, inference, and learning algorithms. Cambridge University Press, Cambridge (2004) 10. Vergara, L., Salazar, A., Igual, J., Serrano, A.: Data Clustering Methods Based on Mixture of Independent Component Analyzers. In: Proc. of ICA Research Network International Workshop, ICArn, Liverpool, pp. 127–130 (2006) 11. Hyvarinen, A., Hoyer, P.O., Inki, M.: Topographic independent component analysis. Neural Computation 13(7), 1527–1558 (2001) 12. Nene, S.A., Nayar, S.K., Murase, H.: Columbia Object Image Library (COIL-100), Technical Report CUCS-006-96 (February 1996) 13. Salazar, A., Unió, J.M., Serrano, A., Gosalbez, J.: Neural networks for defect detection in non-destructive evaluation by sonic signals. In: IWANN 2007, LNCS, vol. 4507, pp. 631– 638. Springer, Heidelberg (2007)
Text Clustering on Latent Thematic Spaces: Variants, Strengths and Weaknesses Xavier Sevillano, Germ´an Cobo, Francesc Al´ıas, and Joan Claudi Socor´ o GPMM - Grup de Recerca en Processament Multimodal Enginyeria i Arquitectura La Salle. Universitat Ramon Llull Quatre Camins, 2. 08022 - Barcelona, Spain {xavis,gcobo,falias,jclaudi}@salle.url.edu
Abstract. Deriving a thematically meaningful partition of an unlabeled text corpus is a challenging task. In comparison to classic term-based document indexing, the use of document representations based on latent thematic generative models can lead to improved clustering. However, determining a priori the optimal indexing technique is not straightforward, as it depends on the clustering problem faced and the partitioning strategy adopted. So as to overcome this indeterminacy, we propose deriving a consensus labeling upon the results of clustering processes executed on several document representations. Experiments conducted on subsets of two standard text corpora evaluate distinct clustering strategies based on latent thematic spaces and highlight the usefulness of consensus clustering to overcome the optimal document indexing indeterminacy.
1
Introduction
The increasingly growing number of unlabeled digital text documents available calls for the development of automatic tools, such as document clustering systems, capable of organizing unlabeled document collections thematically. However, when facing any text clustering problem, practitioners must blindly make several decisions that largely condition the quality of the clustering results, such as selecting i) the document indexing technique used for representing the documents (including its dimensionality), ii) the clustering strategy employed for partitioning the data, or iii) the number of clusters to be found. As regards the former aspect, it is a commonplace that the application of feature transformations can improve clustering results significantly [6]. Thus, the influence of distinct indexing techniques on the performance of text clustering systems has been analyzed elsewhere [14, 17]. In this context, latent thematic generative models, such as Independent Component Analysis [8], Latent Semantic Analysis [1] or Non-negative Matrix Factorization [10] constitute an interesting option for finding document projections on low dimensional spaces where clustering can be conducted more efficiently and effectively than in the original, high-dimensional term vector space. For this reason, this work presents an extensive comparison between document clustering strategies based on the aforementioned latent thematic generative models. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 794–801, 2007. c Springer-Verlag Berlin Heidelberg 2007
Text Clustering on Latent Thematic Spaces D T
X
Latent topic extraction
K
795
D K
Xlts
Clustering
O
C
Fig. 1. Extracting K latent thematic sources for clustering D documents into C clusters
Unfortunately, the conclusions drawn from this study make it difficult to generalize, as no indexing technique guarantees a universally superior performance across distinct clustering problems, giving rise to what we call the data representation dependence effect [13]. In order to overcome this indeterminacy, we present a strategy based on consensus clustering which allows to set text clustering practitioners free from the obligation of selecting a single document representation blindly, while still obtaining reasonably good clustering results. This paper is organized as follows: section 2 describes two clustering strategies based on latent thematic models, and section 3 presents several experiments regarding these strategies. Section 4 describes the consensus clustering proposal for overcoming the data representation dependence effect and presents related experiments. Finally, the conclusions of our work are discussed in section 5.
2
Clustering Strategies Based on Latent Thematic Models
The rationale behind latent thematic generative models establishes an analogy between the blind source separation (BSS) problem and the generation of text collections: the mixing of several latent random topics —or thematic sources— gives rise to a set of D documents (equivalent to the observations in the BSS scenario) [8, 18]. Therefore, as the goal of text clustering systems is to group documents into C thematically homogeneous clusters, recovering those latent thematic sources may lead to improved clustering. Figure 1 illustrates the application of latent topic extraction methods for document clustering. Let X denote the T ×D term-by-document matrix representing the document corpus in the original high-dimensional vector space model (where T stands for the vocabulary size) [11]. The extraction of K << T latent thematic sources yields a K-dimensional representation of the D documents in the latent thematic space, Xlts . Subsequently, the documents are grouped into C clusters by applying a clustering process on Xlts , which yields the D-dimensional labeling vector λ, whose i-th component λi contains a numeric label identifying the cluster the i-th document is assigned to, i.e. λi ∈ {1, 2, . . . , C}, ∀i = {1, 2, . . . D}. As regards the use of the latent thematic space document representation Xlts for clustering, two main approaches can be followed: firstly, in what we call latent source driven clustering (LSDC), the latent topics are not used as document representations, but rather as cluster membership indicators (i.e. documents are assigned to a specific cluster depending on their maximally active latent source in
796
X. Sevillano et al.
Xlts ) [7, 9, 12, 18]. And secondly, following what we call latent thematic feature clustering (LTFC), Xlts is deemed as a projection of the documents onto a new feature space, where a clustering algorithm is applied [13, 14, 17]. Despite sharing a common conceptual background, both approaches differ significantly from a practical viewpoint. In the LSDC approach, the number of latent topics retrieved must be tuned to match the desired number of clusters, i.e. K = C. Moreover, the clustering stage simply boils down to a cluster assignment based on finding the maxima in Xlts [7, 9]. In contrast, in LTFC, the number of clusters to be found is a parameter affecting the clustering stage rather than the latent topic extraction process, i.e. K is not necessarily equal to C. Furthermore, the grouping of the documents is conducted in this case by applying a standard partitioning algorithm on the latent thematic feature space where Xlts lies, which usually increases the computational cost of the clustering stage in comparison with the LSDC approach.
3
Experiments on LSDC and LTFC
The following experiments compare three well-known unsupervised latent thematic generative models applied to the document clustering task –following both the LSDC and LTFC approaches–, namely: Latent Semantic Analysis (LSA), Independent Component Analysis (ICA) and Non-negative Matrix Factorization (NMF)1 . Two single-class balanced clustering problems have been created upon subsets of the standard miniNewsgroups [4] and OHSUMED [3] document collections. Table 1 summarizes the main aspects of both corpora: the predefined number of categories C –which is assumed to be known throughout all the experiments–, the number of documents D, their vocabulary size T and the average number of terms per document, Td . Experimental results are evaluated by comparing the labeling vector λ delivered by the clustering processes with the original labeling of the documents (enclosed in vector κ) in terms of the Normalized Mutual Information (φ(NMI) ), a similarity measure ranging in value from 0 to 1 [16]: C C D·nh,l n log h,l (κ) (λ) h=1 l=1 nh nl (1) φ(NMI) (κ, λ) = (κ) (λ) C C nh nl (κ) (λ) log D h=1 nh log D l=1 nl (κ)
(λ)
where nh is the number of objects in cluster h according to κ, nl is the number of objects in cluster l according to λ, nh,l denotes the number of objects in cluster h according to κ as well as in group l according to λ [16]. 1
The ICA document representation is created by applying a version of the FastICA algorithm that maximizes skewness [7], using LSA (implemented by singular value decomposition) for pre-whitening and dimension reduction. The NMF-based document representation is created by applying a mean square reconstruction error minimization algorithm from NMFPACK [5].
Text Clustering on Latent Thematic Spaces
797
Table 1. Document corpora subsets description Corpus C D T Td miniNewsgroups 6 600 3735 99 OHSUMED 11 1100 4705 120 Table 2. Latent source driven clustering results using LSA, ICA and NMF Corpus LSA ICA NMF miniNewsgroups .342 ± .018 .472 ± .001 .474 ± .038 OHSUMED .188 ± .004 .229 ± .001 .213 ± .013
3.1
Latent Source Driven Clustering
This experiment evaluates the performance of LSA, ICA and NMF in the context of latent source driven clustering (i.e. as many latent thematic sources as desired clusters -C- are extracted and their value is used as a cluster membership indicator). The mean values and standard deviations of the φ(NMI) scores obtained across 10 independent runs of this experiment are presented in table 2. For the miniNewsgroups corpus, the best result in average is achieved by NMF-based LSDC (in boldface in table 2). In contrast, ICA is the latent source extraction method that yields the best results in the OHSUMED experiment. Note that, for both document collections, LSA is clearly outperformed by ICA and NMF, which suggests that the latent topics recovered by these two techniques are better aligned with the real thematic contents of the documents. So as to illustrate this fact, figure 2 compares the six latent thematic sources extracted by means of LSA and NMF with the categories in the miniNewsgroups corpus (100 documents per topic). Notice that the NMF1 , NMF5 and NMF6 latent sources clearly identify the comp.graphics, sci.crypt and misc.forsale topics –in sharp contrast with the less defined pattern of the LSA latent sources–, which somehow justifies the superiority of NMF in this experiment. 3.2
Latent Thematic Feature Clustering
This experiment compares the results of applying four state-of-the-art clustering algorithms –group average agglomerative clustering (AC), graph-based clustering (GC), direct clustering (DC) and repeated bisecting clustering (BC)2 – on i) the original T -dimensional term vector space and ii) on LSA, ICA and NMF latent thematic feature spaces of dimensionalities ranging from K = 2 to K = 50. Figure 3 presents the results of this experiment, averaged across 10 independent runs. An interesting observation is that LTFC delivers better results than LSDC. For the miniNewsgroups corpus (figure 3a), the best clustering results are obtained using the original term-based representation except with agglomerative 2
Implementations provided by the CluTo clustering package, available online at http://glaros.dtc.umn.edu/gkhome/views/cluto
798
X. Sevillano et al. miniNewsgroups LSDC - LSA
miniNewsgroups LSDC - NMF NMF
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
300
400
500
600
6
NMF
5
NMF
4
NMF
3
NMF
2
100
100
comp.graphics
200
rec.autos
sci.crypt
misc.forsale
talk.politics
NMF
LSA
6
LSA
5
LSA
4
LSA
3
LSA
2
LSA
1
1
0.2 0.1 0 0 0.5 0 -0.5 0 0.2 0 -0.2 0 0.5 0 -0.5 0 0.5 0 -0.5 0 0.5 0 -0.5 0
talk.religion
0.4 0.2 0 0 0.4 0.2 0 0 0.4 0.2 0 0 0.4 0.2 0 0 0.4 0.2 0 0 0.4 0.2 0 0
100
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
300
400
500
600
100
comp.graphics
200
rec.autos
sci.crypt
misc.forsale
talk.politics
talk.religion
document nr.
document nr.
Fig. 2. Latent thematic sources extracted by LSA and NMF on the miniNewsgroups corpus miniNewsgroups - DC
0.2
0.8
0.6 Terms LSA ICA NMF
0.4 0.2
0 2
10
20
30
40
0 2
50
10
dimensions
20
0.6
0.2 30
40
0 2
50
miniNewsgroups - BC
0 2
10
20
0.6 Terms LSA ICA NMF
0.4 0.2
30
dimensions
40
I(NMI)
I(NMI)
1 0.8
Terms LSA ICA NMF
0 2
50
(a)
10
20
20
30
40
dimensions
10
40
50
30
40
50
OHSUMED - BC 1
Terms LSA ICA NMF
0.6
Terms LSA ICA NMF
0.8
0.4
0 2
20
dimensions
0.2 30
0 2
50
OHSUMED - GC
1 0.8
0.2
0.4
dimensions
1
0.6
0.6
0.2 10
dimensions
miniNewsgroups - GC
Terms LSA ICA NMF
0.8
0.4
0.8
0.4
1 Terms LSA ICA NMF
I(NMI)
0.8
0.4
OHSUMED - DC
1
I(NMI)
0.6
I(NMI)
I(NMI)
0.8
I(NMI)
OHSUMED - AC
1 Terms LSA ICA NMF
I(NMI)
miniNewsgroups - AC 1
0.6 0.4 0.2
10
20
30
dimensions
40
50
0 2
(b)
10
20
30
40
50
dimensions
Fig. 3. Latent thematic feature clustering results for both corpora (miniNewsgroups on the left, OHSUMED on the right) applying four clustering strategies on latent feature spaces of varying dimensionality
clustering. In this case, the highest φ(NMI) is achieved when AC is conducted on a 7-dimensional NMF feature space. Note that fairly diverse results are obtained across the latent thematic feature space dimensionality range. Moreover, notice that nearly identical clustering results are yielded by all the clustering strategies when operating on the LSA and ICA feature spaces, which is in clear contrast with the situation found in LSDC (see table 2). In contrast, for the OHSUMED corpus (figure 3b), none of the clustering algorithms achieves the best results when operating on the term-based representation, and there exists an indeterminacy regarding both the optimal type of feature and the dimensionality of the latent thematic feature space. Most of all, it is to note that these optimality properties depend on the particular partitioning strategy employed and the clustering problem faced. These results suggest that it is not possible to claim for the universal superiority of any latent thematic model. So as to overcome this indeterminacy, we propose applying a consensus clustering strategy.
Text Clustering on Latent Thematic Spaces K
r1m
Clustering algorithm
O ( r 1K ) M
/
Consensus function
Oc
K
O (r Rm) …
KM R
m
…
K = [Km … KM ] r
Clustering algorithm
…
R
…
Representation
r
Km R
Clustering algorithm
…
X
…
Document corpus
KM 1
O ( r 1K ) …
K = [Km … KM ] r
Clustering algorithm
…
1
…
Representation
799
KM
O(r R )
Fig. 4. Construction of a cluster ensemble Λ upon R document representations of dimensionalities K = {Km , . . . , KM }, and subsequent creation of a consensus clustering λc by means of a consensus function F
4
Consensus Clustering
Being the unsupervised counterpart of classifier committees, consensus clustering is the task of creating a consensus labeling λc by applying a consensus function F on a cluster ensemble Λ that collects the labelings output by several partitioning processes. Typical applications of cluster ensembles include clustering reuse besides distributed and robust clustering [16]—the aim being, in this latter case, that λc approximates or even improves the best clustering in the ensemble. Following this principle, we propose overcoming the data representation dependence effect by building a cluster ensemble upon a set of R document indexing techniques, and subsequently constructing a consensus labeling λc (see figure 4). Several consensus functions have been proposed in the literature [2, 15, 16], and their general rationale is the application of cluster identification plus voting strategies across the labelings in the ensemble. For the sake of space, the reader is referred to [16] for a general introduction to consensus clustering. 4.1
Experiments on Consensus Clustering
This experiment analyzes the ability of consensus clustering for overcoming the uncertainty regarding optimal document indexing. To that effect, the 148 labelings delivered by each clustering algorithm in the LTFC experiment (section 3.2) have been collected into four cluster ensembles Λ (one per clustering strategy). Then, we have applied three state-of-the-art consensus functions on these ensembles: Cluster-Similarity Partitioning Algorithm (CSPA), Hyper-Graph Partitioning Algorithm (HGPA) and Meta-Clustering Algorithm (MCLA) [16]. It is important to note that the robustness to the data representation dependence effect is proportional to the closeness between the φ(NMI) of the consensus labeling and the maximum φ(NMI) of the labelings in the cluster ensemble. For this reason, figure 5 presents –through a φ(NMI) histogram (averaged across 10 experiment runs)– a visual comparison between the labelings in the ensemble and the labelings derived by each consensus function. Complementarily, table 3 presents the relative φ(NMI) differences between the consensus labelings and
X. Sevillano et al. miniNewsgroups - DC
6 4 2 0 0
0.2
0.4
0.6
0.8
10
6 4 2 0 0.2
0.4
(NMI)
MCLA
6 4 2 0.6
0.8
(NMI)
I
CSPA 6 4 2 0 0
0.1
10 8
4 2 0.4
0.6
(NMI)
(a)
CSPA
6
0 0.2
I
0.3
HGPA 8 6
MCLA CSPA
4 2 0 0
0.4
0.1
0.8
OHSUMED - GC CSPA
HGPA
MCLA
6 4 2 0 0
0.1
0.2
0.4
0.3
10
0.4
(NMI)
I
0.3
OHSUMED - BC
10 8
0.2
I(NMI)
I
MCLA
HGPA
0.2
10
(NMI)
miniNewsgroups - BC
CSPA
HGPA
clustering count
clustering count
miniNewsgroups - GC
0.4
0.8
OHSUMED - DC
MCLA
HGPA
8
I
10
0 0.2
0.6
OHSUMED - AC 10
(NMI)
I
8
MCLA
CSPA
HGPA
8
clustering count
CSPA
clustering count
MCLA
8
clustering count
HGPA
clustering count
clustering count
miniNewsgroups - AC 10
clustering count
800
(b)
HGPA
MCLA
8 6
CSPA
4 2 0 0
0.1
0.2
0.3
0.4
I(NMI)
Fig. 5. φ histogram of the cluster ensembles and consensus clustering for both corpora (miniNewsgroups on the left, OHSUMED on the right) applying four clustering strategies and three consensus functions (NMI)
Table 3. Relative φ(NMI) differences (mean value ± standard deviation) between the consensus labelings and the average and best individual labelings in the ensemble Corpus miniNewsgroups OHSUMED F CSPA HGPA MCLA CSPA HGPA MCLA Δφ(NMI) +31.4% −16 ± 15% +31.6 ± 0.1% +13.2% −18.6 ± 6.2% +17.2 ± 0.1% w.r.t. ALE Δφ(NMI) −6% −40 ± 10% −5.7 ± 0.1% −11.6% −36.5 ± 4.8% −8.4 ± 0.1% w.r.t. BLE
the average and best labelings in the ensemble (referred to as ALE and BLE, respectively). It can be observed that, in comparison to HGPA, the CSPA and MCLA consensus functions i) achieve notable robustness to the data representation dependence effect (i.e. the consensus labelings derived by CSPA and MCLA are closer to the BLE, being even better in some cases –see miniNewsgroupsGC or OHSUMED-BC in figure 5), and ii) show a more stable behaviour (i.e. smaller standard deviations) across the 10 experiment runs –in fact, CSPA yields repetitive results when operating on the same data.
5
Conclusions
One of the main difficulties encountered by text clustering practitioners is the uncertainty regarding the selection of a document indexing technique. In this context, document representations based on latent thematic generative models such as LSA, ICA or NMF can be advantageous with respect to the original term-based representation. However, the optimality of a single indexing technique is not a universal property, which gives rise to the so-called data representation dependence effect. Consensus clustering constitutes a reliable strategy to overcome this problem, and it can easily be extended to deal with additional indeterminacies, e.g. the one regarding the optimal clustering algorithm [13].
Text Clustering on Latent Thematic Spaces
801
References [1] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by Latent Semantic Analysis. J. American Society Information Science 6(41), 391–407 (1990) [2] Fred, A., Jain, A.K.: Combining Multiple Clusterings Using Evidence Accumulation. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(6), 835–850 (2005) [3] Hersh, W., Buckley, C., Leone, T., Hichman, D.: OHSUMED: an interactive retrieval evaluation and new large text collection for research. In: Proc. of the 17th ACM SIGIR Conference, pp. 192–201 (1994) [4] Hettich, S., Bay, S.D.: The UCI KDD Archive. University of California at Irvine, Dept. of Information and Computer Science (1999), http://kdd.ics.uci.edu [5] Hoyer, P.O.: Non-Negative Matrix Factorization with Sparseness Constraints. J. Machine Learning Research 5, 1457–1469 (2004) [6] Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a Survey. ACM Computing Surveys 31(3), 264–323 (1999) [7] Kab´ an, A., Girolami, M.: Unsupervised Topic Separation and Keyword Identification in Document Collections: a Projection Approach. Dept. of Computing and Information Systems, University of Paisley. Technical Report Nr. 10 (2000) [8] Kolenda, T., Hansen, L.K., Sigurdsson, S.: Independent Components in Text. In: Girolami, M. (ed.) Advances in Independent Component Analysis, pp. 241–262. Springer, Heidelberg (2000) [9] Kolenda, T.: Clustering text using Independent Component Analysis. Inst. of Informatics and Mathematical Modelling, Tech. University of Denmark. T.R (2002) [10] Lee, D.D., Seung, H.S.: Learning the Parts of Objects by Non-Negative Matrix Factorization. Nature 401, 788–791 (1999) [11] Sebastiani, F.: Machine Learning in Automated Text Categorisation. ACM Computing Surveys 34(1), 1–47 (2002) [12] Sevillano, X., Al´ıas, F., Socor´ o, J.C.: Reliability in ICA-Based Text Classification. In: Proc. of the 5th Intl. Conference on Independent Component Analysis and Blind Signal Separation, pp. 1210–1217 (2004) [13] Sevillano, X., Cobo, G., Al´ıas, F., Socor´ o, J.C.: A Hierarchical Consensus Architecture for Robust Document Clustering. In: Proc. of the 29th ECIR Conference, pp. 741–744 (2007) [14] Shafiei, M., Wang, S., Zhang, R., Milios, E., Tang, B., Tougas, J., Spiteri, R.: A Systematic Study of Document Representation and Dimension Reduction for Text Clustering. Technical Report CS-2006-05. Dalhousie University (2006) [15] Siersdorfer, S., Sizov, S.: Restrictive Clustering and Metaclustering for SelfOrganizing Document Collections. In: Proc. of the 27th ACM SIGIR Conference, pp. 226–233 (2004) [16] Strehl, A., Ghosh, J.: Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. J. Machine Learning Research 3, 583–617 (2002) [17] Tang, B., Shepherd, M., Milios, E., Heywood, M.I.: Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering. In: Proc. of the Intl. Workshop on Feature Selection for Data Mining, pp. 17–26 (2005) [18] Xu, W., Liu, X., Gong, Y.: Document Clustering Based on Non-Negative Matrix Factorization. In: Proc. of the 26th ACM SIGIR Conference, vol. 2, pp. 267–273 (2003)
Top-Down Versus Bottom-Up Processing in the Human Brain: Distinct Directional Influences Revealed by Integrating SOBI and Granger Causality Akaysha C. Tang1, Matthew T. Sutherland1, Peng Sun 2, Yan Zhang3, Masato Nakazawa1, Amy Korzekwa1, Zhen Yang1, and Mingzhou Ding3 1
2
University of New Mexico, Department of Psychology, Albuquerque, NM, USA University of New Mexico, Department of Electrical Engineering, Albuquerque, NM, USA 3 University of Florida, J. Crayton Pruitt Family Department of Biomedical Engineering, Gainesville, FL, USA [email protected], [email protected], [email protected], zhyang @unm.edu, [email protected], [email protected], [email protected]
Abstract. Top-down and bottom-up processing are two distinct yet highly interactive modes of neuronal activity underlying normal and abnormal human cognition. Here we characterize the dynamic processes that contribute to these two modes of cognitive operation. We used a blind source separation algorithm called second-order blind identification (SOBI [1]) to extract from high-density scalp EEG (128 channels) two components that index neuronal activity in two distinct local networks: one in the occipital lobe and one in the frontal lobe. We then applied Granger causality analysis to the SOBI-recovered neuronal signals from these two local networks to characterize feed-forward and feedback influences between them. With three repeated observations made at least one week apart, we show that feed-forward influence is dominated by alpha while feedback influence is dominated by theta band activity and that this directionselective dominance pattern is jointly modulated by situational familiarity and demand for visual processing. Keywords: electroencephalogram, second-order blind identification (SOBI), coherence, Granger causality, top-down, bottom-up, feed-forward, feedback.
1 Introduction Second-order blind identification (SOBI) [1] is an emerging signal processing technique that can be used to facilitate source analysis from high-density EEG. Similar to other ICA algorithms that have been applied to EEG data [2], [3], SOBI can be used to isolate and remove ocular artifact [4]. In our laboratory, we have conducted extensive investigations to demonstrate the utility of SOBI in aiding source analysis from high-density EEG. Specifically, we have shown that: (1) SOBI can correctly recover known noise sources (noisy sensors and artificially injected noise at M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 802–809, 2007. © Springer-Verlag Berlin Heidelberg 2007
Top-Down Versus Bottom-Up Processing in the Human Brain
803
known electrodes) and known neuronal sources (SI activation by median nerve stimulation) [5]; (2) SOBI can increase signal to noise ratios leading to improved performance in single-trial ERP classification [6]; (3) SOBI can recover neuronal sources whose activations are correlated [7]; (4) SOBI can recover neuronal sources using EEG collected when the brain is in its default mode (i.e., the “resting” state) [8]; (5) SOBI can recover neuronal sources during free viewing of continuous streams of visual information [9]; and (6) SOBI can recover weak neuronal signals that temporally overlap with much stronger signals (e.g. signals associated with ipsilateral activation of primary somatosensory cortex) [10]. In this paper, we set out to achieve three goals. First, we seek to provide further validation for SOBI recovered neuronal sources by investigating whether the same neuronal sources can be recovered from repeated EEG measures that are obtained days and weeks apart. Second, we combine SOBI with Granger causality analysis to show distinct patterns of theta(θ)/alpha(α) contributions in the feed-forward and feedback influences between the frontal and occipital cortices. Third, we investigate how such asymmetrical influence between the frontal and occipital cortices is modulated by sensory processing and by situational familiarity.
2 Methods Eight right-handed subjects volunteered to participate in the present study. All subjects were free of any history of neurological or psychological disorders. The experimental procedures were conducted in accordance with the Human Research Review Committee at the University of New Mexico. Each subject was tested in three sessions at Week 0, Week 1, and Week 4 or later. Up to 7 min of continuous 128channel EEG data were collected at 1000 Hz during: (1) eyes-closed “resting”; (2) eyes-open “resting”; (3) video-viewing (a silently played nature video); (4) listening to only the audio track of the video; and (5) forming mental images of scenes from the video. This paper limits the discussion to conditions 1-3. SOBI was applied to the continuous EEG data x(t), across all conditions to extract the continuous time course of activation from two types of neuronal components--- an anterior (A) and a posterior (P) component. For details on SOBI application, see [5]. Briefly, SOBI recovers the underlying sources, s(t), by minimizing the sum squared cross-correlations between si(t) and sj(t + τ), across all pairs of sources and across multiple time delays, τs. A subset of SOBI-recovered components can be verified as neuronal sources via source localization using a forward model (e.g. BESA 5.0) [3]. Here we focused our analysis on two such neuronal components that correspond to focal regions within the frontal and occipital lobes. Feed-forward (FF) and feedback (FB) influences were quantified by Granger causality between the two components, reflecting long-distance directional influences between the frontal and occipital cortices. Granger causality analysis was carried out on the continuous time courses, si(t), from the selected A and P components according to methods detailed in [11], [12]. As Granger causality can be decomposed into its frequency content, we computed Granger causality spectrum and measured power within the θ (4-7 Hz) and α (8-14 Hz) bands using a moving window of 30-sec with
804
A.C. Tang et al.
5-sec increments. Power in the θ and α bands from the A and P components were also computed as indicators of synchronization within the local networks.
3 Results Reliable Extraction and Identification of Neuronal Components from Repeated Measures made Weeks Apart. In all 8 subjects, across all 3 sessions, we were able to recover SOBI components that corresponded to two distinct neuronal sources, one localized to a rather focal region within the frontal cortex, in or near anterior cingulate cortex (ACC) and the other to focal regions within the occipital lobe (occipital gyrus). Repeated-measure ANOVA revealed no statistically significant differences in the location of the corresponding ECD models across the 3 recording sessions. As no session-to-session difference was found, the averaged locations across the 3 sessions are shown in Fig. 1. ECDs for each of the 8 subjects are superimposed in the figure revealing a tight clustering of ECDs across subjects. This result demonstrates that SOBI can reliably recover components that correspond to anatomically well defined brain regions even when the recording sessions were made weeks apart. It is important to emphasize that the recovery of these two neuronal sources was achieved without imposing constraints of fixation or use of event-related stimulation paradigms. Instead, subjects were allowed to freely move or blink their eyes as needed during the recording conditions. No segment of the EEG data was excluded prior to SOBI application. These unique features of SOBI processing have non-trivial
Fig. 1. Equivalent current dipole (ECD) locations for the SOBI recovered A and P components
implications for the study of mental disorders and the study of early development or aging where subjects are often unable to conform to typical experimental constraints.
Top-Down Versus Bottom-Up Processing in the Human Brain
805
Theoretically, this result implies that SOBI’s ability to recover anatomically welldefined neuronal sources does not depend upon the use of an event-related stimulation paradigm. Thus, fast electrical brain activity in the default mode [13] can be investigated in terms of neuronal signals originating from specific, focal cortical areas. In comparison to default mode activity revealed by fMRI, the default mode activity revealed with SOBI and EEG will offer millisecond temporal resolution, allowing for the characterization of default mode brain dynamics within a new temporal domain.
Fig. 2. Median power spectra of two SOBI-neuronal components as a function of repeated exposures to the same experimental situation. Session 1: week 0; Session 2: week 1; Session 3: week 4+.
Local Network Synchrony Shows Distinct Patterns of Change across 3 Repeated Exposures to the same Experimental Situation. For each of the 3 recording sessions, power spectra from the component time courses were computed for ~5-min segments during which the subjects had their eyes-closed (red), eyes-open (blue), or viewed a nature video (green), respectively (Fig. 2). The anterior component had peak power within the θ band while the posterior component had peak power within the α band, indicated by a significant main effect of Region in the θ-to-α ratio (F[1,7] = 52.12, p < 0.001, partial η2 = 0.88). This is consistent with the well established fact that the posterior and anterior parts of the brain are major sources of α and θ generators, respectively.
806
A.C. Tang et al.
Power spectra in these two components were differentially modulated by sessions and experimental conditions [interaction effect: Region x Session (contrast coefficients: 1, -1, 0) x Condition (1, 0, -1), F(1,7) = 3.52, p = 0.05, 1-tailed, partial η2=0.33)]. For the P component, the power spectra revealed a systematic effect of session and experimental condition. Across the 3 repeated exposures to the same experimental conditions, peak α power decreased as the testing situation became increasingly familiar. Across the 3 experimental conditions, the highest peak α power was associated with the eyes closed condition and the peak α power was successively reduced when the demand for visual processing increased from the eyes-closed to the eyes-open and video-viewing conditions. This latter observation is consistent with the known observation that visual processing suppresses α band activity. In contrast, for the anterior component, the power spectra showed a relative insensitivity to repeated exposures to the same experimental conditions and little modulation by the eyesclosed, eyes-open, and video-viewing conditions. Differential Modulation of θ/α Contribution to Feed-Forward and Feedback Influences by Situational Familiarity and Visual Processing. FF(posterior-to-anterior) and FB (anterior-to-posterior) influences were measured by Granger causality in the θ and α band activity separately. FF and FB Granger causality measures were plotted as a function of time (Fig. 3). For the FF influence, when the eyes were closed, α band
Fig. 3. Theta dominance over alpha in the anterior-to-posterior feedback (lower) influence and its reversal in the posterior-to-anterior feed-forward influence (upper) from a single-subject
activity clearly dominated as indicated by the α waveforms (black) having greater area underneath the curve than the θ waveforms (grey). This α dominance was clearly reduced when the eyes were open and was further reduced to nearly non-existent when the subjects viewed a video. For the FB influence, the pattern of α dominance over θ was reversed showing uniform θ dominance over α across all 3 experimental conditions.
Top-Down Versus Bottom-Up Processing in the Human Brain
807
Using the area underneath the curve as a dependent measure, we summarize results from all 8 subjects across all 3 recording sessions in Fig. 4. To determine whether θ and α band activity contribute differentially to the FF and FB influences and how such differential contributions are modulated by situational familiarity and sensory processing, we performed an ANOVA on the θ/α ratio. The θ/α ratio differed significantly between the FF and FB influences with a greater ratio for FB influence than for the FF influence (main effect of Direction, F[1,7] = 34.64, p < 0.001, partial η2 = 0.83), i.e. a θ dominance in FB influence. This can be seen by the higher measures for the θ band activity than the α band activity for the FB influences in most of the 9 conditions and clear reversal or reduction of this θ dominance in the FF influence (Fig. 4).
Fig. 4. Cumulative Granger Causality (area underneath the curve in Fig. 3) in the θ and α band as a function of situational familiarity (repeated sessions) and a function of visual processing (eyes-closed, eyes-open, video-viewing).
This reversal of θ dominance from FF and FB influences was significantly modulated by the familiarity of the situation [Direction x Session (contrast coefficients: 1, -1, 0), F(1,7) = 11.97, p=0.005, 1-tailed, partial η2=0.63]. The reversal is more prominent when the situation was novel (Week 0) than when it became more familiar (Week 1 and 4+). This is best seen in the case of eyes-closed condition. The magnitude of reversal is clearly reduced from Week 0 in comparison to Week 4+.
808
A.C. Tang et al.
For the eyes-open condition, the θ dominance was reversed in Week 0 and 1 and reduced in Week 4+. For the video-viewing condition, the reversal of θ dominance does not appear to be influenced by the increasing situational familiarity. These patterns indicate that the FF/FB contrast is dependent upon the amount of visual information processing involved. When the subjects were engaged in visual perception during videoviewing, θ dominance in the FB influence and θ-α balance in the feed-forward influence are maintained across recording sessions. This visual processing-dependent effect is supported by a significant 3-way interaction [Direction x Session (1, -1, 0) x Condition (1, 0, -1), F(1,7) = 7.13, p = 0.02, 1-tailed partial η2=0.63]. Within Week0 when the recording situation was novel (which is comparable to most studies that do not deal with the issue of task familiarity), θ dominance in the FB influence was maintained despite varying demand for visual processing. In contrast, the α dominance in the FF influence in the case of eyes-closed condition was reduced by increasing demand for sensory processing. In fact, visual processing was accompanied not only by a reduction in α but an increase in θ band activity in the FF influence. We speculate that this increase in θ band activity serves to “match” the θdominance in the FB influence to mediate the dynamic two-way communication between the posterior and anterior parts of the brain.
4 Discussion We analyzed EEG data collected from 8 subjects in three sessions that were weeks apart, each including a period of resting with eyes-closed, resting with eyes-open, and visual perception while free viewing a nature video. We extracted neuronal signals from focal brain regions within the frontal and occipital lobes and showed that such extraction can be achieved under free viewing conditions and from recordings made weeks apart. As many intervening events must have taken place during the intersession intervals, the reliable extraction of the neuronal sources raises the possibility that such a wide range of variations may be overcome by the use of SOBI in longitudinal experimental designs necessary for developmental and aging studies. Applying Granger causality analysis to the time courses of the frontal and occipital SOBI components, we presented evidence indicating distinct patterns of θ/α band activity in the FF and FB influences between the two components, with a θ dominance characterizing the FB influence and an α dominance in the FF influence. By comparing the feed-forward and feedback influences under varying degrees of situational familiarity (sessions) and under conditions of varying degrees of visual processing (eyes-closed, eyes-open, and video viewing), we presented evidence that the balance in θ-α band activity between the FF and FB influences is modulated by two factors. First, situational familiarity can reduce the degrees of θ and α dominance in the FB and FF influences, respectively (as in the case of eyesclosed). Second, the amount of sensory processing increases the θ band contribution and decreases α band contribution to FF influence but has little effect on FB influence. Finally, situational familiarity and sensory processing jointly determine the θ-α balance. Increasing familiarity and increasing visual processing both increases θ band contribution to FF influence. In contrast, for FB influences, increasing familiarity decreases θ band contribution when there is little demand for visual processing (eyes-closed) and has no effect on θ band contribution when there is high demand for visual processing (visual).
Top-Down Versus Bottom-Up Processing in the Human Brain
809
Together, these findings demonstrate a novel non-invasive approach to the assessment of top-down and bottom-up influences in the human brain. These findings may particularly benefit those clinicians and researchers who are interested in how bottom-up and top-down influences interact in both diseased and normal brains. Future work will extend this analysis to networks involving more functionally distinct brain regions. Acknowledgments. Supported by a grant from Sandia National Labs to A.C.T.
References 1. Belouchrani, A., Abed-Meraim, K., Cardoso, J.-F., Moulines, E.: A blind source separation technique using second-order statistics. IEEE Trans. Signal Process. 5, 434–444 (1997) 2. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural Computation 1, 1129–1159 (1995) 3. Hyvarinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997) 4. Joyce, C.A., Gorodnitsky, I.F., Kutas, M.: Automatic removal of eye movement and blink artifacts from EEG data using blind component separation. Psychophysiology 41, 313–325 (2004) 5. Tang, A.C., Sutherland, M.T., McKinney, C.J.: Validation of SOBI components from high density EEG. NeuroImage 25, 539–553 (2005) 6. Tang, A.C., Sutherland, M.T., Wang, Y.: Contrasting single-trial ERPs between experimental manipulations: Improving differentiability by blind source separation. NeuroImage 29, 335–346 (2006) 7. Tang, A.C., Liu, J.Y., Sutherland, M.T.: Recovery of correlated neuronal sources from EEG: The good and bad ways of using SOBI. NeuroImage 28, 507–519 (2005) 8. Sutherland, M.T., Tang, A.C.: Blind source separation can recover systematically distributed neuronal sources from resting EEG. In: Eurasip Proceedings of the Second International Symposium on Communications, Control, and Signal Processing (ISCCSP 2006), Marrakech, Morocco (March 13-15, 2006), http://www.eurasip.org/content/ Eusipco/isccsp06/defevent/papers/cr1307.pdf 9. Tang, A.C., Sutherland, M.T., McKinney, C.J., Liu, J.Y., Wang, Y., Parra, L.C., Gerson, A.D., Sajda, P.: Classifying single-trial ERPs from visual and frontal cortex during free viewing. In: IEEE Proceedings of the International Joint Conference on Neural Networks (IJCNN 2006), Vancouver, BC, Canada, July 16-21, pp. 1376–1383. IEEE Computer Society Press, Los Alamitos (2006) 10. Sutherland, M.T., Tang, A.C.: Reliable detection of bilateral activation in human primary somatosensory cortex by unilateral median nerve stimulation. NeuroImage 33, 1042–1054 (2006) 11. Ding, M.: Short-window spectral analysis of cortical event-related potentials by adaptive multivariate autoregressive modeling: Data preprocessing, model validation, and variability assessment. Biological Cybernetics 83, 35–45 (2000) 12. Ding, M., Chen, Y., Bressler, S.L.: Granger causality: Basic theory and application to neuroscience. In: Winterhalder, M., Schelter, B., Timmer, J. (eds.) Handbook of Time Series Analysis, Wiley, Chichester (2006) 13. Gusnard, D.A., Raichle, M.E.: Searching for a baseline: Functional imaging and the resting human brain. Nature Review Neuroscience 102, 685–694 (2001)
Noisy Independent Component Analysis as a Method of Rotating the Factor Scores Steffen Unkel and Nickolay T. Trendafilov Department of Statistics, Faculty of Mathematics & Computing, The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom {S.Unkel,N.Trendafilov}@open.ac.uk
Abstract. Noisy independent component analysis (ICA) is viewed as a method of factor rotation in exploratory factor analysis (EFA). Starting from an initial EFA solution, rather than rotating the loadings towards simplicity, the factors are rotated orthogonally towards independence. An application to Thurstone’s box problem in psychometrics is presented using a new data matrix containing measurement error. Results show that the proposed rotational approach to noisy ICA recovers the components used to generate the mixtures quite accurately and also produces simple loadings. Keywords: Independent component analysis, Exploratory factor analysis, Factor rotation, Factor scores, Gradient projection algorithm.
1
Introduction
The key difference between EFA and noisy ICA is that in the latter model the common factors are assumed to be both independent and non-normal. Since only second-order statistics are analyzed, the loadings in the EFA model can only be estimated up to an orthogonal rotation. Hence, EFA is not able to separate linear mixtures into their independent components. In contrast, the nonnormality of the common factors allows ICA to perform blind source separation. The rotational redundancy of the EFA model is removed, using supplementary information not contained in the sample covariance or correlation matrix. Several authors use EFA merely for quasi-sphering the data before doing an ICA analysis [8,14]. ICA can be considered as a method for factor rotation seeking a rotation matrix that maximizes the independence between the common factors [6,7]. This connection was first explored in [13], where a varimax-based criterion was proposed to implement noise-free ICA. The current paper implements noisy ICA from an EFA perspective by exploiting this link with factor rotation. Starting from an initial EFA solution, the predicted factor scores are rotated orthogonally towards independence. This is done using an appropriate rotation criterion and an orthogonal rotation algorithm. Recently, the rotational approach to ICA proposed here was introduced for the noise-free case using methods from principal components analysis (PCA) [10]. In the sequel, this rotational approach is applied for studying the less developed noisy version of ICA. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 810–817, 2007. c Springer-Verlag Berlin Heidelberg 2007
Noisy Independent Component Analysis
2
811
Independent Non-normal Factor Analysis Model
Consider the following linear latent variable model in which all variables are assumed to be measured at least on an interval scale: x = μ + Λf + u ,
(1)
where x ∈ Rp×1 is a random vector of manifest variables with mean vector μ, f ∈ Rk×1 is a random vector of k p latent variables called common factors, Λ ∈ Rp×k is a matrix of fixed coefficients referred to as factor loadings, and u ∈ Rp×1 is a random vector of latent variables called unique factors. In EFA, the choice of k is subject to some limitations, which will not be discussed here [5]. The loading matrix Λ is required to have full column rank. Assume that E(f) = 0. Furthermore, let u ∼ Np (0, Ψ), where Ψ is assumed a positive definite diagonal matrix. Finally, suppose that E(ff ) = Ik and E(fu ) = 0k×p . Thus, all the factors are uncorrelated with one another and the variances of the common factors equal unity. Using these assumptions, the model (1) represents an EFA model with orthogonal (uncorrelated) common factors [5]. The idea of model (1) is that the common factors account for the covariance structure among the set of manifest variables, while each unique factor corresponds to that portion of a particular manifest variable which cannot be accounted for by the common factors. As such, a unique factor contains the specificity of that variable as well as errors in measurement or noise. In EFA, it is often covenient to assume that not only u but also f and hence x are multinormally distributed. This assumption is usually made for purposes of statistical inference [15]. The elements of f being normally distributed and uncorrelated are thus statistically independent random variables. Unlike EFA, ICA assumes that the k common factors are both mutually independent and non-normal or at least all but one non-normal [3]. With this key difference, the model (1) is similar to a noisy ICA model [7]. Given a multivariate sample of n independent observations on x = (x1 , . . . , xp ) , the k-factor model (1) can be written as X = M + FΛ + U,
(2)
where X = (x1 , . . . , xp ) ∈ Rn×p is the observed data matrix in which xj = (x1j , . . . , xnj ) (j = 1, . . . , p), M = 1n μ ∈ Rn×p is the matrix of location parameters, and F = (f1 , . . . , fk ) ∈ Rn×k and U = (u1 , . . . , up ) ∈ Rn×p denote the unknown matrices of factor scores of the k common factors and unobserved values for the p unique factors on n observations, respectively. The aim of noisy ICA based on model (2) is to recover F from X alone without knowing Λ, Ψ, and the distributions of the common factors [4]. This problem can be transformed into a specific EFA task. ˆ which consists of n constant The matrix M can easily be estimated by M ¯ = (¯ ¯ , where x x.1 , . . . , x¯.p ) denotes the (p × 1) sample mean vector with rows x n x ¯.j = n1 i=1 xij (j = 1, . . . , p) being the sample means for each variable.
812
S. Unkel and N.T. Trendafilov
Assume without changing notation that X has been mean-corrected and that the column vectors of X are standardized to unit variance. The model in (1) and the assumptions imply the following model correlation structure, R: R = ΛΛ + Ψ .
(3)
ˆ Ψ} ˆ The estimation of the parameters in EFA is a problem of finding the pair {Λ, which gives the best fit for certain k to the sample correlation matrix C = X X/(n − 1) with respect to some goodness-of-fit measure. To find estimates of Λ and Ψ, several factor extraction methods can be employed [5]. A natural and popular choice is the (unweighted) least squares (LS) approach. It can be formulated as the following optimization problem [11]: min ||(C − ΛΛ − Ψ)||2 Λ,Ψ
s.t. Λ Λ a diagonal matrix,
(4)
where ||A|| = trace(A A) denotes the Frobenius norm of A. In EFA, the constraint in (4) eliminates the indeterminacy in (3). This indeterminacy-elimination feature is not always helpful in EFA, because such solutions are usually difficult to interpret [15]. Instead, the parameter estimation is usually followed by some kind of ’simple structure’ rotation [5], which in turn gives solutions violating (4). Recall that in ICA ’simple structure’ rotation is not necessary. The constraint in (4) is invoked to facilitate the algorithms for numerical solution of the LS problem. The standard numerical solutions of the optimization problem in (4) are iterative, usually based on a Newton-Raphson procedure [11].
3
Factor Scores and Rotation Towards Independence
ˆ and Ψ ˆ of the parameters Λ and Ψ have been found, (initial) After estimates Λ factor scores can be predicted in a second step: −1 − 1 ˆ = XΨ ˆ −1 Λ ˆ Λ ˆ Ψ ˆ CΨ ˆ 2. ˆ −1 Λ F
(5)
This set of factor scores was proposed by [1]. Equation (5) produces predicted factor scores which are orthogonal. The factor scores are also valid, which means that the predictions do have high correlations with the factors being measured. However, the factor scores are neither univocal, that is, they do not have the property of not correlating with any of the factors except those they were designed to measure nor are they unbiased estimators [5]. ˆ Ψ} ˆ and F ˆ is a standard EFA problem. To solve the correSo far, finding {Λ, sponding ICA problem one needs to go one step further. The initial factor scores are rotated towards independence, that is, ˜ = FT, ˆ F
(6)
for some orthogonal matrix T. To find the matrix T that leads to (approximately) independent factor scores, an appropriate rotation criterion is set up.
Noisy Independent Component Analysis
813
Recall that if the common factors are independent their squares are also independent. Thus, the (model) covariance matrix of the squared components is diagonal. Let V be an arbitrary orthogonal matrix and let ˆ G = FV.
(7)
The sample covariance matrix between the elementwise squares of G is S=
1 (G G) (Ik − n−1 1k 1k )(G G), n−1
where denotes the element-wise (Hadamard) matrix product. Consider the following rotation criterion to be minimized [10]: F (V) = trace S (S N) ,
(8)
(9)
where N is a square matrix with zeros on the diagonal and ones elsewhere. The aim is to minimize the sum of the squared off-diagonal elements of S over all ˆ orthogonal rotations V of F. The gradient projection algorithm proposed by [9] is used to find T that minimizes F . Let M be the manifold of all orthogonal matrices. Given a current value of V, this algorithm computes the gradient of F at V and moves α units in the negative gradient direction from V. The result is projected on M. The algorithm proceeds iteratively, it is strictly descending and converges from any starting point to a stationary point. At a stationary point of F restricted to M, the Frobenius norm of the gradient after projection onto the plane tangent to M at the current value of V is zero. The algorithm stops when the norm is less than some prescribed precision, say 10−5 . Summarizing, the proposed EFA approach to noisy ICA is as follows: 1. 2. 3. 4. 5. 6.
4
Set up the number of common factors, k, prescribed or estimated. Estimate the parameters Λ and Ψ by a factor extraction method. ˆ using (5). Calculate (initial) predicted factor scores, F, Find an orthogonal matrix T that minimizes F in (9). ˜ = FT. ˆ Calculate the approximately independent factors as F ˜ ˆ Obtain the ICA mixing matrix by Λ = ΛT.
Application
Developing analytical methods for factor rotation has a long history in factor analysis [2]. It is motivated by both solving the indeterminacy problem and facilitating the factors’ interpretation. Thurstone’s 26-variable box problem [16] was notorious for being difficult to solve by any analytic rotation method. In this data set, the boxes constitute the observational units. Table 1 shows the three dimensions f1 (length), f2 (width) and f3 (height) for each box. As in [10], seven additional boxes, whose dimensions are given in Tab. 2, are added to the 20 boxes to form an independent set of boxes which in turn is well-suited for an ICA analysis.
814
S. Unkel and N.T. Trendafilov Table 1. Dimensions f1 , f2 , and f3 of Thurstone’s original 20 box-set
1 f1 3 f2 2 f3 1
2 3 2 2
3 3 3 1
4 3 3 2
5 3 3 3
6 4 2 1
7 4 2 2
8 4 3 1
9 4 3 2
10 4 3 3
11 4 4 1
12 4 4 2
13 4 4 3
14 5 2 1
15 5 2 2
16 5 3 2
17 5 3 3
18 5 4 1
19 5 4 2
20 5 4 3
Table 2. Dimensions f1 , f2 , and f3 of the 7 additional boxes 21 f1 3 f2 4 f3 1
22 3 4 2
23 3 4 3
24 3 2 3
25 4 2 3
26 5 3 1
27 5 2 3
Twenty-six functions of these dimensions represent the variables of the study: f1 , f2 , f3 , f1 f2 , f1 f3 , f2 f3 , f12 f2 , f1 f22 , f12 f3 , f1 f32 , f22 f3 , f2 f32 , f1 /f 2 , f2 /f1 , 2 + f 2, f , 2f + 2f , 2f + 2f , 2f + 2f , f f12 + f32 , 1 /f3 , f3 /f1 , f2 /f3 , f3 /f 2 1 2 1 3 2 3 1 2 2 2 2 2 2 f2 + f3 , f1 f2 f3 , and f1 + f2 + f3 . The analytic rotations’ aim is to find loadings with simple structure which identify the dimensions of the boxes. As the three dimensions are independent, the problem seems quite appropriate to be attacked by ICA instead. Then, one can expect to find loadings with such simple structure as a side effect. The resulting 27 × 26 data matrix contains no measurement error. A less artificial data set is proposed in [12] in which the remedy is to double the number of boxes and add a 54×26 error matrix made up of random normal deviates to the enlarged data matrix. The boxes are doubled to make the results less-dependent on the pseudo-random numbers added. This procedure injects measurement error to the data, giving the problem a greater degree of realism. A cute side-effect is obtained. Since the new data matrix yields a non-singular correlation matrix, it can be used with factor extraction methods such as maximum likelihood factor analysis, canonical factor analysis, image factoring, et cetera. The matrix of measurement errors is mean-centered and also sphered to ensure that the columns of the error matrix are uncorrelated. After that the columns are scaled to have variance 1/19 of the variance of the corresponding columns of the original data matrix. This yields a reliability of (approximately) 95% for each variable represented in the new data matrix. The 54 × 26 data matrix, X, is finally mean-centered and standardized to unit variance. The first few eigenvalues of C sorted in decreasing order are 11.8906, 6.7467, 5.4132, 0.4015. As expected, three eigenvalues are considerably greater than one, which is the Kaiser’s solution for the number of common factors. ˜ = FT. ˆ The prodecure described in the last section was applied to get F The ˜ ˜ ˜ ˜ columns f1 , f2 , and f3 of F are the rotated factor scores and estimates of the standardized form of the three dimensions f1 , f2 , and f3 used to generate the mixtures.
Noisy Independent Component Analysis
815
Table 3. Covariances (diagonal and above) and correlations (below diagonal) between the element-wise squares of ˜f1 , ˜f2 , and ˜f3 0.5847561 0.0000001 -0.0000238
0.0000001 0.6107587 0.0000027
-0.0000139 0.0000016 0.5868985
According to Tab. 3, ˜f1 , ˜f2 , and ˜f3 are quite independent. The off-diagonal elements of the correlation matrix for the element-wise squares of ˜f1 , ˜f2 , and ˜f3 are all very small (below 3 × 10−5 ). The gradient projection algorithm converged after 18 iterations to a stationary point and the value of F at the minimum is 1.97 × 10−10 . Figure 1 displays that the EFA approach to ICA has quite accurately recovered the dimensions for each of the 54 boxes despite the noise introduced to the model. Quite common for an ICA analysis, permutation ambiguities were revealed. The ˜ and vice versa. first factor f1 corresponds to the third column ˜f3 of F Note that some of the manifest variables are non-linear functions of the dimensions of the boxes. However, as [10] point out, the non-linear functions are nearly linear over the values f1 , f2 , and f3 used to generate the mixtures.
2 1 0 −1 −2
0
10
20
30
40
50
60
0
10
20
30
40
50
60
0
10
20
30 i
40
50
60
2 1 0 −1 −2 2 1 0 −1 −2
Fig. 1. Standardized box dimensions ’◦’ and their estimates ’*’ for each dimension f1 (upper panel), f2 (middle panel), and f3 (lower panel) and each box i (i = 1, . . . , 54)
816
S. Unkel and N.T. Trendafilov
Table 4. ICA (orthogonal) and Geomin (oblique) rotated loadings for the box data Function f1 f2 f3 f1 f2 f1 f3 f2 f3 f12 f2 f1 f22 f12 f3 f1 f32 f22 f3 f2 f32 f1 /f2 f2 /f1 f1 /f3 f3 /f1 f2 /f3 f3 /f2 2f1 + 2f2 2f1 + 2f3 2f 2 + 2f3 2 2 f1 + f2 2 2 f1 + f3 f22 + f32 f1 f2 f3 f12 + f22 + f32
˜ ICA (Λ) .97 .09 .09 -.13 .96 .10 -.07 -.10 .96 .45 .86 .12 .33 -.06 .89 -.19 .45 .82 .69 .64 .11 .24 .91 .16 .57 -.01 .77 .21 -.01 .92 -.15 .65 .68 -.14 .27 .90 .66 -.68 -.05 -.63 .71 .01 .43 .14 -.83 -.49 -.15 .77 -.06 .63 -.69 .06 -.56 .74 .56 .77 .17 .68 .03 .70 -.16 .60 .74 .70 .67 .10 .84 .09 .47 -.16 .71 .63 .22 .43 .83 .57 .58 .51
.97 .02 -.02 .58 .37 -.07 .79 .39 .61 .27 .00 -.04 .54 -.51 .40 -.46 -.01 .02 .69 .72 -.02 .80 .87 .00 .34 .69
Geomin -.06 .02 .98 .02 -.01 .97 .78 -.03 -.05 .86 .54 .77 .52 -.04 .86 .01 -.04 .70 .02 .89 .72 .60 .36 .87 -.78 -.02 .81 -.03 .00 -.88 .00 .83 .57 -.76 -.50 .80 .68 .01 -.03 .62 .67 .67 .55 -.05 -.01 .37 .77 .55 .45 .74 .51 .37
If one rotates the factor scores, the loadings are rotated as well. Table 4 shows ˜ If one ignores all loadings with magnitude .19 or less, the rotated loadings Λ. the remaining loadings perfectly identify the subsets of the variables f1 , f2 , and f3 that were used to generate the mixtures. The simple structure is nearly as good as the one obtained by the more sophisticated method of Geomin [2].
5
Discussion
In this paper, noisy ICA was implemented from an EFA perspective. The approach was applied to the notorious Thurstone’s box problem. By rotating the factors towards independence, a simple structure of the loadings was achieved. The criterion for rotating the factor scores towards independence requires minimization of squared fourth-order statistics. Optimization was easily carried out using the gradient projection algorithm. Other methods can also be applied to ˆ towards independence, as for example the varimax-based criterion [13] rotate F or the FastICA algorithm [7]. The proposed approach has to be compared to
Noisy Independent Component Analysis
817
these and/or other ICA methods. One might ask whether the factor analysis’ approach is able to recover the common factors and produces simple loadings if the sources are dependent rather than independent and if real correlated data is considered. Tackling these issues will be the subject of future work. Acknowledgments. The authors would like to thank the anonymous reviewers for their helpful comments and suggestions.
References 1. Anderson, R.D., Rubin, H.: Statistical inference in factor analysis. In: Neyman, J. (ed.) Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, vol. V, pp. 111–150. University of California Press, Berkeley (1956) 2. Browne, M.W.: An overview of analytic rotation in exploratory factor analysis. Multivariate Behavioral Research 36, 111–150 (2001) 3. Comon, P.: Independent component analysis, a new concept? Signal Processing 36, 287–314 (1994) 4. Davies, M.: Identifiability issues in noisy ICA. In: Puntonet, C.G., Prieto, A.G. (eds.) ICA 2004. LNCS, vol. 3195, pp. 470–473. Springer, Heidelberg (2004) 5. Harman, H.H.: Modern Factor Analysis, 3rd edn. University of Chicago Press, Chicago (1976) 6. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 3rd edn. Springer, New York (2001) 7. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, New York (2001) 8. Ikeda, S.: ICA on noisy data: a factor analysis approach. In: Girolami, M. (ed.) Advances in Independent Component Analysis, pp. 201–215. Springer, Berlin (2000) 9. Jennrich, R.I.: A simple general procedure for orthogonal rotation. Psychometrika 66, 289–306 (2001) 10. Jennrich, R.I., Trendafilov, N.T.: Independent component analysis as a rotation method: A very different solution to Thurstone’s box problem. British Journal of Mathematical and Statistical Psychology 58, 199–208 (2005) 11. J¨ oreskog, K.G.: Factor analysis by least-squares and maximum likelihood methods. In: Enslein, K., Ralston, A., Wilf, H.S. (eds.) Mathematical methods for digital computers, vol. 3, pp. 125–153. John Wiley & Sons, New York (1977) 12. Kaiser, H.F., Horst, P.: A score matrix for Thurstone’s box problem. Multivariate Behavioral Research 10, 17–25 (1975) 13. Kano, Y., Miyamoto, Y., Shimizu, S.: Factor rotation and ICA. In: Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Source Separation (ICA 2003), Nara, Japan, pp. 101–105 (2003) 14. Kawanabe, M., Murata, N.: Independent component analysis in the presence of Gaussian noise based on estimating functions. In: Proceedings of the 2nd International Workshop on Independent Component Analysis and Blind Signal Separation (ICA 2000), Helsinki, Finland, pp. 39–44 (2000) 15. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1979) 16. Thurstone, L.L.: Multiple Factor Analysis. The University of Chicago Press, Chicago (1947)
Multilinear (Tensor) ICA and Dimensionality Reduction M. Alex O. Vasilescu1,3 and Demetri Terzopoulos2,3 1
Massachusetts Institute of Technology, Cambridge, MA 02139, USA 2 University of California, Los Angeles, CA 90095, USA 3 University of Toronto, Toronto, ON M5S 3G4, Canada
Abstract. Multiple factors related to scene structure, illumination, and imaging contribute to image formation. Independent Components Analysis (ICA) maximizes the statistical independence of the representational components of a training image ensemble, but it cannot distinguish between these different factors, or modes. To address this problem, we introduce a nonlinear, multifactor model that generalizes ICA. Our Multilinear ICA model of image ensembles learns the statistically independent components of each of the multiple factors. We present an associated dimensionality reduction algorithm for multifactor subspace analysis. As an application, we consider the multilinear analysis of ensembles of facial images that combine several modes, including different facial geometries (people), expressions, head poses, and lighting conditions. For the purposes of face recognition, we introduce a multilinear projection algorithm that simultaneously projects an unknown test image into the multiple constituent mode spaces in order to infer its mode labels. We show that multilinear ICA computes a set of factor subspaces that yield improved recognition rates.
1 Introduction Historically, linear models that capture the statistical properties of image or other signal data have been broadly applied in pattern recognition. For example, the linear, appearance-based face recognition method known as “Eigenfaces” is founded on the principal components analysis (PCA) of facial image ensembles [1]. PCA encodes pairwise relationships between pixels—the second-order, correlational structure of the training image ensemble—but it ignores higher-order pixel statistics. By contrast, independent components analysis (ICA) [2,3] learns a set of statistically independent components by also considering these higher-order dependencies in the training data. However, ICA cannot distinguish between higher-order statistics associated with different factors, or modes, inherent to image formation—factors pertaining to scene structure, illumination, and imaging. In particular, ICA has been employed in face recognition [4] and, like PCA, it works best when person identity is the only factor that is permitted to vary. If additional factors, such as illumination, viewpoint, and expression can modify facial images, recognition rates deteriorate dramatically. We propose a multilinear framework that addresses the aforementioned problems. Specifically, we introduce a nonlinear, multifactor model of image ensembles that generalizes conventional ICA.1 Unlike its conventional, linear counterpart, our Multilinear 1
A preliminary description of this work appeared as an extended abstract in the Learning 2004 Workshop, Snowbird, UT, April, 2004.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 818–826, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multilinear (Tensor) ICA and Dimensionality Reduction
819
ICA model exploits multilinear (tensor) algebra in order to learn the interactions of multiple factors inherent to image formation and separately encode the higher-order statistics of each of these factors. By contrast, the multilinear generalization of Eigenfaces, dubbed TensorFaces [5], encodes only their second-order statistics. Our multilinear ICA should not be confused with existing tensorial algorithms for computing the conventional, linear ICA models [6,7]. We demonstrate the application of multilinear ICA to the problem of face recognition under varying viewpoint and illumination, obtaining significantly improved recognition rates. In this context, our second contribution is a novel, multilinear projection algorithm. It projects an unknown test image into the multiple factor representation spaces to infer the person, viewpoint, illumination, and other mode labels associated with the test image. After reviewing the mathematical foundations of our tensor approach in Section 2, we motivate our work by discussing PCA and multilinear PCA in Section 3. Next, we generalize ICA (Section 4), developing our multilinear ICA algorithm in Section 5. Section 6 introduces the multilinear projection algorithm for recognition. Section 7 presents our experiments and results and Section 8 concludes the paper.
2 Multilinear (Tensor) Algebraic Fundamentals Definition 1 (Tensor). A tensor, or n-way array, is a generalization of a vector (firstorder tensor) and a matrix (second-order tensor).2 Tensors are multilinear mappings over a set of vector spaces. The order of tensor A ∈ IRI1 ×...×IN is N . An element of A is denoted as Ai1 ...in ...iN or ai1 ...in ...iN , where 1 ≤ in ≤ In . Definition 2 (Mode-n Vectors). The mode-n vectors (or fibers) of an N th −order tensor A ∈ IRI1 ×...×IN are the In -dimensional vectors obtained from A by varying index in while keeping the other indices fixed. They are the column vectors of matrix A(n) ∈ IRIn ×(In+1 ...IN I1 ...In−1 ) that results from flattening the tensor A (Fig. 1). Definition 3 (Mode-n Orthonormal Matrices). Mode matrix Un contains the orthonormal vectors spanning the column space of matrix A(n) resulting from the mode-n flattening of A. Definition 4 (Mode-n Rank). The mode-n rank Rn of A ∈ IRI1 ×...×IN is defined as the dimension of the vector space generated by the mode-n vectors: Rn = rankn (A) = rank(A(n) ). Definition 5 (Mode-n Product, ×n ). The mode-n product of a tensor A ∈ IRI1 ×...In ×...IN and a matrix M ∈ IRJn ×In , denoted by A ×n M, is a tensor of dimen×In+1 ×...IN whose entries are computed by (A ×n sionality IRI1 ×...In−1 ×Jn M)i1 ...in−1 jn in+1 ...iN = in ai1 ...in−1 in in+1 ...iN mjn xin . It can be expressed in terms of flattened matrices as B(n) = MA(n) . 2
We denote scalars by italic lowercase letters (a, b, . . .), vectors by bold lowercase letters (a, b, . . .), matrices by bold uppercase letters (A, B, . . .), and higher-order tensors by calligraphic uppercase letters (A, B, . . .).
820
M.A.O. Vasilescu and D. Terzopoulos
I3
I3
I3 I2
I2
I2
I1
I1
I1 I3 I2 I1
I2
I1 I3
A(1) I2
I3
I2
A(2)
I1
I1
I1
I3
I1
A(3)
Fig. 1. Flattening a (3rd-order) tensor. The tensor can be flattened in 3 ways to obtain matrices comprising its mode-1, mode-2, and mode-3 vectors.
A matrix representation of the mode-n product of a tensor A ∈ IRI1 ×...×In ×...×IN and a set of N matrices, Fn ∈ IRJn ×In can be obtained as follows: B = A ×1 F1 . . . ×n Fn . . . ×N FN , or in terms of flattened tensors, B(n) = Fn A(n) (Fn−1 ⊗ . . . F1 ⊗ FN ⊗ . . . Fn+1 )T , where ⊗ denotes the matrix Kronecker product. The Frobenius norm of a tensor A is given by A = A, A.
3 PCA and Multilinear PCA The principal components analysis of an ensemble of I2 images is computed by performing an SVD on a I1 × I2 data matrix D whose columns are the “vectorized” I1 -pixel “centered” images. The matrix D ∈ IRI1 ×I2 is a two-mode mathematical object that has two associated vector spaces, a row space and a column space. In a PCA analysis of D, the SVD orthogonalizes these two spaces and decomposes the matrix as D = UΣVT . Using mode-n products, the SVD can be rewritten as D = Σ×1 U×2 V. The eigenvectors U are called the principal component directions of D (Fig. 3(a)). The analysis of an ensemble of images resulting from the confluence of multiple factors, or modes, related to scene structure, illumination, and viewpoint is a problem in multilinear algebra [5]. Within this mathematical framework, the image ensemble is represented as a higher-order tensor. This image data tensor D, Fig. 2(b), must be decomposed in order to separate and parsimoniously represent the constituent factors. This can be achieved by employing the N -mode SVD, a multilinear extension of the aforementioned conventional matrix SVD [8,9]. D is an N -dimensional matrix comprising N spaces. The N -mode SVD orthogonalizes these N spaces and decomposes the tensor as the mode-n product of N orthogonal spaces: (1) D = Z ×1 U1 ×2 U2 . . . ×n Un . . . ×N UN . Tensor Z, known as the core tensor, is analogous to the diagonal singular value matrix in conventional matrix SVD (although it does not have a simple, diagonal structure). The core tensor governs the interaction between the mode matrices U1 , . . . , UN . Mode matrix Un contains the orthonormal basis vectors spanning the column space of matrix D(n) resulting from the mode-n flattening of D. The tensor basis associated with this multilinear PCA is displayed in Fig. 3(b).
Multilinear (Tensor) ICA and Dimensionality Reduction -35
-30
-25
-20
-15
-10
-5
0
5
10
15
20
25
30
35
821
People
-35 -30 -25 -20
Views
-15 -10 -5 0 5 10 15 20 25 30
Illuminations
35
(a)
(b)
Fig. 2. A facial image dataset of 2,700 training images out of 16,875 images. 3D scans of 75 subjects, recorded using a CyberwareT M 3030PS laser scanner as part of the University of Freiburg 3D morphable faces database [10]. A portion of the 4th -order data tensor D for the image ensemble formed from the dash-boxed images, Fig. 2(a), of each person. Only 4 of the 75 people are shown. The N -mode SVD algorithm for decomposing D according to equation (1) is as follows: 1. For n = 1, . . . , N , compute matrix Un in (1) by computing the SVD of the flattened matrix D(n) and setting Un to be the left matrix of the SVD. 2. Solve for the core tensor: Z = D ×1 UT1 ×2 UT2 . . . ×n UTn . . . ×N UTN .
4 ICA The independent components analysis of multivariate data can be applied in two ways [4]: 1) to DT , each of whose rows is a different image, which finds a spatially independent basis set that reflects the local properties of imaged objects; 2) to D, which finds a set of coefficients that are statistically independent while the basis reflects the global properties of imaged objects. ICA Approach 1: ICA starts essentially from the PCA solution and computes a transformation of the principal components such that they become independent components Ws1 UT = KT CT , (2) DT = VΣUT = VΣWs−1 1 where every column of D is a different image, Ws1 is an invertible transformation matrix that is computed by the ICA algorithm, C = UWsT1 are the independent components (Fig. 3(a)), and K = Ws−T ΣVT are the coefficients. Various objective functions, 1 such as those based on mutual information, negentropy, higher-order cumulants, etc., are presented in the literature for computing the independent components along with different optimization methods for extremizing the objective functions [3].
822
M.A.O. Vasilescu and D. Terzopoulos People
People
Views
Views
Illuminations
(a)
Illuminations
(b)
(c)
(d)
Fig. 3. Eigenfaces and TensorFaces bases for an ensemble of 2,700 facial images spanning 75 people, each imaged under 6 viewing and 6 illumination conditions (see Section 7). (a) PCA eigenvectors (eigenfaces), which are the principal axes of variation across all images. (b) A partial visualization of the 75 × 6 × 6 × 8560 TensorFaces representation of D, obtained as T = Z ×4 Upixels which captures viewpoint varaition, illumination variation and people variation. (c) Independent components Cpixels . (d) A partial visualization of the 75 × 6 × 6 × 8560 multilinear ICA representation of D, obtained as B = S ×4 Cpixels .
ICA Approach 2: Alternatively, ICA can be applied to D, and it transforms the principal components directions such that the coefficients are statistically independent, as follows: D = UΣVT = UWs−1 Ws2 ΣVT = CK, (3) 2 is the basis matrix and K = Ws2 ΣVT are the statistically indewhere C = UWs−1 2 pendent coefficients. Note that C, K and W are computed differently in the two approaches. Approach 1 yields statistically independent bases, whereas Approach 2 yields a “factorial code”.
5 Multilinear ICA Like PCA, ICA is a linear analysis method, hence it is not well suited to the representation of multi-factor image ensembles. To address this shortcoming, we next propose a novel multilinear generalization of ICA. Multilinear ICA is obtained by decomposing the data tensor D as the mode-n product of N mode matrices Cn and a core tensor S, as follows: D = S × 1 C1 × 2 C2 . . . × n Cn . . . × N CN . (4) The N -mode ICA algorithm is as follows: 1. For n = 1, . . . , N , compute the mode matrix Cn in (4) in one of two ways, by (5)–(6) or by (12)–(13). −1 −1 −1 2. Solve for the core tensor: S = D ×1 C−1 1 × 2 C2 . . . × n Cn . . . × N CN . As in ICA, there are two approaches for multilinear ICA.
Multilinear (Tensor) ICA and Dimensionality Reduction
823
Multilinear ICA Approach 1: Transposing the flattened data tensor D in mode n and computing the ICA as in (2), we obtain: (5) DT(n) = Vn Σn UTn = Vn Σn Wn−1 Wn UTn = KTn CTn , where the mode matrices are given by Cn = Un WnT .
(6)
The columns associated with each of the mode matrices Cn are statistically independent; i.e., a factorial code representation is computed for each mode of variation. We can derive the relationship between N -mode ICA and N -mode SVD (1) in the context of this approach as follows: D = Z ×1 U1 . . . ×N UN −T T WN = Z ×1 U1 W1T W1−T . . . ×N UN WN −T = Z ×1 C1 W1−T . . . ×N CN WN
×1 W1−T
−T × N WN )
... = (Z = S × 1 C1 . . . × N CN ,
× 1 C1 . . . × N CN
(7) (8) (9) (10) (11)
−T . where the core tensor S = Z ×1 W1−T . . . ×N WN
Multilinear ICA Approach 2: Alternatively, flattening the data tensor D in mode n and computing the ICA as in (3), we obtain: (12) D(n) = Un Σn VnT = Un Wn−1 Wn Σn VnT = Cn Kn , where the mode matrices are given by Cn = Un Wn−1 .
(13)
This second approach results in a set of basis vectors that are statistically independent across the different modes. Note that the Wn in (13) differs from the Wn in (6); the latter is analogous to Ws1 in (2) while the former is analogous to Ws2 in (3). We can derive the relationship between N -mode ICA and N -mode SVD (1) in the context of the second approach as follows: D = Z ×1 U1 . . . ×N UN −1 WN = Z ×1 U1 W1−1 W1 . . . ×N UN WN = Z × 1 C1 W 1 . . . × N CN W N = (Z ×1 W1 . . . ×N WN ) ×1 C1 . . . ×N CN = S × 1 C1 . . . × N CN ,
(14) (15) (16) (17) (18)
where the core tensor S = Z ×1 W1 . . . ×N WN . Optimal dimensionality reduction in multilinear ICA, which yields the approximaˆ 1 . . . ×N C ˆ N , is achieved by optimizing iteratively, mode per mode ˆ = Sˆ ×1 C tion D
824
M.A.O. Vasilescu and D. Terzopoulos
using alternating least squares, by holding fixed all mode components except one and solving for the remaining mode, with an additional step that computes the transformation matrix W. The error function minimized is: ˆ 1 . . . ×N C ˆ N ). ˆ = D − (Sˆ ×1 C e = D − D
(19)
The N -mode ICA dimensionality reduction algorithm is as follows: 1. Initialize: Apply Step 1 of the N -mode ICA algorithm to D; truncate each mode matrix Un , for n = 1, 2, . . . , N , to Rn columns, and compute the mode matrix Cn , thus obtaining the initial (k = 0) mode matrices C01 , C02 , . . . , C0N . 2. Iterate, for k = 1, 2, . . ., until convergence: Alternating Least Squares: Compute matrix Cn , 1 ≤ n ≤ N : + + + k−1 + ; mode-n flatten C˜nk Set C˜nk = D×1 Ck1 . . .×n−1 Ckn−1 ×n+1 Ck−1 n+1 . . .×N CN k k ˜ ˜ kn . to obtain the matrix Cn ; compute Cn according to (5) or (12) by setting D(n) = C ˆ ˆ ˆ ˜ ˆ 3. Set the converged mode matrices to C1 , C2 , . . . , CN . Compute the core tensor S = CN ×N ˆ = Sˆ ×1 C ˆ + . The approximation of D is D ˆ 1 ×2 C ˆ 2 . . . ×N C ˆ N. C N
6 Multilinear Projection We will now develop a multilinear method for simultaneously inferring the identity, illumination, viewpoint, etc., coefficient vectors of an unlabeled, test image. Multilinear ICA represents the unlabeled, test image by a set of unknown coefficient vectors, dT = B ×1 cTp ×2 cTv ×3 cTl , where the coefficient vector cp encodes the person, the coefficient vector cv encodes the viewpoint, and the coefficient vector cl encodes the illumination.
The multilinear projection algorithm is as follows: + 1. Compute the projection transformation P. In matrix form, P(mode) = BT(pixels) . T 2. Compute the response tensor R = P ×pixels d : R
C
P ×pixels dT ≈ I ×pixels (cTl ⊗ cTv ⊗ cTp ) = cp ◦ cv ◦ cl , 3. Since R = C and has rank-(1, . . . , 1), the coefficients are extracted by factorizing the response tensor using the N -mode SVD algorithm.
Intuitively, the unknowns cp , cv , and cl need to be estimated from d and B. This involves computing a pseudo-inverse tensor. Projecting d onto the pixel mode of B yields the image projection tensor R = P ×4 dT ≈ C, where the “projection transformation” (the matrix B(pixels) is the pixelP is obtained by re-tensorizing matrix P(pixels) = B+T (pixels) mode flattening of tensor B). The tensor R has the structure (cp ◦ cv ◦ cl ), the outer
Multilinear (Tensor) ICA and Dimensionality Reduction
825
product of the coefficient vectors associated with each factor inherent to the data dT ; hence, it is of rank-(1, . . . , 1). The rank of R and the fact that the coefficient vectors are unit vectors enables us to compute the three coefficient vectors via a tensor decomposition using the N -mode SVD algorithm. In principle, R = C, but sometimes in practice R ≈ C. Thus, the optimal dimensionally-reduced rank-(1, . . . , 1) decomposition must be computed by optimizing the objective function: dT − Bˆ ×1 cˆT1 ×2 . . . ×N ˆcTN .
7 Experiments In our face recognition experiments, each subject is imaged from 15 different viewpoints (θ = −35◦ to +35◦ in 5◦ steps on the horizontal plane φ = 0◦ ) under 15 different illuminations (θ = −35◦ to +35◦ in 5◦ steps on an inclined plane φ = 45◦ ). Fig. 2(a) shows the full set of 225 images for one of the subjects with viewpoints arrayed horizontally and illuminations arrayed vertically. The image set was rendered from 3D scans of 75 subjects. Of the 16,875 images, we employed 2,700 as training images. The training data tensor, a 4th -order tensor, D is shown in Fig. 2. Multilinear ICA yields better recognition rates (98.14%) than PCA (eigenfaces) (83.9%), conventional ICA (89.5%) and even multilinear PCA (93.4%) in scenarios involving the recognition of people imaged in previously unseen viewpoints and illuminations. Fig. 3(d) illustrates the multilinear ICA basis derived from the training ensemble, while Fig. 3(c) illustrates the conventional ICA basis.
8 Conclusion We presented a multilinear generalization of ICA. We applied our new multilinear ICA algorithm along with a novel, multilinear projection method to face recognition involving multiple people imaged under different viewpoints and illuminations. Multilinear ICA disentangles the multiple factors inherent to image formation and explicitly represents the higher-order statistics associated with each factor, thus yielding improved recognition rates relative to related prior methods.
References 1. Sirovich, L., Kirby, M.: Low dimensional procedure for the characterization of human faces. Journal of the Optical Society of America A. 4, 519–524 (1987) 2. Comon, P.: Independent component analysis, a new concept? Signal Processing 36, 287–314 (1994) 3. Hyv¨arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, Chichester (2001) 4. Bartlett, M.: Face Image Analysis by Unsupervised Learning. Kluwer, Boston (2001) 5. Vasilescu, M., Terzopoulos, D.: Multilinear analysis for facial image recognition. In: Proc. Int. Conf. on Pattern Recognition. Quebec City, vol. 2, pp. 511–514 (2002) 6. Cardoso, J.F., Comon, P.: Tensor-based independent component analysis. Signal Processing V: Theories and Applications 5, 673–676 (1990)
826
M.A.O. Vasilescu and D. Terzopoulos
7. De Lathauwer, L., De Moor, B., Vandewalle, J.: Independent component analysis and (simultaneous) third-order tensor diagonalization. IEEE Trans. Signal Processing 49(10), 2262– 2271 (2001) 8. Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31, 279–311 (1966) 9. De Lathauwer, L., De Moor, B., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21(4), 1253–1278 (2000) 10. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: SIGGRAPH 99 Conference Proceedings, ACM SIGGRAPH, pp. 187–194 (1999)
ICA in Boolean XOR Mixtures Arie Yeredor School of Electrical Engineering, Tel-Aviv University, Israel [email protected]
Abstract. We consider Independent Component Analysis (ICA) for the case of binary sources, where addition has the meaning of the boolean “Exclusive Or” (XOR) operation. Thus, each mixture-signal is given by the XOR of one or more of the source-signals. While such mixtures can be considered linear transformations over the finite Galois Field of order 2, they are certainly nonlinear over the field of real-valued numbers, so classical ICA principles may be inapplicable in this framework. Nevertheless, we show that if none of the independent random sources is uniform (i.e., neither one has probability 0.5 for 1/0), then any invertible mixing is identifiable (up to permutation ambiguity). We then propose a practical deflation algorithm for source separation based on entropy minimization, and present empirical performance results by simulation.
1
Introduction and Problem Formulation
The classical Independent Components Analysis (ICA) framework usually assumes linear combinations of independent sources over the field of real-valued numbers R, with some exceptions (e.g., [1]) that assume the field of complexvalued numbers C. It might be interesting, at least from a theoretical point of view, to explore the applicability of ICA principles to other algebraic fields. Let us consider the field (often denoted Galois Field of order 2, GF(2)) of binary numbers {0, 1}, where, for x, y ∈ {0, 1}, addition is defined as the “Exclusive Or” (XOR) operation, denoted z = x ⊕ y, where z equals 1 if and only if x = y (and equals zero otherwise). Multiplication in this field (either by 0 or by 1) is defined and denoted in the “usual” way, z = xy. These values and operations trivially satisfy all the requirements that constitute a field [2], namely: associativity, commutativity, distributivity, existence of an additive and of a multiplicative identity element (0 and 1, resp.) and of additive and multiplicative inverses (−0 = 0, −1 = 1; and 1−1 = 1, resp.). Naturally, all random variables (RVs) in this field are binary, and any probability distribution is uniquely defined by a single parameter p, denoting the probability with which the RV takes the value 1. We shall refer to p as the “1-probability” of the RV. Assume now that there are K statistically independent random sources denoted s[n] = [s1 [n] s2 [n] · · · sK [n]]T , with respective fixed, unknown 1probabilities p = [p1 p2 · · · pK ]T . For simplicity we shall further assume that the samples of each source are independent, identically distribute (iid) in time. M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 827–835, 2007. c Springer-Verlag Berlin Heidelberg 2007
828
A. Yeredor
Naturally, just like in “classical” ICA it is also possible to extend this basic model to temporally-correlated or non-stationary sources (e.g., [3] or [4], respectively, for classical ICA), but for now we choose to concentrate on this basic, iid model. Let these sources be mixed (over GF(2)) by an unknown, square (K × K) mixing matrix A (whose elements also belong to GF(2)), x[n] = A ◦ s[n],
(1)
where “◦” denotes matrix/vector multiplication over the field, such that the k-th element of x[n] is given by xk [n] = ak1 s1 [n] ⊕ ak2 s2 [n] ⊕ · · · ⊕ akK sK [n] k = 1, 2, ..., K.
(2)
We further assume that A is invertible over the field, namely that it has a unique inverse in GF(2), denoted B = A−1 , satisfying B ◦ A = A ◦ B = I, where I denotes the K × K identity matrix. Like in “classical” linear algebra (over R), A is non-singular (invertible) if and only if (iff) its determinant1 is non-zero (namely 1). Equivalently, A is singular iff there exists (in GF(2)) a nonzero vector u, such that A ◦ u = 0 (an all-zeros vector). We are interested in the possibility to recover the source signals s[n] from the observations (mixtures) x[n] under this “blind” scenario, where the only available knowledge is that the sources are statistically independent. Admittedly, this problem is not directly related to any specific application, but it is possible to think, e.g., of a hypothetical situation in a digital communication system, where cross-talk between channels might have the effect of a XOR combination (e.g., in a binary symmetric channel (BSC, [5]), the output can be considered as a XOR operation between the signal and noise processes). Note that although the mixing is linear over our GF(2) field, it is certainly not linear over the “standard” ICA fields R or C. Therefore, clearly not all classical results from the ICA theory and practice are applicable to this problem.
2
Identifiability
Let us first address the issue of identifiability (possibly up to some tolerable ambiguities) of A (or, equivalently, of its inverse B) from the set of observations x[n], n = 1, 2, ...N , under asymptotic conditions, namely when N → ∞. Due to the assumption of iid samples for each source (implying ergodicity), the joint statistics of the observations can be fully and consistently estimated from the available data. Therefore, the assumption of asymptotic conditions implies full and exact knowledge of the joint probability distribution of the observation vector x (we dropped the time-index n here, due to the stationarity). Before we proceed, let us consider the characterization of statistical properties of an arbitrary random vector in GF(2). 1
The determinant over GF(2) can be calculated just like over R, but with the ordinary addition / subtraction replaced by the XOR operation.
ICA in Boolean XOR Mixtures
2.1
829
Statistical Characterization of Random Vectors
For any K × 1 random vector y with elements in GF(2), the probability function can be fully described in a K-way tensor (K-dimensional array) P (y) , with two elements in each direction, indexed as 0 or 1 for convenience, such that
(y)
P i1 ,i2 ,...,iK = Prob{y1 = i1 , y2 = i2 , ..., yK = iK } , i1 , i2 , . . . , iK ∈ {0, 1}. (3) For convenience, we may concatenate the K indices into a K × 1 “index-vector” (y) i = [i1 i2 · · · iK ]T and use the notation P (y) (i) ≡ P i1 ,i2 ,...,iK , leading to P (y) (i) = Prob{y = i}. Evidently, P (y) has 2K elements, the sum of which is always 1. Given N iid realizations y[n] of y, a consistent estimate of P (y) (i) can be easily obtained, for all 2K possible values of i, from N (y) (i) = 1 P I{y[n] = i} N n=1
(4)
where I{·} denotes the Indicator function (being 1 if the condition in its argument is satisfied and 0 otherwise). An alternative, but generally incomplete characterization of the statistics of y can be described by its first and second joint moments, namely by
η (y) = E[y]
and
Λ(y) = E[yy T ],
(5)
respectively. Note that due to the 0/1 values in y, the elements of η (y) and of Λ(y) also carry explicit probabilistic interpretations: (y)
ηk = Prob{yk = 1},
and
(y)
Λk, = Prob{yk = 1, x = 1}.
(6)
Consistent estimates can be similarly obtained from N iid realizations, (y) = η
N 1 y[n] N n=1
and
N (y) = 1 Λ y[n]y T [n]. N n=1
(7)
We say that y has independent components if and only if the joint probability of any combination of the K elements equals the product of their marginal probabilities. This condition can be expressed as (recall that ik ∈ {0, 1}) P (y) (i) =
K
(y) (y) (ηk )ik (1 − ηk )(1−ik ) = (η (y) )i (1 − η (y) )(1−i) ,
(8)
k=1
where the notation ab is shorthand for ab11 · ab22 · · · abKK . If y has independent components then they are all uncorrelated, namely the covariance matrix
C (y) = Λ(y) − η (y) (η (y) )T
(9)
is diagonal. Note, however, that although a diagonal covariance matrix implies pairwise independence of the components of y, it does not, in general, imply full independence of these components. We now return to the identifiability problem.
830
2.2
A. Yeredor
Identifiability Through Decorrelation
A natural approach for exploring the identifiability is to look for possible linear ˆ such that the vector y = B ˆ ◦ x has independent components. transformations B Due to the invertibility of A, it is clear that there exists at least one such ˆ = ΠB, where Π is any ˆ = B = A−1 . Moreover, it is clear that any B matrix, B permutation matrix, also produces independent components in y - in accordance with the well-known inherent permutation ambiguity in ICA (fortunately, in our binary framework the classical scaling ambiguity is irrelevant and does not exist). The key question for establishing identifiability is to determine whether (and under what conditions) no other such transformations exist; namely, under what conditions independent components in y imply that the overall mixing ˆ ◦ A is a permutation matrix2 . unmixing matrix D = B The following Theorem establishes our main identifiability result. Theorem 1. Let s = [s1 s2 · · · sK ] denote K statistically independent sources in GF(2), the k-th source having 1-probability pk . Let y = D ◦ s denote a linear transformation of s over GF(2), where D is a K × K matrix (with elements in GF(2)). Let η (y) and C (y) denote the mean and covariance (resp.) of y. If: 1. 2. 3. 4.
All sources are non-degenerate, namely 0 < pk < 1, k = 1, 2, . . . , K; None of the sources is uniform, namely pk = 0.5, k = 1, 2, . . . , K; (y) All elements of η (y) are nonzero, ηk > 0, k = 1, 2, . . . , K; C (y) is diagonal,
Then D is a permutation matrix. Proof. Let us first establish the following lemma. Lemma 1. Let u and v be two RVs in GF(2) with 1-probabilities p and q (resp.),
and let w = u ⊕ v. If u and v are independent, non-degenerate (0 < p, q < 1) and non-uniform (p, q = 0.5), then w is also non-degenerate and non-uniform. To show this nearly trivial (and intuitively appealing) property, note that w has 1-probability r = p(1 − q) + q(1 − p). It can then be easily shown that the only valid values of (p, q) with which r = 0 are (0, 0) or (1, 1); with which r = 1 are (1, 0) or (0, 1); and that r = 0.5 iff either p, q or both equal 0.5. Under conditions 1 and 2 of Theorem 1, Lemma 1 establishes that no XOR sum of any (two, or more, by induction) of the sources can produce a degenerate or a uniform RV (however, if any of the sources are uniform, then any XOR sum involving one or more of these sources is also uniform). Let us assume now that D is a general matrix, and consider any pair yk and y (k = ) in y. yk and y are linear combinations of respective subgroups of the sources, indexed by the 1-s in D k,: and D,: , the k-th and -th rows (resp.) of D. These two subgroups define, in turn, three other subgroups (some of which may be empty): 2
We include I in the set of permutation matrices.
ICA in Boolean XOR Mixtures
831
1. Sub-group 1: Sources common to Dk,: and D,: . Denote the (XOR) sum of these sources as u; 2. Sub-group 2: Sources included in Dk,: but excluded from D ,: . Denote the (XOR) sum of these sources as v1 ; 3. Sub-group 3: Sources included in D,: but excluded from D k,: . Denote the (XOR) sum of these sources as v2 . For example, if (for K = 6) Dk,: = 0 1 1 1 1 1 and D ,: = 1 1 0 0 1 1 , then u = s2 ⊕ s5 ⊕ s6 , v1 = s3 ⊕ s4 and v2 = s1 . Note that by construction, the RVs u, v1 and v2 are statistically independent, and are also non-uniform. Obviously, yk = u + v1 and y = u + v2 . Let us denote the 1-probabilities of u, v1 and v2 as p, q1 and q2 , respectively. Then the 1-probabilities of yk , y are given (resp.) by (y)
ηk = p + q1 − 2pq1
and
(y)
η
= p + q2 − 2pq2 .
(10)
(y)
We’re further interested in Rk, , the probability that both yk and y are 1. This happens if u = 1 and v1 = v2 = 0, or if u = 0 and v1 = v2 = 1, so (y)
Rk, = p(1 − q1 )(1 − q2 ) + (1 − p)q1 q2 = p(1 − q1 − q2) + q1 q2 . (y)
(y)
(y) (y)
Now, condition 4 of Theorem 1 implies that Ck, = Rk, − ηk η
(11)
= 0, so
p(1 − q1 − q2) + q1 q2 − (p + q1 − 2pq1 )(p + q2 − 2pq2 ) = 0,
(12)
which with slight manipulations on the left-hand side can be written as p(1 − p)(1 − 2q1 )(1 − 2q2 ) = 0.
(13)
Since we’ve established that q1 = 0.5 and q2 = 0.5, (13) can only be satisfied if p = 0 or if p = 1. Since no non-trivial linear combination of the sources can be degenerate, p = 1 is also ruled out. The only option left is p = 0, which can happen if and only if sub-group 1 is empty, namely, iff the two rows D k,: and D,: do not share common sources, or, in other words, iff there is no column m in D such that both D k,m and D ,m are 1. (y) Applying this to all possible pairs of k = (for which Ck, = 0), and recalling that due to condition 3 of the theorem, D cannot have any all-zeros row, we immediately arrive at the conclusion that each row and each column of D must contain exactly one 1, meaning that D is a permutation matrix.
ˆ is found such that We therefore conclude that if a transformation matrix B ˆ ◦x=B ˆ ◦ (A ◦ s) = (B ˆ ◦ A) ◦ s = D ◦ s y=B
(14)
has uncorrelated, non-degenerate components, then if none of the sources is degenerate or uniform, D must be a permutation matrix. This means that the sources are fully separated, and are given by the elements of y (up to immaterial permutation ambiguity).
832
A. Yeredor
Note that as mentioned earlier, the decorrelation condition only implies pairwise independence (in GF(2)), which for a general random vector does not necessarily imply full independence. However, our theorem evidently asserts, that when y is a linear combination of independent, non-degenerate and non-uniform sources, pairwise independence indeed implies full independence.
3
Separation Algorithm
The theoretical identifiability theorem gives rise to at least one theoretically feasible separation strategy. Luckily, unlike classical ICA, in GF(2) there exists a 2 ˆ Many of these matrices finite number (2(K ) ) of possible separation matrices B. are singular, and thus cannot be considered as candidates for the inverse of A (naturally, the conditions of Theorem 1 cannot be satisfied with a singular ˆ So a possible “theoretical” algorithm would be to run an exhaustive search B). 2 among all M < 2(K ) nonsingular K × K matrices, looking for those which make ˆ ◦ x[n] as “empirically uncorrelated” as the transformed observations y[n] = B possible. However, this approach is practically inapplicable for values of K above, 2 ˆ to check. say, 5, due to the huge number (2(K ) ) of potential matrices B Instead, we now propose an entirely different approach, based on properties of the entropies of the mixtures. Let u be a binary RV with 1-probability p. Its entropy is H(u) = −p log2 p − (1 − p) log2 (1 − p), and it is easy to show [5] that H(u) takes its maximum value (of 1) when u is uniform (p = 0.5). Now, let u and v be two statistically independent binary RVs and let w = u ⊕ v. It can be easily shown3 that H(w) is greater or equal to both H(u) and H(v), where equality holds iff at least one of the RVs u and v is uniform. Consequently, if A is invertible and none of the sources is uniform, then the source with the minimal entropy can be recovered by searching for the (nontrivial) linear combination (over GF(2)) of components of x which has the minimal entropy. If there are several sources with the same (minimal) entropy, there would be just as many entropy-minimizing linear combinations, each recovering its respective source. The number of potential (non-trivial) linear combinations 2 of components in x would be 2K − 1, usually far less than the O(2(K ) ) trials required in the previous algorithm. ˆ denote a non-zero K × 1 vector (with elements in GF(2)), containLet b ing prospective linear combination coefficients. To estimate the marginal probT ability (hence the entropy) of the linear combination y = ˆb ◦ x one may use ˆT ◦ x[n], but this approach unnecessarily time-averaging over the series y[n] = b ˆ Instead, folrequires computation of the N -long series y[n] for each tested b. (x) can be lowing initial estimation of the observations’ probabilities tensor (P constructed using (4)), the 1-probability p of y[n] can be obtained from (x) (i), (bT ◦ i) · P (15) pˆ = i 3
E.g., a proof involving conditional entropy: H(w) = H(u ⊕ v) ≥ H(u ⊕ v|v) = H(u).
ICA in Boolean XOR Mixtures
833
where the summation extends over all 2K possible values of i. Having obtained ˆ we select the ˆb which produced pˆ pˆ for each of the 2K − 1 possible values of b, 4 with the minimal entropy . While this approach only enables to extract the component(s) with the smallest entropy, we may proceed by taking a “deflation approach” (e.g., [6]). To this end, we would like to first eliminate the extracted source from the mixtures, by subtracting (or adding, it doesn’t matter in GF(2)) that source from the mixture components in which it participates. Thus, given an extracted source, we have to decide, for each component in x, say xk , whether or not the extracted source y is part of the linear combination that created xk . To decide, all we have to do is to examine whether the entropy of xk is smaller or larger than the entropy of xk ⊕ y. Again, this can be done by direct empirical estimation of the 1-probabilities from both series (xk [n] and xk [n] + y[n]), or, (x) : Denoting by ek the k-th preferably, from the estimated probabilities tensor P column of I, we have pˆk {0} =
(x) (i) (eTk ◦ i)P
pˆk {1} =
and
i
(x) (i) (16) ((eTk ⊕ bT ) ◦ i)P
i
where b denotes the coefficients vector that was selected for the extraction of the first source. Here pˆk {0} and pˆk {1} denote the estimated 1-probabilities of the series xk [n] and xk [n] + y[n], respectively. We then chose the one farther from 0.5: if this is pˆk {1}, we create a new observation xk [n] = xk [n] ⊕ y[n], otherwise we simply set xk [n] = xk [n]. Repeating the procedure for each k, we obtain a new set of observations x [n] = [x1 [n] x2 [n], . . . xK [n]]T , which is related to the original set by x [n] = (I ⊕ (j · bT )) ◦ x, (17) where j is a K × 1 vector containing 1-s in the indices corresponding to values of k for which pˆk {1} was farther from 0.5 than pˆk {0}. We may now repeat the process by applying the same procedure to the new (x ) set of observations x . As a first step, we have to construct an updated P
(x) . To do this, we first set P (x ) = 0 (an all-zeros tensor), and then, from P running over all 2K possible values of i, we update
(x ) ((I ⊕ (j · bT )) ◦ i) = P (x ) ((I ⊕ (j · bT )) ◦ i) + P (x) (i). P
(18)
(x) = P (x ) , and return to the initial step, repeating We then substitute x = x , P the process K − 2 times (after the (K − 1)-th pass, the last (maximum entropy) source would appear unmixed in one or more of the components of x ). It has to be noted that after each (k-th) pass x is an over-determined set: It still has K components, but they are now mixtures of K − k sources. Conseˆ that produce null quently, there exist non-trivial linear combination coefficients b 4
There’s no real need to compute the entropy: since H(ˆ p) is a monotonically decreasing function of the distance of pˆ from 0.5, it is sufficient to monitor |ˆ p − 0.5|.
834
A. Yeredor Separation of half of the sources
empirical success rate [%]
Separation of one source
Separation of all sources
100
100
100
90
90
90
80
80
80
70
70
70
60
60
60
50
50
50
40
40
40
30
K=2
30
30
20
K=4
20
20
K=8 10 0 1 10
10 2
10
3
4
10
10
N
5
10
6
10
0 1 10
10 2
10
3
4
10
10
N
5
10
6
10
0 1 10
2
10
3
4
10
10
5
10
6
10
N
Fig. 1. Empirical success rate (1000 trials) of MEXICO for K = 2, 4, 8 vs. N . In each trial source probabilities were independently drawn in (0, 0.4) ∨ (0.6, 1), and a nonsingular mixing matrix was drawn as the product of upper- and lower-triangular matrices with random independent uniform binary elements above/below the diagonals.
ˆT ◦ x [n] = 0 (or even one or more of the K observations x components y[n] = b themselves might be null). In principle, we may apply some order reduction, but this is not necessary: we can easily detect null combinations (characterized by ˆ from the search for minimum entropy. At pˆ = 0) and exclude the respective b-s the end of each pass, the respective source can be extracted as y[n] = bT ◦ x[n]. Sources would be extracted in order of non-decreasing entropy. The algorithm is given the acronym MEXICO: Minimizing Entropies of Xored Independent COmponents. Its separation performance naturally depends on the (x) (when the true P (x) is used, perfect separation is obtained, as accuracy of P long as none of the sources is degenerate or uniform), which in turn depends on the length N and on the true probabilities p. A rigorous performance analysis is quite involved, and falls beyond the scope of this paper. Instead, we present some simulation results: performance is shown in Fig.1 in terms of the empirical probability of success, namely the percentage of trials attaining perfect source reconstruction. We present success-rates in separating the first (minimum-entropy) source (left); half of all sources (middle); and all of the sources (right). See the figure’s caption for details of the simulation setup. To conclude: We derived a necessary and sufficient condition for identifiability of invertible linear mixtures over GF(2). A separation algorithm was proposed, capable of (asymptotically) perfect separation whenever the condition is satisfied (and failing otherwise). Extensions to higher order Galois Fields are also possible.
References 1. Eriksson, J., Koivunen, V.: Complex random vectors and ica models: identifiability, uniqueness, and separability. IEEE Trans. Information Theory 52, 1017–1029 (2006) 2. Ribenboim, P.: Classical theory of algebraic numbers. Springer, Heidelberg (2001)
ICA in Boolean XOR Mixtures
835
3. Belouchrani, A., Abed-Meraim, K., Cardoso, J.F., Moulines, E.: A blind source separation technique using second-order statistics. IEEE Trans. Signal Processing 45, 434–444 (1997) 4. Pham, D.T., Cardoso, J.F.: Blind separation of instantaneous mixtures of nonstationary sources. IEEE Trans. Signal Processing 49, 1837–1848 (2001) 5. Cover, T.: Elements of Information Theory. Wiley-Interscience, Chichester (2006) 6. Delfosse, N., Loubaton, P.: Adaptive blind separation of independent sources: a deflation approach. Signal Processing 45, 59–83 (1995)
A Novel ICA-Based Image/Video Processing Method Qiang Zhang, Jiande Sun, Ju Liu, and Xinghua Sun School of Information Science and Engineering, Shandong University Jinan 250100, Shandong, China [email protected]
Abstract. Since Independent Component Analysis was developed, it has been a hotspot in the field of signal processing, and has received increasing attention in feature extraction, data compression, and so on. In this paper, a novel ICAbased image/video processing method, called ICA transform (ICAT), is proposed. Instead of the traditional blocking, ICAT derives more than one subimages/sub-videos from one original image/video by down-sampling, and features are obtained from these sub-images/sub-videos by using ICA. That helps ICAT extracts features with the global characteristics of the original. And the comparison between ICAT and Digital Wavelet Transform (DWT) is performed in image/video processing, which exhibits that the results obtained by using ICAT has something similar to those of DWT, even something superior. And the comparison also demonstrates that ICAT is promising in image/video processing. Keywords: Independent Component Analysis, Image/Video Processing, Digital Wavelet Transform.
1 Introduction Independent Component Analysis (ICA) is a useful signal processing and data analysis method developed in the research of blind signals separation. Using ICA, even without any information of the source signals and the coefficients of transmission channels, people can recover or extract the source signals only from the observations according the stochastic property of the input signals. It has been one of the most important methods of blind source separation and received increasing attentions in pattern recognition, data compression, image analyzing and so on, because the ICA process derives features that best present the data via a set of components that are as statistically independent as possible and characterizes the data in a natural way [1-9]. In ICA model, more than one observation signals are needed to achieve the analysis, so when ICA is used to image/video processing, how to generate observations from one image must be firstly considered. At present, blocking is the prevalent manner and the features obtained in such way have been applied in many areas[4-9]. Hateren divided the image into blocks with size of 8×8 or 16×16, respectively, and all the blocks are taken as the observations of ICA model, and then features can be extracted by using ICA, which are proved to be some directional M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, pp. 836–842, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Novel ICA-Based Image/Video Processing Method
837
edges and can be used as the bases to reconstruct the image. Furthermore, Hateren proved that these features are similar to what human visual cortex captures, and such experiments on video also result some similar conclusions. Though such kinds of features are applied widely in current researches [4-7], their meaning is still ambiguous, because blocking destroys the global properties of an image. Down-sampling is a common signal processing method, through which subsignals, approximate to the original, can be obtained. Down-sampling can help us to derive some similar sub-signals from only one original signal. And the sub-signals and the original have the same global properties. In this paper, we propose a new image/video processing method, called ICA Transform (ICAT), based on down-sampling instead of blocking. In ICAT, the original image/video is down-sampled into sub-images/sub-videos, which are looked on as the observations in ICA model. Thus features can be derived from the subimages/sub-videos by using ICA. A comparison between ICAT and Digital Wavelet Transform (DWT) helps us to understand the properties of such kind of features. The comparison shows that ICAT can obtain features similar to those extracted by DWT, furthermore has superiorities in some aspects.
2 Comparison Between ICAT and DWT in Image Processing 2.1 2 Dimensional Digital Wavelet Transform (2D DWT) Digital Wavelet Transform (DWT) is a time-frequency analysis method. Because of its characteristic of multi-resolution, DWT has been used widely in signal processing, since it came into being. Fig. 1 shows the original image, peppers, and the four subbands obtained through 1-level 2D-DWT, which are the approximate one and the details containing vertical, horizontal, and diagonal information respectively [10, 11].
(a)
LL
HL
LH
HH
(b)
Fig. 1. (a) is the original image, peppers, and (b) is 1-level DWT of the image, peppers. The left-top, LL sub-band, is the approximate component, and the others, LH, HL, HH sub-bands, are details in the vertical, horizontal, and diagonal directions respectively.
838
Q. Zhang et al.
2.2 Independent Component Analysis Transform (ICAT) As we all know that according to ICA model, there are more than one observation signals, so how to derive them from only one image is what have to be resolved firstly. In this paper, we down-sample the original image into sub-images. Assuming the size of the original image is n × m , after down-sampling with factor 2 as shown in Fig. 2, the four sub-images are:
I sub1 (i, j ) = I (2i − 1,2 j − 1) I sub 2 (i, j ) = I (2i − 1,2 j ) I sub 3 (i, j ) = I (2i,2 j − 1) I sub 4 (i, j ) = I (2i,2 j )
(1)
where I is the original image, i = 1,2 L n 2 , j = 1,2L m 2 . And then we take the sub-images as the observations to perform ICA. That is what is called ICA Transform (ICAT) in this paper. The four Feature Images (FI) obtained by using ICAT, four subbands like, are shown in Fig. 3.
==>
Fig. 2. The diagram of image down-sampling, which results four sub-images
FI1
FI2
FI3
FI4
Fig. 3. The four FIs obtained by ICAT. FI4 is the approximate component of the original image, while the FI1, FI2, FI3 are the details of the original image. That is very similar to what in DWT as shown in Fig. 1(b).
A Novel ICA-Based Image/Video Processing Method
839
2.3 Comparison Between ICAT and 2D DWT
In the comparison, we use the 512×512 standard image, peppers, as the original image, and choose Daubechies-4 as the mother wavelet. And down-sampling with factor 2 is adopted in ICAT. Here in order to analyze the property of the FIs extracted by ICAT, the following procedure is performed. The approximate component is used to take the place of the details each at once, and the inverse transform is executed after each replacement. The results of inverse DWT(IDWT) and inverse ICAT(IICAT) are shown in Fig. 4 and Fig. 5 respectively. From Fig. 4, we can see that each result of such kind of procedure has the obvious textures in a certain direction, e.g. Fig. 4(a), which has heavy vertical textures, shows the vertical detail has been replaced by the approximate component.
(a)
(b)
(c)
(d)
Fig. 4. (a), (b), and (c) are the results of IDWT after the replacement of LH, HL, and HH respectively. And (d) is the enlarged of the rectangle part in (c).
(a)
(b)
(c)
(d)
Fig. 5. (a), (b), and (c) are the results of IICAT after the replacement of FI1, FI2, and FI3 in Fig. 3 respectively. And (d) is the enlarged of the rectangle part in (c).
Given the analysis of Fig. 4, Fig. 5 shows that FI1, FI2, and FI3 in Fig. 3 are also some directional details of the original image. Furthermore, DCT is performed on the approximate component obtained by DWT and ICAT, respectively, to analyze their frequency characteristics. Among the AC coefficients of the DWT approximate component, the first 5627 lowest frequency AC coefficients occupy the 95% in energy, while in the case of ICAT approximate component, the number is only 3451, which is only about 61% of that in DWT. That suggests that ICAT must superior to DWT in image compression.
840
Q. Zhang et al.
2.4 Discussion
Given the above comparison between ICAT and DWT, we can see that the features derived by ICAT are very similar to those derived by DWT. Furthermore, ICAT has at least two superiorities to DWT in image processing. First, when an image is analyzed by different mother wavelet, the size of sub-bands may be different. But it is not the case in ICAT, so ICAT may be more compatible for analysis. Second, the comparison on the occupation ratio of low frequency shows that ICAT is superior to DWT in image compression. Besides, one of the reasons that DCT is used in compression is that DWT can reduce the redundancy by removing the correlation [11], while ICA is to obtain independent components. So from the comparison above, we can conclude that ICAT will be a promising method to image and video processing. It must be widely used in the field of video feature analysis, motion object extraction, compression, and so on.
3 Temporal DWT and ICAT on Video Processing In this comparison, temporal DWT and ICAT are used to analyze the same video in time. And in this video, a hand plays notes on a piano and there is only 5 keys played by each finger respectively. The key pressed by thumb is defined as the No.1 key, and the one pressed by forefinger is the No.2 key, and so on. The keys are pressed in the order of 1-2-3-4-2-3-1-5. The total frames are 50. 3.1 Temporal DWT on Video
Temporal DWT is often used in video analysis in order to extract the motion features from the video, since motion is considered as the temporal details of a video, and the background is the approximate one [12]. Db4 is used here, and two video feature sequences are obtained. The one with motion information is shown in Fig. 6. 3.2 ICAT on Video
We perform ICAT to video by down-sampling in temporally with the factor 2, and two video feature sequences are obtained. The feature sequence with motion information is shown in Fig. 7 corresponding as a counterpart of Fig. 6. 3.3 Discussion
According to Fig. 6 and Fig. 7, Temporal DWT and ICAT with temporal downsampling can extract the motion of the original video. But given the two features sequences shown above, ICAT with temporal down-sampling has two superiorities to temporal DWT. One is that the frame number of ICAT is invariant and it is the half of original video. But the frame number of temporal DWT will be different with various wavelets. The other one is that the motion features obtained by ICAT have the same time order as that of original video. But this time order is lost in the temporal DWT.
A Novel ICA-Based Image/Video Processing Method
841
Fig. 6. The DWT features sequence with motion information is shown in the format of frames. The frames are ordered timely from left the top left to the lower right. And the total feature frames are 28, more than half of the total frame number in original video. In this sequence, there are some feature frames which show only the pressed keys in the original video, e.g. frame in row 2 column 2, in row 2 column 3, and so on. That means temporal DWT can filter the motion objects out.
Fig. 7. The ICAT feature sequence with motion information is shown in the format of frames. The frames are ordered timely from left the top left to the lower right. From this sequence, we can find the features only with the motion objects and some only with something like background. That means that ICAT can extract the motion objects, the pressed keys, and the background as well. By the way, the number of frames in feature sequence is exactly the half of that of original video.
4 Conclusion In this paper, a novel image/video processing method, ICA Transform, is proposed. Given the comparison with DWT in image/video processing, we can see that using
842
Q. Zhang et al.
ICAT we can obtain some features like what is extracted by DWT in both image processing and video processing. What is more, ICAT has something superior to DWT to some extent. ICAT will be promising in image/video processing. And how to overcome the indeterminacy of ICA with the help of image/video properties is the next research interest.
Acknowledgment The work is supported by Program for New Century Excellent Talents in University (NCET-05-0582), Specialized Research Fund for the Doctoral Program of Higher Education (Grant No. 20050422017) and the Project sponsored by SRF for ROCS, SEM ([2005]55). The contact author is Jiande Sun ([email protected]).
References 1. Hyvarinen, A.: Survey on Independent Component Analysis. Neural Computing Surveys 2, 94–128 (1999) 2. Hyvarinen, A., Oja, E.: Independent Component Analysis: Algorithms and Applications. Neural Networks 13(4-5), 411–430 (2000) 3. Hyvarinen, A., Oja, E.: A Fast Fixed-Point Algorithm for Independent Component Analysis. Neural Computation 9(7), 1483–1492 (1997) 4. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face Recognition by Independent Component Analysis. IEEE Transactions on Neural Networks 16(6), 1450–1464 (2002) 5. Larsen, J., Hansen, L.K., Kolenda, T., Nielsen, F.A.: Independent Component Analysis in Multimedia Modeling. In: 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), Japan, pp. 687–695 (2003) 6. Hurri, J., Hyvarinen, A., Karhunen, J., Oja, E.: Image Feature Extraction Using Independent Component Analysis. In: Proc. IEEE Nordic Signal Processing Symposium, Espoo Finland (1996) 7. Artur, J.F., Mário, A.T.F.: On the Use of Independent Component Analysis for Image Compression. Signal Processing: Image Communication 21, 378–389 (2006) 8. van Hateren, J.H., van der Schaaf, A.: Independent Component Filters Of Natural Images Compared With Simple Cells In Primary Visual Cortex. Proceedings Royal Society of London: Biological Sciences 265(1394), 359–366 (1998) 9. van Hateren, J.H., Ruderman, D.L.: Independent Component Analysis of Natural Image Sequences Yields Spatio-Temporal Filters Similar to Simple Cells in Primary Visual Cortex. Proceedings of the Royal Society of London B 265(1412), 2315–2320 (1998) 10. Mallat, S.: Wavelets for A Vision. Proceedings of the IEEE 84(4), 604–614 (1996) 11. Antonini, M., Barlaud, M., Mathieu, P.: Daubechies: Image Coding Using Wavelet Transform. IEEE Transactions on Image Processing 1(2), 205–220 (1992) 12. Swanson, M.D., Zhu, B., Tewfik, A.H.: Multiresolution Scene-Based Video Watermarking Using Perceptual Models. IEEE Journal on Selected Areas in Communications 16(4), 540– 550 (1998)
Blind Audio Source Separation Based on Independent Component Analysis Shoji Makino, Hiroshi Sawada, and Shoko Araki NTT Communication Science Laboratories, Kyoto, Japan [email protected]
Abstract. This keynote talk describes a state-of-the-art method for the blind source separation (BSS) of convolutive mixtures of audio signals. Independent component analysis (ICA) is used as a major statistical tool for separating the mixtures. We provide examples to show how ICA criteria change as the number of audio sources increases. We then discuss a frequency-domain approach where simple instantaneous ICA is employed in each frequency bin. A directivity pattern analysis of the ICA solutions provides us with a physical interpretation of the ICA-based separation. It tells us the relationship between ICA-based BSS and adaptive beamforming. In order to obtain properly separated signals with the frequency-domain approach, the permutation and scaling ambiguity of the ICA solutions should be aligned appropriately. We describe two complementary methods for aligning the permutations, i.e., collecting separated frequency components originating from the same source. The first method exploits the signal envelope dependence of the same source across frequencies. The second method relies on the spatial diversity of the sources, and is closely related to source localization techniques. Finally, we describe methods for sparse source separation, which can be applied even to an underdetermined case.
M.E. Davies et al. (Eds.): ICA 2007, LNCS 4666, p. 843, 2007. c Springer-Verlag Berlin Heidelberg 2007
Author Index
Abed-Meraim, K. 317 Aboutajdine, Driss 201 Absil, Pierre-Antoine 57 Adib, Abdellah 193, 201 A¨ıssa-El-Bey, A. 317 Al´ıas, Francesc 794 Amari, Shun-ichi 169 Anem¨ uller, J¨ orn 325 Araki, Shoko 843 Babaie-Zadeh, Massoud 389, 397, 438 Baffa, O. 585 Balan, Radu 333 Bali, Nadia 137 Barros, Allan Kardec 422, 585, 593 Beckmann, Christian 625 Bern´e, O. 681 Blumensath, Thomas 341 Bobin, J. 349 Bofill, Pau 552 Borschbach, Markus 689 Bouguerriou, N. 145 Buchner, H. 560 Burghoff, Martin 673 Capdessus, C. 145 Cardoso, Jean-Fran¸cois 1 Carmona, J.A. 714 Castella, Marc 9 Castells, Francisco 569 Cavalcante, Andr´e Borges 422 Cemgil, A. Taylan 697 Cervig´ on, Raquel 569 Chan, Laiwan 301 Chao, Jih-Cheng 152 Cichocki, Andrzej 169 Claussen, Heiko 446 Cobo, Germ´ an 794 Comani, Silvia 593 Comon, Pierre 9, 293, 641 Constantinides, Anthony G. 738 Cruces, Sergio 17 D’Asseler, Yves Damper, Robert
641 446
Dapena, Adriana 770 Davies, Mike 341, 577 De Araujo, D.B. 585 De Lathauwer, Lieven 33 De Luigi, Christophe 25 De Vos, Maarten 33 Deville, Alain 706 Deville, Yannick 161, 681, 706, 722 Dikmen, Onur 697 Ding, Mingzhou 802 Douglas, Scott C. 152, 462 Duarte, Leonardo Tomazeli 41 Dur´ an, Iv´ an 17 Essa, Irfan 536 Estombelo-Montesco, C.A.
585
Fadaili, El Mostafa 193 Fadili, J. 349 F´evotte, C´edric 177 Fox, Brendan 454 Gaito, Sabrina 185 Georgiev, Pando 357 Ghahramani, Zoubin 381 Ghennioui, Hicham 193, 201 Giese, Martin A. 762 Golinval, J.C. 778 Gonz´ alez de-la-Rosa, Juan-Jos´e 714 Gorriz Saez, J.M. 649 Gowreesunker, B. Vikrham 365 Grenier, Y. 317 Grossi, Giuliano 185 Gruber, P. 209 Guidara, Rima 722 Guilhon, Denner 422, 593 Gupta, Malay 462 Gutch, Harold W. 49 Halford, Jonathan J. 601 Hansen, Lars K. 89 Harte, James M. 609 Heneghan, Conor 569 Hiroe, Atsuo 471 Hoffmann, Eugen 480
846
Author Index
Holobar, Aleˇs 617 Hosseini, Shahram 161, 722 H¨ uper, Knut 105 Igual, Jorge 786 Illana, A. 714 Imiya, Atsushi 754 Inouye, Yujiro 218 Jafari, Maria G. 488 James, Christopher 577 Jia, Sen 268 Joblin, C. 681 Johnson, Robert 625 Jolesz, Ferenc A. 633 Journ´ee, Michel 57 Jutten, Christian 41, 389, 397, 438 Kaftory, Ran 373 Kawamoto, Mitsuru 218 Kawanabe, M. 121 Kellermann, W. 560 Kerschen, G. 778 Kim, Taesu 227 Kiszlinger, Melinda 252 Knowles, David 381 Kohler, C. 209 Kohno, Kiyotaka 218 Kokkinakis, Kostas 495 Koldovsk´ y, Zbynˇek 285, 730 Kolossa, Dorothea 480 Kopriva, Ivica 504 Korizis, Hariton 738 Korzekwa, Amy 802 Kreutz-Delgado, Ken 97 Lang, E.W. 649 Lee, Jaehyung 227 Lee, Jong-Hwan 633 Lee, Soo-Young 227 Lee, Te-Won 633 Lemahieu, Ignace 641 Levin, David N. 65 Liu, Ju 836 Loizou, Philipos C. 495 Lombard, A. 560 L˝ orincz, Andr´ as 252 Lutter, D. 649 Ma, Jian Ma, Libo
73 657, 746
Madsen, Kristoffer H. 89 Maeda, Takehide 218 Makeig, Scott 97 Makino, Shoji 552, 843 Marchini, Jonathan 625 Marin, F. 778 Martin, Maude 1 Matsuoka, Kiyotoshi 528 Mazur, Radoslaw 512 Mertins, Alfred 512 Michailovich, Oleg 81 Millet, Jos´e 569 Mitianoudis, Nikolaos 236, 738 Mohammad-Djafari, Ali 137 Mohimani, G. Hosein 389, 438 Moraes, E.R. 585 Moreau, Eric 25, 193, 201 Moreno, Javier 569 Moreno Mu˜ noz, A. 714 Mørup, Morten 89 Moudden, Y. 349 Nakazawa, Masato Nandi, A.K. 145 Noorshams, Nima
802 397
O’Grady, Paul D. 520 Ohata, Masashi 528 Ohnishi, Naoya 754 Oja, Erkki 285 Omlor, Lars 762 Orglmeister, Reinhold 480, 673 Palmer, Jason A. 97 Pardo, Bryan 454 Parry, R. Mitchell 536 Pearlmutter, Barak A. 520 P´erez-Iglesias, H´ector J. 770 Pham, Dinh-Tuan 129, 244 Phlypo, Ronald 641 Plumbley, Mark D. 406, 488 P´ oczos, Barnab´ as 252 Poncelet, F. 778 Puntonet, Carlos G. 260, 649, 714 Pyka, Martin 689 Qian, Yuntao
268
Raj, Bhiksha 414 Ralescu, Anca 357
Author Index Rao, Bhaskar D. 97 Rojas, Fernando 260 Roque, A.C. 585 Rosca, Justinian P. 446, 552 Sabin, Andrew 454 Sahani, Maneesh 544 Sakata, Toshio 113 Salazar, Addisson 786 Sander, Tilmann H. 673 Sarmiento, Auxiliadora 17 Sawada, Hiroshi 552, 843 Schachtner, R. 649 Sepulchre, Rodolphe 57 Serrano, Arturo 786 Sevillano, Xavier 794 Shashanka, Madhusudana 414 Shen, Hao 105 Smaragdis, Paris 414 Smith, Stephen 625 Socor´ o, Joan Claudi 794 Sol´e-Casals, Jordi 260 Sousa, Carlos Magno 422 Starck, J.-L. 349 Stathaki, Tania 236 Sumi, Toshio 113 Sun, Jiande 836 Sun, Peng 802 Sun, Xinghua 836 Sun, Zengqi 73 Sutherland, Matthew T. 802 Szab´ o, Zolt´ an 252 Szupiluk, Ryszard 277 Tang, Akaysha C. 802 Terzopoulos, Demetri 818 Tewfik, Ahmed H. 365 Theis, Fabian J. 49, 121, 177, 209, 357, 649 Thirion-Moreau, Nad`ege 193, 201
Tichavsk´ y, Petr 285, 730 Tom´e, A.M. 649 Trahms, Lutz 673 Trendafilov, Nickolay T. 810 Turner, Richard E. 544 Unkel, Steffen
810
Van Huffel, Sabine 33 Vasilescu, M. Alex O. 818 Vergara, Luis 786 Verleysen, Michel 129 Vig´ ario, Ricardo 665 Vincent, Emmanuel 430, 552 Vrins, Fr´ed´eric 129 Wakai, R.T. 585 Wang, Suogang 577 Wehr, S. 560 Wiens, Douglas 81 Wojewnik, Piotr 277 Yang, Wenlu 657 Yang, Zhen 802 Yeredor, Arie 827 Ylipaavalniemi, Jarkko 665 Yoo, Seung-Schik 633 Tomasz 277 Zabkowski, Zarzoso, Vicente 293, 641 Zavala-Fernandez, Heriberto 673 Zayyani, Hadi 438 Zazula, Damjan 617 Zdunek, Rafal 169 Zeevi, Yehoshua Y. 373 Zhang, Kun 301 Zhang, Liqing 309, 657, 746 Zhang, Qiang 836 Zhang, Yan 802 Zhang, Zhi-Lin 309 Zopf, Alec 454
847